A Vision for the Future Low-Temperature Geochemical Data-scape

A Vision for the Future Low-Temperature Geochemical Data-scape S. L. Brantley1,15, Tao Wen2, Deb Agarwal3, Jeff Catalano4, Paul A. Schroeder5, Kerstin Lehnert6, Charuleka Varadharajan7, Julie Pett-Ridge8, Mark Engle9, Anthony M. Castronova10, Rick Hooper11, Xiaogang Ma12, Lixin Jin9, Kenton McHenry13, Emma Aronson14, A. R. Shaughnessy15, Lou Derry16, Justin Richardson17, Jerad Bales10, Eric Pierce18 1. Earth and Environmental Systems Institute, The Pennsylvania State University, University Park, PA, USA 2. Department of Earth and Environmental Sciences, Syracuse University, Syracuse, NY, USA 3. Advanced Computing for Science Department, Lawrence Berkeley National Lab, Berkeley, CA, USA 4. Department of Earth and Planetary Sciences, Washington University, St. Louis, MO, USA 5. Department of Geology, University of Georgia, Athens, GA, USA 6. Lamont-Doherty Earth Observatory, Columbia University, Palisades, NY, USA 7. Earth and Environmental Sciences Area, Berkeley Lab, Berkeley CA, USA 8. Department of Crop and Soil Science, Oregon State University, Corvallis, OR, USA 9. Department of Geological Sciences, The University of Texas at El Paso, El Paso, TX, USA 10. Consortium of Universities for the Advancement of Hydrological Science, Inc, Cambridge, MA, USA 11. Department of Civil and Environmental Engineering, Tufts University, Medford, MA, USA 12. Department of Computer Science, University of Idaho, Moscow, ID, USA 13. National Center for Supercomputing Applications, University of Illinois, Urbana, IL, USA 14. Department of Microbiology and Plant Pathology, University of California Riverside, USA 15. Department of Geosciences, The Pennsylvania State University, University Park, PA, USA 16. Department of Earth and Atmospheric Sciences, Cornell University, Ithaca NY, USA 17. Department of Geosciences, University of Massachusetts Amherst, Amherst, MA, USA 18. Oak Ridge National Laboratory, Division Environmental Sciences, POB 2008, Oak Ridge, TN USA


Introduction
Scientific communities and publishers within geosciences are organizing and publishing their data online and promoting new ways to analyze these data (e.g. CHRISTENSEN et al., 2009;ASPEN INSTITUTE, 2017 GIL et al., 2019;STALL et al., 2019;LIU et al., 2020;U.S.G.S., 2020a). Some publishers have promoted and agreed to the so-called Findability, Accessibility, Interoperability, and Reuse of digital assets (FAIR Data Principles). Some geoscience communities (e.g., climate, oceanography, ice, ecology, genetics, atmospherics, and agricultural science) have progressed toward these goals in terms of managing their data online. Recently, the growth of the Open Science and Open Data movement has encouraged publishers and data repositories in the Earth sciences to collaborate, leading to the Coalition for Publishing Data in the Earth & Space Sciences (COPDESS, http://www.copdess.org). COPDESS is now promoting best practices for data in publications in geosciences (COPDESS, 2020). As part of this movement, journals managed by the American Geophysical Union have opted in to the 'Enabling FAIR Data' project to increasingly require data to be submitted to trusted, certified data repositories where they can be cited with a digital object identifier (DOI). As enforcement of these new policies has become more stringent, data submissions to data repositories have increased (ALBAREDE AND LEHNERT, 2019).
As this movement has progressed, improvements have generally been slow in many subfields of geoscience. For example, the transition in late 2018 to requiring basic data sharing for submissions to Geochimica et Cosmochimica Acta resulted in initial resistance by many authors. Today, a majority of authors choose to attach their data to the published manuscript as supporting material, which remains behind a paywall, as this does not require time-consuming data formatting or input protocols associated with contributing the data to a separate repository. At the same time, the explosion in the use of sensors, remote sensing, automatic instrumentation, and data analytics, and the increasing storage of data online in a globally connected information system, means that efficient and accessible data management for lowtemperature geochemistry (LTG) -the "data-scape" for LTG -is of growing importance. To understand this situation and to chart an appropriate roadmap for forward movement for management of LTG data, a 2.5-day workshop was run in February 2020 in Atlanta, Georgia, USA. Several participants were drawn from the data science community but most participants were practicing geochemists with a large span of research interests, and prior experience with geochemical data collection. This paper summarizes the workshop conclusions, noting that the participants were biased toward practicing geochemists with only a small number of data scientists. Thus, this paper is unusual in comparison to many other papers about data management systems in that it is written mostly from the perspective of bench and field scientists.
By understanding the perspective of the domain scientists, data scientists can better understand the complexity of LTG data, which will be important as current data systems progress or as new data systems are developed in the future.
For the purposes of this paper, "low-temperature geochemistry" describes the field of geoscience that investigates earth processes pertaining to the chemistry of surficial Earth materials including water and biota. This field includes, but is not limited to, chemical and biogeochemical cycling of elements, aqueous processes, mineralogy and chemistry of earth materials, the role of life in the evolution of Earth's geochemical cycles, biomineralization, medical mineralogy and geochemistry, and the geochemical aspects of critical zone science and geomicrobiology. In addition to these topics, LTG also includes tools, methods, and models pertaining to the fields listed above. This LGT definition is drawn from the definition currently used by the U.S. National Science Foundation (NSF) for the U.S. LTG community. This paper is necessarily informed from that perspective because of the funding for the workshop, but it is offered also as an invitation for other scientists worldwide to contemplate the LTG data-scape into the future.
At the workshop, we recognized that some sub-sets of the LTG community have already selforganized their approaches to data management, sometimes initiating their own best practices for data management systems (e.g. Table 1). To enable conversation at the workshop between all the different sub-sets of the LTG community as well as the data informatics community, a short lexicon of terms was compiled (Table 2). We needed this lexicon because words were often used differently by domain scientists (geochemists) and information scientists, and often, words were used differently by different individuals within the information science community itself. Certainly, the lexicon was also helpful for communities who had yet to develop data management systems (e.g., Table 3). This paper was written for both LTG practitioners and data scientists interested in helping in the management of LTG data.
The main questions at the workshop addressed data management and sharing from different perspectives. We focused on three areas. First, who are the different stakeholders interested in coordinated management of LTG data, and what does each of them want to achieve? To answer this question, we discussed what we perceive to be the characteristics of the optimal management system from the perspective of different stakeholders (e.g., data producers, data users, modellers, funders, journal editors, government agencies, the public). Second, we asked, how can we best secure the longevity of data for the future given that a typical research project in LTG is only three years without possibility of renewal? In this regard we noted that data archived in older papers can still be read, while data in "aging" electronic peripheral devices such as floppy disks can only be read by specialty workers, emphasizing the importance of the type of media for storage and the resources available for data storage (e.g. CHRISTENSEN et al., 2009). Similarly, data stored within proprietary software may not be accessible in the future if the software changes or is not maintained. Finally, we looked at the question, what does the data life cycle look like today for LTG? We noted that many LTG practitioners only collect small volumes of data and publish it in papers, while others pursue meta-analysis of multiple datasets. This paper summarizes a roadmap for a future data-scape for LTG data, noting all these activities and usages.

Characteristics of LTG data
Geochemical data are highly variable in terms of type, volume, structure, dimensionality, and character. The one trait that these data tend to share is that they provide chemical analysis or descriptions of features related to chemical makeup. But the structure of the data nonetheless varies from one dataset to another. For example, analyses can focus on the possible 100+ elements, the 200+ stable and radiogenic isotopes, 5000+ minerals, or the many thousands of fluid species (including organic molecules) that have been identified. In addition, new types of analyses are always being developed and refined. A schematic showing the variety in the different chemical analyses that might be made for one soil sample in a given landscape is shown in Figure 1. A few data characteristics are described below. Although not all of these are applicable, many would be applicable to other types of samples of interest to LTG scientists. Schematic shown to provide a sense of the number of analyses and sub-samples and extractions that are often completed in creating a LTG dataset, even from a single sample.
Some geochemical data are sample-based. A "sample" is a physical object that can be archived (Table 2). Samples refer to both laboratory-and field-derived objects and can include any medium from liquids to solids to gases. They can derive from any of the 5000+ minerals known to form naturally (FLEISCHER, 2018) or from the infinite number of possible mixtures of these minerals (e.g. rocks, rock aggregate, sediments, soils, etc.). In addition, geochemists also study non-and nano-crystalline materials (HOCHELLA et al., 2019). Of great importance among the non-crystalline materials are all the different types of organic matter (e.g. HEMINGWAY et al., 2019) as well as living and non-living organisms and biotic waste materials. Finally, geochemists are not just interested in analysis of natural samples: they also investigate the human-made (i.e. engineered) materials and -associated wastes (i.e. incidental materials).
With each sample, geochemists can complete bulk analyses but they also can separate a single sample into multiple daughter sub-samples or they can extract the materials for different species or different associations or affinities (e.g. PICKERING, 1981) as exemplified in Figure 1. Thus, Earth materials (rocks or soils, etc.) are ground for bulk analysis while, in addition, individual grains are separated and analyzed or targeted for laser-based analysis in a thin section. Similarly, when organisms are analyzed, the analysis can be for the bulk or for a specific part such as the leaves, trunk, xylem, brain, otolith, etc., and for each body part, the analysis can target the bulk or a sub-part such as the entrained water (e.g. ORLOWSKI et al., 2016). And of course, each of these sample-based analyses can target concentrations of different species: for example, elements, molecules, isotopes, isotopically-labelled molecules, etc. In addition, geochemical analyses do not just consist of tabulated analytical data; rather, they consist of spectra, diffractograms, chemical maps, photographs, spectrograms, and other types of images or pixelated data. The volume of data associated with these datasets can be much, much larger than sample-based analytical data. Thus, whereas early datasets could be accommodated in a notebook, the larger data volumes can only be accommodated in online data systems ( Figure 2).
In contrast to sample-based data, LTG geochemists also collect time-series (sometimes referred to as longitudinal) or field-based measurements (taken without collecting a sample) of water, air, biota, and solids. Some of these time-series measurements are made by field workers, but increasingly, measurements are made with sensors (e.g. KIM et al., 2017) or remote sensing (e.g. BERATAN et al., 1997). Temporal variations are measured in real-time or intermittently over long durations (e.g. BENSON et al., 2010). Advances occurring in the technology of sensors and sensor networks are rapidly driving new types of data collection for water quality, soil and rock characteristics, air composition, and molecular biological properties.
Regardless of whether their measurements are sample-based, field measurement-based, or timeseries, LTG scientists place great stock in new types of analyses. The upshot of this is that many LTG papers summarize data that are purely research grade. As shown schematically in Figure 3, these measurements are highly non-routine (one-of-a-kind or first-of-a-kind), in contrast to more established, routine measurements with accepted standards. Figure 3 emphasizes that, as innovation in the measurement protocol decreases from left to right, the ease of data management increases. Figure 2. Range in data curation, availability, and harmonization imagined by LTG scientists for management of LTG data. Data are shown schematically as the pink-colored shaded area. Not all LTG observations will even be stored in a notebook, let alone in the sequentially more structured (higher) levels of data storage. Rather, only data that are scientifically useful will be stored in increasingly structured repositories. Currently, LTG scientists need to store more data in online data repositories. Many of these will be generalized data repositories that provide flexibility in management of data and metadata. As datasets in the generalized data repositories grow in scientific importance or the demand for the data grows from users both internal or external to LTG, shared online relational databases may be developed and maintained for specific data types.
Finally, in addition to these sample-, field-and time-based measurements, many geochemical "data" now increasingly consist of model outputs or calculations. One type of model output that is often thought of as data include measurements reported from instruments where manufacturers keep data processing protocols proprietary, leaving open access to raw data limited and sequestered behind a paywall limited to licensed users. Other types of model output are also stored and used by geochemists.
For example, global oceanic chemistry models used by oceanographers and geochemists can yield very large datasets of salinity or trace element content versus location. These models can include predicted data, so-called "re-analysis" data, model workflows, and model programs, and often the community wants to have access to all of these "data" sets (KALNAY et al., 1996).
Given all of this variety in data types and model outputs, some LTG datasets are large in volume while others are very small. For example, model-related output "data" are commonly associated with very large "data" volumes, as are sensor or remote sensing data, both of which can provide high-spatio-temporal resolution. In contrast, many sample-based datasets may be relatively small in volume, at least partly because of the expense and time necessary to collect, prepare, sub-sample, and analyze ( Figure 1).
However, almost all geochemical data are large in terms of types of metadata that are needed. Here we refer to the information related to "who, what, when, where, how" for the data values (MICHENER, 2006).

Lack of best practices, standards, and harmonization for LTG data
The variety of datasets and types of LTG research means that goals and workflows vary considerably from one team to the next. But the design of effective data repositories depends upon the goal of the data as well as the overall workflow for both data generation and data processing (e.g. RUEGG et al.). As a result, even where many examples of a certain type of data have been collected, and even when they may be organized into online libraries, it is rare in LTG that there is a generally accepted standard for the data. For example, quantitative phase analysis of Earth materials, whether they are rocks, soils, sediments, or something else, is fundamental to LTG, and there are several libraries for such data (Table 1), but formats for sample preparation for X-ray diffraction, data collection, and meta-analysis have not been established within the community. In another example, the team behind one NSFsupported geochemical data repository (EarthChem Library) emphasized the most common methods and sample types to develop templates for petrologists to submit rock chemical data. When the EarthChem team tried to use the same template for other communities beyond petrology, they were met with resistance because the non-petrologists preferred templates tailored to their own workflows. As a consequence of the many types of workflows, many practicing LTG scientists report they find data and metadata protocols from highly standardized data repositories to be difficult to implement for their own datasets.
It is important to emphasize that the variety of workflows that characterize LTG is not just a consequence of competing egos or laboratories. The different workflows result from groups asking different questions. For example, soil scientists and geologists collect and analyze soils to pursue questions within LTG. But the former analyze only the <2mm fraction (because it impacts soil fertility the most) while the latter use the entire sample for analysis (because they use the value in mass balance calculations). Thus, for routine analyses of many different types of soils, the National Cooperative Soil Survey (NCSS) database (N.R.C.S., 2020) is useful because all the soils have been sieved in the same way before an analysis, but this database is not necessarily useful for mass balance calculations pursued by geologists (BRIMHALL AND DIETRICH, 1987). In another example, many in-vitro analytical methods have been developed to assess the health impact and bioaccessibility of contaminants in dust particles in the human lungs (WISEMAN, 2015). These protocols differ significantly from analyses aimed to understand leachability in environmental systems (PICKERING, 1981).
Another reason for the lack of agreement on standards and protocols of measurement and reporting data results from LTG practitioners' strong emphasis on development of new or nonstandardized technique -for example in sampling methodology, chemical extraction, analytical technique, and laboratory protocol. This emphasis results in a lack of data standards, difficulty in creating templates for data or metadata input, and ultimately, difficulty in comparing datasets within the LTG community. Here, data standards are defined as policies or protocols that determine how geochemical data and metadata should be formatted, reported, and documented. Many LTG scientists have not heard of nor used standards such as the Observations and Measurements Protocol of the International Organization for Standardization (ISO) (COX, 2011). Likewise, few LTG scientists are aware of the so-called 'Requirements for the Publication of Geochemical Data' which were agreed upon in 2014 by an editors' roundtable (a roundtable that included geochemists). These requirements explain how to report data and metadata in structured, standardized manners .
Even where geochemical data are already compiled and accessible in one place such as the National Water Quality Portal (USGS/EPA), the data are not harmonized, i.e., units, formats, analytical methods, detection limits, and other parameters are not presented consistently (e.g. SPRAGUE et al., 2016;SHAUGHNESSY et al., 2019). Apparently, data standards for agreed-upon units and measurement protocols have never emerged because i) communities have never felt enough need for or placed enough value on such standardization or ii) variations in protocols were simply necessary to answer the proposed research questions. Neither have LTG scientists addressed, as a community, how to cite and reward or incentivize scientists who collate, curate, synthesize, and share published data for LTG or for other communities (data interoperability). The lack of standards has in turn hampered the development of automated flows of geochemical data into databases. For these and other reasons, geochemical data compilations have been growing slowly (LEHNERT AND ALBAREDE, 2019).

Current data management systems
To date, a variety of data management systems have been used by LTG scientists, including storage in notebooks, offline data infrastructures (e.g., individual computers), published works (e.g., theses and journal publications and supplemental material), and online data infrastructures (e.g., personal webpages, dedicated data repositories). A schematic showing the trend of data management is shown in . Such databases represent the most structured and demanding management systems, but also provide the easiest data discovery and use for meta-analysis. These also promote the easiest collaboration.
Some of the data repositories that have a track record of success for data types of interest to LTG (time-series water data, rock chemistry, atmospheric radiation measurements, CO2 flux, etc.) are summarized in Table 1. Some of these online data infrastructures are maintained and used as libraries (e.g., for spectra, electron micrographs, or diffraction patterns). Such libraries are not true data repositories, but rather databases that do not generate DOIs for the data provider and may only retain a limited number of examples for each measured entity. An instructive example for mineralogy is the International Centre for Diffraction Data (ICDD) that offers a detailed (paywalled) library of mineral structure data (both experimental and theoretical), which serves as a reference for identification and quantification of minerals. Other open-source databases for mineral structures are also available (e.g., Mineralogical Society of America Crystal Structure database).
Given that only a few highly structured targeted databases for LTG data are available, and that libraries are not true data repositories, many other LTG data types lack appropriate repositories (a few examples are listed in Table 3). For these "orphaned" data types, scientists either publish their data in a journal article or its supplement, leave it unpublished on their computer or within a thesis, publish it online on their personal website, or use generalized and unstructured data repositories that can accommodate any type of data file and can assign a DOI to the dataset. These generalized data repositories provide only minimal curation of metadata and do not police data quality. On the other hand, they generally provide long-term storage and require that the data provider record a modicum of metadata to allow indexing and search features to be enabled. Some of these general-purpose repositories operate behind a firewall or paywall, while some are open and free. Some can be used by anyone while others are limited to specific clientele (e.g. from a specific university, country, or funded program) or types of data. For example, geochemists in the United States Geological Survey use ScienceBase (U.S.G.S., 2020c), geoscientists funded by the U. S.

Department of Energy (DOE) use ESS-DIVE (see Supplemental Material) for ecosystem and watershed
data (VARADACHARI et al., 1994) and the ARM data center for cloud and aerosol properties, and EDX for data related to fossil fuel energy (N.E.T.L., 2020). Other such generalized data repositories are also becoming available through publishers, universities, federal agencies, and private entities. Two examples that are used by some NSF-funded geochemists are EarthChem Library and CUAHSI's HydroShare (see Supplemental Material). To our knowledge, there is no known portal that links to all the many data repositories used by LTG scientists.
Despite the examples in Table 1, we sense that most LTG scientists do not use data repositories.
Thus, even for those parts of LTG science for which data management systems have been developed, many practitioners of LTG do not understand the repositories, how to use them, how to manage their data efficiently to prepare to ingest data into the repository, nor what kind of science they could enable. The problem is somewhat circular in nature because some of the difficulties in data management could be reduced by 'best practices' in data management throughout the data life cycle, but often the data repository itself is simply not well suited to the scientists' data needs, making it unlikely to be used. As emphasized in Figure 2, a bottleneck has developed where LTG scientists are not uploading data into online repositories.

Lessons learned
In this section we summarize important lessons learned during the workshop with respect to LTG data and data management systems (Table 4). Many of these lessons were gleaned from the history of several USA-centric data management systems discussed at the workshop (see Supplemental Materials).
A schematic of a general trend in data management from these histories is shown in Figure 2. From bottom to top on the diagram, we observed that as systems increasingly allow efficient and easy data discovery for personnel outside of the data producers' home group, the ease of collaboration among groups or across disciplines increases. At the same time, however, increasing the utility and efficiency of the system for the data user (top to bottom on the figure) generally entails more formalized and rigid rules for formatting and uploading data and lack of flexibility for the data provider (i.e., movement from left to right on the graph). From left to right and bottom to top on the diagram also requires increasing effort by the community to identify and agree upon the questions of interest and the data standards needed to address those questions. Thus, from bottom to top and left to right, data use becomes easier but demands on the data provider become greater. Ten related lessons are summarized below and in Table 4.

Lesson 1. Improving the accessibility of geochemical data promotes better science and better
societal decision-making. The value of scientific data increases to other scientists and to the public when data are published online and can be accessed even after a given program or project is terminated (BALL et al., 2004;CHRISTENSEN et al., 2009). As an example, background soil chemistry data from decades in the past can be used to assess pollution impacts or health risks for activities that are ongoing today (e.g. Lesson 2. Although the data enterprise from measurement to meta-analysis is complex and fraught with opportunities for error, systematic management of data leads to improvements in data quality and promotes identification of large-scale trends. Few individuals understand the entire trajectory from sample collection or sensor deployment to measurement or observation to online publication and data interpretation in LTG. In addition, only a very few people within this complex data enterprise can assure the quality of the data, and these personnel tend to be those who made the measurements in the first place or who were responsible for use of reference standards, methodologies, instrumentation upkeep, and quality assurance measures. In addition, as data are moved from the laboratory notebook to compiled datasets and shared data repositories (see, for example, Figure 2), new opportunities for errors arise. Simply put, it is impossible to maintain and manage data systems without errors. Nonetheless, during the complex trajectory from measurement to interpretation of geochemical data, increasingly systematic data management helps scientists to find issues related to data quality as well as to find largescale trends and patterns in the data, through the use of meta-analysis.
Lesson 3. Highly structured relational databases may be useful for LTG data where geochemical measurement protocols are reproduced identically in many locations on many samples, but less structured data repositories are needed for measurements that are implemented differently from sample to sample or place to place, or where protocols are developing or non-standardized. Some geochemical sampling and analytical strategies are routine while many are under development or implemented differently based on the particular nature of the media or the question being asked. The result is that "routine" data are relatively easy to standardize and manage in structured repositories while non-routine data are not ( Figure 3). An example of "routine" data are measurements of solute concentrations, pH, alkalinity, and other parameters completed on water samples by the U.S. Geological Survey's National Water Quality Laboratory or completed based on standard methods (APHA, 1998).
In contrast, data developed from non-standardized analytical techniques or after refinements of specific issues with respect to collection or analysis of novel types of samples are inherently non-routine.
These data are more difficult to archive in standardized data management frameworks and may also require a large amount of metadata, including extensive discussions of analytical technique and clear disclosure of underlying assumptions used. Even with samples undergoing mostly routine analyses, some samples will always need to be treated differently than others. This means that flexibility is required in the data repository. For example, a geochemist may use one workflow of separation and extraction for one sample of a rock and a completely different technique for another sample from the same site (depending upon composition or sample provenance). A case in point is chemical analysis of shale where a red shale generally requires one type of analytical workflow while a black shale requires another because bulk elemental analysis is affected by sulfur content. Thus, the many combinations of different sample preparations and chemical or mineralogical or isotopic analyses can make data compilation in a structured repository a complex process (NIU et al., 2014). Data management systems for LTG, like quality

Lesson 4. The many goals and workflows of LTG scientists have resulted in a multitude of different data structures and metadata requirements, and a proliferation in the number of data
repositories and portals housing or pointing to LTG data. Even when geochemical datasets are small in volume they generally require complex metadata to make them of lasting usefulness (a point also made for ecological data (MICHENER, 2006)). This is partly because interpretation of chemical analyses requires understanding the different methods of sub-sampling, extractions, or density separations before analysis ( Figure 1). Even after preparation, a range of measurements with a variety of instruments are possible: analyses of target elements, mineral species, aqueous species, isotopes, location-specific isotopes in molecules, microbiological characteristics, chemical bonding, photographs, diffractograms, spectra, and other operationally useful parameters (Figure 1). The result of all this complexity is that geochemical datasets are characterized by a wide variety of structures, metadata requirements, and formatting. The result is that data management systems have proliferated (Table 1). Another reason for the proliferation is competition or different preferences among individuals, teams, projects, networks, universities, agencies, and countries.
Lesson 5. LTG scientists often resist sharing data in online data management systems.
Geochemists at the workshop stated that they want sustainable, long-term repositories for their data so that they can have accountability with funding agencies, so they can brand their data as their own, and so that they can promote use and citation of their data by other scientists and, in some cases, the public.
However, like most LTG scientists, most do not publish their data into online data repositories, nor do they train their students in those activities. The few workshop scientists who had used repositories cited requirements for data publication by journal editors or a mandate from a funder as the immediate driver.
But some of the LTG scientists who had used online repositories expressed resistance to sharing data in such online systems (see, also, BRASIER et al., 2016). Sometimes, their resistance stemmed from the natural tension between data providers and those who pursue meta-analysis. They also sometimes expressed fear about loss of control of the data, or the related fear of possible mis-use of their data by others. One might also conjecture that some international governments may be reluctant to share data for perceived reasons of national security, such as in seismic monitoring.
But the most commonly cited reason for resistance to the use of data repositories by LTG scientists was the time-consuming nature of inputting data and metadata and the related lack of a reward structure for data management and publication. In most cases, this work falls on the shoulders of the geochemists who are completing the analyses. This may explain why, as pointed out (for ecological data) (MICHENER, 2006), "Obtaining metadata may be the most challenging aspect of data management. The investigators who collect, manipulate, perform QA [quality assurance] on, and initially analyze their particular part of the project's information know what they need to know about it. They have little intrinsic incentive to take the time to formalize and structure this knowledge, except for what is needed for reports and publications." Resistance to the time requirements of data publication in online repositories may be one of the main explanations for the bottleneck shown in Figure 2 for LTG science and will only likely be solved by improving the data management systems (possibly also, laboratory information managements systems) or by changing the LTG culture.

Lesson 6. The data-scape for LTG must encompass i) flexible management systems for datasets
where measurement methods are less routine or still under development, ii) highly structured and managed data systems for datasets with established standards for measurement, and iii) a process whereby data systems can sometimes evolve from mode (i) to mode (ii). This finding is largely a corollary of the lessons explained above in that it can be difficult and time-consuming to format and input large volumes of metadata into structured data management systems even when they are designed specifically for an individual dataset; likewise, such data input simply does not make sense for less routine data. Such structured data systems need only be built for very large and important datasets where the measurements are more or less routine and the community agrees upon the need for and utility of the database. Two examples discussed previously manifest this finding: namely the development of a highly structured database for rock chemistry (PetDB) and the development of a highly structured database for water chemistry (CUAHSI HIS). These communities had rough measurement standards and protocols already, and agreed on the utility of the data, and so they self-organized and developed standardized data management systems. Without such agreed-upon formats and goals, other communities need data management systems that allow data to be stored in less structured systems. The benefits of these less structured systems are that they are often more intuitive to subject-matter experts, can be easier to archive in some data repositories, and are easy to re-structure (CHRISTENSEN et al., 2009). Of course, by definition, this type of data storage is not as useful to some data users ( Figure 2) because datasets are compiled with different structures and other characteristics.
But it is important to emphasize that one reason that less structured data systems emerged from the rock and water communities was that there was resistance to the time commitment needed for uploading of data and metadata into the highly structured databases. Therefore, even after the highly structured databases emerged (e.g. PetDB and CUAHSI HIS), the need for less structured data systems that would allow easier collations of data without the time-consuming input and metadata format requirements became apparent (see Supplementary Material). These two highly disparate communities (petrologists and water scientists) both discovered that i) some datasets and communities need structured data management systems, and some need less structured systems, and ii) many data providers resist timeintensive data uploading protocols.
Lesson 7. Standards for data and meta-data are only developed when the community or the science demands it. Some communities have successfully brokered data sharing agreements (e.g., climate, biological oceanography, seismology); likewise, best practices have been endorsed for data publication and data citation that apply across some domains (e.g. Lehnert & Hsu 2015, ESIP 2019(DATA CITATION SYNTHESIS GROUP, 2014;STALL et al., 2019;COPDESS, 2020). However, low-temperature geochemists so far have not agreed upon standards for data formatting nor for data storage for many of their datasets.
It is likely that this is because the community has not yet seen enough value in data standards or in data harmonization relative to the time required to implement them. For example, if most LTG data are not intended for integration with other groups' or other disciplines' datasets, or if this integration is not valued, then the hard work of data standardization will not occur. As Earth system models are developed and improved and interoperability of datasets is increasingly valued, however, LTG practitioners will eventually develop ways to standardize and share some if not all of their data. One impetus that is already encouraging a demand for standards on the part of LTG scientists is new mandates from journal editors and granting agencies.

Lesson 8. LTG scientists need both archived (unchanging) and versioned (modified and
updatable) datasets. Some LTG datasets must be maintained as non-changing entities (long-term archives) while others are continuously updated or corrected over time (self-described longitudinal or versioned datasets). For example, water chemistry data has been used to investigate the impact of hydraulic fracturing on groundwater (Shale Network, Table 1). When meta-analyses are published , the data are referenced both as a growing dataset site hosted by the CUAHSI Hydrologic Information Systems (HIS) (doi:10.4211/his-data-shalenetwork), but also as a separately archived version of the dataset sampled at the time of analysis (doi:10.26208/8ag3-b743). To archive the data as a versioned dataset was not possible in the CUAHSI HIS, and so the scientists published it in their university data repository. That repository allowed archiving of a long-term copy of the data, whereas the other site showed the entire, growing dataset. From the perspective of data producers, it is particularly important to archive the dataset analyzed in publications to ensure the reproducibility of the relevant research work. On the other hand, scientists also need to update datasets and attach version numbers to evolving data. Thus, data curation is needed that tracks provenance, provides versioning capabilities, and allows citations (e.g., DOIs). Such utilities could be found in different data management systems or within one system.

Lesson 9.
Where geochemical databases have been successful, they have been funded over long periods of time, organized by groups of dedicated scientists, and focused on specific data types. A few entities have built very focused databases for geochemical data. For example, PetDB and Geochemistry of Rocks of the Oceans and Continents (GEOROC) are successful synthesis databases for petrologic data, as is the CUAHSI HIS for time series water quality data (see Supplementary Material). The first two databases exclude large sectors of rocks and minerals of interest to LTG while the second database is built for time series but is not as easy to use for depth profiles of soil porewater, for example. A similarly successful data repository in another field is the USGS Produced Water Database which provides water chemistry for samples collected from oil and gas reservoirs nationwide (Table 1).
These databases and other long-term repositories (Table 1) share some attributes. First, they target only a subset of data as defined by their mission or funding: PetDB, for example, was funded by NSF's RIDGE Program to provide a complete collection of the geochemistry of igneous and metamorphic rocks of the ocean floor. By definition, the databases do not include the geochemistry of all rock types.
Nonetheless, the databases have grown to accommodate similar geochemical data of other materials. For example, PetDB includes data for mantle xenoliths, ophiolites, marine sediments, volcanic gases, and astromaterials. Second, successful databases tend to receive consistent funding over many years from government agencies, private foundations, libraries, or universities. Third, some data systems are successful at least partly because developers are a small but dedicated group of scientists (<12) who attract and work with a much larger group of scientists contributing data.
Lesson 10. The science community must prioritize which datasets will be managed online as shared relational databases because development and maintenance of such databases is very time-and resource-consuming. Building cyberinfrastructure that facilitates access to geochemical data along the trend shown in Figure 2 is expensive, skill-requiring, and time-consuming. The exact cost of both building and maintaining datasets or data repositories depends upon the type of database. For example, although relational databases are more powerful than flat files, they are also more difficult to maintain over time. They are also less intuitive for subject-matter experts, and require more planning and documentation (CHRISTENSEN et al., 2009). In actual dollars, the cost of maintaining EarthChem's PetDB (Table 2) annually is $250,000/year, and this does not include resources for new developments to keep up with changing technology demands. For large, multi-investigator projects, data management can cost 20-25% of the cost of the measurements themselves (BALL et al., 2004). The costs of maintenance are at least partly related to the ongoing evolution of computer hardware and software and the need to maintain the databases on the evolving platforms. Investments in funding, time, and personnel for updating, maintenance, and upgrades are therefore significant. A part of the problem is that research datasets are ever-changing, but very little money is typically available for changing data management structures or new metadata fields, etc. It is of course always possible to write code to migrate data from one system to the next. However, this also costs time and money.
All of these issues are amplified because of the large number of skillsets needed at the same time in a data management team -skillsets that are generally not found in the same small set of individuals.
For example, information technology researchers that have developed new cyberinfrastructure are generally less interested in maintaining old infrastructure. Furthermore, personnel managing data cyberinfrastructures must not only support the software and hardware but must also provide help to the community of users. This latter requires people with geochemical skills and very few people currently have both data management and geochemical skillsets. All of these problems point to the LTG community's inability to provide sufficient support for large numbers of specialized databases. They also point to the need to train the next generation of emerging geochemists in best practices for data management. Such training would not only help in data archival, but also in promoting better LTG science as a whole.

A proposal for the future LTG data-scape
All data should be published. Workshop participants concluded that all primary LTG data should be shared publicly at the time of journal publication. LTG journals and government publications should consider mandating this, and should similarly consider mandating that computer code be made available and linked to journal articles, reports, and data in repositories (LIU et al., 2020). This could improve documentation and error checking for both data and codes, many of which currently have little external vetting.
Many best practices could be implemented by LTG scientists to manage data systematically. For the longest-term use, data should always be shared with appropriate metadata using community-defined, non-propriety data formats. The U.S.G.S. has similar issues to the LTG community in terms of data heterogeneity. Training modules for such data management have been established online that could be useful to LTG scientists (U.S.G.S., 2020b). If data and metadata are managed well, the job need only be done "once" and data presentation and publication will become easier for journals and online databases.
Researchers should plan for data management in advance of their research, and be afforded additional funding within grant budgets for personnel time, hardware, or software. For larger projects, data management team members could be embedded into science teams to promote improved data and metadata publication. To enable all of this improved data management, LTG scientists suggest that funders should provide adequate funds for time and infrastructure, while protecting resources for the science itself. Data scientists at the workshop pointed out that the use of consistent data templates pulled from existing resources or customized for a laboratory can be a simple but cost-effective way of lowering the burden of collecting consistent metadata. This is because once all the data are standardized into one format, it is easier to convert to some other metadata or data format if needed for ultimate publication in a data repository. Some pointed out that geochemical workflows could be supported and automatically recorded by intelligent software such as Laboratory Information Management Systems. At the same time, however, such systems can be expensive and time intensive to implement and are usually only implemented in large laboratories or for very large datasets, both of which tend to plot to the right on the schematic in Figure 3.
The workshop participants concluded that most LTG data should be published in online data repositories and be referred to with DOIs (instead of in journal paper supplements). In that way, researchers can be evaluated efficiently for published data by peers (in the peer review process), by managers (in assessing salaries, promotion, tenure), and by agencies (in determining criteria for funding).
Some LTG practitioners, for example those involved in process-oriented science, pointed out, however, that measurements produced in some projects are so small in volume that they do not even warrant summary in a table in a paper, let alone in an online repository. Likewise, there are types of data (diffractograms, spectra, photomicrographs, wellbore logs, development-grade data such as on the left of Figure 3, etc.) for which there are no data repositories. Publishing these small-volume or unusual data side-by-side with all explanations, interpretations, and metadata -within a journal paper or its supplement -in some cases might be better than in a repository. However, the problem with publication in pdf format in a journal article or its supplement without archival in a data repository is that such data are difficult to find, let alone ingest into models: in fact, some publishers no longer want to accept data in supplements as part of the 'Enabling FAIR Data' movement (COPDESS, 2020). In recognition of the difficulty of harvesting data from papers and supplements, the U.S. National Science Foundation has funded a group to build tools to find such data and publish it formally online (XDD, 2020). The workshop participants emphasized that it is in the best interest of the scientist, the science, and the public to publish all data in a repository, i.e., to break through the bottleneck in Figure 2, and not to publish data in pdf format alone.
Once more LTG data are published in online repositories, the problem of ambiguity in sample identification will still remain. Therefore, the workshop participants concluded that the community, funders, and journals all should require that LTG scientists use globally unique identifiers such as Choice of data repository. The choice of appropriate data repository must be made by consultation with the scientific community, editors, managers, and funders. The file format should be chosen so that it meets the standard for long-term preservation. In other words, storing the data file as a specific spreadsheet format rather than a CSV file might limit users' ability to use the data in the future if proprietary spreadsheet format conventions are changed. Using non-proprietary data formats make the most sense.
At a minimum, if the data are deposited in a repository, the dataset should be given a DOI and this should be called out in the journal publication. Appropriate repositories could include sites run by a scientific organization, publisher, government agency, or university. Some repositories will be hosted on a single server while others might be distributed data management systems (e.g., CUAHSI HIS or the NASA-funded EOSDIS Distributed Active Archive Centers (DAACs, https://earthdata.nasa.gov/eosdis/daacs)). These latter are also sometimes referred to as portals because they point to data that are housed on servers distributed among participants. If a data repository is available for a specific type of data, then the editor or program manager or funder should encourage (or enforce) publication in that repository. To find the most appropriate data repository, the Enabling FAIR Data Project (Repository Finder) provides a search tool (https://repositoryfinder.datacite.org/). As of October 2020, 63 repositories were listed within the Finder where the search term "geochemistry" was utilized. Several were branded to nations, others to geophysical data, and some to geochemical data explicitly. However, not all the data systems summarized in Table 1 are returned by the finder.
Eventually, LTG scientists should only use repositories that are certified. But currently only a few government agencies, funders, publishers, universities, or community organizations have articulated guidelines for certification of repositories (RE3DATA.ORG, 2020; THE FAIRSHARING TEAM, 2020). For example, the US Geological Survey (USGS) defines a trusted digital repository as "one whose mission is to provide reliable, long-term access to managed digital resources to its customers, now and in the future." The Survey also stipulates four criteria for a "trusted digital repository" and provides an internal certification for such repositories (https://www.usgs.gov/about/organization/science-support/officescience-quality-and-integrity/trusted-digital-repository). Specifically, the repository must 1) accept responsibility for the long-term maintenance of the material that is archived on the site; 2) be able to support not only the repository but also the digital information within the repository; 3) show "fiscal responsibility and sustainability"; 4) follow commonly accepted conventions and standards; and 5) participate in system evaluations defined by the community. Some of the repositories certified on the USGS site are run by the USGS while others are run by other entities (e.g. the Incorporated Research Institutions for Seismology or IRIS). Other data repository certification protocols are being developed, including one that currently has 16 requirements (CORETRUSTSEAL.ORG, 2020).
One point of unanimous agreement at the LTG workshop was that the specialized, targeted, and highly structured data repositories that are currently successful in managing data for specific communities, and that plot at the upper right on Figure 2, should be maintained as long as their community of users finds them useful. They should also be the preferred location for data publication in their respective sub-disciplines. In addition to promoting these specialized, targeted, highly structured databases with a strong track record for growth and security, workshop participants argued that funding agencies should promote development of less-structured, generalized long-term data repositories to provide homes for LTG data that do not at present have one (e.g. Table 3). These repositories can host almost any kind of dataset, without any requirements with respect to data structure. These generalized data repositories are not organized around a research question and thus are useful even as the science itself changes. They are instead organized by an entity (a library or university or country or funding agency, for example) or are associated with a broad scientific target topic (water, climate, etc.). Good examples that have been funded by U.S. federal agencies are CUAHSI HydroShare, EarthChem Library (described in Supplementary Material), the NASA DAACs (https://earthdata.nasa.gov/eosdis/daacs), the USGS's Sciencebase (https://www.sciencebase.gov/catalog/), and the DOE's ESS-DIVE . These generalized data repositories are not as rigid in their metadata requirements, do not provide rigorous data curation, and are simpler and more intuitive to use.
Workshop participants agreed that in addition to the specialized, targeted, and highly structured databases for specific types of data (e.g. PetDB, CUAHSI HIS, etc.) and the generalized and unstructured data repositories (EarthChem Library, HydroShare, ESS-DIVE), there must also be a path by which some datasets within the generalized repositories can grow and eventually nucleate into a targeted, structured, specialized database. Such a transition might organically occur when the volume of data reaches a critical or threshold value, when the need for the data becomes critical, or when the user base becomes large (BALL et al., 2004). At that point, presumably, funding for the transition would become defensible. Not every dataset or data type will follow this trajectory nor receive such resources, but for a small number of datasets, funding could be made available on a competitive basis within the standard proposal format.
The data systems that move all the way to the upper right on Figure 2 will likely answer specific, important, and compelling questions that enable meta-analysis for broad, enduring problems.
The data management ecosystem. The workshop group also considered two generalized scenarios within the overall trend of development of the ecosystem of data repositories for LTG. The first scenario that was discussed, a data "superstore", would lead to development of one large repository or portal for most LTG data, regardless of the country of origin, funding agency, university, sub-discipline, or investigator. For example, the LTG program at NSF could fund a data management system that was required for NSF-funded LTG science but open to non-NSF scientists. The second scenario, a "street bazaar" for data systems, would encourage development of many repositories for LTG data, all differing in data volume, data type (generalized or specific), access characteristics, etc., much as shown in Table 1.
In general, the first scenario, one large data management system or data superstore for all LTG data or all NSF-LTG data, was not considered to be desirable nor feasible. First, LTG datasets are already distributed among many repositories across the world and even within the U.S.A. and many data are stored in sites managed by non-US and non-NSF scientists (for example, see Table 1). Likewise, some already-functioning specialized data management systems (Table 1) could be better places for LTG data publication than a generalized NSF-branded or LTG-branded repository. Furthermore, some datasets might be well-managed in different ways by different scientific communities in different data management systems with different data measurement protocols, promoting different types of science.
Likewise, LTG data from a single study might include multiple types of data that might be better spread across multiple repositories (the opposite might be true as well). Hence, multiple data repositories must be expected and should be encouraged, and a street bazaar of data management systems, scenario two, is inevitable and, in many ways, desirable. For example, as more repositories become available, competition will likely drive improvement in capabilities. Perhaps data providers will begin to choose data repositories in the same way and for the same reasons that they choose journals for their publications today.
This led to a discussion of the need for better search tools to navigate this bazaar of data efficiently. To enable this, LTG scientists must work directly with data specialists to develop better search tools and better ways to tag datasets so they can be easily discovered. Tagging, of course, requires development of controlled vocabularies by LTG scientists themselves. Data could be tagged either during or after uploading, and tagging could either be required by funders or journal editors, or it could be voluntary. Workshop participants argued that funding should be prioritized for cybertools to find the data that have been placed online in trusted secure data repositories and to cross-reference these data for samples with unique identifiers. This search paradigm is a low-cost solution for scientific domain repositories to expose their data holdings in a highly cross-disciplinary manner.
In effect, the LTG participants advocated that we change the paradigm from "build data repository, data will come" to "publish data online, cybertools will find." Participants argued for less money for building data repositories and more for improving the capabilities of tag and search. With this new paradigm, every data provider would put their data into an online data repository with appropriate metadata that are tagged during upload or after, enabling future data discovery. Some researchers might go into datasets posted by others and tag them, just as internet users tag online photographs for Google Search, and funding agencies could reward this activity if specific data types were deemed especially important. While this shift would mean that reusability and interoperability of data would not be possible until tagging and search tools became available, the data publication process would be less onerous for the data providers, and would likely result over the long run in more data uploads with appropriate metadata.
Examples of the type of search tools that are needed are beginning to appear. For example, the Data Observation Network for the Earth (DataONE) currently enables cross-search amongst registered member nodes using indexed metadata. DataONE, a community project, links data repositories as well as providing data search functionality (https://www.dataone.org/). Another example is Google Dataset Search, which is built around a metadata vocabulary and codes created and maintained by Schema.org.
Schema.org, only recently adapted to Earth science data through the NSF-funded EarthCube 418 (https://www.earthcube.org/p418) and 419 projects (https://www.earthcube.org/p419), provides structured vocabulary that can be used to encode metadata, keywords, and web urls into a machine readable format.
Google Dataset Search crawls these encoded datasets, extracts metadata attributes, and catalogs them for search. The result is a catalog of datasets from many different sources, including data repositories, that can easily be searched via datasetsearch.google.com or from a more community-specific portal such as GeoCodes (e.g,https://geocodes.earthcube.org/geocodes/textSearch.html). End users in different disciplines can query and discover data across scientific domains and disciplines from a single access point. Several data repositories already expose datasets in a manner that allows them to be found by Google Dataset Search. Such capabilities for dataset search again point to the need for controlled vocabularies that can be indexed to promote discovery of precise data types and terms.
Data harmonization. The slow growth of geochemical databases (LEHNERT AND ALBAREDE, 2019) despite the emerging importance of meta-analysis in many disciplines emphasizes that the geochemistry community neither prioritizes nor rewards systematic data publication in repositories nor data standards. Journal editors and funders can mandate movement through the bottleneck in Figure 2, and data search cybertools can help. But none of the data repositories --specialized, targeted, and structured or generalized and unstructured --have required data harmonization: providers simply do not use the same method of sample preparation, analysis, nor reporting. For example, in the USGS National Water Information System, 32 different name-unit conventions are used for dissolved nitrate alone (SHAUGHNESSY et al., 2019). In some cases, monitoring networks and government agencies have imposed common standards across specific projects, but among the networks, agencies, and projects, no norms have been established. This data harmonization problem will only be resolved when LTG practitioners themselves develop and accept standardized formats and controlled vocabularies across their discipline. This in turn will likely only happen if the community begins to prioritize and reward integrated databases and meta-analyses. It is possible, however, that data harmonization never emerges organically from the community because of all the issues discussed earlier in this paper (e.g., differences in analytical needs for different types or provenances of samples, and the high premium placed by LTG on development of new analyses). If the datasets are crucial enough, agencies could require or impose harmonization. Alternately, an agency could fund groups to help communities develop reporting formats, along the lines of the community-driven strategy used to develop ESS-DIVE (Table 1). This latter strategy tries to involve a lot of contributors brokering agreements (see, for example, http://essdive.lbl.gov/community-projects/).
It is also possible that cybertools could help in the problem of harmonization. Specifically, some funders have promoted the development and expansion of "translators" or thesauruses for controlled vocabularies used for data. For example, Skomos/OZCAR provides lists of closely related controlled vocabulary terms and their sources. Searching at https://in-situ.theialand.fr/skosmos/theia_ozcar_thesaurus/en/ for "ammonium" yields a list of 17 cognates, organized as "Exactly Matching Concepts", "Closely Matching Concepts" and "Related Matching Concepts" with links to the source of each one. This thesaurus can be used in combination with web search tools to identify common data across multiple databases that use different vocabularies, controlled or otherwise.
Such thesauruses will only be successful, however, if LTG scientists work closely with data scientists to bring cyber-and subject-matter expertise together. As pointed out for a related problem by Schroeder (2018), computers can help impose some harmonization but if algorithms to relate datasets are not agreed upon, then cybertools cannot solve the problem. This points again to the need for education of LTG students to promote better systematic data management in the future.

Conclusions
The LTG community increasingly recognizes the value of data sharing but more guidance and education of the community is needed to push this recognition forward toward systematic data sharing. A workshop was convened with LTG and data scientists to discuss the future "data-scape" for LTG. The group advocated for a change in paradigm from "build data repository, data will come" to "publish data online, cybertools will find." This paradigm shift is powerful and currently tractable. It will require funding agencies to work together to cross between the domains of basic science and information science in ways that promote holistic science. Workshop participants also supported the notion that both highly structured and specialized data repositories are needed for some geochemical data, but less-structured and more generalized repositories are needed to lower the burden of data and metadata input, and to support types of LTG data that do not fit easily into a structured system. This means that the geochemical datascape will grow to include both highly structured super-databases (e.g., PetDB and CUAHSI HIS) as well as less well-structured repositories (e.g. EarthChem, HydroShare, ESS-DIVE). As this data-scape emerges along with powerful cybertools for search, increasingly powerful answers to LTG questions will arise. All of these data transformations within LTG require a new emphasis on data science for LTG professionals and especially, for training future generation LTG scientists. The process by which a compilation of data of the same type of measurement are re-calculated or re-normalized into the same units or species or reporting protocol so that meta-analysis of the large dataset can proceed directly from the data

Data quality
The characteristics that determine if data are fit for the purpose intended, including accuracy, relevance, accountability, reliability, and completeness 1 Data repository A site where multiple datasets are archived together. Data repositories can be of many types, which include general purpose repositories that accept any types of data (e.g. Figshare, Dryad), funder or institutional or national cross-domain repositories (e.g. ESS-DIVE, CUAHSI HIS), and domain-specific repositories that are theme-based (e.g. NCBI, PetDB). Repositories in the first two categories and sometimes the third typically issue DOIs. Importantly, a data repository may or may not require specific preparation, analytical methods, and/or data reporting styles.
Data set or database A group of data values for a given project, with some metadata.

DOI
A unique digital object identifier that allows a researcher to find a published paper or dataset.
Distributed data system A system where one can access data from multiple users but the data sets themselves reside on the providers' server. Portal An online site that allows a user to find many datasets.
Quality assurance of data A management approach that focuses on implementing and improving procedures so that problems do not occur in the data.
Quality control of data An approach that seeks to identify and correct problems in the data product before the product is published. 1 Query A request to find data with certain metadata characteristics (e.g., find groundwater data from Idaho).

Registration
Getting an unique identifier for a sample.
Relational database A database that allows the user to find data related to one another by various metadata (e.g., are there data for porewater and mineralogy and organic matter for this soil horizon in this location?).

Sample
A physical entity that could be archived.

Template
Form with pre-set structure for data input.  Table 4. Ten Lessons Learned 1. Improving the accessibility of geochemical data promotes better science and better societal decisionmaking. 2. Although the data enterprise from measurement to meta-analysis is complex and fraught with opportunities for error, systematic management of data leads to improvements in data quality and promotes identification of large-scale trends. 3. Highly structured relational databases may be useful for LTG data where geochemical measurement protocols are reproduced identically in many locations on many samples, but less structured data repositories are needed for measurements that are implemented differently from sample to sample or place to place, or where protocols are developing or non-standardized. 4. The many goals and workflows of LTG scientists have resulted in a multitude of different data structures and metadata requirements, and a proliferation in the number of data repositories and portals housing or pointing to LTG data. 5. LTG scientists often resist sharing data in online data management systems. 6. The data-scape for LTG must encompass i) flexible management systems for datasets where measurement methods are less routine or still under development, ii) highly structured and managed data systems for datasets with established standards for measurement, and iii) a process whereby data systems can sometimes evolve from mode (i) to mode (ii). 7. Standards for data and meta-data are only developed when the community or the science demands it. 8. LTG scientists need both archived (unchanging) and versioned (modified and updatable) datasets. 9. Where geochemical databases have been successful, they have been funded over long periods of time, organized by groups of dedicated scientists, and focused on specific data types. 10. The science community must prioritize which datasets will be managed online as shared relational databases because development and maintenance of such databases is very time-and resourceconsuming.

Description of Some of the Specialized and Generalized Data Repositories
The trajectory of four data management systems in the USA, three funded by the National Science Foundation and one by the Department of Energy, are summarized to identify lessons learned.
EarthChem. Several specialized and targeted databases for rock chemistry were developed through NSF funding. The first, for oceanic crust and mantle data, was PetDB, which received funding from the US National Science Foundation from 1996 -2010 (some of these funds also were used to EarthChem datasets but many countries chose to implement their own national data policies and their own national, institutional, or programmatic data repositories.
The PetDB team developed templates to format and document data for ingestion from publications. These templates required use of controlled vocabularies, and included essential metadata.
Much of this information had to be requested from authors (with varying degree of success) because original publications often did not contain the requisite information or the information was retrieved from secondary sources (cruise reports, method publications, or theses). As EarthChem databases grew, user submissions alone did not produce a complete set of data and metadata, largely because data providers did not have enough time or resources to upload all of their data into the required format. The highly cited PetDB database is maintained at a cost of about $250,000/year. This funds software engineers (who maintain the software for data ingestion, online search and retrieval), for data curators, and for user support and training. Curators are paid to ensure that newly published data are included in the database, that unique identifiers are used, and that essential metadata are included. In 2012, NSF gave additional funds to the EarthChem team to participate in developing a new data model (ODM2) that could accommodate and manage both sample-based and times-series data (HORSBURGH et al., 2016).
In recognition of i) the difficulty in getting scientists to input data, ii) the worldwide proliferation of data repositories, iii) the need for both an archive of unchanging datasets and a repository that is continually updated, and iv) the need for an archive of versioned datasets, the EarthChem group developed another mechanism for data storage and publication in addition to specialized and targeted databases like PetDB: in 2010, the EarthChem Library (ECL) was developed as a generalized, unstructured, untargetted data repository where users can deposit their datasets for archiving and publication but with fewer rules about the type of data and metadata. Upon uploading data into the EarthChem Library, the dataset is assigned a DOI.
The EarthChem Library (ECL) offers templates for a range of sample and data types to help users format and document their data in compliance with published requirements . All data types are accepted. With use, the data types are becoming more diverse. Curators encourage providers to include metadata such as analytical or data reduction methods and uncertainties. Today, many data uploaded to ECL are not geochemical because some practitioners use ECL as a data library to publish data and get a DOI, regardless of the nature of the dataset.

CUAHSI Water Data Services
The Water Data Services of the Consortium of Universities for the Advancement of Hydrologic Science, Inc. (CUAHSI) support discovery, publication, storage, and re-use of water data and models, including biological, geochemical, and/or physical properties. CUAHSI deploys two primary domainspecific, open-source, technologies for data archival: the Hydrologic Information System (HIS) and HydroShare. (GITHUB, 2018). CUAHSI also provides resources for collaboration, education, and outreach. The first development in the data services was the Hydrologic Information System (HIS) designed with NSF funding from 2004 -2006 but the system is still being maintained and modernized for the latest standards and technology. The HIS is used to publish and discover hydrometeorological and geochemical time-series data using a distributed software architecture consisting of databases, a centralized catalog, and web portal (HORSBURGH et al., 2010;. ODM2 was adopted by HydroShare to support measurements, especially for sample-based data (HORSBURGH et al., 2016). Importantly, the HIS is a distributed data system that provides access to water data from individual researchers as well as organizations such as National Aeronautics and Space Agency (NASA), National Oceanographic and Atmospheric Agency (NOAA), and the USGS while translating the data into a common format. As of 2020, 94 organizations have published data in HIS, collectively representing 1.12 million locations and 478 distinct properties. Fifty-three percent of these data are hydrologic; 16 percent are soil properties (mostly soil moisture); and 16 percent are water chemistry.
Just as EarthChem first developed PetDB and then later discovered the need for the ECL, CUAHSI realized that not all water-related data fit into a time-series data structure. Recognizing that the community collects more than just time-series data, the NSF funded the development of HydroShare (GITHUB, 2018) and it was ultimately given to CUAHSI for operational responsibility. With the appropriate metadata, essentially any file type can be archived in HydroShare. HydroShare was designed to provide a variety of functions such as automatic metadata extraction, resource sharing between users and groups of users, and support for integrating third-party external applications using a public application programming interface (API). CUAHSI's JupyterHub is an example of a third-party web application that provides a cloud-computing environments that are pre-configured with scientific languages, libraries, and models. This web application is specifically designed to leverages data archived in HydroShare for research and educational activities. Any user can link their third-party web application to the HydroShare system which provides a convenient and powerful mechanism for extending the native capabilities of HydroShare.
Overall, the CUAHSI HIS is somewhat like the EarthChem PetDB in that it is a specialized database, albeit for time-series water data rather than sample-based rock chemistry data. Likewise, HydroShare and ECL are similar in that they are both generalized data repositories that can give DOIs to published datasets, though they are more restrictive in providing DOIs than EarthChem. As new capabilities have been developed and as more data have become available through HIS and HydroShare, adoption by the community has not been as rapid as hoped. CUAHSI has learned that several ancillary activities are required: (1) an effective, intuitive user experience is needed (if users encounter difficulties in using the system, they often search elsewhere); (2) a significant marketing effort must be sustained (CUAHSI has found that adoption increases in response to presentations at meetings, short courses, and workshops); and (3) actionable metrics are required for evaluation. This latter is true largely because the community is multi-disciplinary, international, and diverse, and thus, targets for adoption, use, effectiveness, data re-use, and data citations are challenging to identify. The ESS-DIVE team used their experience of working on discipline specific repositories in support of AmeriFlux, FLUXNET, and other projects, where they helped move a community from ad hoc data submission to well-structured data reporting. A similar trajectory is now being followed to gradually move the multi-scale, multi-domain ESS community from unstructured to structured data in agree-upon formats that include a minimal set of agreed terms, metadata, file organization, and vocabularies. These reporting formats are developed by forging agreement among the scientists collecting data. One example of geochemically relevant data formats they are developing is a water/sediment chemistry reporting

ESS
format. An early multi-year effort focused on adoption and testing of IGSN identifiers to develop best practices in sample tracking across different analyses and data sets. ESS-DIVE also took over responsibility for data stored at the repository for the Carbon Dioxide Information Analysis Center (CDIAC) as it ended. Moving the CDIAC data (which was structured using web pages and ftp links) to the new repository required a complete restructuring and metadata gathering activities that took a year.
This taught the team about the importance and challenge of maintaining long-term accessibility.