Data Science for Geoscience: Recent Progress and Future Trends from the Perspective of a Data Life Cycle

Data science receives increasing attention in a variety of geoscience disciplines and applications. Many successful data-driven geoscience discoveries have been reported recently, and the number of geoinformatics and data science sessions have begun to increase in many geoscience conferences. Across academia, industry, and governmental sectors, there is a strong interest to know more about the current progress as well as the potential of data science for geoscience. To address that need, this article provides a review from the perspective of a data life cycle. The key steps in the data life cycle includes concept, collection, preprocessing, analysis, archive, distribution, discovery, and repurpose. Those subjects are intuitive and easy to follow even for geoscientists with very limited experience of cyberinfrastructure, statistics, and machine learning. The review includes two key parts. The first is about the fundamental concepts and theoretical foundation of data science, and the second is the summary of highlights and sharable experience from existing publications centered on each step in the data life cycle. At the end, a vision about the future trends of data science applications in geoscience is discussed, including topics on open science, smart data, and science of team science. We hope this review will be useful to data science practitioners in the geoscience community, and will lead to more discussions on the best practices and future trends of data science for

in the fourth paradigm is not only computational science, but should also incorporate theories and methods from many other disciplines. Many later publications (Drineas and Huo, 2016;Kelleher and Tierney, 2018;NASEM, 2018a) resonate with Hey et al. (2009)'s vision on the theoretical foundation of data science. It is now a common understanding that data science will set its root in the basic research of computer science, mathematics, statistics, information science, and other disciplines. The successful data-driven scientific discovery also requires an open cyberinfrastructure and innovative pathways to enable the synergy of data science methods and domain-specific research questions.
Researchers of geoinformatics and geomathematics have also reviewed and discussed the evolution of information technologies in their work. Merriam (2004) listed six stages for the history of quantitative geology: Origins (1650-1833), Formative , Exploration (1895Exploration ( -1941, Development (1941Development ( -1958, Automated (1958Automated ( -1982, and Integration (1982-). Ma (2018) added that since the early 2010s it has been the Intelligent stage for geoinformatics. Recently, there have been several review articles summarizing the latest trends of different aspects of data science in geoscience. Chan et al. (2016) and Shipley and Tikoff (2019) analyzed the changes that open data and cyberinfrastructure can bring to the workflow of geoscience, such as sedimentary geology and structural geology. Gil et al. (2019) analyzed the characteristics of research challenges in geoscience, and then proposed a roadmap to develop and deploy knowledge-rich intelligent systems to address those challenges. In Karpatne et al. (2019), Bergen et al. (2019), and Reichstein et al. (2019), challenges and opportunities of machine learning and deep learning for geoscience were thoroughly reviewed. Each of those three articles also has its own highlights. Karpatne et al. (2019) pointed out the synergistic advancement that such applications can bring to both machine learning and geoscience. Bergen et al. (2019) analyzed the larger function space and data-processing capability of machine learning in comparison to the conventional approaches in geoscience. Reichstein et al. (2019) addressed that datadriven machine learning should be coupled with the spatial and temporal context to obtain better process understanding of the Earth system, and thus to improve the prediction.
The quick progress of big data and data science has inspired plans and schemes for data-driven geoscience research at a larger scale. In 2018, the Carnegie Institution for Science started the Deep-time Data Driven Discovery (4D) initiative (4D Initiative, 2018). In 2019, the International Union of Geological Sciences initiated the Deep-time Digital Earth (DDE) big science program . In the vision (NASEM, 2020) for the next decade of Earth science priorities within the U.S. National Science Foundation (NSF), open data and community of practices on cyberinfrastructure needs and advances were made as part of the key recommendations. We are now at a dramatic tipping point in science-a time when the open data resources, cyberinfrastructure facilities, and new data science methods for analysis and visualization will change the way geoscientists conduct their research. Keys to discovery lie in the continued development, integration and exploitation of facilities, data and expertise to build and explore pathways to a deeper understanding of the evolving Earth . The review and analysis presented in this article will try to answer questions such as "What changes can data science bring to geoscience?", "What are the fundamental data science skills that a geoscientist should learn?", "What will be the patterns of data science applications in the next five or ten years?", and "As a student of geoscience, how can I quickly learn the data science methods and use them in my work?".
The author would like to note that the perspective of this article is from the point of view of a data life cycle. The data life cycle includes key steps such as concept, data collection, preprocessing, archive, distribution, discovery, analysis, and repurposing. The theme of each step is intuitive and easy to follow. Through this structure, this article summarizes sharable experience from existing studies with regards to data science workflows in geoscience. In the writing, the author has tried to present a comprehensive list and review of existing publications, however the presented analysis might not cover all the highlights of the cited publications. The remainder of the article is organized as follows. Section 2 will give a summary of key concepts in data science. Section 3 will review a number of the latest publications in each step of a data life cycle. Section 4 will analyze the trends of data science in geoscience. At the end, Section 5 will conclude the article.

The science of data science
To get a better understanding of the workflows in data science, it is necessary to know a few fundamental concepts. The author has taught database and data science classes for senior undergraduate and graduate students in the recent years. The experience has shown that, even students majoring in computer science might get confused with the meaning of data, metadata, information, and knowledge. Data are the recorded representation of facts. In the digital era of nowadays, the records are normally presented in a digital form, such as plain text, spreadsheet, relational database, and graph database. Besides the data on hard disk, data can also be recorded on other types of media, such as paper and tape. Archived records from the old days, such as literature printed in hardcopies, can be digitized. Metadata are data about data. Metadata are important in data sharing and reuse because they give an overview about the background of the data. An end user can get a quick summary and understanding of a piece of data by just reading the metadata. Structured metadata can improve the performance of search engines and enable them accurately index records and find the best match for a request. Information is the meaning or message extracted out from data. The information extraction process often depends on the purpose of data analysis, the methods and tools used, and the interpretation of data analysis results. It is not strange to see that a same piece of data is used in studies of different topics to generate varied information. Knowledge is the expertise and familiarity of a topic. In traditional understanding, a human can get knowledge by learning, practice, and experience. In data science, there are now knowledge bases that can save knowledge in quantitative and qualitative formats, which can in turn be used in the data analysis process. The three concepts data, information, and knowledge are also used together with other concepts such as wisdom and action, to form a pyramid or flowchart and depict the ability of using knowledge and insight gained from data to think and act in real-world practices ( Figure  1a). the Data Documentation Initiative (DDI) data life cycle (DDI Alliance, 2021); (c) the cross-industry standard process for data mining (CRISP-DM) (Chapman et al., 2000); (d) the data life cycle in data science (Wing, 2020) (e) the data life cycle and surrounding data ecosystem (Berman et al., 2018); and (f) the data science process (Schutt and O'Neil, 2013).
Many researchers and communities have depicted the data life cycle and the data science process. Figure 1 selects a number of diagrams from the existing publications (Chapman et al., 2000;Schutt and O'Neil, 2013;Berman et al., 2018;Wing, 2020;DDI Alliance, 2021). Most of them are easy to read and understand, and we will omit the detailed description for each of them. Nevertheless, some shared topics in those diagrams are worthy of highlighting. For instance, the data life cycles presented Figures 1b and 1e both include the steps of data sharing, publications and reuse. The step of data processing in Figures 1b and 1c actually means data cleansing, wrangling, and munging, which is similar to the step of data preprocessing in Figure  1f. In Figures 1d and 1f, the steps of visualization and interpretation address the needs of meaningful data science, i.e. to appropriately interpretate the results of data analysis. This includes not only the precision and efficiency of algorithms but also the domain-specific meaning in the outputs of those algorithms. Also, the issues of data privacy and ethics have received more attention and discussion in recent publications to highlight data science as an ecosystem (Figures 1d and 1e).
The emergence and evolution of data science are the result of inter-disciplinary collaboration. Donoho (2017) offered a thorough review of data science's evolution in the past decades. In particular, he summarized the perspectives of several statisticians on the need to expand the boundaries of classical statistics to cover topics of data preparation, presentation, and prediction. In the review it was mentioned that the term "Data Science" was already used two decades ago by Cleveland (2001) for the envisioned new field. Recent discussions have clearly stated that the field of data science should be inter-disciplinary, including computer science, statistics, mathematics, information science, and progress in subject-matter applications (Drineas and Huo, 2016;Kelleher and Tierney, 2018). Those discussions were reflected in the list of data science courses and curricula of those courses. A recent National Academies of Sciences, Engineering, and Medicine report (NASEM, 2018a) summarized that a critical task of data science education is to establish data acumen, which includes these key concepts: mathematical foundations, computational foundations, statistical foundations, data management and curation, data description and visualization, data modeling and assessment, workflow and reproducibility, communication and teamwork, domain-specific considerations, and ethical problem solving. Those topics of data acumen are reflected in the data life cycle and data science process ( Figure 1) to address the real-world needs of data science applications. Several universities already opened data science courses. For example, the University of California, Berkeley's Data 8: Foundation of Data Science is for entry-level undergraduates in any major (Adhikari and DeNero, 2017). Its curriculum covers most of the subjects in the above data acumen list.
Many geoscience and geoinformatics researchers have analyzed the science of data science with their experiences from real-world practices. Mattmann (2013) discussed four advancements that are necessary to tackle the challenges of big data: algorithm integration, software development and stewardship, automated data format identification and reading, and training of data scientists. Fox and Hendler (2014) addressed that the field of data science includes not only the disciplinary foundations but also the strategies for the real-world challenges. They gave details on four cross-cutting data science challenges: understanding scale in systems, sparse systems with incomplete and heterogeneous data, abductive reasoning, and next-generation semantic data infrastructure. Here the abduction reasoning is similar to the Exploratory Data Analysis proposed by Tukey (1977), "a willingness to look for those things that we believe are not there, as well as those we believe to be there." Ho (1994) summarized that abduction creates, deduction explicates, and induction verifies. This means abduction is a good way to find clues of scientific questions through the activities of data exploration. Hazen (2014), based on his experience of data driven studies in mineralogy, further summarized that deduction and induction are to discover what we know we do not know while abduction is to discover what we do not know we do not know. Recognition of the data science myths pointed out by Kitchin (2014) and Kelleher and Tierney (2018) is important to avoid unrealistic expectations. The myths are: 1) Data science is an automatous process without human oversight; 2) Every data science project needs big data and machine learning; 3) Data science software is easy to use and data science is an easy job; and 4) Data science pays for itself quickly. The awareness of those myths will help geoscientists understand the limitations of data science and get better prepared for problem-solving in the real world.

A reflection on the key steps of a data life cycle
Focusing on the theme of data science for geoscience, the following sub-sections will review a list of recent publications for each key step in the data life cycle, and summarize the shareable experience from them.

Business understanding and concept
The step labeled "concept" in Figure 1b and "business understanding" in Figure 1c are intended to determine the objectives of a data science project and estimate the data needs (Chapman et al., 2000;DDI Alliance, 2021). They are about turning business goals into data science plans. If the planned activities include database construction, this step will also include the work of data structures, such as conceptual model, logical model, physical model, as well as controlled vocabularies for data standardization. Researchers of cyberinfrastructure have recognized that consideration and action on data semantics in the early stage will help improve data interoperability when data are generated, collected, integrated and shared in a later stage (Reitsma et al., 2009;Narock and Shepherd, 2017).
The Semantic Web extends the World Wide Web by adding structures and meaning to terms in documents on the Web (Berners- Lee et al., 2001). The key technical approach to enable the Semantic Web is the use of ontologies, which are formal specifications of a shared conceptualization of a domain (Gruber, 1995). Researchers have suggested a semantic spectrum, consisting of a sequence of items such as catalog, glossary, taxonomy thesaurus, conceptual schema, and formal logical models, for constructing and implementing ontology in practice (Welty, 2002;McGuinness, 2003;Obrst, 2003;Uschold and Gruninger, 2004). The items in this spectrum provide a roadmap for increasing semantic precision and interoperability of data in a variety of applications.

Figure 2.
Comparing the layered structure of data interoperability with the Semantic Web architecture and the FAIR data principles (From Ma et al., 2020). For sources of sub-diagrams see description in text.
Data interoperability has received tremendous attention in recent years. The widely accepted FAIR (Findable, Accessible, Interoperable, and Reusable) data principles (Wilkinson et al., 2016; have a close relationship to the discussion of data interoperability in the past decades ( Figure 2). Several researchers presented the layered structure of data interoperability, including systems, syntax, schematics, semantics, and pragmatics (Bishr, 1998;Sheth, 1999;Ludäscher et al., 2003;Brodaric, 2007Brodaric, , 2018. A few other researchers explained those layers in layman's terms, including discoverable, accessible, decodable, understandable, and usable (Wood et al., 2010;Ma et al., 2011). The layered structures of the data interoperability and the FAIR principle can also be compared with the technical architecture of the Semantic Web (Berners-Lee, 2000). Many best practices of data interoperability can be seen in the domain of geoscience. The U.S. National Geologic Map Database of USGS has taken the North American Geologic Map Data Model (NADM) (NADM Steering Committee, 2004) as a common schema for coordinating state-level geologic map databases. Such efforts on standards are continuously active at USGS, such as the recently released Geologic Map Schema (GeMS) (USGS NCGMP, 2020). Similarly, NASA has implemented the Global Change Master Directory (GCMD) Keywords as a hierarchical set of controlled vocabularies to ensure the interoperability of its data and services (GCMD, 2020). In Europe, the INSPIRE Directive aims to create a European Union spatial data infrastructure (Bartha and Kocsis, 2011;Ma and Fox, 2014). Its data and metadata specifications cover 34 data themes in Earth and environmental sciences, with the full implementation required by 2021 across all the participating European nations. Scientific communities such as the World Wide Web Consortium and the Open Geospatial Consortium have also summarized best practices of publishing and serving data on the Web (Loscio et al., 2017;Tandy et al., 2017).

Data understanding, generation and collection
Along with the quick development of hardware and software in the cyberinfrastructure, data are now generated at an ever increasing speed. Sensor networks (Martinez et al., 2004;Hart and Martinez, 2006) greatly facilitates the generation, transmission and integration of Earth and environmental data. NASA organizes about 100 missions and thousands of platforms, instruments and sensors around the Earth and nearby space, and is one of the biggest geoscience data producers across the world. It was reported (Shannon, 2019) that in 2016 NASA was already generating 12.1TB of data every day. The same article also reported that NASA was deploying new sensors which alone will be able to generate 24 TB of data every day. Same advances in instruments and facilities for data generation, transmission and management were also seen in field-based geological survey (Mookerjee et al., 2015). Wing (2019) made a distinction between data generation and collection, and discussed that not all data generated are collected (Figure 1d). It might be because that we just want to collect a certain part of the data, or because the velocity of data streams is too high to be processed with existing tools.
Crowd-sourcing platforms, such as social media and community portals, are generating massive data. Many of our daily activities, such as posting on Twitter or Facebook, watching and commenting on a video on YouTube, and searching on Google, all generate digital records in a way that many of us are even not aware of. A lot of such social media data are used for scientific studies. For example, Twitter data were used for wildfire disaster management (Wang et al., 2016). Google search data were used for predicating the trends of seasonal influenza (Carneiro and Mylonakis, 2009). The community collaboration on OpenStreetMap greatly helped the rescue work after the 2010 Earthquake at Haiti (Ahmouda et al., 2015). The images on Flickr were used for ecosystem assessment in remote areas (Rossi et al., 2019). Besides the public social media, another type of crowd-sourcing platform focuses on a certain subject and is normally maintained by a community of enthusiasts. For example, the Mindat.org is such a community platform focusing on mineral species. It has a small team of database administrators and data reviewers, and is open to thousands of data contributors and users across the world. Researchers have used data of Mindat in many recent studies on mineral evolution and mineral ecology (Hazen et al., 2011;Morrison et al., 2020).
The massive geoscience literature is another good source for collecting data. For example, GeoDeepDive (Zhang et al., 2013;Peters et al., 2014) is a machine learning package for discovering data and knowledge from published documents. By January 2021, it has preprocessed more than 13 million documents. Peters et al. (2014) have successfully used the fossil records extracted from GeoDeepDive to enhance the Paleobiology Database. GeoDeepDive also opens interface for other researchers to use the data resource to explore their own scientific topics. Very recently, there have also been many studies on using text mining technologies to extract knowledge graphs from geoscience literature (Wang et al., 2018;Qiu et al, 2019;.

Data preprocessing and preparation
Data preprocessing is an increasingly important step in data science. It is also written in several alternative names such as data cleansing, data wrangling, and data munging. The general purpose of data preprocessing is to ensure the quality of data before any data analysis is conducted. In real-world practices, it might involve tasks such as cleaning out of noisy and unreliable records, reducing data dimensionality, transforming data formats, selecting records of interest, enriching the existing data with additional attributes, and combing data from different sources to build a new piece of data (Wang, et al., 2018). Many researchers (Press, 2016;Mons, 2018), including geoscientists (Fox, 2019), are spending 80% of their time on data cleansing and preparation before the actual data analysis (i.e., the 80/20 rule). Good data preprocessing can significantly increase the efficiency of data analysis and lead to remarkable scientific discoveries. For example, the above-mentioned Mindat data portal was used as a source for the Mineral Evolution Database . Nevertheless, a limitation of the original Mindat is that it is short of age attribute to record a mineral species' first occurrence time on Earth. Golden et al. (2019) has searched over 1,600 publications and several existing databases to extract such age data and then used them to enrich the Mineral Evolution Database. The updated database underpinned many new research discoveries, including mineral evolution and ecology (Morrison et al., , 2020 and the co-evolution between the geosphere and the biosphere (Spielman and Moore, 2020). The database also derived new designs of mineral species databases and discussions on better ways for data curation and sharing (Prabhu et al., 2021).
Applying data standards to transform existing data or mediate between databases is also a widely used approach in data preprocessing and preparation. The above-mentioned metadata and data specifications in the INSPIRE Directive is a good use case for that approach. Another example is the global OneGeology project for improving the accessibility of geological maps on the Internet (Jackson, 2010). OneGeology has developed a tool kit to set up online services of geologic maps. More than 110 countries have participated in the project and about half of them are serving map data to a web map portal. The original maps are heterogeneous because they are recorded in different formats and using different data models, terminology, and language. Through the OneGeology map service tool kit, the online services of those maps are made consistent to each other and can be browsed in a centralized map window. The OneGeology-Europe project (Laxton, 2017) has developed extra work on multi-lingual vocabularies and used them to develop innovative capabilities for the online geologic maps of participating European nations. New functions of OneGeology-Europe include the multi-lingual user interface, federated queries across distributed geologic map services, consistency with other regional and international data standards, and more. As reflected in those examples, well-organized data preprocessing preparation can significantly change the 80/20 rule in data science activities, or even reverse it.

Data archive, distribution, and discovery
It is a new normal of nowadays that funding agencies require researchers to include a data management plan in their grant proposals (Dietrich et al., 2012;NSF, 2015). Increasingly, data are treated as a formal research output, and receive same attention as paper publications. The FAIR data principles (Wilkinson et al., 2016) are now well received in almost all the scientific disciplines, including geoscience Lannom et al., 2020). The FAIR data principles represent many preceding efforts on data management and stewardship, and represent a systematic approach to share and reuse scientific data in open science. Those efforts include data infrastructure construction (Cutcher-Gershenfeld et al., 2016), persistent and resolvable identifiers for data publication (Klump et al., 2016), metadata standardization (Starr and Gastl, 2011), provenance documentation (Lebo et al., 2013), data citation (Parsons et al., 2010), and more. There are many general-purpose data portals where researchers can upload and share their data. Moreover, there are specific data portals that only focus on one or a few subjects, such as petrology, geochemistry, and geophysics. Data-producing agencies such as NASA, USGS, NOAA and USDA all have their own data archives and data portals that allow users to search and access data of interest. For instance, USGS enables federated query to a long list of mineral resources spatial databases through a central portal (USGS MRDATA, 2021). As workflow platforms such as Jupyter Notebook and R Markdown are increasingly used, many data portals also developed packages to enable data access from workflow platforms, such as the paleobioDB R package for the Paleobiology Database (Varela et al., 2015) and the neotoma R package for the Neotoma Paleoecology Database (Goring et al., 2015).
The FAIR data principles put findability in the first place. It is true that from the perspective of a user, data discovery is a key step if the user's work needs to access data on external databases or data portals. A topdown approach can be used to search records in data portals with specific themes, such as EarthChem (earthchem.org), PANGAEA (pangaea.de), Neotoma (neotomadb.org), PaleoBioDB (paleobiodb.org), and many data portals organized by the federal agencies. Moreover, there are also registries for metadata from multiple data portals, such as DataONE (dataone.org), as well as registries of data portals, such as RE3DATA (re3data.org). On those data portals, a user can quickly narrow down the scope of search by selecting disciplines, subjects, geospatial range, time span, and other attributes. Another approach of data discovery is the free-style search, such as those enabled by the Schema.org (Noy et al., 2019). By providing metadata through the Schema.org specifications, the records in a data portal will be made indexable to search engines. For example, Google has indexed millions of datasets on thousands of data portals and made them searchable through the Dataset Search engine (Noy et al., 2019). A user can search datasets with any combination of keywords. Once a dataset is identified on the search engine, the user can access the dataset through the Web address provided in the metadata. Recently, there are also discussions about dataguides, which are a type of computer-aided analysis that can inform researchers about what data to collect and where to find them (Shipley and Tikoff, 2019).

Data analysis and result interpretation
Many people would simply think of data science just as data analysis. Indeed, data analysis is a key step in the data life cycle, but it is just a part of the process. In the past decades, there have been many studies about the theories and applications of statistical models and data mining in geoscience (Merriam, 2004;Sagar et al., 2018). In recent years, the fast-growing methods and technologies of big data (Yang et al., 2017(Yang et al., , 2019, cloud computing (Li et al., 2015;He et al., 2019), machine learning (Lary et al., 2016;Bergen et al. 2019;Karpatne et al. 2019), and deep learning (Reichstein et al. 2019) have been widely used in geoscience and achieved significant outcomes. Many innovative data-driven discoveries were seen in paleobiology (Peters et al., 2017), paleontology , mineralogy (Hystad et al., 2015, water resources (Wen et al., 2018;Sun and Scanlon, 2019), forest cover change (Hansen et al., 2013), and public health (Goovaerts, 2008(Goovaerts, , 2020. The data analysis step often includes two steps: exploratory data analysis and confirmatory data analysis (Figure 1f). This is a conventional method in statistics, and can still be very useful for data science applications of nowadays. Exploratory data analysis is used to get a better understanding of the data and draw plausible research questions or hypotheses (Turkey, 1977;Camizuli and Carranza, 2018). Confirmatory data analysis, in contrast, is where the complicated models and/or algorithms are applied to prove or disprove the hypotheses. Data visualization has been increasingly discussed as an efficient way to improve the understandability of a data science process and the interpretability of the data science results (Fox and Hendler, 2011;Ma et al., 2015;Wing, 2019). Data visualization not only means to make the information visible but also means the visualization should make the information easy to perceive by a reader. Many people might think visualization just as a way to present data science results, but in actual practice many data visualization techniques can also be used in data preprocessing and analysis. For example, box plot is a widely used visualization in exploratory data analysis. Ma et al. (2017) used a three-dimensional cube matrix to explore the co-relationship between elements and mineral species, and generated new research questions for detailed analyses. Morrison et al. (2017) applied network analysis to visualize the patterns of co-existence of minerals. In Dutkiewicz et al. (2015), machine learning was used to generate new hypotheses based on the analysis of big seafloor sediment data. GPlates software (Müller et al., 2018) was used as a data visualization platform in that study, which generated impressive results. Those examples show that data visualization is an efficient approach to facilitate the collaboration between geoscientists, data scientists, mathematicians, and data managers, and to make the data science process and results understandable to a broader audience.

Repurposing
Repurposing means that a piece of data can be reused in other projects either by external users or the data producers themselves. Data interoperability and reusability will be the focus in this step. The FAIR data principles as well as the open data and open science campaigns suggest that metadata should include the provenance information of the original research activities that generated the data (Di et al., 2013;Gil et al., 2016;Wilkinson et al., 2016;Zeng et al., 2019;Lehmann et al., 2020). In the best practice, besides data sharing, researchers should also document their software packages, workflow setup, and context information that interconnect the entities, agents and activities in a research program. Open data and open science are helping change the culture of research and create a virtuous data ecosystem in geoscience (Sinha et al., 2010;Donker and van Loenen, 2017;Caron, 2020). Many new scientific discoveries are based on research activities that use "other people's data".  (Berman et al. 2018;Kelleher and Tierney, 2018;Wing, 2019).

From big data to data science ecosystem: a vision on the next decade
Along with the evolution of data science theory and methodology, the upgrading of computational facilities and capabilities, the thriving of big data and open data in geoscience, and the training of geoscientists with data science skill set, it is certain that data science will receive much more application in geoscience and will lead to more scientific discoveries. What will be the trends in methodology and technology, and what should a geoscientist be aware of and be better prepared for the data revolution? This section will offer a few thoughts.

Open data and open science will be the new normal
The concept of open science is being widely accepted in academia (Donoho, 2017;NASEM, 2018b;Aspesi and Brand, 2020 (Berendt et al., 2020). For example, data will become more open, more accessible and more interactable through various protocols and interfaces, such as those maintained by the World Wide Web Consortium and the Open Geospatial Consortium. Facilitated by the FAIR principles and other associated efforts, the shared data will be better curated, which will save researchers' time on data preprocessing and preparation. The USGS mineral resources spatial data portal is an example showing that trend (USGS MRDATA, 2021). The Geovanni infrastructure of NASA (Acker and Leptoukh, 2007) has also been working on cooperation among NASA's distributed data archives to enable federated data exploration and comparison (Lynnes, 2020). As a reflection, a key idea in the vision of the Semantic Web (Berners-Lee et al., 2001) is the persistency and traceability of resources on the Web. Similar to the Digital Object Identifier (DOI) for publications, many other entities and agents in the open science, such as data, software packages, samples, researchers, organizations, and research grants, will also have their persistent and resolvable identifiers on the Web. By connecting those identifiers, we can easily weave a graph for all the objects, steps and workflows involved in the generation of a scientific finding.
Workflow platforms such as Jupyter Notebook, R Markdown and others will be widely used in geoscience from research projects to classroom education. Those workflow platforms are not only good tools for collaborative and reproducible research activities, but also well-organized environments for students to learn and use programming languages. Many geoscience data portals now have Python or R packages to enable users to search and access data directly from a workflow, and there have been various successful applications in geoscience (Varela et al., 2015;Peters and McClennen, 2015;Choi et al., 2020;Rosenberg et al., 2020). We anticipate that workflow platforms will become more popular in geoscience in the future. Similar to the needs from computer scientists and data scientists for trustworthy Artificial Intelligence (Floridi, 2019;Wing, 2020), geoscientists also express the request for provenance in their workflows (Gil et al., 2019). Recently, packages have been developed in workflow platforms for capturing provenance. For example, the MetaClip (Bedia et al., 2019) framework is able to capture provenance description of a climate product and then append the provenance information inside the resulting image. Once that image is loaded to the MetaClip Web portal, the provenance information inside it will be read and visualized. To tackle large datasets, researchers have begun to deploy workflow platforms in the cloud environment (Hamman et al., 2018;Sun et al., 2020). This will be a trend for big geoscience data processing in the near future.

Big data, smart data, data science and the changes they bring to geoscience
Big data does not mean we can dump and share data simply relying on machine learning to identify patterns in the chaos. Many researchers have discussed the idea of smart data (Iafrate, 2014;Sheth, 2014;Maskey et al., 2020). That is, to apply metadata and semantics to add more machine-readable structures in data generation and collections, and deploy intelligent algorithms to improve the precision of data discovery and analysis. Smart data will bring refreshing changes to the data life cycle, and help researchers quickly identify the data to be used and extract value from the data. Many geoscience data portals, such as EarthChem, Neotoma, and the Paleobiology Database have already applied controlled vocabularies to improve the precision of data search and query. The Google Dataset Search engine, enabled by Schema.org, offers the playground for developing more innovative functions in data search. The geoscience community already began to work on approaches to expose Schema.org-compatible metadata on their data portals (Shepherd, 2019;Valentine et al., 2020) and make the metadata indexable to the Google Dataset Search engine. When more data portals enable such functions, an end user will be able to search a variety of data on the Google Dataset Search engine. Metadata portals for specific geoscience disciplines or subjects such as deep time (Stephenson et al., 2020) can also be built with those indexable metadata from various data portals. Those improved functionalities will greatly benefit end users (Chapman et al., 2020). With more provenance information of workflows documented and shared, there can even be smart search engines that use such information to provide recommendations not only on data but also on software packages that can be used to analyze the data, potential research topics for the data, and researchers to collaborate. For example, Mookerjee et al. (2021) discussed that, by using machine learning, data management systems will be able to make connections to other datasets that can potentially build collaborations or suggest other geographical areas to study.
The smart data will save researchers' time on data discovery and allow them to put more efforts on proposing research questions and conducting data analysis. Whether working with a small amount of data and identified research questions, or a large amount of data that requires exploratory data analysis and hypothesis generation (Kitchin and Lauriault, 2015). Ma (2018) compared the data science process with conventional science approaches, and pointed out that a unique feature of data science in the big data era is that, while a lot of data are collected, we might not have a specific research question yet. Bergen et al. (2019) discussed that machine learning provides means to discover high-dimensional and complex relationships in data, and enables exploration of more scientific hypotheses. If the conventional approaches are small data and small knowledge (i.e., domain experts and personal computers), then the data science process can enable big data and big knowledge (i.e., domain experts, smart data, machine learning, and cloud environment). In big data-enabled, multidisciplinary geoscience research projects, interpretability of the workflow will be a big advantage for people from different disciplinary background to understand the result and finding (Reichstein et al., 2019). This overlaps with the work on explainable and meaningful Artificial Intelligence in computer science (Hagras, 2018;Holzinger, 2018;Chari et al., 2020). In the geoscience community, there has been some initial work on this topic in workflow platforms, such as the Meaning Spatial Statistics initiative (Stasch et al., 2014), and we anticipate that more works will appear in the near future.

Science of team science to facilitate data-driven geoscience discovery
In the data ecosystem underpinned by open science, there will be small data science projects which only need a small team, personal computers and open source software packages. There will also be large-scale data science projects that cross the boundaries of disciplines and need collaboration of researchers from different institutions, high performance computing facilities, efficient infrastructure for data storage and transmission, and large software programs for data management and processing. To succeed in such data science projects, the science of team science is recommended by many communities (NASEM, 2015). Key elements of the science of team science include 1) clear communication to reach consensus on objective among team members, 2) regular brainstorming activities to identify and specify research questions, 3) complementary expertise from team members on problem solving; 4) regular team meetings to review progress and seek alternative approaches; and 5) positive and supportive working relationships within the team. The recent collaboration on data-driven mineral evolution study  shows successful realworld practices of team science. In that work, a list of activities was organized to create an environment where people from different knowledge background could quickly step out of their comfort zone, get familiar with each other, and work together on focused scientific topics.
Geoscience communities also need some cultural change to fully embrace open data and open science. The NASEM (2020) report "Earth in Time" envisioned a list of science priority questions for the NSF Earth science programs in the next decade. The report also made two recommendations on cyberinfrastructure. One is about a strategy to support FAIR data practices in community data efforts, and the other is about the initiation of a community-based standing committee to provide advice on cyberinfrastructure needs and advances. Community of practice has received increasing attention in many academic associations, and has been discussed as a catalyzer for open science (Cutcher-Gershenfeld et al., 2017). Many researchers have been actively promoting open science in geoscience (Caron, 2020). For instance, the Earth Science Information Partners, through the collaboration with EarthCube, American Geophysical Union, European Geosciences Union, Geological Society of America, American Meteorological Society and other organizations, has successful organized many successful Data Help Desk activities recently and archived a long list of reusable resources (ESIP, 2020). Hundreds of researchers across the world have joined those activities as volunteers to answer questions and share research outcomes. We anticipate more such activities will be organized in the future to promote data science applications and cultural change in geoscience.

Concluding remarks
This article presents a review of recent data science activities in geoscience from the perspective of a data life cycle. It first provides a description about the basic concepts and theoretical foundation of data science. Then, by following the process of the data life cycle, it reviews a number of the latest publications for each step in the data life cycle and summarizes the shareable experience from them. Finally, a vision on the trends of data science applications in geoscience is discussed, including topics of open science, smart data, and science of team science. We hope the review from the aspect of a data life cycle will lower the barrier of data science to geoscientists, especially the newcomers to data science applications. For individual geoscientists, they can become aware of the available resources in the cyberinfrastructure, see representative examples in data science, and initiate ideas for their own work. For research teams, they can learn methods on collaboration and team science. Geoscientists have been successfully embracing the strategy of community of practice to share data science resources and promote best practices. We hope the open science campaign will further facilitate data science applications in geoscience and lead to more data-driven scientific discoveries.