A Model Workflow for GeoDeepDive: Locating Pliocene and Pleistocene Ice-Rafted Debris

Machine learning technology promises a more efficient and scalable approach to locating and aggregating data and information from the burgeoning scientific literature. Realizing this promise requires provision of applications, data resources, and the documentation of analytic workflows. GeoDeepDive provides a digital library comprising over 13 million peer-reviewed documents and the computing infrastructure upon which to build and deploy search and text-extraction capabilities using regular expressions and natural language processing. Here we present a model GeoDeepDive workflow and accompanying R package to show how GeoDeepDive can be employed to extract spatiotemporal information about site-level records in the geoscientific literature. We apply these capabilities to a proof-of-concept subset of papers in a case study to generate a preliminary distribution of ice-rafted debris (IRD) records in both space and time. We use regular expressions and natural language-processing utilities to extract and plot reliable latitude-longitude pairs from publications containing IRD, and also extract age estimates from those publications. This workflow and R package provides researchers from the geosciences and allied disciplines a general set of tools for querying spatiotemporal information from GeoDeepDive for their own science questions.


Introduction
Peer-reviewed papers communicate knowledge to their audiences through text, text, and tables. These elements have been refined over generations to efficiently communicate scientific insight to human readers, but the volume and variety of the peer-reviewed literature has challenged the efficient extraction of the underlying primary data. Efforts to improve the practice of data archiving in structured, sustainable data repositories are increasing (Sansone et al., 2019;Uhen et al., 2013;Williams et al., 2018) as individuals and groups recognize the importance of data sharing and curation (PAGES Scientific Steering Committee, 2018). Despite these efforts, however, a large volume of data and information still exists exclusively in published form as text within manuscripts, embedded in tables, or graphically within figures. In response, new automated software tools are being developed to extract information directly from the scientific literature (Pejić Bach et al., 2019;Tworowski et al., 2021). Various fields are developing tools for automated extraction of meaningful information from the scientific literature, including natural language processing (NLP) and other forms of machine learning (ML), the vast majority of which is being developed and deployed to extract information from general and freely available content, like Twitter feeds and publication abstracts. The development of these new software tools has rapidly outpaced their application in the geosciences, which could allow the extraction of information from digital libraries and infrastructures to address questions at scales not available to traditional meta-analyses.
GeoDeepDive (http://geodeepdive.org, also known as xDD) is a digital library and computing system that currently contains over 13 million publications from multiple commercial and open-access content providers. Early versions of GeoDeepDive have been used to extract fossil occurrences from the scientific literature (Peters et al., 2014), e.g. to understand the temporal patterns and possible drivers of stromatolite resurgences in the geological past (Peters et al., 2017). However, the newness of GeoDeepDive as a platform and the few available software tools to leverage it has limited its impact. Here we provide a sample workflow and accompanying R package that leads the reader through key, public elements of GeoDeepDive. As a case study, we retrieve a sample set of papers on the distribution of ice-rafted debris (IRD) from the Pliocene to present and extract both geographic coordinates and temporal information. We choose to focus our effort on IRD because of the near uniqueness of the acronym in the geoscience literature and because IRD is almost exclusively restricted to ocean settings, thus simplifying identification of false positives -occurrences of IRD that do not refer to ice-rafted debris -in the training dataset. IRD distribution in marine sediments provides a key constraint on cryosphere development, yielding insight into past climate evolution (e.g. Andrews, 1998;Bassis et al., 2017;Bond and Lotti, 1995;Hemming, 2004;Ruddiman, 1977).
One implementation of GeoDeepDive uses sentences as the atomic unit, managing the sentence-level data within a PostgreSQL database. Each sentence within a paper is identified by 1) a unique document id (GDDID, or gddid in the accompanying code, which is an internally assigned unique identifier to accommodate publications that may not have their own formal digital object identifier [DOI]) and 2) a sentence number that is assigned and unique within the paper. A separate table relates GDDIDs to publication metadata (e.g. title, journal, authors, etc.). Hence, GeoDeepDive workflows and their individual steps effectively operate at two distinct levels: document-level and sentence-level. Because GeoDeepDive also provides unique IDs for each journal and links these to the sentence IDs, journallevel analytics are possible. Much of the power of GeoDeepDive derives from its ability to conduct sentence-level analytics. GeoDeepDive makes use of Stanford NLP (Manning et al., 2014), so it is also possible to obtain word-level analysis using indexing within sentences. For this demonstration paper, we focus on sentence-and document-level analytics. Figure 1: Workflow used to go from a list of documents that mention ice-rafted debris (IRD; IRD is the actual search string in this case) and (Pliocene or Pleistocene or Holocene) to a cleaned set of the documents that removes known irrelevant instances of 'IRD', and finally a summary of the documents and relevant information. Modified from Marsicek et al. (2018) This paper presents a sample workflow, intended to provide meaningful but preliminary results on past IRD distributions, with the main goal of illustrating the potential of sentence-level query capabilities in GDD, and showing potential users how GDD can be used to extract information from text. In this example workflow, we identify papers with mentions of ice-rafted debris (IRD) in the Pliocene and Pleistocene (Fig. 1), extract space and time coordinates using a new R Toolkit called geodiveR (http://github.com/EarthCubeGeochron/geodiveR), and store the data and code in a GitHub repository (http://github.com/EarthCubeGeochron).
Many publications document the existence of IRD at the level of individual marine drilling sites, but assembling this information across publications into large-scale mapped syntheses is a non-trivial task that has traditionally taken years of painstaking literature compilation (Andrews et al., 1997;Bond and Lotti, 1995;Heinrich, 1988;Ruddiman, 1977;Stern and Lisiecki, 2013). A comprehensive, accurate database of IRD deposits and their spatial distribution -extracted through a combination of advanced software and human expertise from the published scientific literature -can help the scientific community better understand and characterize ice sheet dynamics over the last 5.3 million years, ideally leading to a better understanding of how glaciers respond to changes in climate and ocean circulation.
The goal is not to remove the expert sedimentologist or paleoceanographer from decision-making processes, but, rather, to show how the GeoDeepDive infrastructure can be employed to perform research tasks more efficiently. Various subtleties and complexities persist that are not yet tractable for machine learning. For example, ice-rafted debris is part of a complex of sedimentary deposits within ocean sediments, including iceberg, ice shelf, and sea ice rafted debris (Powell, 1984). Differences between the processes resulting in sediment entrainment and deposition between these types of sediment may challenge interpretation (Andrews, 1998). Sedimentological features evaluated by the expert can be used to differentiate particle sizes and shapes to offer a better understanding of the sources of sediments identified as ice-sourced (Hemming, 2004;Kleiven et al., 2002;St John et al., 2015;White et al., 2016), and thus provide a more complete picture of the processes that led to deposition. Furthermore, the geospatial and temporal information retrieved by GDD require vetting.
Nonetheless, by providing a comprehensive corpus of documents, with identified publications, timings, and locations for identified deposits, GeoDeepDive can help speed the discovery and mapping of records by experts across a widely dispersed literature, help identify potential outliers or misidentified samples, identify gaps where new field campaigns or re-sampling existing physical cores may provide new insights, and ultimately, generate a more complete model of marine ice dynamics in the geologic record. As a first step forward, this paper provides a model workflow to be carried out with the DeepDive infrastructure. We begin with a general walkthrough of the analytical steps and various considerations that arise at each stage, then move to a specific walkthrough that focuses on identifying papers with IRD records from the Pliocene and Pleistocene, extracting spatial and temporal coordinates, and mapping the returned results. We show that a relatively simple framework is already able to recover a substantial body of useful information that can inform further data processing, cleaning, and interpretation of paleooceanographic patterns and cryosphere-climate evolution over Earth's history.

Initial Returns and RegEx
Processing the entirety of the documents within GDD is time consuming because the body of accessible papers (the corpus) contains over 13 million peer-reviewed publications. For any given goal, only a small fraction of the total corpus is relevant. Keywords are a common first solution to reducing data volume. Particular keywords within a document can be used to identify the subset of all documents that are potentially relevant. The GDD public API (https://geodeepdive.org/api) supports simple string matching using keywords. Here, we chose terms that would return a sufficient breadth of documents for this trial study. We used the acronym "IRD", as well as constraints on geologic time intervals, "Holocene", "Pleistocene" and "Pliocene". This search returned 5,315 total documents, which forms the starting point for this model workflow.
Useful information about these terms can be derived from the "snippets" endpoint of GeoDeepDive's API. Snippets harnesses an ElasticSearch index spanning the full text of all PDFs that have a "native" text layer (i.e., PDFs with searchable text in them already, which constitutes the vast majority of PDFs distributed by journal publishers): https://geodeepdive.org/api/snippets?term=IRD&full_results=true The response to this API call is a JSON object that indicates the total number of "hits" of the term (n=44,136 as of 2021-06-29 and n = 35,772 as of 2020-03-26) and basic bibliographic citation information for each document containing the term. The bibliographic information includes a link to the original PDF distributed by the publisher and a "snippet" of text around mentions of the term in the full text of the document: [ { "pubname": "South African Geographical Journal", "publisher": "Taylor and Francis", "_gddid": "5946b1c8cf58f13cac0191e7", "title": "Reviews/Resensies", "doi": "10.1080/03736245.1976.10559569", "coverDate": "1976 04", "URL": "http://www.tandfonline.com/doi/abs/10.1080/03736245.1976.10559569", "authors": "Young, Bruce; Davies, R. J.; Hart, G. H. T.", "highlight": [ "materials for course work in the second part. The th,<em class=\"hl\">ird<\/em> and fourth parl~ of the book. 0n the other hand," ] } ] The example snippet shown here is a non-relevant IRD instance, as suggested by the highlight text referring to course materials. By default, matching terms are highlighted with HTML tags. To remove the tags, the parameter &clean=true can be added to the URL for the snippet API. For searches with large numbers of results, such as the IRD example shown here, the results include a link to the next page of documents containing the term, allowing the user to scroll through large sets of results.
The GeoDeepDive snippet API also supports searches that combine multiple terms. The following API call, which returns only papers that contain both "IRD" and "Pleistocene" returns 4,892 hits as of 2021-06-29, far fewer than the more expansive search.
https://geodeepdive.org/api/snippets?term=IRD,Pleistocene&inclusive=TRU E&full_results=true The GeoDeepDive API, however, is not designed to provide full functionality, but rather is designed to be deployed in user-constructed applications. Hence, the GDD API is suitable mainly for initial data extraction tasks. More powerful analyses require analysis of the PostgreSQL representation of the document text data.
Here, we show text matching and data extraction from the retrieved body of papers can be achieved in PostgreSQL by using existing PostgreSQL text and string functions, plus regular expression matching in R using the stringr package (Wickham, 2019b). The Stanford NLP library also allows GDD workflows to take advantage of parts-of-speech tagging, and more advanced NLP tools, but these capabilities are not employed in this demonstration workflow.

Subsetting and Cleaning
We begin our analysis with a subset of the data, consisting of 150 papers, that was sampled from the 5,315 papers retrieved using the keyword constraints described above. The subset of papers may still include papers that are not appropriate (i.e. IRD may refer to something other than ice-rafted debris). To obtain a training dataset, we execute a second round with the same text matching, using the same keywords and rules used at the document level, but now enforced at the sentence level. For example, "IRD" must be located in a sentence with another term (e.g., "IRD" and "Holocene"). These additional rules restrict the total list returned to 81 documents for which any sentence contains a match to the keyword. This rule is likely too restrictive (i.e. it likely removes some ice-rafted debris papers) but is employed here to show how sentence-level queries can further constrain searches.
Searching for IRD as a keyword retrieves articles that use IRD as an acronym for Ice-Rafted Debris, but it also, for instance, retrieves articles mentioning the French Research Institute of Research for Development. Throughout this paper we will refer to rules; generally these are statements that can resolve to a Boolean (TRUE/FALSE) output. So for example, within our subset we could search for all occurrences of IRD and CNRS: sentence <-"this,is,a,IRD,and,CNRS,sentence,we,didnt,want,." stringr::str_detect(sentence, "IRD") & !stringr::str_detect(sentence, "CNRS") This statement will evaluate to TRUE if IRD appears in a sentence without CNRS. If we apply this sentence-level test at the document level (any(test == TRUE)) we can estimate which papers are most likely to have the correct mention of IRD for our purposes. This then further reduces the number of papers (and sentences) for our training dataset.

Extracting Data
After cleaning and subsetting, we develop a series of tests and workflows to iteratively extract information. In many cases this requires further text matching, and packages in R such as stringr were useful for accomplishing this task. Additional support can come from the NLP output that can be generated for the data. In all of these cases, we generate clear rules to be tested, and then apply them to the document.
Because understanding both the IRD distribution in ocean sediments and the timing of the deposition of IRD through the Pliocene and Pleistocene is critical for interpreting past ice dynamics, spatial coordinates and geochronologic constraints of the IRD deposits need to be identified within the paper. Hence, any paper that contains neither spatial coordinates or ages, or one but not the other, is here filtered out. Less restrictive searches could be defined.
Extracting spatial location and age information from a paper that contains "IRD", however, is not sufficient. We need to be able to distinguish between an age related to an event we are interested in versus an age reported in a paper for some other reason. So, again, we must develop general rules that allow distinguishing of all ages from ages of interest, and all spatial locations from spatial locations of interest.

Exploratory Iteration
There are several reasons to continue to refine the rules used in this workflow to discover data. First, extraction of text from the PDF and optical character recognition (OCR) are not always accurate, so that some sentences and words are parsed incorrectly. This problem is particularly acute for geographic coordinates (see below). Second, many words have multiple meanings, leading to false positives if only string-matching is used, as the IRD example illustrates. Third, semantic terms and concepts often vary subtly within and among disciplines and journals. As a simple example, if we were interested in retrieving paleoecological information we would need to know that paleoecology and palaeoecology refer to essentially identical concepts. Similarly, ice rafted debris may also be referred to as sand-sized layers in the marine context (e.g., Ruddiman, 1977), while a paleoceanographer might want careful separation among different kinds of IRD, e.g. iceberg-rafted debris, ice-shelf-rafted debris, or sea-ice-rafted debris. Fourth, the context and placement of words matters. For example, temporal information like 'Holocene' and 'Pliocene' may be found in the Methods, where they refer to marine core locations, or in the Discussion, where they might refer to global climate trends. Some potential pitfalls for geoscientific applications of GeoDeepDive include: Repeatedly reviewing matches at both the sentence level and document level (i.e., "Why did this irrelevant paper/sentence match or why didn't this relevant paper/sentence return a match?"), then refining the workflow rule-sets accordingly, is critical to developing a clear workflow and high-value corpus. In many cases, beginning with very broad tests and slowly paring down to more precise tests is an appropriate approach. In this case, tools like RMarkdown are helpful for interactive data exploration, using packages like DT (Xie et al., 2018) and leaflet (Cheng et al., 2019). We can assess the distribution of age-like elements within a paper and determine if they match with our initial expectations (e.g. "Why does the article 'Debris fields in the Miocene' contain Holocene-aged matches?"; "Why does a paper about Korea report locations in Belize?"). Depending on the success of the algorithm, the tests can be revised and the process repeated to increase the frequency of acceptable matches.

Reproducible and Test-Driven Workflows for a Dynamic Literature
As the GDD workflow develops and refines, we can begin to report patterns and findings. Some of these may be semi-qualitative (e.g., "The majority of sites are dated to the LGM"), while others may involve statistical analysis (e.g., "The presence of IRD increases after the Mid-Pliocene Transition (p<0.05)"). In an analysis where the underlying dataset is static or a version has been frozen, it is reasonable to develop a paper and report these findings.
However, the publication database in GDD is far from static; the GDD infrastructure acquires more than 10,000 papers per day from multiple sources. Given this, some patterns will change over time as more information is brought to bear. For example, a new ocean drilling campaign might reveal new insights into the spatiotemporal distribution of IRD, or the addition of new records may reveal previously undiscovered search artifacts within the publication record.
For this reason, it's critical important to use assertions or testable statements that can be evaluated to TRUE or FALSE within the workflow. Test-driven development is common in software development. As developers create new features, a good practice is to first develop tests for the features, to ensure that feature behavior matches expectations. The analogy in our scientific workflow is that findings are features, and as we report them, we want to be assured that those findings are valid.
In R we can use the assertthat package (Wickham, 2019a) to support test-driven development and assertions within the workflow. The assertthat package provides a tool for testing statements, and providing robust feedback through custom error messages (Wickham, 2019a).
percent_ages <-sum(howmany_dates$age_sentences) / nrow(howmany_dates) assertthat::assert_that(percent_ages < 0.1, msg = "Less than 10% of papers have ages.") With this workflow overview we now have mapped out an iterative process that is also responsive to the underlying data. We have developed clear tests under which our findings are valid. We can create a document that combines our code and text in an integrated manner, supporting FAIR Principles (Wilkinson et al., 2016), and supporting the next generation of reproducible research. In the following section we run through this workflow in detail.

Finding Spatial Matches
To begin, we load the packages to be used, and then import the data: This code produces an output object that includes a key for the publication (_gddid, linking to the publications variable), the sentence number of the parsed text, and then both the parsed text and some results from natural language processing. We also obtain a list of gddid's to keep or drop given the regular expressions we used to find instances of IRD in the affiliations or references sections of the papers.

Extracting Spatial Coordinates
In this case study, we are interested in using GeoDeepDive to obtain site coordinates for locations that contain IRD data spanning the Pliocene and Pleistocene. The goal is to provide relevant site information for use in meta-analysis, or for comparing results to existing geographic locations from the relevant geocoded publications and then linking back to the publications using DOIs.
To obtain geographical coordinates from the paper we must consider several potential issues. The first is that not all coordinates will necessarily refer to an actual ocean core. We may also, inadvertently, find numeric objects that appear to be coordinates, but are in fact simply numbers. Therefore, we must identify what exactly we think coordinates might look like and build a regular expression (or set of regular expressions) to accurately extract these values. Since we will process degree-minute-second (DMS) coordinates differently than decimal-degree (DD) coordinates, we generate two regular expressions: These regular expressions allow for negative or positive coordinate systems, that may start with a 1, and then are followed by one or two digits ({1,2}). From there we see differences in the structure, reflecting the need to capture the degree symbols, or, in the case of decimal degrees, the decimal component of the coordinates. The regular expressions are more rigorous (i.e., have fewer matching options) for the decimal degrees than for DMS coordinates. The more open-ended matching options for DMS coordinates in documents is because DMS symbols (e.g. °,','') may be interpreted in non-standard ways by OCR.
The regex commands were constructed to work with the stringr package (Wickham, 2019b), so that we obtain five elements from any match, including the full match, the degrees, the minutes, the seconds (which may be an empty string), and the quadrant (NESW).
degmin <-str_match_all(nlp$word, dms_regex) decdeg <-str_match_all (nlp$word, dd_regex) We expect that all coordinates are reported as pairs within sentences, and so we are most interested in finding all sentences that contain pairs of coordinates. We start by finding the publications with sentences that have coordinate pairs:  Fig. 1 ; Barker et al. , 1999 ;Uenzelmann-Neben , 2006 Table 2. Sample sentences from the IRD corpus that contain matches to the coordinate regular expression rule.
Even here, we can see that many of these matches work, but that some of the matches are incomplete. Given that there are 81 articles in the NLP dataset with matches to IRD related terms, it is surprising that only 20 appear to support regex matches to coordinate pairs. We would expect that the description of sites or locations using coordinate pairs should be common practice. The observed outcome is likely to be, in part, an issue with the OCR/regex processing. A next iterative step would be to review potential matches more thoroughly to find additional methods of detecting the coordinate systems.

Converting Geographic Coordinates
Given the geographic coordinate strings, we need to be able to transform them to reliable latitude and longitude pairs with sufficient confidence to actually map the records. These two functions convert the GeoDeepDive word elements pulled out by the regular expression searches into decimal degrees that can account for reported locations.   (Baeten et al., 2014;Cofaigh et al., 2001;Ikehara and Itaki, 2007;Incarbona et al., 2008;Kaboth et al., 2016;Prokopenko et al., 2002;Rea et al., 2016;Rosell-Melé et al., 1997;Seki et al., 2005;Simon et al., 2013;Stoner et al., 1994).
After cleaning and subsetting the corpus, we find 11 papers with 30 coordinate pairs out of 150 documents in the IRDDive test dataset (Figure 3). This test suggests further improvements that can be made to the current methods or reporting in the literature. First, in some cases, we find papers where IRD is simply referenced, and the paper does not report primary data. Second, in some cases we find IRD but no coordinates or other core metadata; some papers simply do not contain coordinate information. Third, some papers mention IRD in the core data for continental cores (see Fig. 3 Central Asia location). These might possibly be valid instances of IRD in lacustrine deposits (Smith, 2000), or might represent papers that mention IRD without representing primary data. One solution would be to cross-reference the returned IRD coordinates with the location of continents (past or present) and remove coordinate pairs that fall within the continental boundaries. A fourth and last step would be to further refine the regex to obtain additional mentions of IRD, e.g. as 'IRD-rich layers', 'IRD-rich deposits', 'Heinrich layers', etc.

Extracting Ages and Age Ranges
The next step is to extract ages and age ranges that may be associated with IRD events. This requires building regular expressions that pull dates with many different naming conventions for units (e.g., years BP, kyr BP, ka BP, a BP, Ma BP, etc.)  When we apply the code to the documents in the IRD corpus we begin to see patterns of dates (Table 3). A limitation of the regular expression presented here is that the regular expression for the variable is_range, would only match BP ages prepended by a scale (a, ka, Ma). As currently defined, is_range only allows one term between matched numbers and the string BP. Hence, more detailed methods would need to be used to capture all age descriptors and types. Ultimately, multiple matching terms are likely required to find the breadth of age terms. {In,this,paper,",",the,outcome,from,the,analyses,of,these,res ults,is,presented,as,a,comprehensive,",",but,thoughtful,and,c autious,interpretation,of,the,history,of,the,BIIS,",",between,th e,onset,of,the,LGM,until,",",but,not,including,",",the,Bølling,/, Allerød,Interstadial,-LRB-,27,to,15,ka,BP,-RRB-,.} 11,--,3,ka,BP, 55052064e13823269 32d8bce 14 {These,studies,reveal,HTM,summer,temperature,anomalies of,$,0:5,--,3:0,C,relative,to,the,20th,century,mean,",",peaking,anywher e,from,$,11,--,3,ka,BP,-LRB-,e.g., Kaufman,et,al.,",",2004  Most of the recovered ages are for the last 15,000 years ( Fig. 4), which is a signal that the recovered ages from the papers are likely contextual and not directly associated with IRD events, because most known IRD events are associated with Pleistocene stadials or glacial periods, or for earlier time periods. Hence, these tests show that the workflows to date can successfully extract age information but further work is needed to better extract age information directly linked to IRD events.

Figure 4.
Incidence of time periods covered in the documents, using extracted age ranges described in papers (e.g., "5 -7 ka BP"). Individual papers often report multiple time ranges. The high prevalence of recovered ages younger than 15,000 yr suggests that most of the recovered ages are not directly related to Ice Rafted Debris (IRD) events.
One avenue for further exploration is to use the sentence position within the document to provide context for a retrieved geographic coordinate or age. For example, the distribution of coordinates within papers shows a well-defined pattern ( Figure 5). Coordinates are generally presented in the Abstract, Introduction or Methods section, but rarely elsewhere. This differs from the distribution of ages and age ranges within papers, which appear throughout papers, although age ranges tend to be most frequent in the results, discussion and conclusions. Age information also often appears in the titles of papers presented in the References section. This is a promising avenue for further study.
Of course, while the methods presented here do extract ages and geographic locations, it should be clear that this alone is not sufficient to fully understand process. Rather, this demonstration shows how these new capabilities will enable unprecedentedly detailed and powerful searches into the scientific literature, to accelerate the pace and scale of scientific synthesis and insight.

Figure 5:
Relative location of ages and spatial coordinates reported in GDD documents (i.e., are ages or spatial coordinates generally located at the beginning, middle, or end of a paper?). The line segments above represent the approximate position of the term "Abstract", "Methods", etc. in papers, not the extent of those terms.

Conclusions
Here we have provided a model workflow to obtain coordinate and age information associated with IRD deposits, a GitHub Repository for open code development and sharing, and an R toolkit (http://github.com/EarthCubeGeochron/geodiveR). The specific case study showcased here is designed as a first step towards helping researchers study the dynamics of ice sheets over the last 5 million years, while GeoDeepDive and the post-processing analytical workflows shown here will be helpful to those wanting to query for space and time information using GeoDeepDive. This will allow other researchers to import their own data from their own search logic, and output text and coordinates relevant to a researcher's question. The GitHub Repository and R package can act as building blocks that serve researchers not only in the geosciences, but allied disciplines as well.