A simplified palaeoceanography archiving system ( PARIS ) and GUI for storage and visualisation of marine sediment core proxy data vs age and depth

Scientific discovery can be aided when data is shared following the principles of findability, accessibility, interoperability, reusability (FAIR) data (Wilkinson et al., 2016). Recent discussions in the palaeoclimate literature have focussed on defining the ideal database format for storing data and associated metadata. Here, we highlight an often overlooked primary process in widespread adoption of FAIR data, namely the systematic creation of machine readable data at source (i.e. at the field and laboratory level). We detail a file naming and structuring method that was used at LSCE to store data in text file format in a way that is machine-readable, and also human-friendly to persons of all levels of computer proficiency, thus encouraging the adoption of a machine-readable ethos at the very start of a project. Thanks to the relative simplicity of downcore palaeoclimate data, we demonstrate the power of this simple but powerful file format to function as a basic database in itself: we provide a Matlab-based GUI tool that allows users to search and visualise data by sediment core location, proxy type and species type. The adoption of similarily accessible, machinereadable file formats at other laboratories will promote data sharing within projects, while also allowing for the automation of submission of data to online database repositories with particular formatting and/or metadata requirements, thus reducing post-hoc workload. 1.0 Introduction 1 2

A common desire for Earth Science laboratories in the computer age is the digital storage and archiving of datasets in searchable databases. Furthermore, a growing number of funding agencies and publication venues are mandating that datasets are deposited in an open repository, so that other researchers may have access to the data. The 'big data' benefits of such a system for palaeoceanography are clear; data from multiple locations and periods of the Earth's history can be searched, sorted and presented according to, for example, proxy and/or species type. Such an approach would save significant person hours currently spent by researchers worldwide in searching for, downloading, understanding and digitising datasets, thus allowing for much more efficient analysis of data. The principles guiding this process are the principles of findability, accessibility, interoperability, reusability (FAIR) (Wilkinson et al., 2016).
Much of the discussion involving the establishment of standardised digitised data has revolved around defining an ideal database format and/or repository for the storage of data (Bolliet et al., 2016;Jonkers et al., 2020;Khider et al., 2019;McKay and Emile-Geay, 2016), which is indeed a key prerequisite for the ultimate end goal whereby all data is stored on a common, publicly searchable/queryable online database in line with the goals of FAIR data.
However, an often overlooked primary step in the realisation of such an end goal is ensuring that palaeoclimate data produced within a laboratory and/or research group is stored in some kind of machine readable format in the first place, i.e. during the creation step. Current practices at many laboratories involve multiple actors and researchers of various levels of computer proficiency saving their data using idiosyncratic and machine-unreadable file formats. These practices lead to increased workload both during the project and also at the end of the project when submitting data to online repositories (i.e. due to laborious post-hoc data formatting and manual metadata entry at the time of submission). If a given laboratory instead uses an internally consistent and machine readable format for saving data, post-hoc conversion to various database formats and/or uploading to a repository can essentially become an automated process. Therefore, we argue that the ideal database format should be a secondary consideration. A primary consideration should be to take concrete steps to promote and ensure early adoption and awareness of the machine readable ethos within a project and/or laboratory (i.e. upon the creation of the data), by creating a machine readable format that works for the laboratory in question.
Given the aforementioned issues, we determined that the ideal data file format for use within a research group should meet the following four criteria: (1) it must be machine-readable across many operating system platforms, thus allowing for automated reading of data, as well as bulk conversion/uploading to common database formats; (2) it must be human-friendly, thus allowing the human eye to quickly access and understand the data contained within the file if needed; not all project participants have sufficient proficiency with higher level storage formats such as SQL, NetCDF and/or JSON.
(3) the file creation process must be as acessible as possible and cause as little burden as possible for laboratory members of all levels of computer proficiency, thus encouraging the seamless and autonomous creation of machine-readable data formats from the very beginning of the project workflow (e.g., in the field or at the time of laboratory analysis.) Here, we present a file naming and file structure format that is both human-friendly and machine-friendly. The Palaeoceanography ARchivIng System (PARIS) was developed as a spin-off from the ERC ACCLIMATE project at Laboratoire des Sciences du Climat et de l'Environnement (LSCE), Gif-sur-Yvette. It is optimised for human-accessibility from the very beginning of a project (in this case, the stable isotope laboratory environment). Files stored in such a machine-readable file structure can subsequently easily be automatically batch-converted to the specific format requirements of a particular data repository, thus avoiding repeated manual metadata entry upon repository submission. We also demonstrate the machine-readable power of this simple file format as a basis for a simplified database structure to use within a laboratory: we have built a fully documented, GUI environment for interactive searching and plotting of data using our simple file format. This environment allows for the rapid searching and visual presentation of data by, latitude, longitude, water depth, age and, where applicable, species type. The entire setup was designed with modular expansion in mind, and both the file formatting conventions and GUI environment can be used and/or modified by other laboratories for their own particular needs. The structure of archiving system is shown in Figure 1 and described in the following sections.

File naming conventions
We use a file storage system based on universally readable, tab-delimited ASCII text files, which are more than sufficient for palaeoclimate datasets from sediment cores, seeing as such sediment cores contain discrete-depth measurements numbering only in the hundreds or thousands. Such files can easily be created directly from analytical software or by using basic spreadsheet software. A uniform file naming convention is used to create machine readable identifiers containing information about the data contained within the file: core name, data type (six character code) and measured material (e.g., foraminifera species). Select examples are shown in Table 1. The underscore character in the file name functions as a marker to distinguish various descriptive properties of the file, thus facilitating machine readability and automated searching of file names. As such, core names may not contain an underscore. The full species names associated with species abbreviations can be found in the file _abbreviations.txt.

Raw data files
A common challenge preventing long-term data sharing in palaeoceanography is the publication of isotope data exclusively vs age, which prevents re-evaluation of the data by future researchers as understanding of geochronological methods improves and evolves. For these reasons, all isotope and other palaeoclimate proxy data in the PARIS scheme are stored against core depth as the primary format, allowing for the later application of new geochronologies, and/or comparison of proxy data vs multiple geochronologies. A further ambiguity commonplace in palaeoceanography is reporting only a single core depth value corresponding to a particular data point (for example, often only a single core depth value is given, even though subsamples represent a depth interval). To avoid such ambiguity, each data point stored using the PARIS scheme has two depth values (depth1 and depth2) which correspond to the top and bottom of a particular core interval ("depth slice"). Within the PARIS scheme, it is also possible to include NaN for depth2. In such a case, depth2 will simply be assumed to be 1 cm greater than depth1 (i.e. depth1 represents the depth value corresponding to the top of a depth interval with a thickness of 1 cm).
The tab-delimited ASCII text format is used to structure data in column/row format, whereby data such depth, measurement value and measurement uncertainty are stored in specified column numbers. When there is no data available for a particular sample (e.g. δ 18 O value but no accompanying δ 13 C value) a NaN is entered as a placeholder for the missing value, thus ensuring structural integrity and machine-readability of the file. The formatting used for each type of proxy is detailed in the user manual included with the GUI software. All raw data files are stored within the "raw data" folder. Here, we supply a number of example files of previously published Atlantic Ocean sediment core stable isotope data (

Age-depth model files
Within the PARIS system, separate age-depth model files are used to assign age and age uncertainty to the raw data that is stored against depth. Age-depth model files (corename_admodel.txt) are contained in a folder called "master" within the "age models" folder. The reason for this additional subdirectory level is to allow different age model scenarios to be stored, which can subsequently be accessed from the GUI. For example, one might wish to store and compare different age-depth models based on different methods ( 14 C, U/Th, etc) for the same set of sediment cores. Similarily, one may want to compare age-depth models developed by different software packages (Blaauw and Christen, 2011;Bronk Ramsey, 1995;Haslett and Parnell, 2008;Lougheed and Obrochta, 2019). In that case, an additional folder can be made within the "age models" folder, and it's contents will be accessible from the GUI. Age-depth model files use the the "Undatable" (Lougheed and Obrochta, 2019) output file format by default, but users can adjust to use the file format of a different age-depth modelling software, or indeed any type of age-depth model file, by editing the required admodelformat.m formatting file contained within the subdirectory within the "age models" folder". Here, to demonstrate the PARIS system we supply a number of age-depth model files produced for Atlantic Ocean sediment cores by Waelbroeck et al. (2019).

Core information index file
All raw data files and age-depth model files contain a unique code detailing the sediment core that they come from. An additional file (_core information.txt) is present within the main folder of PARIS, which details some basic meta-data for each core, namely location (latitude and longitude) and water depth (mbsl). This allows the PARIS system to subsequently search for sediment core locations that match a specific search criteria (e.g. a certain water depth or latitude/longitude bounding box) and search for all raw data and age-depth models associated with sediment cores that correspond to the search criteria.

Reference records and bathymetry
Laboratories may also wish to store climate reference records for display within the GUI or for easy access. For this reason, we include some climate reference records that can be viewed within the GUI. These include the Greenland ice-sheet δ 18 O and Ca 2+ records Rasmussen et al., 2006;Seierstad et al., 2014), temperature derived from the Greenland isotop temperature record (Kindler et al., 2014), atmosphereic CO 2 derived from the Antarctic ice core record (Lüthi et al., 2008). We also include a downscaled version of the GEBCO bathymetry (General Bathymetric Chart of the Oceans, 2015), that is used within the PARIS GUI to provide a simple map showing core locations superimposed upon bathymetry.

GUI search interface
To demonstrate the power of the text file based archiving system, and in order to provide a system with which laboratory members at LSCE could browse and visualise sediment core data, a GUI system was developed in Matlab (Fig. 2). This system allows the user to search for sediment core locations according to certain criteria, and specify which types of data to plot, which are shown on to three vertically distributed seperated panels (Fig. 3). Data from multiple sediment cores and/or species can be plotted on to one of the first two panels, in order to facilitate inter-core comparison. Data can be plotted against depth or against age (from one of the supplied age-depth models), and the user can choose to plot with our without error bars. The third panel is reserved for plotting reference data and or sediment accumulation rate (SAR) plots. The software automatically assigns a unique colour code to each sediment core, and unique symbol and line type to each type of data and/or species.
Legends are also shown for ease of user interpretation. Finally, every time a plot is generated, a publication quality PDF of the plotted panels is generated within the main PARIS folder, saved under a name specified by the user.

Database inter-compatibility potential
Once all data from a given laboratory is stored using a common format, the process of submission to a given database or repository (i.e. changing the format to suit a particular repository) can be fully automated. One needs only to write a one-off batch script that can convert all files from the laboratory to the required format of the various repositories. Here, we provide an example of a similar such script (dataonage.m) that was used within the LSCE 156 laboratory to systematically read in all isotope data vs depth and output all isotope data vs age according to their respective age-depth models. Hence, systematically updating the age values for data submitted to repositories becomes a simple and rapid task.

Conclusion
The palaeoclimate literature has begun to embrace the principles of FAIR data and many good examples of useful database structures have been previously provided (Bolliet et al., 2016;Jonkers et al., 2020;Khider et al., 2019;McKay and Emile-Geay, 2016). We have provided an example of a concrete first step in the journey towards FAIR data, the creation of machine-readable data at the field and laboratory level. The involvement of multiple actors within a project requires that a machine readable format is fully accessible to persons with only basic computer proficiency. The simple PARIS file naming and structuring system, based on the ubiqutously accessible tabbed-text file format is one such example, and has been successfully deployed at LSCE. The simple structure is nonetheless powerful in that data can easily be indexed and searched, as demonstrated here. Such an approach can encourage a laboratory to adhere to the FAIR data principles from the very outset of a project, thus saving much time and resources that would often be spent on post-hoc data conversion.

Database and GUI availability
The database and GUI system can be downloaded from Zenodo: https://doi.org/10.5281/zenodo.4680717     (Vidal et al., 1997). Also shown are Greenland ice core oxygen isotope data Rasmussen et al., 2006;Seierstad et al., 2014). Younger Dryas and Heinrich Stadial intervals are as defined by Waelbroeck et al. (2019).   Table 2. Stable isotope data from the following sediment cores included in demonstration database.