Addressing Model Data Archiving Needs for the Department of Energy’s Environmental Systems Science Community

Researchers in the Department of Energy’s ESS program use a variety of models to advance robust, scale-aware predictions of terrestrial and subsurface ecosystems. ESS projects typically conduct field observations and experiments coupled with modeling exercises using a model-experimental (ModEx) approach that enables iterative co-development of experiments and models, and ensures that experimental data needed to parameterize and test models are collected. Thus preserving ​“model data”​ comprising the outputs from simulations, as well as driving, parameterization and validation data with associated codes is becoming increasingly important. The ESS-DIVE repository stores data associated with the ESS programs and conducted a months long survey of the ESS community to identify needs for archiving, sharing, and utilizing model data. Here, we present the results of the community survey, and the proposed ESS-DIVE approach over the short-term (next 3 years) and long-term (4-10 years) to support the needs of the ESS modeling community. In the short-term ESS-DIVE proposes to work on functionality that supports archiving of model data associated with publications, with an emphasis on developing community guidelines and standards that make the data more discoverable, accessible and usable. The long-term vision is to broadly enable data-model integration, and knowledge generation from model and observational data. This vision will be achieved through close partnerships with the ESS community.


Introduction
The Environmental System Science (ESS) activity within the Department of Energy's (DOE)

Climate and Environmental Science Division (CESD) in the Office of Biological and Environmental
Research (BER) seeks to advance a robust, predictive understanding of underlying interactive terrestrial and subsurface ecosystem processes (hydrology, biogeochemistry, microbiology, vegetation dynamics) through integrated modeling and experimental efforts (ModEx). Research efforts in this program generate vastly diverse data that range across multiple scales (single-pore to global systems), which are used to inform Earth System Models (ESMs) and local-scale models and evaluate solutions to energy and environmental problems (Figure 1). For example, ESS models are used to predict how arctic and forest landscapes drive and respond to global and environmental change, and how watersheds evolve over time.
Figure 1 -ESS-DIVE stores diverse data generated from research in terrestrial and subsurface ecosystems through field observations and experiments coupled with modeling simulations. This model-experimental (MODEX) approach enables iterative co-development of experiments and models, and ensures that experimental data needed to parameterize and test models are collected and made available. The data help advance scientific understanding and prediction of hydro-biogeochemical and ecosystem processes that occur from bedrock through soil and vegetation to the atmospheric interface. [Figure from Varadharajan C., et al., Eos 100. DOI: 10.1029 Attention to data management and cyberinfrastructure are critical components of accelerating the process of scientific knowledge discovery across domains. These components are also goals of CESD's Data Management Activity that funds the Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE). The ESS-DIVE repository was established on April 1, 2018 to serve as a long-term steward of data produced from ESS research projects and to enable discovery and efficient data use (Varadharajan 2019) . Currently the ESS community can publish different types of data (observational, experimental, modeling) on ESS-DIVE by creating single or multiple data packages containing any number of files for a given project through the web portal or Application Programming Interface (API).
The user completes a metadata form for each data package, which contains identifying information so that the data package can be discovered using the ESS-DIVE search engine. Once a data package is uploaded, a digital object identifier (DOI) is assigned to it. The data packages are archived at the DOE's National Energy Research Scientific Computing Center (NERSC). ESS-DIVE is a member of the DataONE network, and its published data packages are replicated at other nodes. ESS-DIVE's infrastructure is being developed in collaboration with NERSC and the National Center for Ecological Analysis and Synthesis (NCEAS). ESS-DIVE users are expected to comply with the repository's terms of use, including submitting data in community standards and formats ( http://ess-dive.lbl.gov/about/terms/ ).
Model data can include output files of various dimensions and resolutions (final raw outputs, spin-up output files, restart files, test data files, and higher level outputs corresponding to figures); a variety of metadata files (some metadata may be embedded within output files such as those in NetCDF formats); visualization files; model code; input files (e.g., model parameters, meteorological data, surface data); scripts for model set-up and initialization; parameterization; post-processing; and visualizations.
A limited set of small-sized model data (e.g., protocols, outputs, inputs) have been published to date on ESS-DIVE (Fung 1993;Hilton and Baker 2018;Walker et al. 2018a, b;Arora et al. 2019;Dwivedi 2019) . However, the repository has not yet been optimized for model data due to various challenges. For instance, there are software-and architecture-related limitations, such as thresholds for data upload, storage, and distribution. For uploads, default limits are a maximum of 10 GB/file through the web portal and 10 GB/upload through the API. As of December 2019, uploads of upto 100 GB through the API are possible by request. Even though the API is the preferred automated approach for uploading and downloading large datasets compared to the web interface, it still does not scale to large model data. Furthermore, there is no community consensus on several important questions such as what model-related data are worth archiving, which standards to use, and how much storage space is needed.
This whitepaper addresses ESS-DIVE's goals of providing infrastructure to support management of model data, and a path for efficient integration of observational data and model development, testing, and analysis. Our main objective was to synthesize the model data archiving needs through community engagement, and determine possible approaches to support those needs that could include building new functionality in ESS-DIVE for this purpose. First, we researched the capabilities of existing data systems that support model or large data archiving including the Earth System Grid Federation (ESGF) ( https://esgf.llnl.gov/ ), The National Aeronautics and Space Administration (NASA) Earth Observing System Data and Information System (EOSDIS) Distributed Active Archive Centers (DAACs) ( https://earthdata.nasa.gov/eosdis/daacs ), and The National Center for Atmospheric Research (NCAR) Research Data Archive (RDA) ( https://rda.ucar.edu/ ) and Earth Observatory Lab (EOL) data archive ( https://data.eol.ucar.edu/ ). Based on our background research, and our own modeling expertise, we designed a feedback form to gather structured community input. We used several types of outreach methods to solicit community feedback. To kick-off the community engagement effort, we held an ESS-DIVE monthly webinar in November 2019, and received feedback on various needs, challenges, and approaches to archiving model data. Following the webinar, we distributed the feedback form to the ESS cyberinfrastructure working groups, the ESS-DIVE Archive Partnership Board comprising lead PIs of major ESS projects, and also sent 21 emails to identified individual modelers to schedule one-on-one interviews.
We received 12 responses -8 of which were completed through interviews -which spanned institutions (LBNL, ORNL, PNNL, Stanford). We also met with various project groups to discuss their model data archiving needs during a site visit to Los Alamos National Laboratory in January 2020.
Here, we present the results of our community outreach effort (Section 2), and a synthesis of the model archiving needs of the ESS modeling community (Section 3). We then present the ESS-DIVE approach to support model data. Based on feasibility and the logical progression of building infrastructure to support the community, our approach is phased into a short-term (1-3 years) next steps (Section 4) and long-term (4-10 years) vision (Section 5).

Results from Community Feedback
Based on our feedback, there are several pressing cyberinfrastructure and data management challenges that the research community are tackling related to model data. First, data are increasing in volume and complexity. For example, there is increasing use of ensemble model runs and very high resolution simulations, which are critical for the watershed models and the global land-surface modeling community (Wood et al. 2011) , but result in substantially large output data volumes. Second, the data are extremely heterogeneous due to the diversity of scientific domains and spatial and temporal scales. Third, there is a disconnect between model and observational data, and fragmentation between workflows. In order to maintain scientific productivity and advance computational science (e.g., community benchmark problems, model testing and validation) the ESS modeling community needs cyberinfrastructure that can support data integration, visualization, and analytics using data on ESS-DIVE along with data from other repositories or data centers. This problem is difficult for many modeling workflows that require manual retrieval of data from multiple sources and subsequent pre-processing for use in modeling analyses.
To illustrate a use case of a typical modeler's quest to use data from many sources, the following workflow is used to run watershed-based simulations of reactive transport (e.g., SFA modeling efforts, NGEE modeling efforts, ExaSheds), involving accessing and integrating a wide variety of observational and modeled data from multiple sources (Ethan Coon, Personal Communication): 1. Identify watershed(s) of interest (human-driven using a basic data browser). Due to the manual nature of this workflow, it is not scalable to modeling areas larger than a single catchment or site, highlighting the need for developing cross-portal analytical software and cyberinfrastructure to support it.
Next, we summarize the results from our survey of specific questions regarding the range of file specifications for the model data files generated for a typical simulation; the estimated total storage volume needed; opinions on which model data are worth archiving and for how long; the level of importance of various uses and features in a model data archive; and approaches to archiving the data.

ESS Modeling codes and Typical Simulations
A number of different multiscale, multiphysics and data-driven/hybrid modeling codes are used in ESS projects (Table 1). These models are run at different spatial and temporal scales and resolutions, spanning pore-scale to global simulations. some journals and our community has raised is that these platforms are not guaranteed to be long-term archives.
There are numerous types of scripts used in a modeling workflow, ranging from one-offs for specific papers to scripts used every time for preparing model inputs. Most modelers felt that specific scripts used for analysis should be archived. However, if a modeler anticipates running the same kind of experiment many times, then the scripts and model outputs could be archived separately with DOIs, allowing the outputs to be updated over time.

Project portals with version control
In addition to archiving model-related files for long-term preservation (shelf-life of more than 5 years), there was expressed interest in having collaboration spaces within ESS-DIVE that allow interaction between researchers on specific modeling projects, which have access to complete data packages capable of reproducing the same outputs. Many online resources do not have the storage capacity or other features needed for this type of project space. Such a collaboration space would significantly improve the time to publication for some groups, both in terms of the research itself and curation of the data package(s) to be published. An effective collaboration space could be an extension of the concept of project-specific data repositories, which many ESS projects have, and the portals feature that ESS-DIVE will roll out in April 2020. It is important that data storage needs address versioning of different files that are generated during the many model runs, especially since many modelers change their archived data several times during manuscript preparation to final publication and beyond.

Data center Interoperability and computational functionality
Another feature that was determined to be very useful to researchers was to enable ESS-DIVE to provide access to data in other repositories or data systems, such as other DOE systems (ARM, EMSL, etc.), DataONE member nodes, USGS National Map and NWIS, NOAA etc. Furthermore, in order to fully enable the CESD modeling community, tools are needed to perform model-data integration and simulations within the modeling project collaboration spaces. For example, several respondents indicated that having the ability to run models on NERSC while accessing the data stored in ESS-DIVE would improve scientific productivity.

ESS Modeling Community Needs
Based on the results of our community engagement, several efforts have emerged as being important for ESS-DIVE to address to improve its support of model data. The primary need for most researchers currently is to archive data associated with publications to meet journal and funding requirements. Ideally, this would involve developing a model-to-archive pipeline , which would constitute various ESS-DIVE services that can support consistent archiving of model data across ESS projects. This effort could entail: • Developing community-informed guidelines on creating standardized model data packages on

ESS-DIVE
• Developing a pathway for model data packages above the 100 GB size threshold to be archived in

ESS-DIVE
• Implementing the ability to extract specific subsets of model simulations corresponding to specific runs, locations, variables, or figures for data curation and discoverability • Providing project portals for sharing and collaborating on pre-published model data • In the long-term, building on the community-informed model data archiving guidelines to design an interface for ESS-DIVE data contributors that automates the writing and/or organization of the files comprising the data packages. This model-specific (i.e., all the necessary data to run the model), or journal-specific (i.e., all the required data for publishing a journal article) tool would be able to extract specific subsets of model simulations corresponding to specific runs, locations, variables, or figures. This development would be a collaborative process with specific CESD projects.

In the longer term, a data-to-model pipeline that can enable integration of the data on ESS-DIVE
and other data systems with simulation codes would dramatically improve modeling workflows. This effort could entail: • Supporting data formats that are typically used in model simulations (e.g., netCDF), including the ability to retrieve data either through programmatic means or export mechanisms into these formats.

• Developing interoperability between individual data packages in ESS-DIVE and other data centers
for model-data integration, ultimately enabling MODEX through seamless data extraction of field observations, measurements from manipulative experiments, field observations, and remote sensing data to use for model development, parameterization, and performance testing to improve future measurement designs.

Proposed ESS-DIVE Short-term Approach (0-3 years)
Based on the synthesis of community needs expressed by respondents described above, we highlight a proposed ESS-DIVE short-term approach to supporting model data archiving in this section.
The proposed work would build on ESS-DIVE's current efforts to develop standards in and functionality that support the needs and priorities of the broader ESS community.

Developing Model Data Archiving Guidelines
ESS-DIVE already has a number of community efforts related to standardizing different types of field and lab measurements across ESS projects ( http://ess-dive.lbl.gov/community-projects/ ). Thus, a next focus could be the development of model data archiving guidelines in close collaboration with the community . These guidelines will outline best practices for curating model data packages, including the file directory structure; level of detail of model outputs; determining whether to partition collections of files into separate DOI-issued data packages; archiving procedures based on size of data (see Section 4.2); and formatting and naming conventions. Developing this protocol is requisite to being able to create scripts that parse model data files that follow the reporting guidelines.
The first phase of the guidelines could include suggestions for model data submitters to archive (at a minimum) the required data for their targeted journal and to organize it following a standardized directory system outlined below. For example, it may require that for every plot in a journal article, the script, model code, initial and boundary conditions, and the workflow are archived.
Based on the feedback we have received so far and our recommendations for creating data packages, we first propose considerations for how model data packages should be organized . • Optionally add more detailed metadata files within each of the sub-folders, as well as a fourth sub-folder (model) that contains the model code.
Additional considerations for how to bundle data into different data packages should include: • Who deserves credit? If the DOI citations should differentiate the teams that did the work, consider partitioning the model data into separate data packages. For example, there may be different teams that developed the model, performed the model calibration, and ran the simulations.
• What's logically used together? If other researchers may want to use separate elements of the data collection, then split it into distinct data packages that are issued unique DOI's. This may be the case for model code, workflows, or input datasets that are repeatedly used.
• In considering the entire collection of files, if a subset has a pre-existing storage location, provide the DOI and hyperlink in the relevant metadata files and high-level readme_all.txt (at a minimum).
Finally, long-term considerations for ensuring data reusability, adhering to the FAIR (Findable, Accessible, Interoperable Reusable) principles (Wilkinson et al. 2016) are: • Ensure files are machine-readable by (i) using common naming conventions for variables (e.g., CF, NSF), (ii) including metadata files that describe file structure, organization, and naming conventions used (e.g., netCDF and XML files have standard conventions), which should be consistent across the entire data package.
• Include critical metadata in simulation metadata files such as spatial information to enable advanced queries.
• Ensure traceability by including file(s) that document the workflow used. For example, create metadata files that describe the post-processing methods and scripts, the model versions of models and software used, operating systems, spatial and temporal domains.

Support for scaling to large model data files
Several respondents indicated it would be valuable to evaluate several potential paths forward to accommodate robust transfer and replication of very large model data packages by investigating potential partnerships and interoperability with various entities. ESS-DIVE is investigating pathways to enable archival and storage of very large data (where files may be larger than 100GB).
There are three key components that need to be considered with respect to archival of datasets at this scale: 1. Storage Capacity 2. Data Transfer Capabilities and 3. Data Replication for preservation.
ESS-DIVE considers these long-term priorities that will need to be addressed in a robust and scalable manner. There are current limitations in the capabilities of the ESS-DIVE software and hardware service that prevent archiving data at this scale. A pragmatic approach to enabling this functionality would involve partnering with external data infrastructure providers and facilities to utilize existing capabilities that can handle large volumes of data.
ESS-DIVE uses NERSC "Community File System" resources at NERSC for disk storage. This recent upgrade of the NERSC global file system has already allowed ESS-DIVE to increase its overall capacity (to 100TB), and can be scaled up further as needed.
DOE has significant investments in existing infrastructure in the ESGF platform. ESGF manages a decentralized database for handling climate science data, with multiple petabytes of data at dozens of federated sites worldwide ( https://esgf.llnl.gov/mission.html ). ESS-DIVE is beginning early prototyping work with ESGF to utilize this infrastructure for storage, federation and retrieval of large datasets. This would allow data to be registered through ESS-DIVE (metadata, DOI management, search) and then made available through the ESGF infrastructure by leveraging underlying APIs.
Additionally ESS-DIVE is discussing augmentations to its large data transfer and replication capabilities through the Globus platform. Globus provides a secure, unified interface to research data for high-performance data transfers. In particular it can be used as a backend for managed, automated transfers of very large datasets which would allow ESS-DIVE to make these very large datasets available for transfer and replication. The ESS-DIVE team has already begun efforts to enable large data file uploads and downloads through Globus. Note that ESGF also has existing support for Globus.
Currently ESS-DIVE replication and federation is managed through the DataONE infrastructure.
This infrastructure will need to be scaled up to manage large model datasets. As an alternative, ESS-DIVE can also consider peering arrangements with other DOE facilities or cloud service providers, to provide a replicated copy of the data.
Finally, ESS-DIVE is conducting extensive performance and load testing to determine the existing scaling limits of its infrastructure for data upload and download, and to help determine a long term path for being able to support very large data products.

Project spaces for collaboration
ESS-DIVE already supports the capability to create collections of data packages under the construct of a "Portal" . Portals allow users to define a high-level page that describes the purpose and intent of the collection, and to create groups of associated datasets, such as those generated by a project, through a filter. Over the next three years, ESS-DIVE plans to design and develop "Project Spaces" that support administrative capabilities, permissions, and group management, so that projects can manage their data publication and users more easily. This capability would allow model researchers to address their need for interactive and collaborative spaces to co-develop models and organize associated modeling Future efforts will include working with the ESS and broader science communities to figure out an optimal and pragmatic approach to version control data packages, esp. for use in journal bibliographies.
One challenge with model data is that, with the large file sizes involved, storage of all copies of files may not be a viable option.

Fusion Database
ESS-DIVE's goal in the next 3 years is to build a fusion database that in the long-term will enable synthesis, subset, and query of data stored within files across its packages. The database would be created by parsing data files that follow established community standards. Standards and reporting guidelines are an essential prerequisite for being able to enable these advanced query capabilities. A prototype of the fusion database is being developed using csv files that are currently published on ESS-DIVE, which displays summary statistics of the data by variable. The fusion database could potentially address some of the needs related to search and subsetting model data by specific spatial domains and variables.

Integration with external repositories
As expressed by the modeling community and several other users, there is a need to be able to integrate data on ESS-DIVE with data from other repositories. There is also a need to store data on ESS-DIVE with pointers to data on other established, long-term public repositories (e.g. NASA, NCEI etc.) to avoid duplication of data archival (particularly for large datasets) and creating multiple, possibly inconsistent, versions of public data. Over the next 3 years, we will explore options that include ingesting metadata from or linking to external databases, (e.g., ESGF, ARM, NMDC, EMSL, USGS, other DataONE member nodes) either on the data portal (data.ess-dive.lbl.gov) or into the fusion database (Section 4.5). An essential element of this work would involve prioritizing efforts by surveying the community to identify external databases that are of high interest and value to multiple ESS projects.

ESS-DIVE Long-term Vision to Support Model Data (4-10 years)
Our long-term ESS-DIVE vision is in alignment with the model data archiving needs of the ESS community (Section 3) and will contribute to building a CESD data ecosystem that "helps transition the ESS research program from one associated with distributed datasets, specific process knowledge, and individual component models to one that enables a predictive understanding of key couplings and feedbacks among natural systems and anthropogenic processes across scales" (BERAC 2013).  ESS-DIVE's long-term vision is to provide intuitive and efficient access to archived model data for a broad spectrum of scientific users that will enable knowledge discovery from those datasets.
Supported model types will include both multi-scale, multiphysics, and emerging data-driven machine-learning and hybrid models. The capabilities we seek to develop with the community in the long-term will build on our short-term plans (Section 4), and will enable knowledge generation from models. In the long-term ESS-DIVE seeks to include: • Archiving capabilities to store and distribute increasingly large and heterogeneous data with fast access mechanisms and open data licenses.
• Data integration tools that connect and synthesize distributed datasets across data systems (e.g., ESS-DIVE, ESGF, ARM, Ameriflux, NASA, USGS) and enable users to easily discover, access, and integrate big, diverse datasets.
• Multiscale data assimilation tools to enable real-time integration of observation data with simulation codes.
• Data analytics and computational capabilities for data mining and deep learning; advanced statistical and information theory algorithms for time series and spatial analyses. This includes core libraries for data preprocessing such as QA/QC, subsetting, gridding.
• Different workflow tools and science gateways.
• A computational framework that enables community development of scripts and app-based tools with analytics engines to enable users to discover, query, subset, process, analyze and store data (similar to or built on existing cloud infrastructure such as Google Cloud Platform, Amazon Web Services Cloud, Microsoft Azure Cloud).
• Interactive visualizations and narrative interfaces built using recent advances in web-based tools to enable data exploration and knowledge discovery.
• A software repository for sharing programs developed by the community (e.g. QA/QC and data processing scripts) for reproducible science offering a limited number of compatible open source licenses.

This vision for ESS-DIVE is aligned with CESD's Data-Model Integration Scientific Grand
Challenge (2018) that aims to "develop a broad range of interconnected infrastructure capabilities and tools that support the integration and management of models, experiments, and observations across a hierarchy of scales and complexity to address CESD scientific grand challenges."