EarthArXiv coversheet for: A high-resolution downscaled CMIP6 projections dataset of essential surface climate variables over the globe coherent with the ERA5-Land reanalysis for climate change impact assessments

A high-resolution climate projections dataset is obtained by statistically downscaling climate projections from the CMIP6 experiment using the ERA5-Land reanalysis from the Copernicus Climate Change Service. This global dataset has a spatial resolution of 0.1°x 0.1°, comprises 5 climate models and includes two surface daily variables at monthly resolution: air temperature and precipitation. Two greenhouse gas emissions scenarios are available: one with mitigation policy (SSP126) and one without mitigation (SSP585). The downscaling method is a Quantile Mapping method (QM) called the Cumulative Distribution Function transform (CDF-t) method that was first used for wind values and is now referenced in dozens of peer-reviewed publications. The data processing includes quality control of metadata according to the climate modelling community standards and value checking for outlier detection.


Value of the Data
• The high resolution, number of models and variables available offer a great opportunity for both researchers and climate change adaptation practitioners to study climate change features and feed this data into impact models for any region around the world.
• The dataset is obtained by statistically downscaling climate projections from the CMIP6 experiment using the ERA5-Land reanalysis from the Copernicus Climate Change service, a data product extensively used around the world for historical climate analysis. A great advantage of this dataset is thus to provide an extension of the ERA5-Land reanalysis into the future.
• The dataset is global, has a spatial resolution of 0.1°x 0.1°, comprises 5 climate models allowing to address model uncertainty and includes 2 surface daily variables at monthly resolution: mean air temperature and precipitation.
• To sample future climate uncertainty from anthropogenic forcing, two greenhouse gas emissions scenarios are available: one with mitigation policy (SSP126) and one without mitigation (SSP585).

Data Description
The high-resolution climate projections dataset covers the globe at a 0.1°x0.1° spatial resolution and at monthly temporal resolution for five surface variables. It comprises 5 models (see table 1) from the CMIP6 experiment [1] with simulations for the historical period  and the 21st century (2015 to 2100) under two emissions scenarios: one with mitigation policy (Shared Socio Economic Pathway 126 or SSP126 ) and one with no mitigation (Shared Socio Economic Pathway 585 or SSP585). The following two land surface are downscaled: mean daily temperature and total precipitation. The combination of models and scenarios represents 10 climate projections including both variables (see table 1). Other variables, models and emissions scenarios could be added in the near future.
The data was produced with a statistical downscaling method using the ERA5-Land reanalysis [2] for calibration and training (see next section for details). The interest of the downscaled data is the removal of model biases at a resolution more compatible with the requirements of assessments and further modelling of the impacts of climate change. In other terms it corrects the climatology (distribution) of model values to make them comparable with a reference observational dataset [3], which in this case is the ERA5-Land reanalysis. In the following subsections, we present the file naming conventions then proceed with an illustration of the bias removal over the historical period and the climate change signal differences at the end of century.

File name conventions
There is no official Data Reference Syntax (DRS) defined by the climate modeling community for statistically bias-adjusted or downscaled CMIP projections as there is for CMIP6 and CMIP5 projections.
However, there is a DRS for CORDEX bias adjusted simulations defined in the context of the EURO-CORDEX experiment. We adapted the EURO-CORDEX DRS for adjusted projections to CMIP6/CMIP5 in order to produce a DRS for CMIP adjusted projections.
We kept the term "bias-adjustment" and "adjustment" even if strictly speaking we are producing downscaled projections. The terms "bias-adjusted" and "downscaled" are used interchangeably in the scientific literature because they often involve the same statistical techniques (even though, more recently, bias-adjustment is reserved for cases where the original model resolution is unchanged). We also wanted to use only what was defined by the climate modeling community. On top of this, it appears to us the DRS is sufficient to include all the information needed in the file naming.
The DRS we came up for CMIP6 adjusted projections is presented below through a short illustrative example of file naming: -data file containing original (uncorrected) model results: tas_day_IPSL-CM6A-LR_ssp126_r1i1p1f1_gr_20160101-20251231.nc where: -"tas" is the conventional name for surface temperature, -"day" is the label for frequency (here daily) -"IPSL-CM6A-LR" is the official name of the climate model, -"ssp126" is the short name of the used Shared Socioeconomic Pathway, -"r1lip1f1" is the member number of the simulation, -''gr'' is the grid label of the original data, -"20160101-20251231" are the start and end date of the simulation. -"Adjust" is added to the variable conventional name, -"mon" is the temporal resolution of the data (here monthly), -"gr010" is the grid label of the downscaled data, -"TCDF" is the short name of the organization that performed the post-processing, -"CDFT23" is the label referencing the method, -"ERA5Land" the short name of the data set used as observations, -"1981-2010" is the period used for statistical calibrating the method.
The complete list of the data files is given in Annex 1.

Differences between interpolated and downscaled data with reanalysis data
Here we compare the differences with the ERA5-Land reanalysis of both the original model (interpolated on the reanalysis grid and referred to as "interpolated") and downscaled simulations (referred to as "downscaled"). We first look at the historical 30-year calibration period (1981-2010) for both daily mean temperature and total precipitation. We also look at the 1951-1980 period but for simulations only (since there is no reanalysis data) to see the differences in a 30-year period different from the calibration period.
In Figure 1, we illustrate the spatial differences between the interpolated (left) and downscaled data (right) over the calibration period. For temperature, the ensemble mean of the five CMIP6 models tends to overestimate temperature. This is particularly true for northern America where the bias between the model mean and ERA5-Land is above +5°C. We can further notice that temperature in mountainous areas (Himalaya and the Rockies) is often underestimated by GCMs because of the poor representation of elevation. When considering the downscaled data, models have a temperature comparable to the ERA5-Land reanalysis over the world. For precipitation, there are overestimations of precipitation over the oceans in the tropics, along the Andes Cordillera, south of the Arabian Peninsula or in the Gulf of Guinea. On the contrary, there is an underestimation of rainfall in the areas adjacent to the previous ones (e.g. West Africa).
In Figure 2a, the cumulative distributions functions (CDFs) are empirically estimated from monthly values averaged on the globe for the interpolated (left) and for downscaled data (right) over the calibration period. There is a spread between the ERA5-Land reanalysis and the interpolated data with overestimations and underestimations. This spread is more important for precipitation than for the temperature. For downscaled data, the difference between the simulations CDFs and observation CDFs is very diminished. However, for precipitation, we observe differences for low and high monthly precipitation amounts. This is due to the fact that downscaling is performed at a daily scale and grid point by grid point, while CDFs are estimated on monthly and spatially-averaged data. The day-to-day (temporal) and spatial variability of the model data are preserved by the downscaling method, however residual biases can appear on monthly and spatial averages. We can see the same type of changes between the interpolated and downscaled data and of features among the variables as over the calibration period. In the interpolated data, the spread of cumulative distributions is similarly more important for precipitation than for temperature. The reduction of CDFs spread in the downscaled data is more pronounced for temperature than precipitation.

Changes at the end of the century after downscaling
Here we illustrate changes by the end of the century over the 2071-2100 period under scenario SSP585, comparing the interpolated and downscaled simulations. The analyses are based on daily mean temperature, total precipitation and surface wind speed.
On the maps of Figure 3, we show the spatial difference between interpolated and downscaled data of the ensemble mean. For temperature, the effect of downscaling is observed mainly in mountainous regions as in the Himalayan massif, the Andes and in the Rockies. For precipitation, we can see severe underestimation in the tropics and particularly in South America, West Africa, India and Oceania. We can also notice that the simulated double ITCZ in the Pacific (e.g. [4]) is corrected. In extra-tropical areas, the corrections are smaller between -2 and +2 mm/day.

Experimental Design, Materials and Methods
Four datasets are used in this work: • The reanalysis data that is used as reference for calibrating the statistical algorithm over a training period. The reanalysis grid sets the final resolution of the downscaled projections. • The original model climate projections that come in a variety of spatial resolutions (typically between 2.5°x2.5° and 0.9°x0.9°) and referred to as "raw". • The raw data interpolated on the reanalysis grid and referred to as "interpolated". • The downscaled data obtained from the interpolated data and the reanalysis data used for statistical calibration (both on the same grid) and referred as "downscaled". The raw and reanalysis data are input data that need to be sourced. The interpolated data is just an intermediary dataset needed by the methodology while the downscaled data is the final dataset. These datasets correspond to the four steps process (data sourcing, remapping, downscaling, quality control) described below.

Data Sourcing
The reanalysis data is the ERA5-Land reanalysis [2]. ERA5-Land is the latest climate reanalysis being produced by ECMWF as part of implementing the EU-funded Copernicus Climate Change Service (C3S), providing hourly data on atmospheric, land-surface and sea-state parameters together with estimates of uncertainty from 1979 to present day. ERA5-Land data are available on the C3S Climate Data Store on regular latitude-longitude grids at 0.1° x 0.1° resolution. We compute the daily data from the ERA5-Land hourly data for all necessary variables. The climate simulations hail from The Coupled Model Intercomparison Project Phase 6 (CMIP6) experiment [1]. They support the Fifth Assessment Report (AR6) of the Intergovernmental Panel on Climate Change (IPCC). We use projections from 2 emissions scenarios: SSP126 ( mitigation policy aligned with a 2° pre Paris agreement target) and SSP585 (a no mitigation policy). Daily data of necessary variables are extracted from the Copernicus Climate Change Service that hosts a subset of the CMIP6 archive. The data covers the period from January 1 st 1951 to December 31 st 2100 (except for some models). All models have different spatial resolutions ranging between 0.93° to 2.5°. The list of models is shown on Table 1.

Remapping
Remapping is a preliminary task required by the downscaling methodology. It consists in spatially interpolating the raw simulations (between 0.93° and 2.5° resolution) onto the ERA5-Land grid (0.1° x 0.1°). We use the Climate Data Operators (CDO, 2016) software from the Max Planck Institute that gathers various algorithms for interpolation used by the scientific community. Daily temperature (mean, minimum, maximum) and daily wind speed are interpolated with a bicubic method. Daily precipitation is interpolated sequentially (to 1.5° then to 0.75°, and then 0.1°) with a conservative method.

Downscaling
The downscaling method applied here is a Quantile mapping-based method (QM) called the Cumulative Distribution Function transform (CDF-t) method [5][6][7][8][9]. CDF-t was first developed for wind values and is now referenced in dozens of peer-reviewed publications to downscale different sets of data and variables (e.g. [10][11][12]). QM methods relate the cumulative distribution function of a climate variable at large scale (e.g., from the GCM) to the CDF of the same variable at a local scale (e.g., from the reanalysis). They are increasingly popular in climate applications although bias correction methods have received criticism (e.g. [13]). For a review of recent QM methods see [3]. In our case the variables are downscaled at a daily resolution over the 1951-2100 period using 1981-2010 as calibration period. The precipitation variable is downscaled with a specific version of CDF-t referred to as "Singularity Stochastic Removal" (SSR) which considers rainfall occurrence and intensity challenges [14]. The last step is to average daily values into monthly ones to construct the dataset.

Standardization
Standardization consists in rewriting output data files and related metadata to comply with standards used by the climate modeling community (e.g., the Climate and Forecast metadata convention and the Data Reference Syntax). We use the Climate Model Output Rewriter 2 (CMOR 2) library.

Quality control
We conduct two types of quality control. The first one is technical and consists in verifying data compliance with climate community's standards, data consistency and metadata. Doing quality control is crucial for the data publication process and data re-use. The second quality control is a value check to check for outlier values in the downscaled data.

Technical quality control
We use the Quality Assurance tool (QA-DKRZ, https://readthedocs.org/projects/qa-dkrz) developed by the Deutsches Klimarechenzentrum (DKRZ) to check conformance of meta-data of climate simulations given in NetCDF format with conventions and rules of projects. During the Quality Assurance process of the DKRZ, the following criteria are checked: 1. Number of datasets is correct and > 0 2. Size of every dataset is > 0 3. The datasets and corresponding metadata are accessible 4. The data sizes are controlled and correct 5. The spatial-temporal coverage description "metadata" is consistent with the data, 6. Time steps are correct and the time coordinate is continuous 7. The format is correct 8. Variable description and data are consistent

Value quality control
The value quality control is built with CDO and NCO tools and consists in: • Analyzing the difference between the downscaled values and the observations over the reference period. • Analyzing the time evolution difference between downscaled and original model.

Difference between downscaled model and observations
First, we estimate two quantities: • The average for each month over the reference period of the observations. • The average for each month over the reference period of the downscaled model. We then estimate the difference between these two quantities for every month. For each month we take the 10th and 90th quantiles. That gives 12 values for each quantile. Finally, we verify that these 12 values are comprised in the following ranges (unpublished, R. Vautard personal communication): • temperature between [ -1 ; 1 ] in K, • precipitation between [-0.5 ; 0.5] in mm.day-1, These values are relatively small and allow only low discrepancies since modifications should be small over the historical period (inherently over the calibration period). If values are outside the range, the script raises an error and the simulation is rejected and thus not included in the dataset.

Difference of evolutions between downscaled model and original model
First, we estimate four quantities: • Average for each season in the reference period for the original model.
• Average for each season in the reference period for the downscaled model. • Average for each season in the future period (2071-2100) for the original model. • Average for each season in the future period (2071-2100) for the downscaled model. Then, for each season, we compute the evolution between future and reference periods for the original and downscaled model. We estimate the difference between them and get 4 files in output (one per season). For each season (i.e. for each file), we take the 10th and 90th quantiles of the differences. That gives 4 values for each quantile. Finally, we control these 4 values are comprised in the following range (unpublished, R. Vautard personal communication): • temperature between [-2 ; 2] in K, • precipitation between [-1 ; 1] in mm.day-1, In this case, values are higher than previously to account for higher discrepancies but small enough to avoid unrealistic changes. If values are outside the range, the quality control raises an error and the simulation is rejected and thus not included in the dataset.