Title High-resolution downscaled CMIP 5 projections dataset of essential surface climate variables over the globe coherent with ERA 5 reanalyses for climate change impact assessments

A high-resolution climate projections dataset is obtained by statistically downscaling climate projections from the CMIP5 experiment using the ERA5 reanalyses from the Copernicus Climate Change service. The dataset is global has a spatial resolution of 0.25°x 0.25°, comprises 21 climate models and includes 5 surface daily variables: air temperature (mean, minimum, and maximum), precipitation, and mean nearsurface wind speed. Two greenhouse gas emissions scenarios are available: one with mitigation policy (RCP4.5) and one without mitigation (RCP8.5). The downscaling method is a Quantile mapping method (QM) called the Cumulative Distribution Function transform (CDF-t) method that was first used for wind values and is now referenced in dozens of peer-reviewed publications. The data processing includes quality control of metadata according to the climate modelling community standards and value checking for outlier detection.


Value of the Data
• The high resolution, number of models and variables available in the dataset offer a great opportunity for researchers and climate change adaptation practitioners to study climate change features and feed the data into impact models for any region around the world.
• The dataset is obtained by statistically downscaling climate projections from the CMIP5 experiment using the ERA5 reanalyses from the Copernicus Climate Change service, a data product extensively used around the world for historical climate analysis. A great advantage of this dataset is thus to provide a coherent extension of the ERA5 reanalysis into the future.
• The dataset is global, has a spatial resolution of 0.25°x 0.25°, comprises 21 climate models allowing to address model uncertainty and includes 5 surface daily variables: air temperature (mean, minimum, and maximum), precipitation, and mean near-surface wind speed.
• To sample future climate uncertainty from anthropogenic forcing, two greenhouse gas emissions scenarios are available: one with mitigation policy (RCP4.5) and one without mitigation (RCP8.5).

Data Description
The high-resolution climate projections dataset covers the globe at a 0.25°x0.25° spatial resolution and at monthly temporal resolution for five surface variables. It comprises 21 models (see table 1) from the CMIP5 experiment [1] with simulations for the historical period  and the 21st century (2006 to 2100) under two emissions scenarios: one with mitigation policy (Representative Concentration Pathway 4.5 or RCP4.5) and one with no mitigation (Representative concentration Pathway 8.5 or RCP8.5). The downscaled variables are five surface land and ocean variables: mean daily temperature, daily minimum and maximum temperature (at 2 m), total precipitation, and surface wind speed (at 10m). The combination of models and scenarios is 36 climate projections with all variables except a few models that didn't provide some (see table 1). Other variables, models and emissions scenarios could be added in the near future.
The data was produced with a statistical downscaling method using the ERA5 reanalyses [2] for calibration and training (see next section for details). The interest of the downscaled data is the removal of model biases at a resolution more compatible with the requirements of assessments and further modelling of the impacts of climate change. In other terms it corrects the climatology (distribution) of model values to make them comparable with a reference observational dataset [3], which in this case is the ERA5 reanalyses. We briefly illustrate below the bias removal over the historical period and the climate change signal differences at the end of century.

Differences between interpolated and downscaled data with reanalysis data
Here we compare the differences with the ERA5 reanalyses of both the original model (interpolated on the reanalysis grid and referred to as "interpolated") and downscaled simulations (referred to as "downscaled") over the historical period (1981-2010) for daily mean temperature, total precipitation and wind speed.
In Figure 1, we illustrate the spatial differences between the interpolated (left) and downscaled data (right). For temperature, the ensemble mean of CMIP5 models tends to overestimate temperature. This is particularly true for northern America where the bias between the model mean and ERA5 is above +5°C.
We can further notice that the temperature in mountainous areas (Himalaya and the Rockies) is often underestimated by GCMs because of the poor representation of elevation. When we consider the downscaled data, models have a temperature comparable to the ERA5 reanalyses over the world. For precipitation, there are overestimations of precipitation over the oceans in the tropics, along the Andes Cordillera, south of the Arabian Peninsula or in the Gulf of Guinea. On the contrary, there is an underestimation of rainfall in the areas adjacent to the previous ones (example: West Africa). For wind speed, we have an underestimation in the North (Greenland) and in Antarctica. In the mountainous area (Himalaya, The Rockies), we have an underestimation for interpolated data. The underestimation is closed with the downscaled data.
In Figure 2, the cumulative distributions are empirically estimated from monthly values averaged on the globe for the interpolated (left) and for downscaled data (right). There is a spread between the ERA5 data and the interpolated data with overestimations and underestimations. The spread is more important for precipitation and wind speed than for the temperature. For downscaled data, the difference between the different models curves and the observation is very diminished. However, for precipitation, we observe differences for the low and high monthly precipitation amounts. This is due to the fact that downscaling is performed at a daily scale and grid point by grid point while CDFs are estimated and drawn on monthly and spatially averaged data. The day-to-day (temporal) and spatial variability of the model data are mostly preserved by the downscaling method but residual biases can appear on monthly and spatial averages.

Changes at the end of the century after downscaling
Here we illustrate changes by the end of the century over the 2071-2100 period under scenario RCP8.5, comparing the interpolated and downscaled simulations. The analyses are based on daily mean temperature, total precipitation and surface wind speed.
On the maps of Figure 3, we show the spatial difference between interpolated and downscaled data of the ensemble mean. For temperature, the effect of downscaling can be observed mainly in mountainous regions as in the Himalayan massif, the Andes and in the Rockies. For precipitation, we can see that are severely underestimated in the tropics and particularly in South America, West Africa, India and Oceania. We can also notice that the simulated double ITCZ in the Pacific (e.g. [4]) is corrected. In extra-tropical areas, the corrections are smaller between -2 and +2 mm/day. For wind speed, as in Figure 1, the poles obtain stronger wind speed with the downscaled data and lower wind speed for the mountainous regions.
In Figure 4, we now illustrate the difference between interpolated and downscaled data in terms of time series of annual averages. For temperature and precipitation, while maintaining the general trend linked to the climate scenario, a general reduction of the ensemble envelope is shown on future by reducing the inter-model differences (on the left interpolated and on the right downscaled) that is quantified by the distribution shown on the boxplots. For temperature, the model trends are more pronounced. For precipitation, the general trend linked to the climate scenario is stronger and we have an increase of the interannual variability. For wind speed, there is no visible trend and the evolution is similar to the historical period. Both the interpolated data show a reduced envelope that illustrates small intermodal differences and low interannual variability. Because the downscaling method is applied to daily data at each grid point and not globally to annual data, it does not affect the interannual variability of model data. There is also a spatial and temporal averaging smoothing effect because those characteristics are less pronounced at smaller scales (e.g. monthly point data).

Experimental Design, Materials and Methods
Four datasets are used in this work: • The reanalysis data that is used as reference for calibrating the statistical algorithm over a training period. The reanalysis grid sets the final resolution of the downscaled projections. • The original model climate projections that come in a variety of spatial resolutions (typically between 2.0°x2.0° and 0.75°x0.75°) and referred to as "raw". • The raw data interpolated on the reanalysis grid and referred to as "interpolated". • The downscaled data obtained from the interpolated data and the reanalysis data used for statistical calibration (both on the same grid) and referred as "downscaled".
The raw and reanalysis data are input data that need to be sourced. The interpolated data is just an intermediary dataset needed by the methodology while the downscaled data is the final dataset. These datasets correspond to the four steps process (data sourcing, remapping, downscaling, quality control) described below.

Data Sourcing
The reanalysis data is the ERA5 reanalyses [2]. ERA5 is the latest climate reanalyses being produced by ECMWF as part of implementing the EU-funded Copernicus Climate Change Service (C3S), providing hourly data on atmospheric, land-surface and sea-state parameters together with estimates of uncertainty. ERA5 data are available on the C3S Climate Data Store on regular latitude-longitude grids at 0.25° x 0.25° resolution. We compute the daily data from the ERA5 hourly data for all necessary variables.
The climate simulations are from The Coupled Model Intercomparison Project Phase 5 (CMIP5) experiment [1]. They support the Fifth Assessment Report (AR5) of the Intergovernmental Panel on Climate Change (IPCC). We use projections from 2 emissions scenarios RCP4.5 (moderate mitigation policy scenario) and RCP8.5 (a no mitigation policy scenario). Daily data of necessary variables are extracted from the Copernicus Climate Change Service that hosts a subset of the CMIP5 archive. The data covers the period from 1 January 1950 to 31 December 2100. All models have different spatial resolutions ranging between 0.75° to 3°. All models are shown on Table 1.

Remapping
Remapping is a preliminary task required by the downscaling methodology. It consists in spatially interpolating the raw simulations (between 0.75° and 3° resolution) onto the ERA5 grid (0.25° x 0.25°). We use the Climate Data Operators (CDO, 2016) software from the Max Planck Institute that gathers various algorithms for interpolation used by the scientific community. Daily temperature (mean, minimum, maximum) and daily wind speed are interpolated with a bicubic method while daily precipitation is interpolated sequentially (to 1.5° then to 0.75°, and then 0.25°) with a conservative method.

Downscaling
The downscaling method used here is a Quantile mapping-based method (QM) called the Cumulative Distribution Function transform (CDF-t) method [5][6][7][8][9]. CDF-t was first developed for wind values and is now referenced in dozens of peer-reviewed publications to downscale different sets of data and variables (e.g. [10][11][12]). QM methods relate the cumulative distribution function of a climate variable at large scale (e.g., from the GCM) to the CDF of the same variable at a local scale (e.g., from the reanalyses) and are increasingly popular in climate applications although bias correction methods have received criticism (e.g. [13]). For a review of recent QM methods see [3]. Here the variables are downscaled at a daily resolution over the 1951-2100 period using 1981-2010 as calibration period. The precipitation variable is downscaled with a specific version of CDF-t referred to as "Singularity Stochastic Removal" (SSR) which considers rainfall occurrence and intensity challenges [14]. Daily values are then averaged monthly to construct the dataset.

Standardization
Standardization consists in rewriting output data files and related metadata to comply with the climate community's standards (e.g., the Climate and Forecast metadata convention and the Data Reference Syntax). We use the Climate Model Output Rewriter 2 (CMOR 2) library.

Quality control
We conduct two types of quality control. The first one is technical and consists in verifying data compliance with climate community's standards, data consistency and metadata. Doing quality control is crucial for the data publication process and data re-use. The second quality control is a value check to check for outlier values in the downscaled data.

Technical quality control
We use the Quality Assurance tool (QA-DKRZ, https://readthedocs.org/projects/qa-dkrz) developed by the Deutsches Klimarechenzentrum (DKRZ) to check conformance of meta-data of climate simulations given in NetCDF format with conventions and rules of projects. During the Quality Assurance process of the DKRZ, the following criteria are checked: 1. Number of datasets is correct and > 0 2. Size of every dataset is > 0 3. The datasets and corresponding metadata are accessible 4. The data sizes are controlled and correct 5. The spatial-temporal coverage description "metadata" is consistent with the data, 6. Time steps are correct and the time coordinate is continuous 7. The format is correct 8. Variable description and data are consistent

Value quality control
The value quality control is built with CDO and NCO tools and consists in: • Analyzing the difference between the downscaled values and the observations over the reference period. • Analyzing the time evolution difference between downscaled and original model.

Difference between downscaled model and observations
First, we estimate two quantities: • The average for each month over the reference period of the observations. • The average for each month over the reference period of the downscaled model. We then estimate the difference between these two quantities for every month. For each month we take the 10th and 90th quantiles. That gives 12 values for each quantile. Finally, we verify that these 12 values are comprised in the following ranges (unpublished, R. Vautard personal communication): • temperature between [ -1 ; 1 ] in K, • precipitation between [-0.5 ; 0.5] in mm.day-1, • surface wind speed between [-0.5 ; 0.5] in m.s-1. These values are relatively small and allow only low discrepancies since modifications should be small over the historical period (inherently over the calibration period). If values are outside the range, the script raises an error and the simulation is rejected and thus not included in the dataset.

Difference of evolutions between downscaled model and original model
First, we estimate four quantities: • Average for each season in the reference period for the original model. • Average for each season in the reference period for the downscaled model. • Average for each season in the future period (2071-2100) for the original model. • Average for each season in the future period (2071-2100) for the downscaled model. Then, for each season, we compute the evolution between future and reference periods for the original and downscaled model. We estimate the difference between them and get 4 files in output (one per season). For each season (i.e. for each file), we take the 10th and 90th quantiles of the differences. That gives 4 values for each quantile. Finally, we control these 4 values are comprised in the following range (unpublished, R. Vautard personal communication): • temperature between [-2 ; 2] in K, • precipitation between [-1 ; 1] in mm.day-1, • surface wind speed between [-1 ; 1] in m.s-1. In this case, values are higher than previously to account for higher discrepancies but small enough to avoid unrealistic changes. If values are outside the range, the quality control raises an error and the simulation is rejected and thus not included in the dataset.