Comparing the Euro 2k reconstruction to a regional climate model simulation *

We do not expect simulations and reconstructions of past climate change to agree on the exact evolution of past climate variations. Simulations with sophisticated climate models present spatially discrete but complete estimates, which are consistent relative to the implemented physics and their uncertain forcing data. Reconstruction methods statistically infer past changes from natural archives who reacted to and recorded more than one external inﬂuence and, thus, carry an uncertainty by construction. However, comparison of both data sources is relevant not least as it addresses the disagreements between the data. This manuscript describes the disagreement of a regional simulation with an ensemble of regional reconstructions but also highlights some agreement. The results emphasize the congruence in modelling and reconstructing past changes. Besides the expected disagreement both sources of information about past climate changes show some potential agreement in low frequency variability, time-series properties, and spatial covariability. While we cannot aim for perfect matches we can consider simulations and reconstructions as estimates of probabilities of past changes if their general properties agree.


Preliminaries
Two sources provide us with information about past climates, simulations with sophisticated models of the climate-or earth-system and reconstructions based on environmental or documentary observations from past periods (PAGES 2k-PMIP3 group, 2015). Inferrences about past climates from these proxy-observations are uncertain not least since more than one environmental factor influences the recording of the sensor (Evans et al., 2013). The simulation results are uncertain not least since we do not know the exact past boundary conditions (Schmidt et al., 2011).
Even if we knew the major forcing conditions and had the perfect proxy, internal variability on various time-scales would likely result in distinct trajectories in simulation and reconstruction (Deser et al., 2012). Furthermore, proxies, reconstructions and simulations represent different spatial scales (e.g., PAGES Hydro2k Consortium, 2017). Proxies likely record local signals, from which we may make regionalized inferrences. Simulations provide spatially discrete grids with grid-point-representative timeseries. Regional models (Ludwig et al., 2018) and proxy-forward-model approaches (Evans et al., 2013;Tolwinski-Ward et al., 2011) as well as data assimilation techniques (e.g., Hakim et al., 2016) provide means for bridging the gap between the local proxies and the larger scale simulations.
Exact agreement between simulations and reconstructions is more than unlikely. Comparisons of both data sources can result in consilience of the two distinct lines of evidence about past climate evolutions (PAGES 2k-PMIP3 group, 2015). Even disagreement between both data sources is valuable as it highlights future research directions to increase our understanding of past changes. This report compares the Euro 2k summer surface air temperature reconstruction ensemble of Luterbacher et al. (2016) and a regional climate simulation (Wagner, personal communication;PRIME2, 2018;Bierstedt et al., 2016) with the regional climate model CCLM (Rockel et al., 2008). I follow the suggested methods of the PAGES 2k-PMIP3 group (2015) but add additional analyses on time-series properties. I exclude methods for detecting forcing influences. This report is in principle another supplement to the analyses of Luterbacher et al. (2016).
To my knowledge, there has not been an effort using regional climate simulations comparable to the work by the PAGES 2k-PMIP3 group (2015) who used multiple methods to compare global simulations and continental reconstructions. Obviously, this is not least because of a lack of long transient regional climate simulations over the recent centuries except for, e.g., those used by Gómez-Navarro et al. (2013, see also Gómez-Navarro et al., 2015 and Gómez-Navarro et al. (2015, Wagner, personal communication, see also Bierstedt et al., 2016).
The current approach differs from that of the PAGES 2k-PMIP3 group (2015) not only by using a single regional climate simulation instead of an ensemble of global simulations. The availability of an ensemble for the Euro 2k reconstruction provides a range of results for the chosen analyses, which by construction gives an uncertainty measure for the comparison of reconstructions and simulations.
The next section shortly discusses the data sets and gives a short comment on the chosen approach. Afterwards I present the results for the various analyses and end with a short summary discussing these results.   The BHM-median is the median of smoothed/sliding-mean series of the ensemble members. x-axes are years of the Common Era (Years CE), y-axis are temperature anomalies in Kelvin (K).

Reconstruction Data
3 Comparing the Euro2k-BHM-reconstruction and the regional CCLM simulation I generally compare the regional area average series for the simulation with the reconstruction ensemble median. The median refers to the median of the results of the analyses performed on the individual ensemble members. Occasionally I use random members of the ensemble. Since the reconstruction data is for summer means (June, July, August; JJA), all analyses consider summer data for all data sources.

Time series
Visually, 31-year running means of the time-series appear to be surprisingly similar (Fig 1). However, the visual comparison also highlights that there is no real agreement between simulated and reconstructed regional data. The plots suggest that trends are larger in the reconstruction data over the full period but multi-decadal variability may be larger in either source of information.
One striking feature in Fig 1 is the occurrence of multidecadal warmer episodes in the late 18th, late 19th and mid 20th century in simulations and reconstructions. These are more prominent in the simulation. The good congruence between the reconstructed series and the simulation data appears to be larger in the earlier part of the data. This may be due to the different trend sizes in later parts of the data and the choice of reference period. Random members of the BHM ensemble and their respective median do mostly not evolve overly diverse.
There is relatively large coherence between the different regional series within the two data sources The reconstructions indicate that variability was relatively stronger around the early 19th century. This is also seen in some simulated regional series, though generally weaker. Moving standard deviations generally differ strongly among regions.
Thus, summarising the visual agreement, the simulation and the reconstruction both show commonly a number of warmer periods during the last 400 years. Some regional data suggest common variations in the strength of their variability. Generally, neither the time-series nor their variability and trends are overly similar between reconstructions and simulations. I discuss the trends in more detail below.

Correlation analyses
In the next step, I consider correlation analyses to assess the interrelation between the ensemble members, between the ensemble and the simulation, between regions in the simulation and the reconstruction, and the evolution of interregional relations in both data sources.

Ensemble correlations for the period 1645-1850
Fig 4 shows intra-reconstruction-ensemble correlation histograms in red and histograms for correlations between simulation and ensemble members for the pre-1850 period. For most regions there are moderate correlations among the reconstruction members with correlation distribution modes larger than at least 0.4. The regional data for Turkey is an exception. Intra-ensemble correlations are particularly large for the Alps data.
Simulation-reconstruction correlations are generally weak but histograms show always a tendency towards positive correlations. For some regions like, e.g., the Alps or the full domain, correlations are always positive and their distribution mode is larger than 0.2. This may indicate some kind of common signal between both data sets over the period before strong anthropogenically forced trends.    Figure 6: Running correlations between regional series and the full domain series over 91-year moving windows. Black, regional CCLM-simulation, red, median of analyses for the BHM-reconstruction ensemble. Correlations are generally weakest for the Turkey region for both datasets in the interannual data and the smoothed data ( Fig 5). This behaviour is least pronounced in the smoothed simulation data. Turkey also correlates least with the full domain series.

Interregional correlations
Interregional correlations generally increase for the simulations with larger amount of smoothing (Sup- Noteworthy is particularly the Scandinavian data, which shows an up-and-down variation with increasing smoothing.

Running correlations
Running correlations allow to clarify how interrelations change over time. Correlating the unsmoothed series over short windows gives a rather noisy picture (not shown). However, moving correlations of smoothed series artificially accentuate features.
In summarising the correlation results, intra-ensemble correlations are notable, but correlations with the simulation data are always weak though generally positive. Inter-regional correlations are usually stronger in the simulation data, which fits the common finding of larger inter-relations in simulated data compared to reconstructions (compare, e.g., PAGES 2k-PMIP3 group, 2015). Moving correlations with the full domain also differ between simulation and reconstruction. Interestingly, there are periods of weak correlations particularly for the simulation. The changes in correlation may relate to the observed changes in variability in the regional series (compare Fig 3).

Trends
Figs 7 and 8 summarise moving window linear trends of different lengths for the simulation and the reconstruction ensemble, respectively. As I do not show observations for comparison, the panels only highlight the differences between simulated and reconstructed trends. Note, observational regional data is not available in acceptable quality over the full period. Luterbacher et al. (2016) have already shown the relative good agreement between instrumental data and reconstructions for the domain mean data, though not for the trends.
While the trends for the reconstructed data are stronger in recent windows, overall there are not any overly obvious contrasts between simulation and ensemble trend medians. The   Thus, trends in both data sources generally reflect our knowledge of past climate forcing and boundary conditions in the region of interest, i.e. Europe. Disagreement in timing and trend sizes, especially over long trend periods, may be, among other things, due to internal variability, the uncertainty in forcing data, the specific implementation of the forcing in the model, and the inability of a low top model to reproduce dynamical top-down processes originating in the middle and upper atmosphere (compare, e.g., Jungclaus et al., 2010). Annan and Hargreaves (2010) highlight the usefulness of the concept of ensemble reliability or ensemble consistency for climate studies. It was previously mainly used in forecast verification. For applications see also, e.g., Hargreaves et al. (2013). Jolliffe and Primo (2008), Anderson (1996), andHamill (2001) provide relevant discussions of probabilistic consistency and rank histograms, while Marzban et al. (2010) introduce the concept of climatological forecast consistency and residual quantile-quantile plots. Previous simulation-reconstruction comparisons include, e.g., the PAGES 2k-PMIP3 group's work (2015).

Consistency
Consistency assessments rely on the paradigm of a statistically indistinguishable ensemble (e.g., Annan and Hargreaves, 2010). For this, one assumes a validation target and an ensemble of data to be exchangeable samples from a common distribution.
Climatological consistency describes the similarity of the climatological probability distributions of the target and the ensemble over a selected period. Plotting the residual quantiles between the target and the ensemble helps to visualize this. To be more specific, classical quantile-quantile plots display the quantiles of one distribution against the quantiles of another, potentially theoretical distribution. Residual quantile-quantile plots display the differences between the pairs of both quantile series against the target quantiles. Thus, values of or close to zero for all residual quantiles signal consistency.
Probabilistic consistency analyses identify the position of the target within the range of the ensemble.
Rank histograms traditionally are an appropriate visualisation for such analyses. In this case, one orders the values of the ensemble and the target for a data point, identfies the position, i.e. the rank, of the target, and then plots counts of these ranks over all data points in histograms. A flat rank histogram signals probabilistic consistency.

Climatological consistency
Fig 9 displays the climatological consistency assessment for the ensemble reconstruction relative to the simulation data for the pre-1850 period. I plot the frequencies of residual quantiles between the reconstruction ensemble and the 'target' simulation data. An alternative approach would plot a series of residual quantiles for each ensemble member.
Unsurprisingly the extreme quantiles spread widely. Concentrating on the more confined central quantile residuals, there are mainly three behaviours of the data. For one, regions like Britain and Ireland, Central Europe, Eastern Europe, and also the full domain data show consistent residuals, i.e. they are close to zero. However, the Iberian Peninsula, the Alps, the Carpathians, and less strongly the Balkan and Turkey show an underdispersive behavior, residuals have a negative trend. That is, the distributions of the reconstruction are usually narrower than the simulation data, their variance is smaller. This behaviour is strongest for the Iberian Peninsula. Thirdly, Scandinavia shows a slight tendency to the opposite. Such a positive trend signals overdispersion. That is, the variance of the reconstruction data is larger than for the simulation. Nevertheless, it seems appropriate to say that there is not general (climatological) inconsistency between the data. However, the data are anomalies and thus do not include potential biases.

Probabilistic consistency
Keeping Note, that I use very sparse histograms. Only about 200 measurements are available for ranking in the 1001 classes. This affects statistics and implies that small counts are enough to result in an overdispersive histogram. On the other hand, it also emphasizes the finding that the target is often outside of the ensemble range. However, the analyses of Marzban et al. (2010) suggest that rank histograms for ensembles with strong intra-ensemble correlations but relative small ensemble-target correlations are likely to be overdispersive.
In summarising, there is not general climatological inconsistency. Similarly, spread-deviations allow to reject probabilistic consistency, but the large intra-ensemble correlations also prevent to declare the rank-histograms generally inconsistent.

Skill-analysis
The PAGES 2k-PMIP3-group (2015) also adapted the concept of skill analysis for reconstructionsimulation comparisons from Hargreaves et al. (2013). Here, I use a very simple skill measure to assess how the BHM-ensemble is able to capture the target CCLM simulation compared to assumptions of no change. Negative values indicate that the ensemble of reconstruction-series has no skill in representing the simulated data compared with the no-change-reference.
The boxplots for skill measures of the ensemble members in the left panel of Fig 11 indicate that the ensemble lacks skill considering unsmoothed interannually resolved data. If I use means over nonoverlapping windows of 10 or 30 years, there are some reconstruction ensemble members, which show skill. Note the different axis scales for the three panels as deviations increase at the same time with increasing window lengths. There is no pattern in the increase in skill measure, it remains unclear whether skill may be more likely or larger for certain regions.

Principal component analysis (PCA)
Analyses, so far, only considered individual data series or pairs of data. In the following, a principal component analysis (PCA) supplements them by providing more information about the interrelations among the regions. It summarises the covariability among the regional subdomains of the reconstructions and the simulations.
Note, I perform the PCA separately for simulation and reconstruction. Thereby, the comparison of time-series uses projections on different loadings. That is, the comparison of the patterns and of timeseries do not complement each other. An alternative approach would be to, first, compare the loadings and, then, project the data on a common set of loadings to compare the evolution with respect to this loading. Such a comparison would more appropriately consider the evolution of a certain set of modes.
However, as there is no guarantee that such a common set of modes is valid for all data sets, I compare the projections within each data set and refrain from making strong interpretations.
The first principal components for both datasets explain about 35% of the variability (Fig 12). The second PCs explain about 15% for the reconstruction but over 20% for the simulation. The ensemble range does not include the simulation for the second principal component and some higher order PCs. Interestingly, the smoothed PC time series also show some general resemblance (Fig 13, right column).
However, there are also pronounced differences especially for PC1 and PC4, which is to be expected considering the smoothed time-series in Fig 2. Again, some time-series, e.g. PC2, have a wide range for the reconstruction ensemble. Nevertheless, these results suggest that leading modes of variability are comparably captured in both the simulation and the reconstruction for the pre-1850 period for the spatially coarse resolution setup of nine regional average series.
I show the unsmoothed PC time series in the Appendix (Supplementary Fig 26). This Fig highlights that amplitudes of median series are small for PCs 2 to 4 for the reconstruction, i.e., that variations and variability differ strongly between ensemble members. Fig 14 shows the relation between the regional series and the PC time series differently by plotting the correlations between the regional series and the PC series. As for the PC loadings and the PC time series, correlations are comparable between the simulation and the reconstruction. The correlations also follow similar patterns -which is possibly to be expected as the PCs base on the covariability among regions.
For PC1, largest differences between simulation and reconstruction occur for the Turkey-region. The Turkey-series also shows the weakest correlations with PC1 and the largest spread in correlations with PC2, while PC3 correlations are largest for this data. It is worth noting that the full domain series correlates nearly perfectly with the first PC time series.
In summarising, the modes of variability appear comparable in reconstruction ensemble and simulation. However, I do these analyses on a spatially coarse setup of nine regional average series, and

Comparing Euro 2k and CCLM
Non-peer reviewed pre-print submitted to EarthArXiv 20 perform the analyses on simulations and reconstructions independently.

Spectral densities
Fig 15 plots sample spectral density estimates for the simulation data against analyses for all ensemble members. Envelopes of the reconstruction ensemble spectral densities are generally wide over the full frequency range for all regions. Spectral density estimates for the simulation fall mostly inside these envelopes.
There are few clear differences between reconstruction and simulation spectra. Some reconstructed regional series ensembles show a maximum in the densities at frequencies approximately equivalent to a seven year period, which is missing in the simulation data.
Generally the comparison of the simulation spectral densities with the ensemble range indicates a lack of disagreement, which is in line with the visual assessment of comparable variability on high and low frequencies in simulation and reconstruction series. However, this cannot necessarily be interpreted as agreement of spectral properties of the data.

Hurst exponent
The Hurst coefficient (e.g., Weron, 2002) is a measure of self-similarity or long-range dependence of a time-series. Economics and hydrology use it commonly but the paleoclimatology-community also got interested in recent years (e.g., Bunde et al., 2013). Weron (2002) discusses uncertainty of the estimation process. which itself is confined to values larger than 0.5 but smaller than 0.8, i.e., there is usually some amount of long-term positive auto-correlation in the time-series.
Uncertainty of individual estimates is obviously large since the time-series are relatively short. Furthermore, the reconstruction-method may imprint certain time-series properties onto the data. Nilsen et al. (2018) discuss how the assumptions of the reconstruction method potentially result in a reconstruction more in line with the assumed time-series characteristics than with the real input time-series characteristics. The similarity in Fig 16 nevertheless suggests some agreement in self-similarity between simulations and reconstructions over the considered period. Fits suggest an order of 1 for the majority of ensemble members for all regions. However, fits to the simulation data give larger orders than for the bulk of the ensemble-members for the most regions.

Autoregressive (AR) order
This points to larger autocorrelation in the simulation-data. Only Turkey and the Carpathian region give orders of zero for the simulation data and the Balkan gives an order of 1. Please note, that Luterbacher et al. (2016) assume the temporal autocorrelation of the grid-point data to follow an AR(1)-process. However, this does not necessarily imply that regional averages also follow such a process (compare also Nilsen et al., 2018).
The boxplots in the last panel of Fig 17 suggests that, considering forced AR(1)-fits, the simulation is often similar to the reconstruction series especially for regions in north-western and central Europe and the full domain. Differences appear to be larger in the South-East of Europe, where the simulation gives small AR(1)-fits compared to the majority of reconstruction-ensemble-members. However, I again have to emphasize that there is no proxy-data from this region entering the reconstruction.
The AR-results fit the general assessment of the time-series properties, which shows generally agreement between the simulation and the ensemble reconstruction range. They also agree with the pattern that both data sources agree less in the South-East of Europe.

Comparison of the proxies with the simulation grid-points
Another means of assessing simulations and paleo-observations is the comparison of grid-point data, pseudo-proxies, or forward-modelled virtual-proxies from the simulation to the orginal proxy-series.
Such an approach could consider all the above analyses plus additional detection and attribution approaches (compare, e.g., PAGES 2k-PMIP3 group, 2015). Although this is a worthwhile enterprise it is a separate document to be written. In the following I only show a short comparison of the original proxies to the nearest simulation grid-point series and the principal components. Alps, French Alps, Pyrenees, and Albania.

Normalized nearest CCLM-point series vs Euro2K-proxies
As expected, series do not generally agree between simulations and reconstructions. Correlations are often negligible either for unsmoothed or smoothed data and occasionally for both. Among the proxies, Luterbacher et al. (2016) excluded the data from the Tatra and Albania due to lacking summer temperature signals. Interestingly the Tatra data is the only series notably negatively correlated with the simulation data whereas the data from Albania has some correlation with the smoothed simulation data. However the proxy shows much more variability than the closest simulation grid-point.
Correlations for smoothed data are largest for the Swiss Alps but they are also prominent for the Pyrenees, the French Alps, the Austrian Alps, and Torneträsk. The Northern Scandinavia data, the Jämtland series, and the Carpathian data do not show relevant correlations with the smoothed simulation data although there appear to be some similarities visually. The lack of correlations is mainly due to their opposite evolution in other periods.

Proxies and PCA
Fig 19 relates the proxies to the reconstruction principal component series, and the closest simulation grid point data to the simulation principal component series. It also adds the correlations between the proxies and the simulation principal component series for information's sake.
Interestingly, the pattern of correlations is similar in the simulation and the reconstruction for all PCs.
Even the cross-data-source correlation pattern looks similar for PC1 though correlations are very weak in this case.
Correlations are positive with PC1. Grid-point correlations with PC1 can be very strong in the simulation for grid-points outside of Scandinavia but correlations are more moderate between the original proxies and the reconstruction PC1.
Correlations are predominantly likely negative or about zero for PC2, but the Carpathian and Albania simulation grid-points correlate strongly with this PC. There is a pattern from more likely positive correlations in Scandinavia to more likely negative correlations in more southern regions for PC3. Scandinavia is likely to correlate negatively with PC4, but other regional correlations are about zero or slightly positive. Generally, correlations can be large in the simulation and can be non-negligible for the reconstruction.
The results suggest again that the reconstruction and the simulation show similar spatial interrelations between the locations. On the other hand, there appears to be little to no agreement between the original proxies and the simulated grid-point series.

Summary
This report provides simple additional comparisons of the Luterbacher et al. (2016) Euro 2k spatial field reconstruction ensemble with output from a regional climate simulation. Specifically, I use data from a regional simulation with the CCLM model (for the model see Rockel et al., 2008; for the simulation, e.g. Bierstedt et al., 2016).
Reconstruction ensemble and simulation do not generally agree on the evolution over time. Differences in trends and changes in variability are especially noteworthy. Some regions agree better than others.
If I consider the reconstructions as predictions of the simulated temperature evolution, the ensemble generally lacks skill compared to a no-change-reference. Measures of consistency are ambiguous but it is not possible to describe the data sets as climatologically inconsistent.
Similarly, it is not possible to describe time-series properties like spectra, Hurst-exponents, or autoregressive orders as generally different between both data sets. Nevertheless, the simulation data regularly give higher autoregressive orders than the reconstruction data, which may relate to the reconstruction methods.
This lack of disagreement is encouraging. Even more encouraging is the apparent agreement in spatial covariations between the regional series as found in a principal component analysis, which even translates to their evolution over time. The similarity in relations also extends to comparisons of the original proxies or equivalent simulation grid-points with the PC time series.
In my analyses, I take advantage of the ensemble-character of the Luterbacher et al. (2016) Bayesian Hierarchichal Modelling Euro 2k field reconstruction. This provides, on the one hand, a, possibly incomplete, estimate of the uncertainty of the reconstruction. More importantly, it also gives a natural confidence limit for the comparison of the simulation data with the reconstructed estimates of past summer temperature changes in Europe and the sub-domains I use.
The use of the full ensemble highlights the benefits of ensemble approaches or more general of providing credible uncertainty estimates for reconstructions. Uncertainty estimates for the original proxies would also be helpful for further comparison studies. The document also adds a few additional analyses to the suite of analyses suggested by the PAGES 2k-PMIP3 group (2015).
One interesting feature of most analyses is that simulation and reconstruction ensemble are least similar in the South-East of Europe. On the one hand, the reconstruction does not include any proxyinformation from this region. On the other hand, this is the region where the influences of the large scale weather and climate situation over the North Atlantic and western Europe are least important.
The apparent occasional agreement in some regional time-series, trends, and variability-series warrants a short additional discussion. While this is encouraging I cannot rule out that this agreement is coincidental due to different effects in proxies and simulation. That is, reconstruction and simulation agree due to the wrong reasons. Periods of agreement appear visually to relate to reconstructed changes of past climate forcings (e.g., Schmidt et al., 2011;Fernández-Donado et al., 2013;Luterbacher et al., 2016;PAGES 2k-PMIP3 group, 2015). Numerous studies document the influence of volcanic eruptions on surface temperature in Europe, but evidence is weaker for a temperature-effect of solar variability. In the simulation, potential relations between solar forcing variations and surface temperature have to be due to a direct effect because of the vertical extent of the simulation setup and the implementation of forcing variations. That is, there is unlikely a complex chain of dynamic interactions leading to regional temperature variability. See Shindell et al. (2003) or, e.g., also Anet et al. (2014) for studies of the influence of solar forcing variations on the climate. Put clearly, if there is an influence of the solar variations on the near surface air temperature in the simulation it is not due to a credible mechanism. This also accounts for the possibility that the lateral forcing by the global simulation imprints this relation on the regional simulation.
It is tempting to intepret the reconstruction similarly. However, the proxies generally react to more than one environmental influence. Therefore, solar signals in the proxy may also be due to factors like precipitation, dryness, cloud cover, or circulation despite a general temperature sensitivity. The proxy may not represent a direct temperature influence and thus a solar effect on temperature but a signal of circulation variability due to variations in solar activity.

Competing interests
I am not aware of any circumstances that might be seen as competing interests.

Acknowledgements
This work was mostly done while I received funding within PRIME 2 and PALMOD (www.palmod.de).
Sebastian Wagner provided the CCLM model data. This report contributes to the PAGES 2k Network especially its PALEOLINK project. Comments by Johannes Werner helped to improve the manuscript.

Comparing Euro 2k and CCLM
Non-peer reviewed pre-print submitted to EarthArXiv 29

Appendix: Supplementary Figures
The following simply collects a number of additional Figures without comment.     x-axes for the loadings are the regions, and years CE for the time-series.