Hydropower information for power system modelling: the JRC-EFAS-Hydropower dataset

Hydropower plays a very important role in European power systems. Consequently, any power system model aiming to reproduce the behaviour of current and future European power systems should include an accurate representation of the natural variability of water availability, i.e. the amount of water that can be transformed into energy. The JRC-EFAS-Hydropower dataset contains the weekly hydropower inflow of pure storage plants and the daily run-ofriver generation for 27 European countries. The dataset is based on the river discharge provided by the European Flood Awareness System (EFAS) and on the JRC Hydropower Database. The discharge time-series for all the hydropower plants in a specific country are used as predictors for a ridge regression model calibrated on the actual data provided by the ENTSO-E Transparency Platform for the years 2015-2019. Then the model is used to extend the inflows and the generation to the entire time span available for the predictors, the years 1991-2019, reconstructing then the natural variability observed in the entire period. Background & Summary To model accurately the impact of climate variability on power system models is important to use long time-series, possibly spanning on many decades. While a lot of attention has been dedicated to the impact of meteorological factors to wind and solar power, much less research has focused on hydropower, which currently plays a very important role in Europe (in Europe in 2018 hydropower has generated more than 600 TWh about the 16% of the total generation1). Moreover, considering the big challenge posed by climate change, understanding the impact of climate on our power systems is of vital importance to plan efficient and secure low-carbon energy systems2–5. A very important tool used for studying the energy system is a power system model, a software which is able to reproduce some of the most complex behaviour of a real power system. Many datasets with actual or estimated hydropower inflow/generation are available (OPSD1 for example provides a list of available datasets), however only a few provide information for multiple countries and for several years with an adequate temporal resolution for power system modelling. These are the ENTSO-E Transparency Platform (TP) data6, Copernicus Climate Change Service (C3S) ECEM7, Restore20508 and Wattsight data9. The ENTSO-E TP platform provides actual data collected by national Transmission System Operators (TSO) for generation and reservoir levels since 2015. C3S ECEM and Restore2050 are based on meteorological reanalyses and provide respectively daily hydropower generation and inflow for European countries spanning on multiple decades. Wattsight is a commercial dataset providing hydropower information (inflow, generation) for several European countries. The ENTSO-E TP data are too short for many power-system model studies but can be considered the best estimation of the real situation, therefore it is used for the calibration of JRC-EFAS-Hydropower. Wattsight data are more detailed but given its commercial nature it could not be used for calibration and thus we use the data only for validation purposes. ECEM and Restore2050 have both a wide temporal and spatial coverage but while both do not use any information on the location of power plants, the former does not provide inflow and the latter lacks any distinction among hydropower technologies. JRC-EFAS-Hydropower has been developed to introduce in power systems models climate-sensitive information on water availability. The dataset is based on the geographical location of actual hydropower plants and on a state-of-art hydrological model updated in near real-time. The methodology here presented can be used to update the dataset when new and/or better inputs are available and, in principle, it can be applied to estimate sub-national inflows if enough observations are available. 1 https://open-power-system-data.org/data-sources#5_Hydro_power_data Figure 1. Diagram of the workflow used to generate the dataset


Background & Summary
To model accurately the impact of climate variability on power system model is important to use long time-series, possibly spanning on many decades. While a lot of attention has been dedicated to the impact of meteorological factors to wind and solar power, much less research has focused on hydropower, which currently plays a very important role in Europe (in Europe in 2018 hydropower has generated more than 600 TWh about the 16% of the total generation 1 ).
Moreover, considering the big challenge posed by climate change, understanding the impact of climate on our power systems is of vital importance to plan efficient and secure low-carbon energy systems [2][3][4][5] . A very important tool used for studying the energy system is a power system model, a software which is able to reproduce some of the most complex behaviour of a real power system. Many datasets with actual or estimated hydropower inflow/generation are available (OPSD 1 for example provides a list of available datasets), however only a few provide information for multiple countries and for several years with an adequate temporal resolution for power system modelling. These are the ENTSO-E Transparency Platform (TP) data 6 , Copernicus Climate Change Service (C3S) ECEM 7 , Restore2050 8 and Wattsight data 9 . The ENTSO-E TP platform provides actual data collected by national Transmission System Operators (TSO) for generation and reservoir levels since 2015. C3S ECEM and Restore2050 are based on meteorological reanalyses and provide respectively daily hydropower generation and inflow for European countries spanning on multiple decades. Wattsight is a commercial dataset providing hydropower information (inflow, generation) for several European countries.
Data from ENTSO-E TP but can be considered the best estimation of the real situation of power systems in Europe, therefore it is used for the calibration of JRC-EFAS-Hydropower. Wattsight data are more detailed but given its commercial nature it could not be used for calibration and thus we use the data only for validation purposes. ECEM and Restore2050 have both a wide temporal and spatial coverage but while both do not use any information on the location of power plants, the former does not provide inflow and the latter lacks any distinction among hydropower technologies.
JRC-EFAS-Hydropower has been developed to introduce in power systems models climate-sensitive information on water availability. The dataset is based on the geographical location of actual hydropower plants and on a state-of-art hydrological model updated in near real-time. The methodology here presented can be used to update the dataset when new and/or better inputs are available and, in principle, it can be applied to estimate sub-national inflows if enough observations are available.

Methods
The dataset presented in this paper consists of two variables aggregated at country level: hydropower inflow and run-of-river generation ( Table 1).

Variable
Description Spatial res. Temporal res.

Hydropower inflow
Energy available for hydropower generation weekly 1991-2019 Run-of-river generation Generation of hydropower run-of-river plants daily 1991-2019 Table 1. Variables contained into the JRC-EFAS-Hydropower dataset The variables are generated following the same workflow ( Figure 1) which is based on a regression model using as predictors the river discharge time-series extracted in the location of hydropower plants.
The proposed methodology assumes that the river discharges in the proximity of all the hydropower plants for a specific country contain all the information to reconstruct both the inflow of the pure storage power plants and the generation of run-of-river plants.

Input data
The methodology is based on the following three datasets: 1. River discharge and related historical data from the European Flood Awareness System (EFAS) 10,11 2. JRC Hydropower database 12 3. ENTSO-E Transparency Platform data 6,13 The EFAS dataset provides gridded daily hydrological data, including river discharge, at a resolution of 5 km. Data are available on the Copernicus Data Store (CDS) since the 1st January 1991 up until near real-time (6 weeks of delay).
The JRC Hydropower database collects the information on more than 4000 hydropower plants in Europe. The dataset provides several variables, including the coordinates, the installed capacity and the typology of the power plant (pure storage, run-of-river or pumping plant).
Finally, the ENTSO-E Transparency Platform is an online data platform for all the market information of European electricity systems. It provides real-time information since 2015 for a wide range of power systems' variables. The ENTSO-E Transparency Platform contains several variables on power systems and electricity markets but it does not provide directly the hydropower inflow. We estimate the inflow starting from two variables: hydropower generation and the level of water reservoirs. The first is provided at hourly level and describes the aggregated generation per country and the second, weekly, is the level of stored energy in the national water reservoirs. Then, we define the inflow as follows: All the variables involved are expressed in energy units (MWh). The generation data (generation w ) is computed aggregating the original hourly data at weekly level summing all the 168 hourly generation points per week. This equation assumes that the hydropower inflow in a week can be used in two ways: to generate electricity in the same week and stored in the reservoirs. The calculation of the inflow does not consider, due to lack of data, spillage or evaporation. For run-of-river instead we use the hourly generation aggregated at daily level to match the river discharge from EFAS.

Extraction of river discharge time-series
The extraction phase generates a river discharge for each hydropower plant in the JRC Hydropower database. To limit the extraction to the most significant and pertinent locations, we select only the power plants that are not classified as pumping and with more than 1 MW of installed capacity (1 234 plants in the release 6 of the JRC Hydropower database). For each power plant, we select the grid point of the EFAS dataset where the power is located and also the eight neighbouring grid points. From this pool of nine grid points, represented by a square of 15x15 km, for each daily time-step we choose the maximum value of the river discharge. This procedure is applied to reduce the risk of selecting the wrong grid point due for example to inaccurate coordinates in the dataset.

Model calibration
The river discharges extracted in the first phase are here considered the best source of information to estimate hydropower variables (inflow and generation). The ENTSO-E data mentioned above are provided at country level and then we try to find a model able to learn the relationship between all the river discharges from the power plants for a country and the hydropower variable (inflow or generation). The river discharge time-series are preprocessed with the following steps: 1. They are aggregated temporally to match the temporal resolution of the target variable 2. They are normalised with zero average and unitary standard deviation The preprocessed river discharges are joined with the target variable to create a calibration dataset (see Figure 2). Given that the ENTSO-E data covers only the period since 2015, the calibration data contain only the discharge data starting from that year.
Several models and methods have been tested to learn the link between river discharges and hydropower generation. Using root mean square error (RMSE) and correlation computed with a K-fold cross-validation method, we have compared a simple linear regression model, a ridge regression model and random forests (which has been used successfully for a similar task 14 ).
The best results have been obtained with random forests and ridge regression, but considering the calibration and execution speed, the latter has been preferred. For a description of the random forest method, we refer to the original paper 15 . A ridge regression model consists of a linear regression with the application of a shrinkage method on the coefficients. The coefficients are minimised imposing a penalty on their size. The amount of this penalty is defined by a parameter λ . For each use of the ridge regression model, the value of λ is chosen by applying a cross-validation procedure. This procedure not only tends to produce more accurate models but also mitigates the impact of correlated predictors 16 .
To enhance the model interpretability we have set the lower bound of the coefficients to zero, thus avoiding negative values. In this way, it would be possible to give a 'physical' meaning to the coefficients, as a measure to estimate the importance of each hydro power plants with respect to the aggregated sum.

Extension
The final step to create the hydropower time-series for the entire period (1991-2019) consists of using the calibrated models with the inputs for the period outside the calibration period . In this way, each model reconstructs the past variability of the predictand based on the observed historical river discharges.

Data Records
The entire dataset is stored into a Tabular Data Package 17 consisting of two CSV files, jrc-efas-hydropower-inflow.csv and jrc-efas-hydropower-ror.csv respectively for the weekly inflow and the daily run-of-river generation and a JSON file with the metadata. Data have been published on Zenodo 18 .
The file 'jrc-efas-hydropower-inflow.csv' has 40 770 rows and it contains the following columns: 1 This is a non-peer reviewed preprint submitted to EarthArXiv.

Technical Validation
To validate the quality of the estimated variables we can analyse the cross-validation output, i.e. the predicted variable estimated in cross-validation, and, for the inflow, external datasets covering the same time period. Two metrics are used in this section: 1. Spearman correlation coefficient

Normalised RMSE (NRMSE): it is the RMSE divided by the maximum value observed in the target variable
The correlation is a valid metric to measure the linear association between two variables and here will be used to evaluate the capability of JRC-EFAS-Hydropower to reproduce any seasonal cycle. On the other hand, the NRMSE can measure the magnitude of the error between the modelled and the actual inflow.

Weekly inflow
The weekly inflow is validated both using the cross-validation output and a commercial dataset (Wattsight).
The numbers in Table 3 show that JRC-EFAS-Hydropower reproduces correctly the seasonal pattern, as depicted by the correlation metric, for the countries with the highest installed capacity. The cross-validation correlation coefficient is greater than 0.75 for the top 12 countries, that represent the 75% of the entire hydropower installed capacity in the ENTSO-E area (35 countries).
The NRMSE calculated in cross-validation has a wide range, between 6.2% and 13.2% for the top 12 countries. The metrics computed instead on the Wattsight dataset show in general similar values with a few cases where the NRMSE is higher (in particular case in DE and CH). This can be explained by the fact that the Wattsight and ENTSO-E datasets might be not consistent due to different methodologies in measuring/estimating the inflow or a different categorisation of hydropower plants.  Table 3. Comparison for the estimated weekly inflows. The 99th-percentile of the inflow refers to the one in the ENTSO-E Transparency Platform data used for the calibration In general, we should expect multiple factors having an impact on the accuracy of the JRC-EFAS-Hydropower estimated inflow: a) the quality of the data provided by ENTSO-E which might have discrepancies/errors; b) the coverage of the power plants in the JRC Hydropower database which might be inaccurate (e.g. wrong capacity or coordinates) or lacking a specific plant; c) the presence of important cross-border basins which might lead to the absence of relevant predictors and d) the complexity of some hydropower systems that would be very difficult to model without very specific information (e.g. cascading hydropower plants).
The dataset described in this paper is produced with the best information currently available, but however improving one of more of the above-mentioned factors might lead to more accurate results.

Usage Notes
As for the weekly inflow, we assess the quality of the estimated daily run-of-river generation presenting the correlation and the NRMSE computed in cross-validation ( Table 4).
The biggest difference between the run-of-river generation and the inflow is not only about the temporal scale (respectively daily and weekly) but also on dispatchability of run-of-river in many European countries. As visible in Figure 4, the daily generation of run-of-river is not smooth but it follows the market price. This can be explained by two facts: a) the discharge (and thus the generation) of run-of-river plants might be regulated and then partially dispatched as other power plants and b) the run-of-river data we use from ENTSO-E TP actually also include pondage plants (defined by ENTSO-E as the plants with a reservoir with a filling period below 400 hours). Obviously, the statistical model behind JRC-EFAS-Hydropower use river discharges as predictors and then it can not reproduce any human-induced behaviour as for the dispatching. This is particularly evident with a country with many long and regulated rivers like Finland (panel in the centre in Figure 4). However, as also explained by the correlation coefficients, JRC-EFAS-Hydropower is able to reproduce the seasonal pattern of the water availability.  Table 4. Comparison for the estimated daily run-of-river generation. The 99th-percentile of the generation refers to the one in the ENTSO-E Transparency Platform data used for the calibration As discussed for the weekly inflow, the quality of the JRC-EFAS-Hydropower's estimation is affected by the quality of the data used as inputs, both for the predictand (ENTSO-E TP data) and predictors (river discharge in hydropower plants' locations).
Data are stored in tabular format, CSV files following the datapackage convention 17 . Thus they can be easily opened with all the spreadsheets and programming languages.
Whenever using JRC-EFAS-Hydropower as input for a power system model, the quality of the dataset for the different modelled countries (Tables 3 and 4) should be taken into account.