Evaluating precipitation datasets for large-scale distributed hydrological modelling

Abstract We are experiencing a proliferation of satellite derived precipitation datasets. Advantages and limitations of their promising application in hydrological modelling application have been broadly investigated. However, most studies have analysed only the performance of one or few datasets, were limited to selected small-scale case studies or used lumped models when investigating large-scale basins. In this study, we compared the performance of 18 different precipitation datasets when used as main forcing in a grid-based distributed hydrological model to assess streamflow in medium to large-scale river basins. These datasets are classified as Uncorrected Satellites (Class 1), Corrected Satellites (Class 2) and Reanalysis – Gauges based datasets (Class 3). To provide a broad-based analysis, 8 large-scale river basins (Amazon, Brahmaputra, Congo, Danube, Godavari, Mississippi, Rhine and Volga) having different sizes, hydrometeorological characteristics, and human influence were selected. The distributed hydrological model was recalibrated for each precipitation dataset individually. We found that there is not a unique best performing precipitation dataset for all basins and that results are very sensitive to the basin characteristics. However, a few datasets persistently outperform the others: SM2RAIN-ASCAT for Class 1, CHIRPS V2.0, MSWEP V2.1, and CMORPH-CRTV1.0 for Class 2, GPCC and WFEDEI GPCC for Class 3. Surprisingly, precipitation datasets showing the highest model accuracy at basin outlets do not show the same high performance in internal locations, supporting the use of distributed modelling approach rather than lumped.

high performance in internal locations, supporting the use of distributed modelling approach rather than lumped.

Introduction
In a recent study about worldwide information on precipitation ground measurements, Kidd et al. (2017) estimated that "The total area measured globally by all currently available rain gauges is surprisingly small, equivalent to less than half a football field or soccer pitch". This limited gauge representativeness, the scarce and unequal spatial distribution of rain gauges (Maggioni and Massari, 2018) and the concern for the global decline of in-situ hydrologic measurements (Stokstad, 1999;Shiklomanov et al., 2002) have motivated increasing attention on the potentialities offered by the growing availability of satellite-retrieved precipitation products as an alternative source of input data in hydrological modelling. In particular, the interest for satellite products has grown over the past decade, with the increase in their temporal and spatial resolutions (Stephens and Kummerow, 2007;Kidd and Huffman, 2011;Xie and Xiong, 2011;Brocca et al., 2013;Funk et al., 2015;Duan et al., 2016;Beck et al., 2017a).
Since the first studies exploring the potentialities of incorporating satellite-based precipitation data (e.g., Barrett and Martin, 1981;Schulz, 1996;Tsintikidis et al., 1999) to the latest global-scale comprehensive evaluation of a number of precipitation datasets (Beck et al., 2017b), over the past three decades the scientific literature has produced a variety of valuable research works that have advanced our understanding of the advantages and limitations of satellite-derived precipitation information in hydrologic modelling. Recently, Maggioni and Massarri (2018) proposed a comprehensive review of previous studies on satellite-based precipitation input forcing hydrological models.
The performance of precipitation datasets in hydrological applications has been assessed by either comparing simulated and observed soil moisture (e.g., Brocca et al., 2013) or observed river discharge. Here we focus on the latter and group previous studies based on common approaches in performing their assessment.
Also, most of the previous studies have based the evaluation of precipitation datasets on hydrological models that are lumped/conceptual (e.g., Behrangi et al., 2011;Essou et al., 2016), thus not accounting for the inherent spatial variability of river basin characteristics that are averaged over the watershed (e.g., Boyle et al., 2001;Carpenter and Georgakakos, 2006).
Furthermore, only a limited number of applications have explored the suitability of satelliteretrieved precipitation products in global/large-scale hydrological modelling Voisin et al., 2008. A recent contribution from Beck et al. (2017b) performed a comprehensive global evaluation of 22 precipitation datasets using gauge observations and hydrological modeling. Nevertheless, these analyses did not explore the full range of available precipitation products (i.e., Voisin et al., 2008), whose offer over the past fifteen years has greatly increased, and are performed based on lumped models (Beck et al., 2017b) or did not compare the results of the hydrological simulations with observed values (Fekete et al., 2004).
Our research questions emerged from the consideration that, notwithstanding the intrinsic uncertainty and errors associated to satellite retrieved precipitation data (e.g., Hossain and Anagnostou, 2004;Gottschalck et al., 2005;Hossain and Lettenmaier 2006;Hong et al., 2006;Ebert et al., 2007;Tian and Peters-Lidard 2010;Stampoulis and Anagnostou, 2012;Libertino et al., 2016;Maggioni and Massari, 2018), they represent a unique opportunity for hydrologic applications and, in particular, they can potentially spark light on improved flow estimation in data scarce or data poor basins (Serrat-Capdevilla et al., 2014).
We thus posed the following research questions, drawn from the highlighted gaps in the current literature: a) How does precipitation estimation from satellite data using different products compare when forcing a distributed hydrologic model to estimate river discharge in basins with different spatial scales, climatic zones and human influence (e.g. presence of reservoirs)? b) How does the density of the precipitation monitoring network used to correct (some) precipitation datasets affect river discharge estimation in different river basins? c) Do spatial details gained by using distributed hydrological modelling and different precipitation datasets lead to an improved representation of the river flow within the basin? d) Is there a specific dataset that always outperforms the other ones at both outlet and internal basin locations?
To reply to these research questions, we investigated the behaviour of 18 different alternative sources of rainfall data on estimating river discharges over large areas. In particular, we tested three classes of (quasi-) global precipitation dataset as main forcing of a grid-based distributed hydrological model: 1) satellite-based rainfall products; 2) gauge-corrected satellite rainfall products; and 3) reanalysis and gauge measured rainfall data. Eight large-scale river basins of different size and hydrometereological characteristics are used as case studies: Amazon, Brahmaputra, Congo, Danube, Godavari, Mississippi, Rhine and Volga river basins.

Cases studies
The eight river basins (see Figure 1) were selected based on different basin size, hydrometereological and climatic characteristics, presence of hydraulic structures and density of precipitation gauge network. We thus selected mid to large-scale river basins belonging to different continents and covering different climatic zones (Kottek et al. 2006). The Danube, Rhine and Mississippi basins have a dense network of precipitation gauges and a significant presence of dams and reservoirs which alter the natural hydrological response of the river basins.
On the other hand, the Amazon, Congo and Volga basins can be considered less affected by human interventions. Table 1 (Molinier et al., 1993;Gaillardet et al, 1997;Wieriks and Schulte-Wulwer-Leidig, 1997;Goolsby and Battaglin, 2001;Immerzeel, 2008;Jha et al., 2009;Csagoly et al., 2016;Harrison et al., 2016) summarizes the main characteristics of each river, including climactic zones (Peel et al., 2007), size of the drainage area, length of the main river, average river flow, number of flow monitoring stations used to evaluate model performances (GRDC, 2018), number of reservoirs from the GRanDv1 dataset (Lehner et al., 2011) and percentage of human footprint (i.e., human population density, land transformation, electrical power infrastructure and access to land data, as reported in Center for International Earth Science Information Network, CIESIN) (Kareiva et al., 2007).

Discharge dataset
We calibrated and validated the hydrological models using river discharge data provided by the Global Runoff Data Centre (GRDC; http://www.bafg.de/GRDC/). The GRDC centre collects river discharge data for more than 9500 stations from 161 countries. The collection period varies spatially: the earliest data were from 1807, while the most recent were from 2018. In this study, the daily discharge data (from 2007 to 2013) for 46 sensors across 8 different river basins (as shown in Figure 1) were extracted from the whole GRDC database.
In particular, information from flow sensors located at the outlet of the 8 river basins were used to calibrate and validate the hydrological model. Flow values at internal sensors were used in the validation process of 5 of the 8 basins (Amazon, Danube, Godavari, Mississippi and Rhine) for which GRDC flow data were available within the catchment for the simulated period.

Distributed hydrological model
In this study, a grid-based hydrological model was developed to spatially estimate the flow within river catchments with different spatial scales. Figure 2 shows a schematic representation of the proposed modelling framework. For each grid (0.25º × 0.25º in this study), a conceptual HBV-96 model (Bergström, 1992) was implemented to represent the rainfall-runoff processes and its grid output was then routed downstream using the widely known Muskingum model (Cunge 1969). The choice of the HBV model was driven by its computational efficiency and its ability to provide accurate forecast under a wide range of climatic conditions (e.g. Merz and Bloschl, 2004;Bardossy, 2007;Jin et al., 2009;Demirel et al., 2015;Vetter et al., 2015). The connectivity between grids was estimated using the flow direction and flow accumulation dataset developed by Wu et al. (2011 and.
The modelling steps implemented to calculate the discharge accumulation over the drainage network at time step t are: 1-Calculate the flow contribute QR generated in each grid of the basin, as response of rainfallrunoff processes using a conceptual lumped hydrological model.
2-Flows generated at the grids with lowest flow accumulation value (the most upstream part of the catchment) were propagated downstream using the Muskingum routing model following the connectivity characteristics (indicated as QP in Figure 2).
3-The total discharge in each grid was calculated: let's consider cell N, showed in Figure 2, located downstream of cells i and j. Once the upstream flows from grids i and j were routed at the downstream cell N, the total discharge at cell N was calculated as: where QR is the generated flow from the conceptual hydrological model, QP is the propagated flow from the upstream grids and QT is the total flow at the cell N.

4-Discharge
QT at grid N is then propagated at the downstream grids.

5-
Steps from 2 to 4 were sequentially repeated for each flow accumulation values up to the highest one (i.e. the grid related to the basin outlet), in order to calculate the distributed discharge accumulation QC within the catchment at the time step t.
It is worth noting that model states were initialized by running the model twice for the entire record. Despite the effect of model's initial condition has been widely discussed in the literature (see for example Goodrich et al., 1994;Minet et al., 2011;Seck et al., 2015) there is still no clear and automatic rule to determine the optimal spin-up time in hydrological models. However, according to Rahman et al., (2016), up to date modelling exercises are currently performed by running the simulation recursively through a specific period (which is typically a year) or by running the model multiple times for different climatological conditions. For this reason, we decided to initialize the states using two model runs. In addition, model states in each grid are independent from the neighbouring cells, and interactions occur only in the propagation of the generated total discharge.
For more information about the version of the HBV model implemented in this study, the readers are referred to Seiber and Vis (2002) and Beck et al. (2016).

Performance measures
The performance of the distributed hydrological model forced by the different precipitation input was calculated by means of the Nash Sutcliffe efficiency (NSE; Nash and Sutcliffe, 1970), and Bias index, widely used among hydrologists (Moriasi et al., 2007). The reason for using NSE instead of other indices as Root Mean Square Error, Pearson coefficient, or the Kling-Gupta efficiency was that NSE is highly sensitive to peak flow values (Krause et al., 2005;Beck et al., 2017b). In fact, the correct simulation of peak flows is highly dependent on different precipitation forcing used in hydrological modelling, which is the main objective of this study.
NSE value of 1 represents a perfect model simulation, while NSE value equal to 0 indicates that the model is as accurate as the mean of the observed flows. Ritter and Muñoz-Carpena (2013) indicated that models scoring NSE≥0.65 provide acceptable results.
The Bias index was selected to assess the tendency of the model to overestimate (Bias index greater than 1) or underestimate (Bias index smaller than 1) observed flow observations. The Bias index is calculated as the ratio between the mean of the simulated and observed flow values.

Model calibration
One of the main challenges in large-scale hydrological modelling is the proper estimation of model parameters (Anderton et al 2002). In the grid-based distributed model proposed in this study 14 model parameters need to be calibrated for each grid cell: 12 parameters from the HBV model and 2 from the Muskingum routing model. Here, we used the global regionalized dataset developed by Beck et al. (2016), spatially distributed and with a 0.25 degree resolution, to assign the initial values of the 12 parameters of the HBV model. These datasets were then individually perturbed by a correction coefficient, in order to estimate the optimal set of parameters for each different precipitation product and river basin. We re-calibrated the hydrological model for each precipitation dataset in order to get an unbiased and comprehensive comparison among the datasets.
The 14 correction coefficients were calibrated by means of the least squares minimization technique using the Broyden-Fletcher-Goldfarb-Shanno variant of the Davidon-Fletcher-Powell minimization (DFPMIN) algorithm (Press et al., 1992). In particular, we aimed at maximizing the NSE between the observed and simulated discharge values at the outlet, for all the considered basins and precipitation datasets. This approach may be affected by "equifinality problem" (Beven 2001;2002;Brooks et al., 2007) for large-scale basins in which calibration is performed using only outlet discharge instead than flow measurements at internal points. However, due to the limited flow information available at sensor locations within the catchments, we decided to use these measurements for model validation rather than calibration purposes.
The length and characteristic of the flood events used for model calibration have great influence on the estimation of the optimal model parameters. The calibration and validation periods are shown in Table 3. It can be noticed that for some basins the calibration period is posterior to the validation one, and vice versa: flood events included in the calibration period were specifically selected in order to properly represent the hydrological variability of the basin, both in terms of high and low flow. In fact, previous studies demonstrated that applying a hydrological model in flow conditions different than the ones in calibration would result in unreliable model performances (Seibert 2003;Singh and Bardossy, 2012;Brigode et al., 2013).
The validation of the distributed hydrological model was then performed on independent periods.
The results of the calibration and validation analyses are showed in the next sections.

Model calibration results
Following the approach described in the previous section, the distributed hydrological model was calibrated for the 18 precipitation datasets, for each of the 8 river basins: 144 optimal sets of perturbation model parameters were calculated. The results of the calibration phase are first shown by class type of dataset (i.e., Class 1: satellite-based; Class 2: gauge-corrected and Class 3: reanalysis/gauge measured) and then individually for each precipitation dataset.  Table 1).   Figure 4 in terms of class behaviour, we can see that models forced by Class 1 NRT products tend to provide higher variability in NSE values and an overall mean NSE lower than the one obtained in case of Classes 2 and 3 products; while comparable performance results can be found between Class 2 and 3 products.

Model validation results at basins outlets
We validated the distributed hydrological model at basin outlets for time periods and durations different than in calibration. We present here our findings focusing on: a) model performance by forcing dataset over the eight case studies and b) model performance by river basin, highlighting the most and less performing forcing datasets. For what concerns the comparison among different classes, Figure 6 shows that, when using uncorrected satellite products (Class 1, green bars in the figure) as input for the hydrological model, the highest variability in modelling performances was obtained in most of the basins. For example, high performance variability was obtained on the Mississippi basin using Class 1 datasets. This could be presumably due to the tendency of Class 1 products to provide results that are more variable in winter rather than in summer (as recently showed by Beck et al., 2019). In fact, one major challenge of Class 1 products is to properly represent snowfall and light rainfall occurring in winter (Kongoli et al., 2003;Liu and Seo, 2013;Habib et al., 2009;Tian et al., 2009;Beck et al., 2019). Classes 2 and 3 products gave NSE values higher than 0.5 for six out of the eight basins. However, the class providing the highest NSE depends on the considered basin, as shown also during calibration. In fact, Class 2 datasets gave highest median NSE values on the

Model validation results at internal locations
In this analysis, only those basins ( drastically increases when considering model performances at internal locations. This is due to the fact that models were calibrated using only outlet flow information and no internal data. As previously demonstrated during model calibration and validation at basin outlets, Class 1 datasets show higher variability in model performances, while Classes 2 and 3 products lead to results that are more comparable. Furthermore, no dataset constantly outperforms the others. In fact, model performances change according to the river basin and its characteristic. The highest average NSE values were achieved using TMPA 3B42 RT V7, GPCC, CMORPH-CRT V.10 and CHIRPS V2.0 for the Amazon, Danube, Godavari, Mississippi and Rhine, respectively. The good performances obtained using CHIRPS as model forcing may be due to the use of submonthly gauge observations to improve precipitation estimate and because CHIRPS is specifically designed to provide the most temporally homogeneous record possible (Beck et al., 2017a). Similar results for the TMPA 3B42 RT V7 datasets on the Amazon were obtained by Zubieta et al. (2015).
From the results reported in Figure 7, it can be observed that the distributed model was unable to reliably represent flow within the Mississippi river as a high variation of the NSE values is displayed for almost all the forcing datasets. In addition, the median values of the boxplot graph for the majority of the datasets are lower than zero, indicating poor model performances. This can be related to the inability of the model to properly represent hydrological processes in highly regulated basins (e.g. with high number dams) rather than bad performances of the precipitation datasets. Results that are more reliable were achieved in the other four basins.
As previously demonstrated, on the Amazon, datasets of Class 1 generally outperformed those in Classes 2 and Class 3. This can be due to the systematic limitations of producuts from Class 3 in properly detecting precipitation across South America (Sun et al., 2018).
On the Danube basin, both Classes 2 and Class 3 products showed NSE values higher than 0.5, while on the Rhine basin there was higher variability of the median values of these two dataset classes, which might be related to the high human influence on this basin. Moreover, corrected satellites products (Class 2) gave, on the Godavari, higher median NSE values and lower variability of model results than Class 3.

b) Analysis of the model performance at different basins
For what concerns model results with respect to the spatial distribution of the sensors, Figure   8 represents the boxplots of the NSE values obtained at different sensor locations across all precipitation datasets. In addition, the human footprint at each internal gauge was represented using the dataset provided by Kareiva et al. (2007). Model performance at each location varied with the size of the basin. In fact, the variability of the median values of NSE at internal sensors of the basin was found higher when the size of the basin increases. This was clear for the Amazon and Mississippi rivers. Here, it was difficult to find a consistent spatial pattern of NSE values with respect to the order of the river reaches. For example, sensor 6 of the Amazon basin had higher median NSE of the boxplot graph than sensor 2 (downstream of sensor 6) and sensors 10 and 11 (upstream of sensor 6). On the contrary, sensor 12 had higher median NSE than both sensor 5 and 3 that are located downstream along that particular reach.
An additional remark is that model performances showed erratic behaviour in upstream sensors located in reaches converging into the same downstream reach. However, two main mechanisms can be found: (1)  It is interesting noting that no correlation was found between human footprint at each gauge location and model performance. As mentioned in the method section, the overall flow at a particular location is given by the sum of the flow generated by the model in that particular grid and the upstream contribute routed at the downstream location. For this reason, the flow value at a specific location is not affected by the high or low value of the human footprint in that location as it may not be representative of the average value of human footprint within the basin (e.g. upstream part of the basin), which has higher influence on the model performances. This is clear in the Amazon basin, in which the majority of the basin has low human footprint (see Table 1), but flow gauges are located in urbanized populated areas with higher human footprint close to the river.
Furthermore, the results of this analysis showed that model results at the basin outlets were higher than the ones at internal points of the basins. This was not surprising, since the model was calibrated at the outlet of each basin and no information from internal points was used during the calibration process. However, there were cases in which the model at internal locations performed better than at the outlet of the basin. In order to further investigate this aspect, Figure   9 represents the percentage of sensors in which NSE was higher at internal points (NSEI) than at the outlet of the basin (NSEO) for the different precipitation datasets.
The first visible result is that on the Amazon, Danube and Rhine basins only few precipitation datasets have a percentage of NSEI>NSEO higher than zero. As already mentioned in the previous analyses, datasets belonging to Class 1 showed higher variability in model results (especially PERSIANN, PERSIANN-CCS, and CMORPH V1.0) and higher percentage of NSEI>NSEO if compared to the other two datasets classes. This is evident in Figure 9 for the Mississippi and Godavari basins. The latter shows a persistent percentage value of NSEI>NSEO higher than 0: this is because simulated flow at sensor 4 (see Figure 8) always outperformed the model performances at the basin outlet.

c) Analysis of the model bias performance
To further assess model performance at internal points, Figure 10 illustrates the percentage of precipitation datasets, for each sensor location, that overestimates (Bias index greater than 1) the observed flow: this is called %Bias in the following. The red circle indicates that 100% of datasets overestimates the observed flow, while 0% shows that all the precipitation datasets led to underestimation of the river flow at a particular sensor location.
A very interesting result is that, despite the fact that the NSE values are highly dependent on sensor location, the percentage of Bias index higher than 1 (called %Bias in the following) was not influenced by the different dataset classes. In particular, the distributed model generally An interesting aspect visible on the Amazon, Mississippi and Rhine basins is that sensors located at two upstream reaches converging into the basin outlet usually overestimated and underestimated the observed flow in the upstream reaches, and provided accurate results at the outlet. This phenomenon might be due to the model calibration, which focuses on optimised model parameters using flow outlet with the consequent compensation of flow at upstream opposite river reaches. This is clearly illustrated in Figure 11, where the ensemble of simulated flows using all the precipitation datasets is represented in red for overestimation (red arrow in the map, Bias index greater than 1.1), blue for underestimation (blue arrow, Bias index smaller than 0.9), and green for Bias index values close to 1 (green circle).

d) Best performing datasets
To conclude the assessment of model results at internal locations, in Figure 12 we show the datasets that provided the highest NSE value for each sensor location in each river basin. It can be observed that there was not a clear "winner" dataset, as it is showed in the validation analysis at the basin outlet. However, datasets of Class 2 often provide the highest NSE values. This can be due to the explicit use of daily gauge data for correcting satellite products. In particular, out of the 47 locations within the 5 river basins, Class 2 outperformed the other two classes 26 times (see Table 5 Regarding Class 1 datasets, SM2RAIN-ASCAT provided the best results on the Amazon (at 4 sensors) and Danube (2 sensors). This result obtained with SM2RAIN-ASCAT on the Amazon was quite surprising as Massari et al. (2017) stated that soil moisture based rainfall estimates perform reasonably well in semi-arid climates rather than wet ones. We speculate that the good performance of SM2RAIN-ASCAT on the Amazon can be mainly driven by the model calibration and the optimal set of parameters used. TMPA 3B42 RT V7 and PERSIANN-CCS showed also high NSE values on the Amazon (3 times) as they provide a good precipitation estimate in wet tropical regions. Class 3 datasets gave high NSE values for 11 sensors of the five river basins. In particular, besides PFD, all datasets of Class 3 provided the highest NSE at least for one sensor in the basins. The results of this last analysis are collected in Table 5.
Another interesting result was that datasets producing the best model results at

Conclusions
In this study, we compared the performance of 18 different precipitation datasets when used as main forcing in a grid-based distributed hydrological model to simulate river flow in large river basins. 1) The distributed hydrological model was able to properly represent river flow even if highly sensitive to differences in precipitation forcing (in agreement with Voisin et al. 2008). In particular, the model was mainly affected by the scale of the basin and by the human influence expressed as presence of reservoirs (see number of reservoirs in Table 1) and human footprint in the basin; different climatic zones had an indirect impact on the hydrological model through their influence on precipitation datasets; 2) Uncorrected Satellite datasets (Class 1), which can be considered as NRT products, provided the lowest and most variable model results when compared to the other two classes of products. SM2RAIN-ASCAT is the dataset that provided the best results within Class 1 products. for a future research direction to perform a global sensitivity analysis of distributed hydrological models to both input data and model parameters, in order to understand the spatial distribution of the governing principle of the hydrological processes and to guide the modeler towards an informed selection of the precipitation dataset.

3) Precipitation datasets belonging to Class 2 outperformed the other datasets in basins with
The outcomes of this study are valuable to support the selection of precipitation dataset to achieve reliable model results for global and large-scale applications. This is the first research that attempts to model large-scale basins using multiple global datasets of precipitations within a distributed hydrological model. Our research offers promising results that might be key in assessing flow values in data scarce river basins.