Accuracy vs Realism: Does including reservoirs improve hydrological models?

5 Brazil has invested considerably in the reservoir construction during the past decades, mainly 6 for irrigation and hydro-power generation. Despite their large impact on catchment hydrology, 7 reservoir dynamics are often not included in hydrological models due to their complexity. In 8 this study, we investigated the effect of including reservoir dynamics (realism) in hydrological 9 models on the model performance (accuracy). Combined, realism and accuracy form the model 10 fidelity. We used the HBV-EC and GR4J models to simulate hydrological processes and daily 11 streamflow of 403 catchments across Brazil in two scenarios, with and without reservoirs. The 12 model performances were assessed with the Kling Gupta Efficiency (KGE) and its components, 13 and were compared between the models and scenarios. We found a significant increase in the 14 HBV-EC model performance when the reservoirs were taken into account, although the overall 15 performance was relatively poor. The average KGE increased from 0.21 without the reservoirs to 16 0.40 with the reservoirs. The GR4J model, on the other hand, showed better overall performance, 17 but without the improvement when including the reservoirs; the average KGE slightly decreased 18 from 0.57 to 0.56. In the catchments with the largest reservoir capacity, HBV-EC in the scenario 19 with reservoirs outperformed GR4J in both scenarios. We note that better model performance 20 can still be obtained with a smaller spatial scale or other methods of including reservoirs, which 21 require more data and detailed studies. With this paper, we demonstrate that model performance 22 can improve when including reservoir dynamics, but this depends on model structure and does 23 not always increase model fidelity. 24


INTRODUCTION
Models are simplifications of reality and therefore inherently come with uncertainties. Model fidelity is the 26 degree to which the model simulations relate to the real world. Fidelity is achieved both by the sufficient (1) To get the right answers for the right reasons (Kirchner, 2006, p.1), not only a good model performance  Table 5). The three classes, divided by the vertical lines, in the left upper cdf panel coincide with the classes indicated in the map on the left. and therefore it was not included in this study.  113 Two hydrological modeling structures were compared, using the RAVEN modular modeling framework 114 (Craig et al., 2020). RAVEN is a flexible framework, which allows many different algorithms to be used for 115 different parts of the water cycle as well as various routing mechanisms. Several hydrological modeling 116 structures can be reproduced nearly exact: UBCWM (Quick, 1995), HBV-EC (Bergström, 1995), HMETS 117 (Martel et al., 2017), MOHYSE (Fortin and Turcotte, 2006) and GR4J (Perrin et al., 2003). This framework 118 was chosen because it includes some modules that allow modeling of human interference. It can thus easily 119 be adapted to include the reservoir dynamics. The HBV-EC and GR4J models were selected in this study. HBV-EC is a Canadian version of the HBV 122 (in Swendish, Hydrologiska Byråns Vattenbalansavdelning) model (Bergström, 1995;Lindström et al., 123 1997). It is a semi-distributed conceptual model with 16 parameters, employed in this study as a lumped conceptual rainfall-runoff model developed by Perrin et al. (2003). However, the RAVEN emulation 126 contains two additional parameters to add a snow routine to GR4J. In general, HBV-EC has a slightly more 127 complex structure than GR4J, but both are relatively simple and widely used in previous studies with good 128 performance (e.g., Engeland and Hisdal, 2009;Payan et al., 2008;Unduche et al., 2018). An overview of 129 the RAVEN interpretations of both models is given in Table 1. The complete model schemes of HBV-EC 130 and GR4J are shown in Figures 8 and 9, respectively.

131
To run the models in RAVEN, the readily available templates for the HBV-EC and GR4J models were 132 implemented (Craig et al., 2020). Given that we work with lumped models, each catchment was represented 133 by a single Hydrological Response Unit (HRU). The majority of the parameters in both models were 134 calibrated (see Table 3 for HBV-EC and Table 4 for GR4J). For the few remaining parameters, where 135 possible CAMELS-BR data were used, including soil types, groundwater depth and land use types (like 136 forest fraction).

137
To include the reservoirs in the model structures, an extra open-water HRU was added, which accounts for 138 the storage of the reservoir and the open water evaporation from the reservoir. Note that the lumped nature 139 of the models implies that the total reservoir capacity is placed at the outlet of the catchment and therefore 140 we do not account for concatenating or cascading effects of reservoirs. This is also not possible with the 141 information provided in CAMELS-BR; only the total reservoir capacity per catchment is provided. The lake-like reservoirs require information about the weir coefficient (C; default 0.6), crest width (calibrated), 143 maximum depth (h) and surface area (A). A (km 2 ) and h (m) can be calculated from the reservoir capacity 144 (V, in 10 6 m 3 ) by reversing the equations given by Chagas et al. (2020): for reservoirs for which depth h information was available, and for the reservoirs where depth information was not available.

147
Two scenarios were investigated in this study: without reservoirs (the so-called benchmark scenario) and were included with an extra HRU, and the models were calibrated again before assessing their performance. selected to test the methods in both modeling scenarios). The run time of PSO was over thirty minutes for 161 three catchments, compared to just a few minutes with DDS. This made us decide to proceed with DDS.

162
The best parameters found through calibration with DDS were used for validation.

163
For the benchmark scenario, sixteen and six parameters were calibrated for the HBV-EC and GR4J 164 model, respectively (Table 3 and 4). For the reservoir scenario, the calibration was repeated, with an extra 165 calibration parameter that represents the unknown crest width. The range for this parameter was 1-50 m.

166
This extra parameter provides an extra degree of freedom that could lead to higher model performance,

169
Model performance was assessed using the KGE. Its components were also assessed to determine the 170 main cause of the difference in performances. These components include the linear correlation coefficient 171 (r), bias (β) and variability (α) and are all optimal at 1, with r always being lower than (or equal to) 1, 172 while α and β can also be higher. The components all have equal weights for the performance, as shown in

Model performance analysis 175
The change in KGE between the scenarios was assessed with a paired samples t-test. This showed whether 176 including reservoirs increased the model performance significantly across all 403 catchments together. We also assessed if catchment characteristics had influence on these results. These catchment characteristics 178 include seasonality, asynchronicity, land use, catchment area, total reservoir capacity, total relative reservoir 179 capacity, latitude and longitude (Table 5). Aridity is here defined as the ratio of mean evaporation to mean

197
Visual inspection of hydrographs of ten random catchments revealed that the simulated benchmark 198 streamflow often had higher, narrower peaks and lower base-flows than the observed streamflow. Examples 0.65. The values of β are below 1 for over 80% of the catchments for both scenarios, which means that the 208 simulated mean streamflow is generally underestimated.

209
An advantage of working with a large sample of catchments is that the results can be linked to catchment 210 characteristics. To look into spatial differences, Figure 5 shows the KGE values at the outlet of each 211 catchment. However, no clear spatial pattern was observed. The catchment classes described in Section 212 2.2.3 and Table 5 were investigated to see whether differences in model performance could be found 213 based on several catchment characteristics. Most classes show the same general trend that the KGE was 214 significantly higher for the reservoir scenario (Table 6). The only class that did not result in a significant 215 improvement was the class with the smallest relative reservoir capacity. This makes sense, since the 216 difference between both scenarios is the addition of the reservoirs, and a smaller reservoir thus leads to The largest increase in KGE between the scenarios is seen for the catchments with the largest total 221 reservoir capacity (a mean increase of 0.37) and relative reservoir capacity (a mean increase of 0.33) (see 222   Table 2). This is depicted in Figure 3. The benchmark scenario performance decreases with relatively larger 223 reservoir capacities, while the reservoir scenario performance increases. However, for both total and relative 224 reservoir capacity, the middle classes have a higher mean KGE for both scenarios compared to the class 225 with the largest (relative) reservoir capacity (see Table 2). There are two potential explanations. Firstly, the 226 more arid the region is, the more water needs to be stored to maintain water supply. HBV-EC has difficulties 227 simulating arid conditions (see the relative poor performance in arid regions, Table 2), while these are also 228 the catchments that profit most from including reservoirs in the model structure. Besides, the semi-arid The overall model performance achieved with HBV-EC is low but increases with including the reservoirs.  The achieved model performance for both scenarios using GR4J are shown in Figure 6. On average, the   Also for the GR4J results, the model performance for both scenarios was linked to several catchment 255 characteristics (Table 7). Again, the total reservoir capacity and relative reservoir capacity appear as relevant 256 characteristics to explain the differences in both scenarios, see Table 2 and Figure 6. The difference in 257 mean KGE between both scenarios is highest for the classes with the largest relative and absolute reservoir 258 capacity. However, in contrast to the results achieved with HBV-EC, the reservoir scenario in this case 259 leads to lower model performance.

260
The reservoir scenario does not result in improved model performance, and for some specific 261 characteristics even results in a (slightly) lower performance. The overall model performance for both 262 scenarios is lower and decreases most when including reservoirs in the catchments with a (relatively) larger 263 total reservoir capacity. This can indicate that the way in which the reservoirs were included in this study is 264 not appropriate given the GR4J model structure. But, as was also seen for HBV-EC, the model performance 265 for GR4J is low in highly arid regions and this might also explain some of these results, since arid regions 266 are known to have a high density of smaller reservoirs, leading to cascading effects not accounted for in 267 this study.

268
Whereas GR4J was able to achieve a higher overall accuracy than HBV-EC, increasing the realism by 269 including the reservoirs did not lead to an improvement in accuracy. Therefore, it remains unclear if we 270 were able to improve fidelity in the model.

272
The differences between the performance of the two models can be observed by comparing Figures 3 273 and 6. As an overview of the main differences between the results, Figure 7 shows the KGEs for both models with different colors for the relative reservoir capacity classes. Overall, GR4J performs significantly 275 (p< 0.05) better than HBV-EC, both with and without reservoirs included. The differences are smaller 276 for the reservoir scenario. For some catchment characteristic classes, the HBV-EC reservoir scenario 277 performance is better than the GR4J performance, but this is never significant.

278
The most interesting results are found for the relative reservoir capacity classes. For the scenario with 279 reservoirs included, the difference between the performance of the two models is largest for the class with 280 the smallest relative reservoir capacity, with GR4J performing better. However, the class with the relative 281 largest reservoir capacity shows one of the largest differences between the two models, in favor of HBV-EC.

282
The mean KGE of this class is slightly (but not significantly) higher for the HBV-EC than for GR4J. This 283 is visible in Figure 7b, where the points for the catchments with a relative reservoir capacity > 20% lay 284 around the 1:1 line. Although no clear conclusions can be drawn from this, it suggests that with a larger 285 relative total reservoir capacity, the reservoir scenario of HBV-EC might work better than GR4J. Possible 286 reasons for these results are discussed below. Model structure, parameters and results of other studies, in 287 which these models were employed are considered.

288
The models have a different structure and a different number of parameters. HBV-EC has a more complex 289 model structure, including more processes. One of these processes is related to snow, but this is assumed to 290 be negligible because of the low amounts of snowfall in the catchments. Next to that, canopy is explicitly 291 included in the HBV-EC model, which can lead to different evaporation patterns. Specific for GR4J is the 292 groundwater exchange term, which can be a source or sink of water. The flexibility of this model to drain 293 water to the groundwater or to obtain water from seepage helps to close the water balance. Especially when 294 the forcing and streamflow observations do not have a closed balance, this term can resolve input data and 295 calibration data issues. This might explain why GR4J was able to achieve higher model performance than 296 HBV-EC. A thorough evaluation of the quality of the data in the CAMELS-BR basin can confirm this.

297
The more complex HBV-EC model also has more parameters, 16 compared to 6 for GR4J. It might be 298 expected that a more complex model has a better performance, but this also depends on the availability of 299 data. With lower data availability, less complex models are likely to perform better (Grayson and Blöschl, 300 2001). Nevertheless, the increase in information by including the reservoir may be handled better by this 301 more complex model.