A Regional Semi-Distributed Streamflow Model Using Deep Learning

Abstract: Recent studies have shown that deep learning models in hydrological applications significantly improved streamflow predictions at multiple stations compared to traditional machine learning approaches. However, most studies lack generalization; i.e. researchers are training separate models for each location. The spatial and temporal generalization ability of deep learning models in hydrology that can be gained by training a single model for multiple stations is evaluated in this study. We developed a generalized model with a multi-site structure for hourly streamflow hindcasts on 125 USGS gauges. Considering watershed-scale features including drainage area, time of concentration, slope, and soil types, the proposed models have acceptable performance and slightly higher median NSE value than training individual models for each USGS station. Furthermore, we showed that the trained generalized model can be applied to any new gauge in the state of Iowa that was not used in the training set with acceptable accuracy. This study demonstrates the potential of deep learning studies in hydrology where more domain knowledge and physical features can support further generalization.


Introduction
Increases in streamflow due to heavy precipitation is the main driving force of flooding, which has become an ongoing concern in Iowa over the past three decades. In particular, the severe flooding in 2008(e.g. Smith et al., 2012Krajewski and Mantilla, 2010) gave impetus to many new developments. In 2009 the state legislators established the Iowa Flood Center (Krajewski et al. 2017), that in turn has made significant improvements in streamflow monitoring (Kruger et al. 2016;Quintero et al. 2021), development of flood inundation maps (Gilles et al. 2011), streamflow forecasting (Krajewski et al. 2017;Quintero et al. 2020a,b), and communication and outreach to the public (e.g. Demir and Krajewski, 2012). Additional relevant advances are documented in Seo and Krajewski (2020), Ha et al. (2020), Ghimire et al. (2018), and Perez et al. (2018). Many other technological developments have been demonstrated as concepts or prototypes Demir and Szczepanek 2017;Demir 2019, Yildirim andXu et al., 2020).
These above developments provide a stage for exploring the latest methods of Artificial Intelligence to improve hydrologic modeling, analysis and forecasting (Agliamzanov et al., 2020;. While many hydrologic models have been developed over the past 50 years, the challenge of providing streamflow forecasts accurately, efficiently and everywhere at all times remains. Several studies have applied deep learning in water resources fields, including surface water quality (Hu et al., 2019;Zhou, 2020), streamflow forecasting (Feng et al., 2020;Li et al., 2020;Qian et al., 2020;Sarkar et al., 2020; Van et al., 2020;Yue et al., 2020), soil moisture , groundwater Yu et al., 2020), hydrometeorology Lee et al., 2020), and water management (Liu et al., 2019). Recent studies (Chang et al., 2015;Granata et al., 2016;Faruk, 2010;Sit and Demir, 2019) have shown that many machine learning and deep learning models could be valuable in streamflow forecasting.
However, these studies have not emphasized the inclusion of physical watershed characteristics in the models. Recent studies (Chu et al., 2020;Kratzert et al., 2018;Ni et al., 2020; have applied deep learning models to multiple watersheds while using a separate model for each watershed or gauge. While developing one model for each watershed improves accuracy of prediction, it is inefficient and difficult to apply on a larger scale. Kratzert et al. (2018) tested deep learning models trained on multiple gauges in a hydrologic unit; this study concluded that a generalized model trained on multiple gauges may perform better than training one model for each watershed if there is a strong correlation in their discharge behavior.
On the other hand, regional modeling has been thoroughly studied in hydrology. As shown in the review by Razavi and Coulibaly (2013), six different approaches can be used to predict ungauged basins, including the spatial distance approach, physical similarity approach, hydrological similarity approach, arithmetic mean method, scaling relationships, and linear and nonlinear regression-based methods. More recently, machine learning studies have applied classic models such as support vector machine (Yahya et al., 2019) and random forests (Prieto et al., 2019) to ungauged watersheds using various regionalization methods.
Each of these methods has limitations, thus, it remains an important challenge in hydrology (Sivapalan et al., 2003). As proposed by Kokkonen et al. (2003), the generalized model requires incorporating the physical features of the watershed, which can be categorized in two groups: (1) physiographic information (i.e., watershed area, and slope); and (2) meteorological features (i.e., mean rainfall, temperature) (Razavi and Coulibaly, 2013). These features can easily be included as input and applied to machine learning and deep learning models. Kratzert et al. (2019) designed a new deep learning layer that allows for learning watershed similarities as a feature layer and trained a single model on 531 basins in the 4 CAMELS dataset (Addor et al., 2017). The results outperformed the models trained on a single basin. This confirmed that machine learning models can learn regional hydrological behaviors. This provides the impetus to develop a single generalized model for all watersheds rather than training one model for each watershed. However, there are several limitations in such generalized approaches. The first is the requirement for a large amount of data that should be used at the community level for model benchmarks (Ebert-Uphoff et al., 2017).
Some watershed features (i.e., depth to bedrock) may not be available in other datasets. In addition, the selected watersheds in Kratzert's study are less than 2,000 km 2 . However, many USGS gauges (i.e., 51 out of 125 selected USGS streamflow gauges in Iowa) are monitoring watersheds larger than 2,000 km 2 . The behavior of streamflow fluctuations in small and large watersheds are different in terms of the time scale of the variability. Predicting the value of streamflow in a large (10,000 km 2 ) basin in the next hour is an easy task, most likely it's the same value that is being currently observed, even if it's raining, on the other hand, predicting the value of streamflow in a small (10 km 2 ) basin in the next hour while it's still raining can be a major challenge because the rainfall-runoff processes occur on the same time scale of the streamflow fluctuations. These limitations could restrict the applicability of the regional learning approach using deep learning.  proposed a deep learning based semi-distributed hydrological model named the Neural Runoff Model (NRM) to predict the next 120 hours on gauged watersheds; using this approach 125 NRM models have been trained on 125 USGS gauges in Iowa with better performance than traditional machine learning models.  averaged prediction performance indexes for fix lead times (e.g. 5-hours), however that can lead to misinterpretation of the actual performance of the model. The mix of how many large watersheds vs. small watersheds can influence the value of the average. This is a limitation on evaluation that is present in other studies as well (Kratzert et al., 2018;Kratzert et al., 2019). We address this limitation and others in this study.
In particular, in this study, the regional generalization is considered in the deep learning-based streamflow modeling. We create a generalized streamflow simulation model for different watersheds using physical features with standard deep learning architectures, to illustrate how the generalized model performs compared to the one model for each watershed approach in streamflow hindcasting across scales. In addition, we tested the generalized model on additional watersheds outside of the training dataset (e.g., ungauged watersheds) to evaluate the transferability of the final models. Our study is limited to hindcasting, where "future rainfall" is known perfectly. Future work, on forecasting, and other issues are also outlined in the conclusions of this work.

NRM-Generalized and NRM-Generalized-Distributed
We investigate two generalized model architectures, the NRM-Generalized and the NRM-Generalized-Distributed. These two model architectures were developed and improved upon based on previous work of NRM based on an encoder-decoder structured recurrent neural network . NRM-Generalized is a lumped model that uses watershedscale time-series data and physical features as input to simultaneously predict streamflow in all the catchments. The structure of NRM-Generalized proposed in this study is shown in Fig. 1. This regional generalized model includes physical characteristics such as watershed area, soil type, slope, etc., which are not considered in the original NRM because each model is trained independently and thus watershed features are constant for each NRM. However, these features are needed to characterize different watersheds for the development of the generalized model for multiple watersheds. 6 The NRM-Generalized model considered four physical features for each watershed: the average slope, area, time of concentration (Tc) based on the longest path, and the proportion of the 12 soil types in the watersheds. All these physical features are assumed to be constant over time. These data, combined with hourly precipitation (Seo et al., 2019), hourly streamflow, and monthly evapotranspiration, are used as the input in this study. All these data are available from public sources; details, including pretreatment of these data, are shown in Table 1 (1) where: Y , is the observation at the time step i and sample instance j; Ŷ , is the model result at the time step i and sample instance j; Ȳ is the mean of all observations among time steps in one sample instance; m is the total number of the sample instances; and n is the total number of time steps. In this study, the n prediction timestep setting is 120; m is the batch size setting of 32.
To better model the streamflow, we applied a semi-distributed structure based on the NRM-Generalized and named this structure NRM-Generalized-Distributed. NRM-Generalized-Distributed is a semi-distributed model and considers each gauged basin as a sub-basin for large watersheds, and all the time-series data and physical features are based on the sub-basin only. In the previous study, NRM-Distributed models (with semi-distributed model structure) were trained one by one, sequentially from upstream to downstream.
However, this is impossible in a generalized model because all watersheds must be trained simultaneously in one model. Thus, the upstream-downstream relationship is constructed, and it serves as the input in the training. Table 2

Model Settings and Evaluation
Using the upstream-downstream relation, two separate datasets were created with 18 and 19 input features as shown in Table 2 respectively. We used the training and validation datasets to calibrate the models. The test dataset was used for the final evaluation. All analyses in section 3 results are based on the test dataset. The statistics of the features among these watersheds are shown in Table 3.
After removing missing data for 125 USGS streamflow gauges, a total of 2,808,832 instances remained in the training dataset; 1,552,150 instances in the validation dataset; and 866,664 instances in the test dataset. These datasets were split by the gauges based on the tests. This study used a small batch size of 32 that randomly trained 32 instances at a time.
The optimizer is RMSprop, which is known to perform well on time-series models. Because there were 12 more inputs from 4 features, a small learning rate of 0.00003 was used for the optimizer. All other settings in this study were the same as were used in the NRM model to maintain the consistency for our comparison. All these settings were implemented by the Keras library in Python3 with the Tensorflow backend using NVIDIA Titan V GPU.
where: Y is the observation at the time i; Ŷ is the model result at the time i; Ȳ is the mean of all observations; n is the total number of observations; r is the Pearson correlation coefficient; σ is the standard deviation; and μ is the mean.
The goal in this study is to simulate the hourly streamflow at multiple watersheds using one model for multiple hours; it is obvious that longer lead times are harder to predict. Although it may seem fair to compare the prediction skill for a fix lead time, as previous studies did , the prediction performance of different watersheds was significantly affected by their size and concentration time, and the statistics based on lead time. It has been demonstrated in some studies that the prediction skill of temporal persistence as a benchmark method is sensitive to the lead time and concentration time, and the temporal persistence offers good performance especially for larger basins . Especially in the streamflow modeling task in this study, the prediction of a small watershed with short concentration time was much harder than the prediction of a large watershed with a longer concentration time.

Case Studies
We carried out four tests shown in Table 4 to develop and evaluate the regional generalized models. The first test used 62 independent upstream gauges, the most upstream stations in Iowa, to develop the generalized model. Although a total of 125 USGS streamflow gauges exist in Iowa, as shown in Fig. 2 In the last test, we implemented NRM-Generalized-Distributed approach to two new regional models for 63 downstream watersheds and all watersheds with 125 USGS gauges in Iowa with and without distributed structure. As shown in Table 4, we trained four models in this test. The downstream models were trained on 63 gauges; the overall models were trained on 125 gauges. Furthermore, we applied the distributed structure to the generalized models.
In the distributed models, the observed upstream streamflow in the future were used in the training, and the model output of upstream gauges were used in the test for evaluation. In the end, we evaluated our best models for 125 watersheds in this test for our final evaluation and visualization.

Performance Benchmarks
One reference used in this study is temporal persistence (or autocorrelation) in time. As a benchmark in hydrology, persistence forecast is also known as a hard to beat model for large 14 watersheds at short to medium forecast range. Many studies suggest that persistence can be an alternative for evaluating future estimates of streamflow at short to medium range Krajewski et al. 2020).
Because this study aims to explore potential generalization ability, its main benchmarks are the watershed-based model NRM and NRM-Distributed

Results and Discussions
In this study, we reviewed two deep learning models, NRM-Generalized and NRM-Generalized-Distributed, and applied these models to different watersheds as shown in Table   4.

NRM-Generalized on Upstream Watersheds
Sixty-two (62) independent upstream Iowa watersheds with USGS streamflow gauge data in the study years are selected. The median values for the statistical metrics from 62 upstream stations of are shown in Figure 3. For upstream watersheds, the fraction of lead time over concentration time ranges from 0.063 (or 10 -2.2 ) to 4 (or 10 0.6 ) in log scale at a step of 10 0.2 .
The longest hindcast window in our study is 120 hours with lead times ranging from 1 to 120 hours, the lead time over concentration time varies for different watersheds. Most of these are less than four.
These results show that the increase in lead time decreased the performance of all three benchmark models for both NSE and KGE. The temporal persistence model shown by the black line (Fig. 3)  Our NRM-Generalized model on 62 upstream watersheds is shown in light blue (Fig.   3). Compared to the NRM and NRM-Generalized-Baseline, the NRM-Generalized model considered additional features including the area, time of concentration, average slope, and soil composition ratios for each watershed. We found that our generalized model had a better performance for 62 watersheds than NRM's 62 individual models did, based on the NSE. For the KGE, we found that the NRM-Generalized model had a slightly poorer performance for short-term hindcast but better performance for the long-term hindcast and overall performance than NRM. Overall, it is obvious that NRM-Generalized increased the model's performance significantly compared to NRM-Generalized-Baseline. This test successfully applied the generalized model on 62 independent watersheds in Iowa with a better median performance than 62 individual models. This is parallel to the results by Kratzert et al. (2018), which showed that a generalized model may perform better on multiple watersheds when those watersheds are similar. The success of NRM-Generalized also indicates that the rainfallrunoff behavior of watersheds can be effectively characterized by the features of slope, area, time of concentration, and soil types.

Spatial Generalization Test
In this test, the generalized model and baselines shown in Figure  It is indicated from the results that the NRM-Generalized is an efficient regional model with acceptable spatial generalization ability that can be applied to untrained watersheds, both external independent watersheds and extended downstream watersheds, with a limited decrease in model performance. However, it needs to be noted that, when modeling these watersheds, the past 72 hours of streamflow data are still used for data assimilation.
Therefore, this study cannot be applied to an ungauged watershed; however, it provides a simulation method for newly gauged watersheds. For example, in Iowa, Iowa Flood Center (IFC) has deployed nearly 300 stream level sensors in the past several years to measure the stream stage (Krajewski et al., 2017) and developed synthetic rating curves for those 5 locations (Quintero et al., 2021). The generalized model proposed in this study can be applied to make simulations in these areas without relying on decades of historical data.

Ablation Study
We conducted an ablation study to learn whether all four watershed features are necessary by removing one feature at a time. As shown in     showing that in large watersheds with higher baseflow yields, it is more likely to get a better persistence result. The NRM-Distributed-Baseline does not include any of these input features in the modeling, and it has the r 2 higher than 0.1 for most of the The NRM-Distributed model proposed in this study shows correlations lower than 0.1 between model efficiency and most of features, which further demonstrates that it is a wellgeneralized model that does not show significant bias toward any watershed topological feature. Ablation tests 1, 3, and 4 show poorer results than NRM-Distributed with more than one values higher than 0.1. We also found that the deep learning models are relatively robust against the input features, and removing any single feature in this test will cause no certain bias. For example, ablation 2 does not consider the watershed area feature, but it still not biased on areas.
In an earlier study, Kratzert et al. (2019) suggested that meteorological data (e.g., mean precipitation and aridity) are the most important factor in the sensitivity analysis of their hydrologic model. This is possible because baseline models such as persistence and NRM-Distributed-Baseline have a high correlation between meteorological data (e.g., mean precipitation Pmean and mean streamflow Qmean) and model performance. However, our models do not show bias for mean precipitation. We suggest that the hydrological behavior is more related to topological features (e.g., area, Tc, soil types, and slope) in the streamflow modeling. Watershed-scale meteorological data may not have a significant impact because the lead time is short and the precipitation data are already input features, which can be captured by deep learning models.   0

NRM-Generalized-Distributed on Downstream and All Watersheds
In addition to 62 upstream gauges, there are 63 USGS gauges located at the downstream of the gauges. We applied the NRM-Generalized model on 63 downstream gauges in this test first. The NSE results of different models are shown in Figure 5. For downstream watersheds, the figure shows the fraction of lead time over concentration time ranges from 0.004 to 1 in log scale. The reason why the maximum scale is 1 in Figure 5 rather than 4 in Figure 3 is the downstream watersheds normally have larger concentration time and their median concentration time is 99 hours as shown in Table 3.
When the fraction of lead time over concentration time is 1, the median NSE is 0.826 in the NRM-Generalized model training on 63 downstream watersheds. As shown in Figure   5, results of NRM-Generalized are better than NRM in both NSE and KGE at all lead hours, which is even better than the results on 62 upstream watersheds. The main difference here is these downstream watersheds are not independent and many of them are sharing the same areas. The performance increase from NRM to NRM-Generalized can be explained by the information among different watersheds that are well utilized since they have higher similarities.
After  In this test, we developed two models, with and without distributed structure, to include all 125 watersheds in one model as our final tests. The results are shown in Figure 6.
It shows that the distributed model promoted the model performance as well. This is expected since our studies simplified three-dimensional spatial-temporal precipitation data into onedimensional temporal data by spatial averaging for each watershed. As shown in Table 3, the upstream and downstream watershed size varies significantly. The distributed structure considers the sub-watersheds only, which significantly decreased the median watersheds' sizes from 1,675 km 2 to 947 km 2 , and the median travel time from 53 hours to 26 hours in the modeling. Thus, the models with distributed structures can reduce the errors caused by the uneven spatial distribution of precipitation since the sub-watershed boundaries are used.
In contrast, without a distributed structure, the variances of slope and soil types may be neglected when the watershed sizes are too large. Since distributed models have a much smaller area and time of concentration, the features of the slope and soil types can characterize these sub-watersheds more accurately and improve the model performance.
When we compare the results of NRM-Generalized (upstream) + NRM-Generalized (downstream) and NRM-Generalized (125 all), it indicates that two regional models could be better than one model when some watershed features (i.e., area) are highly heterogeneous in the dataset. And we concluded that the regional generalized models can provide better results only when all features are effectively pre-treated and in a reasonable variance. Thus, we used our best model, NRM-Generalized for upstream and NRM-Generalized-Distributed for downstream, in our further analysis. Figure 7 shows the 120-hr lead time simulation NSE and KGE distribution against the concentration time, area, and slope among 125 watersheds in the test year. We only compared the best models with and without the generalization. Thus, we are comparing the overall performances of the deep learning model in Iowa using 125 watershed models and 2 generalized models. For the 120-hr lead time simulations, the overall median NSE is 0.742 and 0.776 in NRM and NRM-Generalized models, respectively, using distributed structures.

Overview of Generalized Models
This corresponds to the results in previous sections. Although the median NSE of the generalized model is better than the individual models trained on each watershed, our generalized model has lower performance at both NSE and KGE metrics for smaller watersheds with less concentration time.  In addition, the watersheds with the largest and smallest mean slopes have significantly reduced performance in the generalized model as well. This indicates that the generalized model may not perform as well on extreme cases as training one model for each watershed.
This is a common issue in the model generalization because these watersheds are underrepresented in the dataset. This indicates that the proposed generalized model could be more stable on ordinary watersheds for an average situation rather than special watersheds with extreme features.

Conclusion
In this study, we developed two generalized deep learning-based rainfall-runoff models, NRM-Generalized and NRM-Generalized-Simulcast. The models are designed to predict the next 120 hours of streamflow using observed rainfall, a "perfect" rainfall forecast, observed In the ablation study, we confirmed that all the features included in the generalized model are contributing to the performance. The correlation analysis shows that including these four watershed features helps to reduce the bias among these features compared to the NRM-Generalized-Baseline model, which has no watershed feature.
After introducing the distributed model structure, we tested the NRM-Generalized-Distributed model on 63 downstream watersheds. Results show that our proposed distributed generalized model is effective and has a limited sacrifice of model performance compared to the baselines by execution sequentially for each watershed based on the stream network. We also compared the performance of the NRM-Generalized-Distributed as a single model on all 125 watersheds to the results from two regional models, NRM-Generalization (62 upstream) and NRM-Generalization (63 downstream). Two regional models outperformed the single model. This might be due to some watershed features (i.e., area) being highly heterogeneous in the dataset. We concluded that the regional generalized models can provide better results only when all features are effectively pre-treated and in a reasonable variance.
In the final overview, this study shows that our generalized model could be more stable on ordinary watersheds at an average situation rather than special watersheds with extreme features. Our generalized model may also underperform on small watersheds and at low flows. Our results also emphasize the importance of accurate precipitation forecast input for the deep learning models for operational real-time applications. These are some of the limitations for the generalized model using deep learning.
This study successfully developed a generalized model, NRM-Generalized-Distributed, using a standard deep learning library with watershed features of the area, time of concentration, average slope, and proportions of 12 soil types. Compared to our previous best performing distributed model (NRM-Distributed), which trained one model for each watershed, the proposed generalized model shows a better performance on the median of 125 USGS streamflow gauges at the lead time of 120 hours with training only two regional models.
Researchers of deep learning in hydrology and water resources have increased significantly in the past several years , but there are limited considerations on physical features and processes. As shown in this study, deep learning models considering watershed-scale physical features and semi-distributed structures can work on multiple watersheds using regional models with limited sacrifice of accuracy. We reiterate our call to deep learning studies in hydrology to include more domain knowledge and physical features in the future.  watersheds, NRM-Generalized for upstream watersheds and NRM-Generalized-Distributed for downstream watersheds, at all lead times with 1, 2, 3, 4, and 5 days precipitation as input.
The temporal persistence is used as a reference.
All the results in this study focus on five-day predictions using accurate hourly precipitations. Since current meteorological forecasts can only provide accurate short-range hourly precipitation predictions (i.e., HRRRv3 (High-Resolution Rapid Refresh) version 3), an hourly precipitation forecast product can provide at most a 36-hour precipitation forecast.
Our study tested the modeling performance under limited precipitation in a potential real-time application of our models. As shown by the purple lines in Fig. A2, our model has a stable performance in all metrics for five days of rainfall because it assumes that the next 120 hours of accurate precipitation data are known. By assuming the hourly precipitation is 0 after 24, 48, 72, and 96 hours, we have four additional models that only know the future rainfall for 4, 3, 2, and 1 days respectively; the results are shown in Fig. A2. Since the long-term hourly precipitation data is set to 0 after different timestamps in these tests, the long-range model accuracies after the timestamp decrease dramatically in watersheds 1 and 3. This is reasonable because precipitation is the key driving force of the streamflow; these two stations have a very small concentration time. Watersheds 2 and 4 are relatively large, with concentration times of 101 hours and 148 hours, respectively. The decrease in model performance is very limited compared to stations 1 and 3 in our predictions at most 120 hours. It is also obvious that the model accuracy curves overlap if the rainfall is known. The model performance only decreases after the precipitation data sets to zero because our model is designed to predict streamflow hour by hour in the NRM architecture. This indicates that the deep learning model using time series architectures is reliable for short-time predictions even with limited accurate long-term precipitation data.