Fully distributed rainfall-runoff modeling using spatial-temporal graph neural network

Recent studies using latest deep learning algorithms such as LSTM (Long Short-Term Memory) have shown great promise in time-series modeling. There are many studies focusing on the watershed-scale rainfall-runoff modeling or streamflow forecasting, often considering a single watershed with limited generalization capabilities. To improve the model performance, several studies explored an integrated approach by decomposing a large watershed into multiple sub-watersheds with semi-distributed structure. In this study, we propose an innovative physics-informed fully-distributed rainfall-runoff model, NRM-Graph (Neural Runoff Model-Graph), using Graph Neural Networks (GNN) to make full use of spatial information including the flow direction and geographic data. Specifically, we applied a time-series model on each grid cell for its runoff production. The output of each grid cell is then aggregated by a GNN as the final runoff at the watershed outlet. The case study shows that our GNN based model successfully represents the spatial information in predictions. NRM-Graph network has shown less over-fitting and a significant improvement on the model performance compared to the baselines with spatial information. Our research further confirms the importance of spatially distributed hydrological information in rainfall-runoff modeling using deep learning, and we encourage researchers to incorporate more domain knowledge in modeling. This manuscript is an EarthArXiv preprint and has not been peer-reviewed before. Subsequent versions of this manuscript may have slightly different content.


Introduction
In the water cycle, rainfall and runoff processes are of great significance to the environment and ecology of the entire basin. Runoff occurs when precipitation reaches the ground, and the excess rainfall flows across the surface of the land and into nearby streams. The rainfallrunoff modeling is difficult due non-linear structure of the process which involves many interconnected variables (Sitterson et al., 2018). Modeling the rainfall-runoff process can be used to simulate peak flow, dynamics of water quality, and other systems related to the rainfall, and it helps to understand and monitor the quality and quantity of water resources (Boyle, 2000). Rainfall-runoff modeling is critical for flood forecasting studies which support communities in disaster mitigation (Xu et al., 2020;Yildirim and Demir, 2021) and decision making (Teague et al., 2021;Ewing and Demir, 2021).
Deep learning models have been used increasingly in hydrological studies for the last couple of years  including data augmentation , image synthesis (Gautam et al., 2020) and web-based modeling (Ramirez et al, 2021). Many studies (Izumi et al., 2016;He et al., 2019;Kim et al., 2020;Roy et al., 2021) have been using the basic DNN (deep neural networks) algorithms on the rainfall-runoff modeling for runoff and flood prediction. Since rainfall-runoff modeling is a time-series task, the deep learning model like LSTM (Long Short-Term Memory) was used for daily (Kratzert et al., 2018;Hu et al., 2018) and monthly (Yuan et al., 2018) modeling since 2018. The studies using LSTM have shown significant improvements compared to other machine learning and data-driven models, which inspired many researchers to explore the potential of deep learning models. Another type of advanced deep learning model designed for temporal modeling is TCN (temporal convolutional network). With gated TCN layers, Deep Mind researchers developed a model named WaveNet for the speech recognition of the voice wave. And recent studies have shown that TCN models work well on rainfall-runoff modeling as well (Lin et al., 2020;Van et al., 2020;Xu et al., 2021). However, these studies developed lumped models by considering each watershed as a unit, and all available geospatial information is not used.
Several studies have integrated geospatial information into deep learning. Asadi et al., (2019) proposed an ANN (artificial neural network) model including the feature of hydrological connectivity index, which effectively characterized the topography and other catchment characteristics. Other studies proposed deep learning models using LSTM  and Graph Neural Network (GNN) (Zhao et al., 2020) considering the upstreamdownstream relationships between watersheds in streamflow forecast projects. Considering the watershed relationships, large watersheds were decomposed into multiple sub-watersheds, and the model results from upstream sub-watersheds can be used in the modeling of downstream sub-watersheds. In recent studies (Kratzert et al., 2019;Kratzert et al., 2018) proposed deep learning models used LSTM considering watershed-scale spatial information including area, slope, land use, soil type on developing one regional model on multiple watersheds. In these studies, some spatial information is used, and the results show that geospatial information is critical for predictions in ungauged basins and developing regional models. However, in these studies, each sub-watershed with runoff observation is still considered as a point, and the spatial information in each sub-watershed is still not fully considered. For example, the PRISM and NCEP/EMC Stage IV are providing the daily and hourly precipitation data at 4km grids. When modeling watersheds with hundreds of square This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content.
kilometers without internal monitoring with a stream gage, the watershed scale models simply taking the average precipitation of all the watershed areas, which ignored the spatial distribution of rainfall. The uneven spatial distribution of precipitation as it happens during rainfall events may cause significant errors in large watersheds compared to average precipitation. Convolutional Neural Network (CNN) is a deep learning model that can deal with data with spatial features. CNNs have been studied since 2012 (Krizhevsky et al., 2012), and are widely applied in many fields including face recognition, medical image analysis, and object detection (LeCun et al., 2015). A convolution kernel will work on each node or pixel and each node will be calculated with its neighbors with the same convolution operation. Thus, CNN can be used for the precipitation data from satellite or radar images. One of the popular methods in meteorology is to combine CNN with LSTM based approaches. Shi et al., (2015) proposed a deep learning model named ConvLSTM that takes convolution calculation for spatial features in each temporal-dimension calculation in the LSTM layer in the precipitation forecast. In hydrology, several studies (Barzegar et al., 2020;Barzegar et al., 2021;Ha et al., 2021;Anderson & Radic, 2021) have used the combination of CNN and LSTM for forecasting streamflow, water quality, and water levels. In detail, CNN module extracts the valid weather information at each time step for each watershed, and feeds the treated data into an LSTM module for the time-series modeling and forecast. However, these models can hardly work on hydrological modeling at high resolutions theoretically. The convolution kernel considers all the local neighbors of each node which might be an issue for hydrological systems. The streams or hillslopes can be spatially close to each other, however they might be split by drainage divide and not have an upstream-and-downstream relation.
To solve this issue, GNN is proposed to capture the dependency of graphs via message passing between the nodes of graphs (Zhou et al., 2018). The water flow direction map can be converted into a unidirectional directed graph, and we can design a neural network on this graph. There are limited studies in water resources that are using GNN. The first study (Zhao et al., 2020) is using a Graph Convolution Network (GCN) with LSTM for the streamflow forecast using connections between gauged sub-watersheds with considering each subwatershed as a point. Deng and Hooi (2021( is using GCN for anomaly detection in the water treatment plant, and the pipe networks are considered as a graph. Jia et al., (2021) proposed a model named PGRGrN using LSTM with GCN to model the river networks for the streamflow and temperature predictions. All these studies selected the default GCN as the GNN layer, and both streamflow studies are focusing on the streamflow connections and only considered the watersheds in a semi-distributed structure.
In this study, with high-resolution precipitation data, we proposed a GNN structure for the fully-distributed rainfall-runoff modeling to reduce the error from the uneven spatial distribution of precipitation in deep learning models. The following section describes the GNN model proposed in this study. Section 3 describes the case study, dataset, evaluation methods, and baselines. Section 4 describes the results and discussion. The conclusion provides a summary of the work and recommendations for future research. This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content. 4

Sequence Modeling Using Deep Learning in Rainfall-Runoff
Rainfall-runoff modeling is a time-series task. It is important to ensure the model performance at an acceptable range without consideration of spatial information. In this study, we used four state-of-art deep learning architectures for modeling the sequence relationships of precipitation data. The first architecture is the LSTM (Schmidhuber & Hochreiter, 1997). The LSTM layer provides a set of non-linear operations on sequence inputs. Its input is sequence data, such as time-series, and LSTM applies the same set of operations on each time-step recurrently. Thus, the LSTM assumes each time step as same, and it calculates every step from the beginning to the end of the sequence. The LSTM has been applied to hydrology studies for the first time in 2017 (Fang et al., 2017) and hundreds of studies have been published in the past years that stated LSTM as a state-of-art time-series model in rainfall-runoff or streamflow forecast tasks.
Each set of LSTM non-linear operations contains six calculations as shown in Equations 1 to 6. In Equation 1, forget gate parameter , decides how much of the previous cell state needs to be forgotten by a sigmoid function with a linear calculation on the current input and the previous result ℎ −1 . The linear equations in all steps have different weights (W) and biases (b) in each LSTM calculation step. In Equation 2, input gate is the new information to be learned, which is calculated by the sigmoid function with a linear relation on and ℎ −1 as well. In Equation 3, ̃ is the candidate of new cell state values, which was calculated by a tanh function with a linear relation on and ℎ −1 . In Equation 4, the cell state is updated by a linear operation. In Equation 5, the output parameter is calculated by the sigmoid function with a linear relation on and ℎ −1 . In Equation 6, the final output result at the current step, ℎ is calculated. In this study, a stacked LSTM is used, and the Figure 1a shows an example LSTM model with 3 layers.
The second architecture is the Gated Temporal Convolution Network (GTCN). It is the principal module in WaveNet, a state-of-art speech recognition model designed by DeepMind. Different to LSTM, GTCN is a convolutional model that applies a 1-D convolution kernel to each sequence window or reception field. For the time-series modeling, GTCN used a causal convolution to make sure there is no leakage from the future into the past. For reducing the model size on long sequences, the dilated convolution is used to increase the reception field. A residual block is also designed for multiple layers to improve the model performance. A gating mechanism similar to LSTM is used to allow GTCN to keep or forget information more efficiently (Kong et al., 2020). This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content.

5
Each set of the GTCN operation contains three equations as shown in Equation 7 to 9. In Equation 7, h is the candidate of new cell state values, and it is calculated from the 1-D convolution kernel W with a size of k and the input with a dilation factor d. The gate decides how much information from the last layer is going to be remembered using a sigmoid function. In the end, the product of and h is the final output of the timestep t. In this study, a stacked GTCN is used, and Figure 1b shows an example GTCN model for a maximum reception field of 12 using 3-layer GTCN with k=3 and d=1,2,4. This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content. 6 copy of the input sequence. And at the end of each bidirectional layer, the final output is a combination of the output from the forward layer and reversed layer as shown in Equation 10.
Bidirectional models have more parameters and can capture the information from both directions of the sequence (Schuster & Paliwal, 1997). Studies have shown significant improvement of model performance in many sequence learning fields including natural language process (Huang et al., 2015), speech recognition (Graves et al., 2013), and rainfallrunoff modeling (Kang et al., 2020;Lees et al., 2021) The advantage of the LSTM and BiLSTM is that the model parameters will not be affected by the sequence length, but they have a drawback that the computation speed will reduce when the sequence length increase since they need to be calculated sequentially. GTCN and BiGTCN naturally requires more layers to capture long-term information. Thus, the parameters of GTCN and BiGTCN will increase when the length is increased. We designed our LSTM model based on our previous studies that a 5-layer model with 32 neurons in each layer has the best performance with higher computational efficiency. For the hourly rainfall-runoff predictions, 72 hours is also proved to be efficient and adequate in our previous studies . We stacked GTCN and BiGTCN five times as well, and the designed structure is set with a maximum length of 72 as well to make them comparable. The values of k and d are set to be reasonable to have a reception field higher than 72 at the 5 th layer. Since the bidirectional layer doubled the size of its output, the hidden layers from the second layer have double input and double output. Thus, the number of parameters at each hidden layer is quadruple comparing to the unidirectional layer. In this study, we are proposing rainfall-runoff models using the rainfall up to 72 hours to predict the rainfall data of the next hour at each grid cell as shown in Equation 11. Runoff = (Precipitation , Precipitation −1 , . . . , Precipitation −71 ) (11)

Graph Representation of A Watershed
In hydrology, a watershed model can be lumped, semi-distributed, or fully distributed (Rathjens et al., 2015;Pina et al., 2016). A lumped model represents the watershed as a whole and no spatial information is taken into consideration. Semi-distributed models break down the watershed into sub-watersheds, and they lump meteorological variables and physical parameters into these regions. Fully distributed models decomposes the watershed into grid cells, and they are more realistic and physically-based since they maximizes the use of data and minimizes the calculation and the simplifications that applied at sub-watersheds (Pina et al., 2016). In this rainfall-runoff modeling study, we proposed our graph neural network with a fully distributed structure since the original high-resolution precipitation data are available in grids.
As an example, Figure 2a shows a fully distributed flow direction map of a watershed in an irregular shape. We cannot apply CNN models in this structure for two reasons. First, not all neighbors are physically meaningful. For example, the water from the precipitation at grid This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content. 7 cells a and b will only flow to its top left and will not contribute to the stream in the watershed in yellow. Thus, when calculating precipitation at cell c, its neighbors a and b should not be considered. Second, not all neighbors are equally valuable when they have the same positional relation. For example, the grid cell c is on the top of cell e, and the cell d is on the top of cell f. It is obvious from the water flow direction map that the patterns between cells c and e are different from the pattern between cells d and f. However, when applying a CNN model, it is assumed that their relationships are the same. Thus, a graph that represents a watershed can be converted from a water flow direction map. Figure 2b shows the converted graph G(V,E), where 8 nodes represent different land grid cells, and the edges represent different runoff channels. The flow direction map can be calculated using either d4 or d8 algorithm. d4 is a kernel method that routes flow to one of the four cardinal directions based on the steepest descent, and d8 considered eight surrounding directions (Shelef &Hilley 2013). In this study, we created the flow direction map in both ways. A graph is then generated by considering each grid cell as a node and each flow direction as an edge. In this study, we assume that each graph edge has the same distance in both d4 and d8 algorithms. The reason why we made this assumption is, the sequence models such as LSTM and GTCN only work on the data that has the same time step in the sequence.

Figure 2. Water Flow Directional Map
For a watershed, the geospatial information (i.e., drainage area, land use, soil type, etc.,) of the grid cell are natural attributes and can be used as the node features. Studies (Kratzert et al., 2019;Xiang et al., 2021) have shown that these features are effective in deep learning based modeling to represent different watersheds in regional models. Thus, one or more features from available geospatial information can be used in the graph networks based on the data availability.

A Spatial-Temporal GNN Model for Rainfall-Runoff Modeling
Based on the design principles of graph neural networks, we proposed an innovative physicsinformed rainfall-runoff model named Graph Neural Rainfall-Runoff Models (GNRRM). There are two key points in our design: k-hop with k value at the maximum distance, and the use of sequence model for spatial information.
In GNN studies, k-hop neighborhood means the neighborhood of radius k. Some traditional graph research tasks such as the social networks and traffic networks only consider This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content. 8 low order similarity, which means they normally take the 1-hop or less than 3-hop neighborhoods into consideration. This is reasonable in many graph network tasks since highorder neighborhoods are meaningless in these tasks. For example, according to the theory of six degrees of separation, anyone's 6-hop friend can be anyone on the earth, which is meaningless. In engineering studies such as the traffic predictions, studies (Cui et al., 2019) used at most 3-hop neighborhoods in their studies since the traffic congestion is mainly a short-term issue and it is very rare to have traffic congestion over 100 kilometers. However, in hydrology, precipitation that falls over land runs off into streams, which would affect downstream regions (Alexander et al., 2018). Thus, it is very critical to take all the rainfall in the watershed into account to make it physically meaningful. Thus, we use the k value at the maximum distance in our GNRRM. Second, we share the same equations and parameters on different hop neighborhoods. In most GNN models, including the most popular models Graph Convolutional Networks (Kipf & Welling, 2016) and GraphSAGE (Hamilton et al., 2017), the parameters on each hop neighborhood are designed to be different values for a stronger characterization capability. However, in hydrology, it is a natural way to consider each step between the neighborhoods is the same, just like many distributed hydrologic models. The design of our GNRRM is shown in Figure 3a, where the diagram is designed for the sample watershed shown in Figure 2. The calculation process of our designed GNRRM-TS (Temporal Spatial) is Sequence-Aggregation-Sequence. At first, we apply one sequence model on each node for the time-series data. The detailed design of this part is shown in Section 2.1. Then, we aggregated our data in nodes based on its distance to the outlet node using the graph method in Section 2.2. In detail, we create a simple hierarchical graph representation of the original graph based on the node distance. And the node aggregation operations can be sum for the area, weighted average for precipitation intensity, and mode for soil type. Finally, we applied the second sequence model on the hierarchical graph for spatial modeling. This spatial sequence model applied the same calculation between each node, which assumes the water flow between two nodes is the same. Each of the sequence model s listed in Section 2.1 can be used as the spatial sequence model.
Compared to our lumped model, the calculation of GNRRM shown is costly since it takes all nodes into account for the time-series modeling. To simplify the calculation and increase the speed, we simplified our model by aggregating the areas having the same distance to the outlet before using the sequence model. Thus, the calculation process of the second design is Aggregation-Sequence-Sequence. Theoretically, this simplification will make the calculation much faster by sacrificing some model accuracy on a spatial level. We named the simplified model as GNRRM-ST (Spatial Temporal).
To enhance non-linear modeling ability, two additional fully-connected layers (a.k.a. dense layer) are used in GNRRM-TS and GNRRM-ST. Right after each dense layer, a This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content. 9 dropout layer is used, which helps to avoid overfitting. In terms of hydrology, our spatialtemporal model is using the rainfall up to 72 hours for each grid cell to predict the rainfall data of the next hour as shown in Equation 12. . The sample input data for each node is shown in yellow. GNRRM-TS model applied the same temporal sequence model on each node first before the spatial aggregation. GNRRM-ST model applied the spatial aggregation first and then applied the same temporal sequence model on each aggregated nodes. For both GNRRM models, the spatial sequence model and several fully-distributed dense layers are used to generate the final output.

Case Study 3.1. Study Area
In this study, we implemented our GNRRM models on a watershed located in the eastern part of Iowa. The USGS gauge at the drainage outlet is located at latitude 42°09'51.6" and longitude 90°43'45.6", named "USGS 05418400 North Fork Maquoketa River near Fulton, IA" (USGS, 2021). According to traditional classification (Singh & Frevert 2010), it is a large watershed (over 1,000 km 2 ) with a drainage area of 1,308 km 2 . In addition, the rainfall This manuscript is an EarthArXiv preprint and has not been peer-reviewed before. Subsequent versions of this manuscript may have slightly different content.
is the only real-time data we have since there is no other USGS streamflow gauge inside of this watershed. It is also a well-studied watershed that many streamflows and water quality measurements are going on. According to the flood level data and rating curve from IFIS (Demir and Krajewski, 2013) and USGS (USGS, 2021), its flood action stage flow rate is 187.66 m 3 /s, and flood stage flow rate is 219.48 m 3 /s. And the time of concentration from the farthest point is estimated at 45 hours in IFIS.

Data Processing
Multiple public datasets obtained from different agencies are used in this study. As is shown in the summary in Table 1, the most coarse and indivisible spatial resolution is a 4-km grid in precipitation data. Thus, we used this as our spatial resolution in the modeling. DEM, drainage area polygon, and precipitation at 4-km grid are used to create the flow direction map as shown in Figures 4b and 4c. The flow direction map is then converted to the graph using the method shown in Section 2.2. In this study area, 72 hourly precipitation data at each of 104 grid cells in yellow-shadow were taken into consideration. As shown in previous studies , smoothing precipitation data helps to remove random variation and may improve the deep learning model performance in hydrology.
In this study, we used the same simple moving average of two-way 6 hours as the pretreatment of the precipitation data. Since the USGS measures the streamflow rate in 15minutes interval only, we aggregated it into hourly data and then extracted the quick-response runoff using the baseflow separation method proposed by Hewlett & Hibbert (1967). This process is automatically calculated by the Hydrograph-py package (Terink, 2019) using Python. In this watershed, geoinformation such as the land use and soil type are about the same in most of the grid cells. Thus, we only take the drainage area in each grid cell as an additional feature other than precipitation. As shown in previous studies (Kratzert et al., 2019), the drainage area is the most important geospatial feature in regional modeling. As a last pretreatment step, both input features, the drainage area, and precipitation, are standardized using a min-max scaler.

Model Setup
There are two sequence model layers in GNRRM. The first is used for the temporal scale modeling, and the parameters are summarized in Table 2. The second sequence model layer is used for the spatial scale, and it would be very different for the different study areas. The parameters of the spatial sequence model are summarized in Table 3. In this study, based on the watershed flow direction map shown in Figure 4, it can be converted into a 22-hop graph using the d4 algorithm or a 16-hop graph using the d8 algorithm. Thus, we have different parameters for GTCN and BiGTCN layers. For the graph aggregation operation, we used the sum for the drainage area, and the area-based weighted average is used for the aggregation of the precipitation. After the temporal and spatial operations, the final two dense layers have 32 and 16 neurons respectively, and the dropout rate at each layer is 0.2. To show the calculation process more clearly, we listed the data dimensions after each calculation from the input to the output in Table 4. In this study, we processed seven years of data, from the water year 2012 to 2018. In all models including baseline approaches, the data in the water year 2012-2017 is used for the model calibration, and the data in the water year 2018 is the test dataset used for the final evaluation. In detail, in the deep learning models, the data in the water year 2014-2017 is used for the training, and the water year in 2012-2013 is used for the validation. We only considered April and November each year to better represent the hydrological process in This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content. rainfall-runoff due to a lack of accurate ice, snow, and spring snow melt data in high spatial and temporal resolutions. In total, we have 40,404 hourly data instances, and they are split into 23,137 for training, 11,491 for validation, and 5,776 for the testing. The deep learning model parameters are all randomly initialized, and the optimizer used in this study is Adam (Kingma & Ba, 2014), which is the best among the adaptive optimizers in most cases. The model learning rate is set in a range of 0.001 to 0.00001 for different models. The model loss or cost function is the mean squared error (MSE), which is an effective and simple error to minimize in regression. The batch size in the training is set as 128, thus, each epoch of training takes 181 iterations in the training. We did not set up a limit of epoch or iteration to stop the model. Instead, we used an early-stopping algorithm to automonitor the training and validation loss, and the training process would be stopped when the validation loss did not decrease five epochs in a row. This helps to get a model that is well trained before over-fitting.
We used five statistical equations to evaluate our final model performance. They are KGE (Gupta et al., 2009), NSE (a.k.a., coefficient of determination, r 2 ), and root mean square error (RMSE). Statistically, the values are higher the better with the maximum of 1 for KGE and NSE, and the values are smaller the better for RMSE. We also used the correlation coefficient r for correlation and collinearity analyses.

Baseline Model Setup
We designed multiple baseline models in this study. The operations are shown in Table 4. The first two baseline models are the Lasso-lumped and Lasso-all, which are linear regressions with L1 regularization. The use of L1 regularization helps to reduce model complexity and reduce overfitting. Lasso-lumped calculated the total precipitation by using This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content.
13 the area-based weighted sum as the first step, and only 72 hourly precipitation is the input of Lasso. Lasso-all simply took 7,488 inputs since all the precipitation data in 104 grid cells each of 72 hours is used. Since our proposed graph neural networks aim to improve the rainfall-runoff models at the spatial level, we designed multiple baseline models using different pre-treatment methods to deal with the spatial information. For these spatial-temporal models, we used the same sequence model for the temporal level. The third baseline model is Sequence-lumped. It calculated the watershed-scale total precipitation by using the area-based weighted sum as the first step and then used the temporal sequence models such as LSTM, BiLSTM, TCN, BiTCN. This is the most important baseline model that many lumped rainfall-runoff modeling studies are using (Kratzert et al., 2018;Hu et al., 2018;Katzert et al., 2019). The fourth baseline model is Sequence-all, which is using all the precipitation data as input, with all other layers the same as graph networks. This model considered the precipitation at each grid cell as an independent feature, which is another naï ve baseline. The fifth and sixth baseline models are the Sequence-lumped-lagged and Sequence-all-lagged. These two models considered the precipitation input at each grid cell with lag from 0 to 42 hours due to their average distance to the outlet. As the seventh baseline model is CNNtemporal, where we used a convolutional layer to treat the spatial information before the temporal modeling. This is known in some recent studies (Anderson & Radic, 2021;Baek et al., 2020) as CNN-LSTM when using LSTM as the temporal sequence layer. CNN layer with non-linear activation function such as ReLU helps to address nonlinear problems better. Since our designed model GNRRM do the calculations temporally first, we constructed two temporal-spatial baseline models as well. These two baseline models are Temporal-sum and Temporal-CNN. They both used the same temporal sequence models such as LSTM, BiLSTM, TCN, BiTCN to generate the potential runoff at each grid cell. Temporal-sum simply sums up all the output at each grid cell. Temporal-CNN provides non-linear calculations on the spatial level to help to model.

Results and Discussions 4.1. Baseflow Separation
In this study, we extracted the runoff from the observed discharge data from USGS, and the results are shown in Figure 5. Due to the ice and snow in winter, and the snow meltwater in early spring, we have a reduced data availability and quality from December to March. For better analysis of rainfall-runoff models, we only used the data from April to November every year. Figures 6a and 6b show two different events, Event A and Event B, that both have peak runoff at about 90 m 3 /s. In detail, Event A shows a double-peak hydrograph with double-peak rainfall, and both lag time from peak rainfall to peak runoff is 7 hours. Event B shows a single-peak hydrograph with one-peak rainfall, and the lag time is 2 hours. It is also noticeable that Event B has a long tail with an unusual shape, and this is uncommon in a single-peak hydrograph. It is difficult to perform accurate rainfall-runoff modeling using hydrological domain knowledge with watershed-level rainfall data. This manuscript is an EarthArXiv preprint and has not been peer-reviewed before. Subsequent versions of this manuscript may have slightly different content. 15 If we have higher-resolution rainfall data, we could then explain this hydrograph. As shown in Figures 6c and 6d, the effective hourly precipitation at the peak rainfall of the two events is different. Figure 6c shows that Event A is a common event that the precipitation happened in the middle of the watershed. However, Figure 6d shows that Event B has strong precipitations, over 100 m 3 /s rainfall water in the 4km grid cell, around the watershed outlet, which caused the lag time much less than Event A.
Other than the lag time issue, it is also obvious that the hydrograph shapes are different. Event A has two runoff peaks, and Event B has one peak with a long steady tail rather than two peaks. As is shown in Figure 6d, the heavy rainfall on the upper-middle area of the watershed took about 20 hours to reach to the outlet, and the high transmission loss is possible to make it a long tail. This is consistent with the conclusions of a summary by Cataldo et al., (2004) that the transmission loss over storm events is very high in Midwest and Western streams, and the average and median loss is 44% and 42%. Thus, the transmission loss is not negligible on large watersheds. In this section, we showed that both the lag time and transmission loss are important, however, this information is lost when using a lumped model. And the high spatial resolution precipitation data is necessary for accurate rainfallrunoff modeling.

Baseline Models
The results of naï ve baseline models without considering spatial information including linear model Lasso, and four deep learning models with four different inputs are shown in Table 5. The analyses resulted that the BiLSTM model has the highest KGE of 0.784, and the BiGTCN model has the highest NSE of 0.642 and the lowest RMSE of 6.486 m 3 /s. Comparing the models, we find that bidirectional models have better performance than unidirectional ones. And the linear model Lasso performs the worst in most cases. This result This manuscript is an EarthArXiv preprint and has not been peer-reviewed before. Subsequent versions of this manuscript may have slightly different content. 16 proved the effectiveness of the deep learning models especially the ones with bidirectional structure. Comparing the input pretreatment methods, contrary to our knowledge and discussion in the last section, we were surprised to find that the lumped models using the watershed-level averaged precipitation shows much better performance than the models using all the precipitation data. Especially for the deep learning models, all four time-series models have higher KGEs when using only 1 input per time step rather than 104.
The pairwise correlation coefficient heatmap between the moving-averaged precipitation intensity values in 104 grid cells shows that the lowest Pearson correlation coefficient is 0.630, and 96.9% of the Pearson pairwise correlation coefficients are higher than 0.7 as shown in Figure 7. Statistically, higher than 0.3 would represent a moderate correlation, and the value higher than 0.7 is normally considered a strong correlation (Vatcheva et al., 2016). The strong correlation would cause a multicollinearity issue and there is no doubt that high linear intercorrelation between the input variables may result in incorrect results in regression models (Kim, 2019;Vatcheva et al., 2016). This would be the reason that caused a reduced model performance in deep learning models that is shown in Table 5.
We analyzed the loss function MSE of the deep learning model BiGTCN in training processes as shown in Figure 7. The solid lines show that the BiGTCN-all has slightly higher MSE values on the training datasets than BiGTCN-lumped after the first epoch. However, the MSE of the BiGTCN-all decreased significantly faster than the lumped model. At the end of the training, after 1086 iterations, the training loss of the BiGTCN-all model is 38% lower than BiGTCN-lumped. The dotted lines show that, on the validation dataset, BiGTCN-all has slightly higher MSE than BiGTCN-lumped after the first epoch. However, the MSE of BiGTCN-all did not decrease like it did on the training dataset. In the evaluation of the test dataset as shown in Table 5, the model performance of BiGTCN-all is significantly worse than BiGTCN-lumped. In our tests, a similar phenomenon happens on all other deep learning models (GTCN, BiLSTM, LSTM) as well.
We can clearly see that the direct use of spatial rainfall data in all 104 grid cells would reduce the training loss. However, the model efficiency on our validation dataset and test dataset decreased. This implies that our model has an overfitting issue due to the data, and we need to use more appropriate approaches to deal with spatial relationships to eliminate overfitting caused by multicollinearity.
Comparing the models using the input with lag-time, the model performance does not improve for the models with all input, and the model performance decreased significantly in all lumped models. This indicates that the use of time lag to represent the geospatial information is not good enough to increase the model performance of deep learning timeseries modeling. The time lag simply changed the time step but it did not consider the transmission losses, which may be important as shown in our analysis in Section 4.1.

Spatial-Temporal Models
The spatial-temporal models are shown in Table 6. In this study, we considered the lumped model as a special spatial-temporal model that calculated the area-based averaged precipitation intensity as the first step. Another type of baseline model is CNN-temporal models, which includes CNN-LSTM, CNN-BiLSTM, CNN-GTCN, and CNN-BiGTCN. Among all these 16 models in Table 6  This manuscript is an EarthArXiv preprint and has not been peer-reviewed before. Subsequent versions of this manuscript may have slightly different content. In this test, the CNN-based models do not always perform better than lumped models. For example, the CNN-BiLSTM and CNN-BiGTCN models both show lower KGE than the lumped-BiLSTM and lumped-BiGTCN models. The CNN layer in this study provides a temporal-wise nonlinear operation, which may provide stronger power to represent the spatial information than the lumped models. However, the improvement is very limited since the CNN layers considered all 104 features independent as well, and the higher correlation between these features would affect the CNN layer with overfitting issues as the models using all features.
With the designed GNRRM-ST, the model performances are higher than CNN-based models and lumped models in most cases. The GNN based models have higher KGE than the CNN-based models in each of the four sequence layers. This represents that the GNN structure designed in this study is an effective model structure for rainfall-runoff modeling, and it supports any sequence layer including the LSTM, BiLSTM, GTCN, and BiGTCN. It is also noticed that the proposed GNRRM-ST models do not always have a higher performance using d4 and d8 algorithms. The models with the d8 algorithm perform better than d4 using sequence layers such as GTCN and BiGTCN. The models with the d4 algorithm perform better than d8 using other sequence layers such as LSTM and BiLSTM.

Temporal-Spatial models
We proposed a unique type of model in this study which is a temporal-spatial model. Different from spatial-temporal models, these models applied the time-series modeling in each grid cell first and then applied algorithms to aggregate the results. The results are shown in Table 7. Among all these 16 models in Table 7, the graph model using the BiGTCN layers with the d8 algorithm has the highest KGE of 0.842. The graph models BiGTCN with d4, BiLSTM with d4, and BiLSTM with d8 have the second to fourth highest KGE values. For NSE, the graph model using BiGTCN with d8 has the highest NSE of 0.714, and the graph models using BiGTCN with d4 and GTCN with d4 have the second and third highest NSE values of 0.704 and 0.700. For RMSE, the lowest and the second-lowest values are 5.786 and 5.799 m 3 /s, which are the results of the graph model using BiGTCN with d4 and d8 algorithms.
In this test, when applying temporal layers first, the issue of high correlation in rainfall data is solved. This is the reason that all CNN-based temporal-spatial models have better performance with higher KGE values than the models simply applied a simple sum as a lumped model in spatial-temporal model. The graph models in the temporal-spatial models always show higher model performance (KGE) than other CNN-based models with any of This manuscript is an EarthArXiv preprint and has not been peer-reviewed before.
Subsequent versions of this manuscript may have slightly different content. 20 four sequence layers. This trend is the same as the results of temporal-spatial models, and it further proves the high model efficiency and applicability of graph neural network architecture GNRRM-TS. Algorithmically, the temporal-spatial models have a similar model complexity with the spatial-temporal models since the same temporal operations are applied to each node. Although the temporal-spatial models can be operated parallelly, the spatialtemporal models take much longer since the time complexity is linear time to the size.

Discussion
In previous sections, we presented that the GNRRM models have the best performances among all results. This section provides a detailed comparison between six different spatial aggregation approaches and a discussion of the limitation. Figure 9 shows the best performance among 6 different spatial aggregation approaches in a stormwater event that starts on September 1, 2018. It is a complex event with multiple rainfalls and runoff peaks in 6 days. From the statistics shown in the figure, on this 6-day event, the graph-based model GNRRM-TS using the d8 algorithm with BiGTCN as the sequence layer has the best model performance with the highest NSE, second-highest KGE, lowest RMSE, and the lowest bias. GNRRM-ST using d4 algorithm with BiGTCN as sequence layer performs the second-best with highest KGE, second-highest NSE, second-lowest RMSE, and the third-lowest bias.
Visually, the plot also shows that the GNRRM-TS matches the observation runoff better than other models, especially in the second half. Spatial-temporal models such as lumped model and CNN-BiLSTM cannot perform as well as graph models since the spatial-temporal models aggregated the spatial information first which cannot use the spatial information efficiently as graph models. As the first attempt at a graph neural network in rainfall-runoff modeling using spatial information, this study and the model GNRRM have some limitations. First, we did not pursue the highest spatial resolution. The graph neural network-based model GNRRM is developed at 4-km resolution in our case study, which is lower than other fully distributed rainfall-runoff modeling studies in 1km (Huang et al., 2019) or 250m (Tren andDe Niel, 2018) resolutions. However, the proposed method could be expanded to higher resolution theoretically with high-resolution rainfall data (Seo et al., 2019) and more computing resources or distributed computing (Agliamzanov et al., 2020).
One of our contributions in this study is to apply the sequence model to the spatial dimension. Thus, the proposed GNRRM shares the same assumptions and limitations as to the sequence models. In deep learning models, the sequence models were designed for timeseries tasks, and the sequence interval is assumed to be the same. However, in this study, to use the sequence model on the spatial dimension, we simply considered the distance in the number of grid cells as the sequence interval, and this simplification ignores the curvature of the river network in the grid. This may reduce the model accuracy in a rough resolution of 4km. Figure 9. Watershed-level precipitation intensity and runoff from 6-day event on Sep 1, 2018 Another limitation is the sequence models are inefficient for long sequences. Although the sequence models, such as LSTM, were designed for long-term dependency, their model performance would decrease with a very long sequence as well. Typically, the sequence models have a limit of several hundred time steps, and this would be the limit of the watershed size when applying sequence models on the spatial dimension. In this study, we tested two basic algorithms, d4 and d8, for the flow direction, and they both show good results. However, many algorithms including multiple four-direction, multiple eightdirection, and triangular multiple-direction may have better performance, and these will be studied in future studies.

Conclusion
Deep learning models have been used increasingly in hydrological studies recently. Many studies have been using the DNN, LSTM, and TCN models on the rainfall-runoff modeling and streamflow forecasting for the runoff simulation and flood prediction. However, most studies developed lumped models by considering each watershed or sub-watershed as a unit, and the geospatial information is not effectively used.
To use the geospatial information from the precipitation data, this study proposed a new type of network architecture using the idea of graph neural networks for the rainfall-runoff modeling named GNRRM-TS and GNRRM-ST. We applied the baseline models such as lumped models using the state-of-the-arttime-series models including LSTM, BiLSTM, GTCN, and BiGTCN. As an improvement on lumped models, we applied the CNN-based models using convolutional layers, and GNRRM-ST models using graph-based layers as two spatial-temporal models. Moreover, we developed temporal-spatial models including GNRRM and other baselines by reversing the layers for temporal and spatial dimensions.
Results show that models with designed graph architectures effectively used spatial information and avoided overfitting issues due to the spatial correlation.
Our method also has some limitations. First, we did not pursue the highest spatial resolution. In future studies, the proposed method could be expanded to higher resolution theoretically, but that would need even higher-resolution data and higher computing speeds with larger memories. Second, GNRRM-ST and GNRRM-TS both use sequence models on the spatial dimension, thus, the sequence interval representing the spatial distance is assumed to be the same. And this simplification ignores the curvature of the river network in the grid.
Last, we only applied two basic algorithms, d4 and d8, for the flow direction. Other algorithms such as multiple four-direction, multiple eight-direction, and triangular multipledirection will be studied in future studies. Since the successful design and application of GNRRM is a physics-informed model inspired from the traditional physical rainfall-runoff models, we highly recommended researchers working on the integration of physical models with the current deep learning models. Also these type of comparative modeling studies can be further improved by using benchmark datasets  to adapt specific best practices in hydrology and water resources (Ebert-Uphoff et al., 2017).

Acknowledgments
The work reported in this study was made possible by the support of members of the Iowa Flood Center and the Department of Civil and Environmental Engineering at the University of Iowa.