Improved Accuracy of Watershed-Scale General Circulation Model 1 Runoff Using Deep Neural Networks 2 3

Projecting impacts of climate change on water resources is a vital research task, and general circulation models (GCMs) are important tools for this work. However, the spatial resolution of downscaled GCMs makes them difficult to apply to non-grid conforming scales relevant to water resources management: individual watersheds. Machine learning techniques like deep neural networks (DNNs) may address this issue. Here we use a DNN to predict monthly watershed-scale runoff (i.e., stream discharge divided by watershed area) from monthly gridded and downscaled Coupled Model Intercomparison Project Phase 5 (CMIP5) GCM hydroclimatic fluxes (i.e., precipitation, evapotranspiration, and temperature). We used hydroclimatic fluxes, biotic, and abiotic characteristics from 2,731 watersheds in the conterminous United States to train and test a DNN that can predict watershed-scale runoff. The DNN described 93% (Pearson’s correlation coefficient = 0.962) of the variability in observed runoff and was temporally and spatially robust. The median absolute error (MAE) of DNN predictions was approximately 25 percentage points lower than that of gridded, downscaled GCM runoff or monthly normal runoff (i.e., 30-year average of runoff observations at the watershed-outlet). DNN monthly runoff predictions had the lowest MAE of all the grid-to-watershed-scale conversion approaches we tested, including: linear ridge regression, support vector machines, extreme gradient boosting, and artificial neural networks. We demonstrated why using DNNs to convert gridded GCM hydroclimatic fluxes to watershed-scales is relevant to water resources research and management. We also provided a methods guide for hydrologists interested in implementing machine learning techniques.

Projecting impacts of climate change on water resources is a vital research task, and general 23 circulation models (GCMs) are important tools for this work. However, the spatial resolution of 24 downscaled GCMs makes them difficult to apply to non-grid conforming scales relevant to water map input data into a high dimensional feature space and then classify data into groups by 103 minimizing classification error to hyperplanes within this high dimensional space (Raghavendra 104 et al., 2014). RVMs are similar to SVMs but rely on probabilistic Bayesian learning to classify 105 data into groups (Ghosh et al., 2008). ANNs consist of layers of nodes (also called cells or 106 neurons) and edges where each layer of nodes and edges represents a linear or non-linear input-107 output mapping (Shen, 2018). The values of ANN nodes and edges are adjusted during training 108 to minimize a loss function that compares the ANN predicted output to the observed output 109 a regression-style approach (e.g., Trigo & Pulutikof;1999), ANNs are used to develop self-111 organizing maps that aid statistical downscaling via weather typing schemes (e.g., Ramseyer et 112 al., 2018). ANNs have up to four layers ( Figure S1a) including an input layer, two hidden layers, 113 and an output layer. In contrast, DNNs are extensions of ANNs containing more than two hidden 114 layers ( Figure S1b). We know of one study using convolutional neural networks (CNNs)-a 115 class of DNNs applied to multiple two-dimensional inputs such as images-to develop seasonal 116 and regional extreme weather classifications from gridded GCM outputs (Knighton et al., 2019). 117 To the best of our knowledge, no studies have used DNNs to downscale gridded GCM runoff to 118 the watershed-scale. 119 120 Machine learning-based downscaling methods offer benefits over other downscaling methods. 121 Machine learning techniques such as DNNs are agnostic to the mathematical parameterization of 122 physical processes, even though they may effectively recreate those processes from related data 123 or be used in coordination with physically-based models ( and outputs; diverse model architectures, unsupervised pretraining, and weight sharing improve 146 computational convergence in DNNs (Shen, 2018 We aggregated observed runoff from a daily mean, as provided in the GAGES-II dataset, up to a 202 monthly mean and then used this monthly runoff as a response variable when training and testing 203 the DNN ( Figure S2 watershed characteristics included in this study, see Table S1. In addition to GAGES-II data, we 212 downloaded monthly, gridded, downscaled precipitation, temperature, evapotranspiration, and 213 runoff GCM outputs for each of the 2,731 study watersheds at a spatial resolution of 1/8° x 1/8° 214 (14 km x 11 km at 40°N 100°W) for the previously mentioned 30-year study period for an  215   ensemble of 18 (model abbreviations: bcc_csm1-1, ccsm4, cesm1-cam5, csiro-mk3-6-0, fio-esm,  216 gfdl-cm3, gfdl-esm2g, gfdl-esm2m, giss-e2-r, hadgem2-ao, hadgem2-es, ipsl-cm5a-lr, ipsl-cm5a-217 mr, miroc-esm, miroc-esm-chem, miroc5, noresm1-m, and noresm1-me) CMIP5 GCMs (Maurer 218 et al., 2007;Taylor et al., 2012). We downloaded CMIP5 data from http://gdo-219 dcp.ucllnl.org/downscaled_cmip_projections/. We used temperature data from all 18 CMIP5 220 GCMs. For precipitation, evapotranspiration, and runoff data, we excluded ipsl-cm5a-lr and 221 noresm1-me GCMs because they only provided annual averages over the study period (i.e., 222 1970-1999) and hydroclimatic fluxes (i.e., runoff) needed for model comparisons were not 223 available. We used each watershed boundary to calculate a watershed areal average value (i.e., 224 area weighted average of gridded GCM data falling within the watershed boundary) for each 225 CMIP5 variable (i.e., temperature, precipitation, evapotranspiration, runoff) and each GCM. We 226 then calculated the mean CMIP5 variable across GCMs for each watershed. This resulted in a 227 monthly ensemble value, which we used for all remaining analyses. We also calculated the one-, 228 two-, and three-month time lags in monthly average GCM ensemble precipitation, temperature, 229 and evapotranspiration using a similar areal average approach (Table S1) The combination of a 30-year study period and 2,731 study watersheds resulted in a total of 239 972,960 monthly observations of runoff that were ≥ 99% complete. We constructed the DNN 240 train set by randomly sampling (i.e., 75% of observations from each ecoregion and either 241 reference/non-reference class) observations at each time step (i.e., monthly) over the 30-yr study 242 period. We refer to this grouped random selection as semi-random sampling; its purpose is to 243 ensure that the trained DNN model can accurately represent non-random spatio-temporal 244 autocorrelation in the original dataset by explicitly forcing consistent and complete spatio-245 temporal coverage (Rice et al., 2019). We used the remaining 25% of the data as a test set to 246 assess model performance (i.e., DNN testing). For a complete breakdown of data included in the 247 train and test sets see Figure S3 Table S1) and one output layer node to represent the regression output of watershed-282 scale runoff predictions (Table 1). However, to arrive at the final DNN hidden layer architecture, 283 our basic approach was to start with a large number of hidden layers with many nodes and prune 284 both down based on DNN training loss performance (i.e., overall prediction accuracy as well as 285 the time it takes for the DNN to converge to a solution). More specifically, we initialized the 286 DNN architecture with a large number of hidden layers, where the first hidden layer had 287 approximately 10x more nodes than the input layer. Subsequent hidden layers had approximately 288 half as many nodes as the previous hidden layer. Hidden layers 2 and 3 were an exception to this 289 because we observed that slowing down the node "size decay" reduced training loss (i.e., 290 improved DNN predictions). The initial DNN architecture contained 14 hidden layers but we 291 trimmed it down to 7 after monitoring training loss and the DNNs ability (or inability) to 292 converge in a reasonable amount of time. This is one of several suggested approaches for 293 determining DNN hyperparameters such as the number of hidden layers and hidden layer nodes. 294 Beginners may look to established guides that discuss these approaches in more detail (e.g.,  variables. The number of parameters represents flexibility in this non-linear mapping rather than 317 the dimensionality of the data space. This is in contrast to, for example, linear regression where p 318 variables are used to fit a line passing through each of p points. Best practices such as model 319 evaluation using an independent test sets help reduce the risk of DNN model overfitting. 320

321
We used bias (i.e., y-axis intercept), slope, Pearson's correlation coefficient (PCC), and median 322 absolute error expressed as a percentage (MAE) to test DNN performance. We obtained bias, 323 slope, and PCC from the (linear) line-of-best-fit between observed versus modeled watershed-324 scale runoff. We bootstrapped 95% confidence intervals (n = 1000) for MAE and PCC using We tested the ability of four other grid-to-watershed-scale conversion approaches to predict 345 observed monthly runoff at the watershed-scale (Table 2). These included: linear ridge 346 regression, SVM, extreme gradient boosting (XGBoost), and ANN modeling approaches. Similar 347 to the DNN, we tested the performance of these four approaches using bias, slope, MAE, and PCC (Section 2.2). The linear ridge regression model used an L1 regularization penalty applied 349 to the loss function (squared error) to impose sparsity on model features (i.e., parameters for 350 variables in Table S1 should not get too large). SVM, described previously (Section 1), utilized a 351 linear SVM with L1 regularization (Drucker et al., 1997). XGBoost is a more advanced version  which we refer to henceforth as 'monthly normal runoff". As with computing GCM runoff, 387 monthly normal runoff was estimated at the watershed extent as described in Section 2.1. This 388 process was implemented on a monthly time-step prior to computing 30-year means. We note 389 that monthly normal runoff only relies on three features while the five models mentioned 390 previously rely on 62 features (see Table S1). As a result, GCM runoff and monthly normal 391 runoff serve as model comparison controls. 392

Data and Script Availability 394
We analyzed these data using Python ( (Table S4), slope ranged from 0.87 to 1.08 (Table S5), and PCC ranged from 0.80 to 0.97 (Table S6). DNN 416 residuals were spread around zero when plotted against spatio-temporal variables such as time, 417 latitude, longitude, and watershed area (Figures S5d-S5f, and S6). PCCs between DNN residuals 418 and spatio-temporal variables were close to zero ( Figures S5d-S5f, S6, and S10); they ranged 419 from -0.05 to 0.04 (Table 3). At the watershed-scale, DNN median residuals were distributed 420 around zero for test set (Figures 3, S5a, and S7a). The same was true for Q10 and Q90 events 421     The trained DNN predicted monthly runoff more accurately than controls (i.e., GCM runoff and 492 monthly normal runoff) and was able to effectively translate gridded GCM outputs into 493 watershed-scale runoff as demonstrated by several key results. First, the DNN explained more 494 variation in observed monthly runoff and had a lower MAE compared to any other methods that 495 we considered (Table 2). Second, the DNN runoff predictions approximated observed runoff 496 with little bias (Figure 2a). Third, DNN residuals were close to zero and were relatively 497 symmetric ( Figures S5a, S5b, and S7a). This indicates the absence of a systematic tendency for 498 the DNN to overestimate or underestimate watershed-scale runoff. Fourth, we observed a near-499 zero correlation between DNN residuals and variables related to time, location, or watershed size 500 (i.e., time, longitude, latitude, and watershed area; Table 3, Figures S5d-S5f, S6, S7, S10). This 501 indicates that the DNN was generally robust to spatio-temporal variation. However, we observed 502 that the DNN overpredicted monthly runoff in California, Texas, and Florida as indicated by 503 large negative (i.e., < -25%) median watershed residuals (Figure 3). Future studies may use local 504 The trained DNN adequately predicted monthly Q10 and Q90 runoff events, although, Q90 509 events tended to be more accurately predicted than Q10 events. More specifically, the DNN 510 explained a larger percentage of variation (i.e., higher PCC) in observed monthly runoff test set 511 Q90 events compared to Q10 events (Table S2). Also, the scatter plot of observed versus 512 modeled runoff for Q90 events tracked the 1:1 line closer than that for Q10 events (Figures 2b  513   and 2c). Points below the 1:1 line support the finding that the DNN tended to overpredict Q10 514 events (Figure 2b). We also observed a higher MAE for Q10 events compared to Q90 events 515 (Table S2). Compared to GCM runoff and monthly normal runoff, the DNN was more effective 516 at predicting Q10 and Q90 monthly runoff events as supported by a consistently higher PCC and 517 lower MAE. 518 519 In addition to scale (i.e., CONUS-and watershed-scale) and extreme events, the DNN accurately 520 predicted monthly runoff for non-reference as well as reference watersheds and across all nine 521 GAGES-II watershed ecoregions. More specifically, PCC for non-reference watersheds in the 522 test set was 0.954 and 0.971 for reference watersheds in the test set (Table S3). Bias was less 523 than 8, slope was close to one, and PCC was > 0.8 for watersheds in all ecoregions. We observed 524 the largest PCC in the West Mountains (Table S6). Compared to GCM runoff and monthly 525 normal runoff, the DNN was better at predicting non-reference and reference site monthly runoff 526 as supported by a consistently higher PCC and lower MAE (Table S3). 527

Model Performance Comparisons 529
Compared to the four other grid-to-watershed-scale conversion techniques, the DNN explained 530 the most variation in observed monthly runoff (i.e., highest PCC) and had the lowest MAE 531 (Table 2). We found that linear ridge regression and SVM methods all had higher MAE, higher 532 bias (either negative or positive), and lower PCC than the control methods (i.e., GCM runoff and 533 monthly normal runoff; Table 2). Therefore, we do not recommend using these methods for 534 converting gridded, downscaled monthly GCM hydroclimatic fluxes to watershed-scale monthly 535 runoff for the CONUS. The ANN, which represents a simpler neural network structure compared 536 to the DNN ( Figure S1a), could adequately predict monthly runoff, albeit not as well as the DNN 537 (Table 2). This finding is likely explained by the difference in hidden layers between the ANN 538 and DNN. The DNN has more hidden layers, which enable it to represent more complex 539 relationships between data inputs and outputs (Shen, 2018). XGBoost predicted monthly runoff 540 better than controls, had a higher PCC and MAE than the ANN, but underperformed relative to 541 the DNN (Table 2). 542

543
The DNN outperformed the model controls as well as the four alternative techniques when it 544 came to predicting Q10 and Q90 monthly runoff, non-reference and reference monthly runoff, 545 and monthly runoff in various ecoregions. Similar to the results discussed above, we do not 546 recommend the linear ridge regression because this technique tended to perform worse than 547 model controls for Q10 and Q90 events; it had a higher MAE and lower PCC (Table S2). Linear 548 ridge regression also had a lower PCC than the controls for non-reference and reference 549 watersheds (Table S3) as well as for watersheds in all ecoregions (Table S6). SVM sometimes 550 outperformed the model controls for Q10 and Q90 events, but there were some exceptions to this 551 finding (e.g., the MAE for Q10 events was 736%; Table S2). SVM tended to have a lower PCC 552 compared to the model controls for non-reference and reference watersheds (Table S3) as well as  553 for watersheds in some ecoregions (e.g., West Mountains). While not as accurate as the DNN, 554 XGBoost and ANN consistently had a lower MAE and higher PCC for Q10 and Q90 monthly 555 runoff compared to model controls (Table S2). Additionally, compare to model controls, 556 XGBoost and ANN had a higher PCC for non-reference and reference watersheds as well as 557 watersheds in all ecoregions. 558 559 When it came to computing power, we found that the DNN required the second longest 560 computing time and a GPU compared to all the other grid-to-watershed-scale monthly runoff 561 conversion methods tested (Table 2). However, other well performing approaches required either 562 a GPU (i.e., ANN) or parallel computing on a CPU (i.e., XGBoost; Table 2). This finding 563 highlights a potential limitation of DNN-based methods; hydrologists interested in using 564 machine learning methods to convert gridded GCM hydroclimatic fluxes to watershed-scale 565 runoff may wish to consider available computing resources before implementing DNNs. DNNs 566 can be trained on a single desktop workstation in less than a day using open-source software or 567 users may seek out cloud-based computing methods to carry out analyses if research budgets are 568 more limited. Based on these results, DNNs hold great promise as a tool for improving the 569 accuracy of GCM-derived runoff estimates for watershed-scale research. 570

Applying Deep Learning to Climate Model Downscaling: Examples in Colorado and North 572
Carolina, USA 573 To illustrate the efficacy of using the DNN to convert gridded, downscaled GCM outputs to 574 watershed-scale runoff, we consider two example watersheds (Figure 4a In 2006, the UCR consisted of primarily (53.8%) forest land cover, followed by 24% shrubland, 592 9.9% grassland, 4.1% agriculture, 3.8% barren, 1.7% wetlands, and 1.5% development, and 593 the MAE of monthly runoff for the UCR was 12.2%; nearly 50 percentage points better. More 607 specifically, many of the monthly runoff peaks were overpredicted by GCM runoff (Figure 4b). 608 To put these numbers into perspective, mean monthly runoff from this basin during the study 609 period was 11.0 mm, or 511 million m 3 . Using that value as a guideline, the reduction in error 610 associated with applying the correction model to monthly, gridded GCM runoff for the UCR 611 equates to an improvement in accuracy on the order of 66 million m 3 of water per month (i.e., 612 water withdrawals for Colorado in 2015 (Dieter et al., 2018). For the UCR, the DNN enhanced 614 the accuracy of GCM runoff and GCM output applicability to water resources management. 615

Implications and Future Directions 645
To the best of our knowledge, this study was the first to combine watershed characteristics from 646 a large publicly available dataset with gridded GCM hydroclimatic fluxes (i.e., precipitation, 647 temperature, and evapotranspiration) to develop a DNN that accurately predicted monthly runoff 648 for watersheds across the CONUS ( Figure S2). The trained DNN was robust to spatio-temporal 649 changes in monthly runoff, accounted for non-reference and reference site characteristics, and 650 was robust across the nine GAGES-II watershed ecoregions. Additionally, the trained DNN 651 adequately predict Q90 events; however, it had a more difficult time predicting Q10 events. We 652 also compared DNN runoff predictions to two controls (i.e., GCM runoff and monthly normal 653 runoff) and four statistical grid-to-watershed-scale conversion techniques. The DNN 654 outperformed all alternative techniques but required more computing power and computing time 655 than some alternatives. This work highlights key benefits of DNNs as well as future 656 opportunities for the application of DNNs to statistical GCM downscaling. 657 In addition to key benefits discussed in Section 1, DNN structure-including input, hidden, and 659 output layers ( Figure S1b) which it has never "seen". Ideally, the DNN may also be evaluated based a separate 730 validation set that includes newly generated data. For example, this may include using 731 current precipitation and temperature data as inputs to the model. Researchers may also 732 wish to consider their method for splitting up data (i.e., random sampling or semi-random 733 hold out). Here, we used semi-random sampling because we wanted to make sure the 734 DNN was robust in time and space. Thus, we are choosing which input variables are 735 important for the DNN to represent. 736 • model evaluation metrics -Consider using multiple model evaluation metrics when 737 assessing DNN performance. These may include bias, slope, R 2 , and MAE as well as 738 others we do not use here (e.g., Nash-Sutcliffe Efficiency). For a thorough review of 739 standard hydrology model evaluation metrics see Moriasi et al. (2007). 740 • residual analysis -Residual analysis including the plotting of residuals versus 741 observations, and in this case, important spatio-temporal variables is an important 742 statistical evaluation technique to assessing whether or not the DNN is robust to changes 743 in model inputs. 744 • architecture -Researchers should consider whether they will start simply and add layers 745 and nodes or start with a large model and remove layers and nodes. Both approaches can 746 lead to useful capabilities, as we discussed in Section 2.2. 747 • model training improvement techniques -In this study, we implemented a number of 748 techniques to improve model training accuracy and reduce model training time (e.g., 749 early stopping). Researchers should consider including some of these; fortunately, many are easily implemented using existing Python and R libraries. For a thorough description 751 of these techniques, look to Goodfellow et al. (2016). 752 • data quality and research framing -The old adage "garbage in, garbage out" is important 753 to consider when it comes to implementing machine learning methods. If your data are 754 biased, the machine learning model may learn to reproduce those biases. For example, if 755 a DNN model is conditioned only on water samples collected after a precipitation event, 756 the model may have a hard time predicting water quality metrics before or during a 757 storm. Just as importantly, it is key to be mindful that machine learning, while powerful, 758 is simply another tool for extracting insights from data. Therefore, machine learning is 759 best used in combination with well-framed research questions and relevant, high quality 760 data. 761

Introduction
This manuscript supporting information document includes conceptual diagrams showing the architecture of machine learning models used in this study, deep neural network (DNN) model inputs and outputs, and the semi-random training and test set split proportions. We included a figure showing the location and region of watersheds used in this study and several model testing figures. In addition to figures, we included a table describing the DNN input variables and tables with model assessment metrics for nonreference and reference test set results, Q10 and Q90 test set results, and ecoregional results.       Figure S8. Deep neural network test set Q10 event median watershed residuals expressed as a percent relative to observations. Point location represents the watershed centroid.
10 Figure S9. Deep neural network test set Q90 event median watershed residuals expressed as a percent relative to observations. Point location represents the watershed centroid.