Multi-Task Deep Learning of Daily Streamflow and Water Temperature

models


of 18
Most of the DL models in the hydrology literature, to this point, have been trained to perform only one task or, in other words, predict only one variable (e.g., streamflow (Hu et al., 2018;Kratzert, Klotz, Shalev, et al., 2019), water level (Bowes et al., 2019), or water temperature (Read et al., 2019)). In this paper, we explore a variation of DL for hydrologic modeling: multi-task learning of two related hydrologic variables. In multi-task learning, a single DL model is trained to predict two related variables. Multi-task models have been shown to better generalize and reduce overfitting (Ruder, 2017) and have proven effective in several applications including natural language processing (D. Chen et al., 2014;Seltzer & Droppo, 2013) and computer vision (Girshick, 2015). In the hydrologic sciences, there are many variables that are related to each other, and a multi-task model could learn to leverage information from a related variable to make better predictions. Specifically, when trained to predict multiple variables that are driven by the same underlying physical processes, a multi-task DL model could learn to better represent the shared hydrologic processes and therefore better predict the hydrologic variable that we are interested in. In this paper, we will focus on the variables of streamflow and water temperature.
Multi-task modeling may also be a way to address, at least in part, a limitation of DL models: the need for large observational data sets with which a model can be trained. In the earth sciences, these data can be sparse given the expense with which many observations are obtained. By implementing multi-task learning we may be able to leverage the information from one variable to help improve the prediction of another related variable when it is less observed. An imbalance in amounts of observations can happen in cases where one variable is more expensive to observe compared to another. For example, long-term streamflow is much more expensive to observe compared to water temperature.
We selected streamflow and stream temperature, in part, because changes in streamflow and temperature are part of the energy transfer process. Most generally, heat is transferred to or from a stream segment via advective and non-advective processes. Advective heat transfer manifests in changes to both streamflow and stream temperature. Changes in streamflow can change water temperature and, similarly, changes in water temperature can indicate changes in streamflow. A model trained using streamflow and stream temperature observations has the potential to better learn about the advective energy transfer process and, therefore, improve predictions of both variables.
Some studies have leveraged the relationship between hydrologic variables using data-driven models. Rahmani et al. (2020) and Van Vliet et al. (2011) found that including streamflow as an input for an ML model and statistical model (respectively) for stream temperature prediction improved model performance. Our contribution relative to these is that in multi-task learning streamflow is not used as a model input but as an output along with water temperature. Furthermore, we examine the benefit of multi-task learning for improving streamflow, whereas the Rahmani et al. (2020) and Van Vliet et al. (2011) focused on improvements in water temperature predictions. Kraft et al. (2020) used a multi-task model to predict different parts of the water balance equation including evapotranspiration, snow water equivalent, and groundwater recharge but did not formally assess the benefits of the multi-task approach.
To test the value of multi-task learning for streamflow and water temperature, we evaluated whether a model trained to predict the two variables together made more accurate predictions than a model trained to predict only one variable. We performed this evaluation using sites distributed across the conterminous United States. Additionally, we tested whether multi-task performance showed improved predictions under data sparse conditions for streamflow, which are common given the high cost of streamflow observations. The results of our experiments shed light on the potential of multi-task learning as a means of using process knowledge and making use of available data to improve streamflow and stream temperature predictions.

Data
The data for our experiments came from the Catchment Attributes and Meteorology for Large-sample Studies (CAMELS) data set . For each site in the CAMELS data set, nearly complete daily observations of streamflow and daily basin-averaged weather forcing data are available from 1 January 1980 through 31 December 2014. Of the 671 total sites in the CAMELS data set, 101 had at least 50 water temperature observations per location collected by the U.S. Geological Survey (USGS) between 1980 and 2014. These basins ranged from 5.4 to 14,269 square kilometers in area. The number of water temperature daily Writing -review & editing: J. M. Sadler, A. P. Appling, J. S. Read, S. K. Oliver, X. Jia, J. A. Zwart, V. Kumar 3 of 18 observations ranged from 50 to 12,515 with a median of 838 observations. The sites were distributed fairly evenly across the conterminous United States except for a cluster of sites in the Mid-Atlantic. The input data we used were the seven basin-averaged Daymet forcing variables (also known as features in DL terminology) from CAMELS (Table 1). All data were at a daily timestep.

The DL Model
The DL model we used for our experiments consisted of three main parts: a Long Short-Term Memory (LSTM) layer (Hochreiter & Schmidhuber, 1997) and two parallel, densely connected output layers (one for streamflow output and one for water temperature output; Figure 1). Our model was implemented in Python using the TensorFlow library (Martin et al., 2015). The LSTM is a type of recurrent neural network (RNN). RNNs have explicit representations of sequences where the information of one member of the sequence (i.e., one time step) is part of the input for the next sequence member (i.e., the next time step). This is especially important for streamflow and water temperature where the temperature and streamflow are autocorrelated (i.e., values at one time step have a strong correlation with the temperature and streamflow in the next time step). Compared to a standard RNN, LSTMs are even better suited to model time series of hydrologic variables because LSTM's memory cells can store long-term temporal dependencies such as snow-pack (Kratzert et al., 2018).
The LSTM layer in our model takes sequences of weather forcing inputs (Table 1) and converts these into intermediate output sequences. These output sequences are passed as input to the two output layers. The two output layers produce the predictions: one output layer predicts streamflow, and the other output layer predicts water temperature. Both output layers are given the same intermediate values from the LSTM layer. Each layer in the model has parameters (i.e., weights and biases). For more details on the exact equations describing how an LSTM's parameters are used to convert model inputs to model outputs, see Hochreiter and Schmidhuber (1997). The model parameters are adjusted in model training to reduce the error between observations and model predictions. The model hyperparameters are architectural and training process decisions that are typically tuned manually.
The equations for updating the DL model parameters for the output layers are main( + 1) ∶= main( ) − ∇ main main ( ( ), main( )) (1) where main represents the parameters for the main variable output layer, aux represents the parameters for the auxiliary variable output layer,  is the loss in either the main or the auxiliary variable, ∇ is the gradient corresponding to a set of parameters, is the learning rate hyperparameter, represents parameters for the LSTM layer, and i is the training step. These equations are written generically. If streamflow were the "main" variable, the main would be replaced with flow . The same would be done with the ∇ and  terms. Likewise, stream temperature would be used for the "aux" terms. If stream temperature were the "main" variable, streamflow would be the "aux" variable. Our experiments tested multi-task modeling when streamflow was the main variable and stream temperature the auxiliary variable and vice versa.
For each output layer, only the gradient with respect to the loss in the relevant variable is used to update the layer's parameters (Equations 1 and 2). For example, only errors in the streamflow predictions are used to update the weights in the streamflow output layer. In contrast, the gradients with respect to the loss from both variables are combined and used to update the parameters of the shared LSTM layer. The equation to update the parameters of the shared LSTM layer is  (Hochreiter & Schmidhuber, 1997), and t is the timestep in the sequence being modeled.

of 18
The overall gradient is a weighted sum of the main and auxiliary gradients, where weights the influence of the auxiliary gradient relative to the main gradient. If = 0, the auxiliary gradient has no effect on the update of the LSTM parameters. In this case, the LSTM parameters are only adjusted in proportion to the errors in the main variable and the model is effectively a single-task model. If > 0, errors in both the main and auxiliary variables affect the gradient and the model is a multi-task model. For the remainder of the paper, we refer to as the "multi-task scaling factor." The loss function was based on the Nash Sutcliffe Efficiency (NSE), a common metric for measuring streamflow prediction performance (Nash & Sutcliffe, 1970): where ̂ is the predicted value, which is a function of the inputs ( ) and the model parameters and either main for the main variable or aux for the auxiliary variable. is the observed value, is a time step of the time series, and is the full time series length. The unmodified NSE values range from negative infinity to one, with one being a perfect model. We used the normalized Nash Sutcliffe Efficiency (NNSE) which ranges from 0 to 1: Because the model training algorithm was minimizing the loss, our loss function was 1 minus the NNSE so that a smaller loss would correspond to a better model and vice versa:

Multi-Task Learning Experiments
To test the ability of a multi-task model to improve streamflow and water temperature predictions, we ran four experiments (A-D). In our experiments we compared a single-task model with multi-task models using a variety of multi-task scaling factors (see Equation 3 for the meaning of the multi-task scaling factor, ). The multi-task scaling factors explored for each experiment are given in Table 2.
The available data were split into three sets: a training set, a validation set, and a test set. The training data were used in updating the model parameters. The validation data were used to select the multi-task learning factor. Other common hyperparameters were fixed and are described in Section 2.4. The test set was used for final model evaluation and was not used to adjust any part of the model.
The amount of data used for training varied in each experiment ( Figure 2). For Experiments A, C, and D we trained individual models for each site in our data set. For Experiment B, we trained a single model with the data from all available sites. All of our experiments were scripted using the Snakemake workflow management system (Köster & Rahmann, 2012).

Experiment A: Multi-Task Modeling With Streamflow as the Main Variable Using 25 Years of Training Data and Single-Site Models
In Experiment A, we trained a separate model for each site and we evaluated the multi-task model performances using 25 years of training data . This experiment was intended to represent predicting streamflow at a well-monitored location. In this experiment, streamflow was the main variable and stream temperature was the auxiliary variable. We used 1 January 1980 to 31 December 1984 as a validation set and 1 January 1985 to 31 December 1989 as a test set. We used data from the latter part of the data set  for training because there is substantially more water temperature data available to train the multi-task model in later years compared to earlier years. There were 101 sites that had data in all three partitions for this experiment.  Given that there is a process-based relationship between streamflow and water temperature, it follows that the relationship between the two variables at a given site is expected to affect the performance of the multi-task approach at that site. Because there are catchment characteristics such as groundwater influence, land cover, and climate that could influence the relationship between streamflow and water temperature, we hypothesized that we could use these attributes to predict and/or understand how beneficial a multi-task model would be. To explore the predictability of the improvement of a multi-task model compared to a single-task model, we used a Random Forest model (Breiman, 2001) to predict the improvement of multi-task models compared to single-task models across the 101 sites. We used the catchment characteristics from the CAMELS data set as input features to the Random Forest model. The details of our implementation of the Random Forest model are in Supporting Information S1.

Experiment B: Multi-Task Modeling With Streamflow as the Main Variable Using 25 Years of Training Data and a Multi-Site Model
Experiment B was identical to Experiment A except instead of training separate models for each site, we trained one model using the data from all available sites. This experiment was intended to place the single-site model experiments, which permit meaningfully diverse replication within the available data set, in the context of multisite models that have been shown to be effective in several other studies (Kratzert, Klotz, Herrnegger, et al., 2019;Rahmani et al., 2020). Training a multi-site model required that the model be given site-specific information that would allow it to differentiate one site from another. Similar to the approach Kratzert, Klotz, Herrnegger, et al. (2019) took, we provided the model catchment characteristic data in addition to the meteorological inputs. The catchment characteristics (see Table S2 in Supporting Information S1) were part of the CAMELS data set . These characteristics represented characteristics such as soil types, land use/land cover, and catchment size. These data were repeated for each time step and concatenated to the time-varying meteorological inputs.

Experiment C: Multi-Task Modeling With Streamflow as the Main Variable Using 2 Years of Training Data
In Experiment C, we trained the multi-task, single-site models with only 2 years of streamflow data each. This experiment was designed to simulate the situation where only a small period of record is available for a site. We used different 2-year periods to account for differences in events and flow regimes that may or may not be represented in one 2-year period compared to another. The data that we used to train the model in Experiment C were each non-overlapping 2-year period in the years 1990-2014 (i.e., 1 January 1990 to 31 December 1991, 1 January 1992 to 31 December 1993, 1 January 1994 to 31 December 1995, etc.). The validation and test sets were the same as in Experiment A. There were 98 sites that had data in all three partitions for this experiment. 6 of 18

Experiment D: Multi-Task Modeling With Temperature as the Main Variable
In Experiment D, we evaluated the performance of the multi-task models for improving stream temperature predictions. We performed this experiment to see how the multi-task model would perform for a different primary variable with very different dynamics than streamflow. In this experiment, streamflow was the auxiliary variable. Because there were fewer temperature observations available at these sites compared to streamflow observations, especially in the early years of the data record, we used the period 1 January 1980 to 31 December 2011 as the training period, the year 2013 as the validation period, and the year 2014 as the testing period. There were 53 sites that had data in all three partitions for this experiment.

Model Configuration, Training, and Evaluation Metrics
The training data were divided into samples (or sequences) and batches. A training sample consisted of a 365-day sequence for a single site. Making the sequence length 365 days is important because stream temperature and streamflow have long temporal dependencies that can extend back into previous seasons. A batch consisted of a number of samples. In the single-site experiments (Exp. A, C, and D), all of the 365-day sequences were combined into a single batch. Therefore, in Experiment A the training batch was 25 365-day sequences (one sequence for each year), in Experiment C the batch was two 365-day sequences, and in Experiment D the batch was 32 365-day sequences. In the multi-site experiment (Exp. B) the training batches were made up of one 365-day sequence from each of the 101 sites. Therefore, there were 25 batches in Experiment B, where each batch had 101 samples or sequences. In model training, each batch was used to update the model parameters in turn. For example, the gradient from the first batch was used to update the model parameters and then those parameters were used as the starting point for training using the second batch (if there is a second batch) and so on. After each batch is used to update the model parameters once, one training epoch was complete.
Each model was trained for 400 epochs for Experiments A, B, and D and 200 epochs for Experiment C. We trained the models for fewer epochs in Experiment C to reduce overfitting that was more likely with a much smaller data set (using 2 years of training data instead of 25). For all the models, we used a dropout rate of 0.15 to reduce overfitting. The optimizer used for the model training was the Adam optimizer (Kingma & Ba, 2015).
In each experiment, 30 replicates of each model were trained, each with different random initializations of the model parameters. This was to account for variation in model performance due solely to the random starting parameters. All 30 models were then given inputs from the validation and testing sets to make predictions and those predictions were compared with the observations for evaluation. The LSTM hidden states were initialized as all zeros. All models had 20 hidden units in the LSTM layer. Because our objective was to assess the effect of the multi-task approach, we did not perform a thorough hyperparameter tuning exercise and used the same set of hyperparameters for both the single-task models and the multi-task models. Although there may be a combination of other hyperparameters that may perform better, we found these hyperparameters to work well enough for the purposes of our study.
Both the weather input features (Table 1) and target data (streamflow and stream temperature) were scaled and centered by subtracting the mean from each variable and dividing by the standard deviation; in this way each scaled and centered variable had a mean of zero and a standard deviation of one. Because we were training multitask models, it was necessary to scale the target data as well as the input data so that both output streamflow and temperature predictions would be on the same scale in model training. All the data were scaled and centered using the standard deviation and means from the training data only.
We compared the multi-task models and the single-task models using NSE. Formal statistical comparisons were done on a site-wise basis using a one-sided Welch's t-test comparing mean NSE's (Satterthwaite, 1946). In the t-test, our hypothesis was that, for a given site, a multi-task model is more accurate in terms of NSE than the single-task model. To calculate in the Welch's t-test for a given site, , the equation is 7 of 18 where NSE is the NSE averaged across the 30 replicates, the standard deviation, and is the number of replicates. The t-score and p-value were calculated using Python's SciPy package .

Influence of Multi-Task Scaling Factor, λ
For most of the single-site model configurations, the multi-task model performed better than the single-task model at predicting streamflow (Figures 3a and 3b). The multi-task scaling factor ( from Equation 4) had a large influence on the accuracy of the multi-task models. As the multi-task scaling factor increased from 0 to 80, the performance of the model generally increased. In other words, as the emphasis that the model gave to predicting stream temperature grew, the accuracy of the streamflow predictions increased. However, when multi-task scaling factors exceeded 80, overall model performances in terms of median NSE decreased with further increases in the scaling factor.
The decrease in accuracy in terms of median NSE with multi-task scaling factors greater than 80 in Experiment A likely occurred because the combined gradient used to update the parameters became more influenced by the temperature loss contribution. With the gradient more influenced by the temperature loss, the parameter adjustments in gradient descent were directed more to reducing error in the temperature predictions. When the multi-task scaling factor was high enough, those adjustments reduced error in the temperature predictions at the expense of streamflow predictive performance. This is likely the reason that there was an overall decrease in streamflow prediction accuracy in terms of median NSE with the multi-task scaling factors of 160 and 320.
The multi-task approach made substantial improvements to the worst performing single-task models. The minimum NSE for the single-task models was −3.5 whereas the minimum NSE for the multi-task model with a multitask scaling factor of 80 was −0.8. The same data as panel (a) but zoomed in so that the variation between the models can be better seen. (c) The number of sites for which each multi-task scaling factor performed the best in terms of median NSE. The dashed horizontal line in panels (a and b) is the median performance of the single-task model. In the boxplots, the bottom of each box is the first quartile of the data (Q1), the middle bar is the median of the data, and the top of each box is the third quartile (Q3). The lower whisker is the lowest point within Q1 -1.5 × IQR, where IQR is the interquartile range (Q3-Q1). The upper whisker is the highest point within Q3 + 1.5 × IQR. The diamonds are points that lie outside of the whiskers.

of 18
Although there was a clear overall pattern in the effect of increasing the multi-task scaling factor on model performance, there was also substantial variability site to site (Figure 3c). The multi-task scaling factor that performed best in terms of median NSE for the largest number of sites in the validation period was 80. However, for each multi-task scaling factor we evaluated, there were at least six sites at which that multi-task scaling factor performed the best. For example, for six sites, the best-performing multi-task scaling factor was 10. For eight sites, the best multi-task scaling factor was 0, meaning that the multi-task approach did not improve overall streamflow predictions compared to the single-task model. Examining the geography, catchment characteristics, and number of temperature observations of these sites showed no obvious reason that the single-task models performed best at these sites.
Given the variability in the performance of the multi-task scaling factors in the validation period across the 101 sites, we selected the best-performing, non-zero multi-task scaling factor for each site to evaluate in the test set. For the eight sites for which the best performing model was the single-task model, we used the second-best multitask scaling factor. The best multi-task models performed better overall than the single-task models in the test set in terms of NSE (Figure 4). The median NSE of the best multi-task models in the test set was 0.66 compared to 0.61 for the single-task models.
The performance in the validation time period was better (single-task median NSE approximately 0.68) than the test period (single-task median NSE approximately 0.61). We compared the training data's flow characteristics in the upper part of the flow duration curve (q50, q60, q70, q80, q90, q95, q99, where q represents the flow quantile) to the same characteristics in the validation and test period. In this comparison, we found that the flow characteristics of the training data set were more similar to the validation set than to the test set. We believe that this is the likely reason that, overall, the models performed better in the validation period compared to the test period.
With a median NSE of 0.61, the performance of our single-site models is comparable to that of Kratzert, Klotz, Shalev, et al. (2019) whose single-site LSTM models that predicted runoff at CAMELS sites had a median of 0.59. The spatial distribution of the model performance was also similar to Kratzert, Klotz, Shalev, et al. (2019), with the models in North Dakota performing the worst ( Figure 5). There was little spatial trend in the improvements made by multi-task models. For each multi-task scaling factor except for 320, the multi-task performance was better than the single-task models for more than half of the sites (Figure 6), although the sites that improved for one multi-task scaling factor were not necessarily the same ones that improved for another. When the multi-task scaling factor was 20, 40, and 80 and when the best site-specific multi-task scaling factors were used, a majority of sites had statistically significant (p < 0.05) improvements (Table 3). A larger multi-task scaling factor resulted in larger improvements at some sites and a larger number of significantly improved sites up until a factor of 160. A larger multi-task scaling factor also resulted in a larger magnitude decrease in performance for some sites. The site-specific best multi-task models performed the same in terms of median root mean squared error compared to when all the sites used the multi-task scaling factor of 80; however, with a multi-task scaling factor of 80, the majority of sites (57 sites) had significant improvements and 17 sites had significantly worse performance. In contrast, only four sites had significantly worse performance when the best site-specific multi-task scaling factor was used.

Catchment Characteristics
The Random Forest model was trained to predict the t-score of the NSEs of the multi-task models compared to the single-task models when the multitask scaling factor was 80. The model had a mean coefficient of determination of 0.2 (standard deviation of 0.29). This performance was measured on sites not present in the training data across 20 random data partitions. The model performance indicates that the model was able to learn general trends from the data that could help it learn to predict the multi-task improvements from the catchment characteristics. One likely limiting factor in the Random Forest model performance was the small data set. There were only 101 sites to draw from. We used a 75/25 test/train split, so only 77 of those could be used for training.
The importance of the Random Forest predictor variables was calculated based on the effect that perturbing a given predictor variable had on the prediction error; a description of each predictor variable is given in Table S2 in Supporting Information S1. The most important variable was base-flow index (Figure 7). Sites with a larger base-flow index will have smoother flow patterns and will have a thermal buffer against extremes in air temperature affecting water temperature compared to sites with a lower base-flow index. This buffer can alter the relative influence of different parts of the energy balance equation that governs the energy transfer process and that links  . t-scores of multi-task model in test set of Experiment A. The "best" is the lambda that was best for each site based on the validation set. The distributions are across the 101 sites used in Experiment A. The bottom of each box is the first quartile of the data (Q1), the white circle is the median of the data, and the top of each box is the third quartile (Q3). The lower whisker is the lowest point within Q1 -1.5 × IQR, where IQR is the interquartile range (Q3-Q1). The upper whisker is the highest point within Q3 + 1.5 × IQR. The dotted lines mark where the t-scores are statistically significantly worse (below zero) or better (above zero) with p = 0.05. the streamflow and water temperature variables. It is reasonable, therefore, that base-flow index is an important variable in predicting how much a multi-task DL approach would benefit a given site.

Seasonal Trends
The improvements on the single-task models made by the multi-task models had a noticeable seasonal trend, with larger improvements in higher flow months and smaller improvements in lower flow months (Figure 8). A greater portion of the training loss occurred in the high-flow periods. This is because the loss function is based on the squared error: a 10% error when the observed value is 40 cubic meters per second (CMS) will have 100× the effect on the overall loss compared to a 10% error when the observed value is 4 CMS. Therefore, the Note. Significance was tested using the students t-test and based on p-value < 0.05. For brevity, not all multi-task scaling factors for all experiments are reported.

Table 3 Number of Sites That Were Significantly Better and Significantly Worse Than the Single-Task Models for Several Multi-Task Scaling Factors in Our Experiments
Figure 7. Feature importance for predicting t-scores for Experiment A when the multi-task scaling factor is 80. Error bars are the standard deviation across 20 random train/test splits.
SADLER ET AL.

10.1029/2021WR030138
11 of 18 multi-task (as well as the single-task) model parameters adjusted to reduce error in springtime. These results indicate that the multi-task models were able to exploit the relationship between streamflow and water temperature to further improve model performance in the spring months. The parameter values that adjusted to fit that relationship, however, did not seem to benefit the model in the summer months where, overall, the multi-task models performed worse than the single-task models. This difference indicates that the relationship learned in the high-flow months of the spring did not apply in the summertime.

Experiment B: Multi-Task Modeling With Streamflow as the Main Variable Using 25 Years of Training Data and a Multi-Site Model
In Experiment B, the overall performance of the single-task, multi-site models (median NSE of 0.69) was better than the single-task, single-site models (median NSE of 0.61; Figure 4). The multi-task approach had a much smaller effect on the model performance in the multi-site models: in the test period, the best site-specific multi-task models had a median NSE of 0.69 (the same as the single-task model). The is likely in large part because the amount of training data for each model in Experiment B was roughly 100× the training data in Experiment A because the Experiment B models learned from data from all of the sites.
Fewer sites had statistically improved performance for the multi-site models. As in Experiment A, we found the multi-task scaling factor for the multisite model that performed best for each site. The site-specific best multitask configurations of the multi-site model were significantly better than the single-task configuration for 17 of the 101 sites (see Table 3). When data are available, using a multi-site approach with or without multi-tasking may make larger improvements on model performance compared to using single-site approach with even the best multi-task configuration. Additionally, the best multi-task configuration of the multi-site model were slightly more accurate than the single-task multisite configuration.

Experiment C: Multi-Task Modeling With Streamflow as the Main Variable Using 2 Years of Training Data
In general, the multi-task models with the best multi-task scaling factors improved upon the single task models in Experiment C (Figure 4 and Table 3). This experiment was done to test whether a multi-task could be used to improve predictions when only sparse data is available. The effect of the multi-task approach was smaller in the sparse data conditions. The model performance across the single-task models and all the multi-task scaling factors was worse in Experiment C (single-site median 0.45) compared to Experiment A (overall median of 0.61). This was expected given that there was a much smaller amount of training data from which the model could learn the flow dynamics (and temperature dynamics in the case of the multi-task models). There was more variability in the results and the effect of the multi-task model when the data were sparser. This variation was likely due to the variation in training data that the model used, because only 2 years of training data were given and there can be a lot of variability between one 2-year period of streamflow record and another. The multi-task scaling factor that was the best for the most sites in Experiment C was much smaller compared to Experiment A: in Experiment A it was 80, in Experiment C it was 10.

Experiment D: Multi-Task Modeling With Temperature as the Main Variable
The multi-task approach did not improve upon the single-task model as substantially for predicting water temperature as it did for predicting streamflow (Figure 4). The multi-task scaling factor that was the best for the most sites in Experiment D was much smaller compared to Experiment A: in Experiment A it was 80, in Experiment D it was 0.01. When using the best site-specific multi-task scaling factor, the multi-task models significantly improved upon the single-task models for seven sites. When a multi-task scaling factor of 1 was used (as would be the default if no scaling factor was selected) the multi-task models performed significantly worse for 47 of the 53 sites (see Table 3).

Performance of Auxiliary Variable
Although we focused on how multi-task learning improves the main variable's predictions, Figure 9 shows the results of the change in performance for the auxiliary variable in Experiment D (streamflow) in addition to the performance for the main variable (stream temperature). We see that as expected, the performance of the auxiliary variable was poorest when the scaling factor is zero. As the scaling factor increased, the flow predictions improved. However, when the site-specific best multi-task scaling factors based on temperature were used, the performance was sub-optimal for the streamflow predictions (median NSE of 0.72 compared to 0.75 if a scaling factor of 0.05 was used). Therefore, although there was a tradeoff between optimizing solely for streamflow performance and optimizing solely for stream temperature performance, a configuration for this model setup produced good (if not optimal) predictions for both variables. This tradeoff is a reality with all multi-objective optimization approaches (Efstratiadis & Koutsoyiannis, 2010).

Discussion
The goal of our experiments was to evaluate the performance of multi-task modeling for predicting streamflow and water temperature. We compared the accuracy of the multi-task approach (predicting streamflow and water temperature simultaneously with one model) to a single-task modeling approach (predicting streamflow or water temperature with their own independent models). Because streamflow and water temperature are related, we hypothesized that a multi-task model would be more accurate than a single-task model. The results of our experiments were nuanced. In some cases, the multi-task approach performed significantly better, in some there was no significant difference, and in some the multi-task model performed significantly worse. The improvements made by the multi-task approach seemed to depend on (a) seasonal factors, (b) which variable was being used as the primary variable, (c) the site being modeled, and (d) data availability. Together these affected the optimal value of the multi-task scaling factor. Given these nuances, we suggest that a multi-task modeling approach should be pursued with care. However logical and reasonable a multi-task approach may seem at the outset, it cannot not be assumed that the approach will be beneficial without exploration of the factors above.

Seasonality of Experiment a Results
There was an apparent seasonal pattern to the improvements made by the multi-task approach where, in general, the improvements were greater in the high flow months and lesser in the low flow months (Figure 8). The influence of snowmelt and base flow are two possible physical mechanisms that could cause differences in the streamflow-water temperature relationship in spring compared to summer and therefore cause this seasonal pattern. Snowmelt and base flow have seasonal patterns and can be strong influences on both streamflow and water temperature (Anderson, 2005;Constantz, 2008;Lisi et al., 2015). For a site that is substantially affected by snowmelt, a multi-task model might have parameters that are trained to predict a decrease in stream temperature for conditions in which an increase in streamflow is also predicted, as would happen during a snowmelt event. Those parameters could make performance worse, however, when increases in streamflow are not due to snowmelt but to rainfall as would happen in the summer months. Similarly, a multi-task model whose parameters capture the flow and temperature dynamics of high-flow months may be less well-suited to predict those dynamics in the summertime when the relative influence of base flow on streamflow and water temperature changes. Figure 9. Performance in terms of Nash Sutcliffe Efficiency in the test set for temperature (the main variable) and flow (the auxiliary variable) in Experiment D. The bottom of each box is the first quartile of the data (Q1), the middle bar is the median of the data, and the top of each box is the third quartile (Q3). The lower whisker is the lowest point within Q1 -1.5 × IQR, where IQR is the interquartile range (Q3-Q1). The upper whisker is the highest point within Q3 + 1.5 × IQR. The diamonds are points that lie outside of the whiskers. Printed numbers show values below the range shown. These were represented in this way to better show the distribution of majority of the data.

of 18
The tendency to favor the high-flow periods would not be desirable for certain use cases such as drought. This issue could be addressed in future work in several ways. As a first approach, the model could be trained to predict logged streamflow which would substantially lessen the influence of high flows on the loss function. A more tailored loss function could also be used such as log NSE. As another approach, the loss could be changed to an average of the NSEs of each season. That way the errors in the model during the low-flow summer seasons would be considered separate from, and would not be overshadowed by, the errors during the higher-flow spring season. This could, however, make a model that is mediocre in performance for each of the seasons instead of better in springtime and worse in summertime. As a more extensive solution, one could train season-specific models where each model was only trained using data from a certain season. This approach would produce a model for each season. When making predictions, the season-specific model outputs would be concatenated together. In this way, the model would be able to focus on season-specific dynamics between streamflow and water temperature. One draw-back of this approach, however, could be the significantly smaller data set that would result from dividing the training record by season. Another opportunity is to have the multi-task scaling factor dynamically adjusted for different time steps. For example, the multi-task scaling factor could be a function of the day of year.

Smaller Improvements in Predicting Water Temperature
The multi-task models had much smaller improvements on the single-task models when the main variable was stream temperature compared to when the main variable was streamflow. One reason for this difference may be that, in terms of NSE, water temperature is easier to predict than streamflow, so less improvement can be made in predicting water temperature. Both the seasonal trends in water temperature and the constrained range (between 0 and about 30 degrees C), make water temperature predictions easier in terms of annual NSE. In general, stream temperatures have a predictable sinusoidal pattern with highest temperatures in the summer and lowest temperatures in the winters.

Selecting an Appropriate Multi-Task Scaling Factor
Our results anecdotally suggest that a larger multi-task scaling factor would be best if the auxiliary variable is easier to predict. In terms of NSE, predicting stream temperature is easier than predicting streamflow; therefore, when streamflow was the main variable, the best multi-task scaling factors were larger (most commonly 80 in Experiment A). When predicting stream temperature, the best multi-task scaling factors were smaller (most commonly 0.01 in Experiment D). One way to get a general sense of the relative predictability of the two variables could be the ratio of the variables' standard deviations or ranges. For example, the ratio of standard deviations between streamflow and stream temperature averaged across all sites was 3.3 and the ratio of the scaled ranges was 6.7. Because both values are greater than one, this indicates that the unscaled multi-task loss will favor streamflow. If streamflow is the primary variable, a multi-task scaling factor greater than one would likely be needed for stream temperature errors to noticeably impact the overall gradient. Although more tuning would be needed to find the optimal factor, this simple comparison could provide a starting point for the search.
Our results also suggest that a larger multi-task scaling factor should be used if the auxiliary variable is represented in only a small percentage of training samples. In our data some training samples (365-day sequences) were without any temperature observations. These sequences produced gradients in model training that were only informed by the streamflow data; these were effectively single-task gradients. In Experiment A, we saw, in general, an inverse relationship between the number of training samples with water temperature data and the best multi-task scaling factor ( Figure 10). It seemed that when there were fewer training samples with auxiliary data, a larger multi-task scaling factor was needed to influence the overall gradient used to update the model weights. The larger multi-task scaling factor counter-balanced the influence from the larger number of gradients from the samples with only data from the main variable. This explanation also holds for Experiment C in which the percentage of training samples with auxiliary data was higher (either 50% or 100%) and the best multi-task scaling factors were typically smaller (most commonly 10).

Multi-Task Learning Versus Using the Auxiliary Information as Input
Multi-task learning is but one approach for using information contained in an auxiliary variable to improve main variable performance. Another approach is to use the auxiliary variable as an input to the model. For example, when Rahmani et al. (2020) used streamflow observations as an input variable to predict water temperature, water temperature predictions improved. When the auxiliary variable is used as an input to a model, the model parameters are trained to leverage the information therein during the model training phase, an automated process. This approach uses the auxiliary information without the time cost of tuning a separate hyperparameter needed in multi-task learning (i.e., time needed to tune the multi-task scaling factor). Although the approach of using the auxiliary data as an input does not require hyperparameter tuning, it does require a value for the auxiliary variable for each time step at each location modeled in the training, validation, and testing periods.
Compared to using the auxiliary data as input, multi-task modeling may be a more efficient use of available auxiliary data when locations only have sparse observations. For example, in our data set, only 4 out of 101 sites had more than 80% coverage for water temperature observations in the training, validation, and test time periods. A forecasting scenario is another case where auxiliary observations are not available. Although in both sparse data and forecast scenarios a secondary model could be used to predict future or missing data, a multi-task approach has the advantage of being able to learn from the available auxiliary data without relying on the existence or accuracy of a secondary model.
Convenience aside, multi-task learning appears to produce more accurate predictions than using auxiliary data as input in some sparse data conditions. We performed a preliminary investigation comparing streamflow prediction results between the best-configured multi-task models and models that used stream temperature data as input. Our results showed that when a substantial number of water temperature observations were missing, the best-configured multi-task models were more accurate overall (see Text S2, Table S2, and Figure S1 in Supporting Information S1).

Limitations and Potential Future Work
As with most data-driven approaches, the models we used were limited by the training data available. Likewise, the effectiveness of the approach rests on the assumption that the training conditions are a good predictor of the prediction conditions. Evidence indicates that this assumption of stationarity is becoming less valid given the observed and expected climate change (McCabe & Wolock, 2002). Nevertheless, in our results the models performed well even when trained on data from certain decades (e.g., 1990-2014 for Experiment A) and evaluated on data from other decades (e.g., 1980-1989). Despite any changes to the climate that may have already occurred and that may occur in the future, the underlying physical, chemical, and biological mechanisms that drive the processes that are modeled will not change. This highlights the importance of the infusion of process understanding into data-driven models rather than relying wholly on data that cannot adequately capture the full process being modeled.
There may be more efficient methods of deciding on an appropriate multi-task scaling factor compared to bruteforce hyperparameter tuning. For example, Kraft et al. (2020) used the uncertainty in each variable to scale the influence of the variable in the loss function. One reason we did not explore this approach is that it does not allow for the explicit designation of a main variable of interest, whose performance should not be sacrificed at the expense of the other variables being predicted. We were instead interested in an approach that would only benefit the prediction of the main variable. Some more automated approaches for achieving this objective could also be explored, such as those by Du et al. (2020) and Liu et al. (2019).
It can reasonably be assumed that the appropriateness of a multi-task scaling factor is largely dependent on the process relationship between the two variables of interest. Because the statistical relationship between two variables can be influenced (if not determined) by physical characteristics and physical, chemical, and/or biological processes, it could be possible to leverage our process understanding to estimate an appropriate multi-task scaling factor for a given site and combination of variables. Although we did not discover a predictable pattern 15 of 18 for estimating an appropriate multi-task scaling factor value based on catchment characteristics and data availability, that pattern may exist for streamflow and stream temperature with another set of catchment characteristics or different method of relating them to the multi-task scaling factor. If multi-task learning were being explored for other variables, there may be more easily discoverable patterns that would narrow the search for a beneficial multi-task scaling factor. This paper examines multi-task learning for just two hydrologic variables; however, the approach could be applied to predicting other variables. We hypothesize that for multi-task learning to have benefit, the variables would have to have some common drivers (stemming from a physical/chemical/biological relationship). There would also need to be sufficient data for both variables from which the multi-task model could learn. For example, other parts of the water balance equation that are observed such as soil moisture could be used to train a multi-task model with a main objective of predicting streamflow. Another example is the modeling of dissolved oxygen and water temperature together.
Multi-task learning could be used to predict any number of observed variables. For example, a model could be trained to predict streamflow, water temperature, and soil moisture. As we discussed above, the selection of the multi-task scaling factor would be an important part of a successful multi-task model. This selection would be even more important (and likely more difficult) if multiple variables were used as auxiliary variables because each auxiliary variable would need its own multi-task scaling factor. This would also open the possibility of assimilating observations of multiple observed variables in a forecast application.
Although the variables for multi-task learning must have enough data for training, that data could be simulated data. For example, it may be fruitful to train a multi-task model on all water balance outputs produced from a process-based hydrologic model such as the Soil and Water Assessment Tool (SWAT) model (e.g., evapotranspiration, runoff, infiltration; USDA, 2018). Then the model could be fine-tuned to just the data for which observations are available. The model would presumably learn the dynamics of the hydrologic cycle more completely from the process model output and could then better predict the variable of interest when given unseen inputs.

Conclusions
Our study examined the value of multi-task DL for leveraging information from related hydrologic variables to improve predictions. In the multi-task approach, one DL model was trained to predict both streamflow and water temperature. The effectiveness of the approach was tested using four experiments: (a) predicting streamflow as the main variable with 25 years of training data using site-specific models, (b) predicting streamflow as the main variable with 25 years of training data using one multi-site model, (c) predicting streamflow as the main variable with 2 years of training data using site-specific models, (d) and predicting water temperature as the main variable using site-specific models. For the experiments we used data from sites throughout the conterminous United States where coincidental streamflow and water temperature observations were available.
When using single-site models, the best site-specific, multi-task configuration significantly improved streamflow predictions for 56 of 101 the monitoring sites with the full 25 years of training data in terms of NSE. At 4 sites, the best multi-task models were significantly worse than the single-task models and at the remaining 43 sites there was no significant difference. The improvements made by the multi-task approach when using the multi-site model were much smaller: with the site-specific best configuration, 89 sites had no significant difference and 11 were significantly improved. With only 2 years of training data, the best site-specific multi-task approach significantly improved streamflow predictions at 46 out of 98 sites. When predicting water temperature, the multi-task approach improved predictions at only seven sites. The main reason that the multi-task approach had a smaller benefit for stream temperature predictions was likely that the stream temperature predictions were already very good in terms of NSE (median NSE > 0.95).
One of the most important factors in the effectiveness (or lack of effectiveness) of the approach for a given site was the multi-task scaling factor. This model hyperparameter scaled the influence of the errors in the secondary (or auxiliary) variable in the overall error used to update the model parameters. When streamflow was the primary (or main) variable, the scaling factor had to be quite high (most commonly the best factor was 80) for the single-site models. In contrast, when the main variable was stream temperature the scaling factor had to be very low (most commonly the best factor was 0.01). A naïve factor of 1 led to models that performed much worse 16 of 18 overall for predicting stream temperature than the single-task models. The main reason for the difference in the scaling factors could be attributed to the difference in variability between the two variables. Seasonal trends in the effectiveness of the multi-task model as well as examining the characteristics of the catchments that benefitted most and least from the approach indicate that base flow is an important factor in whether this approach is successful or not for improving streamflow predictions.
Streamflow and water temperature are but two among many variables that are part of the physical/chemical/ biological processes that are ecologically and societally consequential. Our findings indicate that multi-task learning, when configured properly, can be an effective way to leverage knowledge of the relationship between hydrologic variables in a DL model for improved predictions. In our findings the multi-task approach did not improve predictions in all cases tried and where there were benefits, proper configuration was needed to realize those benefits.