Uncertainty Estimation with Deep Learning for Rainfall-Runoff Modelling

Deep Learning is becoming an increasingly important way to produce accurate hydrological predictions across a wide range of spatial and temporal scales. Uncertainty estimations are critical for actionable hydrological forecasting, and while standardized community benchmarks are becoming an increasingly important part of hydrological model development and research, similar tools for benchmarking uncertainty estimation are lacking. We establish an uncertainty estimation benchmarking procedure and present four Deep Learning baselines, out of which three are based on Mixture Density Networks and one is based on Monte Carlo dropout. Additionally, we provide a post-hoc model analysis to put forward some qualitative understanding of the resulting models. Most importantly however, we show that accurate, precise, and reliable uncertainty estimation can be achieved with Deep Learning.

high resolution (terminology adopted from Renard et al., 2010). Reliability measures how consistent the provided uncertainty estimates are with respect to the available observations; and resolution measures the "sharpness" of distributional predictions (i.e., how thin the body of the distribution is). Generally, models with higher resolution are preferable. However, this preference is conditional on the models being reliable. A model should not be overly precise relative to its accuracy (over-confident) or overly disperse relative to its accuracy (under-confident). We note, that the best form metrics for comparing distributional predictions would be to use proper scoring rules, such as likelihoods (see, e.g., Gneiting and Raftery, 2007). Likelihoods, however do not exist on an absolute scale (it is generally only possible to compare likelihoods between models), which makes these difficult to interpret (although, see: Weijs et al., 2010). Additionally, these can be difficult to compute with certain types of uncertainty estimation approaches, and so are not completely general for future benchmarking studies. We therefore based the assessment of reliability on probability plots, and evaluated resolution with a set of summary statistics.
Reliability. Probability plots (Laio and Tamea, 2007) are based on the following observation: If we insert the observations into the estimated cumulative distribution function, a consistent model will provide a uniform distribution on the interval [0,1].
The probability plot uses this to provide a diagnostic. The theoretical quantiles of this uniform are plotted on the x-axis, and the fraction of observations that fall below the corresponding predictions on the y-axis ( Figure 2). Deficiencies appear as deviations from the 1:1 line: a perfect model should capture 10% of the observations below a 10% threshold, 20% under the 20% threshold and so on. If the relative counts of observations in particular modeled quantiles are higher than the theoretical quantiles this means that a model is under-confident. Similarly, if the relative counts of observations in particular modeled quantiles are lower than the theoretical quantiles then the model is over-confident. Laio and Tamea (2007) proposed to use the probability plot in a continuous fashion to avoid arbitrary binning. We preferred to use discrete steps in our quantile estimates to avoid falsely reporting overly precise results (e.g., Cole, 2015). As such, we chose a 10% step size for the binning thresholds in our experiments. We used 10 thresholds in total: one for each of the resulting steps, and the additional 1.0 threshold which used the highest sampled value as an upper bound, so that an an intuition  regarding the upper limit can be obtained. Subtracting the thresholds from the relative count yields a deviation from the 1:1 line (the sum of which is sometimes referred to as expected calibration error, see e.g. Naeini et al., 2015). For the evaluation we depicted this counting error alongside the probability plots to provide better readability (see: Figure 2 (b)).
A deficit of the probability plot is its coarseness, since it represents an aggregate over time and basins. As such, it provides a general overview, but necessarily neglects many aspects of hydrological importance. Many expansions of the analytical range are possible. One that suggested itself was to examine the deviations from the 1:1 line for different basins. Therefore, we evaluated the probability plot for each basin specifically, computed the deviations from the 1:1 line and examine their distributions. We did not include the 1.0 threshold for this analysis since it consisted of large spikes only.
Resolution. To motivate why further metrics are required on top of the reliability plot it is useful to look at the following observation: There are an infinity of models that produce perfect probability plots. One edge-case example is a model that simply ignores the inputs and produces the unconditional empirical data distribution at every timestep. Another edge-case Table 1. Overview of the benchmarking metrics for assessing model resolution. Each metric is applied to the distributional streamflow predictions at each individual timestep and then aggregated over all time steps and basins. All metrics are defined in the interval [0, ∞) and lower values are preferable (but not unconditional on the reliability).

Benchmarking metric a Description
Mean absolute deviation More robust than standard deviation and variance.
Standard deviation We use Bessel's correction to account for one degree of freedom.
Variance We use Bessel's correction to account for one degree of freedom.
Average width of the 0.2 to 0.9 quantiles We compute the width of each of the inner quantiles and take the mean.
Distance between the 0.25 and 0.75 quantiles Average interquartile range.
Distance between 0.1 and 0.9 quantiles Average interdecile range. a All metrics are computed for the samples of each timestep and then averaged over time and basins.
example is a hypothetical "perfect" model that produces delta distributions at exactly the observations every time. Both of these models have precision that exactly matches accuracy, and these two models could not be distinguished from each other using a probability plot. Similarly, a model which is consistently under-confident for low flows can compensate this by being over-confident for higher flows. Thus, to better assess the uncertainty estimations, at least another dimension of the problem has to be checked: the resolution.
To assess the resolution of the provided uncertainty estimates, we used a group of metrics (Table 1). Each metric was computed for all available data points and averaged over all time steps and basins. The result are statistics that characterize the overall sharpness of the provided uncertainty estimates (roughly speaking they give us a notion about how thin the body of the distributional predictions is). Further, to provide an anchor for interpreting the magnitudes of the statistics we also computed them for the observed streamflow values (this yields an unconditional empirical distribution for each basin that can be aggregated). These are not strictly the same, but we believe that they still provide some form of guidance.

Baselines: Uncertainty Estimation with Deep Learning
We tested four strategies for uncertainty estimation with deep learning. These strategies fall into two broad categories: Mixture Density Networks (MDN) and Monte Carlo Dropout (MCD). We believe that these approaches represent a useful set of baselines for benchmarking.

Mixture Density Networks
The first class of approaches use a Neural Network to mix different probability densities.  . Plot (c) shows the same juxtaposition, but the mixture is derived from a different weighting w = (0.80, 0.15, 0.05). The plots demonstrate that even in the simple example with fixed parameters the skewness and form of the mixed distribution can vary strongly. Figure 3. Mixing is done using weighted sums. Mixture weights are larger than zero and collectively sum to one to guarantee that the mixture is also a density function. These weights can therefore be seen as the probability of a particular mixture component. Usually, the number of mixture components is discrete, however this is not a strict requirement.
The output of an MDN is an estimation of a conditional density, since the mixture directly depends a given input ( Figure   4). The mixture represents changes every time the network receives new inputs (i.e., in our case for every timestep). We thus obtain time-varying predictive distributions that can approximate a wide variety of distributions (they can, for example, account for asymmetric and multimodal properties). The resulting model is trained by maximizing the log-likelihood function of the observations according to the predicted mixture distributions. We view MDNs as intrinsically distributional, in the sense that The core idea is to use the outputs of a Neural Network to determine the mixture weights and parameters of a mixture of densities (see Figure 3). That is, for a given input, the network determines a conditional density function, which it builds by mixing a set of predefined base-densities (the so-called components).
they provide probability distributions instead of first making deterministic streamflow estimates and then appending a sampling distribution.
In this study, we tested three different MDN approaches: 1. Gaussian Mixture Models (GMMs) are MDNs with Gaussian mixture components. Appendix B1 provides a more formal definition as well as details on the loss/objective function.
2. Countable Mixtures of Asymmetric Laplacians (CMAL) are similar to GMMs but instead of Gaussians, the mixture components are asymmetric Laplacian distributions (ALD). This allows for an intrinsic representation of the asymmetric uncertainties that often occur with hydrological variables like streamflow. Appendix B2 provides a more formal description as well as details on the loss/objective function.
3. Uncountable Mixtures of Asymmetric Laplacians (UMAL) also use asymmetric Laplacians as mixture components but the mixture is not discretized. Instead, UMAL approximates the conditional density by using Monte Carlo integration over distributions obtained from quantile regression (Brando et al., 2019). Appendix B3 provides a more formal description as well as details on the loss/objective function.
One can read this enumeration as a transition from simple to complex: We start with Gaussian mixture components, then replace them with ALD mixture components, and lastly transition from a fixed number of mixture components to an implicit approximation. There are two reasons why we think the more complex MDN methods might be more promising than a simple GMM. First, error distributions in hydrologic simulations often have heavy tails. A Laplace component lends itself towards Examples for different components Figure 5. Characterization of distributions that are used as mixture components in our networks. Plot (a) superimposes a Gaussian with a Laplace distribution. The latter is sharper around its center, but this sharpness is traded with thicker tails. We can think about it in the following way: the difference in area is moved to the center and the tails of the distributions. Plot (b) illustrates how the asymmetric Laplace distribution (ALD) can accommodate for differences in skewness (via an additional parameter).
thicker-tailed uncertainty ( Figure 5). Second, streamflow uncertainty is often asymmetrical, and thus the ALD component could makes more sense than a symmetric distribution in this application. For example, even a single ALD component can be used to account for zero flows (compare Figure 5 (b)). UMAL extends this to avoid having to pre-specify the number of mixture components, which removes one of the more subjective degrees of freedom from the model design.

Monte Carlo Dropout
MCD provides an approach to estimate a basic form of epistemic uncertainty. In the following we provide the intuition behind its application. Dropout is a regularization technique for Neural Networks, but can also be used for uncertainty estimation (Gal and Ghahramani, 2016). Dropout randomly ignores specific network units (see Figure 6). Hence, each time the model is evaluated during training, the network structure is different. Repeating this procedure many times results in an ensemble of many submodels within the network. Dropout regularization is used during training, while during the model evaluation the whole Neural Network is used. Gal and Ghahramani (2016) showed that dropout can be used as a sampling technique for Bayesian inferencehence the name Monte Carlo dropout.

Model Setup
All models are based on the LSTMs from Kratzert et al. (2020) with an additional hidden layer after the LSTM to give the model more flexibility in estimating the mixture weights and components. This means that the presented predictions are from simulation models sensu Beven and Young (2013), i.e., no previous discharge observations were used as inputs. The LSTM in the context of hydrology is described in (e.g. Kratzert et al., 2018), and not repeated here. However, all models can be adapted to a forecasting settings.
In short, our setting was the following: Each model takes a set of meteorological inputs (namely: precipitation, solar radiation, min. and max. daily temperature, and vapor pressure) from a set of products (namely: NLDAS, Maurer, and DayMet). As in our previous studies, a set of static attributes is concatenated to the inputs (see: Kratzert et al., 2019b). The training period is from 01 October 1980 to 30 September 1990. The validation period is from 01 October 1990 to 30 September 1995. Finally, We trained all MDNs with the log-likelihood and the MCD as in Kratzert et al. (2020), except that the loss was the meansquared error (as proposed by Gal and Ghahramani, 2016). All hyperparameters were selected as to provide the smallest average deviation from the 1:1 line of the probability plot for each model. For GMM this resulted in 10 components and for CMAL in 3 components (Appendix A).
To make the benchmarking procedure work at the most general level, we employed the setup depicted in Figure 7. This allows that each approach, with the ability to generate samples, can be plugged into the framework (as evidenced by the inclusion of MCD). For each basin and time step the models either predict the streamflow directly (MCD) or provide a distribution over the streamflow (GMM, CMAL, and UMAL). In the latter case, we then sampled from the distribution to get 7500 sample points for each data point. Since the distributions have infinite support samples below 0 m 3 d −1 are possible. In this case, we truncated the distribution by setting the sample to zero. All in all, this resulted in 531 * 3650 * 7500 simulation points for each model and metric. Exaggerating a little bit we could say that we actually deal with "multi-point" predictions here.

Post-hoc Model Examination: Checking Model Behavior
We performed a post-hoc model examination as a complement to the benchmarking to avoid potential blind spots. The analysis has of three parts, each one is associated with a specific property: 1. Accuracy: How accurate are single-point predictions obtained from the distributional predictions?
2. Internal consistency: How are the mixture components used with regard to flow conditions?
3. Estimation quality: How can we examine the properties of the distributional predictions with regard to second-order uncertainties?
The following subsections lay out our rationale and method of investigating each of these questions.

Accuracy: Single-Point Predictions
To address accuracy, we used standard performance metrics applied to single-point predictions. The term single-point predictions is used here in the statistical sense of a point estimator, to differentiate it from distributional predictions. Single-point predictions were derived as the mean of the distributional predictions at each timestep, and evaluated for aggregating over the different basins, using mean and median as aggregation operators (as in Kratzert et al., 2019b). Table 2 summarises the different metrics.

Internal Consistency: Mixture Component Behavior
To get an impression of the model consistency we looked at the behavioral properties of the mixture densities themselves.
The goal was to get some qualitative understanding about how the mixture components are used in different situations. As a prototypical example for this kind of examination we refer to the study of Ellefsen et al. (2019). It examined how LSTMs use the mixture weights to predict the future within a simple game setting. Similarly, Nearing et al. (2020a) reported that a GMM produced probabilities that changed in response to different flow regimes. We conducted the same exploratory experiment with the best-performing benchmarked approach.

Estimation Quality: Second Order Uncertainty
MDNs allow us to check the quality of the given distributional predictions. The basic idea here is that predicted distributions are estimations themselves. MDNs provide us with an estimation of the aleatoric uncertainty in the data and the MCD is a basic estimation of the epistemic uncertainty. Thus, the estimations of the uncertainties are not uncertainties themselves, but estimations thereof and thus as such subject to uncertainties themselves. This does, of course, hold for all forms of uncertainty estimates, not just for MDNs. However, MDNs provide us with single-point predictions of the distribution parameters and mixture weights. We can therefore assess the uncertainty of the estimated mixture components, by checking how perturbations (e.g. in form of input noise) influence the distributional predictions. This can be important in practice. For example, if we mistrust a given input -let us say because the event was rarely observed so far or if we suspect some form of errors -we can use a second order check to obtain qualitative understanding of the goodness of the estimate.
Concretely, we examined how second order effect on the estimated uncertainty can be checked with MCD approach (which provides estimations for some form of epistemic uncertainties), as it can be layered on top of the MDN approaches (which provide estimations of the aleatoric uncertainties). This means that the Gaussian Process interpretation by Gal and Ghahramani (2016) can not be strictly applied. We can nonetheless use the MCD as a perturbation method, since it still forces the model to learn an internal ensemble.

Benchmarking Results
The probability plots for each model are shown in Figure 8. The approaches that used mixture densities performed better than MCD, and among all of them, the ones that used asymmetric components (CMAL and UMAL) performed better than GMM.
CMAL has the best performance overall. All methods, except UMAL, tend to give estimates above the 1:1 line for thresholds lower than 0.5 (the median). This means that the models were generally under-confident in low-flow situations. GMM was the only approach that showed this type of under-confidence throughout all flow regimes -in other words, was above the 1:1 line everywhere. The largest under-confidence occured for MCD in the mid-flow range (between 0.3 and 0.6 quantile thresholds).
For higher flow volumes, both UMAL and MCD underestimated uncertainty. Overall, CMAL was close to the 1:1 line.  the distributions from CMAL and UMAL were better centered than GMM and MCD. At the outer bounds, a bias was induced due to evaluating in probability space: It is more difficult to be over-confident as the thresholds get lower; and vice versa it is more difficult to be under-confident as the thresholds become higher. At higher thresholds, UMAL had a larger tendency to fall below the center-line, which is also visible in the probability plot. Again, this is a consequence of the overconfident predictions from the UMAL approach for larger flow volumes (MCD also exhibited the same pattern of overconfidence for the highest threshold).
Lastly, Table 3 shows the results of the resolution benchmark. In general, UMAL and MCD provide the sharpest distributions.
This goes along with overconfident narrow distributions that both approaches exhibit for high flow volumes. These, having the largest uncertainties, also influence the average resolution the most. The other two approaches, GMM and CMAL, provide less sharp distributions. In the case of GMM, this is reflected in under-confidently wide distributions in the probability plot. Notably, the predictions of CMAL are in between those of the over-confident UMAL and the under-confident GMM. This makes sense from a methodological viewpoint since we designed CMAL as an "intermediate step" between the two approaches.
Moreover, this reflect a trade-off in the log-likelihood (which the models are trained for), where a balance between reliability and resolution has to be obtained in order to minimize the loss.    Table 4.

Accuracy
Among the uncertainty estimation approaches, the models with asymmetric mixture components (CMAL and UMAL) perform best. UMAL provided the best point estimates. This is in line with the high resolutions of the uncertainty estimation benchmark: the sharpness makes the mean a better predictor of the likelihood's maximum and indicates again that the approach trades reliability for accuracy. That said, even with our naive approach for obtaining single-point estimations (i.e., simply taking the mean), both CMAL and UMAL manage to outperform the model that is optimized for single-point predictions with regard to some metrics. This suggests that it could make sense to train a model to estimate distributions and then recover best estimates. One possible reason why this might be the case is that single-point loss functions (e.g., MSE) define an implicit probability distribution themselves (e.g., minimizing an MSE loss is equivalent to maximizing a Gaussian likelihood with fixed variance). Hence, using a more nuanced loss function (i.e., one that is the likelihood of a multimodal, asymmetrical, heterogeneous distribution) can improve performance even for the purpose of making non-distributional estimates. In fact, it is reasonable to expect that the results of the MDN approaches can be improved even further by using a more sophisticated strategy for obtaining single-point predictions (e.g., searching for the maximum of the likelihood). The single-point prediction  Hydrograph of basin #14182500

CMAL weights
Predictions and weight behavior Figure 10. Top: Hydrograph of an exemplary event in the test period with both 5% to 95% and 25% to 75% quantile-range. Bottom: The weights (αi) of the CMAL mixture components for these predictions.  Figure 10). Falling limbs of the hydrograph (corresponding roughly to throughflow) are associated with the second mixture component (α 2 ), which is low for both rising limbs and low-flow periods. The third component (α 3 ) mainly controls the rising limbs, the peak-runoff, but also has some influence throughout the rest of the hydrograph. In effect, CMAL learns to separate the hydrograph into three parts -rising limb, falling limb, and low-flows -which correspond to the standard hydrological conceptualization. No explicit knowledge of this particular hydrological conceptualization is provided to the model -it is solely trained to maximize overall likelihood.

Estimation Quality
In this experiment, we are interested in how uncertainties in the estimation of weights and parameters of the mixtures influence the distributional predictions. Figure 11 illustrates the procedure: The upper part shows a hydrograph with the 25%-75%

Computational Demand
This section gives an overview of the computational demand required to compute the different uncertainty estimation. All of the reported execution times were otbained by using NVIDIA P100 (16GB RAM), using the Pytorch library (Paszke et al., 2019). A single execution of the CMAL model with a batch size of 256 takes 0.026 +0.001 −0.002 seconds (here, and in the following, the basis gives the median over 100 model runs; the index and the exponent show the deviations the 10% quantile and the 90% quantile respectively). An execution of the MCD model takes 0.055 +0.002 −0.001 seconds. The slower execution time of the MCD approach her is explained by its larger hidden size. It used 500 hidden cells, in comparison to the 250 hidden cells of CMAL (see Appendix A).
Generating all the needed samples for the evaluation with MCD and a batch size of 256 would take approximately 360 days (since 7500 samples have to be generated for 531 basins and 10 yeas on a daily resolution). In practise we could shorten this time to under a week by using considerably larger batch-sizes and distributing the computations for different basins over multiple GPUs. In comparison, computing the same amount of samples by re-executing the CMAL model would take around 174 days. In practise, however, only a single run of the CMAL model is needed, since MDNs provide us with a density estimate from which we can directly sample in a parallel fashion (and without needing to re-execute the model run). Thus, the CMAL model, with a batch size of 256, takes only ∼14 hours to generate all needed samples.

Conclusions and Outlook
Benchmarking hydrological models is a large undertaking because setting up a model over many watersheds requires substantial effort. Benchmarking uncertainty estimation approaches requires even more effort: In addition to setting up at least one hydrological model, most uncertainty estimation strategies require the setup and calibration of additional parts to predict uncertainty. These additions might be sampling distributions, ensembles, post-processors, and so forth. Regardless of these difficulties, uncertainty estimation is critical for hydrological forecasting. However, as of now, there is no way to assess different uncertainty estimation strategies for general or particular setups. One of our goals was thus to provide a public data-driven benchmarking scheme. Other modeling groups can build on it for more formal, or at least community-minded, benchmarking.
Our hope is that this will encourage better community practice and provide a foundation for others to build on.
With respect to the uncertainty estimation baselines presented here, results using deep learning seem promising.The distributional predictions of the uncertainty estimation approaches tested here are dynamic in that the probability distributions change in response to dynamic inputs (precipitation, etc.). Their general behavior is in accordance with basic hydrological intuition in that the uncertainty increases with a rise in streamflow. This relationship is, however, non-linear and not a simple 1:1 depiction.
The MCD approach provided the worst uncertainty estimates. For one, this is very likely due to the Gaussian assumption of the uncertainty estimates is difficult to hold for many low-and high-flow situation. There is however also a more nuanced aspect to consider. MCD estimates a particular form of epistemic uncertainty, while the MDNs estimate the aleatoric uncertainty. These two uncertainty types can be seen as perpendicular to each other. They do however co-appear in our setting, since both the epistemic and the aleatoric uncertainties are largest for high flow volumes.
There was an advantage to using asymmetric distributions as exhibited by UMAL and CMAL, as compared to GMM both in terms of reliability of distributional predictions and accuracy of derived single-point predictions. The CMAL approach in particular gave distributional predictions that were very good in terms of reliability and sharpness (and single-point estimates).
We could demonstrate that it exhibits a direct link between the predicted probabilities and hydrologic behavior in that different distributions were activated (i.e., got larger mixture weights) for rising vs. falling limbs, etc.
Nevertheless, it is important for us to remind the readers that likelihood based approaches (for estimating the aleatoric uncertainty) are prone to give over-confident predictions. We were not able to diagnose this empirically. This might be more a result of the limits of the inquiry than the non-existence of the phenomenon. Thus, to conclude, we will discuss the limits of the chosen benchmarking setup and the post-hoc examination: Our personal observation regarding benchmark data is that the current generation of open datasets is the result of a growing enthusiasm for community efforts. New benchmarking possibilities will become available in the near future. However, a downside of relying on open datasets is missing defense against over-fitting on the test data. We therefore hope that in the future the development of datasets will withhold some test data to make even more rigorous benchmarking possible. As far as we know, this is, for example, already happening in the PLUMBER-2 intercomparison project developed at the University of New South Wales.
With regard to the metrics, the proposed methods exhaust the diagnostic capacity of the probability plot (at least as long as we use it in the proposed form). This is an encouraging sign for our ability to make reliable hydrologic predictions. The downside is that it might be hard for models to improve on this metric going forward. As mentioned throughout the manuscript, it is important to be aware that the probability plot misses several important aspects of probabilistic prediction (for example, precision, consistency, or event specific properties). We thus expect that future efforts will derive and use more powerful metrics, similar to the way we continue to develop diagnostic tooling for single-point predictions (see: Nearing et al., 2018).
Stronger baselines will then emerge in tandem with stronger metrics. From our perspective, there is plenty of room to build better ones. A perhaps self-evident example for the potential of improvements are ensembles: Kratzert et al. (2019b) showed the benefit of LSTM ensembles for single-point predictions, and we believe that similar approaches could be developed for uncertainty estimation. We are therefore sure that future research will yield improved approaches, and move us closer to achieving holistic diagnostic tools for the evaluation of uncertainty estimations (sensu Nearing and Gupta, 2015).
Our post-hoc examination demonstrated how the uncertainty estimation models can be analysed with regard to hydrolgoical points of interest. Remarkably, one of the results showed that the distributional predictions are not only reliable and precise but also yield strong single-point estimates. By and large, however, this just a start. Rainfall-runoff modelling is a complex endeavor. Often we do not just care that models perform well, but also about the way these predictions are made. Ultimately, we want to get the "right answers for the right reasons" (Kirchner, 2006). Current data is noisy, and metrics do not capture everything we want models to achieve (see also : Muller, 2018;Thomas and Uminsky, 2020). We therefore expect that, at least as of now, there will remain blind spots in comparative studies. Post-hoc examination can potentially remedy this fact and demonstrate to users how model checking can be done. We believe that it will play a central role in future benchmarking efforts and model developments. Code and data availability. We will make the code for the experiments and data of all produced results available online. We trained all our machine learning models with the neuralhydrology Python library (https://github.com/neuralhydrology/neuralhydrology). The CAMELS dataset with static basin attributes is accessible at https://ral.ucar.edu/solutions/products/camels.

A1 General Setup
Table A provides the general setup for the hyperparameter search and model training.

A2 Noise Regularization
Adding noise to the data during training can be viewed as a form of data augmentation and regularization that biases towards smooth functions. These are large topics in themselves, and at this stage we refer to Rothfuss et al. (2019) for an investigation on the theoretical properties of noise regularization and some empirical demonstrations. In short, plain maximum likelihood estimation can lead to strong over-fitting (resulting in a spiky distribution that generalizes poorly beyond the training data).
Training with noise regularization results in smoother density estimates that are closer to the true conditional density.
Following these findings we also add noise as a smoothness regularization for our experiments. Concretely, we decided to use a relative additive noise as a first order approximation to the sort of noise contamination we expect in hydrological time series. The operation for regularization is: where z is a placeholder variable for either the dynamic or static input variables or the observed runoff (and, as before, the time index is omitted for the sake of simplicity), N (0, σ) denotes a Gaussian noise term with mean zero and a standard deviation σ, and z * the obtained noise contaminated variable.

A3 Search
To provide a meaningful comparison we conducted a hyperparameter search for each of the four conditional density estimators.
A hyperparameter search is an extended search (usually computationally intensive) for the best pre-configuration of a machine learning model.
In our case we searched over the combination of six different hyperparameters (see Table A2). To balance between computational resources and search depth we took the following course of action: -First, we informally searched for sensible general presets.
-Second, we trained the models for each combinations of the four hyperparameters "Hidden Size" (the number of cells in the LSTM, see Kratzert et al., 2019b), "Noise" (added relative noise of the output, see Appendix A2), "number of densities" (the number of density heads in the mixture (only needed of MDN and CMAL), and "Dropout Rate" (the rate of dropout employed during training (and inference in the case of MCD)). We marked these in Table A2 with a white background.
-Third, we choose the best resulting model and refine the found models by searching for the best settings for the hyperparameters "batch size" (the number of samples shown per backpropagation step) and "learning rate" (the parameter for the update per batch).

A4 Results
The results of the hyperparameter search are summarized in Table A3. Gaussian mixture models (GMMs; Bishop, 1994) are well-established for producing distributional predictions from a single single input. The principle of GMMs is to have a nerual network that predicts the parameters of a mixture of Gaussians (i.e., the means, standard deviations and weights) and to use these mixtures as distributional output. GMMs are a powerful concept. The seen usage for diverse applications such as acoustics (Richmond et al., 2003), handwriting generation (Graves, 2013), sketch generation (Ha and Eck, 2017), and predictive control (Ha and Schmidhuber, 2018).
Given the rainfall-runoff modelling context, a GMM models the runoff q ∈ R at a given time step (subscript omitted for the sake of simplicity) as a probability distribution p(·) of the input x ∈ R M ×T (where M indicates the number of defined inputs, such as precipitation and temperature; and T the number of used time steps which are provided to the Neural Network) as a mixture of K Gaussians: where α k are a mixture weights with the property α k (x) ≥ 0 and K k=1 α k (x) = 1 (convex sum); and N (µ k (x), σ k (x)) denotes a Gaussian with mean µ k and standard deviation σ k , All three defining variable -i.e. the mixture weights, the means, and the standard deviations, are set by a Neural Network and thus a function of the inputs x.
The negative logarithm of the likelihood between the training data the training data and the estimated conditional distribution a is used as loss, i.e.: (B2)

B2 Countable Mixture of Asymmetric Laplacians
Countable mixtures of asymmetric Laplacian distributions, short CMAL, are another form of MDN where asymmetric Laplacian distributions (ALDs) are used as a kernel function. The acronym is a reference to UMAL since it serves as a natural intermediate intermediate stage between GMM and UMAL -as will become clear in the respective section. As far as we are aware, the use of ALDs for quantile regression was proposed by Yu and Moyeed (2001) and their application for MDNs was first proposed by Brando et al. (2019). The components of CMAL already intrinsically provide a measure for asymmetric distributions and are therefore inherently more expressive than GMM. However, since they also necessitate the estimation of more parameters one can expect that they are also more difficult to handle than GMMs. The density for the ALD is: where τ is asymmetry parameter, µ the location parameter and s the scale parameter respectively. Using the ALD as a component CMAL can be defined in analogy to the GMM: and the parameters and weights are estimated by a Neural Network. Training is done by maximizing the negative loglikelihood of the the training data from estimated distribution: Since the network only has to account the scale and the location parameter, considerably less parameters have to be estimated than for the GMM or CMAL.
In analogy to the CMAL model equations, these extensions lead to the following equation for the conditional density: where the asymmetry parameter τ k is randomly sampled k times to provide a Monte Carlo approximation to an implicitly approximated distribution. After the training modellers can choose from how many discrete samples the learned distribution is approximated. As with the other mixture density networks the training is done by minimizing the negative log-likelihood of the training data from the estimated distribution: A LD q | µ k (x, τ k ), s k (x, τ k ), τ k + log(K).

B4 Monte Carlo Dropout
Monte Carlo Dropout (MCD; Gal and Ghahramani, 2016) has found widespread and has already been used in a large variety of applications (e.g., Zhu and Laptev, 2017;Kendall and Gal, 2017;Smith and Gal, 2018). The MCD mechanism can be can be expressed as: where µ * (x) is the expectation over the sub-networks given the dropout rate r, such that: where d is a dropout mask sampled from a Bernoulli distribution B with rate r, d m is a particular realization of a dropout mask, θ are the network weights, and the f (·) is the neural network. Note f (x, d m , θ) is equivalent to a particular sub-network of f .
MCD is trained by maximizing the expectancy, that is by minimizing the mean squared error. As such it is quite different from the MDN appraoches. It provides an estimation of the epistemic uncertainty and as such does not supply a heterogeneous, multi modal estimate (it assumes a Gaussian form). For evaluation studies of MCD in hydrological fields we refer to Fang et al.
(2019) who investigated its usage in the context of soil-moisture prediction. We also note that it has been observed that MCD can underestimate the epistemic uncertainty (e.g., Fort et al., 2019).
Author contributions. DK, FK, MG, and GN designed all experiments. DK conducted all experiments and the results where analyzed together with the rest of the authors. Fk provided the main setup for the "Accuracy" analysis; AKS and GN for the "Internal Consistency" analysis, and DK for the "Estimation Quality" analysis. GN supervised the manuscript from the hydrologic perspective, GK and SH from the machine learning perspective. All authors worked on the manuscript.
Competing interests. The authors declare that they have no conflict of interest.