Comparison of estimation methods for a nonstationary index-flood

27 Due to climatic or anthropogenic causes, changes in flood magnitudes in many parts of the world 28 have been observed and are expected to continue in the future. To characterize such changes, 29 nonstationary models have focussed on the modeling of stations with long records, but in practice 30 such models may be needed to improve the evaluation of flood risk for stations having shorter 31 records. In this study, a nonstationary index-flood model for peaks over threshold is investigated 32 to reduce model uncertainty in such situations. A procedure is proposed to automatically calibrate 33 such models for at-site and regional frequency analysis. The assumption of an index-flood model 34 is used to define a probability structure that is stable in time. This requires adapting existing 35 automatic procedures for threshold selection and the delineation methods for forming pooling 36 groups to the nonstationary models. Four estimators are investigated in a simulation study to 37 determine which perform best in different situations. Two methods are based on the combination 38 of regression techniques and L-moments, while the other two methods employ likelihood-based 39 3 techniques. A case study of 425 stations in Canada is considered to verify if a nonstationary index40 flood model using pooling groups that combine stationary and nonstationary stations can reduce 41 the uncertainty of design levels associated with a finite reference period. 42


30
The evaluation of risk associated with flood events is an important factor in the design of safe and reliable infrastructure. In particular, estimation of accurate flood quantiles is challenging as it requires the extrapolation of the tail of a probability distribution estimated from a limited number of extreme events. To increase the amount of information, threshold models were introduced as an alternative to the more common strategy of modeling annual maximum 35 discharges. Comparative studies showed that this strategy, which allows the inclusion of more than one peak per year, can effectively reduce model uncertainties (Bezak et al., 2014;, although the efficiency of threshold methods may vary, based on factors such as the number of peaks per year (Cunnane, 1973). Another strategy to reduce model uncertainty in flood frequency analysis is to include information from nearby stations with similar hydrological 40 properties. Such an approach, called regional frequency analysis, is often recommended by governmental authorities to perform frequency analysis on stations having few years of data (Robson and Reed, 1999;USGS, 2018). Among popular regional approaches, the index-flood model is widely applied (Hosking and Wallis, 1997;Ilorme and Griffis, 2013;Nobert et al., 2014;Wright et al., 2014). The latter assumes that every station inside a homogenous group 45 follows the same distribution called growth curve or regional distribution up to a scale factor. This hypothesis proved to be a flexible approach that led to various generalizations, such as multivariate frequency analysis of peaks and volumes (Chebana and Ouarda, 2009;Requena et al., 2016) and spatial extremes characterized by max-stable processes (Wang et al., 2014) .
There is evidence that changes in flood regimes are occurring due to either climatic (Burn et al., 50 2016;Kiem et al., 2003;Kundzewicz, 2012) or anthropogenic causes (Prosdocimi et al., 2015;Villarini et al., 2009). Distinguishing between long-term persistence and short oscillation patterns in environmental time series represents an important dilemma that affects the interpretation of the observed changes. Some authors pointed out that stationary time series may exhibit a trend in the data , even though the probability structure of the studied phenomenon does 55 not change in time (Koutsoyiannis, 2005;Montanari and Koutsoyiannis, 2014). Consequently, research that has addressed the issues related to change in flood regimes has mostly focussed on analysis with long records to characterize and attribute changes to specific drivers (Blöschl et al., 2007;Hall et al., 2014;Mediero et al., 2015). Some studies have considered regional approaches to attribute changes in flood regimes (Renard and Lall, 2014;Sun et al., 2014), but such 60 approaches remain relatively marginal and changes are generally investigated for a specific station (Viglione et al., 2016). In a nonstationary frequency analysis, the major challenge remains to adequately predict trends. Recent studies demonstrate that even when stations present significant signs of nonstationarity, the variability of the trends estimated using uniquely time as covariate are still too important for nonstationary models to provide a valuable replacement to 65 existing stationary models (Renard et al., 2013;Serinaldi and Kilsby, 2015). Indeed, even a simple linear trend can diverge considerably from the truth over the years. As a compromise, Luke et al. (2017) recommended in the United States the approach of updated stationarity where the predicted trend is constant and equal to the last observed year. This provides a balance 5 between opting for a stationary model that ignores the observed trend and a nonstationary model 70 that leads to an unrealistic flood estimate.
To adapt flood frequency models to nonstationary situations, a common approach is to let the parameters of extreme distributions evolve as a function of temporal covariates (El Adlouni et al., 2007;Katz, 2013;Tramblay et al., 2013;Villarini et al., 2010). For threshold modeling, the Generalized Pareto distribution (GPA) generally has a time-varying scale parameter and constant 75 shape parameter. Similarly, it was shown that using common regression models applied to the logarithm of annual maximum floods were proper methods to describe the trends observed in most watersheds in the United States (Serago and Vogel, 2018;Vogel et al., 2011). Additionally, several studies for modeling peaks over threshold have also considered a time-dependent threshold to ensure that the exceedance probability remains constant in time (Kyselý et al., 2010;80 Northrop and Jonathan, 2011). A nonstationary index-flood model was presented by Cunderlik and Burn (2003) and Hanel et al. (2009) in the case of annual maxima and by Roth et al. (2012) in the case of peaks over threshold. These study introduction of a time-dependent scaling function to replace the constant scale factor. This modification describes the probabilistic structure of a group of stations using a unique growth curve that is stable in time. Accordingly, a 85 homogenous region becomes a group of sites following the same distribution up to a scaling function. Moreover, they suggested a further generalization that lets the parameters of the regional growth curve vary in time to describe temporal trends common to the homogenous region. For modeling flood peaks over a given threshold,  investigated a procedure to estimate a stationary index-flood model based on L-moments. Their 90 methodology estimates a regional shape parameter using L-coefficient of variation and the scale parameter was taken as the at-site mean. Contrary to Roth et al. (2012) that used the threshold as 6 a scaling function, a generalization of the  model involves using a time-dependent mean excess function. The direct correspondence between the mean excess and the scale parameter of the GPA distribution implies an equivalent representation, but with a 95 clearer separation of the time-dependent component (mean excess) and the probabilistic structure (growth curve). O'Brien and Burn (2014) applied a nonstationary index-flood model to the estimation of flood quantiles in Canada using the annual maximum of river discharges that, contrary to Hanel et al. (2009), used a constant scaling factor and a time-dependent growth curve. Their study reveals an additional challenge in the application of regional and 100 nonstationary flood frequency analysis, because tests for trend applied on a large network of gauged stations resulted in a limited number of stations presenting significant signs of nonstationarity. The scarcity of the nonstationary stations restricted the availability of nearby sites having similar hydrological properties, which complicated the formation of the homogenous regions. 105 Homogenous regions can be formed following the notion of regions of influence, which was demonstrated to lead to more accurate estimates of flood quantiles than fixed regions (Burn, 1990;GREHYS, 1996;Tasker et al., 1996). A region of influence entails the formation of pooling groups that include the stations nearest to a target site. A particularity of this delineation method is that the pooling groups are specific to a target and the same station can be part of two 110 pooling groups having similar but different growth curves. Consequently, the rationale of combining the region of influence methodology with index-flood models is to obtain a neighborhood of similar stations where the global probability structure can be locally approximated by a unique growth curve. Another important aspect for modeling exceedances is the selection of a proper threshold. A threshold should normally be selected as low as possible, 7 while respecting the model assumptions. In this regard, an important aspect for selecting a threshold is the notion of stability that entails that if GPA is a proper model for the exceedances of a given threshold, then the exceedances of a higher threshold should also follow a GPA distribution with the same shape parameter (Coles, 2001). An indicator for determining if a threshold was correctly selected is to use a goodness-of-fit test to verify that GPA is a proper 120 distribution (Davison and Smith, 1990). The p-value of such tests were used recently to develop automatic selection procedures based on the identification of the maximum p-value and the first threshold respecting a given significance level (Durocher et al., 2018b;Solari et al., 2017).
The objective of the present study is to propose an automatic procedure to perform the calibration of a nonstationary index-flood model for peaks over threshold. In addition to testing 125 for trends in the frequency of occurrences and magnitudes of the threshold exceedances, the methodology includes a way of selecting the time-dependent threshold and mean excess function. More precisely, this procedure involves using the stability of the growth curve to adapt existing methods to nonstationary models. It also allows the formation of pooling groups that combine stationary and nonstationary stations to maximize the information provided by the 130 nearby stations. In at-site flood frequency analysis, L-moments are often preferred for curve fitting, to the alternative method of maximum likelihood, due to their robustness and lower bias (Hosking, 1990;. Similarly, the proposed procedure suggests a methodology based on regression techniques and L-moments. The relative performance of this estimation method is compared to existing likelihood-based methods to identify the best method for 135 different circumstances. The present study does not address the task of forecasting trend, which demands a separate focused attention. The focus is rather put on a methodology that could be applied in practice and that can improve flood quantile estimates for stations having limited data 8 and for which nonstationarity is suspected. According to the idea of update stationarity (Luke et al., 2017), there is interest in investigating flood quantiles of the most recent years, as they may 140 be among the best indicators of future flood risk. Consequently, this study attempts to put forward models that reduce the uncertainties of design levels that summarize flood risks over the most recent year of observation (Cooley, 2013;Salas et al., 2018).
The investigated model has 3 important components: the threshold, the mean excess series and the growth curve. Section 2 explains these components in more detail and presents the stepwise 145 procedure for calibrating the model. For simplicity, time is used as a predictor (covariate), but this could be replaced by other meaningful descriptors. Section 3 provides a simulation study that compares the relative performance of four estimators of the nonstationary index-flood model including 2 at-site and 2 regional ones. In section 4, a case study based on 425 stations in Canada is used to verify that the regional version of the proposed method can reduce the uncertainty of 150 estimated design levels in comparison to existing at-site methods. Finally, Section 5 discusses the results and draws conclusions.

Methodology
This section describes the investigated nonstationary index-flood model. The general procedure is schematized in Figure 1 and details are provided in the following subsections. First the model 155 components are defined in the context of an at-site frequency analysis. Next, the selection of the threshold and estimation of the components are discussed. Afterward, a procedure for forming pooling groups is presented to extend the approach to a regional frequency analysis. Finally, an alternative estimator based on likelihood theory is proposed and design levels are suggested as an alternative to return periods for quantifying risk in a nonstationary context. 160 9

Threshold
Let's consider first the modeling of a single station. Stationary threshold models assume that the probability of exceeding the threshold is constant through time. When this hypothesis is unrealistic, quantile regression is suggested to define a time-dependent threshold that restores a 165 constant probability of exceeding the threshold (Kyselý et al., 2010;Northrop and Jonathan, 2011). Quantile regression estimates conditional quantiles with respect to covariates without choosing a specific distribution to fit to the data (Koenker and Bassett, 1978). For streamflow data, a declustering method is necessary to identify independent peaks from a series of daily river discharge. To this end, the declustering method recommended by the Water Resource Council of 170 the United States as described in Lang et al. (1999) is applied. In brief, two adjacent peaks must respect the following two conditions: i) they must be separated by 4 log( ) A  days, where A is the drainage area in square kilometers; and ii) the minimal intermediate flow must be less than 75% of the lowest of the two peaks. Utilization of a declustering method creates a discrepancy between the exceedance probability of the peaks and the probability associated with the 175 conditional quantile of the quantile regression. For this reason, the exceedance probability is estimated as the ratio nN representing the extracted number of peaks divided by the number of daily observations (Coles, 2001). Henceforth, let be a time-dependent threshold associated with a rate  representing the average number of peaks per year (PPY). Note that this rate is proportional to the exceedance probability, the proportionality factor being the number of 180 days in a year (365.25) and has comparable interpretation in both stationary and nonstationary models.

Mean excess
The second component of the model is the mean excess that can be constant or time-dependent.
In the latter case, it describes a trend in the magnitude of the exceedances. Let Y(t) be a random 185 variable characterizing exceedances. Based on theoretical arguments, Y(t) follows a Generalized Pareto (GPA) distribution (Davison and Smith, 1990) (1)

Growth curve
The third component of the model is a dimensionless growth curve that describes the probability structure of the exceedances. Using the mean excess as a scaling factor leads to the definition of the standardized exceedance ( ) controlled uniquely by the shape parameter  . Note that the representation using 200 11 the mean excess and the growth curve is equivalent to directly using the shape parameter of the GPA model. However, the proposed methodology appears more straightforward when using this form.
When the model is applied in a regional analysis, it can be assumed that all stations inside a homogenous region have the same (regional) growth curve, which is the assumption of an index-205 flood model (Hosking and Wallis, 1997). A further generalization is introduced by considering a space-dependent GPA shape parameter related to a linear predictor where s is the station of interest,  is a vector of parameters and () s x a vector of descriptors.
Overall, for a station s the proposed model evaluates the flood quantile of probability p at a 210 specific time t as where () p qs is a growth curve, , () s ut  is the station threshold and () s t  is the station mean excess. The choice of using a linear predictor that depends on station characteristics is made to provide a clear separation between the temporal and spatial components of the model. 215

Selection of the threshold
The nonstationary model is calibrated following an automatic procedure that verifies the model hypothesis in a stepwise manner. A simple automatic procedure to determine the threshold u  consists of selecting the lowest threshold for which a goodness-of-fit test, such as the Anderson-Darling (AD), cannot reject the hypothesis of a GPA distribution (Davison and Smith, 1990). 220 The rationale of this heuristic is to find a balance between selecting a threshold high enough to respect the model assumptions and low enough to reduce model uncertainty by maintaining most of the available information. Considering a set of candidate thresholds, one can iterate until the p-value of the goodness-of-fit test is greater than a chosen value. This significance-based strategy was criticized because it fails in some situations to provide a stable threshold based on common 225 visual diagnostics. As a solution, Solari et al. (2017) showed that a proper alternative is to use the threshold that leads to the maximum p-value of the goodness-of-fit test. A comparison of these two approaches was later performed by Durocher et al. (2018b). They noticed that the significance-based method fails in a limited number of cases and that when it doesn't fail, it leads to models with lower uncertainty. They also found that a large discrepancy between the flood 230 quantile of a candidate threshold and the one of a lower reference threshold provides a good indicator of failures. This resulted in the proposition of a hybrid procedure where an alternative method to the significance-based method is preferred only when the discrepancy is considered large enough to suggest that stability has not been reached.
In this study, a set of 30 candidate thresholds between 0.8 and 2.5 PPY are identified by an initial 235 screening process. The significance-based and the maximum p-value thresholds are then searched among the candidates and the final threshold is taken as the one with the highest number of peaks. Following Durocher et al. (2018b), the significance-based threshold is chosen as the first threshold that has a p-value greater than 0.25 and a relative discrepancy with the 1 PPY threshold lower than 25%. To speed up the computation, a table is used to interpolate the p-240 values of the Anderson-Darling test (Choulakian and Stephens, 2001). Although this table does not allow the evaluation of p-values greater than 0.5, it was shown that such restriction does not substantially affect the performance of the two automatic procedures (Durocher et al., 2018b).

13
Once the exceedances are identified, a logistic regression model is applied to check for the presence of a significant trend in the occurrences of peaks (Frei and Schär, 2001), i.e. an 245 evolving probability of (exceeding/not exceeding) the threshold each day. If the slope is not significant, the threshold is assumed to be constant; otherwise a time-dependent threshold is added to the model. The same automatic selection procedure is therefore repeated using quantile regression to identify the exceedances. Afterward, the hypothesis of a time-dependent mean excess function is verified by the Mann-Kendall test (Helsel and Hirsch, 2002). To account for 250 possible temporal correlation, block bootstraps are employed to obtain a more robust evaluation of the significance level (Önöz and Bayazit, 2012). If a trend in the magnitude of the exceedances cannot be rejected, a time-dependent mean excess function is added. At this point, the automatic selection procedure cannot be used directly, because the goodness-of-fit test is not applied on identically distributed data. The reformulation of the GPA distribution in terms of a 255 growth curve scaled by the mean excess function becomes useful as the automatic procedure can be applied on the standardized exceedances. In this context, the automatic procedure ensures that the growth curve has reached stability for the selected threshold.   depends on the GPA shape parameter  . In that context,  is treated as a nuisance parameter in the sense that it is not needed to estimate the mean. The quasi-likelihood approach is an estimation method that mimics the properties of the maximum likelihood approach, but uses only the information from the first two moments. For this study, the mean excess has the form 270

Estimation of the mean excess and growth curve
is a vector of parameters and g is a link function that relates the mean excess to a linear predictor. Here, the link function is restricted to a constant or an exponential function; where the latter may be useful to enforce positive values. Please see McCullagh and Nelder (1989) for a more in-depth introduction of GLM modeling.
Once the threshold and mean excess of the model are estimated, empirical values of the 285 standardized exceedances can be computed. Using only at-site information, one possible estimator for the GPA shape parameter  of the growth curve is the L-moment estimator where  is the empirical estimate of the L-coefficient of variation ).

290
In this study, the pooling groups are built using a hierarchical structure that accounts for more than one notion of similarity. Similar strategies were proposed, for instance, by Eng et al. (2007) and Durocher et al. (2018a). First, the 0 m nearest stations to the target are identified according to geographical distance. Then, among the identified stations, the final m stations are selected as the most similar to the target in terms of seasonality. Mostafi Zadeh et al. (2019) indicated that 295 regional frequency analysis performed with pooling groups based on a seasonality measure using annual maximums are more accurate than a seasonality measure based on peaks over threshold.
In the seasonality space, a station can be represented as a circular statistic (Mardia, 1975)  For each station j s of a pooling group delineated using the seasonality measure, the at-site 305 estimates of the GPA shape parameter ˆj  can be obtained by using the steps described in Section 2.3. The drainage area (AREA) and mean annual precipitation (MAP) defines the linear predictor, Equation (3), that characterizes the relationship between the GPA shape parameter and its descriptors. To reduce skewness and impose a scale for comparison, both descriptors are initially transformed using the logarithm function and standardized. A GLM model assuming a 310 Normal distribution is employed to estimate the parameter  of the linear predictor, Equation (3). To find appropriate neighborhood sizes, the objective is to determine 0 mm  such that the pooling groups best predict the target GPA shape parameter, which is accomplished by leave-one out cross-validation. In turn, the GPA shape parameter () j  of the target station is predicted as if it was ungauged. This process is repeated for every pooling group and the cross-validation score 315 For at-site frequency analysis, the GPA shape parameter is constant () s   , which simplifies the well-established nonstationary models for threshold modeling (Coles, 2001). Using the invariance property of the maximum likelihood estimator, the likelihood function ( , ; ) L  by can be optimized before using the properties in Equation (2) to derive the mean excess and the growth curve from the estimated parameters. 330 Although dependence structure and estimation methods exist for modeling spatial extremes (Davison et al., 2012;Padoan et al., 2010;Thibaud et al., 2013), when the focus is the quality of the fitted distribution and not the realism of the simulations, simpler estimation methods were shown to lead to proper inference without specifying such dependence structure. One approach is the independent likelihood method that optimizes jointly the likelihood of multiple stations as if 335 all stations were independent (Roth et al., 2012;Wang et al., 2014). The maximization of the independent likelihood is sometimes challenging. Here, the algorithm is initialized using the parameters estimated by the regression approach and follows the procedure described in Roth et al. (2012). In brief, it alternates between a phase where the growth curve is optimized assuming the mean excess of all stations is known and a phase where the at-site estimation of each station is optimized separately assuming that the growth curve is known. 345 Asymptotic results for the distribution of the parameters estimated by the independent likelihood method are presented, for instance, in Varin et al. (2011). However, the present study relies on a parametric bootstrap method to quantify the uncertainty of the model because the same method can be applied to the regression approach. In particular, bootstraps are necessary to propagate the error made at each step of the methodology. The resampling scheme includes an adjustment for 350 intersite correlation by using simulations generated from a multivariate Normal distribution with constant coefficient of correlation (Hosking and Wallis, 1997). Contrary to regional models using annual maximums, peaks over threshold events are not observed at regular time steps.
Consequently, it is assumed that correlation only affects pairs of exceedances separated by less than a month and the dependence parameter is estimated as the average correlation coefficient. 355 The multivariate simulations are transformed to GPA distributions using the parameters obtained by the at-site estimation of each station by the combination of regression techniques and Lmoments.

Evaluation of flood risk
For a stationary model, flood risk is measured in terms of a return period, T, corresponding to the 360 expected waiting time before the occurrence of an event of similar magnitude. For threshold models, a return period is associated with the quantile of the GPA distribution having probability 1 1 ( ) T pT    , where  corresponds to the mean number of peaks per year . For nonstationary models, a different flood quantile is evaluated each year and the usual correspondence between exceeding probability and expected waiting time does not 365 hold. In practice, a generalization of expected waiting time to a nonstationary model is more challenging, because to evaluate the expected waiting time of a 100 year return period, it is necessary to know the trends for a period longer than 100 years (Cooley, 2013). This can be especially problematic considering that accurate prediction of future trends remains an open problem in flood frequency analysis (Luke et al., 2017;Serinaldi and Kilsby, 2015). 370 Measuring risk as a probability associated with a finite reference period is simpler. The reliability associated with a given design level is defined as the probability that no event of such magnitude occurs during this period. The probability of failure is then one minus the reliability.

Simulation study
A simulation study is performed to evaluate the relative performance of the regression versus the likelihood-based approach for estimating the parameters of the nonstationary index-flood model.
Both at-site and regional models are considered. The comparison is based on a group of stations 395 that have the same mean excess and regional growth curve. The first station of the group is treated as the target site the pooling group. (i) The notation LMM is used to denote the at-site method using the combination of regression techniques and L-moments (Section 2.

3). (ii)
Similarly, RLM represents the regional version of the LMM method, where the GPA shape parameter is estimated by a second regression model (Section 2.4). (iii) The notation MLE 400 corresponds to the at-site maximum likelihood estimator and (iv) the notation IND refers to using an independent likelihood method that jointly fits the 20 stations assuming a constant regional GPA shape parameter (Section 2.5).
The simulated data represented a table rs nn  where nr is the record length and ns is the size of the pooling group. Each column corresponds to a GPA sample simulated according to the 405 nonstationary index-flood model. Among many factors, the quality of the estimation method will be affected by the number of peaks. Three record lengths of nr = 30, 50 and 100 years are considered and an average of two peaks per year is assumed to evaluate the return periods. The reference period for the occurrence of peaks is set to be the interval [0, nr] where the time of observation for each exceedance is selected at random. The pooling group size is chosen as ns = 410 20. For every experiment, the threshold is zero and assumed to be known. The mean excess is defined as a linear trend that is one at the origin and has a 1% annual increase. Every experiment is repeated 1000 times for several GPA shape parameters,  , ranging from -0.3 to 0.3 by steps of 0.1.
The accuracy of some model components is summarized using bias and root mean square error 415 (RMSE). Figure 2 reports the RMSE of the shape, slope and design level (flood quantile) Q100 for the different simulated GPA shape parameters. Note that Q100 is based on the last 30 years of simulated data. As expected, since the design of the simulation study is based on a known regional model, the third row shows that the regional methods (RLM, IND) are more accurate than the at-site methods (LMM, MLE). In particular, the IND method is found to be 420 systematically the best method for predicting Q100. However, the differences between RLM and IND are relatively small, in particular for simulated GPA shape parameters between -0.1 and 0.1.
The differences outside this range seem to mostly result from lower accuracy in the estimation of the linear trend because the RMSE of the GPA shape parameter for the IND method is not systematically better than the RMSE for RLM. Note that in the second row of Figure 2, RLM 425 and LMM have the same estimated slope and thus the difference of accuracy is a consequence of the approach used for the estimation of the GPA shape parameter. In terms of bias, which is represented in Figure 3, the first row of this figure indicates that the RLM has smaller bias than 22 the IND method for GPA shape parameters lower than -0.1. This difference in bias performance between the methods also translates to lower bias for Q100. For the slope parameter, the bias of 430 all methods is found relatively low and does not depend on the shape parameter.
The comparison of the at-site methods also has a special interest as the procedure proposed to guide the choice of the time-dependent components is based only on the data of the target station. Figure 2 indicates lower RMSE in the GPA shape parameter and design level Q100 for the LMM method in comparison to the MLE method when simulations are performed using 435 negative GPA shape parameters and 30 or 50 years of simulations. This conclusion suggests that LMM is more robust in the sense that it is better at estimating the GPA shape parameters of smaller samples with heavy tails, which impact the accuracy of Q100. MLE performs relatively better when simulations have positive GPA shape parameters and more data. When looking at the bias in Figure 3, both estimation methods tend to overestimate the GPA shape parameter, but 440 LMM is found to be less biased. In particular, as the GPA shape parameter becomes more positive, LMM becomes relatively less biased, while MLE becomes more biased. The same conclusion applied to Q100, even though bias remains relatively small. By comparison, the design level associated with a GPA shape parameter of zero is 4.6 and the highest relative bias of MLE for Q100 is 2%. Overall this shows that using the LMM approach for calibration is a safer 445 approach because when the RMSE is large (heavy tails) it provides a gain of accuracy and when RMSE is small (light tails) it is less biased. Section 2.5 described the parametric bootstrap procedure used to evaluate the uncertainty of the four estimators. For that resampling scheme, the LMM estimate is used as plug-in value to transform the marginal distribution. This choice can be motivated by the relatively lower bias compared to the likelihood-based estimator. 450

23
Other experiments were performed, but detailed results are not reported. In particular, the impact of including intersite correlation using a multivariate Normal distribution with constant correlation coefficient was considered. Increasing the intersite correlation did decrease the overall accuracy of all methods, but did not affect the relative performance of the four estimation methods. Adding a small perturbation to the GPA shape parameter was also considered, but 455 again none of the estimators performed relatively better than the others under this type of model misspecification.

Data and local trends
The Water Survey of Canada (WSC, 2018) manages a large network of gauge stations that 460 provide daily measurements of streamflow across the country. For the purpose of this study, a total of 425 stations are selected that have unregulated streamflows and at least 27 years of complete data during the reference period of 1988 and 2017. This reference period of thirty years was selected because it represents a common window to evaluate persistence in climate data (IPCC, 2019;WMO, 2019) and it is used to evaluate design levels. Furthermore, it ensures a 465 minimal record length for each station and a good representation of the trend during the reference period. Table 1 Burn et al. (2016) investigated changes in peaks over threshold data in Canada using a classification based on three types of flood regimes. A similar approach is adopted herein, where hierarchical clustering (Ward, 1963) is applied to define seasonality regions using the seasonality 475 measure of Equation (9). Figure 5 illustrates the locations and seasonal properties of 6 clusters.
The average monthly maximum flow of each station is computed to create a profile vector that offers a second representation of the flood seasonality. To account for catchment scale, every profile vector is standardized to a unit norm. The two panels at the bottom of Figure Table 1). Conversely, the proportion of nonstationary stations with nival regime is 4.9%, which implies that the rejection of the hypothesis of stationarity is similar to random. Stations with a pluvial or mixed regime include 76.2% of the stations with a timedependent threshold. In particular, 10 of the 13 stations with mixed regimes have positive trends and among the 3 negative trends, 2 of them have a near zero slope with more than 60 years of 500 data. Burn et al. (2016) indicated an increase in the prevalence of rainfall-driven flood events that is coherent with the present findings. At the same time, 13 of the 18 stations (68.4%) with a time-dependent mean excess are located in southern Ontario or southern British Columbia.
Across Canada, stations with a time-dependent mean excess are generally negative in a proportion of 2.2 to 1. 505

Calibration of a single station
To illustrate the calibration of one station, the stepwise procedure is described in more detail for station 02HL003 located on the Black River, Ontario, which has a mixed flood regime. See the station having both trends in Figure 4. First, the automatic selection procedure is applied to the stationary at-site model. It leads to a threshold having a significant trend in peak occurrences (p-510 value of 0.03) according to the logistic regression model. The automatic selection procedure is therefore repeated with a time-dependent threshold. The newly selected time-dependent threshold leads to a trend in the magnitude of the exceedances with a p-value of 0.03 for the Mann-Kendall test. A final run of the automatic selection procedure is performed with both timedependent components. In this case, both tests reject the hypothesis of trends. 515 Outputs of the automatic selection procedure is presented in Figure 6. It reports the estimated GPA shape parameter (denoted "Shape") and the p-value of the Anderson-Darling test for the 26 threshold candidates. Notice that the selected threshold is associated with 2.48 PPY and corresponds to the first candidate below 2.5 PPY that reaches the maximum p-value of 0.5. The GPA shape parameters associated with the candidate thresholds are relatively stable between 520 approximately 1.4 and 2.8 PPY, but there is a clear evolution below An overall visualization of the model is presented in Figure 8, which includes the estimated timedependent threshold and mean excess along with the daily streamflow data. A slightly positive slope is reported for the threshold, while the mean excess has a slightly negative slope. Jointly, the flood quantiles of probabilities 0.9 and 0.99, respectively R10 and R100 summarize flood 540 27 risk for each year. The observed negative slope for R100 shows the relative importance of the mean excess in the evaluation of flood risk. Figure 8 also presents the design levels Q10 and Q100. As expected, the comparison between R10 (R100) and Q10 (Q100) indicates that the design levels behave similarly to an average flood quantile.

545
For each station, a nonstationary index-flood model is set-up following the stepwise procedure.
For forming the pooling groups, the calibration of the hierarchical scheme presented in section 2.4 is performed using the transformed mean annual precipitation (MAP) and the drainage area (AREA). Figure 9 reports the results of the leave-one-out cross-validation using a traditional index-flood model (constant growth curve) and a linearly evolving GPA shape parameter as 550 presented in Equation (3). The left panel presents for each value of 0 m (the size of the initial neighborhood formed using geographical distance) the minimal cross-validation score. When considering the linear predictor, the results suggest that it is preferable to not impose too severe a restriction on the geographical extent because the selected value is 0 m =350 and higher 0 m leads to similar scores. It is seen that restricting the geographical distance improves the cross-555 validation of pooling groups with constant growth curve. In this case 0 m = 125 is selected.
Overall, Figure 9 shows that the inclusion of the linear predictor improves the modeling of the GPA shape parameter by the members of the pooling groups by about 5%. The right panel of Figure 9 presents the cross-validation score with respect to the final pooling group size (m) for the best 0 m . One finds that the best pooling group sizes are respectively 35 and 25. 560 Once the pooling groups are formed, the four estimators: LMM, MLE, RLM and IND can be evaluated on each pooling group. Bootstrap samples of size 1000 are drawn to obtain an 28 approximate distribution of every model parameter and design level. The relative difference between two estimators in terms of variability is measured by the ratio of their variance. In Figure 10, the variance ratios of each estimation method are compared to the IND method, which 565 corresponds to the denominator and the x-axis represents the GPA shape parameter of the growth curve estimated by the IND method. On the logarithmic scale (base 2), variance ratios below zero indicate that the design levels are estimated with less uncertainty than IND. In particular, values of -1 and 1 indicate that the estimator has half or double the variance of the IND estimates. The first two rows of Figure 10 summarize the comparison between at-site and 570 regional models. This shows that for almost all stations the design levels are estimated more accurately by the regional models. The evaluation of Q100 corresponds to the extrapolation to higher risk than Q10. It is then reasonable to see that the variance ratios associated with Q10 exhibit less spread than Q100. The comparison between the RLM and IND methods shows similar results to the simulation study. Indeed, the IND estimator is found to be in general more 575 accurate than RLM when the GPA shape parameter is outside the interval [-0.1,0.1], while the opposite seems to be true in the present case study. However, note that for both design levels the difference between the regional methods is relatively small in comparison with the difference between the at-site methods. To better understand these scales, notice that a logarithm value of 0.25 corresponds to a standard deviation 9% higher, while a value of 2.5 corresponds to a 580 standard deviation 238% higher.
To understand the impact of selecting a nonstationary index-flood model versus a stationary index flood model, Figure 11 reports the relative difference between the design level estimated by the independent likelihood for both approaches. Note that the GPA shape parameter is a regional estimate and that most stations are stationary, consequently the GPA shape parameter 585 29 does not differ substantially in both approaches. Figure 11 shows that the nonstationary stations with more than 60 years of data have equal or lower design level in comparison with stationary models with an average of 5%. For the stations with fewer than 60 years of data, the median of the relative difference is not significantly different from zero according to a Wilcoxon signedrank test. Several reasons may explain this outcome. As mentioned, for stations with shorter time 590 series, the relative difference between stationary and nonstationary models may not represent a persistent change and may be a consequence of shorter oscillation patterns. Additionally, in this situation the reference period represents a large proportion of the observed years and the design levels of the two approaches may be similar, even if an important trend is observed. Overall, Figure 11 suggests that for sufficiently long time series the replacement of the stationary models 595 by nonstationary models entails smaller flood risk. These results are in agreement with research that shows that due to global warming, the important spring snowmelt events that characterize major floods in a majority of rivers in Canada are expected to occur earlier during the year and drain water from smaller snowpacks (Burn et al., 2016;Cunderlik and Ouarda, 2009).

600
A stepwise procedure was introduced to calibrate a nonstationary model using trend tests, Lmoments and regression techniques. For this procedure, a time-dependent GPA distribution was separated into a mean excess and a growth curve. This representation, characteristic of indexflood models, allowed the adaptation of existing automatic procedures for selecting threshold exceedances by ensuring the stability of the growth curve. A second benefit of this representation 605 is that mixed pooling groups containing nonstationary and stationary stations were created to resolve the issue of finding stations with similar hydrological properties. Indeed, among 425 stations in Canada, the stepwise procedure led to the consideration of time-dependent components in 10% of the studied stations, although higher concentrations of nonstationary models were found in regions characterized by pluvial and mixed regimes. 610 A comparison of four estimation methods was carried out in a simulation study. For a single station, the comparison between the regression approach and the maximum likelihood method has shown that design levels derived from the regression approach were generally less biased and more accurate for shorter time series having negative GPA shape parameters (thick-tailed distributions). This suggested that the regression approach could be recommended as a robust 615 strategy to perform at-site frequency analysis. The GPA distribution has a clear variance function that makes the use of quasi-likelihood straightforward in the stepwise procedure. McCullagh and Nelder (1989) mentioned that quasi-likelihood tends to behave similarly to the log-likelihood function. Therefore, it can be argued that the main difference between the regression approach and the maximum likelihood method is the estimation of the GPA shape parameter by the L-620 moments. Indeed, similar qualities attributed to the regression approach in this study are shared by L-moment estimators in a stationary context (Hosking, 1990). For the regional model, it was also demonstrated in the simulation study that using the independent likelihood method led to the most accurate estimates of the design levels Q10 and Q100. For the Canadian case study, 43 stations were found to require a time-dependent threshold 625 or mean excess. Using the variance ratios between the four estimation methods it was shown that the estimates provided by the regression approach have a comparable variability level to those of the independent likelihood when the GPA shape parameter is in the interval [-0.1, 0.1]. Although the majority of the stations have a GPA shape parameter in this interval, when the GPA shape parameter is outside this interval, the independent likelihood method was found to reduce model 630 31 uncertainty. For stations with more than 60 years of data, the comparison of the design levels based on a 30-year period indicated that the utilization of nonstationary models should result in a lower evaluation of the flood risk than stationary models. Luke et al. (2017) put forward the idea of update stationarity, which recommends that flood risk associated with recent years of data be used as a way to predict future flood risks. Further research is necessary to assess if the design 635 levels as defined in this study represent a reliable indicator for that purpose. However, in the meantime, this study shows that a nonstationary index-flood model using pooling groups that mix stationary and nonstationary stations can be recommended to reduce the variability of design levels.

640
This work was supported by the Natural Sciences and Engineering Research Council (NSERC) Canadian FloodNet (# NETGP 451456 -13). Drainage area and streamflow data are available on the website of the water survey of Canada (https://wateroffice.ec.gc.ca/search/historical_e.html).

Mean annual precipitation was provided by Environment and Climate Change Canada (ECCC).
A special thanks to Dr. Shabnam Motofi Zadeh for its help with collecting the data. The authors 645 would also like to thank the editor and three anonymous reviewers for their valuable comments and suggestions that contributed to improve the quality of the paper.