Predictive Modeling of Envelope Flood Extents Using Geomorphic and Climatic‐Hydrologic Catchment Characteristics

A topographic index (flood descriptor) that combines the scaling of bankfull depth with morphology was shown to describe the tendency of an area to be flooded. However, this approach depends on the quality and availability of flood maps and assumes that outcomes can be directly extrapolated and downscaled. This work attempts to relax these problems and answer two questions: (1) Can functional relationships be established between a flood descriptor and geomorphic and climatic‐hydrologic catchment characteristics? (2) If so, can they be used for low‐complexity predictive modeling of envelope flood extents? Linear stepwise and random forest regressions are developed based on classification outcomes of a flood descriptor, using high‐resolution flood modeling results as training benchmarks, and on catchment characteristics. Elementary catchments of four river basins in Europe (Thames, Weser, Rhine, and Danube) serve as training data set, while those of the Rhône river basin in Europe serve as testing data set. Two return periods are considered, the 10‐ and 10,000‐year. Prediction of envelope flood extents and flood‐prone areas show that both models achieve high hit rates with respect to testing benchmarks. Average values were found to be above 60% and 80% for the 10‐ and the 10,000‐year return periods, respectively. In spite of a moderate to high false discovery rate, the critical success index value was also found to be moderate to high. It is shown that by relating classification outcomes to catchment characteristics, the prediction of envelope flood extents may be achieved for a given region, including ungauged basins.


Introduction
Floods pose a serious threat to individuals and communities, as shown by disaster data found, for example, in the International Disaster Database of the Centre for Research of the Epidemiology of Disasters (CRED EM-DAT). A United Nations report on the Human Cost of Weather-Related Disasters (CRED/UNISDR, 2016) illustrates the dimension of the problem well: 47% of all disasters since 1995 have been floods, affecting a total of 2.3 billion people worldwide. This previous experience paints a grim picture, and there is mounting evidence for an increase in frequency and intensity of severe floods due to climate change (Aerts et al., 2006;Alfieri et al., 2017;Barichivich et al., 2018;Kleinen & Petschel-Held, 2007;Milly et al., 2002;Sassi et al., 2019). On the other hand, the increase of socio-economic activities in flood-prone areas perseveres in a number of countries, making exposure of persons and assets, including critical infrastructure, a serious problem ( Thus, understanding flood risk is of great importance to the management of socio-economic and environmental impacts. Disaster risk reduction (i.e., mitigation, preparedness, response, and recovery) can greatly benefit from innovative tools, able to further inform the decision-making process. Governmental organizations use flood maps-a critical component of risk assessment-for emergencies, spatial planning, and awareness raising, while, in the insurance sector, flood maps are critical for managing portfolios, risk screening, and assessing long-term financial solvency (de Moel et al., 2009). The vital importance of flood mapping is also recognized in the EU Floods Directive (2007/60/EC) that mandates member states to produce flood hazard and risk maps.
Flood mapping is traditionally available at the reach-scale and in urban settings (Horritt & Bates, 2002), where researchers devote themselves to increasing the level of detail (i.e., physics and spatial resolution) of flood models, an example being the move toward hyperresolution flood modeling (e.g., Noh et al., 2018). Instead, from regional to global scales, researchers tend to focus more on simplification, as parsimonious models are more suitable to be used over larger domains due to their higher computationally efficiency (Alfieri et al., 2014;Dottori et al., 2016;Neal, Schumann, et al., 2012;Neal, Villanueva, et al., 2012;Pappenberger et al., 2012;Rebolho et al., 2018;Sampson et al., 2015;Winsemius et al., 2013;Yamazaki et al., 2011;Zheng et al., 2018). The level of detail in flood models and their large spatial coverage are two desirable but often competing properties; in other words, it is hard to set up a flood model with one of these characteristics without compromising the other. Adding to long simulation times, the calibration and validation of models in standard approaches to flood hazard mapping (please refer to Grimaldi et al., 2013, for more details) pose significant challenges. In global hydrological models (Bierkens, 2015), calibration is crucial to improve simulations (e.g., Hirpa et al., 2018) but requires large amounts of reliable streamflow observations that are scarcely available. In hydrodynamic models, calibration is also invaluable (e.g., Wood et al., 2016), and it is not uncommon to find numerical instabilities that need to be solved beforehand (e.g., by adjusting the time step or adding numerical diffusion). All these challenges, on top of the need for computational power and the difficulty in finding reliable validation data (e.g., Bernhofen et al., 2018), counters the effort for up-to-date flood maps at any location or any time.
Bottlenecks, such as the ones presented above, motivate a number of authors to produce alternative low-complexity solutions that rely on data-driven methods (Caprario & Finotti, 2019;Giovannettone et al., 2018;Tang et al., 2018;Zhao et al., 2019). Some authors take advantage of the causality between historical floods and the floodplain hydraulic geometry (e.g., Bhowmik, 1984;Dodov & Foufoula-Georgiou, 2006;McGlynn & Seibert, 2003) and make use of digital elevation models (DEMs) that are data sets representing the Earth's surface, distributed as gridded values of local terrain elevations (Tavares da . For example, Nardi et al. (2006Nardi et al. ( , 2013Nardi et al. ( , 2019, Morrison et al. (2018), and Annis et al. (2019) employed a flat-water approach (i.e., intersection of a water level with the surrounding DEM, or a variation of it, such as the HAND-Height Above the Nearest Drainage; Rennó et al., 2008;Nobre et al., 2016) to delineate floodplains. The authors used a variable water level at each stream pixel from a stream-order averaged linear scaling relation (power law of upslope contributing area), obtained either from a generalization of outlet discharges and the Manning uniform flow equation (Manning, 1891) or through calibration with reference data. Interestingly, using a measure-of-fit of the delineated floodplains, the authors found that consistent floodplain delineations can be obtained with constant values of power-law coefficients . Moreover, Annis et al. (2019) also found that the optimal power-law exponent varied with the spatial resolution of the DEM, the return period, and with Strahler stream order. Degiorgis et al. (2012) proposed the delineation of flood-prone areas from a location where a flood map exists to one where it does not. This extrapolation procedure was achieved by threshold binary classification: in other words, by the identification of the isoline (the optimal threshold, TH) of a chosen flood descriptor that best approximated the areal extent of an existing flood map. Flood descriptors can be defined as quantitative layers extracted from DEMsthat correlate to the tendency of an area to flood. Manfreda et al. (2014Manfreda et al. ( , 2015 and Samela et al. (2016) improved the method introduced by Degiorgis et al. (2012) by evaluating different flood descriptors in terms of their suitability to delineate flood-prone areas. In their studies, the Geomorphic Flood Index (GFI, Samela et al., 2017) was found to be the best performing and the most consistent flood descriptor among the ones analyzed (Manfreda et et al., 2016, 2017), of which the HAND was one. Building upon this, Samela et al. (2017) and Tavares da Costa, Manfreda, et al. (2019) successfully delineated flood-prone areas at the continental scale by dramatically reducing computational time and costs, opening new possibilities for flood risk assessment and management over large scales. In Tavares da Costa, Manfreda, et al. (2019), optimal thresholds of the GFI were also shown to be positively correlated to flood extents associated with specific return periods.
In a different effort, Jafarzadegan and Merwade (2017) experimented with regression models, obtained by analysis of climate and catchment characteristics, to delineate the 100-year floodplains in North Carolina, USA. The delineation was performed based on a range of thresholds of the HAND model, used as the flood descriptor. The authors utilized the United States Federal Emergency Management Agency's Flood Insurance Rate Maps as benchmark for validating the results, but pointed out their uncertain and subjective nature. This approach was later extended by Jafarzadegan et al. (2018) and Jafarzadegan and Merwade (2019) to include a probabilistic description of the 100-year floodplains.
The data-driven approach described above complements traditional flood modeling, because its underlying principle is different and because it provides a cost-effective alternative that can fully exploit big, high-resolution data sets, without limiting the scale of application nor compromising computational speed (Di Baldassarre et al., 2020). Being mostly DEM-based, the approach also lessens the problem of data-scarcity often found, enabling the delineation of flood extents in any given catchment, based solely on the regression between envelope flood extents and catchment characteristics. The questions that the approach poses are: (1) Can functional relationships also be established between the GFI and catchment characteristics? (2) If so, can these relationships be used for low-complexity predictive modeling of envelope flood extents?
In this study, and differently from Jafarzadegan and Merwade (2017), a deterministic classification of the GFI, based on a specific objective function, is used to derive a TH for a larger number of training elementary catchments (i.e., hydrological units defined as the "portion of basin directly drained by a river stretch, between two confluences, or from the headwater to the first confluence" in Castellarin et al., 2018; see Figure S1 in the supporting information for an illustration of this concept) of four different major river basins in Europe with significantly different characteristics and for two return periods (the 10-and 10,000-year). Values of TH correspond to unique envelope flood extents: in other words, the GFI layer isolines that best envelope a given benchmark flood hazard map. Consistent binary masks of high-resolution flood hazard maps are used as benchmark and were obtained from Risk Management Solutions (RMS), a recognized catastrophe risk modeling company that applied a standard flood hazard mapping approach.
Subsequently, regression models are established between the TH-the target variable-and a set of geomorphic and climatic-hydrologic catchment characteristics-the explanatory variables-different from those in Jafarzadegan and Merwade (2017), tested beforehand in terms of multicollinearity. Regression models were objectively calibrated and optimized for the prediction of flood extents and flood-prone areas.
In this way, two major drawbacks found in previous applications are relaxed: (1) the complete dependence on benchmarks, since in this approach they are only needed to train the models and not for every location where flood-prone areas are to be delineated, as is the case in Degiorgis et al. (2012), for example, and (2) the assumption of transferability of the TH without any physical basis, as is also the case in Degiorgis et al. (2012), for example, since here catchment geomorphic and climatic-hydrologic characteristics are used to regress the TH.
Finally, flood-prone areas in elementary catchments of a distinct major river basin in Europe-from those used in the regression models-are delineated using a TH predicted by the regression models. Each resulting delineation is compared to the benchmark to assess the ability of the method to predict the extent of the envelope of major floods.
This paper is organized as follows: In section 2, the workflow to develop and test the predictive models of envelope flood extents is presented, and each step of this workflow is described in detail; in section 3, the four river basins in Europe (Thames, Weser, Rhine, and Danube), used to train the models, and the Rhône river basin in Europe, used to validate the models, are presented alongside the data sources used to obtain the physical characteristics of the river basins. Furthermore, a brief description of the workflow used to obtain the benchmark flood maps is explained; in section 4, the results obtained for each part of the methodological workflow, that is, classification and physical characterization of elementary catchments, model development, and prediction, are presented; in section 5, the main conclusions are drawn, and future work is addressed.

Development and Testing of Predictive Models
The methodology adopted for the development and testing of predictive models is based on the prior definition of a flood descriptor, the GFI, whose isolines are used to classify benchmark flood extents. The unique TH values resulting from the classification are then related to a selection of geomorphic and climatic-hydrologic catchment characteristics by two distinct types of regression models. Establishing such relations allows for the prediction of TH based on physical inputs of any given river basin. In Figure 1, the general methodological workflow is presented.

Catchment Delineation
The delineation of elementary catchments accounts for a hierarchical structure that reflects the topology of the river network (Verdin & Verdin, 1999). The main reason for choosing this scale of analysis is the division of each river basin in topographic areas that may contribute significantly to discharge and that play a central role in the management of water resources. It also serves the purpose of making computations more manageable through concurrent programming (Tavares da Costa, Manfreda, et al., 2019). Catchments are delineated following the constraint that catchment areas should be less than~1,200 km 2 .

Geomorphic and Climatic-Hydrologic Catchment Characterization
Catchment characteristics used in this study as explanatory variables (see Table 1) are strictly geomorphic and climatic-hydrologic, as defined by Horton (1932), where soil, geology, and vegetation are not taken into account. For consistency, the same high-resolution DEM of the proprietary flood maps is used to extract geomorphic catchment characteristics.
The following single geomorphic catchment characteristics were considered: area of elementary catchment (A); flow accumulation at the elementary catchment outlet (F), defined as the cumulative sum of raster cells upstream of the outlet, which relates to A; relief of the elementary catchment (Δz), defined as with z max the maximum and z min the minimum elevation of the elementary catchment; the total river channel length in the elementary catchment (L ch ), which also relates to A; and mean river channel fall, or relief, in the elementary catchment (Δz ch ): with z ch,max the maximum and z ch,min the minimum elevation of the stream channel in the corresponding elementary catchment. Single geomorphic catchment characteristics were chosen according to their relevance as potential drivers of flood (e.g., Gaál et al., 2012). The magnitude of a flood can be related to A, and F, as it conditions the amount of precipitation that may enter a catchment, that is intercepted by vegetation, infiltrates the soil or is routed to the stream channel. On the other hand, catchment relief (Δz) determines whether water infiltrates or flows quickly to a stream channel. In turn, channel relief (Δz ch ) determines whether water flows quickly to the outlet or accumulates and leads to a flood. Finally, stream channel length (L ch ), in combination with Δz ch , determines whether travel times of water are shorter or longer and the amount of storage; this affects the response of a catchment to precipitation.
Composite geomorphic catchment characteristics used in this study are representative of the mean declivity. Within each elementary catchment, the relief-area ratio is defined as while the relief ratio of the river channel in the elementary catchment is defined as From a hydrological perspective, the relief-area ratio has an important relation to surface runoff, to the concentration of rainfall in river channels, and to flood magnitude (Horton, 1932). On the other hand, relief ratio of the river channel gives an estimate of channel storage and time length required by a flood wave to traverse the channel (Horton, 1932), while it also relates to the linear head loss found in the Manning's equation for uniform flow. High mean declivities equate to water entering the channel quicker and thus to higher flooding likelihood. By contrast, gentle sloping channels are slower to route the incoming runoff and have a lower flooding likelihood.
Precipitation is the most important factor driving a flood. In particular, multiday rainfall events are an important cause of flooding (Fowler & Kilsby, 2003) as they increase the likelihood of exceeding the catchments' drainage capacity. In this study, the annual highest 10 consecutive day precipitation is reported, with a likelihood of occurring or of being exceeded every 10 and 10,000 years on average (P 10 and P 10k , respectively). These statistics were calculated based on the ensembles daily gridded observational data set for precipitation, temperature, and sea level pressure in Europe from the European Climate Assessment & Dataset (ECA&D E-OBS; Cornes et al., 2018). Principal component analysis was applied to precipitation anomalies in the data set for the 1950-2010 period, in order to identify dominant rainfall patterns across Europe. Stochastic precipitation fields were obtained for 50,000 years as linear combinations of empirical orthogonal functions and principal components (Zanardo et al., 2019). To complement these statistics, the mean annual precipitation (MAP) calculated by averaging the annual totals obtained from the ECA&D E-OBS data set is also reported.
Proxies for long-term average runoff are obtained by accumulating precipitation statistics downstream using the hierarchy of connected elementary catchments. The general water balance equation for each elementary catchment is given by with P the precipitation received at each elementary catchment, ΔQ the change in specific runoff, E the evapotranspiration, and ΔW S the change in water storage. If subsurface water flow and evapotranspiration losses are neglected as a simplification for severe rainstorms and very humid conditions-meaning that overland flow suffers either from saturation excess or infiltration excess and that evapotranspiration losses are much lower than water entering the elementary catchment-the direct conversion of precipitation into runoff may be assumed dominant at each elementary catchment, water yield tends to 1, and the following equation holds: with Q out the runoff at each elementary catchment outlet and Q in the runoff from upstream elementary catchments. Equation 6 can be further expanded, to cater for the lumped cascading estimation of direct runoff at the elementary catchment: with P 0 corresponding to a unique long-term average precipitation statistic (i.e., P 10 , P 10k , or MAP) associated with the elementary catchment under analysis with area A 0 , while P i is the unique precipitation statistic associated with the nth upstream elementary catchment with area A i . Results are reported as unit discharge at each elementary catchment outlet: with cell size (in m 2 ) equal to the product of pixel length by pixel width, specific to each DEM.
Spurious values found in the computed catchment characteristics were filtered out. In specific, some elementary catchments were found to be unrealistically small (A < 2 km 2 ); this was due to automatic delineation problems, namely, polygon intersections and invalid geometries, and some of the precipitation statistics presented a number of values that deviated markedly from the other data points.

Flood Descriptor
A flood descriptor is understood here as a raster layer that is able to identify the susceptibility to flooding in a given area. It can be obtained by combining different factors into a unique raster (e.g., terrain analysis, land 10.1029/2019WR026453

Water Resources Research
da COSTA ET AL.
use, and land cover). In this study, the flood descriptor, GFI, is presented as a combination of hydrogeomorphic factors (Samela et al., 2017). The GFI computation requires several steps (Tavares da Costa, Manfreda, et al., 2019).

Terrain Analysis
The extraction of the terrain analysis layers from the DEM for each European river basin precedes the computation of the GFI. Terrain analysis employed in this work follows a simple workflow (see Figure S2 for an illustration of this workflow) using the TauDEM toolbox. For more details, the reader can refer to Tavares da Costa, Manfreda, et al. (2019) and Tavares da Costa, Mazzoli, and Bagli (2019).

The Geomorphic Flood Index
The GFI is a raster layer estimated from preprocessed terrain analysis layers extracted from a DEM. The computation of the GFI for each of the major river basins in Europe considered in this study is given by It is composed of two terms, computed following the steepest downslope path given by a convergent eight direction flow model (abbreviated as D8 flow model). The first term, h ij (in meters), consists of an empirically derived bankfull depth estimated by means of a power-law hydraulic scaling relation of bankfull depth and upslope contributing area (Dodov & Foufoula-Georgiou, 2006;Manfreda et al., 2015;Nardi et al., 2006;Samela et al., 2016Samela et al., , 2017. The empirically derived bankfull depth in each cell under analysis (i, j) is computed using the upslope contributing area specific to the river centerline cell hydrologically connected to cell i, j following the D8 flow model: with F the flow accumulation specific to the river centerline cell hydrologically connected to cell i, j following the D8 flow model.
For simplicity, the power-law constant a and exponent n are kept constant with values of 0.1 and 0.4 (Samela et al., 2017), respectively. The reason for integrating a bankfull depth that scales with the contributing area in the GFI is to mark a clear boundary between channel and floodplain and avoid problems with the actual channel. The second term of the GFI consists of the HAND calculated between the cell under analysis (i, j) and the river centerline cell hydrologically connected to cell i, j, following the D8 flow model: with z ij the DEM elevation value of the cell under analysis and z ch k the DEM elevation value of the hydrologically connected river centerline cell. The HAND model makes each cell elevation value relative to the connected channel cell instead of the mean sea level. This is crucial in order to determine unique thresholds that best define a specific envelope flood extent. The GFI is rescaled before use to a range of values lying between 0 and 1, corresponding to low (i.e., away from the river centerline) and high hazard levels (i.e., near the river centerline), respectively. Note that moving away from the river centerline, H ij increases while the GFI decreases. Scaling is achieved by resorting to the minimum and maximum values of the GFI (Tavares da Costa, Manfreda, et al., 2019). The rescaled GFI can effectively be used as a classifier of flood-prone areas (Manfreda et al., 2015;Samela et al., 2016Samela et al., , 2017Tavares da Costa, Manfreda, et al., 2019) and of the extent of the envelope of major floods that is confined to the floodplain, between the active river channel at bankfull and the surrounding marked topography. The GFI computation is summarized in Figure 2.

Discrete Statistical Classification
The threshold binary classification, introduced by Degiorgis et al. (2012), is adopted in this study to find the TH that produce the best possible representation of the RMS flood maps, the benchmark, in terms of flood extent (see Figure S3 for an illustration of the workflow). The resulting unique TH per elementary catchment of the Thames, Weser, the upper Rhine, and the upper Danube river basins are subsequently used as the target variable for training the regression models.

10.1029/2019WR026453
Water Resources Research da COSTA ET AL.

Threshold Binary Classifier
The binary classifier consists of a mathematical optimization that outputs the best possible representation of known binary values from a benchmark. The algorithm starts by creating, through image segmentation of the GFI, a binary flood extent representation associated with each threshold, out of a large number of possible values from zero to one, hereinafter called the segmented GFI. The algorithm then searches all these binary cases to find the one that best approximates the benchmark. The optimal case, represented by a unique TH, is indicated by the maximization of a specific objective function that expresses the correctness of a representation.
The classifier performs better when binary categories (flood-prone and flood-free) are symmetrically distributed (Kubat et al., 1998), that is, when raster cells of one binary class are not much greater in number than the other (i.e., class imbalance). Therefore, a portion of the GFI, namely, a classification area corresponding to a fixed buffer (~1 km) around the river network centerline of the largest benchmark, is adopted in order to handle class imbalance by discarding the number of flood-free raster cells in excess.
The classification results are evaluated with two specific metrics based on a 2 × 2 binary contingency matrix (see Table S1 for an example of such matrix) constituted by values of tp or the number of raster cells marked as flood-prone in both the segmented GFI and the benchmark flood hazard maps; fn or the number of raster cells marked as flood-free in the segmented GFI but marked as flood-prone in the benchmark; tn or the number of raster cells marked as flood-free in both the segmented GFI and the benchmark; and fp or the number of raster cells marked as flood-prone in the segmented GFI but marked as flood-free in the benchmark. The first of these metrics assesses the discerning capability of the GFI itself (the receiver operating characteristic, or ROC analysis; see section 2.4.3), while the other measures the degree of association between flood-prone areas resulting from the threshold binary classification and a benchmark (the r ϕ ; see section 2.4.4). Both metrics were found to work well for this type of classification (Tavares da Costa, Manfreda, et al., 2019).

Objective Function
The maximization of the true skill statistic (TSS) (Peirce, 1884) is adopted in this study as the classification rule that defines which threshold is optimal to select for each elementary catchment. The TSS represents the point of maximum forecast value of the classifier; in other words, it is the point in the ROC that has the maximum perpendicular distance from the line of no-skill (Manzato, 2007), which translates to a good representation of the binary categories in the benchmark (see Figure S4 for an example of a ROC and the respective TSS point identified in the curve).
The TSS, also based on the contingency matrix, has been used elsewhere with success by several authors (Alfieri et al., 2012;Bartholmes et al., 2009) and can also be interpreted as the probability of making an informed decision in terms of the proportion of correct binary categories, assuming for this specific study that the misclassification of flood-prone areas is as undesirable as the misclassification of flood-free areas. The TSS is defined as with TPR the true positive rate or the probability of a correct hit and FPR the false positive rate or the probability of an incorrect hit. The TSS is negative when the segmented GFI has a higher number of fp and fn than tp and tn; it is positive when the opposite happens, with TSS ¼ 1 indicating that the segmented GFI perfectly matches the benchmark. The case of TSS ¼ 0 implies that the classifier does not provide any useful information.

ROC Analysis
ROC analysis has been used by several authors to distinguish between decision values in a classifier and their trade-offs between costs and benefits (Bradley, 1997;Fawcett, 2006;. It is considered a threshold-independent performance measure, as points falling along the ROC curve (see Figure S4 for an example) represent unique evaluations, in terms of TPR and FPR, of a considered segmented GFI against the benchmark. The top left corner of the ROC space represents the perfect classification, such that TPR ¼ 1 and FPR ¼ 0; instead, the diagonal line dividing the ROC space represents the line of no-skill.
In the specific context of this study, the area under the ROC curve (AUC) summarizes the overall discerning capability of the GFI in a single threshold-independent value per elementary catchment. As such, irrespective of the threshold, it represents the probability of correctly classifying a randomly chosen raster cell as flood-prone rather than incorrectly classifying it as such (Bradley, 1997;Fawcett, 2006). The AUC can be estimated by trapezoidal rule approximation of the definite integral and may take values from AUC ¼ 0.5, meaning no discerning capability of the GFI, to AUC ¼ 1, the perfect classifier. An AUC ≤ 0.5 is used to filter out elementary catchments that are not well suited to serve as a classifier and may therefore impact the formulation of the statistical relationships.

Degree of Association
The modified Pearson's correlation coefficient for discrete dichotomous problems, r ϕ (Cramér, 1999;Matthews, 1975), is used as a measure of magnitude of association and direction of the linear relationship between flood-prone areas and a benchmark: where χ 2 is the Pearson chi-square statistical test (Pearson, 1900), with N the total number of samples. It can be seen as the geometric mean of the TSS and its complementary term. The r ϕ is used in this study to evaluate the degree of association between the segmented GFI and the benchmark. As a rule of thumb, it is assumed that values between 1 and 0.5 represent a strong positive degree of association, between 0.5 and 0.3 a moderate degree of association, 0.3 to 0.1 a weak degree of association, and from 0 to 0.1 a complete absence of association. A r ϕ ≤ 0.3 is used to filter out elementary catchments that have a weak degree of association.

10.1029/2019WR026453
Water Resources Research da COSTA ET AL.

Formulation of Statistical Relationships
Multivariate statistical methods can be used to describe the relationship between unique TH, the target variable, and a set of explanatory variables represented by catchment characteristics that are scaled and mean centered before use. In this study, the stepwise regression and random forest are set up as models to predict envelope flood extents.
Ideally, catchment characteristics should not be highly correlated to each other or to their linear combination, since multicollinearity may increase the variance of parameter estimates and potentially lead to unreliable results. Therefore, before developing the statistical models, multicollinearity is diagnosed with the variance inflation factor (VIF). Multicollinearity may be present when VIF > 10 (Hirsch et al., 1992;Merz & Blöschl, 2005) and therefore variables above such values are considered for elimination prior to model fitting.
The problem of estimating the regression coefficients, or the fitting problem, is solved by stepwise analysis with bidirectional elimination (i.e., the sequential addition) and replacement or elimination of explanatory variables based on the relative quality of each competing model. The trade-off between maximum likelihood and explanatory variables, or model's simplicity in this context, is measured by the Akaike's information criterion (AIC) (Akaike, 1974). In practical terms, model selection is based on the minimum possible AIC obtained for competing models (Haddad et al., 2012). A formal definition of MLR, VIF, and AIC can be found in the supporting information.
To obtain classical evaluation metrics such as R 2 and the root mean square error (RMSE), a 10-fold crossvalidation procedure is used. The cross-validation consists of randomly splitting the data set in 10 equally sized groups, one of which is retained for testing and the remaining are used for training. The training and testing procedures are repeated 10 times, until every single group has been selected once. Performance results from each of the 10 validations are averaged to produce a single final estimation.
Some important advantages of the random forest method are that it does not need any specific assumption about the probability distribution (nonparametric), it works well when the relationship between explanatory variables, and response is nonlinear, as well as when there are high order interactions (Snelder et al., 2013). Furthermore, random forest is relatively robust against outliers, noise and overfitting (Breiman, 2001), and can handle the problem of multicollinearity well .
As opposed to MLR, a chief disadvantage of this method is that it cannot predict target values outside the range of the explanatory variables in the training data set. Another limitation of the random forest is that it does not provide an easy understanding of the statistical relationships between explanatory variables. However, it does provide a simple visualization of the model structure and of the covariate influence, in contrast to other machine learning methods, such as artificial neural networks (Shortridge et al., 2016).
The random forest regression model used in this study goes through an automatic and distributed optimization procedure (grid search) of the setup parameters in order to find the best performing model, in terms of both accuracy and computational efficiency. In specific, the optimized parameters are the number of decision trees in the ensemble, the number of sampled variables at each tree node, and the maximum depth of each tree. The optimization of the random forest regression is achieved a priori, using a 10-fold crossvalidation to obtain evaluation metrics and compare the multiple models.

Performance Assessment
To evaluate the fit between predicted flood-prone areas, obtained through image segmentation of the GFI using the predicted TH, and the benchmark, four common performance metrics, also based on the with values ranging from FDR ¼ 0 (no false alarms) to FDR ¼ 1 (overprediction); the critical success index, with values ranging from C ¼ 0 when there is no match between the delineated flood-prone areas and the benchmark and C ¼ 1 when there is a perfect match; and the error bias, that indicates whether there is a tendency toward underprediction, 0 ≤ E < 1, or overprediction, E > 1, with E ¼ 1 an indication of no bias.
As the objective is to predict envelope flood extents, overpredicting the flood-prone areas might benefit the tp and inflate the TPR. This inflation would be of concern if there was no other reported measure that would give an alternative account of the performance. By reporting the FDR, an account of the percentage of cells that are overpredicted is given. At the same time, the critical success index extends the TPR by including the fp, accounting for both underprediction and overprediction, while the error bias gives the ratio between the fp and the fn, indicating whether there is a tendency for underprediction or overprediction. The reporting of these four measures should give a reasonable overall account of the performance.

Study Area, Data Sources, Training, and Test Sets
The River Thames in the United Kingdom constitutes the longest one in southern England (~350 km length). It drains an area (~13,478 km 2 ) of relatively flat terrain (mean elevation of~100 m a.s.l.) to the North Sea. The Thames river basin has a MAP ranging from~610 to 778 mm yr −1 that results in a mean annual runoff (MAR) ranging from 100 to 295 mm yr −1 . The Thames is prone to major flooding; the 2013/2014 winter floods that the valley sustained are an example of this (Fenn et al., 2016;Huntingford et al., 2014).
The River Weser in Germany has an overall length of~452 km. It drains an area of (~43,857 km 2 ) relatively flat terrain (mean elevation of~200 m a.s.l.) to the North Sea. The Weser river basin has a MAP ranging from 575 to 1,195 mm yr −1 that results in a MAR ranging from 190 to 930 mm yr −1 . In 2013, the Weser river basin was affected by high flood levels with peak discharges above the 50-year return period (Schröter et al., 2015).
The River Rhine (~1,230 km total length) has its source in the Swiss Alps and flows through several major cities in Switzerland, France, Germany, and the Netherlands, where it drains to the North Sea. The upper Rhine river basin (drainage area of~32,114 km 2 ), the portion of the Rhine river basin considered in this study, presents a relatively mountainous terrain (mean elevation of~1,065 m a.s.l.), with MAP ranging from 825 to 1,715 mm yr −1 and resulting MAR ranging from 330 to 2,250 mm yr −1 . The upper Rhine river basin is prone to major flooding; for example, in 2007, one person died, at least 100 were affected, and the country withstood a total estimated damage of more than 312 million EUR (CRED EM-DAT).
Originating in Germany and flowing through major cities (e.g., Vienna in Austria) in 10 different countries before draining to the Black Sea, the River Danube is the second longest river in Europe (~2,850 km length). The upper Danube river basin (drainage area of~97,000 km 2 ), portion considered in this study, is characterized by a relatively mountainous terrain (mean elevation of~890 m a.s.l.), MAP ranging from~460 to 1,785 mm yr −1 and resulting MAR ranging from~23 to 1,282 mm yr −1 . The upper Danube river basin is 10.1029/2019WR026453

Water Resources Research
da COSTA ET AL. prone to major flooding; for example, in 2013, four persons lost their lives, at least 200 were affected, and the country withstood a total estimated damage of more than 893 million EUR (CRED EM-DAT).
The River Rhône in France originates in the Swiss Alps and runs through southeastern France, where it finally drains to the Mediterranean Sea. The Rhône river basin, with an area of~96,475 km 2 , has a mean elevation of~785 m a.s.l. It is characterized by a MAP ranging from~561 to 1,890 mm yr −1 , resulting in a MAR ranging from~119 to 1,551 mm yr −1 . The winter floods of 2003 marked the largest flood in the Rhône river basin since 1856. Consequences arising from this event were severe, with the country withstanding a total estimated damage of~1.130 billion EUR (Arnaud-Fassetta, 2013). Mean elevation values presented in this section were estimated from the European Environmental Agency (EEA) EU-DEM, MAP values from the ECA&D E-OBS, and MAR values from the University of New Hampshire Global Composite Runoff Fields (Fekete et al., 2002).
The five river basins were selected for this study mostly for their record of major floods and their importance in Europe; their geographical locations can be visualized in Figure 3. The GFI raster layer and the catchment characteristics are computed for all selected river basins using a proprietary DEM data set, hereafter referred to as RMS-DEM, at~50 m spatial resolution. The RMS-DEM is suited for flood inundation modeling, as it does not contain artifacts such as trees, buildings, or bridges that can adversely affect the accuracy of the simulations. The (1) Thames river basin in the United Kingdom, with 83 elementary catchments delineated, the (2) Weser river basin in Germany, with 170 elementary catchments delineated, the (3) upper Rhine river basin in Switzerland, with 109 elementary catchments delineated, and the (4) upper Danube river basin in Austria, with 286 elementary catchments delineated, are used for training the regression models. Their merging into a single data set resulted in a total of 648 elementary catchments for each return period, of which, after filtering out issues such as poor classification results (see sections 2.2, 2.4.3, and 2.4.4 for more details), 453 were effectively used for the 10-year return period and 486 for the 10,000-year return period, giving a total of 939 data points effectively used and 357 discarded. The first two river basins are representative of flatter regions and the last two of mountainous regions. The (5) Rhône river basin in France, with 277 elementary catchments delineated, is instead used for testing the regression models.
The benchmarks used in the threshold binary classification to find the unique TH values, and also in the evaluation of the final predictions, are the high-resolution flood hazard maps for Europe developed by RMS and currently used by global insurance and reinsurance companies. The RMS flood maps were created for several return periods at~50 m resolution. They are based on a cascade of sequential modeling components. Rainfall-runoff processes are modeled with a semidistributed, TOPMODEL-based approach (Beven & Kirkby, 1979). Flows are routed through the river network using the Muskingum-Cunge 1D wave propagation method (Cunge, 1969;Georgakakos et al., 1990). Inundation depths and extents are derived by applying rating curves to river flows in every river segment of 500 m, calculating the associated river depth, and filling the river cross-section extracted from the DEM for each segment. The maximum flood depths over the floodplain, after propagating the flood wave through the main river channel, represent the flood hazard map for an event. The benchmarks are used in the form of binary masks (raster cells marked as flood-prone or floodfree) obtained through image segmentation with a cut-off depth set to 0.01 m. The overall accuracy of the benchmarks is not expected to be very high because of the methodology employed; however, they should provide a fairer comparison, since the GFI is not able to represent the dynamics of the flow over the floodplain.

Classification Outcomes and Catchment Characteristics
The classification of the GFI layer to obtain the TH was performed using each elementary catchment that constitutes the training set, composed of four major river basins in Europe. In Figure 4, the data used in the development of estimators and prediction of envelope flood extents for the 10-and 10,000-year return periods are presented.
Average AUC values of~95% and 91% are found for the 10 and 10,000-year return periods, respectively, which indicates a very high discerning capability of the GFI classifier. These AUC values translate to a high probability of correctly classifying a raster cell as flood-prone or flood-free.

10.1029/2019WR026453
Water Resources Research da COSTA ET AL.
Average r ϕ values between~60% and~64% are found for the 10-and 10,000-year return periods, respectively, which indicates a strong positive degree of association (i.e., between 1 and 0.5) of the best possible representation of flood-prone areas.
Values of TH for the elementary catchments of the training set are found to range between 0.18 and 1, with a mean value of~0.44 and 0.39 for the 10-and 10,000-year return periods, respectively. As expected, there is a tendency toward a value decrease with increasing return period (Tavares da Costa, Manfreda, et al., 2019).
For most catchment characteristics, more than 50% of test data is contained within the training set interquartile range. Exceptions to this can be found for the precipitation statistics, P 10 and P 10k , with median slightly above the test set interquartile range, and for the corresponding unit discharge estimates, q 10 and q 10k . The   Table 1 and section 2.2 for a complete description and units of variables. (a) Area under the relative operating characteristic curve AUC, the modified Pearson's correlation coefficient for discrete dichotomous problems r ϕ , and the optimal Geomorphic Flood Index (GFI) thresholds TH obtained from the classification of training catchments (Thames, Weser, upper Rhine, and upper Danube river basins). (b) Geomorphic and climatic-hydrologic catchment characteristics for the combined training and test catchments (Rhône river basin). (c) Correlation matrix between TH and catchment characteristics.

10.1029/2019WR026453
sample variability of the test set is larger than that of the training set for the A, the S, the L ch , and the S ch . The explanatory variable A is the only one that is noticeably positively skewed, while the MAP and corresponding q MAP are negatively skewed. These differences are expected to impact the final prediction of envelope flood extents as the training set does not represent the test set in the most exhaustive manner.
The correlation matrix in Figure 4c provides an evaluation of the magnitude of association and direction of the linear relationship between explanatory variables and the dependent variable (in Figure S5, the correlation matrices for the 10-and 10,000-year return periods are presented). Correlation between the TH and the F indicate a moderate positive linear relationship. Moreover, a moderate negative linear relationship is revealed between TH, the Δz ch , and S ch . The remaining catchment characteristics reveal weak linear relationships to the TH.
Furthermore, correlations that exist between different catchment characteristics may be indicative of multicollinearity. Disregarding the correlations between composite explanatory variables and their constituting parts, a moderate to strong positive correlation between precipitation statistics, discharge proxies, and Δz is noticeable. Also, A shows a strong positive correlation with Δz and L ch . Collinearity between catchment characteristics is an undesirable condition that can negatively impact the quality of the statistical relationships for the prediction of envelope flood extents and needed to be addressed before any further step is taken.

Estimators of Envelope Flood Extents
Two types of regression models were built from the classification outcomes TH and catchment characteristics of the training set, namely, the stepwise regression and the random forest regression models.
Several data transformations were tried for building different models (log-linear, linear-log, and principal component analysis, which were tested but did not produce any beneficial result). Log-log-transformed variables (note that the GFI is already a logarithm) were used as they substantially improved both models' statistical tests and performance metrics.
Several data splits were also tried for building different models. For example, one model for the 10-and another for the 10,000-year return period were tried but did not yield significantly different results from the ones presented in Table 2. Furthermore, the inclusion of only some specific river basins was tested. Namely, two out of the four training river basins were held out at the time for testing the models, built with the remainders. None of the six river basin combinations (4 ! /2 ! * (4 − 2) ! ¼ 6) significantly improved the overall performance, and in most cases, holding out specific river basins actually decreased it. Note. Variable importance, as well as variance inflation factor (VIF), Akaike's information criterion (AIC) and the statistical tests, R 2 , and root mean squared error (RMSE), refer to the optimized models.

10.1029/2019WR026453
Water Resources Research da COSTA ET AL. In the stepwise regression, multicollinearity tests point toward strong collinearity between the composite explanatory variables and their individual constituents. Additionally, VIF values above 2 were found between the combined q 10 and q 10k and the q MAP . Thus, the catchment characteristics A, Δz, L ch , Δz ch , and q MAP were considered for elimination given the results of the multicollinearity tests and taking into consideration the correlation results presented in Figure 4c.
The previous steps were followed by a stepwise selection of explanatory variables based on the AIC, which reflects the trade-offs between maximum likelihood and model simplicity. A very low AIC leads to the following equation: As can be seen from Table 2, the final optimized linear model is constituted by six of the 11 original explanatory variables and is characterized by a high F statistic (>3) and very low p value (<0.01), which indicates a high degree of significance of individual explanatory variables and of the model. The 10-fold crossvalidation procedure results in an R 2 value of~42%, indicating a moderate explanatory power of the model, and an RMSE of 0.0597.
In the random forest regression, optimization was performed by an automatic search of the best possible combination of input parameters that lead to the highest possible decrease in RMSE obtained through cross-validation. This also ensured that overfitting was avoided. The final optimized random forest regression corresponds to a model with 644 trees, three explanatory variables randomly sampled at each tree node, and a maximum depth of 25 nodes. As shown in Table 2, the random forest regression results have substantially improved the explained variance obtained by the stepwise regression model, from~42% to~64% R 2 , and the RMSE, which decreased from 0.0598 to 0.0466.
Variable importance was assessed based on the absolute value of the t statistic for the MLR model (for more details about the assessment of variable relative importance in linear models, please refer to Lindeman et al., 1980). For the random forest regression model, variable importance was assessed based on the empirical improvement of the squared-error as a result of a split in a nonterminal tree node, averaged over all trees (Breiman et al., 1983). It is shown that the S ch is the explanatory variable with the highest relative importance in both models,~14% and 19%, followed by the F with~10% and 19%. In the random forest regression model, Δz ch is also found to have a fairly high variable importance with~14%. The remaining catchment characteristics are ranked as relatively less important or not included at all (stepwise regression). However, it should be noted, for the case of the random forest model, that as one explanatory variable is randomly selected at a tree node, the importance of other variables is substantially reduced, particularly if there is collinearity. In light of this, variable importance should be interpreted with caution, as explained in Seibert et al. (2017).

Prediction of Envelope Flood Extents
Using the models presented in the previous section, envelope flood extents were predicted based on the physical characteristics extracted for each elementary catchment of the Rhône river basin.
Catchment characteristics matching the ones used for training of the regression models were obtained for the 10-and 10,000-year return periods and used as input. A unique TH was predicted per elementary catchment, return period, and model. The predicted TH values were used to segment the original GFI raster layer of each corresponding elementary catchment of the Rhône river basin and to delineate the flood-prone areas (see Figure 5b for an example).
By comparing each raster cell of the binary mask of predicted flood-prone/flood-free areas with each corresponding raster cell of the benchmark RMS flood maps, it was possible to obtain a contingency matrix for each model and for each return period considered, from which the performance metrics described in section 2.6 were computed.
In Figure 5 and Table 3, it can be observed that the TPR is high for the great majority of elementary catchments (average above 80%), similar between models, and slightly higher for the 10-year return period. At the same time, the FDR is high for the 10-year return period (average~63%), moderate for the 10,000-year return period (average~39%), and slightly higher for the stepwise regression. In turn, the C is moderate for the 10-year return period (average~34%), high for the 10,000-year return period (average~52%), and slightly higher for the random forest. It is also shown by the E that more than~85% of the flood-prone areas obtained for the elementary catchments of the Rhône river basin suffer from overestimation (E > 1). On average, the TPR decreased with increasing return period, but this seems to be compensated by a significantly lower number of false alarms, as a higher C is observed for the 10,000-year return period and a decrease of E (for more details, see Figure S6 where the distribution of the tp, fn, and fp values, in terms of absolute predicted area, are presented).
Furthermore, cases with a high number of upstream elementary catchments were found to have limited impact on the average results (for more details, see Figure S7 where the performance results for individual elementary catchments of the Rhône river basin are plotted against the corresponding number of upstream elementary catchments). However, there appears to be a higher dispersion of TPR, and a tendency for high FDR values for the 10-year return period, with an increasing number of upstream elementary catchments (see Figure S6 for more details).

Discussion and Future Research
The work presented in this paper documents the use of two data-driven, expeditious, and cost-effective approaches to predict the extents of the envelope of major floods in diverse river basins, gauged or ungauged, and for diverse return periods. This has been achieved by establishing functional relationships in the form of linear and nonlinear regression models between specific isolines of a flood descriptor (TH) and geomorphic and climatic-hydrologic characteristics of elementary catchments. This advancement extends a previous approach employed in Tavares da Costa, Manfreda, et al. (2019), by relaxing its complete dependence on benchmark flood maps and by providing a physical basis for the transferability of the TH between catchments, also giving a physical basis to the extrapolation and downscaling goals described in Tavares da Costa, Manfreda, et al. (2019). The results obtained in this work are promising for such a novel approach and give an optimistic view about future directions that could improve or extend it. At the same time, the limitations found, and discussed below, should encourage further investigation as new developments can be beneficial to flood managers and insurers, giving them nearly instantaneous views of the areas that can potentially be affected by floods.
The classification stage of the methodological workflow showed that, overall and in spite of some outliers, the GFI has a high overall discerning capability of flood-prone areas, as shown by the average AUC value above 91% for any of the return periods. At the same time, the average value of r ϕ above 60% for any of the return periods indicates a strong positive degree of association between the GFI delineated flood-prone areas and the benchmark flood maps. These values are significantly higher than the ones reported by Tavares da Costa, Manfreda, et al. (2019).
The development of estimators of envelope flood extents has shown that in the stepwise regression the VIF and AIC selection of catchment characteristics has been valuable in obtaining statistically significant explanatory variables that improved the explained variance (R 2 ) of the target variable and the fit of the initial model (RMSE).
In comparison to the stepwise regression, the random forest regression proved to be a much more flexible and straightforward approach to setup. The final optimized random forest model could be obtained without any prior selection of catchment characteristics and still substantially increase the R 2 and decrease the 10.1029/2019WR026453

Water Resources Research
da COSTA ET AL.

RMSE.
Moreover, the improvement of the statistical tests by the random forest model seems to provide some evidence of nonlinear behavior between TH and catchment characteristics.
Efforts toward further improving the R 2 and the RMSE could be undertaken in future work, but only with caution, since the method is not able to represent the dynamics of flooding during specific events and, therefore, cannot entirely replicate results from hydrodynamic models or flooded areas detected using satellite imagery. However, if this was to be pursued, it could be perhaps achieved by adding new or replacing existing single and composite catchment characteristics, which may be specific to each elementary catchment response, and by testing different data transformations in order to increase the correlation with TH. It should also be noted that the improvement of some of these quantities may demand more data, raise additional issues, and make replicability more difficult without any guarantee of significant performance enhancement in terms of metrics based on the contingency matrix.
When it comes to the predictions of envelope flood extents in the elementary catchments of the Rhône river basin, the random forest regression model performed marginally better than the stepwise regression for any of the return periods considered. Both the stepwise and the random forest regression outputted high TPR values, while at the same time moderate to high FDR values. This was reflected in a moderate to high critical success index, C, and in the E values always above 1, indicative of overprediction, especially at lower return periods. As this paper tries to deal with envelope flood extents, overprediction was already expected. Overall, predicted flood-prone areas better match the benchmarks at higher return periods, and, particularly for the 10,000-year return period, it is interesting to note that the average performance obtained is in line with some modeling results reported by Wing et al. (2017).
Limitations to this methodological approach are known. For instance, the GFI itself, as detailed by Manfreda et al. (2014Manfreda et al. ( , 2015, Samela et al. (2017), Tavares da Costa, Manfreda, et al. (2019), and Tavares da Costa, Mazzoli, and Bagli (2019), is found to be less than optimal in identifying flood-prone areas in flat terrain, which may explain the high FDR and overestimation. In other words, in a more incised fluvial valley, there are more independent GFI contours to choose from, which allows for a better representation of flood-prone areas. In contrast, in flat or convex terrain (e.g., alluvial fans), these fail to constrain flood extents in terms of elevation differences, which means that a small variation of TH can translate into overpredicted flood extents.
Being aware of the limitations of the methodology, there are possibilities to improve flood descriptors. This could be done by (1) further improving the spatial resolution, vertical accuracy, and the processing of the DEM (e.g., de-noising, smoothing, and hydrological conditioning, including stream burning); (2) calibrating the coefficient of the power-law that scales bankfull depth with contributing area (also in terms of DEM spatial resolution, return period, and Strahler stream order); (3) using a multidirectional flow model (e.g., D-Infinity, Tarboton, 1997); and (4) testing other channel initiation methods (e.g., Li et al., 2020). At the same time, upgrading the threshold binary classification and understanding why some elementary catchments are not producing higher r ϕ values, will likely help. A number of additional findings are also worth noting: • Although the regression models proved to be reasonably robust, considering that the sample variability of the training data was limited in comparison to that of the testing, a training of the models with a broader range of values and more degrees of freedom could improve the generalization properties and prediction capability. • The random forest regression is not able to predict target values outside the range found in the training data set, and this can be particularly problematic for lower TH values. To account for this feature, different algorithms would need to be considered or modifications to the random forest would need to be implemented. • Explicitly including the case where flooding does not occur in the models (e.g., accurate representation of the river at bankfull flow and the corresponding physical climatic-hydrologic characteristics that lead to it) may benefit the analysis. However, such cases should be completely withheld from the performance analysis, as they might artificially influence the performance. • Additional tests using the EU-DEM, not reported here, have revealed that the use of a DEM to compute the GFI that is different from the DEM used in the modeling of the benchmark flood maps negatively influences the results. Caution should, thus, be exercised in the selection of the DEM, as in this study (a consistent use of the RMS-DEM), but also in its processing (e.g., terrain analysis, river network, and catchment delineation; Tavares da Costa, Mazzoli, & Bagli, 2019).
Besides what was mentioned above, future work could • Investigate how such methodology performs in other test river basins and whether it could actually be generalized to the global scale. • Investigate how such methodology would work with benchmark flood extents obtained from remote sensing detection (e.g., Schumann & Moller, 2015;Westerhoff et al., 2013). It should be noted, however, that observed flood extents will largely reflect the dynamics of the flow, while the simplified approach presented here does not. On the other hand, observed flood extents will be dependent on the orbital pass and on the imaging time-windows, and it is thus important to consider multiple images of the same flood event in order to ensure that the maximum flood extent is best captured (Bernhofen et al., 2018). • Include more return periods in the analysis, particularly the 100-year return period flood, as it is considered the standard for risk assessment in many places (e.g., by the US National Flood Insurance Program). At the same time, investigate if bias in performance assessment could be reduced by excluding lower return period from higher return period flood-prone areas (e.g., exclude from the 10,000-year flood-prone areas the raster cells corresponding to the 10-year return period flood-prone areas). • Include a broader range of durations for the precipitation statistics, besides the 10 consecutive days precipitation. • Take a step further and provide a way to estimate flood depth, even if coarsely (e.g., .

Data Availability Statement
Data used in this study have been uploaded to an open and free-to-use repository along with scripts (https:// doi.org/10.17632/7r8pp8c5ct.1). Some data sets, on which parts of the results are based, cannot be made available, as that would violate the commercial licenses of the products. Licensing information can be found in the company website (https://www.rms.com/).