Non-peer reviewed preprint submitted to Water Resources Research Predictive modelling of envelope flood extents using geomorphic and climatic- hydrologic catchment characteristics

A topographic index, or flood descriptor, that combines the scaling of bankfull depth with morphology was shown to describe well the tendency of an area to be flooded. However, this approach depends on the quality and availability of flood maps and assumes that outcomes can be directly extrapolated and downscaled. This work attempts to relax these problems and answer two questions: 1) Can functional relationships be established between a flood descriptor and geomorphic and climatic-hydrologic catchment characteristics? 2) If so, can they be used for low-complexity predictive modelling of envelope flood extents? Linear stepwise and random forest regressions are developed based on classification outcomes of a flood descriptor, using high-resolution flood modelling results as training benchmarks, and on catchment characteristics. Elementary catchments of four river basins in Europe (Thames, Weser, Rhine and Danube) serve as training dataset, while those of the Rhône river basin in Europe serve as testing dataset. Two return periods are considered, the 10and 10,000-year. Prediction of envelope flood extents and flood-prone areas show that both models achieve high hit rates with respect to testing benchmarks. Average values were found to be above 60% and 80% for the 10and the 10,000year return periods, respectively. In spite of a moderate to high false discovery rate, the critical success index value was also found to be moderate to high. It is shown that by relating classification outcomes to catchment characteristics the prediction of envelope flood extents may be achieved for a given region, including ungauged basins. Plain Language Summary Topographic features can be extracted from a digital terrain to identify floodplains. In turn, the classification of these features can be used to represent flood extents. The classification, however, depends on existing flood maps to be used as reference for its calibration, in a small portion of the study area, before flood extents can be extrapolated. This work seeks to improve this dependence on the existence of a reference and the assumption that extrapolation can be performed without considering the physical differences between areas. To do so, a machine learning model was developed using the classification outcomes and characteristics of four river basins in Europe (Thames, Weser, Rhine and Danube) and was tested in the Rhône river basin in Europe, showing promising results. This development should help flood managers and insurers to allocate time and money better, by giving them nearly instantaneous views of areas that can potentially be affected by floods, with a physical basis for extrapolation and without a reference for every case.


Introduction
Floods pose a serious threat to individuals and communities as shown by disaster data found, for example, in the International Disaster Database (EM-DAT) of the Centre for Research of the Epidemiology of Disasters (CRED). A United Nations report on the Human cost of weather related disasters (CRED/UNISDR, 2016) illustrates the dimension of the problem well: 47% of all disasters since 1995 have been floods, affecting a total of 2.3 billion people worldwide. This previous experience paints a grim picture and there is mounting evidence for an increase in frequency and intensity of severe floods due to climate change (Milly et al., 2002;Aerts et al., 2006;Kleinen and Petschel-Held, 2007;Alfieri et al., 2017;Barichivich et al., 2018;Sassi et al., 2019). On the other hand, the increase of socio-economic activities in flood-prone areas perseveres in a number of countries, making exposure of persons and assets a serious problem (de Moel et al., 2009;EEA, 2016;Kron et al., 2019).
Thus, understanding flood risk is of great importance to the management of socioeconomic and environmental impacts. Disaster risk reduction (i.e., mitigation, preparedness, response and recovery) can greatly benefit from innovative tools, able to further inform the decision-making process. Governmental organizations use flood maps -a critical component of risk assessment -for emergency, spatial planning and awareness raising, while, in the insurance sector, flood maps are critical for managing portfolios, risk screening and assessing long-term financial solvency (de Moel et al., 2009). The vital importance of flood mapping is also recognized in the EU Floods Directive (2007/60/EC) that mandates member states to produce flood hazard and risk maps.
Flood mapping is traditionally available at the reach-scale and in urban settings (Horritt & Bates, 2002), where researchers devote themselves to increasing the level of detail (i.e., physics and spatial resolution) of flood models, an example being the move towards hyperresolution flood modelling (Noh et al., 2018). Instead, from regional to global scales, authors tend to focus more on simplification, as parsimonious models are more suitable to be used over larger domains due to their higher computationally efficiency (Yamazaki et al., 2011;Neal et al., 2012a,b;Pappenberger et al., 2012;Winsemius et al., 2013;Alfieri et al., 2014;Sampson et al., 2015;Dottori et al, 2016;Rebolho et al., 2018;Zheng et al., 2018). The level of detail in flood models and their large spatial coverage are two desirable but often competing properties; in other words, it is hard to setup a flood model with one of these characteristics without compromising the other. Adding to long simulation times, the calibration and validation of models in standard approaches to flood hazard mapping (please refer to Grimaldi et al., 2013, for more details) pose significant challenges. In global hydrological models (Bierkens, 2015), calibration is crucial to improve simulations (e.g., Hirpa et al., 2018), but requires large amounts of reliable streamflow observations that are scarcely available. In hydrodynamic models, calibration is also invaluable (e.g., Wood et al., 2016) and it is not uncommon to find numerical instabilities that need to be solved beforehand (e.g., by adjusting the time step or adding numerical diffusion). All these complexities, on top of the need for computational power and the difficulty in finding reliable validation data (e.g., Bernhofen et al., 2018) counters the effort for up-to-date flood maps at any location or any time.
Bottlenecks, such as the ones presented above, motivated a number of authors to produce alternative low-complexity solutions that rely on data-driven methods (Schumann et al., 2014a;Tang et al., 2018;Giovannettone et al., 2018;Caprario & Finotti, 2019;Zhao et al., 2019). Some take advantage of the causality between historical floods and the floodplain hydraulic geometry (e.g., Bhowmik, 1984;McGlynn and Seibert, 2003;Dodov and Foufola-Georgiou, 2006) and make use of digital elevation models (DEMs), or digital elevation datasets representing the Earth's surface that are distributed as gridded values representing local terrain elevations (Tavares da Costa et al., 2019b). For example, Nardi et al. (2006Nardi et al. ( , 2013Nardi et al. ( , 2019, Morrison et al. (2018) and Annis et al. (2019) employed a flat-water approach (i.e., simple intersection of a water level with the surrounding DEM) to delineate floodplains. The authors used a variable water level at each stream pixel from a stream-order averaged linear scaling relation (power law of upslope contributing area) obtained from a generalization of outlet discharges and the Manning uniform flow equation. Interestingly, using a measure-of-fit of the delineated floodplains, the authors found support for the use of constant values for the power law coefficients (Nardi et al., 2019). However, Annis et al. (2019) also found that the power law exponent varied with spatial resolution of the DEM, return period and Strahler stream order. Degiorgis et al. (2012) proposed the delineation of flood-prone areas from a location where a flood map exists to one where it does not. This extrapolation procedure was achieved by threshold binary classification or, in other words, by the identification of the isoline (the optimal threshold, TH) of a chosen flood descriptor that best approximated the areal extent of an existing flood map. Flood descriptors can be defined as quantitative layers extracted from DEMs that correlate to the tendency of an area to flood. Manfreda et al. (2014Manfreda et al. ( , 2015 and Samela et al. (2016) improved the method introduced by Degiorgis et al. (2012) by evaluating different flood descriptors in terms of their suitability to delineating flood-prone areas. In their studies, the Geomorphic Flood Index (GFI, Samela et al., 2017), was found to be the best performing and the most consistent hydrogeomorphic descriptor amongst the ones analysed (Manfreda et al., 2015;Samela et al., 2016Samela et al., , 2017, of which the Height Above the Nearest Drainage (HAND) (Rennó et al., 2008;Nobre et al., 2016) was one. Building upon this, Samela et al. (2017) and Tavares da Costa et al. (2019a) successfully delineated flood-prone areas at the continental scale by dramatically reducing computational times and costs, opening new possibilities for flood risk assessment and management over largescales. In Tavares da Costa et al. (2019a), optimal thresholds of the GFI were also shown to be positively correlated to flood extents associated with specific return periods.
In a different effort, Jafarzadegan and Merwade (2017) experimented with regression models, obtained by analysis of climate and catchment characteristics, to delineate the 100-year floodplains in North Carolina, US. The delineation was performed based on a range of thresholds of the HAND model, used as the flood descriptor. The authors used the United States Federal Emergency Management Agency's Flood Insurance Rate Maps as a benchmark for validating the results, but pointed out their uncertain and subjective nature. This approach was later extended by Jafarzadegan et al. (2018) and Jafarzadegan and Merwade (2019) to include a probabilistic description of the 100-year floodplains.
The data-driven approach described above complements traditional flood modelling, providing a cost-effective alternative that can fully exploit big, high-resolution datasets, without limiting the scale of application nor compromising computational speed. Being mostly DEMbased, the approach also lessens the problem of data-scarcity often found, enabling the delineation of flood extents in any given catchment, based solely on the regression between envelope flood extents and catchment characteristics. The questions that the approach poses are: 1) Can functional relationships also be established between the GFI and catchment characteristics? 2) If so, can these relationships be used for low-complexity predictive modelling of envelope flood extents?
In this study, and differently from Jafarzadegan and Merwade (2017), a deterministic classification of the GFI, based on a specific objective function, is used to derive a TH for a larger number of training elementary catchments (i.e., hydrological units defined as the "portion of basin directly drained by a river stretch, between two confluences, or from the headwater to the first confluence" in Castellarin et al. (2018); see supporting information Figure S1 for an illustration of this concept) of four different major river basins in Europe with significantly different characteristics and for two return periods (the 10-and 10,000-year). Values of TH correspond to unique envelope flood extents; in other words, the GFI layer isolines that best envelope a given benchmark flood hazard map. Consistent binary masks of high-resolution flood hazard maps are used as benchmark and were obtained from Risk Management Solutions (RMS), a recognized catastrophe risk modelling company, that applied a standard flood hazard mapping approach.
Subsequently, regression models are established between the TH -the target variableand a set of geomorphic and climatic-hydrologic catchment characteristics -the explanatory variables -different from those in Jafarzadegan and Merwade (2017), tested beforehand in terms of multicollinearity. Regression models were objectively calibrated and optimized for the prediction of flood extents and flood-prone areas.
In this way, two major drawbacks found in previous applications are relaxed: 1) the complete dependence on benchmarks, since in this approach they are only needed to train the models and not for every location where flood-prone areas are to be delineated, as is the case in Degiorgis et al. (2012) for example; and, 2) the assumption of transferability of the TH without any physical basis, as is also the case in Degiorgis et al. (2012) for example, since catchment geomorphic and climatic-hydrologic characteristics are used to regress the TH.
Finally, flood-prone areas in elementary catchments of a distinct major river basin in Europe -from those used in the regression models -are delineated using a TH predicted by the regression models. Each resulting delineation is compared to the benchmark to assess the ability of the method to predict the extent of the envelope of major floods. This paper is organized as follows: in section 2, the workflow to develop and test the predictive models of envelope flood extents is presented and each step of this workflow is described in detail; in section 3, the four river basins in Europe (Thames, Weser, Rhine and Danube), used to train the models, and the Rhône river basin in Europe, used to validate the models, are presented alongside the data sources used to obtain the physical characteristics of the river basins. Furthermore, a brief description of the workflow used to obtain the benchmark flood maps is explained; in section 4, the results obtained for each part of the methodological workflow, i.e., classification and physical characterization of elementary catchments, model development and prediction, are presented; and, in section 5, the main conclusions are drawn and future work is addressed.

Development and testing of predictive models
The methodology adopted for the development and testing of predictive models is based on the prior definition of a flood descriptor, the GFI, whose isolines are used to classify benchmark flood extents. The unique TH values resulting from the classification are then statistically related to a selection of geomorphic and climatic-hydrologic catchment characteristics by two distinct types of regression models. The establishing of such relations allows for the prediction of TH based on physical inputs of any given river basin. In Figure 1, the general methodological workflow is presented. Figure 1. Workflow for developing predictive models of envelope flood extents using geomorphic and climatic-hydrologic catchment characteristics.

Catchment delineation
The delineation of elementary catchments accounts for a hierarchical structure that reflects the topology of the river network (Verdin and Verdin, 1999). The main reason for choosing this scale of analysis is the division of each river basin in topographic areas that may contribute significantly to discharge and play a central role in the management of water resources. It also serves the purpose of making computations more manageable through concurrent programming (Tavares da Costa et al., 2019a). Catchments are delineated following the constraint that catchment areas should be less than ca. 1200 km 2 .

Geomorphic and climatic-hydrologic catchment characterization
Catchment characteristics used in this study as explanatory variables (see Table 1) are strictly geomorphic and climatic-hydrologic, as defined by Horton (1932), where soil, geology and vegetation are not taken into account. For consistency, the same high-resolution DEM of the proprietary flood maps is used to extract geomorphic catchment characteristics. The following single geomorphic catchment characteristics were considered: area of elementary catchment (A); flow accumulation at the elementary catchment outlet (F), which relates to A, defined as the cumulative sum of raster cells upstream of the outlet; relief of the elementary catchment (Δz) defined as: with zmax the maximum and zmin the minimum elevation of the elementary catchment; the total river channel length in the elementary catchment (Lch), which also relates to A; and mean river channel fall, or relief, in the elementary catchment (Δzch): with zch,max the maximum and zch,min the minimum elevation of the stream channel in the corresponding elementary catchment. Single geomorphic catchment characteristics were chosen according to their relevance as potential drivers of flood (e.g., Gaál et al, 2012). The magnitude of a flood can be related to A, and F, as it conditions the amount of precipitation that may enter a catchment, that is intercepted by vegetation, infiltrates the soil or is routed to the stream channel. On the other hand, catchment relief (Δz) determines whether water infiltrates or flows quickly to a stream channel. In turn, channel relief (Δzch) determines whether water flows quickly to the outlet or accumulates and leads to a flood. Finally, stream channel length (Lch), in combination with Δzch, determines whether travel times of water are shorter or longer and the amount of storage, this affects the response of a catchment to precipitation.
Composite geomorphic catchment characteristics used in this study are representative of the mean declivity. Within each elementary catchment, the relief-area ratio is defined as: while the relief ratio of the river channel in the elementary catchment is defined as: From a hydrological perspective, relief-area ratio has an important relation to surface runoff, to the concentration of rainfall in river channels and to flood magnitude (Horton, 1932). On the other hand, relief ratio of the river channel gives an estimate of channel storage and time length required by a flood wave to traverse the channel (Horton, 1932), while it also relates to the linear head loss found in the Manning's equation for uniform flow (Manning, 1891). High mean declivities equate to water entering the channel quicker, and thus to higher flooding likelihood. By contrast, gentle sloping channels are slower to route the incoming runoff and have a lower flooding likelihood.
Precipitation is the most important factor driving a flood. In particular, multi-day rainfall events are an important cause of flooding (Fowler & Kilsby, 2003) as they increase the likelihood of exceeding the catchments drainage capacity. In this study, the annual highest 10 consecutive day precipitation is reported, which has a likelihood of occurring or of being exceeded every 10 and 10,000 years on average (P10 and P10k, respectively). These statistics were calculated based on the ECA&D E-OBS 0.1 degree regular gridded precipitation dataset (Cornes et al., 2018). Principal component analysis was applied to precipitation anomalies in the dataset for the 1950-2010 period, in order to identify dominant rainfall patterns across Europe. Stochastic precipitation fields were obtained for 50,000 years as linear combinations of empirical orthogonal functions and principal components (Zanardo et al., 2019). To complement these statistics, the mean annual precipitation (MAP) calculated by averaging the annual totals obtained from the ECA&D E-OBS dataset is also reported.
Proxies for long-term average runoff are obtained by accumulating precipitation statistics downstream using the hierarchy of connected elementary catchments. The general water balance equation for each elementary catchment is given by: with P the precipitation received at each elementary catchment, ΔQ the change in specific runoff, E the evapotranspiration and ΔWS the change in water storage. If subsurface water flow and evapotranspiration losses are neglected as a simplification for severe rain storms and very humid conditions -meaning that overland flow suffers either from saturation excess or infiltration excess and that evapotranspiration losses are much lower than water entering the elementary catchment -the direct conversion of precipitation into runoff may be assumed dominant at each elementary catchment, water yield tends to 1, and the following equation holds: with Qout the runoff at each elementary catchment outlet and Qin the runoff from upstream elementary catchments. Equation 6 can be further expanded, to cater for the lumped cascading estimation of direct runoff at the elementary catchment: with P0 corresponding to a unique long-term average precipitation statistic (i.e., P10, P10k or MAP) associated with the elementary catchment under analysis with area A0, while Pi is the unique precipitation statistic associated with the n-th upstream elementary catchment with area Ai. Results are reported as unit discharge at each elementary catchment outlet: with cell size (in m 2 ) equal to the product of pixel length by pixel width, specific to each DEM.
Spurious values found in the computed catchment characteristics were filtered out. In specific, some elementary catchments were found to be unrealistically small (@ < 2 km 2 ), this was due to automatic delineation problems, namely polygon intersections and invalid geometries, and some of the precipitation statistics presented a number of values that deviated markedly from the other data points.

Flood descriptor
A flood descriptor is understood here as a raster layer that is able to identify the susceptibility to flooding in a given area. It can be obtained by combining different factors into a unique raster layer (e.g., terrain analysis, land use and land cover, and so forth). In this study, the flood descriptor, GFI, is presented as a combination of hydrogeomorphic factors (Samela et al., 2017). The GFI computation requires several steps (Tavares et al., 2019a).

Terrain analysis
The extraction of the terrain analysis layers from the DEM for each European river basin precedes the computation of the GFI. Terrain analysis follows a simple workflow (see supporting information Figure S2 for an illustration of this workflow) using the TauDEM toolbox (Tarboton, 2015). For more details, the reader can refer to Tavares da Costa et al. (2019a, b).

The Geomorphic Flood Index
The GFI is a raster layer estimated from pre-processed terrain analysis layers extracted from a DEM. The computation of the GFI for each of the major river basins in Europe considered in this study is given by: It is composed of two terms, computed following the steepest downslope path given by a convergent eight direction flow model (abbreviated as D8 flow model). The first term, ℎ (S (in meters), consists of an empirically derived bankfull depth estimated by means of a power law hydraulic scaling relation of bankfull depth and upslope contributing area (Nardi et al., 2006;Dodov & Foufoula-Georgiou, 2006;Manfreda et al., 2015;Samela et al., 2016Samela et al., , 2017. The empirically derived bankfull depth in each cell under analysis (i, j) is computed using the upslope contributing area specific to the river centreline cell hydrologically connected to cell i, j following the D8 flow model: ℎ (S = \(@* + ) ) , with @* + = Q^* + * `abb cdea .
with F the flow accumulation specific to the river centreline cell hydrologically connected to cell i, j following the D8 flow model.
For simplicity, the power law constant a and exponent n are kept constant with values of 0.1 and 0.4 (Samela et al., 2017), respectively. The reason for integrating a bankfull depth that scales with contributing area in the GFI is to mark a clear boundary between channel and floodplain and avoid problems with the actual channel. The second term of the GFI consists of the HAND calculated between the cell under analysis (i, j) and the river centreline cell hydrologically connected to cell i, j, following the D8 flow model: with e (S the DEM elevation value of the cell under analysis and e* + the DEM elevation value of the hydrologically connected river centreline cell. The HAND model makes each cell elevation value relative to the connected channel cell instead of the mean sea level. This is crucial in order to determine unique thresholds that best define a specific envelope flood extent. The GFI is rescaled before use to a range of values lying between 0 and 1, corresponding to low (i.e., away from the river centreline) and high hazard levels (i.e., near the river centreline), respectively. Note that moving away from the river centreline, f (S increases while the GFI decreases. Scaling is achieved by resorting to the minimum and maximum values of the GFI (Tavares da Costa et al., 2019a). The rescaled GFI can effectively be used as a classifier of flood-prone areas (Manfreda et al., 2015;Samela et al., 2016Samela et al., , 2017Tavares da Costa et al., 2019a) and of the extent of the envelope of major floods that is confined to the floodplain, between the active river channel at bankfull and the surrounding marked topography. The GFI computation is summarized in Figure 2.

Discrete statistical classification
The threshold binary classification, introduced by Degiorgis et al. (2012), is adopted in this study to find the TH that produce the best possible representation of the RMS flood maps in terms of flood extent (see supporting information Figure S3 for an illustration of the workflow). The resulting unique TH per elementary catchment of the Thames, Weser, the upper Rhine and the upper Danube river basins are subsequently used as the target variable for training the regression models.

Threshold binary classifier
The binary classifier consists of a mathematical optimization that outputs the best possible representation of known binary values from a benchmark. The algorithm starts by creating, through image segmentation of the GFI, a binary flood extent representation associated to each threshold, out of a large number of possible values from zero to one, hereinafter called the segmented GFI. The algorithm then searches all these binary cases to find the one that best approximates the benchmark. The optimal case, represented by a unique TH, is indicated by the maximization of a specific objective function that expresses the correctness of a representation.
The classifier performs better when binary categories (flood-prone and flood-free) are symmetrically distributed (Kubat et al., 1998), i.e., when raster cells of one binary class are not much greater in number than the other (i.e, class imbalance). Therefore, a portion of the GFI, namely a classification area corresponding to a fixed buffer (ca. 1 km) around the river network centreline of the largest benchmark, is adopted in order to handle class imbalance by discarding the number of flood-free raster cells in excess.
The classification results are evaluated with two specific metrics based on a 2 × 2 binary contingency matrix (see supporting information Table S1 for an example of such matrix) that is constituted by values of: tp or the number of raster cells marked as flood-prone in both the segmented GFI and the benchmark flood hazard maps; fn or the number of raster cells marked as flood-free in the segmented GFI but marked as flood-prone in the benchmark; tn or the number of raster cells marked as flood-free in both the segmented GFI and the benchmark; and, fp or the number of raster cells marked as flood-prone in the segmented GFI but marked as flood-free in the benchmark. The first of these metrics assesses the discerning capability of the GFI itself (the receiver operating characteristic, or ROC analysis, see section 2.4.3) while the other measures the degree of association between flood-prone areas resulting from the threshold binary classification and a benchmark (the h i , see section 2.4.4). Both metrics were found to work well for this type of classification (Tavares da Costa et al., 2019a).

Objective function
The maximization of the True Skill Statistic (TSS) (Peirce, 1884) is adopted in this study as the classification rule that defines which threshold is optimal to select for each elementary catchment. The TSS represents the point of maximum forecast value of the classifier; in other words, it is the point in the ROC that has the maximum perpendicular distance from the line of no-skill (Manzato, 2007), which translates to a good representation of the binary categories in the benchmark (see supporting information Figure S4 for an example of a ROC and the respective TSS point identified in the curve).
The TSS , also based on the contingency matrix, has been used elsewhere with success by several authors (Bartholmes et al., 2009;Alfieri et al., 2012) and can also be interpreted as the probability of making an informed decision in terms of the proportion of correct binary categories, assuming for this specific study that the misclassification of flood-prone areas is as undesirable as the misclassification of flood-free areas. The TSS is defined as: with TPR, the true positive rate or the probability of a correct hit; and, FPR the false positive rate or the probability of an incorrect hit. The TSS is negative when the segmented GFI has a higher number of fp and fn than tp and tn; it is positive when the opposite happens, with jkk = 1 indicating that the segmented GFI perfectly matches the benchmark. The case of jkk = 0 implies that the classifier does not provide any useful information.

ROC analysis
ROC analysis has been used by several authors to distinguish between decision values in a classifier and their trade-offs between costs and benefits (Bradley, 1997;Fawcett, 2006;Schumann et al., 2014b). It is considered a threshold-independent performance measure, as points falling along the ROC curve (see Figure S4 for an example) represent unique evaluations, in terms of TPR and FPR, of a considered segmented GFI against the benchmark. The top left corner of the ROC space represents the perfect classification, such that j4u = 1 and Q4u = 0; instead, the diagonal line dividing the ROC space represents the line of no-skill.
In the specific context of this study, the Area Under the ROC Curve (AUC) summarizes the overall discerning capability of the GFI in a single threshold-independent value per elementary catchment. As such, irrespective of the threshold, it represents the probability of correctly classifying a randomly chosen raster cell as flood-prone rather than incorrectly classifying it as such (Bradley, 1997;Fawcett, 2006). The AUC can be estimated by trapezoidal rule approximation of the definite integral and may take values from @vw = 0.5, meaning no discerning capability of the GFI, to @vw = 1, the perfect classifier. An @vw ≤ 0.5 is used to filter out elementary catchments that are not well suited to serve as a classifier and may therefore impact the formulation of the statistical relationships.

Degree of association
The modified Pearson's correlation coefficient for discrete dichotomous problems, h i (Cramér, 1946;Matthews, 1975), is used as a measure of magnitude of association and direction of the linear relationship between flood-prone areas and a benchmark: where Ç É is the Pearson chi-square statistical test (Pearson, 1900), with Ñ the total number of samples. It can be seen as the geometric mean of the TSS and its complementary term. The h i is used in this study to evaluate the degree of association between the segmented GFI and the benchmark. As a rule of thumb, it is assumed that values between 1 and 0.5 represent a strong positive degree of association, between 0.5 and 0.3 a moderate degree of association, 0.3 to 0.1 a weak degree of association and from 0 to 0.1 a complete absence of association. A h i ≤ 0.3 is used to filter out elementary catchments that have a weak degree of association.

Formulation of statistical relationships
Multivariate statistical methods can be used to describe the relationship between unique TH, the target variable, and a set of explanatory variables represented by catchment characteristics that are scaled and mean centred before use. In this study, the stepwise regression and random forest are setup as models to predict envelope flood extents.
Ideally, catchment characteristics should not be highly correlated to each other or to their linear combination, since multicollinearity may increase the variance of parameter estimates and potentially lead to unreliable results. Therefore, before developing the statistical models, multicollinearity is diagnosed with the variance inflation factor (VIF). Multicollinearity may be present when ÜRQ > 10 (Hirsch et al., 1992;Merz & Blöschl, 2005) and therefore variables above such values are considered for elimination prior to model fitting.
The problem of estimating the regression coefficients, or the fitting problem, is solved by stepwise analysis with bidirectional elimination (i.e., the sequential addition) and replacement or elimination of explanatory variables based on the relative quality of each competing model. The trade-off between maximum likelihood and explanatory variables, or model's simplicity in this context, is measured by the Akaike's information criterion (AIC) (Akaike, 1974). In practical terms, model selection is based on the minimum possible AIC obtained for competing models (Haddad et al., 2012). A formal definition of MLR, VIF and AIC can be found in supporting information.
To obtain classical evaluation metrics such as R 2 and the root mean square error (RMSE), a 10-fold cross-validation procedure is used. The cross-validation consists of randomly splitting the dataset in ten equally-sized groups, one of which is retained for testing and the remaining are used for training. The training and testing procedures are repeated ten times, until every single group has been selected once. Performance results from each of the ten validations are averaged to produce a single final estimation.
Some important advantages of the random forest method are that it does not need any specific assumption about the probability distribution (non-parametric), it works well when the relationship between explanatory variables and response is non-linear, as well as when there are high order interactions (Snelder et al., 2013). Furthermore, random forest is relatively robust against outliers, noise and overfitting (Breiman, 2001) and can handle the problem of multicollinearity well .
As opposed to MLR, a chief disadvantage of this method is that it cannot predict target values outside the range of the explanatory variables in the training dataset. Another limitation of the random forest is that it does not provide an easy understanding of the statistical relationships between explanatory variables. Even though, it does provide a simple visualization of the model structure and of the covariate influence, in contrast to other machine learning methods, such as artificial neural networks (Shortridge et al., 2016).
The random forest regression model used in this study goes through an automatic and distributed optimization procedure (grid search) of the setup parameters in order to find the best performing model, in terms of both accuracy and computational efficiency. In specific, the optimized parameters are the number of decision trees in the ensemble, the number of sampled variables at each tree node and the maximum depth of each tree. The optimization of the random forest regression is achieved a priori, using a 10-fold cross-validation to obtain evaluation metrics and compare the multiple models.

Performance assessment
To evaluate the fit between predicted flood-prone areas, obtained through image segmentation of the GFI using the predicted TH, and the benchmark, four common performance metrics, also based on the contingency matrix, are selected in order to keep consistency with other published works (e.g., Wing et al., 2017;Jafarzadegan and Merwade, 2017). These are the hit rate, or TPR, as described in section 2.4.2; the false discovery rate: with values ranging from Qàu = 0 (no false alarms) to Qàu = 1 (overprediction); the critical success index: with values ranging from w = 0 when there is no match between the delineated flood-prone areas and the benchmark and w = 1 when there is a perfect match; and the error bias: that indicates whether there is a tendency towards underprediction, 0 ≤ 7 < 1, or overprediction, 7 > 1, with 7 = 1 an indication of no bias.
As the objective is to predict envelope flood extents, overpredicting the flood-prone areas might benefit the tp and inflate the TPR. This inflation would be of concern if there was no other reported measure that would give an alternative account of the performance. By reporting the FDR, an account of the percentage of cells that are overpredicted is given. At the same time, the critical success index extends the TPR by including the fp, accounting for both under-and overprediction, while the error bias gives the ratio between the fp and the fn indicating whether there is a tendency for under-or overprediction. The reporting of these four measures should give a reasonable overall account of the performance.

Study Area, Data Sources, Training and Test Sets
The River Thames in the UK constitutes the longest one in southern England (ca. 350 km length). It drains an area (ca. 13,478 km 2 ) of relatively flat terrain (mean elevation of ca. 100 m a. s. l.) to the North Sea. The Thames river basin has a MAP ranging from ca. 610 to 778 mm yr -1 that results in a mean annual runoff (MAR) ranging from 100 to 295 mm yr −1 . The Thames is prone to major flooding; the 2013/14 winter floods that the valley sustained are an example of this (Huntingford et al., 2014;Fenn et al., 2016).
The River Weser in Germany has an overall length of ca. 452 km. It drains an area of (ca. 43,857 km 2 ) relatively flat terrain (mean elevation of ca. 200 m a. s. l.) to the North Sea. The Weser river basin has a MAP ranging from ca. 575 to 1,195 mm yr -1 that results in a MAR ranging from 190 to 930 mm yr −1 . In 2013, the Weser river basin was affected by high flood levels with peak discharges above the 50-year return period (Schröter et al., 2015).
The River Rhine (ca. 1,230 km total length) has its source in the Swiss Alps and flows through several major cities in Switzerland, France, Germany and the Netherlands, where it drains to the North Sea. The upper Rhine river basin (drainage area of ca. 32,114 km 2 ), the portion of the Rhine river basin considered in this study, presents a relatively mountainous terrain (mean elevation of ca. 1,065 m a. s. l.), with MAP ranging from ca. 825 to 1,715 mm yr -1 and resulting MAR ranging from 330 to 2,250 mm yr −1 . The upper Rhine river basin is prone to major flooding; for example, in 2007, one person lost its life, at least 100 were affected and the country withstood a total estimated damage of more than 312 million EUR (CRED EM-DAT).
Originating in Germany and flowing through major cities (e.g., Vienna in Austria) in 10 different countries before draining to the Black Sea, the River Danube is the second longest river in Europe (ca. 2,850 km length). The upper Danube river basin (drainage area of ca. 97,000 km 2 ), portion considered in this study, is characterized by a relatively mountainous terrain (mean elevation of ca. 890 m a. s. l.), MAP ranging from ca. 460 to 1,785 mm yr -1 and resulting MAR ranging from ca. 23 to 1,282 mm yr −1 . The upper Danube river basin is prone to major flooding; for example, in 2013, four persons lost their lives, at least 200 were affected and the country withstood a total estimated damage of more than 893 million EUR (CRED EM-DAT).
The River Rhône in France originates in the Swiss Alps and runs through south-eastern France, where it finally drains to the Mediterranean Sea. The Rhône river basin, with an area of ca. 96,475 km 2 has a mean elevation of ca. 785 m a. s. l. It is characterized by a MAP ranging from ca. 561 to 1,890 mm yr -1 , resulting in a MAR ranging from ca. 119 to 1,551 mm yr −1 . The winter floods of 2003 marked the largest flood in the Rhône river basin since 1856. Consequences arising from this event were severe, with the country withstanding a total estimated damage of ca. 1.130 billion EUR (Arnaud-Fassetta, 2013). Mean elevation values presented in this section were estimated from the EEA EU-DEM, MAP values from the ECA&D E-OBS and MAR values from the UNH/GRDC runoff dataset (Fekete et al., 2012).
The five river basins were selected for this study mostly for their record of major floods and their importance in Europe; their geographical locations can be visualized in Figure 3. The GFI raster layer and the catchment characteristics are computed for all selected river basins using a proprietary DEM dataset, hereafter referred to as RMS-DEM, at ca. 50 m spatial resolution. The RMS-DEM is suited for flood inundation modelling, as it does not contain artefacts such as trees, buildings or bridges that can adversely affect the accuracy of the simulations. The (1) Thames river basin in the UK, with 83 elementary catchments delineated, the (2) Weser river basin in Germany, with 170 elementary catchments delineated, the (3) upper Rhine river basin in Switzerland, with 109 elementary catchments delineated, and the (4) upper Danube river basin in Austria, with 286 elementary catchments delineated, are used for training the regression models. Their merging into a single dataset resulted in a total of 648 elementary catchments for each return period, of which, after filtering out issues such as poor classification results (see sections 2.2, 2.4.3 and 2.4.4 for more details), 453 were effectively used for the 10-year return period and 486 for the 10,000-year return period, giving a total of 939 data points effectively used and 357 discarded. The first two river basins are representative of flatter regions and the last two of mountainous regions. The (5) Rhône river basin in France, with 277 elementary catchments delineated, is instead used for testing the regression models. Figure 3. Study area comprising five major European river basins, with drainage divide highlighted in black in the lower left map. The training of the regression models is performed using four river basins, namely a) the Thames river basin in the UK, b) the Weser river basin in Germany. c) the upper Rhine river basin in Switzerland, and d) the upper Danube river basin in Austria. Testing of the regression models is performed using e) the Rhône river basin in France.
The benchmarks used in the threshold binary classification to find the unique TH values, and also in the evaluation of the final predictions, are high-resolution flood hazard maps for Europe, developed by RMS and currently used by global insurance and reinsurance companies. The RMS flood maps were created for several return periods at ca. 50 m resolution. They are based on a cascade of sequential modelling components. Rainfall runoff processes are modelled with a semi-distributed, TOPMODEL-based approach (Beven & Kirkby, 1979). Flows are routed through the river network using the Muskingum-Cunge 1D wave propagation method (Cunge, 1969;Georgakakos et al., 1990). Inundation depths and extents are derived by applying rating curves to river flows in every river segment of 500 m, calculating the associated river depth and filling the river cross-section extracted from the DEM for each segment. The maximum flood depths over the floodplain, after propagating the flood wave through the main river channel, represent the flood hazard map for an event. The benchmarks are used in the form of a binary masks (raster cells marked as flood-prone or flood-free) obtained through image segmentation with a cut-off depth set to 0.01 m. The overall accuracy of the benchmarks is not expected to be very high because of the methodology employed; however, they should provide a fairer comparison, since the GFI is not able to represent the dynamics of the flow over the floodplain.

Classification outcomes and catchment characteristics
The classification of the GFI layer to obtain the TH was performed using each elementary catchment that constitutes the training set, composed of four major river basins in Europe. In Figure 4, the data used in the development of estimators and prediction of envelope flood extents for the 10-and 10,000-year return periods is presented.  Table 1 and section 2.2 for a complete description and units of variables. a) Area Under the relative operating characteristic Curve AUC the modified Pearson's correlation coefficient for discrete dichotomous problems rf and the optimal Geomorphic Flood Index (GFI) thresholds TH obtained from the classification of training catchments (Thames, Weser, upper Rhine and upper Danube river basins). b) Geomorphic and climatic-hydrologic catchment characteristics for the combined training and test catchments (Rhône river basin). c) Correlation matrix between TH and catchment characteristics.
Average AUC values of ca. 95% and 91% are found for the 10 and 10,000-year return periods, respectively, which indicates a very high discerning capability of the GFI classifier. These AUC values translate to a high probability of correctly classifying a raster cell as floodprone or flood-free.
Average h i values between ca. 60% and ca. 64% are found for the 10-and 10,000-year return periods, respectively, which indicates a strong positive degree of association (i.e., between 1 and 0.5) of the best possible representation of flood-prone areas.
Values of TH for the elementary catchments of the training set are found to range between 0.18 and 1, with a mean value of ca. 0.44 and 0.39 for the 10-and 10,000-year return periods, respectively. As expected, there is a tendency towards a value decrease with increasing return period (Tavares da Costa et al., 2019a).
For most catchment characteristics more than 50% of test data is contained within the training set interquartile range. Exceptions to this can be found for the precipitation statistics, P10 and P10k, with median slightly above the test set interquartile range, and for the corresponding unit discharge estimates, q10 and q10k. The sample variability of the test set is larger than that of the training set for the A, the S, the Lch, and the Sch. The explanatory variable A is the only that is noticeably positively skewed, while the MAP and corresponding qMAP, are negatively skewed. These differences are expected to impact the final prediction of envelope flood extents as the training set does not represent the test set in the most exhaustive manner.
The correlation matrix in Figure 4c provides an evaluation of the magnitude of association and direction of the linear relationship between explanatory variables and the dependent variable (in Figure S5 of supporting information the correlation matrices for the 10and 10,000-year return periods are presented). Correlations between the TH and the F, indicate a moderate positive linear relationship. Moreover, a moderate negative linear relationship is revealed between TH, the Δzch and Sch. The remaining catchment characteristics reveal weak linear relationships to the TH.
Furthermore, correlations that exist between different catchment characteristics may be indicative of multicollinearity. Disregarding the correlations between composite explanatory variables and their constituting parts, a moderate to strong positive correlation between precipitation statistics, discharge proxies and Δz is noticeable. Also, A shows a strong positive correlation with Δz and Lch. Collinearity between catchment characteristics is an undesirable condition that can negatively impact the quality of the statistical relationships for the prediction of envelope flood extents and need to be addressed before any further step is taken.

Estimators of envelope flood extents
Two types of regression models were built from the classification outcomes TH and catchment characteristics of the training set, namely the stepwise regression and the random forest regression models.
Several data transformations were tried for building different models (log-linear, linearlog and principal component analysis, which were tested but did not produce any beneficial result). Log-log transformed variables (note that the GFI is already a logarithm) were used as they substantially improved both models' statistical tests and performance metrics.
Several data splits were also tried for building different models. For example, one model for the 10-and another for the 10,000-year return period, were tried but did not yield significantly different results from the ones presented in Table 2. Furthermore, the inclusion of specific river basins was tested. Namely, two out of the four training river basins were held out at the time for testing the models built with the remainders. None of the six river basin combinations (4! 2! * (4 − 2)! ⁄ = 6) significantly improved the overall performance and in most cases holding out specific river basins actually decreased it.
In the stepwise regression, multicollinearity tests point towards strong collinearity between the composite explanatory variables and their individual constituents. Additionally, VIF values above 2 are found between the combined q10 and q10k and the qMAP. Thus, the catchment characteristics A, Δz, Lch, Δzch and qMAP were considered for elimination given the results of the multicollinearity tests and taking into consideration the correlation results presented in Figure 4c.
The previous steps were followed by a stepwise selection of explanatory variables based on the AIC, which reflects the trade-offs between maximum likelihood and model simplicity. A very low AIC lead to the following equation: jf = 0.1580 − k *+ >.>éèC + Q >.>èêë + k >.>>Éè − 4 C>/C>> .>ììê + î@4 >.CÉéì + D C>/C>> .>Cëé . (16) As can be seen from Table 2, the final optimized linear model is constituted by six of the 11 original explanatory variables and is characterized by a high F-statistic (> 3) and very low pvalue (< 0.01), which indicates a high degree of significance of individual explanatory variables and of the model. From the 10-fold cross-validation procedure, results a R 2 value of ca. 42%, indicating a moderate explanatory power of the model, and a RMSE of 0.0597. Table 2. Optimization results for the stepwise regression (swr) and the random forest regression model (rf) for simultaneously predicting the 10-and the 10,000-year return period optimal Geomorphic Flood Index (GFI) threshold (TH) using geomorphic and climatic-hydrologic catchment characteristics as explanatory variables.

Model
Catchment characteristics In the random forest regression, optimization was performed by an automatic search of the best possible combination of input parameters that lead to the highest possible decrease in RMSE obtained through cross-validation. This also ensured that overfitting was avoided. The final optimized random forest regression corresponds to a model with 644 trees, three explanatory variables randomly sampled at each tree node and a maximum depth of 25 nodes. As shown in Table 2, the random forest regression results have substantially improved the explained variance obtained by the stepwise regression model, from ca. 42% to ca. 64% R 2 , and the RMSE, which decreased from 0.0598 to 0.0466.
Variable importance was assessed based on the absolute value of the t-statistic for the MLR model (for more details about the assessment of variable relative importance in linear models please refer to Lindeman et al., 1980). For the random forest regression model, variable importance was assessed based on the empirical improvement of the squared-error as a result of a split in a non-terminal tree node, averaged over all trees (Breimen et al., 1983). It is shown that the Sch is the explanatory variable with the highest relative importance in both models, ca. 14% and 19%, followed by the F with ca. 10% and 19%. In the random forest regression model Δzch is also found to have a fairly high variable importance with ca. 14%. The remaining catchment characteristics are ranked as relatively less important, or not included at all (stepwise regression). However, it should be noted for the case of the random forest model that as one explanatory variable is randomly selected at a tree node, the importance of other variables is substantially reduced, particularly if there is collinearity. In light of this, variable importance should be interpreted with caution, as explained in Seibert et al. (2017).

Prediction of envelope flood extents
Using the models presented in the previous section, envelope flood extents were predicted based on the physical characteristics extracted for each elementary catchment of the Rhône river basin.
Catchment characteristics matching the ones used for training of the regression models were obtained for the 10-and 10,000-year return periods and used as input. A unique TH was predicted per elementary catchment, return period and model. The predicted TH values were used to segment the original GFI raster layer of each corresponding elementary catchment of the Rhône river basin and to delineate the flood-prone areas (see Figure 5b for an example).

Figure 5. a)
True positive rate (TPR), false discovery rate (FDR) and critical success index (C) for the elementary catchments of the Rhône river basin, for the different regression models (rfrandom forest; swr -stepwise regression) and for the 10-year and 10,000-year return periods. b) Overlaid samples of flood-prone areas and corresponding envelope flood extents, predicted with the rf model for two return periods.
By comparing each raster cell of the binary mask of predicted flood-prone/flood-free areas with each corresponding raster cell of the benchmark RMS flood maps, it was possible to obtain a contingency matrix for each model and for each return period considered, from which the performance metrics described in section 2.6 were computed.
In Figure 5 and Table 3, it can be observed that the TPR is high for the great majority of elementary catchments (average above 80%), similar between models and slightly higher for the 10-year return period. At the same time, the FDR is high for the 10-year return period (average ca. 63%), moderate for the 10,000-year return period (average ca. 39%), and slightly higher for the stepwise regression. In turn, the C is moderate for the 10-year return period (average ca. 34%), high for the 10,000-year return period (average ca. 52%), and slightly higher for the random forest. It is also shown by the E that more than ca. 85% of the flood-prone areas obtained for the elementary catchments of the Rhône river basin suffer from overestimation (7 > 1). On average, the TPR decreased with increasing return period, but this seems to be compensated by a significantly lower number of false alarms, as a higher C is observed for the 10,000-year return period and a decrease of E (for more details see supporting information Figure S6 where the distribution of the tp, fn and fp values, in terms of absolute predicted area, are presented). Table 3. Average performance of the regression models expressed as true positive rate (TPR), false discovery rate (FDR), critical success index (C) and error bias (E), for the 10-and 10,000year return period flood-prone areas in the Rhône river basin using the RMS flood maps as benchmark. Furthermore, cases with a high number of upstream elementary catchments were found to have limited impact on the average results (for more details see supporting information Figure  S7 where the performance results for individual elementary catchments of the Rhône river basin are plotted against the corresponding number of upstream elementary catchments). However, there appears to be a higher dispersion of TPR, and a tendency for high FDR values for the 10year return period, with an increasing number of upstream elementary catchments (see supporting information Figure S6 for more details).

Conclusion
The work presented in this paper documents the use of two data-driven, expeditious and cost-effective approaches to predict the extents of the envelope of major floods in diverse river basins, gauged or ungauged, and for diverse return periods. This has been achieved by establishing functional relationships in the form of linear and non-linear regression models between specific isolines of a flood descriptor (TH) and geomorphic and climatic-hydrologic characteristics of elementary catchments. This advancement extends a previous approach employed in Tavares da Costa et al. (2019a), by relaxing its complete dependence on benchmark flood maps and by providing a physical basis for the transferability of the TH between catchments, also giving a physical basis to the extrapolation and downscaling goals described in Tavares da Costa et al. (2019a). The results obtained in this work are promising for such a novel approach and provide an optimistic view about future directions that could improve or extend it. At the same time, the limitations found, and discussed below, should encourage further investigation as new developments can be beneficial to flood managers and insurers giving them nearly instantaneous views of the areas that can potentially be affected by floods.
The classification stage of the methodological workflow showed that, overall and in spite of some outliers, the GFI has an overall high discerning capability of flood-prone areas, as shown by the average AUC value above 91% for any of the return periods. At the same time, average value of h i above 60% for any of the return periods indicates a strong positive degree of association between the GFI delineated flood-prone areas and the benchmark flood maps. These values are significantly higher than the ones reported by Tavares da Costa et al. (2019a).
The development of estimators of envelope flood extents has shown that in the stepwise regression the VIF and AIC selection of catchment characteristics has been valuable in obtaining statistically significant explanatory variables that improved the explained variance (R 2 ) of the target and the fit of the initial model (RMSE).
In comparison to the stepwise regression, the random forest regression proved to be a much more flexible and straightforward approach to setup. The final optimized random forest model could be obtained without any prior selection of catchment characteristics and still substantially increase the R 2 and decrease the RMSE. Moreover, the improvement of the statistical tests by the random forest model seems to provide some evidence of non-linear behaviour between TH and catchment characteristics.
Efforts towards further improving the R 2 and the RMSE should be undertaken in any future work. This could be achieved by adding new or replacing existing single and composite catchment characteristics, which may be specific to each elementary catchment response, and by testing different data transformations in order to increase the correlation with TH. It should be noted, however, that the improvement of some of these quantities may demand more data, raise additional issues and make replicability more difficult without any guarantee of significant performance enhancement.
When it comes to the predictions of envelope flood extents in the elementary catchments of the Rhône river basin, the random forest regression model performed marginally better than the stepwise regression for any of the return periods considered. Both the stepwise and the random forest regression outputted high TPR values, while at the same time moderate to high FDR values. This was reflected in a moderate to high critical success index values, C, obtained and in the E values always above 1, indicative of overprediction, especially at lower return periods. As this paper tries to deal with envelope flood extents, overprediction was already expected. Overall, predicted flood-prone areas better match the benchmarks at higher return periods and, particularly for the 10,000-year return period, it is interesting to note that the average performance obtained is in line with some modelling results reported by Wing et al. (2017).
Nevertheless, limitations to this methodological approach are known. For instance, the GFI itself, as detailed by Manfreda et al. (2014Manfreda et al. ( , 2015, Samela et al. (2017) and Tavares da Costa et al. (2019a, b), is found to be less than optimal in identifying flood-prone areas in flat terrain, which may explain the high FDR and overestimation. In a more incised fluvial valley there are more independent GFI contours to choose from, this allows a better representation of flood-prone areas; whereas, flat terrain fails to constrain flood extents in terms of elevation differences along fluvial valleys, the HAND component in the GFI, which means that a small variation of TH can translate into overpredicted flood extents.
Being aware of the limitations of the methodology, there are possibilities to improve the quality of flood descriptors. This could be done by: 1) further improving the spatial resolution, vertical accuracy and the processing of the DEM (e.g., de-noising, smoothing and hydrological conditioning); 2) calibrating the coefficient of the power law that scales bankfull depth with contributing area (also in terms of DEM spatial resolution, return period and Strahler stream order); 3) using a multi-directional flow model (e.g., D-Infinity, Tarboton, 1997); and, 4) testing other channel initiation methods (e.g., Li et al., 2020). At the same time upgrading the threshold binary classification and understanding why some elementary catchments are not producing higher h i values, will most likely help. A number of additional findings are also worth noting: • Although the regression models proved to be reasonably robust, considering that the sample variability of the training data was limited in comparison to that of the testing, a training of the models with a broader range of values and more degrees of freedom could improve the generalization properties and prediction capability.
• The random forest regression is not able to predict target values outside the range found in the training dataset and this can be particularly problematic for lower TH values. To account for this feature, different algorithms would need to be considered or modifications to the random forest would need to be implemented. • Explicitly including the case where flooding does not occur (e.g., accurate representation of the river at bankfull flow and the corresponding physical climatic-hydrologic characteristics that lead to it) in the models may benefit the analysis. However, such cases should be completely withheld from the performance analysis, as they might artificially influence the performance. • Additional tests using the EEA EU-DEM, not reported here, have revealed that the use of a DEM to compute the GFI that is different from the DEM used in the modelling of the benchmark flood maps negatively influences the results. Caution should, thus, be exercised in the selection of the DEM, as in this study (a consistent use of the RMS-DEM), but also in its processing (e.g., terrain analysis, river network and catchment delineation; Tavares da Costa et al., 2019b).
Besides what was mentioned above, future work could: • Investigate how such methodology performs in other test river basins and whether it could actually be generalized to the global scale.
• Investigate how such methodology would work with benchmark flood extents obtained from remote sensing detection (e.g., Westerhoff et al., 2013;Schumann et al., 2015). It should be noted, however, that observed flood extents will largely reflect the dynamics of the flow, whilst the simplified approach presented here does not. On the other hand, observed flood extents will be dependent on the orbital pass and on the imaging time-windows, it is thus important to consider multiple images of the same flood event in order to ensure that the maximum flood extent is best captured (Bernhofen et al., 2018). • Include more return periods in the analysis, particularly the 100-year return period flood; as it is considered the standard for risk assessment in many places (e.g., by the US National Flood Insurance Program). At the same time, investigate if bias in performance assessment could be reduced by excluding lower return period from higher return period flood-prone areas (e.g., exclude from the 10,000-year flood prone areas the raster cells corresponding to the 10year return period flood prone areas).
• Take a step further and provide a way to estimate flood depth, even if coarsely (e.g., Manfreda & Samela, 2019).