Integrating ecosystem services information into water resource management: an indicator-based approach

Natural ecosystems are fundamental to local water cycles and the water-related ecosystem services that humans enjoy, such as water provision and protection from natural hazards. However, integrating ecosystem services into water resources management requires that they be acknowledged, quantified, and communicated to decision makers. We present an indicator framework that incorporates the supply of, and demand for, freshwater-related ecosystem services, which provides an initial diagnostic for natural resource managers as well as a mechanism for evaluating tradeoffs through future scenarios. Building on a risk assessment framework, we present a three-tiered indicator for measuring where demand exceeds supply of services, addressing the scope (spatial extent), frequency, and amplitude for which objectives (service delivery) are not met. The Ecosystem Service Indicator is presented on a scale of 0-100 which encompasses none to total service delivery. We demonstrate the framework and its applicability to a variety of services and data sources from case studies in China and the Lower Mekong region. We also evaluate the sensitivity of the indicator score derived from these methods, to communicate uncertainty. The proposed indicator framework is conceptually simple, robust, and flexible enough to accommodate the inevitable evolution and expansion of tools, models and data sources used to measure and evaluate the value of water-related ecosystem services.


Introduction
Rivers, lakes, wetlands, and groundwater provide people with a variety of ecosystem services including water, fisheries, erosion prevention, flood protection, wildlife habitat, and cultural services (Brauman et al. 2007). Their importance is captured in the United Nation's Sustainable Development Goal Target 6.6, to protect and restore water-related ecosystems. However, water diversion, forest degradation, wetland loss, urbanization, and channel development are among the factors degrading these ecosystems and by extension the quantity, quality, timing, and location of water-related ecosystem services. Addressing these threats and preserving related ecosystem services requires new approaches to measuring, valuing, and managing freshwater ecosystems. Despite decades of progress in illuminating the contributions of ecosystem services to human well-being, there remains a need to translate the concept into practical terms for decision making (Inostroza et al., 2017).
When managing water resources, it is important to identify where, and to what extent, water-related ecosystems support services. This helps maximize service provision and mitigate negative tradeoffs (Liu et al., 2013). However, incorporating water-related ecosystem services into integrated water resource management is challenging. Most water-related ecosystem services cannot be easily assigned an economic value, which makes evaluating them against conventional gray infrastructure (e.g., wastewater treatment plants, dams, levees etc.), or including them within cost-benefit analyses difficult. Instead, proxies such as land cover (Burkhard et al., 2014) or process-based models have been applied to estimate ecosystem service supply (Vigerstol and Aukema, 2011;Crossman et al. 2013). Consequently, most assessments have focused on surface water supply, as this is the most analytically tractable approach (using geospatial and biophysical datasets). Moreover, regulating services, such as flood mitigation, sediment retention, and water filtration can be difficult to quantify or data-intensive to assess. Thus, they are seldom factored into resource planning decisions (Villamagna et al. 2013). Measuring water-related ecosystem services is also challenging because services are often produced upstream, but consumed either downstream or outside of a watershed, complicating attribution -from point of supply, to point of demand -particularly when services partially depend on gray infrastructure for delivery. As a result, the spatial flow and demand for services are understudied .
A variety of techniques are used to assess ecosystem services due to variations in data availability, technical capacity, geographic scale of interest, and research question (Pandeya et al. 2016, Harrison-Atlas et al. 2016. The flow of water-related services is typically measured with a hydrologic model (e.g., Nedkov and Burkhard, 2012), while demand is often measured using population-based proxies (e.g., number of people downstream of a reservoir). Ecosystem services assessments need to be both site (Fisher et al. 2009) and context-specific (Cowling et al. 2008, Wissen Hayek et al. 2016), but a unified framework for measuring water-related ecosystem services may improve communication (Polasky et al. 2015, Grizetti et al. 2016. And whilst early assessments of ecosystem services raised awareness about the links between ecosystem functioning and human benefits; contemporary assessments need to integrate both policy and management concerns (Harrison-Atlas et al. 2016, Inostroza et al. 2017). This requires assessing both service flow and demand (Burkhard et al. 2014), as opposed to a supply-centered assessment. These assessments may also incorporate socioeconomic data and catchment hydrology . Land cover-based assessments of ecosystem services sacrifice accuracy for expediency (Eigenbrod et al. 2010) and are generally not suitable for incorporating demand. Unvalidated process-based modeling approaches can only provide relative estimates of change. Moreover, complex models may create a false sense of confidence in their outputs, without adding value to decision-making processes Polasky 2011, Bagstad et al. 2013).
We have developed a quantitative ecosystem services indicator framework that bridges the gap between current ecosystem services science and decision maker needs. Effective indicator frameworks should provide a flexible and decision-relevant approach to measuring ecosystem services (Grizzetti et al. 2016).
Such frameworks should also distill and frame scientific information representing complex environmental phenomena, whether to evaluate current conditions, or set future goals (Heink and Kowarik 2010). Our need to evaluate water-related ecosystem services was driven by the Freshwater Health Index's (FHI, Vollmer et al. 2018) social-ecological framework. The FHI frames freshwater ecosystems as dynamic social-ecological networks, with linkages and feedbacks that reveal human water needs and the ecological effects of using fresh water in a watershed. This focus on freshwater ecosystem health engages stakeholders and decision-makers Souter et al. 2020;Wen et al., 2020). The FHI incorporates both supply and demand for water-related ecosystem services, which diagnoses freshwater health for land and water resource managers and provides a mechanism for evaluating tradeoffs.
We present a framework for assessing ecosystem service delivery covering six types of water-related services -1) water and 2) biomass provision, and regulation of 3) water quality, 4) sediment, 5) flooding, and 6) water-related disease. Within this framework we describe and evaluate three alternate methods of metric calculation, allowing for flexibility based on the amount and type (categorical, numeric) of data available. We provide examples from two case studies, the Dongjiang basin in south-eastern China and the Sesan, Srepok and Sekong basin in Southeast Asia. We then conduct sensitivity analyses to evaluate how varying the quality of data inputs affects the outputs. Finally we discuss the strengths of the various calculation methods when balancing salience, information that is useful to those who can act on it; legitimacy, the perception that it respects divergent values and that it has been developed in an unbiased way; and credibility, its scientific and technical rigor (e.g. Vollmer et al 2016).

Case study basins
The Dongjiang (Fig. 1) is the smallest tributary of the three main rivers comprising the Pearl River system in southern China. The basin covers 35,340 km 2 and has an annual average discharge of 739 m3/s. It is the primary source of water for close to 40 million people and hence water supply, quality and sediment regulation are services in high demand from it. The Sesan, Srepok and Sekong in Southeast Asia are transboundary basins (Fig. 2) and important tributaries to the Mekong River. Collectively referred to as the 3S basin, it provides close to quarter of Mekong's discharge and covers approximately 78,650 km 2 . With a large rural population, subsistence fisheries, flood and disease regulation are services of concern for this region.
Besides calculating an indicator score for ecosystem service delivery for these basins with various grades of available data (Table 1), we use the water supply provisioning in Dongjiang as an example to demonstrate framework's flexibility and robustness. We used water supply reliability -the ability of available water to meet scheduled allocations and calculated as a percentage -to demonstrate water supply provisioning. As an illustrative dataset, monthly projections of water supply reliability for major municipalities and sectors in China's Dongjiang basin (Fig. 1a) were constructed from modelled water supply reliability estimates for the six major municipalities of the region during the 1991 severe drought (Zhang et al., 2008; Table S1). The municipalities receiving water from the Dongjiang basin are Heyuan, Huizhou, Dongguan, Hong Kong, Shenzhen, and Guangzhou. We further sub-divided total demand in each municipality among three sectors: Residential use (R), Industry (I), and Agriculture (A). Aside from Hong Kong, which did not use water for agriculture, all municipalities required water from all sectors (Fig. 1b).

General principles
Each assessment begins by identifying a spatial area of interest and time-period over which we evaluate ecosystem service supply and demand. We determine if demand is being met by supply, in both space and time -and if not -we quantify the magnitude of unmet demand. To do this we define a satisfying criterion. For some ecosystem services, this criterion can be sharpa univariate, quantifiable threshold-based objective. For example, irrigators may receive a yearly volumetric allocation of water. The objective fails if insufficient water is available to meet their allocations, with the magnitude of unmet demand the difference between the allocated and supplied volumes. In the illustrative water supply dataset for Dongjiang, we use 100% water supply reliability as the threshold. Thus, any time supply falls below demand, the objective in this case would fail. For other services, the criterion will be fuzzymulti-variate threshold-based objectives that require indirect estimates. For example, ecosystems can reduce flood risk by various means (Nedkov and Burkhard, 2012), but the threshold for evaluating demand may include the number of flood related fatalities, houses inundated, livestock lost, or economic damage. Thus, when a single objective cannot be defined, the threshold is fuzzy. However, some losses can be unrelated to the ecosystem service (e.g. drowning due to misadventure) and floods can provide beneficial services (e.g. inundating floodplain habitat or nourishing agricultural lands). In these cases, the threshold may be a subjective or multi-criteria decision and can be defined through stakeholder surveys or by combining multiple metrics.
With the objective established, we evaluate ecosystem service delivery across three dimensions: Scope, Frequency and Amplitude. These dimensions are similar to those used in the Canadian Council of Ministers of the Environment (CCME) Water Quality Index (Saffran et al. 2001) and mirror the aspects of risk source, exposure and consequences used in risk assessment calculations (Merkhofer 2012, Covello & Merkhofer 2013. We define the three dimensions as: • Scope (F1): The proportion of the study area that fails to meet the objective (threshold) at least once over the evaluation period; • Frequency (F2): The frequency with which the objectives (thresholds) are not met over the evaluation period; and • Amplitude (F3): The degree or magnitude with which the objectives (thresholds) remain unfulfilled.
We scaled the scores for each dimension between (0-100) and combined them into a final Ecosystem Service Indicator (ESI) score. The following section details the steps taken to determine the scores.

Defining spatial & temporal elements
The study area (often a river basin) should be divided into discrete spatial units over which the three dimensions (F1, F2, and F3) can be evaluated. It is important that spatial units provide a representative coverage of the study area as selective coverage can bias the results. To avoid this, we recommend dividing the study area into sub-basins or administrative units. Spatial units can be combined with a thematic variable. For instance, with the monthly water supply reliability dataset for Dongjiang, we have 6 municipalities covering 3 sectors, giving potentially 18 spatial units for the calculation. As we do not have any agricultural demand for one of the municipalities, there are 17 spatial units.
A one to five-year evaluation-period over which data is divided into smaller intervals (e.g., monthly) provides an aggregate assessment of current condition. The state of a spatial unit at a specific interval is an instance, for which we evaluate compliance with the objective (e.g. Table 1).

Calculating dimensions and total scores
We present and compare three methods for calculating and combining F1, F2, and F3 (Table 2) into a single Ecosystem Service Indicator (ESI). The first method (M1) is identical to Saffran et al's (2001) Canadian Water Quality Index. Due to shortcomings in this method we present and evaluate two adjusted methods, M2 and M3. 1. Services where a univariate 'sharp' threshold for non-compliance can be defined: When the target must not fall short of the objective, the excursion is defined as: Alternately, when the target must not exceed the objective, the excursion is defined as:

Services where a univariate 'sharp' threshold for non-compliance cannot be defined:
Excursion for each instance i be ranked on a scale of 1 to 10 to correspond with a low to high gap between supply and demand. Ranking is derived by stakeholder survey or multi-criteria analysis.

F3
Amplitude Frequency and Amplitude Amplitude From n instances among the SUs where the objective is not met, a normalized sum of excursions (nse) is calculated: Total no. of instances F3 is now calculated by scaling nse to between 0-100: From n instances among the SUs where objective is not met, a mean of excursions (moe) is calculated: n F3 is now calculated by scaling moe to between 0-100: F3 = ( moe moe + 1 ) × 100 If only able to determine F1:

Scope (F1) and Frequency (F2)
All three methods retain Saffran et al's (2001) definition and formula for scope (F1) and frequency (F2) ( Table 2). Scope is the percentage of spatial units for which the ecosystem service has not met demand at least once over the evaluation period relative to the total number of spatial units. Frequency measures exposure by comparing the percentage of instances monitored from all spatial units where demand for the ecosystem service was not met.

Amplitude (F3)
Amplitude (F3) measures the magnitude by which ecosystem service demand was not met. To estimate amplitude of the whole domain, the gap between supply and demand for each failed instance is calculated (Table 2, row 3) as an excursion for all failed instances. As described in section 2.2.1, demand thresholds may be either sharp or fuzzy and our methods can determine excursions for both cases.
These excursion values are combined to derive amplitude. Here, the three methods -M1, M2 and M3diverge (Table 2). For M1 (following Saffran et al's (2001) original method), F3 is calculated by averaging the total excursions of all monitored instances (Table 2, row 4). However, use of all monitored instances, whether compliant or not, means that the resulting score merges frequency and amplitude. Combining all three dimensions to derive ESI (Table 2, row 5) leads to double counting between F2 and F3.
To avoid double counting, M2 and M3 adopt different approaches for defining and calculating F3 and the ESI score. M2 retains M1's calculation process of F3 but defines it as a combination of frequency & amplitude. And, when calculating the ESI score, M2 uses either F2 or F3 (not both), depending on the level of evidence available. In M3, F3 only accounts for amplitude. This is achieved by calculating a mean of excursions only for failed instances. We examined differences in the behavior of these methods using a series of tests (section 2.3) to select that which is most suitable to the water-related ecosystem service.

Combined score (ESI1, ESI2 or ESI3)
The three dimensions are then combined into an ESI score (Table 2 -row 4). Data quality and availability will determine how many of the three dimensions can be calculated, reflecting the level of evidence and confidence in the final score. This is critical to understanding uncertainty in an assessment and should be reported alongside the final score. These are denoted as ESI1, ESI2 or ESI3 representing low, medium, or high levels of evidence, respectively. M1 allows for ESI3 as the only output. As ESI1 is calculated using only F1 it provides the least complete description of the ecosystem service. As ESI3 is calculated using measures of scope, frequency, and amplitude it provides the most complete description and is also likely to have used the most detailed datasets. Other sources of uncertainty in the assessment can come from the accuracy of data, parameters, and models.

Framework testing
Several factors can influence the ESI score including differences in spatial coverage and threshold setting; calculation method, either M1, M2 or M3; and for M2 and M3, the level of evidence (ESI1 to ESI3). The choice of method, and levels of evidence are likely to have the largest influence on indicator output. Understanding this influence is thus critical when determining which method may be most suitable for a particular case and interpreting results. We examined the importance of level of evidence and calculation method, by defining the expected level of ecosystem service delivery and testing the precision of scores generated by each method.

Metrics for Monte-Carlo simulation: Probability of failure & range of failure
We used Monte-Carlo simulations to generate multiple (n = 10,000) water-reliability tables analogous to the water supply reliability dataset of the Dongjiang case study. We constrained the value generated (Fig.  3) for each instance j in each table i (of n) with two metrics: probability of failure, which controlled the rate of non-compliance; and range of failure, which controlled the magnitude of non-compliance. For example, when probability of failure was set at 1% each randomly generated reliability value had a 1% probability of being below the required threshold. On average, due to low probability of failure of each instance, we would expect few cases of non-compliance. Thus, we expected higher ESI scores when compared to a case where probability of failure was >1%. Setting the range of failure at, for example, 5% limits the magnitude of a failed instance to a maximum of 5% below the threshold. Thus, when the threshold is 100%, the value of water supply reliability will vary between 95 -100%. The probability of failure and range of failure metrics pre-determined the expected level of ecosystem service delivery. Thus, based on the structure of the indicator system, our expectation was that probability of failure would have a greater influence on scope and frequencywhilst range of failure should influence amplitude. By using these metrics, we tested each methods credibility as it must provide outputs matching those that were expected.
For each of the 10,000 (n) water-supply reliability tables we randomly set both probability and range of failure between 0 -100%. Consequently, each table generated by the simulation represented a system with different failure characteristics and overall, the 10,000 tables covered a wide range of values for the frameworks three dimensions. Using this dataset, we compared the characteristics of the three dimensions and calculation methods. We generated summary statistics (median and percentiles) for the three dimensions and ESI scores to confirm that the outputs behaved as expected and summarized the results of each method over a wide range of scenarios. We visually examined the relationship between both probability of failure and range of failure on the three dimensions using hexbin bivariate histograms, which depict the count of observations within hexagonal bins and ordinary least squares coefficient of determination (R 2 ).

Testing flexibility afforded by three-tiered calculation
We examined level of evidence by calculating the three dimensions from a subset of the Monte-Carlo simulation dataset. Tables for three sets of probability of failure: (a) 0-10%, (b) 20-30% and (c) 40-50% was extracted from the dataset. ESI scores were then calculated for the tables under there three sets using methods M2 and M3 (as these two methods allow for a score to be calculated at different levels of evidence). Based on the probability of failure sets, we would expect that on average, ESI scores for set (a) would be higher than those from set (b), and in turn higher than set (c). We also examined whether this trend (should it be confirmed) persisted with a loss of information. We assessed this by calculating three scores, ESI1, ESI2 and ESI3 corresponding to different levels of evidence -ESI1 performs the least number of operations on the data in a table while ESI3 is the most rigorous. We examined the effect of a loss of information between the different levels of evidence on the final score using scatterplots and Pearson's correlation coefficient (r). We expected ESI scores calculated using the three levels of evidence to be correlated as an artefact of the calculation process, as all use F1 in calculating the final metric.

Comparison of adjusted calculation methods
We compared the three calculation methods by examining ESI3 scores calculated from the Monte-Carlo simulation dataset. A suitable method should reflect changes in the system by being sensitive to all three dimensionsespecially the higher dimensions (F2 and F3) which will change first in response to any management action. Further, the indicator values should not be biased on any end of scale (0-100) but vary along the full range. We compared the scores from methods M1, M2 and M3 using Pearson's correlation coefficient (r) and heatmap graphs showing variation in ESI3 scores for x-y combinations of the two metrics.

Water supply, quality, and sediment regulation from Dongjiang basin
Our assessment of monthly data from 1991 gave a total of 204 (17 spatial units x 12 months) instances. As expected for a severe drought year, several sectors and municipalities in the basin were unable to meet the water demand. The water reliability table and its analysis are included as supplementary material (Tables S1-S3). Nearly half of the spatial units failed to meet the delivery threshold at least once over the year, but the frequency of failure was low and mostly occurred from February to May. Excursions for the sharp threshold approach were moderately high. Despite the differences in F3 scores, the final ESI3 scores with the three methods were similar, indicating a moderately stressed to stressed basin as expected during a drought period.
The main difference between the sharp and two fuzzy threshold assessments was in the range of scores for the different excursion calculations. ESI3 M2 was most sensitive with a difference of nearly 30 units between the three approaches (Table 3). ESI3 M3 was the least sensitive and differed only by 8 units.
The 2017 Freshwater health assessment of the Dongjiang basin used data from 2012 -2016 . This gave a Water Supply Reliability score of 86 (ESI3 M2 ) as Guangzhou, Shenzhen and Heyuan reported modest supply shortages. However, future demand management may be challenging as most of the basins surface water is allocated and the freshwater ecosystem has been highly modified to capture and supply water. Over the past decade, considerable effort has been expended on improving water quality. This was reflected in the water quality regulation score of 76 (ESI3 M2 ) for the entire basin, calculated using measured water quality parameters such as dissolved oxygen, biological oxygen demand, ammonium-N, chemical oxygen demand, fecal coliforms, and heavy metals (zinc, copper, lead, cadmium) against expected water quality targets derived from Government regulation. Sediment regulation is another important service for the Dongjiang as soil erosion elevates suspended solids and caused reservoir sedimentation. As monitored data for this service were scarce, modelled soil erosion estimates were used to calculate the indicator with the southern China soil erosion standard (20 t/ha/yr) used as the threshold, giving a score of 75 (ESI3 M2 ). Erosion hotspots were mostly found near downstream urban areas where it may be less of a concern since this may contribute to natural dynamics of the Pearl river delta. Sub-basins upstream of the major reservoirs exhibited low erosion rates due to the provincial government's efforts to maintain the headwater forests as protected areas.

Biomass provisioning, flood, and disease regulation for 3S basin
Subsistence fishing is an important service for the 3S basin, but actual catch data are sparse and unreliable. Consequently, as a surrogate measure of biomass available for consumption Souter et al., (2020) assessed the basins migratory fish habitat. Migratory fish preferentially use habitat closer to the three rivers confluence with the Mekong. Hydropower dam development causes fragmentation which limits access to these important habitats and thus, negatively impacts the amount of migratory fish availability for subsistence fishing. This approach was limited as it excluded non-migratory fish due to a lack of data. Over the assessment period this approach gave a high score (ESI3 M2 = 95) as most dams were located in the basin's highlands. However, the construction of the lower Sesan II dam, close to the basin's outlet reduced ESI to 26 as migratory fish habitat in the Sesan and Sekong rivers became isolated. Whilst this assessment gave an overview of likely impact of hydropower development, local catch data is needed to understand the impact of rapid hydropower dam development.
The 3S received a high flood regulation score (ESI3 M2 = 88) as few floods were observed at the four gauging sites over the 2010-2015 evaluation period. However, the 3S is believed to be at a high risk of flooding (MRC, 2010). Either the period in question was unusually low in floods or the four stations did not adequately describe the entire basin. Flood risk in the 3S is closely tied to the steep terrain and landslides. Uncoordinated releases from dam cascades under development on the river system could present additional challenges for managing future flood risk.
The lowest score was disease regulation (67). This is driven by two endemic water-related diseases Mekong schistosomiasis (Schistosoma mekongi) and dengue fever. Dengue is more widespread (particularly during the wet season) in the basin, while Mekong schistosomiasis is confined to a smaller region, mostly near the confluence of the 3S rivers.

Outputs from the Monte-Carlo simulation
We evaluated 9871 of the 10,000 water-reliability tables generated using the Monte-Carlo simulation. We discarded 129 tables as they were either repetitions of a no-failure state (F1=F2=F3=0) or unrealistic cases were excursion for all 'failed' points was zero (Ex=0). As expected, the median values for both probability of failure and range of failure inputs was 50%, with the 25 th and 75 th percentiles at 25% and 75%, respectively, with each covering the full range 0-100% (Fig. 4). The distribution of the F1, F2, and F3 scores were also as expected. F2 showed a median of 50% failure which mirrored the probability of failure input distribution. F1 summary scores were high, due to the high probability of numerous sites failing in any one scenario. The scores for F3 (M1 and M2) were lower compared to F3 (M3), as the latter is calculated using non-compliant instances only. Indicator scores derived from each method (M1, M2 and M3) at the highest level of evidence (ESI3) follow distinct distributions indicating that each method has different characteristics. Overall M1 gave the lowest scores and M2 the highest.  As probability of failure values increased from 0 to 100, F1 rose sharply (ordinary least squares coefficient of determination, R 2 = 0.867; Fig. 5a.). When probability of failure was greater than 50% -i.e. the chances of water supply at any instance being unable to meet demand were >50% --finding a water-reliability table where any of the 17 spatial units did not have at least 1 failure over the 12 month became highly unlikely. Frequency (F2) increased linearly with probability of failure (R 2 = 0.994; Fig. 5b.), however the range of values also increased with probability of failure. As expected, probability of failure was not a good predictor of amplitude (shown here using F3 from M3; R 2 = 0.637; Fig. 5c.). The range and value of F3 increased with range of failure (R 2 = 0.979; Fig. 5d.). And range of failure was not a good predictor for either F1 or F2 (R 2 = 0.706 & 0.576 respectively).

Ability of lower level of evidence to depict system state
The average ESI3 M2 scores for set (a), (b) & (c) were 94.5, 72.3 & 62.0, respectively. And as expected ESI scores calculated with a lower number of dimensions (i.e., ESI1 or ESI2) were correlated with ESI3 (Fig. 6).
The correlation between ESI1 and ESI3 was less than the correlation between ESI2 and ESI3, which was also expected given the nature of the calculation method. Probability of failure had a greater impact on the correlation than dimensions, with the level of correlation decreasing as probability of failure increased. The comparison between ESI3 with ESI2 ( Fig. 6, a1-c1) examines the influence of amplitude. When amplitude was high (clustered red dots at the bottom of the graph) then the values of ESI3 and ESI2 were closer in magnitude. When amplitude was low, ESI2 was always lower than ESI3 (blue dots, top of graph). Thus, when the magnitude of the gap between supply and demand is unknown-ESI2 is likely to give a conservative score of the state of ecosystem delivery. This value will generally, but not always, increase when information about amplitude becomes available. Moving from row (a) to (c), the extent by which scores can increase (upon additional information on amplitude becoming available) also increases. This is seen as the increasing depth of the scatterplot along the y-axisand hence the correlation decreases.
The correlation between ESI1 and ESI3 was high for both M2 and M3 when set (a) was compared against set (b) & (c). And for all cases, ESI1 score was lower than either the ESI2 or ESI3 scores. A high ESI1 score (>60) indicates moderate to high ecosystem service delivery. However, a functioning system can also give a low ESI1 score when for example, numerous sites fail only once and by a small magnitude.

Comparison of the calculation method
The correlation between ESI3 M2 & ESI3 M3 was higher (r = 0.979) than that between ESI3 M1 and ESI3 M2 (r = 0.722) or between ESI3 M1 and ESI3 M3 (r = 0.819). ESI3 M1 showed limited variation as range of failure changed (Fig. 7). This was further confirmed by its correlation values with F1 (r = -0.943), F2 (r = -0.912) and F3 (r = -0.589). M2 and M3, by comparison, showed variation upon change in either metric. ESI3 M2 , was most sensitive to changes in F3 (r = -0.961) followed by F2 (r = -0.642) and then F1 (r = -0.596). We expected the dimensions to be inversely correlated with the ESI scoresas an increase in Scope, Frequency or Amplitude sees a decline in ESI. However, the high inverse correlation between ESI3 M1 and F1 is a drawback for M1, as change in magnitude of gap between supply and demand (resulting in a change in F3) had a diminished impact on the total ESI score. Thus, M1 is overly sensitive to the failure threshold as outlier values can change F1 values significantly, and 'swing' the scores. Box 1 describes a hypothetical illustration of the practical effect of this sensitivity. ESI3 M2 had a higher inverse correlation to range of failure (r = -0.724) than probability of failure (r = -0.601). Conversely, ESI3 M3 has a higher inverse correlation to probability of failure (r = -0.705) than range of failure (r = -0.614). This implies a greater sensitivity of M2 to excursion values, as was also observed with the sample dataset examined earlier when using the fuzzy thresholds (Table 3).

Focusing on delivery of water-related ecosystem services
The idea of using water-related ecosystem services to link ecosystems to basin management has wide appeal, demonstrated by its wide-spread uptake in both academic studies and global policy forums. However, moving from a conceptual understanding of freshwater ecosystem services, to measuring their actual value, requires information that is often unavailable. In using this indicator framework in seven basins (covering Asia, Africa and Latin America) we have found that most stakeholders, whilst familiar with the concept of water-related ecosystem services, have considered them less often than bio-physical or governance aspects. Furthermore, the need for ecosystem services have yet to be legislated in many jurisdictions (e.g. Liu et al. 2019). This hinders the development of the systematic monitoring or data collection needed to make detailed assessments of water-based ecosystem services. Consequently, we found the adoption and use of models focused on water-related ecosystem services to be low or non-existent. This becomes a vicious circle as the benefits of modelling ecosystem services have not been demonstrated, the motivation to devote resources to it is limited, especially given the lack of technical capacity in these regions. We found that moving from a discussion of value to delivery engaged local stakeholders. People associate freshwater ecosystems with the delivery of certain services. Focusing on the question "Are those services being provided?" places water-related ecosystem services in a demand and supply framework that can be measured using existing data and provide an initial diagnosis of the state of the services. Our approach focuses on the extent to which ecosystem service supply meets demand irrespective of the  Vollmer et al. (2018) found that while stakeholders largely attribute ecosystem services to functioning ecosystems, there was also a clear acknowledgment that built infrastructure maintains water provisioning (reservoirs and supply network), purification (wastewater plants to artificial wetlands), flood defenses (dykes and diversions channels) and other interventions. Further, the distinction between natural and built infrastructure maybe of little consequence if the desired services appear to meet demand for the foreseeable future. Information on the gap between ecosystem service supply and demand was more relevant for decision making than the type of infrastructure supporting the service.
Our two case-studies demonstrate how the framework allows for a comparative assessment of multiple water-related ecosystem services. This prevents focus on what is perceived to be the most important or dominant service from the basin, and instead opens a wider dialogue on gaps in knowledge, tradeoffs between different services, linkages between services and ecosystem state. For example, concerns regarding tradeoffs were evident in 3S basin when dam development , Souter et al. 2020) and potential changes to several ecosystem services, especially subsistence fishing, resulting from it were presented to stakeholders. Given the uncertainty of the dataset used for the initial assessment, it also became apparent that improving data on fish harvests should be prioritized, to inform monitoring and better calibrate any modeling efforts.

Using the flexibility of data requirement to initiate assessment
Much of the emphasis with our framework is to provide a starting point for assessment in basins with little to no prior work on water-related ecosystem services. Progressively assessing ecosystem service delivery according to scope, frequency, and amplitude provides considerable flexibility. The three levels allow for an assessment to be made with varying levels of confidence, which is largely dependent upon the type and quality of data available. For the FHI case studies in Asia , Souter et al. 2020 and Latin America , the majority of indicators could be calculated with data collected by basin authorities or with remotely sensed data. A few of these calculations relied on low-resolution estimates and were limited to a low level of evidence (F1). The ability to calculate the indicator with easily available data increases its applicability and the likelihood that an ecosystem services-based approach will gain recognition, often an initial hurdle to overcome. Spatial data that can estimate supply at a low level of evidence (e.g. extent and frequency) can increasingly be obtained using remotely sensed or other global datasets of flood maps, drought indicators, soil erosion estimates, land cover & degradation, etc. (e.g. Shaad 2018, Mukherjee et al. 2018, Liu et al. 2018). As information on higher dimensions (such a damage or flood levels) become available, demand is better understood, and analysis leads to more accurate thresholds. This leads to a rise in confidence in the estimated state of ecosystem service delivery. With growing capacity, using data generated from modelled scenarios to re-calculate indicators can help understand system sensitivity and explore higher order issues like the relative contribution of the natural (green) and managed (grey) systems.
As our framework testing shows, adequately representing the underlying system has important implications on the final indicator score, which needs to be considered when applying this framework at lower levels of evidence. We found that scores calculated with only scope (F1) have basic diagnostic value. A high score obtained with only one dimension (scope) implies that the system is functioning well, but the converse may not always be true. At least two dimensions are needed to convey reliable information about the system. Omitting the highest dimension (amplitude) was found to create an upward bias for scores, i.e. scores with an assumption that amplitude is low and thus a score suggesting better service delivery than is actually the case.

Performance of the adjusted methods
Our results show that the original Canadian Water Quality Index method on which our approach was based, was more sensitive to scope than either of our two adjusted methods. The framework testing shows that this has a practical implication -as this in turn, makes the original method highly sensitive to the value of the threshold. Sharp thresholds are often based on either scientific consensus, a regional policy, or long-term means from local monitoring. While for some water quality parameters, treating them as strict limits is justifiedthis may not apply for many of the other water-related ecosystem services. Therefore, treating thresholds as solid boundaries may prevent stakeholders from using the results to inform deliberative and nuanced management discussions. Indicator scores calculated using the adjusted methods (M2 and M3) were more balanced in their response to changes in values of scope and were most sensitive to changes in amplitude. As the magnitude of gap between ecosystem service delivery is most likely to change first in response to an improvement or deterioration of a service, the method that is most sensitive to that change will best reflect reality. We found M2 to have been marginally more sensitive to amplitude/excursion values than M3 and thus recommend it as the standard Freshwater Health Index method.

Conclusion
Our ecosystem service indicator framework transparently measures the demand for a suite of water-related ecosystem services. It is a flexible framework that can accommodate the inevitable evolution and expansion of tools, models and data used to measure and evaluate the value of water-related ecosystem services. As the importance of freshwater ecosystems in meeting not just demand for water supply, but a host of other services are increasingly realized, our indicator framework integrates quantitative information about waterrelated ecosystem services into decision-making. We hope that the framework will stimulate further discussion between various management agencies with basins concerned with different aspects of freshwater governance and lead to an improved appreciation of the value freshwater ecosystems have for human well-being.