Streamflow depletion estimation for conjunctive water management in a heavily-stressed aquifer using analytical depletion functions

Groundwater pumping can lead to reductions in streamflow (‘streamflow depletion’) and estimating streamflow depletion is critical for conjunctive groundwater-surface water management. Streamflow depletion can be quantified using either analytical models, which have low data requirements but many simplifying assumptions, or numerical models, which represent physical processes more realistically but have high data, effort, and expertise requirements. Analytical depletion functions are a recently-developed tool that address some of the limitations of analytical models, but to date have only been evaluated under relatively simple conditions. Here, we evaluate eight different analytical depletion functions across a range of groundwater abstraction, physiographic, and hydrostratigraphic conditions via comparison to the Republican River Compact Administration groundwater model, a calibrated MODFLOW numerical model used for conjunctive water management in a heavily-stressed portion of the High Plains Aquifer (USA). We find mostly strong agreement between the analytical depletion functions and the MODFLOW model, though analytical depletion functions underestimate depletion for wells close to surface water features in high transmissivity environments. Compared to previous work, there is little variability among the eight analytical depletion functions, indicating that function formulation plays a minor role in this setting. Analytical depletion function performance is strongly influenced by hydrostratigraphic parameters (storativity and transmissivity) but performance is insensitive to pumping rate, confirming a key assumption of analytical models. Overall, analytical depletion functions provide comparable estimates of streamflow depletion to numerical models at a fraction of the time and data requirements. Accurate hydrostratigraphic data are essential to estimating streamflow depletion regardless of modeling approach.


Introduction
Groundwater is an essential contributor to streamflow around the world (Beck et al. 2013) , providing a relatively cool and stable supply of water particularly during dry periods. Groundwater inflow to streams ('baseflow') is essential for a number of aquatic and groundwater-dependent ecosystems (Rohde et al. 2017;Larsen and Woelfle-Erskine 2018) . However, groundwater abstractions can lead to reductions in streamflow ('streamflow depletion') via the capture of discharge, which includes interception of water which otherwise would have discharged into a stream or, in extreme cases, induced infiltration from a previously gaining stream (Bredehoeft et al. 1982;Bredehoeft 2002; Barlow et al. 2018) .
Streamflow depletion cannot be directly observed and is often difficult to estimate due to complex groundwater flow systems, lag times between pumping and streamflow reductions, and natural variability in streamflow resulting from other processes such as weather, water control structures such as dams, and surface water withdrawals (Barlow and Leake 2012) . For this reason, conjunctive groundwater-surface water management typically relies on numerical models (e.g. MODFLOW), which are physics-based simulations of groundwater flow processes (McDonald and Harbaugh 1988;Fienen et al. 2018) . However, there are large time, effort, and computational costs associated with numerical models (Fienen et al. 2015(Fienen et al. , 2016 . They are, therefore, not available in most settings. In locations where numerical models are not available, analytical models are often used instead (Hamilton and Seelbach 2011;Huang et al. 2018) . Analytical models are simpler representations of stream-aquifer interactions, but contain many limiting assumptions such as one (or occasionally two) streams, homogeneous subsurface conditions, and simplified stream and aquifer geometry. While analytical models have been proposed that account for some of these assumptions (Butler et al. 2007;Yeh et al. 2008;Zlotnik and Tartakovsky 2008;Singh 2009) , there are still many real-world environments which violate the core assumptions of analytical models including settings where there are multiple and/or highly sinuous streams.
Recently, analytical depletion functions ( Figure 1) have been proposed as a potential extension of existing analytical models which empirically address some of these limitations for use in complex, real-world settings (Zipper et al. 2019b) . An analytical depletion function consists of (i) stream proximity criteria , which identify the streams that may be affected by a well; (ii) a depletion apportionment equation , which calculates how depletion from a single well should be allocated to multiple stream segments meeting the stream proximity criteria; and (iii) an analytical model , which estimates depletion in each stream segment meeting the proximity criteria. Analytical depletion functions have only been subjected to limited testing in a small number of study sites. During the development of the State of Michigan's Water Withdrawal Assessment Tool, Reeves et al. (2009) compared nine different depletion apportionment equations for a small watershed in Michigan and found that an inverse distance-based approach best matched output from a MODFLOW numerical model. Subsequently, Zipper et al. (2018) evaluated five depletion apportionment equations for several real-world stream networks in British Columbia, finding that a new inverse distance-based approach which considers stream geometry (called web squared; Figure 1b) performed the best under steady-state conditions. Zipper et al. (2019b) introduced the concept of stream proximity criteria ( Figure 1a) and performed a sensitivity analysis to 50 combinations of stream proximity criteria, depletion apportionment equation, and analytical model, finding that analytical depletion functions were able to accurately estimate the distribution and magnitude of streams affected by a well. However, Zipper et al. (2019b) compared analytical depletion functions to an archetypal numerical model with several simplifications including a homogeneous subsurface and stream properties. Li et al. (2020) conducted the first comparison of analytical depletion functions to a calibrated numerical model, but this was in unstressed conditions with only a single well pumping at any given time and did not evaluate different analytical depletion function formulations.
As a result, it remains unknown whether analytical depletion functions are suitable tools for heavily-stressed aquifers with significant ongoing pumping, particularly since evidence indicates that cumulative impacts of multiple wells may not be linearly additive (Ahlfeld et al. 2016) , and if so what are the controls over analytical depletion function performance. Thus, the ability of analytical depletion functions to predict streamflow depletion in complex, heterogeneous, and highly-stressed real-world settings where cumulative impacts of multiple wells are occurring simultaneously remains unknown. In this study, we address this knowledge gap by comparing a suite of analytical depletion functions from a complex, calibrated groundwater model of the Republican River Basin (USA) which is currently used for conjunctive water management (RRCA 2003) . Specifically, we ask: (1) Can analytical depletion functions accurately estimate the spatial distribution and magnitude of streamflow depletion associated with a single well in a complex unconfined aquifer with ongoing pumping? (2) How does analytical depletion function performance vary as a function of formulation, hydrogeological properties, stream properties, landscape attributes, and time of year?

Republican River Compact Administration groundwater model
The Republican River Watershed drains approximately 64,500 km 2 (24,900 mi 2 ) of the US High Plains, flowing through Colorado, Nebraska, and Kansas (RRCA 2003) . There is significant irrigated agriculture within the watershed (Deines et al. 2017(Deines et al. , 2019 and, as a result, the surface and groundwater resources in the Republican River Watershed and surrounding High Plains Aquifer are heavily stressed (Haacker et al. 2016;Butler et al. 2018) . To allocate and manage the water of the Republican River, the three states entered into the Republican River Compact in 1942 (Khan and Brown 2019) . Following a US Supreme Court decision in 2002, representatives from Kansas, Nebraska, Colorado, the US Bureau of Reclamation, and the US Geological Survey collaboratively constructed the Republican River Compact Administration (RRCA) groundwater model as a tool for quantifying streamflow depletion caused by groundwater pumping. The model is updated annually to guide water allocations among the three states.
The RRCA model spans an area larger than the Republican River watershed and is bounded by the Platte River in the north and outcrops of the High Plains Aquifer on the east, west, and south. The model is constructed using MODFLOW-2000(Harbaugh et al. 2000 covering a total active domain of 77,868 km 2 (30,065 mi 2 ), which is discretized into 2.6 km 2 (1 mi 2 ) grid cells. The model is updated annually with each year's estimated pumping data submitted by each state. Here, we use version 12p, which is the version originally released that spans the period 1918-2000 at a monthly timestep. While the High Plains Aquifer in this region is unconfined, the RRCA model simulates the groundwater flow system as confined in order to use a constant transmissivity value and improve model stability. Therefore, the model represents aquifer storage properties as specific storage, which is calculated for each grid cell as the estimated specific yield divided by saturated thickness in each grid cell (RRCA 2005) .
The model represents surface water features using a combination of three MODFLOW packages ( Figure 2; Figure S1): the stream package (STR), which is used for the Republican River and tributaries and allows stream cells to dry in response to pumping; constant head boundaries (CHB), which are used for the Platte River at the north edge and the eastern edge of the model; and the drain package (DRN), which represent springs and are primarily along the southeastern portion of the domain. There are a total of 9372 pumping wells in the domain, each of which has a monthly pumping schedule which primarily occurs during the June-September growing season ( Figure S2; Figure S3). The model was calibrated via comparison to groundwater level and baseflow data for the historical pumped period, with hydraulic conductivity and precipitation recharge rates as the primary calibration parameters. The model is described in detail in RRCA (2003) and all model files are available on the Republican River Compact Administration website ( https://www.republicanrivercompact.org/ ).

Analytical depletion functions
Analytical depletion functions are described in detail in Zipper et al. (2019b), so only a brief overview is presented here. Analytical depletion functions consist of three components: (i) stream proximity criteria, which identify stream segments that may be depleted by a well; (ii) a depletion apportionment equation, which calculates the fraction of a well's total depletion, f i , that is sourced from each stream segment, i , meeting the stream proximity criteria; and (iii) an analytical model, which calculates the streamflow depletion, Qa i , caused by a well in each stream segment, i, meeting the stream proximity criteria. The estimated volumetric streamflow depletion in each segment, Qs i , is then calculated as, Qs i = f i * Qa i . Numerous options exist for each component, which were systematically evaluated in Zipper et al. (2019b) via comparison to an uncalibrated numerical model of the Navarro River Watershed (California, USA). Zipper et al. (2019b) found that analytical depletion function performance was most sensitive to the choice of depletion apportionment equation, secondarily sensitive to the choice of stream proximity criteria, and relatively insensitive to the choice of analytical model.
In this study, we will compare a subset of 8 of the 50 combinations evaluated by Zipper et al. (2019b). Since Zipper et al. (2019b) found the greatest sensitivity of model performance to the choice of depletion apportionment equation, we compared four unique but related depletion apportionment equations here: web and web squared, which were the two best-performing equations for the Navarro River Watershed, CA (Zipper et al. 2019b) ; and inverse distance and inverse distance squared, the former of which was the best-performing equation for the Kalamazoo Valley, MI (Reeves et al. 2009) . All four depletion apportionment equations can be expressed as: The result, f i is the fraction of total depletion occurring in stream segment i.
Required inputs include d , the distance from the well to the stream segment; w, a weighting factor that is equal to 1 for inverse distance and web and equal to 2 for inverse distance squared and web squared; n is the total number of affected stream segments identified by the stream proximity criteria; P is the number of points each stream segment is divided into. The inverse distance and inverse distance squared methods are simplified formulations of Eq. 1 where only the closest point to the well on each stream segment is used and therefore P = 1.
Since stream proximity criteria were of secondary importance, we compared two stream proximity criteria: adjacent catchments only, which was used by Reeves et al. (2009), and adjacent+expanding, which Zipper et al. (2019b) found worked best in the Navarro River Watershed. The adjacent+expanding methods includes both adjacent catchments to the well and any catchments in which the estimated streamflow depletion would be greater than or equal to 1% of the pumping rate at a given timestep. As a result, the number of stream segments meeting the stream proximity criteria increases over time.
We tested a single, relatively simple Glover and Balmer (1954) analytical model (herein referred to as the Glover model; Eq. 2):

{Eq. 2}
In Eq. 2, Qw is the pumping rate of the well; S is the storativity, typically specific yield or specific storage; T is the transmissivity; and t is the time since pumping began. To account for monthly variation in Qw , we used standard superposition techniques to turn the wells on and off (Jenkins 1968) . Since Zipper et al. (2019b) found minimal sensitivity to the choice of analytical model and the RRCA MODFLOW model represents surface water using a mixture of packages, some of which do not include streambed conductance as an input which is required for more complex analytical models such as Hunt (1999) , we did not test multiple analytical models in this study. However, many analytical models exist with different formulations (reviewed in Huang et al. 2018) and in other hydrogeological settings, different analytical models may be appropriate.

Analytical depletion function performance evaluation 2.3.1 Selecting pumping well sample
The goal of our study is to examine the performance of analytical depletion functions over a range of hydrogeological and physiographic characteristics. Therefore, we selected a subset of 166 wells (of 9372 total pumping wells in the domain) based on the following characteristics: (i) the mean pumping rate, (ii) pre-development water table depth from MODFLOW steady-state output, (iii) the transmissivity of the MODFLOW cell containing the well (log-transformed), (iv) the storativity of the MODFLOW cell containing the well, (v) the distance from the well to the closest surface water feature (active STR, DRN, or CHB grid cell), and (vi) the distance from the well to the closest cell with potential phreatophytic ET (active EVT cell). The distribution of these properties among our experimental sample is shown in Figure 2 and the distribution among all wells is shown in Figure S4. To span each of these characteristics as uniformly as possible, we used the Latin Hypercube Sampling method (McKay et al. 1979) to randomly sample 250 parameter combinations spanning these six characteristics (Zipper et al. 2018b) . We then selected the pumping well that had the shortest euclidean distance in parameter space to each parameter sample, resulting in a total of 166 unique pumping wells in our final evaluation because some wells were closest to multiple samples.

Calculating streamflow depletion in MODFLOW model
To calculate the streamflow depletion associated with each of these pumping wells in the MODFLOW model, we first ran a baseline RRCA simulation to calculate the stream-aquifer flux under baseline conditions for each cell containing a surface water feature. The cell-resolution stream-aquifer fluxes were then aggregated to net stream-aquifer fluxes for each stream segment, which were defined (for the STR package) or manually delineated based on stream network geometry (for the DRN and CHB packages). Since many of the DRN cells were discontinuous ( Figure 2) and represent springs or seeps rather than traditional stream channels, each DRN segment was delineated as a cluster of nearby DRN cells. To evaluate the sensitivity of the calculated performance metrics to the inclusion of these features, we performed analytical depletion function calculations both including and excluding DRN cells. We then turned off each of the selected wells one-at-a-time for 166 unique numerical experiments (which we refer to as 'pumping simulations' herein) and calculated the net stream-aquifer flux for each stream segments in each pumping simulation. The decrease in stream-aquifer flux in the baseline simulation relative to a pumping simulation is equal to the streamflow depletion caused by that pumping well (and potential numerical model error), and therefore we can calculate streamflow depletion in each segment and at each stress period for each of our 166 pumping wells. We automated MODFLOW runs using the FloPy package for Python (Bakker et al. 2016(Bakker et al. , 2018 . Since changing groundwater model boundary conditions (such as pumping rates) can impact model convergence and stability, we screened MODFLOW output for anomalous results prior to comparison. We used two approaches to screen for anomalous results in all MODFLOW simulations, each of which corresponds to a single pumping well. First, we identified additive outliers in the timeseries of the difference in overall mass balance error between the pumping simulation and the baseline simulation ( Chen and Liu 1993;López-de-Lacalle 2019) . Second, we identified any stress period where MODFLOW estimated negative streamflow depletion either in an individual segment or summed across all segments exceeding 1% of the maximum pumping rate for that well. For both of these comparisons, we identified the stress periods at which anomalous MODFLOW results occurred and limited our comparison to the time period between the onset of pumping and the stress period prior to the first anomalous MODFLOW mass balance change ( Figure S5). Mass balance outliers indicating convergence issues were identified for 31 of the 166 pumping experiments tested, occurring as early as the 499th stress period and as late as the final (996th) stress period and stress periods with anomalous results from these 31 pumping simulations were removed from analysis as described above. For all other pumping experiments, our comparison was conducted between the onset of pumping and the end of the model simulation period (December 2000)

Calculating streamflow depletion with analytical depletion functions
Input for the analytical depletion functions (the formulations of which are described in Section 2.1) was extracted directly from the RRCA MODFLOW model, described in Section 2.1. Using consistent parameters between the two approaches was intended to focus our comparison on the impact of the simplifications of analytical depletion functions relative to the numerical model on predicted streamflow depletion, since we do not do a direct comparison to field observations of streamflow depletion. The pumping rate ( Qw ) was extracted from the MODFLOW WEL package ( Figure S3), with each grid cell representing a unique pumping well. The well-stream geometry ( d ) was extracted as the distance from each well to each grid cell with a STR, CHB, and DRN cell, which were grouped into segments as described in Section 2.1. Due to the discretization of the MODFLOW model, the point spacing for the discretization of segments in the web and web squared depletion apportionment equations was 2.6 km 2 (1 mi 2 ). Effective transmissivity ( T ) and storativity ( S ) were calculated as the average T and S for all MODFLOW cells intersected by a straight line between the well and each stream segment. For S , both specific yield and specific storage were used to test the sensitivity of the comparison to the storage parameter. The output from the analytical depletion functions is the estimated streamflow depletion in each stream segment caused by each well throughout the entire RRCA simulation period.

Comparison between analytical depletion functions and MODFLOW
To systematically assess different aspects of analytical depletion function performance, we calculated four fit metrics which are described below. To assess the drivers of analytical depletion function performance, we conducted a regional sensitivity analysis for each of the fit metrics in response to each of the well and landscape characteristics used to define the well sample ( Figure 2; Section 2.3.1). The regional sensitivity analysis is meant to identify conditions under which the performance of analytical depletion functions exceeds a defined performance threshold (Spear and Hornberger 1980;Wagener et al. 2001;Pianosi et al. 2016) , which we set separately for each of the four fit metrics: 1. Spatial distribution of primary impact: The percentage of pumping simulations in which the stream segment most affected by groundwater pumping matched between the analytical depletion functions and MODFLOW model. For regional sensitivity analysis, a threshold value of 50% was used to separate good (match > 50%) from poor (match < 50%) performance. 2. Magnitude of primary impact: The mean absolute difference (MAD) between the volumetric depletion ( Qs ) in each stream segment predicted by the analytical depletion function and MODFLOW model, normalized to the range of predicted Qs values from MODFLOW. The normalization allows for comparison across a range of depletion conditions -for instance, a difference in predicted Qs of 100 m 3 d -1 is more problematic when actual Qs is 200 m 3 d -1 than when actual Qs is 5000 m 3 d -1 . For regional sensitivity analysis, a threshold value of 0.25 was used to separate good (normalized MAD < 0.25 ) from poor (normalized MAD > 0.25) performance. 3. Spatial distribution of overall impacts: The Kling-Gupta Efficiency (KGE; Gupta et al. 2009;Kling et al. 2012) between the volumetric depletion ( Qs ) in each stream segment predicted by the analytical depletion function and MODFLOW model. A KGE value of 1.0 indicates perfect agreement and lower values indicate worse performance. As a benchmark, KGE > -0.41 indicates better agreement than simply using the mean (Knoben et al. 2019) . For regional sensitivity analysis, a threshold value of -0.41 was used to separate good (KGE > -0.41) from poor (KGE < -0.41) performance.

Magnitude of overall impacts:
The bias between the total streamflow depletion across all segments for an individual well simulated by the analytical depletion function and the MODFLOW model, normalized to the range of predicted total streamflow depletion from MODFLOW. For regional sensitivity analysis, a threshold value of 75% was used to separate good (absolute bias < 75%) from poor (absolute bias > 75%) performance.

Overall performance
Overall, we find a strong agreement between the analytical depletion function and MODFLOW predictions of streamflow depletion. There is substantial variability among analytical depletion function formulations, which is explored in the following section. The best-performing analytical depletion function combined the adjacent+expanding stream proximity criteria, the web squared depletion apportionment equation, and the Glover analytical model, which agrees with the results from a previous comparison in the Navarro River Watershed (Zipper et al. 2019b) .
Performance for this analytical depletion function is strong across our four performance criteria (Table 1). Over the final 20 years of the simulation period, the most-affected stream segment is identified correctly for 53.9% of pumping wells, the MAD of predicted depletion in the most-affected segment is 4.8% of the range in predicted depletion, the KGE of predicted depletion across all segments is 0.779, and the bias for predicted total streamflow depletion is 0.4%. Analytical depletion function performance is significantly better than using a standalone analytical model (without stream proximity criteria or depletion apportionment equations) for all metrics except the identification of the most affected stream segment (Table 1), highlighting the ability of analytical depletion functions to improve predictions for real-world settings compared to a standalone analytical model.
Identification of the most-affected segment is substantially lower than previous work, which was generally >70% correct in the Navarro River Watershed (Zipper et al. 2019b) . Results from the standalone analytical model, which always used the stream segment closest to each well, had the same skill in identifying the most-affected segment (53.9%) which indicates that for 46.1% of wells the most-affected stream segment was not the closest stream segment to the well. This may be driven by the fact that well-stream distances and numerical model discretization are an order of magnitude larger in this domain compared to the Navarro River watershed, and therefore subsurface controls on flow such as spatial heterogeneity in T and S exert a stronger control over the distribution of pumping impacts.
Despite a negligible overall bias, this analytical depletion function tends to have a higher estimate than MODFLOW for both segment-resolution and total streamflow depletion at low magnitudes and a lower estimate at high magnitudes ( Figure 3). The largest differences between MODFLOW and analytical depletion function estimates tend to be driven by a relatively small number of wells which are located near the edge of the domain. The cluster of points in which the analytical depletion function produces substantially higher estimates of depletion compared to MODFLOW are associated with two wells on the southeastern margin of the domain that have several extreme characteristics relative to the overall well sample. These wells are the minimum possible distance from surface water cells given the model discretization (one grid cell, or 1.6 km) yet relatively far from evapotranspiration cells (>45 km, or 85th percentile among well sample), and also have very low transmissivity (~50 m 2 d -1 , or 6th percentile). In contrast, the points in which analytical depletion functions have the largest underestimate relative to MODFLOW are associated with two wells on the northern edge of the domain that are also extremely close to a stream (two grid cells, or 3.2 km) but have a very high transmissivity (~1900 m 2 d -1 , or 95th percentile).
While there is variability in bias among analytical depletion functions (Table 1), there are no analytical depletion function formulations that have substantially larger streamflow depletion predictions at high magnitude ( Figure S6 and Figure S7), indicating that this may be a persistent problem for analytical depletion functions in this setting. This finding is in contrast to the typical assumption that analytical approaches provide conservative estimates of streamflow depletion (Sophocleous et al. 1995;Rathfelder 2016) may be problematic from a water management perspective because underestimating the depletion from the wells with the largest impacts could lead to an overallocation of water resources (Zipper et al. 2018a) . As a result, analytical depletion functions should not be considered a "worst-case" estimate of depletion, but rather a non-biased estimate which may be an overestimate or underestimate depletion relative to other approaches.  Figure 3. Comparison between analytical depletion function and MODFLOW predictions of (a) segment-scale streamflow depletion, (b) total streamflow depletion summed across all segments for a given well. Circles on depletion plot indicate 'higher estimate' and 'lower estimate' points discussed in Section 3.1 text.

Performance response to analytical depletion function formulation and input data
Analytical depletion formulation had relatively little impact on most of the model performance metrics. Comparing among stream proximity criteria, there is little difference between the Adjacent and Adjacent + Expanding stream proximity criteria (Table 1), likely due to the large size of the domain and relatively sparse stream network compared with Zipper et al. (2019b). Comparing among depletion apportionment equations, t he inverse distance and inverse distance squared depletion apportionment equations do not perform the best for any of the fit metrics evaluated (Table 1) indicating that considering stream network geometry with the web and web squared improves performance, though differences between approaches are only a few percentage points. Notably, the web squared depletion apportionment equation performed the best at the spatial distribution of the overall impacts (regardless of stream proximity criteria used), demonstrating its effectiveness at identifying streamflow depletion across a stream network. The superior overall performance of the web squared depletion apportionment equation related to the web equation is due to a negative bias for wells with high amounts of total streamflow depletion (Table 1; Figure S7). This is caused by the increased weight given to near-well stream segments in the web squared approach (Zipper et al. 2018a) .
The storage parameter used in analytical depletion functions has a substantial impact on depletion predictions. While specific yield is typically used to represent storage in unconfined aquifers, the RRCA MODFLOW model uses a specific yield-based approximation of specific storage since the model assumes a constant saturated thickness (details in Section 2.2). Recent work found that accurate estimates of specific yield are critical to obtain accurate estimates of groundwater depletion in the High Plains Aquifer, including parts of the RRCA domain (Butler et al. 2020) . To assess the importance of storage parameterization in the analytical depletion functions, we compared our best-performing analytical depletion function in Table 1 (adjacent + expanding stream proximity criteria and web squared depletion apportionment equation) with both specific yield (as in Table 1) and specific storage as input. Using specific storage instead of specific yield slightly improved identification of the most affected segments (57.9%, compared to 53.9% using specific yield) but led to substantially worse performance on the other three performance metrics. Normalized MAD of depletion for the most affected segment rose to 0.130 with specific storage (compared to 0.048 with specific yield), KGE for depletion of all stream segments declined to -3.40 with specific storage (compared to 0.779 with specific yield), and percent bias for total streamflow depletion increased to 431% with specific storage (compared to 0.4% with specific yield).
The degradation in performance when analytical depletion functions use specific storage as input is primarily driven by a systematic overestimation of depletion relative to the MODFLOW model during the pumping season (Figure 4). This occurs because specific storage values in the RRCA MODFLOW model were defined by dividing specific storage by the saturated thickness, causing them to be 1-2 orders of magnitude lower than specific yield values. The lower specific storage values causes the streamflow depletion to occur much more quickly after the onset of pumping. These results indicate that data collection efforts should prioritize high-accuracy estimates of storage parameters in order to improve accuracy of both streamflow depletion and groundwater depletion predictions. In areas with high-quality water use and water level data, emerging data-driven approaches to estimating storage parameters may be valuable Butler et al. 2016) . Both the magnitude of predicted depletion and the relationship between MODFLOW and analytical depletion functions varied as a function of the boundary condition used in the MODFLOW model. In general, the highest levels of depletion tended to be predicted for the constant head boundary, which runs along the north side of the model domain (Figure 2a), and analytical depletion function estimates of depletion were consistently lower than MODFLOW (Figure 5a). In contrast, predicted depletion from the stream package tended to be more evenly distributed with a mixture of overestimates and underestimates (Figure 5b). Depletion from drain features (which are diffuse boundaries representing springs in this model; Figure 2a) was small, but analytical depletion functions had consistently higher estimates than MODFLOW ( Figure  5c).
The differences in predicted depletion among these boundary conditions are likely driven by a combination of hydrostratigraphy and MODFLOW model design. First, the constant head boundaries are found along the north side of the model domain and this region has the highest estimated transmissivity (Figure 2a) due to more conductive sediments and a greater saturated thickness (RRCA 2003) . These higher conductivity materials, along with the close proximity of some wells to the stream (discussed in Section 3.1), explains why the largest depletion estimates are found for the northern part of the domain along the constant head boundary. Second, MODFLOW uses a streambed conductance term to simulate potential low-conductivity streambed materials for stream and drain features but not for constant head boundaries. This streambed layer is not represented in the Glover analytical model we use in our analytical depletion functions. Use of an analytical model that includes streambed conductance, such as Hunt (1999), may improve agreement for stream and drain boundary conditions, but would cause further underestimation of depletion in constant head boundaries.
Since analytical models are not traditionally designed for use in diffuse discharge features such as the springs represented by the drain package, we also compare analytical depletion function performance with and without drain features ( Figure S8). Removing drains had relatively small impacts on model performance, particularly at high levels of depletion. Overall, removing DRNs meant that estimates of depletion in DRN segments went to 0, and as a result estimates of depletion in some other boundary conditions increased ( Figure S8). Removing DRN features from the analytical depletion functions had mixed impacts on our four performance metrics. When DRNs were excluded from analytical depletion function calculations, the identification of the most affected segments degraded (42.2%, compared to 53.9% when DRNs were included) and total streamflow depletion had a negative bias of -8.4% (compared to 0.4% with DRNs). However, normalized MAD of depletion for the most affected segment improved to 0.043 without DRNs (compared to 0.048 with DRNs) and KGE for depletion of all stream segments rose to 0.811 without DRNs (compared to 0.779 with DRNs).

Well and landscape drivers of performance variability
Using the threshold defined in Section 2.3.4 for each performance metric, we found that 13 wells had good performance for all four metrics, 13 wells had good performance for three metrics, 33 wells had good performance for two metrics, 38 wells had good performance for a single metric, and 23 wells had good performance for no metrics. To investigate the relative importance of each parameter with comparable sample sizes, we compared wells with good performance in at least two fit metrics to wells with good performance in less than two fit metrics ( Figure 6). Differences in the empirical cumulative distribution functions between the two samples for a given parameter indicates a potentially significant impact of that parameter on overall analytical depletion function performance (Wagener et al. 2001;Pianosi et al. 2016) . Comparing across all fit metrics concurrently, the only two parameters that led to significant differences between these two samples were the hydrostratigraphic properties of transmissivity and specific storage. Wells tended to have better performance at intermediate to high values of transmissivity, agreeing with the observed drivers of over-and underestimated depletion ( Figure  3). Specific storage has a strongly skewed relationship in our domain, but performance tended to be better when specific storage was higher. Figure 6. Regional sensitivity analysis results, expressed as empirical cumulative distribution functions for wells with good performance in at least two fit metrics and wells with good performance in less than 2 fit metrics. In shaded panels, distributions of the two samples are expected to be drawn from the same distribution (two-sample Kolmogorov-Smirnov test p > 0.05).
Investigating individual fit metrics, the regional sensitivity analysis found a significant impact of all of the well and landscape properties except pumping rate on at least one of the fit metrics, but no well or landscape property had a significant impact on all fit metrics (Figure 7). For pumping rate, none of the fit metrics differed significantly between wells with a good and poor fit, which supports the long-held assumption that streamflow response to pumping is independent of abstraction rate (Theis 1941;Glover and Balmer 1954;Hunt 1999) . Transmissivity affected the most fit metrics, with intermediate transmissivity values associated with improved prediction of depletion (MAD in the most-affected segment and KGE in all segments) and bias. The difference between the 'good' and 'poor' wells at intermediate values of transmissivity further supports the observation that performance degrades in extremely high and low transmissivity settings near streams (Figure 3), and that variability in hydraulic conductivity is an important control over analytical depletion function accuracy (Sophocleous et al. 1995;Li et al. 2016) . Unlike the overall assessment (Figure 6), looking at individual fit metrics revealed only a minor sensitivity to specific storage for one fit metric, the KGE of segment-resolution streamflow depletion, where good performing wells had a lower specific storage (Figure 7). The specific storage distribution across the domain is strongly skewed (Figure 2b), and the differences indicate that assessing the relative impact and importance of a given parameter depends on the aspect of model performance being considered.
There was an apparent threshold-type response as distance to the closest surface water feature increased; the identification of the most-affected segment and the total streamflow depletion bias both degraded significantly at distances greater than ~20 km. For distance to cells with ET, which is an alternate source of capture in the MODFLOW model that is not considered by the analytical depletion functions, wells that performed poorly for identifying the most-affected segment and MAD of depletion were primarily concentrated at shorter distances, while the MAD of depletion estimates improved at further distances to ET. Since phreatophytic ET is primarily concentrated along stream channels in the MODFLOW model (RRCA 2003) , this indicates a spatial interplay between streamflow and ET capture sources which merits future investigation (Condon and Maxwell 2019) . The water table depth had only a minor influence on the fit metrics, with better depletion predictions at intermediate water table depths (~20-50 m). Figure 7. Regional sensitivity analysis results, expressed as empirical cumulative distribution functions for wells with each performance criteria. In shaded panels, distributions of 'good' and 'poor' groups are expected to be drawn from the same distribution (two-sample Kolmogorov-Smirnov test p > 0.05).

Synthesis with previous analytical depletion function evaluations
This work extends previous evaluations of analytical depletion functions by comparing their output to a calibrated numerical model in a highly stressed aquifer, a setting where analytical depletion functions have not previously been tested. Synthesizing across studies, we find general agreement that the adjacent + expanding stream proximity criteria and the web squared depletion apportionment equation produce the best agreement with numerical model output (Zipper et al. 2018a(Zipper et al. , 2019b . We also found performance was best when wells are close to streams, a finding that is consistent with previous work in coastal California (Zipper et al. 2019b) but in contrast to a study in British Columbia that found better agreement for wells further from streams (Li et al. 2020) . We also extend this previous work by testing performance across 166 wells with a variety of pumping rates and demonstrated that performance is insensitive to pumping rate, indicating that analytical depletion functions are likely to be equally useful regardless of the magnitude of groundwater abstractions.
This study and previous work raise several key questions for further evaluation. First, we identify a potential spatial performance-related interaction between the distance from the well to the closest stream and closest ET cell (Figure 7). Additional testing is necessary to determine the conditions in which phreatophytic ET confounds analytical depletion function estimates of streamflow depletion (Condon and Maxwell 2019) . Second, this and previous evaluations have focused on evaluation of a single well in isolation. While the current study investigates performance in the context of a heavily-stressed aquifer with many pumping wells, we isolated effects of an individual well by turning wells on/off one-at-a-time. While it is widely assumed that the output from analytical models is additive, work in the Republican River Watershed has shown this may not be the case (Schneider et al. 2017) . For application of analytical depletion functions in heavily-stressed aquifers, systematic testing of cumulative impacts by evaluating the impacts of multiple wells concurrently is critical. Finally, recent field investigations found that analytical model performance in an urban setting varied as a function of stream stage (Flores et al. 2020) , and will be important to test analytical depletion functions in a variety of stream stage conditions.

Conclusions
Reliable estimates of streamflow depletion are critical for effective conjunctive management of groundwater and surface water resources. Our systematic evaluation of analytical depletion functions in a heavily-stressed unconfined aquifer found that analytical depletion functions can produce reliable estimates of streamflow depletion during both the pumping and non-pumping seasons. Comparing among eight different analytical depletion functions, we found relatively little sensitivity to analytical depletion function formulation, but a strong response to the input storage parameter, indicating the critical importance of reliable parameter estimates. Among the analytical depletion functions, the best-performing one included time-varying stream proximity criteria and a depletion apportionment equation that accounted for stream network geometry, which is consistent with previous studies. The performance of this analytical depletion function is best for wells within ~20 km of a stream and intermediate values of transmissivity, and is unaffected by pumping rate. These results do not suggest that numerical models should be replaced or superseded by analytical depletion functions, but rather that analytical depletion functions are a useful low-cost, low-effort approach to obtain reliable estimates of streamflow depletion in settings where calibrated numerical models are not available.