Additive Model Perturbations Scaled by Physical Tendencies for Use in Ensemble Prediction

Imperfections and uncertainties in forecast models are often represented in ensemble prediction systems by stochastic perturbations of model equations. In this article, we present a new technique to generate model perturbations. The technique is termed Additive Model-uncertainty perturbations scaled by Physical Tendencies (AMPT). The generated perturbations are independent between different model variables and scaled by the local-area-averaged modulus of physical tendency. The previously developed Stochastic Pattern Generator is used to generate space and time-correlated pseudo-random fields. AMPT attempts to address some weak points of the popular model perturbation scheme known as Stochastically Perturbed Parametrization Tendencies (SPPT). Specifically, AMPT can produce non-zero perturbations even at grid points where the physical tendency is zero and avoids perfect correlations in the perturbation fields in the vertical and between different variables. Due to a non-local link from physical tendency to the local perturbation magnitude, AMPT can generate significantly greater perturbations than SPPT without causing instabilities. Relationships between biases and spreads caused by AMPT and SPPT were studied in an ensemble of forecasts. The non-hydrostatic, convection-permitting forecast model COSMO was used. In ensemble prediction experiments, AMPT perturbations led to statistically significant improvements (compared to SPPT) in probabilistic performance scores such as spread-skill relationship, CRPS, Brier Score, and ROC area for near-surface temperature. AMPT had similar but weaker effects on near-surface wind speed and mixed effects on precipitation.


Introduction
Forecasting natural phenomena such as weather cannot be perfect. Knowing the degree of the imperfection (the forecast uncertainty) is always desirable and sometimes vital, for example, in decision making. A rough estimate of forecast uncertainty can be obtained by comparing forecasts with verifying observations and averaging the forecast-minus-observation statistics over time and space. This simple approach often results in useful estimates. However, the 'climatological' estimates can be insufficient if we predict the state of a nonlinear chaotic system like the Earth's atmosphere, where the forecast uncertainty can vary. The uncertainty in meteorological forecasts depends on the weather situation (on the local structure of the atmospheric flow) and on the observational coverage.
To allow for a situation-dependent assessment of the forecast uncertainty, a forecast of forecast uncertainty is needed. In probabilistic terms, we have to predict not just the state of the system (in our case, atmosphere) but also the probability distribution of the unknown true atmospheric state around the forecast. This is a challenging problem. An attempt to explicitly forecast the multi-dimensional probability density is not feasible due to the ultrahigh dimensionality of the problem. The Monte-Carlo based approach known as ensemble prediction, in which the probability distribution of the truth is represented by a small number (tens to hundreds) of points in state space (ensemble members), has proven to be both feasible and useful in predicting major features of the forecast uncertainty.
For ensemble prediction to be successful in predicting the forecast uncertainty, ensemble members need to be pseudo-random draws from a probability distribution which is reasonably close to the conditional probability distribution of the truth given the data available prior to the forecast. The most promising approach here is, arguably, to identify all sources of uncertainty that affect the forecast and then stochastically model each of those 'input' uncertainties individually. The forecast uncertainty is caused by uncertainties in (i) meteorological observations, (ii) data assimilation techniques, (iii) boundary conditions, and (iv) forecast model itself. In this study we are concerned with the uncertainty in the forecast model, often called model error or model uncertainty.

Model uncertainty
A numerical weather prediction model computes the forecast by time stepping. At each time step, the model has on input, typically, the current model state (defined on a spatial grid) and computes the model state at the next time step or, equivalently, computes the forecast tendency, the difference between the next-time-step and the present model states. The model uncertainty is, by definition, the uncertainty in the forecast tendency. Being accumulated and transformed during the time stepping, the uncertainty in the tendency contributes to the uncertainty in the forecast fields.
The model uncertainty is caused by the following imperfections in the forecast model (listed in order of increasing importance). 1. Atmospheric model's partial differential equations normally involve a number of simplifications like the neglect of variations (horizontal and vertical) in the gravitational force or the ideal gas law assumption. The model may also omit some processes like chemical reactions or development and impact of electric charges of hydrometeors.

2.
A classical (i.e., based on laws of physics) meteorological forecast model solves a set of time and space discretized differential equations. The discretization leads to truncation error.
3. Subgrid-scale processes are accounted for in atmospheric models in an approximate manner. On the one hand, these processes cannot be reproduced by the discretized model equations because the model grid is too scarce. On the other hand, subgrid-scale processes do impact grid-scale fields due to nonlinearity of physical laws that govern the evolution of the atmosphere. In forecast models, this impact is assessed using simplified sub-models known as physical parametrization schemes. Simplifications (made in these schemes for computational reasons) along with the inherent uncertainty in the unresolved scales lead to uncertainties/errors in physical parametrizations.
Note that some processes in the atmosphere (such as turbulence, convection, gravity waves) can become increasingly resolved by the model equations just by refining the computational grid, that is, by increasing the model's spatial and temporal resolution. Therefore, these processes can, actually, be regarded as part of truncation error. The respective physical parametrization schemes, thus, attempt to reduce truncation error while introducing their own (presumably, smaller) errors/uncertainties.
Our focus in this research is on uncertainties/errors in physical parametrizations of subgridscale processes.

Error and uncertainty
At each model time step, nonlinear interactions of the subgrid-scale field x SGS with itself and with the grid-scale field x GS produce a spectrum of combination wavenumbers some of which fall within the resolvable (grid-scale) range. This yields a contribution of the subgrid-scales to the (grid-scale) forecast tendency. Having a perfect model, we could compute this contribution, let it be denoted πpx SGS , x GS q. Deterministic physical parametrization schemes, having, by definition, access only to x GS , attempt to assess this impact, producing a physical tendency, Ppx GS q. The uncertainty in all physical parametrization schemes combined is the difference between Ppx GS q and the true physical tendency πpx SGS , x GS q.
The subgrid-scale field x SGS is not explicitly defined in the model, therefore it is unknown and even unknowable to the model (to the extent that x SGS is not determined by x GS ). Therefore we assume that x SGS is a random field with some conditional probability density ppx SGS | x GS q (we condition on x GS because it is available to the model). This density induces the probability density ppπ | x GS q. Then, the ideal (best in the mean square sense) deterministic physical tendency is the conditional expectation of the true tendency given the grid-scale field, P determ ideal px GS q Epπ | x GS q, where E stands for expectation. It is P determ ideal px GS q that a deterministic physical parametrization seeks to approximate by Ppx GS q. In these terms, it is meaningful to call the error in the deterministic physical tendency P. But a deterministic physical tendency is incapable of simulating the variability of π around its conditional mean value Epπ | x GS q, (2) This is the inherent (irreducible) uncertainty in πpx SGS , x GS q given x GS . Then, the full uncertainty is the difference of the error ε P and the irreducible uncertainty δ P : The irreducible uncertainty δ P is especially important in so-called 'gray zones', where a process is partly resolved by the model, meaning that a length scale of x SGS is comparable to the grid spacing. In this case, in each grid cell, the grid-scale impact of the subgrid-scale process is, effectively, a sum of a small (and random) number, ν, of random contributions: π °ν i1 π i . With convection, the contributions π i to the convective tendency are due to individual convective plumes (Plant and Craig, 2008). With boundary-layer turbulence, π i are due to individual turbulent eddies (Kober and Craig, 2016). Assuming, for presentation purposes, that all π i are the same in a grid cell and their number ν follows the Poisson distribution, we readily obtain that | E π|{sd π c E ν, where sd denotes standard deviation. This expression shows that the standard deviation (which measures randomness) and the mean value of the gridscale impact of subgrid-scale processes are comparable to each other in magnitude if E ν 1, which is the case in a gray zone as discussed above. This makes the irreducible uncertainty in π indeed substantial and implies that it needs to be properly accounted for in ensemble prediction.
It is worth remarking that the irreducible uncertainty δ P caused by randomness of π given x GS is an aleatory (truly random) uncertainty, therefore it cannot be called error. Whereas ε P is a kind of systematic error (reducible conditional bias), which is caused by imperfections in the physical parametrization schemes. This latter kind of uncertainty due to lack of knowledge is generically called epistemic. Both aleatory and epistemic uncertainties need to be taken into account in building an ensemble prediction scheme. The ideal random physical tendency (often called the perturbed physical tendency) P ¦ ideal px GS q is a 'possible true π consistent with the grid-scale field x GS ', that is, a random draw from the probability distribution ppπ | x GS q. It is reasonable to anticipate that the aleatory part of the uncertainty, being caused by random subgrid noise, should be associated with small spatial and time scales. On the contrary, the epistemic part of the uncertainty caused by systematic and, likely, flow-dependent deficiencies of the physical schemes, may be characterized by larger spatio-temporal scales.
In an ensemble, representing both kinds of the uncertainty in physical tendency using pseudo-random spatial fields results in model perturbations introduced at each model time step during the forecast. Techniques that add stochasticity on top of existing deterministic physical parametrization schemes or replace the deterministic schemes with stochastic ones are known as 'stochastic physics'.

Existing stochastic physics schemes
The first question is how to represent the aleatory uncertainty in physical tendency in ensemble prediction schemes? We believe that the most sensible way to address this question is to build intrinsically stochastic (rather than traditional deterministic) parametrization schemes.
Research in this direction is underway, see Plant and Craig (2008); Dorrestijn et al. (2013); Sakradzija et al. (2015); Hirt et al. (2019); Machulskaya and Seifert (2019); Clark et al. (2021) and others. The second question is how to represent the epistemic uncertainty in physical tendency produced either by a deterministic or by a stochastic physical parametrization scheme?
We state that, on the one hand, stochastic parametrizations have not yet replaced deterministic ones. So we still need techniques to represent uncertainties in deterministic physical parametrizations. Ad-hoc approaches are in wide use here. On the other hand, stochastic parametrizations are not going to be devoid of epistemic uncertainties either. To represent those uncertainties, we will, most likely, resort to ad-hoc schemes, too. Our focus in this research is on ad-hoc model perturbation schemes.
Currently, the two most popular ad-hoc techniques to represent uncertainties in physical parametrizations are Stochastically Perturbed Parametrization Tendencies (SPPT, Buizza et al., 1999;Leutbecher et al., 2017) and Stochastically Perturbed Parametrizations (SPP, Ollinaho et al., 2017). SPPT generates model perturbations relying on the assumption that the magnitude of the error in physical tendency is proportional to the magnitude of the physical tendency itself. There is also a flavor of SPPT called iSPPT in which tendencies from different physical parametrizations are perturbed independently . We selected SPPT (described in more detail in section 2.1) as a starting point for our development in this study because it attempts to do exactly what is needed to represent uncertainty in physical parametrizations: it perturbs the physical tendency (see above section 1.2).
In SPP, some selected model parameters are made spatio-temporal random fields rather than fixed numbers. Advantages of SPP include reliance on expert knowledge to design parameter perturbation distributions and spatio-temporal patterns, internal consistency of the resulting numerical scheme, conservation properties, and capability of generating significant spread in the ensemble (Lang et al., 2021;McTaggart-Cowan et al., 2022). Disadvantages of SPP are more conceptual. First, it accounts only for parametric uncertainty so that inadequacies in modeling assumptions are not taken into account. Second, it is hard to even suggest how parameter-perturbation probability distributions can be objectively justified. The reason is that parameters may have no counterparts in nature (there is no 'diffusion coefficient' in nature) and even in a high-resolution model used as a proxy to the truth. Besides, both SPPT and SPP can lead to biases, see, e.g., Leutbecher et al. (2017) and Bouttier et al. (2022), respectively.
In this study, we analyze limitations of SPPT and build a technique that attempts to address those limitations. The new scheme termed Additive Model-uncertainty perturbations scaled by Physical Tendencies (AMPT) is tested in numerical experiments with a convective-scale ensemble prediction system.

Methodology
In this section, we review SPPT and introduce a new approach to generation of modeluncertainty perturbations. The new AMPT scheme builds on SPPT and attempts to avoid/relax some of the deficiencies of SPPT. AMPT is applied to both atmosphere and soil.

Background on SPPT and notation
To facilitate the presentation of AMPT, we first outline SPPT . By Ppx, y, ζ, tq pP 1 px, y, ζ, tq, . . . , P n fields px, y, ζ, tqq we denote the vector-valued physical tendency (the net physical tendency, that is, generated by all physical parametrizations combined) at the spatial grid point with the Cartesian horizontal coordinates px, yq, the vertical coordinate ζ, and the forecast time t. Here P i px, y, ζ, tq is the component of Ppx, y, ζ, tq in the i-th model field (variable) X i , and n fields is the number of model fields selected to be perturbed (most often, temperature, winds, and humidity, e.g., Christensen et al. (2017)).
In SPPT, the perturbed physical tendency P ¦ is postulated to be P ¦ px, y, ζ, tq p1 κ ξpx, y, tqq ¤ Ppx, y, ζ, tq, where ξpx, y, tq is a scalar (i.e., common for all physical tendency components) zero-mean and unit-variance spatio-temporal random field and κ the parameter (magnitude multiplier) that controls the magnitude of the perturbation. For stability reasons (to avoid sign reversal of the physical tendency ), ξpx, y, tq is required to be bounded so that the following inequality always holds: 1 κ ξpx, y, tq ¡ 0.
(5) This constraint places an upper limit on the magitude multiplier κ and thus limits the magnitude of perturbations in SPPT.
Here and elsewhere ∆ denotes a perturbation.

Motivation
The following deficiencies of SPPT led us to propose the new approach.
1. In SPPT, perturbations are large (small) when and where the physical parametrizations generate a large (small) physical tendency P. This formulation gives rise to a meaningful scaling of perturbations in situations when the model predicts high or moderate intensity of subgrid-scale processes and produces a large or moderate physical tendency. But it cannot cover situations in which the physical tendency appears to be small or even zero whereas the model uncertainty is, in fact, large. This may occur if, for example, in some grid cell, convection is initiated in nature whilst a convective parametrization fails to be activated (note that in this example switching to iSPPT would not help either). (6) implies that the multivariate perturbation vector ∆Ppx, y, ζ, tq is strictly proportional to the physical tendency vector Ppx, y, ζ, tq. As noted by Leutbecher et al. (2017), this implies that SPPT tacitly assumes that only the magnitude of the vector P is in error and not its direction, which is highly unlikely. In other words, SPPT 'assumes' that the ratios of the physical tendencies in different variables i and j at the same point px, y, ζ, tq of the 4D model grid are error-free. As a consequence, the SPPT perturbations (and thus the assumed model uncertainties) are perfectly (100%) correlated or perfectly (-100%) anticorrelated for any pair of model variables.

Equation
Indeed, from Eq.(6) written component-wise, we have ∆P i κ ξP i , where κ is a constant and ξ has zero mean, E ξ 0, and unit standard deviation, sd ξ 1. This implies that Ep∆P i | P i q 0, so that the covariance between the perturbations in the i-th and the j-th variables (given the unperturbed physical tendencies P i , P j ) is Cov p∆P i , ∆P j | P i , P j q Ep∆P i ∆P j | P i , P j q κ 2 P i P j . Taking into account that sd p∆P i | P i q κ|P i | and sd p∆P j | P j q κ|P j |, the correlation, that is, the covariance normalized by the product of the two standard deviations, becomes Corr p∆P i , ∆P j | P i , P j q ¨1.
3. Similarly (and also noted by Leutbecher et al. (2017)), since the SPPT random pattern ξpx, y, tq does not depend on the vertical coordinate, the SPPT perturbations are perfectly coherent (i.e., perfectly correlated) for all variables at all levels in a vertical column, which again is unrealistic.
4. Moreover, it has appeared that in order for SPPT to give rise to a significant spread in the ensemble, the length and time scales need to be really large even for convective-scale models. E.g., in Maurer et al. (2014), the tuned length and time scales in the 2.2-km resolution COSMO model were as large as 500 km and 6 h, respectively. This implies that the above unphysical¨100% correlation of SPPT perturbations approximately holds for all model variables in huge 4D volumes spanning the whole atmosphere in the vertical, hundreds of kilometers in the horizontal, and hours of forecast time.
With iSPPT , the above points 1-3 pertain to the tendency due to a single parametrization scheme. iSPPT alleviates weak points 4-5, allowing for smaller spatial scales of the random pattern and having the capability to generate somewhat larger spread in the ensemble than SPPT (Wastl et al., 2019;Christensen et al., 2017).

Approach
In AMPT, we propose to address the above deficiencies of SPPT as follows.

Univariate AMPT design
We rely on the SPPT's assumption that the standard deviation of the model-uncertainty perturbation of the field X i is proportional to the modulus of physical tendency, sd p∆X i q κ|P i |.
But we define this dependency to be more general than just point-wise.
Specifically, we propose to generate model-uncertainty perturbations for the field X i in the additive (rather than multiplicative as in SPPT) way and postulate their magnitude to be proportional to an area-averaged (rather than point-wise as in SPPT) |P i |. This allows AMPT to generate non-zero perturbations even at grid points with zero physical tendency -if there are nearby points with non-zero physical tendency. Theoretically, this approach can be justified as follows.
Assume that the unknown true model-uncertainty field ε i psq (where s is the spatio-temporal coordinate vector px, y, ζ, tq) can be modeled as a conditionally Gaussian random field given its unknown standard deviation field σ i psq and a spatial correlation function, which are random fields by themselves. In these settings, SPPT, effectively, estimates σ i psq as κ|P i psq|, see Eq.(6), that is, the only predictor to estimate σ i at the point s in space and time is the physical tendency at the same point. We propose to acknowledge that |P i psq| is a noisy 'observation' of the model-uncertainty standard deviation σ i psq and therefore it is worth looking for other predictors, i.e., other data that can contain information on σ i psq. In AMPT, we hypothesize that these additional relevant data are values of the absolute physical tendency in a vicinity of the spatio-temporal point in question s. Linearly combining these noisy data, we obtain the AMPT's estimate of the unknown σ i psq given the known field |P i p.q|: where W i p., .q is a weighting function and the integral is over the model domain. With the simplifying assumption that the weighting function is homogeneous, W i ps, s I q w i ps ¡ s I q, we rewrite Eq.(7) as is the local-area averaged absolute physical tendency (we will call it scaling physical tendency), and K i psq 1 κ i w i psq is the averaging kernel such that ³ K i psq ds 1. Having estimated σ i psq, we can simulate the model uncertainty ε i psq, which, by assumption, is zero-mean Gaussian given σ i psq. The simulated model uncertainty is then where ξ i psq is a zero-mean and unit-variance Gaussian random field postulated to be stationary in space and time. Its spatio-temporal correlations are discussed below in section 2.3.4. Spatial (and temporal) non-stationarity of the model-uncertainty field ε i comes from variability in the scaling physical tendency P i psq. Non-Gaussianity of the model-uncertainty field comes from randomness of P i psq, see Eq.(10).

Scaling physical tendency
Technically, with the gridded fields, the integral in Eq.(9) is replaced with a sum in which the averaging is performed in the horizontal only: where px q , y r q is the horizontal grid point, i (we recall) labels the model variable, and w qr px, yq are the averaging weights. The latter are specified to be non-zero only inside the averaging area |x q ¡x| A i , |y q ¡y| A i , where A i is the half-size of the averaging area (the averaging length scale) in both x and y directions. For simplicity and due to lack of knowledge on the spatial structure of uncertainties associated with |P i px q , y r , ζ, tq| as 'observations' on σ i px, y, ζ, tq we adopt the simplest design: the weights w qr px, yq are equal to each other within the averaging area and normalized so that for any x, y, we have°q r w qr px, yq 1.
In the context of limited area modeling, if the point px, yq, where P i is evaluated, is near the model's boundary, A i is reduced so that the averaging area is still a square and is within the model domain. If A i is specified greater than the size of the domain, the averaging is performed over the whole domain so that P i px, y, ζ, tq P i pζ, tq is the same for all grid points in the horizontal. This is how we computed the scaling physical tendency for atmospheric temperature and winds and for soil temperature. For less Gaussian fields such as humidity and soil moisture, P i is computed by averaging over a significantly smaller moving window centered at the grid point in question (see section 3 for details).

Multivariate and 3D aspects
First, we allow in AMPT not only for error in the modulus of the vector P but also for errors in the direction of P. We do so by introducing independent driving random fields ξ i for different model variables X i (see Eq. (10)). Though the purely uncorrelated perturbations may seem unphysical, we rely on the model to introduce physically meaningful relationships between the variables during its adaptation to the perturbation we have added. As the magnitude of the perturbation at each model time step is tiny, the adaptation is expected to go smoothly.
Second, to get rid of the perfect coherency of the perturbations in the vertical, we switch from the 3D random pattern ξpx, y, tq in SPPT to the 4D random fields ξ i px, y, ζ, tq in AMPT.
Third, the perturbation-magnitude multiplier κ i is variable-specific in AMPT.

Length scales and perturbation magnitudes
Like in SPPT, we specify ad-hoc spatio-temporal correlation functions, length scales, and time scales for the random fields ξ i . The AMPT's length and time scales are postulated to be multiples of the minimal effectively resolvable features in the respective dimension. Say, in the horizontal, with the nominal resolution (grid spacing) h x , we assume that the effective resolution is h eff (Frehlich and Sharman, 2008) and tune the length scale L i of the field ξ i around the postulated default value 10 h eff x . As for AMPT magnitude multipliers κ i , the non-local dependence of the scaling physical tendency on the modulus of the unperturbed physical tendency (Eqs. (9) and (11)) suggests that κ i do not need to obey the SPPT's strict upper limit on the magnitude multiplier, Eq.(5).
This allows AMPT to cause greater spread than SPPT in the ensemble, if needed.

Further details
The rest of the Methodology section is organized as follows.
Details on AMPT perturbations for specific model fields are given below in section 2.4.
The Stochastic Pattern Generator (SPG, Tsyrulnikov and Gayfulin, 2017) used in this study to generate the 4D pseudo-random fields ξ i is outlined in section 2.5. Section 2.6 explains how SPG fields are mapped from the SPG domain onto the model domain. In section 2.7 we briefly discuss stability, conservation properties of the new scheme, and possible biases due to nonlinearity of the forecast model.

Treatment of specific model fields
In the atmosphere, we experimented with perturbations of the 3D fields of temperature T , pressure p, horizontal wind components u, v, and specific humidity q v . We also tried perturbing cloud ice and cloud water but found that those perturbations had little overall impact, so we abandoned them. In the soil, we perturbed 3D fields of soil temperature and soil moisture.

Atmospheric temperature, pressure, and winds
Independent perturbations of T, u, v are computed following Eq.(10). The pressure perturbation ∆p is computed from ∆T by integrating the hydrostatic equation (in which ∆q v is neglected as a small contribution to a small perturbation).
As for wind perturbations, we note that theoretically, it is 'better' to rely on mutually uncorrelated random stream function and velocity potential perturbation fields -rather than on mutually uncorrelated u and v (i.e., zonal and meridional wind) perturbation fields. The reason is that the former approach allows for isotropic vector-wind perturbations (Monin and Yaglom, 2013, section 12.3), unlike the latter approach. However, in practical terms, we were not able to identify any practically significant flaw in the vector field composed of two independent u and v perturbation fields. For this reason and due to the lack of evidence on the actual structure of model uncertainties, we stick to the simpler formulation of AMPT with mutually independent ∆upx, y, ζ, tq and ∆vpx, y, ζ, tq in this study.

Humidity
The salient difference of humidity q v from T, u, v is that q v has a narrow range of values (from zero to saturation or somewhat higher than saturation). The range of q v is narrow in the sense that it is comparable to the standard deviation of the natural variability in q v . To make sure that the AMPT-perturbed q v is within this range (i.e., from zero to saturation) and does not directly introduce any bias into the model, we modify the above formulation of AMPT. Specifically, we employ a kind of 'perturbation symmetrization' as follows. With the model's specific humidity at some grid point px, y, ζ, tq denoted by q v and the saturated specific humidity q sat , we compute the scaling physical tendency P qv following Eq.(11). Then, we compute a zero-mean Gaussian perturbation with the standard deviation σ qv κ qv P qv as in Eq.(10) and symmetrically truncate it at¨c, where c minpq v , q sat ¡ q v q. Due to the truncation, the perturbed q ¦ v q v ∆q v is, first, within the admissible range from 0 to q sat and second, no bias is directly introduced: which would not be the case if we just truncated q ¦ v at 0 and q sat ).

Soil fields
In the land (soil) model, tendencies of two model fields are perturbed: soil temperature T so and soil moisture, more specifically, soil water content W so per unit area within the soil layer in question. As compared to the atmospheric AMPT, the differences in the treatment of the soil fields are the following.
1. The scaling tendencies P Tso and P Wso are computed by averaging the total (not physical) tendency, assuming that all processes in the soil that contribute to the total tendency are modeled with substantial uncertainty.
2. The averaging in Eq.(11) is performed over land only.
3. The perturbation patterns in the soil, ξ Tso px, y, tq and ξ Wso px, y, tq, used to generate perturbations following Eq.(10) are two-dimensional (not three-dimensional as in the atmosphere).
Whole-domain averaging (i.e., with a large averaging length scale A Tso ) is used to compute the scaling physical tendency P Tso . Local-area averaging (i.e., with the averaging length scale A Wso much smaller than the domain size) is employed to compute the scaling physical tendency P Wso . This choice is motivated by higher variability/non-Gaussianity of soil moisture as compared to soil temperature (not shown).
The perturbed W so is truncated to ensure that the volumetric soil water content η so is between wilting-point η wp and field capacity η fc : where Z 1, 2, . . . , n Z labels the soil layer, ∆ Z is the thickness of the Z-th layer, and n Z is the number of layers.

Initial soil perturbations
In the soil, processes have much longer time scales than in the atmosphere, therefore the role of model perturbations can be revealed only in long-range forecasts or in cycled systems.
Dealing in this study with short-range forecasts without cycling and having an under-dispersive ensemble of initial conditions, we had to develop a generator of initial T so and W so perturbations.
The initial soil temperature perturbation is specified as where κ ini Tso is the external parameter that defines the magnitude of the perturbation at the uppermost level, c Tso ¡ 1 is the vertical-decay external parameter, Z 0, 1, . . . , n Z labels the vertical level (the levels are counted downwards), and ξpx, yq is the 2D SPG pseudo-random field. Note that the perturbation pattern ξ is the same for all vertical levels, whilst the magnitude of the perturbation exponentially decreases downwards.
With the soil moisture W so , the technique (inspired by Schraff et al. (2016)) is to perturb as follows: where κ ini S is the magnitude parameter like κ ini Tso in Eq.(13), c Wso is the vertical-decay parameter, and Z 1, 2, . . . , n Z . If at some grid point the perturbed S appears to lie outside the meaningful range r0, 1s, the perturbation ∆S is truncated accordingly. Using Eqs. (14) and (12), we finally convert the perturbation ∆Spx, y, Zq into the perturbation of W so :

Stochastic pattern generator (SPG)
In this study, we rely on the limited-area Stochastic Pattern Generator (SPG) developed by (Tsyrulnikov and Gayfulin, 2017). It generates independent pseudo random fields ξ i that are used to compute perturbations following Eq.(10). Each ξ i is computed by solving the stochastic pseudo-differential equation ξpt, x, y, zq σαpt, x, y, zq.
Here x, y, z are the SPG-space spatial coordinates, ∆ is the 3D Laplacian, αpt, x, y, zq is the standard spatio-temporal white Gaussian noise, and λ, U, σ are the parameters. λ determines the spatial scale of ξ. Given λ, the characteristic velocity U determines the time scale of ξ.
Given λ and U , the parameter σ determines the variance of ξ and is selected to ensure that 1. The solution ξ satisfies the so-called proportionality of scales property (Tsyroulnikov, 2001). The meaning of this property is the following. For any t, let us expand ξ in Fourier series in space: ξpt, x, y, zq °r ξ mn ptq e ipmx ny zq , where m, n, are the spatial wavenumbers and r ξ mn ptq are the Fourier coefficients. Then Eq.(17) decouples into a series of ordinary stochastic differential equations for the random processes r ξ mn ptq. The proportionality of scales means that for a large total wavenumber K c m 2 n 2 2 , the time scale of the process r ξ mn ptq is proportional to its spatial scale 1{K, a property often possessed by natural spatio-temporal processes.
2. The spatial spectra of ξ should be convergent, that is, the process variance should be finite in the space-continuous case and thus should be bounded above as the spatial resolution increases in the space-discrete case.
The solution to Eq.(17) is a zero-mean and unit-variance homogeneous (stationary in time and space) 4D random field, which has 'nice' spatio-temporal correlations: the shape of the correlation function is the same along any direction in the 4D space (thus, including spatial and temporal correlations). This universal correlation function belongs to the very popular in spatial statistics Matérn class, see (Tsyrulnikov and Gayfulin, 2017) for details.
Since the SPG computational domain is the cube whereas the limited-area-model domain can be approximated by a rectangular parallelepiped, we use an anisotropic Laplacian ∆ ¦ instead of ∆ in Eq. (17): Here γ is the aspect ratio of the model domain in the horizontal and δ controls the length scale along the z axis.
The SPG internal parameters (for the model variable labeled by i) λ i and δ i are computed from the respective horizontal and vertical length scales, L i and H i , which are SPG external parameters (this is done by taking advantage of the known correlation functions of the 4D SPG field in space and time). The time scale is T i L i {U, where U is the above characteristic velocity (another external parameter).
With each realization of the driving white noise α, the solution to Eq. (17)  A technically simpler one, which we adopted in this study, is to update P i not at every time step but less frequently, once per T update P i lead time during the forecast, where T update P i is an external parameter defined below in section 3.
A somewhat more involved but potentially more powerful approach (which we left for future research) is to calculate P i px, y, ζ, tq from the unperturbed (control) model run and then use it to compute AMPT perturbations for all ensemble members. This will completely destroy the harmful positive feedback loop. Strictly speaking, it is this regime in which the AMPT perturbations are truly additive.
An even more promising approach is to perturb fluxes instead of prognostic fields as discussed in the next subsection.

Conservation properties
Neither SPPT nor AMPT preserve balances in conserved quantities. The reason is that in both SPPT and AMPT the model fields are perturbed whereas the fluxes are not. Switching from perturbing state variables to perturbing fluxes (like this is done by Van Ginderachter et al. (2020) for deep convection), including boundary fluxes, would completely solve this problem.
However, a purely flux-based model perturbation scheme aimed to represent uncertainties in multiple physical parametrizations remains to be built and tested, which is beyond the scope of this study.

Biases
It is well known and easy to understand that feeding a nonlinear system with an unbiased signal can lead to a biased system output. Biases can appear even in

Numerical experiments
The program code of AMPT was built into the limited-area non-hydrostatic COSMO model (Baldauf et al., 2011), which has both atmospheric and soil prediction modules. The model was, in turn, embedded in a limited-area ensemble prediction system. The AMPT-generated model perturbations along with an ensemble of initial and lateral-boundary conditions allowed us to compute (and verify) ensemble forecasts. The goal was to assess the effect of AMPT on deterministic and probabilistic forecasts and compare it with the effect of SPPT perturbations.
In this study, we were mostly interested in ensemble predictions of near-surface temperature and precipitation.  The length scale A i of averaging the absolute physical tendency (see section 2.3.2) was specified equal to the length scale L i of the respective SPG random field ξ i for q v and W so .
For T, u, v and T so , the respective A i were selected large enough to ensure the whole-domain averaging of |P i |. For the soil fields T so and W so , the common time scale T so was specified 12 times as large as the atmospheric-temperature time scale T T (so that the time scales of the perturbation fields were, roughly, 1 h in the atmosphere and 12 h in the soil).
In the initial soil perturbation scheme (see section 2.4.4), the magnitude parameter κ ini Tso , was 1 K. The magnitude parameter of the initial soil moisture index perturbation, κ ini S was only 0.01 (larger values led to unrealistically large model tendencies in T so ). The vertical-decay parameters were c Tso 1.75 and c Wso 2.
The mapping from the SPG space to the model space in the vertical was performed using Option 1 described in section 2.6 (for technical reasons we couldn't perform enough numerical experiments with the more physical Option 2 to judge which option is better).
In SPPT, the spatial scale was about 500 km and the time scale was 6 h. The magnitude multiplier (i.e., the standard deviation of the Gaussian random field) was set to 1. The Gaussian distribution was truncated at the value (range) of 0.8. This SPPT setup implied greater perturbations than those explored with the same-resolution COSMO model by Maurer et al. (2014): they used the standard deviations varying from 0.25 to 0.5 and the range from 0.625 to 1. (We opted for stronger SPPT perturbations because otherwise they generated too little spread in the ensemble forecasts.) The supersaturation limiter was off in SPPT. In AMPT, the impact of the supersaturation limiter is discussed below in section 4.1.
Tapering (i.e., gradual reduction) of perturbations (i) in the lower troposphere towards the surface and (ii) in the stratosphere from the tropopause upwards was handled in SPPT as follows. The stratospheric tapering was always active because it is believed that the radiation tendency, which is dominant in the stratosphere, is quite accurate in clear-sky conditions (e.g., Leutbecher et al., 2017). As for the lower-tropospheric tapering, which is intended to prevent instabilities due to inconsistencies of perturbed physical tendencies and unperturbed surface fluxes (Wastl et al., 2019), we found that SPPT worked better without it. Specifically, our experiments showed that, on the one hand, SPPT without tapering was stable in the boundary layer. On the other hand, with tapering, SPPT led to an unacceptably small ensemble spread in the near-surface fields, so we switched off the lower-tropospheric tapering in SPPT. In AMPT, tapering was off everywhere.
Note that the above spatial and time scales employed in AMPT were an order of magnitude less than those in SPPT (50 km vs. 500 km and 1 h vs. 6 h). This is reasonable because in SPPT, the perturbation is multiplicative, i.e., the random field (pattern), ξpx, y, tq, is multiplied by the physical tendency, Ppx, y, ζ, tq, which is a rather small-scale random field itself, so that the perturbation, κ ξpx, y, tq Ppx, y, ζ, tq, is a reasonably small-scale field even if ξpx, y, tq is largescale. If ξpx, y, tq is small-scale in SPPT, the product ξ P becomes too patchy, which reduces the effect of the perturbation on the forecast. To find out if this argument is reasonable, we ran SPPT with the smaller time scale of 1 h and the spatial scales of 50 km and 100 km. The resulting spread in the ensemble forecast was indeed very small, confirming the conclusions of Maurer et al. (2014) (made for a different domain, orography, physiography, etc.) and justifying the choice of the SPPT parameters for our domain (the setup was also consistent with that employed in AROME-EPS by Bouttier et al. (2012)).

Biases and spreads induced by AMPT and SPPT
In preliminary experiments, we found that humidity perturbations did substantially increase ensemble spread but at the expense of introducing a significant bias into the forecast. This led us to explore the impact of humidity perturbations (and the supersaturation limiter) on the spread of ensemble forecasts and on the biases of the ensemble-mean forecast (in terms of precipitation and near-surface temperature). The aim was to decide whether it is worth perturbing humidity in the AMPT scheme at all. The (informal) criterion was twofold: AMPT should generate significantly more spread in the ensemble forecast than SPPT while having the bias-to-spread ratio as low as possible.
In the experiments described in this section, to goal was to isolate the roles of different model perturbations. To this end, we did not change initial and lateral-boundary conditions in ensemble members and switched off all soil perturbations. Additionally, hydrostatically balanced pressure perturbations in AMPT were deactivated. The bias (caused solely by model perturbations) is defined in this section as the domain averaged difference between the ensemble mean and the unperturbed deterministic forecast.
Results are presented in terms of bias-spread scatterplots. Biases (on the x axis) and spreads (on the y axis) are combined for lead times 3,6,9,...,24 h and shown on a single scatterplot.
Ensemble forecasts of near-surface temperature and accumulated total precipitation were examined.

Role of supersaturation limiter
The question was whether the supersaturation limiter leads to biases in the ensemble mean forecast and how it impacts the spread of the ensemble? First, we found that its impact on the near-surface temperature was rather small, so we focused on its impact on precipitation. Figure 2 shows the bias-spread scatterplots for two configurations of AMPT: without and with the supersaturation limiter (satLim). One can see that the bias-to-spread ratio is almost the same for the two configurations whilst the spread is larger without satLim. Therefore, we decided that the supersaturation limiter is not worth being activated in AMPT. So, for the experiments described in the rest of the article, the supersaturation limiter was off in both SPPT and AMPT.

Roles of humidity and temperature perturbations
Here we compare three configurations of AMPT and SPPT in terms of their impacts on the bias and the spread of the ensemble. In the AMPT-TUVQv configuration, T, u, v, q v fields were perturbed. In the AMPT-TUV configuration, only T, u, v fields were perturbed (that is, without humidity model perturbations). In the AMPT-UV configuration, only u, v fields were perturbed.
With AMPT-UV, we found that the spread in the ensemble was somewhat too small. This led us to increase the amplitude multiplier κ in this configuration from its default value 0.75 (see section 3) to 1. Figure 3 shows bias-spread scatterplots for near-surface temperature. The following conclusions can be drawn from these scatterplots. Perturbing humidity on top of T, u, v did not change much neither the biases nor the spreads so that AMPT-TUV and AMPT-TUVQv performed similarly in these experiments. The other pair of schemes, SPPT and AMPT-UV, had significantly lower biases and spreads. In dry conditions (the top left plot), SPPT had smaller both spread and bias-to-spread ratio than AMPT-UV. In wet conditions (the top right plot), SPPT and AMPT-UV performed similarly in terms of both spread and bias-to-spread ratio. In very wet conditions (the bottom plot), it was AMPT-UV that had the smallest bias-to-spread ratio while having nearly the same spread as SPPT. Figures 4 shows bias-spread scatterplots for accumulated precipitation. In both wet (the left plot) and very wet (the right plot) conditions, the differences between the four schemes were more systematic than in Fig. 3. Like in Fig. 3, the magnitudes of precipitation biases and the spreads of SPPT and AMPT-UV were quite similar. Again, like for T 2m , AMPT-TUV and AMPT-TUVQv had significantly greater (than SPPT and AMPT-UV) biases and spreads. But unlike Fig. 3, the bias-to-spread ratio for AMPT-TUV was significantly lower than for AMPT-TUVQv. We summarize findings from Figs. 3 and 4 as follows.
1. The magnitudes of forecast biases and the spreads due to SPPT were similar to those due to AMPT-UV (i.e., AMPT without temperature and humidity perturbations).
2. The addition of temperature perturbations in AMPT (in AMPT-TUV) led to significantly greater spreads but somewhat higher (i.e., worse) bias-to-spread ratios.
3. Perturbing humidity on top of T, u, v (in AMPT-TUVQv) led to a significant growth in spread for precipitation but at the expense of a significant degradation in the bias-to-spread ratio.
Thus, without temperature perturbations in AMPT (i.e., perturbing only u, v), we could not outperform SPPT in terms of spread. But with humidity perturbations in AMPT (i.e., perturbing T, u, v, q v ), forecast biases for precipitation were too large. Therefore, in the ensemble prediction experiments described in the next subsection, AMPT temperature (and winds) perturbations were switched on whereas humidity perturbations were switched off.
5 Testing AMPT in the ensemble prediction system

Setup
Ensemble forecasts were initialized every day at 00 UTC during the two-month period of February-March 2014. As indicated in section 3, initial and boundary conditions were provided by the parent ensemble prediction system COSMO-S14-EPS. The local time was UTC+4h. The results were verified against near-surface observations (about 40 stations) using the VERification System Unified Survey (VERSUS) developed within the COSMO consortium (Gofa et al., 2010). The list of experiments and their basic features are presented in Table 1. In AMPT-SOIL, initial soil perturbations generated following section 2.4.4 were added to the respective fields of the members of the initial-conditions ensemble.

AMPT-SOIL atmospheric and soil AMPT perturbations
Besides the four model perturbation schemes listed in Table 1, we also tested a hybrid of AMPT and SPPT. As AMPT perturbations in different model variables are mutually independent (section 2.3.3) whereas SPPT perturbations are perfectly correlated (section 2.2), we expected that an AMPT/SPPT hybrid could perform better than each of the two schemes separately. Somewhat surprisingly, that did not happen. We could not see any benefit from the hybridization of the two schemes (not shown), so we abandoned the hybrid scheme. We regard the lack of effect from that hybrid as an indication that with very small perturbations introduced at every model time step, it is not worth trying to specify inter-dependencies between model perturbation fields. The model itself seems to be capable of balancing such small disturbances, as we suggested in section 2.3.3.
In this section, we show verification scores for T 2m forecasts averaged over the whole twomonth period. Similar results were obtained for each of the two months separately (not shown).
With precipitation, the only statistically significant effect of AMPT was an increase in ensemble spread. This can be seen in the model-perturbations-only experiments reported above in section 4 (see Fig. 4) and this was observed in the experiments described in this section (not shown). Other verification scores for precipitation forecasts were ambiguous because, on the one hand, the time period considered was rather dry and on the other hand, the observation network was too scarce to detect relatively rare and localized precipitation events (not shown). Figure 5 shows root-mean-square errors (RMSE) of the T 2m ensemble mean forecast (the upper bunch of curves). Each curve corresponds to a model perturbation scheme from the list in Table 1. One can see in Fig. 5 that excluding the initial transient (spinup) period of some 3 h (likely, due to imbalances in initial conditions), the RMSEs had a prominent diurnal cycle with a broad minimum at night and a narrower maximum shortly after midday.

Deterministic verification
To highlight the barely seen in Fig. 5 differences between the curves in the upper bunch,   pRMSE NOPERT ¡ RMSEq{RMSE NOPERT , the higher the better.

Probabilistic verification: reliability and resolution
The lower bunch of curves in Fig. 5 shows ensemble spreads for the four schemes examined.
One can see that the spreads were somewhat too low with both SPPT and AMPT, though AMPT perturbations induced substantially greater spread than SPPT perturbations.
Note that in an ideal ensemble, its members are indistinguishable from the truth. Both members of such an ensemble and the truth can be viewed as independent draws from the same probability distribution. In particular, the expectation of any member equals the expectation of the truth. The standard deviation of any member at some point in space and time (let it be denoted by σ) is the same as the respective standard deviation of the truth. Then, the expected square of the ensemble spread (defined as the square root of the unbiased sample variance) equals σ 2 by construction. At the same time, the expected square of the error in the ensemble mean (the mean-square error, MSE) equals N 1 N σ 2 , where N is the ensemble size (Fortin et al., 2014). Being averaged in space and time, the expected square of the ensemble spread becomes equal to σ 2 . Correspondingly, the averaged in space and time MSE becomes N 1 N σ 2 . Thus, it is clear that for large ensembles, the root-mean-square spread should be very close to RMSE. For an ensemble of size N 10 used in this study, RMSE should, on average, exceed the spread by N 1 N ¡ 1 0.05, which is of course much smaller than the differences between the RMSEs and the spreads seen in Fig. 5.
Another reason why RMSE can be greater than spread is observation error. In complex terrain, the dominant source of error can be representativeness uncertainty, which accounts for the fact that observations can poorly represent grid-cell averaged fields provided by the model.
We did not account for the contribution of observation error to RMSE.
If spread is systematically and significantly different (say, lower, as in our experiments) than RMSE, then the conditional cumulative distribution function of the verification (truth, observation) x v given the ensemble cumulative distribution function F e pxq of the model variable x, i.e., P px v ¤x | F e q (where P stands for probability), is systematically different from F e pxq.
This kind of inconsistency is called lack of reliability of the probabilistic forecast (Toth et al., 2003). In these terms, Fig. 5 demonstrates that AMPT perturbations are capable of significantly improving reliability of ensemble-based probabilistic forecasts as compared to SPPT perturbations. Higher in the atmosphere, the AMPT-induced spread was consistently higher than the spread induced by SPPT, too (not shown).
Note that the variations in the RMSE as functions of lead time seen in Fig. 5 are not accompanied by corresponding variations in the spreads. The reason is that the diurnal-cycle variations in RMSE were largely caused by systematic forecast errors (biases). Indeed, verification results of COSMO model forecasts on different domains presented in (Rieger et al., 2021) show that the magnitude of diurnal variations in the forecast bias is between 1K and 2K, with a broad maximum at night and a narrow minimum in the afternoon. But biases in the model cannot be represented by random zero-mean perturbations. Therefore, with stochastic model perturbations, spread cannot reflect the contribution of bias to RMSE. Either the forecasts are to be debiased or a multi-physics/multi-model ensemble (in which different members have different biases) is to be used, see, e.g., Berner et al. (2015).
A reliable uncalibrated ensemble-based probabilistic forecast tells the user that if the ensemble variance (i.e., spread squared) at some point in space and time equals some number d, then the expected MSE (RMSE squared) also approximately equals d. This is a valuable information provided by an ensemble but reliability alone does not fully characterize the quality of a probabilistic forecast. Indeed, the constant spread equal to the 'climatological' RMSE would imply a perfectly reliable forecast, which is, however, not much useful. The ensemble needs to generate sufficiently different spreads. If it does and if different spreads correspond to sufficiently different expected RMSEs (the property known as resolution), then the ensemble can provide relevant information about variations in RMSE at different sites and dates.
One useful measure of resolution combined with reliability is the continuous ranked probability score (CRPS). Figure 7 displays CRPS for T 2m . One can see that both SPPT and AMPT perturbations did improve CRPS (as compared with NOPERT) while AMPT led to substantially greater improvements, especially with additional soil perturbations. Another popular measure of reliability and resolution is the Brier score, which, in contrast to CRPS, is computed for specific events. If an event is defined as x θ (or x ¡ θ), where x is the continuous model variable in question (say, temperature) and θ is a threshold, then CRPS is the integral of the Brier score over all possible θ (Hersbach, 2000). Therefore, Fig. 7 implies that the integrated over all temperature thresholds Brier score is better with AMPT than with SPPT.
To find out what happens for specific distinct thresholds, we selected several meteorologically relevant events (T 2m ¡5 ¥ C, T 2m ¡ 0 ¥ C, T 2m ¡ 5 ¥ C) and verified the ensemble forecasts using the Brier score. AMPT did outperform SPPT for all these events, we show the results for the most populated event T 2m ¡ 0 ¥ C in Fig. 8.
We also examined the reliability and resolution components of the Brier Score. With the reliability substantially improved, the resolution was barely changed or slightly reduced (i.e., degraded, not shown). As noted by Candille and Talagrand (2005), an increase in spread may lead to degradation in resolution because a larger spread is more akin to climatological spread, which yields no resolution. So, we state that AMPT substantially improved reliability without significantly degrading resolution.

Probabilistic verification: discrimination
The ability of a probabilistic forecast to predict the uncertainty in the forecast on the pointby-point basis can be measured by reliability and resolution (as discussed above) through the conditional distribution of the verification x v given the ensemble cumulative distribution function F e pxq, i.e., ppx v | F e q (where p stands for probability density). The reciprocal and complementary view on the accuracy of the ensemble-based prediction of forecast uncertainty is through ppF e | x v q, i.e., conditional on verification. The capability of a probabilistic forecast to concentrate the probability mass close to a real-world x v (thus, discriminating between different outcomes x v ) is called discrimination (e.g., Wilks, 2011).
Discrimination is commonly assessed using the Relative Operating Characteristic (ROC) defined for a specific meteorological event and, more succinctly, using the area under the ROC curve (ROC area). Figure 9 shows the ROC area for the event T 2m ¡ 0 ¥ C (similar results were obtained for the two other events, T 2m ¡5 ¥ C and T 2m ¡ 5 ¥ C, not shown). From this figure, we draw three conclusions. First, the overall level of the ROC area score was quite high (note that its perfect value is 1 whereas a value less than 0.5 indicates no skill) so that the probabilistic forecasts in question were quite skillful. Second, the AMPT-SOIL scheme was uniformly better than SPPT and NOPERT (just like the other above mentioned probabilistic scores). Third, we note that the ROC area is known to be insensitive to the degree of reliability and cannot be improved by calibration of a probabilistic forecast (e.g., Wilks, 2011). Therefore, the superiority of AMPT-SOIL over SPPT and NOPERT in terms of ROC area implies that AMPT not just inflates spread (which could be done by calibration), it yields 'good spread' in the sense of Eckel and Mass (2005), meaning greater spread when and where it is relevant.

Statistical significance
In contrast to section 4, where the results were obtained by averaging over large samples (the whole spatial grid and two months of data), the ensemble forecasts described above in section 5 were verified against a relatively scarce observation network, which entails a larger statistical uncertainty and therefore requires statistical significance testing. Here, we examine how statistically significant are the observed benefits of AMPT as compared to SPPT in ensemble prediction experiments.

Methodology
Let us test the hypothesis that a verification score ψ is greater for scheme 1 than for scheme 2. More precisely, we are going to compare the scores ψ 1 and ψ 2 that could be obtained from a very large sample of forecasts, whilst having their unbiased estimates p ψ 1 and p ψ 2 obtained from smaller samples. That is, the hypothesis to be tested is H : ψ 1 E p ψ 1 ¡ E p ψ 2 ψ 2 . The default 'null' hypothesis states that there is no difference between the two schemes: H 0 : ψ 1 ψ 2 .
Since atmospheric states are correlated in time and forecast errors depend on the atmospheric flow, forecast errors are also temporally correlated (dependent). Accounting for these time correlations would complicate statistical hypothesis testing, so for simplicity, we employ the following approach. We divide the whole two-month period into n 8 weeks and compute the score ψ for each of the two schemes for each week separately, getting the respective sequences p ψ 1i and p ψ 2i (for i 1, 2, . . . , n) and their differences δ obs i p ψ 1i ¡ p ψ 2i . We treat δ obs i as (observed) realizations of the respective random variables δ i . Assuming that temporal dependencies in δ i decay on a scale shorter than one week, we regard δ i as a sample of statistically independent and identically distributed random numbers. This can be easily done using an approach to statistical inference known as bootstrap. In its simplest flavor, bootstrap postulates that the data distribution (i.e., the distribution of δ i ) equals the empirical distribution (which is concentrated at the observed data, in our case, δ obs i ).
However, this setup cannot be directly applied here because H 0 assumes that E δ i 0 whereas the empirical distribution is, normally, biased (δ obs $ 0). Following Efron and Tibshirani (1994, ch.16), to define the bootstrap distribution, we shift the empirical distribution so that it has zero mean. Then, we proceed as usual, sampling (with replacement) from this discrete distribution (concentrated with equal probabilities at the points ∆ obs i δ obs i ¡δ obs ), compute the mean in each bootstrap sample,δ ¦ , and calculate the fraction of bootstrap samples in whichδ ¦ ¡δ obs . This fraction is the estimate of the p-value. We also use the classical Student's t-test.

Results
Statistical significance of improvements in near-surface temperature forecasts due to AMPT as compared with SPPT was tested for the following scores (averaged over all lead times up to 48 h to reduce statistical uncertainty): spread, CRPS, Brier Score, and ROC area. The significance level was set at α 0.05. The number of bootstrap samples was 10 5 . We tested improvements of AMPT-NOSOIL over SPPT because both schemes involved no soil perturbations and thus are comparable. With the scores defined for specific meteorological events, that is, for Brier Score and ROC area, the most populated event T 2m ¡ 0 ¥ C was examined, again, to avoid excessive uncertainty due to small sample size. Table 2 shows p-values for the above bootstrap test, p bootstrap , and the one-sided Student's t-test, p t¡test . For spread, the advantage of AMPT-NOSOIL over SPPT was very statistically significant. For all 8 weeks and all lead times, the forecast spread with AMPT-NOSOIL was substantially greater than the spread with SPPT. For other scores, AMPT-NOSOIL statistically significantly outperformed SPPT as well. In addition, as it is follows from Figs. 5 (the lower bunch of curves), 7, 8, and 9, AMPT-SOIL was nearly uniformly better than AMPT-NOSOIL.
Therefore the advantage of AMPT-SOIL over SPPT was even more statistically significant than the above superiority of AMPT-NOSOIL over SPPT.

Summary of the ensemble prediction results
For near-surface temperature forecasts, atmospheric AMPT perturbations gave rise to statistically significant improvements (as compared to SPPT) in the performance of the ensemble prediction system -in terms of spread-skill relationship, CRPS, Brier score, and ROC area.
The effects of AMPT on RMSE of the ensemble-mean near-surface temperature forecast as well as impacts on precipitation were mixed. Soil AMPT perturbations imposed in addition to atmospheric AMPT perturbations led to nearly uniform further improvements in the ensemble performance.

Conclusions
A new technique called Additive Model-uncertainty perturbations scaled by Physical Tendency (AMPT) has been proposed. AMPT is a modification of SPPT aimed at addressing some of SPPT's weak points: (i) inability to generate significant perturbations at points where physical tendency is small, (ii) perfect correlation of perturbations between different model variables and between different model levels, (iii) limitations on the magnitude of perturbations.
AMPT employs the Stochastic Pattern Generator (Tsyrulnikov and Gayfulin, 2017) to generate four-dimensional random fields with tunable spatio-temporal correlations (but can be used with any pattern generator). The random fields generated by SPG to perturb different model fields are mutually independent. They are scaled by the area-averaged modulus of physical tendency and added to the model fields at every model time step. AMPT perturbations of three-dimensional atmospheric model fields of temperature, pressure (computed from temperature perturbations via hydrostatics), wind, humidity, and three-dimensional soil fields (temperature and moisture) were imposed and their effects on convection-permitting ensemble forecasts (based on the COSMO forecast model) were examined.
The main findings of the study are the following.
Humidity perturbations generated significant ensemble spread for precipitation but led to a high bias-to-spread ratio. For this reason AMPT was systematically tested without humidity perturbations.
Withholding temperature perturbations led to a substantial reduction in the bias and in the bias-to-spread ratio, but at the expense of a substantial reduction in spread as well.
For this reason, temperature (and pressure) were included in the list of perturbed model fields in the final ensemble prediction experiments.
In ensemble prediction experiments, it was found that AMPT was much more effective in generating spread in the ensemble than SPPT, thus considerably improving reliability of the ensemble. This was the case for the upper-air fields as well as for the near-surface fields.
AMPT was able to reduce the average root-mean-square error (RMSE) of the ensemble mean near-surface temperature forecast w.r.t. the configuration of the ensemble without model perturbations and w.r.t. SPPT, however, for some parts of the diurnal cycle, RMSE was degraded.
Most probabilistic ensemble verification scores (spread-skill relationship, CRPS, Brier Score, and ROC area) for near-surface temperature were improved due to atmospheric AMPT perturbations as compared with SPPT (at the statistical significance level of 0.05).
AMPT perturbations of the soil fields further improved the deterministic and probabilistic verification scores for the near-surface temperature.
Probabilistic verification of precipitation ensemble forecasts gave mixed results, perhaps, due to an insufficient number of precipitating events during the time period examined and scarcity of the observation network.
The technique can be extended along the following directions. First, it looks reasonable to introduce state dependence not only in the magnitude of perturbation fields but also in their spatial and temporal scales. This will require a non-stationary (in space and time) stochastic random field generator and a method to specify a non-stationarity pattern. Second, non-Gaussianity of perturbations may become more important with higher resolution forecast models. Third, in its current formulation, AMPT takes the net physical tendency as input but it can be used at the process level as well, i.e., for each parametrization scheme separately.
Finally, we note that AMPT can be used in data assimilation schemes as well as in ensemble prediction systems: in the atmosphere, in the soil, and also in the ocean.
The Fortran source code of SPG is available from https://github.com/gayfulin/SPG.