Three common statistical missteps we make in reservoir characterization

Reservoir characterization analysis resulting from incorrect applications of statistics can be found in the literature, particularly in applications where integration of various disciplines is needed. Here, we look at three misapplications of ordinary least squares linear regression (LSLR) and show how they can lead to poor results and offer better alternatives, where available. The issues are Using actual and synthetic datasets, we illustrate the effects that these errors have on analysis and some implications for using machine learning results.


Introduction
Statistical methods are an important tool for many petroleum engineers. While idealized models and their governing equations offer valuable insights into the behavior of petroleum systems, statistics provides methods to analyze real systems to determine if their behavior comports with our idealizations. The typical university engineering education invests considerable effort in the development of system models and the solutions of the resultant differential equations. However, they often devote relatively little time to appropriate application and interpretation of statistical methods.
Therefore, it should come as no surprise that the petroleum literature contains a number of misapplications of statistical methods, particularly in using ordinary least square linear regression (LSLR).
The basics of LSLR are well known and only the principles relevant to this study need be mentioned. We consider the two-variable case because it is the simplest, and almost all of its principles apply to the multi-variate situation. Let's assume that we have a series of measured values of characteristics X and Y: (X1, Y1), (X2, Y2), …, (XN, YN). We will assign X the role of explanatory (or predictor) variable and Y is the response (or predicted) variable, so we are seeking a relationship which will give us estimates of Y based on known X values or "Y on X regression".
If N is large (e.g., 100's or 1000's), there may be no need for LSLR because we have many data to define the relationship (Fig. 1A). For each measured value of X which lies in the interval ui to ui+1, the best estimate of Y is Yi, where Yi is the average value of Y for all the measured values having ui < X < ui+1.
Since the arithmetic average is a random variable, YI may change if we collect more data within the interval ui to ui+1.
Of course, it is common that N is not large. In that case, we often impose engineering judgementbased on experience or physical models-in the form of the X -Y relationship (Fig. 1 right). In this case, we use LSLR with the data to estimate values of the relationship parameters. If we specify, for example, a linear X -Y relationship, we are using Eq. 1.
4 where a and b are parameters to be estimated and e is a random variable, typically assumed to have a normal distribution with zero mean and fixed variance (written as e ~ N(0, s 2 )). The term "linear" in LSLR comes from the way the unknowns a and b appear in Eq.
Hence, the term "least squares" in LSLR. Equation 2 and reason 2 above imply that the estimated values of a and b will be random variables. This is because, by Eq. 1, the Yi's are random variables, so that S 2 is a random variable. Thus, minimizing S 2 by adjusting a and b also makes them random variables.
Another line can be calculated from the (X, Y) data. This is the line called the X on Y line, where c and d are estimated by minimizing The error term e* serves the same purpose as e, but now X is the quantity associated with error and Y is assumed to be measured without error.
Ignoring the error terms e and e*, we can apply some algebraic manipulation to obtain and We demonstrate later that these formulas for c and d do not give the same values as minimizing Eq. 4.
Finally, we mention the coefficient of determination, R 2 , which is frequently used as a measure of how well the X -Y relationship-with the LSLR estimated values of a and b-fits with the data. R 2 is related to the scatter of Y values about the line compared to the total variability of Y values in the dataset.
where the summation is for i = 1 to N. When R 2 = 1, all the data agree with the relationship. In other words, measuring X and using the model fully explains all the variability in Y. If R 2 = 0, the model explains none of the Y variability.
Sometimes, R values are reported instead of R 2 . Perhaps this because of the temptation to cite large values. While R 2 is a measure of the explanatory ability of X and the model, R values have a different and more complicated interpretation related to the normal distribution. See for example, Montgomery and Peck (1982, Section 2.10).
The most important statistical assumption for R 2 in LSLR is that the residuals (pi and qi) are independent from each other and identically distributed (often abbreviated IID). Ideally (though this can be relaxed), the error term is e ~ N(0, σe 2 ). Given that the expected value of e 2 is σe 2 , we can examine Eq. 6a, and obtain the following relation between R 2 and the standard deviation of the error term: where σY 2 is the variance of the response variable Y. 6 We now consider examples in petroleum studies where the above concepts are mis-applied and lead to incorrect interpretations. Our aim is not to bring reproach on cited authors, reviewers, and editors. We simply want to demonstrate that, even with the best of intentions, these mistakes occur and have consequences in the results.

Misstep 1: Applying algebra to random variables
We take as our example of this problem the use of pore throat size (PTS) equations. Several PTS predictors have been published since the 1980s, with Kolodzie (1980) being among the first to report his and Winland's relationships. These models typically take the form where rp is the pore throat size (µm) at some stipulated fluid saturation, k is permeability (md), and φ is porosity (percent). In this case, log(k) and log(φ) are explanatory variables and log(rp) is the response variable. The parameters a, b, and c are obtained from LSLR applied with data sets of various sizes. For the Winland relationship, for example, a = 0.588, b = -0.864, and c = 0.732 (Kolodzie, 1980).
Several studies (Comisky et al., 2007, their Eq. 15;Jennings and Lucia, 2003, their  instead of using the model where d, f, and g are estimated using X on Y LSLR. Pittman (1992) uses the correct procedure by separately calculating the Y on X and X on Y lines when calculating the two relations with rp at 25% mercury saturation for N = 202 data: and log(k) = 1.512 log(rp) + 1.415 log(φ) -1.221.
7 (Pittman (1992) used permeabilities uncorrected for gas slippage.) If Pittman had simply applied algebra to Eq. 9a and solved for log(k), as done by Comisky et al. (2007), Jennings and Lucia (2003), and Ghanbarian et al. (2019), he would have obtained The difference between Eqs. 9b and 9c is plotted in Fig. 3. Equation 9c over-predicts the permeability by less than 20% over the range 0.1 < k < 36 md ( Fig 3B). Outside this interval, the over-prediction can become quite significant, particularly for small permeabilities. The logarithmic presentation in Fig. 3A visually understates this consistent bias.
The case from Pittman (1992) is, of course, just one example but it is valuable because it is based on actual reservoir data. Another illustration using the Winland equation and Monte Carlo modelling can be made to help confirm the impact of the problem.
We begin with an N = 100 size hypothetical poro-perm dataset ( Fig. 4A) with log(rp) as specified from Eq. 7 but with noise N(0, 0.4 2 ) added. Figure 4B is one example realization of 100 generated and is typical of those observed for this type of data as reported by Kwon and Pickett (1975), who found about 67% of such data fell between the dotted lines for their datasets. While the role reversal problem with PTS predictors is the example used here, the issue will arise any time a study offers a relationship which later investigators want to re-express. Examples include fluid 8 properties, waterflood recoveries, and well stimulation performance. The re-expression will lead to biased estimates.
The problems arising from using a Y on X line to predict X values raises the question of whether there is a way to calculate the X on Y line from the Y on X line. To our knowledge, there is no way to do this without having the underlying data except for one case discussed in the statistical literature. This pertains to using the Y on X line to predict X values is called the calibration problem or inverse estimation (e.g., Seber, 1977, Sec. 7.2.6;Montgomery and Peck, 1982, Sec. 9.7). This method, however, requires X to have been a controllable variable, i.e. a variable that can be adjusted by the experimenter during the process of measuring Y.

Misstep 2: De-transforming regression relationships
Numerous studies have described the error involved in applying LSLR to equations of the form and then de-transforming the relationship to predict Y using the algebraically correct but statistically naïve formula Among those describing the problem include Etnyre (1984), Troutman and Williams (1987), Jensen et al. (2000), and Delfiner (2007). The problem becomes clearer if we include the error term in Eq. 10.
because this is the model we use with LSLR. Then, on de-transformation, we have where h is a function of the error variability. The additive error in Eq. 12 becomes multiplicative in Eq.

13.
Another way to explain this effect is shown in Fig. 6A. When log(Y) is used for the response, the LSLR line slope and intercept are estimated to minimize deviations in log(Y) about the line (Eq. 2), giving equal weight to deviations above and below the line. That means, for example, for X = 10 and log(Y) = 0, a deviation of 1 above the line (∆log(Y) = log(10) -log(1) or ∆Y = 9) has the same weight as a deviation of 1 below the line (∆log(Y) = log(1) -log(0.1) or ∆Y = -0.9). That is, the line slope and intercept are determined by giving equal weight to changes of 9 and 0.9. The result is that the de-transformed line (Eq. 11) underestimates the average value of Y in the linear space (Fig. 6B). This effect has implications for any procedure which evaluates the log-transformed response variable, such as R 2 values and artificial neural networks.
There are several methods to de-bias (adjust) the line. Etnyre (1984) where the factor of 2.3 2 converts the log10 variation to natural log variation. For the dataset and line in Fig. 6A, the adjusted line is Y = 1.85 x 10 0.18X-1.8 . That is, the de-biased line predicts Y values that are 85% larger than the naïve de-transformed line. A third approach is described by Wendt et al. (1986), who found the previous two methods unsatisfactory for their case study. Craig (1991) used a different line, the reduced major axis line, to characterize the log(Y) on X relationship but do not give a reason for this preference.
This problem applies, for example, to pore throat size, net pay, and permeability predictors, including the Winland equation (Eqs. 7, 9a, and 9b). Since we usually plot rp or k on a logarithmic scale, we do not see the bias in the predictions. Unless we have the value of S 2 for the dataset, however, we cannot apply Eq. 14 to evaluate the adjustment. Roadifer and Scheihing (2011) show how use of the naïve Eq. 11 affects the porosity cutoff for net pay evaluation. Equation 11 will give a larger porosity cutoff than an appropriately de-biased equation.
Studies where Eq. 11 has been used instead of Eqs. 13 and 14 include formation factors (e.g., Holtz and Major, 2004, their Fig. 24), porosity cutoff values (e.g., Worthington and Cosentino, 2005, their Figs. 5 -7), permeability prediction from porosity (e.g., Worthington, 2004, his 5) capillary pressure-based permeability predictor. and pore throat size (e.g., Ghanbarian et al., 2019, their Figs. 4 -6). For example, Ghanbarian et al. (2019) argue that the Winland-based method of rock typing is inferior to their formation-factor approach (their Fig. 6). However, the Winland lines they show for the rp = 0.19 and 0.45 microns (at 35% mercury saturation) are not the lines for those rp's but for larger pore throat sizes. This reduces the difference between Winland and their proposed method.

Misstep 3: Interpreting R 2 incorrectly
An assumption around calculating R 2 is that the errors e are independent samples from a normal distribution. This assumption is violated in three cases: A) the response is sampled from measurements that depend on each other, B) the response is sampled from multiple distributions (such as a bi-modal distribution), or C) the response is sampled from a heavy-tailed, outlier-rich, or censored distribution.
Case A often occurs when examining time series data, which makes R 2 a poor metric for determining fits of a model to production and drilling data. This can be tested through calculating the correlation between the data and a time-lagged version of the data (autocorrelation).
A simple example of this occurs when a production analysis fit is provided, as in Fig 7. Here, production is related to time through the Arps equation where e ~ N(0, 0.01 2 ), b = 1.3 and D = 0.7, typical values observed for the decline of a hydrofractured well. These time-dependent flow rates are then fit to an (incorrect) exponential model of the form q(t) = A*exp(-t/τ). We find that the R 2 of the fit is 0.952, which would suggest a very good fit, but the behavior of the two relationships are quite different, particularly at the short-and long-time extremes.
Examples of R 2 being used improperly to determine the appropriateness of production analysis methods include Gupta et al. (2018), Ren et al. (2019), and many, many others. In our experience, this misuse of R 2 does not result in incorrect comparisons between multiple production forecasting approaches, but it always leads to overconfidence in the final model.
A better treatment of time-series analysis is the Newey-West estimator (Newey and West, 1987).
Implementations of this estimator exist in R, Matlab, and Python. It calculates the covariance for models where the residuals are autocorrelated, which can be used to improve the regression.
Case A can also occur when measuring geologic data, such as well logs. Semi-variogram analysis is useful for identifying the persistence length of such measurements, allowing the careful tester to perform tests on data outside the range (correlation distance) identified by the semivariogram.
In Case B, the response is sampled from multiple distributions. This can happen, for example, when there are two flow regimes (such as turbulent and laminar flow), two dominant facies (common in fluvial environments composed of alternating sandstone and mudrock facies), or two similar measurements of the same property (neutron porosity and density porosity or whole core and core plug permeability).
When the response is sampled from two or more distributions, the R 2 is predicated on the ability to predict which distribution is being sampled from and how different the average response is for each distribution, rather than how accurate predictions are inside these distributions. At the extreme, when the means of the two distributions are several standard deviations separated in both the explanatory and response variables, R 2 > 0.9 for a model with no explanatory value inside the distribution, A synthetic example of calculating R 2 after sampling from two distributions is given in Fig 8. Here, X consists of a mixture of X1 and X2, where the distributions are X1 ~ N(0, 1) and X2 ~ N(6, 1). Y = 1 + X + 0.5 X 2 + e. The error, e ~ N(0, 3 2 ). A naïve, linear model of the form Y = X + c + e gives R 2 = 0.91, which is very good, but the model fails to capture the quadratic relationship between X and Y (Fig. 8A). An examination of the residuals (Fig. 8B) shows strong evidence that the residuals are not normally distributed, and the average of the residuals varies with X, providing clues that the linear model is not reflecting the X -Y relationship correctly, despite the large R 2 value.
An example of this is present in Al-Mudhafar (2017), where R 2 = 0.995 for training data that are strongly bimodal (see his Figure 7), and R 2 = 0.95 for testing data, in spite of little predictive power within each facies (see his Figure 9). Because of this problem, Al-Mudhafar (2017) has essentially built a neural network that predicts facies, rather than permeability.
Case C is the most common and most easily remedied problem for R 2 . LSLR is robust to minor deviations in the normality assumption, but large deviations can create problems. Sometimes, when the error term is heteroscedastic, this causes the response variable to be strongly skewed. In these cases, the response variable can be transformed back to having an approximately normal distribution through a log transform, a Box-Cox transform, or a Yeo-Johnson transform. This often has the side effect of reducing heteroscedasticity. Outliers can be identified and, if necessary, removed. Truncated or censored data can also be treated.
When non-normal data are not treated, this can lead to either inflated R 2 values, in the case of a heavytailed distribution, or deflated R 2 values, in the case of truncated data or outliers. In Fig. 9A we show a synthetic example where Y = a exp(b X) + e, where a = 1, b = 1, X ~ N(2, 1), and e ~ N(0, X 2 ). The variance changes with X. A naïve fit of the exponential equation to the data results in the estimates a = 1.3, b = 0.91, and the R 2 = 0.72. Log transforming the response, then fitting, leads to the estimates a = 0.98, b = 0.99, and R 2 = 0.90 (Fig. 9B). Using the Yeo-Johnson power transform leads to an R 2 = 0.89.
There are a few options for identifying heteroscedasticity. The Breusch-Pagan test regresses on the square of a model's residuals and can help identify heteroscedastic errors. An alternative test, also regressing on a model's residuals, is the White test (White, 1980). Finally, it is instructive to plot the residuals of every regression, versus predictor and response variables, and as a Q-Q plot. Fig. 10 shows a synthetic example where Y = X + e, X ~ N(1, 1) and Y is censored to set all negative values to 0. R 2 = 0.820 for the non-censored case, but here R 2 = 0.78. In addition, the slope of the best fit line decreases from 1 to 0.91 from censoring, and the y-intercept increases from 0 to 0.17.
Censoring is common when reaching the limit of measurement for a system, such as with permeability or pressure measurements. Di and Jensen (2015), for example, identified censoring in Kwon's (1975) mercury injection capillary pressure data. Di and Jensen (2015) applied trimming on the data, as described by Mosteller and Tukey (1977) to correct for the bias generated by censoring.
An example of predicting a strongly skewed response variable is Ahmadi et al. (2013), who show scatter diagrams of the dependent variable to the predicted response that are heavily right-skewed. This leads to heteroscedastic errors, inflating the R 2 of the model by overestimating the covariance between the predicted and measured responses.
To avoid misinterpreting the R 2 of a regression, it is crucial to examine the residuals. If the residuals are A) correlated, B) the result of two populations, or C) heteroscedastic, this will lead to a misleading R 2 and possibly a biased model.

Discussion
LSLR is a very versatile and powerful method but, as we have seen above, misapplications can lead to erroneous results. Sometimes LSLR alternatives give superior results. For example, as Jensen and Menke (2006) explained in relation to the net pay problem, that choosing a porosity cutoff based on a permeability cutoff and an LSLR line leads to erroneous results. Their alternative is to propose two 13 methods, depending on how the net pay estimate is to be used. Both methods avoid regression and estimate cutoff values that minimize errors for the intended application. The results are estimators that can include measurement errors and cost differences when pay is overlooked or non-pay is included, things not easily accommodated in the LSLR approach.
A very popular discussion of the risks of statistical methods is the still relevant, best-selling book How to Lie With Statistics (Huff, 1954). Other papers covering statistical mistakes include: • King (1986), which discussed, among other things, the dangers of treating R 2 as a general measure of model validity, • Habibzedah (2013), who covered issues relating to standard errors and making statistical inferences as they relate to biological papers, and • Hurd (1979), who covered the issue of heteroscedasticity in greater detail than is present here, in the context of econometrics.
Economic literature includes countless examples of time series forecasting and where repeated measurements of the same quantity are fed into statistical models, using random effect and fixed effect models.
In the petroleum literature, Wallace et al. (2019) show examples motivating the need for blind testing when building regression models for predicting well performance. The aforementioned Di and Jensen (2015) and Jensen et al. (2000) identify problems and offer solutions in regressions related to petrophysics. Male and Duncan (2020) identify heteroscedasticity in permeability predictions and reduce its effects through performing a Box-Cox transformation on porosity.
If the errors are heteroscedastic, robust regression, weighted least squares regression, and the implementation of heteroscedasticity-consistent standard errors (HCSE) can ease this problem. Long and Ervin (2000) explain the use of HCSE. The popular statistical programming languages R, Python, and Matlab provide HCSE algorithms for properly interpreting linear models with heteroscedastic errors.
Some of the preceding critique has implications for machine learning (ML). Algebraic manipulations (Misstep 1) are unlikely to be encountered in ML processes, but biased predictions for log-transformed variables (Misstep 2) and misinterpretations of R 2 (Misstep 3) are relevant. In particular, R 2 is often used 14 as a measure of model validity when analyzing the results of neural network training and test datasets (e.g., He et al., 2019), although alternative measures exist (e.g., Schuetter et al., 2015).

Conclusions
This manuscript details three common statistical mistakes in the reservoir characterization literature. These are: 1. Applying algebra to random variables in regression models. The Winland pore throat size equation is a particular example that has been algebraically re-expressed to predict permeability. This leads to a biased predictor which can over-estimate permeability by large amounts, factors of 2 or more, depending on the data set and value to be predicted. Such overestimation is consistent with studies finding the "Winland equation" over-predicts permeability.   values are from the statistically correct method and the vertical axis is the naïve method using algebra. B. Ratio of the two predictors. To obtain these results, we had to assume a φ-k relationship.
Α Β Figure 4. A. Porosity (porosity units or percent) and permeability (millidarcies) values for a hypothetical dataset. B. Permeability to porosity ratio vs. pore throat size for noisy data. Dotted lines are 2.5 times (upper) and 1/2.5 times (lower) and show scatter similar to Kwon and Pickett (1975).  outputs from a device that cannot measure below a certain limit. The best-fit line has a shallower slope and a larger Y-intercept than would be the fit without censoring.

Supplementary data
All analysis come from synthetic data. The authors will make the code used to generate these analyses available at https://github.com/frank1010111/statistical_missteps.