Beyond prediction: methods for interpreting complex models of soil variation

Understanding the spatial variation of soil properties is central to many sub-disciplines of soil science. Commonly in soil mapping studies, a soil map is constructed through prediction by a statistical or non-statistical model calibrated with measured values of the soil property and environmental covariates of which maps are available. In recent years, the field has gradually shifted attention towards more complex statistical and algorithmic tools from the field of machine learning. These models are particularly useful for their predictive capabilities and are often more accurate than classical models, but they lack interpretability and their functioning cannot be readily visualized. There is a need to understand how these these models can be used for purposes other than making accurate prediction and whether it is possible to extract information on the relationships among variables found by the models. In this paper we describe and evaluate a set of methods for the interpretation of complex models of soil variation. An overview is presented of how modelindependent methods can serve the purpose of interpreting and visualizing different aspects of the model. We illustrate the methods with the interpretation of two mapping models in a case study mapping topsoil organic carbon in France. We reveal the importance of each driver of soil variation, their interaction, as well as the functional form of the association between environmental covariate and the soil property. Interpretation is also conducted locally for an area and two spatial locations with distinct land use and climate. We show that in all cases important insights can be obtained, both into the overall model functioning and into the decision made by the model for a prediction at a location. This underpins the importance of going beyond accurate prediction in soil mapping studies. Interpretation of mapping models reveal how the predictions are made and can help us formulating hypotheses on the underlying soil processes and mechanisms driving soil variation. This manuscript is a non-peer reviewed preprint that has been submitted for publication. Subsequent versions of this manuscript may have updated content. Feedback and comments are welcomed, feel free to contact the corresponding author: Alexandre Wadoux alexandre.wadoux@sydney.edu.au


83
(c) Obtain covariate importance for the j th covariate by the ratio (f (X * ), y)/ (f (X), y) 84 or the difference (f (X * ), y) − (f (X), y). 85 Permutation of the covariate matrix involves randomness and is usually repeated to obtain a distri-86 bution of the importance metric. Figure 1 shows an example of permutation covariate importance 87 using the ratio of RMSE. The technique can be extended to measure the importance of group of 88 covariates, by permuting the group of covariate simultaneously instead of a single covariate. Calculation of covariate importance with permutation is computationally efficient as it does not 91 require re-calibrating the model at each permutation. In case where covariates are dependent, 92 however, the values obtained by permutation might be misleading and result in incorrect ranking for a grid of x i,S while keeping fixed the values of x i,C (Fig. 2a).
When comparing ICE curves, it is convenient to center the individual ICE curves to a baseline 108 value. The centered ICE curves show the partial dependence of the predicted value at a location 109 to a covariate, expressed in terms of difference to the baseline value. The centered ICE curve is 110 expressed as: where x 0 is the baseline value, usually the minimum, maximum or average of the values in the 112 calibration dataset (Fig. 2b). ICE curves are an intuitive way to explore the effect of covariates to individual spatial locations.

115
ICE curves can further be computed for group of spatial locations within an area, and their average 116 value (i.e. their partial dependence plot, see also Section 2.3) compared to that of another area.

117
This may provide insight into local or regional dependence to a covariate. However, ICE are 118 also calculated from the marginal covariate distribution and are thus they are reliable only when 119 covariates are independent. More information on this is provided in Section 2.3.
In practice the numerical integration required to estimate the marginal distribution of X C is 127 approximated by averaging over the n observation locations: where x 1C , x 2C , . . . , x nC are the row-vectors of X C . Eq. 3 shows that the PDP of the calibration dataset is the average of the n ICE curves. Accordingly, Fig. 2a is the derivative of the prediction function with respect to covariates approximated by a step function over the K intervals: where k j (x j ) is the interval that x j falls into. The right-hand side of Eq. 5 is the difference in 155 prediction computed over the range ]z k−1,j , z k,j ], which quantifies the effect of the covariate for an formula in Eq. 5 is a step function which can be smooth by linear interpolation. The ALE is 160 centered at zero by: An example of one and two-dimensional ALE plot is shown in Fig. 3. Note that interpretation of the two-dimensional ALE plot is different from that of a PDP. ALE 168 is formally interpreted as being the centered difference in prediction (i.e. the effect) when the 169 observations within an interval are moved from one border of the interval to another other. Fig. 3a 170 shows the effect of woody biomass on SOC for a range of values of woody biomass, and compared 171 to the average prediction. Fig. 3b shows the pure interaction effect of woody biomass and elevation 172 on SOC compared to the average prediction. For example, the ALE estimate of woody biomass 173 in Fig. 3a illustrates that for large values of woody biomass (i.e. greater than 300 Mg/ha), the 174 predicted values of SOC are lower by nearly 20 dg/kg compared to the average prediction. The estimates of ALE tend to be more robust than the PDP for correlated covariates, because of 177 averaging and accumulating the local effect over the conditional distribution. However, this comes 178 at the expense of determining of having a more localized interpretation (within intervals), and 179 possibly non-intuitive interpretations for some data-generating processes (Grömping, 2020).

Interaction between covariates
Interaction between covariates can be estimated with the H-statistic (Friedman & Popescu, 2008).

182
Interaction is the variation that remains unexplained after summing the individual effects of the 183 covariates on the model prediction. In other words, there is interaction when the combination of 184 two covariates explains more of the data variance than the sum of these same two covariates taken  two-way interaction, is measured by the H-statistics as: The interaction between a single covariate x j with all combinations of covariates is: The H-statistics is dimensionless and usually between 0 and 1, but can exceed one if the variance    from the original data set) is given by: where |S| is the size of the subset which excludes the jth covariate, S ∪ {j} is the subset S with the   We apply the local and global interpretation methods described in Section 2. We interpret the     valley bottom flatness has, on average, a moderate impact in model prediction (Fig. 9a), but this 328 is more subtle than that (Fig. 9b).  The two-dimensional relationship of SOC with elevation and precipitation seems more complex 342 (Fig. 11) than the one-dimensional figures in Fig. 10. Fig. 11a shows that SOC content generally 343 increases with higher elevation and more precipitation. However, the ALE plot in Fig. 11b has a 344 different pattern: for elevation lower than 250 m, the SOC content increases with precipitation, 345 while an opposite pattern is seen for elevation values larger than 250. In Fig. 11, both plots have a noticeable increasing pattern of SOC with higher precipitation, but only for low relief. Above an 347 elevation of 1000 m, few SOC observations exist, which means that interpretations of effects for 348 this elevation should be cautious.

349
How does SOC prediction depend on interactions among covariates?
350 Figure 12 shows the strength of the interaction between environmental covariates for the RF 351 model. Note that the MLR is not expected to contain an interaction effect between covariates 352 unless explicitly specified. Fig. 12a shows the presence of a strong overall interaction effect in 353 the random forest model. Satellite imageries MODIS red, green and SWIR 2 are involved in 354 interactions with other covariates. Elevation also substantially interacts with other covariates.

355
Covariates standard deviation of monthly solar radiation and soil water content, conversely, have 356 negligible interaction. Fig. 12b  Interaction with elevation b) Figure 12: Estimate of the overall interaction (Eq. 8) between the environmental covariates used in the random forest model (a) and estimate of the two-way interaction (Eq. 7) with elevation (b) for the RF model.
How to summarize the model? rules shown in the right-hand side of Fig. 13

Local interpretation 376
What is the local functional form of the association between environmental covariates and SOC?     to obtain new pedological knowledge. We recommend to use the interpretation methods described 483 in this paper to obtain insights into the pattern found by the model, and then to translate the 484 pattern into the formulation of hypotheses through connection of patterns to possible soil processes.

486
Another option, especially applicable when producing quantitative soil information (i.e. predic-487 tion) is the main objective, is to use interpretation methods to perform a diagnostic on the model.

488
In many soil mapping studies issues of hypothesis generation are not present, so an assessment of 489 potential causalities is not a priority. Often however, the modelling process is made of refining, soil models. Thus, we did not present LIME in this study but we acknowledge that this might be 507 a valuable approach too.

509
The alternative to these model-independent methods is the use of prediction models that are not 510 "black boxes" or interpretation methods that are specific to a model. In many instances sufficient 511 insights into soil processes can be obtained through the rule sets generated by methods that rely 512 on a statistical model. Geostatistical models of soil variation, for example, through the analysis of 513 the variogram and kriging, can be interpreted in terms of the estimated variogram parameters and 514 plausibility of the assumptions, which all give us insights into the nature of soil variation. Notably, 515 geostatistical models are powerful for prediction and provision to address complex non-stationary 516 soil variation exist (e.g. through wavelet transform).

518
Finally, in the Introduction we presented a set of interpretation methods that are specific to a 519 model. These methods are valid and useful for the interpretation of complex models. We refer to 520 Biecek & Burzykowski (2021, Section 1.5) for an overview and to Molnar et al. (2020b, Section 10) 521 for a summary of model-specific methods for interpreting artificial neural networks. Further inves-522 tigations are needed to understand how these methods can be used for the interpretation of soil 523 models.

525
We have presented methods to obtain insights into complex models of soil variation. These methods 526 were reviewed and evaluated in a case study for mapping topsoil organic carbon in France using a 527 large set of environmental covariates as predictor and two complex models. From the results and 528 discussion we draw the following conclusions:

529
• The methods presented in this paper allows one to extract and visualize different aspects of 530 a complex model.

531
• In a case study, we reveal i) the importance of each driver of soil variation, ii) their interaction 532 and iii) the functional form of the association between environmental covariates and the soil 533 property.

534
• Interpretation could also be performed locally, for an area or a spatial location of interest.

535
• The use of Shapley values for interpreting complex models of soil variation is a promising 536 future line of research because it is versatile, enables both local and global interpretation, is 537 easy to interpret and has an underlying theory.

538
• Different methods might produce seemingly similar results. Ample attention should be paid 539 to the conclusions that can effectively be drawn with the interpretation methods.

540
• A number of assumptions underlie the use of the interpretation methods, the most common 541 of which is that of independence between covariates. Deviation from this assumption does 542 not preclude the use of the methods, but results should be interpreted with care.

543
• We presented a summary table as a guide for selecting the interpretation method, given the 544 purpose of the study and the pros and cons of the method. 545 We stress the importance of going beyond prediction in the use of complex statistical or non-546 statistical models. Interpretation of models reveal how the predictions are made and can help us 547 formulating hypotheses on the underlying soil processes and mechanisms driving soil variation.

548
Interpretation methods are also valuable when the production of quantitative soil information 549 (i.e. prediction) is the main interest, to assist model refining and evaluation of model prediction