Machine learning for digital soil mapping: Applications, challenges and suggested solutions

The uptake of machine learning (ML) algorithms in digital soil mapping (DSM) is transforming the way soil scientists produce their maps. Within the past two decades, soil scientists have applied ML to a wide range of scenarios, by mapping soil properties or classes with various ML algorithms, on spatial scale from the local to the global, and with depth. The wide adoption of ML for soil mapping was made possible by the increase in data availability, the ease of accessing environmental spatial data, and the development of software solutions aided by computational tools to analyse them. In this article, we review the current use of ML in DSM, identify the key challenges and suggest solutions from the existing literature. There is a growing interest in the use of ML in DSM. Most studies emphasize prediction and accuracy of the predicted maps for applications, such as baseline production of quantitative soil information. Few studies account for existing soil knowledge in the modelling process or quantify the uncertainty of the predicted maps. Further, we discuss the challenges related to the application of ML for soil mapping and suggest solutions from existing studies in the natural sciences. The challenges are: sampling, resampling, accounting for the spatial information, multivariate mapping, uncertainty analysis, validation, integration of pedological knowledge and interpretation of the models. Overall, the current literature shows few attempts in understanding the underlying soil structure or process using the predicted maps and the ML model, for example by generating hypotheses on mechanistic relationships among variables. In this regard, several additional challenging aspects need to be considered, such as the inclusion of pedological knowledge in the ML algorithm or the interpretability of the calibrated ML model. Tackling these challenges is critical for ML to gain credibility and scientific consistency in soil science. We conclude that for future developments, ML could incorporate three core elements: plausibility , interpretability , and explainability , which will trigger soil scientists to couple model prediction with pedological explanation and understanding of the underlying soil processes.


Introduction
In recent years, soil science has witnessed a considerable increase in digital soil 1 mapping activities. This is caused by the convergence of several timely factors which 2 are, among others, a huge demand for quantitative and spatial soil information, the where S is the soil and the acronym clorpt stands for climate, organisms, relief, par-9 ent material and time, respectively. In short, clorpt is a list of variables which, if 10 they are known without error, are likely to explain the soil variation over a region. Conventionally, spatial prediction of soil has been embedded in the geostatisti-18 cal framework (Heuvelink & Webster, 2001) in which a sample of a soil property is 19 modelled as a sum of a linear combination of environmental covariates and a spa-20 tially autocorrelated (stochastic) residual, and prediction at unobserved locations is 21 made by kriging. Geostatistical models are often used in soil mapping because they 22 have several advantages (Oliver, 1987). First, a statistically sound model is assumed 23 for spatial variation. This enables interpretation of the underlying physical processes 24 conveyed (inferred) by the model. Secondly, spatial autocorrelation is explicitly mod-25 elled. This is relevant for environmental variables such as soil which vary from place 26 to place, but exhibit correlation between places. Thirdly, an explicit measure of the 27 uncertainty is associated with the prediction. In many circumstances such as in a 28 decision making process, the prediction is not the only interest and uncertainty maps 29 are required for the evaluation of the map quality or modelling risk. Geostatistical mapping of soil has, conversely, several limitations which have only 32 partially been resolved in the current literature. To begin, the residuals are as-33 sumed normally distributed, stationary (with constant mean and unit variance) and 34 isotropic. Next, modelling the non-linear relation between a soil property or class 35 and numerous cross-correlated covariates is not straightforward and introduces addi-36 tional challenges (e.g. many parameters have to be estimated). Finally, geostatistical 37 models are computationally demanding if the sample size and/or the number of pre-38 diction locations are large (Cressie & Johannesson, 2008).

40
As an alternative, machine learning (ML) emerged in the 1990s as a tool for 41 spatial prediction and digital soil mapping (Lagacherie, 2008). Machine learning 42 techniques refer to a large class of non-linear data-driven algorithms employed pri-43 marily for data mining and pattern recognition purposes, and now frequently used for WoSIS is a harmonised database of more than 6 million geo-referenced soil records 55 (Batjes et al., 2017). Additionally, numerous spatially exhaustive scorpan covariates 56 are available at global scale for climate (Fick & Hijmans, 2017), elevation (Yamazaki 57 et al., 2017), and parent material (Hartmann & Moosdorf, 2012). Further potential to predict a set of input values to output values using an error-minimization proce-72 dure. Since ML algorithms are not conditioned to follow any statistical assumptions, 73 they often appear more accurate than conventional models. The exact path between 74 input and output is ignored, and may not resemble an actual process described by 75 the existing knowledge. In soil science, the explosion of articles using ML algorithms 76 have made difficult to see the difference between model fitting and inference, and, 77 as a result between data science and soil science. Research seems to be driven by 78 the technique rather than by the hypothesis to be tested. This seems a poor bet 79 for the advancement of knowledge since "almost invariably the technician's skill is a 80 solution looking for a problem" (Braben, 1985).    Table 1 summarizes some recent case studies of digital soil maps that have been 105 produced using a ML algorithm. There is a large range of case studies, mapping soil 106 properties or classes from the plot (<1 km 2 ) to the global (>10 7 km 2 ) scale. Most 107 studies in our literature review predict at a local to regional scale. The mean extent 108 of the study area is 3,900 km 2 , but most (90%) studies consider a study area smaller 109 than 650,000 km 2 (equivalent to the size of metropolitan France). Few studies map     The sampling design is the spatial location of the sampling units used to cal-133 ibrate or validate the ML algorithm. Most studies do not specify the sampling 134 design used to generate the observations. It is speculated that the sample originates 135 from multiple sources, e.g. legacy data, expert-based designs, and combination of                    1 Plot: 0-1 km 2 ; Local: > 1 km 2 -10 4 km 2 ; Regional: > 10 4 km 2 -10 7 km 2 ; Global: > 10 7 km 2 . 2 RF: random forest; ANN: artificial neural networks; CNN: convolutional neural networks; GBM: gradient boosting machine; BRT: boosted regression tree; GEP: gene expression programming; QRF: quantile regression forest; avNNet: neural networks using model averaging; ctree: conditional inference trees; evtree: evolutionary algorithm for classification and regression tree; NN: neural networks; GBM: generalized boosted regression; k -NN: k -nearest neighbors; RT: regression tree; SVM: support vector machine; MARS: multivariate adaptive regression splines; SGB: stochastic gradient boosting; CART: classification and regression tree; NSC: nearest shrunken centroids; CT: classification tree; BCT: bagged classification tree; DT: decision tree; LMT: logistic model tree; EGB: extreme gradient boosting.
3 R 2 : coefficient of determination; R 2 adj : adjusted coefficient of determination; RMSE: root mean square error; IQR: interquartile range; MAE: mean absolute error; CCC: Lin's concordance correlation coefficient; MSE: mean square error; ME: mean error; MBE: mean bias error; RPIQ: ratio of performance to interquartile distance; NRMSD; normalized root mean squared deviation; MARE: median absolute relative error; NMSE: normalized mean square error; sMAPE: symmetric mean absolute percentage error; SS: skill score; RMSD: minimum root mean square deviation; RPD: residual prediction deviation; SDE: standard deviation of the error; EC: overall ratio; OA: overall accuracy; PA: producer accuracy; UA; user accuracy; AUROC: area under receiver operating characteristic curve; AUC: area under the curve.    tate the identification of a missing spatial process. In some cases, a map of residuals 512 exhibits a clear pattern (e.g. increasing residuals with distance from the river) and 513 might help to generate a new hypothesis or to refine the existing model (see Fig. 3). Step 1 [a + b] Step 2 [b + c] Step 3 [a + b + c]      Most studies to date do not provide estimate of the uncertainty (Table 1) Fig. 2). This is the results of ML algorithm being very poor predictors for extrapo-663 lating to areas of the covariate space that is not comprised in the calibration sample.

664
Uncertainty quantification that separates out data and model uncertainties is thus 665 recommended to complete the evaluation of the predicted maps.  shows that this is due to the selected covariates, some having strong spatial autocor-   Figure 3: The recommended framework for digital soil mapping with machine learning. The modeller must first decide whether a legacy soil sample or a new sample is collected. He must also decide whether the objective is a categorical or quantitative map. The recommended framework enables the separation between the variation explained by the pedologically relevant covariates and by the spatial covariates. It is recommended to use a spatial cross-validation strategy for validation, but also for parameter tuning and covariate selection.
of our knowledge of soil forming processes, is the best option. to the domain (the human language) requires some attention and further research.

814
More discussion on this issue is found in Gahegan et al. (2001). The lowest level is the reality, the unknown real-world soil that one wants to predict. The second level is the dataset that is extracted from the reality. We collect a fraction of the reality, a sample, and link it to exhaustively known environmental covariates. The relationships between the covariates and the sample is learned by a black-box machine learning model (level 3), on top of which comes the interpretation level to extract some knowledge from the structure of the calibrated model. The structure of the model is converted to human understandable knowledge.
A straightforward way to increase interpretability is to decrease model complexity,

962
Including other criteria to assess the overall performance of a ML model would 963 certainly make one step towards "conscious" digital soil mapping, and participate to 964 the uptake of knowledge discovery via machine learning in soil science. or not, and without any assessment on whether the fitted relationships relate to a 999 real-world soil process.