Integrating Scientific Knowledge into Machine Learning using Interactive Decision Trees

27 28 Decision Trees (DT) is a machine learning method that has been widely used in the geosciences to automatically 29 extract patterns from complex and high dimensional data. However, like any data-based method, the application of 30 DT is hindered by data limitations and potentially physically unrealistic results. We develop interactive DT (iDT) 31 that put the human in the loop and integrate the power of experts’ scientific knowledge with the power of the 32 algorithms to automatically learn patterns from large datasets. We created an open-source Python toolbox that 33 implements the iDT framework. Users can create new composite variables, manually change the variable and threshold 34 to split, manually prune and group variables based on their physical meaning. We demonstrate with three case studies 35 that iDT help experts incorporate their knowledge in the DT development achieving higher interpretability. 36 37 38 39

Integrating Scientific Knowledge into Machine Learning using 1 Interactive Decision Trees

Introduction 40
In the past few decades, our ability to collect, store and access large volumes of earth systems data has increased at 41 unprecedent rates thanks to improved monitoring and sensing techniques (Hart and Martinez, 2006;Butler, 2007; 50 by imitating the way that humans learn (IBM, 2020). The main purpose of ML is to develop algorithms that can learn 51 from historical data and perform tasks (e.g. predictions and classification) on new input data. The capability of ML 52 methods to automatically extract patterns from large volumes of complex and high-dimensional (Table 1)

54
In this paper we focus on Decision Trees (DT) (Breiman et al., 1984), a supervised ML method that is widely used in 55 the geosciences. A DT model is developed through an automatic algorithm that recursively partitions the space of 56 input variables into subspaces using a set of hierarchical decisions. In Figure 1, we show a DT with a schematic 57 representation of the recursive partitioning of the dataset along with basic terms used in this paper. A DT model is a 58 hierarchical tree structure that comprises nodes and branches. Each node is associated with a logical expression, i.e. a 59 "split", which consists of the variable and threshold to split, e.g. "Xi smaller X ̅ i,j". Each node will lead to two branches 60 that correspond to the different possible outcomes of the split. The terminal nodes are called leaves and are associated 61 to either a class or a specific value for the output. The paths from root to leaf thus represent a set of classification (or 62 regression) rules for the output. DT are commonly used for (Flach, 2012):

63
• Classification: The DT is trained on output data that are categorized under different classes (discrete values 64 or non-numerical categories) and then predicts classes for unseen data. In geosciences applications, the output 65 classes are sometimes obtained by previously grouping continuous variables.

73
DTs are quite appealing in geosciences because geophysical processes often reveal a hierarchical structure of 74 controlling variables, and the hierarchical structure of DT with nodes, branches and splits is a straightforward way to 75 capture those significant controlling variables and how they are organized to lead to different outputs.

106
Integration of human experts in the DT development processand hence of their domain knowledge and their 107 cognitive ability to formulate hypotheses and theoriesmay help overcome some of these challenges (Table 1). For

118
In this paper, we propose a framework to develop "interactive Decision Trees" (iDTs) that put human experts in the       2.3 Evaluating DT predictive and interpretive performance evaluation is typically based on statistical metrics of their prediction ability (Lipton, 2018). Examples of such metrics 178 include classification accuracy, confusion matrices, precision, recall, accuracy rate, root mean square, and mean error 179 (Pedregosa et al., 2011). However, in geosciences applications we often would like the DT to be not only a good 180 predictor, but also to be interpretable (Lipton, 2018). Differently from predictive performance, interpretability is a less 181 well defined concept and metrics to measure interpretability are not well established (Doshi-Velez and Been, 2017).

182
A widely used proxy for interpretability is the complexity of the tree, as it can be reasonably assumed that a less 183 complex tree is easier to interpret (Molnar, 2020; Lipton, 2018). The complexity of a DT can be easily quantified 184 through the number of leaf nodes and/or the depth of the tree (Molnar, 2020). We will adopt these simple metrics to 185 evaluate DT interpretability in our first case study.

186
The need for interpretability is often linked to the use of models to assist scientific understanding (Doshi-Velez and 187 Been (2017). The evaluation of interpretability for scientific understanding though is context specific. In case study 188 2, we will give an example of a case-specific definition of interpretability, based on the consistency of the DT 189 partitioning of the input space with an independent classification system of some of the input variables (climate in our

203
In Figure 3 we show the statistically optimal DT initially delivered by the automatic DT algorithm. Nodes are coloured 204 based on Impurity, a default choice in many software. Figure 3 also shows the graphical interface of the InteractiveDT 205 tool, which allows to define groups of input variables and colour code the nodes accordingly. The resulting tree with 206 nodes colour-coded based on their meaning is shown below. With this visualization, it is evident that the first three 207 levels of the tree are dominated by "geophysical properties" and "slope geometry variables", while levels 4 and 5 are 208 mainly dominated by "design storm properties". Furthermore, the colour coding helps spotting a repetition of two 209 variables, cohesion (c_0) and thickness of topsoil (H0), in the first levels of the tree, which indicates that these two

230
Holdridge scheme provides a classification of land areas based on annual precipitation and aridity index (i.e. the ratio between potential evaporation and precipitation; Figure S.2 in the Supplementary material shows the original and our 232 simplified scheme). By imposing that the threshold values for Precipitation (Pm) and Aridity index (AI) in the DT be 233 the same as in the Holdridge chart thresholds, we aim to obtain a more physically meaningful tree. We then want to 234 explore whether a tree so constructed leads to leaf nodes that are more interpretable, i.e. they map into fewer Holdridge 235 life zones, and whether this gain in interpretability comes with a significant loss in classification accuracy.

236
To answer these questions, we generated 15 datasets of 1000 samples each by random sampling from the original 237 dataset (of 17,000,000 samples). For each dataset we derived a statistically optimal (SO) and an interactive (iDT) 238 decision tree. To derive the SO decision tree, we tried different combinations of the algorithm tuning parameters 239 (splitting criterion based on "Gini impurity" or "entropy", maximum number of leaf nodes varied from 15 to 25, 240 maximum impurity decrease of 10 -5 , 10 -6 , 10 -7 ) and retained the best SO tree based on 10-fold Cross Validation 241 strategy. To derive the corresponding iDT, we used the iDT framework to manually change all the splitting thresholds 242 for Pm and AI to the closest Holdridge chart threshold values.
243 Figure 5 shows an example of a statistically optimal DT (top) and the corresponding iDT (middle). Below the leaf 244 nodes, we reported the number of Holdridge life zones (HLZs) each leaf is mapped to. The bottom panel in Figure 5 245 shows the average number of HLZs for each leaf across the four recharge classes and in total. Overall, the Figure   246 shows that when moving from the statistically optimal DT to the iDT, the number of HLZs associated to each leaf 247 node tend to decrease. This may increase the interpretability of the iDTs, as the leaf nodes not only provide a prediction 248 of the output class (amount of groundwater recharge) but also have a clearer mapping into climate zones. Figure S.3 249 in the Supplementary material shows the performance of the statistically optimal DTs and the iDTs on the training 250 and test set. Generally, the differences are not pronounced, which means the changes made by the expert to the trees 251 did not lead to a significant change in performance. As expected, the statistically optimal DTs always show a slightly 252 higher classification accuracy in the training sets. Interestingly though, the iDTs outperform the statistically optimal 253 trees in most cases (9 out of 15) in the test sets. In conclusion, this example shows that incorporating knowledge in  This case study is an example of application of iDT in cases where certain classes are under-represented in a dataset, 259 a situation known as "imbalanced datasets". We use again the dataset from Sarazin (2018) as in Sec. 3.2, and randomly 260 generated 5 subsample datasets of increasing sizes (1000, 5000, 10000, 50000 and 100000 samples). We then split 261 each subsample dataset into a training and a test set (75% and 25% of the dataset size respectively) and randomly 262 remove data points that belong to class C2 from the training dataset. Therefore, the training sets contain only few data 263 points of class C2 (<2%). Similarly, to Sec. 3.2, for each dataset we train a Statistically Optimal (SO) decision tree 264 and then derive an iDT by manually changing the nodes' variables and thresholds until the iDT included the 265 unrepresented class C2 in some of its leaf nodes. In some cases, we also manually changed the class of a leaf node to 266 class C2. For example, in Figure 6a on the left we show a part of the SO tree obtained for sample dataset 2. We know 267 from Sarrazin (2018) that low recharge class C2 should appear for low precipitation values, but the algorithm fails to 268 include the C2 class in the SO tree as the class is under-represented in the training dataset. Hence, we manually change 269 the threshold in the split node "Pm<=639.075" and the node variable in the split Vr<=201.14", so to create a branch 270 in the tree that specifically explore low precipitation cases. In response to these manual changes, the algorithm creates

287
One direction for future research could look at how to achieve closer interaction between human experts and machine 288 algorithms by including domain knowledge in algorithmic form (Solomatine and Ostfeld, 2008). For example, experts 289 could force the algorithm to search for thresholds in a specific range of values for selected variables, or they could 290 define constraints on variable selection to eliminate unrealistic sequence of variables to split. Another area for future 291 improvement would be to expand the range of visualisation techniques (e.g. partial dependence plots, accumulated

294
We hope that this paper will contribute to foster the development and use of interactive decision trees and, more 295 broadly, of methods to better integrate domain knowledge in ML, which can be particularly relevant for geoscience 296 applications.

315
Access to the code, datasets and workflows to reproduce the results presented in this paper: