Chemistry speedup in reactive transport simulations: purely data-driven and physics-based surrogates

The computational costs associated with coupled reactive transport simulations are mostly due to the chemical subsystem: replacing it with a pre-trained statistical surrogate is a promising strategy to achieve decisive speedups at 5 price of small accuracy losses and thus to extend the scale of problems which can be handled. We introduce a hierarchical coupling scheme in which “full physics”, equation-based geochemical simulations are partially replaced by surrogates. Errors on mass balance resulting from multivariate surro10 gate predictions effectively assess the accuracy of multivariate regressions at runtime: inaccurate surrogate predictions are rejected and the more expensive equation-based simulations are run instead. Gradient boosting regressors such as xgboost, not requiring data standardization and being able to 15 handle Tweedie distributions, proved a suitable emulator. Finally, we devise a surrogate approach based on geochemical knowledge which overcomes the issue of robustness when encountering previously unseen data, and which can serve as basis for further development of hybrid physics-AI mod20


Introduction
Coupled reactive transport simulations (Steefel et al., 2005(Steefel et al., , 2015 are very expensive, effectively hampering their wide applications. While hydrodynamic simulations on finely re-25 solved spatial discretizations containing millions of grid elements are routinely run on common workstations, the order of magnitude of the computationally affordable reactive transport simulations on the same hardware decreases by a factor ten to hundred as soon as chemical reactions are cou-30 pled in (De Lucia et al., 2015;Jatnieks et al., 2016;De Lucia et al., 2017;Leal et al., 2020;Prasianakis et al., 2020). This usually requires oversimplifications of the subsurface domain, reduced to 2D or very coarse 3D, and of the geochemical complexity as well. 35 In the classical operator splitting such as Sequential Non-Iterative Approach (SNIA), the three interacting physical processes hydrodynamic flow, solutes transport and chemical interactions between solute species and rock forming minerals are solved sequentially. Chemistry usually represents the 40 bottleneck for coupled simulations with up to 90% of computational time (De Lucia and Kühn, 2013;De Lucia et al., 2015;Leal et al., 2020). The numerical model for geochemical speciation and reactions requires in general the integration of one stiff differential-algebraic system of equations 45 per grid element per simulation time step. Parallelization is thus required to tackle large spatial discretizations, which is why many modern codes are developed to run on high performance computing (HPC) clusters with many thousand CPUs. However, parallelization alone still is not sufficient to en-50 sure numerical convergence of the simulations, a problem routinely encountered by many practitioners. Furthermore, large uncertainties affect the phenomenological model itself (i.e., reaction kinetics spanning over orders of magnitude; consistent activity model for concentrated solutions usually 55 encountered in the subsurface, . . . ) and even larger concern the parametrization of the subsurface, regarding the spatial heterogeneity of rocks, which is mostly unknown and hence often disregarded (De Lucia et al., 2011). It thus may appear unjustified to allocate large computational resources to 60 solve very expensive but still actually oversimplified or uncertain problems. Removing the computational cost associated with reactive transport modeling is thus of paramount importance to ensure its wide application on a range of otherwise unattackable problems (Prommer et al., 2019).
The much desired speedup of this class of numerical models has been the focus of intensive research in the last few years. Among the proposed solutions, Jatnieks et al. (2016) 5 suggests to replace the "full physics" numerical models of the geochemical subsystem with emulators or surrogates, employed at runtime during the coupled simulations. A surrogate is a statistical multivariate regressors which has to be trained in advance on a set of precalculated "full physics" solutions of the geochemical model at hand, spanning the whole parameter range expected for the simulations. Since the regressors are much quicker to compute than the setup and integration of a DAE system of equations, this promises a significant speedup and has thus found resonance in the 15 scientific community (e.g., De Lucia et al., 2017;Laloy and Jacques, 2019;Guérillot and Bruyelle, 2020). However, all approximations and especially purely data-driven surrogates introduce accuracy losses into the coupled simulations. These must be kept low in order to generate meaningful 20 simulation results. Ultimately, replacing a fully fledged geochemical simulator with surrogate equals to trading computational time for accuracy of the simulations. Due to the non linear nature of geochemical subprocesses, even small errors in surrogates predictions propagate in successive itera-25 tions so that in few time-steps rapidly diverging trajectories for the coupled models originate, leading to unphysical results. Mass and charge imbalances, i.e., "creation" of matter, happen to be the most common source of unphysicality in our early tests. It is thus of paramount importance to ob-30 tain highly accurate surrogates, which in turn may require very large and densely sampled training datasets and training times.
Any regression algorithm can be theorethically employed to replace the "full physics" equation-based geochemical 35 model. The thriving developments in data science and machine learning in the recent years have produced many different and efficiently implemented regressors readily available and usable in high-level programming languages such as python or R. Among the most widely successful, one 40 can cite gaussian processes, support vector machines, artificial neural networks and decision-tree based algorithms such as random forest or gradient boosting. Most of these algorithms are "black boxes" which non-linearly relate many output variables to many input variables. Their overall accuracy 45 can be statistically assessed by measuring their performances on the training dataset or on a subset of the available training data left out for the specific purpose of testing the models. In any case these training and/or test datasets must be obtained beforehands computing an appropriate number of 50 points with the "full physics" model. Geochemistry is usually largely multivariate, meaning that many input and many output variables are passed to and from the geochemical subsystem at each time step. In general, different regressors may capture in better fashion each output variable depending on 55 many factors (e.g., the problem at hand, where variables display different non-linear behaviors; the sampling density of training dataset, which may be biased; . . . ). For this reason, we argue that the most sensible choice for a surrogate modelling framework is that of multiple multivariate regression: 60 one multivariate regressor -making use of many or all inputs as predictors -is trained independently for each distinct output variable, while the choice of regressor may vary from variable to variable. With algorthms such as Artificial Neural Networks (ANN) it's possible to train one single network, 65 and hence one single surrogate, for all variables at once. However, ANN usually require long CPU times for training and quite large training datasets.
In praxis, the CPU-time, the user interactions and the overall skills required for optimally training complex regressors 70 can not be underestimated and may prove overwhelming for the geoscientists. The whole process of hyperparameter tuning, required by most advanced machine learning algorithms, while being an active area of development and research, is still hardly fully automatable. 75 This work showcases and analyses two different approaches for surrogate geochemical modelling in reactive transport simulations. The first is completely data-driven, disregarding any possible knowledge about the ongoing process. In the second approach we derive a surrogate which 80 exploits the actual equations solved by the full physics representation of chemistry. Both are applied and evaluated on the same 1D benchmark implemented in a reactive transport framework. This includes a hierarchical submodel coupling strategy which is advantageous for different accuracy levels 85 in the predictions for one subprocess.

Methods: simulation environment and benchmark problem
All the code used for running the reactive transport simulations, generating the datasets and training of surrogates 90 is included in the RedModRphree package for the R environment (R Core Team, 2020), making use of the geochemical simulator PHREEQC (Appelo et al., 2013 Since chemistry is inherently an embarassing parallel task, the speedup achieved on a single CPU as in this work will transfer -given large enough simulation grids making up 105 the overhead -on parallel computations.

Numerical simulation of flow and transport
We consider a stationary fully-saturated, uncompressible, isothermal 1D Darcy flow in homogeneous medium. Transport is restricted to pure advection and the feedback of mineral precipitation and dissolution on porosity and permeabil-5 ity is also disregarded; the fluid density is also considered constant. Advection is numerically computed via a forward Euler explicit resolution scheme: where u is the module of Darcy velocity, C i (x, t) the volu-10 metric concentration (molality) of the i-th solute species at the point x and time t, and ∆ x the size of a grid element. For this scheme, the Courant-Friedrichs-Lewy stability condition (CFL) imposes that the Courant number ν be less than or equal to 1: For Courant numbers less than 1, numerical dispersion arises; the scheme is unstable for ν > 1. The only both stable and precise solution for advection is with ν = 1. Thus, the CFL condition is very limiting in ∆ t: a factor two refinement 20 in the spatial discretization corresponds to a factor two decrease in ∆ t, thus obtaining double required iterations. Note that equation 1 is not written in terms of porosity, so that effectively the Darcy velocity is assumed equal to the fluid flux or, alternatively, porosity is equal to unity. This assumption 25 does not have any impact on the calculations besides the initial scaling of the system. The implemented advection relies on transport of total elemental concentrations instead of the actual dissolved species, since all solutes are subjected to the very same advection 30 equation (Parkhurst and Wissmeier, 2015). Special attention needs to be payed to pH which is defined in terms of activity of protons: and is hence not additive. Assuming that the activity co-35 efficient for protons is constant throughout the simulation, the activity [H + ] can be actually transported instead of pH. Charge imbalance and redox potential (pe) can be safely disregarded for this redox-insensitive model, with absolutely insignificant errors when compared to the same problem sim-40 ulated, e.g., with PHREEQC's ADVECTION keyword (not showed).

The chemical benchmark
The chemical benchmark used throughout this work is inspired by Engesgaard and Kipp (1992) and is well known, 45 with many variants, in the reactive transport community (e.g., Shao et al., 2009;Leal et al., 2020). It was chosen since it has been studied by many different authors and it is challenging enough from a computational point of view.
At the inlet of a column of 0.5 m (conventionally on the 50 left side in the pictures throughout this work) a 0.001 molal magnesium chlorine (MgCl 2 ) solution is injected into a porous medium whose initial solution is at thermodynamical equilibrium with calcite. With the movement of the reactive front, calcite starts to dissolve and dolomite is transiently 55 precipitated. Kinetic control is imposed on all mineral reactions following a Lasaga rate expression from Palandri and Kharaka (2004) limited to only neutral and H + mechanisms (parameters are summarized in Table 1) and constant reactive surfaces, hence independent on the actual amounts of 60 minerals. Rate of precipitation -relevant only for dolomite -is set equal to the rate of dissolution. Temperature is set for simplicity at constant 25°C in disregard to actual physical meaningfulness of the model concerning dolomite precipitation (Möller and De Lucia, 2020). Detailed initial and 65 boundary conditions are summarized in Table 2. To achieve a complete description of the chemical system at any time, seven variables input are required: pH, C, Ca, Mg, Cl, calcite and dolomite -those can be considered, with an abuse of language, state variables, in the sense that they constitute 70 the necessary and sufficient inputs of the geochemical subsystem, and all reactions only depend on them. The outcome of the "full physics" calculations is completely defined (at least with the simplifications discussed above) by four distinct quantities: the amounts of reaction affecting the two 75 minerals calcite and dolomite in the given time step, from which the changes in solutes Ca, Mg and C can be backcalculated, Cl, which is actually non-reactive, and pH. In a completely process-agnostic, data-driven framework, however, the relationships between minerals and aqueous con-80 centrations is disregarded, and the output of the chemical subsystem is expressed in terms of the input variables.

Reference simulations and training data
For the remainder of this work the geochemical benchmark described above is solved on a 1D column of length 0.5 m, 85 with constant fluid velocity of u = 9.375·10 −6 m/s. The domain is discretised with grid refinements ranging between 50 and 500 grid elements. Higher refinements have a double effect: on one side larger grids obviously increase the overall computational load, and in particular for chemistry; on the 90 other side, given the restriction of the implemented forward Euler explicit advection scheme, the time stepping required for the coupled simulations in order to be free of numerical dispersion decreases accordingly. Smaller time steps decrease the computational load for geochemistry for each it-95 eration, since they require shorter time integrations, but also require more coupled iterations to reach the same simulation time. More iterations mean also that there are more chances for errors introduced by surrogates to further propagate into the simulations in both space and time. In presence of significant overhead due to, e.g., data passing between different softwares or the setup of geochemical simulations, the advantage due to shorter time steps vanishes. However, these aspects become more relevant in the context of paralleliza-5 tion of geochemistry and are not addressed in the present work. All coupled simulations, both reference and with surrogates, are run with constant time step either honouring the CFL condition with ν = 1, and thus free of numerical disper-10 sion, or, for comparing how the speedup scales with larger grids, a fixed time step small enough for the CFL condition (eq. 2) to be satisfied for every discretisation. As previously noted, the resulting simulations will be affected by griddependent numerical dispersion, which we don't account for 15 in the present work. This makes the results uncomparable in terms of transport across grids. However, since the focus is on the acceleration of geochemistry through pre-computed surrogates, this is in our opinion an acceptable simplification.
The comparison between the reference simulations, ob-20 tained by coupling of transport with the PHREEQC simulator, and those obtained with surrogates is based on an error measure composed as the geometric mean of the relative RMSEs of each variable i, using the variable's max value at a given time step as norm: where m is the number of variables to compare, n the grid dimension and t the particular time step where the error is computed.
In this work the datasets used for training the surrogates 30 are obtained directly by storing all calls to the full physics simulator and its responses in the reference coupled reactive transport simulations, possibly limited to a given simulation time. This way of preceeding is considered more practical than, e.g., an a priori sampling of a given parameter space, 35 where the bounds of the parameter space are defined by the ranges of the input/output variables occuring in the reference coupled simulations. This strategy mimics the problem of wanting to train a surrogates directly at runtime during the coupled simulations. Furthermore, an a priori, statistical 40 sampling of parameter space, in absence of restrictions based on the physical relationships between the variables, would include unphysical and irrelevant input combinations. By employing only the input/outputs tables actually required by the full physics simulations, this issue is automatically solved; 45 however, the resulting datasets will be in general skewed, multimodal and highly inhomogeneously distributed within the parameter space, with highly dense samples in some regions and even larger empty ones.

Hyerarchical coupling of chemistry 50
In this work we consider only a Sequential Non-Iterative Approach (SNIA) coupling scheme, meaning that the subprocesses flow, transport and chemistry are solved numerically one after another before advancing to the next simulation step. For sake of simplicity, we let the CFL condition (2) 55 for advection dictate the allowable time step for the coupled simulations.
Replacing the time-consuming, equation-based numerical simulator for geochemistry with an approximated but quick surrogate introduces inaccuracies into the coupled simula-60 tions. These may quickly propagate in space and time during the coupled simulations and lead to ultimatively incongruent and unusable results.
A way to mitigate error propagation, and thus to reduce the accuracy required of the surrogates, is represented by a 65 hierarchy of models used to compute chemistry at each time step during the coupled simulations. The idea is to first interrogate the surrogate, identify unplausible or unphysical simulations, and run the full physics chemical simulator for the rejected predictions. The surrogates can so be tuned to capture with good accuracy the bulk of the training data, and no particular attention needs to be payed to the most difficult "corner cases". For the highly non-linear systems usually encountered in geochemistry, this is of great advantage.

5
In practice, however, there still is need to have a reliable and cheap error estimation on surrogate predictions at runtime.
It is important to understand that the criteria employed to accept or reject the surrogate predictions depend strictly on the architecture of the multivariate surrogate and on the ac-10 tual regression method used. Methods such as kriging offer error models, based, e.g., on the distance of the estimated point from the nearest training data. However, in the general case, any error estimation requires first the training and then the evaluation at runtime of a second "hidden" model.
Both steps can be time consuming; furthermore, in the general case one can only guarantee that the error is expectedin a probabilistic sense -to be lower than a given threshold.
In a completely data-driven surrogate approach, where each of the output variables is independently approximated 20 by a different multivariate regressor, checking mass conservation is a very inexpensive way to estimate the reliability of a given surrogate prediction, since it only requires the evaluation of linear combinations across predictors and predictions. Other constraints may be added, suited to the chem-25 ical problem at hand, such as charge balance. However we only use mass balance in the present work. Figure 1 illustrates schematically this simple hyerarchical coupling. For Figure 1. Schematic view of hyerarchical sequential non iterative coupling. The decision wether to accept or not the predictions of a multiple multivariate surrogate is based on computing the mass balances for the three elements forming dolomite and calcite before and after reaction, and computing their mean absolute error. If this error trespasses a given threshold, the more expensive equation based geochemical simulator is run instead. the chemical benchmark of section 2.2, three mass balance equations can be written, one for each element C, Ca and 30 Mg, accounting for the stoichiometry of the minerals' brute formulas. If a surrogate prediction exceeds a given, predetermined, tolerance on the Mean Absolute Error of the balance equations, that particular prediction is rejected and a more expensive full physics simulation is run instead.

35
This approach moderates the need for extremely accurate regressions, especially in instances of non linear behaviour of the chemical models, for example when a mineral pre-cipitates for the first time or when it is completely depleted, which are hard things for regressors to capture. However, the 40 number of rejected simulations must be low to produce relevant speedups; it is effectively a trade-off between the accuracy of the surrogates (and efforts and time which goes into it) and the speedup achieved in coupled simulations.
3 Fully data-driven approach 45 The first approach is a completely general one, fully datadriven and thus process-agnostic: it can be employed for any kind of numerical model or process which can be expressed in form of input and output tables, i.e., virtually any. In our case, the tables produced by the geochemical subprocess dur-50 ing the reference coupled simulations are used to train seven multiple multivariate regressors, one for each output.
The reference simulations, and hence the dataset for training the surrogate, are fully coupled simulations on grid 50, 100 and 200 with a fixed time step of 210 s, run until 33600 s 55 or else 161 total coupling iterations. As previously noted, these simulations are then not comparable among themselves due to significant numerical dispersion; however, from the point of view of geochemical processes, this strategy has the advantage of spreading the "perturbations" due to transport 60 of the results of geochemistry in the previous time step. Instead of the usual random split of the dataset in train and test subsets, costumary in the machine learning community, we retained only the data resulting from the first 101 iterations for training the surrogates, and evaluated the resulting reac-65 tive transport simulations until iteration 161, where the geochemical surrogate is faced with 60 iterations on unseen or "out of sample" geochemical data. The training dataset comprises tables with 13959 unique rows or input combinations. All simulations, the reference and with surrogates, are run on 70 a single CPU core.
The choice of the regressor for each output is actually arbitrary, and nothing forbids to have different regressors for each variables, or even different regressors in different regions of parameter space of each variable. Without going 75 into details on all kinds of algorithms that we tested, we found that decision-tree based methods such as Random Forest and their recent gradient boosting evolutions appear the most flexible and successful for our purposes. Their edge can in our opinion be resumed by: (1) implicit feature selection 80 by construction, meaning that the algorithm automatically recognizes which input variables are most important for the estimation of the output; (2) no need for standardization of both inputs and outputs; (3) ability to deal with any probability distributions; (4) fairly quick to train with sensible hyper-85 parameter defaults; (5) extremely efficient implementations available.
The points number (2)-(4) cannot be overlooked. Data normalization or standardization techniques, also called "preprocessing" in machine learning lingo, are redundant with 90 decision-tree based algorithms, whereas they have a significant impact on results and training efficiency with other regressors such as Support Vector Machines and Artificial Neural Networks. The distributions displayed by the variables in the geochemical data are extremely variable and cannot 5 be assumed uniform, gaussian or lognormal in general. We found out that the Tweedie distribution is suited to reproduce many of the variables in the training dataset. The Tweedie distribution is a special case of exponential dispersion models introduced by Tweedie (1984) and toroughly described 10 by Jørgensen (1987), which finds application in many actuarial and signal processing processes (Hassine et al., 2017).
This means that it's a family depending on p: gaussian if p = 0, Poisson 15 if p = 1, gamma if p = 2 and inverse gaussian if p = 3. The interesting case, which is normally referred to when using the term "Tweedie", is when 1 ≤ p ≤ 2. This distribution represents positive variables with positive mass at zero: meaning that this distribution preserves the "physical meaning" of 20 zero. It's intuitively an important property when modelling solute concentrations and mineral abundances: the geochemical system solved by full physics simulator is radically different when, e.g. a mineral is present or not.
Extreme gradient boosting xgboost (Chen and Guestrin,25 2016) is a decision-tree based algorithm which enjoys an enormous success in the machine learning community in recent years. It has out-of-the-box the capability to perform regression of Tweedie variables and it is extremely efficient in both training and prediction. Using the target Tweedie re-30 gression with fixed p = 1.2, max tree depth of 20, the default η = 0.3 and 1000 boosting iterations with early stopping at 50, all results in the dataset are reproduced with great accuracy and the training itself takes around 20 seconds for all seven outputs on our workstation, using four cores. Contrary 35 to the expectaction, the accuracy of the predictions is largely enhanced if the labels (i.e., the output table) are scaled before training. We used the max value of each label divided by 1·10 −5 as scale. The default evaluation metric when performing Tweedie regression is the Root Mean Squared Log Error: In the previous section it was claimed that in the framework of hierarchical coupling there is no practical need to further refine the regressions. This could be achieved by hyperpa-45 rameter tuning and by using a different and more adapted probability distribution for each label including proper fitting of parameter p for the Tweedie variables. While this would be of course beneficial, we proceed now by plugging such a rough surrogate into the reactive transport simulations. The 50 coupled simulations with surrogates are performed on the three grids for 161 iterations, setting the tolerance on mass balance to 10 −5 , 10 −6 and only relying on the surrogate, meaning with no call to PHREEQC even if a large mass balance error is detected.

55
In Figure 2 are exemplarily displayed the variables profiles for grid 100 and tolerance 10 −6 at two different time steps, iteration 101, which is the last one within the training dataset, and at the end of the simulation time, after 60 coupling iterations in "unseen territory" for the surrogates. The accuracy 60 of the surrogate simulations is excellent for the 101st iteration, but by iteration 161, while still acceptable, some discrepancies start to show. The number of rejected surrogate responses at each time step does not remain constant during the simulations, but increases steadily. An overview of all 65 the simulations is given in Figure 3 (top frame). The more stringent mass balance tolerance of 10 −6 (solid lines) rejects obviously many more simulations wich goes hand in hand with the excellent accuracy of the results (Figure 3, bottom panel; error measured with formula of eq. 3 excluding pH). 70 It was expected, and it is demonstrated by the evaluation, that starting with the first "out of sample" time step the accuracy of the surrogates significantly drops, which triggers a steep increase of rejected predictions and conversely of calls to PHREEQC. The hierarchical coupling ensures that the er-75 rors in the surrogate simulations do not follow the same steep increase, but from this moment on there is a loss of computational efficiency, visible in the simulations with tolerance 10 −6 , which makes the whole surrogate predictions actually useless in terms of speedup even before making them so in-80 accurate to be useless. It is also apparent from the error panel in Figure 3 (bottom) that errors introduced in the coupled simulations at early time steps propagate through the rest of the simulations, so that the overall discrepancy between reference and surrogate simulations also steadily increases.

5
Note that this "diverging behavior" also tends to bring the geochemistry "out of sample" in the sense of seen vs. unseen geochemical data, since the training data only comprise "physical" input combinations but, due to the introduced inaccuracies, we are asking the surrogate more and more pre-10 dictions based on slightly "unphysical" input combinations. Having highly accurate surrogates, hence, would be beneficial also in this regard.
It is difficult to discriminate "a priori" between acceptable and unacceptable simulation results based on a threshold of 15 an error measure such as that of eq. 3, which can be roughly interpreted as "mean percentage error". This is also a point where in our opinion further research is needed. Relying on the visualization of the surrogate simulation results and reference, we can summarize that the tolerance on mass balance 20 of 10 −6 (solid lines in Figure 3) produces accurate coupled simulations, excellent accuracy within the time steps of the training data and good accuracy after the 60 out of sample iterations. The tolerance of 10 −5 as well as the simulations based solely on surrogates produce acceptable accuracy in 25 sample but unusable and rapidly diverging results out of sample. For the given chemical problem, the 10 −6 tolerance on  Figure 3. Purely data-driven approach: evaluation of calls to full physics simulator for the runs with hierarchical coupling for the three discretizations à 50, 100 and 200 elements, and of overall discrepancy between surrogate simulations and reference. When the surrogate enters the region of "unseen data", its accuracy degrades significantly, which causes loss of efficiency rather than accuracy. mass balance could be relaxed, whereas the 10 −5 is too optimistic. The optimal value, at least for the considered time steps, lies between these two values.

30
The overall speedup -in terms of total wall clock time of the coupled simulations, thus including also CPU time used for advection and all the overheads, altough both much less computationally intensive than chemistry, and therefore termed pseudo speedup -with respect to the reference sim-35 ulations is summarized in Figure 4. Here the whole 161 iterations, also all the out of sample ones are considered. Pseudo speedup increases with grid size as expected. The accurate 10 −6 simulations are not accelerated on grid 50 (pseudo speedup of 0.86), but they reach 1.33 on the 200 40 grid. The surrogate-only speedup starts at around 2.6 for the 50 grid and reaches 4.2 for the 200 grid, and that can be taken as a measure of achievable speedup, in projection, by using extremely accurate surrogates. Considering only the first 101 iterations, the 10 −6 simulations would achieve speedup 45 slightly larger than one already on the 50 elements grid, and be well over 2 on the 200 grid. With grids of 10 5 , 10 6 elements, speedups in the order of 25-50 are achievable for this chemical problem. The speedup will arguably increase even further in presence of more complex chemistry.

4 Surrogates based on geochemical knowledge
The above presented fully data-driven approach disregards any domain knowledge or known physical relationships between variables besides those which are picked up automatically by the multivariate algorithms operating on the in-55 put/outputs in the training data. We start this approach by considering the actual "true" degrees of freedom for the geochemical problem, wich is fully described by seven inputs and four outputs: ∆calcite, ∆dolomite, Cl and pH. This means that we will have to cal-60 culate back the changes in concentrations for C, Ca and Mg, risking a quicker propagation of errors if the reaction rates of the minerals are incorrectly predicted.
The reference simulations for this part are run with ν = 1 and thus without numerical dispersion on four different grids: 65 50, 100, 200 and 500 elements respectively. This implies that the simulation on grid 500 has ten times more coupling iterations than the 50 grid, or in other terms, that the allowable time step in grid 500 is a tenth of that for grid 50.
A common way to facilitate the task of the regressors by "injecting" physical knowledge into the learning task of 5 the algorithms is to perform feature engineering: this simply means computing new variables defined by non-linear functions of the original ones, which may give further insights regarding multivariate dependencies, hidden conditions or relevant subsets of the original data.

10
For any geochemical problem involving dissolution or precipitation of minerals, each mineral's saturation ratio (SR) or its logarithm SI (Saturation Index) discriminates the direction of the reaction. If SR > 1 (and thus SI > 0) the mineral is oversaturated and precipitates; it is undersaturated and 15 dissolves if SR < 1 (SI < 0); SR = 1 (SI = 0) implies local thermodynamical equilibrium. Writing the reaction of calcite dissolution: The Law of Mass Action relates, at equilibrium, the activi-20 ties of the species present in the equation. We conventionally indicate activity with square brackets. For eq. 5, the LMA reads: where γ stands for the activity coefficient of subscripted aqueous species. The solubility product K eq Cc at equilibrium, tabulated in thermodynamical databases, is a function of temperature and pressure and defines the saturation ratio: The estimation of the saturation ratio for calcite and dolomite from the elemental concentrations in our training data constitutes the first, natural feature engineering tentative for this problem. It turns out that this is not only achievable with only a few sensible assumptions, but it also allows us to demon-35 strate how to construct a "physics based surrogate" from the data. Using total elemental concentrations as proxy for species activities implies neglecting the difference between concentration and activity -"true" activity [H + ] is known from the 40 pH -, but also not relying on the actual speciation to estimate the ion activity products which appear in the definition of saturation ratios. For the chemical problem at hand, as it will be shown, it is a viable approximation, but it won't be in presence of strong gradients of ionic strength or in general 45 for more complex or concentrated systems. An exception to this simplification is required for dissolved carbon due to the well known buffer. In this case, given that the whole model is at pH between 7 to 10, we may assume that two single species dominate the dissolved carbon speciation: CO −2 3 and 50 HCO − 3 . The relationship between the activities of those two species is always kept at equilibrium in the PHREEQC models and thus, up to the "perturbation" due to transport, also in our dataset. This relationship is expressed by the reaction and the corresponding law of mass action written in eq. 8: The closure equation, expressing the approximation of total carbon concentration as the sum of two species, gives us the second equation for the two unknowns: Combining eq. 8 and 9 we get the estimation of dissolved bicarbonate (the wide tilde indicates that it is an estimation) from the variables total carbon and pH comprised in our dataset and an externally calculated thermodynamical constant: It immediately follows that we can estimate the calcite saturation ratio SR Cc with the formula: The two thermodynamical quantities (at 25°C and atmo-70 spheric pressure) K eq carb = 10 −10.3288 and K eq Cc = 10 −2.00135 were computed with the CHNOSZ package for the R environment (Dick, 2019), but may also be derived with simple algebraic calculations from, i.e., the same PHREEQC database employed in the reactive transport simulations. 75 Do these two newly defined variables, or "engineered features", bicarbonate and calcite saturation ratio, actually help to better understand and characterize our dataset? This can be simply assessed by plotting the ∆calcite against the logarithm of SR Cc , which is the SI Cc (Figure 5a, left panel, 80 dataset from the reference simulations on grid 200, which will be used from now on to illustrate the analysis since it contains enough data points) in the data. While many points remarkably lie on a smooth curve (colored in black), many others are scattered throughout the graph (in red). It's easy 85 to observe that those red points are either on the trivial ∆calcite=0 line, implying that calcite is undersaturated but not present in the system so nothing happens, or else the reaction did not reach the amount which could have been expected based on its initial undersaturation simply because 90 calcite has been completely depleted during the timestep. All the red points correspond in fact to simulations with calcite=0 in the labels (results) dataset. The retained black points, however, belong to timesteps where the dissolution of calcite is limited by kinetics and not by its initial amount, and can be thus used to estimate the reaction rate.
We could now try and derive analytically a functional de-5 pendency between the observed amount of dissolved calcite in the retained points and the estimated SI Cc , since it is a function of the kinetic law; or we can simply use a regressor instead. The perfectly bijective relationship between the calculated ∆calcite and the estimated SI Cc means that we 10 should be able to regress the first using only the second. In the right panel of Figure 5 are plotted in blue the in sample predictions of a Multivariate Adaptive Regression Spline model (MARS) (Friedman, 1991(Friedman, , 1993, computed through the earth R package (Milborrow, 2018), based only on 15 SI Cc . The accuracy is already acceptable indeed; however including further predictors from the already available features, in this case pH, Cl and the newly computed HCO 3 , a much better regression (in red) is achieved, improving the RMSE of more than factor two. 20 Before moving forward, two considerations are important. First, the red points of Figure 5a should not be used when trying to estimate the rate of calcite dissolution, since they result from a steep and "hidden" non-linearity or discontinuity in the underlying model. This is a typical example of features 25 potentially leading to overfitting of the dataset. Secondly, this "filter" does not need to be applied at runtime during coupled reactive tansport simulations: it suffices to estimate correctly the reaction rate and then ensure that calcite does not reach negative values.

30
More interesting and more demanding is the case of dolomite, which firstly precipitates and then redissolves in the benchmark simulations. In a completely analogous manner as above we define its saturation ratio SR Dol as: thus using the total elemental dissolved concentration of C and with K eq Dol = 10 3.647 resulting from the reaction: The theoretical value of K eq Dol = 10 3.647 used for calculation of SI Dol does not discriminate the initially undersaturated 40 from the oversaturated samples (dashed vertical black line in Figure 6). The "offset" which would serve us for a correct discrimination is nothing else than the maximum value of SR Dol restricted to the the region where ∆dolomite ≤ 0. We correspondingly update the definition of SR Dol : Now we are guaranteed that the vertical line SR Dol =1 (or equivalently, SI Dol =0, plotted with a dashed blue line in Figure 6) divides correctly the parameter space in four distinct quadrants. Note that this offset emerges from the actual con-50 sidered data, and depends on the perturbation of the concentrations due to transport and thus, in our simple advective scheme, on the grid resolution through the time step. It follows that a different offset is expected for the other grids, and a different learning for each grid is necessary.

55
The green shaded, top right quadrant points to dolomite precipitation in initially supersaturated samples; the bottom left, blue shaded contains solutions initially undersaturated w.r.t. dolomite and, if present, dissolving; the top left, orange shaded quadrant is the most problematic: dolomite is initially 60 undersaturated but, presumably due to the concurring dissolution of calcite, it becomes supersaturated during the time step and hence precipitates.
First of all, we note that the initial presence of calcite is a perfect proxy for SI Dol . If calcite is initially present in 65 points reached by the reactive magnesium chlorine solution, then dolomite precipitates. When calcite is completely depleted, then dolomite starts dissolving again. The dissolution of dolomite in absence of calcite follows the same logic as the dissolution of calcite above: a few points are scattered in-70 between the line ∆dolomite=0 and the envelope of points lying on a well defined curve. These scattered points are again those where dolomite is depleted within the time step, so they are excluded. For the remaining points, an xgboost regressor based on the predictors SI Dol , pH, C, Cl, Mg and 75 dolomite achieves an excellent accuracy (Figure 7) in reproducing the observed ∆dolomite. The top right quadrant of Figure 6, corresponding to the case of dolomite precipitating while calcite is dissolving, cannot be explained based only on the estimated SI Dol since their relationship is not surjec-80 tive (Figure 8a). Here again we can use a piece of domain knowledge to engineer a new feature to move forward. The Mg/Ca ratio is often used to study the thermodynamics of dissolution of calcite and precipitation of dolomite (Möller and De Lucia, 2020). Effectively, the occurring overall reac-85 tion which transforms calcite into dolomite reads: By applying the law of mass action to reaction 15, it is apparent that its equilibrium constant is a function of the Mg/Ca ratio (of its inverse in the form of equation 15). Plotting the 90 ∆dolomite versus the initial Mg/Ca ratio, a particular ratio of 7.345 discriminates between two distinct regions for this reaction. Incidentally, this splitting value corresponds to the highest observed SR Dol in the training data; again, as previously noted for the offset on the estimated saturation index, 95 this numerical value depends on the considered grid and time step. On the left hand region we observe a smooth, quasilinear dependency of the amount of precipitated dolomite on initial Mg/Ca. This is a simple bijective relationship where we can apply a simple monovariate regression. The amount 100 of precipitated dolomite is accurately predicted by a MARS regressor using the sole Mg/Ca as predictor.  The region on the right of the splitting ratio can be best understood considering the fact that the precipitation of dolomite is limited, in this region, by a concurrent amount of calcite dissolution. The full physics chemical solver finds iteratively the correct amounts of calcite dissolution and 5 dolomite precipitation while honoring both the kinetic laws and all the other conditions for a geochemical DAE system (mass action equations, electroneutrality, positive concentrations, activity coefficients, . . . ). We can't reproduce such articulate and "interdependent" behavior without knowing the actual amount of dissolved calcite: we are forced here to employ the previously estimated ∆calcite as a "new feature" to estimate of the amount of dolomite precipitation, albeit limited to this particular region of the parameter space. A surprisingly simple expression, fortunately, captures this re-15 lationship quite accurately (Figure 9). This implies of course that during coupled simulations first the ∆calcite must be computed, and relying on this value, the ∆dolomite can be further estimated. The last parameter space region which is left to consider 20 is the orange-shaded, topleft quadrant of Figure 6. Here, although dolomite is undersaturated at the beginning of the time step, it still precipitates in the end, following the concurrent dissolution of calcite which changes its saturation state. Since however we already calculated the ∆calcite, we 25 can update the concentrations of dissolved Ca and C of corresponding amounts. One of these two concentrations, together with that of Mg, will constitute a limiting factor for the precipitation of dolomite. Hence, plotting the ∆dolomite against the minimum value of these three concentrations at 30 each point (C must be divided by two for the stoichiometry of dolomite), we obtain a piecewise-linear relationship with limited non-linear effects. A very simple regression is hence sufficient to capture the bulk of the "true model behavior" for all these data points ( Figure 10). Now the behavior of calcite 5 and dolomite is fully understood and we dispose of a surrogate for both of them. Among the remaining output variables, only pH needs to be regressed: Cl is non-reactive, meaning that the surrogate is the identity function. For pH, while it could be possible to derive a simplified regression informed 10 with geochemical knowledge, we chose for simplicity to use the xgboost regressor.

No Calcite present
Summarizing, we effectively designed a decision tree, based on domain knowledge, which enabled us to make sense of the "true" data, to perform physically meaningful fea-15 ture engineering and ultimatively to define a surrogate model "translated" to the data domain ( Figure 11). The training of this decision tree surrogate consists merely in computing the engineered features, finding the apparent offset for the SI Dol , the split value for the Mg/Ca ratio, and performing six dis-20 tinct regressions on data subsets, of which three are monovariate and two use less predictors than the corresponding completely data-driven counterpart. All of them, excluding pH, only use a subset of the original training dataset. On our workstation, this operation takes few seconds. The resulting 25 surrogate is valid for the ∆t of the corresponding training data.

Regression in the orange quadrant
To evaluate the performance of this surrogate approach, a decision tree is trained separately for each grid (and hence ∆t) using the reference timesteps until 42000 seconds, 30 whereas the coupled simulations are prolonged to 60000 s, so that at least 30 % of the simulation time is computed on unseen geochemical data.
The top panel of Figure 12 shows the results of the coupled simulation for grid 50 using the surrogate trained on the 35 same data, at the end of the iterations used for training. Discrepancies with respect to the reference full physics simulation are already evident. The problem here is that the training  Figure 11. Decision tree for the surrogate based on physical interpretation of the training dataset. The engineered features are used as splits and as predictors for different regressions depending on the region of parameter space. The abbreviations "lm" and "pwl" stand respectively for "linear model" and for piecewise-linear regression.
dataset is too small and the time step too large for the decision tree surrogate to be accurate. However, nothing forbids to perform "inner iterations" for the chemistry using a surrogate trained on a finer grid, which directly corresponds to smaller ∆t. For grid 50 (∆t=1066 s) we can hence use the 5 surrogate trained on grid 500 (∆t=106.6 s) just calling it 10 times within each coupling iterations. The bottom panel of Figure 12 displays the corresponding results. The same problem affects the grid 100, which also requires the surrogate trained on grid 500, reiterated 5 times in this case. The grids 10 200 and 500 are fine with the own reference data, as can be seen in Figure 13, this time displaying the end of simulation time at 60000 s. In Figure 14 are summarized the errors of the surrogates simulations (top panel) and the overall pseudo speedup after 15 60000 s (bottom panel). While inaccuracies are introduced in the coupled simulations by the decision tree surrogate, they reach quite early a plateau and are not sensitive to "new data", with no increase in error after crossing the "out of sample" boundary. Even if the overall error is slightly larger than 20 the corresponding purely data-driven simulations with 10 −6 tolerance, the robustness of the physics-based approach is a major advantage. Moreover, since no calls to PHREEQC

Grid 50
Reference Surrogate a Grid 50 -surrogate from Grid 500 are issued at all during the simulations, the performance of the coupled simulations is not going to degrade during the 25 simulation time. The physics-based surrogates achieve large pseudo speedups, starting with little over 3 for the grid 50 and reaching 7.2 for the 500 grid ( Figure 14, bottom panel).
Note that the decision tree approach has been implemented in pure high-level R language (up to the calls to the regressors 30 xgboost and earth, which are efficiently implemented in low-level languages such as C/C++) and is not optimized. A better implementation would further improve its performance, especially in the case where repeated calls to the surrogate are performed at each coupled iteration. The results presented in this work devise some strategies which can be successfully exploited to speedup reactive transport simulations. The simplifications concerning the transport and the coupling itself (stationary flow; pure advec-40 tion with dispersive full explicit forward Euler scheme; no feedback of chemistry on porosity and permeability; initially homogeneous medium; kinetic rate not depending on reac- tive surfaces) are obviously severe, but most of them should only marginally affect the validity of the benchmarks concerning the achievable speedup of geochemistry in a broad class of problems. A fully data-driven approach, combined with a hierarchi-5 cal coupling in which full physics simulations are performed only if surrogate predictions are found implausible, works and promises significant speedups for large scale problems.
The main advantage of this approach is that the same "code infrastructure" can be used to replace any physical process, 10 not limited to geochemistry: it's completely general, and it could be implemented in any multiphysics toolbox to be used for any co-simulated process. The hierarchy of models for process co-simulation is a vast research field on itself. This idea has to our knowledge never been implemented specifi-15 cally for reactive transport, but has been proposed, e.g., for particular problem settings in fluid dynamics and elastome-chanics (Altmann, 2013;Altmann and Heiland, 2015) and in the broader context of theoretical model reduction and error control (Domschke et al., 2011). This is however a fertile in-20 terdisciplinary research task and it is not difficult to foresee that significant progress in this area will soon be required to facilitate and fully leverage the powerful machine learning algorithms already available, in order to speedup any complex, multiscale numerical simulations.

25
The coupling hierarchy implemented in this work is obviously extremely simple and cannot be directly compared with the above cited works, since it is merely based on a posteriori evaluation of plausibility of geochemical simulations. Furthermore, it exploits redundant regressions, which is sub-30 optimal, albeit practical: in effects, regressing more variables than strictly necessary is not much different than regressing the true independent variables and their error models. Since the surrogate predictions are so cheap compared to the full physics, it would be only slightly beneficial to first interro-35 gate the error model and then go directly to the full physics instead of computing at once the whole surrogate predictions and check it afterwards. Nevertheless, several improvements can be implemented with respect to the hierarchy presented in this work. The first would be adding charge balance to 40 the error check at runtime. For different classes of chemical processes, other criteria may be required. For example check on mass action laws can be implemented for models requiring explicit speciation, like in the simulations of radionuclide diffusion and sorption in storage formations. Another 45 one would be to actually eliminate one or more redundant regression and base the error check on the accordance between the overlapping one. As an example, one could regress the ∆dolomite, ∆calcite and ∆Ca, limiting in practice the mass balance check to one element.

50
In our opinion there is no point in discussing if there is one most suitable or most efficient regression algorithm. This largely depends on the problem at hand and on the skills of the modeller. While we rather focused on gradient boosting decision-tree regressors for the reasons briefly discussed in 55 section 3, a consistent number of authors successfully applied artificial neural networks to a variety of geochemical problems and coupled simulations (Laloy and Jacques, 2019;Guérillot and Bruyelle, 2020;Prasianakis et al., 2020).
We would like to point out that transforming geochemistry -as any other process -in a pure machine learning problem requires on one hand skills that are usually difficult for the geoscientists to acquire, and on the other it fatally overlooks domain knowledge that can be used to improve at least the learning task, which will directly result in accurate and robust predictions, as we demonstrated in section 4. Feature engineering based on known physical relationships and equations should be part of any machine learning workflow anyway; building experience in this matter, devising suitable 10 strategies for a broad class of geochemical problems is in our opinion much more profitable than trying to tune overly complex "black box" models of general applicability. The purely data-driven approach has its own rights and applications. As already noted, it is a completely process-agnostic approach which can be implemented in any simulator for any process. However, in absence of physical knowledge within the surrogate, the training data must cover beforehand all the processes and the scenarios happening in the coupled simulations. On-demand training and successive incremental update 20 of the surrogates at runtime during the coupled simulations would mitigate this issue. This would require a careful choice of the regressors, since not all of them have this capability, and possibly a sophisticated load balance distribution, likely viable only in the context of parallel computing. In perspec-25 tive, however, this is a feature that in our opinion should be implemented in the numerical simulators. A second issue, related to the first, is that a data-driven surrogate trained on a specific chemical problem (intending here initial conditions, concentration of the injected solutions, mineral abundances, 30 time steps...), is not automatically transferable to different problem settings, even when for example only a single kinetic constant is varied. Again, shaping the surrogate following the physical process to be simulated seem here the most straightforward way to overcome this issue, at least partially.

35
One would dispose of partial regressions in specific parameter space regions which could be varied following changes in underlying parameters.
It remains to be assessed if it is possible to generalize and automate the physics-based surrogate approach devised 40 in section 4 on geochemical problems of higher complexity, i.e., with many minerals reacting. No claim of optimality is made about the actual choice of engineered features we made for this chemical benchmark: different features could possibly explain even more simply the data, and thus the

Conclusions
Employing surrogates to replace computationally intensive geochemical calculations is a viable strategy to speedup reactive transport simulations. A hierarchical coupling of geochemical subprocesses, allowing to recur to "full physics" 70 simulations when surrogate predictions are not accurate enough is advantageous to mitigate the inevitable inaccuracies introduced by the approximated surrogate solutions. In the case of purely data-driven surrogates, which are a completely general approach not limited to geochemistry, re-75 gressors operate exclusively on input/output data oblivious of known relationships. Here, redundant information content can be employed effectively to obtain cheap estimation of plausibility of surrogate predictions at runtime, by checking the errors on mass balance. This estimation works well 80 at least for the presented geochemical benchmark. Our tests show consistent advantage of decision-tree based regression algorithms, especially belonging to the gradient boosting family.
Feature engineering based on domain knowledge, i.e., 85 the actual governing equations for the chemical problem as solved by the full physics simulator, can be used to construct a surrogate approach in which the learning task is enormously reduced. The strategy consists in partitioning the parameter space based on the engineered features, and looking 90 for bijective relationships within each region. This approach reduces both the required multivariate predictions and the dimension of training dataset upon which each regressor must operate. Algorithmically it can be represented by a decision tree, and proved both accurate and robust, being equipped to 95 handle unseen data and less sensible to sparse training dataset since it embeds and exploits knowledge about the modelled process. Further research is required in order to generalize it and to automate it for more complex chemical problems, as well as for adapting it to specific needs such as sensitivity 100 and uncertainty analysis.
Both approaches constitute non-mutually-exclusive valid strategies in the arsenal of modellers dealing with overwhelmingly CPU-expensive reactive transport simulations, required by present day challenges in subsurface utilization. 105 We are in particular persuaded that hybrid AI-physics mod-els will offer the decisive computational advantage needed to overcome current limitations of classical equation-based numerical modelling.
Code availability. The code used in the present work is available in form of R package on: https://gitext.gfz-potsdam.de/delucia/RedModRphree or, after January 1 st , 2021: https://git.gfz-potsdam.de/delucia/RedModRphree Author contributions. MDL shaped the research, performed analyses, programming and wrote the manuscript. MK helped providing 10 funding, shaping the research, and revised the manuscript.
Competing interests. The authors have no conflict of interests.
Acknowledgements. The authors gratefully acknowledge the Helmholtz Association of German Research Centers -Initiative and Networking Fund for the funding in the framework of the 15 project "Reduced Complexity Models -Explore advanced data science techniques to create models of reduced complexity" -reference number ZT-I-0010.