Global models for short-term earthquake forecasting and predictive skill assessment

We present rigorous tests of global short-term earthquake forecasts using Epidemic Type Aftershock 10 Sequence models with two different time kernels (one with exponentially tapered Omori kernel 11 (ETOK) and another with linear magnitude dependent Omori kernel (MDOK)). The tests are con12 ducted with three different magnitude cutoffs for the auxiliary catalog (M3, M4 or M5) and two 13 different magnitude cutoffs for the primary catalog (M5 or M6), in 30 day long pseudo prospective 14 experiments designed to forecast worldwide M ≥ 5 and M ≥ 6 earthquakes during the period from 15 January 1981 to October 2019. MDOK ETAS models perform significantly better relative to ETOK 16 ETAS models. The superiority of MDOK ETAS models adds further support to the multifractal 17 stress activation model proposed by Ouillon and Sornette (2005). We find a significant improvement 18 of forecasting skills by lowering the auxiliary catalog magnitude cutoff from 5 to 4. We unearth 19 evidence for a self-similarity of the triggering process as models trained on lower magnitude events 20 have the same forecasting skills as models trained on higher magnitude earthquakes. Expressing 21 our forecasts in terms of the full distribution of earthquake rates at different spatial resolutions, we 22 present tests for the consistency of our model, which is often found satisfactory but also points to a 23 number of potential improvements, such as incorporating anisotropic spatial kernels, and accounting 24 for spatial and depth dependant variations of the ETAS parameters. The model has been implemented 25 as a reference model on the global earthquake prediction platform RichterX, facilitating predictive 26 skill assessment and allowing anyone to review its prospective performance. 27


Introduction
In this study, our aim is to compare the performance of different models (described in Section 3.2) in forecasting future 137 earthquakes. We do this by means of pseudo prospective experiments. forecasts of future unseen observations. Truly prospective experiments are time consuming, as they require many years 141 before enough future observations accumulate to strengthen (or weaken) the evidence in favor of a model [Kossobokov, 142 2013]. A practical solution is to conduct pseudo prospective experiments. In these experiments, one uses only the early 143 part of the dataset for the calibration of the models and leaves the rest as virtually unseen future data. Subsequently, the 144 calibrated models are used to simulate a forecast of future observations and the left out data is used to obtain a score for 145 each of the forecasting models. These scores can then be compared to identify the best model. 146 Although, (pseudo) prospective experiments have started to catch up in the field of earthquake research [ have not become the norm. In this regard, the work done by the collaboratory for the study of earthquake predicatbility 150 (CSEP) and others [Schorlemmer et al., 2018;Kossobokov, 2013] has been very commendable, as they have tried to for computational convenience. However, as the area of these cells vary with latitude, becoming smaller as one moves 156 north or south from the equator, a model gets evaluated at different spatial resolutions at different latitudes, thus, by 157 construction, yielding different performances as a function of latitude. This areal effect on the performance then gets 158 convoluted with the underlying spatial variation in the performance of the model. For instance, modelers might find that 159 their models yield better performance in California, but very poor performance in Indonesia at a fixed spatial resolution. 160 Should they then infer that their models are better suited for a strike slip regime than for a subduction setting? Due to 161 the inherent nature of the varying spatial resolution of the grid prescribed by CSEP, the answer to this question becomes 162 obfuscated. 163 Another aspect of pseudo prospective experiments that is poorly handled by CSEP is that it forces the modelers to 164 assume Poissonian rates in a given space-time-magnitude window irrespective of their best judgement. Nandan et al. 165 [2019b] showed that this choice puts the models that do not comply with the Poissonian assumption on weaker footing 166 than those models which agree with this assumption. As a result, the reliability of the relative rankings obtained from 167 the models evaluated by CSEP remains questionable to some extent.

168
Last but not the least, in some model evaluation categories [Werner et al., 2011;Schorlemmer and Gerstenberger, 2007], 169 CSEP evaluates the models using only "background" earthquakes, which are identified using Reasenberg's declustering 170 algorithm as being independent [Reasenberg, 1985]. However, as the earthquakes do not naturally come with labels 171 such as "background", "aftershocks" and "foreshocks" that can be used for validation, this a posteriori identification 172 remains highly subjective. In regards to CSEP's use of the Reasenberg's declustering algorithm, Nandan et al. [2019c] 173 pointed out that the subjective nature of declustering introduces a bias towards models that are consistent with the 174 declustering technique, rather than the observed earthquakes as a whole. This puts into questions the value of such 175 experiments, as their results are subject to change as a function of the declustering parameters.
Since the aim of our forecasting experiment is to assess which model is more suitable to serve as a reference model for the global earthquake prediction platform RichterX, it becomes important to address the drawbacks mentioned above by designed the pseudo prospective experiments in this study with the following settings: 180 1. Many training and testing periods: We start testing models beginning on January 1, 1990 and continue 181 testing till October 31, 2019, spanning a duration of nearly 30 years. The maximum duration of an earthquake 182 prediction that can be submitted on the RichterX platform is 30 days. Using this time window, our pseudo 183 prospective experiments are composed of 362 non overlapping, 30 days long testing periods. To create the 184 forecast for each of the testing periods, the models are calibrated only on data prior to the beginning of each 185 testing period for calibration as well as simulation. The forecasts are specified on an equal area mesh with 186 predefined spatial resolution. 187 2. Equal area mesh: To create this equal area mesh, we tile the whole globe with spherical triangles whose area 188 is constant all over the globe. This mesh is designed in a hierarchical fashion. To create a higher resolution 189 mesh from a lower resolution one, the triangles in the lower resolution mesh are divided into four equal area 190 triangles. In this way, we create eleven levels of resolution: at the first level, the globe is tiled with 20 equal 191 area triangles (corresponding to an areal resolution of ≈ 25.5 × 10 6 km 2 each); at the second level 80 equal 192 area triangles tile the globe, and so on. Finally, at level eleven ≈ 21 × 10 6 triangles tile the globe with an areal 193 resolution of ≈ 24 km 2 . In this study, we evaluate the models at level six (unless otherwise stated), which 194 has an areal resolution equivalent to a circle with radius ≈ 90 km. To test the sensitivity of our results to the 195 choice of areal resolution, we also evaluate the models at level five and level seven, which correspond to an 196 areal resolution equivalent to circles with radii ≈ 180 km and ≈ 45 km, respectively. In principle, the models 197 can be evaluated at all spatial resolutions (from 1 to 11). The resolutions in this study are chosen to be in 198 accordance with the the spatial extents used on the RichterX platform (radius of 30 to 300km).  4. Performance evaluation using all earthquakes with M ≥ 5 or M ≥ 6: We test the models against target 204 sets consisting of M ≥ 5 and M ≥ 6 events that occurred during each testing period. During a given testing 205 period, competing models forecast a distribution of earthquake numbers (M ≥ 5 or M ≥ 6 depending on 206 the choice of target magnitude threshold or M t ) in each triangular pixel. We then count the actual number of 207 observed earthquakes in each pixel. With these two pieces of information, the log likelihood LL i A of a given 208 model A during the i th testing period is defined as: where P r i j is the probability density function (PDF) of earthquake numbers forecasted by model A and n i In order to ascertain if the information gain of model A over B is statistically significant over all testing periods, 214 we conduct a right tailed paired t-test, in which we test the null hypothesis that the mean information gain

362
) over all testing periods is equal to 0 against the alternative that it is larger than 0. We Model Description In this model, the seismicity rate λ(t, x, y|H t ) at any time t and location (x, y) depends on the 254 history of the seismicity H t up to t in the following way: where µ is the time-independent background intensity function, g is the triggering function and (t i , x i , y i , M i ) represents 256 the time, x-coordinate, y-coordinate and magnitude of the i th earthquake in the catalog, respectively.

257
The memory function g in Equation 3 is formulated as: which is the product of three kernels:   An important consideration before calibrating the ETAS model is the choice of the primary and auxiliary catalogs.

296
The main difference between these two catalogs is that earthquakes in the primary catalog can act both as targets and if we specifically train our models for them. We use different magnitude thresholds for the auxiliary catalog to test the 307 hypothesis that smaller earthquakes play an important role in triggering and can improve the forecasting potential of the 308 ETAS models.

309
Note that, even though the available ANSS catalog extends down to magnitude 0, we do not use such a low magnitude and Sornette, 2003]. For a branching ratio < 1, the system is in the the sub-critical regime. For a branching ratio > 1, 321 the system is in the super-critical regime [Helmstetter and Sornette, 2002]. In addition, in Figure S1, we show the 3), the triggering kernel is modified to account for a possible magnitude dependence of Omori-Utsu parameters c and ω.

338
The triggering kernel for this model is redefined as: where c(M i ) = 10 c0+c1Mi and ω = ω 0 + ω 1 M i . suggest that the c-value would correlate negatively with mainshock magnitude, as their model predicts that the larger 349 the stress perturbation, the shorter would be the duration between the mainshock and the onset of the power-law decay.

350
Regardless of the underlying mechanism for the dependence of the c-value on mainshock magnitude, the evidence for 351 such an exponential dependence is rather clear, and thus warrants an explicit formulation within the ETAS model.

352
The linear dependence of the Omori exponent ω on the mainshock magnitude is based on the work of Ouillon and  where critical exponents are linked by such relationships close to a critical point.

393
In this forecasting experiment, we aim to systematically test the idea that explicitly taking account of magnitude 394 dependence in these two Omori parameters would lead to an improvement in the forecasting ability of the modified 395 ETAS models relative to the ones in which these dependencies are ignored.  Each of these twelve models are shown in Table 1 and are individually calibrated on the 362 training period periods. is indicated in panels b-d; (e-l) Probability density functions (PDF) of earthquake numbers that are forecasted by the model in the circular geographic region of radius = 300 km around "earthquake prone" cities of the world. 1  2  3  4  5  6  7  8  9  10  11  12  Omori Kernel ET ET ET ET ET ET MD MD MD MD MD MD  M pri  5  5  5  6  6  6  5  5  5  6  6  6  M aux  3  4  5  3  4  5  3  4  5  3  4  5  Table 1: All twelve models resulting from different calibration choices; ET and MD stand for ETAS models with exponentially tapered Omori kernel and magnitude dependent Omori kernels, respectively. In this section, we illustrate how the forecasts of different models are constructed. We do this only for a selected model 429 and a particular testing period, as the procedure for all other testing periods and models is the same.

444
It is important to mention here that these average rate maps are just used for the sake of illustration of the generated 445 forecasts, as they provide a convenient representation. In reality, each pixel on the globe is associated with a full 446 distribution of forecasted earthquake numbers. To illustrate this, we show in Figure 2 (e-l) the probability density 447 function (PDF) of earthquake numbers that is forecasted by the model in circular geographic regions (with 300 km 448 radii) around some of the "earthquake prone" cities of the world. These PDFs are obtained by counting the number of 449 simulations in which a certain number of earthquakes were observed and then by dividing those by the total number of 450 simulations that were performed. In this study, we perform 100,000 simulations for all the models and for all testing 451 periods. Notice that the PDF of the forecasted number of earthquakes varies significantly from one city to another, 452 despite the fact that none of the competing models feature spatial variation of the ETAS parameters. This variation can 453 be attributed to variation in the local history of seismicity from one place to another. Other factors that control the shape 454 of these distributions include the time duration of the testing period and the size of the region of interest (see Figure 1 in 455 Nandan et al. [2019b]). It is also evident that the forecasted distributions of earthquake numbers around these selected 456 cities display thick tails and cannot be approximated by a Poisson distribution. In fact, Nandan et al. [2019b] showed 457 that, if a Poissonian assumption is imposed, the ETAS model yields a worse forecast relative to the case in which it was 458 allowed to use the full distribution. Therefore we use the full distribution approach proposed by Nandan et al. [2019b] 459 to evaluate the forecasting performance of the models in the following section.  Figure 3b, we show the cumulative information gain of these six 467 models along with those variants that have been trained specifically with M pri = 6. The performance of these models 468 has been tracked with dashed lines. The configurations of all the twelve models is indicated in Figure 3b. 469 From both panels in Figure 3, we can make the following observations:     Figure 4: (a) Pairwise mean information gain (MIG, per testing period) matrix of the six models used to forecast M ≥ 5 earthquakes; (i, j) element indicates the MIG of the i th model over the j th model; (b) MIG matrix of twelve models in the experiments dealing with forecasting M ≥ 6 earthquakes; Black and grey labels correspond to models trained with M pri = 5 and M pri = 6, respectively; (c) p-value matrix obtained from right tailed paired t-test, testing the null hypothesis that the MIG of the i th model over the j th model, when forecasting M ≥ 5 earthquakes, is significantly larger than 0 against the alternative that it is not; (d) same as panel (c) but when forecasting M ≥ 6 earthquakes. null model and then compared their cumulative information gain over this null model to each other. In order to assess 486 whether one model performs significantly better than others, we also compare the models pairwise. In Figure 4a periods. Note that this matrix is antisymmetric. In Figure 4(b), we show the MIG matrix for the twelve models in 491 the experiments dealing with forecasting M ≥ 6 earthquakes. The models that have been trained with M pri = 6 are 492 labelled in grey while the ones trained with M pri = 5 are labelled in black.

493
In order to find if the MIG of one model over the other is statistically significant, we perform right tailed paired t-test.

494
In this test, we test the hypothesis that the MIG of the i th model over the j th model is significantly larger than 0 against 495 the alternative that it is not. Figure 4c and d shows the matrix of log 10 (p-value) obtained from the t-test corresponding 496 to the MIGs shown in panel a and b, respectively. From these MIG and p-value matrices, we can make the following  3. We also find that, when decreasing M aux from 5 to 4, the models tend to always perform statistically 503 significantly better (all other settings being the same), not only when forecasting M ≥ 5 earthquakes but also 504 nearly always when forecasting M ≥ 6 earthquakes. In the latter case, there is just one exception, i.e. when 505 the MDOK kernels are used with M pri = 6 setting. However, the same trend does not hold when decreasing 506 the M aux from 4 to 3. 507 4. We also find that the models that have been trained specifically with M pri = 6 almost never significantly  For these two resolutions, we also present the table of pairwise MIG and p-value in Figures S7 and S8, respectively. We 513 find that the observations made earlier from Figures 3 and 4 are robust with respect to the choice of spatial resolution.

514
Sensitivity to the Number of Simulations An important point to consider when evaluating and comparing the 515 models is the number of simulations to perform. As the models are evaluated based on the empirical distribution of 516 earthquake numbers that they provide in a given space-time-magnitude bin, performing too few simulations would 517 is due to the fat-tailed distribution of seismic rates [Saichev et al., 2005;Sornette, 2006a, 2007], which 519 implies strong sample to sample fluctuations and a slow convergence of statistical properties [Sornette, 2006]. 520 On the other hand, more simulations come at higher computational costs. As a result, it is important to optimize this 521 trade-off. Figure S9 shows the net log-likelihood (summed over all testing periods) that a model obtains as a function of 522 the number of simulations. The default number of simulations (100,000) considered in this study to obtain all the results 523 is indicated with a shaded vertical bar. At 100,000 simulations, all the models show a slow convergence towards their 524 "true" log likelihood score. Furthermore, the relative ranking of the models seem to be stable for more than 100,000 525 simulations at all spatial resolutions, further justifying this choice. dependence with a magnitude independent ω. We then calibrate these models on all the 362 training periods. The time 533 evolution of estimated parameters is reported in Figures S10 and S11, respectively. We use the estimated parameters to 534 simulate 100,000 catalogs for the corresponding testing periods. To limit the needed computational resources, we only 535 calibrate these models with the M aux = 5 and M pri = 5 setting and then use these two models to create and evaluate 536 the forecasts for M ≥ 5 earthquakes. We then compare ( Figure S12) the performance of these two models to the one

551
In Figure S12c, we assess whether accounting for a negative correlation between c and mainshock magnitude could 552 lead to any information gain over the model with the ETOK kernel. We find that the c(M ) model does not provide 553 any systematic information gain. One possible reason for the poor performance of the c(M ) models in forecasting 554 could lie in the short-term aftershock incompleteness [Hainzl, 2016b], which is present in both the training and testing 555 catalogs. This rate-dependent incompleteness would not only dampen the negative correlation between c and mainshock 556 magnitude, but also lead to very low information gain, as the events that would have led the c(M ) type model to be 557 more informative are missing from the testing catalog in the first place.

558
On the Importance of Small Earthquakes in Forecasting Our results also indicate that including smaller earth-

576
On the Possible Self-Similarity of the Triggering Process The insignificant difference in the performance of the 577 models that have been trained with M pri = 6 and M pri = 5 suggests the existence of self similarity in triggering 578 processes. More concretely, the models do not need to be trained specifically on M pri = 6 to perform best in forecasting 579 M ≥ 6 earthquakes, as even the models trained on M pri = 5 can do an equally good job. This observation could 580 potentially be generalized to even higher magnitude thresholds, although we have not tested it in this work.

581
On the Exclusivity of the Two Model Improvements Finally, the cumulative improvements obtained by changing 582 the time kernel from ETOK to MDOK and M aux from 5 to 4, indicate that these two modifications capture, to some 583 extent, mutually exclusive aspect of the triggering process. Furthermore, these two modifications seem to be equally 584 important, as they separately lead to similar information gains over the base model (see the solid orange and red curve 585 in Figure 3). In an earthquake forecasting experiment, consistency tests play an important part, as they allow for the direct comparison 588 of model's expectations with the observations, thus serving as necessary sanity checks. One such important sanity 589 check is the "N-test" in which the overall number of earthquakes forecasted by a model is compared against the actual 590 number of earthquakes observed during the testing period. Indeed, this test, along with other consistency tests such as 591 L, M and S tests (see Rhoades et al. [2011] for details), have been used by CSEP to measure the consistency of the 592 models relative to the data. It is important to note that these tests are not used to rank the models.

593
Not surprisingly, one of the hard-coded assumption in these tests, thus far in CSEP, has been that the distribution of 594 the overall number of earthquakes forecasted by the models is Poissonian. Thus, when the numbers of earthquakes 595 forecasted by the models are compared against the observed numbers (especially when aftershocks were deliberately 596 not removed), most often the models are found to be inconsistent (see for e.g. Figure 9 in Werner et al. [2011]). For 597 instance, Werner et al. [2011] have showed that, with a retrospective assumption of negative binomial distribution, the 598 smoothed seismicity models developed in their study "passed" the N-tests for all testing periods.

599
Indeed, it is prohibitively reductive to enforce the same assumption on all models regardless of their formulation.

600
Furthermore, the assumptions of the models should not be modified retrospectively. Last but not least, the assumptions 601 in a model should be self consistent at all scales. For instance, if a model assumes that the rate of future earthquakes is 602 Poissonian, it cannot then be evaluated using a negative binomial assumption for the N-test and a Poissonian assumption 603 for estimating the information gain. In summary, the consistency tests should be modified to allow for simultaneous 604 testing of models with diverse assumptions. One possible way to do this for the "N-tests" is to build an empirical PDF  Figure S13 shows the same information as in Figure 5, 614 but for the twelve models used to forecast M ≥ 6 earthquakes. Recall that the six extra models in this case comes from 615 the distinction introduced by the minimum threshold of the primary catalog (M pri = 5 or 6) used to train the models. such as ignoring the spatial variation and depth dependence of parameters, may also be at the origin of some of these 628 inconsistencies. The quantification of the extent to which each of these different factors contribute to inconsistencies 629 will be undertaken in future studies. We also observe from Figure 5, that there are extended periods (such as between are less evident for M ≥ 6 earthquakes ( Figure S13), possibly because of their sparse numbers during a given testing The models developed and tested in this work constitute a first imperfect attempt at developing global models that are 699 capable of making short-term operational forecasts. Several simplifications have been made, especially in terms of 700 diversity of the models developed and tested. Some of the obvious simplifications include (a) considering only ETAS 701 type models, (b) assuming the parameters of the ETAS models to be spatially homogeneous and time invariant, (c) 702 ignoring the depth dependence of parameters, (d) ignoring errors in the data, (e) assuming isotropic spatial kernels 703 and so on. Nevertheless, by introducing fair and reliable testing schemes, in which modellers have the flexibility to 704 adhere to their best judgement consistently, this study can serve as a framework for further model developments. Indeed, 705 by operationalizing the best performing model as a benchmark for the RichterX prediction contest, we enable fellow 706 modellers to use our results as a stepping stone for improving their models. This also constitutes a continuing process of 707 peer-review, whereby anyone who finds the forecast probabilities too low or high can issue a to-occur or a not-to-occur 708 prediction, providing us with important prospective feedback to improve our model.

709
On more general grounds, forecasting models can be split into two broad categories, namely statistical models (such as 710 ETAS) and physical ones (using quantities such as static or dynamic stress transfer). The latter require the knowledge of 711 many additional parameters, including the spatial extent and orientation of each rupture, as well as a detailed description 712 of the slip over the failure planes. Nandan [2017] showed that our ability to forecast aftershock sequences using a 713 stress-transfer approach increased if one took into account the triggering probabilities provided by an independent 714 ETAS declustering process (the stress-based forecast being logically more appropriate for direct aftershocks). This, in 715 return, suggests that a better knowledge of the space-time variations of the stress field may help to improve the forecasts 716 of ETAS-like models. Nevertheless, the difficulty of such a forecasting framework is that the details of rupture must 717 be known in real time for all past events, and forecasted as well for all future events. As this is clearly out of scope 718 given our very limited knowledge of the deterministic structure of fault networks in the Earth crust, the MSA model 719 thus offers the best opportunity to encode some of the universal properties of the mechanics of brittle media within a 720 purely statistical framework. That certainly explains the superiority of this model for forecasting purposes, even in its 721 simplified, linearized form presented in this paper.