Another look at the treatment of data uncertainty in Markov chain Monte Carlo inversion and other probabilistic methods

natural way, without having to make arbitrary choices regarding residual thresholds. In particular, we can regard the data uncertainty distribution as a mixture between a Gaussian distribution, which represent valid measurements with some measurement error, and a uniform distribution, which represents invalid measurements. The relative balance between them is an unknown parameter to be estimated alongside the standard deviation of the Gauss distribution. For each data point, the algorithm can then assign a probability to be an outlier, and the inﬂuence of each data point will be effectively downgraded according to its probability to be an outlier. Furthermore, this assignment can change as the McMC search is exploring different parts of the model space. The approach is demonstrated with both synthetic and real tomography examples. In a synthetic test, the proposed mixed measurement error distribution allows recovery of the underlying model even in the presence of 6 per cent outliers, which completely destroy the ability of a regular McMC or linear search to provide a meaningful image. Applied to an actual ambient noise tomography study based on automatically picked dispersion curves, the resulting model is shown to be much more consistent for different data sets, which differ in the applied quality criteria, while retaining the ability to recover strong anomalies in selected parts of the model.

In the past, solving a geophysical inverse problem generally implied finding an optimum model that fits the observed data in a least squares sense(e.g.Backus & Gilbert 1967;Aster et al. 2005;Menke 2012) and fulfill a number of essentially arbitrary regularisation constraints (Engl et al. 2000) such as damping (minimization of model derivatives) or smoothing (minimization of first or second order derivatives).Whereas this paradigm persists for computingintensive inverse problems such as full-waveform modelling in two or three dimensions (e.g.Komatitsch & Vilotte 1998;Virieux & Operto 2009), increasing computer power has enabled the practical application of algorithms that apply Bayes' theorem (Bayes 1763) to not only generate an 'optimum model' but an estimate of the probability distribution of the model parameters, m, given the observed data d and an a priori probability distribution of the model parameters, p(m).The latter encodes our knowledge of the range of possible model parameters and their likelihood prior to acquiring the data (note that d and m are vectors, but the number of elements of m needed to adequately describe the model might not be known in advance).Bayes' theorem as applied to model inference reads: where p(m|d) is the a posteriori probability (density) of the model, p(d|m) is the probability (density) of the data given a model under consideration, also known as the likelihood, and p(d) is the unconditional probability (density) of the data (Smith 1991).In theory, p(d) can be obtained by integration of the conditional probability (density) over all possible models, but is difficult to carry out in practice in higher-dimensional model spaces (in the following, we will simply use probability for conciseness, but in most cases a probability density is implied).In most cases an estimate of the absolute probability is not required, as only the relative probabilities of two models are compared.Then, only the ratios of eq. ( 1) evaluated for different values of m are required, and the denominator term p(d), being independent of m, cancels.We are therefore free to ignore this term for the remainder of the paper.
As the model parameter space is usually too vast to be searched exhaustively with a grid search, a strategy is needed to focus this search in regions of the model space, which contribute significantly to the overall probability, that is, where p(m|d) is large.One of the earliest approaches using a Monte Carlo inversion was conducted by Press (1968) to retrieve global Earth models.A popular method is the Markov chain Monte Carlo (McMC) method (e.g.Sambridge & Mosegaard 2002;Sisson 2005) with the Metropolis-Hastings acceptance rule (see MacKay 2003, for a detailed overview), where a chain of models is generated following this algorithm: (1) Generate a starting model m (1) and start with iteration k = 1 (2) Generate a trial model m from m (k) according to transition probability q(m |m (k) ) (3) Calculate the acceptance probability of the trial model from the following ratio α = p(d|m ) p(m ) p(d|m (k) ) p(m (k) ) (4) Generate a random number β based on a uniform distribution between 0 and 1.
(1) If β < α, accept the trial model, that ism (k+1) = m and add to the chain (i.e. for α > =1 the new model will always be accepted) (2) Otherwise, reject the new model, and add the previous one to the chain m (k+1) = m (k)   (5) Go back to the second step to find the next element of the chain.
In practice, the transition probability is often chosen to be symmetric, that is q(m (k) |m ) = q(m |m (k) ), and the logarithm of the probabilities is used in order to avoid round-off error or because of the inability to represent the probability as a floating point number.This then leads to the acceptance condition.log β < log p(d|m ) − log p(d|m (k) ) + log p(m ) − log p(m (k) ). (3) It has been shown that the representation of models will converge to the a posteriori probability distribution p(m|d) (MacKay 2003), although in reality the tendency of the chain to get trapped in local minima means that for many problems of practical interest excessively long run times would be required to achieve convergence in a single chain.However, this shortcoming can partially be countered by running many chains in parallel, such that the flexibility and ease of use of this algorithm has made it quite popular for geophysical applications (Sambridge et al. 2013).
The observed data are generally understood to represent the solution of a forward problem for the 'true' model, which is perturbed by an error term, e, that isd = g(m true ) + e, where g(•) is a vector function representing the solution of the forward problem for all data points.We ignore here that the parametrization scheme can never fully represent reality, and that therefore there is in fact no true model.Parametrization or modelling related errors, for example, insufficient spatial sampling, can be thought of as part of the error term e, if the unrepresentable part can be described statistically.An example is the probabilistic earthquake location in a 1-D model, where lateral heterogeneities can be thought of causing correlated measurement errors for neighbouring stations [see Lomax et al. (2000) for an implementation of that approach].The probability of the data can thus be described by where f is the probability distribution for the measurement errors, r = d − g(m) is the residual vector, and λ i 's are arbitrary parameters of the distribution f; they can be vectors taking on different values for each data point, or parameters describing how the measurement errors of different data points are linked.
Although the McMC method imposes hardly any limitations on the assumed error distributions, most geophysical applications of the McMC method (e.g.Sambridge & Mosegaard 2002;Bodin et al. 2012b) assume Gaussian distributed errors, the probabilistic equivalent to minimization of the L 2 norm in optimization problems.This assumption is theoretically justified by the idea that measurement errors arise from the sum of many small (independent) perturbations, which the central limit theorem tells us will approximately yield a normal distribution irrespective of the underlying distributions of each contributing perturbation.It is also practically justified by the empirical observation that histograms of residuals after model optimization often resemble normal distributions quite closely.For geophysical data it is often quite difficult to estimate the measurement uncertainty (Rawlinson et al. 2014) based on knowledge of the measurement process, and also a part of the error can arise due to inadequacies of the forward model, either due to simplification of the physics, or overly simplified model parametrization, as described above.The data uncertainty itself, as Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020 expressed by the variance of the Gaussian distribution, then becomes an unknown parameter, whose probability distribution is determined within the McMC search (Bodin et al. 2012a).The parameters describing the noise distribution are often referred to as 'hyperparameters' (Gelman et al. 2004, ch. 5), and the approach has been described as 'Hierarchical Bayes', but in fact, as far as the algorithm is concerned, the noise parameters can be considered like any of the physical model parameters, and there is no obvious hierarchical relationship to the physical parameters but rather interdependence.Instead, they could be considered 'nuisance parameters', a term coined by Jaynes (2003) to describe parameters that are of no inherent interest but must be taken into account with their uncertainties in order to determine the physical model parameters of interest.Importantly, we do not really need to reconstruct the probability distribution of the nuisance parameters, as long as their interaction with the physical model parameters is accounted for.The first part of this paper describes how by simple marginalization over the standard deviation (representative of the data uncertainty) this can be achieved for normally distributed errors, leading to efficiency gains with respect to the standard approach of including the standard deviation as parameter in the Markov chain search.
In practice, errors in most geophysical problems are not normally distributed (Rawlinson et al. 2014), the superficial similarity of typical distributions to a Gaussian one notwithstanding.In our experience, the primary deviation from normality of the distribution is often the much more frequent occurrence of large error values than predicted, as seen in the example shown Fig. 1.When strict data selection criteria are applied, these outliers are generally small in absolute number but due to their extreme rarity in the Gaussian distribution, they exert an undue influence on the model estimation: the model probability distribution will be skewed to reduce substantially the distance between predicted and observed values for these outliers, at the cost of slightly increasing the misfit for a much larger number of observations.The mitigating strategy in gradient-based optimization problems is usually to remove or heavily down-weight outliers prior to inversion (e.g.Egbert & Booker 1986;Kissling et al. 1994;Dreiling et al. 2018).In a Bayesian context this approach is unsatisfactory as it introduces arbitrariness in the form of the choice of threshold.Even worse, the ratio in eq. ( 2) requires the number of data points to stay the same, as removal of individual data points during the chain construction would result in probability densities of different dimensionality, which cannot form the (dimensionless) ratio needed for application of the acceptance rule.A more promising approach is to replace the assumption of a normal distribution with the assumption of a double-exponential distribution, also known as the Laplacian distribution, which is equivalent to imposing an L 1 norm in an optimization context.This distribution falls off more slowly and is thus not significantly biased by the presence of outliers.However, the sharp peak of the Laplacian distribution is not a commonly observed feature of actual residual distributions for most geophysical problems, which, as pointed out above, seem to be modelled quite well by Gaussian distributions except for the presence of outliers.Whereas the assumption of a Laplacian error distribution is more robust, it is therefore known to be incorrect, and the model parameter uncertainties estimates will therefore not be correctly estimated based on this assumption.Furthermore, the width of the Laplacian distribution confounds the frequency of outliers with the typical uncertainties of valid measurements, making it difficult to interpret.The second part of the paper will thus introduce a mixed probability distribution which is explicitly accounting for outliers.

G AU S S I A N DATA U N C E RTA I N T Y
The most common assumption for the data uncertainty is the Gaussian distribution.Let us start with the assumption of identically and independently distributed data errors with an (unknown) standard deviation σ for a total of N observations; the approach will later be straightforwardly generalized to more complex multivariate Gaussian distributions. (5) Remember that r i = d i − g i (m), that is the residuals depend on m.With a known σ , the log likelihood is where C is a constant only dependent on N, that is, only the first term actually depends on the model parameters and maximization of the likelihood corresponds to least-squares minimization.
In transdimensional inference problems the number of parameters used to represent the model can change during the McMC search in a data-driven fashion, see Bodin & Sambridge (2009) for details.Therefore, in a transdimensional context, or when the model prior is non-uniform, the fixed standard deviation will affect the complexity of the model, or how far it is allowed to stray from its prior.When σ is unknown, it can be subject to a parameter search and a posterior PDF derived for it.It is then necessary to define a prior PDF for σ , p(σ ), where most applications in geophysics have opted for a uniform distribution between zero and some set maximum value (e.g.Bodin et al. 2012a;Galetti et al. 2015;Ravenna et al. 2018).However, as the standard deviation is a scale parameter the uninformative prior representing the state of no prior information is Jeffrey's prior p(σ )∝σ −1 (Jaynes 2003), which is essentially saying there is no a priori knowledge on the scale, that is the probability density in log-space, p(log σ )d(log σ ), is uniform; for example, values ten times larger are just as likely as value ten times smaller.Bodin et al. (2012a) already pointed this out but nevertheless proceeded to impose a uniform prior, a practice followed by most geophysical applications, even though it implies a non-uniform prior in log-space.Although for any reasonable data set the choice between these two priors should not greatly affect the posterior PDF of the model parameters, use of the uniform prior might be the reason for the observed slight overestimation of the standard deviation by McMC tests on synthetic data, where the data uncertainty was assumed unknown (Bodin et al. 2012a).The Jeffrey's prior is not normalizable, but as the McMC chain only ever requires probability ratios, this shortcoming is of no concern here.
Whereas the standard approach includes σ as one of the parameters to be varied in the McMC search and calculates p(d|m, σ ) at each McMC step, we do not really need to care about the PDF of σ because it is a nuisance parameter.We thus marginalize by integration:   (Tilmann et al. 2001).At first glance, the distribution looks close to Gaussian, but there are a few outliers, only barely visible above the x-axis line.(b) The same residuals as in a, but plotted as a quantile-quantile plot (Q-Q plot), which plots the observed values on the y-axis versus the theoretical quantiles of the distribution, here the Gaussian distribution, for the number of data points on the x-axis (see Aster et al. 2005, appendix B.7).The mean and standard deviation of the Gaussian distribution was estimated from the data.The red line shows the line of identity, around which the measurement points (blue crosses) would scatter if they truly followed a Gaussian distribution.The deviation from a Gaussian distribution is clearly marked by the sigmoidal shape of the Q-Q plot, that is, extreme values are farther from the mean than predicted for both the bottom and top end of the distribution.The slope of the identity line does not fit the implied slope for the central range of the residuals, implying an overestimation of σ .
solved easily via the standard definite integral The marginal log likelihood is thus The second and third term are constant and can be ignored, as in the McMC algorithm only the ratio of the likelihoods, that is, the difference of the log-likelihoods matters.The maximum likelihood model is still the model corresponding to the smallest RSS, but the decrease in likelihood away from this peak is softened by the logarithm.In a transdimensional context this encourages exploration of a range of models of varying complexity, some fitting the data less well with a simple parametrization and some fitting the data better with a more refined parametrization.No approximation is involved in the marginalization so it will give the same results as the standard approach of explicit inclusion of σ in the MC search-if the latter has converged properly-but at reduced computational cost.One disadvantage is that no empirical PDF is generated for σ in this way, but the a posteriori maximum likelihood value for σ can easily be set according to the residual RMS of the maximum likelihood model, or alternatively an approximate (mean) value for σ can be estimated from the residual RMS of the average model.If an actual probability distribution for σ were desired, it could be easily generated by sampling from the probability distribution in eq. ( 5) each time the model parameters are sampled.
Curiously, if one assumes a uniform prior for p(σ ), and evaluates p(d|m, σ ) at σ , the maximum likelihood value of σ , as an approximation for the marginalized distribution p(d|m) (Dosso et al. 2012;Sambridge 2014, appendix B), exactly the same expression as eq.( 9) is obtained.Therefore, we are in the happy situation that an approximate solution for an arguably poorly motivated prior is actually identical to the exact solution for a better justified prior.
The expression easily generalizes to a few common more general cases.
(1) Relative errors for different data points.Often some information is available on which data points are more or less reliable, even though the absolute uncertainty is poorly constrained.If this relative uncertainty can be expressed as normalized standard deviations of each data point, σi , then the RSS, that is, i r 2 i , can simply be replaced by i r i σi 2 (equivalent to the definition of χ 2 in classic statistics).In this case, σ represents the scaling factor for the normalized standard deviations.
(2) Correlated data errors.If the correlations are described by a data correlation matrix Cd , then the RSS must be replaced by the matrix−vector product r T C−1 d r.Again, σ is then a scaling factor.(3) Multiple data types.In joint inversion type problems, if the data belong to M different classes or data types, each consisting of N (k) data points with their independent but unknown standard deviation, σ (k) , then for a full marginalization, the integral in eq. ( 7) will turn into a multidimensional integral, which, however, is separable Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020 into M regular integrals and can thus be solved exactly as above.The resulting likelihood function is: These different cases can be combined, of course.

D I S T R I B U T I O N W I T H O U T L I E R S
In order to reduce the strong bias of extreme values on the overall probability p(d|m), a mixture of a Gaussian distribution with a uniform distribution is used to represent the data error for each data point (Fig. 2, the choice of the uniform distribution will be discussed later in Section 5).Under this assumption, the joint log-likelihood for independent measurement errors is where f (in [0: 1]) is the fraction of outliers, W is the width of the range spanned by the outliers.By definition, all of the data points (d i ) must be possible, such that the range of the uniform distribution should be sufficiently large to include all observations; in many cases it will be preferable to define the range based on the residuals (r i ) with respect to some reference model rather than based on the raw data spread.We note that the normal part of the distribution is not truncated but is assumed to have reached such extremely small values at the edges of the uniform distribution range that the implied probability densities reach 'impossibility' level for all practical purposes.What this mixed distribution achieves is that for small residuals close to the centre of the distribution, changes in the model will have an impact on the total likelihood similar to that for a pure normal distribution.The probability for large residuals on the far tail of the Gaussian distribution is essentially constant and independent of the model.Therefore these data points, likely outliers, are no longer assumed to carry any information about the model, and will not bias it.Also, an upward bias of the standard deviation estimate due to outliers is avoided.In a transdimensional context, a realistic estimate of the standard deviation is needed in order to choose models of an appropriate level of complexity; overestimated standard-deviations would lead to oversimplified or oversmoothed models.Crucially, this assignment into outlier data point or good data point is not made based on a threshold or only once, but can change as the model evolves and thus residuals are getting larger or smaller and also as the values f and σ are changing in the random walk.For some data points, the magnitude of the residual will be such that both terms in eq. ( 11) are of approximately similar size.In this case, the probability of these points will still change with model adaptions but not as strongly as for an equivalent normal distribution.Effectively, the weight of these data points is reduced.
An analytical marginalization of the mixed distribution is no longer possible, and we must include σ and f as parameters in the Markov chain.The value of W should be fixed at a reasonable value, usually the maximum range of data residuals in some easily evaluated background model.In Appendix B, we discuss the choice of W in more detail.
As above we consider how this Markov chain approach is expected to generalize: (1) Relative errors for different data points (applied to the Gaussian distribution).The equation for the normal part of the Gaussian distribution, eq. ( 12), can be straightforwardly extended with the normalized (relative) standard deviation of each data point σi , in which case σ is interpreted as a scaling factor for these relative standard deviations: (2) Multiple data types with separate, but unknown, standard deviations and outlier fractions.This entails the introduction of one parameter pair (σ (k) , f (k) ) for each data type.Then, eq. ( 11) can be extended by an outer summation over the different data types.Of course, each data type must have a sufficient number of data points associated with it in order to obtain meaningful estimates for σ (k)  and f (k) .
(3) Correlated data errors.What is meant by this is that the Gaussian part of the error distribution is described by a covariance matrix, whose off-diagonal terms describe the correlation between the measurement errors of different data points, while the outliers are assumed to be uncorrelated.In this case, each possible combination of a data point either being a valid measurement with some uncertainty or an outlier would give rise to a new term in an extended log-likelihood equation.For N data points, the summation over N terms in eq. ( 11) would thus have to be replaced by a summation over 2 N terms, not practical for the numbers of measurements typically encountered in geophysical inference problems.In the case of very strongly correlated errors it is thus advisable to simply subsample the data set, while very weak correlations can probably be Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020 Data Uncertainty in Markov chain Monte Carlo 393 safely ignored, as is done quite often in practice in any case, even when only carrying out least-squares minimization.

Application to mean value estimation
As a toy problem illustrating the effect of outliers we consider the problem of a single parameter m 1 = μ, whose value has to be estimated from several imperfect measurements, that is, g(m 1 ) = (m 1 m 1 m 1 • • • ) T , where the length of the vector corresponds to the number of data points.Note that the argument of g is normally a vector but here the model has exactly one unknown parameter, m 1 ; the value of this parameter is simply the predicted outcome of the measurement in the absence of errors.Samples are generated by drawing from a normal distribution with mean μ (here 2.0) and an assumed measurement uncertainty σ (here 0.2).In addition, with a certain probability, here 10 per cent, the values are replaced by uniformly distributed outliers; in this example the limits of the uniform distribution are set at -5 and 5 s, respectively.Effectively this corresponds to drawing from the mixed distribution (eq.11).The estimation of the three parameters of the mixed distribution is accomplished with the Metropolis−Hastings algorithm described above; for details see Appendix A and Table A1.
Fig. 3(a) shows the histogram of samples and resulting PDFs for the three model parameters for one realization with 200 samples, that is, a strongly overdetermined problem.Unsurprisingly, the Gaussian maximum likelihood model (red line in Fig. 3a) is grossly wrong, with a standard deviation overestimated by a factor of approximately five, and a biased mean value.When using the mixed distribution p n + o (with W = 10), the estimated mean value and standard deviation is not only closer to the true mean, also the estimated errors of these values are realistic, that is, the true value is within one or two standard deviations.Although only one realization is shown here, we repeated this experiment hundreds of times to verify that the returned PDFs for the model parameters represent the actual uncertainty of the estimate.
We also carried out this experiment with only 20 samples (Fig. 3b), but otherwise identical parameters.Given the typical redundancy in geophysical data sets, this is a more realistic test than the previous example.It is clear that now the outcome will be highly dependent on the realization: on average we expect 2 outliers, but with such small numbers realizations with 0, 1, 2, 3 and 4 outliers all have a reasonable probability.In the particular realization shown in Fig. 3 In geophysical problems, it is conceivable that much larger outlier fractions than 10 per cent occur in some contexts, depending strongly on the strictness of the data selection criteria that are applied before including measurements into the inversion.As an end-member case we therefore considered also an example with 60 per cent outlier fraction and 50 measurements in total, shown in Fig. 3(c).Here, on average only 20 of the measurements contain useful information.With such a high outlier fraction, the Gaussian estimate becomes nearly useless, although by eye it is still quite obvious where the true mean value is found.Not only is the estimated value of the standard error of the mean very large, but the difference between the estimated and true value is approximately five times the estimated standard error.In contrast, the mixed distribution manages to return a mean value estimate close to the true value, with a reasonable error estimate.
As the final example of this section, we consider the case where the assumption of a uniform distribution for the outlier examples is actually wrong.Specifically, we construct an example simulating cycle skipping, whereby outlier measurement errors correspond to small integer multiples of the dominant period of the analysed signal.Cycle skips can occur, for example, in estimating body wave traveltimes from cross-correlations (VanDecar & Crosson 1990;Pavlis & Vernon 2010), or in estimating surface wave phase velocities (e.g.Sadeghisorkhani et al. 2018).The cycle skipping errors are implemented by adding or subtracting from the synthetic measurement, which already includes the Gaussian noise, an integer multiple of the dominant period, here assumed to be 1.In the simulation, negative cycle skips are more likely than positive ones, which can be a realistic assumption for some problems; see for example the real data case study presented in Section 4. Multiple cycle skips are allowed but with diminishing probability.The true PDF for measurements is thus unbounded, but each realization will have a maximum and minimum measurement.For the example shown in Fig. 3(d), the probability of a (single or multiple) negative cycle skip was set to 17.5 per cent, the probability of a positive cycle skip to 7.5 per cent; the probability of a double negative cycle skip was set to 0.3 of the probability of a single cycle skip, and so on for higher orders (i.e. a negative skip by three cycles has 0.3 times the probability of a double negative skip).For positive skips, the probability density ratio of subsequent numbers of cycles skipped was set to 0.2.The assumed width of the uniform distribution was set to the actual range of the samples, as shown in Fig. 3(d).Because of the asymmetry in the likelihood of positive and negative likelihoods, the Gaussian mean value estimate is biased, and the standard error of the mean underestimates the true misfit; σ is strongly overestimated just as in the earlier examples.Even though the mixed distribution is making a wrong assumption, the inferred values for μ, σ and f are reasonably close to their true values, also in relation to their respective probability distributions.Of course, an assumed measurement error distribution which explicitly takes into account the possibility of cycle skips presumably would have performed even better, as the current approach effectively discards the information contained in the cycle-skipped measurements.Such distributions which are fine-tuned to the physics of the measurement process are an interesting target for future improvements, but would introduce additional unknowns into the inference process.

Application to synthetic tomography problem
Next, we consider a 2-D traveltime tomography problem, as might be encountered in ambient noise based studies involving inversion of interstation group or phase arrival times at a selected period for the corresponding velocity variations.The model domain is a 400×200 km 2 area and is parametrized as slowness perturbations within 10×10 km 2 cells of constant slowness, resulting in 40×20, that is 800 model parameters.A total of 34 stations is placed within the domain in an irregular configuration, such that some parts of the domain are well illuminated, and others only sparsely sampled.With Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020 this number of stations 561 possible pairs exist, but as for real ambient noise studies not all pairs yield successful measurements, a subset of 449 pairs is randomly selected.Anomalies are assumed to be small enough that the ray paths are not significantly perturbed by the velocity heterogeneity and the reference model is assumed uniform, such that ray paths are straight lines and the Fréchet kernel matrix can be constructed from the lengths of the ray paths in each cell.The assumption of straight rays is not realistic for many ambient noise tomography studies, but the assumption of stronger contrasts inducing significant ray-bending would not directly add insight into the issue of probabilistically accounting for outliers, and is therefore not considered here.The interaction of non-linearity and non-Gaussian distributions might,however, be an interesting topic for future investigations.Of course, as the problem is linear, it could be solved directly using singular-value decomposition, but would have to be heavily damped due to sparse coverage in some parts of the model.Furthermore, a damped-least squares solution implicitly corresponds to the assumption of a Gaussian error distribution and also a Gaussian distribution for the model prior, p(m) (Tarantola & Valette 1982).Therefore, we implemented a transdimensional inversion following Bodin & Sambridge (2009), but with the further simplification that for all grid cells, it is determined with which Voronoi cell their centre is associated, and the whole cell is then given the slowness value of the Voronoi cell.Also, we parametrize the model in terms of slowness perturbation rather than absolute velocities.We use 20 chains, each running for 4 × 10 6 iterations, from which the initial 2 × 10 6 are used for burn-in and subsequently discarded.Parallel tempering (Sambridge 2014) is used to avoid getting stuck in local minima, and also to remove poorly converging chains from the final average.Twelve chains are run at T = 1, and the remaining eight chains are run at gradually increasing temperatures up to T = 5, with exchanges between chains allowed every 50 000 iterations.For chains at higher temperatures, acceptance of less well fitting models is more likely, allowing a wider exploration of the model space, see Sambridge (2014) for details.Uniform priors are assumed for the number of Voronoi cells, with a maximum of 200, and slowness perturbations, with a limits of ±0.02 s km -1 .Finally, the average model is calculated from the 12 chains at T = 1 and the iterations post burn-in.This model is considered to be a representative estimate of the underlying model.
We choose a sparse checkerboard model as a test case because the success of recovery is easily judged visually (Fig. 4a).For the first test with a purely Gaussian distribution, the forward modelled traveltime anomalies are additionally perturbed with Gaussian noise with a standard deviation of 0.1 s, which corresponds to 9 per cent of the largest absolute traveltime anomaly (1.13 s) and 53 per cent of the mean absolute anomaly (0.19 s).We run the McMC search assuming two Gaussian error models: (i) a Gaussian distribution with the measurement standard deviation σ fixed to the true value of 0.1 s (Fig. 4c) and (ii) a Gaussian distribution with unknown σ and a Jeffrey's prior for σ (Fig. 4e).In each case the central part with a high number of crossing paths is recovered very well, with minimal smearing and very good recovery of the absolute magnitude of anomalies.Even anomalies in the poorly covered margin are recovered at least in the sign of the anomaly but-as expected-are more diffuse and the true anomaly is underestimated.Visually, both models are very similar, with only subtle differences.Both models actually slightly overfit, as the final residual RMS is a little lower than the standard deviation of the input model.
For the second test, for better comparability we take exactly the same traveltime measurements as in the first test, that is, using the same realization of the Gaussian noise and selection of ray paths.In addition, for 28 additional ray paths (equivalent to ∼6 per cent outlier fraction) entirely spurious observations are generated by drawing from a uniform distribution with a range of -3 to +3 s (Fig. 4b).
The presence of outliers nearly completely ruins the McMC average model under the assumption of a fixed σ (Fig. 4d).Because the residual RMS is far larger than the imposed σ , the McMC search will seek to improve the data fit for the outliers, nearly no matter what the price is in terms of model complexity.In fact, the number of Voronoi cells in this test quickly converged to the maximum allowed value of 200.When we switch to the assumption of an unknown standard deviation, the McMC search is more tolerant of very large residuals, suppressing overly complex models and allowing recovery of a hint of the basic pattern in the well covered area (Fig. 4f).However, the effect of outliers is visible as streaks, particularly when they are associated with long paths, and the image is essentially uninterpretable.Because we use the approach described in Section 2, we do not determine an explicit probability distribution for σ , but the RMS is approximately identical to its maximum likelihood value, here 0.70 s, that is, far larger than the actual σ of the underlying Gaussian distribution of the errors of the well behaved major part of the data set.Therefore, the recovered model is both undercomplex and still strongly biased by the outlier observations.
Finally, when the mixed distribution is assumed (with unknown σ and outlier fraction), most of the anomalies of the input model are recovered very well, both in shape and amplitude because the inclusion of the uniform distribution greatly diminishes the influence of the outliers (Fig. 4h).The residual RMS is actually significantly larger than for the inversions with the Gaussian error assumption, because no attempt is made to fit the outliers, which dominate the residual sum of squares.The mean posterior value for σ is 0.0112 ± 0.0008, and for the outlier fraction f is 7.3 ± 1.4 per cent (ranges show one standard deviation), which is close to the true values.It could be argued that the recovery is still somewhat poorer than for the forward model without any outliers (Fig. 4c).This is to be expected, as the inference algorithm does not know a priori, which measurements are the outliers.Reduced model recovery arises both because some of the more extreme good measurements might be associated with a significant probability of being an outlier, and because outliers might by chance fall close to the range of good measurements.In that case they are not clearly identifiable as outliers and still end up (erroneously) influencing the recovered model.
We also checked the outcome of applying the mixed distribution to the data set of the first test, that is, when there are actually no outliers present (Fig. 4g).The recovered model is again visually very close to the models in Figs 4(c) and (e), and its RMS accordingly only very slightly larger.The posterior mean estimate for σ is 0.0104 ± 0.0005 and for f 0.8 ± 0.6 per cent, that is, again both are close to the true values of 0.01 s and 0 per cent.
It has to be acknowledged that the deleterious effect of outliers has been exaggerated in this test compared to real applications because in most geophysical inversions obvious outliers can be removed prior to the formal inference procedure.However, if this is done too aggressively, then it is likely that anomalies are systemically underestimated because good measurements are being removed erroneously.Therefore, the basic conclusions are expected to hold in more realistic settings, too.

U S E C A S E : S C A N A R R AY A M B I E N T N O I S E 2 -D P H A S E V E LO C I T Y T O M O G R A P H Y
Finally, we apply the algorithm to real Rayleigh wave phase dispersion measurements obtained from stacked cross-correlations from stations of the ScanArray Experiment (including dedicated temporary stations (Thybo et al. 2012), the stations of the permanent Swedish network and the NEONOR2 temporary stations) and permanent stations in Scandinavia, covering most of Sweden and Norway as well as the Baltic Sea and western Finland.Standard processing procedures were followed in constructing the cross-correlation Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020  11) with unknown standard deviation and outlier fraction.The RMS of residuals is shown above each subfigure.For comparison, the initial RMS in the uniform reference model is 0.28 s for the Gaussian noise (c, g, e), and 0.52 s for the Gaussian noise plus outliers (d, f, h).Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020 stacks, described in detail in Mauerberger et al. (2019) ; they are considered as empirical Green's functions between station pairs.From the stacked correlation functions the phase dispersion curves are determined using two different automated algorithms, both based on the principle of measuring the zero-crossings of the real part of the Fourier transformed cross-correlation function (Ekström et al. 2009).The two algorithms reflect end members with regard to how conservatively the algorithm accepts candidate picks.For this paper, we only consider the observations at a period of 4 s.
Although a detailed analysis of the sources of error and the error patterns in the Scandinavian data set along the lines of, for example, Olugboji et al. (2017) is beyond the scope of the current work, some understanding of the differences between the algorithms and the resulting error patterns is useful for evaluating the effectiveness of the mixed-distribution approach.The first data set (termed HS) is based on the code described in Sadeghisorkhani et al. (2018), and the second approach (termed EKr) is based on a reimplementation of the algorithm described by Kästle et al. (2016).Both algorithms use a reference curve to pick the correct branch at long period and then progress to shorter periods.To compare the two automatic algorithms we use the same reference curve and feed them with the same whitened and windowed (a group velocity filter) crosscorrelations.The reference curve of the average phase velocity is estimated based on the approach explained by Sadeghisorkhani et al. (2018), making use of all ∼15 800 station pairs.This approach is based on fitting a Bessel function to the real part of the spectrum as a function of distance at different periods.
To calculate the dispersion curves for each station pair, Kästle et al. (2016) reduced the problem caused by spurious zero crossings by applying a low-pass filter in the amplitude-frequency domain before measurements (effectively smoothing the spectrum), whereas Sadeghisorkhani et al. (2018) use whitening and then a group velocity filter in the time domain first, which significantly improves the stability of the estimate.
Both algorithms mainly rely on a smoothness constraint as the picked curve is extended to shorter periods but differ in the weight given to this constraint.There are other criteria in both methods discouraging picks far from the next expected value.In the HS algorithm, these criteria relate to the range of allowable interstation distances (see Ekström 2017, here we use measurements for interstation distances between 1.5 and 30 times the wavelength), the maximum deviation from the reference curve, the ratio of the amplitude where the measurement is made to the maximum amplitude for all periods, the deviation of a picked zero-crossing from its expected value based on extrapolation, and finally each trace is only accepted if not too many of its corresponding zero-crossings are rejected based on the other criteria.The picked dispersion points should be highly reliable but the algorithm relies heavily on the dispersion curve.The EKr algorithm mainly uses three criteria: first, the frequency-step width of zero-crossings has to be in an acceptable range; second, a threshold prevents jumps to the next branch (cycle skipping); and third, a gradient-based smoothness criterion with respect to the reference curve is imposed.Because it is less reliant on the reference curve, it can pick dispersion points with higher velocity variations but if a previous period is incorrectly picked for any reason, the following picks at shorter period are doubtful.
A total of 2897 measurements were obtained with the HS algorithm.Even with these measures, the residual distribution appears to be skewed and has a heavier tail than Gaussian, at least at the upper end (see Fig. 5,top; this observation implies that it is more likely to find paths associated with very low rather than with very high velocities).An even graver concern is the possibility that valid measurements could have been excluded erroneously.As Scandinavia has no significant sedimentary cover, the approach to select by similarity to the average phase curve is probably valid almost everywhere at the 4 s period considered here.Nevertheless there is concern that it excludes measurements from the most anomalous areas, almost by design.This would be much more of a concern in more heterogeneous areas, where it is unlikely a threshold can be found that does not exclude valid data in a highly systematic way and still does not introduce too many erroneous measurements.
Implementation of the EKr algorithm for the ScanArray data results in a very large number of picks (15 433).In order to keep the size of the data set manageable for the McMC algorithm, all picks for distances larger than 240 km were removed-this corresponds to about 20 times the wavelength for the 4 s period.The different branches get very close and are difficult to distinguish for larger distances.Even the reduced data set of 2205 measurements appears to have a large number of outlier picks (see Fig. 5 bottom for residual).In fact, there is weak evidence for cycle skipping visible in the histogram, which shows a small secondary peak near 4 s, indicating a systematic measurement error due to confusion of branches, not consistent with a Gaussian assumption for measurement errors (although the secondary peaks are obviously also not consistent with a uniform distribution, an assumed uniform distribution implies that they do not provide constraints on the model, see 'Discussion' section below).Interestingly, even though the RMS is unsurprisingly different for the HS and EKr data sets (1.19 and 1.76 s, respectively), the median absolute deviation (MAD) is identical, and visually the spread of the Gaussian function in the histograms in Fig. 5 is accordingly similar.
The basic transdimensional McMC algorithm followed is essentially identical to the one used in the synthetic tomography test with different parameters (10 6 iterations, 60 chains, of which 44 are at T = 1), except that ray paths are recalculated every 200 000 iterations (in total five times).The chains at T = 1 were arranged into 4 independent groups, and for each one its respective mean model was used to recalculate the ray paths, that is, in total 20 ray path recalculations were carried out.In fact, for Scandinavia the heterogeneities are small enough that this step was not really necessary and straight ray paths would have been adequate.
We now determine the posterior model PDF using the McMC method, first under the assumption of a Gaussian measurement error with unknown standard deviation.Fig. 6 compares the models obtained for both data sets and the data fit achieved by them.Visually, the mean models are not too dissimilar, with the slowest velocities along the western coast and relatively faster velocities in the southeast of the resolved region and along the eastern edge.However, it is apparent that over most of the study region there is a bias towards faster velocities in the EKr derived model.This bias is visible both in the map view of the differences in the top right of Fig. 6(a) and in the scatter plot on the bottom left.Probably, it can be attributed to a skew in the distribution of outliers.This skew might arise because noisy data tends to increase the number of zero crossings rather than reduce it.Additional zero crossings manifest themselves as faster velocities, which could cause a bias towards faster velocities.The other difference occurs along the west coast next to the Lofoten where velocities in the EKr data tend to be slower.
When instead the mixed distribution is assumed for the measurement error, there is only a very minor bias remaining in the models estimated from the two data sets (Fig. 7a).Remaining differences EKr: automatic phase arrival measurements based on a new implementation of the algorithm described in Kästle et al. (2016).Only measurements at less than 240 km epicentral distance, equivalent to approximately 20 wavelengths were used for the EKr data sets (blue dots), although measurements were available for all possible pairs (plotted in grey for the residual-versus-distance plot, for distances less than 500 km).
are mostly only visible in areas of poor coverage (e.g. the Gulf of Bothnia, the northernmost arm of the Baltic Sea, and at the northern coast).However, within the Lofoten area along the west coast, velocities remain much lower for the EKr model than the HS model.This is probably because many measurements in the EKr data set are consistently showing these low velocities, and the model therefore prefers to adapt rather than to increase the outlier percentage.It is likely that these measurements were systematically excluded by the HS measurement process as they deviated too much from the expectation based on the reference model, although very low velocities in this area are physically reasonable.
The residual RMS of the average posterior model for the assumed mixed distribution (Fig. 7b) are a little worse than for the one based on the assumption of a Gaussian distribution (Fig. 6b).This is to be expected, as outliers will have a very strong effect on the RMS, and in the former case the likelihood depends less on the model adapting to these outliers.On the other hand side, the MAD for the mixed distribution models is actually lower, meaning that the typical data point is fit somewhat better.Again, this is to be expected, as the model in the mixed case does not need to accommodate measurements inconsistent with neighbouring measurements.
In order to check whether the use of the mixed distribution leads to local minima, the different chains were initialized with outlier fractions between 0 and 20 per cent.All chains quickly converged to a relatively narrow range of outlier fraction of 3-5 per cent for the HS data set and 10-12 per cent for the EKr data set (Fig. 8), with the exception of those run at higher temperature, which do not contribute to the posterior distribution estimate.The remaining spread represents the actual uncertainty of these nuisance parameters.Inspection of the different chains showed that there is only minor trade-off between the standard deviation and the outlier fraction but there is some trade-off between model complexity and the uncertainty parameters, of course.This trade-off represents the ambiguity inherent in non-unique and sometimes contradictory data, which it is proper for the McMC estimation to capture.

D I S C U S S I O N A N D C O N C L U S I O N
First, it was demonstrated that for assumed Gaussian distributions with unknown standard deviation, the standard deviation as a nuisance parameter does not need to be explicitly included as a free parameter to be perturbed, for example, in a McMC algorithm, but can be marginalized analytically, which can somewhat reduce the computational effort needed.
The second, more important conclusion is that a distribution involving the mixture of a Gaussian distribution with unknown standard deviation and a uniform distribution allows analysis of data sets tainted by a significant number of outliers, that is, where the distribution of measurement errors is not well described by a Gaussian distribution.The effectiveness of this approach was demonstrated for traveltime tomography with a synthetic example and a real use case.
Although both conclusions were demonstrated for a McMC algorithm, they can be easily and trivially applied to any gradient-free parameter estimation algorithm involving explicit evaluation of a likelihood function, for example, the neighbourhood algorithm.One might question the assumption of a uniform distribution to represent the outliers, as in realistic examples there is often some relation of the 'true' value of a data point to the measurement even for outlier points, whereas the uniform distribution implies that the outlier measurements holds no information about the model whatsoever.If the actual distribution of outlier points is known, then of course, it would be preferable to use this alternative distribution to exploit the information contained in the data points.However, if the distribution is not known, the choice of uniform distribution represents the conservative choice that prioritizes the avoidance of bias due to outlier points at the cost of potentially throwing away some information still contained in them.
Some iterative gradient-search based inference schemes use ad hoc outlier removal at each iteration (e.g.Dreiling et al. 2018).The approach described here can be adapted to gradient-based algorithms to reduce the influence of outliers in a data-driven manner by determining the current best estimate of standard deviation σ and Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020  outlier fraction f of the residual distribution for the reference model and then at each iteration, for example through McMC as was done for the toy problem discussed earlier (but imposing a zero mean).Then, each data point i is weighted with a factor This weight factor is chosen such that the steepest-descent direction of a least-squares cost function based on the weighted data is the same as the steepest-descent direction of the negative loglikelihood function of the mixed distribution.However, as the nature of the true model might be such that the predicted residuals with respect to the reference model would appear initially as outliers, there is no guarantee that the iterative procedure will converge to the same model as a fully exhaustive non-linear search would have, even when the unweighted least-squares cost-function is convex.Also, the Hessians of the two cost-functions are not the same; and the practicability of this approach needs to be tested in future work.The Markov chains were run for 300 000 iterations in total, of which the first 100 000 were discarded as burn-in phase.Table A1 shows all relevant parameters.A Gaussian proposal distribution with a standard deviation σ prop was used, as indicated.New proposals were generated by changing one parameter at a time.The four accept ratios quoted for each parameter type in the table apply to the examples with (N = 200, f = 0.1), (N = 20, f = 0.1), (N = 50, f = 0.60) and (N = 200, f = 0.25, cycle skipping simulation), respectively.

A P P E N D I X B : I N F L U E N C E O F T H E W I D T H O F T H E U N I F O R M D I S T R I B U T I O N
First of all, the definition of the uniform distribution is ambiguous whether the bounds apply to the actual data points or to the residuals.In some respect, this distinction is irrelevant as the outlier term in eq. ( 11) is just a constant and does not depend on actual values but has some implications when discussing bounds.Because the actual spread in data point values can be very wide in geophysical problems, and the actual spread of residuals in models with a large likelihood is not known a priori, we recommend to consider residuals with respect to some easily evaluated reference model, for example, for a surface wave arrival time tomography problem simply a uniform model.The bounds of the uniform distribution could theoretically be considered unknowns that must be estimated as part of the McMC search itself.Intuitively, the lower bound must be smaller than or equal to the smallest residual and the upper bound larger than or equal to the largest residual.We did not carry out such a search, but can gain some intuition by simply evaluating the likelihood as a function of varying either the lower or the upper boundary, while keeping all other quantities at their mean value.The maximum likelihood is indeed obtained just at these extreme values (compare Fig. B1 with the actual distribution of samples in Figs 3a and b , top).Where there is a large number of expected outliers (for the case N = 200), the actually covered range gives a good impression of the underlying range, and therefore the likelihood decays quickly away from these limits.Where there is only a small number of expected outliers, there is a large chance that they appear far from the boundaries, and the likelihood decays much more slowly away from the extremal values (e.g. for N = 20).
The bounds are fictitious in the sense that a uniform distribution with sharp boundaries is unlikely to truly describe the distribution of outliers.Instead, the bounded uniform distribution is used to ensure that unambiguous outliers are not dependent on and thus do not influence the model.As such, there is no meaningful interpretation of the bounds.For performance reasons we also prefer to avoid the introduction of additional parameters into the McMC search.We therefore investigate the sensitivity of the results on the assumed (upper bound), that is, with the normal distribution parameters and the fraction of outliers fixed at their mean value, and the other bound set to its maximum likelihood value.The likelihood functions have additionally been normalized to have a maximum value of 1, and were calculated from eq. ( 11) based on the two realizations shown in Fig. 3a (N = 200) and 3b (N = 20).The true limits of the uniform distribution used to generate the outlier fraction is shown with black vertical lines.
value for W in Table B1, which lists the estimated distribution parameters for different values of W, and in Fig. B2, which shows the likelihood as a function of outlier fraction for the true width of the generating distribution (top panels), the empirically determined width, which by definition must underestimate the true width (centre panels), and a grossly overestimated width, here an overestimate by a factor of 3 (bottom panels).Table B1 demonstrates that generally the estimated distribution parameters only show a very weak sensitivity on W, with variations well within the uncertainty range for all parameters.This is true in particular for μ, which is the real target in our toy example, as σ and f only describe errors in the measurement process itself and can thus be considered nuisance parameters.The exception to this general rule is example (d), the cycle skipping simulation.Here, the inferred values of σ increase and the values for f decrease with increasing W, implying that some of the singly cycle-skipped measurements are mis-classified as valid measurements on the flanks of the Gaussian distribution for larger values of W. However, even in this case, the inferred values for μ do not seem strongly biased, and in practical applications it does not seem likely that the appropriate value for W would be so grossly overestimated.
Although the absolute probabilities differ by many orders of magnitude, the shape of the probability distributions and particularly the values at which they attain their maximum value depend only weakly on the assumed width for the cases shown in Figs B2, a-c.The difference in absolute probability densities can be easily understood, as each outlier will add approximately a factor of 1 W to the final probability density, but is usually not relevant as only likelihood ratios will be considered in any case.As a further check we also estimated the total number of outliers by two methods: MaxL(Maximum likelihood.)We check for each data point, whether it is more likely to be an outlier or a valid data point, and then sum the number of points with an outlier probability of more than 50  (eq. 11) for the true values of μ, σ and W); for the cycle skipping example no true value for W exists, and thus the observed range of samples is used instead.The centre panel shows the likelihood function with the empirical value for W, that is, the range of the samples, and the inferred mean values of μ, and σ (as reported in the central legends in Fig. 3).The bottom panel shows the likelihood function with an assumed value of W = 30, which is three times more than the true value (in a-c); the values for μ, and σ are set to their inferred mean values, when a W of 30 is used (see Table B1).In addition, the likelihood function for W = 30, but the true values of μ and σ is plotted as a dotted line for comparison, but only in (d) the two curves are significantly different.The vertical black lines show the true outlier fractions of the generative distribution, whereas the dotted line shows the actual fraction of outliers in the particular realization (using privileged knowledge of the random numbers, which were used to decide whether a particular sample was drawn from normal or uniform distribution).See text for an explanation of the outlier percentages reported in the headers of each panel.per cent, that is where H( • ) is the Heaviside step function, and r i = d i − μ.Cum(Cumulative.)For each data point, we determine the probability of being an outlier, that is the ratio of the probability of the uniform distribution divided by the total probability of the data point, the sum of uniform and Gaussian probabilities and then sum those fractional probabilities, that is, Those estimates of the number of outliers for the particular realizations are shown in the title of the panels in Fig. B2   Data Uncertainty in Markov chain Monte Carlo 405 histograms in the bottom panels of Fig. 3).They are also very similar for different estimated widths W of the uniform distribution, giving further confidence in the consistency of the estimates.For example (d) and W = 30 (bottom panel in Fig. B2d, the maximum of the likelihood is attained at a much lower value of f than for the empirically determined W = 7.8 (central panel).This discrepancy is mostly due to the overestimation of σ , increases of which trade off with decreasing f.The discrepancy is also reflected in the estimate of the number of outliers based on 'Cum' and 'MaxL' estimates.
Another way to think about the effect of the choice of W is to consider how it influences the magnitude of the residual, for which the probability of a sample to be an outlier (i.e. to have been drawn from the uniform distribution) matches the probability to be a valid data point (i.e. to have been drawn from the Gaussian distribution).We term this residual magnitude the crossover distance, that is, the value of r cross = d i − μ, where fφ uni (r cross ) = (1 − f)φ gauss (r cross ).Data points with residuals larger than r cross will rapidly lose the ability to influence the inferred model parameters.The equation is easily solved for r cross , and we obtain: This relation is graphed as a function of W for the true values of σ and f in Fig. B3.We note that widths less than the actual range of measured data points are not consistent with the data, so that the parts of the curve to the left of the dotted line are not relevant in practice.The most important message from this figure is that r cross only shows a weak dependency on W, explaining the relative insensitivity of the results on the assumed W, which we demonstrated above.When, instead of using the true values for μ, σ , and f, the values inferred from McMC, evaluated with the respective value of W, are used (circles in Fig. B3), there is a bias but the same trend is followed.The bias arises because of the specificities of the particular realization considered.The exception is again example (d), where the crossover distance based on inferred values increases much faster than that predicted from the true values.This pattern is caused by the increasing values of σ that are inferred for larger W values.The exact reason for this behaviour is not known, but we note that the crossover distance at the empirical value of W is close to the half-way point between the valid and the singly cycle-skipped measurements.Therefore, even a small increase in the crossover distance potentially results in a small number of outliers being allowed to contribute to the estimate for σ , which then results in a small increase in the estimated σ , which in turn allows further true outliers to contribute.Downloaded from https://academic.oup.com/gji/article-abstract/222/1/388/5831068 by Geoforschungszentrum Potsdam user on 20 May 2020

Figure 1 .
Figure 1.(a) Example of P wave residual distribution from a teleseismic traveltime tomography study(Tilmann et al. 2001).At first glance, the distribution looks close to Gaussian, but there are a few outliers, only barely visible above the x-axis line.(b) The same residuals as in a, but plotted as a quantile-quantile plot (Q-Q plot), which plots the observed values on the y-axis versus the theoretical quantiles of the distribution, here the Gaussian distribution, for the number of data points on the x-axis (seeAster et al. 2005, appendix B.7).The mean and standard deviation of the Gaussian distribution was estimated from the data.The red line shows the line of identity, around which the measurement points (blue crosses) would scatter if they truly followed a Gaussian distribution.The deviation from a Gaussian distribution is clearly marked by the sigmoidal shape of the Q-Q plot, that is, extreme values are farther from the mean than predicted for both the bottom and top end of the distribution.The slope of the identity line does not fit the implied slope for the central range of the residuals, implying an overestimation of σ .

Figure 2 .
Figure2.Illustrative plot of probability density function for mixed distributions, where the standard deviation of the normal part of the distributions plotted in blue is two times the standard deviation of the distribution plotted in green.The fraction of outliers, that is, the probability that any given sample is drawn from the uniform distribution, is set to 0, 20 and 40 per cent, respectively.(All distributions are assumed to have the same upper and lower bounds of the uniform distribution.).
(b)  in fact one outlier was generated.As a result, the estimation for the outlier fraction f becomes very difficult and the McMC search considers values of f up to ∼35 per cent possible.The most likely value indicated by the histogram in Fig.3(b) is just under 10 per cent) but because the PDF is skewed, the mean value of 14 per cent actually overestimates the true outlier fraction.The true values are still contained in the one standard error range around the mean, though.The standard deviations for μ and σ are increased by (very) approximately a factor of three, that is, similar to √ 10 = √ 200/20, which would be the factor expected for a pure normal distribution.

Figure 3 .
Figure 3. Mean value estimation with outliers.(a) Sample size N = 200, outlier fraction f = 0.1%.The histogram in the top panel shows one realization of a mixed probability distribution, where samples either are good measurements following a Gaussian distribution, or are outliers with a uniform distribution between the limits of the graph.The blue line shows the true distribution from which the samples were drawn (target distribution), the red line shows the pure Gaussian distribution estimated from the sample mean and standard deviation, and the magenta line shows the mixed distribution estimated from this particular realization.Note that the PDF of the uniform distribution is so small that its value is nearly indistinguishable from the x-axis at the plotted scale.Bottom panels shows estimated unnormalized posterior PDF for the parameters of the mixed distribution determined from an McMC search, with the red asterisk showing the true value, and the solid line a Gaussian fit to this distribution.The mean values and standard deviation for the three model parameters are also reported within the legend.(b) N = 20, f = 0.1%.(c) N = 100, f = 0.6%.(d) N = 200, f = 0.25%; outlier samples are drawn from a distribution simulating cycle skipping (see text).

Figure 4 .
Figure 4. 2-D tomography synthetic test with Gaussian errors (left-hand column) and Gaussian errors and a few outliers (right-hand column).Colours show perturbations with respect to a uniform reference model.Triangles show station locations, which act as sources and receivers.x-and y-axis scales are distance scales in km.(a, b) Input model.Black lines show ray paths.Magenta lines in b show additional ray paths with spurious measurements.(c, d) Mean model, assuming a normal distribution with known standard deviation in the McMC search.(e, f) Mean model, assuming a normal distribution with unknown standard deviation.(g, h) Mean model, assuming the mixed distribution in eq.(11) with unknown standard deviation and outlier fraction.The RMS of residuals is shown above each subfigure.For comparison, the initial RMS in the uniform reference model is 0.28 s for the Gaussian noise (c, g, e), and 0.52 s for the Gaussian noise plus outliers (d, f, h).

Figure 5 .
Figure 5. Distribution of phase arrival time residuals for 4 s Rayleigh waves measured from ScanArray ambient noise stacks for two types of automatic measurement tools.The residuals are relative to a uniform velocity background model optimized to minimize the HS residuals in a least-squares sense.The left-hand columns shows histograms; dashed lines indicate 10th and 90th percentiles and header lines show residual root mean square (RMS), median absolute deviation (MAD), the median of the absolute values of all residuals, and the largest absolute value (MaxAbs); like the RMS, the MAD is representative of the typical magnitude of the residuals but is unaffected by outliers.The right-hand column shows a scatter diagram of residuals versus interstation distance.HS: automatic phase arrival measurements based on added functionality of the phase velocity measurement tool introduced by Sadeghisorkhani et al. (2018).EKr: automatic phase arrival measurements based on a new implementation of the algorithm described inKästle et al. (2016).Only measurements at less than 240 km epicentral distance, equivalent to approximately 20 wavelengths were used for the EKr data sets (blue dots), although measurements were available for all possible pairs (plotted in grey for the residual-versus-distance plot, for distances less than 500 km).

Figure 6 .
Figure 6.(a) Comparison between posterior mean models based on the HS and EKr data sets under assumption of normally distributed errors with unknown σ .Triangles show station locations of the ScanArray network.The main diagonal shows the derived models with a consistent colour scale, where cells with a posterior model standard deviation of more than 0.1 km s -1 are masked.The upper right figure visualizes the difference between both models; the unit for the colour bar is again km s -1 .The bottom left shows a scatter plot of the velocities for those cells that are present (not masked) in both models.Identical models would all fall on the line of identity.In order to help visual association of the points in the scatter plot with the map view, they are coloured based on the average of the velocities in the two models.(b) Residual histograms corresponding to models shown on the left.

Figure 7 .
Figure 7.Comparison between posterior mean models based on the HS, and EKr data sets under assumption of mixed normal and uniform distribution.Figure format as in 6.

Figure 8 .
Figure 8. Evolution of standard deviation and outlier fraction as function of iteration number for the HS (a) and EKr (b) data sets.Only models corresponding to the dark green portions are used for the final average model estimate.The grey and dark green curves show the chains at temperature T = 1 and the red curves show the chains at T > 1.The grey parts of the curve show the burn-in phase.The burn-in phase is intermittent for larger iteration numbers, as after each recalculation of ray paths a new burn-in phase was started.

Figure B1 .
Figure B1.Likelihood functions for the placement of the lower and upper bounds, p(d|μ, σ , f , x, max(d)) (lower bound) and p(d|μ, σ , f , min(d), x) (upper bound), that is, with the normal distribution parameters and the fraction of outliers fixed at their mean value, and the other bound set to its maximum likelihood value.The likelihood functions have additionally been normalized to have a maximum value of 1, and were calculated from eq. (11) based on the two realizations shown in Fig.3a(N = 200) and 3b (N = 20).The true limits of the uniform distribution used to generate the outlier fraction is shown with black vertical lines.

Figure B2 .
Figure B2.Likelihood functions p(d|...) as a function of the outlier fraction for the realizations shown in Figs 3(a)-(d) (top panel) for different assumed widths of the uniform (outlier) distributions, scaled to a peak value of 1; the scaling factor is given in the header line of each panel.The top panel in each subfigure shows the likelihood function(eq.11)  for the true values of μ, σ and W); for the cycle skipping example no true value for W exists, and thus the observed range of samples is used instead.The centre panel shows the likelihood function with the empirical value for W, that is, the range of the samples, and the inferred mean values of μ, and σ (as reported in the central legends in Fig.3).The bottom panel shows the likelihood function with an assumed value of W = 30, which is three times more than the true value (in a-c); the values for μ, and σ are set to their inferred mean values, when a W of 30 is used (see TableB1).In addition, the likelihood function for W = 30, but the true values of μ and σ is plotted as a dotted line for comparison, but only in (d) the two curves are significantly different.The vertical black lines show the true outlier fractions of the generative distribution, whereas the dotted line shows the actual fraction of outliers in the particular realization (using privileged knowledge of the random numbers, which were used to decide whether a particular sample was drawn from normal or uniform distribution).See text for an explanation of the outlier percentages reported in the headers of each panel.
).For examples (a)-(c), both counts of the number of outliers are well within the uncertainty range of the estimated outlier fraction (as shown in the Table B1.Resulting estimated values (mean and standard error inferred from the distribution in the McMC chain) for μ, σ and f for different assumed values of W. The first column corresponds to the empirically determined width W, for which the results are shown in detail in Fig. 3.For all experiments the true values are μ = 2.0, σ = 0.2 and the f values as quoted in the headings below.(a) N = 200, f = 10% W 8

Figure B3 .
Figure B3.Variation of the crossover distance as a function of the assumed width of the uniform distribution.The crossover distance is defined as the distance between a sample value and the inferred mean value, where the probability for a sample to have been drawn from the uniform distribution is equal to that to be drawn from the Gaussian distribution (see text).The continuous blue line shows the theoretical crossover distance for the true values σ and f.The circles show the inferred crossover distance based on the values estimated for σ and f from the actual realization (as shown in the top panels of Figs3a-d), when the respective value of W is assumed for the uniform distribution term, σ uni (eqs 11, 13).