Samples, Symmetries and Extensions: Exploring Parameter Space in Nonlinear Problems

Many scientific and technological advances require the values of a set of parameters to be constrained or estimated using recorded data. Geophysical parameters usually describe a geological process, or the state of the Earth at a particular time. It is often possible to model data that would be recorded for particular parameter values using a computable and in general nonlinear function, where its inverse does not exist as a unique-valued mapping. The space of parameter values must then be explored to find values that are consistent with measured data. This may be achieved by sampling values at a sufficiently dense set of points in parameter space, evaluating the forward function for each sample, and testing their modelled data against the measured data. However, for manyparameter (high-dimensional) systems, achieving sufficient sample density may be infeasible due to the ‘curse of dimensionality’ – the exponential increase in sampling required to achieve similar sample density in spaces of increasing dimensionality. This article shows that for each sample, significantly more information may be available than the forward function evaluation alone, due to symmetries and other properties of the physical system. Call this additional information the extension of a sample. In travel time tomographic imaging, the extension of every sample is shown to provide forward function values throughout continuous parameter semi-subspaces, almost for free. The dimensionality of these subspaces increases exponentially with the dimensionality of the problem, almost matching the rate of information increase required by the so-called curse of dimensionality. The use of extensions within physics-based sampling schemes may therefore contribute to increase the dimensionality of problems in which one can explore parameter space, solve inverse or inference problems, or provide distributed information about forward function values directly.


Introduction
Many scientific and technological advances require the values of a set of model parameters in parameter space to be constrained or estimated using recorded data in data space . It is often the case that it is possible to estimate for any particular values of using some known, computable, in general nonlinear function ( ), but the inverse function is unknown and does not exist as a unique-valued mapping. Since the goal is to estimate given values of , inverse or inference theory must then be deployed.
The most commonly used inverse methods involve linearising ( ) around an initial estimate of the parameter values, solving the resulting inverse problem of estimating values of that best fit recorded data under the linearised (approximate) version of , then re-linearising the function around the new parameter estimates and solving the newly linearised inverse problem. This process is iteratived to convergence. Given parameterised uncertainty on data , corresponding uncertainty in values of may be estimated using the gradients of linearised around the final parameter estimate (Tarantola 2005).
While apparently computationally efficient, linearised methods may fail to converge towards the bestfit solution when is nonlinear, and whether or not the best-fit solution has been found remains unknown. In addition, gradient-based uncertainty estimates account only for an approximation to the physics that is valid near the solution found, so they do not represent true parameter uncertainty. As a result, in significantly nonlinear problems, both the minimum-misfit solution found, and corresponding linearised uncertainty estimates may be highly misleading (Galetti et al., 2015;Zhang et al., 2018).
It is therefore increasingly common to solve inverse problems without linearization, using samplingbased methods. These methods explore space by evaluating the consistency between parameter values , recorded data , and any other pertinent a priori (prior) information that is independent of the data, at a succession of points or samples of . The samples may be chosen pseudo-randomly as in Monte Carlo methods (Gelfand & Smith 1990;Mosegaard & Tarantola 1995;Sambridge 1999;Malinverno 2002), from a deterministic set of values of each parameter such as in grid-search methods Lomax et al., 2009), or using a variety of other algorithms such as particle swarms (Kennedy & Eberhart 2001), simulated annealing (Kirkpatrick et al., 1983), genetic algorithms (Goldberg 1989;Sambridge and Gallagher 1993) or variational methods (Zhang & Curtis 2020a,b). Several of these can be tuned either to focus on exploration of more broadly, or to converge around already-discovered models that are consistent with recorded data and prior information.
In a third class of methods, samples can be distributed probabilistically to reflect to the state of knowledge prior to collecting a current data set. These so-called prior samples can then be used to train machine learning or other algorithms to solve inverse problems swiftly for any particular data set that may arise in future, and to provide fully nonlinear parameter uncertainty estimates (Devilee et al., 1999;Meier et al., 2007a,b;Maiti et al. 2007;Shahraeeni & Curtis 2011;Shahraeeni et al., 2012;Käufl et al., 2016;Karmakar et al., 2018;Nawaz & Curtis 2019;. In all such sampling algorithms, the function is evaluated for each sample. These evaluations may be used to find a set of models that are consistent with measured data, or for more formal uncertainty analysis using Bayesian inference (Tarantola 2005). However, in all cases the principle holds that where the function value has not been evaluated, it remains unknown. If we wish to infer results from such explorations it is therefore important to have included samples that span all relevant parts of parameter space (those with a significant probability of being close to the truth), with reasonable sample density. This is problematic because a priori we do not know which parts of are relevant. If is impossible to explore fully given available computing power, we must adopt a pragmatic strategy.
The barriers to exploring space using dense sampling are almost always a combination of (i) is too high-dimensional; (ii) prior information is too weak to constrain the search to a sufficiently small subspace or sub-volume of ; and (iii) the cost of evaluating each sample (the computational cost of evaluating ) is too high. The first barrier reflects the so-called 'curse of dimensionality' which states that the number of samples required to explore with a given sample density expands exponentially with the number of dimensions . The second simply states that prior information is too weak to overcome the expansion in computation imposed by this curse. The third indicates that even for low-dimensional problems the expense of computing the forward function may preclude dense sampling. Nevertheless, when parameters are numerous and (i) and (ii) apply, the third barrier is the least significant: one could reduce the cost of the forward evaluation almost to zero, and it would make little or no difference to our practical ability to explore high-dimensional problems by sampling. The curse of dimensionality is so extreme that without prior information to reduce its effects, even storing dense sample sets on disk becomes problematic or impossible as the number of parameters gets large. And finally another barrier exists: (iv) the 'No Free Lunch theorem' states that the performance of any one sampling algorithm when averaged over all problems is no better than the performance of any other (Wolpert & McReady 1987). More efficient sampling algorithms can therefore only be devised or conceived of for specific classes of problems, not in general.
Standard sampling methods attempt to increase efficiency by exploring only at a level of detail proportional to the probability of parameters being true, in other words proportionately to the consistency of parameters with prior information and recorded data. In Bayesian solutions, this consistency is embodied in a so-called posterior probability density function (pdf) constructed using Bayes rule as shown below. A common sampling strategy is to attempt to select samples with sampledensity equal to this pdf. This is called importance sampling, and may be approximated using Markov chain Monte Carlo algorithms (Green 1995;Mosegaard & Tarantola 1995;Malinverno 2002;Bodin & Sambridge 2009;Fichtner et al., 2019) or variational methods (Liu & Wang 2016;Zhang & Curtis 2020a,b), or performed exactly using so-called exact sampling methods (Propp & Wilson 1996;Walker & Curtis 2014).
With the exception of exact sampling methods, such algorithms only distribute their samples proportionally to the true posterior pdf in the limit of infinite samples. Indeed, the mathematical basis for such methods requires that in the infinite limit they do in fact explore all parts of parameter space for which the posterior pdf is positive. If we aim to discover all higher probability regions of parameter space to achieve importance sampling, we must first explore all of that space. Many importance sampling algorithms do not explore space efficiently since they instead focus sampling on higher probability regions that have already been discovered. In practice therefore, importance sampling algorithms are approximate, use a limited number of samples, and explore in detail only the higherprobability regions of parameter space that happen to have been discovered using the set of previous samples. Any inferences made from a finite set of samples are thus potentially biased to an unknown extent due to unexplored parts of that may have non-zero posterior probability.
If we are to believe conclusions reached using sampling methods, it is therefore important to reduce the computational power required to explore relevant parts of . This has been attempted previously by reducing the dimensionality of the problem to be solved (Douma et al., 1996;Grana et al., 2019), tightening a priori constraints on parameter values (Curtis & Wood 2004;Walker & Curtis, 2014;Nawaz & Curtis 2019;Linde et al., 2015), developing more efficient forward computations (Rawlinson & Sambridge 2005;Nissen-Meyer et al., 2014;van Driel et al., 2015;Krischer et al., 2017), approximating the forward function with an emulator that can be explored more rapidly (Das et al., 2018;Moseley et al., 2020), or improving predictions of where samples might usefully be located in unexplored parts of (e.g., Fichtner et al., 2019;Khoshkholgh et al., 2020). However, in all such studies the same principle holds: where the parameter space has not been sampled, the value of is unknown.
By contrast, introducing new analytical results within the sampling algorithm formulation has been shown to allow sampling to be avoided either partially or entirely in spatially structured inverse problems (e.g., Nawaz & Curtis 2017. Many Geophysical problems are formulated in a spatial domain, and have additional underlying structure imposed by their physics. The rest of this article therefore investigates the effectiveness of increasing information about the structure and symmetries of the physics in the forward function . It is common to assume that has properties of continuity or smoothness, but we may have far more knowledge about its structure and symmetries. While attempts have been made to include mathematical symmetries in sampling methods, for example in Physics (Chuang Chen et al., 2018;Ji-Yao Chen et al., 2018) and in Genetics (e.g., Felsenstein 1981), that work is highly specialised to those fields and not easily transferable. This article shows how understanding of the geophysics of problems to be solved may lead to highly efficient exploration of parameter space.
Mathematically, symmetries of a function are transformations ( ) that leave ( ) unchanged, that is ( ) = [ ( )]. The existence of symmetries thus implies that far more information is available from each sample than simply the evaluation of at those parameter values. For each sample evaluated ( indexes the sample number), let us call this extra information the extension of , defined as all information about other forward function evaluations that is available from ( ) without requiring any further evaluation of . The extension may include information in parameter space , in data space , or in a variety of other parameter or data spaces of varying dimensionalities.
This article shows that while some types of extension are already used in a number of Geophysical applications, using them directly for sampling and exploration problems may lead to radical new algorithms. The information in extensions depends on the particular physics of the problem, and this article suggests that it may be possible to construct highly efficient, problem-specific sampling algorithms, thus circumnavigating limitations imposed by the No Free Lunch theorem.
An example below concerns the sample extensions of wave or particle speed maps that are related to travel time data using ray-based physics. These extensions are shown to offer the possibility for extreme reductions in the number of samples required to explore to any desired level of detail compared to standard search methods. The criterion used to select the samples is entirely different to those underlying existing algorithms, and may eventually lead to methods that involve no random search at all. And importantly, as the number of parameters increases, information provided by extensions increases approximately proportionately to the added volume of parameter space (i.e., exponentially); this suggests that it may be possible to explore parameter space despite the curse of dimensionality, at least in some specific classes of problems. Similar models, data and physics to this example occur in diverse problems of acoustic, seismic and electromagnetic travel time tomography or in positioning and location problems, and in inversions for parameters controlling fluid flow given fluid break-through times. Sample extensions may therefore significantly expand the range of scenarios in which nonlinear uncertainty analysis can be applied.

Inverse Problem Formulation
Say we seek to constrain the values of a set of parameters described by vector ∈ where parameter space is defined by parameterisation . These will be constrained using recorded data in data space which is defined as the set of all possible measured values using survey or experimental design . Variable defines how the mathematical model ( ): → that relates parameters to data is parameterised since this is usually subject to choice; for example a spatial model might be represented within a variety of coordinate axis orientations and scalings, and may be parameterised using continuous basis functions of arbitrary resolution or approximated using different families of discrete cells. Variable defines data types, and how, where and when the data are measured, all of which may also be a matter of choice.
It is common to use Bayes rule to solve such problems: In equation (1), ( | ) is referred to as the likelihood and measures the extent to which parameter values are consistent with the measured data, and ( ) is the prior pdf which defines the state of information about independent of information in the data. ( | ) is called the posterior probability distribution function (pdf) and represents the final state of information about parameters given the prior information and the data , and is the solution to the overall inverse problem. ( ) is called the evidence and is a measure of the consistency between data , prior information about values , and the mathematical model ( ) that is used to relate the two within the likelihood.
Notice that all terms in equation (1) depend on either the model or data parameterisations or both, and that such dependencies were dropped from notation. The full expression is The prior pdf does not depend on since under the Bayesian approach it represents information that exists independently of the current data set. We will use the notation in equation (1) but dependencies in equation (2) are implicit.
It is often the case that data are assumed to be mutually independent, and to have a symmetric uncertainty depending on the distance from each measured datum under an ℎ-norm for some ℎ. In that case the likelihood term may be assumed to take the form (Tarantola 2005) where is the th of data in vector , and 1 is a normalising constant which ensures that the right side of the equation defines a valid probability distribution (its integral over parameter space must be 1). It is common to assume a squared norm (ℎ = 2) and Gaussian uncertainties with standard deviation on the th datum, whence the likelihood becomes The expression on the right can be rewritten ∏ (2 2 ) which is the product of likelihoods ( | ) of the sub-problem involving the single datum given parameters . We therefore obtain ( | ) = ∏ ( | ) =1 and Bayes rule becomes We may also be able to assume that the evidence can be written as the product of pdf's over the evidence in sub-problems involving individual data: e.g., for fixed model and data parameterisations the evidence is independent of parameter values , and hence it is constant. It can then be divided between the various sub-problems, where the latter equality is obtained by Bayes rule. Thus we see that the solution to the full Bayesian problem is the product of the posterior pdf's of the set of Bayesian sub-problems, each involving constraints from a single datum.
The expression for ( | ) in equation (6) evaluates the pdf of any particular set of parameter values . Often we wish to estimate statistics of this pdf such as its mean or variance. This involves calculating integrals of the form ∫ ( | ) ∈ to obtain the th moment of the th parameter.
In turn, this requires that we know ( | ) over at least a representative set of samples , and from equation (4) we see that this involves evaluating the forward function ( ) for each . For example, using equation (5) the mean of parameter is given by Even if we do not take a Bayesian approach to inversion, in order to estimate confidence intervals on parameters we must know ( ) for each of a representative set of samples . It is therefore necessary to explore ( ) over parameter space , and by equations (1) and (5) this exploration may be performed considering all data at once, or one datum at a time.
Almost all standard Bayesian sampling methods to solve nonlinear inverse problems consider all of the data at once and explore the solution using equation (1). This follows from an assumption that the more data constraints placed on parameter vector , the more limited will be high-likelihood areas of parameter space, hence the more tractable will become the sampling problem. Alternatively, one might choose to solve single-datum travel time tomography problems to obtain ( | ): as shown below, extensions to single-datum problems may be easier to describe and evaluate, and may span large hyper sub-spaces of . Single-datum solutions can then be combined using equation (5).

Sample Extensions
Define the th sample of parameter space to be = [ ,1 , … , , ] ∈ where , is the th of parameters of the mathematical model. For each sample we assume that the forward function ( ) has been evaluated. The total state of information derived from the th sample is then Here, ( , , ) is the forward function (noting explicitly that the model and data space affect the form of ), and represents the extension of sample . The extension is defined to be all information about other forward function evaluations that is obtainable from { , ( , , )} without performing further forward function evaluations. Extension is a set: it may contain information about the value of for parameter values other than , or values of different forward functions ̂(̂,̂,̂) which map samples ̂ in space ̂ to space ̂, where spaces ̂ and ̂ may differ from and , respectively. By analogy, we define the second order sample extension , of samples and to be all information about other forward function evaluations that is obtainable from the set { , ( ; , , , ) ; , ( ; , , , )} without performing further forward function evaluations. Again, either or both of model and data spaces (and hence the forward functions) might differ between parameter values and , hence the additional subscripts and throughout. Higher order extensions are defined analogously. In this article we refer to the first order sample extension simply as the extension of , and to simplify notation the forward function ( , , ) is denoted ( ) except where we wish to focus on the parameter and data space dependencies.
Symmetries always provide sample extensions. By definition, ( ) = [ ( )], so if ( ) has already been evaluated then [ ( )] is a member of the extension of that is available at the cost of evaluating ( ). In linear problems, the particular set of symmetries for which the parameters best-fit recorded data are referred to as the null space and in theory can be found using matrix methods (Snieder & Trampert 2000). In nonlinear problems they must be found using sampling methods (Fichtner & Zunino 2018). Symmetries extend the concept of a nonlinear null space, as they provide information about parameters throughout , irrespective of fit to a particular data set.
In addition to those derived from symmetries, a variety of other types of extension may also be available. Extensions generally fall into four main types: function value, data space, parameter space and interpretational extensions of ( ), defined as follows.

Function Value Extensions:
These provide the values of ( ) for parameter values other than . Functional symmetries lead to function value extensions as explained above. Other types of information may also exist within these extensions, such as function values for numerical combinations of already-evaluated parameter values.
Parameter Space Extensions: These provide forward evaluations from spaces that differ from . They typically arise by changing parameterisation in ways that affect the function value predictably. This may involve adding, removing or otherwise changing parameters. Parameter space extensions may be particularly useful in problems involving many sets of parameter values with different parameterisations, including model selection problems (Box & Draper 1987;Linde 2014), reversiblejump Monte Carlo methods (Green 1995;Malinverno 2002;Bodin & Sambridge 2009) and interrogation problems (Arnold & Curtis 2018).

Data Space Extensions:
These provide forward evaluations to spaces that differ from . Data space extensions are already used in various contexts. They are particularly important for experimental design problems where we attempt to define an optimal suite of data types, locations and recording times (all defined by design ) in order to obtain particular target information (Curtis 2004;Maurer et al., 2010;Bloem et al., 2020). In these studies, the ability to test different designs without having to evaluate different functions increases efficiency.
Interpretational Extensions: These provide alternative interpretations of the same parameter, data or function values. They are important in a variety of settings, and such extensions may differ substantially in their nature.
Extensions are particularly important when they can be evaluated at substantially lower cost than evaluating . In such cases, extensions of previous samples might be used to guide more efficient future sampling strategies, or may answer questions of interest directly. The importance of each type of extension becomes clear when discussing tangible cases, so we now introduce a specific application that we follow through the rest of this article.

Travel Time Tomography
Travel time tomography is an inverse problem that uses the measured travel times of energy or particles travelling between source-receiver pairs, to infer the speed or slowness (the reciprocal of speed) of propagation across the spatial domain through which the energy travelled. The parameterisation consists of any means to define the slowness map across the domain, so the parameter space is the set of all parameter values that define a valid slowness map. The data space is the set of possible travel times between each source-receiver pair considered in the problem. One set of travel times is recorded experimentally, which constitutes the measured data. Prior information usually exists about physical parameters, and must be combined with information from the data to find the posterior uncertainty in the parameter values. The posterior pdf that describes this uncertainty is the solution to the inverse problem.

Figure 1 (a) Map of slowness of wave propagation (colour) and ray (green) of first-arriving energy at receiver (triangle) from the source (star). (b) Corresponding map of time taken to travel from the source (star) to each grid cell (colours) and rays (black) to two of several receivers (triangles).
A specific example is introduced in Figure 1. Figure 1a depicts a gridded (cellular) slowness structure of a 2-dimensional isotropic medium. This is referred to as a slowness map, and represents sample where each element of vector is the slowness in one cell. A ray path between a source and receiver depicts the fastest route between the two points. For an isotropic source, this is the path taken by the first energy to arrive at the receiver, and we define its travel time to be . The forward function in this case calculates given map and the source and receiver locations. It is now standard practice to calculate travel times by solving the eikonal equation across the grid using a fast marching method (e.g., Rawlinson & Sambridge 2005). This computation is the forward function = ( ). If desired, rays can then be calculated by tracing the negative gradient of travel time from each receiver back to the source.

Function-Value Extensions
We obtain a first example of a function value extension by invoking the physics of refracting rays. For easy reference, we define this extension using the following Lemma: Lemma 1: Given travel times = ( ) through slowness map ∈ , the corresponding travel times ( ) for map are given by for any positive scalar .
Proof: Ray geometries are defined by angles of incidence and refraction at each cell boundary, which in turn are governed by Snell's law: 1 cos 1 = 2 cos 2 where angles 1 and 2 and slownesses 1 and 2 are defined for one cell boundary in Figure 2a. If the slownesses are multiplied by positive scalar value , Snell's law remains identical: appears on both sides and cancels, so the ray path remains in its original position. The travel time of any packet of energy travelling along a ray is the length of path in each cell multiplied by the cell slowness, summed over all cells. Since the ray and hence these path lengths do not change, the travel time along the ray in slowness map is the travel time in map multiplied by the same scalar ( Figure 2b). End Proof. Say we evaluate the travel time function = ( ) for any slowness map to obtain a sourcereceiver travel time . By Lemma 1, we obtain the corresponding travel times at an infinite number of other slownesses essentially for free by using the relation = ( ). All such models in the set { : ∀ ∈ ℝ >0 } are therefore in the extension of , where ℝ >0 is the set of strictly positive real numbers.
It is useful to visualise the extension of models in parameter space. Figure 3a plots a conceptual slowness map in a graph of its first three slowness parameters (labelled 1 to 3 ), and we evaluate = ( ). The extension { : ∀ ∈ ℝ >0 } is depicted by the red line in that figure, showing that we immediately know the function value at an infinite number of other models. A similar line, generally at different trajectories, exists in all parameter sub-spaces spanned by slownesses in .

(a) Sample (circle) and extension (line). (b) & (c) Similar to (a) but extensions are planar (red) if cell 1 (b) or cell 2 (c) does not lie on the ray. (d) Extension is a solid volume (red) if neither cells 1 nor 2 lie on the ray.
Function value extensions can also be derived using the following result: Lemma 2: Let 1 be a vector that describes a shortest travel-time path between a fixed source and receiver, through a medium with slowness map . Increasing the slowness in any part of that does not lie along ray 1 does not alter the smallest travel time source-to-receiver path 1 , nor the travel time 1 of the first-arriving energy at the receiver.
Proof: Consider any perturbation to parameter values such that ≥ 0 ∀ , and = 0 for cells intersected by ray 1 . By definition of the travel time along path 1 does not change in map + . Consider any second path 2 ≠ 1 between the source and receiver. By definition of 1 , energy travelling along path 2 in map has travel time 2 ≥ 1 . And by definition, cannot increase speed (decrease slowness) anywhere on path 2 . Thus path 2 has travel time 2 ≥ 1 in map + . Thus path 1 remains a shortest travel time path, and the travel time remains 1 . End Proof.
Lemma 2 defines symmetry 1 ( ) = + with defined as above, since ( ) = [ 1 ( )]. This provides travel times for a family of other maps without further evaluation of . The power of this symmetry can be observed by applying Lemma 2 to any of the models in the extensions of . Assume that cell in the model does not lie on the particular source-receiver ray path for which ( ) was evaluated. Lemma 2 states that slowness can be increased arbitrarily without changing travel time . Thus we observe that if = [0, … , 0, 1, 0, … , 0] is the unit -vector with th element equal to 1, then the set of models { + : ∀ ∈ ℝ >0 } lies in the extension of . Since there exists a similar extension to every model along the extension defining the line in Figure 3a, all models in the set { + : ∀ , ∈ ℝ >0 } lie in the extension of . This set is the plane shown in Figure 3b for the case in which cell 1 does not lie on the source-receiver ray, and in Figure 3c if cell 2 does not lie on the ray. If neither cell 1 nor cell 2 lie on the ray then every model in the extension in the 1 direction ( Figure 3b) can also be extended in the 2 direction, so the solid wedge of models in Figure 3d also lies in the extension of . Thus it is clear that if cells do not lie on the ray, there is an extension of the line { : ∀ ∈ ℝ >0 } into dimensions of parameter space; this creates a hyperwedge in + 1 dimensions of the -dimensional parameter space within which the forward function values are obtainable given a single function evaluation for parameters .  dimensions. This example shows the power of sample extensions: a space of such dimensionality is impossible to search with Monte Carlo or grid search methods (having a sample density of only 2 samples per dimension in 227 dimensional space requires that we evaluate the forward function 2 227 = 2x10 68 times. This is currently infeasible on any computer for any forward function, either to evaluate the function or to store the results. Yet sample extensions make this information available almost for free anywhere inside the hyper-wedge, at the cost of a single forward model evaluation and tracing a ray. Clearly this extension type may be important when the goal is to explore the values of the function across parameter space . Notice that the extension defined using Lemma 2 is exactly half of the null space that would be calculated using linearised methods if the recorded travel time was equal to the calculated value for any sample. Such methods define the set of linear combinations of slownesses that can be added to a particular solution * of an inverse problem without affecting the data values predicted by the linearised forward function ̅ ( + ) = ( * ) + [ ( * )⁄ ] where the th row of the matrix in square brackets is [ ( * )⁄ ] assuming column vectors. Standard practice is to evaluate singular values of the matrix of partial derivatives, identify parameter space singular vectors with associated singular values equal to zero, and define the linearised null space by adding any multiple of those vectors to the particular solution * . In the above example, some of these vectors would span all off-ray cells, and so the null space would define part of the extension of sample * under the approximate function ̅ ( ). Unfortunately this function only approximates the true forward function if gradients ( )⁄ are close to ( * )⁄ for all . This is only guaranteed in the vicinity of * for forward functions with smoothly varying first derivatives. Smoothness is not guaranteed if we decrease the slowness of cells that are not intersected by the ray. The fastest ray may then jump to a different location, changing first derivatives discontinuously. This explains why Lemma 2 restricts consideration to linearised null-space vectors that give positive perturbations to slownesses away from the ray: this guarantees that the ray remains constant, and that the corresponding extension exists for the true, nonlinear forward function ( ).
A similar type of extension exists for some cases where the data consists of recordings of waveforms of arriving wave energy. In waveform tomography, the recording of the first-arriving wave (rather than its travel time) may be used as data to invert for the slowness map. In that case the forward function involves either solving the wave equation to estimate wavefields between each source and receiver from which the first-arriving waves are extracted (e.g., Zhou et al., 2018), or estimation of first arriving waveforms directly (e.g., Lomax 1994;. It will usually be the case that slownesses across much of the map have no influence on the first arriving waveform, and for such cells we can invoke Lemma 2 to define similar extensions to those above. For a cell to have influence, energy must be able to travel from the source to that cell, and from that cell back to the receiver, before the end of the first arriving waveform. Say we evaluate two travel time calculations across the grid similar to that in Figure 1 (these are typically far cheaper than calculating waveforms): the first has a source at the true source location, the second has a source at the receiver location. These are sufficient to calculate the travel time along the fastest path from the source to any cell and then onward to the receiver. Any cell for which this total time is larger than that of the latest energy in the first arriving wave cannot affect that waveform. We may therefore increase the slowness in such cells without inducing a change in those waveforms, and all such perturbed slowness structures lie in the extension of the unperturbed model.

Data Space Extensions
Data space extensions are often available when the evaluation of at existing or proposed spatiotemporal data-recording points and types defined by experimental design , necessarily also produces synthetic data for other recording points and types. Such data do not lie in data space , hence they extend sample information to other spaces. Function evaluations at these other data points and types are often obtained almost for free.
A typical case occurs when evaluating requires a computer simulation of a physical process that evolves dynamically through space or time until it spans observation points in design . The full simulation then also represents values of the underlying process at other space-time points, perhaps also for different data types, the set of which define an experimental design ′. These synthetic data would be members of a different data space ′ , and hence are values of a forward function ̂( , , ′ ) where ̂: → ′ . Such information would be included in extension because it is available without requiring further forward evaluations -it merely requires that we store the dynamic evolution at other space-time points as we evaluate .  Figure 1a shows a slowness map , and a ray path between a source and an existing receiver that depicts the fastest route between the two points. Say the travel time along that ray is . Simulating using the fast-marching method constitutes forward function = ( ) and also provides travel times to every point on the grid (Figure 1b). So as a by-product of this calculation we obtain travel times to any other receiver location. Such travel times lie in a different data space, and hence in the data space extension. As another example, the derivatives of in space can be calculated by applying finite-difference operators to the values of in Figure 1b. If spatial derivatives of travel times are measured, as for example in stereo slope tomography (Tavakoli F. et al., 2017), then this represents a change in data type and hence experimental design. All such information is included in the data-space extension of slowness , and is available at low computational cost given the original calculation of ( ).

Parameter Space Extensions
Common examples of parameter space extension are referred to as grid refinement, grid coarsening and effective medium theories. In all of these, a newly gridded parameterisation ′ produces a sample ̂ from a new parameter space ′ . This sample has known function value ̂(̂) where ̂: ′ → , which is obtained without any additional function evaluations, and hence is in the parameter space extension.
Say parameterisation defines a gridded model where each grid-cell represents a region that has certain parameterised properties as in Figure 1, and evaluation of requires a hypothesised physical process (here, travel time) to be simulated across this grid. Any grid cell may usually be replaced by a partition of smaller grid cells with properties that exactly match those of the original cell, without changing the value of since the old and new grids represent exactly the same distribution of properties in the underling physical space or process. This refined parameterisation increases the dimensionality of , as depicted in Figure 5 for the same travel time calculation as in Figure 1. Grid coarsening (sometimes called up-scaling) is the opposite process, and may be applied for example where neighbouring cells contain identical properties, or affect the physical process as though they were a single cell with a single 'effective' property: such cells can be replaced by a single, larger cell containing the effective property without changing the evolution of the underlying physical process. For example, if all of the red cells in Figure 5 had the same slowness, any adjacent subset of them could be replaced by a single cell with the same slowness without changing either the ray path or the travel time of the wave travelling between the source and receiver. This change in parameterisation reduces the dimensionality of . Above we showed that Lemma 2 defines a symmetry 1 ( ) that provides travel times for a family of other models of the same dimensionalities. In fact it does so also for models of different dimensionalities, found for example by combining cells, as depicted in Figure 6. The slowness map shown in the upper left panel is the same as that in Figure 1. Four cells shown in the upper-right inset have differing slownesses, and none of those cells lie on the source-receiver ray path. Lemma 1 states that if we increase the slownesses of any such cells, the source-receiver travel time remains unchanged. This is illustrated in the lower plots where three cells have increased slowness to match the maximum of the original four slownesses. Since the four cells now have equal slowness they can be combined into a single cell as shown lower-right, and thus we obtain the slowness distribution in the lower-left panel. This has lower dimensionality 252 rather than the original 255, and the travel time for this model is known without further evaluations of ( ), and is equal to that in map . Any sets of neighbouring cells that do not lie on the ray path can be combined in a similar manner to provide models of significantly reduced dimensionality and with known travel time, all of which lie in the extension of the original sample .
The above are simple examples of effective medium theories. These essentially define symmetries across parameterisations of different spatial scales, and therefore also imply parameter space extensions. They are used to find properties of parameters at a coarser scale, which produce exactly the same data as media with (in general) different properties at finer scales. They exist not only for travel time data, but for data sets consisting of amplitudes, anisotropy, or even full seismic waveforms (Capdeville 2010;Capdeville & Cance 2015).

Interpretational Extensions
This type of extension involves no changes in parameter or data spaces, nor in the values of function , but extends the interpretation of the model, data or function values. This may be important in a variety of settings, between which such extensions may differ substantially in both nature and form.
One philosophical example is that under certain conditions on parameter and data uncertainty distributions, it can be shown that a linearised inverse problem solution can be formulated that has exactly the same mathematical form under a Bayesian or non-Bayesian interpretation of the parameter-data system (e.g., Snieder & Trampert 2000, equations 44-48). Since numerical values of the data, and the mathematical operations are identical, the numerical values of parameters in the solution are identical. Solving the problem within either the Bayesian or the frequentist philosophy thus also provides the solution to the problem interpreted within the other philosophy, for free.
Interpretational extensions of a different nature are often available from an evaluation of a forward function at one scale in a physical domain, or when the function is evaluated in dimensionless units. Extensions to any other scaling in the physical domain may then be represented by identical values in parameter space. For example, the scalar wave equation Then exactly the same waveforms would be obtained in a spatial domain that is times as large with a speed map that is times as fast -and we obtain that solution for free through a simple reinterpretation of the original solution. Thus, the solutions to infinitely many other problems may be interpreted from a single forward calculation, and all of these alternative interpretations lie in its extension.

Example: 3-parameter travel time tomography
We now define a particular travel time tomography problem that has only three parameters so that parameter space, samples and extensions can be represented fully in three coordinate axes. Equation (5) shows that the Bayesian solution can be found provided that the likelihood can be integrated through parameter space, and equation (4) shows that this typically requires that the forward function is known across this space. So our goal here is to define the forward function ( ) at a dense set of points distributed throughout . In this example, parameter vectors ∈ consist of three slowness values that define the full medium considered. Figure 7(a) shows a simple 3-cell model for the slowness structure around a source and receiver pair between which a first-arrival travel time is measured. The cells have laterallyconstant wave slowness (grey shade) and fixed geometry, leaving only the 3 cell slowness parameters ( = slowness in cell ; = 1,2,3) variable.
The function ( ) that predicts travel time data given a set of parameters comprises components Here, ( , ) is the slowness at location in the medium which is specified by parameters , is the ray path between the th source and receiver pair, and is the travel time along that path. While this integral looks linear provided that is linear in , in fact is always nonlinear since ray ( ) is the path along which energy takes least time to travel from source to receiver. This is also a function of the slowness distribution in the medium, hence ray also depends on parameters , and the problem is nonlinear.
The forward problem consists of calculating the travel time of the earliest arriving wave travelling between the source and receiver, in this case found by calculating the fastest ray path and integrating the slowness along its length. Figure 7(a) shows an example slowness structure and corresponding source-receiver ray path, for a particular parameter-space sample shown as a red spot in Figure 7(b). Lemma 1 showed that if the absolute magnitudes of the slownesses are scaled by any constant then the ray does not change, and the corresponding travel time is given by the same scaling. Hence all scaled structures lie in the extension of the sample, and these are represented by the blue line through the origin in Figure 7(b).
The ray does not pass through cell 3. Lemma 2 proves that making cell 3 slower than its current value can not change the shortest travel time; hence, all parameter sets constructed from those within the blue extension by making cell 3 slower, are also in the extension of this sample. These are represented by the green slice in Figure 7(b). Thus, it can be seen that any slowness map that lies in a significant part of a 2-dimensional subspace now has known travel time. Given a measured datum, the likelihood in equation (5) is also directly calculable at every such map , and this information is available given only a single forward function evaluation.
Figure 8(a) shows 15 random samples { : = 1, … ,15} generated from a Uniform prior distribution in the range (0,1] in each slowness. An average of 2 samples per dimension would require 2 3 = 8 samples, whereas 3 samples per dimension would require 3 3 = 27 samples, so 15 samples represents a density of between 2 and 3 samples per dimension. It can clearly be seen that much of parameter space has not been sampled: throughout the region of white space neither the forward function value (travel time) nor the likelihood is known, so we can not know whether or not we have sampled the high probability regions.
Samples in this figure have been colour coded to represent the ray that gives the fastest path from source to receiver. If the ray passes through cells 1 and 2 similarly to that in Figure 7 the sample is green, if the ray goes directly from source to receiver within cell 1 the sample is blue, whereas if the ray goes through cells 1 and 3 the sample is red. Similarly to Figure 7, samples whose rays go through cells 1 and 2 have both absolute scaling extensions (Lemma 1), and extensions in the direction of larger slowness in cell 3 (Lemma 2), and these are represented for each sample by green semi-planes in Figure 8(b). The symmetric situation is where rays pass through cells 1 and 3 and hence have extensions towards increasing slowness in cell 2, represented by red semi-planes in Figure 8(b). Rays that only pass through cell 1 have extensions towards larger values of slowness in cells 2 or 3: this creates volumetric 3D pyramid-shaped extensions, which are represented in Figure 8(b) as sets of multi-coloured, closely spaced planes so that extensions of one sample that lies within the extension of another can also be seen.  Figure 8(b), but these are dramatically reduced in volume compared to that in Figure  8(a). Figure 8(b) also shows that extensions render some of the random samples entirely redundantspecifically those that lie within the extension of another sample. These can be seen as pyramids within the largest pyramidal extension in that figure. The latter pyramid spans around 1/6 th of parameter space (and below it is shown that this could have been up to around 1/4), so a large proportion of forward model evaluations could be considered to be wasted if selected randomly without reference to extensions. Indeed, the situation might be significantly worsened had we deployed typical Markov chain Monte Carlo (McMC) sampling schemes: the strength of those methods is to focus future samples around previously discovered areas of high probability. If such a method discovers a high probability region within the largest pyramid, it would tend to focus future samples there too, with the possibility that much of this computation would be redundant.
Figure 9(a) shows an alternative set of 13 non-random samples. These have been chosen in a systematic manner: six closely spaced pairs of red and green samples lie either side of a line (not plotted) rising diagonally in the 1 = 1 plane from the point (1,0,0), terminating in a single blue sample in the 3 = 1 plane. Their importance is apparent in Figure 8(b) which shows their extensions: parameter space is now relatively densely sampled, with known travel time values across regularly distributed portions of continuous subspaces in either 2 or 3 dimensions. There are no redundant samples (the origin and two coordinate axes appear to be repeated by several samples but are precluded on physical grounds since zero slowness is impossible), and the distribution of known travel times and hence likelihoods is far more informative in Figure 8 than in Figure 7a or 7b, despite using fewer samples. The samples in Figure 9 have the following characteristics. The central point between each pair of red and green samples describes a slowness map which exhibits multi-pathing along two fastest rays (let us call this double-pathing). In fact, any map that lies exactly on the line passing between the sample pairs has two rays that give the same shortest travel time between source and receiver -one path going through cell 2, the other through cell 3. Moving the sample even infinitesimally in the direction of increasing slowness in cell 2 ensures that the path through cell 3 becomes uniquely fastest, and vice versa, thus producing each pair of samples shown. It can therefore be understood that the red-green pairs of samples are merely for illustration, since their extensions are both given from the single double-pathed point that lies in between: that point produces both the red and green extensions. Using double-pathed samples instead of sample pairs reduces the total number of samples required to produce this set of extensions to 7 -fewer than 2 samples per dimension, and fewer than half the number used in Figure 8.
The entire pyramid-shaped extension is generated by the single sample in blue. The sample has a straight ray between source and receiver within cell 1, and varying the absolute scaling of this sample produces the line between the sample and origin; the pyramid consists of all models that can be derived from those on that line by increasing the slowness in cells 2 or 3.
The blue sample is clearly the most important from the point of view of exploring parameter space since its extension spans almost a quarter of the total volume. It is in fact the single point of triplepathing: rays could take either a direct path, a path through cell 2, or a path through cell 3 through the corresponding slowness map, and arrive at the receiver in the same shortest possible time.
Graphically, if we were to place another pair of samples just to the left of the blue point they would have red and green 2D extensions similar to those of the other sample pairs, each 2D extension bounding the pyramid on one side. However, similarly to above, this implies that we only need the single, triple-pathed sample to define all three extensions: by increasing the slowness value of any two cells infinitesimally (such that the travel time through those cells increases by an infinitesimally small amount) we obtain the extension of the path that traverses the third.
It is possible to reduce sampling further while also increasing information density if we invoke more physics about rays, and allow a certain degree of approximation. Fermat's principle states that travel time varies to first order with changes in slowness, and to second order with changes in ray path. In other words, if we compare small changes in ray paths versus small changes in cell slowness along the ray, the latter have far greater effect on travel times. This is often used to justify linearising tomographic problems by fixing ray paths in their prior best estimate locations, and solving for slownesses along those rays that are consistent with data. By contrast, here we use Fermat's theorem only to infer that travel time varies slowly with changes in ray geometry, compared to changes in slownesses, leading to a different linearization as follows.
All sample pairs in Figure 9 describe maps with rays that pass through the same cells. However, rays in different sample pairs vary in their exact geometry. Provided ray paths do not change too much, Fermat's theorem states that it is reasonable to interpolate the travel time between the top and bottom point pair. This implies that to within a certain approximation, the travel time of any doublepathed slowness structure can be inferred from only two samples: the lower-left double-pathed sample (A in Figure 9), and the upper-right triple-pathed sample (B in Figure 9) since the latter is also double-pathed with similar rays. The approximate travel time of any intermediate sample is obtained by interpolation, and since Lemmas 1 and 2 apply to that sample, the approximate travel time throughout the entire continuous parameter space can then be estimated -in this case using only two samples. This is an example of a second-order (approximate) extension: information in the extension is obtained from pairs of samples rather than individual samples alone.
Whether 2 samples are used to obtain approximate information about the forward function continuously throughout parameter space, or 7 samples are used to obtain exact information throughout the coloured regions in Figure 9 (which can be interpolated to every point in that space), reasonably dense information about ( ) is now available throughout parameter space. Similar physical principles hold for any source and receiver geometry, so a similarly dense set of information could be obtained for each travel time datum providing information about ( ). Equation (4) can therefore be evaluated to produce an estimate of the posterior pdf at all points in parameter space, allowing integrals of moments in equation (8) to be calculated efficiently. Thus we solve the tomographic inverse problem.

Discussion
In the absence of strong prior information, the curse of dimensionality indicates that to explore solutions to high-dimensional problems in a computationally tractable manner we must obtain far more information than simply probabilities of point samples. The additional information contributed by the extension of each sample depends on the physics of the problem at hand. In the examples above, the ratio of the dimensionality of extensions to the dimensionality of parameter space approaches 1 as the density of cells in slowness maps (hence the number of parameters) increases. This exponential increase in the volume of information provided by extensions almost matches the order of increase in volume that must be searched due to the curse of dimensionality. This indicates that physics-based sampling methods may be tractable in high-dimensional systems even with weak prior information, at least for some classes of problem.
The results in the simple tomographic problem analysed above suggest that the information within a set of extensions is likely to be maximised using multi-pathed samples, for which the shortest sourcereceiver travel time occurs along multiple rays simultaneously. This is also true in more general tomographic problems: for an -pathed sample we need only increase the slowness anywhere along exactly − 1 of those rays in order to produce another sample for which (a) we know that the single remaining ray produces the first arrival, (b) we know its travel time, and hence (c) we know its extension. Systematically perturbing the slownesses on each subset of − 1 rays produces all single-pathed extensions, so all such extensions are included in the extension of the original -pathed sample.
This implies that to use travel-time tomographic extensions efficiently we might develop a procedure to find multi-pathed slowness maps. For example, considering samples at a time that have different rays, one could attempt to generate a single sample that produces the same first arrival time along a subset or all of the rays. For example, the bifocal design algorithms of Winterfors & Curtis (2008; focus on travel times from two samples at a time, and it is possible that a similar approach could be taken for the task above. The forward function of the original samples might then have to be evaluated, and these samples may have useful extensions in their own right. However, the single multi-pathed sample could contain additional information. For example, compare the combined extensions of any red and a green sample in Figure 7 with the single multipathed example lying between any sample pair in Figure 8: the latter produces both red and green extensions that are always larger than the former combination.

Figure 10 Cartoon of a slowness map comprising slow (white) and fast (blue) cells. Two rays are shown (yellow) between a source (star) and receiver (triangle).
A more systematic algorithm might generate multi-pathed slowness maps directly. Figure 10 shows a map that has two fast paths between the source and receiver. This map could be adjusted in a number of ways: given the travel time along a ray that follows the lower path, slownesses along the upper path can be directly assigned to match that travel time, thus generating a double-pathed model. Similarly a triple pathed model could be generated by suitably increasing the speed of cells that lie on the straight line between source and receiver. The rays can then also be shifted by decreasing the speed in blue cells and increasing the speed of their neighbours. All such maps will have highdimensional extensions in parameter space. This suggests that an algorithm might be conceived that systematically samples sets of multi-pathed rays; maps that embody those rays could be generated automatically.
Given travel time data between multiple source-receiver pairs, generally different parameter values will produce multi-pathed slowness maps for each datum. This might suggest that one should develop sampling algorithms that focus on multiple data at a time. However, above it was shown that interpretational extensions often allow the forward evaluation for one source-receiver pair to be converted to a similar evaluation for another source-receiver pair, by a simple rescaling. If the system can also be rotated and translated in space without affecting the forward function, then the forward evaluation for any source-receiver pair also provides a related forward evaluation for all other such pairs. The entire system of extensions found for any one datum may therefore be applicable to all others, by rescaling, rotation and translation, so single-datum extensions may remain important even in multiple-data scenarios.
The approximate extension proposed in the three parameter tomographic example demonstrates the increased efficiency with which information can be obtained. Not only are the number of samples required reduced to two, but the information obtained from their first and second order extensions spans all of parameter space continuously. While for the region outside of the pyramidal extension this information is approximate, it is nevertheless useful for estimating approximate moments of the posterior pdf using equation (8), or for identifying regions of parameter space that are more likely to provide good data fits and hence high posterior probabilities. As a result this information could be used to guide sampling algorithms that invoke proposal distributions for each new sample selected, such as McMC algorithms (Mosegaard & Tarantola 1995). Proposal distributions are used to select samples that are expected to have a good chance of fitting the data before their forward function has been evaluated, and improving proposal distributions has been shown to dramatically improve the performance of McMC (Khoshkholgh et al. 2020). While exact extensions provide posterior information directly, approximate extensions could guide McMC towards better areas to sample.
McMC and other importance sampling algorithms are usually used for integration, for example to estimate the expected value of any function over a posterior pdf, including moments of that pdf as in equation (8). Sampling apparently redundant areas within previously discovered extensions may then have value: if samples are distributed according to the posterior pdf then integrals such as equation (8) can be evaluated by simply averaging values of that function over the set of samples (Mosegaard & Tarantola 1995). Such algorithms might be made significantly more efficient if instead of evaluating computationally intensive forward functions for every sample, they used known values from previously discovered extensions. Indeed, McMC integration could be deployed with no further function evaluations at all, by estimating likelihoods for each new sample by interpolating previously discovered likelihoods between the nearest neighbour extensions.
The redundancy of samples chosen in Figure 8, the deterministic (non-random) nature of highlyinformative samples in Figure 9, and the method of selecting samples using Figure 10 all suggest that efficient sampling of parameter space for travel time tomography may require a departure from existing random sampling methods, and indeed that it might be achieved deterministically. However achieved, the arguments presented herein show that physics-based sampling algorithms avoid limitations that might otherwise be imposed by the No Free Lunch theorem, can be particularly efficient in terms of number of samples evaluated, may add continuous and widespread information throughout parameter subspaces for each sample evaluated, avoid redundancy that may be rife in randomly-selected samples, and crucially may provide information that grows exponentially with the number of dimensions. The latter property provides hope that sampling algorithms may in future be tractable even in high-dimensional problems.