Bayesian Population Correlation: A probabilistic approach to inferring and comparing population distributions for detrital zircon ages

Populations of detrital zircons are shaped by geologic factors such as sediment transport, erosion mechanisms, and the zircon fertility of source areas. Zircon U-Pb age datasets are influenced both by these geologic factors and by the statistical effects of sampling. Such statistical effects introduce significant uncertainty into the inference of parent population age distributions from detrital zircon samples. This uncertainty must be accounted for in order to understand which features of sample age distributions are attributable to earth processes and which are sampling effects. Sampling effects are likely to be significant at a range of common detrital zircon sample sizes (particularly when n 300). In order to more accurately account for the uncertainty in estimating parent population age distributions, we introduce a new method to infer probability model ensembles (PMEs) from detrital zircon samples. Each PME represents a set of the potential parent populations that are likely to have produced a given zircon age sample. PMEs form the basis of a new metric of correspondence between two detrital zircon samples, Bayesian Population Correlation (BPC), which is shown in a suite of numerical experiments to be unbiased with respect to sample size. BPC uncertainties can be directly estimated for a specific sample comparison, and BPC results conform to analytical predictions when comparing populations with known proportions of shared ages. We implement all of these features in a set of MATLAB R © scripts made freely available as open-source code and as a standalone application. The robust uncertainties, lack of sample size bias, and predictability of BPC are desirable features that differentiate it from existing detrital zircon correspondence metrics. Additionally, analysis of other sample limited Submitted to Chemical Geology March 28, 2019 Final version published in the journal Chemical Geology is available at: https://doi.org/10.1016/j.chemgeo.2019.03.039 This version is the accepted manuscript. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/ or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA. datasets with complex probability distributions may also benefit from our approach.


Introduction
This version is the accepted manuscript.  provided by basis-spline or b-spline functions (Fig. 3). In b-splining, model  : Third order basis splines provide an efficient way to represent probability density functions (PDFs) using a finite number of model parameters. (a) Each spline basis function b k is a piecewise function (boundaries, called knots, shown by dashed lines) that is composed of quadratic pieces and is smooth and differentiable (De Boor, 1978). Importantly, each basis function is defined such that it has non-zero value over only a limited region. A coefficient θ k is multiplied by the basis function to control its height. (b) In our application, third order basis splines, shown in color, are distributed at regular intervals along the x axis, which corresponds to zircon U-Pb age. The resulting modeled curve, s(x), is the sum of all basis functions multiplied by their respective coefficients. This example shows the effect of having the coefficient for each basis function equal to one. (c) This example shows a modeled curve when spline coefficients are varied, with these particular values generated randomly and shown beneath the basis functions. To simplify computation, our application uses these modeled curves as natural log-transformed probability density functions, as discussed in Sections 2.1, S1. model curve s(x|θ), the basis functions are multiplied by their respective coefficients and summed (Fig. 3b, c). In our method, many basis functions are distributed at fixed, regular intervals over an x axis that corresponds to zircon 143 U-Pb age and basis functions correspond candidate parent population PDFs 144 for a given sample (similar to Eilers and Marx, 1996). In order to simplify 145 our modeling, we use splining to model natural log-transformed probabilities, 146 such that each spline curve s(x|θ) is a natural log-transformed probability 147 model (see discussion of the advantages of our approach in Section S4). Each 148 spline curve s(x|θ) corresponds with a PDF that we define

Concentration of PDF curves
Each candidate parent population PDF, g(x|θ), is uniquely identified by its 150 set of basis function coefficients, θ. Sets of likely parent populations aggre-151 gated by our method will indicate the range of permissible basis function 152 coefficients warranted by a given zircon age dataset, providing a direct esti-153 mate of the uncertainty of age peak heights.

154
The fact that each basis function has non-zero value over a fixed and 155 localized area means that each parameter θ k has highly localized influence.

156
The localized influence of each θ k means that the probability of observing a 157 grain of a certain age is a function of a small subset of the model parameters,  ages increases proportional to the measured age (Gehrels, 2000). The pro-  portionality of analytical uncertainty to age in the <1 Ga range suggests that 176 a logarithmic age scale is appropriate in order to capture more precise age 177 peaks at younger ages. For ages >1 Ga, the 207 Pb/ 206 Pb ratio generally yields 178 the most precise ages, and analytical uncertainties associated with these ages 179 show an extremely slight, poor negative correlation with age (Gehrels, 2000),  a candidate parent population. The likelihood of a single age, P(d i |θ) is 188 determined by evaluating the PDF that corresponds to the coefficients θ for 189 datapoint d i (Fig. 4). As mentioned above, we use spline curves to repre-190 sent natural log-transformed probability density functions, which simplifies 191 our calculations (see further discussion in Section S1). Thus, each spline 192 curve must be exponentiated to evaluate likelihood. As mentioned above, where g(x|θ) is the spline function that corresponds to model parameters also calculated by Riemann summation.

205
Because each zircon age measurement represents an independent draw 206 from its parent population, the likelihood of observing an entire sample d 207 given model θ is the product of the likelihood of each individual sample age: where n is the sample size.
where range(Zircon U-Pb age) here refers to the log-age range over which 231 modeling is conducted, 1 -4000 Ma in our application. The expected value 232 shown here will yield a uniform distribution over x that integrates to 1. This 233 expected value is the peak of the prior distribution for each parameter.

234
The second assumption we can reasonably make about a PDF inferred 235 from a finite sample is that if the sample size increases, the height of the 236 age peaks should increase reflecting the increased confidence gained from a 237 larger sample. To form a rough guideline for how much peak height should 238 increase with added sampling, we use the example of a unimodal age dis-239 tribution. If a unimodal age distribution contains n ages, then the total 240 integrated probability mass away from the lone age peak should equal ∼ 1 2n .

241
This is so because if a second age peak existed in the sampled population 242 and constituted a share of the population of > 1 2n , then a sample of size n is 243 more likely than not to include at least one zircon age from this second age This version is the accepted manuscript.
tions that we tested, the Cauchy and multivariate Student t distributions, 251 were able to achieve this desired scaling, whereas the multivariate Gaussian  to a single parameter. In our case, the movement of the walker and the re-  This version is the accepted manuscript.
where d 1 and d 2 are the age data of sample 1 and sample 2, and d J is the 321 union of d 1 and d 2 , representing the combined data of samples 1 and 2. We  In general, (7) For ease of interpretation, we seek to scale Λ to occupy the range between 332 0 and 1. Such scaling can be accomplished by normalizing Λ with respect 333 to two end member scenarios: (1) identical parent populations, which will 334 produce the maximum Λ value and (2) parent populations with no shared 335 ages, which will produce the minimum Λ value. Here, we calculate these end 336 member results analytically, making the simplifying assumption that samples 337 are large enough to accurately represent parent population PDFs (Fig. 5).

338
Following from Eqns. (6, 7), the expected value of Λ, Λ ideal , is defined as where F 1 , F 2 , and F J are well characterized parent population PDFs and
(3) shows that multiplying F by a constant will also scale P(d i |F ) by 354 the same constant. Therefore, P(d i |F J ) can be calculated from P(d i |F 1 ) and 355 P(d i |F 2 ) as follows: Considering first the case of d i taken from sample 1, we assert that because We define Bayesian Population Correlation as a remapping of Λ onto 368 the range of 0 to 1, which we accomplish using the two endmember cases 369 described above:  separate PMEs. These taller peaks result in higher likelihoods for each zir-389 con age than the age peaks of the PME inferred for the sample by itself.

390
The higher likelihood of each sample age under the joint PME results in a 391 BPC value slightly greater than 1 (Eqns. 6, 13). This same phenomenon can 392 occur for two nearly identical samples. Despite this undesired behavior, we 393 believe the use of a likelihood ratio is justified because of its well understood 394 behavior. When BPC is greater than 1, estimated uncertainty (see Section   In order to verify that BPC behaves as intended in a simple case, BPC 406 was evaluated for samples of two synthetic, Gaussian parent populations 407 that are systematically displaced relative to one another (Fig. 6). For this 408 experiment, PMEs were modeled over an x value domain of 0 to 6.9, which 409 corresponds to the natural logarithmic transformation of the age range 1 -410 1000 Ma, thus paralleling our modeling procedure for zircon U-Pb age data 411 <1 Ga. One Gaussian parent is centered on 2.5 and a second Gaussian 412 parent of equal width is defined with its center located at 3, 3.5, 4, etc.

413
The standard deviation of each Gaussian parent population is 0.5. Random 414 values are drawn from these Gaussian parent populations and compared to 415 one another using BPC (Fig. 6). The analytical uncertainty ascribed to each      Fig. S5). In addition, we also find that the test statistics and have been used to assess detrital zircon correspondence, are biased according 461 to sample size for our resampling experiments (Fig. S4). The sample size bi-462 asing and dependence on KDE method of these previously published metrics 463 is explored in more detail elsewhere (Saylor and Sundell, 2016). BPC, which 464 shows minimal or no sample size biasing and is not based on kernel density 465 estimation, avoids these issues.  ages shared between two populations is systematically varied (Fig. 8). The    uncertainty of a typical detrital zircon age measurement. However, we sus-572 pect that they will be adequate for most detrital zircon age populations, and 573 Figure 2 shows that our implementation completely recovers the age peaks 574 revealed by a popular KDE method (Botev et al., 2010).
where θ J,i , θ 1,i , and θ 2,i are models randomly drawn from their respective

597
In order to test the reliability of σ BP C , we apply it to the synthetic samples 598 used in the previous section (Fig. 9). We calculate σ BP C for random sub-  Fig. 7). For samples drawn from identical 602 parent populations (Fig. 9a), σ BP C proves to be a conservative estimate of   617 We have shown that BPC is a correspondence metric that varies pre-618 dictably between 0 and 1, shows minimal or no sample size biasing (Fig. 7), 619 and for which uncertainties can be readily estimated (Fig. 9). These features 620 indicate that BPC is potentially a more reliable metric of correspondence be-621 tween detrital zircon populations than other metrics currently available. 622 We have shown that the functional relationship between BPC and the 623 proportions (f 1 , f 2 ) of ages that are shared between two populations can be 624 derived analytically from probability theory (Fig. 8, also see Section S5). for example, Niemi, 2013, and Section S5 for further discussion). 645 We have also shown that a PME is a representative set of the possible 646 parent populations that are likely to have produced a given detrital zircon 647 sample (Section 2). The diversity of PMEs mirrors the diversity of samples 648 of a certain size obtained from a population (Fig. 2). Thus, PMEs may prove such as to other types of detrital geochronology or thermochronology data.

669
As shown here, multi-modal datasets with limited sample sizes may not be