THE CHI-SQUARE TEST, ANNEALING VARIABILITY, AND ANALYTICAL CONSIDERATIONS

The manner in which we apply statistical analysis is founded on an idealized concept of how fission tracks are generated and measured, which may be more complicated in practice when there are departures from ideal circumstances. The fission track (FT) method is a low precision technique, and we beat this shortcoming with numbers, that is to say, by repeating many low precision individual analyses to provide a better estimate of some ‘true’ age or to make inferences about the overall age population. The chi-square (c) test is a simple diagnostic used to assess whether single-grain age estimates are statistically homogenous and consistent with a common true age (1). Significance of the test is assessed by calculating the p-value, which is the probability that a value drawn from the c (n-1) distribution exceeds the c statistic (1). A small p-value indicates that the data are inconsistent with a common ρs/ρi ratio (for the external detector method, EDM) or a common U/Ca ratio (for the apatite LA-ICP-MS z-method), with the convention being that a p-value <0.05 offers sufficient evidence against the null hypothesis (of a common true age) and a p-value <0.01 is strong evidence against the null hypothesis (1). Typically, a chi-square value of <0.05 is considered to indicate that the single-grain ages are unlikely to be drawn from a single Poissonian distribution with a discrete mean value and may represent multiple age populations (1-3). However, a large p-value only means there is a lack of evidence against the null hypothesis – it does not lend support for an alternative hypothesis (1).

The χ 2 test applied to FT data is not a 'data quality' filter. A χ 2 pass or failure does not equate to 'good' or 'bad' fission track data, respectively. A sample should still undergo evaluation if it passes χ 2 , and if a sample fails, an attempt should be made to understand why. Even though the χ 2 statistic is designed to test the underlying assumption of a single, underlying age component for a sample, a problem encountered in many published fission-track studies is that the results of the χ 2 test are noted when samples 'pass' but are ignored or downplayed if the sample 'fails' -leading to a tendency to incorrectly judge the quality of FT data based on the χ 2 test. This is likely due to the fact that without more information, we do not necessarily know how to explain a χ 2 failure. For example, it has been acknowledged for decades that chemical composition has an effect on fission-track annealing and consequently FT ages (e.g., [4][5][6][7][8], however many studies lack proper quantification of mineral composition to evaluate potential cause(s) of high age dispersion and χ 2 failure. The amount of thermal annealing, and therefore FT shortening, experienced by a sample may differ between grains due to variable chemical composition (2,5,8,9). In this case, a distribution of FT single-grain ages should exist, rather than a single common value (1,3). This point is critical -if a crystalline basement or detrital fission track sample is characterized by variable intra-grain chemical composition (or additionally, variable provenance for detrital samples) and has experienced thermal annealing -then a χ 2 failure should be expected. This sample would be considered 'multikinetic' and if the relationship between mineral chemical composition and discrete age populations can be characterized (i.e. variable thermal annealing kinetics), then each kinetic population within a sample will be sensitive to a specific temperature range, broadening the overall thermal sensitivity of the sample (10)(11)(12). This essentially means that multiple thermochronometers are contained within a single sample -characterized by grain age populations with distinctive annealing behavior. Multikinetic behavior is extremely useful because it allows the user to 'do more with less,' in that we can constrain more thermal information with a single sample and yield improved time-temperature resolution for thermal history analysis, which is often the goal of nearly all FT studies.
There are also other factors that are important when assessing and dealing with fission track age dispersion that require attention when interpreting a dataset. Some of the factors that play a role in statistical age dispersion include: (i) the precision of individual fission track grain ages within a population, (ii) the number of spontaneous tracks (N s ) counted by the analyst, and (iii) the total number of grains analyzed. Uranium concentration and N s are the main contributors to the analytical uncertainty during fission track data collection (13,14). There is some concern that the standard error of LA-ICP-MS ages may be underestimated, causing age over-dispersion, and that the lack of a 'matched pairs' design (as with the EDM), may produce inaccurate ages due to difficulty in accounting for U heterogeneity (15). This is more of an issue for very young or U-poor apatite grains, but ways to alleviate this are to only count tracks in the laser ablation spot or carry out multiple U spot analyses on a grain (15). Nevertheless, a closer examination of LA-ICP-MS error propagation may be warranted.
Young or U-poor samples typically have low N s and larger associated errors, whereas old or U-rich samples have high N s and higher precision ages. Generally, if a population of grain ages is characterized by low precision, then there is a χ 2 pass (p-value >0.05) and low overall age dispersion would be expected, whereas if a dataset contains highly precise ages, then a χ 2 failure would be expected (1, 2). The order-of-magnitude better age precision obtained by the LA-ICP-MS fission track technique (partly due to direct and more precise measurement of U compared to the EDM) will often produce χ 2 failures that are difficult to interpret without additional information such as mineral composition, and complicates parsing of kinetic populations that align with radial plot mixture model predictions.
An exaggerated example to illustrate the effect of age errors is shown by Figure 1 using a radial plot (16) to graphically display single-grain ages and their precisions for two synthetic samples of 20 grains each with randomly generated ages between 50-100 Ma. We hold the grain ages fixed for both samples and only change the individual age errors at random. The first sample is precise (i.e. more ages further from the origin on the radial plot; FIGURE 1A), with individual grain-age errors varying between ~5-15%, while the second sample is low precision (FIGURE 1B) with age errors between ~12-85%. Clearly, the errors on individual ages influence the sample χ 2 probability (0.0 vs. 0.98), percent age dispersion (20% vs. 0%), and the central age of the sample (~75 Ma vs. ~83 Ma), as well as modify the mixture model ages and the number of peaks identified during deconvolution (17,18,19).
Fission track analysis is a geochronological method that is highly dependent on choices made by an analyst. Therefore, evaluating, monitoring for, and mitigating fission-track analyst bias should not be marginalized by users of fission-track data (20,21). Complex Earth system interactions often produce complex datasets that are difficult to interpret or understand if we do not have all the necessary information at our disposal. Consequently, fission track datasets that are 'too good' with perfect 100% χ 2 passes should not be expected in every case and require scrutiny like any other heteroscedastic dataset. Imagine a hypothetical scenario where an old basement sample yields Precambrian apatite FT ages. Most of the time we would assume a priori that this sample should yield a single age population and pass the χ 2 test, since all apatites are derived from the same rock. The analyst may struggle to find suitable grains for measurement due to high track density in many grains on the mount and select those considered easier to measure, unconsciously biasing towards 'young' ages. The analyst could also only count some, but not all, of the tracks in the counting area, which would yield low N s totals and magnify the individual age errors. Ultimately, the central age would be more or less representative for the sample and we would have a χ 2 pass. The sole source apatite assumption and omission of apatite compositional data would implicitly validate our glowing statistics -but this may not be true.
"Mistakes often come from assuming something is true just because there is little or no evidence against it." -R. Galbraith, Statistics for Fission Track Analysis, p. 190.
In addition to analyst bias, analytical canon has been to measure (up to) 20 age grains and 100 track lengths per sample. The χ 2 test is sensitive to sample size. The 'power' of χ 2 increases when sample size is large enough and the absolute differences between individual ages become a progressively smaller proportion of the expected or 'true' value. This means that any small deviations from an assumed 'true' age model in the dataset may appear statistically significant and provides support for the addition of mixture model age peaks (22). Basically, the alternative to the null hypothesis is a very general one when n is large. Therefore, the higher the number of grains analyzed for an FT dataset, there is greater potential to fail the χ 2 test due to sample size (2), which is even more likely if single-grain ages are high precision. The reader is referred to (1,15,20,22) for further discussion of these topics.  8.4 (19).

Thanks to Rex Galbraith and Nathan Cogné for informal comments and discussion
This technical note is a preprint and has not been submitted for journal publication. Please be aware that this manuscript was informally commented on but did not undergo formal peer review. These notes are meant for a general Earth science audience and users of thermochronology data. Please consult listed references for more detailed discussion of these topics.
Subsequent versions of this manuscript may have slightly different content.
Feel free to contact the author -all feedback is welcome