A new method to study inhomogeneities in climate records: Brownian motion or random deviations?

Climate data are affected by inhomogeneities due to historical changes in the way the measurements were performed. Understanding these inhomogeneities is important for accurate estimates of long‐term changes in the climate. These inhomogeneities are typically characterized by the number of breaks and the size of the jumps or the variance of the break signal, but a full characterization of the break signal also includes its temporal behaviour. This study develops a method to distinguish between two types of breaks: random deviations from a baseline and Brownian motion. Strength and frequency of both break types are estimated by using the variance of the spatiotemporal differences in the time series of two nearby stations as input. Thus, the result is directly obtained from the data without running a homogenization algorithm to estimate the break signal from the data. This opens the possibility to determine the total number of breaks and not only that of the significantly large ones. The application to German temperature observations suggests generally small inhomogeneities dominated by random deviations from a baseline. U.S. stations, on the other hand, also show the characteristics of a strong Brownian‐motion‐type component.


| INTRODUCTION
Climate station records are affected by inhomogeneities resulting from relocations of stations or changes in the measuring methods. To study long-term changes in the climate accurately, the influence of these inhomogeneities on observational data needs to be reduced as much as possible. How well this is possible using statistical homogenization algorithms depends on the statistical properties of the inhomogeneities and of the station network.
Inhomogeneities are detected and corrected by comparing neighbouring stations with each other. For temperature data, these comparisons are performed by computing a difference time series of a candidate station and its neighbouring reference station. This removes the complicated and noisy regional climate signal and makes it much easier to detect and adjust the inhomogeneities.
The most important properties of the inhomogeneities are their size and frequency. However, these properties do not fully describe inhomogeneities present in a station series; also important is the temporal behaviour of the inhomogeneities. In this article, we will study two classical types of processes with very different temporal behaviours: random deviations (RDs) from a baseline and Brownian motion (BM). As explained in Section 2 in more detail, this matters because for the same break sizes these two behaviours differ in the total variance of the break signal and in how large the trend errors due to the inhomogeneities are.
The realism of benchmark datasets is important for a fair comparison of homogenization algorithms and vital for accurate estimates of errors remaining in the data after homogenization. Early validation datasets often had only one break inhomogeneity or only large ones, while not allowing breaks to be close together. They also often did not have an explicit reference series or assumed the reference was homogeneous. Menne and Williams Jr. (2005) generated an annual dataset with randomly positioned breaks in the reference time series and simulated the break signal as BM, just as  and Williams et al. (2012). Furthermore, Menne and Williams Jr. (2005) studied the sizes of breaks known from station histories and found that break sizes have a normal distribution, including small breaks. We will use this distribution to model breaks in this paper. Domonkos (2011b) generated several validation datasets from simple to complex. In the complex so-called Hungarian standard dataset, the inhomogeneities were tuned to make the properties of the detected breaks fit to those of the detected breaks in the Hungarian network. The breaks were in principle simulated as BM, but the maximum deviation was limited (Domonkos, pers. Comm., 2018). Furthermore, also many platform-like break pairs were inserted. The latter pairs with opposite signs make the sign statistics of consecutive breaks more similar to RDs, but the long-term behaviour would still be similar to BM.
The above-mentioned papers do not explicitly mention that they model breaks as BM. By accident, the simulated data of the benchmarking study of the COST Action HOME (2007)(2008)(2009)(2010)(2011)(2012) used RDs (Venema et al., 2012). Later studies used both processes. A study on daily temperature has generated data that is mostly BM (Killick, 2016), while newer studies on monthly temperature and daily humidity inserted breaks as RDs (Guijarro et al., 2017;Chimani et al., 2018).
In case of BM, the probability of a break going up or down is always 50%, independent of the previous level or jump size. However, for RD from a baseline, the statistic is different: 2 of 3 break pairs will have opposite signs (platform-like break pairs). Analysing the sign statistics of consecutive detected breaks, Venema et al. (2012) found that pairs with opposite signs were well above 50% (not fitting to pure BM). However, this percentage was slightly lower in the real European networks (59%) than in the simulated RD data (63%). These statistics suggest that breaks mostly behave like RDs, but also provide weak evidence that there may be a BM component.
The article is structured as follows. In the next section, we explain why it matters whether the break signal is modelled as RDs or BM. For the same break sizes, the variance of the break time series and the trend errors due to the breaks are much larger for BM. In Section 3, we describe our statistical model for the break signals and introduce the variance of the spatiotemporal difference considered as a function of time lag. This function is used as a metric that allows us to decompose a break signal into its RD and BM components. We derive equations how this metric depends on the strength of the noise as well as on the number and size of breaks for RD and BM inhomogeneities. In Section 4, we test this method by simulated data and propose a retrieval method for the characteristic parameters of the break signals. For accurate estimates of these parameters, the method needs a large number of stations.
The method is applied to monthly temperature observations from the United States and Germany in Section 5. In Germany, the break signal is dominated by RDs, no sign of BM is found. In the United States, there is a BM component and the total number of estimated breaks is found to be large. For this reason, we study in Section 6 the fraction of breaks that can be detected by statistical homogenization, which can explain a large part of the difference between the number of detected breaks reported in the literature and the number our retrieval finds. The article finishes with Discussions (Section 7) and Conclusions (Section 8). Figure 1 shows a conceptual model how to describe an identical break signal by two different approaches. In both cases, a random variable ΔT produces a typical break signal in the form of a step function. The question is which of the two sketched ΔT is assumed to be an independent random variable, the heights of the segment levels themselves or the jumps heights between two levels? In the first case RDs result (Figure 1a), in the second case BM (Figure 1b). Depending on our choice, different statistical properties of the break signal will result.

| THE DIFFERENT CHARACTERISTICS OF BROWNIAN MOTION AND RANDOM DEVIATION
The variance of an RD process is in principle equal to σ δ 2 , the variance of the random numbers used to create the segment levels. In this estimation, we neglect the slightly increased effective variance due to the assumed different lengths of the segments, which are caused by the stochastic occurrence of the breaks (Lindau and Venema, 2018b). Nevertheless, the jump heights between two levels consist of the sum of two of such independent random numbers so that their variance σ β 2 is twice as large.
The variance of a BM process grows linearly in time, because the kth value is defined as the sum of k random independent numbers. Thus, at the end of a BM time series, the variance is equal to kσ β 2 where k denotes the number of breaks and σ β 2 the jump height variance. Since the variance is linearly growing in time, the average variance over time is equal to the half: A comparison of Equations (1) and (2) shows that the variance of a BM-break signal is k times larger compared to an RD-break signal with the same magnitude of jump sizes, which is often used to characterize the break signal. The latter could also explain why many authors actually assume BM-type breaks. If the jump heights are used to characterize the break sizes, it feels appropriate to assume that they can be modelled as an independent random variable, which leads automatically to BM-type breaks.
Following Venema et al. (2012), we usually assume five breaks within a time series of 100 years for a typical European station. A difference time series will have twice as much breaks as two stations contribute. Thus, BM-type breaks produce a 10 times larger variance in the difference time series compared with those of RD type, when the jump sizes are equally large.
With a 10 times larger variance, BM-type breaks should be visible already by eye and the linearly growing variance seems unrealistic especially for longer time series. But the above discussion deals with the total variance. Differences in the mean of candidate and reference contribute to this total variance. However, this difference is partially due to real differences of the climate normals between two stations of unknown magnitude. As a consequence, we are forced to remove this mean difference.
Following Lindau (2003) and Lindau and Venema (2013), the total variance can be decomposed into an external and an internal part. This rule is often referred as to Law of the Total Variance: Equation (3) states that the variance of the external averages plus the mean internal variance constitutes the total variance. In the present case, we consider several time series of a dataset and apply the above variance decomposition. The variance between the different means of each time series is considered as external and the mean variance within the times series as internal. As shown in Appendix A, two-thirds of the total variance of a BM process is allotted to the variance of the means (the external part) and only one-third to the internal variance. Because real climatic differences between two stations also contribute to the external variance, we need to remove it completely in our data analysis and thus also inherently remove that part of the external variance, which is caused by the inhomogeneities. The remaining internal variance amounts to only σ β 2 k/6. For RD breaks, the major part of the variance (a fraction of [k − 1]/k) is internal. Thus, for k = 6, the internal BM variance is about twice as large as for RD. However, the break variance by itself cannot be used to distinguish the cases. The more so as observed data can have both BM and RD characteristics.
Even more important are the different effects, the two breaks types will have on the trend. Lindau and Venema (2018b) investigated the trend introduced by RD breaks. Assuming zero mean jump sizes, they found that the error variance of the trend decreases with the number of breaks. This can be understood considering that more underlying independents will make the random fluctuations of the trend smaller.
BM breaks have the opposite effect. To estimate the trend error, consider that the last homogeneous subperiod (HSP) of a BM has the expected value zero and varies with an SD of ffiffi ffi k p σ β , whereas the first one is constant and zero. The difference between the first and last HSP divided by the length n does not give the completely correct trend but is nevertheless a good estimate of its magnitude. Thus, for BM, the error variance of the trend grows linearly with k, the number of breaks. For a long time, we were not fully aware that there are two different approaches. Williams et al. (2012) modelled BM-type breaks whereas Venema et al. (2012) inserted RDtype breaks into their benchmark dataset. Only the SDs applied to generate the break sizes were communicated. These should be reduced by a factor of ffiffi ffi 2 p when transferred from BM to RD, but, most important, the statistical properties of the signals differ considerably depending on the break type created.
As mentioned in the introduction, Venema et al. (2012) briefly raised this topic and analysed the statistics of the break signal to decide whether the detected breaks are of BM or RD type. An important difference to the present study is that their results are based on the retrieved break signal, which may differ from the true one. We assume the break signal to be a step function that consists of a number of HSPs. Venema et al. (2012) considered triples of consecutive HSPs and distinguished between stairs and platforms. Stairs are defined by a monotonic increase or decrease, whereas for platforms the middle HSP is either the maximum or the minimum of the considered triple. The idea is that the percentage of stairs and platforms should be equal for BM because the second break is independent of the first break, whereas platforms should occur more frequently in RD data, because there is a tendency to revert back to the baseline. The considered triples are defined by three random numbers (the levels of the three HSPs). There are six different possibilities of rank order for these three values, which for RD all have the same probability. One combination is an upward stair, a second one is a downward stair, and the other four possibilities are platforms. Consequently, the theoretical fraction of platforms for RD breaks is 2/3.
As mentioned above, Venema et al. (2012) found only 62 to 64% platforms for the benchmark data containing pure artificial RD-type breaks, although a fraction of 2/3 is expected. This suggests that stairs are easier to detect than platforms, although also gradual inhomogeneities modelled as linear trends in the station data could have lowered the percentage of detected platforms. The percentage of stairs and platforms as they are detectable by a homogenization algorithm will be further dependent, for example, on the signal-to-noise ratio (SNR), which is shown to be a key parameter for break detection (Lindau and Venema, 2016;Lindau and Venema, 2018a). Thus, to conclude that the reduced number of detected platforms is directly caused by an admixture of BM-type breaks is risky and has been a motivation for this further study.

| Stochastic simulation
To analyse the effect of the different break types, we used simulated data that are described in the following. We generated time series with a total length of n years: that consists of four superimposed signals: 1. The climate signal γ, which is identical for all stations of a network 2. Noise ε added to the climate signal, which mimics the differences between the stations within a network, for example, due to the weather 3. BM-type inhomogeneities β 4. RD-type inhomogeneities δ We assume that the BM and RD-type break signals are independent from each other. Furthermore, the timings of the breaks occur randomly with a mean frequency, which may be different for the two types. For both break types, the signal remains at a constant level between two adjacent break points. For RD-type breaks, these levels are random, independent, and normal distributed. For the BM-type inhomogeneities, this is not the case for the levels themselves, but for the jumps from one level to the next one.

| Spatiotemporal differences
Our approach distinguishes the two break signals by analysing the variance of the spatiotemporal differences of the data. First, the spatial difference time series dif between two neighbouring stations x 1 and x 2 is built, which eliminates the common climate signal γ.
Then, temporal differences D (with lag L) within the time series dif are built: Inserting Equation (6) into Equation (7), we get: Thus, D consists of the sum (or difference) of 12 numbers with known statistical properties: three types (β, δ, and ε) from two stations (x 1 and x 2 ) and two time points (i and i + L). In a final step, the variance of D is calculated for classes of constant time lags L.
It is a general rule that the variance of the sum (or the difference) of two variables is equal to the sum of their variances increased (or reduced) by twice the covariance between them: Applying Equation (9) to Equation (8), we obtain: In Equation (10), 12 variance terms occur, one for each for the 12 variables in Equation (8). The four terms for RDtype breaks do not depend on time so that they can be summarized to one term only. The same is true for the noise variance. However, for BM-type inhomogeneities, the variance grows with time so that two separate terms, one for time step i and another for i + L, must be used. Most of the covariances between the 12 variables are zero by definition so that they can be omitted. This is trivial for the noise, but also valid for the inhomogeneities when different stations or types are considered. Only the temporal covariance for identical stations and types are non-zero. These occur two times for each station and a further factor 2 comes directly from the definition in Equation (9). Together, they lead to the factor 4 in front of the covariance terms.

| Random deviations
The last term in Equation (10) denotes the covariance for RD-type breaks, where the years i and i + L are considered. For this break type, the levels of two different inhomogeneity segments are independent. Only for internal pairs, that is, if i and i + L are both situated in the same segment, the covariance is equal to the variance Var(δ) itself; in all other cases it is zero. Consequently, the covariance is equal to the variance multiplied by the probability of internal pairs P int .
As mentioned above, we assume that breaks occur with a mean frequency p δ . Consequently, the probability f to find k breaks within a time span L is Poisson-distributed: Setting k = 0 in Equation (12) yields an equation for the probability of internal pairs: By inserting Equation (13) into Equation (11), we finally get

| Brownian motion
Now we consider the three β-terms in Equation (10), which describe the BM-type breaks. A classical BM is defined as following: with a being a random variable. The variance of this process increases linearly in time, because at each time step a growing sum of random, independent, and normal distributed numbers is obtained. The variance at time step i is However, in our case, a new value is created not every year, but only in years when actually a break occurs, which happens only with a probability of p β . Consequently, the variance growth is reduced by this factor and analogously The covariance of two time steps within a BM is equal to the variance of the earlier one, because both values have all random numbers in common that produce the earlier value. Any additional break after the first time step does not contribute to the covariance.
Summarizing Equations (17a), (17b), and (18) yields 3.5 | The overall variance of spatiotemporal differences We insert Equations (14) and (19) into Equation (10) and replace the expressions Var(δ) and Var(ε) by σ δ 2 and σ ε 2 , respectively: Equation (20) shows that we are able to identify the different effects of BM-type breaks, RD-type breaks, and noise by calculating the variance of spatiotemporal differences of climate time series as described in this section. This is possible since the three processes produce different types of signals: The first term on the right-hand side describes the signal of BM-type breaks, which grows linearly with L; the second term denotes RD-type breaks, which can be identified by a saturating exponential growth of the variance; the third term, the noise, produces a constant offset.

| APPLICATION TO SIMULATED DATA
In this section, the developed method is applied to simulated data. As we are finally aiming at climate data of about one century, we generated 1,000 pairs of time series with a length of n = 100 as described in Section 3. Also, to stay close to the application, we use the physical dimension of temperature (K) and give break rates in the dimension per year (a −1 ). In a first step, the two different break types are inserted separately. Figure 2a shows the result for RD-type breaks alone. In this example, we set the noise variance σ ε 2 = .05 K 2 , the RD-break variance σ δ 2 = .10 K 2 , and the frequency of breaks p δ = .05 a −1 . The O-symbols give the empirically obtained variance V emp for each time lag L, whereas the fat line denotes the theoretical expectation of V according to Equation (20). Data and theory are in almost perfect agreement, both showing a saturating exponential increase (by 1 − e −x ) with a saturation level for large L and a constant offset caused by the noise. When only BM breaks plus noise (with σ β 2 = .05 K 2 and p β = .05 a −1 ) and no RD breaks are inserted (Figure 2b), the variance increases linearly as predicted by the theory. Figure 2c shows the result for a mixture of both break types with a non-linear steep increase of variance for small lags and a transition to a linear increase, which is slower because at larger lags only BM breaks contribute. Again, the simulated data are closely following the theory, which shows that the method suggested in Section 3 works.

| Retrieval for the characteristic parameters
In Figure 2, the known five parameters used to generate the data (i.e., strength of noise and strength and frequency of both break types) are merely inserted into Equation (20) to show the good agreement between simulated data and theory. However, for real observations, these parameters will not be known a priori, but they have to be retrieved by fitting a function of the form given in Equation (20) This is one parameter less (4) than necessary to generate the data (5). The strength and frequency of the BM-type breaks cannot be separated from each other but arise as product in parameter b; many small BM-type breaks have the same effect as a few large ones.
We fit the function given in Equation (21) by a two-step approach. Initially, a first guess of the four parameters is computed, which is then used as central point of an exhaustive search in the surrounding. The first guess is obtained by fitting two tangents, the linear functions V 1 for L ! 0 and V 2 for L ! ∞. In this way, four coefficients are available (2 for each tangent) from which the four unknowns b, c, d, and e can be derived. The details are given in Appendix B. Figure 3 illustrates the above-described parameter retrieval. Using the five input parameters (given in the upper left corner), 10,000 time series pairs are generated. Their mean spatiotemporal variance for each time lag is denoted by O-symbols. The obtained retrieval is given by a curve that is in nearly perfect agreement with the data. In the lower right corner, the retrieved parameters b, c, d, and e defining this curve are explicitly shown. Furthermore, the two tangents V 1 and V 2 are shown, which offer the possibility to interpret the found parameters geometrically by using Equations (B7-B10) in Appendix B.
The parameters d and e have the physical dimension K 2 and describe the strength of RD-type breaks and noise, respectively. They can be read at the y-axis: e is equal to the intercept of the tangent V 1 , whereas the intercept of V 2 is giving the sum of d and e. For both parameters, the expected value is .4 K 2 , which can be obtained by inserting the according input values into Equations (21d) and (21e). With .38 and .39 K 2 , actual estimates differ by a few percent.
For the characterization of BM-type breaks, we have only one parameter b combining strength and frequency. It has the dimension K 2 a −1 and is represented by the slope of tangent V 2 . The retrieval yields .0105 K 2 a −1 , which differs by 5% from the expectation (0.0100) given in Equation (21b).
The parameter c denotes the frequency of RD-type breaks, which has the dimension a −1 . Its reciprocal, the average time between two break events, can be read as the xvalue of the crossing point of the two tangents as indicated by the vertical dashed line in Figure 3. Its expectation is matched better than 1% (0.0500 vs. 0.0498).
So far, we considered the result of 10,000 data pairs to show that the method works in principle. However, for applications to real observations, we may have considerably less data, which will diminish the accuracy of the retrieval. To estimate this effect, we reduced the number of pairs drastically to 100 and repeated the procedure five times to get an idea how much the results vary. Figure 4 shows the outcome separately for the five randomly different datasets (again given by O-symbols) together with the fitted curve (given by the solid lines). The spread between the curves is large especially for higher lags. The problem is mainly that the data itself vary between the different samples, whereas the retrieved curve fits well to the according data. These random fluctuations are quantified by the SD for each parameter in Figure 4. For the parameters b, c, and d, they reach about 50% when expressed as relative error in the mean. The error of the noise parameter e is, with about 5%, much smaller. Increasing the number of repetitions from 5 to 20, the results are more stable and summarized in Table 1; Table 2 shows the according findings for 1,000 data pairs.  (21) is fitted, which is given by the closely following continuous curve. The two tangents, separately fitted for small and large lags, can be used for a geometrical interpretation of the retrieved parameters The accuracy increases when only RD breaks and noise are inserted, but no BM breaks. For 100 station pairs, the SD of the four parameters b, c, d and e decreases to 0.133 (from 0.317), 1.812 (from 2.618), 0.098 (from 0.184), and 0.009 (from 0.015). The numbers in brackets are extracted from Table 1 for comparison.

| APPLICATION TO OBSERVATIONS
Finally, the above-described method is applied to real observations. We used data of the International Surface Temperature Initiative (Rennie et al., 2014). In a first step, we restricted the data to the 20th century to limit the effects of possible non-stationarities. Spatial difference time series are built against the best neighbour station, which is defined as the nearest neighbour within a maximum distance of 100 km and with at least 80 years of overlapping data. To avoid mutual double pairs, the best neighbour is only searched in the eastward direction. In the contiguous United States, 1,459 station pairs fulfil these constraints ( Figure 5). Temporal differences are built from monthly resolution data, but only between identical calendar months to avoid any spurious effect caused by the annual cycle. Thus, only whole year lags are possible, but the resulting parameters are valid for monthly resolution.
The resulting variance of the spatiotemporal difference according to Equation (20) is given in the left panel of Figure 5. For time lags smaller than 20 years, the saturating exponential increase induced by RD-type breaks is obvious; for larger time lags, the increase is clearly linear indicating that also BM-type breaks are included in the data.
For Germany, only 45 station pairs ( Figure 6) could be built, which is certainly at the lower limit for a useful application of the method. However, compared to the United States, German data show a distinctly different behaviour. Especially, any evidence for BM-type breaks is missing, as the curve saturates at a constant level for higher time lags. Reconverting the coefficients b to e by Equation (21b-e), we get the estimates summarized in Table 3.
The SDs for noise and RD-type breaks are of comparable size with .3 to .4 K for both Germany and the United States. However, there are four times more RD-type breaks in U.S. data (one each 6 years compared to one each 24 years).
The break sizes are commonly communicated as the SD of the jump heights between two adjacent HSPs. To convert the variance σ δ 2 given in Table 3 into this traditional parameter, we have to multiply by 2 (as two RDs build one jump) and calculate the square root. For U.S. data .485 K results, for Germany .505 K.
In Section 2, we roughly estimated the error variance of the temperature change caused by BM breaks to kσ β 2 . This is equal to the BM parameter p β σ β 2 (Table 3) multiplied by the length of the series, which gives for U.S. stations .45 K 2 for a length of 100 years. Thus, the BM-type breaks cause an error of about .67 K for the temperature change for 100 years. The number of breaks for U.S. station appears extraordinarily large. , for example, found only one significant break every 15 to 20 years. However, the latter estimate contains only the breaks which are large enough to be detected by a homogenization algorithm. Our approach, in contrast, comprises all breaks. It would imply that only one third of the breaks is usually detected. In Section 6, we show that such an assumption is not unrealistic. In contrast, the examination of a commonly used stop F I G U R E 4 As Figure 3 but for a data volume reduced from 10,000 to 100 pairs. The variability of five random repetitions illustrates the increased uncertainty criterion, by which the number of detected breaks is determined, suggests actually a fraction smaller than 0.5.

| FRACTION OF UNDETECTED BREAKS
Detection algorithms search for the optimum number of breaks and their positions. For each number of breaks, an optimum combination of positions exists, which is defined by explaining the maximum of variance. However, more breaks are of course able to explain more variance. Thus, to decide which number of breaks is optimal a stop criterion is needed that penalizes the insertion of breaks. Caussinus and Mestre (2004), Domonkos (2011a) and Lindau and Venema (2018a) use the Lyazrhi stop criterion (Caussinus and Lyazrhi, 1997), which is given by where n denotes the length of the time series, k the number of test breaks, and v the fraction of variance explained by the segmentation. Thus, k 0 , the number of detected breaks, can be obtained by setting the first derivative of Equation (22) equal to zero: The right-hand side of Equation (23) is a constant, which is for monthly data (where n ≈ 1,000) equal to 0.014. Thus, Equation (23) states that the search is stopped if an additional test break is not able to explain at least 1.4% of the so far unexplained variance.
The total (normalized) variance is equal to the sum of break variance and noise variance: In the vicinity of the optimum solution k 0 , most of the break variance is already explained and the unexplained variance will be approximately equal to the noise: We substitute the unexplained variance in Equation (23) by the above approximation and get For the ideal case without noise, Lindau and Venema (2018a) showed that the explained variance grows with k by where n k denotes the true number of breaks. The derivation of Equation (27) with respect to k yields If the method works perfectly it explains only break variance and no noise. Then we can insert Equation (28) into Equation (26): Solved for the underestimation factor f u : To estimate f u , we insert the values found for RD breaks in the United States given in Table 3. These are valid for monthly scale so that n is in the order of 1,000 (month). For such a time span, 14 breaks (17.1 century −1 × 0.83 century) are expected, which has to be doubled because two stations contribute their breaks in a difference time series: Thus, only about 50% of the breaks are detected, that is, 14 of 28, which means 7 per station.
For a Gaussian distribution, 50.3% of the data lies inside 0.71 SDs. In Section 5, we found an SD of 0.485 K for RD breaks. Thus, if the detectability is only a function of size, an estimate for the minimum detectable break size is d =0:71× 0:48K =0:344K: The above calculations provide of course only a rough estimate and for larger SNR (2) and less breaks (5 per station) the underestimation factor is considerably smaller: Furthermore, we assumed a perfect performance of the search method. This means that it is solely the break variance which is gradually explained. However, Lindau and Venema (2018a) showed that in practice also some noise is erroneously treated as signal, which can lead, especially for SNRs far below 1, to large errors. In the context discussed here, it is important that the detected number of breaks may increase by this effect. For SNRs of about 1, as it is the case here, these errors are expected to remain limited.
Above, we estimated the detection threshold of U.S. data to about .344 K. An alternative way to calculate this parameter is the following. The potentially attainable (but just not accepted) increase of the explained variance at k 0 (for the last detected break) gives an estimate of the minimum detectable break variance. Using Equation (26), we get T A B L E 3 Characteristic break parameters for U.S. and German station data It is difficult to determine exactly which break size (jump height) corresponds to a specific amount of additionally explained variance. Any new solution for k breaks will change all break positions of the previous solution for k − 1 breaks. However, for a rough estimate, we assume that one of the existing segments is divided in the middle, while the others are left unchanged, as it is applied in hierarchical splitting methods. In this case, a jump height of h will produce a variance of h 2 /4. However, as only one segment is affected, we have to multiply by its relative length. An estimate for that is 1/k 0 so that a rough estimate for the just not accepted variance gain is By combining Equations (34) and (35), we can solve for the minimum break size h. Inserting n = 1,000, k 0 = 14, and v N = .155 K 2 yields: This is in good agreement to the estimate of the detection threshold made in Equation (32).
The theoretical estimates above are rechecked by running the detection algorithm presented by Lindau and Venema (2018a). Thousand time series with the characteristics found for American RD breaks (length 1,000 month, σ δ 2 = .1175 K 2 , σ ε 2 = .1550, n k = 28, see Equation 31) are modelled as described in Section 3 and then analysed by the search algorithm (Figure 7). The histogram of the inserted jump heights is given by the crosses and follows the expected normal distribution N(0, 2σ δ 2 ). The O-symbols denote the detected breaks. Large breaks are nearly entirely retrieved, whereas breaks sizes below 0.1 K are completely missing. The maxima are reached at the break size classes ±0.5 K, which is in accordance with  and Dunn et al. (2014). In total, 11,369 of 27,790 breaks are detected, which is equal to 40.9%. This largely confirms our theoretical analysis (Equation 31), which found 49.7%.

| DISCUSSION
The above analysis found that the break signal of contiguous United States is a mixture of both BM and RDs, while in Germany it is dominated by RDs. This fits qualitatively to the findings of the HOME benchmarking study, which found for several networks in Europe that the break signal was mostly RDs, but also had a BM component. This gives us two independent lines of evidence that the break signal has both characteristics. As explained in Section 2, Venema et al. (2012) studied the sign of consecutive break pairs found by statistical homogenization. Thus Venema et al. (2012) analyse the break signal from one break to the next, while the method in this paper considers the characteristic of the break signal as whole. The current study, furthermore, does not require a previous homogenization.
The SNR for the German observations is close to 1, which is twice as high as the ratio found by Lindau and Venema (2018a) for the German network. The smaller SNR in our older work may be caused by two reasons: The first is that in the earlier study only data from the second half of the 20th century was considered where we expect smaller inhomogeneities. A second reason may be that previously neighbour stations with a maximum correlation were selected, whereas the present study uses the distance. Maximum correlation selects for both a similar climate, but also for a similar break signal. This reduces the break variance in the difference time series and consequently produces a smaller SNR. Thus, selecting station pairs with highest correlation may not be the best method to detect breaks.
The number of breaks detected by homogenization will be certainly lower than the actual number of breaks in a dataset. The main effect is that small breaks are not detected, but homogenization algorithms will also combine multiple subsequent breaks into one. Using the Pairwise Homogenization Algorithm Williams, 2009), Dunn et al. (2014) found a break frequency of 6.8 breaks per century in the global dataset HadISD. For conterminous United States (U.S .Historical Climatology Network),  found a break frequency of 5 to 6.7 breaks per century. The minimum detectable break size is not well defined and hard to estimate from the histogram of detected breaks. This histogram is bimodal, it looks like a normal distribution where many values around zero are missing. If we take the two maxima of this histogram as a coarse estimate of the minimum detectable size, it is for both datasets about 1 SD of the distribution of all breaks. That would mean that about 68% of all breaks are not detected. In Section 6, we analyse the behaviour of a multiple breakpoint detection method, which largely confirms this estimate. It suggests that in case of the U.S. data analysed above we would only detect 40 to 50% of the actual break points. Homogenization detection algorithms combining subsequent breaks will further reduce this fraction so that one third is not unrealistic. Consequently, it seems reasonable that our method found much higher break frequencies than the previously reported significant break frequencies based on statistical homogenization.
For Germany, our estimate of four breaks per century is comparable to Lindau and Venema (2018a), who found six breaks per century. As in this paper, they obtained the break rate directly from the data and without running a homogenization algorithm. The earlier estimate comprises 10 times more stations and only half of the time period, which could be the reason for such a difference. Four breaks per century correspond roughly to 1.3 to 2 breaks per century (one third to one half) that are large enough to be detected by a homogenization algorithm. This value is small, but in good agreement to the 1.5 breaks per century Müller-Westermeier (2003) found using Standard Normal Homogeneity Test for the long-temperature series in the German network.

| CONCLUSIONS
We presented a method to estimate the strength and frequency of inhomogeneities and to distinguish between BM and RD-type breaks. This is possible because the two break types produce different signals in the variance of spatiotemporal difference between two stations and between two dates. BM-type breaks generate a linear increase of variance as a function of the time lag. The corresponding function for RD-type breaks is saturating exponential; for noise it is a constant.
Applying the method to German data, we found only a small number of stations (45 pairs), that fulfil the strict requirements concerning spatial closeness of the stations and temporal length and completeness. The results suggest that German stations contain only RD-type breaks and no BMtype breaks.
In the United States, much more data are available (1,459 station pairs). The method clearly identifies both break types, BM and RD. RD breaks occur on average each 6 years and their size has an SD of .34 K. If expressed as jump height, this value has to be multiplied by ffiffi ffi 2 p so that .485 K results.
The number of breaks is, with 17 per century, three times larger than commonly estimated. However, our estimate includes all breaks, whereas the traditionally estimates are often denoted as significant breaks because they comprise only the detected ones. Thus, the total and the detected number of breaks are essentially two different parameters. As demonstrated in Section 6, there is some evidence from the theory that more than half of the breaks may be actually not detected during homogenization. Since they are of small size, their impact may be limited for the climatological practice. However, the new possibility to estimate also the total number of breaks could improve our insight into the homogenization problem.
If the inhomogeneities in benchmark datasets are modelled as BM-type breaks, they are probably easier detectable due to their large impact on the variance. Therefore, results from benchmark datasets with different break types are not directly comparable; not only strength and frequency of the breaks are crucial, but also the break type. A mixture of both breaks types seems reasonable at least if the conditions in the United States shall be modelled.

APP EN DI X B: Parameter retrieval
In this appendix, we describe in detail how the theoretically derived function given in Equation (21) is fitted to the data. In a first step, two tangents V 1 and V 2 are determined by fitting a straight line to the data curve between L = 1 to 3 and L = 51 to 80, respectively.
For L ! 0, we use e -x ≈ 1 − x so that Equation (21a) can be approximated by V 1 L ð Þ= bL +cdL+ e= slp 1 L+ con 1 ðB1Þ with slp 1 = b +cd, ðB2Þ For L ! ∞, the exponential term in Equation (21a) can be neglected and it follows V 2 L ð Þ=bL +d +e = slp 2 L+ con 2 with slp 2 =b ðB5Þ Equations (B2), (B3), (B5) and (B6) give four equations for the four characteristic parameters yielding the first guess, which we mark by a zero index b 0 =slp 2, ðB7Þ c 0 = slp 1 −slp 2 con 2 −con 1 , ðB8Þ d 0 =con 2 −con 1, ðB9Þ e 0 = con 1: The final fit is performed by varying each of the four first guess parameters in 41 samples around the initial value and testing which of the 41 4 combinations yields a minimum in the squared difference Δ between simulated data V sim and theoretical function V theo . The step width for the parameters b and c is 0.001 and 0.01 for d and e. The number of data pairs decrease linearly with L, which is taken into account by an according weighting factor.