This paper is a non-peer reviewed preprint submitted to EarthArXiv! Massive-Parallel Trajectory Calculations version 2.2 (MPTRAC-2.2): Lagrangian transport simulations on Graphics Processing Units (GPUs)

Lagrangian models are powerful tools to study atmospheric transport processes. However, conducting large-scale Lagrangian transport simulations with many air parcels can become numerically rather costly. In this study, we assessed the potential of exploiting graphics processing units (GPUs) to accelerate Lagrangian transport simulations. We ported the MassiveParallel Trajectory Calculations (MPTRAC) model to GPUs using the open accelerator (OpenACC) programming model. The 5 trajectory calculations conducted within the MPTRAC model have been fully ported to GPUs, i. e., except for feeding in the meteorological input data and for extracting the particle output data, the code operates entirely on the GPU devices without frequent data transfers between CPU and GPU memory. Model verification, performance analyses, and scaling tests of the MPI/OpenMP/OpenACC hybrid parallelization of MPTRAC have been conducted on the JUWELS Booster supercomputer operated by the Jülich Supercomputing Centre, Germany. The JUWELS Booster comprises 3744 NVIDIA A100 Tensor Core 10 GPUs, providing a peak performance of 71.0 PFlop/s. As of June 2021, it is the most powerful supercomputer in Europe and listed among the most energy efficient systems internationally. For large-scale simulations comprising 10 particles driven by the European Centre for Medium-Range Weather Forecasts’ ERA5 reanalysis, the performance evaluation showed a maximum speedup of a factor of 16 due to the utilization of GPUs compared to CPU-only runs on the JUWELS Booster. In the large-scale GPU run, about 67 % of the runtime is spent on the physics calculations, being conducted on the GPUs. Another 15% of the 15 runtime is required for file-I/O, mostly to read the ERA5 data from disk. Meteorological data preprocessing on the CPUs also requires about 15 % of the runtime. Although this study identified potential for further improvements of the GPU code, we consider the MPTRAC model to be ready for production runs on the JUWELS Booster in its present form. The GPU code provides a much faster time to solution than the CPU code, which is particularly relevant for near-real-time applications of a Lagrangian transport model. 20


Introduction
Lagrangian transport models are indispensable tools to study the chemical and dynamical processes of the Earth's atmosphere.
A wide range of Lagrangian transport models has been developed for research studies and operational applications during the past decades (Draxler and Hess, 1998;McKenna et al., 2002b,a;Lin et al., 2003;Stohl et al., 2005;Jones et al., 2007;Stein et al., 2015;Sprenger and Wernli, 2015;Pisso et al., 2019). While Eulerian models represent fluid flows in the atmosphere based on the flow between regular grid boxes of the model, Lagrangian models represent the transport of trace gases and 35 aerosols based on large sets of air parcel trajectories following the fluid flow. Both approaches have distinct advantages and disadvantages. Lagrangian models are particularly well suited to study fine-scale structures, filamentary transport, and mixing processes in the atmosphere, as their spatial resolution is not inherently limited to the resolution of Eulerian grid boxes and numerical diffusion is particularly low for these models. However, Lagrangian transport simulations may become rather costly because subgrid-scale processes, such as diffusion and mesoscale wind fluctuations, need to be represented in a statistical 40 sense, i. e., by adding stochastic perturbations to large sets of air parcel trajectories.
In this study, the Massive-Parallel Trajectory Calculations (MPTRAC) model is applied to exploit the potential of conducting Lagrangian transport simulations on graphics processing units (GPUs). MPTRAC was first described by Hoffmann et al. (2016), discussing Lagrangian transport simulations for volcanic eruptions with different meteorological data sets. Heng et al. (2016) and Liu et al. (2020) discussed inverse transport modeling to estimate volcanic sulfur dioxide emissions from the fusion, convection, and chemical lifetime. The performance and scaling of all three components of the MPI/OpenMP/OpenACC hybrid parallelization is being assessed. Section 4 provides the summary and conclusions of this study. graph shown here has been derived automatically from the C code of the MPTRAC model (by means of the cflow tool), but it has been edited and strongly simplified in order to present only the most relevant features. Following the input-process-output 95 (IPO) pattern, the call graph reveals the principle input functions (read_ * ), the processing functions (module_ * ), and the output functions (write_ * ) of MPTRAC.
Three main input functions are available, i. e., read_ctl to read the model control parameters, read_atm to read the initial particle data, and read_met to read the meteorological input data. The read_met function further splits into functions that deal directly with reading of the meteorological data files (read_met_grid, read_met_levels, and 100 read_met_surface), extrapolation and sampling of the meteorological data (read_met_extrapolate, read_met_ periodic, read_met_sample, and read_met_detrend), and the calculation of additional meteorological variables (read_met_geopot, read_met_pv, read_met_pbl, read_met_tropo, read_met_cloud, and read_met_ cape). Although the additional meteorological variables are often available directly via the meteorological input data files, the MPTRAC model allows users to recalculate these data consistently from different meteorological input data sets. The

105
MPTRAC model input data are further discussed in Sect. 2.2.
The processing functions of MPTRAC provide the capabilities to calculate kinematic trajectories of the particles using given ω-velocities (module_advection) and to add stochastic perturbations to the trajectories to simulate the effects of diffusion and subgrid-scale wind fluctuations (module_diffusion_turb and module_diffusion_meso). The functions module_convection and module_ sedi may alter the particle positions along the trajectories by simulating the 110 effects of unresolved convection and sedimentation. The function module_isosurf allows us to vertically constrain the particle positions to different types of isosurfaces. The function module_position enforces the particle to remain within the boundaries of the model domain. The function module_meteo allows us to sample the meteorological data along the trajectories at fixed time intervals. The functions module_decay, module_oh_chem, module_dry_deposition, module_wet_deposition and module_bound_cond affect the mass or volume mixing ratios of the particles based 115 on a given e-folding lifetime, chemical decomposition by hydroxyl radical, dry and wet deposition, or by enforcing boundary conditions, respectively. Most of the chemistry and physics modules are presented in more detail in Sect. 2.3. Finally, the model output is directed via a generic function (write_output) towards individual functions that can be used to write particle data (write_atm), gridded data (write_grid), or ensemble data for groups of particles (write_ens).
The functions write_csi, write_prof, write_sample, and write_station can be used to sample the model data  in specific manners to enable comparisons with observational data, such as satellite measurements or data from measurement stations. The model output is further discussed in Sect. 2.4.
All the processing functions (module_ * ) have been highlighted in Fig. 1 to indicate that they have been ported to GPUs in the present work. Porting these processing functions to GPUs is the natural choice as these functions typically require most of the computing time, in particular for simulations with many particles. The input and output functions cannot easily be ported to 125 GPUs as the input and output operations are directed over the CPUs and utilize CPU memory in the first place. The approach used for porting the physics and chemistry modules to GPUs and the parallelization strategy of MPTRAC are discussed in more detail in Sect. 2.5.

Model input data
2.2.1 Initial air parcel positions 130 In order to enable trajectory calculations, MPTRAC requires an input file providing the initial positions of the air parcels. The initial positions are defined in terms of time, log-pressure height, longitude, and latitude. Internally, MPTRAC applies pressure as its vertical coordinate. For convenience, the initial pressure p is specified in terms of a log-pressure height z obtained from the barometric formula, p(z) = p 0 exp − z H , with a fixed reference pressure p 0 = 1013.25 hPa and scale height H = 7 km. Longitude and latitude refer to a spherical Earth approximation, with a constant radius of curvature of R e = 6367.421 km.  Unless the initial positions of the trajectories are derived from measurements, a set of tools provided with MPTRAC can be used to create initial air parcel files. The tool atm_init creates a regular grid of air parcel positions in time, log-pressure height, longitude, and latitude. Around each grid point and time, the tool can create random distributions of multiple particles by applying either uniform distributions with given width or a normal distribution with given standard deviations. The total mass of all air parcels can be specified, which is distributed equally over all air parcels. Alternatively, a fixed volume mixing 145 ratio for each air parcel can be specified. With this tool, initializations for a wide range of sources (e. g., point source, vertical column, surface, or volume source) with spontaneous or continuous emissions can be created.
A tool to modify air parcel positions is atm_split. It can be used to create a larger set of air parcel positions based on randomized resampling from a given, smaller set of air parcel positions. It "splits" the initial small set into a larger set of air parcels. The total number of air parcels that is desired for the large set can be specified, which is helpful because it 150 provides control over the computational time needed for a simulation. The probability by which an air parcel is selected during the resampling can be weighted by its mass. The total mass of the initial air parcels can be retained or adjusted during the resampling process. During the resampling process, uniform or Gaussian stochastic perturbations in space and time can be applied to the air parcel positions. These stochastic perturbations can be used to spread the air parcels over a larger volume, which might represent the measurement volume covered by an instrument, for example. Finally, a vertical weighting function 155 can be applied in the resampling process in order to redistribute the air parcel positions to follow the vertical sensitivity of a nadir sounding satellite instrument.

Requirements and preprocessing of meteorological input data
Next to the particle positions, the other main input data required by MPTRAC are meteorological data from a global forecasting or reanalysis system. At a minimum, MPTRAC requires 3-D fields of temperature T , zonal wind u and meridional wind v. The 160 vertical velocity ω, specific humidity q, ozone mass mixing ratio x O3 , cloud liquid water content (CLWC), cloud rain water content (CRWC), cloud ice water content (CIWC), and cloud snow water content (CSWC) are optional, but often needed for various applications. The model also requires the surface pressure p s or log-surface pressure (ln p s ) and the geopotential height at the surface Z g to be provided as 2-D fields. Optionally, near surface temperature T 2m and near surface winds (u 10m , v 10m ) can be ingested from the input files.

165
The model requires the input data in terms of separate netCDF files at regular time steps. The meteorological data have to be provided on a regular horizontal latitude × longitude grid. The model internally uses pressure p as its vertical coordinate, i. e., the meteorological data should be provided on pressure levels. However, the model can also ingest meteorological data on other types of vertical levels, e. g., hybrid sigma vertical coordinates η as used by ECMWF's Integrated Forecast System (IFS).
In this case, the model requires p as a 3-D coordinate on the input model levels in order to facilitate vertical interpolation of 170 the data to a given list of pressure levels for further use in the model.
Once the mandatory and optional 3-D and 2-D meteorological input variables have been read in from disk, MPTRAC provides the capability to calculate additional meteorological variables from the input data. This includes the option to calculate geopotential heights, potential vorticity, the tropopause height, cloud properties such as cloud layer depth and total column cloud water, the convective available potential energy (CAPE), and the planetary boundary layer (PBL). Having the option to calculate these additional meteorological data directly in the model helps to reduce the disk space needed to safe them as input data. This is particular relevant for large input data sets such as ERA5. It is also an advantage in terms of consistency, if the same algorithms can be applied to infer the additional meteorological variables from different meteorological data sets. The algorithms and selected examples on the meteorological data preprocessing are presented in Appendix A.

Boundary conditions for meteorological data 180
The upper boundary of the model is defined by the lowest pressure level of the meteorological input data. The lower boundary of the model is defined by the 2-D surface pressure field p s at a given time. If an air parcel leaves the pressure range covered by the model during the trajectory calculations, it will be pushed back to the lower or upper boundary of the model, respectively. Therefore, the particles will follow the horizontal wind and slide along the surface or the upper boundary until they encounter an up-or downdraft, respectively. For the surface, this approach follows Draxler and Hess (1997). We recognize that pressure 185 is not the best choice for the vertical coordinate of a Lagrangian model, in particular for the boundary layer, because a broad range of surface pressure variations needs to be represented appropriately by a fixed set of pressure levels. However, as the scope of MPTRAC is on applications covering the free troposphere and the stratosphere, we consider the limitation of having reduced vertical resolution in terms of pressure levels near the surface acceptable for the time being.
For global meteorological data sets, MPTRAC applies periodic boundary conditions in longitude. During the trajectory 190 calculations, air parcels may sometimes cross either the pole or the longitudinal boundary of the meteorological data. If an air parcel crosses the North or South Pole, its longitude will be shifted by 180 • and the latitude will be recalculated to match the ±90 • latitude range, accordingly. If an air parcel leaves the longitude range, it will be re-positioned by a shift of 360 • to fit the range again. The model works with both, meteorological data provided at −180 • · · · + 180 • or 0 • . . . 360 • of longitude.
Following Stohl et al. (2005), we mirror data from the western boundary at the eastern boundary of the model, in order to have 195 a full coverage of the 360 • longitude rage and to avoid extrapolation of the meteorological input data.

Interpolation and sampling of meteorological data
In MPTRAC, 4-D linear interpolation is applied to sample the meteorological data at a given position and time. Although higher-order interpolation schemes may achieve better accuracy, linear interpolation is considered a standard choice in Lagrangian particle dispersion models (Bowman et al., 2013). The approach requires that two time steps of the meteorological 200 input data are kept in memory at each time. Higher-order interpolation schemes would require more time steps of the input data and would therefore require more memory.
Following the memory layout of the data structures used for the meteorological input data in our model and to make efficient use of memory caches, the interpolations are conducted first in the vertical, followed by latitude and longitude, and finally in time. Interpolations can be conducted for a single variable or all the meteorological data at once. Two separate index look-up 205 functions have been implemented for regularly gridded data (longitude and latitude) and for irregularly gridded data (pressure).
Interpolation weights are kept in caches to improve efficiency of the calculations.
For various studies it is interesting to investigate the sensitivity of the results with respect to the spatial and temporal resolution of the meteorological input data. For instance, Hoffmann et al. (2019) assessed the sensitivity of Lagrangian transport simulations for the free troposphere and stratosphere with respect to downsampled versions of the ERA5 reanalysis data set. To 210 enable such studies, we implemented an option for spatial downsampling of the meteorological input data. The downsampling is conducted in two steps. In the first step, the full resolution data are smoothed by applying triangular filters in the respective dimensions. Smoothing of the data is required in order to avoid aliasing effects. In the second step, the data are downsampled by selecting only each (l, m, n)-th grid point in longitude, latitude, and pressure, respectively.
Note that downsampling has not been implemented rigorously for the time domain. Nevertheless, the user can easily specify 215 the time interval at which the meteorological input data should be ingested. If filtering of the downsampled data is required, downsampling in time needs to be done externally by the user by applying tools such as the Climate Data Operators (CDO) (Schulzweida, 2014) to preprocess the input data of different time steps accordingly. In any case, it needs to be carefully considered whether the meteorological input data represent instantaneous or time-averaged diagnostics at the given time steps of the data. The advection of an air parcel, i. e., the position x(t) at time t for a given wind and velocity field v(x, t), is given by the trajectory equation, (1) 225 Here, the position x is provided in a meteorological coordinate system, x = (λ, ϕ, p), with longitude λ, latitude ϕ, and pressure p. The velocity vector v = (u, v, ω) is composed of the zonal wind component u, the meridional wind component v, and the vertical velocity ω, respectively. Based on its accuracy and computational efficiency for trajectory calculations (Rößler et al., 2018), we apply the explicit midpoint method to solve the trajectory equation, (2)

230
The model time step ∆t controls the trade-off between the accuracy of the solution and the computational effort. Similar to the Courant-Friedrichs-Lewy (CFL) condition, ∆t should be generally selected so that the steps ∆x = x(t + ∆t) − x(t) remain smaller than the grid box size of the meteorological data. MPTRAC will issue a warning, if ∆t > ∆λ met R e /u max for a longitudinal grid spacing ∆λ met of the meteorological data, the mean radius of Earth R e , and a maximum zonal wind speed, u max = 150 m s −1 . A default time step ∆t = 180 s has been selected to match the effective horizontal resolution of 235 ∆x ≈ 31 km of ECMWF's ERA5 reanalysis.
Calculations of the numerical solution of the trajectory equation require coordinate transformations between Cartesian coordinate distances (∆x, ∆y, ∆z) and spherical coordinate distances (∆λ, ∆ϕ, ∆p), The vertical coordinate transformation between ∆p and ∆z uses an average scale height of H 0 = 7 km. However, note that the vertical transformation is currently not needed to solve Eq. (2), because the vertical velocity is already provided in terms of pressure change, ω = dt/dp, in the meteorological data. The transformation is reported here for completeness, as it is required in other parts of the model. For convenience, we assume λ and ϕ to be given in radians, i. e., scaling factors of π/180 • and 245 180 • /π to convert from degrees to radians and vice versa do not need to be introduced here.
The coordinate transformation between ∆λ and ∆x in Eq.
(3) bears numerical problems due to the singularities near the poles. Although this problem might be solved by means of a change to a more suitable coordinate system at high latitudes, we implemented a rather simple solution. Near the poles, for |λ| > λ max , we discard the longitudinal distance ∆λ calculated from  Eq. (3) and use ∆λ = 0 instead. The high latitude threshold has been set to λ max = 89.999 • , a distance of about 110 m from 250 the poles, following an assessment of its effect on trajectory calculations conducted by Rößler (2015).
As an example, Fig. 2 shows the results of kinematic trajectory calculations for particles located in the vicinity of the

Turbulent diffusion
Rather complex parametrizations of atmospheric diffusivity are available for the planetary boundary layer (e. g., Draxler and Hess, 1997;Stohl et al., 2005), which have not been implemented in MPTRAC, so far. Much less is known about the diffusivities in the free troposphere and in the stratosphere, being in the scope of our model. In MPTRAC, the effects of atmospheric diffusion are simulated by adding stochastic perturbations to the positions x of the air parcels at each time step ∆t of the 265 model, This method requires a vector ξ = (ξ x , ξ y , ξ z ) of random variates to be drawn from the standard normal distribution for each air parcel at each time step. The vector of diffusivities D = (D x , D x , D z ) is composed of the horizontal diffusivity D x and the vertical diffusivity D z . The model allows specifying D x and D z separately for the troposphere and stratosphere as con-270 trol parameters. Following choices made for the FLEXPART model (Stohl et al., 2005), default values of D x = 50 m 2 s −1 and D z = 0 have been selected for the troposphere and D x = 0 and D z = 0.1 m 2 s −1 have been selected for the stratosphere. Diffusivities will be changing, if an air parcel intersects the tropopause. A smooth transition between tropospheric and stratospheric diffusivities is created by linear interpolation of D x and D z within a ±1 km log-pressure altitude range around the tropopause.

Subgrid-scale wind fluctuations 275
In addition to turbulent diffusion, the effects of unresolved subgrid-scale winds, also referred to as mesoscale wind perturbations, are considered. The starting point for this approach is the separation of the horizontal wind and vertical velocity vector v into the grid-scale meanv and the subgrid- It is further assumed that the mean windv is given by the coarse-grid meteorological data, whereas the wind perturbations v ′ 280 need to be parameterized. The subgrid-scale wind perturbations are calculated by means of the Langevin equation, From a mathematical point of view, this is representing a Markov chain or a random walk, which adds temporally correlated stochastic perturbations to the winds over time. The degree of correlation depends on the correlation coefficient r, and therefore 285 on the time step ∆t of the model and the time step ∆t met of the meteorological data. The variance (f σ i ) 2 of the random component added at each time step depends on the grid-scale variance σ 2 i of the horizontal wind and vertical velocity data and a scaling factor f , which is used for downscaling to the subgrid scale. The scaling factor f needs to be specified as a control parameter of the model. The default value is f = 40%, following a choice made for the FLEXPART model (Stohl et al., 2005).
For each air parcel, the grid-scale variance σ 2 i is calculated from the neighboring eight grid boxes and two time steps of the 290 meteorological data. To make computations more efficient, the grid-scale variance of each parcel is kept in a cache. As before, ξ is a vector of random variates to be drawn from the standard normal distribution.  Note that several studies provided estimates of atmospheric diffusivities (Legras et al., 2003(Legras et al., , 2005Pisso et al., 2009), but the data required for the free troposphere and stratosphere for the parametrizations are typically still not well constrained. Figure   315 4 also shows results of sensitivity tests on the parameter choices for the diffusion and subgrid-scale wind parametrizations. In addition, in earlier work we found that the parametrizations of diffusion and subgrid-scale winds used here may yield different particle spread for different meteorological data sets, even if the same parameter choices are applied .
The parameter choices may therefore need to be tuned individually for each meteorological data set. This also opens the possibility for potentially interesting considerations regarding sensitivities to the relative importance of diffusion and sub-grid

Convection
The spatial resolution of global meteorological input data is often too coarse to allow for explicit representation of convective up-and downdrafts. Although the downdrafts may occur on larger horizontal scales, the updrafts are usually confined to horizontal scales below a few kilometers. Furthermore, convection occurs on timescales of a few hours. Hourly ERA5 data may 325 better represent convection than six-hourly ERA-Interim data. A parametrization to better represent unresolved convective upand downdrafts in global simulations has been implemented in MPTRAC, which is similar to the convection parametrization implemented in the HYSPLIT and STILT models (Draxler and Hess, 1997;Gerbig et al., 2003). In the present study, the convection parametrization was applied for synthetic tracer simulations as discussed in Sect. 3.3.
The convection parametrization requires as input the convective available potential energy (CAPE) and the equilibrium 330 level from the input data (Appendix A5) to be interpolated to the horizontal position of each air parcel. If the interpolated CAPE value is larger than a threshold CAPE 0 , an important control parameter of the parametrization, it is assumed that the up-and downdrafts within the convective cloud are strong enough to yield a well-mixed vertical column. Therefore, the parametrization will randomly redistribute the air parcels in the vertical column between the surface and the equilibrium level, which is considered as an estimate of the cloud top height. Within this process, the random distribution of the particles needs 335 to be vertically weighted by air density.
The globally applied threshold CAPE 0 can be set to a value near zero, implying that convection will always take place everywhere below the equilibrium level. This approach is referred to as the "extreme convection method". The extreme convection method will provide an upper limit to the effects of unresolved convection. On the contrary, switching off the convection parametrization completely will provide a lower limit for the effects of unresolved convection. Intermediate states can be 340 simulated by selecting other specific values of the threshold CAPE 0 . The threshold CAPE 0 is therefore an important tuning parameter of the parametrization. Figure A7 provides some guideline on how different choices of CAPE 0 affect the simulated convection.

Wet deposition
Wet deposition causes the removal of trace gases and aerosol particles from the atmosphere within or below clouds by absorp-345 tion into droplets followed by droplet precipitation or impaction on the surface. In the first step, it is determined whether an air parcel is located below a cloud top. The cloud top pressure p ct is determined from the meteorological data as the highest vertical level where cloud water or ice (i. e., CLWC, CRWC, CIWC, or CSWC) is existent. For numerical efficiency, the level p ct is inferred at the time when the meteorological input data are processed (Appendix A4). During the model run, p ct can therefore be determined quickly by interpolation from the preprocessed input data. Likewise, the cloud bottom pressure p cb is 350 determined quickly from precalculated data.
In the second step, the wet deposition parametrization determines an estimate of the subgrid-scale precipitation rate I s , which is needed to calculate the scavenging coefficient Λ. We estimated I s (in units of mm h −1 ) from the total column cloud

355
which is based on a regression analysis using existing cloud and precipitation data. Similar to p ct , the total column cloud water c l is calculated by vertically integrating the cloud water content data at the time when the meteorological input data are processed (Appendix A4). During the model run, c l can therefore also be determined quickly by horizontal interpolation of the preprocessed input data. For efficiency, the parametrization of wet deposition is no further evaluated for a lower limit of In the third step, it is inferred whether the air parcel is located within or below the cloud because scavenging coefficients will be different under these conditions. The position of the air parcel within or below the cloud is determined by interpolating the cloud water content to the position of the air parcel and by testing whether the interpolated values are larger than zero.
In the fourth step, the scavenging coefficient Λ is calculated based on the precipitation rate I s ,

365
For aerosol particles, the constants a and b need to be specified as control parameters for the "within cloud" and "below cloud" cases, respectively. Values of a = 4 . . . 8×10 −5 and b = 0.79 have been used in the HYSPLIT model (Draxler and Hess, 1997).
For gases, wet deposition depends upon their solubility and the scavenging coefficient can be calculated from where H is Henry's law constant, R is the universal gas constant, and ∆z c is the depth of the cloud layer, which we calculate 370 from p ct and p cb . Henry's law constant is obtained from The constants H ⊖ and ∆ sol H R with enthalpy of dissolution ∆ sol H at the reference temperature T ⊖ = 298.15 K have to be specified as control parameters. Values for a wide range of species have been tabulated by Sander (2015). The parameters for selected species of interest are included as default in MPTRAC and summarized in Table 1.

375
Finally, once the scavenging coefficient Λ for the gases or aerosol particles within or below a cloud is determined, an exponential loss of mass m over the time step ∆t is calculated,

Dry deposition
Dry deposition leads to a loss of mass of aerosol particles or trace gases by gravitational settling or chemical and physical 380 interactions with the surface in the dry phase. In the parametrization implemented in MPTRAC, dry deposition is calculated for air parcels located in the lowermost ∆p = 30 hPa layer above the surface. This corresponds to a layer width of ∆z ≈ 210 m at standard conditions. For aerosol particles, the deposition velocity v dep will be calculated as described in Sect. 2.3.7 as a function of surface pressure p and temperature T as well as particle radius r p and particle density ρ p . For trace gases, the deposition velocity v dep 385 needs to be specified as a control parameter. Currently, this parameter is set to a constant value across the globe for each trace gas. For future applications with a stronger focus on the boundary layer v dep will need to vary geographically to account for dependence on the surface characteristics.
For both, particles and gases, we calculated the loss of mass based on the deposition velocity v dep , the model time step ∆t, and the surface layer width ∆z from Note that both, the dry deposition and the wet deposition parametrization have been implemented only very recently and need further testing and evaluation in future work. They are described here for completeness, but they were not included in the model verification and performance analysis in Sect. 3.

395
In order to take into account the gravitational settling of particles, the sedimentation velocity v s needs to be calculated. Once v s is known, it can be used to calculate the change of the vertical position of the particles over the model time step ∆t. In MPTRAC, v s is calculated for spherical particles following the method described by Jacobson (1999). In the first step, we calculate the density ρ of dry air, 400 from pressure p, temperature T , and the specific gas constant R of dry air. Next, the dynamic viscosity of air, and the thermal velocity of an air molecule, are used to calculate the mean free path of an air molecule, as well as the Knudsen number for air, where r p refers to the particle radius. The Cunningham slip-flow correction is calculated from Finally, the sedimentation velocity is obtained by means of Stokes law and the slip-flow correction, with particle density ρ p and conventional standard gravitational acceleration g. Note that r p and ρ p can be specified individually for each air parcel. A larger set of parcels can be used to represent a size distribution of aerosol or cloud particles. Figure 5 shows sedimentation velocities calculated by Eq. (22) for different particle diameters as well as pressure levels and 415 mean temperatures at middle latitude atmospheric conditions. A standard particle density of ρ p = 1000 kg m −3 was considered.
The sedimentation velocity v s scales nearly linearly with ρ p , i. e., increasing ρ p by a factor of two will also increase v s by the same factor. Note that the overall sensitivity on temperature T is much weaker than the sensitivity on pressure p (not shown).
Sedimentation velocities are shown here up to particle Reynolds number Re p ≤ 1, for which v s is expected to have accuracies better than 10 % (Hesketh, 1996). Larger particles will require additional corrections, which have not been implemented. 420

Hydroxyl chemistry
In this section, we discuss the MPTRAC module that is used to simulate the loss of mass of a chemical species by means of reaction with the hydroxyl radical. The hydroxyl radical (OH) is an important oxidant in the atmosphere, causing the decay O3 + OH → HO2 + O2 1.7 × 10 −12 940 of many gas phase species. The oxidation of different gas phase species with OH can be classified into two main categories, bimolecular reactions (e. g., reactions of CH 4 or NH 3 ) and termolecular reactions (e. g., CO or SO 2 ).

425
For bimolecular reactions, the rate constant is calculated from Arrhenius law, where A refers to the Arrhenius factor, E to the activation energy, and R to the universal gas constant. For bimolecular reactions, the Arrhenius factor A and the ratio E/R need to be specified as control parameters to the MPTRAC model. The reaction rate coefficients for various gas phase species can be found in the Jet Propulsion Laboratory (JPL) data evaluation (Burkholder 430 et al., 2019).  k 0 and high-pressure-limiting rate k ∞ . Based on the molecular density of air, with Avogadro constant N A , the effective second-order order rate constant k is calculated from The low-and high-pressure limits of the reaction rate constant are given by The constants k 298 0 and k 298 ∞ at the reference temperature of 298 K and the exponents n and m need to be specified as control parameters. The exponents can be set to zero in order to neglect the temperature dependence of the low-or high-pressure limits of k 0 and k ∞ . The termolecular reaction rate coefficients implemented directly into the MPTRAC model are listed in Table 3.
Based on the bimolecular reaction rate k = k(T ) from Eq. (23) or the termolecular reaction rate k = k(T, [M ]) from Eq.

445
(25), the loss of mass of the gas phase species over time is calculated from The hydroxyl concentrations [OH] are obtained by bilinear interpolation from the monthly mean zonal mean climatology of Pommrich et al. (2014). This approach is most suitable for global simulations covering at least several days, as hydroxyl concentrations may vary significantly between day-and nighttime as well as with the local atmospheric composition.

Exponential decay
A generic module has been implemented, to simulate the loss of mass of an air parcel due to any kind of exponential decay process, e. g., chemical loss or radioactivity, over a model time step ∆t, implemented an option to specify separate lifetimes for the troposphere and stratosphere. A smooth transition between the tropospheric and stratospheric lifetime is created within a ±1 km log-pressure altitude range around the tropopause.

Boundary conditions
Finally, a module has been implemented to impose constant boundary conditions on particle mass or volume mixing ratio. At each time step of the model, all particles that are located within a given latitude and pressure range can be assigned a constant 460 mass or volume mixing ratio as defined by a control parameter of the model. Next to specifying fixed pressure ranges in the free troposphere, stratosphere, or mesosphere to define the boundary conditions, it is also possible to specify a fixed pressure range with respect to the surface pressure, in order to define a near-surface layer.
The boundary condition module can be used to conduct synthetic tracer simulations. For example, the synthetic tracer E90 (Prather et al., 2011;Abalos et al., 2017) is emitted uniformly at the surface and has constant e-folding lifetime of 90 465 days throughout the entire atmosphere. In our study, we implemented this by assigning a constant volume mixing ratio of 150 ppb to all air parcels residing in the lowermost 30 hPa pressure range above the surface. With its 90-day lifetime, the tracer E90 quickly becomes well-mixed in the troposphere. However, the lifetime is much shorter than typical timescales of stratospheric transport. The tracer E90 therefore exhibits sharp gradients across the tropopause. Being a passive tracer in the upper troposphere and lower stratosphere region, E90 is evaluated in studies related to the chemical tropopause and Other examples of synthetic tracers are ST80, which is defined by a constant volume mixing ratio of 200 ppb above 80 hPa and a uniform, fixed 25-day e-folding lifetime in the troposphere (Eyring et al., 2013). Like E90, the tracer ST80 is of interest in investigating stratosphere-troposphere exchange. The tracer NH50 is considered to assess interhemispheric gradients in the troposphere. It is defined by a surface layer volume mixing ratio of 100 ppb over 30 to 50 • N and a uniform 50-day e-folding 475 lifetime (Eyring et al., 2013). Simulation results of E90, ST80, and NH50 obtained with the CPU and GPU code of MPTRAC will be discussed in Sect. 3.3.

Sampling of meteorological data
For diagnostic purposes, it is often necessary to obtain meteorological data along the trajectories. However, typically only a 480 limited subset of the meteorological variables is of interest. Therefore, we introduced the concept of "quantities", allowing the user to select, which variables should be calculated and stored in the output data files. Next to selecting the specific types of model output that are desired, also a list of the specific quantities that should be provided as output needs to be specified via control parameters. In total, about 50 output variables are implemented in MPTRAC (Table 4). In addition to the predefined output variables, it is possible to include user-defined variables. The user-defined variables are typically not modified during 485 the simulations, but they are useful to store information that are required for further analyses. For example, the starting time  and position of the particles can be stored as user-defined variables, and they can later be used to calculate the trajectory time or the distance from the origin.
A distinct module has been implemented into MPTRAC that is used to sample the meteorological data along the trajectories.
This module can become demanding in terms of computing time, as it requires the interpolation of up to ten 3-D variables 490 and sixteen 2-D variables of meteorological data that are either read in from the meteorological input data files or calculated during the preprocessing of the meteorological data. However, the computing time needs are typically strongly reduced by the fact that meteorological data along the trajectories are typically not required at every model time step but only at user-defined output intervals.
Next to the option of sampling of the meteorological data along the trajectories, we also implemented tools to provide 495 direct output of the meteorological data. This includes tools to extract maps, zonal mean cross-sections, and vertical profiles on pressure or isentropic levels. These tools allow for time averaging over multiple meteorological data files. Another tool is available to sample the meteorological data based on a list of given times and positions, without involving any trajectory calculations. Altogether, these tools provide a great flexibility in interpolating meteorological data for many applications.
At present, MPTRAC offers seven output options, referred to as "atmospheric data","gridded data", "CSI/skill scores", "ensemble output", "profile output", "sample data', and "station output". The most comprehensive output of MPTRAC are particle data files, also referred to as "atmospheric data". Particle data files can be generated at user-defined output time intervals, which need to be integer multiples of the model time step ∆t. The particle data files are the most comprehensive type of output because they contain the time, location, and the values of all user-defined variables of each individual air parcel.

505
As the particle data output can easily become too large for further analyses, in particular if many air parcels are involved, also the output of "gridded data" has been implemented. This output will be generated by integrating over the mass of all parcels in regular longitude × latitude × log-pressure height grid boxes. From the total mass per grid box and the air density, the column density and the volume mixing ratio of the tracer are calculated. Alternatively, if the volume mixing ratio per air parcel is specified, the mean volume mixing ratio per grid box is reported. In the vertical, it is possible to select only a single layer 510 for the grid output, in order to obtain total column data. Similarly, by selecting only one grid box in longitude, it is possible to calculate zonal means.
Another type of output that we used in several studies Heng et al., 2016) is the critical success index (CSI) output. This output is produced by analyzing model and observational data on a regular grid. The analysis is based on a 2 × 2 contingency table of model predictions and observations. Here, predictions and observations are counted as "yes", 515 if the model column density or the observed variable exceed user-defined thresholds. Otherwise, they would be counted as "no". Next to the CSI, the counts allow us to calculate the probability of detection (POD) and the false alarm rate (FAR), being additional skill scores that are often considered in model verification. More recently, the CSI output was extended to also include the equitable threat score (ETS), the linear and rank-order correlation coefficients, the bias, the root mean square (RMS) difference, and the mean absolute error. A more detailed discussion of the skill scores is provided by Wilks (2011).

520
Another option to condense comprehensive particle data is provided by means of the "ensemble output". This type of output requires a user-defined specific ensemble index value to be assigned to each air parcel. Instead of the individual air parcel data, the ensemble output will contain the mean positions as well as the means and standard deviations of the quantities selected for output for each set of air parcels having the same ensemble index. The ensemble output if of interest, if tracer dispersion from multiple point sources needs to be quantified by means of a single model run, for instance.

525
The "profile output" of MPTRAC is similar to the grid output as it creates vertical profiles from the model data on a regular longitude × latitude horizontal grid. However, the vertical profiles do not only contain volume mixing ratios of the species of interest, but also profiles of pressure, temperature, water vapor, and ozone as inferred from the meteorological input data. This output is compiled with the intention to be used as input for a radiative transfer model, in order to simulate satellite observations for the given model output. In combination with real satellite observations, this output can be used for model validation but also as a basis for radiance data assimilation.
The "sample output" of MPTRAC was implemented most recently. It allows the user to extract model information on a list of given locations and times, by calculating the column density and volume mixing ratio of all parcels located within a user-specified horizontal search radius and vertical height range. For large numbers of sampling locations and air parcels, this type of output can become rather time-consuming. It requires an efficient implementation and parallelization because it needs to be 535 tested at each time step of the model whether an air parcel is located within a sampling volume or not. The numerical effort scales linearly with both, the number of air parcels and the number of sampling volumes. The sample output was first applied in the study of Cai et al. (2021) to sample MPTRAC data directly on TROPOspheric Monitoring Instrument (TROPOMI) satellite observations.
Finally, the "station output" is collecting the data of air parcels that are located within a search radius around a given location 540 (latitude, longitude). The vertical position is not considered here, i. e., the information of all air parcels within the vertical column over the station is collected. In order to avoid double-counting of air parcels over multiple time steps, a data variable "stat" has been introduced that keeps track on whether an air parcel has already been accounted for in the station output or not.
We used this type of output in studies estimating volcanic emissions from satellite observations using the backward trajectory method Wu et al., 2017Wu et al., , 2018.

545
By default, all output functions of MPTRAC create data files in an ASCII table format. This type of output is usually simple to understand and usable with many tools for data analysis and visualization. However, in case of large-scale simulations it is desirable to use more efficient file formats. Therefore, an option was implemented to write particle data to binary output files.
Likewise, reading particle data from a binary file is much more efficient than from an ASCII file. The binary input and output is an efficient way to save or restore the state of the model during intermediate steps of a workflow. Another interesting option 550 of output is to "pipe" the data directly from the model to a visualization tool. This will keep the output data in memory and directly forward it from MPTRAC to the visualization tool. This option has been successfully tested for the particle and the grid output in combination with the graphing utility gnuplot (Williams and Kelley, 2020).

Trajectory statistics
For diagnostic purposes, it is often helpful to calculate trajectory statistics. The tool atm_stat can calculate the mean, 555 standard deviation, skewness, kurtosis, median, absolute deviation, median absolute deviation, minimum, and maximum, of the positions and variables of air parcel data sets. The calculations can cover the entire set or a selection of air parcels, based on the ensemble variable or extracted by means of the tool atm_select. The tool atm_select allows to select air parcels based on a range of parcels, a log-pressure height × longitude × latitude box, and a search radius around a given geolocation.
The trajectory statistics are calculated by applying the algorithms of the GNU Scientific Library (Gough, 2003).

560
The tool atm_dist allows us to calculate statistics of the deviations between two sets of trajectories. Foremost, this includes the absolute horizontal and vertical transport deviations (AHTD and AVTD, Kuo et al., 1985; Rolph and Draxler,

565
Here, the horizontal distances are calculated approximately by converting the geographic longitudes and latitudes of the par- , followed by calculation of the Euclidean distance of the Cartesian coordinates. Vertical distances are calculated based on conversion of particle pressure to log-pressure altitude using the barometric formula.
Next to AHTDs and AVTDs, the tool atm_dist can calculate the relative horizontal and vertical transport deviations 570 (RHTD and RVTD), which relate the absolute transport deviations to the horizontal and vertical path lengths of individual trajectories. It can also be used to quantify the mean absolute and relative deviations as well as the absolute and relative bias of meteorological variables or the relative tracer conservation errors of dynamical tracers over time. The definitions of these quantities and more detailed descriptions are provided by Hoffmann et al. (2019). In this study, we mostly present absolute transport deviations, to evaluate the differences between CPU and GPU trajectory calculations (Sect. 3.2).

Parallelization strategy
Lagrangian particle dispersion models are well suited for parallel computing as large sets of air parcel trajectories can be calculated independently of each other. The workload is "embarrassingly parallel" or "perfectly parallel" as little to no effort is needed to separate the problem into a number of independent parallel computing tasks. In this section, we discuss the MPI/OpenMP/OpenACC hybrid parallelization implemented in MPTRAC. The term "hybrid parallelization" refers to the fact 580 that several parallelization techniques are employed in a simulation at the same time.
The Message Passing Interface (MPI) is a communication protocol and standardized interface to enable point-to-point and collective communication between computing tasks. MPI provides high performance, scalability, and portability. It is considered as a leading approach used for high-performance computing. The application programming interface Open Multi-Processing (OpenMP) supports multi-platform shared-memory multiprocessing programming for various programming lan-585 guages and computing architectures. It consists of a set of compiler directives, library routines, and environment variables that are used to distribute the workload over a set of computing threads on a compute node. An application built with an MPI/OpenMP hybrid approach can run on a compute cluster, such that OpenMP can be used to exploit the parallelism of all hardware threads within a multi-core node while MPI is used for parallelism between the nodes. Open accelerators (OpenACC) is a programming model to enable parallel programming of heterogeneous CPU/GPU systems. As in OpenMP, the source code 590 of the program is annotated with "pragmas" to identify the areas that should be accelerated by using GPUs. In combination with MPI and OpenMP, OpenACC can be used to conduct multi-GPU simulations.
The GPU parallelization is a new feature since version 2.0 of MPTRAC. Various concepts for employing GPUs are available for scientific high-performance computing. The simplest option is to replace standard computing libraries by GPU-enabled versions, which are provided by the vendors of the GPU hardware such as the NVIDIA company. The most prominent example of such a library, which has been ported to GPUs, is the Basic Linear Algebra Subprograms (BLAS) library for matrix- The CUDA programming model is often considered to be most flexible and allows for most detailed control over the GPU 600 hardware. However, CUDA requires solid knowledge on GPU technology and it may cause error-prone and time-consuming work if legacy code needs to be rewritten (Li and Shih, 2018).
OpenACC can help to overcome the practical difficulties of CUDA and reduces the coding workload for developers significantly. We considered OpenACC to be a more practical choice for porting MPTRAC to GPUs as the code of the model should remain understandable and maintainable by students and domain scientists that may not be familiar with the details of more 605 complex GPU programming concepts such as CUDA. By using OpenACC, we found that we were able to maintain the same code base of the MPTRAC model rather than having to develop a CPU and a GPU version of the model independently of each other. Next to easiness in terms of maintenance and portability of the code, the common code base is considered a significant advantage to check for bit-level agreement of CPU and GPU computations and reproducibility of simulation results.
Algorithm 1 illustrates the GPU porting of the C code of MPTRAC by means of OpenACC pragmas. Two important aspects 610 have to be considered. First, the pragma #pragma acc parallel loop independent gang vector is used to distribute the calculations within the loops over the particles over the compute elements of the GPU device. In this example, the loop over the particles within the advection module used to calculate the trajectories has been ported to GPUs. For compute kernels that are used inside of a GPU parallelized code block, the compiler will have to create both, a CPU and a GPU version of the kernel. In this example, the pragma #pragma acc routine (intpol_met_time_3d) instructs the compiler 615 to create CPU and GPU code for the function used to interpolate meteorological data, as being used within and outside of the advection module. Algorithm 1 also shows that if the code is compiled without GPU support (as identified by the _OPENACC flag), the fall-back option is to apply OpenMP parallelization for the compute loops on the CPUs (#pragma acc parallel loop independent gang vector). The OpenMP parallelization has been already implemented in an earlier CPU-only version of MPTRAC.

620
The second aspect of GPU porting by means of OpenACC is concerned with the data management. In principle, the data required for a calculation need to be copied from CPU to GPU memory before the calculation and need to be copied back from GPU to CPU memory after the computation finished. Although NVIDIA's "CUDA Unified Memory" technique makes sure, this data transfers can be done automatically, frequent data transfer can easily become a bottleneck of the code. Therefore, we implemented additional pragmas to instruct the compiler when a data transfer is actually required.

625
The pragma #pragma acc enter data create(ctl,atm[:1],met0[:1],met1[:1]) in Algorithm 1 creates a data region and allocates memory for the model control parameters (ctl), air parcel data (atm), and meteorological data (met0, met1) on the GPU. The data region is deleted and the GPU memory is freed by #pragma acc exit data delete(ctl,atm,met0,met1) at the end of the main function. Within the data region, the associated data will remain to In total, about 60 pragmas had to be implemented in the code to parallelize the computational loops and to handle the data management to facilitate the GPU porting of MPTRAC by means of OpenACC. Code parts that have been ported to GPUs have been highlighted in the call graph shown in Fig. 1. Next to the OpenACC pragmas, some additional code changes were necessary to enable the GPU porting. In the original CPU code, the GNU Scientific Library (GSL) was applied to conduct a number of tasks, for instance, the GSL was to compute statistical quantities of data arrays. As the GSL has not been ported to GPUs, the corresponding functions had to be rewritten without usage of the GSL. The GSL was also used to generate random numbers. Here, we implemented NVIDIA's CUDA random number generation library (cuRAND) as an efficient replacement for distributed random number generation on GPUs. Also, we found that some standard math library functions were not available with PGI's C compiler, e. g., a replacement had to be implemented for the math function fmod() used to calculate floating point modulo. For earlier versions of the PGI C compiler, we also found a severe bug when large data structures 645 (> 2 GByte) were transferred from CPU to GPU memory. Nevertheless, despite a number of technical issues, we consider the porting of MPTRAC to GPUs by means of OpenACC to be straight forward and would recommend it for other applications.
3 Model verification and performance analysis

Description of the GPU test system and software environment
The model verification and performance analysis described in this section has been conducted on the Jülich Wizard for Eu-650 ropean Leadership Science (JUWELS) system (Jülich Supercomputing Centre, 2019). The JUWELS system is considered as a major target system for running Lagrangian transport simulations with MPTRAC in future work.  deviations for this height range due to favorable background wind and velocity conditions (Hoffmann et al., 2019).
Note that the CPU and GPU binaries created by PGCC (colored curves in Fig. 6) show somewhat better agreement (by a factor of 3 to 5) compared to deviations between the CPU binaries created by GCC and PGCC (solid gray curves). These differences between the compilers may also be attributed to different optimization settings applied at compile time. However, overall, the transport deviations found here are several orders of magnitude smaller than deviations caused by other factors, 715 such as deviations related to different numerical integration schemes and the choice of the model time step (Rößler et al., 2018) or the impact of subgrid-scale winds and diffusion (Hoffmann et al., 2019). Therefore, we consider this result to show excellent agreement between the kinematic trajectory calculations with the GPU and CPU code and the different compilers.

Comparison of synthetic tracer simulations
With our current code, we cannot directly compare individual trajectories including the effects of turbulent diffusion and 720 subgrid-scale winds as well as convection because these modules add stochastic perturbations to the trajectories that are being created by means of the different random number generators of the cuRAND library for the GPUs and the GSL library for the CPUs. In order to compare CPU and CPU simulations considering these modules, we conducted global transport simulations of synthetic tracers with ERA5 data. In these simulations, we included the tropospheric synthetic tracers E90 and NH50 as well as the stratospheric synthetic tracer ST80 (Sect. 2.3.10). The simulations have been initialized with 10 6 globally homogeneously 725 distributed air parcels in the height range from the surface up to the lower mesosphere on 1 January 2017, 00:00 UTC. The simulations cover the entire time range of the year of 2017. We assigned volume mixing ratios to the individual air parcels according to the specific initial and boundary conditions of the synthetic tracers (Sect. 2.3.10). Expecting that differences between individual trajectories are largely cancelling out due to the averaging, we then compared monthly mean zonal means of the synthetic tracer volume mixing ratios from the CPU and GPU simulations.

730
As a representative example, Fig. 7c shows the monthly mean zonal mean distribution of E90 in July 2017 from our ERA5 simulations with the GPU code of MPTRAC. Following a spin-up time of half a year, i. e., about twice the lifetime of the E90 tracer, it is found that E90 is well-mixed and rather homogeneously distributed throughout the troposphere. The 90 ppbv contour of the E90 tracer is expected to indicate the chemical tropopause (Prather et al., 2011). For our simulations, we found that the 90 ppbv contour line of E90 agrees well with the monthly mean zonal dynamical tropopause. The E90 distribution in 735 Fig. 7c qualitatively compares well with the climatological mean presented by Abalos et al. (2017). Figure 7d shows the differences of the E90 monthly mean zonal means between the GPU and CPU code. Despite the fact that even without diffusion or convection being considered, individual CPU and GPU trajectories already differ significantly after 60 days (Sect. 3.2), the E90 monthly mean zonal means of the CPU and GPU code agree quite well. The differences of the means are typically below ±8 ppb, which is about 5% of the maximum volume mixing ratio of the E90 tracer. The largest 740 differences between the CPU and GPU simulations are found near the tropopause. This region is most sensitive to statistical effects, as the tracer gradients are largest in this region. However, no systematic bias between the E90 distributions of the CPU and GPU code is being revealed. Overall, this comparison of the E90 simulations shows good agreement between the GPU and CPU simulations. tion results for the northern hemisphere tracer NH50. Similar to E90, the comparisons for ST80 and NH50 show that individual differences between the monthly mean zonal means from the CPU and GPU simulations are mostly below ±5% of the maximum volume mixing ratios and that the largest differences are being found in regions where sharp gradients are being present in the tracer fields. Recognizing the effects of the spin-up time of the simulations and that stratospheric transport has much longer timescales than tropospheric transport, which causes some latency in particular for ST80, similar results are also found for the other months of the year 2017.

OpenMP scaling analysis
Some parts of the MPTRAC code cannot or have not been ported to GPUs. This includes the file input and output operations, which are directed via the hosting CPUs and the CPU memory, and the functions that are used for preprocessing of the meteorological data, such as the codes to calculate geopotential heights, potential vorticity, convective available potential energy, the planetary boundary layer, the tropopause height, and others. For those parts of the MPTRAC code that have not been ported to GPUs, it is important to study the OpenMP scalability on the host devices. In particular, good OpenMP scaling up to 12 threads on the JUWELS Booster needs to be demonstrated, as this is the fraction of physical cores shared by each GPU device on a JUWELS compute node.
Here, we focus on an OpenMP strong scaling test conducted by means of binaries compiled with the PGI compiler, as this The results of the OpenMP strong scaling test are shown in Fig. 8. The runtimes indicate that the calculations in the physics modules (trajectories, diffusion, convection, etc.) and the meteorological data processing yield the major contributions to total 770 runtime up to 12 OpenMP threads. At larger numbers of threads, the file input becomes the leading contribution to the total runtime. Concerning the GPU simulations, the OpenMP scaling of the meteorological data processing on the CPUs is of interest, as this is the leading component in the calculations that has not been ported to GPUs. In contrast, the scaling of the physics calculations can be ignored here for now, as these calculations have been ported to GPUs. For the meteorological data processing, OpenMP provides a speedup of 10.7 (a parallel efficiency of 90%) up to 12 threads and 20.2 (84%) up to 24 775 threads (Fig. 8b). If the simulations cover more than one socket (48 threads), the parallel efficiency drops to 65%. Enabling simultaneous multithreading (SMT) provides a further improvement of the speedup from 31.3 at 48 threads to 37.2 at 96 threads. The OpenMP scaling of the meteorological data processing is particularly efficient compared to other parts of the code. This might be related to the fact that we selected a rather suitable memory layout for the meteorological data. Vertical columns of the meteorological data are ordered sequentially in memory, which allows for efficient use of memory caches as 780 the preprocessing of the meteorological data is mostly conducted on vertical columns of the data.

GPU scaling analysis
In this section, we discuss the GPU scaling of MPTRAC simulations conducted on JUWELS Booster nodes with respect to the problem size, i. e., the number of particles or trajectories that are being calculated. The analysis of the scaling behavior with respect to the problem size is particularly interesting, because the NVIDIA A100 GPUs provide a particularly large number of 785 parallel compute elements (Sect. 3.1), which require a high degree of parallelism of the problem in order fully exploit the GPU computing capacities. Here, we tested problem sizes of 10 0 to 10 8 particles being used in the simulations. We measured the  Booster A100 GPU devices. The scaling analysis shows that the total runtime required for the simulations is rather constant and in the range of 122 s to 127 s between 10 0 to 10 6 particles. For larger numbers of particles, linear scaling is observed. The total runtime increases up to 153 s for 10 7 particles and 409 s for 10 8 particles.
The analysis of individual contributions shows that reading of the ERA5 input data (60 to 64 s, referred to as INPUT in Fig.   9a) and preprocessing of the meteorological data (60 to 62 s, referred to as METEO) accounts for most of the total runtime for 795 ≤ 10 7 particles. The runtime for this part of the code depends on the size of the meteorological input data, but is independent of the number of particles. The computing time required to calculate the trajectories, diffusion, convection, and the decay of particle mass (referred to as PHYSICS) is rather constant (0.1 to 0.4 s) for ≤ 10 4 particles, followed by linear scaling for larger numbers of particles. A transition from constant to linear scaling in the range of 10 3 to 10 4 particles can be expected, as at least 3456 particles must be simulated at once, in order to fully exploit the A100 parallel compute capacities (Sect. 3.1). For smaller 800 numbers of particles, a fraction of the compute elements of the GPU device remains unused. Beyond 3 × 10 7 particles, the runtime required for the physics calculations exceeds the runtime required for file input and meteorological data preprocessing.
The runtime for file output can be neglected in the current setup.
The same scaling test was used to estimate the speed-up achieved by applying the GPU device over a simulation that was conducted on CPUs only. Note that defining a GPU-over-CPU speedup is a difficult task, in general, as the results will depend 805 not only on the GPU devices, but on the individual compute capacities of both, the CPU and GPU devices. Nevertheless, to estimate this speed-up, we conducted corresponding CPU-only simulations and runtime measurements by applying 12 cores of a JUWELS Booster compute node. A number of 12 cores was chosen here, as this corresponds to the number of physical cores sharing a GPU device on a booster node. We think that this specific approach of estimating the GPU-over-CPU speedup provides a fair and more realistic comparison in contrast to comparing the GPU compute capacity to just a single CPU core, 810 for example.
The estimated speed-up due to applying the GPU devices of a JUWELS Booster compute node is shown in Fig. 9b. In case of not using the "managed memory" option (also referred to as NVIDIA Unified Memory), which corresponds to the results shown earlier, we found a GPU-over-CPU speedup of less than one for ≤ 10 5 particles, i. e., the GPU simulations are actually slower than the CPU-only simulations. This behavior changes starting from about 10 6 particles, with an actual speedup of about 815 1.5 for 10 6 particles and about 5.3 for 10 7 particles. The GPU-over-CPU speedup finally saturates near 16.4 for 10 8 particles.
At these particle numbers, the speedup of the total runtime is largely determined by the scaling for the physics calculations.
Additionally, we tested NVIDIA's Unified Memory capability, which allows the programmer to avoid specifying explicit data transfers between the host and the device, by triggering those data transfers automatically, if memory cache misses occur. It is convenient to not have to specify the data transfers explicitly in the code, but our tests showed that the managed memory 820 option causes a performance degradation of about 20 % in the GPU-over-CPU speedup.
Finally, we conducted a more detailed runtime profile analysis of the GPU simulations based on about 40 individual timers, which have been implemented directly into the code. Table 5 shows the results of the profile analysis for the largest GPU simulation conducted here, covering 10 8 particles. This is the most compute-intensive simulation, in which about 67% of  Next to the physics modules, other parts of the code required about 33% if the overall runtime of this large-scale simulation.
Reading the input data required about 15% of the runtime, where most time was needed to read the 3-D level data of the 830 meteorological input data. A similar amount of time (about 15%) was required for the processing of the meteorological data, with most time being spent on calculating CAPE (5.4%), geopotential heights (4.9%), and the tropopause (2.1%). The runtime required for output was 2.4% in this simulation. However, it needs to be considered that output was written only 6-hourly, here.
The runtime required for output would scale accordingly for more frequent output. The runtime required for data transfers between GPU and CPU memory was also about 1%, with most time being spent on transferring the input meteorological 835 data from CPU to GPU memory. Although this profile analysis did not reveal any major bottlenecks, there is room for further improving the code of MPTRAC. Most attention should be devoted to further optimizing the advection module, as it requires most of the runtime for large-scale simulations.

Timeline analysis of GPU simulations
For a more detailed analysis of the large-scale MPTRAC simulation comprising 10 8 particles as discussed in Sect. 3.5, we 840 analyzed the timeline of the simulation generated with the NVIDIA Nsight Systems performance analysis tool. Figure 10a shows the timeline for the entire 24 h simulation time period, completed in a runtime of 460 s. Note that the runtime of the Nsight Systems run comprises an overhead of about 50 s compared to the corresponding runtime profile presented in Sect. 3.5. Overall, the timeline analysis indicates that the code is providing a high utilization rate of the GPU devices, which is a basic 860 requirement for using the GPU devices efficiently. However, the timeline analysis also reveals optimization opportunities. and reduce the wall clock time significantly. As the meteorological data are finally required on the GPU devices to conduct the Lagrangian transport simulations, it could also be beneficial to port the meteorological data processing from the CPU to the GPU devices to accelerate the processing.

865
Finally, a more detailed inspection of the physics calculations on the GPUs shows that about 70 % of the time are spent in the advection module, which is being used to solve the trajectory equation  Figure 11 shows the results of an MPI weak scaling test on the JUWELS Booster, utilizing the MPI/OpenMP/OpenACC hybrid parallelization. Here, the test setup was slightly modified compared to earlier work. We conducted the simulations 885 with 5 × 10 7 particles. This number of particles was chosen to achieve similar fractions of runtime required for file-I/O, meteorological data preprocessing, and the physics calculations. File-I/O was given a significant fraction of the total runtime in this experiment (at least 25%), as we expected this part to be most sensitive to performance degradation in the MPI scaling test. The simulation time was reduced from 24 h to 6 h, to limit the overall consumption of computing resources. During the 6 h simulation time, ERA5 data for seven synoptic time steps were read from the global file system. Particle data and grid output 890 were written once at the end of the 6 h time interval.
We measured the total runtime as well as the runtime for the physics calculations, meteorological data processing, file input, and file output for each MPI task. Figure 11 shows the means and standard deviations of the runtime between 1 and 256 MPI tasks. The scaling test shows that the runtime for the meteorological data preprocessing and physics calculations is rather file-I/O will become a bottleneck for larger simulations, at least if large-scale data sets, such as the ERA5 reanalysis, are being 900 used to drive the simulations.

Conclusions
In this paper, we provide a comprehensive description of the Lagrangian transport model MPTRAC version 2.2. We give an overview on the main features of the model and briefly introduce the code structure. Requirements on the model input data, i. e., global meteorological reanalysis or forecast data as well as particle data, are being discussed. MPTRAC provides 905 the functionality to calculate various additional meteorological data from basic prognostic variables, such as the 3-D fields of pressure, temperature, winds, water vapor, and cloud ice and liquid water content. This includes functions to calculate geopotential heights and potential vorticity, to calculate additional cloud properties, such as the cloud top pressure and the total column cloud water, to estimate the convective available potential energy, and to determine the boundary layer and the tropopause height level. Some evaluation of the results of the meteorological data processing code of MPTRAC with data 910 provided along with ECMWF's ERA5 reanalysis is being presented.
As its main component, MPTRAC provides an advection module to calculate the trajectories of air parcels based on the explicit mid-point method using given wind fields. Individual stochastic perturbations can be added to the trajectories to account for the effects of diffusion and subgrid-scale wind fluctuations. Additional modules are implemented to simulate the effects of unresolved convection, wet and dry deposition, sedimentation, hydroxyl chemistry, and other types of exponential loss 915 processes of particle mass. MPTRAC provides a variety of output options, including the particle data itself, gridded data, profile data, station data, sampled data, ensemble data, and verification statistics and other measures of the model skills. Additional tools are provided to further analyze the particle output, including tools to calculate statistics of particle positions and transport deviations over time.
Next to providing a detailed model description, the focus of this study was to assess the potential for accelerating Lagrangian 920 transport simulations by exploiting graphics processing units (GPUs). We ported the Lagrangian transport model MPTRAC to GPUs by means of the OpenACC programming model. The GPU porting mainly comprised (i) creating a large data region over most of the code to keep the particle data in GPU memory during the entire simulation, (ii) implementing data transfers of the meteorological input data from CPU to GPU memory, (iii) implementing data transfers of the particle data from GPU to CPU memory for output, (iv) parallelization and offloading of the compute loops for the trajectories and other physical processes of 925 the particles, and (v) removing calls to the GNU Scientific Library, which is not available for GPUs, and adding calls to the cuRAND library for random number generation on the GPUs. Next to various minor changes and fixes, about 60 OpenACC pragmas had to be introduced into the code to manage the data transfers and to offload the compute loops. With the OpenACC approach used here, it was possible to maintain the same code base for both, the CPU and the GPU version of MPTRAC.
We  Appendix A: Meteorological data preprocessing

A1 Calculation of geopotential heights
Various applications require information about the geopotential heights of the air parcels. MPTRAC calculates geopotential 965 heights from the pressure level data of temperature and specific humidity as well as surface pressure and geopotential height at the surface. Starting from the surface pressure p 0 and geopotential height Z 0 , the geopotential height increments ∆Z i,i+1 between the neighboring pressure levels of a vertical column are calculated and added up onto each other to determine the geopotential height Z j of the corresponding pressure level p j .
The geopotential height differences ∆Z i,i+1 between neighboring pressure levels p i and p i+1 are calculated from where R = 287.058 J kg −1 K −1 is the specific gas constant of dry air and g = 9.80665 m s −2 is the conventional standard gravitational acceleration. The virtual temperature T v is computed from air temperature T and water vapor volume mixing ratio x H2O according to

975
where ϵ = M H2O /M a ≈ 0.622 is the ratio of the molar mass of dry air, M a = 28.0644 g mol −1 , and the molar mass of water vapor, M H2O = 18.01528 g mol −1 . The water vapor volume mixing ratio x H2O relates to specific humidity q via Figure A1. Geopotential heights at 5, 20, 100, and 500 hPa as calculated with the meteorological data processing code of MPTRAC from ERA5 input data for 1 January 2017, 00:00 UTC.
As an example, Fig. A1 shows maps of geopotential heights at selected pressure levels in the troposphere and stratosphere as calculated with the meteorological data processing code of MPTRAC using ERA5 data on 1 January 2017, 00:00 UTC for 980 input. We verified the geopotential height calculations of MPTRAC by comparing them to geopotential height data provided directly along with the ERA5 reanalysis. Figure A2 shows that the zonal root mean square (RMS) differences between the geopotential heights from MPTRAC and ERA5 are less than 10 -15 m at 5 -500 hPa. At pressure levels above 1 -2 hPa, interpolation and integration errors start to accumulate and the RMS differences increase to 20 -50 m. With relative errors being always less than 0.2 %, this comparison indicates reasonable accuracy of the method used to obtain geopotential heights 985 from ERA5 data in the troposphere and stratosphere with MPTRAC.

A2 Calculation of potential vorticity
Potential vorticity is defined as Figure A2. Zonal root mean square (RMS) differences between geopotential heights calculated by the meteorological data processing code of MPTRAC and from ERA5 directly at pressure levels from 500 to 1 hPa on 1 January 2017, 00:00 UTC.
where α is the specific volume, Ω the angular velocity vector of the Earth's rotation, u the three-dimensional vector velocity 990 relative to the rotating Earth, and θ the potential temperature. Potential temperature is defined as for the reference pressure p 0 = 1000 hPa and κ = R/c p ≈ 0.286 with R = 287.058 J kg −1 K −1 being the specific gas constant of dry air and c p = 1003.5 J kg −1 K −1 being the specific heat capacity at a constant pressure. Both, potential vorticity and potential temperature are dynamical tracers. In the absence of friction and heat sources, P is a materially conservative property 995 that remains constant for each particle.
On pressure levels, Ertel's potential vorticity P can be calculated from temperature T and horizontal wind (u, v) as The Coriolis parameter f at latitude ϕ is calculated from 1000 with Earth's rotation rate Ω = 7.2921 × 10 −5 rad s −1 . The horizontal partial derivatives in Eq. (A6) are evaluated by means of the central differencing scheme. The partial derivatives with respect to pressure are evaluated similarly, but need to consider the irregular vertical grid spacing. The numerical implementation in our code follows a software package to calculate potential vorticity developed by Barlow (2017). Potential vorticity is reported in potential vorticity units, 1 PVU = 1.0 × 10 −6 m 2 s −1 K kg −1 . Figure A3. Potential vorticity at 5, 20, 100, and 500 hPa as calculated with the meteorological data processing code of MPTRAC from ERA5 input data for 1 January 2017, 00:00 UTC.
As an example, Fig. A3 shows potential vorticity at different pressure levels in the troposphere and stratosphere as calculated with the meteorological data processing code of MPTRAC using ERA5 data on 1 January 2017, 00:00 UTC as input. We verified the potential vorticity calculations of MPTRAC by comparing the results to data provided along with the ERA5 reanalysis. Corresponding zonal mean RMS differences are shown in Fig. A4. The mean relative differences of the potential vorticity data of MPTRAC and ERA5 are mostly below ±5% at mid and high latitudes at the pressure levels from 1 to 1010 500 hPa considered here. At low latitudes, the relative errors may become larger. However, Fig. A4 shows that the absolute RMS differences in the tropics are similar to the RMS differences in the extratropics. Somewhat larger differences of the PV calculations at specific pressure levels and latitudes are related to different factors, including elevated terrain (at 30 -40 • N and 500 hPa), the subtropical jet (at 30 -40 • S and 200 hPa), and deep convection (at 0 -10 • N and 100 hPa). Figure A4. Zonal root mean square (RMS) differences between potential vorticity calculated by the meteorological data processing code of MPTRAC and ERA5 at pressure levels from 500 to 1 hPa on 1 January 2017, 00:00 UTC.

1015
In the MPTRAC model, we implemented a code to estimate the pressure level of the thermal lapse rate tropopause as defined by the World Meteorological Organization (WMO). We also implemented a code to estimate the pressure level of the second thermal tropopause, the cold point, and the dynamical tropopause. In order to improve the vertical resolution of the calculations, we first interpolated the reanalysis temperatures, geopotential heights, and potential vorticity data from the pressure levels of the meteorological input data to a fine vertical grid (100 m grid spacing in log-pressure altitude coordinates). Cubic spline 1020 interpolation was applied as this is expected to yield a more realistic representation of real temperature profiles than linear interpolation, in particular near the tropical tropopause region. The vertical range of the tropopause determination was restricted to pressure levels between 47 -530 hPa or 4.5 -21.5 km in log-pressure altitude. If the algorithm failed to identify a tropopause in that vertical range, missing values are reported.
The pressure levels of the first and second thermal tropopause were estimated following the definition of the World Meteoro-1025 logical Organization (WMO, 1957). The first tropopause is defined as the lowest level at which the lapse rate falls to 2 • C km −1 or less, provided the average lapse rate between this level and all higher levels within 2 km also does not exceed 2 • C km −1 . If above the first tropopause the average lapse rate between any level and all higher levels within 1 km exceeds 3 • C km −1 , then a second tropopause is defined by the same criterion. Considering that the input data are given on pressure levels, the lapse rate Γ is calculated using the hydrostatic equation and the ideal gas law according to rates Γ i based on temperature T i on the pressure level p i of the fine vertical grid are therefore calculated according to The local lapse rates Γ i on the refined vertical grid are averaged over the given height ranges of the WMO criterion in order to 1035 estimate the tropopause pressure.
Another quantity of interest is the cold point. The cold point corresponds to the local minimum of the temperature profile in the upper troposphere and lower stratosphere at a given time and location. We calculated it by applying a search algorithm to the temperature values T i at the pressure levels p i of the refined vertical grid. We calculated the cold point globally, but it is typically meaningful only in the tropics, i. e., within a latitude range of about 30 • S to 30 • N. At mid and high latitudes the cold 1040 point is typically not well-defined as the temperature profiles tend to become isothermal in the upper troposphere and lower stratosphere region, in particular in the winter season.
Finally, we also estimated a dynamical tropopause based on potential vorticity at mid and high latitudes and potential temperature in the tropics. The pressure level p i of the dynamical tropopause is found as the lowest level at which either the absolute value of the potential vorticity first exceeds a threshold of 3.5 PVU or potential temperature first exceeds a threshold of 1045 380 K. Previous studies typically used thresholds between 2 and 4 PVU to define the dynamical tropopause. Here we decided for a value of 3.5 PVU following Hoerling et al. (1991) and Hoinka (1998), showing that statistical deviations between the thermal tropopause and the dynamical tropopause are minimized for that value.
As an example, zonal mean geopotential heights and temperatures of the first and second thermal tropopause, the cold point, and the dynamical tropopause as estimated with MPTRAC from ERA5 input data on 1 January 2017 are presented in Fig. A5.

1050
The daily mean zonal mean tropopause data from MPTRAC are in accordance with first thermal tropopause data extracted from Global Positioning System (GPS) radio occultation observations on the same day. A more detailed description of the extraction of tropopause information with MPTRAC as well as the application of the data to identify stratospheric ice clouds is provided by Zou et al. (2020Zou et al. ( , 2021.

A4 Calculation of cloud properties
1055 Some applications of the MPTRAC model require the calculation of specific cloud properties. In particular, the parametrization of wet deposition (Sect. 2.3.5) requires an estimate of the cloud top pressure p ct , the cloud bottom pressure p cb , and the total column cloud water c l . The cloud pressure levels p ct and p cb are determined from the 3-D fields of CLWC, CRWC, CIWC, and CSWC. We vertically scan the columns of these data to determine p ct and p cb as the uppermost and lowermost mid-level where at least one of the variables changes from zero to non-zero, respectively. At the same time, the total column cloud water Using the hydrostatic equation, Eq. (A10) can be numerically evaluated as A5 Calculation of convective available potential energy The convective available potential energy (CAPE) is a measure of the maximum buoyancy of an undiluted air parcel and therefore related to the potential updraft strength of thunderstorms. CAPE is calculated by integrating vertically the local 1070 buoyancy of an air parcel from the level of free convection (LFC) to the equilibrium level (EL), where T v,ap is the virtual temperature of the air parcel and T v,env is the virtual temperature of the environment. The level of free convection (LFC) is the altitude where the temperature of the environment decreases faster than the moist adiabatic lapse rate of a saturated air parcel at the same level. Above the LFC, the equilibrium level (EL) is the altitude at which a rising parcel 1075 of air is at the same temperature as its environment.
In order to establish the vertical range of integration, the lifting condensation level (LCL) needs to be found first. The LCL is formally defined as the height at which the relative humidity (RH) of an air parcel reaches 100% with respect to liquid water when it is cooled by dry adiabatic lifting. To find the LCL, the potential temperature and water vapor volume mixing ratio of an air parcel near the surface need to be calculated first. A common procedure is to calculate MLCAPE50, which is obtained using 1080 the mixed-layer meanθ andx H2O for the lowermost 50 hPa above the surface. Starting from surface pressure, the air parcel pressure p is continuously reduced and T is recalculated from Eq. (A5) by keeping θ constant until RH approaches 100%.
The relative humidity over liquid water is calculated from RH(p, T, x H2O ) = p w (p, x H2O ) p sat (T ) · 100%, with partial water vapor pressure, 1085 p w (p, x H2O ) = p x H2O 1 + (1 − ϵ)x H2O , with T 0 = 273.15 K. Starting from the air parcel pressure at the LCL, we sequentially reduce the pressure in steps ∆p of about 100 m in log-pressure height. The corresponding layer width ∆z in terms of geopotential heights is calculated from Eq.

1090
(A1). The temperature change ∆T = −Γ m ∆z is calculated by means of the moist adiabatic lapse, with specific gas constant of dry air R, specific heat capacity of dry air at constant pressure c pd , latent heat of vaporization of water L v = 2.501 × 10 6 J kg −1 , and the mixing ratio of the mass of water vapor to the mass of dry air, The water vapor volume mixing ratio x H2O is recalculated to keep RH at 100%. With this procedure, the virtual temperature of the air parcel T v,ap can be calculated along the pressure steps and compared to the virtual temperature of the environment T v,env . If T v,ap > T v,env , the LFC is found. The EL is found as the first level above the LFC, where T v,ap < T v,env . CAPE is calculated by vertically integrating the relative differences of the virtual temperature T v,ap and T v,env from the LFC to the EL according to Eq. (A13).
1100 Figure A6 compares the results of the CAPE calculations of MPTRAC with CAPE data provided along with the ERA5 reanalysis on 1 January 2017, 00:00 UTC. Qualitatively, the data sets compare well and show similar patterns on the global scale. However, it is also found that CAPE maxima from MPTRAC typically exceed the ERA5 data by up to 500 to 1000 J kg −1 in the tropics (Fig. A6c). Figure A7a compares tail distributions of the CAPE data from MPTRAC and ERA5. This more detailed analysis shows that CAPE values larger than about 20 J kg −1 are up to 5 percentage points more frequent in MPTRAC 1105 compared to ERA5.  Figure A7. Convective available potential energy (CAPE) estimated with MPTRAC using ERA5 data. (a) The occurrence frequencies of CAPE as estimated from the MPTRAC meteorological data processing code and the ERA5 reanalysis and (b) equilibrium levels versus CAPE for individual grid boxes of the reanalysis data (gray dots) and a running mean through the data (black curve). All data refer to 1 January 2017, 00:00 UTC.
Note that CAPE calculations can be quite sensitive to parameter choices, such as the depth of the surface layer being used to calculateθ andx H2O . Tests with different depths of the surface layer (50 or 100 hPa) with the MPTRAC code showed already quite significant variations in CAPE. For this reason, we attribute the differences between the MPTRAC and ERA5 CAPE data seen here to different methodologies and parameter choices applied for the CAPE calculations.
Both, CAPE and the height of the equilibrium levels are required as input data for the convection parametrization implemented in MPTRAC (Sect. 2.3.4). For reference, Fig. A6d shows the log-pressure height of the equilibrium levels found in the CAPE calculations with MPTRAC. Corresponding data from ERA5 were not available for comparison. Figure A7b shows the log-pressure height of the equilibrium levels versus CAPE. A running mean through the data indicates a strong correlation between CAPE and the height of the equilibrium level. CAPE values less than 20 J kg −1 typically correspond to equilibrium 1115 levels below 2 km whereas CAPE values larger than 1000 J kg −1 may correspond to equilibrium levels as high as 16 km in tropical convection.

A6 Determination of the planetary boundary layer
Most recently, a code has been added to the MPTRAC model to estimate the height of the planetary boundary layer (PBL) using the bulk Richardson number method (Vogelezang and Holtslag, 1996). The bulk Richardson number at height z above 1120 the surface is calculated from where θ v is the virtual potential temperature and the index s denotes the surface. The height of the PBL is found as the vertical level where Ri exceeds a critical value of 0.25, considering that layers with Richardson numbers less than this critical value are dynamically unstable and likely to become or remain turbulent. The bulk Richardson number method is generally 1125 considered suitable for both stable and convective boundary layers, identifies a non-negative height in all cases, and is not strongly dependent on vertical resolution of the input data (Seidel et al., 2012).
As an example, Fig. A8 shows PBL heights on 1 January 2017, 00:00 UTC as estimated with the method implemented into the MPTRAC model as well as the differences with respect to data provided directly along with ERA5. Overall, this comparison shows that the large-scale patterns of the PBL heights are similar, but also that some systematic differences are 1130 present. In particular, the PBL heights from MPTRAC over mid-latitude ocean regions can be 500 to 1000 m lower than the ERA5 data. Consistent with other studies (Zhang et al., 2014), our tests showed that the calculation of PBL heights is particularly sensitive to the choice of the critical value for Ri and the surface data. We also found it necessary, to impose an empirical lower limit of (5 m s −1 ) 2 for the denominator of Eq. (A19), to account for surface friction velocity and to achieve numerical stability of the results.