Bayesian Models for Deriving Biogeochemical Information from Satellite Ocean Color

The era of satellite ocean color began in 1978 with the 1 launch of NASA’s Coastal Ocean Color Scanner (CZCS) on 2 board the Nimbus-7 spacecraft. Through measurement of the 3 quantity and quality of the light reflected from the ocean, 4 CZCS revolutionized our understanding of the intimate re5 lationships between ocean physics and phytoplankton distri6 bution in the world ocean (1). Generations of spaceborne 7 sensors have subsequently followed, and satellite ocean color 8 measurements now provide spatial and temporal distributions 9 of phytoplankton (2) and other aquatic biogeochemical con10 stituents (3), estimates of ocean primary productivity (4, 5), 11 and have become a vital input to global models of Earth 12 system processes and their response to a changing climate 13 (6, 7). 14 Ocean color is derived from the signal collected at the top 15 of the atmosphere (TOA) by a satellite spectroradiometer. 16 The majority of this signal is due to scattering from atmo17 spheric aerosols and reflection by wind-generated whitecaps, 18 with only ~10% maximum of the spectrum due to radiance 19 either reflected from the ocean surface or scattered back out 20 through the air-water interface. The atmospheric contribution 21 must, therefore, be ‘subtracted’ from the TOA signal in order 22 to derive the oceanic contribution. This is achieved opera23 tionally through the process of atmospheric correction (AC), 24 which removes the influence of sun glint, whitecaps formed 25 by wind, and the contribution by atmospheric aerosols. This 26 latter step takes advantage of the fact that the water body 27 can be considered to be totally absorbing (i.e. black) in the 28 near infrared (NIR). Any TOA signal detected at wavelengths 29 in the NIR is then attributed to atmospheric contributions 30 and a suitable aerosol radiance model is chosen to extrapolate 31 to shorter wavelengths. This derived atmospheric radiance 32 is subtracted from the total TOA spectrum, and the signal 33 that remains is the water-leaving radiance. Remote sensing 34 reflectance (Rrs, sr−1) i.e. the light exiting the water normal35 ized to a hypothetical condition of an overhead Sun and no 36 atmosphere (8, 9), can then be calculated. Following calcu37 lation of Rrs, various approaches are used to estimate water 38 constituent concentrations. These fall broadly into two classes 39 of algorithms: i) empirical band-ratio algorithms, which are 40 derived from the statistical relationship between the ratio of 41 two or more wavebands (blue and green) of Rrs and in situ 42 measurements of chlorophyll a concentration, Chl a (mg m−3), 43 a proxy for phytoplankton biomass (2), and ii) semi-analytical 44 algorithms that are based on a combination of radiative trans45 fer theory and empirically derived parameters, and that permit 46 the retrieval of inherent optical properties (IOPs) such as spec47 tral particulate backscattering, bbp (m−1), and phytoplankton 48 absorption, aph (m−1), coefficients that can be related to the 49 water constituents of interest (3). 50 In a very general sense, AC approaches perform well over 51 the open ocean where the water is totally absorbing in the 52 NIR and the aerosol assemblage can be well modeled. How53 ever, AC performance becomes severely limited in coastal and 54 inland waters where bottom reflectance can contaminate water55 leaving signals, suspended sediments may produce a non-zero 56 reflectance in the NIR, and/or absorbing aerosols, e.g. those 57 generated by terrestrial anthropogenic sources (10), are dif58 ficult to model accurately. The performance of the in-water 59 algorithms is also degraded in these regions for a variety of 60 reasons. The band ratio algorithms were developed for use in 61 case 1 waters, i.e. those in which ocean color is dominated by 62 Chl a and all other optically active water constituents covary 63 (11). In case 2 waters (11), where other optically active water 64 constituents vary independently of Chl a (e.g. coastal waters), 65 band ratio algorithms often perform poorly as colored dissolved 66 organic material (CDOM) and suspended particulate material 67 compete with phytoplankton for the absorption and scattering 68 of blue photons, thereby confounding the algorithm’s assump69 tion of co-variability. Semi-analytical models may perform 70 satisfactorily in case 2 waters, but model parameters such as 71 the spectral slopes of CDOM+detrital absorption and partic72 ulate backscattering may need to be regionally tuned as their 73 local values can vary widely (12–14). Additionally, the signal 74 of interest may simply be swamped by competing processes – a 75 common occurrence in case 2 waters where CDOM absorption 76 coefficients can be an order of magnitude or greater than that 77 of phytoplankton (15, 16). Finally, if AC is inaccurate, the 78 spectral shape of the retrieved Rrs spectrum may be distorted, 79 meaning that the starting point for any of these ocean color 80 models will be fundamentally flawed. As a result of these 81 challenges, satellite measurements made over such water bod82 ies are often unusable for quantitative studies. The loss of 83 information from these systems is particularly egregious as 84 they are vulnerable to climate and anthropogenic forcing (17), 85 play host to highly productive fisheries (18), or are regions 86 of intense atmospheric CO2 uptake (19) or sinking of organic 87 matter for climate-relevant time scales (20, 21). 88

coefficients can be an order of magnitude or greater than that 77 of phytoplankton (15,16). Finally, if AC is inaccurate, the 78 spectral shape of the retrieved Rrs spectrum may be distorted, 79 meaning that the starting point for any of these ocean color 80 models will be fundamentally flawed. As a result of these 81 challenges, satellite measurements made over such water bod-82 ies are often unusable for quantitative studies. The loss of 83 information from these systems is particularly egregious as 84 they are vulnerable to climate and anthropogenic forcing (17), 85 play host to highly productive fisheries (18), or are regions 86 of intense atmospheric CO2 uptake (19) or sinking of organic 87 matter for climate-relevant time scales (20,21).

88
Work has been devoted to improving the standard AC and 89 a number of alternative approaches have been proposed (22). 90 These include a multiband AC that uses multiple NIR and 91 shortwave infrared channels (23)    Here, we present a machine learning effort that extends

156
Three hierarchical Bayesian models predicting phytoplankton 157 absorption at 6 wavebands were successfully fitted. In increas-158 ing order of complexity, these models were linear regression, 159 linear regression with first order interaction terms, and neural 160 network. Input variables included 6 principal components de-161 rived from 6 Rayleigh-corrected bands, in addition to a number 162 of ancillary predictors (see Materials and Methods section). 163 All three models were built to highlight predictor relevance. 164 This is depicted in order of relevance for each model in the 165 forest plots shown in Fig. 1 for phytoplankton absorption at 166 411 nm, a ph (411). Note that we use a ph (411) in Fig. 1 as an 167 illustrative example because this region of the spectrum is typ-168 ically the most affected by inaccurate atmospheric correction 169 and, therefore, represents the most challenging scenario for 170 ocean color retrievals. For all models, the first three principal 171 components appear among the more influential variables. The 172 linear regression model identified sea surface temperature and 173 bathymetry (sst and dep in Fig. 1, top panel) as significantly 174 relevant in predicting a ph (411). For the linear regression with 175 interaction model, interaction between the first two spectral 176 principal components (pc1 and pc2 ), and interaction between 177 the fourth principal component and the solar zenith angle (pc4 178 and solz) were found to be the most influential variables (Fig. 179 1, middle panel). Interestingly, the neural network deemed 180 only PC spectral information as relevant in a ph (411) prediction 181 (Fig. 1, bottom panel).

182
The uncertainties around the relevant parameters were 183 similar in magnitude between the two types of linear regression 184 models. In the case of the neural network, the most relevant 185 parameters exhibit the greatest uncertainty (Fig. 1, bottom 186 panel), likely an effect of the small size of the data set used. 187 This pattern changes, however, where model prediction skill 188 is concerned. To asses each model's prediction skill, a small 189 out-of-sample data set was used and the following criteria 190 examined: 1) how tight the 95% credibility interval of the 191 posterior predictive simulation was (Fig. 2); 2) where out-192 of-sample observations occur in relation to the 95% and 50% 193 credibility intervals (Fig. 2); and 3) how closely average 194 predictions tracked out-of-sample observations (Figs. 2 -5). 195 We found that for all bands, and for all performance criteria 196 listed above, expected predictive performance on out-of-sample 197 data (i.e. future, unseen data) increased with model complex-198 ity. Linear regression was the least proficient of the three 199 models, while the Bayesian neural network was the model 200 most likely to generalize well. Of all 6 bands tested, a ph (555) 201 was the most challenging to fit across all models, likely due to 202 the fact that phytoplankton absorption is weakest in the green 203 spectral region. This behavior was also observed by Craig et 204 al. (16) in their PC-based models.

206
This study illustrates the feasibility of retrieving inherent op-207 tical properties, in this case phytoplankton absorption, from 208 optically complex coastal waters, using Rayleigh-corrected 209 Top-of-Atmosphere reflectance as principal input to a num-210 ber of models. Of these, BNN resulted in the most robust 211 predictions. This is significant because coastal water IOPs, 212 including phytoplankton absorption have until now remained 213 essentially invisible to ocean color remote sensing.

214
• Bayesian approaches have not been utilized for the pre-215 diction of IOPs and offer an alternative to traditional RT 216 models. Additionally, uncertainty estimates come for free. 217 • Of particular note is the model's ability to accurately 218

249
Data pre-processing. The NOMAD in situ a ph (λ) data was 250 provided at 20 wavelengths. This was reduced to 6 to match 251 SeaWiFS visible wavelengths of 412, 443, 490, 510, 555, 670 252 nm. Data points were discarded if no in situ a ph (λ) data 253 existed or had missing wavelengths, and if any of the satellite 254 wavelengths were missing or contained zero values. Three ad-255 ditional pre-processing steps were performed: i) The principal 256 components (PCs) of the 6 Rayleigh-corrected remote sensing 257 reflectances were computed. After initial model trials, it was 258 found that the PCs were consistently more powerful predictors 259 than the parent reflectance spectra, in agreement with the 260 findings of Craig et al. (16) who observed that Rrs PCs were 261  weak relationship between predictors and predicted variables, 269 when in fact the relationship is much stronger. ii) The data, 270 which included sea surface temperature, solar zenith angle, 271 and reflectance principal components, span widely varying 272 scales. Therefore, they were standardized by subtracting the 273 mean from each predictor variable and dividing by its respec-274 tive standard deviation. iii) The data was split into training 275 and testing sets, with the training set used for model fitting, 276 while the testing (i.e., out-of-sample) set was used for model 277 predictive skill evaluation.

278
Model development and fitting. All models described were de-279 veloped in the Python language using the probabilistic pro-280 gramming library PyMC3 (41). Bayesian models to predict the 281 spectral phytoplankton absorption coefficient, a ph (λ), were 282 developed. By definition, Bayesian model parameters and 283 their resulting predictions are probabilistic in nature. In brief, 284 Bayes' rule provides a way to update beliefs based on the 285 sion models, in that they assume a number of the the model 320 parameters will effectively shrink to 0. The horseshoe prior 321 holds a significant advantage over Lasso or Ridge regression 322 priors in that it assigns high probability to 0 while providing a 323 thick tail, which reduces bias. The regularized horseshoe has 324 the advantage that it provides a way to adjust the shrinking 325 rate of non-zero parameters, thereby preventing the model 326 from overfitting on the features corresponding to these non-327 zero parameters. Fig. 8 (top panel) shows the structure of the 328 linear regression models, with a common intercept parameter 329 given a Gaussian prior, the predictor slope parameters are 330 also Gaussians with the scale parameter σ β inherited from a 331 combination of 3 hyperpriors (a hyperprior is an assumption 332 made about a parameter in a prior probability assumption) as 333 specified in (42). The linear regression equation is used as the 334 mean of a Gaussian likelihood, which has a standard deviation 335 with a half-Cauchy prior.

336
Similarly, the Bayesian neural network's construction fea-337 tures automatic relevance determination (ARD) (43). The 338 neural network is fully connected and features one hidden 339 layer. The weights between the input layer and the hidden 340 layer have Gaussian priors as in the linear regression models. 341 However, the spreads of priors for the weights correspond-342 ing to each predictor variable have independent half-Cauchy 343 hyper-priors; this is the basis for ARD. On the other hand, the 344 weights connecting the hidden layer to the output layer have 345 Gaussian priors with a common hyperprior for their spread. 346 The Bayesian neural network's architecture is depicted in Fig. 347 8, bottom panel.

348
All models were fit using the No U-Turn Sampler, a variant 349 of Hamiltonian Monte-Carlo (44). For the regression models, 350 2000 samples were drawn after a tuning period made up of 351 2000 preliminary samples that were subsequently discarded. A 352 similar fitting procedure was followed for the Bayesian neural 353 network, with the difference that 2000 samples were collected 354 after a 15000-iteration tuning step. In all cases, the sampling 355 was performed four times concurrently, but independently, to 356 ensure convergence. The Gelman-Rubin statistic (45) was 357 used to verify that convergence was equivalent between in-358 dependent sampling runs. Relatively naive priors were used, 359 codifying the rather loose constraint that reasonable values of 360 the target variable would remain highly probable. An addi-361 tional constraint was applied to the Bayesian neural network 362 to address the problem of weight space symmetry (46), which 363 affects the weights applied to the input nodes, represented 364 as edges connecting input nodes x1...n to hidden layer nodes 365 h1...m as shown in the bottom panel of Fig. 8. The problem 366 arises from the fact that, without an additional constraint, 367 there is nothing to differentiate hidden layer nodes from one 368 another. In practice this results in the sampler encountering 369 difficulty in converging on the same mode for the affected 370 weights. The constraint applied consists of enforcing a numer-371 ical order within the weights applied to each input node. This 372 guarantees that no overlap can occur, thus eliminating the 373 exchangeability problem.

374
Reproducibility. The code describing the preparation and 375 transformation of data, as well as the code for the devel-376 opment, fitting, and evaluation of the models are available 377 through github https://github.com/madHatter106/Bayesian-378 ML-4-IOP-from-TOA. The raw data is available through our 379 project page on the Open Science Foundation website OSF 380 link.