This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint.

PhyX - Predicting Phytoplankton Community Composition from Satellite Ocean Color
Downloads
Authors
Abstract
The way in which phytoplankton communities are structured - often referred to as phytoplankton community composition (PCC) - exerts fundamental control on ocean biogeochemical cycling, climate regulation, and marine ecosystem dynamics. Accurate quantification of these groups from satellite ocean color data remains challenging due to spectral similarities among phytoplankton types and the limitationsof existing empirical and semi-analytical models. In this study, we used an extreme gradient boosting (XGBoost) tree-based regression model to retrieve multiple PCCs and total chlorophyll-a concentrations from simulated hyperspectral remote sensing top-of-atmosphere (TOA) ocean color data as well as some ancillary data. The intent is to mimic what could be gathered from the NASA Plankton, Aerosol, Cloud, ocean Ecosystem (PACE) mission and auxiliary data sources to characterize to char- acterize the environment. In its final form, the model, validated on an out-of-sample set, demonstrated strong predictive performance across most functional groups, with R2 values exceeding 0.95. Dinoflagellate retrievals showed lower accuracy (R2 = 0.53). Further analysis revealed that temperature was a key predictor alongside hyperspectral TOA radiance, suggesting that integrating external temperature data could enhance future retrieval models. Furthermore, despite using only 10% of the available hyperspectral bands, feature importance analysis showed that specific spectral regions disproportionately contributed to model predictions. These findings highlight the potential of machine learning for phytoplankton classification and inform future algorithm development for hyperspectral ocean color missions.
DOI
https://doi.org/10.31223/X5QQ9K
Subjects
Marine Biology
Keywords
phytoplankton, Regression, XGBoost, Shap, Explainable AI
Dates
Published: 2025-08-02 22:01
Last Updated: 2025-08-21 20:12
Older Versions
License
Additional Metadata
Data Availability (Reason not available):
Data and code are available.
There are no comments or no comments have been made public for this article.