This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.
Evaluating Sampling Bias and Model Uncertainty in Species Distribution Models of Marine Plankton Using Virtual Ecosystem Data
Downloads
Authors
Abstract
Understanding the biodiversity and biogeography of plankton in the ocean is essential for predicting responses to environmental changes and informing ocean conservation and management strategies. Species distribution models (SDMs) are a pivotal tool in this regard. This study used data from a global marine ecosystem model as a testbed to assess the reliability of various SDMs, including Generalized Linear Model (GLM), Generalized Additive Model (GAM), Random Forest (RF), Boosted Regression Trees (BRT) and Artificial Neural Network (ANN). We used artificial datasets to replicate the sampling patterns of three datasets: a compiled dataset of global scope, the Tara Ocean dataset, and the Atlantic Meridional Transect (AMT) project. Our findings indicate that tree-based algorithms, RF and BRT, exhibit better predictive accuracy and stability compared to GLM, GAM, and ANN, especially when trained with more spatially resolved datasets. We highlight the significant influence of sampling bias on model performance, with models trained on more comprehensive global datasets outperforming those trained on more latitudinally and longitudinally biased data respectively (Tara and AMT). Furthermore, we demonstrate that broad spatial coverage is a more critical determinant of predictive skill than sample size alone, as simply increasing sampling density within a biased region is insufficient to overcome poor spatial representation. Overall, this research underscores the necessity of careful consideration of sampling strategies and model selection in plankton species distribution modelling.
DOI
https://doi.org/10.31223/X5HV0P
Subjects
Life Sciences
Keywords
Species distribution model, ecosystem model, model evaluation
Dates
Published: 2026-04-29 03:05
Last Updated: 2026-04-29 03:05
License
CC BY Attribution 4.0 International
Additional Metadata
Data Availability:
The physical model used in the Darwin simulation is the MIT General Circulation Model (MITgcm), accessible at http://mitgcm.org. The generic ecosystem code is available at https://gitlab.com/jahn/gud, and detailed equations and documentation can be found at https://darwin3.readthedocs.io/en/latest/phys_pkgs/darwin.html. The Darwin model data can be downloaded at https://doi.org/10.7910/DVN/RPL6PT and https://doi.org/10.7910/DVN/LQH9PX. The SDMs model script can be accessed in GitHub https://github.com/ZhiboShao/uncertainty-and-predicitability-of-plankton-SDM-. The data can be found in Zenodo https://doi.org/10.5281/zenodo.14219377 .
Metrics
Views: 20
Downloads: 1
There are no comments or no comments have been made public for this article.