Skip to main content
Reducing Bias in Cropland Soil Organic Carbon and Clay Predictions using Sentinel-2 Composites and Data Balancing

Reducing Bias in Cropland Soil Organic Carbon and Clay Predictions using Sentinel-2 Composites and Data Balancing

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Tom Broeg , Axel Don, Thomas Scholten, Stefan Erasmi

Abstract

Accurate maps of cropland soil organic carbon stocks (SOCS) and clay content are essential for climate-smart agriculture. Soil reflectance composites (SRC), derived from multispectral bare soil observations, offer a scalable approach to high-resolution soil mapping. While studies often focus on maximizing model performance, challenges remain regarding (1) the bias introduced by masking and excluding soil samples during SRC generation and (2) the accurate representation of the full range and distribution of soil properties in the resulting maps. Evaluating different SRC parameters, we found that commonly used indices such as the Normalized Burn Ratio 2 (NBR2) and the Normalized Difference Vegetation Index (NDVI) were significantly correlated with clay content and SOCS, respectively. These dependencies can lead to the systematic exclusion of high SOCS (>80 Mg ha-1) and clay (>30 mass%) samples during SRC generation, introducing bias in the resulting maps. Models trained solely on SRC bands failed to capture the full range of the training data, limiting the applicability of the soil property maps. While the inclusion of additional remote sensing features, such as spectral-temporal metrics and indices, significantly improved the prediction accuracy, the representation of the imbalanced samples remained challenging. We demonstrated that a combined framework of spatial data augmentation and majority undersampling was effective in improving the range and concordance correlation coefficient (CCC) of the predictions (SOCS = 0.82; Clay = 0.9). Our findings emphasize the importance of (1) evaluating excluded samples to identify potential SRC-induced bias, and (2) optimizing model predictions reflecting the observed data range to improve the reliability and usability of the resulting soil maps.

DOI

https://doi.org/10.31223/X50M91

Subjects

Physical Sciences and Mathematics

Keywords

Soil Reflectance Composite, Bare Soil Mosaic, Digital soil mapping, Imbalanced Regression, Soil Organic Carbon Stocks, Clay Content, cropland

Dates

Published: 2025-05-15 09:46

Last Updated: 2025-05-15 09:46

License

CC BY Attribution 4.0 International

Additional Metadata

Conflict of interest statement:
None

Data Availability (Reason not available):
The Sentinel-2 soil reflectance composite and predicted soil maps are available in Zenodo at https://doi.org/10.5281/zenodo.15402687 and https://doi.org/10.5281/zenodo.15403341. The soil data from the German Agricultural Soil Inventory (BZE-LW) are available from OpenAgrar at https://doi.org/10.3220/DATA20200203151139.