This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.
From Points to Predictions: Data Curation for Geospatial Machine Learning
Downloads
Authors
Abstract
The quality of training datasets can have a large impact on Machine Learning (ML) models, yet this aspect of the pipeline frequently receives less scrutiny than it should. In the context of geospatial mapping from point-scale field data, quality control strategies to remove erroneous or misleading data can be applied prior to model training to improve performance. However, such strategies and their resulting impact are rarely reported, compared to extensive discussions of model selection and tuning. To investigate the potential for spatial data error correction, we examine the case of peatland mapping from peat core samples. We assess several curation strategies and compare fully automated filters against filters that require monitoring by domain experts. We find that cleaning strategies based on location precision and landcover classification filtering to detect mismatches can significantly improve performance metrics. We also find that blind reliance on fully automated classification may lead to worse results. Despite the additional effort required, we conclude that manual spatial data quality control processes are an important component of large-scale spatial modelling and discuss recommended approaches to scale them effectively for large datasets.
DOI
https://doi.org/10.31223/X59N2H
Subjects
Physical Sciences and Mathematics
Keywords
geospatial machine learning, data-centric machine learning, peatland mapping, location accuracy, landcover filtering
Dates
Published: 2026-01-20 15:33
Last Updated: 2026-01-20 15:33
License
Metrics
Views: 82
Downloads: 8
There are no comments or no comments have been made public for this article.