This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Data-Driven Facies Prediction: A Comparative Study of Random Forest, XGBoost, SVM, CatBoost, and K-Means
Downloads
Authors
Abstract
Facies classification plays a critical role in characterizing subsurface heterogeneity and supporting effective reservoir development. Traditional methods, which often rely on core interpretation and manual log analysis, are limited by subjective interpretation and sparse data coverage. This study aims to improve facies prediction by comparing the performance of five machine learning models: Random Forest, XGBoost, Support Vector Machine, CatBoost, and K-Means clustering. The dataset is derived from sandstone formations in Labuan Island, Malaysia, and is enhanced using synthetic data generated through Latin Hypercube Sampling to address data scarcity. Feature selection is performed using three independent techniques to identify the most informative variables, and Principal Component Analysis is used to investigate feature relationships. Model evaluation is based on classification accuracy, precision-recall metrics, receiver operating characteristic curves, and confusion matrices. Among the models tested, CatBoost achieved the highest cross-validation accuracy at 95.4%, followed by XGBoost at 93.7%. Random Forest achieved a test accuracy of 89.5%, while Support Vector Machine performed less reliably with a test accuracy of 85.6%. The K-Means clustering approach yielded an overall accuracy of 49.7% in aligning predicted clusters with true facies labels. The results demonstrate the effectiveness of ensemble methods in facies classification and support the use of augmented data in enhancing model performance. This approach provides a practical framework for applying machine learning in geological settings, with potential benefits for reservoir modeling and development planning.
DOI
https://doi.org/10.31223/X5244X
Subjects
Analysis, Earth Sciences, Geology, Sedimentology
Keywords
Facies classification, machine learning, geostatistics, Synthetic data augmentation, ensemble models
Dates
Published: 2025-06-11 02:07
Last Updated: 2025-06-11 02:07
License
CC BY Attribution 4.0 International
Additional Metadata
Conflict of interest statement:
The authors declare no relevant financial or non-financial competing interests that could have influenced the research findings
Data Availability (Reason not available):
Data belongs to a YUTP project
There are no comments or no comments have been made public for this article.