Data-Driven Facies Prediction: A Comparative Study of Random Forest, XGBoost, SVM, CatBoost, and K-Means

Muhammad Risha; Paul Liu

Data-Driven Facies Prediction: A Comparative Study of Random Forest, XGBoost, SVM, CatBoost, and K-Means

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.

Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Muhammad Risha, Paul Liu

Abstract

Facies classification plays a critical role in characterizing subsurface heterogeneity and supporting effective reservoir development. Traditional methods, which often rely on core interpretation and manual log analysis, are limited by subjective interpretation and sparse data coverage. This study aims to improve facies prediction by comparing the performance of five machine learning models: Random Forest, XGBoost, Support Vector Machine, CatBoost, and K-Means clustering. The dataset is derived from sandstone formations in Labuan Island, Malaysia, and is enhanced using synthetic data generated through Latin Hypercube Sampling to address data scarcity. Feature selection is performed using three independent techniques to identify the most informative variables, and Principal Component Analysis is used to investigate feature relationships. Model evaluation is based on classification accuracy, precision-recall metrics, receiver operating characteristic curves, and confusion matrices. Among the models tested, CatBoost achieved the highest cross-validation accuracy at 95.4%, followed by XGBoost at 93.7%. Random Forest achieved a test accuracy of 89.5%, while Support Vector Machine performed less reliably with a test accuracy of 85.6%. The K-Means clustering approach yielded an overall accuracy of 49.7% in aligning predicted clusters with true facies labels. The results demonstrate the effectiveness of ensemble methods in facies classification and support the use of augmented data in enhancing model performance. This approach provides a practical framework for applying machine learning in geological settings, with potential benefits for reservoir modeling and development planning.

DOI

https://doi.org/10.31223/X5244X

Subjects

Analysis, Earth Sciences, Geology, Sedimentology

Keywords

Facies classification, machine learning, geostatistics, Synthetic data augmentation, ensemble models

Dates

Published: 2025-06-10 18:07

Last Updated: 2025-06-10 18:07

License

CC BY Attribution 4.0 International

Additional Metadata

Conflict of interest statement:
The authors declare no relevant financial or non-financial competing interests that could have influenced the research findings

Data Availability:
Data belongs to a YUTP project

Metrics

Views: 1140

Downloads: 585