Forecasting Urban Water Escherichia coli Contamination Using Machine Learning Models

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Vidhatri Iyer

Abstract

The state of Indiana ranks first in the nation for water recreation impairments due to contaminated waterways. According to U.S. Environmental Protection Agency, 73% of rivers and streams and 23% of lakes and reservoirs have recreational use impairments like swimming, fishing and boating. Increased density of urban population and agricultural activities are some of the key contributors to run-off into our urban watersheds. The fecal coliform bacteria Escherichia coli (E. coli) has been used as an indicator of bacterial pollution in the water streams. Local governmental water authorities and non-profit organizations routinely collect samples of urban waters weekly (or biweekly) to measure water quality parameters including E. coli counts. These analytical methods are time-consuming and only provide retrospective analysis of E. coli loads. Thus, forecasting of E. coli contamination in urban waters is necessary to provide real-time information to the public about their suitability for bodily contact, recreation, fishing, boating, and domestic utilization. Another caveat of the current methods is the lack of integration of the local climatic conditions such as changes in temperature and precipitation. E. coli contamination in urban water streams was predicted utilizing the last 20 years of climatic factors (temperature, precipitation) and water sample analysis data. E. coli data was collected for three water streams from the Marion County (Indiana) watershed project for a period of 2003-2022. Daily temperature and precipitation data for Marion County were obtained from the National Oceanic and Atmospheric Administration site. These 2 sources of data were combined using the date field as a common parameter. An initial exploratory data analysis was performed to understand the correlation of parameters to E. coli levels. Next, additional calculated values such as cumulative degree days, max precipitation in 10 days or 15 days were included as input for 6 machine learning models (Logistic Regression, Random Forest Classifier, Extra Trees Classifier, Decision Tree Classifier, Gradient boosting Classifier and XGB Classifier). Feature importance analysis and overall accuracy scores across these 6 machine learning models were compared to identify the best model. XGB classifier consistently had ROC value of above 85% for 3 individual water streams.

DOI

https://doi.org/10.31223/X5D99N

Subjects

Microbiology

Keywords

E. coli, Urban waterstreams, machine learning models

Dates

Published: 2024-03-26 17:07

Last Updated: 2024-03-26 17:07

License

No Creative Commons license