Towards HydroLLM: A Benchmark Dataset for Hydrology-Specific Knowledge Assessment for Large Language Models

Dilara Kizilkaya; Ramteja Sajja; Yusuf Sermet; Ibrahim Demir

Towards HydroLLM: A Benchmark Dataset for Hydrology-Specific Knowledge Assessment for Large Language Models

This is a Preprint and has not been peer reviewed. The published version of this Preprint is available: https://doi.org/10.1017/eds.2025.10006. This is version 3 of this Preprint.

Add a Comment

You must log in to post a comment.

Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Dilara Kizilkaya, Ramteja Sajja, Yusuf Sermet, Ibrahim Demir

Abstract

The rapid advancement of Large Language Models (LLMs) has enabled their integration into a wide range of scientific disciplines. This paper introduces a comprehensive benchmark dataset specifically designed for testing recent large language models in hydrology domain. Leveraging a collection of research articles and hydrology textbook, we generated a wide array of hydrology-specific questions in various formats, including True/False, Multiple-Choice, Open-Ended, and Fill-in-the-Blank. These questions serve as a robust foundation for evaluating the performance of state-of-the-art LLMs, including GPT-4o-mini, Llama3:8B, and Llama3.1:70B, in addressing domain-specific queries. Our evaluation framework employs accuracy metrics for objective question types and cosine similarity measures for subjective responses, ensuring a thorough assessment of the models’ proficiency in understanding and responding to hydrological content. The results underscore both the capabilities and limitations of Artificial Intelligence (AI)-driven tools within this specialized field, providing valuable insights for future research and the development of educational resources. By introducing HydroLLM-Benchmark, this study contributes a vital resource to the growing body of work on domain-specific AI applications, demonstrating the potential of LLMs to support complex, field-specific tasks in hydrology.

DOI

https://doi.org/10.31223/X5R410

Subjects

Environmental Engineering

Keywords

benchmark dataset, large language models, hydrology, Question Generation, Domain-Specific AI, natural language processing

Dates

Published: 2025-03-15 15:04

Last Updated: 2025-06-03 19:27

Older Versions

License

CC BY Attribution 4.0 International

Additional Metadata

Conflict of interest statement:
None

Data Availability (Reason not available):
Data is shared in the paper.