This is a Preprint and has not been peer reviewed. This is version 2 of this Preprint.

Towards HydroLLM: Building a Domain-Specific Language Model for Hydrology
Downloads
Authors
Abstract
As large language models (LLMs) continue to expand, their effective adaptation to specialized fields remains a critical challenge. This work presents an initial step toward the development of HydroLLM, a domain-specific LLM for hydrology. We construct a dataset of approximately 8,800 hydrology-focused question–answer pairs, each with a supporting context passage drawn from textbooks and scientific articles. The dataset includes four instructional formats: multiple choice, true/false, fill-in-the-blank, and open-ended. Using this corpus, we fine-tune several LLMs of varying type and scale—from compact (1.5B) to large (32B) parameter counts using parameter-efficient LoRA (Low-Rank Adaptation) methods. Our methodology compares different fine-tuned models and evaluates model performance using accuracy and cosine similarity metrics across task types. Results show that larger model size is not always advantageous: among the fine-tuned models, the 8B DeepSeek Llama variant achieved the strongest overall performance, while the 32B model overfit and the 1.5B model underperformed—emphasizing the need to match model capacity to dataset size. This work demonstrates that effective domain adaptation requires careful consideration of model architecture, parameter count, and task complexity, with fill-in-the-blank tasks proving particularly challenging across all models. By establishing performance and identifying the limits of current fine-tuning approaches, we took a concrete step toward building HydroLLM as a robust, domain-specific language model for hydrological analysis and decision support.
DOI
https://doi.org/10.31223/X51H99
Subjects
Engineering
Keywords
HydroLLM, Large Language Models (LLMs), Fine-Tuning, hydrology, Question Generation, Domain-Specific AI, Natural Language Processing (NLP), Large Language Models (LLMs), Fine-tuning, hydrology, Question Generation, Domain-Specific AI, Natural Language Processing (NLP)
Dates
Published: 2025-07-11 10:38
Last Updated: 2025-07-11 19:58
Older Versions
License
CC BY Attribution 4.0 International
Additional Metadata
Conflict of interest statement:
None
Data Availability (Reason not available):
All data is available upon reasonable request.
There are no comments or no comments have been made public for this article.