This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Domain-Specific Embedding Models for Hydrology and Environmental Sciences: Enhancing Semantic Retrieval and Question Answering in RAG Pipelines
Downloads
Authors
Abstract
Large Language Models (LLMs) have shown strong performance across natural language processing tasks, yet their general-purpose embeddings often fall short in domains with specialized terminology and complex syntax, such as hydrology and environmental science. This study introduces HydroEmbed, a suite of open-source sentence embedding models fine-tuned for four QA formats: multiple-choice (MCQ), true/false (TF), fill-in-the-blank (FITB), and open-ended questions. Models were trained on the HydroLLM Benchmark, a domain-aligned dataset combining textbook and scientific article content. Fine-tuning strategies included MultipleNegativesRankingLoss, CosineSimilarityLoss, and TripletLoss, selected to match each task's semantic structure. Evaluation was conducted on a held-out set of 400 textbook-derived QA pairs, using top-k similarity-based context retrieval and GPT-4o-mini for answer generation. Results show that the fine-tuned models match or exceed performance of strong proprietary and open-source baselines, particularly in FITB and open-ended tasks, where domain alignment significantly improves semantic precision. The MCQ/TF model also achieved competitive accuracy. These findings highlight the value of task- and domain-specific embedding models for building robust retrieval-augmented generation (RAG) pipelines and intelligent QA systems in scientific domains. This work represents a foundational step toward HydroLLM, a domain-specialized language model ecosystem for environmental sciences.
DOI
https://doi.org/10.31223/X5DQ71
Subjects
Education, Engineering
Keywords
Domain-Specific Embeddings, Fine-tuning, HydroLLM, Large Language Models (LLM), Retrieval-Augmented Generation (RAG), Semantic Retrieval
Dates
Published: 2025-07-11 15:25
Last Updated: 2025-07-11 15:25
License
CC BY Attribution 4.0 International
Additional Metadata
Data Availability (Reason not available):
The metrics for model comparision is shared in the paper. The models are open-sourced.
There are no comments or no comments have been made public for this article.