Water Science is Becoming More Interdisciplinary

23 We use Natural Language Processing (NLP) to assess topic diversity at the level of (i) 24 individual articles, (ii) individual journals, and (iii) the whole corpus of research article25 abstracts in eighteen water science journals. 26 Interdisciplinarity within individual articles in water science and hydrology jour27 nals is increasing. No such discernible trend exists at the corpus level topic diversity 28 in the overall hydrology and water science corpus is not increasing. We assess the inter29 disciplinarity of 74,479 water science and hydrology research articles at multiple levels 30 (article and corpus) for eighteen water science journals. In doing so, we leverage Nat31 ural Language Processing (NLP) tools, and apply unsupervised learning to extract a di32 verse range of topics and carry our contextual analyses. We observe the strongest rise 33 in interdisciplinarity of articles published in Water Resources Research WRR, Advances 34 in Water Resources AWR, and Journal of Contaminant Hydrology JCH, while rest of 35 the journals demonstrate slightly rising to slightly decreasing trends. At the corpus level, 36 Journal of Hydrometeorology JHM , Hydrogeology Journal HGJ , Hydrology and Earth 37 System Sciences HESS, and Journal of the American Water Resources Association JAWRA 38 show slightly rising trend. We analyze the topics in terms of their trends, and also iden39 tify eleven isolated topics (subdisciplines) in this field, some of which have become in40 creasingly isolated over time. These findings contribute to the discourse on interdisci41 plinarity in water science and hydrology domain. 42

Topic modeling is a particular type of NLP that uses statistical algorithms to ex-101 tract semantic information from a collection of texts in the form of thematic classes (Jiang,102 Qiang, & Lin, 2016). Topic models can be applied to massive collections of documents 103 -3-manuscript submitted to Water Resources Research (Blei, 2012) and have been used to recommend scientific articles based on content and 104 user ratings (C. Wang & Blei, 2011). Topic modeling has also been used to cluster sci-  (Priva & Austerweil, 2015), and exploring topic divergence and similarities in sci-116 entific conferences (Hall, Jurafsky, & Manning, 2008). As opposed to scientometrics tech-117 niques (Mingers & Leydesdorff, 2015), which have been traditionally used for ranking 118 articles and authors based on citation data, topic modeling allows for a contextual un-119 derstanding of particular scientific domains and disciplines. 120 Motivated by the success of topic modeling in a wide range of applications, we ex-121 plore its potential to aid bibliometric exploration of peer-reviewed water science liter-122 ature. In particular, we explore the question of whether peer-reviewed water science lit-123 erature is increasing in interdisciplinarity with respect to sub-topics in the discipline. The 124 specific hypotheses that we will explore are:

125
• Individual hydrology research papers are becoming more topically diverse i.e., in-126 terdisciplinarity is increasing at a document level.

127
• The hydrology and water science corpus is becoming more topically-diverse.

128
• Articles published in certain journals are becoming more interdisciplinary. 129 We would additionally like to understand whether certain topics in water science are con-130 tributing more or less to interdisciplinary work, including whether certain topics are iso-131 lated in the community research output.     Performance of topic modeling is influenced by the quality of input training data.

149
Article-abstracts were preprocessed into a canonical format for efficacious feature extrac-150 tion (Feldman, Sanger, et al., 2007). To prepare the data, we used separate temporally-151 segregated dataframes of abstracts and metadata from each journal. All sets of data were 152 processed through identical multi-layered cleaning routines. We used Spacy and NLTK 153 Python libraries to filter non-semantic elements such as stopwords, punctuation, and sym-154 bols, and in addition we manually identified and removed unwanted elements that were 155 common in our article abstracts (the cleaned abstracts are available in the repository linked 156 in the Data and Code Availability statement at the end of this article).

157
In the next step, we formed bi-grams and segmented texts by tokenizing with whites-158 paces as word boundaries. This was followed by lemmatization, to extract semantic roots 159 from conjugations, etc. Using this corpus, we created a map between words and integer 160 identifiers. We then converted this dictionary into a bag-of-words format, making the    Here, we use an LDA implementation in the Python Gensim package with VEM. 194 We train our models with the number of passes set to 5000 and chunksize (number of 195 documents in a batch) set to 100. We used a parallelized implementation of LDA in Gensim 196 to train individual models with topic sizes ranging from K = 10 to K = 80; each model 197 trained using 40 shared-memory cores on a single node of a high performance cluster.

198
Using these settings it takes on the order of a few hours to train a single model: between 199 3-15 hours per model on our particular machine, depending on K. year, t: . (3)

292
Where µ is the distribution of topics over document d. We also calculated the mean Shan-293 non diversity in documents per year as H t d : 295 Finally, we calculated the Shannon diversity per article per journal per year H t dj as: We calculated Shannon diversity at the corpus level and then computed these cor-302 pus indexes for each journal. To do this, we began by calculating the K-nomial distri-303 bution over topics µ j in a particular journal j: 305 -10-manuscript submitted to Water Resources Research where µ kj is the relative popularity of a particular topic in a particular journal as a frac-306 tion of popularity of all topics in the journal. We then calculated the total entropy of 307 each µ j , H j , as a measure of the Shannon diversity of the per-journal topic distributions: The popularity of a particular topic in a particular journal for a particular year, 310 µ t kj is a fraction of the popularity of all topics in that journal and year:

325
The correlation coefficient between topic weights over the whole corpus M for each 326 pair of topics, r k,j , was calculated as:

328
where µ k is the weight for topic k assigned to document d, andμ k is the mean weight 329 for a topic k assigned over all documents in the corpus, and µ j is the weight for a topic 330 j assigned to document d, andμ j is the mean weight for topic j assigned over all doc-331 uments in the corpus. We only report correlations greater than 0.1. 332 We identified topics that frequently appear isolated using the correlation coefficient  were conceptually similar to these, however LDA was able to extract a larger and more 352 nuanced set of topics through unsupervised learning.     year for each of the eighteen journals as shown in Figure 6. As before, we used linear re-399 gression to assess the significance of temporal trends in these per-journal time series.   pus lends us a snapshot of co-appearing and disjointed topics, they also assist in segre-454 gating isolated topics. Runoff" topic also correlates with "Urban Drainage" (r k,j = 0.14), and "Watershed Hy- dictably) between "River Flow" and "Streamflow" (r k,j = 0.12), "River Flow" and "Tem-  The most insular topics in our corpus tend to reduce the paper-wise diversity when 529 they appear in an article (meaning they are less likely to appear alongside a wide vari-530 ety of other topics). We refer to these topics as being 'isolated'. It is important to re-531 member that these topics are actually collections of words (Figure 3), and thus topic iso-532 lation means that there is a subsection of water science literature that uses a particu-533 lar vocabulary that is somehow disconnected from other portions of the community.

534
-18-manuscript submitted to Water Resources Research with a 2018 IF greater than 0.9). We tested the hypotheses that interdisciplinarity was 562 increasing in both respects and found evidence to support one of those hypotheses but 563 not the other. Individual researchers appear to be broadening their scope across differ-564 ent subtopics in the discipline (i.e., per-paper topic diversity is increasing - Figure 5), 565 and while individual topics are changing in popularity over time (Figure 4), the water 566 science and hydrology corpus as a whole is not becoming overall more or less topically-567 diverse (Figure 7).

568
The primary findings of this study are:     Change Impacts", "Solute Transport", and "Surface-GW Interactions").

584
-20- Perplexity is a decreasing function of the probability assigned to each per-document word 858 distribution. Lower perplexity indicates a better model.

859
Topic coherence c is a measure of similarity in semantics between the high prob- Journals with a fairly recent publication history -i.e., ESW RT , ISW CR, JHREG, and 880 W RI had lower overall diversity compared to the rest of the corpus, which is expected.