Skip to main content
LLM-augmented taxonomy for >4500 palaeopalynology genera

LLM-augmented taxonomy for >4500 palaeopalynology genera

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Add a Comment

You must log in to post a comment.


Comments

There are no comments or no comments have been made public for this article.

Downloads

Download Preprint

Authors

Michael Henry Stephenson, Jiaxi Yang, Alessandro Carniti, Shuzhong Shen, Junxuan Fan, Jan A.I. Hennissen, Jieping Ye

Abstract

Large Language Models (LLMs), being text-based, are ideal types of artificial intelligence to consider the complexities of palaeontological taxonomy because palaeontology depends on published textual descriptions as the primary, authoritative record of a taxon. This paper describes (1) the preparation of palynological (the study of organic-walled microfossils) taxonomic text contained within the >4500 genera of the Jansonius and Hills palaeopalynological catalogue (JHC) for an LLM-augmented taxonomy system (LATS), (2) the efficiency and accuracy of the LATS, and (3) examples of possible further uses of the LATS beyond aids to identification. The conversion of the JHC into a LATS is typical of the challenges of making so called ‘long tail’ data suitable for AI development and can involve considerable manual checking. Principles of development include (1) ‘inclusion’, that is, making sure that the LATS as far as possible includes rather than excludes candidate genera; (2) the principle of ‘assistance’ rather than supplantation so that the LATS is intended as an aid to taxonomy, not a replacement for a human taxonomist; and (3) the principle of ‘non-intervention’ whereby no alterations to original authoritative genus descriptions or diagnoses are applied. Training for the dataset involved 500 Question/Answer pairs generated for the JHC by specialists, as well as additional synthetic QA pairs which, combined, were used to supervised-fine tune the LLM.
The LATS functions through Retrieval Augmented Generation and returns candidate genera with statistical measures of match against the prompt(s). Access to full descriptions of genera extracted from the JHC and to scans of the original catalogue cards allow the taxonomist to use their own judgment in final identification. The LATS produces generally good results but there are two types of limitations or shortcomings: those that emanate from the JHC (and palaeopalynological taxonomy itself), and those that emanate from the working of the LATS. Limitations due to the JHC include (1) poor potential for discrimination between some genera because of poor original descriptions which have not been subsequently emended, (2) numerous candidate genus names that may be synonyms, and (3) invalid candidate genera, i.e. that were illegitimately published. Limitations that emanate from the working of the LATS include evidence of bias against finely described genera.
As well as providing a LATS to aid a palynologist through the stages of identification, the information and ‘understanding’ that the underlying system has of a large area of palaeopalynological taxonomy means that the system could be put to more general uses for example in the identification of genus names that are likely synonyms, and investigation of the distribution of genera (or taxa) in ‘morphological space’.
The development of the LATS described here has implications for other palaeontological groups in terms of the text basis of their taxonomy (for example variable quality of descriptions and inconsistency in terminology), and their suitability for development of other LLM-assisted taxonomic aids.

DOI

https://doi.org/10.31223/X55J3M

Subjects

Earth Sciences, Environmental Sciences, Mathematics

Keywords

taxonomy, palaeontology, LLM, Artificial intelligence, palynology, biodiversity

Dates

Published: 2026-02-07 07:41

Last Updated: 2026-02-07 07:41

License

CC BY Attribution 4.0 International

Additional Metadata

Conflict of interest statement:
None

Metrics

Views: 34

Downloads: 2