Unofficial Bookmarks for STRATI 2026 Program v0.1.7
G17 June 29 · 14:30–14:45 · International Room II (7F)

LLM-Augmented Fossil Taxonomy: Reshaping the Practice of Biostratigraphy?

G17 Quantitative Stratigraphy: Concepts, Principles, Methods and Applications 📅 Add to Calendar

Michael H. Stephenson, Jiaxi Yang, Alessandro P. Carniti, Shu-zhong Shen, Junxuan Fan, Jieping Ye

Fossils are essential to classify, date and correlate stratigraphic sections around the globe, and to reconstruct the environmetal context of their deposition. However, any attempt to use fossils as tools for biostratigraphy or palaeoecological reconstructions should be rooted in careful identification of the fossil specimens collected—paleontological taxonomy. Given its vastness, this discipline requires years-long training of experts for specific fossil groups and it is based on textual descriptions of fossil taxa within a heterogeneous corpus of texts developed over more than two centuries of studies, some difficult to access online and in a wide range of languages and formats. These issues have made the practice of paleontological taxonomy difficult and compartmentalized across continents and geological eras and systems, exaggerating differences between sections and regions, rather than indicating similarity and connection. As paleontological taxonomy is text based, Large Language Models (LLMs) are ideal tools to facilitate the practice of taxonomy: here we introduce two open-access Large Language Model (LLM)-augmented taxonomy systems (LATS) based on the Treatise of Invertebrate Paleontology for brachiopods (TIPB), and the Jansonius and Hills Catalogue (JHC) for palynomorphs. Brachiopods provide high-resolution, regionally consistent biostratigraphic frameworks for much of the Palaeozoic, while palynomorphs are important in Phanerozoic continental basins (e.g., Devonian–Carboniferous coal measures) where they represent the primary stratigraphic tool. For both the data sources, the LATS converts original descriptions of fossil taxa into globally accessible, machine-readable knowledge graphs which enable guided genus-level identification, matching user specimen description with genus diagnoses/descriptions. The Retrieval-Augmented Generation (RAG) technique was adapted to search for the best genus matches based on user descriptions or prompts, incorporating, if necessary, iterative sequences of prompts. The LATS determines the numbers of candidate taxa, and the statistical quality of the match between the prompt and the database of candidate genera diagnoses and descriptions from JHC and TIPB. This system is built to assist the taxonomist to make an informed determination by providing relevant information, while the final decision on the determination is left to the taxonomist. The LATS provides quick access to thousands of unaltered, authoritative original taxa descriptions, making the work of taxonomists easier, quicker and more informed. By showing a rationale for each selected candidate genus, the LATS exposes the practice and reasoning behind taxonomy, providing a potential education tool to shape the next generation of biostratigraphers. Finally, the LATS can be used as a discovery engine to run cluster analyses on its database containing thousands of taxa descriptions, detecting probable synonymies defined in different regions and stratigraphic intervals, and revealing morphological trends through geological time and geography. The future development of LATS might have the potential to re-shape the practice of fossil taxonomy and biostratigraphy making it more efficient and reliable in the context of modern stratigraphy.

paleontologytaxonomylarge Language modelsbrachiopodspalynomorphs
Affiliations
  1. Stephenson Geoscience Consulting Ltd, Keyworth, Nottingham, UK
  2. Zhejiang laboratory, Hangzhou, China
  3. School of Earth Sciences and Engineering, Nanjing University, Nanjing, China