Structured information extraction from scientific text with large language models

  • John Dagdelen
  • , Alexander Dunn
  • , Sanghoon Lee
  • , Nicholas Walker
  • , Andrew S. Rosen
  • , Gerbrand Ceder
  • , Kristin A. Persson
  • , Anubhav Jain

Research output: Contribution to journalArticlepeer-review

322 Scopus citations

Abstract

Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.

Original languageEnglish (US)
Article number1418
JournalNature communications
Volume15
Issue number1
DOIs
StatePublished - Dec 2024
Externally publishedYes

All Science Journal Classification (ASJC) codes

  • General Chemistry
  • General Biochemistry, Genetics and Molecular Biology
  • General Physics and Astronomy

Fingerprint

Dive into the research topics of 'Structured information extraction from scientific text with large language models'. Together they form a unique fingerprint.

Cite this