TY - JOUR
T1 - Structured information extraction from scientific text with large language models
AU - Dagdelen, John
AU - Dunn, Alexander
AU - Lee, Sanghoon
AU - Walker, Nicholas
AU - Rosen, Andrew S.
AU - Ceder, Gerbrand
AU - Persson, Kristin A.
AU - Jain, Anubhav
N1 - Publisher Copyright:
© The Author(s) 2024.
PY - 2024/12
Y1 - 2024/12
N2 - Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.
AB - Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.
UR - http://www.scopus.com/inward/record.url?scp=85185243599&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85185243599&partnerID=8YFLogxK
U2 - 10.1038/s41467-024-45563-x
DO - 10.1038/s41467-024-45563-x
M3 - Article
C2 - 38360817
AN - SCOPUS:85185243599
SN - 2041-1723
VL - 15
JO - Nature communications
JF - Nature communications
IS - 1
M1 - 1418
ER -