TY - JOUR
T1 - Language Models as Science Tutors
AU - Chevalier, Alexis
AU - Geng, Jiayi
AU - Wettig, Alexander
AU - Chen, Howard
AU - Mizera, Sebastian
AU - Annala, Toni
AU - Aragon, Max Jameson
AU - Fanlo, Arturo Rodríguez
AU - Frieder, Simon
AU - Machado, Simon
AU - Prabhakar, Akshara
AU - Thieu, Ellie
AU - Wang, Jiachen T.
AU - Wang, Zirui
AU - Wu, Xindi
AU - Xia, Mengzhou
AU - Xia, Wenhan
AU - Yu, Jiatong
AU - Zhu, Jun Jie
AU - Ren, Zhiyong Jason
AU - Arora, Sanjeev
AU - Chen, Danqi
N1 - Publisher Copyright:
Copyright 2024 by the author(s)
PY - 2024
Y1 - 2024
N2 - NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TUTOREVAL and TUTORCHAT. TUTOREVAL is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TUTOREVAL helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multidisciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL. Therefore, we create TUTORCHAT, a dataset of 80,000 long synthetic dialogues about textbooks. We use TUTORCHAT to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TUTOREVAL while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.
AB - NLP has recently made exciting progress toward training language models (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TUTOREVAL and TUTORCHAT. TUTOREVAL is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TUTOREVAL helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multidisciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TUTOREVAL. Therefore, we create TUTORCHAT, a dataset of 80,000 long synthetic dialogues about textbooks. We use TUTORCHAT to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TUTOREVAL while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations publicly.
UR - http://www.scopus.com/inward/record.url?scp=85203823328&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85203823328&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85203823328
SN - 2640-3498
VL - 235
SP - 8310
EP - 8335
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 41st International Conference on Machine Learning, ICML 2024
Y2 - 21 July 2024 through 27 July 2024
ER -