TY - JOUR
T1 - Holistic Evaluation of Language Models
AU - Liang, Percy
AU - Bommasani, Rishi
AU - Lee, Tony
AU - Tsipras, Dimitris
AU - Soylu, Dilara
AU - Yasunaga, Michihiro
AU - Zhang, Yian
AU - Narayanan, Deepak
AU - Wu, Yuhuai
AU - Kumar, Ananya
AU - New-Man, Benjamin
AU - Yuan, Binhang
AU - Yan, Bobby
AU - Zhang, Ce
AU - Cosgrove, Christian
AU - Manning, Christopher D.
AU - Ré, Christopher
AU - Acosta-Navas, Diana
AU - Hudson, Drew A.
AU - Zelikman, Eric
AU - Durmus, Esin
AU - Ladhak, Faisal
AU - Rong, Frieda
AU - Ren, Hongyu
AU - Yao, Huaxiu
AU - Wang, Jue
AU - Santhanam, Keshav
AU - Orr, Laurel
AU - Zheng, Lucia
AU - Yuksekgonul, Mert
AU - Suzgun, Mirac
AU - Kim, Nathan
AU - Guha, Neel
AU - Chatterji, Niladri
AU - Khattab, Omar
AU - Henderson, Peter
AU - Huang, Qian
AU - Chi, Ryan
AU - Xie, Sang Michael
AU - Santurkar, Shibani
AU - Ganguli, Surya
AU - Hashimoto, Tatsunori
AU - Icard, Thomas
AU - Zhang, Tianyi
AU - Chaudhary, Vishrav
AU - Wang, William
AU - Li, Xuechen
AU - Mai, Yifan
AU - Zhang, Yuhui
AU - Koreeda, Yuta
N1 - Publisher Copyright:
© 2023, Published in Transactions on Machine Learning Research. All rights reserved.
PY - 2023
Y1 - 2023
N2 - Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios to the extent possible (87.5% of the time), ensuring that metrics beyond accuracy don’t fall to the wayside, and that trade-offs across models and metrics are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to more deeply analyze specific aspects (e.g. knowledge, reasoning, memorization/copyright, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on a set of core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings concerning the interplay between different scenarios, metrics, and models. For full transparency, we release all raw model prompts and completions publicly1 for further analysis, as well as a general modular toolkit for easily adding new scenarios, models, metrics, and prompting strategies.2 We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
AB - Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what’s missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios to the extent possible (87.5% of the time), ensuring that metrics beyond accuracy don’t fall to the wayside, and that trade-offs across models and metrics are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to more deeply analyze specific aspects (e.g. knowledge, reasoning, memorization/copyright, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, including 21 scenarios that were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on a set of core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings concerning the interplay between different scenarios, metrics, and models. For full transparency, we release all raw model prompts and completions publicly1 for further analysis, as well as a general modular toolkit for easily adding new scenarios, models, metrics, and prompting strategies.2 We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models.
UR - https://www.scopus.com/pages/publications/105003305555
UR - https://www.scopus.com/inward/citedby.url?scp=105003305555&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:105003305555
SN - 2835-8856
VL - 2023-August
JO - Transactions on Machine Learning Research
JF - Transactions on Machine Learning Research
ER -