TY - GEN
T1 - QUALEVAL
T2 - 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024
AU - Murahari, Vishvak
AU - Deshpande, Ameet
AU - Clark, Peter
AU - Rajpurohit, Tanmay
AU - Sabharwal, Ashish
AU - Narasimhan, Karthik
AU - Kalyan, Ashwin
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Quantitative evaluation metrics have been pivotal in gauging the advancements of AI systems like large language models (LLMs). However, due to the intricate nature of real-world tasks, a single scalar to quantify and compare performance trivializes the fine-grained nuances of model behavior. Additionally, metrics do not yield actionable diagnostics for model improvement, thus requiring extensive manual efforts of scientists, involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QUALEVAL, which uses automated qualitative evaluation as a vehicle for model improvement. QUALEVAL uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are supported by a dashboard report with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QUALEVAL by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QUALEVAL successfully increases the pace and quality of model development by eliminating the need of arduous manual analysis, thus serving as a data-scientist-in-a-box.
AB - Quantitative evaluation metrics have been pivotal in gauging the advancements of AI systems like large language models (LLMs). However, due to the intricate nature of real-world tasks, a single scalar to quantify and compare performance trivializes the fine-grained nuances of model behavior. Additionally, metrics do not yield actionable diagnostics for model improvement, thus requiring extensive manual efforts of scientists, involving sifting through vast datasets and attempting hit-or-miss adjustments to training data or setups. In this work, we address the shortcomings of quantitative metrics by proposing QUALEVAL, which uses automated qualitative evaluation as a vehicle for model improvement. QUALEVAL uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights that when applied, accelerate model improvement. The insights are supported by a dashboard report with fine-grained visualizations and human-interpretable analyses. We corroborate the faithfulness of QUALEVAL by demonstrating that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative on a challenging dialogue task (DialogSum) when compared to baselines. QUALEVAL successfully increases the pace and quality of model development by eliminating the need of arduous manual analysis, thus serving as a data-scientist-in-a-box.
UR - http://www.scopus.com/inward/record.url?scp=85200250108&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85200250108&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85200250108
T3 - Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024
SP - 2093
EP - 2111
BT - Long Papers
A2 - Duh, Kevin
A2 - Gomez, Helena
A2 - Bethard, Steven
PB - Association for Computational Linguistics (ACL)
Y2 - 16 June 2024 through 21 June 2024
ER -