Rethinking Math Benchmarks for LLMs using IRT

Research output: Contribution to journalConference articlepeer-review

Abstract

Several datasets have been created to evaluate LLM performance on mathematical reasoning tasks. Performance on these benchmarks is used as a proxy for a model's math ability and to rank their capability relative to other models. These rankings play a crucial role for AIEd practitioners in selecting models for applications like math tutoring. Recent research has argued that several of these benchmarks have become too saturated, prompting the creation of new datasets with more difficult tasks. How can we gauge the effectiveness of these benchmarks for measuring math skills and producing reliable rankings? Leveraging the psychometric framework of Item Response Theory, we examine three math benchmarks: GSM8K, MATH, and MathOdyssey. We find that GSM8K and MathOdyssey are not suited to properly evaluate the current range of frontier model abilities, and are instead suited to models with lower and higher math abilities respectively. Moreover, current rankings derived from these benchmarks are unstable and fail to reliably capture the latent math ability they aim to measure. To remedy these issues, we recommend the integration of IRT analysis into the process of selecting questions for future benchmarks.

Original languageEnglish (US)
Pages (from-to)66-82
Number of pages17
JournalProceedings of Machine Learning Research
Volume273
StatePublished - 2025
Event39th Annual AAAI Conference on Innovation and Responsibility in AI-Supported Education Workshop, iRAISE 2025 - Philadelphia, United States
Duration: Mar 3 2025 → …

All Science Journal Classification (ASJC) codes

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Keywords

  • AIEd
  • Item Response Theory
  • LLM Benchmarks
  • LLM Evaluation

Fingerprint

Dive into the research topics of 'Rethinking Math Benchmarks for LLMs using IRT'. Together they form a unique fingerprint.

Cite this