Benchmark suites instead of leaderboards for evaluating AI fairness

Angelina Wang, Aaron Hertzmann, Olga Russakovsky

Research output: Contribution to journalReview articlepeer-review

Abstract

Benchmarks and leaderboards are commonly used to track the fairness impacts of artificial intelligence (AI) models. Many critics argue against this practice, since it incentivizes optimizing for metrics in an attempt to build the “most fair” AI model. However, this is an inherently impossible task since different applications have different considerations. While we agree with the critiques against leaderboards, we believe that the use of benchmarks can be reformed. Thus far, the critiques of leaderboards and benchmarks have become unhelpfully entangled. However, benchmarks, when not used for leaderboards, offer important tools for understanding a model. We advocate for collecting benchmarks into carefully curated “benchmark suites,” which can provide researchers and practitioners with tools for understanding the wide range of potential harms and trade-offs among different aspects of fairness. We describe the research needed to build these benchmark suites so that they can better assess different usage modalities, cover potential harms, and reflect diverse perspectives. By moving away from leaderboards and instead thoughtfully designing and compiling benchmark suites, we can better monitor and improve the fairness impacts of AI technology.

Original languageEnglish (US)
Article number101080
JournalPatterns
Volume5
Issue number11
DOIs
StatePublished - Nov 8 2024

All Science Journal Classification (ASJC) codes

  • General Decision Sciences

Fingerprint

Dive into the research topics of 'Benchmark suites instead of leaderboards for evaluating AI fairness'. Together they form a unique fingerprint.

Cite this