TY - JOUR
T1 - Benchmark suites instead of leaderboards for evaluating AI fairness
AU - Wang, Angelina
AU - Hertzmann, Aaron
AU - Russakovsky, Olga
N1 - Publisher Copyright:
© 2024 The Author(s)
PY - 2024/11/8
Y1 - 2024/11/8
N2 - Benchmarks and leaderboards are commonly used to track the fairness impacts of artificial intelligence (AI) models. Many critics argue against this practice, since it incentivizes optimizing for metrics in an attempt to build the “most fair” AI model. However, this is an inherently impossible task since different applications have different considerations. While we agree with the critiques against leaderboards, we believe that the use of benchmarks can be reformed. Thus far, the critiques of leaderboards and benchmarks have become unhelpfully entangled. However, benchmarks, when not used for leaderboards, offer important tools for understanding a model. We advocate for collecting benchmarks into carefully curated “benchmark suites,” which can provide researchers and practitioners with tools for understanding the wide range of potential harms and trade-offs among different aspects of fairness. We describe the research needed to build these benchmark suites so that they can better assess different usage modalities, cover potential harms, and reflect diverse perspectives. By moving away from leaderboards and instead thoughtfully designing and compiling benchmark suites, we can better monitor and improve the fairness impacts of AI technology.
AB - Benchmarks and leaderboards are commonly used to track the fairness impacts of artificial intelligence (AI) models. Many critics argue against this practice, since it incentivizes optimizing for metrics in an attempt to build the “most fair” AI model. However, this is an inherently impossible task since different applications have different considerations. While we agree with the critiques against leaderboards, we believe that the use of benchmarks can be reformed. Thus far, the critiques of leaderboards and benchmarks have become unhelpfully entangled. However, benchmarks, when not used for leaderboards, offer important tools for understanding a model. We advocate for collecting benchmarks into carefully curated “benchmark suites,” which can provide researchers and practitioners with tools for understanding the wide range of potential harms and trade-offs among different aspects of fairness. We describe the research needed to build these benchmark suites so that they can better assess different usage modalities, cover potential harms, and reflect diverse perspectives. By moving away from leaderboards and instead thoughtfully designing and compiling benchmark suites, we can better monitor and improve the fairness impacts of AI technology.
UR - http://www.scopus.com/inward/record.url?scp=85208104592&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85208104592&partnerID=8YFLogxK
U2 - 10.1016/j.patter.2024.101080
DO - 10.1016/j.patter.2024.101080
M3 - Review article
AN - SCOPUS:85208104592
SN - 2666-3899
VL - 5
JO - Patterns
JF - Patterns
IS - 11
M1 - 101080
ER -