TY - GEN
T1 - InstructEval
T2 - 2024 Findings of the Association for Computational Linguistics: NAACL 2024
AU - Ajith, Anirudh
AU - Pan, Chris
AU - Xia, Mengzhou
AU - Deshpande, Ameet
AU - Narasimhan, Karthik
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - In-context learning (ICL) performs tasks by prompting a large language model (LLM) using an instruction and a small set of annotated examples called demonstrations. Recent work has shown that precise details of the inputs used in the ICL prompt significantly impact performance, which has incentivized instruction selection algorithms. The effect of instruction-choice however is severely underexplored, with existing analyses restricted to shallow subsets of models and tasks, limiting the generalizability of their insights. We develop InstructEval, an ICL evaluation suite to conduct a thorough assessment of these techniques. The suite includes 13 open-sourced LLMs of varying scales from four model families, and covers nine tasks across three categories. Using the suite, we evaluate the relative performance of seven popular instruction selection methods over five metrics relevant to ICL. Our experiments reveal that using curated manually-written instructions or simple instructions without any task-specific descriptions often elicits superior ICL performance overall than that of automatic instruction-induction methods, pointing to a lack of generalizability among the latter. We release our evaluation suite for benchmarking instruction selection approaches and enabling more generalizable methods in this space.
AB - In-context learning (ICL) performs tasks by prompting a large language model (LLM) using an instruction and a small set of annotated examples called demonstrations. Recent work has shown that precise details of the inputs used in the ICL prompt significantly impact performance, which has incentivized instruction selection algorithms. The effect of instruction-choice however is severely underexplored, with existing analyses restricted to shallow subsets of models and tasks, limiting the generalizability of their insights. We develop InstructEval, an ICL evaluation suite to conduct a thorough assessment of these techniques. The suite includes 13 open-sourced LLMs of varying scales from four model families, and covers nine tasks across three categories. Using the suite, we evaluate the relative performance of seven popular instruction selection methods over five metrics relevant to ICL. Our experiments reveal that using curated manually-written instructions or simple instructions without any task-specific descriptions often elicits superior ICL performance overall than that of automatic instruction-induction methods, pointing to a lack of generalizability among the latter. We release our evaluation suite for benchmarking instruction selection approaches and enabling more generalizable methods in this space.
UR - http://www.scopus.com/inward/record.url?scp=85197892892&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197892892&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85197892892
T3 - Findings of the Association for Computational Linguistics: NAACL 2024 - Findings
SP - 4336
EP - 4350
BT - Findings of the Association for Computational Linguistics
A2 - Duh, Kevin
A2 - Gomez, Helena
A2 - Bethard, Steven
PB - Association for Computational Linguistics (ACL)
Y2 - 16 June 2024 through 21 June 2024
ER -