TY - GEN
T1 - Deciphering the Factors Influencing the Efficacy of Chain-of-Thought
T2 - 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024
AU - Prabhakar, Akshara
AU - Griffiths, Thomas L.
AU - McCoy, R. Thomas
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers (Andress, 2014), where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs-GPT-4, Claude 3, and Llama 3.1-performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning.
AB - Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers (Andress, 2014), where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs-GPT-4, Claude 3, and Llama 3.1-performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning.
UR - http://www.scopus.com/inward/record.url?scp=85217621960&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217621960&partnerID=8YFLogxK
U2 - 10.18653/v1/2024.findings-emnlp.212
DO - 10.18653/v1/2024.findings-emnlp.212
M3 - Conference contribution
AN - SCOPUS:85217621960
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
SP - 3710
EP - 3724
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
Y2 - 12 November 2024 through 16 November 2024
ER -