Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (2201.11903v6)

Published 28 Jan 2022 in cs.CL and cs.AI

Abstract: We explore how generating a chain of thought -- a series of intermediate reasoning steps -- significantly improves the ability of LLMs to perform complex reasoning. In particular, we show how such reasoning abilities emerge naturally in sufficiently LLMs via a simple method called chain of thought prompting, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three LLMs show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks. The empirical gains can be striking. For instance, prompting a 540B-parameter LLM with just eight chain of thought exemplars achieves state of the art accuracy on the GSM8K benchmark of math word problems, surpassing even finetuned GPT-3 with a verifier.

Citations (6,541)

View on Semantic Scholar

Summary

The paper introduces chain-of-thought prompting that guides LLMs through intermediate reasoning steps to improve multi-step solving capabilities.
It demonstrates substantial performance gains on arithmetic, commonsense, and symbolic reasoning tasks with models over 100B parameters.
The approach requires no fine-tuning, offering enhanced interpretability and practical insights into the model’s reasoning process.

This paper, "Chain-of-Thought Prompting Elicits Reasoning in LLMs" (2201.11903), introduces a simple prompting technique called chain-of-thought (CoT) prompting that significantly enhances the reasoning abilities of LLMs. The core idea is to include a sequence of intermediate reasoning steps—a "chain of thought"—in the few-shot exemplars provided in the prompt, guiding the model to generate similar intermediate steps before producing the final answer.

Core Idea and Motivation

Standard few-shot prompting, where the model is given input-output pairs, has been successful for many tasks but often falls short on those requiring multi-step reasoning, like arithmetic word problems or complex commonsense questions. Prior work addressed this by training or finetuning models to generate intermediate steps or rationales, but creating large datasets of high-quality rationales is costly. Chain-of-thought prompting (2201.11903) combines the benefits of generating intermediate steps with the advantages of few-shot prompting. Instead of just input -> output examples, CoT prompting uses input -> chain of thought -> output examples. This approach requires no model finetuning, allowing a single LLM to perform various reasoning tasks using only few-shot prompting.

Experimental Setup

The researchers evaluated CoT prompting on a diverse set of reasoning tasks:

Arithmetic Reasoning: Math word problems from benchmarks like GSM8K, SVAMP, ASDiv, AQuA, and MAWPS.
Commonsense Reasoning: Tasks including CSQA, StrategyQA, Date Understanding, Sports Understanding, and SayCan robot planning.
Symbolic Reasoning: Toy tasks like last letter concatenation and coin flip, designed to test the model's ability to manipulate symbols and track state.

Experiments were conducted using various LLMs, including LaMDA, GPT-3 (InstructGPT variants), PaLM, UL2, and Codex. For each task, a small number of few-shot exemplars (typically 8, manually composed) were used. The standard prompting baseline used the same exemplars but excluded the intermediate chain-of-thought steps. Greedy decoding was primarily used for generation. For arithmetic tasks, the authors also investigated the effect of using an external Python calculator to evaluate the mathematical expressions generated within the chain of thought, demonstrating that errors can stem from either reasoning logic or arithmetic computation itself.

Key Findings

The experiments revealed several significant findings:

Emergent Ability: Chain-of-thought reasoning was found to be an emergent ability of model scale (2201.11903). It did not consistently improve performance, and sometimes even hurt it, for models smaller than approximately 100 billion parameters. Only with sufficiently large models (e.g., GPT-3 175B, PaLM 540B) did CoT prompting consistently and significantly improve performance on reasoning tasks compared to standard prompting. Smaller models tended to produce fluent but often illogical or incoherent chains of thought.
Performance Gains: CoT prompting yielded substantial performance improvements across the tested benchmarks.
- On GSM8K (math word problems), PaLM 540B with CoT achieved a solve rate of 56.9%, a significant jump from 17.9% with standard prompting, surpassing prior state-of-the-art results.
- Similar large gains were observed on other math datasets like SVAMP and MAWPS, particularly on the more complex multi-step subsets.
- For commonsense tasks like StrategyQA and Date Understanding, CoT prompting also improved performance, demonstrating its applicability beyond purely numerical problems.
- In symbolic reasoning tasks (last letter concatenation, coin flip), CoT enabled impressive performance, often approaching 100% accuracy for in-domain examples on large models.
Generalization to Length: CoT prompting facilitated generalization to out-of-domain examples with more steps than seen in the few-shot prompt (e.g., longer names for concatenation, more flips for coin tracking), a capability largely absent in standard prompting.
Ablation Studies: Experiments compared CoT prompting against variants:
- Equation only: Prompting the model to output only a mathematical equation before the answer provided some benefit for simpler arithmetic tasks but was less effective than full CoT on complex problems like GSM8K, suggesting the natural language steps are crucial for semantic understanding and decomposition (2201.11903).
- Variable compute only: Prompting the model to output a series of dots equivalent to the computation length showed little improvement, indicating that simply spending more tokens is not the key; the content of the intermediate steps matters (2201.11903).
- Reasoning after answer: Placing the chain of thought after the final answer did not improve performance, suggesting that the sequential generation of reasoning steps leading to the answer is essential for deriving the solution (2201.11903).
Robustness: While exemplar-based prompting can be sensitive, CoT prompting showed robustness across different annotators who wrote the chains of thought, different sets of exemplars (including those from a separate dataset), and variations in the number and order of exemplars (2201.11903).

Manual Analysis

A manual analysis of generated chains of thought for LaMDA 137B on GSM8K provided insight into why CoT works and where models still fail. For correct answers, the generated chains of thought were mostly logically and mathematically sound. For incorrect answers, errors were categorized:

Minor errors (calculator errors, symbol mapping errors, one step missing) accounted for a significant portion of mistakes (46%). Scaling models from 62B to 540B was observed to fix many of these types of errors, suggesting improved semantic understanding and logical flow with scale.
Major errors (semantic understanding errors, incoherent reasoning) constituted the remaining mistakes (54%).

This analysis suggests that improvements in foundational abilities like semantic understanding and the ability to maintain coherent, step-by-step logic contribute to the emergence of CoT reasoning at scale (2201.11903).

Practical Implications and Limitations

CoT prompting offers a powerful way to unlock the reasoning capabilities of existing LLMs without needing expensive task-specific finetuning datasets. It provides a degree of interpretability by showing the steps the model took.

However, the approach has limitations:

It is most effective only on very large models, which are costly to train and serve.
There is no guarantee that the generated chains of thought are factually correct or logically sound, even if they lead to a correct answer, particularly for non-arithmetic tasks. Ensuring the factuality and coherence of generated reasoning remains an open challenge.
While few-shot annotation cost is minimal, creating extensive CoT data for potential finetuning applications would be expensive, although synthetic data generation could be explored.
Chain of thought may not be beneficial for all tasks, particularly simple ones where standard prompting already performs well or tasks that don't naturally decompose into sequential steps.

The paper concludes that CoT prompting demonstrates that standard prompting may only show a lower bound of LLMs' capabilities and highlights the potential for further exploration of language-based reasoning methods (2201.11903).

PDF Markdown

Related Papers

Tweets

https://twitter.com/gordic_aleksa/status/1846127912539304194

https://twitter.com/cwolferesearch/status/1787553640703520832

https://twitter.com/garrytan/status/1750585926588469705

https://twitter.com/saliencexbt/status/1803662349221634382

https://twitter.com/Saboo_Shubham_/status/1777688054662361361

https://twitter.com/GlenwoodNate/status/1883329479843676279

YouTube

Show All Videos