Chain of Thoughtlessness? An Analysis of CoT in Planning (2405.04776v3)

Published 8 May 2024 in cs.AI

Abstract: LLM performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated with chain of thought prompting-a method of demonstrating solution procedures-with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examines the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. We also create scalable variants of three domains commonly studied in previous CoT papers and demonstrate the existence of similar failure modes. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations but depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.

Citations (17)

View on Semantic Scholar

Summary

The paper demonstrates that tailored chain-of-thought prompts significantly improve performance in specific Blocksworld tasks compared to generic approaches.
It experiments with various CoT setups, revealing that narrowly defined prompts outperform broader strategies when solving complex planning problems.
The study highlights that while CoT techniques enhance task-specific performance, their scalability is limited without precise, problem-specific prompt calibration.

Exploring the Limits of Chain of Thought Prompting in Blocksworld with LLMs

Introduction to Chain of Thought Prompting

The idea behind Chain of Thought (CoT) prompting in LLMs is captivating for both practitioners and researchers in the AI field. By inserting intermediate reasoning steps into prompts—the so-called chains of thought—the goal is to guide LLMs to better perform on complex reasoning tasks without the need for retraining. This capability to 'teach' LLMs to solve problems through example-driven learning is gleaned from human problem-solving methodologies. Yet, how this plays out practically, especially in nuanced domains like planning in Blocksworld, opens up an avenue full of challenges and revelations.

The Setup: Experiments in Blocksworld

In a nutshell, Blocksworld entails rearranging blocks to achieve a specific configuration. It’s a classic problem domain often used in AI because of its clear-cut planning nature. The challenge becomes a vessel for checking if an LLM can practically apply shown reasoning steps to solve unseen, similar tasks. Five types of CoT setups were examined:

Zero-shot CoT: The most general approach where the model is merely prompted to "think step by step."
Progression Proof CoT: Introducing planning domain knowledge associated with the PDDL specifications used in Blocksworld but tries to stay somewhat generic.
Blocksworld Universal Algorithm: Offers a specific algorithmic approach tailored for any Blocksworld problem.
Stacking Prompt: Directly focuses on problems where blocks begin on the table and must be assembled into a single stack.
Lexicographic Stacking: Targets only a subset of stacking problems where blocks must be stacked in a specific sequence.

Observations and Results

The results reveal a discerning picture of the CoT's effectiveness. When the prompts were exceedingly specific to the problem class, performance improvements were noticeable. However, these gains diminished rapidly as the problems deviated even slightly from the examples provided. Here’s a brief rundown:

Zero-shot and Progression Proof CoT showed minimal improvements, proving insufficient for even moderately complex planning tasks.
Blocksworld Universal Algorithm prompted better responses from the LLMs, yet struggled significantly as the complexity of Blocksworld scenarios increased.
Stacking and Lexicographic Prompts yielded high performance on narrowly defined tasks but failed to generalize across slightly broader problem sets despite still being within the stack-assembly category.

Implications and Speculations on Future Developments

These findings suggest that while CoT can nudge LLMs toward better task-specific performance, the scope of such enhancements is heavily tethered to the prompt’s specificity relative to the problem. This poses significant implications:

Scalability: Broad application of CoT is limited. As the problem's complexity grows, so does the need for incredibly precise prompts, making this strategy less scalable across diverse tasks without substantial human input.
Practical Utility: The reliance on detailed, problem-specific prompts diminishes the utility of CoT for general problem-solving using LLMs. For real-world applications, generating such detailed prompts could become an overhead that outweighs the benefits.
Future Research: Continual improvement in CoT methodologies might focus on finding a balance between prompt generality and performance, potentially through more sophisticated techniques of teaching LLMs to better abstract and generalize from examples.

Conclusion

Investigating the CoT performance across a range of specialized to generalized setups in Blocksworld presents a clear verdict: performance is closely knit with the specificity and alignment of the CoT prompt to the problem class. LLMs, in their current form, excel in tasks that mimic training examples but show diminished aptitude when required to generalize strategies across broader scenarios. The dream of leveraging LLMs for general reasoning through CoT remains, for now, a meticulously crafted prompt away. Further explorations could illuminate pathways to enhance the flexibility and learning capacity of these fascinating models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rao2z/status/1788390256120926531

https://twitter.com/rao2z/status/1931451648913867081

https://twitter.com/rtk254/status/1797103234814767498

https://twitter.com/Mlearning_ai/status/1788471979269345340

https://twitter.com/rao2z/status/1793092431606653297

https://twitter.com/feulf/status/1795869513012916342

YouTube

Show All Videos

HackerNews

Chain of Thoughtlessness: An Analysis of Cot in Planning (2 points, 0 comments)