Emergent Mind

Chain of Thoughtlessness: An Analysis of CoT in Planning

(2405.04776)
Published May 8, 2024 in cs.AI

Abstract

Large language model (LLM) performance on reasoning problems typically does not generalize out of distribution. Previous work has claimed that this can be mitigated by modifying prompts to include examples with chains of thought--demonstrations of solution procedures--with the intuition that it is possible to in-context teach an LLM an algorithm for solving the problem. This paper presents a case study of chain of thought on problems from Blocksworld, a classical planning domain, and examine the performance of two state-of-the-art LLMs across two axes: generality of examples given in prompt, and complexity of problems queried with each prompt. While our problems are very simple, we only find meaningful performance improvements from chain of thought prompts when those prompts are exceedingly specific to their problem class, and that those improvements quickly deteriorate as the size n of the query-specified stack grows past the size of stacks shown in the examples. Our results hint that, contrary to previous claims in the literature, CoT's performance improvements do not stem from the model learning general algorithmic procedures via demonstrations and depend on carefully engineering highly problem specific prompts. This spotlights drawbacks of chain of thought, especially because of the sharp tradeoff between possible performance gains and the amount of human labor necessary to generate examples with correct reasoning traces.

Target distributions for expected generality in progression proof prompts across various PDDL and Blocksworld domains.

Overview

  • The paper explores the effectiveness of Chain of Thought (CoT) prompting in LLMs aimed at guiding LLMs to solve complex reasoning tasks without retraining, using experiments based in the Blocksworld problem domain.

  • Different CoT setups like Zero-shot CoT, Progression Proof CoT, Blocksworld Universal Algorithm, Stacking Prompt, and Lexicographic Stacking were examined, revealing mixed results in terms of effectiveness, with performance tied closely to prompt specificity.

  • The findings indicate that while CoT improves task-specific performance in some cases, its scalability and practical utility are limited, suggesting future research should aim at balancing prompt generality with performance.

Exploring the Limits of Chain of Thought Prompting in Blocksworld with LLMs

Introduction to Chain of Thought Prompting

The idea behind Chain of Thought (CoT) prompting in LLMs is captivating for both practitioners and researchers in the AI field. By inserting intermediate reasoning steps into prompts—the so-called chains of thought—the goal is to guide LLMs to better perform on complex reasoning tasks without the need for retraining. This capability to 'teach' LLMs to solve problems through example-driven learning is gleaned from human problem-solving methodologies. Yet, how this plays out practically, especially in nuanced domains like planning in Blocksworld, opens up an avenue full of challenges and revelations.

The Setup: Experiments in Blocksworld

In a nutshell, Blocksworld entails rearranging blocks to achieve a specific configuration. It’s a classic problem domain often used in AI because of its clear-cut planning nature. The challenge becomes a vessel for checking if an LLM can practically apply shown reasoning steps to solve unseen, similar tasks. Five types of CoT setups were examined:

  1. Zero-shot CoT: The most general approach where the model is merely prompted to "think step by step."
  2. Progression Proof CoT: Introducing planning domain knowledge associated with the PDDL specifications used in Blocksworld but tries to stay somewhat generic.
  3. Blocksworld Universal Algorithm: Offers a specific algorithmic approach tailored for any Blocksworld problem.
  4. Stacking Prompt: Directly focuses on problems where blocks begin on the table and must be assembled into a single stack.
  5. Lexicographic Stacking: Targets only a subset of stacking problems where blocks must be stacked in a specific sequence.

Observations and Results

The results reveal a discerning picture of the CoT's effectiveness. When the prompts were exceedingly specific to the problem class, performance improvements were noticeable. However, these gains diminished rapidly as the problems deviated even slightly from the examples provided. Here’s a brief rundown:

  • Zero-shot and Progression Proof CoT showed minimal improvements, proving insufficient for even moderately complex planning tasks.
  • Blocksworld Universal Algorithm prompted better responses from the LLMs, yet struggled significantly as the complexity of Blocksworld scenarios increased.
  • Stacking and Lexicographic Prompts yielded high performance on narrowly defined tasks but failed to generalize across slightly broader problem sets despite still being within the stack-assembly category.

Implications and Speculations on Future Developments

These findings suggest that while CoT can nudge LLMs toward better task-specific performance, the scope of such enhancements is heavily tethered to the prompt’s specificity relative to the problem. This poses significant implications:

  • Scalability: Broad application of CoT is limited. As the problem's complexity grows, so does the need for incredibly precise prompts, making this strategy less scalable across diverse tasks without substantial human input.
  • Practical Utility: The reliance on detailed, problem-specific prompts diminishes the utility of CoT for general problem-solving using LLMs. For real-world applications, generating such detailed prompts could become an overhead that outweighs the benefits.
  • Future Research: Continual improvement in CoT methodologies might focus on finding a balance between prompt generality and performance, potentially through more sophisticated techniques of teaching LLMs to better abstract and generalize from examples.

Conclusion

Investigating the CoT performance across a range of specialized to generalized setups in Blocksworld presents a clear verdict: performance is closely knit with the specificity and alignment of the CoT prompt to the problem class. LLMs, in their current form, excel in tasks that mimic training examples but show diminished aptitude when required to generalize strategies across broader scenarios. The dream of leveraging LLMs for general reasoning through CoT remains, for now, a meticulously crafted prompt away. Further explorations could illuminate pathways to enhance the flexibility and learning capacity of these fascinating models.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube