Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 72 tok/s Pro
Kimi K2 211 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

On the Planning Abilities of Large Language Models : A Critical Investigation (2305.15771v2)

Published 25 May 2023 in cs.AI

Abstract: Intrigued by the claims of emergent reasoning capabilities in LLMs trained on general web corpora, in this paper, we set out to investigate their planning capabilities. We aim to evaluate (1) the effectiveness of LLMs in generating plans autonomously in commonsense planning tasks and (2) the potential of LLMs in LLM-Modulo settings where they act as a source of heuristic guidance for external planners and verifiers. We conduct a systematic study by generating a suite of instances on domains similar to the ones employed in the International Planning Competition and evaluate LLMs in two distinct modes: autonomous and heuristic. Our findings reveal that LLMs' ability to generate executable plans autonomously is rather limited, with the best model (GPT-4) having an average success rate of ~12% across the domains. However, the results in the LLM-Modulo setting show more promise. In the LLM-Modulo setting, we demonstrate that LLM-generated plans can improve the search process for underlying sound planners and additionally show that external verifiers can help provide feedback on the generated plans and back-prompt the LLM for better plan generation.

Citations (170)

Summary

  • The paper critically examines LLMs’ autonomous planning, finding limited success with only about 34% correctness in Blocksworld tasks.
  • The study employs various prompting strategies, including zero-shot, one-shot, and chain-of-thought, to reveal a reliance on contextual cues over abstract reasoning.
  • LLM-modulo mode demonstrates heuristic value by guiding external planners and verifiers, while also highlighting challenges in bias mitigation and safety.

On the Planning Abilities of LLMs: A Critical Investigation

Introduction

LLMs, particularly those based on transformer architectures, have gained significant traction in the field of NLP due to their impressive performance in various language processing tasks. However, the extent to which LLMs possess planning capabilities, independently generating executable plans or assisting humans and automated systems in planning tasks, remains unclear. This paper scrutinizes these capabilities by conducting experiments across several domains, focusing on both autonomous mode plan generation and LLM-Modulo mode, where LLMs guide external planners or verifiers. Figure 1

Figure 1: The diagrammatic overview of the two modes of LLMs for planning.

Autonomous Planning Abilities

Evaluation in Common Planning Domains

This section assesses LLMs' ability to autonomously generate plans in domains like Blocksworld and Logistics. Despite the simplicity of these domains, where humans are generally proficient, LLMs exhibit quite limited success. For instance, even the most advanced model, GPT-4, achieves only approximately 34% correctness in Blocksworld tasks. Figure 2

Figure 2: Assessment of GPT-4 plans with relaxations in Blocksworld domain.

These models were evaluated under different prompting configurations, including zero-shot, one-shot, and chain-of-thought. The output demonstrates that even state-tracking prompts don't substantially improve performance. Furthermore, obfuscating action and object names in Blocksworld further diminishes LLM effectiveness, revealing a reliance on contextual cues rather than abstract reasoning.

LLM-Modulo Mode

Heuristic Value of LLM-Generated Plans

In heuristic settings, LLM-generated plans can serve as guidance for sound planners like LPG. The experiments show that when LLM-generated plans are provided as initial inputs to LPG, they help reduce search steps required to generate valid plans, indicating their utility as heuristic suggestions. Figure 3

Figure 3: A detailed comparison of the Blocksworld instances, noting successful plans with prompts including one example.

Interaction with External Verifiers

Another promising approach lies in using LLMs together with external verifiers like VAL. Feedback from these verifiers on the flaws of LLM-generated plans can be backprompted to the LLM, encouraging it to refine its proposals. This process enhances plan correctness in easier domains, but is less effective in domains with deceptive disguising.

Human-in-the-loop Assistance

Further exploration has shown potential in employing LLMs as assistants alongside human planners. Studies indicate no significant increase in plan success rates or subjective task ease when participants receive LLM suggestions, underscoring the challenges of meaningful human-LLM collaboration without bias or error risks. Figure 4

Figure 4: Interface at the plan writing phase without LLM assistance.

Broader Implications

The deployment of LLMs for planning inherently brings potential for perpetuation of biases seen in training data, and challenges in maintaining factual correctness, especially in critical, safety-sensitive domains. Mitigation strategies should include using verified domain models and employing automated or human-led verification steps to ensure plan viability and safety.

Conclusion

The planning capabilities of LLMs, when operating autonomously, remain subpar, particularly in generating executable and goal-achieving plans. Although LLMs show promise when paired with verifiers or sound planners, their autonomous utility is restricted. Future work must address scaling executional reasoning within LLM architectures, alongside robust bias mitigation and safety assurance when employing LLMs in planning-centric applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 15 tweets and received 1154 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com