Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

When do you need Chain-of-Thought Prompting for ChatGPT? (2304.03262v2)

Published 6 Apr 2023 in cs.AI

Abstract: Chain-of-Thought (CoT) prompting can effectively elicit complex multi-step reasoning from LLMs~(LLMs). For example, by simply adding CoT instruction ``Let's think step-by-step'' to each input query of MultiArith dataset, GPT-3's accuracy can be improved from 17.7\% to 78.7\%. However, it is not clear whether CoT is still effective on more recent instruction finetuned (IFT) LLMs such as ChatGPT. Surprisingly, on ChatGPT, CoT is no longer effective for certain tasks such as arithmetic reasoning while still keeping effective on other reasoning tasks. Moreover, on the former tasks, ChatGPT usually achieves the best performance and can generate CoT even without being instructed to do so. Hence, it is plausible that ChatGPT has already been trained on these tasks with CoT and thus memorized the instruction so it implicitly follows such an instruction when applied to the same queries, even without CoT. Our analysis reflects a potential risk of overfitting/bias toward instructions introduced in IFT, which becomes more common in training LLMs. In addition, it indicates possible leakage of the pretraining recipe, e.g., one can verify whether a dataset and instruction were used in training ChatGPT. Our experiments report new baseline results of ChatGPT on a variety of reasoning tasks and shed novel insights into LLM's profiling, instruction memorization, and pretraining dataset leakage.

Citations (38)

Summary

  • The paper demonstrates that ChatGPT can perform zero-shot reasoning without explicit chain-of-thought prompts in arithmetic tasks.
  • It reveals a task-dependent behavior, with chain-of-thought prompting enhancing non-arithmetic reasoning similarly to GPT-3.
  • The findings raise concerns about pretraining recipe leakage and highlight evolving instruction-following capabilities in large language models.

Understanding ChatGPT's Reasoning Ability

Introduction to Chain-of-Thought Prompting

Chain-of-Thought (CoT) prompting has emerged as a technique to elicit complex, multi-step reasoning from LLMs like GPT-3. By instructing these models to "think step-by-step," researchers have seen significant improvements in task performance. But does this prompting strategy hold its ground with more recent Instruction Finetuned (IFT) LLMs such as ChatGPT?

ChatGPT's Performance Without Explicit CoT

The University of Maryland paper conducts experiments with ChatGPT, focusing on its capabilities in zero-shot reasoning—a model's ability to deduce correct answers without prior specific task training. The paper's findings suggest ChatGPT, in certain tasks like arithmetic reasoning, can already generate step-by-step reasoning without explicit CoT prompts. Interestingly, when compared to its predecessor GPT-3, ChatGPT sometimes excels when no CoT instruction is given, pointing toward an inherent understanding of such tasks.

ChatGPT and Different Reasoning Tasks

Research showed task-dependent behavior from ChatGPT. In some cases, such as non-arithmetic reasoning tasks, ChatGPT benefits from a CoT instruction similarly to GPT-3, leading to better reasoning accuracy. However, for arithmetic and commonsense reasoning tasks, ChatGPT mostly operates best without CoT prompts, even generating the step-by-step rationale autonomously—a stark contrast to previous models where CoT instruction almost always enhances performance.

Implications of Findings

These distinct behaviors hint towards the possibility of ChatGPT having been trained during IFT with datasets and CoT instructions, leading it to internalize the CoT reasoning process for certain types of questions. This raises concerns over the potential for 'pretraining recipe leakage,' meaning one could deduce elements of a model's training just by observing its responses to defined tasks. Furthermore, the model's variations in response to CoT prompting across different tasks pose new questions about the generalization of instruction-following capabilities in LLMs post-IFT.

Final Thoughts

The University of Maryland's examination of ChatGPT's reasoning skills suggests a nuanced understanding of tasks ingrained from its training process, highlighting the model's advanced capabilities but also raising questions about instruction dependency and data privacy. As the AI field continues to push the boundaries of LLMs, this paper underscores the necessity of continually reassessing our prompting strategies to leverage the full potential of these sophisticated models and to avoid unintended side effects of their training methods.

X Twitter Logo Streamline Icon: https://streamlinehq.com