Emergent Mind

Abstract

LLMs have unveiled remarkable reasoning capabilities by exploiting chain-of-thought (CoT) prompting, which generates intermediate reasoning chains to serve as the rationale for deriving the answer. However, current CoT methods either simply employ general prompts such as Let's think step by step, or heavily rely on pre-defined task-specific demonstrations to attain preferable performances, thereby engendering an inescapable gap between performance and generalization. To bridge this gap, we propose GeM-CoT, a Generalizable CoT prompting mechanism in Mixed-task scenarios where the type of input questions is unknown. GeM-CoT first categorizes the question type and subsequently samples or constructs demonstrations from the corresponding data pool in an automatic pattern. With this technical design, GeM-CoT simultaneously enjoys superior generalization capabilities and remarkable performances on 10 public reasoning tasks and 23 BBH tasks.

Flow chart of the GeM-CoT mechanism.

Overview

  • The paper introduces GeM-CoT, a mechanism designed to improve chain-of-thought (CoT) prompting in mixed-task scenarios using LLMs, by generalizing CoT prompts across diverse and unstructured question types.

  • GeM-CoT employs a dynamic process involving type matching, demonstration acquisition, answer derivation, and data cache updating to maintain high performance and adaptability across various tasks.

  • Experimental results show that GeM-CoT outperforms existing CoT techniques across multiple reasoning tasks and demonstrates robust performance in diverse and evolving question sets using GPT-3.5-Turbo and GPT-4 as backbone models.

Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with LLMs

The paper titled "Generalizable Chain-of-Thought Prompting in Mixed-task Scenarios with LLMs" by Anni Zou, Zhuosheng Zhang, Hai Zhao, and Xiangru Tang presents a novel mechanism called GeM-CoT aimed at addressing the challenges of chain-of-thought (CoT) prompting in mixed-task scenarios using LLMs. This work stands out by focusing on the automatic generalization of CoT prompts across diverse and unstructured question types, thereby addressing a practical gap in the application of CoT methods.

Introduction and Background

The advent of LLMs has significantly boosted the capabilities of automatic reasoning systems. CoT prompting, a notable technique in this domain, generates intermediate reasoning steps before arriving at the final answer, thereby enhancing the reasoning robustness of LLMs. However, CoT techniques applied so far, as categorized into General Zero-Shot-CoT and Specific Few-Shot-CoT, struggle significantly when it comes to balancing performance with generalization across unknown or mixed-task scenarios. The former approach, while generalizable, often sacrifices performance, while the latter, despite higher performance, lacks adaptability across tasks not explicitly trained or demonstrated.

Proposed Mechanism: GeM-CoT

GeM-CoT aspires to bridge this gap by introducing a framework that is both generalizable and performance-oriented in mixed-task settings. The system is built to handle questions whose task types are unknown and presented in arbitrary sequence and format. It works through an iterative process involving type matching, demonstration acquisition, answer derivation, and data cache updating, ensuring continuous refinement and adaptability.

Key Components

  1. Type Matching: GeM-CoT initially makes a decision on whether an input question can be matched to a pre-constructed demo pool using a similarity-based approach. This step ensures that the subsequent reasoning is conducted with relevant demonstrations, enhancing the accuracy of outputs.

  2. Demo Acquisition: For matched inputs, GeM-CoT fetches pertinent demonstrations from a demo pool. This pool is dynamically maintained, ensuring that it reflects the most up-to-date and relevant examples for inference.

  3. Answer Derivation: This stage derives the answer using either few-shot reasoning with demonstrations (if a match is found) or zero-shot reasoning without demonstrations (if no match is found). This dual approach allows the model to maintain high accuracy even when task-specific examples are not available.

  4. Data Cache Update: The mechanism includes updating the data cache with failed matches, performing density-based clustering, and constructing new demonstrations. This continuous learning cycle enhances the model’s ability to generalize across previously unseen task types.

Experimental Results

The system was evaluated on ten reasoning tasks covering arithmetic, commonsense, and symbolic reasoning, as well as on 23 BBH tasks to verify its stability and generality. Using GPT-3.5-Turbo and GPT-4 as the backbone models, GeM-CoT achieved superior performance on all test settings:

  • Performance on Reasoning Datasets: GeM-CoT outperformed strong baselines, including task-specific and general CoT techniques. Notably, it achieved an average accuracy of 82.3%, which marks an improvement over both Zero-Shot-CoT and other generalization methods.
  • Performance on BBH Tasks: When applied to BBH datasets under a realistic streaming scenario, GeM-CoT demonstrated increasingly robust performance, attesting to its capability of handling diverse and evolving question types.

Analysis and Examination

The paper further dissected the mechanism’s performance by examining different selection methods for demonstrations, the efficacy of the type matching module, and the impact of varying the matching threshold. The insights highlight that diversity in demonstration selection drives higher performance, and the similarity-based type matching mechanism effectively balances accuracy and flexibility.

Implications and Future Directions

The approach proposed in this study holds significant implications for practical applications of LLMs in real-world scenarios where task types are varied and not pre-defined. By ensuring that CoT prompting can automatically adapt and generalize across diverse tasks, GeM-CoT can enhance the applicability of LLMs in more complex and dynamic environments. Future work can explore incorporating additional reasoning improvement techniques and optimizing the efficiency of the demonstration selection process.

In conclusion, GeM-CoT represents a substantial advancement in the application of LLMs for reasoning tasks, achieving a practical blend of generalization and performance through a dynamic and self-improving mechanism. This work paves the way for more adaptive and robust AI systems capable of handling a wide array of real-world challenges.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.