ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting (2403.14312v1)

Published 21 Mar 2024 in cs.CL

Abstract: Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of LLMs, establishing itself as a primary approach to solving complex reasoning tasks. Existing CoT synthesis approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts. In response to this challenge, we present an empirical investigation of CoT prompting and introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts. CoTGenius is developed based on three major evolution strategies, i.e., complicate, diversify, and specify-alongside two filtering mechanisms: evolutionary success judgement and correctness verification. We further employ CoTGenius to create an extensive CoT dataset, and subsequently fine-tune the Llama 2-Chat 7B and 13B models on this dataset. We call the resulting model ChainLM. To deal with the cumulative error issue in reasoning steps, we propose a step-level debating method, wherein multiple debaters discuss each reasoning step to arrive at the correct answer. Extensive experiments demonstrate that our ChainLM models exhibit enhanced proficiency in addressing a spectrum of complex reasoning problems compared to existing models. In addition, we conduct an in-depth analysis of the impact of data categories within CoTGenius on the model performance. We release our dataset and code at https://github.com/RUCAIBox/ChainLM.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces CoTGenius, a framework that evolves chain-of-thought prompts through complicate, diversify, and specify strategies to enhance LLM reasoning.
It fine-tunes Llama 2-Chat models on a dataset of over 44,000 prompts, achieving significant accuracy gains across commonsense, mathematical, and scientific tasks.
Experimental results demonstrate ChainLM outperforms both closed-source and open-source models by leveraging step-level debates to refine intermediate reasoning steps.

ChainLM: Empowering LLMs with Improved Chain-of-Thought Prompting

Introduction

The paper "ChainLM: Empowering LLMs with Improved Chain-of-Thought Prompting" by Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, and Ji-Rong Wen presents a refined framework to enhance the reasoning capabilities of LLMs through a novel approach termed CoTGenius. By addressing the limitations of previous Chain-of-Thought (CoT) prompting strategies, which often falter on complex reasoning tasks, the authors introduce an empirical investigation alongside the development of more sophisticated prompting techniques.

Core Contributions

1. CoTGenius Framework

CoTGenius is designed to automatically generate high-quality CoT prompts by employing three major evolution strategies:

Complicate: Adding additional constraints and depth to problems to increase their complexity and stimulate multi-step reasoning.
Diversify: Altering problem scenarios and drawing on novel inspirations based on the given questions to enhance the topic diversity.
Specify: Refining and detailing existing reasoning steps to create a more comprehensive step-by-step thought process.

Accompanying these strategies are two filtering mechanisms:

Evolutionary Success Judgement: This mechanism leverages multiple LLMs (ChatGPT, Claude, and PaLM) to evaluate the effectiveness of evolved questions and reasoning steps.
Correctness Verification: Ensures the accuracy of newly generated reasoning steps through LLM-based validations.

2. Creation of ChainLM

Utilizing CoTGenius, the authors constructed an extensive dataset of 44,335 CoT prompts and fine-tuned the Llama 2-Chat 7B and 13B models on this dataset. The resulting models, termed ChainLM, demonstrate enhanced proficiency in handling complex reasoning problems as compared to other models.

Experimental Validation

The authors conducted experiments across various reasoning task categories, including commonsense reasoning, mathematical reasoning, scientific reasoning, and symbolic reasoning, utilizing datasets such as CommonsenseQA, MATH, ScienceQA, and several others.

Key Results:

ChainLM models exhibited robust improvements in accuracy across most datasets when benchmarked against both closed-source models such as ChatGPT and InstructGPT and open-source models including Llama 2-Chat, Falcon, and Vicuna.
For instance, ChainLM achieved an accuracy of 34.13% in Elementary Mathematics, significantly outperforming Llama 2-Chat's 24.88%.
The step-level debating method, which facilitates agent discussions for each reasoning step, further enhanced ChainLM's performance, especially in tasks prone to intermediate step errors.

Implications and Future Work

The strong numerical results obtained by ChainLM suggest that the CoTGenius framework and its evolution strategies can substantially improve LLMs' reasoning capabilities. This has critical implications for various applications, notably in domains requiring intricate reasoning, such as scientific research, mathematical problem solving, and automated decision-making systems.

Theoretical Contributions:

The empirical analysis sheds light on crucial factors like inference completeness, prompt specificity, and reasoning logicality, providing deeper insights into the mechanics of effective CoT prompts.

Practical Contributions:

The release of a high-quality CoT dataset and associated code (available at https://github.com/RUCAIBox/ChainLM) offers a valuable resource to the research community for further developments in LLM training and evaluation.

Future Developments:

Exploring additional evolution strategies and optimizing the debate mechanisms for CoT prompting could further elevate the performance of LLMs in complex tasks.
There is also potential to investigate the application of CoTGenius improvements in real-world scenarios and industry-specific use cases.

In conclusion, the methodologies and innovations presented in this paper mark a significant step towards refining the reasoning capabilities of LLMs, fostering advancements that could pave the way for more sophisticated AI systems in the future.

PDF Markdown