Emergent Mind

Abstract

In open-domain question-answering (ODQA), most existing questions require single-hop reasoning on commonsense. To further extend this task, we officially introduce open-domain multi-hop reasoning (ODMR) by answering multi-hop questions with explicit reasoning steps in open-domain setting. Recently, LLMs have found significant utility in facilitating ODQA without external corpus. Furthermore, chain-of-thought (CoT) prompting boosts the reasoning capability of LLMs to a greater extent with manual or automated paradigms. However, existing automated methods lack of quality assurance, while manual approaches suffer from limited scalability and poor diversity, hindering the capabilities of LLMs. In this paper, we propose Self-prompted Chain-of-Thought (SP-CoT), an automated framework to mass-produce high quality CoTs of LLMs, by LLMs and for LLMs. SP-CoT introduces an automated generation pipeline of high quality ODMR datasets, an adaptive sampler for in-context CoT selection and self-prompted inference via in-context learning. Extensive experiments on four multi-hop question-answering benchmarks show that our proposed SP-CoT not only significantly surpasses the previous SOTA methods on large-scale (175B) LLMs, but also nearly doubles the zero-shot performance of small-scale (13B) LLMs. Further analysis reveals the remarkable capability of SP-CoT to elicit direct and concise intermediate reasoning steps by recalling $\sim$50\% of intermediate answers on MuSiQue-Ans dataset.

The SP-CoT framework for automated ODMR datasets, adaptive CoT selection, and self-prompted inference via ICL.

Overview

  • The paper addresses the challenge of open-domain multi-hop reasoning (ODMR) in question-answering, introducing an automated pipeline for generating high-quality ODMR datasets and a self-prompted Chain-of-Thought (SP-CoT) methodology.

  • The authors propose an adaptive sampler mechanism to enhance the diversity and quality of in-context Chain-of-Thought (CoT) selections within LLMs and validate their approach through extensive experimentation on multiple multi-hop question-answering benchmarks.

  • The SP-CoT method significantly outperforms existing techniques on large-scale LLMs, improves zero-shot performance on smaller models, and demonstrates effective intermediate reasoning abilities, suggesting its applicability across diverse reasoning tasks.

Self-prompted Chain-of-Thought on LLMs for Open-domain Multi-hop Reasoning

The study presented by Wang et al. focuses on addressing a key challenge in open-domain question-answering (ODQA), specifically within the context of multi-hop reasoning. Traditionally, ODQA largely centers around single-hop reasoning tasks that leverage commonsense knowledge from a given corpus. This paper ventures into the more complex task of open-domain multi-hop reasoning (ODMR), where multiple reasoning steps are required to derive a correct answer without relying on an explicit context corpus.

Key Contributions and Methodology

Main Contributions:

  1. Automated Generation of ODMR Datasets: The authors introduce an automated pipeline that generates high-quality ODMR datasets. These datasets are composed of multi-hop questions, encompassing up to four reasoning steps, alongside intermediate questions and answers.
  2. Adaptive Sampler and Self-prompted CoT: They propose an adaptive sampler mechanism for in-context Chain-of-Thought (CoT) selection, leveraging the in-context learning (ICL) capabilities of LLMs. This method ensures diversity and quality in the generated CoTs.
  3. Experimental Validation: Through extensive experiments on four distinguishable multi-hop question-answering benchmarks (HotpotQA, ComplexWebQuestions, 2WikiMultiHopQA, and MuSiQue-Ans), the proposed Self-prompted Chain-of-Thought (SP-CoT) methodology significantly outperforms state-of-the-art techniques.

Methodology:

  1. 2-Hop QA Generation: The pipeline starts with iteratively generating 2-hop QA pairs, where each pair consists of context, question, answer, and explanation. Generated questions must entail commonsense knowledge and must pass a double-check quality assurance test.
  2. Multi-Hop QA Composition: The 2-hop QA pairs are then composed into larger multi-hop question sets following a stringent composability criterion. This arrangement guarantees logical coherence and avoids cycles in reasoning chains.
  3. Benchmarking on ODMR: The authors develop four newly constructed ODMR datasets, specifically focusing on the above-mentioned benchmarks. This dataset generation highlights the versatility of their approach across diverse question-answering tasks.

Experimental Results

The SP-CoT method demonstrates:

  • A significant performance advantage on large-scale models (175B parameters), showcasing enhancements over prior CoT methodologies.
  • Almost doubling the zero-shot performance on smaller-scale models (13B parameters), confirming its broad applicability and efficiency.
  • Remarkable capabilities in eliciting precise and clear intermediate reasoning steps, showing a recall of ∼50% for intermediate answers on the MuSiQue-Ans dataset.

Implications and Future Directions

Theoretical Implications: The strong performance across varying scales of LLMs—both large and small—suggests that the automated generation and adaptive CoT sampling methods can generalize well and are not strictly bound by model size. This holds promise for scalable improvements in other complex reasoning tasks beyond question-answering.

Practical Implications: In real-world applications, particularly those involving automated reasoning and analytics, the ability to generate and rely on high-quality self-structured datasets can reduce the dependency on extensive manually curated training data. This can lead to more generalizable AI systems capable of robust reasoning and decision-making across open domains.

Future Research: The techniques introduced could inspire further research in several avenues:

  • Enhanced Instruction-Following: Exploring the boundaries of instruction-following capabilities in LLMs and how different self-prompting mechanisms can be optimized for diverse downstream applications.
  • Refinement Through Feedback Loops: Incorporating mechanisms for real-time feedback and correction, which can iteratively improve the quality of generated CoTs.
  • Cross-Domain Applications: Extending the use of SP-CoT to domains requiring complex, multi-modal reasoning such as legal analysis, scientific research, and strategic gaming AIs.

Conclusion

The approach introduced by Wang et al. for LLMs in the context of open-domain multi-hop reasoning and the subsequent development of SP-CoT presents a significant advance in solving complex Q&A tasks. The method’s automated generation and efficient prompting framework, paired with large-scale evaluation, provide a solid foundation for future explorations in enhancing LLMs' reasoning capabilities. By unlocking higher reasoning potential through self-prompted automation, the research paves the way for robust and scalable AI solutions in diverse fields.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.