Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (2212.10509v2)

Published 20 Dec 2022 in cs.CL

Abstract: Prompting-based LLMs are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}

References (36)

Citations (277)

View on Semantic Scholar

Summary

The paper introduces IRCoT, a method that interlaces retrieval with chain-of-thought reasoning to incrementally gather evidence and improve multi-step QA.
The paper demonstrates IRCoT's effectiveness across benchmarks like HotpotQA and 2WikiMultihopQA, significantly outperforming single-step retrieval baselines.
The paper highlights IRCoT's robustness with various LLM sizes and out-of-distribution data, ensuring enhanced recall and factual accuracy.

Introduction

The paradigm of prompting LLMs to perform natural language reasoning in a step-by-step fashion, recognized as chains of thoughts (CoT), has seen significant success, especially for tasks where all necessary information is assumed to appear in the LLM's learned parameters. However, LLMs often falter when external, up-to-the-minute knowledge is required for open-domain multi-step question answering (QA). Traditional solutions to augment LLMs have involved a singular retrieval step, but this is less effective for complex queries where retrieved information is needed incrementally, as reasoning progresses.

Interleaved Retrieval and CoT (IRCoT) Method

Proposed in this work is a method called Interleaved Retrieval guided by Chain-of-Thought (IRCoT), an approach that interlaces retrieval with the CoT process. Initially, documents are extracted using the question as a query. During the answering process, retrieval and reasoning mutually inform each subsequent action. The CoT generation step builds on existing reasoning and collected paragraphs to craft the next sentence of the CoT sequence. Reversely, the newly generated CoT sentence guides the retrieval of additional evidence. This cycle repeats until a termination criterion is reached, contemporaneously enhancing both CoT generation quality and relevance of retrieved information.

Efficacy of IRCoT

IRCoT's performance has been evaluated across multiple datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Employing GPT3 and Flan-T5 models, the IRCoT approach surpasses baseline single-step retrieval by a substantial margin, both in terms of recall and QA performance. Furthermore, IRCoT demonstrates robustness in out-of-distribution scenarios and is effective even with smaller LLMs. To aid practical replication and future research, resources including code and data prompts are publicly accessible online.

Conclusions and Further Remarks

IRCoT stands as a notable approach, intertwining retrieval and CoT generation to navigate open-domain, multi-step QA tasks effectively. This technique elevates both the accuracy of retrieved information and the factual reliability of generated CoTs, with gains witnessed across diverse LLM sizes and testing conditions. Despite its certain dependencies on specific LLM capabilities, such as zero or few-shot CoT generation ability and support for longer contexts, IRCoT represents a stride forward in the domain of knowledge-intensive QA, potentially informing a variety of future LLM applications.

PDF Markdown

Related Papers

GitHub

GitHub - StonyBrookNLP/ircot: Repository for Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions, ACL23 (151 stars)

YouTube

Show All Videos