Prompting-based LLMs are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}
The paper introduces a method called Interleaved Retrieval guided by Chain-of-Thought (IRCoT) for multi-step open-domain question answering.
IRCoT combines step-by-step reasoning with the retrieval of external information incrementally during the answering process.
The method involves mutual reinforcement between reasoning steps and the retrieval of relevant documents.
Performance improvements over traditional single-step retrieval models have been demonstrated across various datasets using GPT3 and Flan-T5.
IRCoT also shows promises for future applications and advancements in knowledge-intensive question answering.
The paradigm of prompting LLMs to perform natural language reasoning in a step-by-step fashion, recognized as chains of thoughts (CoT), has seen significant success, especially for tasks where all necessary information is assumed to appear in the LLM's learned parameters. However, LLMs often falter when external, up-to-the-minute knowledge is required for open-domain multi-step question answering (QA). Traditional solutions to augment LLMs have involved a singular retrieval step, but this is less effective for complex queries where retrieved information is needed incrementally, as reasoning progresses.
Proposed in this work is a method called Interleaved Retrieval guided by Chain-of-Thought (IRCoT), an approach that interlaces retrieval with the CoT process. Initially, documents are extracted using the question as a query. During the answering process, retrieval and reasoning mutually inform each subsequent action. The CoT generation step builds on existing reasoning and collected paragraphs to craft the next sentence of the CoT sequence. Reversely, the newly generated CoT sentence guides the retrieval of additional evidence. This cycle repeats until a termination criterion is reached, contemporaneously enhancing both CoT generation quality and relevance of retrieved information.
IRCoT's performance has been evaluated across multiple datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Employing GPT3 and Flan-T5 models, the IRCoT approach surpasses baseline single-step retrieval by a substantial margin, both in terms of recall and QA performance. Furthermore, IRCoT demonstrates robustness in out-of-distribution scenarios and is effective even with smaller LLMs. To aid practical replication and future research, resources including code and data prompts are publicly accessible online.
IRCoT stands as a notable approach, intertwining retrieval and CoT generation to navigate open-domain, multi-step QA tasks effectively. This technique elevates both the accuracy of retrieved information and the factual reliability of generated CoTs, with gains witnessed across diverse LLM sizes and testing conditions. Despite its certain dependencies on specific LLM capabilities, such as zero or few-shot CoT generation ability and support for longer contexts, IRCoT represents a stride forward in the domain of knowledge-intensive QA, potentially informing a variety of future language model applications.