Emergent Mind

Abstract

Prompting-based LLMs are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}

Overview

  • The paper introduces a method called Interleaved Retrieval guided by Chain-of-Thought (IRCoT) for multi-step open-domain question answering.

  • IRCoT combines step-by-step reasoning with the retrieval of external information incrementally during the answering process.

  • The method involves mutual reinforcement between reasoning steps and the retrieval of relevant documents.

  • Performance improvements over traditional single-step retrieval models have been demonstrated across various datasets using GPT3 and Flan-T5.

  • IRCoT also shows promises for future applications and advancements in knowledge-intensive question answering.

Introduction

The paradigm of prompting LLMs to perform natural language reasoning in a step-by-step fashion, recognized as chains of thoughts (CoT), has seen significant success, especially for tasks where all necessary information is assumed to appear in the LLM's learned parameters. However, LLMs often falter when external, up-to-the-minute knowledge is required for open-domain multi-step question answering (QA). Traditional solutions to augment LLMs have involved a singular retrieval step, but this is less effective for complex queries where retrieved information is needed incrementally, as reasoning progresses.

Interleaved Retrieval and CoT (IRCoT) Method

Proposed in this work is a method called Interleaved Retrieval guided by Chain-of-Thought (IRCoT), an approach that interlaces retrieval with the CoT process. Initially, documents are extracted using the question as a query. During the answering process, retrieval and reasoning mutually inform each subsequent action. The CoT generation step builds on existing reasoning and collected paragraphs to craft the next sentence of the CoT sequence. Reversely, the newly generated CoT sentence guides the retrieval of additional evidence. This cycle repeats until a termination criterion is reached, contemporaneously enhancing both CoT generation quality and relevance of retrieved information.

Efficacy of IRCoT

IRCoT's performance has been evaluated across multiple datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. Employing GPT3 and Flan-T5 models, the IRCoT approach surpasses baseline single-step retrieval by a substantial margin, both in terms of recall and QA performance. Furthermore, IRCoT demonstrates robustness in out-of-distribution scenarios and is effective even with smaller LLMs. To aid practical replication and future research, resources including code and data prompts are publicly accessible online.

Conclusions and Further Remarks

IRCoT stands as a notable approach, intertwining retrieval and CoT generation to navigate open-domain, multi-step QA tasks effectively. This technique elevates both the accuracy of retrieved information and the factual reliability of generated CoTs, with gains witnessed across diverse LLM sizes and testing conditions. Despite its certain dependencies on specific LLM capabilities, such as zero or few-shot CoT generation ability and support for longer contexts, IRCoT represents a stride forward in the domain of knowledge-intensive QA, potentially informing a variety of future language model applications.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations.
  2. Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Evaluating Large Language Models Trained on Code
  5. Scaling Instruction-Finetuned Language Models
  6. Multi-step retriever-reader interaction for scalable open-domain question answering. In International Conference on Learning Representations.
  7. Yair Feldman and Ran El-Yaniv. 2019. Multi-hop paragraph retrieval for open-domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2296–2309, Florence, Italy. Association for Computational Linguistics.
  8. IIRC: A dataset of incomplete information reading comprehension questions. In EMNLP.
  9. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  10. Large Language Models Are Reasoning Teachers
  11. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In COLING.
  12. Atlas: Few-shot Learning with Retrieval Augmented Language Models
  13. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  14. RealTime QA: What's the Answer Right Now?
  15. Baleen: Robust multi-hop reasoning at scale via condensed retrieval. In Advances in Neural Information Processing Systems.
  16. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP
  17. Decomposed prompting: A modular approach for solving complex tasks
  18. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
  19. Large language models are zero-shot reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models.
  20. Internet-augmented language models through few-shot prompting for open-domain question answering
  21. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
  22. Teaching Small Language Models to Reason
  23. WebGPT: Browser-assisted question-answering with human feedback
  24. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
  25. Measuring and Narrowing the Compositionality Gap in Language Models
  26. Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2590–2602, Hong Kong, China. Association for Computational Linguistics.
  27. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  28. Recitation-Augmented Language Models
  29. UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  30. MuSiQue: Multihop questions via single-hop question composition. TACL, 10:539–554.
  31. Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  32. Answering complex open-domain questions with multi-hop dense retrieval. In International Conference on Learning Representations.
  33. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
  34. ReAct: Synergizing Reasoning and Acting in Language Models
  35. Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
  36. Retrieving and Reading: A Comprehensive Survey on Open-domain Question Answering

Show All 36