Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions (2212.10509v2)
Abstract: Prompting-based LLMs are surprisingly powerful at generating natural language reasoning steps or Chains-of-Thoughts (CoT) for multi-step question answering (QA). They struggle, however, when the necessary knowledge is either unavailable to the LLM or not up-to-date within its parameters. While using the question to retrieve relevant text from an external knowledge source helps LLMs, we observe that this one-step retrieve-and-read approach is insufficient for multi-step QA. Here, \textit{what to retrieve} depends on \textit{what has already been derived}, which in turn may depend on \textit{what was previously retrieved}. To address this, we propose IRCoT, a new approach for multi-step QA that interleaves retrieval with steps (sentences) in a CoT, guiding the retrieval with CoT and in turn using retrieved results to improve CoT. Using IRCoT with GPT3 substantially improves retrieval (up to 21 points) as well as downstream QA (up to 15 points) on four datasets: HotpotQA, 2WikiMultihopQA, MuSiQue, and IIRC. We observe similar substantial gains in out-of-distribution (OOD) settings as well as with much smaller models such as Flan-T5-large without additional training. IRCoT reduces model hallucination, resulting in factually more accurate CoT reasoning. Code, data, and prompts are available at \url{https://github.com/stonybrooknlp/ircot}
- Learning to retrieve reasoning paths over wikipedia graph for question answering. In International Conference on Learning Representations.
- Improving language models by retrieving from trillions of tokens. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Multi-step retriever-reader interaction for scalable open-domain question answering. In International Conference on Learning Representations.
- Yair Feldman and Ran El-Yaniv. 2019. Multi-hop paragraph retrieval for open-domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2296–2309, Florence, Italy. Association for Computational Linguistics.
- IIRC: A dataset of incomplete information reading comprehension questions. In EMNLP.
- Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
- Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
- Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In COLING.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- RealTime QA: What’s the answer right now? arXiv preprint arXiv:2207.13332.
- Baleen: Robust multi-hop reasoning at scale via condensed retrieval. In Advances in Neural Information Processing Systems.
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive NLP.
- Decomposed prompting: A modular approach for solving complex tasks.
- Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
- Large language models are zero-shot reasoners. In ICML 2022 Workshop on Knowledge Retrieval and Language Models.
- Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
- Teaching small language models to reason. arXiv preprint arXiv:2212.08410.
- WebGPT: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems.
- Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350.
- Answering complex open-domain questions through iterative query generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2590–2602, Hong Kong, China. Association for Computational Linguistics.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
- Recitation-augmented language models. arXiv preprint arXiv:2210.01296.
- UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
- MuSiQue: Multihop questions via single-hop question composition. TACL, 10:539–554.
- Chain of thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
- Answering complex open-domain questions with multi-hop dense retrieval. In International Conference on Learning Representations.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In EMNLP.
- ReAct: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
- Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv preprint arXiv:2101.00774.