Emergent Mind

Abstract

We investigate the ability of language models to perform compositional reasoning tasks where the overall solution depends on correctly composing the answers to sub-problems. We measure how often models can correctly answer all sub-problems but not generate the overall solution, a ratio we call the compositionality gap. We evaluate this ratio by asking multi-hop questions with answers that require composing multiple facts unlikely to have been observed together during pretraining. In the GPT-3 family of models, as model size increases we show that the single-hop question answering performance improves faster than the multi-hop performance does, therefore the compositionality gap does not decrease. This surprising result suggests that while more powerful models memorize and recall more factual knowledge, they show no corresponding improvement in their ability to perform this kind of compositional reasoning. We then demonstrate how elicitive prompting (such as chain of thought) narrows the compositionality gap by reasoning explicitly. We present a new method, self-ask, that further improves on chain of thought. In our method, the model explicitly asks itself (and answers) follow-up questions before answering the initial question. We finally show that self-ask's structured prompting lets us easily plug in a search engine to answer the follow-up questions, which additionally improves accuracy.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2
  2. Adaptive neural networks for efficient inference. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 527–536. JMLR.org.
  3. Improving language models by retrieving from trillions of tokens. In ICML.
  4. Language models are few-shot learners
  5. Ask the Right Questions: Active Question Reformulation with Reinforcement Learning
  6. The Second Conversational Intelligence Challenge (ConvAI2)
  7. Neural logic machines
  8. Towards a Human-like Open-Domain Chatbot
  9. Adaptive Computation Time for Recurrent Neural Networks
  10. REALM: Retrieval-Augmented Language Model Pre-Training
  11. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  12. Multi-scale dense networks for resource efficient image classification. In ICLR.
  13. Compositionality decomposed: How do neural networks generalise? J. Artif. Intell. Res., 67:757–795.
  14. Search-based neural structured learning for sequential question answering. In ACL.
  15. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In EACL.
  16. RealTime QA: What's the Answer Right Now?
  17. Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations.
  18. Generalization through Memorization: Nearest Neighbor Language Models
  19. Text modular networks: Learning to decompose tasks in the language of existing models. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page 1264–1279.
  20. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations.
  21. Large language models are zero-shot reasoners
  22. Brenden M. Lake and Marco Baroni. 2017. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks.
  23. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
  24. A diversity-promoting objective function for neural conversation models. In NAACL.
  25. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In ACL.
  26. Generated knowledge prompting for commonsense reasoning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3154–3169, Dublin, Ireland. Association for Computational Linguistics.
  27. Teaching language models to support answers with verified quotes
  28. Multi-hop Reading Comprehension through Question Decomposition and Rescoring
  29. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, pages 589–612, Dublin, Ireland. Association for Computational Linguistics.
  30. WebGPT: Browser-assisted question-answering with human feedback
  31. Show your work: Scratchpads for intermediate computation with language models
  32. Is a Question Decomposition Unit All We Need?
  33. Unsupervised question decomposition for question answering. In EMNLP.
  34. Answering complex open-domain questions through iterative query generation. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).
  35. Sudha Rao and Hal Daumé III. 2019. Answer-based Adversarial Training for Generating Clarification Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 143–155, Minneapolis, Minnesota. Association for Computational Linguistics.
  36. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  37. Closed ai models make bad baselines
  38. Recipes for building an open-domain chatbot. In EACL.
  39. Knowledge-Aware Language Model Pretraining
  40. The right tool for the job: Matching model and instance complexities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6640–6651, Online. Association for Computational Linguistics.
  41. Can you learn an algorithm? generalizing from easy to hard problems with recurrent networks. In NeurIPS.
  42. Neural speed reading via skim-rnn
  43. Generative Deep Neural Networks for Dialogue: A Short Review
  44. Neural Responding Machine for Short-Text Conversation
  45. Unsupervised Commonsense Question Answering with Self-Talk
  46. A neural network approach to context-sensitive generation of conversational responses. In NAACL.
  47. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  48. Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In NAACL.
  49. olmpics-on what language model pre-training captures. Transactions of the Association for Computational Linguistics, 8:743–758.
  50. LaMDA: Language Models for Dialog Applications
  51. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554.
  52. A Neural Conversational Model
  53. Shepherd pre-trained language models to develop a train of thought: An iterative prompting approach
  54. SkipNet: Learning Dynamic Routing in Convolutional Networks
  55. Not all images are worth 16x16 words: Dynamic transformers for efficient image recognition. In NeurIPS.
  56. Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
  57. Chain of thought prompting elicits reasoning in large language models
  58. Break it down: A question understanding benchmark. Transactions of the Association for Computational Linguistics, 8:183–198.
  59. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  60. The Unreliability of Explanations in Few-shot Prompting for Textual Reasoning
  61. Personalizing dialogue agents: I have a dog, do you have pets too? In ACL
  62. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

Show All 62