Emergent Mind

TheoremQA: A Theorem-driven Question Answering dataset

(2305.12524)
Published May 21, 2023 in cs.CL and cs.AI

Abstract

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Mathqa: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367.
  2. Program Synthesis with Large Language Models
  3. Constitutional AI: Harmlessness from AI Feedback
  4. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  6. Sparks of Artificial General Intelligence: Early experiments with GPT-4
  7. UniGeo: Unifying geometry logical reasoning via reformulating mathematical expression. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3313–3323, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  8. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 513–523.
  9. Evaluating Large Language Models Trained on Code
  10. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
  11. Finqa: A dataset of numerical reasoning over financial data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3697–3711.
  12. PaLM: Scaling Language Modeling with Pathways
  13. Training Verifiers to Solve Math Word Problems
  14. Compositional Semantic Parsing with Large Language Models
  15. PAL: Program-aided Language Models
  16. Google. 2023. Palm 2 technical report. https://ai.google/static/documents/palm2techreport.pdf.

  17. Measuring mathematical problem solving with the math dataset. Conference on Neural Information Processing Systems.
  18. Training Compute-Optimal Large Language Models
  19. Learning to solve arithmetic word problems with verb categorization. In EMNLP, pages 523–533.
  20. Wolfram Research, Inc. Mathematica, Version 13.2. Champaign, IL
  21. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems.
  22. Parsing algebraic word problems into equations. Transactions of the Association for Computational Linguistics, 3:585–597.
  23. Mawps: A math word problem repository. In Proceedings of the 2016 conference of the north american chapter of the association for computational linguistics: human language technologies, pages 1152–1157.
  24. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  25. StarCoder: may the source be with you!
  26. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167.
  27. Visual Instruction Tuning
  28. Inter-GPS: Interpretable geometry problem solving with formal language and symbolic reasoning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6774–6786, Online. Association for Computational Linguistics.
  29. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  30. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  31. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. In International Conference on Learning Representations (ICLR).
  32. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In The 35th Conference on Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks.
  33. A survey of deep learning for mathematical reasoning. In The 61st Annual Meeting of the Association for Computational Linguistics (ACL).
  34. A diverse corpus for evaluating and developing english math word problem solvers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 975–984.
  35. Lila: A unified benchmark for mathematical reasoning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  36. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis
  37. Show your work: Scratchpads for intermediate computation with language models. In Deep Learning for Code Workshop.
  38. GPT-4 Technical Report
  39. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  40. Are nlp models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094
  41. Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  43. Scaling Language Models: Methods, Analysis & Insights from Training Gopher
  44. Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1743–1752.
  45. Analysing Mathematical Reasoning Abilities of Neural Models
  46. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1466–1476, Lisbon, Portugal. Association for Computational Linguistics.
  47. Task Ambiguity in Humans and Language Models
  48. Galactica: A Large Language Model for Science
  49. LLaMA: Open and Efficient Foundation Language Models
  50. Shyam Upadhyay and Ming-Wei Chang. 2015. Draw: A challenging and diverse algebra word problem set. Technical report, Citeseer.
  51. Shyam Upadhyay and Ming-Wei Chang. 2017. Annotating derivations: A new evaluation strategy and dataset for algebra word problems. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 494–504, Valencia, Spain. Association for Computational Linguistics.
  52. Self-Consistency Improves Chain of Thought Reasoning in Language Models
  53. Deep neural solver for math word problems. In Proceedings of the 2017 conference on empirical methods in natural language processing, pages 845–854.
  54. CodeT5+: Open Code Large Language Models for Code Understanding and Generation
  55. Emergent abilities of large language models. Transactions on Machine Learning Research.
  56. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
  57. GLM-130B: An Open Bilingual Pre-trained Model
  58. OPT: Open Pre-trained Transformer Language Models
  59. Progressive-Hint Prompting Improves Reasoning in Large Language Models
  60. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models
  61. Tat-qa: A question answering benchmark on a hybrid of tabular and textual content in finance. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3277–3287.

Show All 61