Emergent Mind

ToolQA: A Dataset for LLM Question Answering with External Tools

(2306.13304)
Published Jun 23, 2023 in cs.CL and cs.AI

Abstract

LLMs have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  2. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  4. Sparks of Artificial General Intelligence: Early experiments with GPT-4
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks
  6. A dataset for answering time-sensitive questions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)
  7. PaLM: Scaling Language Modeling with Pathways
  8. Scaling Instruction-Finetuned Language Models
  9. Training Verifiers to Solve Math Word Problems
  10. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273
  11. PAL: Program-aided Language Models
  12. Measuring mathematical problem solving with the math dataset. NeurIPS
  13. MathPrompter: Mathematical Reasoning using Large Language Models
  14. Unsupervised Dense Information Retrieval with Contrastive Learning
  15. Atlas: Few-shot Learning with Retrieval Augmented Language Models
  16. SciREX: A challenge dataset for document-level information extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7506–7516, Online, July 2020. Association for Computational Linguistics.
  17. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38
  18. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. ArXiv
  19. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  20. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781
  21. RealTime QA: What's the Answer Right Now?
  22. Language models can solve computer tasks
  23. Large language models are zero-shot reasoners. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems
  24. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474
  25. Solving Quantitative Reasoning Problems with Language Models
  26. Api-bank: A benchmark for tool-augmented llms
  27. Unsupervised cross-task generalization via retrieval augmentation. In Advances in Neural Information Processing Systems
  28. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
  29. Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning
  30. Reacc: A retrieval-augmented code completion framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6227–6240
  31. Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango
  32. When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories
  33. Lila: A Unified Benchmark for Mathematical Reasoning
  34. WebGPT: Browser-assisted question-answering with human feedback
  35. Investigating the Limitations of Transformers with Simple Arithmetic Tasks
  36. OpenAI. Gpt-4 technical report. arXiv
  37. OpenAI. Introducing chatgpt
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744
  39. ART: Automatic multi-step reasoning and tool-use for large language models
  40. TALM: Tool Augmented Language Models
  41. Gorilla: Large Language Model Connected with Massive APIs
  42. Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
  43. Limitations of Language Models in Arithmetic and Symbolic Induction
  44. Tool learning with foundation models
  45. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  46. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389
  47. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
  48. Toolformer: Language Models Can Teach Themselves to Use Tools
  49. HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face
  50. REPLUG: Retrieval-Augmented Black-Box Language Models
  51. Reflexion: Language Agents with Verbal Reinforcement Learning
  52. Adaplanner: Adaptive planning from feedback with language models
  53. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  54. LLaMA: Open and Efficient Foundation Language Models
  55. Code4Struct: Code Generation for Few-Shot Event Structure Prediction
  56. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents
  57. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  58. Ethical and social risks of harm from Language Models
  59. Reframing human-ai collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658
  60. S. Wolfram. Wolfram|Alpha as the Way to Bring Computational Knowledge Superpowers to ChatGPT. Stephen Wolfram Writings
  61. Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
  62. On the Tool Manipulation Capability of Open-source Large Language Models
  63. Weakly-Supervised Scientific Document Classification via Retrieval-Augmented Multi-Stage Training
  64. MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
  65. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.
  66. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations
  67. J. Zhang. Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt
  68. M. Zhang and E. Choi. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7371–7387, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  69. PRBoost: Prompt-Based Rule Discovery and Boosting for Interactive Weakly-Supervised Learning
  70. ReSel: N-ary relation extraction from scientific text and tables by learning to retrieve and select. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 730–744, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.

Show All 70