Papers
Topics
Authors
Recent
2000 character limit reached

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs (2307.16789v2)

Published 31 Jul 2023 in cs.AI, cs.CL, and cs.LG

Abstract: Despite the advancements of open-source LLMs, e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. Do as i can, not as i say: Grounding language in robotic affordances. ArXiv preprint, abs/2204.01691, 2022.
  2. Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  93–104, 2022.
  3. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  4. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  7. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  8. Assistgpt: A general multi-modal assistant that can plan, execute, inspect, and learn. arXiv preprint arXiv:2306.08640, 2023.
  9. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  14953–14962, 2023.
  10. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. arXiv preprint arXiv:2305.11554, 2023.
  11. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  9118–9147. PMLR, 2022a.
  12. Inner monologue: Embodied reasoning through planning with language models. ArXiv preprint, abs/2207.05608, 2022b.
  13. Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
  14. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. ArXiv, 2023.
  15. Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244, 2023a.
  16. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  17. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3470–3487, 2022.
  18. Webgpt: Browser-assisted question-answering with human feedback. ArXiv preprint, abs/2112.09332, 2021.
  19. OpenAI. OpenAI: Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  20. OpenAI. Gpt-4 technical report, 2023.
  21. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  22. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  23. Creator: Disentangling abstract and concrete reasonings of large language models through tool creation. arXiv preprint arXiv:2305.14318, 2023.
  24. Webcpm: Interactive web search for chinese long-form question answering. arXiv preprint arXiv:2305.06849, 2023a.
  25. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023b.
  26. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  27. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  28. Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761, 2023.
  29. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface, 2023.
  30. Reflexion: Language agents with verbal reinforcement learning, 2023.
  31. Restgpt: Connecting large language models with real-world applications via restful apis. arXiv preprint arXiv:2306.06624, 2023.
  32. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. arXiv preprint arXiv:2306.05301, 2023.
  33. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  35. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  36. Chatgpt for robotics: Design principles and model abilities. Technical Report MSR-TR-2023-8, Microsoft, February 2023.
  37. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  38. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  39. Chain-of-thought prompting elicits reasoning in large language models, 2023.
  40. Visual chatgpt: Talking, drawing and editing with visual foundation models. ArXiv preprint, abs/2303.04671, 2023.
  41. Wizardlm: Empowering large language models to follow complex instructions, 2023a.
  42. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023b.
  43. Chatgpt is not enough: Enhancing large language models with knowledge graphs for fact-aware language modeling. arXiv preprint arXiv:2306.11489, 2023.
  44. React: Synergizing reasoning and acting in language models. ArXiv preprint, abs/2210.03629, 2022.
  45. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  46. Large language model as autonomous decision maker. arXiv preprint arXiv:2308.12519, 2023.
  47. Toolqa: A dataset for llm question answering with external tools. arXiv preprint arXiv:2306.13304, 2023.
Citations (454)

Summary

  • The paper introduces ToolLLM, which enables LLMs to interact with 16,000+ real-world APIs by leveraging a novel instruction-tuning dataset and DFSDT algorithm.
  • It presents ToolLLaMA, a fine-tuned LLaMA version that, aided by a neural API retriever, rivals closed-source models on unseen APIs and achieves strong zero-shot performance.
  • An automated evaluation framework, ToolEval, validates the approach with high pass and win rates, offering a scalable benchmark for tool-use capabilities.

ToolLLM: Facilitating LLMs to Master 16000+ Real-world APIs

Introduction

The paper "ToolLLM: Facilitating LLMs to Master 16000+ Real-world APIs" focuses on enhancing the tool-use capabilities of open-source LLMs such as LLaMA. Despite the advancements in LLMs, these models still lack the ability to effectively interact with external tools (APIs) to fulfill complex human instructions, unlike their closed-source counterparts like GPT-4. This work introduces ToolLLM, a comprehensive framework that includes innovative approaches to data construction, model training, and evaluation to address this gap. Figure 1

Figure 1: Architectural overview of ToolLLM showcasing its components and workflow.

ToolBench: Instruction-Tuning Dataset

A significant contribution of this work is the introduction of ToolBench, an instruction-tuning dataset specifically designed for tool-use tasks. ToolBench consists of automatically generated instructions using ChatGPT, which cover interactions with a vast collection of over 16,000 real-world RESTful APIs. The dataset construction involves three stages:

  1. API Collection: Gathering 16,464 APIs spanning 49 categories from RapidAPI Hub.
  2. Instruction Generation: Using ChatGPT to create diverse instructions for these APIs in both single-tool and multi-tool scenarios.
  3. Solution Path Annotation: Employing a novel depth-first search-based decision tree algorithm to annotate valid solution paths for each instruction. Figure 2

    Figure 2: Dataset construction stages illustrating API collection, instruction generation, and solution path annotation.

Decision Tree Algorithm for Enhanced Reasoning

The paper introduces a depth-first search-based decision tree (DFSDT) algorithm to bolster the reasoning capabilities of LLMs. By enabling the evaluation of multiple reasoning traces, DFSDT facilitates a broader expansion of the search space compared to conventional methods like Chain-of-Thought (CoT) and ReACT. This approach significantly improves the model's ability to annotate complex instructions efficiently, as demonstrated by its superior performance in generating valid solution paths where other methods fail.

ToolLLaMA and ToolEval

ToolLLaMA, a fine-tuned version of LLaMA, is developed using ToolBench and enhanced with a neural API retriever. This retriever aids in recommending appropriate APIs for each instruction, increasing the model's effectiveness in executing complex tasks. Experiments indicate that ToolLLaMA rivals the performance of state-of-the-art closed-source models like ChatGPT, particularly in scenarios involving unseen APIs, thereby showcasing strong zero-shot generalization.

Additionally, the paper introduces ToolEval, an automated evaluation framework to assess tool-use capabilities with two key metrics: pass rate and win rate. ToolEval, validated for high correlation with human evaluations, offers a scalable solution for benchmarking LLM performance in tool-use tasks. Figure 3

Figure 3: ToolEval schematic illustrating the evaluation process for model's tool-use capabilities.

Experimental Results

The experiments reveal that ToolLLaMA exhibits remarkable capabilities in both single-tool and multi-tool instruction scenarios, outperforming models like Claude-2 and approaching the performance of ChatGPT. The introduction of DFSDT enhances decision-making strategies, leading to improved outcomes compared to ReACT. Furthermore, ToolLLaMA demonstrates strong generalization on an out-of-distribution dataset, APIBench, by effectively adapting to new APIs with minimal additional data.

Conclusion

This research marks significant progress in equipping open-source LLMs with advanced tool-use skills. By integrating novel data construction methods, enhanced reasoning algorithms, and robust evaluation metrics, ToolLLM effectively bridges the capability gap between open-source and closed-source models. The flexibility and generalization ability of ToolLLaMA open new avenues for AI systems that can dynamically interact with a diverse range of real-world applications, paving the way for future developments in autonomous decision-making and API-driven AI models.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 19 likes about this paper.