Despite the advancements of open-source LLMs, e.g., LLaMA, they remain significantly limited in tool-use capabilities, i.e., using external tools (APIs) to fulfill human instructions. The reason is that current instruction tuning largely focuses on basic language tasks but ignores the tool-use domain. This is in contrast to the excellent tool-use capabilities of state-of-the-art (SOTA) closed-source LLMs, e.g., ChatGPT. To bridge this gap, we introduce ToolLLM, a general tool-use framework encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning dataset for tool use, which is constructed automatically using ChatGPT. Specifically, the construction can be divided into three stages: (i) API collection: we collect 16,464 real-world RESTful APIs spanning 49 categories from RapidAPI Hub; (ii) instruction generation: we prompt ChatGPT to generate diverse instructions involving these APIs, covering both single-tool and multi-tool scenarios; (iii) solution path annotation: we use ChatGPT to search for a valid solution path (chain of API calls) for each instruction. To enhance the reasoning capabilities of LLMs, we develop a novel depth-first search-based decision tree algorithm. It enables LLMs to evaluate multiple reasoning traces and expand the search space. Moreover, to evaluate the tool-use capabilities of LLMs, we develop an automatic evaluator: ToolEval. Based on ToolBench, we fine-tune LLaMA to obtain an LLM ToolLLaMA, and equip it with a neural API retriever to recommend appropriate APIs for each instruction. Experiments show that ToolLLaMA demonstrates a remarkable ability to execute complex instructions and generalize to unseen APIs, and exhibits comparable performance to ChatGPT. Our ToolLLaMA also demonstrates strong zero-shot generalization ability in an out-of-distribution tool-use dataset: APIBench.
A framework named ToolLLM was developed to improve open-source LLMs' ability to work with a wide range of real-world APIs, comparing favorably to closed-source models like ChatGPT.
ToolBench, a dataset containing over 16,000 REST APIs, was created to train LLMs in executing APIs, using instructions and annotating solution paths, aiming to generalize to new APIs post-training.
An evaluation system called ToolEval measures an LLM's proficiency in executing instructions, with the fine-tuned model ToolLLaMA showing strong performance and generalization.
ToolLLaMA is demonstrated to adapt effectively to unseen instructions and to generalize across out-of-distribution datasets, often performing better than when provided with correct APIs.
The study highlights the potential of open-source LLMs for tool use and the impact of datasets and algorithms like ToolBench and DFSDT on the future of LLMs' instruction tuning capabilities.
The integration of LLMs with APIs to accomplish complex tasks has been a focal area of interest in AI research. Open-source models such as LLaMA have shown versatility through various instruction tuning approaches. However, their capabilities in tool-use domains, specifically interacting with external tools or APIs to adhere to complex human instructions, are yet to be on par with state-of-the-art (SOTA) closed-source models like ChatGPT. To address this, a novel framework named ToolLLM has been presented, aimed at enabling open-source LLMs to competently master a wide array of real-world APIs.
The construction of the ToolBench dataset is a central aspect of this framework. ToolBench is designed to help LLMs learn to execute APIs and generalize to new ones not encountered during the training phase. The dataset spans 16,464 REST APIs across 49 categories and is devised in stages: collecting APIs, generating diverse instructions, and annotating solution paths. This dataset is unique in its coverage of both single-tool and multi-tool scenarios and is automatically constructed using ChatGPT, minimizing the need for human supervision. A distinct depth-first search-based decision tree (DFSDT) algorithm enhances LLMs' reasoning, enabling them to manage multiple reasoning traces and improve upon existing models like ReACT.
ToolEval, the automated evaluator developed alongside ToolBench, offers metrics that quantify an LLM's ability to execute instructions effectively. The fine-tuned LLaMA model, referred to as ToolLLaMA, is equipped with a neural API retriever and demonstrates impressive capabilities in executing complex instructions with performance comparable to ChatGPT and strong generalization abilities, even on out-of-distribution tool-use datasets. The neural API retriever dispenses with the requirement for manual API selection amid a large collection, showcasing excellent precision in API recommendations.
ToolLLaMA offers compelling evidence regarding the adaptability of open-source LLMs to unseen instructions and tools, showing results that rival those of the teacher model, ChatGPT. The generalization capabilities extend to an OOD dataset called APIBench, where ToolLLaMA, even without training on APIBench's domains, demonstrates a noteworthy performance. Notably, ToolLLaMA combined with the API retriever surpasses the performance when utilizing ground truth APIs, arguably due to its ability to identify more appropriate APIs for a given instruction among the extensive database.
In conclusion, ToolLLM stands as a comprehensive framework that imparts high-level tool-use competencies in open-source LLMs, promoting the democratization of AI technologies and community-driven innovation. The methodologies developed within this framework, including ToolBench, DFSDT, ToolEval, and integrative API retrieval, highlight the future trajectory of instruction tuning and tool usage in LLMs.
Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. https://lmsys.org/blog/2023-03-30-vicuna/.
BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. https://aclanthology.org/N19-1423.
Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
OpenAI. OpenAI: Introducing ChatGPT, 2022. https://openai.com/blog/chatgpt.
Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca