Emergent Mind

Abstract

We introduce AnyTool, a large language model agent designed to revolutionize the utilization of a vast array of tools in addressing user queries. We utilize over 16,000 APIs from Rapid API, operating under the assumption that a subset of these APIs could potentially resolve the queries. AnyTool primarily incorporates three elements: an API retriever with a hierarchical structure, a solver aimed at resolving user queries using a selected set of API candidates, and a self-reflection mechanism, which re-activates AnyTool if the initial solution proves impracticable. AnyTool is powered by the function calling feature of GPT-4, eliminating the need for training external modules. We also revisit the evaluation protocol introduced by previous works and identify a limitation in this protocol that leads to an artificially high pass rate. By revising the evaluation protocol to better reflect practical application scenarios, we introduce an additional benchmark, termed AnyToolBench. Experiments across various datasets demonstrate the superiority of our AnyTool over strong baselines such as ToolLLM and a GPT-4 variant tailored for tool utilization. For instance, AnyTool outperforms ToolLLM by +35.4% in terms of average pass rate on ToolBench. Code will be available at https://github.com/dyabel/AnyTool.

AnyTool resolves queries using 16k+ APIs, incorporating a hierarchical retriever, solver, and self-reflection mechanism.

Overview

  • AnyTool is a model that uses over 16,000 APIs to improve the way LLMs respond to user queries.

  • The model features a hierarchical API retriever that sorts through APIs efficiently and a self-reflection mechanism that refines the search process.

  • AnyTool achieves better results than existing models, with up to a 20% improvement in pass rates after 4-6 rounds of self-reflection.

  • The AnyToolBench benchmark and revised evaluation protocol demonstrate AnyTool's superior performance with a +35.4% lead in average pass rate over other models.

Introduction

AnyTool represents a significant contribution to the field of LLMs by introducing an agent that leverages over 16,000 APIs to address user queries without training external modules. The model integrates a hierarchical API retriever, a solver, and a self-reflection mechanism, which altogether form a closed-loop system for enhanced efficiency in query resolution. AnyTool demonstrates superior performance compared to existing models, evident through its remarkable average pass rate improvement in benchmark evaluations.

Hierarchical API Retriever

The essence of AnyTool lies in its advanced API retriever that employs a hierarchical structure to sort through a large collection of APIs efficiently. Inspired by the divide-and-conquer strategy, the retriever is comprised of meta-agents, category agents, and tool agents, which sequentially narrow down the search space by leveraging the API structure defined by Rapid API. This structure significantly mitigates constraints associated with the maximum context length in LLMs. The performance of AnyTool on various datasets reveals how pass rates enhance corresponding to the number of self-reflection rounds, with notable improvements of up to 20% across all datasets after 4-6 rounds.

Self-Reflection Mechanism

In addition to the hierarchical retriever, AnyTool features a self-reflection mechanism activated when initial solutions fail. It allows AnyTool to consider reasons for failure and previous context, leading to refined search strategies and reducing the propensity for "over-search". AnyTool's self-reflection is applied to both the API retriever and the solver, refining their operations continuously to improve overall performance.

Evaluation Protocol & Benchmarks

AnyTool proposes a revised evaluation protocol for user queries resolution, tackling a critical issue present in previous methodologies where an artificially high pass rate surfaced due to misclassification of "non-solvable" queries. By introducing AnyToolBench, a new supplementary benchmark, and employing a manual review process to ensure query solvability using specific APIs, AnyTool underlines its capability to outperform strong baselines like ToolLLM and a custom GPT-4 tailored for tool utilization, with a considerable margin of +35.4% in average pass rate on ToolBench.

Conclusion

AnyTool has set a new standard for tool utilization in LLMs, providing a compelling model that efficiently combines thousands of APIs to address complex user queries. Its hierarchical structure and self-reflective mechanism not only simplify the retrieval process but also significantly enhance the problem-solving abilities of LLMs. This achievement is firmly substantiated by its robust numerical results, and AnyTool's code availability offers the research community a valuable resource to further explore and expand upon its innovative approach.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

GitHub
YouTube