GTA: A Benchmark for General Tool Agents

Published 11 Jul 2024 in cs.CL and cs.AI | (2407.08713v2)

Abstract: Significant focus has been placed on integrating LLMs with various tools in developing general-purpose agents. This poses a challenge to LLMs' tool-use capabilities. However, there are evident gaps between existing tool-use evaluations and real-world scenarios. Current evaluations often use AI-generated queries, single-step tasks, dummy tools, and text-only interactions, failing to reveal the agents' real-world problem-solving abilities effectively. To address this, we propose GTA, a benchmark for General Tool Agents, featuring three main aspects: (i) Real user queries: human-written queries with simple real-world objectives but implicit tool-use, requiring the LLM to reason the suitable tools and plan the solution steps. (ii) Real deployed tools: an evaluation platform equipped with tools across perception, operation, logic, and creativity categories to evaluate the agents' actual task execution performance. (iii) Real multimodal inputs: authentic image files, such as spatial scenes, web page screenshots, tables, code snippets, and printed/handwritten materials, used as the query contexts to align with real-world scenarios closely. We design 229 real-world tasks and executable tool chains to evaluate mainstream LLMs. Our findings show that real-world user queries are challenging for existing LLMs, with GPT-4 completing less than 50% of the tasks and most LLMs achieving below 25%. This evaluation reveals the bottlenecks in the tool-use capabilities of current LLMs in real-world scenarios, which provides future direction for advancing general-purpose tool agents. The code and dataset are available at https://github.com/open-compass/GTA.

Abstract PDF HTML Upgrade to Chat

Authors (7)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a realistic evaluation framework by combining human-authored queries with operational tools and authentic multimodal inputs.
Researchers constructed a dataset of 229 tasks with executable tool chains to rigorously assess LLMs' reasoning and planning capabilities.
Experimental evaluations reveal that even top models like GPT-4 complete under 50% of tasks, highlighting critical challenges in argument prediction and instruction-following.

An Academic Review of "GTA: A Benchmark for General Tool Agents"

The paper "GTA: A Benchmark for General Tool Agents" introduces a new framework, GTA, designed to evaluate the tool-use capabilities of LLMs in real-world scenarios. This study is motivated by the increasing efforts to integrate LLMs with diverse external tools, aiming to develop effective general-purpose agents capable of complex problem-solving. Existing benchmarks fail to simulate genuine task environments by considering only AI-generated queries, single-step tasks, or dummy tools, and thus the paper addresses this gap by presenting a comprehensive evaluation benchmark that closely mimics practical applications.

Key Contributions

Realistic Evaluation Framework: GTA proposes a benchmark with three main contributions. Firstly, it includes real user queries—human-authored with implicit tool-use—which necessitate reasoning to identify the appropriate tools and solution pathways. Secondly, it employs real, operational tools across various categories such as perception, operation, logic, and creativity to assess actual task execution competency. Thirdly, it introduces real multimodal inputs like authentic images and data, aligning the evaluation setup more closely with real-world situations.
Dataset and Methodology: The researchers have constructed a dataset comprising 229 tasks, each with corresponding executable tool chains, to test the capabilities of mainstream LLMs. For each task, a sequence of tools must be invoked in a planned manner to accurately complete it. This dataset enables detailed analysis and understanding of LLMs' reasoning and planning abilities in leveraging external tools.
Performance Metrics: Introduced are fine-grained metrics encapsulating different aspects of tool execution, among them InstAcc (instruction-following accuracy), ToolAcc (tool selection accuracy), ArgAcc (arguments accuracy), and SummAcc (summary accuracy after sequence completion).
Experimental Evaluations: Comprehensive evaluations on 16 LLMs reveal that even the most advanced models like GPT-4 can achieve only under 50% task completion rate. The paper identifies instruction-following ability, particularly argument format challenges, as critical bottlenecks.

Insights and Implications

The findings suggest that enhancing argument prediction capabilities is crucial for improving the performance of tool-augmented LLMs. Furthermore, the strong performance disparity between API-based models and open-source counterparts indicates potential areas for development in open-source AI systems, particularly in instruction adherence and aggressive versus conservative tool use strategy optimization.

Future Directions

The implications of this research extend into developing more robust LLMs that can function as autonomous, context-aware agents in dynamic environments. The benchmark sets forth groundwork for future research focusing on the integration of reasoning and execution models. Additionally, the expansion of such benchmarks into multilingual contexts could further extend their applicability across global scenarios.

The study significantly contributes to an understanding of current limitations and potentials in tool-use by LLMs, encouraging ongoing improvements towards highly capable, general-purpose AI agents.

In summary, the paper grounds its findings in rigorous experimental setup and offers critical insights into the challenges faced by LLMs in handling real-world tasks, presenting a proactive step toward more comprehensive agent system development.

Markdown Report Issue