TaskBench: Benchmarking Large Language Models for Task Automation (2311.18760v4)
Abstract: In recent years, the remarkable progress of LLMs has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasks and adopt a back-instruct method to generate high-quality user instructions. We propose TaskEval, a multi-faceted evaluation methodology that assesses LLM performance across these three stages. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation. Experimental results demonstrate that TaskBench effectively reflects the capabilities of various LLMs in task automation. It provides insights into model performance across different task complexities and domains, pushing the boundaries of what current models can achieve. TaskBench offers a scalable, adaptable, and reliable benchmark for advancing LLM-based autonomous agents.
- Palm 2 technical report. CoRR, abs/2305.10403, 2023.
- Language models are few-shot learners. In NeurIPS, pp. 1–25, 2020.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
- Significant Gravitas. Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/Auto-GPT, 2023.
- Measuring massive multitask language understanding. In ICLR, 2021.
- How long can open-source llms truly promise on context length?, June 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. CoRR, abs/2303.16434, 2023.
- Agentbench: Evaluating llms as agents. CoRR, abs/2308.03688, 2023.
- Yohei Nakajima. Babyagi. https://github.com/yoheinakajima/babyagi, 2023.
- OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
- Training language models to follow instructions with human feedback. In NeurIPS, volume 35, pp. 27730–27744, 2022.
- Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023.
- Tool learning with foundation models. CoRR, abs/2304.08354, 2023.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761, 2023.
- Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023.
- Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. CoRR, abs/2306.05301, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023a.
- MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023b. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-03-28.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b.
- Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, pp. 3261–3275, 2019a.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019b.
- Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
- Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023.
- Toolqa: A dataset for LLM question answering with external tools. CoRR, abs/2306.13304, 2023.