Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models (2403.12881v1)
Abstract: Open-sourced LLMs have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. How to integrate agent ability into general LLMs becomes a crucial and urgent problem. This paper first delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. Based on the above findings, we propose Agent-FLAN to effectively Fine-tune LLMs for Agents. Through careful decomposition and redesign of the training corpus, Agent-FLAN enables Llama2-7B to outperform prior best works by 3.5\% across various agent evaluation datasets. With comprehensively constructed negative samples, Agent-FLAN greatly alleviates the hallucination issues based on our established evaluation benchmark. Besides, it consistently improves the agent capability of LLMs when scaling model sizes while slightly enhancing the general capability of LLMs. The code will be available at https://github.com/InternLM/Agent-FLAN.
- FireAct: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023a.
- AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors in agents. arXiv preprint arXiv:2308.10848, 2023b.
- T-eval: Evaluating the tool utilization capability step by step. arXiv preprint arXiv:2312.14033, 2023c.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- Specializing smaller language models towards multi-step reasoning. arXiv preprint arXiv:2301.12726, 2023.
- GlaiveAI. glaive-function-calling-v2, 2023. URL https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2.
- ToRA: A tool-integrated reasoning agent for mathematical problem solving. arXiv preprint arXiv:2309.17452, 2023.
- Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
- MetaGPT: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Inner monologue: Embodied reasoning through planning with language models. In Conference on Robot Learning, pp. 1769–1782. PMLR, 2023.
- Hint: Hypernetwork instruction tuning for efficient zero-shot generalisation. arXiv preprint arXiv:2212.10315, 2022.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Language models can solve computer tasks, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- Encouraging divergent thinking in large language models through multi-agent debate, 2023.
- Generated knowledge prompting for commonsense reasoning. arXiv preprint arXiv:2110.08387, 2021.
- AgentBench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023a.
- Bolaa: Benchmarking and orchestrating llm-augmented autonomous agents, 2023b.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Full parameter fine-tuning for large language models with limited resources. arXiv preprint arXiv:2306.09782, 2023.
- Editing personality for llms. arXiv preprint arXiv:2310.02168, 2023.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
- OpenAI. Openai: Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt.
- OpenAI. Gpt-4 technical report, 2023.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22, 2023.
- Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
- Making language models better tool learners with execution feedback. arXiv preprint arXiv:2305.13068, 2023.
- ToolLLM: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768, 2020.
- Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2998–3009, 2023.
- Cognitive architectures for language agents. arXiv preprint arXiv:2309.02427, 2023.
- Multi-agent collaboration: Harnessing the power of intelligent llm agents. arXiv preprint arXiv:2306.03314, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023a.
- Scienceworld: Is your agent smarter than a 5th grader? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 11279–11298, 2022a.
- Instructuie: Multi-task instruction tuning for unified information extraction. arXiv preprint arXiv:2304.08085, 2023b.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- AutoGen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
- Rewoo: Decoupling reasoning from observations for efficient augmented language models. arXiv preprint arXiv:2305.18323, 2023.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380, 2018.
- Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022a.
- React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022b.
- GPT-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
- AgentTuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023.
- Exploring collaboration mechanisms for llm agents: A social psychology view, 2023a.
- Instruction tuning for large language models: A survey, 2023b.
- Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023.