AgentInstruct: Toward Generative Teaching with Agentic Flows (2407.03502v1)
Abstract: Synthetic data is becoming increasingly important for accelerating the development of LLMs, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach LLMs different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.
- Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- CodeParrot. Github-code clean dataset, 2022. https://huggingface.co/datasets/codeparrot/github-code-clean [Accessed: (06/15/2024)].
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
- Query of cc: Unearthing large scale domain-specific knowledge from public corpora. arXiv preprint arXiv:2401.14624, 2024.
- The false promise of imitating proprietary llms, 2023. URL https://arxiv.org/abs/2305.15717.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
- Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023. URL https://arxiv.org/abs/2311.10702.
- Mistral 7b, 2023.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023. URL https://arxiv.org/abs/2309.00267.
- Camel: Communicative agents for "mind" exploration of large language model society, 2023a. URL https://arxiv.org/abs/2303.17760.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
- Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization, 2023. URL https://arxiv.org/abs/2311.09184.
- Lm-sys. Mt-Bench, 2023. URL https://huggingface.co/spaces/lmsys/mt-bench/tree/cf27f9f9da48f72169bce3c3e784d24347d1e833/data/mt_bench/model_answer.
- Daniel van Strien Loubna Ben Allal, Anton Lozhkov. Cosmopedia: how to create large-scale synthetic data for pre-training, 2024. URL https://huggingface.co/blog/cosmopedia.
- Orca 2: Teaching small language models how to reason, 2023. URL https://arxiv.org/abs/2311.11045.
- Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830, 2024.
- Xtremedistil: Multi-stage distillation for massive multilingual models, 2020.
- Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Samuel J. Paech. Eq-bench: An emotional intelligence benchmark for large language models, 2024. URL https://arxiv.org/abs/2312.06281.
- Instruction tuning with gpt-4, 2023. URL https://arxiv.org/abs/2304.03277.
- Infobench: Evaluating instruction following ability in large language models, 2024. URL https://arxiv.org/abs/2401.03601.
- Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
- Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022.
- Direct nash optimization: Teaching language models to self-improve with general preferences, 2024. URL https://arxiv.org/abs/2404.03715.
- The curse of recursion: Training on generated data makes models forget, 2024. URL https://arxiv.org/abs/2305.17493.
- Re(gex|dos)eval: Evaluating generated regular expressions and their proneness to dos attacks. In Proceedings of the 46th International Conference on Software Engineering, NIER Track (ICSE-NIER ’24), 2024. doi: 10.1145/3639476.3639757.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation, 2023.
- Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. URL https://arxiv.org/abs/2308.08155.
- Fofo: A benchmark to evaluate llms’ format-following capability, 2024. URL https://arxiv.org/abs/2402.18667.
- Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
- Wizardlm: Empowering large language models to follow complex instructions, 2023.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
- Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024.
- Agieval: A human-centric benchmark for evaluating foundation models, 2023.
- Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911.
- Arindam Mitra (40 papers)
- Luciano Del Corro (9 papers)
- Guoqing Zheng (25 papers)
- Shweti Mahajan (6 papers)
- Dany Rouhana (1 paper)
- Andres Codas (5 papers)
- Yadong Lu (19 papers)
- Wei-Ge Chen (2 papers)
- Olga Vrousgos (1 paper)
- Corby Rosset (21 papers)
- Fillipe Silva (1 paper)
- Hamed Khanpour (6 papers)
- Yash Lara (3 papers)
- Ahmed Awadallah (27 papers)