Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models (2406.13542v3)

Published 19 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: One core capability of LLMs is to follow natural language instructions. However, the issue of automatically constructing high-quality training data to enhance the complex instruction-following abilities of LLMs without manual annotation remains unresolved. In this paper, we introduce AutoIF, the first scalable and reliable method for automatically generating instruction-following training data. AutoIF transforms the validation of instruction-following data quality into code verification, requiring LLMs to generate instructions, the corresponding code to check the correctness of the instruction responses, and unit test samples to verify the code's correctness. Then, execution feedback-based rejection sampling can generate data for Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF) training. AutoIF achieves significant improvements across three training algorithms, SFT, Offline DPO, and Online DPO, when applied to the top open-source LLMs, Qwen2 and LLaMA3, in self-alignment and strong-to-weak distillation settings. Our code is publicly available at https://github.com/QwenLM/AutoIF.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces AutoIF, a novel self-play methodology with execution feedback that automates training data generation to improve LLM instruction-following.
It employs self-instruct for diverse instruction creation and Python-based verification functions to ensure high-quality, scalable training data.
Empirical results show significant advancements, with models achieving over 90% instruction accuracy and a 5% improvement in key performance metrics.

Self-play with Execution Feedback: Improving Instruction-following Capabilities of LLMs

The paper introduces AutoIF, an innovative methodology to enhance the instruction-following capabilities of LLMs like Qwen2 and LLaMA3 via automated data generation and quality verification. The premise is simple yet effective: leverage a systematic self-play mechanism with execution feedback, transforming instruction adherence checks into code verification challenges.

Summary of the Methodology

AutoIF stands out by automating the creation of high-quality training datasets for LLMs, circumventing the need for manual annotation, which is typically labor-intensive and scale-constrained. The core components of AutoIF are:

Instruction Generation: Starting from a set of hand-crafted seed instructions, the system employs LLMs to generate a more detailed and diverse set of instructions using a process termed self-instruct.
Verification Function Generation: For each instruction, AutoIF generates Python-based verification functions capable of evaluating whether a given response adheres to the specified constraints.
Response Generation and Quality Verification: The LLMs are used to generate multiple responses for each instruction and query pair. These responses are then evaluated by the verification functions. Only those passing the validation are retained for training purposes.

The AutoIF process encapsulates the entire instruction generation and verification cycle, making it scalable and robust. By automating both tasks (via LLMs for writing code and human-like judgment for quality control), it addresses the scalability issue endemic to manual annotation.

Key Results

The empirical findings are comprehensive, demonstrating significant performance enhancements across different models and evaluation benchmarks. Specifically:

Instruction Following Benchmarks: On the IFEval benchmark, AutoIF trained models, namely Qwen2-72B and LLaMA3-70B, realized notable improvements, achieving Loose Instruction Accuracy up to 90.4%, a first to surpass the 90% threshold in this benchmark.
Performance Metrics: Detailed evaluations reported substantial gains in the average SSR metric on FollowBench, with Qwen2-72B-Instruct and LLaMA3-70B-Instruct showing increases of over 5%.
Generalizability and Robustness: The algorithm maintained or slightly improved performance on general capabilities benchmarks like C-Eval, MMLU, GSM8k, and HumanEval, indicating that the improvements in instruction-following did not come at the expense of the models' overall skills.

Theoretical and Practical Implications

Theoretically, AutoIF introduces a paradigm shift in the way training data for instruction adherence is generated and validated in LLMs. This data-centric approach allows for continuous improvement in models' capabilities without manual intervention.

Practically, the success of AutoIF has far-reaching implications:

Scalability: AutoIF is eminently scalable, allowing the generation of vast amounts of high-quality training data without manual effort, making it feasible to maintain and improve models continuously.
Cost Efficiency: By replacing labor-intensive manual annotation with automated processes, AutoIF significantly reduces the associated costs.
Rapid Development Cycles: Given its automated nature, AutoIF enables rapid iteration and refinement of LLM instruction-following capabilities.
Broader Application: The principles behind AutoIF could be extended to other areas where fine-grained, complex data generation is needed, opening new avenues for model training and application.

Future Directions

The paper also contemplates future developments, primarily focusing on:

Enhanced Cross-instructions: Current efforts are limited to atomic instructions. Future work will likely involve constructing cross-instructions by combining multiple atomic instructions, thereby challenging LLMs further and pushing the boundaries of their comprehension and execution abilities.
Extended Generality: While the current work focuses on specific instruction-following capabilities, extending the techniques to handle a broader spectrum of tasks and domains remains an exciting avenue for research.

In conclusion, AutoIF sets a remarkable precedent in the field of AI, showcasing how systematic use of self-play and execution feedback can yield substantial improvements in LLM performance. By ensuring high-quality automated training data generation, this methodology not only optimizes learning processes but also drives the field towards more adaptive, intelligent, and resource-efficient models. With the promising results demonstrated, AutoIF ushers in a new era of efficient and effective model training frameworks.

PDF Markdown

Related Papers

GitHub

GitHub - QwenLM/AutoIF (211 stars)

Tweets

https://twitter.com/kakakbibibi/status/1804551866417877158

https://twitter.com/KemingLu612/status/1803996449644122316

https://twitter.com/fly51fly/status/1804788798859133087

https://twitter.com/theomitsa/status/1804405896732193044

https://twitter.com/kakakbibibi/status/1842956724984242638

https://twitter.com/kakakbibibi/status/1804002462208266631