AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback (2305.14387v4)
Abstract: LLMs such as ChatGPT have seen widespread adoption due to their strong instruction-following abilities. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following requires tackling three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 50x cheaper than crowdworkers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, DPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003. We release all components of AlpacaFarm at https://github.com/tatsu-lab/alpaca_farm.
- Using large language models to simulate multiple humans. arXiv preprint arXiv:2208.10264, 2022.
- Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30, 2017.
- Out of one, many: Using language models to simulate human samples. arXiv preprint arXiv:2209.06899, 2022.
- Ext5: Towards extreme multi-task scaling for transfer learning. arXiv preprint arXiv:2111.10952, 2021.
- A general language assistant as a laboratory for alignment, 2021.
- Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022.
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
- R. Bommasani et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
- Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749, 2023.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality, March 2023.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
- Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, pages 767–782. PMLR, 2018.
- Brax–a differentiable physics engine for large scale rigid body simulation. arXiv preprint arXiv:2106.13281, 2021.
- Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760, 2022.
- Koala: A dialogue model for academic research, March 2023.
- Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
- Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019.
- Unity: A general platform for intelligent agents. arXiv preprint arXiv:1809.02627, 2018.
- An alternate objective function for Markovian fields. In International Conference on Machine Learning (ICML), 2002.
- Ai personification: Estimating the personality of language models. arXiv preprint arXiv:2204.12000, 2022.
- CTRL: A Conditional Transformer Language Model for Controllable Generation. arXiv preprint arXiv:1909.05858, 2019.
- Revisiting the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:2106.08942, 2021.
- Pretraining language models with human preferences. arXiv preprint arXiv:2302.08582, 2023.
- Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958, 2018.
- A reinforcement learning approach to interactive-predictive neural machine translation. arXiv preprint arXiv:1805.01553, 2018.
- Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
- Dialogue learning with human-in-the-loop. arXiv preprint arXiv:1611.09823, 2016.
- Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676, 2023.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
- G-eval: Nlg evaluation using gpt-4 with better human alignmentg. arXiv preprint arXiv:2303.16634, 2023.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
- Quark: Controllable text generation with reinforced unlearning. In Advances in Neural Information Processing Systems, 2022.
- Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609, 2022.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
- Reinforcement learning for bandit neural machine translation with simulated human feedback. arXiv preprint arXiv:1707.07402, 2017.
- OpenAI. Introducing chatgpt.
- OpenAI. Model index for researchers.
- OpenAI. Gpt-4 technical report, 2023.
- Training language models to follow instructions with human feedback, 2022.
- Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
- Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–18, 2022.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
- Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
- Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
- High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Proximal policy optimization algorithms, 2017.
- When life gives you lemons, make cherryade: Converting feedback from bad responses into good labels. arXiv preprint arXiv:2210.15893, 2022.
- Mastering the game of go without human knowledge. Nature, 550(7676):354–359, 2017.
- Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871, 2022.
- Bandit structured prediction for learning from partial feedback in statistical machine translation. arXiv preprint arXiv:1601.04468, 2016.
- Learning to summarize from human feedback, 2020.
- Lyceum: An efficient and scalable ecosystem for robot learning. In Learning for Dynamics and Control, pages 793–803. PMLR, 2020.
- Alpaca: A strong, replicable instruction-following modely, March 2023.
- Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Turingbench: A benchmark environment for turing test in the age of neural text generation. arXiv preprint arXiv:2109.13296, 2021.
- Solving math word problems with process- and outcome-based feedback, 2022.
- Self-instruct: Aligning language model with self generated instructions, 2022.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- J. E. Weston. Dialog-based language learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 829–837, 2016.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.