Emergent Mind

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

(2305.14387)
Published May 22, 2023 in cs.LG , cs.AI , and cs.CL

Abstract

LLMs such as ChatGPT have seen widespread adoption due to their strong instruction-following abilities. Developing these LLMs involves a complex yet poorly understood workflow requiring training with human feedback. Replicating and understanding this instruction-following requires tackling three major challenges: the high cost of data collection, the lack of trustworthy evaluation, and the absence of reference method implementations. We address these challenges with AlpacaFarm, a simulator that enables research and development for learning from feedback at a low cost. First, we design LLM prompts to simulate human feedback that are 50x cheaper than crowdworkers and display high agreement with humans. Second, we propose an automatic evaluation and validate it against human instructions obtained on real-world interactions. Third, we contribute reference implementations for several methods (PPO, DPO, best-of-n, expert iteration, and more) that learn from pairwise feedback. Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate eleven models on 10k pairs of real human feedback and show that rankings of models trained in AlpacaFarm match rankings of models trained on human data. As a demonstration of the research possible in AlpacaFarm, we find that methods that use a reward model can substantially improve over supervised fine-tuning and that our reference PPO implementation leads to a +10% improvement in win-rate against Davinci003. We release all components of AlpacaFarm at https://github.com/tatsu-lab/alpaca_farm.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
  2. Thinking fast and slow with deep learning and tree search. Advances in neural information processing systems, 30
  3. Out of One, Many: Using Language Models to Simulate Human Samples
  4. ExT5: Towards Extreme Multi-Task Scaling for Transfer Learning
  5. A general language assistant as a laboratory for alignment
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback
  7. Constitutional AI: Harmlessness from AI Feedback
  8. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189
  9. On the Opportunities and Risks of Foundation Models
  10. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345
  11. OpenAI Gym
  12. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS)
  13. Improving Code Generation by Training with Natural Language Feedback
  14. Can Large Language Models Be an Alternative to Human Evaluations?
  15. Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality, March 2023
  16. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30
  17. Surreal: Open-source reinforcement learning framework and robot manipulation benchmark. In Conference on Robot Learning, pages 767–782. PMLR
  18. Brax -- A Differentiable Physics Engine for Large Scale Rigid Body Simulation
  19. Scaling Laws for Reward Model Overoptimization
  20. Koala: A dialogue model for academic research, March 2023
  21. Improving alignment of dialogue agents via targeted human judgements
  22. Learning from Dialogue after Deployment: Feed Yourself, Chatbot!
  23. Unity: A General Platform for Intelligent Agents
  24. An alternate objective function for Markovian fields. In International Conference on Machine Learning (ICML)
  25. Estimating the Personality of White-Box Language Models
  26. CTRL: A Conditional Transformer Language Model for Controllable Generation
  27. Revisiting the Weaknesses of Reinforcement Learning for Neural Machine Translation
  28. Pretraining Language Models with Human Preferences
  29. Can Neural Machine Translation be Improved with User Feedback?
  30. A Reinforcement Learning Approach to Interactive-Predictive Neural Machine Translation
  31. Aligning Text-to-Image Models using Human Feedback
  32. Dialogue Learning With Human-In-The-Loop
  33. Chain of Hindsight Aligns Language Models with Feedback
  34. Visual Instruction Tuning
  35. G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment
  36. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
  37. Quark: Controllable text generation with reinforced unlearning. In Advances in Neural Information Processing Systems
  38. Quark: Controllable text generation with reinforced unlearning. Advances in neural information processing systems, 35:27591–27609
  39. Self-Refine: Iterative Refinement with Self-Feedback
  40. Cross-Task Generalization via Natural Language Crowdsourcing Instructions
  41. Reinforcement Learning for Bandit Neural Machine Translation with Simulated Human Feedback
  42. OpenAI. Introducing chatgpt.
  43. OpenAI. Model index for researchers.
  44. OpenAI. Gpt-4 technical report
  45. Training language models to follow instructions with human feedback
  46. Generative Agents: Interactive Simulacra of Human Behavior
  47. Social simulacra: Creating populated prototypes for social computing systems. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology, pages 1–18
  48. A Deep Reinforced Model for Abstractive Summarization
  49. Instruction Tuning with GPT-4
  50. Discovering Language Model Behaviors with Model-Written Evaluations
  51. Direct Preference Optimization: Your Language Model is Secretly a Reward Model
  52. Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization
  53. Multitask Prompted Training Enables Zero-Shot Task Generalization
  54. Self-critiquing models for assisting human evaluators
  55. Training Language Models with Language Feedback at Scale
  56. High-Dimensional Continuous Control Using Generalized Advantage Estimation
  57. Proximal policy optimization algorithms
  58. When Life Gives You Lemons, Make Cherryade: Converting Feedback from Bad Responses into Good Labels
  59. Mastering the game of go without human knowledge. Nature, 550(7676):354–359
  60. Offline RL for Natural Language Generation with Implicit Language Q Learning
  61. Bandit Structured Prediction for Learning from Partial Feedback in Statistical Machine Translation
  62. Learning to summarize from human feedback
  63. Lyceum: An efficient and scalable ecosystem for robot learning. In Learning for Dynamics and Control, pages 793–803. PMLR
  64. Alpaca: A strong, replicable instruction-following modely, March 2023
  65. DeepMind Control Suite
  66. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE
  67. LLaMA: Open and Efficient Foundation Language Models
  68. TURINGBENCH: A Benchmark Environment for Turing Test in the Age of Neural Text Generation
  69. Solving math word problems with process- and outcome-based feedback
  70. Self-instruct: Aligning language model with self generated instructions
  71. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109
  72. Finetuned Language Models Are Zero-Shot Learners
  73. J. E. Weston. Dialog-based language learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 829–837
  74. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  75. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Show All 75