Emergent Mind

Chain of Hindsight Aligns Language Models with Feedback

(2302.02676)
Published Feb 6, 2023 in cs.LG and cs.CL

Abstract

Learning from human preferences is important for language models to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model, allowing us to take advantage of the language comprehension capabilities of language models. We condition the model on a sequence of model generations paired with feedback. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors. Applying our method to LLMs, we observed that Chain of Hindsight significantly surpasses previous methods in aligning language models with human preferences. We report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. Hindsight experience replay. Advances in neural information processing systems, 30
  2. A General Language Assistant as a Laboratory for Alignment
  3. An Actor-Critic Algorithm for Sequence Prediction
  4. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
  5. Constitutional AI: Harmlessness from AI Feedback
  6. Better Rewards Yield Better Summaries: Learning to Summarise Without References
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901
  8. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097
  9. Towards Coherent and Cohesive Long-form Text Generation
  10. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307
  11. Scaling Instruction-Finetuned Language Models
  12. Hierarchical Neural Story Generation
  13. Controlling Linguistic Style Aspects in Neural Language Generation
  14. Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
  15. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  16. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR
  17. Koala: A dialogue model for academic research. Blog post, April, 1
  18. Learning from Dialogue after Deployment: Feed Yourself, Chatbot!
  19. Large Language Models Can Self-Improve
  20. Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094–8. Citeseer
  21. Ctrl: A conditional transformer language model for controllable generation. PREPRINT
  22. Adam: A Method for Stochastic Optimization
  23. Pretraining Language Models with Human Preferences
  24. Can Neural Machine Translation be Improved with User Feedback?
  25. In-context Reinforcement Learning with Algorithm Distillation
  26. Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback
  27. Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. International Conference On Machine Learning
  28. Evaluating Human-Language Model Interaction
  29. Don't Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training
  30. Towards Better Few-Shot and Finetuning Performance with Forgetful Causal Language Models
  31. QUARK: Controllable text generation with reinforced unlearning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. https://openreview.net/forum?id=5HaIds3ux5O.

  32. Interactive learning from policy-dependent human feedback. International Conference On Machine Learning
  33. Cross-Task Generalization via Natural Language Crowdsourcing Instructions
  34. WebGPT: Browser-assisted question-answering with human feedback
  35. Training language models to follow instructions with human feedback
  36. Finding Generalizable Evidence by Learning to Convince Q&A Models
  37. Improving language understanding by generative pre-training. https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf

  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  39. Multitask Prompted Training Enables Zero-Shot Task Generalization
  40. Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR
  41. Training Language Models with Language Feedback
  42. Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. https://openai.com/blog/chatgpt.

  43. Proximal Policy Optimization Algorithms
  44. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958
  45. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021
  46. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.

  47. Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63
  48. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.

  49. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks
  50. Deep tamer: Interactive agent shaping in high-dimensional state spaces. Aaai Conference On Artificial Intelligence, 2017. doi: 10.1609/aaai.v32i1.11485.
  51. Finetuned Language Models Are Zero-Shot Learners
  52. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
  53. Neural Text Generation with Unlikelihood Training
  54. CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP
  55. Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators
  56. Star: Self-taught reasoner bootstrapping reasoning with reasoning. NeurIPS
  57. OPT: Open Pre-trained Transformer Language Models
  58. The Wisdom of Hindsight Makes Language Models Better Instruction Followers
  59. Wangchunshu Zhou and Ke Xu. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9717–9724
  60. Fine-Tuning Language Models from Human Preferences

Show All 60