Chain of Hindsight Aligns Language Models with Feedback (2302.02676v8)
Abstract: Learning from human preferences is important for LLMs to match human needs and to align with human and social values. Prior works have achieved remarkable successes by learning from human feedback to understand and follow instructions. Nonetheless, these methods are either founded on hand-picked model generations that are favored by human annotators, rendering them inefficient in terms of data utilization and challenging to apply in general, or they depend on reinforcement learning, which often suffers from imperfect reward functions and relies on extremely challenging optimizations. In this work, we propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity. Our idea is inspired by how humans learn from extensive feedback presented in the form of languages. We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model, allowing us to take advantage of the language comprehension capabilities of LLMs. We condition the model on a sequence of model generations paired with feedback. By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors. Applying our method to LLMs, we observed that Chain of Hindsight significantly surpasses previous methods in aligning LLMs with human preferences. We report significant improvements on summarization and dialogue benchmarks, with our approach markedly preferred in human evaluations.
- Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
- An actor-critic algorithm for sequence prediction. arXiv preprint arXiv:1607.07086, 2016.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
- Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214, 2019.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34:15084–15097, 2021.
- Towards coherent and cohesive long-form text generation. arXiv preprint arXiv:1811.00511, 2018.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, pages 4299–4307, 2017.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Hierarchical neural story generation. arXiv preprint arXiv: Arxiv-1805.04833, 2018.
- Controlling linguistic style aspects in neural language generation. arXiv preprint arXiv: Arxiv-1707.02633, 2017.
- Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
- The Pile: An 800GB dataset of diverse text for language modeling. Computing Research Repository, arXiv:2101.00027, 2020. version 1.
- Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866. PMLR, 2023.
- Koala: A dialogue model for academic research. Blog post, April, 1, 2023.
- Learning from dialogue after deployment: Feed yourself, chatbot! arXiv preprint arXiv:1901.05415, 2019.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
- Leslie Pack Kaelbling. Learning to achieve goals. In IJCAI, volume 2, pages 1094–8. Citeseer, 1993.
- Ctrl: A conditional transformer language model for controllable generation. PREPRINT, 2019.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Pretraining language models with human preferences. arXiv preprint arXiv:2302.08582, 2023.
- Can neural machine translation be improved with user feedback? arXiv preprint arXiv:1804.05958, 2018.
- In-context reinforcement learning with algorithm distillation. arXiv preprint arXiv: Arxiv-2210.14215, 2022.
- Improving a neural semantic parser by counterfactual learning from human bandit feedback. arXiv preprint arXiv:1805.01252, 2018.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. International Conference On Machine Learning, 2021.
- Evaluating human-language model interaction. arXiv preprint arXiv:2212.09746, 2022.
- Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. arXiv preprint arXiv:1911.03860, 2019.
- Fcm: Forgetful causal masking makes causal language models better zero-shot learners. arXiv preprint arXiv:2210.13432, 2022.
- QUARK: Controllable text generation with reinforced unlearning. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=5HaIds3ux5O.
- Interactive learning from policy-dependent human feedback. International Conference On Machine Learning, 2017.
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- Finding generalizable evidence by learning to convince q&a models. arXiv preprint arXiv:1909.05863, 2019.
- Improving language understanding by generative pre-training. URL https://s3-us-west-2. amazonaws. com/openai-assets/researchcovers/languageunsupervised/language understanding paper. pdf, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021.
- Universal value function approximators. In International conference on machine learning, pages 1312–1320. PMLR, 2015.
- Training language models with language feedback. arXiv preprint arXiv: Arxiv-2204.14146, 2022.
- Chatgpt: Optimizing language models for dialogue. OpenAI Blog, 2022. URL https://openai.com/blog/chatgpt.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15:1929–1958, 2014.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
- Tl; dr: Mining reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization, pages 59–63, 2017.
- GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. URL https://arxiv. org/abs/2204.07705, 2022.
- Deep tamer: Interactive agent shaping in high-dimensional state spaces. Aaai Conference On Artificial Intelligence, 2017. doi: 10.1609/aaai.v32i1.11485.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
- Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835, 2021.
- Towards coherent and engaging spoken dialog response generation using automatic conversation evaluators. arXiv preprint arXiv:1904.13015, 2019.
- Star: Self-taught reasoner bootstrapping reasoning with reasoning. NeurIPS, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv: Arxiv-2205.01068, 2022.
- The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv: Arxiv-2302.05206, 2023.
- Wangchunshu Zhou and Ke Xu. Learning to compare for better training and evaluation of open domain natural language generation models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 9717–9724, 2020.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.