Papers
Topics
Authors
Recent
2000 character limit reached

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment (2304.06767v4)

Published 13 Apr 2023 in cs.LG, cs.AI, cs.CL, cs.CV, and stat.ML

Abstract: Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both LLMs and diffusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.  610–623, 2021.
  5. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  10. On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  14. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  15. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. https://optimalscale.github.io/LMFlow/, 2023.
  16. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  17. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833, 2018.
  18. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  19. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023.
  20. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  21. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  22. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  23. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  24. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  25. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  26. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  27. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  28. Ai safety via debate. arXiv preprint arXiv:1805.00899, 2018.
  29. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
  30. Wendell Johnson. Studies in language behavior: A program of research. Psychological Monographs, 56(2):1–15, 1944.
  31. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  32. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
  33. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
  34. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
  35. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  36. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  37. Understanding learned reward functions. arXiv preprint arXiv:2012.05862, 2020.
  38. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  39. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  40. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  41. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  43. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  45. Raj Reddy. Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university. Pittsburgh, Pa, 1977.
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  47. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  48. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
  49. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  50. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  51. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  52. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  53. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  54. A contrastive framework for neural text generation. arXiv preprint arXiv:2202.06417, 2022.
  55. Causal confusion and reward misidentification in preference-based reward learning. arXiv preprint arXiv:2204.06601, 2022.
  56. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  57. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  58. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  59. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  60. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
  61. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
  62. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  63. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
  64. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34, 2021.
  65. Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
  66. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  67. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Citations (323)

Summary

  • The paper introduces RAFT, a novel framework that fine-tunes generative models via iterative reward ranking.
  • The methodology alternates between data sampling, ranking with a reward model, and fine-tuning to boost metrics like mean reward and diversity.
  • The framework offers scalable, robust alignment that reduces memory load and mitigates issues such as reward hacking and overfitting.

Reward rAnked FineTuning for Generative Foundation Model Alignment (RAFT)

Introduction

Generative foundation models have shown proficiency in performing diverse tasks in various domains such as natural language processing and computer vision. These models include LLMs and diffusion models which are capable of generating high-quality, meaningful outputs. However, generative models trained on large-scale unsupervised datasets inherit biases which may lead to undesirable, skewed, and unfair outcomes. This paper addresses such concerns by introducing Reward rAnked FineTuning (RAFT), a novel alignment framework designed to align generative models with human ethics and preferences. RAFT seeks to enhance model performance via fine-tuning on high-quality samples which are selected based on a reward model (2304.06767).

Methodology

RAFT Framework

The RAFT framework operates in iterations, alternately performing data sampling, data ranking, and model fine-tuning. The process begins by generating candidate outputs from the model. A reward model quantifies the quality of each sample, and only those with high rewards are selected for fine-tuning. This ranks responses by quality, ensuring that iterative updates move models towards desired behaviors.

The RAFT process can be formalized as follows:

  1. Data Collection: Sample a batch of prompts and generate multiple responses from the model.
  2. Data Ranking: Assess each response using the reward model, keeping only the responses with the highest rewards.
  3. Model Fine-tuning: Fine-tune the generative model using the high-reward responses to steer future outputs towards higher quality and ethically aligned results.

This methodology ensures a continuous improvement in model alignment by exposing it to and training it on high-caliber responses. Figure 1

Figure 1

Figure 1: The left figure illustrates a typical RAFT training curve with different hyperparameters, highlighting improvements in reward over iterations.

Experimental Evaluation

LLMs Experiments

The RAFT was evaluated using the LLaMA-7B model, SFT (Supervised Fine-Tuning) baseline, and PPO (Proximal Policy Optimization). The performance of these models was assessed through mean rewards on a test set, perplexity, and diversity metrics like msttr, Distinct-1, and Distinct-2 scores. The LLaMA-7B-SFT model aligned using RAFT exhibited significant improvements in mean reward, outperforming the PPO-aligned model with better diversity metrics.

Diffusion Model Experiments

Beyond LLMs, RAFT was also applied to diffusion models enhancing resolution adaptation capabilities of Stable Diffusion (SD-1.5) models operating at reduced resolutions of 256x256 pixels. The metrics for assessing improvements included CLIP scores and aesthetic scores. RAFT demonstrated notable gains in both metrics for in-domain and out-of-domain samples. Figure 2

Figure 2: The test reward variations during iterations for different settings of hyperparameter KK.

Implications and Future Prospects

The RAFT framework offers enhanced stability and robustness over traditional reinforcement learning-based alignment approaches like PPO. By decoupling data generation and model fine-tuning, RAFT reduces memory burden and improves flexibility in model training. Its ranking-based data selection is more resistant to reward scale variations and noise, potentially mitigating issues like reward hacking and model overfitting.

Future work could explore integrating more diverse data sources and expert generators with RAFT. Investigating prompt engineering and post-processing could further optimize response quality for generative models. As a modular and adaptable framework, RAFT has the potential to be applied across different model architectures and domains.

Conclusion

RAFT provides a powerful, effective, and scalable solution for aligning generative foundation models with human preferences. Its iterative refinement of models via high-quality sample selection markedly enhances both ethical alignment and performance outcomes without the complexities and inefficiencies associated with reinforcement learning methods. This framework promises significant contributions to safe and responsible AI deployment.

Whiteboard

Video Overview

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 59 likes about this paper.