Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment (2304.06767v4)

Published 13 Apr 2023 in cs.LG, cs.AI, cs.CL, cs.CV, and stat.ML

Abstract: Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both LLMs and diffusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  3. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.  610–623, 2021.
  5. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  6. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217, 2023.
  10. On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752, 2019.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Supervising strong learners by amplifying weak experts. arXiv preprint arXiv:1810.08575, 2018.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  14. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021.
  15. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. https://optimalscale.github.io/LMFlow/, 2023.
  16. Implementation matters in deep policy gradients: A case study on ppo and trpo. arXiv preprint arXiv:2005.12729, 2020.
  17. Hierarchical neural story generation. arXiv preprint arXiv:1805.04833, 2018.
  18. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  19. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pp. 10835–10866. PMLR, 2023.
  20. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  21. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022.
  22. Optimizing prompts for text-to-image generation. arXiv preprint arXiv:2212.09611, 2022.
  23. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  24. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  25. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  26. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  27. Openclip, July 2021. URL https://doi.org/10.5281/zenodo.5143773. If you use this software, please cite it as below.
  28. Ai safety via debate. arXiv preprint arXiv:1805.00899, 2018.
  29. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pp. 5084–5096. PMLR, 2021.
  30. Wendell Johnson. Studies in language behavior: A program of research. Psychological Monographs, 56(2):1–15, 1944.
  31. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  32. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018.
  33. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
  34. A diversity-promoting objective function for neural conversation models. arXiv preprint arXiv:1510.03055, 2015.
  35. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  36. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.  142–150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. URL http://www.aclweb.org/anthology/P11-1015.
  37. Understanding learned reward functions. arXiv preprint arXiv:2012.05862, 2020.
  38. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  39. OpenAI. Gpt-4 technical report. ArXiv, abs/2303.08774, 2023.
  40. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  41. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  42. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
  43. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241, 2022.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  45. Raj Reddy. Speech understanding systems: A summary of results of the five-year research effort at carnegie mellon university. Pittsburgh, Pa, 1977.
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  47. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  48. Training language models with language feedback at scale. arXiv preprint arXiv:2303.16755, 2023.
  49. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  50. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  51. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020a.
  52. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020b.
  53. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  54. A contrastive framework for neural text generation. arXiv preprint arXiv:2202.06417, 2022.
  55. Causal confusion and reward misidentification in preference-based reward learning. arXiv preprint arXiv:2204.06601, 2022.
  56. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  57. Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019.
  58. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  59. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  60. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022a. URL https://openreview.net/forum?id=yzkSU5zdwD. Survey Certification.
  61. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022b.
  62. Recursively summarizing books with human feedback. arXiv preprint arXiv:2109.10862, 2021.
  63. Better aligning text-to-image models with human preference. arXiv preprint arXiv:2303.14420, 2023.
  64. Policy finetuning: Bridging sample-efficient offline and online reinforcement learning. Advances in neural information processing systems, 34, 2021.
  65. Nearly minimax optimal offline reinforcement learning with linear function approximation: Single-agent mdp and markov game. arXiv preprint arXiv:2205.15512, 2022.
  66. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023.
  67. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
Citations (323)

Summary

  • The paper presents RAFT, a framework that ranks model-generated samples based on rewards to achieve robust alignment with human preferences.
  • RAFT improves stability and efficiency by decoupling sample generation from optimization, significantly reducing GPU memory needs compared to traditional RL methods.
  • Empirical results show RAFT maintains language fluency and outperforms SFT and PPO in mean rewards and diversity metrics across LLM and diffusion model tasks.

An Analysis of "RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment"

The paper "RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment" addresses a pertinent challenge in the domain of AI: the alignment of generative foundation models with human ethics and preferences. Traditionally reliant on Reinforcement Learning from Human Feedback (RLHF), this adroitly proposed framework, termed Reward rAnked FineTuning (RAFT), seeks to enhance model alignment with improved stability and simplicity over conventional RL methods.

Core Contributions

RAFT introduces an innovative approach by ranking model-generated samples based on a reward function and subsequently fine-tuning the model using high-quality samples. The primary contributions of RAFT lie in its simplicity, computational efficiency, and flexibility:

  1. Stability and Robustness: RAFT utilizes a fine-tuning strategy akin to supervised learning, circumventing the instabilities often associated with RL algorithms. This ensures a streamlined process with fewer hyper-parameters, thus facilitating easier implementation and adjustment.
  2. Efficient Resource Utilization: Unlike RL algorithms such as PPO that have significant computational overhead due to their requirement to load multiple LLMs concurrently, RAFT decouples sample generation from model optimization, thus minimizing GPU memory requirements.
  3. Broad Applicability: RAFT’s versatility extends across various generative models, including both LLMs and diffusion models, provided a suitable reward model is available.
  4. Clear Preference Objectives: The framework prioritizes high-reward samples, thus mitigating reward hacking through transparency and interpretability of the training data.

Empirical Evaluation

The empirical paper benchmarks RAFT against the PPO within the context of LLMs, specifically employing the LLaMA-7B model and leveraging the HH-RLHF dataset. RAFT demonstrated superior ability to maintain model language fluency while achieving higher mean rewards indicative of model alignment success. Notably, the RAFT-aligned models outperformed the SFT baseline and PPO across various diversity metrics, without sacrificing complexity. It also suggested robustness against typical complications such as reward noise.

In diffusion model experiments, RAFT was effective in two domains: enhancing resolution capability and aligning text-to-image generation. The capability to adapt model resolution and align images more accurately with contextual prompts further corroborates the efficacy and adaptability of RAFT in visual tasks.

Implications and Future Directions

By proposing RAFT, the authors illuminate practical paths to achieving a balance between model performance and alignment with human feedback. The paper implicitly suggests the potential for extending RAFT to other domains in AI where ethical alignment is crucial. Furthermore, the decoupled nature of RAFT presents opportunities to integrate supplementary data sources and advanced generation techniques, thereby improving inference quality.

The paper opens several avenues for future research. An immediate step is to explore the integration of more sophisticated reward functions, potentially leveraging insights from continual learning and meta-learning. Also, considerations of scale and real-world deployment scenarios could further reveal the strengths and limitations of RAFT.

Conclusion

"RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment" provides a methodologically sound and pragmatic approach to the optimization of generative foundation models. Its design reflects a balance between simplicity and performance, offering a robust alternative to incumbent RLHF techniques. As LLMs and diffusion models continue to advance, RAFT proposes a viable mechanism to ensure these models increasingly align with human ethical expectations and social values.