Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy (2403.04283v1)

Published 7 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure LLMs align with human values. However, existing RLHF methods require a high computational cost, one main reason being that RLHF assigns both the generation and alignment tasks to the LLM simultaneously. In this paper, we introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs, achieving alignment with human values at a much lower computational cost. We start with a novel Markov Decision Process (MDP) designed for the alignment process and employ Reinforcement Learning (RL) to train a streamlined proxy model that oversees the token generation of the LLM, without altering the LLM itself. Experiments show that our method achieves a comparable level of alignment with only 1\% of the training parameters of other methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  4. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773.
  7. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  8. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
  9. An empirical survey on long document summarization: Datasets, models, and metrics. ACM computing surveys, 55(8):1–35.
  10. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  11. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
  12. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  13. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  14. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
  15. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  16. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  17. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  18. Rrhf: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
  19. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Citations (3)

Summary

We haven't generated a summary for this paper yet.