Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 48 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 205 tok/s Pro
GPT OSS 120B 473 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Proxy-RLHF: Decoupling Generation and Alignment in Large Language Model with Proxy (2403.04283v1)

Published 7 Mar 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Reinforcement Learning from Human Feedback (RLHF) is the prevailing approach to ensure LLMs align with human values. However, existing RLHF methods require a high computational cost, one main reason being that RLHF assigns both the generation and alignment tasks to the LLM simultaneously. In this paper, we introduce Proxy-RLHF, which decouples the generation and alignment processes of LLMs, achieving alignment with human values at a much lower computational cost. We start with a novel Markov Decision Process (MDP) designed for the alignment process and employ Reinforcement Learning (RL) to train a streamlined proxy model that oversees the token generation of the LLM, without altering the LLM itself. Experiments show that our method achieves a comparable level of alignment with only 1\% of the training parameters of other methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
  3. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  4. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  5. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  6. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773.
  7. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767.
  8. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973.
  9. An empirical survey on long document summarization: Datasets, models, and metrics. ACM computing surveys, 55(8):1–35.
  10. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
  11. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477.
  12. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  13. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  14. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241.
  15. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  16. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  17. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
  18. Rrhf: Rank responses to align language models with human feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
  19. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.