Preference Ranking Optimization for Human Alignment (2306.17492v2)
Abstract: LLMs often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.
- Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs. arXiv preprint arXiv:2305.08844.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- GLM: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, Dublin, Ireland. Association for Computational Linguistics.
- Scaling laws for reward model overoptimization. arXiv preprint arXiv:2210.10760.
- Reward design with language models. arXiv preprint arXiv:2303.00001.
- The bigscience roots corpus: A 1.6 tb composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
- Pebble: Feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091.
- Interacting with non-cooperative user: A new paradigm for proactive dialogue policy. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 212–222, New York, NY, USA. Association for Computing Machinery.
- Api-bank: A benchmark for tool-augmented llms. arXiv preprint arXiv:2304.08244.
- Chain of hindsight aligns language models with feedback. arXiv preprint arXiv:2302.02676.
- Interactive learning from policy-dependent human feedback. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 2285–2294. PMLR.
- Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Offline rl for natural language generation with implicit language q learning. arXiv preprint arXiv:2206.11871.
- Kihyuk Sohn. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021.
- Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, volume 33, pages 3008–3021. Curran Associates, Inc.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Deep tamer: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
- Reinforcement learning from diverse human preferences. arXiv preprint arXiv:2301.11774.
- Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
- The wisdom of hindsight makes language models better instruction followers. arXiv preprint arXiv:2302.05206.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.