Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets (2405.18952v2)
Abstract: Training LLMs with Reinforcement Learning from AI Feedback (RLAIF) aligns model outputs more closely with human preferences. This involves an evaluator model ranking multiple candidate responses to user prompts. However, the rankings from popular evaluator models such as GPT-4 can be inconsistent. We propose the Repeat Ranking method - where we evaluate the same responses multiple times and train only on those responses which are consistently ranked. Using 2,714 prompts in 62 languages, we generated responses from 7 top multilingual LLMs and had GPT-4 rank them five times each. Evaluating on MT-Bench chat benchmarks in six languages, our method outperformed the standard practice of training on all available prompts. Our work highlights the quality versus quantity trade-off in RLAIF dataset generation and offers a stackable strategy for enhancing dataset and thus model quality.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- AI@Meta. 2024. Llama 3 model card.
- AI Anthropic. 2024. The claude 3 model family: Opus, sonnet, haiku. Claude-3 Model Card.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. arXiv preprint arXiv:2308.16884.
- J. Borda. 1781. Mémoire sur les élections au scrutin. Histoire de L’Académie Royale des Sciences, Paris.
- Assessing cross-cultural alignment between chatgpt and human societies: An empirical study. arXiv preprint arXiv:2303.17466.
- Peter Devine. 2024. Tagengo: A multilingual chat dataset.
- From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models. arXiv preprint arXiv:2305.08283.
- Andy P Field. 2005. Kendall’s coefficient of concordance. Encyclopedia of statistics in behavioral science, 2:1010–11.
- Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
- Aidan Gomez. 2024. Command R: Retrieval-Augmented Generation at Production Scale.
- Teaching large language models to reason with reinforcement learning. arXiv preprint arXiv:2403.04642.
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
- Open hermes preferences. https://huggingface.co/datasets/argilla/OpenHermesPreferences.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561.
- Maurice G Kendall and B Babington Smith. 1939. The problem of m rankings. The annals of mathematical statistics, 10(3):275–287.
- Openassistant conversations-democratizing large language model alignment. Advances in Neural Information Processing Systems, 36.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Lmsys - chatbot arena human preference predictions.
- Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530.
- Benjamin Reilly. 2002. Social choice in the south seas: Electoral innovation and the borda count in the pacific island countries. International Political Science Review, 23(4):355–372.
- Toolformer: Language models can teach themselves to use tools. Advances in Neural Information Processing Systems, 36.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
- Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales. arXiv preprint arXiv:2308.01320.
- Weak-to-strong extrapolation expedites alignment. arXiv preprint arXiv:2404.16792.
- Lmsys-chat-1m: A large-scale real-world llm conversation dataset.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
- Starling-7b: Improving llm helpfulness & harmlessness with rlaif.