Insights into Alignment: Evaluating DPO and its Variants Across Multiple Tasks (2404.14723v2)
Abstract: This study evaluates Direct Preference Optimization (DPO) and its variants for aligning LLMs with human preferences, testing three configurations: (1) with Supervised Fine Tuning (SFT), (2) without SFT, and (3) without SFT but using an instruction tuned model. We further investigate how training set size influences model performance. Our evaluation spans 13 benchmarks covering dialogue, reasoning, mathematical problem-solving, question answering, truthfulness, MT-Bench, Big Bench, and the Open LLM Leaderboard. We find that: (1) alignment methods often achieve near optimal performance even with smaller subsets of training data; (2) although they offer limited improvements on complex reasoning tasks, they enhance mathematical problem-solving; and (3) using an instruction tuned model improves truthfulness. These insights highlight the conditions under which alignment methods excel, as well as their limitations.
- Palm 2 technical report.
- Program synthesis with large language models.
- A general theoretical paradigm to understand learning from human preferences.
- Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- BIG bench authors. 2023. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- Piqa: Reasoning about physical commonsense in natural language.
- Heejong Bong and Alessandro Rinaldo. 2022. Generalized results for the existence and consistency of the mle in the bradley-terry-luce model.
- Language models are few-shot learners.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Evaluating large language models trained on code.
- Self-play fine-tuning converts weak language models to strong language models.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways.
- Deep reinforcement learning from human preferences.
- Boolq: Exploring the surprising difficulty of natural yes/no questions.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Training verifiers to solve math word problems.
- Enhancing chat language models by scaling high-quality instructional conversations.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
- Measuring massive multitask language understanding.
- Lora: Low-rank adaptation of large language models.
- Mistral 7b.
- Solving quantitative reasoning problems with language models.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Truthfulqa: Measuring how models mimic human falsehoods.
- Statistical rejection sampling improves preference optimization.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP.
- Training language models to follow instructions with human feedback.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Scaling language models: Methods, analysis & insights from training gopher.
- Direct preference optimization: Your language model is secretly a reward model.
- Winogrande: An adversarial winograd schema challenge at scale.
- Multitask prompted training enables zero-shot task generalization.
- Proximal policy optimization algorithms.
- Llama 2: Open foundation and fine-tuned chat models.
- Zephyr: Direct distillation of lm alignment.
- AMOS TVERSKY and DANIEL KAHNEMAN. 1992. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5(4):297–323.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
- Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903.
- Pairwise proximal policy optimization: Harnessing relative feedback for llm alignment.
- Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417.
- Rrhf: Rank responses to align language models with human feedback without tears.
- Hellaswag: Can a machine really finish your sentence?
- Slic-hf: Sequence likelihood calibration with human feedback.
- Judging llm-as-a-judge with mt-bench and chatbot arena.