Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process (2405.11870v2)
Abstract: Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are two fundamental processes for enhancing the capabilities of LLMs (LMs) post pre-training, aligning them better with human preferences. Although SFT advances in training efficiency, PO delivers better alignment, thus they are often combined. However, common practices simply apply them sequentially without integrating their optimization objectives, ignoring the opportunities to bridge their paradigm gap and take the strengths from both. To obtain a unified understanding, we interpret SFT and PO with two sub-processes -- Preference Estimation and Transition Optimization -- defined at token level within the Markov Decision Process (MDP) framework. This modeling shows that SFT is only a specialized case of PO with inferior estimation and optimization. PO evaluates the quality of model's entire generated answer, whereas SFT only scores predicted tokens based on preceding tokens from target answers. Therefore, SFT overestimates the ability of model, leading to inferior optimization. Building on this view, we introduce Intuitive Fine-Tuning (IFT) to integrate SFT and Preference Optimization into a single process. IFT captures LMs' intuitive sense of the entire answers through a temporal residual connection, but it solely relies on a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to sequential recipes of SFT and some typical Preference Optimization methods across several tasks, particularly those requires generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635.
- Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335.
- Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36.
- Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306.
- Farama. 2023. Frozen lake. https://gymnasium.farama.org/environments/toy_text/frozen_lake/. Accessed: 2024-05-19.
- Direct and indirect reinforcement learning. International Journal of Intelligent Systems, 36(8):4439–4467.
- Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792.
- Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Reinforcement learning in continuous action spaces through sequential monte carlo methods. Advances in neural information processing systems, 20.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267.
- Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection. The International journal of robotics research, 37(4-5):421–436.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657.
- Extensive self-contrast enables feedback-free language model alignment. arXiv preprint arXiv:2404.00604.
- Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744.
- Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Harm Seijen and Rich Sutton. 2014. True online td (lambda). In International Conference on Machine Learning, pages 692–700. PMLR.
- Richard S Sutton. 1988. Learning to predict by the methods of temporal differences. Machine learning, 3:9–44.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354.
- Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.