LiPO: Listwise Preference Optimization through Learning-to-Rank (2402.01878v3)

Published 2 Feb 2024 in cs.CL and cs.LG

Abstract: Aligning LLMs (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a thorough study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a \textit{listwise} ranking problem and describe the LiPO framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment with DPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-$\lambda$, which leverages a state-of-the-art \textit{listwise} ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-$\lambda$ can outperform DPO variants and SLiC by a clear margin on several preference alignment tasks with both curated and real rankwise preference data.

References (39)

Citations (38)

View on Semantic Scholar

Summary

The paper introduces listwise preference optimization to reformulate language model alignment as a ranking problem.
It proposes LiPO-λ, a method that leverages complete response lists and label values for enhanced performance.
Evaluations on Reddit TL;DR and AnthropicHH datasets demonstrate its scalability and superior alignment quality.

Introduction

LLMs such as GPT-4 and Gemini have shown their prowess across a breadth of tasks, from casual conversational roles to complex coding problems. To employ these models viably in everyday applications, however, one must align them with human values and preferences—a process termed 'LM alignment'. Traditional reinforcement learning techniques for this task are notoriously complex and resource-intensive. The paper "LiPO: Listwise Preference Optimization through Learning-to-Rank" proposes an alternative that treats LM alignment as a Learning-to-Rank (LTR) problem, aiming to leverage the efficiency of ranking-based methods over traditional ones in optimizing LLMs according to human feedback.

The LiPO Framework

This work critiques that prevalent preference optimization methods scarcely go beyond pairwise alignments, which may be inadequate given that human feedback often takes the form of a ranked list. Responding to this, the authors devise the Listwise Preference Optimization (LiPO) framework, which poses LM alignment as a listwise ranking challenge. This framework not only generalizes existing models but allows for the exploration of more listwise objectives.

Under LiPO, previous alignment methods can be understood as specific cases of ranking optimizations. For instance, while earlier models like DPO and SLiC reduce to pairwise ranking problems, LiPO proposes listwise objectives that could better capture the essence of human rankings. What is particularly noteworthy in this paper is the introduction of LiPO-λ—a new method employing a sophisticated and theoretically grounded listwise ranking objective which has shown improved performance across evaluation tasks over its counterparts.

Advantages of Listwise Ranking

LiPO's advantage lies in its listwise perspective. Where traditional methods may consider response pairs in isolation, LiPO-λ infers from entire lists of responses, arguably a more holistic approach. Additionally, LiPO-λ innovatively incorporates label values within its optimization—a crucial detail that earlier methods ignored. In doing so, it understands the graded spectrum of quality, thereby making more informed alignment decisions. Empirically, through various experiments on the Reddit TL;DR and AnthropicHH datasets, LiPO-λ outperformed existing methods such as DPO and SLiC by clear margins, and its benefits intensified as the size of the response lists increased.

Evaluation and Applications

The evaluations using three distinct approaches: Proxy Reward Model, AutoSxS, and Human Evaluation, all converge to affirm LiPO-λ's strengths. Proxy Reward Model, for instance, found LiPO-λ's generated responses aligning more closely with the SFT target than other models it was pitted against. Moreover, its scalability with larger LM policies suggests wider applicability to various natural language processing tasks.

Concluding Remarks

Listwise Preference Optimization (LiPO) brings forth a nuanced approach to aligning LMs with human preferences. Its innovative incorporation of Learning-to-Rank techniques not only simplifies but also enhances the alignment process. The superior results of LiPO-λ substantiate its potential as a powerful tool in refining LLMs for real-world deployment, hailing in a new, more efficient phase for model alignment techniques. Future work beckons with numerous possibilities, from in-depth theoretical analyses of LambdaLoss's effectiveness in LM alignment to considering online learning strategies to further reduce distribution shift issues.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1754725169552142459

https://twitter.com/fly51fly/status/1754990936881275155

https://twitter.com/yacineMTB/status/1754879763028533468

https://twitter.com/essobi/status/1755616153039429979

https://twitter.com/woojinrad/status/1760307897030873324