Emergent Mind

LiPO: Listwise Preference Optimization through Learning-to-Rank

(2402.01878)
Published Feb 2, 2024 in cs.CL and cs.LG

Abstract

Aligning language models (LMs) with curated human feedback is critical to control their behaviors in real-world applications. Several recent policy optimization methods, such as DPO and SLiC, serve as promising alternatives to the traditional Reinforcement Learning from Human Feedback (RLHF) approach. In practice, human feedback often comes in a format of a ranked list over multiple responses to amortize the cost of reading prompt. Multiple responses can also be ranked by reward models or AI feedback. There lacks such a study on directly fitting upon a list of responses. In this work, we formulate the LM alignment as a listwise ranking problem and describe the Listwise Preference Optimization (LiPO) framework, where the policy can potentially learn more effectively from a ranked list of plausible responses given the prompt. This view draws an explicit connection to Learning-to-Rank (LTR), where most existing preference optimization work can be mapped to existing ranking objectives, especially pairwise ones. Following this connection, we provide an examination of ranking objectives that are not well studied for LM alignment withDPO and SLiC as special cases when list size is two. In particular, we highlight a specific method, LiPO-{\lambda}, which leverages a state-of-the-art listwise ranking objective and weights each preference pair in a more advanced manner. We show that LiPO-{\lambda} can outperform DPO and SLiC by a clear margin on two preference alignment tasks.

Overview

  • The paper introduces LiPO, a framework for aligning LLMs with human preferences using Listwise Preference Optimization (LTR).

  • LiPO treats LM alignment as a listwise ranking problem, improving upon traditional pairwise methods with more holistic listwise objectives.

  • LiPO-λ, a method within the LiPO framework, offers significant performance improvements by utilizing a sophisticated listwise ranking objective and considering label values in optimization.

  • Evaluations on various datasets and using different approaches show that LiPO-λ aligns better with human preferences and performs well across natural language processing tasks.

Introduction

LLMs such as GPT-4 and Gemini have shown their prowess across a breadth of tasks, from casual conversational roles to complex coding problems. To employ these models viably in everyday applications, however, one must align them with human values and preferences—a process termed 'LM alignment'. Traditional reinforcement learning techniques for this task are notoriously complex and resource-intensive. The paper "LiPO: Listwise Preference Optimization through Learning-to-Rank" proposes an alternative that treats LM alignment as a Learning-to-Rank (LTR) problem, aiming to leverage the efficiency of ranking-based methods over traditional ones in optimizing language models according to human feedback.

The LiPO Framework

This work critiques that prevalent preference optimization methods scarcely go beyond pairwise alignments, which may be inadequate given that human feedback often takes the form of a ranked list. Responding to this, the authors devise the Listwise Preference Optimization (LiPO) framework, which poses LM alignment as a listwise ranking challenge. This framework not only generalizes existing models but allows for the exploration of more listwise objectives.

Under LiPO, previous alignment methods can be understood as specific cases of ranking optimizations. For instance, while earlier models like DPO and SLiC reduce to pairwise ranking problems, LiPO proposes listwise objectives that could better capture the essence of human rankings. What is particularly noteworthy in this paper is the introduction of LiPO-λ—a new method employing a sophisticated and theoretically grounded listwise ranking objective which has shown improved performance across evaluation tasks over its counterparts.

Advantages of Listwise Ranking

LiPO's advantage lies in its listwise perspective. Where traditional methods may consider response pairs in isolation, LiPO-λ infers from entire lists of responses, arguably a more holistic approach. Additionally, LiPO-λ innovatively incorporates label values within its optimization—a crucial detail that earlier methods ignored. In doing so, it understands the graded spectrum of quality, thereby making more informed alignment decisions. Empirically, through various experiments on the Reddit TL;DR and AnthropicHH datasets, LiPO-λ outperformed existing methods such as DPO and SLiC by clear margins, and its benefits intensified as the size of the response lists increased.

Evaluation and Applications

The evaluations using three distinct approaches: Proxy Reward Model, AutoSxS, and Human Evaluation, all converge to affirm LiPO-λ's strengths. Proxy Reward Model, for instance, found LiPO-λ's generated responses aligning more closely with the SFT target than other models it was pitted against. Moreover, its scalability with larger LM policies suggests wider applicability to various natural language processing tasks.

Concluding Remarks

Listwise Preference Optimization (LiPO) brings forth a nuanced approach to aligning LMs with human preferences. Its innovative incorporation of Learning-to-Rank techniques not only simplifies but also enhances the alignment process. The superior results of LiPO-λ substantiate its potential as a powerful tool in refining LLMs for real-world deployment, hailing in a new, more efficient phase for model alignment techniques. Future work beckons with numerous possibilities, from in-depth theoretical analyses of LambdaLoss's effectiveness in LM alignment to considering online learning strategies to further reduce distribution shift issues.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.