Emergent Mind

Abstract

Practitioners commonly align LLMs using pairwise preferences, i.e., given labels of the type response A is preferred to response B for a given input. Perhaps less commonly, methods have also been developed for binary feedback, i.e. training models given labels of type response A is good or bad. We show how an existing performant binary feedback method, the Cringe Loss (Adolphs et al., 2022), can be generalized to the pairwise preference setting using a simple soft margin extension. Pairwise Cringe Loss is straightforward to implement and efficient to train, and we find it outperforms state-of-the-art preference optimization algorithms such as PPO and DPO on the AlpacaFarm benchmark. We show that iterations of training of our model are important for improved results, and that we can generalize DPO to Iterative DPO in the same way.

Overview

  • The paper introduces the Pairwise Cringe Loss for optimizing LLM performance using pairwise preference data.

  • Pairwise Cringe Loss builds upon the Cringe Loss method designed for binary feedback, introducing a soft margin for preferred vs. less preferred model responses.

  • Experimental results show that Pairwise Cringe Loss minimizes repetitions and improves content quality over other methods like PPO and DPO.

  • The method performs exceptionally on the AlpacaFarm benchmark, creating high-quality responses that follow instructions.

  • It offers a significant improvement for instruction-based LLM tasks and is adaptable for use with binary feedback.

Introduction to Pairwise Cringe Loss

The domain of LLM alignment has incorporated various approaches to optimize performance based on different types of feedback data. An established technique for handling binary feedback – discerning good from bad model responses – has been enhanced to accommodate pairwise preferences, where one model response is chosen over another for a given input. This progression is nurtured by the Pairwise Cringe Loss, a method building upon a known binary feedback strategy commonly referred to as the Cringe Loss.

Binary Feedback and Its Extension

Initially, the Cringe Loss method was tailored for binary feedback. This mechanism applies a standard training loss for acceptable examples and a contrasting loss for weaker examples, reducing their likelihood as top-sequence candidates. Iteration further refines model performance by using the model to label new data iteratively. Despite its efficacy with binary feedback, the applicability and prevalence of pairwise preference data for training LLMs necessitate an adaptable method. Consequently, the Pairwise Cringe Loss was developed, implementing a soft margin that activates or deactivates depending on the probability gap between a preferred and a less preferred response generated by the model. This hybrid loss not only works on the level of entire sequences but also considers individual token probabilities.

Experiments and Performance Comparison

Through experiments, the Pairwise Cringe Loss was contrasted with existing standard binary feedback implementations like the original Cringe Loss and others, such as PPO and DPO. It displayed superiority in minimizing repetitions, a trait of LLMs, and demonstrated a higher quality of generated content. When tested on a benchmark known as AlpacaFarm, it excelled in generating model responses that follow given instructions, surpassing several state-of-the-art methods. A pivotal observation is the method's improvement through iterative training. Using a reward model, new responses are generated and assessed to form updated training data, which is then used in subsequent training iterations.

Concluding Remarks

The primary takeaway is that the Pairwise Cringe Loss presents a significant advancement for training instruction-based LLM tasks. This method is not only simple and efficient but exhibits robust performance when benchmarked against leading alternatives. It shows adaptability for potential usage alongside binary feedback, by combining the binary Cringe loss with the Pairwise Cringe loss for diverse data types. The Pairwise Cringe Loss thus stands as a compelling candidate for future LLM training and alignment endeavours.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.