Emergent Mind

Decoding-Time Language Model Alignment with Multiple Objectives

(2406.18853)
Published Jun 27, 2024 in cs.LG

Abstract

Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a linear combination of predictions of all base models, for any given weightings over different objectives. We exploit a common form among a family of $f$-divergence regularized alignment approaches (such as PPO, DPO, and their variants) to identify a closed-form solution by Legendre transform, and derive an efficient decoding strategy. Theoretically, we show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method. Empirical results demonstrate the effectiveness of the algorithm. For example, compared to a parameter-merging baseline, MOD achieves 12.8% overall reward improvement when equally optimizing towards $3$ objectives. Moreover, we experiment with MOD on combining three fully-finetuned LLMs of different model sizes, each aimed at different objectives such as safety, coding, and general user preference. Unlike traditional methods that require careful curation of a mixture of datasets to achieve comprehensive improvement, we can quickly experiment with preference weightings using MOD to find the best combination of models. Our best combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9--33.3% improvement across other three metrics ($\textit{i.e.}$, Codex@1, GSM-COT, BBH-COT).

Safety alignment comparison using different divergence metrics; MOD's frontier consistently outperforms RS, smoother than MODPO.

Overview

  • The paper proposes Multi-Objective Decoding (MOD), a novel algorithm for aligning language models with human preferences through simultaneously optimizing multiple objectives.

  • Theoretical insights are provided using $f$-divergence regularized alignment approaches, and MOD combines predictive distributions of various base models to allow flexible, on-the-fly adjustment.

  • Empirical validation demonstrates MOD's superior performance across tasks like Reddit summarization, helpful assistant tasks, safety alignment, and open instruction-following, outperforming traditional parameter-merging methods.

Decoding-Time Language Model Alignment with Multiple Objectives

The paper "Decoding-Time Language Model Alignment with Multiple Objectives" by Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A. Smith, and Simon S. Du proposes a novel approach to align language models (LMs) with human preferences by optimizing multiple objectives simultaneously. This work addresses a significant limitation in existing methods that typically focus on a single reward function, thereby enhancing the adaptability and practical utility of LMs for diverse and dynamic user needs.

Key Contributions and Methodology

This paper introduces Multi-Objective Decoding (MOD), a decoding-time algorithm that combines the predictive distributions of multiple base models, each tuned for different objectives. MOD allows for the on-the-fly adjustment of LMs to varying preference weightings without the necessity of extensive retraining, thus providing a versatile and efficient solution for multi-objective alignment.

Theoretical Foundations

The authors leverage a common form among a family of $f$-divergence regularized alignment approaches to identify a closed-form solution via Legendre transformation. This theoretical insight supports the derivation of an efficient decoding strategy. Specifically, MOD employs strong-barrier functions to ensure the optimality of the combined predictions from multiple base models, thus allowing precise control over the generation characteristics.

Empirical Validation

The paper presents robust empirical evidence supporting the efficacy of MOD across various tasks and datasets:

  1. Reddit Summary Task: MOD demonstrates superior performance over parameter-merging baselines and MORLHF by achieving higher rewards in summary quality and faithfulness.
  2. Helpful Assistant Task: In optimizing towards attributes such as helpfulness, harmlessness, and humor, MOD consistently outperforms parameter-merging baselines and exhibits competitive results against MORLHF.
  3. Safety Alignment Task: The implementation of MOD with $f$-DPO models highlights its robustness across diverse $f$-divergences, including Reverse KL-divergence, JSD, and other parameterized divergences, outperforming baselines such as RS and showing effectiveness even in scenarios with mixed positive and negative weightings.
  4. Open Instruction-Following Task: MOD effectively combines large-scale models tuned for different objectives, enhancing overall performance in tasks requiring attributes like safety, coding accuracy, and reasoning ability.

Theoretical Analysis and Insights

Sub-optimality of Parameter Merging

The paper rigorously demonstrates the limitations of parameter-merging paradigms, particularly under commonly used $f$-divergences. It shows that the optimal policy for combined objectives often does not lie within the interpolation region of the weights of base policies. This suboptimality underscores the necessity of the proposed MOD algorithm, which avoids such pitfalls.

Necessity of Barrier Functions

The authors establish that barrier functions are crucial for ensuring the solvability of the multi-objective optimization problem. This is because such functions prevent significant deviations from the reference policy and ensure a feasible solution space for aligning with multiple objectives.

Robustness Against Sub-optimal Base Policies

The paper also explores the robustness of MOD when base policies are sub-optimal. The performance bounds and error propagation analyses indicate that MOD maintains its efficacy even when the base models are not fully optimal, making it a practical solution for real-world applications.

Practical and Theoretical Implications

The practical implications of this research are substantial. MOD provides a flexible and efficient method for aligning LMs with complex, multi-faceted user preferences without requiring extensive retraining. This capability is particularly valuable in dynamic environments where user needs and preferences can change rapidly.

Theoretically, this work opens avenues for further exploration in multi-objective optimization in LMs, particularly in the context of $f$-divergences and their role in model alignment. It also highlights the potential for extending the framework to other settings, such as supervised fine-tuning and proxy-tuning, further broadening the scope of its applicability.

Future Directions

Potential future developments in this line of research could include:

  1. Extension to Larger Model Architectures: Scaling MOD to even larger models and more diverse sets of objectives.
  2. Integration with Energy-Based Models: Enhancing the decoding efficiency and robustness using energy-based approaches.
  3. User-Specific Customization: Developing methods to further personalize LMs for individual users based on real-time feedback and preferences.

In conclusion, the paper makes significant advancements in the field of LM alignment, providing both practical tools and theoretical insights that pave the way for more adaptive and user-aligned AI systems.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.