Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning LLMs to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named Self-Exploring Language Models (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to Direct Preference Optimization (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings. Our code and models are available at https://github.com/shenao-zhang/SELM.

Impact of alpha on SELM win rates, reward distribution, and exploration-to-exploitation transition.


  • The SELM framework enhances preference optimization in LLMs by integrating active exploration into Reinforcement Learning from Human Feedback, aiming for better alignment with human intentions.

  • SELM incorporates an optimism term into the reward objective, encouraging exploration of out-of-distribution responses and thus enabling more dynamic and effective learning.

  • Empirical results show that SELM significantly outperforms traditional methods across multiple benchmarks, highlighting its robustness and superior performance in various instruction-following tasks.

An Expert Overview of the SELM Framework for Preference Optimization in LLMs

The paper under discussion introduces an advanced method for enhancing preference optimization in LLMs via a novel framework called Self-Exploring Language Models (SELM). This approach fundamentally focuses on integrating active exploration into the process of Reinforcement Learning from Human Feedback (RLHF), aiming to produce LLMs that are better aligned with human intentions and more effective in various instruction-following benchmarks.

Core Approach and Theoretical Foundations

The SELM framework is built on the premise that online feedback collection, rather than relying on a fixed dataset, tends to generate more capable reward models and improved alignment for LLMs. Traditional RLHF procedures are often bounded by local optima due to limited diversity in the response data. The SELM approach addresses this by integrating an optimism term into the reward fitting objective, thus encouraging the exploration of out-of-distribution (OOD) responses.

The paper introduces a bilevel optimization objective that incorporates an optimism term $\alpha \max_y r(x, y)$. This addition biases the reward model toward potentially high-reward responses that are previously unexplored, allowing for more effective and dynamic learning. The resultant algorithm, SELM, reparameterizes the reward function to eliminate the need for a separate reward model (RM), subsequently simplifying the objective.

Empirical Validation

Experimental analyses validate the efficacy of SELM across multiple benchmarks. The framework was implemented using Zephyr-7B-SFT and Llama-3-8B-Instruct models, and performance was significantly boosted in instruction-following tasks such as MT-Bench and AlpacaEval 2.0. Specifically, SELM outperforming the baseline iterative Direct Preference Optimization (DPO) by margins of +16.24% and +11.75% on AlpacaEval 2.0 and +2.31 and +0.32 on MT-Bench, respectively.

Additionally, SELM demonstrated robust performance across various academic benchmarks, achieving improvements even in zero-shot, few-shot, and Chain-of-Thought (CoT) settings. The enhancements were consistent across different iterations, emphasizing the robustness and reliability of the SELM methodology.

Implications and Future Directions

Theoretically, SELM presents a profound implication for the field of AI alignment. By actively exploring OOD regions, it mitigates the risk of models becoming overfitted to local optima and ensures a higher probability of discovering globally optimal responses. Practically, the integration of optimism in the RLHF process provides a more efficient pathway for fine-tuning LLMs, which is critical in tasks requiring high adaptability and precision.

The SELM framework also highlights the potential for integrating this optimism-based exploration with other contemporary online RLHF methodologies, suggesting that future research could explore the synergistic effects of combining SELM with other sophisticated alignment techniques.


In summary, the SELM framework introduces a novel and effective approach to preference optimization in LLMs. By leveraging active exploration through an optimism-biased objective, SELM significantly enhances the alignment and performance of LLMs across various benchmarks. This research paves the way for future developments in AI alignment, emphasizing the importance of dynamic, exploration-based strategies in preference optimization. The code and models associated with this study are available at SELM GitHub repository, providing a valuable resource for further research and application in the field.

