Direct Alignment of Language Models via Quality-Aware Self-Refinement (2405.21040v1)

Published 31 May 2024 in cs.CL and cs.AI

Abstract: Reinforcement Learning from Human Feedback (RLHF) has been commonly used to align the behaviors of LLMs with human preferences. Recently, a popular alternative is Direct Policy Optimization (DPO), which replaces an LLM-based reward model with the policy itself, thus obviating the need for extra memory and training time to learn the reward model. However, DPO does not consider the relative qualities of the positive and negative responses, and can lead to sub-optimal training outcomes. To alleviate this problem, we investigate the use of intrinsic knowledge within the on-the-fly fine-tuning LLM to obtain relative qualities and help to refine the loss function. Specifically, we leverage the knowledge of the LLM to design a refinement function to estimate the quality of both the positive and negative responses. We show that the constructed refinement function can help self-refine the loss function under mild assumptions. The refinement function is integrated into DPO and its variant Identity Policy Optimization (IPO). Experiments across various evaluators indicate that they can improve the performance of the fine-tuned models over DPO and IPO.

Citations (6)

View on Semantic Scholar

Summary

The paper introduces a quality-aware self-refinement approach using Sr-DPO and Sr-IPO to integrate intrinsic LLM knowledge into loss function adjustments.
It demonstrates that these methods outperform traditional DPO, achieving higher accuracy on benchmarks like Open-LLM, Arc, and TruthfulQA.
The study implies that dynamic self-assessment in models can reduce reliance on human-annotated data, paving the way for more adaptable AI alignment strategies.

Direct Alignment of LLMs via Quality-Aware Self-Refinement

The paper "Direct Alignment of LLMs via Quality-Aware Self-Refinement" addresses the optimization of LLMs by aligning them directly with human preferences. This core challenge pertains to accommodating human feedback with precision, which is pivotal for developing AI systems that are both safe and controllable.

Key Contributions and Methodology

The authors embark on their research by exploring the Direct Policy Optimization (DPO) approach. DPO substitutes reward models with the policy itself to bypass the need for additional memory and training time. A significant drawback noted is the sub-optimal training outcome when the differences in response quality are marginal. To mitigate this issue, this paper introduces a refinement mechanism that utilizes intrinsic knowledge within the LLM. This knowledge is leveraged to create a refinement function that aids in dynamic loss function adjustment, potentially increasing the model's performance without requiring a pre-defined reward model.

The proposed methodology is an enhancement over existing DPO strategies, focusing on two novel approaches: Self-refined DPO (Sr-DPO) and Self-refined Identity Policy Optimization (Sr-IPO). The refinement function operationalizes intrinsic LLM knowledge to self-adjust the loss function during training. Sr-DPO and Sr-IPO integrate this refinement into DPO's framework, promoting effective model alignment with human feedback.

Experimental Evaluations

Three distinct experimental benchmarks were utilized to validate the efficacy of the proposed methods: MT-Bench, Vicuna-Bench, and the Open-LLM leaderboard. Using a selection of diverse datasets, including the HH-RLHF dataset for supervised fine-tuning and the Ultra-feedback dataset for large-scale preference illustration, the authors demonstrate that the self-refined approaches generally outperform their non-self-refined counterparts.

Quantitative results highlight that Sr-DPO and Sr-IPO effectively reduce the reward difference margin while maintaining high accuracy. For example, Sr-DPO showed superior performance on various tasks within the Open-LLM leaderboard, outperforming traditional DPO in accuracy across several metrics. Particularly, Sr-DPO achieved the highest improvements in tasks targeting Arc and TruthfulQA benchmarks.

Implications and Future Directions

The research implications extend to refining and improving LLM alignment methodologies, particularly those employing offline and online alignment strategies. By focusing on the self-assessment capability of LLMs themselves, the paper suggests a potential path for reducing reliance on extensive human-annotated data sets.

Future research could explore the integration of online policy-based direct alignment, which may pool the full potential of real-time feedback mechanisms with direct alignment processes. Combined with quality-aware self-refinement strategies, such developments could foster more robust, adaptable AI systems, broadening their application spectrum.

In sum, this paper delineates a novel paradigm in LLM alignment, emphasizing inherent model capabilities, refining the training process, and achieving superior alignment accuracies. This stands to not only improve the immediate alignment task at hand but also offers a framework adaptable to evolving AI challenges, setting the stage for further developments in AI alignment methodologies.

PDF Markdown

Related Papers

YouTube

Show All Videos