RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs (2407.02552v1)

Published 2 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Preference optimization techniques have become a standard final stage for training state-of-art LLMs. However, despite widespread adoption, the vast majority of work to-date has focused on first-class citizen languages like English and Chinese. This captures a small fraction of the languages in the world, but also makes it unclear which aspects of current state-of-the-art research transfer to a multilingual setting. In this work, we perform an exhaustive study to achieve a new state-of-the-art in aligning multilingual LLMs. We introduce a novel, scalable method for generating high-quality multilingual feedback data to balance data coverage. We establish the benefits of cross-lingual transfer and increased dataset size in preference training. Our preference-trained model achieves a 54.4% win-rate against Aya 23 8B, the current state-of-the-art multilingual LLM in its parameter class, and a 69.5% win-rate or higher against widely used models like Gemma-1.1-7B-it, Llama-3-8B-Instruct, Mistral-7B-Instruct-v0.3. As a result of our study, we expand the frontier of alignment techniques to 23 languages covering half of the world's population.

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a novel data generation method that produces high-quality multilingual feedback to enable scalable preference optimization.
It rigorously compares offline (DPO) and online (RLOO) techniques, with RLOO demonstrating superior cross-lingual transfer and stability.
The research reports a 54.4% win-rate against state-of-the-art models, expanding LLM alignment to 23 languages and enhancing global accessibility.

RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs

LLMs have achieved significant successes in recent years, primarily within the confines of certain high-resource languages like English and Chinese. However, extending these models to low-resource languages remains a challenge. The paper "RLHF Can Speak Many Languages: Unlocking Multilingual Preference Optimization for LLMs" by Dang et al. targets this issue directly through an innovative paper aimed at optimizing multilingual LLMs using Preference Optimization techniques.

Core Contributions and Methodology

The paper addresses several key challenges in multilingual LLM training, such as data scarcity and quality, training instability across multiple languages, and the effectiveness of preference optimization methods. The core contributions of this work can be summarized as follows:

Novel Data Generation Method: The authors propose a scalable method for generating high-quality multilingual feedback data. By leveraging LLMs like Cohere’s Command and Command R+, they create a diverse set of multilingual completion pairs, thus avoiding common pitfalls associated with translationese and enhancing dataset diversity.
Extensive Evaluation: The paper includes a comprehensive evaluation of training dynamics by comparing offline and online preference optimization techniques such as Direct Preference Optimization (DPO) and REINFORCE-Leave-One-Out (RLOO).
Training and Performance Metrics: The effectiveness of preference optimization across multiple languages is thoroughly investigated. For instance, the authors report that their preference-trained model achieves a 54.4% win-rate against the state-of-the-art Aya 23 8B model and substantially higher win-rates against other popular models like Gemma-1.1-7B-it and Llama-3-8B-Instruct.

Key Findings

Cross-lingual Transfer: One of the most notable findings is the evidence of cross-lingual transfer in preference optimization. Even when training is performed with English-only data, the performance in other languages shows noticeable improvement. These gains are even more pronounced when preference data includes a small number of additional languages.

Multilingual Data Necessity: The paper underscores the necessity of using multilingual preference data for optimal performance in multilingual LLMs. The addition of more languages in the training data leads to improved win-rates, indicating the value of diverse linguistic input.

Online vs. Offline Optimization: The research demonstrates that online preference optimization (RLOO) significantly outperforms offline methods (DPO). Notably, RLOO shows better cross-lingual transfer capabilities, higher stable win-rates, and superior overall performance in unseen languages during training.

Application of Findings: The paper achieves a significant milestone by expanding alignment techniques to 23 languages, covering half the world’s population. The enhanced Aya 23 8B model exhibits superior performance, indicating the practical and theoretical value of the presented methods.

Practical and Theoretical Implications

Practical:

The methodologies and findings have substantial implications for broadening the accessibility and utility of LLMs across different linguistic groups. This approach allows for more inclusive AI models that are beneficial to underrepresented communities.

Theoretical:

The robustness of multilingual preference optimization and cross-lingual transfer provides insights into how linguistic diversity in training data influences model performance. It supports the hypothesis that multilingual learning benefits from diverse inputs and that these methods can be generalized to other AI training paradigms.

Future Directions

Future work may explore several promising directions:

Scaling to More Languages: Increasing the number of languages supported by LLMs will be a crucial step to further enhance the global applicability and fairness of these models.
Improving Annotation and Dataset Quality: Addressing the inherent biases in synthetic data and translations remains an ongoing challenge. Future research could explore more sophisticated data augmentation and annotation methods to mitigate these biases.
Exploring Larger Models: Due to compute constraints, the current paper is limited to 8-billion parameter models. Future work should investigate the effects of scaling these techniques to larger models to understand the implications on performance and stability.

Conclusion

The work by Dang et al. stands as a comprehensive and rigorous examination of multilingual preference optimization, opening new avenues for multilingual LLM research. Through innovative data generation methods, thorough evaluation, and significant performance improvements, the paper makes substantial contributions toward making AI tools more inclusive and effective across diverse linguistic landscapes.

Related Papers

Tweets

https://twitter.com/sarahookr/status/1810191364551770349

https://twitter.com/johnamqdang/status/1809211873906036878

https://twitter.com/CohereForAI/status/1809245942232895915

https://twitter.com/sarahookr/status/1837213134672798035

https://twitter.com/fly51fly/status/1809899337104990459

https://twitter.com/CohereForAI/status/1809248197979918745

YouTube

Show All Videos