Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment (2404.12318v2)

Published 18 Apr 2024 in cs.CL

Abstract: Aligning LMs based on human-annotated preference data is a crucial step in obtaining practical and performant LM-based systems. However, multilingual human preference data are difficult to obtain at scale, making it challenging to extend this framework to diverse languages. In this work, we evaluate a simple approach for zero-shot cross-lingual alignment, where a reward model is trained on preference data in one source language and directly applied to other target languages. On summarization and open-ended dialog generation, we show that this method is consistently successful under comprehensive evaluation settings, including human evaluation: cross-lingually aligned models are preferred by humans over unaligned models on up to >70% of evaluation instances. We moreover find that a different-language reward model sometimes yields better aligned models than a same-language reward model. We also identify best practices when there is no language-specific data for even supervised finetuning, another component in alignment.

Citations (8)

View on Semantic Scholar

Summary

The paper demonstrates that a reward model trained in one language can effectively guide LM alignment across different languages, achieving over 70% human evaluator preference.
It employs reinforcement learning and best-of-n reranking for tasks like summarization and dialog generation without relying on target-specific annotated data.
Unexpectedly, cross-lingual reward transfer sometimes outperforms same-language approaches, suggesting reduced overfitting and enhanced interlingual generalizability.

Evaluating Zero-Shot Cross-Lingual Alignment in LLMs Using a Single-Language Reward Model

Introduction

Cross-lingual transfer of reward models (RMs) stands as a fundamental approach to facilitate LLM (LM) alignment when multilingual preference data are scarce. This work investigates the efficacy of using a single-language RM to align LMs across multiple languages, offering a potential solution to the problem of scaling alignment practices to diverse language settings where specific preference data may be lacking.

Zero-Shot Cross-Lingual Transfer of Reward Models

The core methodology proposes transferring a RM trained on one source language to guide the alignment of LMs in target languages. This approach side-steps the necessity for target-specific annotated datasets by leveraging the interlingual generality of pretrained multilingual models. The paper explores two tasks: summarization and open-ended dialog generation, utilizing reinforcement learning and best-of- $n$ reranking as reward optimization techniques.

Cross-lingual effectiveness is measured through comprehensive evaluation methods, including direct human evaluation and automated metrics by larger and unbiased LMs (GPT-4 and PaLM-2-L), revealing a surprising observation. Aligned models using the transferred RM from another language often showed alignment quality surpassing models that utilized a same-language RM. This suggests that the generalization capabilities of RMs may be robust to input language changes and that certain biases tied to same-language RM might be sidestepped with source-language RMs.

Key Results and Observations

Generalizability of RMs: Despite being trained on data from one language, RMs were able to effectively drive alignment in different languages, with human evaluator preference reaching over 70% in favor of aligned models across various instances.
Comparison with Translate-Train Baseline: The RMs directly transferred cross-lingually outperformed the translate-train baseline, where the RM data was automatically translated into the target language, hinting at the strong adaptability and perhaps superior interlingual transfer capabilities of the original RMs.
Unexpected Superiority of Cross-lingual Alignment: In several instances, using a RM from a different language yielded better alignment than using a RM from the target language. It is hypothesized this could be due to the reduced likelihood of overfitting to language-specific artifacts present in the target-language training data.

Implications and Future Directions

The findings underscore the potential to lower the barriers for deploying multilingual LMs aligned to human preferences, especially for under-resourced languages. Cross-lingual RM transfer, by avoiding the need for extensive language-specific annotated data, could democratize the benefits of advanced LMs globally.

However, the implications of this strategy are complex. It opens questions about the extent to which language-agnostic principles of generation quality hold across different contexts and cultural nuances. Conducting further studies on tasks or domains with heavier cultural or context-specific elements could enrich our understanding of the limits of cross-lingual RM transferability.

Recommendations

For practical deployment, using RMs from a high-resource language like English to guide alignment in other languages might be an effective strategy. This strategy should ideally be complemented by rigorous evaluations and comparisons against in-language RMs to ensure that the alignment preserves the intended semantic and pragmatic properties across languages.

In conclusion, this work represents an important step towards scalable, cross-lingual alignment of LMs, though future research is necessary to refine these methods and fully understand the boundary conditions under which they operate optimally.