MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization (2401.06838v3)

Published 12 Jan 2024 in cs.CL

Abstract: Though reasoning abilities are considered language-agnostic, existing LLMs exhibit inconsistent reasoning abilities across different languages, e.g., reasoning in the dominant language like English is superior to other languages due to the imbalance of multilingual training data. To enhance reasoning abilities in non-dominant languages, we propose a Multilingual-Alignment-as-Preference Optimization framework (MAPO), aiming to align the reasoning processes in other languages with the dominant language. Specifically, we harness an off-the-shelf translation model for the consistency between answers in non-dominant and dominant languages, which we adopt as the preference for optimization, e.g., Direct Preference Optimization (DPO) or Proximal Policy Optimization (PPO). Experiments show that MAPO stably achieves significant improvements in the multilingual reasoning of various models on all three benchmarks (MSVAMP +16.2%, MGSM +6.1%, and MNumGLUESub +13.3%), with improved reasoning consistency across languages.

References (23)

Citations (15)

View on Semantic Scholar

Summary

The paper introduces MAPO, a framework that improves multilingual LLM reasoning by aligning non-dominant language reasoning chains with those in English.
It employs translation-based alignment scores and preference optimization via PPO/DPO, achieving accuracy gains up to 16.2% across nine languages on benchmark tests.
MAPO boosts both generalization and consistency in reasoning, providing a scalable solution to address disparities in multilingual language models.

The paper "MAPO: Advancing Multilingual Reasoning through Multilingual Alignment-as-Preference Optimization" (2401.06838) addresses the observed disparity in reasoning capabilities of LLMs across different languages. While reasoning is often considered language-agnostic, LLMs frequently demonstrate superior performance in high-resource languages like English compared to lower-resource languages, attributed largely to imbalances in multilingual training corpora. The work introduces the Multilingual-Alignment-as-Preference Optimization (MAPO) framework designed to mitigate this gap by aligning the reasoning processes in non-dominant languages with those generated in a dominant language (English).

MAPO Framework Methodology

The MAPO framework operates by generating a preference signal based on the alignment between reasoning chains produced in a non-dominant language and a corresponding chain in the dominant language for the same input query. This preference signal is then used to fine-tune the LLM using standard preference optimization algorithms like Proximal Policy Optimization (PPO) or Direct Preference Optimization (DPO). The core idea is that a non-dominant language reasoning process that is semantically closer or more "translatable" to the dominant language reasoning process (assumed to be more reliable) is preferred.

Preference Estimation via Multilingual Alignment

Data Generation: For a given input question $x$ (e.g., a mathematical problem), the LLM policy $\pi_\theta$ (initially an SFT model) generates a reasoning process $\bar{Y}$ in the dominant language (English) and multiple reasoning process samples $\{Y_1, ..., Y_k\}$ in a non-dominant target language.
Alignment Scoring: An off-the-shelf multilingual translation model $M_{trans}$ is employed to assess the alignment between each non-dominant language reasoning process $Y_i$ and the dominant language reasoning process $\bar{Y}$ . The alignment score is typically derived from the negative log-likelihood or perplexity of translating $Y_i$ to $\bar{Y}$ , essentially $P_{M_{trans}}(\bar{Y}|Y_i)$ . A higher probability (lower perplexity) signifies better alignment and is interpreted as higher preference.
Preference Data Formulation:
- For PPO, the alignment score $P_{M_{trans}}(\bar{Y}|Y_i)$ serves directly as the reward $r(x, Y_i)$ for generating $Y_i$ .
- For DPO, the sampled non-dominant reasoning processes $\{Y_1, ..., Y_k\}$ are ranked based on their alignment scores. Pairs $(Y_w, Y_l)$ are constructed where $Y_w$ has a higher alignment score (winner) than $Y_l$ (loser), forming the preference dataset $D = \{(x, Y_w, Y_l)\}$ .

Preference Optimization

The estimated preferences are used to fine-tune the LLM $\pi_\theta$ :

Using PPO: The objective is to maximize the expected reward obtained from the alignment score, while regularizing the policy shift using a KL divergence penalty against the initial SFT policy $\pi_{SFT}$ :

$L_{PPO} = \mathbb{E}_{(x,Y) \sim \pi_\theta}[r(x,Y)] - \beta KL(\pi_\theta(Y|x) || \pi_{SFT}(Y|x))$

Here, $r(x,Y) = P_{M_{trans}}(\bar{Y}|Y)$ is the alignment-based reward.
Using DPO: The DPO loss function directly optimizes the policy $\pi_\theta$ to increase the likelihood of preferred sequences $Y_w$ over dispreferred sequences $Y_l$ , relative to a reference policy $\pi_{ref}$ (typically the frozen SFT model):

$L_{DPO} = -\mathbb{E}_{(x, Y_w, Y_l) \sim D} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(Y_w|x)}{\pi_{ref}(Y_w|x)} - \beta \log \frac{\pi_\theta(Y_l|x)}{\pi_{ref}(Y_l|x)} \right) \right]$

where $\sigma$ is the sigmoid function and $\beta$ is a hyperparameter controlling the deviation from the reference policy.
Iterative DPO (iDPO): The paper also explores an iterative application of DPO, where the model optimized in one iteration ( $\pi_{\theta_i}$ ) is used as the sampling policy to generate data for the next DPO iteration ( $\pi_{\theta_{i+1}}$ ), potentially leading to progressive refinement.

Experimental Setup

Base Models: The experiments utilized LLaMA2 (7B, 13B) and Mistral (7B) models, specifically versions already fine-tuned for multilingual mathematical reasoning (MathOctopus, MetaMathOctopus, MistralMathOctopus) via supervised fine-tuning (SFT) on translated reasoning data (MGSM8KInstruct).
MAPO Training Data: Preference data for MAPO was generated using mathematical problems from a subset of NumGLUE (tasks 1, 4, 8), translated into 9 non-English languages (part of MNumGLUESub). This dataset was distinct from the initial SFT dataset.
Alignment Model: The NLLB-600M-distilled model was used as the default $M_{trans}$ for calculating alignment scores.
Evaluation Benchmarks: Performance was evaluated on three multilingual mathematical reasoning datasets covering 10 languages (English + 9 non-English):
- MSVAMP: Out-of-domain multi-step arithmetic word problems.
- MGSM: Multi-step grade school math problems (in-domain relative to SFT).
- MNumGLUESub: Numerical reasoning problems (in-domain relative to MAPO preference data generation).
Metrics:
- Accuracy: Percentage of correctly solved problems.
- PPL-based Alignment Score: Average perplexity assigned by the NLLB model between non-English and English reasoning chains (lower is better), measuring reasoning process consistency.
- Answer Consistency Ratio (ACR): Jaccard index between the sets of problems solved correctly in English versus a non-English language, measuring answer consistency.
Baselines: Performance was compared against the base SFT models and m-RFT (Rejection sampling Fine-Tuning based on final answer correctness).

Results and Findings

MAPO demonstrated significant improvements in multilingual reasoning performance across various base models and benchmarks:

Accuracy Gains: On average across 9 non-English languages, MAPO applied to MathOctopus 7B yielded substantial improvements:
- +16.2% on MSVAMP (out-of-domain)
- +6.1% on MGSM (in-domain SFT)
- +13.3% on MNumGLUESub (in-domain MAPO)
- Similar gains were observed for larger models (13B) and Mistral-based models. Notably, the largest gains were often seen in languages with lower baseline performance (e.g., Bengali, Thai, Swahili).
Improved Generalization: The strong performance increase on the out-of-domain MSVAMP dataset suggests MAPO fosters better generalization of reasoning skills compared to methods like m-RFT, which showed negligible improvement on MSVAMP.
Enhanced Consistency: MAPO led to improved reasoning consistency between non-dominant languages and English, as evidenced by:
- Lower (better) PPL-based alignment scores.
- Higher Answer Consistency Ratio (ACR).
- This indicates that MAPO successfully aligns not just the final answers but also the intermediate reasoning steps.
PPO vs. DPO: Both PPO and DPO implementations of MAPO proved effective. DPO appeared slightly more sample-efficient in early training stages, while iterative DPO showed potential for further gains.
Ablation Studies: Ablations confirmed the importance of using alignment scores over simpler rewards (like final answer correctness used in m-RFT) and the benefit of using translated dominant language reasoning ( $\bar{Y}$ ) compared to reference solutions.

Significance and Implications

The MAPO framework offers a practical approach to enhance LLM reasoning in non-dominant languages without requiring costly human annotations of reasoning steps in multiple languages. By leveraging the stronger reasoning capabilities typically present in a dominant language like English and using automated translation models to create an alignment-based preference signal, MAPO effectively transfers reasoning proficiency.

The method's success, particularly on out-of-domain tasks, indicates that optimizing for cross-lingual reasoning alignment encourages the model to learn more fundamental, language-agnostic reasoning patterns rather than merely overfitting to specific language data. The explicit optimization towards consistency leads to more reliable and predictable reasoning behavior across the supported languages. This alignment strategy provides a scalable way to improve the equity of reasoning performance in multilingual LLMs.

In conclusion, MAPO presents a novel preference optimization strategy centered on cross-lingual reasoning alignment. By using translation models to generate preference signals comparing non-dominant language reasoning to dominant language reasoning, it substantially improves multilingual mathematical reasoning accuracy and consistency, demonstrating effectiveness across different base models and benchmarks, especially enhancing generalization to out-of-domain tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/kevinprossj/status/1762524055658508437

https://twitter.com/kevinprossj/status/1821863955549319188