Emergent Mind

Abstract

LLMs have become a go-to solution not just for text generation, but also for natural language understanding (NLU) tasks. Acquiring extensive knowledge through language modeling on web-scale corpora, they excel on English NLU, yet struggle to extend their NLU capabilities to underrepresented languages. In contrast, machine translation models (MT) produce excellent multilingual representations, resulting in strong translation performance even for low-resource languages. MT encoders, however, lack the knowledge necessary for comprehensive NLU that LLMs obtain through language modeling training on immense corpora. In this work, we get the best both worlds by integrating MT encoders directly into LLM backbones via sample-efficient self-distillation. The resulting MT-LLMs preserve the inherent multilingual representational alignment from the MT encoder, allowing lower-resource languages to tap into the rich knowledge embedded in English-centric LLMs. Merging the MT encoder and LLM in a single model, we mitigate the propagation of translation errors and inference overhead of MT decoding inherent to discrete translation-based cross-lingual transfer (e.g., translate-test). Evaluation spanning three prominent NLU tasks and 127 predominantly low-resource languages renders MT-LLMs highly effective in cross-lingual transfer. MT-LLMs substantially and consistently outperform translate-test based on the same MT model, showing that we truly unlock multilingual language understanding for LLMs.

Stage 1: Merging MT encoder with LLM and training adapters via self-supervised distillation.

Overview

  • The study introduces a method for enhancing LLMs' (LLMs) cross-lingual capabilities by integrating machine translation (MT) encoders with LLM backbones through self-distillation, forming a hybrid model called MT-LLM.

  • The methodology includes self-supervised general adaptation to align MT encoder and LLM representations, followed by task-specific distillation to fine-tune the hybrid model on labeled data, thereby improving multilingual natural language understanding (NLU).

  • Experimental results demonstrate that the MT-LLM outperforms standard LLMs and standalone MT models in cross-lingual NLU tasks across multiple languages and datasets, achieving significant accuracy improvements and reducing inference overhead.

Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages

This study presents a novel methodology for enhancing the cross-lingual capabilities of LLMs by integrating machine translation (MT) encoders with LLM backbones through self-distillation. The resulting hybrid model, termed MT-LLM, aims to leverage the strengths of both LLMs and MT encoders to perform natural language understanding (NLU) tasks across a diverse set of languages, including many low-resource languages.

Introduction

LLMs like GPT-3 and Llama 3 have demonstrated impressive performance on a variety of NLU tasks, particularly in English. However, their effectiveness diminishes significantly for languages that are typologically distant from English or poorly represented in their training data. Conversely, state-of-the-art MT models like NLLB and MADLAD-400 provide strong multilingual representations but lack the extensive world knowledge embedded in LLMs. To bridge this gap, the authors propose integrating MT encoders directly into LLM backbones through a self-distillation process, thereby enhancing the cross-lingual transfer capabilities of LLMs.

Methodology

The integration is achieved in two primary stages:

  1. Self-Supervised General Adaptation: This initial stage focuses on aligning the representation spaces of the MT encoder and the LLM. The process uses a sequence-level alignment objective where new trainable parameters (a projection matrix and LoRA adapters) are optimized to map MT encoder outputs to the LLM's input embedding space. This enables the LLM to understand multilingual representations generated by the MT encoder.
  2. Task-Specific Distillation: In this stage, the model undergoes task-specific fine-tuning. Initially, the LLM is fine-tuned on labeled task data, and subsequently, this task-specific knowledge is transferred to the MT-LLM hybrid by aligning the output representations of the task-tuned LLM and the MT-LLM.

Experimental Setup and Results

Tasks and Languages

The authors evaluated the proposed MT-LLM across three NLU tasks:

  1. Natural Language Inference (NLI): Evaluated on XNLI, AmericasNLI, and Kardeş-NLU datasets.
  2. Sentiment Classification: Evaluated on the NusaX dataset, covering 10 Indonesian languages.
  3. Multiple-Choice Machine Reading Comprehension (MRC): Evaluated on the Belebele benchmark, which includes 122 languages.

Cross-Lingual Transfer Setups

The study employed two standard cross-lingual transfer setups:

  1. Zero-Shot Cross-Lingual Transfer (ZS-XLT): The model is fine-tuned on English training data and evaluated directly on target language instances.
  2. Translate-Test: Target language instances are translated into English before being processed by the LLM.

Numerical Results

The MT-LLM significantly outperformed both standard LLMs and the standalone NLLB encoder in cross-lingual NLU tasks. Notably, the MT-LLM exhibited an average accuracy of 81.4% on XNLI and 82.1% on Kardeş-NLU, showing substantial improvements over standard LLMs and MT models. The results demonstrate that the MT-LLM approach surpasses the translate-test setup, achieving better performance and reducing inference overhead by eliminating the need for MT decoding.

Discussion

The study sheds light on the computational efficiency of the proposed self-distillation method, which requires only a few thousand training steps to achieve significant alignment between the MT and LLM backbones. This efficiency is crucial given the extensive computational resources typically required for training such models.

Implications and Future Work

The integration of MT encoders into LLMs through self-distillation holds considerable promise for improving multilingual capabilities in NLU tasks. By extending LLMs' access to the rich multilingual representations of MT encoders, this approach mitigates the constraints posed by typological differences and low-resource language representations.

Future research could explore the inclusion of token-level alignment objectives to further enhance the alignment and generalization capabilities of MT-LLMs. Additionally, extending this approach to support even more languages through post-hoc adaptation of both LLM and MT encoders may yield further gains in cross-lingual NLU performance.

Conclusion

This paper introduces a novel and effective method to enhance the cross-lingual NLU capabilities of LLMs by integrating MT encoders through self-distillation. The resulting MT-LLMs demonstrate superior cross-lingual performance, validating the efficacy of the proposed approach and paving the way for more inclusive and efficient multilingual language models.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.