Emergent Mind

Abstract

To democratize LLMs to most natural languages, it is imperative to make these models capable of understanding and generating texts in many languages, in particular low-resource ones. While recent multilingual LLMs demonstrate remarkable performance in such capabilities, these LLMs still support a limited number of human languages due to the lack of training data for low-resource languages. Moreover, these LLMs are not yet aligned with human preference for downstream tasks, which is crucial for the success of LLMs in English. In this paper, we introduce xLLaMA-100 and xBLOOM-100 (collectively xLLMs-100), which scale the multilingual capabilities of LLaMA and BLOOM to 100 languages. To do so, we construct two datasets: a multilingual instruction dataset including 100 languages, which represents the largest language coverage to date, and a cross-lingual human feedback dataset encompassing 30 languages. We perform multilingual instruction tuning on the constructed instruction data and further align the LLMs with human feedback using the DPO algorithm on our cross-lingual human feedback dataset. We evaluate the multilingual understanding and generating capabilities of xLLMs-100 on five multilingual benchmarks. Experimental results show that xLLMs-100 consistently outperforms its peers across the benchmarks by considerable margins, defining a new state-of-the-art multilingual LLM that supports 100 languages.

Comparison of error rates in different cryptographic hash functions across various input sizes.

Overview

  • The paper introduces new multilingual LLMs, xLLaMA-100 and xBLOOM-100, that support up to 100 languages by leveraging a vast multilingual instruction dataset and cross-lingual human feedback.

  • Key methodologies include translating instructions using Google Translate API and ChatGPT, generating responses with a hybrid approach, and fine-tuning models with LoRA and DPO algorithm to align with human feedback.

  • The experimental evaluation demonstrated that xLLMs-100 improved multilingual comprehension and generation, mitigated off-target issues, and highlighted the significance of cross-lingual feedback in low-resource language performance.

An Insight into "LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback"

The paper titled "LLMs Beyond English: Scaling the Multilingual Capability of LLMs with Cross-Lingual Feedback" focuses on expanding the multilingual capabilities of LLMs such as LLaMA and BLOOM to support up to 100 languages, including those with low-resource datasets. This study is motivated by the inherent bias in LLMs primarily being optimized for English and a limited number of other high-resource languages. To achieve a broader linguistic scope, the authors introduce new models—xLLaMA-100 and xBLOOM-100—collectively referred to as xLLMs-100.

Methodology and Data Construction

Multilingual Instruction Dataset: The cornerstone of the enhancement strategy hinges on the creation of a vast multilingual instruction dataset that spans 100 languages. This dataset was developed by translating instructions from Alpaca via ChatGPT and Google Translate API, utilizing the strengths of these tools to bridge gaps in low-resource languages where translation performance may be lacking. The construction process involves:

  • Instruction Translation: Instructions from the Alpaca dataset are translated using Google Translate API for 100 languages, with NLLB model assistance for languages not supported by Google Translate.
  • Hybrid Response Generation: Responses are generated via a hybrid approach combining Google Translate API for low-resource languages and ChatGPT for languages where it shows prowess.

Cross-Lingual Human Feedback: Recognizing that effective cross-lingual feedback is pivotal for improving generative capabilities, the paper details the construction of a dataset incorporating cross-lingual feedback across 30 languages. This novel dataset entails:

  • Instruction Design: Translating English instructions into a source language and combining them with target-language instructions.
  • Response Generation: Leveraging ChatGPT for translating interactions and ranking responses based on correctness, coherence, and naturalness.

Multilingual Instruction Tuning

The model training procedure involves:

  • Supervised Fine-Tuning (SFT): This step uses parameter-efficient fine-tuning (LoRA) on the multilingual datasets to enhance model parameters efficiently without exorbitant computational demands.
  • Aligning with Human Feedback: Utilizing the DPO algorithm to fine-tune the SFT models with collected cross-lingual human feedback. This step optimizes LLMs for better alignment with human preferences, bypassing the computational overhead usually associated with RLHF.

Experimental Evaluation

Benchmarks: The models were evaluated against:

  • Understanding Tasks: PAWS-X and Self-Instruct* requiring text analysis and generation based on high-resource and low-resource languages.
  • Generation Tasks: FLORES-101 requiring machine translation, and XL-Sum necessitating summarization in various languages.
  • Reasoning Tasks: XCOPA evaluating commonsense reasoning across multiple languages.

Results: The xLLMs-100 demonstrated notable improvements across all benchmarks, excelling in both multilingual comprehension and generation. Specifically:

  • The models exhibited superior language democratization, creating more balanced performance across languages.
  • xLLMs-100 effectively mitigated the off-target issue, demonstrating improved generation accuracy across the tested languages.

Ablation Studies: Ablation studies underscored the significance of cross-lingual human feedback in enhancing low-resource language outputs. Further, comparisons between multilingual instruction datasets and multilingual parallel corpora highlighted the robustness of instruction datasets in avoiding performance degradation due to catastrophic forgetting.

Implications and Future Work

The paper suggests significant implications for the design and deployment of multilingual LLMs. By successfully scaling LLMs to include low-resource languages and optimizing through cross-lingual feedback, this research proposes an effective pathway to democratize AI and NLP tools globally. Future research directions include:

  • Extending the cross-lingual feedback dataset beyond the current 30 languages.
  • Addressing tokenizer inefficiencies to better support a wider range of languages.
  • Expanding experimental sizes and fine-tuning larger models (e.g., 13B or 70B models) to further push the boundaries of multilingual LLMs.

Conclusion

This study has successfully constructed multilingual datasets and applied innovative tuning methods to significantly broaden the linguistic capabilities of LLMs. As the paper shows, effectively scaling LLMs while maintaining performance across a multitude of languages can democratize AI technology, making it accessible and beneficial globally. The insights derived from this work pave the way for enhanced AI inclusivity, stressing the importance of cross-lingual approaches in next-generation LLM development.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.