Emergent Mind

Abstract

The rapid development of LLMs demonstrates remarkable multilingual capabilities in natural language processing, attracting global attention in both academia and industry. To mitigate potential discrimination and enhance the overall usability and accessibility for diverse language user groups, it is important for the development of language-fair technology. Despite the breakthroughs of LLMs, the investigation into the multilingual scenario remains insufficient, where a comprehensive survey to summarize recent approaches, developments, limitations, and potential solutions is desirable. To this end, we provide a survey with multiple perspectives on the utilization of LLMs in the multilingual scenario. We first rethink the transitions between previous and current research on pre-trained language models. Then we introduce several perspectives on the multilingualism of LLMs, including training and inference methods, model security, multi-domain with language culture, and usage of datasets. We also discuss the major challenges that arise in these aspects, along with possible solutions. Besides, we highlight future research directions that aim at further enhancing LLMs with multilingualism. The survey aims to help the research community address multilingual problems and provide a comprehensive understanding of the core concepts, key techniques, and latest developments in multilingual natural language processing based on LLMs.

Transition of LLM paradigms, model architectures, and multilingualism frontiers.

Overview

  • The paper provides a comprehensive survey exploring the multilingual capabilities of LLMs like GPT-3, GPT-4, and LLaMA, highlighting training paradigms, inference strategies, security, and applications across various domains.

  • It discusses training from scratch and continual training methods for LLMs, emphasizing challenges like obtaining high-quality multilingual datasets and managing catastrophic forgetting.

  • The survey identifies key inference strategies, security vulnerabilities, and the necessity of specialized datasets and benchmarks to address biases and improve LLM performance in multilingual contexts.

Understanding Multilingual Capabilities in LLMs

LLMs like GPT-3, GPT-4, and LLaMA have transformed NLP in ways we could not have imagined a few years ago. Yet, despite their profound impact, there’s a significant gap when it comes to these models’ performance across different languages. A recent comprehensive survey aims to tackle this issue by exploring the multilingual capabilities of LLMs from various perspectives, such as training paradigms, inference strategies, security, and multi-domain applications.

Training Paradigms

From Scratch

Training LLMs from scratch with multilingual data involves incorporating diverse languages from the outset. A notable example here is XLM, which uses Translation Language Modeling (TLM) to enhance cross-lingual capabilities. Similarly, PolyLM employs curriculum learning to balance language data during pre-training. However, this approach underscores a crucial challenge: obtaining vast, high-quality multilingual datasets, especially for low-resource languages.

Continual Training

An efficient alternative to training from scratch is continual training on top of foundational models with new multilingual data. This method leverages existing knowledge while updating the model with additional language data. For instance, the BigTrans and Chinese-LLaMA models build on pre-trained models, improving their multilingual abilities without incurring the enormous costs of retraining. Nevertheless, this approach must solve catastrophic forgetting – where new knowledge interferes with old information – and deal with data scarcity in low-resource language settings.

Inference Strategies

Direct Inference

Direct inference, where models process text natively in multiple languages, is becoming more viable with advances in LLMs. Models like GPT-4 and PaLM-2 show promising results. Direct inference preserves linguistic nuances and ensures efficient processing by eliminating the translation step, but performance can still suffer in low-resource languages.

Pre-Translation

Pre-translation approaches convert input text into a high-resource language, like English, before processing. While this may allow models to leverage their strongest language proficiency, it introduces dependencies on high-quality translation tools and potential errors, which can distort meaning.

Multilingual CoT

Chain of Thought (CoT) strategies, initially successful in monolingual settings, are adapted for multilingual contexts. This involves instructions either in native languages or in English (e.g., "Let's think step by step"). The effectiveness varied, showing better results when instructional language is English.

Retrieval Augmented Generation (RAG)

RAG enhances LLMs by integrating external knowledge during text generation. This approach shows significant promise, especially for low-resource languages where models show a predisposition towards hallucinations or factual inaccuracies.

Code-Switching

Handling code-switching in multilingual dialogue settings, where speakers switch between languages, remains challenging for LLMs. Recent work shows that even powerful models struggle without tailored fine-tuning.

Security

Attack Methods

LLMs are vulnerable to various attacks, including jailbreaks, which trick them into bypassing safety protocols. Prompt-based methods, gradient-based methods like Greedy Coordinate Gradient (GCG), and multilingual-specific attacks expose these vulnerabilities. Certain languages, particularly low-resource ones, often bypass strict safety checks due to less extensive fine-tuning in those languages.

Defense Methods

Defense strategies range from enhanced training protocols to real-time input analyses, but there is no foolproof method yet. Methods like SmoothLLM show promise in perturbing input prompts to avoid generating unsafe outputs.

Multi-Domain Applications

In specialized fields like medicine and law, building effective multilingual models includes challenges that go beyond common LLM tasks. Models like MMedLM2 and BioMistral adapt LLMs to medical contexts across multiple languages, showing significant improvement within their domains. However, acquiring high-quality multilingual data remains a big hurdle, exacerbated by cultural and contextual intricacies unique to each language.

Data Resources and Benchmarking

The scarcity of large, high-quality multilingual datasets is a significant bottleneck. Datasets like MultiLegalPile and XMedBench offer initial steps towards bridging this gap. Comprehensive benchmarks that account for cultural and contextual factors across languages need to be developed to accurately reflect LLM performance in multilingual environments.

Bias and Fairness

Addressing biases in multilingual LLMs involves understanding both language-specific and demographic biases. While incremental improvements are made through techniques like up-sampling and adversarial training, the field still lacks robust tools and datasets to fully mitigate these biases.

Conclusion

Despite the impressive strides made in multilingual capabilities of LLMs, much work remains. Researchers must continue exploring advanced training strategies and inference methods while developing robust evaluation benchmarks and addressing biases to truly achieve language-fair AI. For both academia and industry, fostering collaboration and sharing resources will be crucial in overcoming these challenges and unlocking the full potential of LLMs in multilingual contexts.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.