Multilingual Large Language Model: A Survey of Resources, Taxonomy and Frontiers (2404.04925v1)

Published 7 Apr 2024 in cs.CL

Abstract: Multilingual LLMs are capable of using powerful LLMs to handle and respond to queries in multiple languages, which achieves remarkable success in multilingual natural language processing tasks. Despite these breakthroughs, there still remains a lack of a comprehensive survey to summarize existing approaches and recent developments in this field. To this end, in this paper, we present a thorough review and provide a unified perspective to summarize the recent progress as well as emerging trends in multilingual LLMs (MLLMs) literature. The contributions of this paper can be summarized: (1) First survey: to our knowledge, we take the first step and present a thorough review in MLLMs research field according to multi-lingual alignment; (2) New taxonomy: we offer a new and unified perspective to summarize the current progress of MLLMs; (3) New frontiers: we highlight several emerging frontiers and discuss the corresponding challenges; (4) Abundant resources: we collect abundant open-source resources, including relevant papers, data corpora, and leaderboards. We hope our work can provide the community with quick access and spur breakthrough research in MLLMs.

Citations (25)

View on Semantic Scholar

Summary

The paper introduces a novel taxonomy categorizing multilingual LLM alignment into parameter-tuning and parameter-frozen approaches.
It systematically reviews extensive multilingual datasets and data resources across pre-training, fine-tuning, and reinforcement learning from human feedback.
The survey outlines emerging challenges like hallucination, safety, fairness, and multi-modality, guiding future research in multilingual AI innovation.

Comprehensive Review on Multilingual LLMs: Resources, Taxonomy, and Emerging Trends

Introduction to Multilingual LLMs

Multilingual LLMs (MLLMs) have gained significant attention due to their capability to understand and generate text in multiple languages, overcoming the limitations of monolingual LLMs. The paper presents a cohesive analysis of multilingual resources, proposed taxonomy for MLLM strategies, and emerging frontiers in the development and application of MLLMs. Notably, it distinguishes itself as the inaugural comprehensive survey specifically focusing on MLLMs, offering insights into parameter-tuning alignment, parameter-frozen alignment, and the chasm between current methodologies and future potential.

Data Resource

The survey meticulously categorizes the data resources vital for MLLMs' development across various stages such as pre-training, supervised fine-tuning, and reinforcement learning from human feedback. It showcases an extensive collection of datasets spanning manual creation, web crawling, and benchmark adaptation, highlighting the breadth of linguistic diversity captured. This segment serves as a foundation for understanding the underpinning data that drives the performance and linguistic capabilities of MLLMs.

Taxonomy of Alignment Strategies for MLLMs

The paper introduces a novel taxonomy that bifurcates MLLMs based on their alignment methods into two primary types: parameter-tuning alignment and parameter-frozen alignment.

Parameter-tuning alignment encapsulates methodologies requiring model reconfiguration to enhance multilingual performance, spanning strategies across the pretraining, supervised fine-tuning, reinforcement learning from human feedback, and downstream fine-tuning phases. This approach not only encompasses the initial alignment during model training but also extends to fine-tuned modifications to bolster the model's proficiency in handling multiple languages.
Parameter-frozen alignment, in contrast, leverages existing MLLM configurations employing prompting strategies to elicit desired outputs. It outlines four distinct prompting techniques: direct prompting, code-switching prompting, translation alignment prompting, and retrieval augmented alignment. This delineation emphasizes the distinct pathways towards achieving cross-lingual alignment without necessitating alterations to the model parameters.

Future Directions and New Frontiers

The exploration into new frontiers presents an intriguing examination of challenges and opportunities within the MLLM domain, including hallucination mitigation, knowledge editing, ensuring safety, maintaining fairness, language extension, and multi-modality extension.

Hallucination in MLLMs presents a critical inquiry into reliability and factual accuracy in multilingual generation.
Knowledge Editing in MLLMs embarks on addressing the dynamic nature of knowledge and its consistency across languages.
Safety in MLLMs underscores the ethical implications and privacy concerns pervasive in multilingual models.
Fairness in MLLMs illuminates the performance disparities and token consumption inequities across languages.
Language Extension in MLLMs deliberates on the methods to incorporate new languages, ensuring minimal impact on existing language capabilities.
Multi-Modality Extension in MLLMs ventures beyond textual data, exploring how MLLMs can engage with visual, auditory, and video data, thereby broadening the applicability and interaction modalities of MLLMs.

Conclusion

This survey critically assesses the current landscape, methodologies, and challenges inherent to the development and application of Multilingual LLMs. By elucidating a detailed taxonomy, analyzing data resources, and projecting future research directions, it contributes a comprehensive perspective to the burgeoning field of MLLMs. Through its insights, the paper not only serves as a navigational beacon for existing researchers but also lays the groundwork for pioneering innovations in multilingual AI.

Related Papers

Tweets

https://twitter.com/omarsar0/status/1778063103906771105

https://twitter.com/fly51fly/status/1779264275649028330

https://twitter.com/Michael_AI_bro/status/1778136090475688345

https://twitter.com/knishimae0531/status/1778207059328278847

https://twitter.com/knishimae0531/status/1779302828760342785