SeaLLMs -- Large Language Models for Southeast Asia (2312.00738v2)

Published 1 Dec 2023 in cs.CL

Abstract: Despite the remarkable achievements of LLMs in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of LLMs that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon the Llama-2 model and further advanced through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM-13b models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.

Citations (54)

View on Semantic Scholar

Summary

The paper demonstrates that expanding vocabulary by 16,512 tokens and employing a four-stage training protocol significantly improves language processing for SEA languages compared to mainstream models.
The study leverages an advanced tokenization strategy that reduces token lengths—achieving up to a 2.7-fold reduction in Thai—to enhance contextual understanding.
The findings imply that region-specific LLM adaptations democratize AI by providing cost-effective, high-performance tools for underrepresented Southeast Asian languages.

Evaluation of SeaLLMs: Addressing Linguistic Disparity in Southeast Asian LLMs

The paper entitled "SeaLLMs - LLMs for Southeast Asia" delineates a comprehensive effort to mitigate the linguistic biases apparent in mainstream LLMs by introducing a series of models specifically aimed at Southeast Asian (SEA) languages. While LLMs have shown prodigious capabilities in numerous language tasks, their output quality diminishes for languages outside high-resource brackets, primarily due to the paucity of relevant training data. This research adequately addresses its objective by advancing models—namely SeaLLM-13B and its variants—that specialize in SEA languages such as Vietnamese, Thai, Indonesian, and underrepresented regional languages like Khmer and Lao, outperforming prominent models like ChatGPT-3.5.

Advanced Linguistic Representation

The authors expand upon Meta's Llama-2 architecture, tailoring it to accommodate SEA languages through meticulous vocabulary elongation that resolves the inefficiencies of tokenization for non-Latin scripts. This extension process adds 16,512 new tokens that substantially compress text representation, as evidenced by reduction ratios reported—such as a 2.7-fold reduction in Thai token length. This refined tokenization markedly enhances the models' contextual processing capability, a fundamental challenge when dealing with extensive texts in low-resource languages.

Training Paradigm and Performance

The architecture's refinement is further cemented through a four-stage training protocol encompassing continual pre-training, a hybrid pre-training with supervised fine-tuning, targeted supervised fine-tuning, and self-preferencing optimization. This meticulous training regimen ensures that SeaLLMs not only assimilate general language patterns from Llama-2's extensive pre-training dataset but also excel in nuanced comprehension and generation tasks needed for SEA linguistic variety.

Empirical evaluations highlight SeaLLM-13B's preeminence, especially over ChatGPT-3.5 in non-Latin SEA languages, outstripping it by significant margins for languages like Khmer and Burmese. The results of the comprehensive Sea-bench assessments reinforce the hypothesis that SeaLLMs are not only linguistically inclusive but also cost-effective due to their lightweight and efficient architecture.

Theoretical and Practical Implications

Theoretical advancements of this research underscore the possibility of region-specific adaptations of LLMs that remain competitive with universally dominant models in multilingual tasks. By implementing vocabular expansion and culturally nuanced tuning, SeaLLMs offer a blueprint for developing LLMs that respect linguistic diversity while maintaining high performance. Practically, the democratization of AI tools through SeaLLMs stands to provide underserved populations with enhanced accessibility to advanced language technologies, paving the way for localized AI applications that resonate with cultural and social norms.

Future Prospects

Anticipating future directions, this work opens avenues for continuous adaptation of LLMs using hybrid data strategies—formalizing an iterative cycle of model improvement as more training data becomes available. The implications for AI in facilitating communications across diverse linguistic landscapes are profound, particularly for regions with rich linguistic tapestries like Southeast Asia. Subsequent exploration may extend these methodologies to other underrepresented languages and evaluate longitudinal impacts across varied technological applications.

In conclusion, the development of SeaLLMs represents a substantive stride towards addressing linguistic inequality in AI, providing specialized tools that enhance understanding and communication within Southeast Asia while emulating the best practices of leading LLMs.

PDF Markdown

Related Papers

GitHub

GitHub - DAMO-NLP-SG/SeaLLMs: SeaLLMs - Large Language Models for Southeast Asia (169 stars)

Tweets

https://twitter.com/ceobillionaire/status/1754232029615428077

https://twitter.com/Montreal_AI/status/1754226663842250953

https://twitter.com/Quebec_AI/status/1754233163583271181