SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages (2407.19672v1)

Published 29 Jul 2024 in cs.CL

Abstract: LLMs have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a novel language-specific neuron (LSN) training method that enhances pretraining efficiency for SEA languages.
It leverages extensive supervised fine-tuning with native speaker input to ensure cultural accuracy and balanced task performance.
Evaluations demonstrate superior multilingual world knowledge, translation, and safety metrics, underscoring its practical impact on low-resource languages.

SeaLLMs 3: Open Foundation and Chat Multilingual LLMs for Southeast Asian Languages

The paper "SeaLLMs 3: Open Foundation and Chat Multilingual LLMs for Southeast Asian Languages" by Wenxuan Zhang and colleagues presents an extensive exploration into the development of LLMs specifically tailored for Southeast Asian (SEA) languages. The authors address a significant disparity in language technology advancement that predominantly favors high-resource languages like English and Chinese. SeaLLMs 3 marks a crucial advancement in rectifying this imbalance by providing robust, specialized models for a diverse array of SEA languages.

Introduction and Background

LLMs such as GPT-4 and Qwen have demonstrated remarkable proficiency across various linguistic tasks. However, these models have largely focused on high-resource languages, leaving low-resource languages, particularly those in SEA, without adequate support. SEA is a linguistically rich region encompassing languages such as Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. SeaLLMs 3 specifically targets these underserved linguistic communities by enhancing model training techniques and optimizing performance to meet the unique needs of these languages.

Methodology

Pre-training

SeaLLMs 3 diverges from conventional methods of continued pretraining by employing Language-Specific Neuron (LSN) training. This method selectively enhances language-specific neurons within a foundational model, significantly reducing training costs while preserving the performance of high-resource languages from the base model. The LSNs, identified using a parallel detection method, are fine-tuned with targeted training data, ensuring efficient and effective learning for each specific language.

The pretraining data amalgamates diverse sources, including Wikipedia, textbooks, CC-News, CulturaX, and synthetic data generated through stronger models. Enhanced regional knowledge integration is achieved through meticulously annotated training data, ensuring comprehensive coverage of SEA languages.

Supervised Fine-Tuning (SFT)

The SFT process involves constructing a diverse pool of task-specific datasets, actively engaging native speakers throughout the data collection and validation phases to ensure linguistic and cultural accuracy. The authors emphasize a balanced representation of languages within the SFT, mitigating the dominance of English data. Furthermore, the SFT data has expanded to encompass a wide array of task types, enhancing the model’s capability across various domains such as coding, mathematics, education, dialogue, and translation.

Special attention is given to safety and reliability through the introduction of refusal-type data and culturally sensitive safety scenarios, ensuring that the models avoid hallucinations and provide contextually appropriate responses.

Evaluations and Results

Multilingual World Knowledge

SeaLLMs 3 demonstrates superior performance in multilingual world knowledge tasks across various benchmarks such as M3Exam and MMLU. Notably, SeaLLMs-v3-7B-Chat achieved the highest average scores in SEA languages, outperforming competitive models like Qwen2-7B and Meta-Llama-3. These results underscore the model’s robustness in handling educational and cross-linguistic alignment tasks.

Multilingual Math and Instruction-following

In mathematics, SeaLLMs-v3-7B-Chat exhibited high proficiency, particularly in Indonesian and Thai, showcasing superior adaptability and robustness. Moreover, the model excelled in multi-turn instruction-following tasks, garnering high scores in Indonesian, Thai, and Vietnamese. This highlights the model's capability to generate coherent and contextually appropriate multi-turn responses.

Translation

SeaLLMs 3's translation capabilities were evaluated using the Flores-200 dataset, where it consistently achieved higher chrF scores across multiple languages, including low-resource ones like Khmer and Lao. This suggests that SeaLLMs 3 is highly effective in multilingual translation tasks.

Model Trustworthiness

The model's trustworthiness was assessed through its ability to refuse questions beyond its knowledge boundaries (hallucination) and handle unsafe queries (safety). SeaLLMs-v3-7B-Chat significantly outperformed baseline models in refusing questions about non-existing entities, as demonstrated in the SeaRefuse evaluations. Moreover, it achieved the highest safety scores across multiple languages, indicating a robust design catering to the needs of different linguistic contexts.

Implications and Future Directions

The successful development and deployment of SeaLLMs 3 have promising implications for the inclusion of SEA languages in advanced AI technologies. The efficient language enhancement techniques and comprehensive instruction tuning datasets used in this model set a new standard for developing LLMs for low-resource languages. Future research could focus on extending this methodology to other underrepresented language groups, further democratizing access to advanced language technologies.

Conclusion

SeaLLMs 3 represents a significant stride towards inclusive AI, providing advanced LLM capabilities for SEA languages. By adopting innovative training methods and emphasizing cultural and linguistic nuances, the SeaLLMs 3 models offer high performance, safety, and reliability. The open-sourcing of these models encourages further research and application development, fostering broader access and innovation in AI for Southeast Asia. The authors' commitment to addressing language technology disparities aligns with the broader goal of equitable and inclusive AI advancement.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AdeenaY8/status/1818194081933127777

https://twitter.com/javaeeeee1/status/1818402428720759180

https://twitter.com/susumuota/status/1819524759614357665