Bridging the Bosphorus: Advancing Turkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking (2405.04685v1)

Published 7 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making NLP benefits more globally accessible.

Citations (6)

View on Semantic Scholar

Summary

The paper presents dual strategies—adapting pretrained English LLMs and training Turkish models from scratch—to tackle data scarcity and language adaptation challenges.
It employs supervised instruction-tuning with a newly created Turkish dataset to enhance reasoning and benchmark performance on tasks like question-answering.
The study demonstrates that adapted models achieve lower bits-per-character metrics and superior Turkish performance, highlighting the importance of balanced multilingual training.

Advancing LLMs for Turkish: Strategies and Evaluations

The paper "Bridging the Bosphorus: Advancing Turkish LLMs through Strategies for Low-Resource Language Adaptation and Benchmarking" by Emre Can Acikgoz et al. addresses the significant challenges and strategic solutions in advancing LLMs for the Turkish language. The paper includes two main methodologies for developing Turkish LLMs: (i) fine-tuning existing LLMs pretrained in English and (ii) training models from scratch using Turkish pretraining data. Both approaches are supplemented with supervised instruction-tuning on a novel Turkish dataset to enhance reasoning capabilities. This paper also introduces new benchmarks and a leaderboard specifically for Turkish LLMs to facilitate fair and reproducible evaluation of these models.

Challenges in Developing LLMs for Turkish

Although Turkish is not a low-resource language per se, it suffers from a lack of research focus, solid base models, and standardized benchmarks. The primary obstacles identified include data scarcity, model selection, evaluation challenges, and computational limitations. Addressing these gaps is essential for making advanced NLP technologies accessible for underrepresented languages like Turkish.

Methodological Approaches

The paper carried out two distinct methods: adapting existing LLMs (Mistral-7B and GPT2-xl) to Turkish and training a series of decoder-only models from scratch, which are named Hamza models. The aim was to explore the feasibility and performance differences between these approaches.

Adaptation of Existing LLMs: The selected LLMs, Mistral-7B and GPT2-xl, were further pretrained on Turkish data. The models were subjected to progressive enlargement of the training corpus size to ensure efficient adaptation. LoRA (Low-Rank Adaptation) was employed to make training cost-efficient while preventing catastrophic forgetting.
Training from Scratch: A family of decoder-only models named Hamza, ranging from 124M to 1.3B parameters, was trained on the Turkish splits of CulturaX, a multilingual dataset. The training adhered to the architectural and training principles similar to GPT2, with adaptations for the specificities of the Turkish language.

Instruction-Tuning and Evaluation Datasets

A novel Turkish instruction-tuning (IT) dataset was designed following the Self-Instruct framework to improve the models' reasoning abilities. The dataset was meticulously validated and augmented with diverse and challenging tasks.

The paper also released two new evaluation datasets, TruthfulQA-TR and ARC-TR, which were carefully translated and validated by annotators to ensure quality. These datasets were designed to evaluate models on their ability to avoid common falsehoods and answer grade-school science questions, respectively.

Experimental Results

The paper's experimental results highlighted the following key findings:

Performance on Benchmark Datasets:

The adapted models, particularly Hamza $_{\scriptsize Mistral}$ , showed superior performance in Turkish question-answering tasks compared to models trained from scratch.

Bits-Per-Character (BPC) Metrics:

Lower BPC values indicated that the adapted models were better at compressing Turkish text, suggesting more efficient LLMing capabilities.

Implications of Fine-Tuning:

The supervised fine-tuning with the novel Turkish IT dataset yielded improvements in model performance across downstream benchmarks.

Catastrophic Forgetting:

The paper found that models pretrained on English experienced a decline in English task performance when further trained on Turkish data, indicating catastrophic forgetting. This finding suggests the need for balanced training data to maintain multilingual capabilities.

Practical and Theoretical Implications

Practical Implications: The release of new datasets and the open-source nature of Hamza models contribute significantly to the NLP community. They provide essential resources for future research in Turkish language processing and facilitate the development of better models.
Theoretical Implications: The paper explores the challenges of adapting high-resource LLMs for underrepresented languages and provides insights into effective strategies for overcoming these challenges. It emphasizes the importance of dataset quality and balanced training to avoid catastrophic forgetting.

Future Developments in AI

The future of AI and NLP, as inferred from this paper, involves creating more efficient, adaptable, and inclusive models. There is a need for high-quality, diverse, and representative datasets. Additionally, innovative training techniques like LoRA and better fine-tuning strategies are crucial in advancing LLMs for underrepresented languages. Future efforts could also focus on improving hardware capabilities to support large-scale model training in resource-constrained settings.

Conclusion

This paper's comprehensive approach sheds light on the development of LLMs for Turkish, presenting a clear roadmap for other underrepresented languages. The methodologies, benchmarks, and models introduced in this paper significantly enrich the field by making NLP technologies more accessible and effective for diverse linguistic contexts. The research underscores the need to balance between extending existing models and developing new ones from scratch, highlighting the importance of dataset quality, computational efficiency, and multilingual adaptability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/emrecanacikgoz/status/1788930641985233379