Emergent Mind

Abstract

Building LLMs for languages other than English is in great demand due to the unavailability and performance of multilingual LLMs, such as understanding the local context. The problem is critical for low-resource languages due to the need for instruction sets. In a multilingual country like India, there is a need for LLMs supporting Indic languages to provide generative AI and LLM-based technologies and services to its citizens. This paper presents our approach of i) generating a large Odia instruction set, including domain knowledge data suitable for LLM fine-tuning, and ii) building a Llama2-finetuned model tailored for enhanced performance in the Odia domain. The proposed work will help researchers build an instruction set and LLM, particularly for Indic languages. We will release the model and instruction set for the public for research and noncommercial purposes.

Overview

  • The paper discusses the creation of a Llama2-based LLM specifically fine-tuned for the Odia language, aimed at improving AI services for speakers of this underrepresented language.

  • A bespoke dataset of 181K translated and Odia-specific domain instructions was compiled using existing AI models for translation to facilitate the model's fine-tuning process.

  • The Llama2-7b model underwent fine-tuning with specific hyperparameters, employing techniques like 4-bit quantization and LoRA to optimize efficiency and attention.

  • Objective and human evaluations indicate the model's strong grasp of Odia grammar but highlight limitations in certain tasks such as classification and generating lengthy responses.

  • Despite some issues like response hallucination and verbosity, the research marks progress towards digital equality for Odia and offers a basis for future improvements in AI for lesser-resourced languages.

Introduction

Generative AI and LLMs (Language Models) have transformed computational linguistics, offering remarkable capabilities for language generation and understanding. However, the effectiveness of these models for languages with limited digital resources, such as those in the Indic group, remains underwhelming. The need to build robust LLMs for these languages is evident, especially for regional languages like Odia, spoken in the eastern Indian state of Odisha. Addressing this need, the paper discusses the development of a Llama2-based LLM, fine-tuned with domain-specific instruction sets in Odia, aiming to bridge the representation gap and enhance AI-enabled services for Odia speakers.

Dataset Construction

The underpinning of this LLM is the dataset creation, which comprises 181K translated and domain-specific instruction sets for Odia. This dataset is a melange of translated common instructions from English to Odia and Odia domain-specific knowledge instructions. These diverse instruction sets span various subjects from local cuisine to historical landmarks, sports, and general knowledge about Odisha, offering an all-encompassing domain understanding for the fine-tuning of Llama2. The bespoke dataset was prepared leveraging the power of existing AI models, like GPTs, for translation and augmenting domain knowledge.

Experimental Methodology

For the experimental setup, the paper describes fine-tuning the chosen Llama2-7b model with tailored hyperparameters, like a learning rate of 2e-4 and seven training epochs. Employing 4-bit quantization, the researchers managed to optimize memory and computational efficiency. Additionally, they introduced LoRA – Learnable Reordering Attention for fine-tuning attention mechanisms, enhancing the model's attentional focus. The training regimen also included regularization techniques such as dropout and weight decay to prevent overfitting. The model's training effectiveness is reflected in the decreasing loss values presented, indicating successful learning over time.

Results and Observations

The fine-tuned model was assessed using objective metrics like ROUGE and BLEU and human evaluators fluent in Odia. The model scored 0.6583 on ROUGE and 0.6158 on BLEU, indicating strong alignment with reference texts. In human evaluations handling readability, perplexity, and correctness, the model demonstrated substantial command over the Odia language’s grammar but revealed limitations in answering classification tasks and generating lengthy responses.

In inference, the model revealed its prowess in arithmetic questions and maintained formal honorifics familiar in Odia interactions. Nonetheless, it showed imperfections like hallucinations for elongated responses, suggesting areas for further model refinement. Analytic focus identified areas for advancement, particularly in generating concise answers and managing the complexity of longer responses, indicating the journey toward linguistic model parity is ongoing.

Conclusion

The development of the fine-tuned Llama2 model for Odia is a stride towards linguistic equality in AI representation for lesser-resourced languages. Creating a robust dataset and meticulously fine-tuning the model have yielded a tool that enhances digital inclusivity for the Odia-speaking populace. While shortcomings persist, such as hallucination issues and a proclivity for verbosity, the research lays a sturdy foundation for future enhancements. The authors commit to investigating and addressing limitations while comparing the model with other multilingual LLMs and exploring distillation methods for creating compact yet competent models for languages like Odia.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.