Building a Llama2-finetuned LLM for Odia Language Utilizing Domain Knowledge Instruction Set (2312.12624v1)

Published 19 Dec 2023 in cs.CL

Abstract: Building LLMs for languages other than English is in great demand due to the unavailability and performance of multilingual LLMs, such as understanding the local context. The problem is critical for low-resource languages due to the need for instruction sets. In a multilingual country like India, there is a need for LLMs supporting Indic languages to provide generative AI and LLM-based technologies and services to its citizens. This paper presents our approach of i) generating a large Odia instruction set, including domain knowledge data suitable for LLM fine-tuning, and ii) building a Llama2-finetuned model tailored for enhanced performance in the Odia domain. The proposed work will help researchers build an instruction set and LLM, particularly for Indic languages. We will release the model and instruction set for the public for research and noncommercial purposes.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces a fine-tuned Llama2 model using 181K domain-specific Odia instruction sets to bridge the digital resource gap.
It employs techniques like 4-bit quantization and LoRA, achieving a ROUGE score of 0.6583 and a BLEU score of 0.6158 for improved performance.
The study highlights enhanced grammatical accuracy with formal language use, while also addressing challenges in generating concise and accurate long responses.

Introduction

Generative AI and LLMs have transformed computational linguistics, offering remarkable capabilities for language generation and understanding. However, the effectiveness of these models for languages with limited digital resources, such as those in the Indic group, remains underwhelming. The need to build robust LLMs for these languages is evident, especially for regional languages like Odia, spoken in the eastern Indian state of Odisha. Addressing this need, the paper discusses the development of a Llama2-based LLM, fine-tuned with domain-specific instruction sets in Odia, aiming to bridge the representation gap and enhance AI-enabled services for Odia speakers.

Dataset Construction

The underpinning of this LLM is the dataset creation, which comprises 181K translated and domain-specific instruction sets for Odia. This dataset is a melange of translated common instructions from English to Odia and Odia domain-specific knowledge instructions. These diverse instruction sets span various subjects from local cuisine to historical landmarks, sports, and general knowledge about Odisha, offering an all-encompassing domain understanding for the fine-tuning of Llama2. The bespoke dataset was prepared leveraging the power of existing AI models, like GPTs, for translation and augmenting domain knowledge.

Experimental Methodology

For the experimental setup, the paper describes fine-tuning the chosen Llama2-7b model with tailored hyperparameters, like a learning rate of 2e-4 and seven training epochs. Employing 4-bit quantization, the researchers managed to optimize memory and computational efficiency. Additionally, they introduced LoRA – Learnable Reordering Attention for fine-tuning attention mechanisms, enhancing the model's attentional focus. The training regimen also included regularization techniques such as dropout and weight decay to prevent overfitting. The model's training effectiveness is reflected in the decreasing loss values presented, indicating successful learning over time.

Results and Observations

The fine-tuned model was assessed using objective metrics like ROUGE and BLEU and human evaluators fluent in Odia. The model scored 0.6583 on ROUGE and 0.6158 on BLEU, indicating strong alignment with reference texts. In human evaluations handling readability, perplexity, and correctness, the model demonstrated substantial command over the Odia language’s grammar but revealed limitations in answering classification tasks and generating lengthy responses.

In inference, the model revealed its prowess in arithmetic questions and maintained formal honorifics familiar in Odia interactions. Nonetheless, it showed imperfections like hallucinations for elongated responses, suggesting areas for further model refinement. Analytic focus identified areas for advancement, particularly in generating concise answers and managing the complexity of longer responses, indicating the journey toward linguistic model parity is ongoing.

Conclusion

The development of the fine-tuned Llama2 model for Odia is a stride towards linguistic equality in AI representation for lesser-resourced languages. Creating a robust dataset and meticulously fine-tuning the model have yielded a tool that enhances digital inclusivity for the Odia-speaking populace. While shortcomings persist, such as hallucination issues and a proclivity for verbosity, the research lays a sturdy foundation for future enhancements. The authors commit to investigating and addressing limitations while comparing the model with other multilingual LLMs and exploring distillation methods for creating compact yet competent models for languages like Odia.

PDF Markdown

Related Papers

Tweets

https://twitter.com/1646866474529701889/status/1739738718892671067