Emergent Mind

Abstract

Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

Components of IndicLLMSuite showcased, detailing the system's diverse parts.

Overview

  • IndicLLMSuite seeks to bridge the linguistic gap in AI by providing tailored datasets and resources for 22 Indian languages, aiming for equitable AI advancements.

  • With 251B tokens for pre-training and 74.7M instruction-response pairs for fine-tuning, the suite enhances the development of LLMs in Indian languages.

  • Sangraha, a part of IndicLLMSuite, combines manually verified, unverified, and synthetic data from diverse sources to create a rich pre-training dataset.

  • IndicLLMSuite not only supports the theoretical viability of using synthetic data in low-resource languages but also emphasizes the need for culturally nuanced models in practical AI applications.

IndicLLMSuite: Empowering Indic Language Models with Rich Resources

Introduction

The monumental growth of research and development in LLMs primarily benefits English due to the abundance of resources. In contrast, languages from the Indian subcontinent, spoken by over 1.4 billion people, lag behind due to the dearth of comparable datasets and tailored resources. This research introduces IndicLLMSuite, a comprehensive suite aimed at bridging this gap, providing tools, datasets, and resources tailor-made for 22 constitutionally recognized Indian languages. With a total of 251B tokens for pre-training and 74.7M instruction-response pairs for fine-tuning, this suite is a significant step towards equitable AI advancements across languages.

Sangraha: A Multifaceted Pre-training Dataset

Sangraha is distinguished by its unique composition of manually verified data, unverified data, and synthetic data, aggregating a total of 251B tokens. The dataset comprises diverse sources including web content, PDFs, and videos. A notable feature of Sangraha is its emphasis on quality through human verification, alongside leveraging synthetic data to enhance dataset diversity. This approach offers a balanced representation of different content types, ensuring that the dataset is not only vast but also rich in quality and variety.

Setu: A Robust Curation Pipeline

The curation of Sangraha is facilitated by Setu, a Spark-based distributed pipeline customized for Indian languages. This pipeline addresses several critical steps in data processing, including extraction, cleaning, flagging, and deduplication. Setu's comprehensive architecture ensures the sanitization and refinement of data, making Sangraha a reliable source for training robust language models.

IndicAlign: Enriching Instruction Fine-Tuning Data

IndicAlign, part of IndicLLMSuite, offers a wide array of prompt-response pairs across 20 languages. It merges existing datasets, translates English datasets, and employs both human and synthetic generation methods to create context-grounded conversations. This diversity enriches the suite with culturally and contextually relevant datasets, aiding in comprehensive model training.

Theoretical and Practical Implications

The theoretical implications of this research are profound, demonstrating the viability of synthetic data generation in supporting low-resource languages. Practically, the release of IndicLLMSuite paves the way for advanced research and development of LLMs in Indian languages. It serves as a blueprint for extending similar efforts to other languages, advocating for a global approach toward equitable AI development.

Future Directions

This research invites collaboration for training high-quality Indian language LLMs through community-driven initiatives. By pooling resources, the AI community can achieve significant milestones in developing models that are not only linguistically inclusive but also culturally nuanced.

IndicLLMSuite represents a pivotal movement towards closing the linguistic divide in AI advancements, supporting the growth of LLMs across Indian languages. This progressive stride encourages the embrace of diversity and inclusivity in the realm of AI, fostering developments that resonate with a broader spectrum of the global population.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.