Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

129 tokens/sec

GPT-4o

28 tokens/sec

Gemini 2.5 Pro Pro

42 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

199

IndicLLMSuite: A Blueprint for Creating Pre-training and Fine-Tuning Datasets for Indian Languages (2403.06350v2)

Published 11 Mar 2024 in cs.CL

Abstract: Despite the considerable advancements in English LLMs, the progress in building comparable models for other languages has been hindered due to the scarcity of tailored resources. Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages, containing a total of 251B tokens and 74.8M instruction-response pairs. Recognizing the importance of both data quality and quantity, our approach combines highly curated manually verified data, unverified yet valuable data, and synthetic data. We build a clean, open-source pipeline for curating pre-training data from diverse sources, including websites, PDFs, and videos, incorporating best practices for crawling, cleaning, flagging, and deduplication. For instruction-fine tuning, we amalgamate existing Indic datasets, translate/transliterate English datasets into Indian languages, and utilize LLaMa2 and Mixtral models to create conversations grounded in articles from Indian Wikipedia and Wikihow. Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios and then generate non-toxic responses by feeding these toxic prompts to an aligned LLaMa2 model. We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages. The data and other artifacts created as part of this work are released with permissive licenses.

References (82)

Citations (11)

View on Semantic Scholar

Summary

The paper presents a comprehensive suite offering pre-training and fine-tuning datasets for 22 Indian languages using 251B tokens and 74.7M instruction pairs.
It employs a balanced mix of manually verified, unverified, and synthetic data using a Spark-based pipeline, Setu, for robust curation.
The results establish a scalable blueprint for equitably advancing AI in low-resource languages, ensuring cultural and linguistic inclusivity.

IndicLLMsuite: Empowering Indic LLMs with Rich Resources

Introduction

The monumental growth of research and development in LLMs primarily benefits English due to the abundance of resources. In contrast, languages from the Indian subcontinent, spoken by over 1.4 billion people, lag behind due to the dearth of comparable datasets and tailored resources. This research introduces IndicLLMsuite, a comprehensive suite aimed at bridging this gap, providing tools, datasets, and resources tailor-made for 22 constitutionally recognized Indian languages. With a total of 251B tokens for pre-training and 74.7M instruction-response pairs for fine-tuning, this suite is a significant step towards equitable AI advancements across languages.

Sangraha: A Multifaceted Pre-training Dataset

Sangraha is distinguished by its unique composition of manually verified data, unverified data, and synthetic data, aggregating a total of 251B tokens. The dataset comprises diverse sources including web content, PDFs, and videos. A notable feature of Sangraha is its emphasis on quality through human verification, alongside leveraging synthetic data to enhance dataset diversity. This approach offers a balanced representation of different content types, ensuring that the dataset is not only vast but also rich in quality and variety.

Setu: A Robust Curation Pipeline

The curation of Sangraha is facilitated by Setu, a Spark-based distributed pipeline customized for Indian languages. This pipeline addresses several critical steps in data processing, including extraction, cleaning, flagging, and deduplication. Setu's comprehensive architecture ensures the sanitization and refinement of data, making Sangraha a reliable source for training robust LLMs.

IndicAlign: Enriching Instruction Fine-Tuning Data

IndicAlign, part of IndicLLMsuite, offers a wide array of prompt-response pairs across 20 languages. It merges existing datasets, translates English datasets, and employs both human and synthetic generation methods to create context-grounded conversations. This diversity enriches the suite with culturally and contextually relevant datasets, aiding in comprehensive model training.

Theoretical and Practical Implications

The theoretical implications of this research are profound, demonstrating the viability of synthetic data generation in supporting low-resource languages. Practically, the release of IndicLLMsuite paves the way for advanced research and development of LLMs in Indian languages. It serves as a blueprint for extending similar efforts to other languages, advocating for a global approach toward equitable AI development.

Future Directions

This research invites collaboration for training high-quality Indian language LLMs through community-driven initiatives. By pooling resources, the AI community can achieve significant milestones in developing models that are not only linguistically inclusive but also culturally nuanced.

IndicLLMsuite represents a pivotal movement towards closing the linguistic divide in AI advancements, supporting the growth of LLMs across Indian languages. This progressive stride encourages the embrace of diversity and inclusivity in the field of AI, fostering developments that resonate with a broader spectrum of the global population.

Tweets

https://twitter.com/ai4bharat/status/1768279858969448667

https://twitter.com/prajdabre1/status/1824455468548522011