Emergent Mind

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

(2407.09855)
Published Jul 13, 2024 in cs.CL and cs.AI

Abstract

LLMs demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.

Gathering and processing information to produce a pre-trained Hindi dataset.

Overview

  • The paper discusses a comprehensive methodology for creating large pre-trained Language Models (LLMs) for Hindi, focusing on data collection, preprocessing, and accessibility.

  • It highlights the importance of high-quality datasets for Hindi and other Indic languages in NLP and examines various existing models and resources.

  • The dataset, comprising 1.28 billion tokens from diverse sources such as Wikipedia, legal texts, and social media, aims to enhance NLP tasks like text generation, language modeling, and domain-specific applications.

Building Pre-train LLM Dataset for the Indic Languages: A Case Study on Hindi

The paper "Building Pre-train LLM Dataset for the Indic Languages: A Case Study on Hindi" presents a robust methodology aimed at addressing the challenges of constructing large pre-trained Language Models (LLMs) for Hindi, with a particular emphasis on data collection, preprocessing, and availability. This study is authored by Shantipriya Parida, Shakshi Panwar, Kusum Lata, Sanskruti Mishra, and Sambit Sekhar and it focuses on providing a comprehensive dataset to facilitate advancements in NLP for Indic languages, particularly Hindi.

Introduction

The introduction highlights the significance of pre-trained LLMs in NLP, especially their applications in various tasks such as speech recognition, sentiment analysis, and machine translation. The authors underscore the imperative need for high-quality datasets for non-English languages like Hindi to create reliable NLP systems. The scarcity of large-scale, high-quality datasets for Hindi presents a formidable challenge in developing effective NLP applications.

Literature Survey

The literature survey provides a meticulous overview of relevant works that contribute significantly to the domain of LLMs and their adaptation to new languages. Key references include studies on big language models' architectures, training methodologies, and use cases across various languages. The authors also discuss the MuRIL and L3Cube-HindBERT/DevBERT models, which emphasize the importance of tailored approaches to Devanagari-based languages like Hindi. Additionally, the paper mentions the INDICLLMSUITE initiative, which represents a significant effort in providing resources for Indic language models.

Focused Language

Hindi, a member of the Indo-Aryan branch of the Indo-European language family, serves as the focal language of this study. The linguistic features of Hindi, including its subject-object-verb (SOV) word order, extensive inflectional system, and rich lexicon influenced by several languages, underscore the complexity involved in creating LLMs tailored for Hindi. The dataset prepared for this study encapsulates this linguistic diversity by encompassing various texts from different domains, genres, and dialects.

Dataset Preparation

The dataset preparation phase involved collecting data from multiple domains, such as:

  • Wikipedia: A comprehensive general knowledge repository providing 43.67 million tokens across 1.85 million sentences.
  • Dialect Hindi Dataset: Focused on capturing regional dialect variations in Hindi.
  • AI4Bharat IndicParaphrase: Containing paraphrased sentences to aid in linguistic understanding.
  • Miracl Corpus: Targeting legal discourse with 33.66 million tokens.
  • Oscar: Providing a diverse range of textual content from literary works to social media posts.
  • BigScience/xP3all: Focused on scientific discourse.

These datasets collectively amount to a total of 1.28 billion tokens, providing an expansive linguistic resource for training LLMs.

Data Processing

Data processing involved extensive steps to ensure the dataset's quality and consistency. Key processes included filtering out extraneous metadata, normalizing text elements, and addressing language-related errors. This resulted in a refined dataset comprising uniform and consistent text, suitable for effective model training.

Analysis and Discussion

The analysis section explore the diversity and richness of the collected datasets, which span various domains such as general knowledge, regional dialects, paraphrases, legal discourse, and scientific literature. The paper emphasizes the importance of this diversity in creating robust and adaptable language models. The integration of these diverse datasets ensures that the trained LLMs can handle a variety of linguistic contexts and applications.

Use Cases

The paper outlines several use cases for the developed dataset:

  • Pre-training: Essential for training LLMs tailored to Indian languages, significantly improving their performance on various NLP tasks.
  • Language Modeling: Enhancing text generation, sentence completion, and next-word prediction for Hindi.
  • Generating Synthetic Data: Augmenting existing datasets and solving data scarcity issues by generating synthetic examples.
  • Domain-Specific Improvement: Fine-tuning models using domain-specific datasets for specialized applications like legal document analysis.
  • Multilingual NLP Research: Extending the dataset's use to other Indian languages, promoting inclusivity in NLP research.

Availability

The dataset is made freely available on Hugging Face, allowing researchers and practitioners to access and utilize it for further research and model training.

Conclusion and Future Work

The paper concludes by emphasizing the significance of creating a large-scale pre-trained dataset for Hindi, which lays a strong foundation for advancing NLP research and applications for Indic languages. Future work includes refining evaluation criteria, addressing biases, and extending the dataset's applicability to other low-resource languages. Moreover, enhancing resource availability and accessibility is crucial for fostering collaboration and innovation.

Limitations

The authors acknowledge several limitations, including potential biases in the source data, incomplete linguistic diversity representation, and resource constraints in dataset development. Ethical considerations regarding data privacy and bias mitigation are also discussed, emphasizing the need for transparency and careful handling of textual data.

In summary, the paper presents a comprehensive approach to building a pre-trained LLM dataset specifically for the Hindi language, addressing key challenges in data collection, preprocessing, and accessibility while providing valuable insights and resources for the NLP research community.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.