Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi (2407.09855v1)

Published 13 Jul 2024 in cs.CL and cs.AI

Abstract: LLMs demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Shantipriya Parida (9 papers)
  2. Shakshi Panwar (1 paper)
  3. Kusum Lata (4 papers)
  4. Sanskruti Mishra (1 paper)
  5. Sambit Sekhar (2 papers)
Citations (1)

Summary

Building Pre-train LLM Dataset for the Indic Languages: A Case Study on Hindi

The paper "Building Pre-train LLM Dataset for the Indic Languages: A Case Study on Hindi" presents a robust methodology aimed at addressing the challenges of constructing large pre-trained LLMs for Hindi, with a particular emphasis on data collection, preprocessing, and availability. This paper is authored by Shantipriya Parida, Shakshi Panwar, Kusum Lata, Sanskruti Mishra, and Sambit Sekhar and it focuses on providing a comprehensive dataset to facilitate advancements in NLP for Indic languages, particularly Hindi.

Introduction

The introduction highlights the significance of pre-trained LLMs in NLP, especially their applications in various tasks such as speech recognition, sentiment analysis, and machine translation. The authors underscore the imperative need for high-quality datasets for non-English languages like Hindi to create reliable NLP systems. The scarcity of large-scale, high-quality datasets for Hindi presents a formidable challenge in developing effective NLP applications.

Literature Survey

The literature survey provides a meticulous overview of relevant works that contribute significantly to the domain of LLMs and their adaptation to new languages. Key references include studies on big LLMs' architectures, training methodologies, and use cases across various languages. The authors also discuss the MuRIL and L3Cube-HindBERT/DevBERT models, which emphasize the importance of tailored approaches to Devanagari-based languages like Hindi. Additionally, the paper mentions the INDICLLMsUITE initiative, which represents a significant effort in providing resources for Indic LLMs.

Focused Language

Hindi, a member of the Indo-Aryan branch of the Indo-European language family, serves as the focal language of this paper. The linguistic features of Hindi, including its subject-object-verb (SOV) word order, extensive inflectional system, and rich lexicon influenced by several languages, underscore the complexity involved in creating LLMs tailored for Hindi. The dataset prepared for this paper encapsulates this linguistic diversity by encompassing various texts from different domains, genres, and dialects.

Dataset Preparation

The dataset preparation phase involved collecting data from multiple domains, such as:

  • Wikipedia: A comprehensive general knowledge repository providing 43.67 million tokens across 1.85 million sentences.
  • Dialect Hindi Dataset: Focused on capturing regional dialect variations in Hindi.
  • AI4Bharat IndicParaphrase: Containing paraphrased sentences to aid in linguistic understanding.
  • Miracl Corpus: Targeting legal discourse with 33.66 million tokens.
  • Oscar: Providing a diverse range of textual content from literary works to social media posts.
  • BigScience/xP3all: Focused on scientific discourse.

These datasets collectively amount to a total of 1.28 billion tokens, providing an expansive linguistic resource for training LLMs.

Data Processing

Data processing involved extensive steps to ensure the dataset's quality and consistency. Key processes included filtering out extraneous metadata, normalizing text elements, and addressing language-related errors. This resulted in a refined dataset comprising uniform and consistent text, suitable for effective model training.

Analysis and Discussion

The analysis section explores the diversity and richness of the collected datasets, which span various domains such as general knowledge, regional dialects, paraphrases, legal discourse, and scientific literature. The paper emphasizes the importance of this diversity in creating robust and adaptable LLMs. The integration of these diverse datasets ensures that the trained LLMs can handle a variety of linguistic contexts and applications.

Use Cases

The paper outlines several use cases for the developed dataset:

  • Pre-training: Essential for training LLMs tailored to Indian languages, significantly improving their performance on various NLP tasks.
  • LLMing: Enhancing text generation, sentence completion, and next-word prediction for Hindi.
  • Generating Synthetic Data: Augmenting existing datasets and solving data scarcity issues by generating synthetic examples.
  • Domain-Specific Improvement: Fine-tuning models using domain-specific datasets for specialized applications like legal document analysis.
  • Multilingual NLP Research: Extending the dataset's use to other Indian languages, promoting inclusivity in NLP research.

Availability

The dataset is made freely available on Hugging Face, allowing researchers and practitioners to access and utilize it for further research and model training.

Conclusion and Future Work

The paper concludes by emphasizing the significance of creating a large-scale pre-trained dataset for Hindi, which lays a strong foundation for advancing NLP research and applications for Indic languages. Future work includes refining evaluation criteria, addressing biases, and extending the dataset's applicability to other low-resource languages. Moreover, enhancing resource availability and accessibility is crucial for fostering collaboration and innovation.

Limitations

The authors acknowledge several limitations, including potential biases in the source data, incomplete linguistic diversity representation, and resource constraints in dataset development. Ethical considerations regarding data privacy and bias mitigation are also discussed, emphasizing the need for transparency and careful handling of textual data.

In summary, the paper presents a comprehensive approach to building a pre-trained LLM dataset specifically for the Hindi language, addressing key challenges in data collection, preprocessing, and accessibility while providing valuable insights and resources for the NLP research community.

X Twitter Logo Streamline Icon: https://streamlinehq.com