AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

Published 30 Apr 2020 in cs.CL | (2005.00085v1)

Abstract: We present the IndicNLP corpus, a large-scale, general-domain corpus containing 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We show that the IndicNLP embeddings significantly outperform publicly available pre-trained embedding on multiple evaluation tasks. We hope that the availability of the corpus will accelerate Indic NLP research. The resources are available at https://github.com/ai4bharat-indicnlp/indicnlp_corpus.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (72)

View on Semantic Scholar

Summary

The paper introduces a vast IndicNLP corpus that compiles 2.7B words across ten Indic languages, significantly enhancing resource availability for NLP research.
The paper leverages FastText to train word embeddings that capture intricate morphological nuances, outperforming benchmarks in word similarity and text classification tasks.
The paper presents new benchmarks and develops unsupervised morphology analyzers to address data sparsity in the morphologically rich landscape of Indic languages.

AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages

The paper "AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages" presents an extensive collection of language resources aimed at advancing NLP research for Indian languages. Despite being spoken by a significant population globally, the development of NLP resources for Indic languages lags behind due to limited availability of large-scale corpora and pre-trained models. This work endeavors to bridge this gap by introducing the IndicNLP corpus, comprising 2.7 billion words across ten Indic languages from two prominent language families, the Indo-Aryan and Dravidian families.

Contribution to Resources for Indic Languages

The paper outlines several key contributions:

Monolingual Corpora: The IndicNLP corpus covers ten languages, providing at least 100 million words for each language, aside from Oriya. These data sources include news websites and Wikipedia, reflecting contemporary usage and covering assorted topics.
Pre-trained Word Embeddings: Using FastText, the authors train word embeddings for each language, leveraging the corpora's morphological richness. FastText's integration of subword information is especially suitable for Indic languages, enhancing the embeddings' capability to capture semantic nuances.
Text Classification Datasets: New benchmarks for news article categorization across nine languages are presented, aiding in the development of robust classification models.
Unsupervised Morphology Analyzers: The authors develop unsupervised morphanalyzers using these corpora, applicable particularly for morphologically rich languages, to ameliorate data sparsity challenges.

Evaluation and Performance

The authors rigorously evaluate the IndicNLP embeddings against existing public embeddings. On tasks including word similarity, word analogy, and text classification across a variety of datasets, IndicNLP embeddings outperform existing benchmarks. Specifically, the IndicNLP corpus achieves an average accuracy of 97.40% on the IndicNLP news category dataset and demonstrates superior results in bilingual lexicon induction tasks, indicating improved cross-lingual performances. It's worth noting that these achievements demonstrate the utility of the corpus in enhancing NLP task performance across Indic languages.

Implications and Future Directions

IndicNLP Corpora's availability has significant implications for NLP research and application in India. It facilitates the development of language technologies for digital consumption in native languages, catering to a diverse linguistic population. The corpus also supports the creation of multilingual embeddings, promoting cross-lingual transfer learning, which is particularly beneficial given the structural similarities among Indic languages due to prolonged linguistic interaction.

Looking forward, the authors aim to expand their corpora collection to achieve at least one billion words for the major Indian languages. Future work includes developing richer pre-trained models such as BERT and ELMo for Indic languages and constructing additional evaluation benchmarks for broader NLP tasks.

Conclusion

The contributions of the IndicNLP corpus are substantial in fostering NLP advancements for Indian languages. By making these resources public, the work lays a foundation for the NLP community to build upon, paving the way for innovative research and applications that can greatly enrich the digital linguistic landscape of India. Continued expansion and enhancement of these resources hold promise for addressing the linguistic diversity of the Indian subcontinent with suitable technological solutions.

Markdown Report Issue