L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library

Published 29 May 2022 in cs.CL and cs.LG | (2205.14728v2)

Abstract: Despite being the third most popular language in India, the Marathi language lacks useful NLP resources. Moreover, popular NLP libraries do not have support for the Marathi language. With L3Cube-MahaNLP, we aim to build resources and a library for Marathi natural language processing. We present datasets and transformer models for supervised tasks like sentiment analysis, named entity recognition, and hate speech detection. We have also published a monolingual Marathi corpus for unsupervised language modeling tasks. Overall we present MahaCorpus, MahaSent, MahaNER, and MahaHate datasets and their corresponding MahaBERT models fine-tuned on these datasets. We aim to move ahead of benchmark datasets and prepare useful resources for Marathi. The resources are available at https://github.com/l3cube-pune/MarathiNLP.

Abstract PDF Upgrade to Chat

Authors (1)

Raviraj Joshi

Citations (23)

View on Semantic Scholar

Summary

The paper presents extensive datasets such as MahaCorpus, MahaSent, MahaNER, and MahaHate to address existing Marathi NLP resource gaps.
It develops fine-tuned transformer models including MahaBERT variants and MahaGPT that enhance performance on language-specific tasks.
The initiative advances Marathi NLP research and practical applications while providing a scalable framework for low-resource languages.

An Overview of L3Cube-MahaNLP: Datasets and Models for Marathi NLP

The paper "L3Cube-MahaNLP: Marathi Natural Language Processing Datasets, Models, and Library" introduces a comprehensive resource suite aimed at enhancing Marathi NLP. Despite Marathi being one of the most spoken languages in India, the language suffers from a paucity of NLP resources—an issue that this work seeks to address. This scholarly contribution is particularly notable for its focus on low-resource languages, creating essential tools and datasets which may facilitate a variety of NLP tasks.

Core Contributions

Datasets: The paper offers several curated datasets:
- MahaCorpus: A large monolingual corpus consisting of 24.8 million sentences, amounting to 289 million tokens. This dataset supports unsupervised language modeling, providing foundational text data sourced mainly from both news and non-news Marathi content.
- MahaSent: A sentiment analysis dataset with sentiment-labeled Marathi tweets, containing 12,114 training, 2,250 test, and 1,500 validation samples.
- MahaNER: A named entity recognition dataset comprising 25,000 manually tagged sentences that span eight entity classes.
- MahaHate: A hate speech detection dataset with over 25,000 tweets annotated into categories such as hate, offensive, profane, and neutral.
Models: L3Cube-MahaNLP also presents various fine-tuned Transformer models:
- MahaBERT variants (MahaBERT, MahaAlBERT, MahaRoBERTa): Monolingual models trained using the MahaCorpus with the Masked Language Modeling (MLM) objective, tailored specifically for effective Marathi language understanding.
- MahaGPT: A generative transformer model using a causal language modeling approach, applied to the full Marathi corpus for generating and predicting text in Marathi.
- MahaFT: FastText word embeddings tailored for Marathi, enabling efficient utilization in multiple NLP applications.

Implications and Future Prospects

The development of these datasets and models represents a significant advancement in the capabilities available for Marathi NLP. By outperforming existing multilingual models in various Marathi NLP tasks, these monolingual models demonstrate the added value of focused linguistic resources. The availability of these datasets and models facilitates robust research and practical applications such as sentiment analysis, entity recognition, and hate speech detection in Marathi.

Going forward, the authors aim to expand further into domains such as natural language generation, while also streamlining access to these models via Python packages. This progression could further strengthen the NLP landscape for low-resource languages, advocating for stronger representation and utility within the global research community.

Overall, the L3Cube-MahaNLP initiative not only enriches Marathi NLP but also provides a model of resource development for other low-resource languages, potentially paving the way for enhanced linguistic processing across diverse linguistic domains.

Markdown Report Issue