BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla (2101.00204v4)

Published 1 Jan 2021 in cs.CL

Abstract: In this work, we introduce BanglaBERT, a BERT-based Natural Language Understanding (NLU) model pretrained in Bangla, a widely spoken yet low-resource language in the NLP literature. To pretrain BanglaBERT, we collect 27.5 GB of Bangla pretraining data (dubbed `Bangla2B+') by crawling 110 popular Bangla sites. We introduce two downstream task datasets on natural language inference and question answering and benchmark on four diverse NLU tasks covering text classification, sequence labeling, and span prediction. In the process, we bring them under the first-ever Bangla Language Understanding Benchmark (BLUB). BanglaBERT achieves state-of-the-art results outperforming multilingual and monolingual models. We are making the models, datasets, and a leaderboard publicly available at https://github.com/csebuetnlp/banglabert to advance Bangla NLP.

Citations (154)

View on Semantic Scholar

Summary

The paper introduces a BERT-based model, BanglaBERT, pre-trained on a 27.5GB Bangla dataset using the ELECTRA framework and RTD objective.
It establishes new benchmark datasets for Bangla NLI and QA, consolidating them into the first Bangla Language Understanding Benchmark (BLUB).
The model outperforms multilingual and monolingual baselines with a 77.09 BLUB score and demonstrates strong zero-shot cross-lingual capabilities.

An Evaluation of BanglaBERT: Advancements in Bangla NLP

This paper introduces BanglaBERT, a BERT-based model specifically pre-trained for natural language understanding (NLU) in Bangla. Despite Bangla being the sixth most spoken language worldwide, it remains under-resourced in terms of NLP tools. This paper addresses the gap by assembling a comprehensive 27.5 GB dataset named 'Bangla2B+', extracted from 110 popular Bangla websites. The objective is to enhance the language processing capabilities for Bangla through pretraining tailored specifically to this language.

Contributions and Methodology

Model Development: The authors present two models, BanglaBERT and its bilingual counterpart, BanglishBERT, also employing English data to facilitate zero-shot cross-lingual transfer learning. BanglaBERT utilizes the ELECTRA framework for training, capitalizing on the Replaced Token Detection (RTD) objective for efficient pretraining.
Dataset and Benchmark Creation: They introduce new datasets for Bangla Natural Language Inference (NLI) and Question Answering (QA), and consolidate these with existing datasets into the Bangla Language Understanding Benchmark (BLUB). This marks the first Bangla-specific benchmark to assess model performance across text classification, sequence labeling, and span prediction tasks.
Results: BanglaBERT delivers state-of-the-art results, outperforming both multilingual models like mBERT and XLM-R, as well as monolingual ones in supervised setting, achieving a 77.09 BLUB score. In zero-shot settings, BanglishBERT showed strong cross-lingual capabilities, rivaling XLM-R large despite its smaller size.

Implications

Practical Implementation: The availability of BanglaBERT coupled with its datasets represents a critical resource for Bangla NLP applications, fostering advancements in regional language technologies. This research provides a clear path for developing efficient, task-specific Bangla NLP tools in applications like sentiment analysis, entity recognition, and more.

Theoretical Insights: The work underscores the benefits of language-specific models over multilingual ones, particularly when low-resource languages are involved. It also presents an interesting case of leveraging bilingual models to cross-bind resource strengths across languages effectively using cross-lingual transfer learning.

Future Directions: Moving forward, efforts can be made to extend the BLUB benchmark by incorporating other NLU tasks, such as dependency parsing, offering a more comprehensive evaluation field. Furthermore, exploring the potential of initializing Bangla Natural Language Generation (NLG) models from BanglaBERT could further boost the language processing ecosystem in the Bangla language.

This paper importantly bridges a gap in Bangla NLP resources and opens the door for more tailored LLMs that accurately reflect the linguistic nuances of low-resource languages like Bangla. The public release of their datasets and models encourages academic and practical exploration in this domain, emphasizing a community-driven advancement of Bangla language technologies.

PDF Markdown

Related Papers

GitHub

GitHub - csebuetnlp/banglabert: This repository contains the official release of the model "BanglaBERT" and associated downstream finetuning code and datasets introduced in the paper titled "BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla" accpeted in Findings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: NAACL-2022. (232 stars)