IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Published 2 Nov 2020 in cs.CL | (2011.00677v1)

Abstract: Although the Indonesian language is spoken by almost 200 million people and the 10th most spoken language in the world, it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the IndoLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release IndoBERT, a new pre-trained LLM for Indonesian, and evaluate it over IndoLEM, in addition to benchmarking it against existing resources. Our experiments show that IndoBERT achieves state-of-the-art performance over most of the tasks in IndoLEM.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (216)

View on Semantic Scholar

Summary

The paper presents IndoLEM, a comprehensive benchmark dataset spanning seven linguistic tasks, alongside IndoBERT, a pre-trained model tailored for Indonesian.
It demonstrates IndoBERT’s superior performance compared to multilingual models in tasks such as POS tagging, NER, and dependency parsing.
The work establishes standardized evaluation metrics that promote future advancements in Indonesian language processing.

Overview of "IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained LLM for Indonesian NLP"

The paper focuses on addressing the under-representation of the Indonesian language in the field of NLP due to a lack of annotated datasets, a scarcity of language resources, and insufficient standardization. The authors introduce IndoLEM, a benchmark dataset spanning seven NLP tasks that encompass morpho-syntactic, semantic, and discourse competencies specifically for Indonesian. Additionally, they introduce IndoBERT, a pre-trained LLM for Indonesian that exhibits state-of-the-art performance across most tasks included in IndoLEM.

IndoLEM: A Comprehensive Resource

IndoLEM consists of tasks across multiple linguistic dimensions, carefully designed to capture a wide array of competencies:

Morpho-syntactic/Sequence Labeling Tasks: These include Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and dependency parsing. The datasets for these tasks draw from a variety of sources, including publicly-available corpora and previous works, with standardized metrics and splits to ensure reproducibility and robustness.
Semantic Tasks: Two primary tasks are included—sentiment analysis and summarization. The sentiment analysis dataset is a low-resource binary classification task with data sourced from social media and reviews, while the summarization task is rooted in single-document extractive methods, leveraging the IndoSum dataset.
Discourse Coherence Tasks: The authors propose novel tasks measuring discourse coherence in Twitter threads—specifically, next tweet prediction and tweet ordering. Here, tweets are used as the dataset to structure authentic and challenging sequence prediction and ordering tasks.

IndoBERT: A Pre-Trained LLM

IndoBERT is a monolingual BERT-style model trained via a masked language modeling approach using Indonesian-specific corpora. Its architecture mirrors that of BERT-Base (12 layers, 768 hidden states), with training data sourced from Indonesian Wikipedia, news outlets, and the Indonesian Web Corpus. The model was trained on comprehensive data amounting to over 220 million words, leading to superior performance, as showcased in its evaluations over IndoLEM.

Comparative Evaluation and Results

The experimental results provide a comparative analysis of IndoBERT against other models such as multilingual BERT (mBERT) and MalayBERT. IndoBERT emerges as a superior performer across a majority of tasks:

Morpho-syntactic Tasks: IndoBERT achieves competitive accuracy in POS tagging and notably higher F1 scores in NER tasks compared to mBERT and MalayBERT.
Dependency Parsing: The BiAffine parser augmented with IndoBERT features robust results, outperforming other models on UD-Indo-GSD, whereas mBERT excels on UD-Indo-PUD due to its alignment with multilingual contexts.
Semantic and Discourse Tasks: IndoBERT significantly improves sentiment analysis outcomes and provides measurable gains in extractive summarization tasks. For discourse coherence, IndoBERT approaches human-level performance in next tweet prediction and ranks highly in tweet ordering, exemplifying substantial room for development within these tasks given their complexity.

Implications and Future Directions

IndoBERT and IndoLEM set a new benchmark for Indonesian NLP, providing a foundation to catalyze future research. IndoLEM offers a resource-rich platform for evaluating NLP models specifically within the Indonesian language context, facilitating efforts to standardize and compare advances in this domain. The IndoBERT model could serve as a base model for further task-specific fine-tuning, setting a pathway for improving linguistic processing in lesser-studied languages. Future research may focus on leveraging IndoLEM to expand beyond current benchmarks and explore more diverse language modeling challenges, potentially resembling advancements parallel to high-resource languages.

IndoLEM and IndoBERT's contributions hold substantial potential to enrich NLP capabilities for Indonesian, thereby enhancing computational approaches available for Southeast Asian languages in general.

Markdown Report Issue