BERTje: A Dutch BERT Model (1912.09582v1)

Published 19 Dec 2019 in cs.CL

Abstract: The transformer-based pre-trained LLM BERT has helped to improve state-of-the-art performance on many NLP tasks. Using the same architecture and parameters, we developed and evaluated a monolingual Dutch BERT model called BERTje. Compared to the multilingual BERT model, which includes Dutch but is only based on Wikipedia text, BERTje is based on a large and diverse dataset of 2.4 billion tokens. BERTje consistently outperforms the equally-sized multilingual BERT model on downstream NLP tasks (part-of-speech tagging, named-entity recognition, semantic role labeling, and sentiment analysis). Our pre-trained Dutch BERT model is made available at https://github.com/wietsedv/bertje.

Citations (277)

View on Semantic Scholar

Summary

The paper introduces a monolingual Dutch BERT model that outperforms multilingual BERT in key NLP tasks such as NER and sentiment analysis.
It employs Sentence Order Prediction and whole-word masking on a diverse 2.4B-token corpus to enhance linguistic coherence and representation.
The model demonstrates improved accuracy in tasks like POS tagging and semantic role labeling, emphasizing practical benefits for Dutch NLP applications.

Insights and Evaluation of BERTje: A Dutch BERT Model

The paper introduces BERTje, a monolingual Dutch BERT model developed using the transformer architecture, comparable to BERT_base, aimed at advancing the performance of Dutch NLP tasks. The model addresses the limitations of the multilingual BERT which, despite supporting Dutch, is primarily trained on Wikipedia and not entirely representative of everyday language nuances.

Pre-training Process

Data Collection and Pre-processing: BERTje is trained on a diverse dataset of 2.4 billion tokens compiled from multiple high-quality Dutch corpora. These include books, a multifaceted news corpus, a reference corpus, and data from Dutch news websites, alongside the Dutch segment of Wikipedia. This comprehensive, multi-domain dataset is intended to offer a broader linguistic perspective than the Wikipedia-driven multilingual BERT.

Training Considerations: The paper notes key adjustments in the pre-training objectives. Recognizing the ineffectiveness of the Next Sentence Prediction (NSP) task in BERT, BERTje employs the Sentence Order Prediction (SOP) as an alternative strategy, thereby enhancing the model's capability in understanding sentence coherence. Moreover, the Masked LLMing (MLM) task was adapted to mask complete words instead of word pieces to mitigate predictability issues observed with smaller token segments.

Performance Evaluation

To benchmark the effectiveness of BERTje, it was fine-tuned for various NLP tasks: named-entity recognition (NER), part-of-speech (POS) tagging, semantic role labeling, and sentiment analysis. Notably, BERTje demonstrated consistent improvements over the multilingual BERT in almost all metrics.

Named-Entity Recognition: BERTje achieved F1 scores significantly higher than the multilingual BERT on both the Dutch CoNLL-2002 and SoNaR-1 datasets. This indicates its enhanced capability in contextually understanding and identifying entities within Dutch texts.
Part-of-Speech Tagging: In POS tagging tasks derived from Universal Dependencies and SoNaR-1, BERTje outperformed its multilingual counterpart, maintaining superior accuracy across various tag complexities.
Semantic Role Labeling and Spatio-Temporal Relations: Results also show BERTje's comprehensive advantage in tasks involving hierarchical linguistic annotations and spatio-temporal relations, showcasing its versatility in handling nuanced language structures.
Sentiment Analysis: On evaluating the binary sentiment classification task using Dutch book reviews, BERTje approached state-of-the-art performance, underscoring its efficacy in high-level semantic tasks without extensive hyperparameter tuning.

Theoretical and Practical Implications

The findings of this paper suggest that monolingual models like BERTje offer tangible performance benefits over multilingual models, particularly in cases where high linguistic specificity is required. Importantly, the research emphasizes a structured pre-training approach, where establishing a robust foundation of low-level linguistic structures is pivotal to subsequently achieving proficiency in more abstract language tasks. This progressive learning trajectory can inform further iterations of monolingual models across other languages.

Future Directions

The exploration of BERTje's potential can be deepened by investigating how different layers capture various linguistic phenomena, specifically aiming at improving its performance in complex sentence-level tasks. Additionally, the examination of higher-level tasks such as document classification or dialogue-based interactions could further harness the sophisticated interplay of semantic understanding BERTje facilitates.

In conclusion, while multilingual models offer broader coverage, BERTje compellingly illustrates the advantages of tailored, language-specific models for NLP applications, motivating ongoing enhancements and diversifications of BERT architectures to other languages. With BERTje made publicly available, it sets a precedent for future developments in the field of Dutch NLP.

PDF Markdown

Related Papers

GitHub

GitHub - wietsedv/bertje: BERTje is a Dutch pre-trained BERT model developed at the University of Groningen. (EMNLP Findings 2020) "What’s so special about BERT’s layers? A closer look at the NLP pipeline in monolingual and multilingual models" (138 stars)