Neural sequence labeling for Vietnamese POS Tagging and NER (1811.03754v2)

Published 9 Nov 2018 in cs.CL

Abstract: This paper presents a neural architecture for Vietnamese sequence labeling tasks including part-of-speech (POS) tagging and named entity recognition (NER). We applied the model described in \cite{lample-EtAl:2016:N16-1} that is a combination of bidirectional Long-Short Term Memory and Conditional Random Fields, which rely on two sources of information about words: character-based word representations learned from the supervised corpus and pre-trained word embeddings learned from other unannotated corpora. Experiments on benchmark datasets show that this work achieves state-of-the-art performances on both tasks - 93.52\% accuracy for POS tagging and 94.88\% F1 for NER. Our sourcecode is available at here.

Citations (10)

View on Semantic Scholar

Summary

The paper presents a novel bidirectional LSTM-CRF framework that integrates character and word embeddings for enhanced Vietnamese sequence labeling.
It achieves 93.52% accuracy in POS tagging and 94.88% F1 in NER by incorporating syntactic features, surpassing previous state-of-the-art models.
The approach minimizes reliance on hand-crafted features, offering a robust baseline for processing morphologically rich, low-resource languages.

Review of "Neural sequence labeling for Vietnamese POS Tagging and NER" (1811.03754)

This paper presents a neural sequence labeling architecture for Vietnamese Part-of-Speech (POS) tagging and Named Entity Recognition (NER). Building on the Bidirectional LSTM-CRF framework, the system integrates both character-based and word-level embeddings, demonstrating significant performance improvements over preceding methods for Vietnamese language processing.

Architectural Overview

The proposed model leverages a hierarchical architecture:

Character Embedding Layer: Each word is decomposed into a sequence of characters and embedded via randomly initialized vectors. These character embeddings are processed by a bidirectional LSTM, and the final states from both directions are concatenated to form a character-level representation. This approach facilitates capturing morphological patterns and addresses out-of-vocabulary word issues without reliance on hand-engineered features.
Word Embedding Layer: Pre-trained 300-dimensional word embeddings are employed, derived from a large Vietnamese news corpus. Out-of-vocabulary words are represented by a uniformly-initialized vector ("UNK").
Feature Integration: For the NER task, additional one-hot encoded syntactic features, namely POS and chunk tags, are concatenated with the word representation for enhanced linguistic context.
Sequence Contextualization: These rich word representations (incorporating character-level, pretrained word, and, for NER, syntactic features) are fed into a bidirectional LSTM to capture broader contextual dependencies.
Decoding Layer: A linear-chain CRF forms the top layer, facilitating joint decoding over the entire sequence and effectively modeling tag dependencies.
Regularization: Dropout (rate 0.35) is applied at input/output to both LSTM layers to mitigate overfitting and encourage robustness to representational information.

Experimental Setup and Results

The system was evaluated on two established Vietnamese datasets:

POS Tagging: VietTreebank (VTB)
NER: VLSP 2016 (with gold word segmentation, POS, and chunk tags)

Crucially, the system avoids hand-crafted features or language-specific resources, requiring only annotated data and raw text for embedding pretraining.

Hyperparameters

Hyperparameter	Value
Character Embedding	100
Word Embedding	300
Char-LSTM Hidden Size	100
Word-LSTM Hidden Size	150
Dropout	0.35
Optimizer	Adam
Learning Rate	0.0035
Batch Size	8
Early Stopping	Yes

Results

POS Tagging: Achieved 93.52% accuracy using 10-fold cross-validation, outperforming previous state-of-the-art systems (e.g., RDRPOSTagger at 92.59% and NNVLP at 91.92%).
NER: Attained 94.88% micro-F1 by integrating chunk and POS features, clearly surpassing both feature-rich CRFs (93.93%) and NNVLP (92.91%). Removal of character embeddings led to substantial performance degradation (F1 drops to 91.36%), highlighting the critical role of subword modeling.

The results are summarized below:

Method	POS Accuracy	NER F1
Proposed BiLSTM-CRFs +POS +Chunk	93.52	94.88
Feature-rich CRFs	-	93.93
NNVLP (CNN char)	91.92	92.91
BiLSTM-CRFs (no char)	91.74-92.22	91.36

Analysis and Discussion

The architectures demonstrate that LSTM-based character embeddings are highly effective for Vietnamese, likely owing to their ability to model long-range morphological dependencies, which are frequent in languages with compounding and complex word formation. This approach gives a decisive edge over methods such as CNN-based character representations, which may not capture sequential dependencies as effectively.

The inclusion of syntactic features (POS, chunk) for NER yields further gains, reinforcing the importance of encoding linguistic structure in neural models. The improvements over NNVLP, despite employing the same word embeddings, are attributed to both model differences (use of LSTM over CNN for character composition) and careful hyperparameter tuning.

Implications

Practical Implications:

The system provides an end-to-end, feature-minimal baseline for Vietnamese sequence labeling, reducing dependencies on domain experts for feature engineering or lexical resources.
Generalizable to other morphologically rich, low-resource languages by retraining on new datasets and assembling appropriate word embeddings from large unlabelled corpora.
The use of publicly released code enhances reproducibility and potential for extension.

Theoretical Implications:

Supports empirical findings that LSTM-based character encoding offers language-agnostic benefits for modeling subword information, notably in languages with complex morphosyntax.
Validates the effectiveness of linear-chain CRFs in neural sequence labeling, especially where tag dependencies are linguistically meaningful.

Future Directions

Potential avenues for future work include:

Exploring hybrid or transformer-based sequence models to further enhance context modeling.
Investigating language-agnostic pretraining using cross-lingual or multilingual embeddings to improve low-resource adaptation.
Extending to other Vietnamese NLP tasks such as dependency parsing, or expanding to joint models across segmentation, POS, and NER.
Conducting ablation studies to quantify the interaction between various syntactic feature integrations and their impact on different tasks.

Conclusion

This paper establishes a robust, neural baseline for Vietnamese sequence labeling, emphasizing the utility of character-level modeling via bidirectional LSTMs and the seamless integration of syntactic features. The demonstrated performance improvements and absence of hand-crafted features mark this approach as a strong candidate for both immediate deployment in Vietnamese NLP pipelines and as a template architecture for other under-resourced languages.