Constrained Decoding for Cross-lingual Label Projection (2402.03131v1)

Published 5 Feb 2024 in cs.CL and cs.LG

Abstract: Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces CODEC, which uses a two-pass decoding process to translate data first and insert labels later to preserve translation quality.
Experiments across 20 languages show marked improvements in NER and EAE, outperforming both alignment-based and marker-based methods.
The method efficiently prunes unlikely marker positions, balancing high translation fidelity with precise label projection.

Introduction

Cross-lingual transfer learning is a powerful tool for extending applications to low-resource languages without labeled data. Though multilingual LLMs facilitate zero-shot learning, their performance on granular tasks like Named Entity Recognition (NER) or Event Argument Extraction (EAE) is generally subpar compared to supervised models fine-tuned on labeled data. Traditional label projection techniques translate labeled data from a resource-rich language and align it to the low-resource language data. Alignment-based methods preserve translation quality but lag in accuracy, while marker-based methods lead in accuracy but compromise translation fidelity by injecting markers before translation.

Constrained Decoding for Label Projection

A novel approach, Constraint Decoding for Cross-lingual Label Projection (CODEC), addresses quality degradation while maintaining the accuracy benefits of marker-based projection. CODEC translates training data sans markers before inserting them in a second pass via a custom decoding algorithm. This maintains high translation quality, vital for accurate projection. CODEC's constrained decoding algorithm ensures that only likely marker positions and hypotheses with the correct number of labels are considered.

Experimental Results

CODEC was evaluated across 20 languages for NER and EAE tasks. It showed marked improvements over state-of-the-art methods. For example, in the NER task, CODEC outperformed the established EasyProject method by a wide margin, with particularly significant uplifts in underperforming languages like chiShona. In the EAE task, CODEC demonstrated notable success, outperforming alignment-based projection methods across Arabic and Chinese, showcasing the efficacy of its constrained decoding strategy.

Algorithm Efficiency

Further efficiency is achieved by approximating the multi-projection issue and pruning unlikely marker positions to expedite decoding. CODEC operates effectively even with large sequence lengths and numerous labeled spans. Its design is optimized to remove branches early in the search and uses heuristics to aggressively reduce decoding time with minimal performance impact.

Conclusion

Prompted by multilingual label projection’s challenges, CODEC presents a versatile, constraint-based decoding methodology that balances translation quality with the precision of label projection. It improves cross-lingual tasks by translating without markers initially, preserving text integrity, and performing constrained decoding to insert markers later. Experiments show that it outperforms existing methods, setting a new bar for cross-lingual label projection. The paper’s insights and proposed technique, CODEC, embody a significant advancement in multilingual NLP, potentially revolutionizing data augmentation for low-resource languages.

PDF Markdown

Related Papers

Tweets

https://twitter.com/cocoweixu/status/1754916478409965658