Emergent Mind

Constrained Decoding for Cross-lingual Label Projection

(2402.03131)
Published Feb 5, 2024 in cs.CL and cs.LG

Abstract

Zero-shot cross-lingual transfer utilizing multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data. However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods. Therefore, it is common to exploit translation and label projection to further improve the performance by (1) translating training data that is available in a high-resource language (e.g., English) together with the gold labels into low-resource languages, and/or (2) translating test data in low-resource languages to a high-source language to run inference on, then projecting the predicted span-level labels back onto the original test data. However, state-of-the-art marker-based label projection methods suffer from translation quality degradation due to the extra label markers injected in the input to the translation model. In this work, we explore a new direction that leverages constrained decoding for label projection to overcome the aforementioned issues. Our new method not only can preserve the quality of translated texts but also has the versatility of being applicable to both translating training and translating test data strategies. This versatility is crucial as our experiments reveal that translating test data can lead to a considerable boost in performance compared to translating only training data. We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment.

Approaches to project English label spans to low-resource languages, highlighting projection errors with underlines.

Overview

  • The paper introduces CODEC, a method for cross-lingual label projection that addresses the balance between accuracy and translation quality.

  • CODEC operates by translating data without markers and applies a custom decoding algorithm for marker insertion, thus maintaining high translation quality necessary for precise label projection.

  • The method outperformed state-of-the-art techniques in tasks like Named Entity Recognition and Event Argument Extraction, showing significant improvements, especially in low-resource languages.

  • CODEC enhances efficiency by approximating the multi-projection issue and implementing pruning strategies, achieving faster decoding with minimal impact on performance.

Introduction

Cross-lingual transfer learning is a powerful tool for extending applications to low-resource languages without labeled data. Though multilingual Language Models (LLMs) facilitate zero-shot learning, their performance on granular tasks like Named Entity Recognition (NER) or Event Argument Extraction (EAE) is generally subpar compared to supervised models fine-tuned on labeled data. Traditional label projection techniques translate labeled data from a resource-rich language and align it to the low-resource language data. Alignment-based methods preserve translation quality but lag in accuracy, while marker-based methods lead in accuracy but compromise translation fidelity by injecting markers before translation.

Constrained Decoding for Label Projection

A novel approach, Constraint Decoding for Cross-lingual Label Projection (CODEC), addresses quality degradation while maintaining the accuracy benefits of marker-based projection. CODEC translates training data sans markers before inserting them in a second pass via a custom decoding algorithm. This maintains high translation quality, vital for accurate projection. CODEC's constrained decoding algorithm ensures that only likely marker positions and hypotheses with the correct number of labels are considered.

Experimental Results

CODEC was evaluated across 20 languages for NER and EAE tasks. It showed marked improvements over state-of-the-art methods. For example, in the NER task, CODEC outperformed the established EasyProject method by a wide margin, with particularly significant uplifts in underperforming languages like chiShona. In the EAE task, CODEC demonstrated notable success, outperforming alignment-based projection methods across Arabic and Chinese, showcasing the efficacy of its constrained decoding strategy.

Algorithm Efficiency

Further efficiency is achieved by approximating the multi-projection issue and pruning unlikely marker positions to expedite decoding. CODEC operates effectively even with large sequence lengths and numerous labeled spans. Its design is optimized to remove branches early in the search and uses heuristics to aggressively reduce decoding time with minimal performance impact.

Conclusion

Prompted by multilingual label projection’s challenges, CODEC presents a versatile, constraint-based decoding methodology that balances translation quality with the precision of label projection. It improves cross-lingual tasks by translating without markers initially, preserving text integrity, and performing constrained decoding to insert markers later. Experiments show that it outperforms existing methods, setting a new bar for cross-lingual label projection. The study’s insights and proposed technique, CODEC, embody a significant advancement in multilingual NLP, potentially revolutionizing data augmentation for low-resource languages.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.