Emergent Mind

Abstract

Data scarcity in low-resource languages can be addressed with word-to-word translations from labeled task data in high-resource languages using bilingual lexicons. However, bilingual lexicons often have limited lexical overlap with task data, which results in poor translation coverage and lexicon utilization. We propose lexicon-conditioned data generation (LexC-Gen), a method that generates low-resource-language classification task data at scale. Specifically, LexC-Gen first uses high-resource-language words from bilingual lexicons to generate lexicon-compatible task data, and then it translates them into low-resource languages with bilingual lexicons via word translation. Across 17 extremely low-resource languages, LexC-Gen generated data is competitive with expert-translated gold data, and yields on average 5.6 and 8.9 points improvement over existing lexicon-based word translation methods on sentiment analysis and topic classification tasks respectively. We show that conditioning on bilingual lexicons is the key component of LexC-Gen. LexC-Gen is also practical -- it only needs a single GPU to generate data at scale. It works well with open-access LLMs, and its cost is one-fifth of the cost of GPT4-based multilingual data generation.

Accuracy impact of lexicon-conditioning on sentiment analysis compared to finetuning with gold translations.

Overview

  • LexC-Gen is introduced as a novel methodology for generating classification task data for extremely low-resource languages (LRLs) using LLMs and bilingual lexicons.

  • The methodology involves generating high-resource-language (HRL) data conditioned on bilingual lexicons and translating these datasets into LRLs through word-to-word substitution.

  • Empirical evaluations show significant performance improvements in sentiment analysis and topic classification tasks for 17 LRLs, making LexC-Gen a cost-effective and scalable solution for NLP challenges in low-resource settings.

Generating Data for Extremely Low-Resource Languages with LLMs and Bilingual Lexicons

In the domain of NLP, the scarcity of labeled data represents a significant hindrance to advancements for extremely low-resource languages (LRLs). This paper introduces a novel approach, lexicon-conditioned data generation (LCDG), which leverages LLMs and bilingual lexicons to generate classification task data at scale for such languages.

Methodology and Contributions

The approach of translating labeled data from high-resource languages (HRLs) using bilingual lexicons is not new, but the authors recognize a key issue: existing task data and bilingual lexicons often exhibit low lexical overlap. This mismatch results in suboptimal translation coverage and underutilization of lexicons. To address these challenges, the authors propose LexC-Gen, a two-stage methodology designed to maximize the lexical overlap between task data and bilingual lexicons:

  1. Lexicon-Compatible High-Resource Language Data Generation: LexC-Gen initially uses LLMs to generate high-resource-language task data conditioned on words from bilingual lexicons. This step ensures that the generated data have a high lexical overlap with the lexicon, thereby improving the quality of subsequent translations.
  2. Word-to-Word Translation: Following the generation of lexicon-compatible HRL data, these data are translated into LRLs using the bilingual lexicon through word-to-word substitution.

The efficacy of LexC-Gen was evaluated across 17 extremely low-resource languages on sentiment analysis and topic classification tasks. The results showed that classifiers trained on LexC-Gen generated data exhibited significant improvements, with an average increase of 5.6 points in sentiment analysis and 8.9 points in topic classification over existing lexicon-based methods. Notably, the performance of these classifiers was competitive with those trained on expert-translated gold data, despite the auto-generated nature of the training data.

Key Findings and Implications

  1. Improved Lexical Overlap and Translation Quality: The lexicon-conditioned generation method ensures high lexical overlap, leading to better translation coverage and lexicon utilization. This enhancement directly contributes to the improved performance of LRL classifiers.
  2. Cost-Effectiveness and Practicality: LexC-Gen is computationally efficient, requiring only a single GPU to generate data at scale, making it accessible for researchers with limited computational resources. The cost of generating data using open-access LLMs with permissive licenses (e.g., BLOOMZ) is only a fifth of that required by GPT-4-based methods.
  3. Scalability and Flexibility: The methodology is robust and scalable, capable of generating large volumes of training data swiftly. This scalability is crucial for significantly underrepresented languages where collecting labeled data is otherwise prohibitively difficult.
  4. Cross-Lingual Applications: The ability to generate high-quality synthetic data for LRLs opens new avenues for advancing NLP research and applications in multilingual settings. By improving data availability, LRLs can benefit from advanced NLP techniques historically limited to HRLs.

Future Developments

Given the promising results of LexC-Gen, future research could focus on several exciting directions:

  1. Expanding Task Domains: While the current study focuses on sentiment analysis and topic classification, evaluating the methodology on other NLP tasks, such as named entity recognition or machine translation, could further validate and expand the utility of LexC-Gen.
  2. Enhancing Translation Accuracy: Integrating linguistic information or contextual data into bilingual lexicons could mitigate issues related to word sense disambiguation, thus refining the translation process and improving data quality.
  3. Exploration of Further LLMs: Additional studies could investigate the performance of other instruction-tuned LLMs or alternatives to BLOOMZ, optimizing for different languages and tasks.
  4. Incorporating Syntactic Structures: Addressing syntactic mismatches between HRLs and LRLs through syntactic transformation techniques could enhance the applicability of LexC-Gen across a wider variety of languages and technical contexts.

In conclusion, the LexC-Gen framework introduces a practical and scalable solution to the data scarcity problem in NLP for low-resource languages, leveraging the strength of LLMs and the breadth of bilingual lexicons. The method not only offers a significant performance boost over traditional lexicon-based methods but also underscores the potential of synthetic data in bridging linguistic disparities. The implications of this research extend beyond immediate performance improvements, highlighting a pathway towards more inclusive and representative linguistic technologies.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.