Learning Word Vectors for 157 Languages

Published 19 Feb 2018 in cs.CL and cs.LG | (1802.06893v2)

Abstract: Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (1,362)

View on Semantic Scholar

Summary

The paper introduces high-quality word embeddings for 157 languages using combined data from Wikipedia and Common Crawl.
It evaluates skipgram and CBOW models with subword enhancements, achieving significant improvements in multilingual word analogy tasks.
The study demonstrates that leveraging diverse data sources boosts performance, especially for low-resource languages.

Learning Word Vectors for 157 Languages

Overview

The paper "Learning Word Vectors for 157 Languages," authored by Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov, undertakes the significant task of generating high-quality word embeddings for a total of 157 languages. The authors employ data from both Wikipedia and the Common Crawl project, which enables them to cover a wide range of languages, including those with smaller corpora. The application of these word vectors is substantiated through the introduction of language-specific word analogy datasets for French, Hindi, and Polish. Further, a comparative evaluation is performed on ten languages, demonstrating substantial performance improvements over existing models.

Data Sources and Preprocessing

The datasets utilized in this study are composed of a combination of Wikipedia and Common Crawl data. The authors note the high quality of Wikipedia text due to its curated nature, while acknowledging that its coverage and volume can be limited for many languages. Common Crawl data, despite being noisier, offers a larger quantity of text and broader linguistic coverage.

The preprocessing pipeline includes:

Language Identification: A fastText-based language detector is employed to classify each line of text and retain only those with a high confidence score.
Deduplication: Duplicate lines are removed by hashing, which is particularly critical for the web data to eliminate boilerplate content.
Tokenization: Various tokenizers are used according to the script and language, ensuring proper conversion of raw text to a tokenized format suitable for training.

Model Architectures

Two main models are evaluated:

Skipgram Model with Subword Information: An extension of fastText's skipgram model incorporating character ngrams to improve the quality and robustness of word vectors.
CBOW (Continuous Bag of Words) with Position Weights and Subword Information: Another variant that uses context words with positional information to predict target words, also enriched with subword data.

The essential models and hyperparameters are further refined through several enhancements, including increasing the number of negative examples and training epochs for better model robustness.

Evaluation Procedures

The primary evaluation metric is the word analogy task, where the goal is to complete analogical word pairs (e.g., Paris:France::Berlin:?). This task is used across ten languages, employing datasets from prior works and newly introduced datasets for French, Hindi, and Polish.

Results

The experimental results reveal several key insights:

Model Enhancements: Incorporating character ngrams of varying lengths, adding more negative samples, and increasing training epochs significantly boost model accuracy.
Training Data Influence: For high-resource languages, combining Wikipedia and Common Crawl data slightly improves or maintains performance. However, for low-resource languages, the inclusion of Common Crawl data results in significant accuracy gains.
Overall Performance: The enriched CBOW model consistently outperforms the baseline fastText skipgram model, highlighting the effectiveness of the proposed enhancements.

Implications and Future Directions

Practically, the availability of high-quality word vectors in 157 languages could immensely benefit various NLP applications, particularly in multilingual contexts. It also opens avenues for improved machine translation, cross-lingual information retrieval, and cultural data analysis. Theoretically, the work exemplifies that a mixed-source data approach can successfully mitigate the disparities in data availability across languages, suggesting new directions for improving NLP model fairness and inclusivity.

Looking forward, future research could explore better techniques for handling noisy crawl data and enhancing word vector quality for extremely low-resource languages. Additionally, exploring domain-specific variations and extending this work to other embeddings such as contextual word embeddings could further advance the field.

In summary, the comprehensive approach to training word vectors on both Wikipedia and Common Crawl data, along with the robust evaluation mechanisms, underline the significance of this study in the continuous evolution of multilingual NLP.

Markdown Report Issue