A Large Parallel Corpus of Full-Text Scientific Articles

Published 6 May 2019 in cs.CL | (1905.01852v1)

Abstract: The Scielo database is an important source of scientific information in Latin America, containing articles from several research domains. A striking characteristic of Scielo is that many of its full-text contents are presented in more than one language, thus being a potential source of parallel corpora. In this article, we present the development of a parallel corpus from Scielo in three languages: English, Portuguese, and Spanish. Sentences were automatically aligned using the Hunalign algorithm for all language pairs, and for a subset of trilingual articles also. We demonstrate the capabilities of our corpus by training a Statistical Machine Translation system (Moses) for each language pair, which outperformed related works on scientific articles. Sentence alignment was also manually evaluated, presenting an average of 98.8% correctly aligned sentences across all languages. Our parallel corpus is freely available in the TMX format, with complementary information regarding article metadata.

Abstract PDF Upgrade to Chat

Citations (24)

View on Semantic Scholar

Summary

The paper describes the creation of a large multilingual parallel corpus from full-text scientific articles in English, Portuguese, and Spanish using automated sentence alignment.
This corpus, including over 2.9 million aligned sentences for EN-PT, significantly improves Statistical Machine Translation (SMT) performance compared to previous benchmarks.
The resource supports various NLP tasks like multilingual text mining, cross-language plagiarism detection, and can enhance Named Entity Recognition (NER) tools.

The paper "A Large Parallel Corpus of Full-Text Scientific Articles" focuses on the creation of a comprehensive parallel corpus using full-text scientific articles from the Scielo database, which is a key resource for scientific literature in Latin America. This resource is significant due to its multilingual nature, with many articles available in English, Portuguese, and Spanish, making it suitable for NLP tasks and Statistical Machine Translation (SMT) applications.

Key Contributions and Methodology

Corpus Construction: The authors developed a parallel corpus by harnessing articles from Scielo available in English, Portuguese, and Spanish. The corpus was constructed by automated sentence alignment using the Hunalign algorithm, known for its efficacy in aligning multilingual texts based on sentence length and dictionary-based realignment techniques.
Scope and Scale: This work presents an improvement over previous efforts by including full-text articles across multiple domains beyond the biomedical scope. The corpus comprises more than 2.9 million aligned sentences for the English-Portuguese language pair alone, alongside significant datasets for the other language pairs and trilingual samples.
Structural Alignment and Metadata: The corpus is organized according to the hierarchical structure of articles, preserving sections and paragraphs, which benefits tasks such as text summarization. Additionally, metadata such as journal name and subject area are included, enhancing the utility for text classification.
Legal Considerations: In compliance with Creative Commons licenses, only articles permitting derivative works are included, ensuring legal distribution. This is particularly important due to modifications like the removal of non-textual elements in the corpus.

Evaluation and Results

SMT Performance: The paper evaluates the corpus by training SMT systems using Moses. The resulting BLEU scores indicate superior translation performance compared to prior works, notably achieving a BLEU score of 48.51 for EN→PT and 49.24 for PT→EN, which are significantly higher compared to existing benchmarks.
Alignment Quality: Manual evaluation of sentence alignment revealed a high accuracy rate, with correct alignments exceeding 98% across all language pairs, illustrating the robustness of the Hunalign algorithm when extended with domain-specific dictionaries.
Comparison with EuroMatrix: Even though BLEU scores vary depending on corpus domain-specific traits, the results obtained are comparable to established benchmarks such as the Europarl corpus, showcasing the corpus's competitive quality in the scientific article domain.

Implications and Future Work

The corpus is designed to support varied NLP applications, including multilingual text mining, cross-language plagiarism detection, and potentially enhancing Named Entity Recognition (NER) tools across multiple languages. The authors suggest future directions such as the implementation of this corpus in Neural Machine Translation (NMT) systems and application in text classification tasks. Furthermore, potential expansions into more domains and additional language pairs could broaden the applicability and impact of the corpus in the field.