GECToR -- Grammatical Error Correction: Tag, Not Rewrite (2005.12592v2)

Published 26 May 2020 in cs.CL and cs.LG

Abstract: In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an $F_{0.5}$ of 65.3/66.5 on CoNLL-2014 (test) and $F_{0.5}$ of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system. The code and trained models are publicly available.

Citations (289)

View on Semantic Scholar

Summary

The paper introduces a sequence tagging approach that transforms tokens individually to correct grammatical errors, offering an efficient alternative to traditional seq2seq models.
It implements a rigorous training regimen with pre-training on synthetic data followed by fine-tuning on errorful corpora, achieving competitive F0.5 scores on benchmark tests.
The system significantly reduces computational complexity and boosts inference speed, making it ideal for real-time language processing and resource-restricted applications.

GECToR: A Sequence Tagging Approach to Grammatical Error Correction

The paper introduces GECToR, an innovative approach to Grammatical Error Correction (GEC) that leverages a sequence tagging methodology rather than the traditional sequence generation techniques. This method is rooted in employing a Transformer encoder to facilitate a more streamlined and efficient process for correcting grammatical errors in text. The paper demonstrates that this approach not only enhances the speed of inference but also maintains competitive performance accuracy when compared to conventional methods.

The authors outline a comprehensive training regimen that begins with pre-training on a substantial synthetic dataset, followed by fine-tuning in two additional phases using specific errorful and combined errorful/error-free corpora. Each stage has been meticulously crafted to optimize the model's performance at correcting grammatical errors. Specifically, the model reached $F_{0.5}$ scores of 65.3 and 66.5 on CoNLL-2014 test sets for single and ensemble models, respectively, and 72.4 and 73.6 on BEA-2019 test sets.

A key innovation of GECToR is its use of token-level transformations, which transform individual tokens into their corrected forms. These transformations are able to address common grammatical errors like spelling, noun number, and verb forms. The appeal of this system lies in its reduced computational complexity and enhanced inference speed, operating up to ten times faster than traditional sequence-to-sequence approaches based on Transformers.

The experiments carried out in the paper utilized a range of datasets, such as the PIE-synthetic, Lang-8, NUCLE, and others, which are critically evaluated under different training conditions. The authors utilized advanced Transformer architectures including BERT, RoBERTa, and XLNet, with a particular emphasis on their performance vis-à-vis other state-of-the-art models.

By focusing on sequence tagging, the GECToR system sidesteps some inherent challenges of seq2seq models, including the demand for extensive computing resources and issues surrounding interpretability. The authors address these by factorizing token-level edits into manageable transformations, considerably simplifying the error correction task.

This research holds significant implications for practical applications in real-world language processing tasks. Given its enhanced speed and efficiency, the GECToR system is an attractive candidate for deployment in environments where computing resources are limited, or where fast processing times are paramount, such as in real-time editing tools and educational software.

In conclusion, the GECToR framework presents a robust alternative to conventional GEC systems, marrying performance with efficiency. Moving forward, blending this approach with evolving Transformer architectures may further enhance its capabilities, providing new opportunities for advancements in machine-assisted language correction tasks. As AI and NLP technologies evolve, methodologies like GECToR could significantly influence the future landscape of linguistic error correction tools.

PDF Markdown

GECToR -- Grammatical Error Correction: Tag, Not Rewrite (2005.12592v2)

Summary

GECToR: A Sequence Tagging Approach to Grammatical Error Correction

Related Papers