Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Contextual Text Denoising with Masked Language Models (1910.14080v2)

Published 30 Oct 2019 in cs.CL and cs.LG

Abstract: Recently, with the help of deep learning models, significant advances have been made in different NLP tasks. Unfortunately, state-of-the-art models are vulnerable to noisy texts. We propose a new contextual text denoising algorithm based on the ready-to-use masked LLM. The proposed algorithm does not require retraining of the model and can be integrated into any NLP system without additional training on paired cleaning training data. We evaluate our method under synthetic noise and natural noise and show that the proposed algorithm can use context information to correct noise text and improve the performance of noisy inputs in several downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (14)
  1. Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173.
  2. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.
  3. Shamil Chollampatt and Hwee Tou Ng. 2018. Neural quality estimation of grammatical error correction. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2528–2539.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  6. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  7. Robust word vectors: Context-informed embeddings for noisy texts. In Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, pages 54–63.
  8. Disruption to word or letter processing? the origins of case-mixing effects. Journal of Experimental Psychology: Learning, Memory, and Cognition, 23(5):1275.
  9. The conll-2014 shared task on grammatical error correction. In Proceedings of the Eighteenth Conference on Computational Natural Language Learning: Shared Task, pages 1–14.
  10. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  11. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311–318. Association for Computational Linguistics.
  12. Sentence embeddings in nli with iterative refinement encoders. Natural Language Engineering, 25(4):467–482.
  13. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
  14. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. arXiv preprint arXiv:1903.00138.
Citations (11)

Summary

We haven't generated a summary for this paper yet.