Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding (2004.14870v4)

Published 30 Apr 2020 in cs.CL, cs.AI, cs.LG, and cs.NE

Abstract: Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Samson Tan (21 papers)
  2. Shafiq Joty (187 papers)
  3. Lav R. Varshney (126 papers)
  4. Min-Yen Kan (92 papers)
Citations (32)

Summary

We haven't generated a summary for this paper yet.