Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text (2311.18805v1)
Abstract: While LLMs have achieved remarkable performance in many tasks, much about their inner workings remains unclear. In this study, we present novel experimental insights into the resilience of LLMs, particularly GPT-4, when subjected to extensive character-level permutations. To investigate this, we first propose the Scrambled Bench, a suite designed to measure the capacity of LLMs to handle scrambled input, in terms of both recovering scrambled sentences and answering questions given scrambled context. The experimental results indicate that most powerful LLMs demonstrate the capability akin to typoglycemia, a phenomenon where humans can understand the meaning of words even when the letters within those words are scrambled, as long as the first and last letters remain in place. More surprisingly, we found that only GPT-4 nearly flawlessly processes inputs with unnatural errors, even under the extreme condition, a task that poses significant challenges for other LLMs and often even for humans. Specifically, GPT-4 can almost perfectly reconstruct the original sentences from scrambled ones, decreasing the edit distance by 95%, even when all letters within each word are entirely scrambled. It is counter-intuitive that LLMs can exhibit such resilience despite severe disruption to input tokenization caused by scrambled text.
- Word order does matter and shuffled language models know it. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6907–6919, Dublin, Ireland. Association for Computational Linguistics.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Scaling instruction-finetuned language models.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Wikimedia Foundation. Wikimedia downloads.
- Rebecca L Johnson and Morgan E Eisler. 2012. The importance of the first and last letter in words during sentence reading. Acta psychologica, 141(3):336–351.
- Realtime qa: What’s the answer right now? arXiv preprint arXiv:2207.13332.
- Vladimir I Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. Soviet physics doklady, 10(8):707–710.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Program induction by rationale generation: Learning to solve and explain algebraic word problems. ACL.
- Mildred Mason. 1982. Recognition time for letters and nonletters: effects of serial position, array size, and processing order. Journal of Experimental Psychology: Human Perception and Performance, 8(5):724.
- OpenAI. 2023. Gpt-4 technical report.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Out of order: How important is the sequential order of words in a sentence in natural language understanding tasks? In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1145–1160, Online. Association for Computational Linguistics.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Graham Rawlinson. 2007. The significance of letter position in word recognition. IEEE Aerospace and Electronic Systems Magazine, 22(1):26–27.
- Kshitij Shah and Gerard de Melo. 2020. Correcting the autocorrect: Context-aware typographical error correction via training data augmentation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6930–6936, Marseille, France. European Language Resources Association.
- Masked language modeling and the distributional hypothesis: Order word matters pre-training for little. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2888–2913, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- UnNatural Language Inference. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7329–7346, Online. Association for Computational Linguistics.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
- DREAM: A challenge data set and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics, 7:217–231.
- An error-guided correction model for Chinese spelling error correction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3800–3810, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
- MosaicML NLP Team. 2023. Introducing mpt-30b: Raising the bar for open-source foundation models. Accessed: 2023-06-22.
- BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models.
- Emergent abilities of large language models. Transactions on Machine Learning Research. Survey Certification.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- ByT5: Towards a token-free future with pre-trained byte-to-byte models. Transactions of the Association for Computational Linguistics, 10:291–306.
- Wikiqa: A challenge dataset for open-domain question answering. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 2013–2018.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.