Emergent Mind

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

(1910.10683)
Published Oct 23, 2019 in cs.LG , cs.CL , and stat.ML

Abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in NLP. The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a detailed summary of this paper with a premium account.

We ran into a problem analyzing this paper.

Subscribe by Email

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube
References
  1. Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence
  2. Memory-Efficient Adaptive Optimization
  3. Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
  4. Layer Normalization
  5. Cloze-driven Pretraining of Self-attention Networks
  6. Neural machine translation by jointly learning to align and translate. In Third International Conference on Learning Representations
  7. Simple, Scalable Adaptation for Neural Machine Translation
  8. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
  9. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation
  10. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation
  11. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation
  12. Generating Sentences from a Continuous Space
  13. N-gram counts and language models from the common crawl. In LREC
  14. Rich Caruana. Multitask learning. Machine learning, 28(1)
  15. SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
  16. Long Short-Term Memory-Networks for Machine Reading
  17. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
  18. ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
  19. SentEval: An Evaluation Toolkit for Universal Sentence Representations
  20. Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
  21. The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop
  22. Semi-supervised sequence learning. In Advances in neural information processing systems
  23. The CommitmentBank: Investigating projection in naturally occurring discourse. In Sinn und Bedeutung 23
  24. ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition
  25. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
  26. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005)
  27. Unified Language Model Pre-training for Natural Language Understanding and Generation
  28. Understanding Back-Translation at Scale
  29. Learning Word Vectors for 157 Languages
  30. Generating Sequences With Recurrent Neural Networks
  31. C4Corpus: Multilingual web-size corpus with free license. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 914–922
  32. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition
  33. Rethinking ImageNet Pre-training
  34. A Hybrid Neural Network Model for Commonsense Reasoning
  35. Teaching machines to read and comprehend. In Advances in neural information processing systems
  36. Deep Learning Scaling is Predictable, Empirically
  37. Learning Distributed Representations of Sentences from Unlabelled Data
  38. Distilling the Knowledge in a Neural Network
  39. Parameter-Efficient Transfer Learning for NLP
  40. Universal Language Model Fine-tuning for Text Classification
  41. Music transformer: Generating music with long-term structure. In Seventh International Conference on Learning Representations, 2018a.
  42. GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
  43. What makes ImageNet good for transfer learning?
  44. First Quora dataset release: Question pairs. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

  45. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia
  46. TinyBERT: Distilling BERT for Natural Language Understanding
  47. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
  48. SpanBERT: Improving Pre-training by Representing and Predicting Spans
  49. Exploring the Limits of Language Modeling
  50. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics
  51. CTRL: A Conditional Transformer Language Model for Controllable Generation
  52. Unifying Question Answering, Text Classification, and Regression via Span Extraction
  53. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL)
  54. Skip-thought vectors. In Advances in neural information processing systems
  55. A Surprisingly Robust Trick for Winograd Schema Challenge
  56. Federated Optimization:Distributed Optimization Beyond the Datacenter
  57. Federated Learning: Strategies for Improving Communication Efficiency
  58. Do Better ImageNet Models Transfer Better?
  59. One weird trick for parallelizing convolutional neural networks
  60. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
  61. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing
  62. Cross-lingual Language Model Pretraining
  63. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
  64. The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning
  65. Qi Li. Literature survey: domain adaptation algorithms for natural language processing. 2012.
  66. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out
  67. Generating Wikipedia by Summarizing Long Sequences
  68. SummAE: Zero-Shot Abstractive Text Summarization using Length-Agnostic Auto-Encoders
  69. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  70. Multi-Task Deep Neural Networks for Natural Language Understanding
  71. Fine-tune BERT for Extractive Summarization
  72. RoBERTa: A Robustly Optimized BERT Pretraining Approach
  73. An efficient framework for learning sentence representations
  74. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV)
  75. The Natural Language Decathlon: Multitask Learning as Question Answering
  76. Efficient Estimation of Word Representations in Vector Space
  77. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 2013b.
  78. Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond
  79. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition
  80. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics
  81. A Deep Reinforced Model for Abstractive Summarization
  82. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
  83. To Tune or Not to Tune? Adapting Pretrained Representations to Diverse Tasks
  84. Deep contextualized word representations
  85. Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks
  86. WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
  87. A Call for Clarity in Reporting BLEU Scores
  88. Improving language understanding by generative pre-training
  89. Language models are unsupervised multitask learners
  90. Resolving complex cases of definite pronouns: the Winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics
  91. SQuAD: 100,000+ Questions for Machine Comprehension of Text
  92. Unsupervised Pretraining for Sequence to Sequence Learning
  93. Snorkel MeTaL: Weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning
  94. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series
  95. An Overview of Multi-Task Learning in Deep Neural Networks
  96. Sebastian Ruder. Neural transfer learning for natural language processing. PhD thesis, NUI Galway
  97. Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18
  98. ImageNet large scale visual recognition challenge. International journal of computer vision
  99. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
  100. Get To The Point: Summarization with Pointer-Generator Networks
  101. Neural Machine Translation of Rare Words with Subword Units
  102. Measuring the Effects of Data Parallelism on Neural Network Training
  103. Self-Attention with Relative Position Representations
  104. Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
  105. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
  106. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems
  107. Dirt cheap web-scale parallel text from the common crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics
  108. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing
  109. MASS: Masked Sequence to Sequence Pre-training for Language Generation
  110. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research
  111. Learning General Purpose Distributed Sentence Representations via Large Scale Multi-task Learning
  112. Sequence to sequence learning with neural networks. In Advances in neural information processing systems
  113. Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

  114. Wilson L. Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism Bulletin
  115. A Simple Method for Commonsense Reasoning
  116. NewsQA: A Machine Comprehension Dataset
  117. Attention is all you need. In Advances in neural information processing systems
  118. The Bottom-up Evolution of Representations in the Transformer: A Study with Machine Translation and Language Modeling Objectives
  119. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
  120. Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a.
  121. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
  122. StructBERT: Incorporating Language Structures into Pre-training for Deep Language Understanding
  123. Neural Network Acceptability Judgments
  124. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference
  125. A learning algorithm for continually running fully recurrent neural networks. Neural computation
  126. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  127. XLNet: Generalized Autoregressive Pretraining for Language Understanding
  128. How transferable are features in deep neural networks? In Advances in neural information processing systems
  129. QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension
  130. Defending Against Neural Fake News
  131. ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
  132. FreeLB: Enhanced Adversarial Training for Natural Language Understanding
  133. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision

Show All 133