Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (1910.10683v4)
Abstract: Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in NLP. The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
- Character-level language modeling with deeper self-attention. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
- Memory-efficient adaptive optimization for large-scale learning. arXiv preprint arXiv:1901.11150, 2019.
- Massively multilingual neural machine translation in the wild: Findings and challenges. arXiv preprint arXiv:1907.05019, 2019.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785, 2019.
- Neural machine translation by jointly learning to align and translate. In Third International Conference on Learning Representations, 2015.
- Simple, scalable adaptation for neural machine translation. arXiv preprint arXiv:1909.08478, 2019.
- SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
- Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, 2014.
- Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, 2015.
- Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation, 2016.
- Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
- N-gram counts and language models from the common crawl. In LREC, 2014.
- Rich Caruana. Multitask learning. Machine learning, 28(1), 1997.
- Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055, 2017.
- Long short-term memory-networks for machine reading. arXiv preprint arXiv:1601.06733, 2016.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555, 2020.
- SentEval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449, 2018.
- Supervised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364, 2017.
- The PASCAL recognising textual entailment challenge. In Machine Learning Challenges Workshop, 2005.
- Semi-supervised sequence learning. In Advances in neural information processing systems, 2015.
- The CommitmentBank: Investigating projection in naturally occurring discourse. In Sinn und Bedeutung 23, 2019.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 2009.
- BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005.
- Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197, 2019.
- Understanding back-translation at scale. arXiv preprint arXiv:1808.09381, 2018.
- Learning word vectors for 157 languages. arXiv preprint arXiv:1802.06893, 2018.
- Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
- C4Corpus: Multilingual web-size corpus with free license. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 914–922, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016.
- Rethinking ImageNet pre-training. arXiv preprint arXiv:1811.08883, 2018.
- A hybrid neural network model for commonsense reasoning. arXiv preprint arXiv:1907.11983, 2019.
- Teaching machines to read and comprehend. In Advances in neural information processing systems, 2015.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Learning distributed representations of sentences from unlabelled data. arXiv preprint arXiv:1602.03483, 2016.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Parameter-efficient transfer learning for NLP. arXiv preprint arXiv:1902.00751, 2019.
- Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
- Music transformer: Generating music with long-term structure. In Seventh International Conference on Learning Representations, 2018a.
- GPipe: Efficient training of giant neural networks using pipeline parallelism. arXiv preprint arXiv:1811.06965, 2018b.
- What makes ImageNet good for transfer learning? arXiv preprint arXiv:1608.08614, 2016.
- First Quora dataset release: Question pairs. https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs, 2017.
- Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia, 2014.
- TinyBERT: Distilling BERT for natural language understanding. arXiv preprint arXiv:1909.10351, 2019.
- TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
- SpanBERT: Improving pre-training by representing and predicting spans. arXiv preprint arXiv:1907.10529, 2019.
- Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410, 2016.
- A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.
- CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019a.
- Unifying question answering and text classification via span extraction. arXiv preprint arXiv:1904.09286, 2019b.
- Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), 2018.
- Skip-thought vectors. In Advances in neural information processing systems, 2015.
- A surprisingly robust trick for Winograd schema challenge. arXiv preprint arXiv:1905.06290, 2019.
- Federated optimization: Distributed optimization beyond the datacenter. arXiv preprint arXiv:1511.03575, 2015.
- Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492, 2016.
- Do better ImageNet models transfer better? arXiv preprint arXiv:1805.08974, 2018.
- Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks. arXiv preprint arXiv:1404.5997, 2014.
- Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018.
- SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
- Cross-lingual language model pretraining. arXiv preprint arXiv:1901.07291, 2019.
- ALBERT: A lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
- The Winograd schema challenge. In Thirteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2012.
- Qi Li. Literature survey: domain adaptation algorithms for natural language processing. 2012.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out, 2004.
- Generating Wikipedia by summarizing long sequences. arXiv preprint arXiv:1801.10198, 2018.
- SummAE: Zero-shot abstractive text summarization using length-agnostic auto-encoders. arXiv preprint arXiv:1910.00998, 2019a.
- Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2015.
- Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504, 2019b.
- Yang Liu. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318, 2019.
- RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019c.
- An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893, 2018.
- Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
- The natural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730, 2018.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013a.
- Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, 2013b.
- Abstractive text summarization using sequence-to-sequence RNNs and beyond. arXiv preprint arXiv:1602.06023, 2016.
- Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2014.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 2002.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304, 2017.
- GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014.
- To tune or not to tune? adapting pretrained representations to diverse tasks. arXiv preprint arXiv:1903.05987, 2019.
- Deep contextualized word representations. arXiv preprint arXiv:1802.05365, 2018.
- Sentence encoders on STILTs: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088, 2018.
- WIC: 10,000 example pairs for evaluating context-sensitive representations. arXiv preprint arXiv:1808.09121, 2018.
- Matt Post. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771, 2018.
- Improving language understanding by generative pre-training, 2018.
- Language models are unsupervised multitask learners, 2019.
- Resolving complex cases of definite pronouns: the Winograd schema challenge. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012.
- Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250, 2016.
- Unsupervised pretraining for sequence to sequence learning. arXiv preprint arXiv:1611.02683, 2016.
- Snorkel MeTaL: Weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, 2018.
- Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
- Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
- Sebastian Ruder. Neural transfer learning for natural language processing. PhD thesis, NUI Galway, 2019.
- Transfer learning in natural language processing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18, 2019.
- ImageNet large scale visual recognition challenge. International journal of computer vision, 2015.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368, 2017.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909, 2015.
- Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018.
- Self-attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018.
- Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235, 2018.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
- Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems, 2018.
- Dirt cheap web-scale parallel text from the common crawl. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013.
- MASS: Masked sequence to sequence pre-training for language generation. arXiv preprint arXiv:1905.02450, 2019.
- Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 2014.
- Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079, 2018.
- Sequence to sequence learning with neural networks. In Advances in neural information processing systems, 2014.
- Richard S. Sutton. The bitter lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html, 2019.
- Wilson L. Taylor. “Cloze procedure”: A new tool for measuring readability. Journalism Bulletin, 1953.
- A simple method for commonsense reasoning. arXiv preprint arXiv:1806.02847, 2018.
- NewsQA: A machine comprehension dataset. arXiv preprint arXiv:1611.09830, 2016.
- Attention is all you need. In Advances in neural information processing systems, 2017.
- The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380, 2019.
- GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
- Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a.
- SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537, 2019b.
- StructBERT: Incorporating language structures into pre-training for deep language understanding. arXiv preprint arXiv:1908.04577, 2019c.
- Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
- A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
- A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1989.
- Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
- XLNet: Generalized autoregressive pretraining for language understanding. arXiv preprint arXiv:1906.08237, 2019.
- How transferable are features in deep neural networks? In Advances in neural information processing systems, 2014.
- QAnet: Combining local convolution with global self-attention for reading comprehension. arXiv preprint arXiv:1804.09541, 2018.
- Defending against neural fake news. arXiv preprint arXiv:1905.12616, 2019.
- ReCoRD: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885, 2018.
- Freelb: Enhanced adversarial training for language understanding. arXiv preprint arXiv:1909.11764, 2019.
- Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, 2015.