Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

Published 2 Nov 2018 in cs.CL | (1811.01088v2)

Abstract: Pretraining sentence encoders with language modeling and related unsupervised tasks has recently been shown to be very effective for language understanding tasks. By supplementing LLM-style pretraining with further training on data-rich supervised tasks, such as natural language inference, we obtain additional performance improvements on the GLUE benchmark. Applying supplementary training on BERT (Devlin et al., 2018), we attain a GLUE score of 81.8---the state of the art (as of 02/24/2019) and a 1.4 point improvement over BERT. We also observe reduced variance across random restarts in this setting. Our approach yields similar improvements when applied to ELMo (Peters et al., 2018a) and Radford et al. (2018)'s model. In addition, the benefits of supplementary training are particularly pronounced in data-constrained regimes, as we show in experiments with artificially limited training data.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (457)

View on Semantic Scholar

Summary

The paper introduces STILTs, adding an intermediate supervised training step to improve sentence encoder accuracy.
It demonstrates performance gains on models like BERT, GPT, and ELMo, achieving an 81.8 GLUE score in low-data setups.
The study shows reduced variance across training runs, highlighting STILTs’ potential to stabilize fine-tuning in data-constrained regimes.

Sentence Encoders on STILTs: Supplementary Training on Intermediate Labeled-data Tasks

This paper examines an innovative approach to enhancing the performance of sentence encoders through Supplementary Training on Intermediate Labeled-data Tasks (STILTs). The methodology builds upon the prevalent paradigm of pretraining models on large-scale unsupervised tasks followed by fine-tuning on specific target tasks. The central hypothesis is that incorporating an additional stage of training on data-rich supervised tasks can significantly improve model performance, particularly in data-constrained settings.

Main Contributions

Introduction of STILTs: The authors propose STILTs, a supplementary training phase on intermediate tasks. These tasks provide additional labeled data which can bridge the gap between initial unsupervised pretraining and final supervised fine-tuning. The result is a potentially more robust and effective target task model.
Application to Existing Models: STILTs is applied to three well-known pretrained sentence encoders: BERT, GPT, and ELMo. The authors use a selection of intermediate tasks such as MNLI, SNLI, QQP, and a custom fake-sentence-detection task to empirically validate the efficacy of STILTs.
Performance on GLUE Benchmark: STILTs demonstrates substantial performance gains on the GLUE benchmark, particularly in tasks with limited training data. Notably, BERT fine-tuned with STILTs achieves a GLUE score of 81.8, setting a new state of the art at the time.
Reduced Variance in Training: Additionally, STILTs reduces variance in model performance across random restarts, which is particularly beneficial for tasks with small datasets where fine-tuning can be unstable.
Data-Constrained Regimes: The study includes experiments that simulate data-constrained scenarios by limiting the training set to 1k and 5k examples. In such regimes, STILTs provides even more pronounced performance gains, reinforcing its utility in real-world situations where labeled data is often scarce.

Implications and Future Directions

The findings suggest that STILTs can be a valuable extension to the current pretraining and fine-tuning paradigm, offering improvements in both accuracy and stability. This approach allows for a more effective use of the intermediate tasks that share some structural similarity with the target tasks, or are simply data-rich.

Future work could explore a broader selection of intermediate tasks and study the interactions between different intermediate and target tasks to fully exploit STILTs' potential. Additionally, this approach could be valuable beyond natural language tasks, extending to various domains involving structured prediction problems.

Overall, STILTs represents a strategic enhancement to model training processes, advocating for a more nuanced view of transfer learning in the context of deep LLMs.

Markdown Report Issue