Shortformer: Better Language Modeling using Shorter Inputs

Published 31 Dec 2020 in cs.CL | (2012.15832v2)

Abstract: Increasing the input length has been a driver of progress in language modeling with transformers. We identify conditions where shorter inputs are not harmful, and achieve perplexity and efficiency improvements through two new methods that decrease input length. First, we show that initially training a model on short subsequences before moving on to longer ones both reduces overall training time and, surprisingly, substantially improves perplexity. Second, we show how to improve the efficiency of recurrence methods in transformers, which let models condition on previously processed tokens when generating sequences that exceed the maximal length the transformer can handle at once. Existing methods require computationally expensive relative position embeddings; we introduce a simple alternative of adding absolute position embeddings to queries and keys instead of to word embeddings, which efficiently produces superior results. We show that these recurrent models also benefit from short input lengths. Combining these techniques speeds up training by a factor of 1.65, reduces memory usage, and substantially improves perplexity on WikiText-103, without adding any parameters.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (84)

View on Semantic Scholar

Summary

The paper introduces a two-stage training process that reduces complexity, cuts memory usage, and improves perplexity in language models.
It pioneers Position-Infused Attention by integrating absolute position embeddings into query and key layers, enabling efficient caching.
Empirical results on WikiText-103 show a perplexity drop from 18.65 to 17.47 and a training speed-up factor of 1.65 compared to conventional models.

Shortformer: Better Language Modeling Using Shorter Inputs

The paper "Shortformer: Better Language Modeling Using Shorter Inputs" challenges the prevailing assumption that longer input sequences invariably lead to superior performance in transformer-based LLMs. The researchers propose two primary innovations: staged training and position-infused attention, which allows transformers to effectively utilize shorter input lengths without compromising, and indeed sometimes improving, performance.

Key Contributions

This work introduces and rigorously tests methods for improving both the efficiency and performance of LLMs by leveraging shorter input sequences. The two key contributions are outlined below:

Staged Training: The authors propose a two-stage training process where models are first trained on shorter subsequences before progressing to longer ones. This approach is shown to reduce overall training time, decrease memory usage, and notably improve perplexity. The researchers attribute the improved perplexity to the reduced complexity experienced by the model during the initial stages of training, which may help the model better generalize.
Position-Infused Attention (PIA): The paper advances a novel approach to incorporating position information into the attention mechanism of transformers. By adding absolute position embeddings to the queries and keys rather than to the word embeddings, PIA enables the efficient reuse (caching) of representations from previous subsequences. This innovation eschews the need for computationally expensive relative position embeddings while retaining superior performance metrics, such as perplexity.

Experimental Validation

The authors provide comprehensive empirical validation of their methods using the WikiText-103 dataset, a well-known benchmark in natural language processing. The quantitative results reveal several significant improvements:

The proposed staged training can speed up the training process by a factor of 1.65 compared to a baseline model trained with conventional methods.
Both staged training and position-infused attention reduce the perplexity on the WikiText-103 dataset when compared to standard LLMs.
The Shortformer model, which combines staged training and position-infused attention, achieves a perplexity of approximately 17.47, outperforming the baseline (18.65) and demonstrating improved efficiency, as it utilizes attention matrices significantly smaller in size.

Implications and Future Directions

The findings of this study hold considerable implications for the design and implementation of large-scale LLMs. By demonstrating that shorter input sequences can not only match but sometimes exceed the performance of longer ones, this research opens opportunities for more memory-efficient and quicker-train models. Moreover, with the growing demand for resource-efficient AI models, adopting such strategies may become particularly beneficial.

In terms of future developments, integrating the proposed methods with existing advanced models, like the Compressive Transformer or Routing Transformer, could potentially yield even more robust models. Moreover, while the current research focuses on language modeling, exploring the application of these methods in other sequential tasks, like video processing or time-series prediction, could further validate the applicability and effectiveness of these techniques.

Conclusion

This paper successfully argues against the assumption that longer input subsequences are inherently beneficial for transformer-based LLMs. By adopting innovative approaches such as staged training and position-infused attention, the researchers manage to enhance model efficiency and effectiveness, pushing the boundaries of what can be achieved with shorter input lengths. The Shortformer serves as a testament to the potential of rethinking conventional strategies in language modeling, paving the way for more adaptable and resource-conscious AI technologies.

Markdown Report Issue