Adaptive Input Representations for Neural Language Modeling (1809.10853v3)

Published 28 Sep 2018 in cs.CL

Abstract: We introduce adaptive input representations for neural LLMing which extend the adaptive softmax of Grave et al. (2017) to input representations of variable capacity. There are several choices on how to factorize the input and output layers, and whether to model words, characters or sub-word units. We perform a systematic comparison of popular choices for a self-attentional architecture. Our experiments show that models equipped with adaptive embeddings are more than twice as fast to train than the popular character input CNN while having a lower number of parameters. On the WikiText-103 benchmark we achieve 18.7 perplexity, an improvement of 10.5 perplexity compared to the previously best published result and on the Billion Word benchmark, we achieve 23.02 perplexity.

Authors (2)

Alexei Baevski (39 papers)
Michael Auli (73 papers)

Citations (374)

View on Semantic Scholar

Summary

The paper introduces an adaptive input representation method that allocates capacity based on word frequency to improve language modeling.
It demonstrates over twice faster training speeds, a perplexity of 18.7 on wikitext-103, and a 61% reduction in parameters compared to prior models.
The approach balances computational efficiency and accuracy, offering practical benefits for resource-constrained and large-scale applications.

Adaptive Input Representations for Neural LLMing

In the paper titled "Adaptive Input Representations for Neural LLMing," the authors introduce a novel approach to enhance neural LLMs with adaptive input representations. This methodology extends upon the adaptive softmax framework, specifically aiming to optimize input word embeddings' capacity based on word frequency. Their primary focus is on achieving high efficiency and accuracy in model training and evaluation, particularly on LLMing tasks.

The paper conducts a thorough exploration of various input and output factorization strategies within self-attentional architectures, with an emphasis on adaptive embeddings. Character-based input via Convolutional Neural Networks (CNNs), sub-word units using Byte Pair Encoding (BPE), and full word inputs are considered. Adaptive input representations stand out as they allocate more capacity to frequently occurring words, thus mitigating overfitting to rare words.

Experimental Results and Observations

Training Efficiency: The paper reports substantial training efficiency gains with adaptive embeddings, as models train more than twice as fast as character input CNN models while retaining a lower parameter count.
Perplexity Improvement: On the wikitext-103 dataset, models with adaptive input representations achieved a perplexity score of 18.7, marking a 10.5 perplexity improvement over previous state-of-the-art results. Similarly, for the billion word benchmark, the perplexity was reduced to 23.02, representing a notable decrease.
Parameter Reduction: By employing adaptive input embeddings in conjunction with adaptive softmax for the output layer, the authors achieve a 61% reduction in the total number of parameters. This reduction does not come at the expense of model performance, which is demonstrated by the improved perplexity scores.

Comparative Analysis

Detailed comparisons are drawn between word-based models with fixed-size embeddings, character input models, and sub-word models. The adaptive input representations not only outperform these alternatives but also align with the reduced computational and memory requirements owing to their efficiency in parameter allocation based on word frequency.

Implications and Future Directions

The practical implications of this work are significant for real-world applications where computational resources are constrained. Theoretically, the approach opens avenues for refining the balance between model capacity and computational cost — leading towards more efficient yet capable LLMs.

Future exploration could extend adaptive input representations to other domains beyond core LLMing tasks, such as translation or speech recognition, which could benefit from similar efficiency improvements. Improved regularization techniques, particularly for very large vocabularies, and the exploration of dynamic adjustment strategies during training could further enhance model performance and applicability.

In conclusion, the paper makes a substantial contribution to the field of neural LLMing by proposing an adaptive approach to input representations that balances model complexity, accuracy, and computational efficiency.

PDF Markdown