Hash Layers For Large Sparse Models

Published 8 Jun 2021 in cs.LG and cs.CL | (2106.04426v3)

Abstract: We investigate the training of sparse layers that use different parameters for different inputs based on hashing in large Transformer models. Specifically, we modify the feedforward layer to hash to different sets of weights depending on the current token, over all tokens in the sequence. We show that this procedure either outperforms or is competitive with learning-to-route mixture-of-expert methods such as Switch Transformers and BASE Layers, while requiring no routing parameters or extra terms in the objective function such as a load balancing loss, and no sophisticated assignment algorithm. We study the performance of different hashing techniques, hash sizes and input features, and show that balanced and random hashes focused on the most local features work best, compared to either learning clusters or using longer-range context. We show our approach works well both on large language modeling and dialogue tasks, and on downstream fine-tuning tasks.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (188)

View on Semantic Scholar

Summary

The paper introduces a hash-based routing strategy that eliminates extra parameters and load balancing losses for efficient sparse training in Transformer models.
The method employs balanced hash strategies that outperform traditional MoE routing by reducing computational overhead and improving workload distribution.
Empirical evaluations across Reddit, RoBERTa+cc100en, and Wikitext-103 datasets demonstrate enhanced training speed, robustness, and overall model performance.

Analysis of Hash Layers For Large Sparse Models

The paper "Hash Layers For Large Sparse Models" investigates a novel sparse training technique for Transformer models in NLP applications. The authors propose a hash-based routing strategy for sparse layers within large Transformer models, which determines specific parameters for different inputs based on hashing. This approach positions itself as an alternative to learning-to-route Mixture-of-Experts (MoE) methods such as Switch Transformers and BASE Layers. Below, we provide a detailed overview of the paper's contributions, results, and implications.

Methodology

The hash-based routing approach modifies the feedforward layers of Transformer models to select different sets of weights for each token, achieved through hashing. This technique differentiates from other MoEs by eliminating routing parameters, load balancing losses, and complex assignment algorithms. The simplicity and computational efficiency of hash-based routing offer robustness and ease of implementation, which are particularly advantageous for training extremely large models on limited compute budgets. The authors experiment with various hashing strategies, hash sizes, and input features, finding that balanced and random hashes targeting local features are the most effective. The method is evaluated on large language modeling and dialog tasks, along with downstream fine-tuning tasks.

Experimental Results

The paper presents empirical comparisons of hash layers against established sparse routing strategies on several datasets:

Performance Against Mixture-of-Experts Methods: Hash Layers are either competitive with or outperform Switch Transformers and BASE Layers. This is especially notable because hash layers dispense with added routing network parameters and sophisticated balancing mechanisms.
Experimental Results with Various Hash Strategies: The balanced assignment hash strategy generally performs best due to improved distribution of workload among experts when compared to fixed random assignment. However, attempts to bias routing with clusters based on token similarity did not achieve favorable results, suggesting that diversity in token routing is crucial for effective MoE operation.
Model Training and Efficiency: The proposed method's operational simplicity results in faster training times compared to traditional MoE techniques. It minimizes overhead by removing learned routing decisions, reducing complexity, and offering enhanced stability in training outcomes.
Robustness Across Tasks: The evaluations, which include tasks on pushshift.io Reddit, RoBERTa+cc100en, and Wikitext-103 datasets, confirm the robustness of this approach across different environments and task conditions.

Implications

The study extends our understanding of sparse neural architectures by demonstrating that non-learned, hash-based routing can effectively rival complex, learned methods for many practical tasks in NLP. Its cost-efficiency and simplicity suggest it as a potent alternative for scaling model parameters without incurring commensurate increases in computation or training overhead. Future research might focus on expanding the method's applicability, optimizing hash functions further, or combining it with other state-of-the-art NLP strategies to enhance model generalizability.

Future Directions

Looking forward, further investigation could focus on exploring adaptive hash functions or integrating scalable hashing mechanisms that consider dynamic tokens' distribution during training. A more nuanced exploration of the conditions under which such hashing fails could also illuminate the boundaries of this technique's applicability. Evaluating hash layers at the scale of large industrial NLP applications would provide more insights into its broader applicability and limitations.

In essence, the introduction of hash layers into sparse neural models presents a promising direction for efficient large-scale NLP. This work serves as a foundation for exploring sparse architectures that derive efficiency from simple, rule-based operations rather than computationally expensive learned components.

Markdown Report Issue