Block Pruning For Faster Transformers

Published 10 Sep 2021 in cs.LG and cs.CL | (2109.04838v1)

Abstract: Pre-training has improved model accuracy for both classification and generation tasks at the cost of introducing much larger and slower models. Pruning methods have proven to be an effective way of reducing model size, whereas distillation methods are proven for speeding up inference. We introduce a block pruning approach targeting both small and fast models. Our approach extends structured methods by considering blocks of any size and integrates this structure into the movement pruning paradigm for fine-tuning. We find that this approach learns to prune out full components of the underlying model, such as attention heads. Experiments consider classification and generation tasks, yielding among other results a pruned model that is a 2.4x faster, 74% smaller BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models in speed and pruned models in size.

Abstract PDF Upgrade to Chat

Authors (4)

Citations (205)

View on Semantic Scholar

Summary

The paper introduces block pruning as a novel method for reducing transformer model size while significantly accelerating inference speed.
Experiments show that structured block pruning achieves up to 2.4x speed improvements and 74% smaller models with minimal accuracy loss.
By pruning redundant components such as attention heads, the approach offers actionable insights for optimizing transformer architectures in real-time AI.

An Analysis of Block Pruning for Optimizing Transformer Efficiency

The paper "Block Pruning For Faster Transformers" by Lagunas et al. explores methodologies for enhancing the efficiency of pre-trained transformer models, focusing on both classification and generation tasks. The challenge addressed by the authors is the tendency of state-of-the-art models to increase in size, thereby increasing computational cost and inference latency. The primary contributions of this work are the introduction and evaluation of a block pruning approach as a means to reconcile size reduction with improved inference speed.

The authors begin by surveying existing model compression techniques. Pruning methods, such as magnitude pruning and movement pruning, have been established for reducing model size but often result in structural sparsity, which traditional hardware architectures are not well-equipped to optimize. Movement pruning, specifically, achieves parameter storage reduction without significantly improving inference times. In contrast, distillation methods like DistilBERT and TinyBERT offer substantial speed-ups but remain comparatively larger in size unless carefully engineered.

Block pruning is proposed as a novel compression strategy that balances the trade-offs between the two existing paradigms. Unlike unstructured pruning, which typically involves individual weights, block pruning targets blocks of parameters. This structured approach translates more efficiently to dense hardware implementations typical of GPUs. Experiments demonstrate its efficacy, with block pruning leading to substantial speed-ups (e.g., a 2.4x speedup on SQuAD v1.1) while maintaining competitive F1 scores.

Noteworthy is the method’s ability to prune entire components of transformer models, such as attention heads. This capability hints at an advanced form of redundancy optimization, suggesting that many attention heads may not contribute significantly to model performance. The theoretical implication is significant: by pruning such redundant structures, model designs could be refined based on empirical data, potentially leading to new architectural insights.

Experiments conducted by the authors underscore the potential of block pruning across various datasets, including SQuAD, QQP, and CNN/DailyMail. In terms of performance, the pruned models exhibit only a slight degradation in accuracy metrics despite substantial reductions in parameter count and appreciable speed improvements. For instance, the block pruning method was able to yield models that were 74% smaller with minimal accuracy loss.

This work also accounts for the energy efficiency gains associated with speeding up inference processes. While the training phase of block-pruned models may require more epochs (i.e., a longer time investment), the subsequent inference phase is drastically optimized. Such efficiency is critical in resource-limited environments or applications requiring real-time processing.

Nevertheless, the presented approach is not without limitations. The dependence of block pruning efficacy on block size and structure requires further exploration to perfect hyperparameter selection. Moreover, the granularity with which blocks are defined could influence performance variably across different model architectures and tasks—a topic ripe for continued research.

In conclusion, Lagunas et al.'s contribution to the domain of model compression situates block pruning as a viable general-purpose method for improving transformer efficiency within practical trade-offs of speed, size, and accuracy. Looking forward, the integration of block pruning with other techniques, such as knowledge distillation, could herald new avenues for optimizing large-scale models without significant computational costs. As artificial intelligence continues to evolve, particularly in areas demanding real-time, efficient inference, these insights may prove essential.

Markdown Report Issue