Emergent Mind

Exponentially Faster Language Modelling

(2311.10770)
Published Nov 15, 2023 in cs.CL , cs.AI , cs.LG , and cs.NE

Abstract

Language models only really need to use an exponential fraction of their neurons for individual inferences. As proof, we present UltraFastBERT, a BERT variant that uses 0.3% of its neurons during inference while performing on par with similar BERT models. UltraFastBERT selectively engages just 12 out of 4095 neurons for each layer inference. This is achieved by replacing feedforward networks with fast feedforward networks (FFFs). While no truly efficient implementation currently exists to unlock the full acceleration potential of conditional neural execution, we provide high-level CPU code achieving 78x speedup over the optimized baseline feedforward implementation, and a PyTorch implementation delivering 40x speedup over the equivalent batched feedforward inference. We publish our training code, benchmarking setup, and model weights.

Overview

  • LLMs have improved NLP but are computationally expensive.

  • UltraFastBERT reduces neuron usage during inference to just 0.3% with no loss in performance.

  • UltraFastBERT retains 96% of BERT-base's performance using the GLUE benchmark.

  • Conditional matrix multiplication (CMM) enables UltraFastBERT to be up to 78x faster on standard hardware.

  • Future work includes integrating CMM into deep learning frameworks to maximize processing speed.

Introduction

The field of natural language processing has witnessed significant advancements with the introduction of LLMs that have dramatically improved comprehension and generation abilities. These models often come with a high computational cost due to their extensive number of parameters, especially during inference time. To address this, research has been directed toward optimizing the efficiency of such models while maintaining performance levels.

Model Architecture

An emerging approach for efficient language modeling is introduced through UltraFastBERT, a model built upon the architecture of BERT (Bidirectional Encoder Representations from Transformers). UltraFastBERT distinguishes itself by incorporating fast feedforward networks (FFFs) in place of the conventional feedforward layers in BERT's architecture. This novel structure significantly reduces the number of neurons required during inference—to the extent that only 0.3% of the model's neurons are engaged in this process. Specifically, within its layers, UltraFastBERT only activates 12 out of 4095 neurons for individual inferences. Despite the massive reduction in active neurons, UltraFastBERT shows no loss in performance when compared to BERT-like models of similar size and training regimen.

Downstream Performance

To validate the efficacy of UltraFastBERT, comprehensive evaluations were conducted using the GLUE benchmark, a widely recognized suite of natural language understanding tasks. The reported results indicate that UltraFastBERT, with substantially fewer active neurons, retained at least 96% of BERT-base's downstream predictive performance. Interestingly, the reduction in performance due to the model’s sparse activation was mainly noticeable in a single GLUE task, suggesting that the overall approach is sound. For those interested in replicating or extending this research, the model weights have been made public.

Inference Acceleration and Compatibility

UltraFastBERT introduces conditional matrix multiplication (CMM) as the core of its efficiency gains. CMM is a departure from dense matrix multiplication (DMM), traditionally used in feedforward networks, as it computes dot products conditional on the input, sparing the need to engage all neurons simultaneously. Remarkably, current rudimentary implementations of CMM on standard hardware already yield a 78x speedup over DMM. Analysis of CPU and GPU compatibility suggests that with optimized device-specific programming, the actual speedup could approach the theoretical maximum, which, for a BERT-base sized model, is a 341x improvement.

Conclusion and Future Outlook

The pioneering UltraFastBERT model demonstrates the vast potential of LLMs that leverage conditional neural execution. This work paves the way for substantial enhancements in processing speeds, potentially making large-scale language models accessible on devices with far less computational power. A key takeaway is that through refined implementations of FFFs, the AI community stands on the brink of achieving unprecedented efficiency in LLMs while preserving their impressive language understanding and generation capabilities. The next step involves integrating native support for CMM into deep learning frameworks and hardware firmware to fully capitalize on the impressive speedup potential demonstrated by UltraFastBERT.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube