Accelerating Large Language Model Inference with Self-Supervised Early Exits (2407.21082v1)

Published 30 Jul 2024 in cs.CL, cs.LG, and stat.ML

Abstract: This paper presents a novel technique for accelerating inference in large, pre-trained LLMs by introducing early exits during inference. The computational demands of these models, used across a wide range of applications, can be substantial. By capitalizing on the inherent variability in token complexity, our approach enables selective acceleration of the inference process. Specifically, we propose the integration of early exit ''heads'' atop existing transformer layers, which facilitate conditional terminations based on a confidence metric. These heads are trained in a self-supervised manner using the model's own predictions as training data, thereby eliminating the need for additional annotated data. The confidence metric, established using a calibration set, ensures a desired level of accuracy while enabling early termination when confidence exceeds a predetermined threshold. Notably, our method preserves the original accuracy and reduces computational time on certain tasks, leveraging the existing knowledge of pre-trained LLMs without requiring extensive retraining. This lightweight, modular modification has the potential to greatly enhance the practical usability of LLMs, particularly in applications like real-time language processing in resource-constrained environments.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces self-supervised early exits in transformer architectures to dynamically terminate token processing based on prediction confidence.
It employs strategically placed early exit heads trained using combined cross-entropy and entropy-penalized losses without extra annotated data.
Experimental results on the Phi-2 model demonstrate significant speedup in inference while maintaining accuracy across standard benchmarks.

Accelerating LLM Inference with Self-Supervised Early Exits

In the paper "Accelerating LLM Inference with Self-Supervised Early Exits," Valade et al. introduce an innovative technique aimed at improving the efficiency of inference in large pre-trained LLMs. This method's core concept involves the utilization of early exit mechanisms within the transformer architecture to allow selective acceleration of token processing based on their complexity. By leveraging self-supervised learning techniques, these early exits can dynamically terminate computations during inference without necessitating extensive retraining or additional annotated data.

Methodology

The proposed technique employs strategically positioned early exit heads atop the existing transformer layers in an LLM. These heads act as conditional termination points, using a confidence metric to decide whether to continue processing or to exit with the current prediction. This metric is established through a calibration set and ensures that the desired accuracy levels are maintained.

The early exit heads are trained self-supervised, utilizing the model's own predictions as the training data, thus circumventing the need for external annotations. During training, a loss function combining cross-entropy loss with an entropy penalty is employed to encourage the heads to output probability distributions that reflect the uncertainty of predictions.

Calibration of these heads involves determining confidence thresholds through a calibration dataset, with thresholds fine-tuned to balance accuracy and computational efficiency. During inference, the model evaluates the confidence at each early exit head and decides whether to produce a final prediction or continue processing based on the predefined thresholds.

Experimental Findings

The paper's experiments conducted on the Phi-2 model demonstrate the efficacy of the early exit strategy. The experiment places early exit heads at regular intervals (after layers 6, 12, 18, and 24) within the 32-layer transformer model. Different configurations were tested to fine-tune the loss function and initialization strategies for early exit heads:

Low Entropy Penalty ( $\lambda$ = 0.1): Higher accuracy with slightly reduced entropy.
High Entropy Penalty ( $\lambda$ = 0.95): Balanced accuracy with significantly increased entropy.
Copied Head Initialization Without Penalty: Generates high accuracy but insufficient entropy.
Copied Head Initialization With Penalty ( $\lambda$ = 0.95): Accomplishes moderate accuracy and entropy.

The results indicate that the combination of high penalty and newly initialized heads provides a good balance of high accuracy with useful entropy levels, facilitating effective early exits without substantial loss in model performance.

Performance Impact

The paper reports that the early exit mechanism preserves the model's accuracy while offering substantial computational savings. This is evident from the results on benchmarks like MMLU, where performance remains relatively stable across varying epsilon values. However, other benchmarks (e.g., Hellaswag and Winogrande) do show performance degradation at lower epsilon values, indicating that task-specific calibration is crucial for maintaining overall efficiency gains.

The overall speedup, especially at lower epsilon levels, demonstrates the method's potential in reducing inference time significantly. By exiting early for a large proportion of tokens, computational resources are utilized more efficiently, making this approach highly beneficial for real-time language processing in resource-constrained environments.

Implications and Future Directions

The implications of this research are manifold. The practical usability of LLMs can be greatly enhanced by accelerating their inference speeds, which is particularly advantageous for applications requiring real-time processing, such as conversational AI and machine translation in mobile devices. The modular nature of the proposed enhancement ensures that it can be integrated into a wide range of pre-trained models with minimal adjustments.

Theoretical advancements are also significant. This work contributes to the broader field of model efficiency, emphasizing an approach that retains the integrity of the model’s predictions while reducing computational overhead. This sees potential in shaping future research toward optimizing large model inferences through innovative integration techniques.

Future research could extend this approach by exploring its adaptability across larger, more complex LLMs. It could also involve refining the calibration and confidence metrics tailored to diverse application domains or identifying alternative early exit strategies that might offer further efficiency improvements.

In conclusion, Valade et al.'s method for integrating self-supervised early exits into transformer architectures presents a promising path forward in optimizing the inference efficiency of LLMs, balancing the complex trade-offs between speed and accuracy without additional labeled data requirements. This advancement offers a significant step towards making high-performing LLMs more accessible and practical for a broader range of applications.