Accelerating LLaMA Inference by Enabling Intermediate Layer Decoding via Instruction Tuning with LITE (2310.18581v2)

Published 28 Oct 2023 in cs.CL

Abstract: LLMs have achieved remarkable performance across a wide variety of natural language tasks; however, their large size makes their inference slow and computationally expensive. Focusing on this problem, we propose to instruction tune LLMs with additional explicit losses from the intermediate layers (LITE) and show that it enables these layers to acquire 'good' generation ability without affecting the generation ability of the final layer. We perform 'dynamic confidence-based early exiting' at token level from the intermediate layers which improves the efficiency of text generation without compromising the quality of the generation. We conduct comprehensive experiments by instruction tuning LLaMA-2 models on the Alpaca dataset and holistically evaluate on four different human-instruction test sets. We show that dynamic early exiting achieves consistent and considerable inference computation cost improvements (37.86% for 7B and 46.35% for 13B model) while maintaining the generation quality of the responses. We further conduct a thorough analysis of the results over several important aspects, such as comparing the semantic similarity of the outputs and dissecting the efficiency improvements by comparing the number of tokens generated in the output. In summary, our work contributes to improving the efficiency of LLM inference while maintaining the generation quality, a crucial step en route to enabling their widespread adoption.

Summary

The paper introduces LITE, an instruction tuning method that enables intermediate layers in LLMs to generate high-quality text.
It integrates weighted losses and dynamic confidence-based early exiting to achieve inference speed improvements of up to 46% across models.
Experiments with LLaMA-2 and the Alpaca dataset validate that enhanced intermediary performance maintains semantic coherence while reducing computation.

Introduction

Accelerating the inference process of LLMs is a continuous challenge due to their significant computational demands. Traditionally, inference quality has been tightly coupled to the use of a model's final layer, leaving intermediate layers underutilized. Varshney et al. target this inefficiency by exploring the potential of intermediate layers in LLMs for text generation tasks. Their approach, instruction tuning with Losses from the InTermediate layErs (LITE), retains the generative capabilities of intermediate layers without sacrificing the final layer's performance.

Accelerating Inference through Intermediate Layer Utilization

The authors identify a key limitation of LLMs tuned only on final layers: while the last layer is well-optimized for high-quality text generation, intermediate layers are not. This single-layer dependency limits the possibility of early exiting, where one could stop the forward pass through the model at an intermediate point to save computation—doing so would typically degrade output quality. To rectify this, instruction tuning with LITE is proposed, allowing intermediate layers to produce quality text outputs.

Instruction Tuning with LITE

LITE enables a weighted contribution of loss from intermediate layers, fostering better alignment and generation capacity within these layers without affecting the final layer's output. To validate this approach, the authors present experimental results on instruction tuning LLaMA-2 models with the Alpaca dataset and holistic evaluation across four different human-instruction test sets. The experiments demonstrate that while intermediate layers do not inherently possess high-quality generation capabilities, with LITE, they indeed gain such abilities.

Dynamic Confidence-Based Early Exiting

Building on LITE, dynamic confidence-based early exiting is introduced. It relies on the probability signals from intermediate layers' token predictions to determine alignment with the final layer's output, effectively deciding on-the-fly whether to generate the next token early. Results from this method indicate significant improvements in inference efficiency without quality trade-offs. The improvement ranges between 37.86% for 7B and 46.35% for 13B models, maintaining output semantic similarity and coherence even when exiting early.

The paper contributes to the growing body of work aiming to optimize the utilization of LLMs. By enhancing the representational quality of intermediate layers and allowing dynamic exits, their approach marks a significant step toward efficient inference in resource-intensive LLMs. This methodology not only achieves computational efficiency but does so with minimal impact on generation quality, making it a promising avenue for facilitating the broader adoption of LLMs.

PDF Markdown