Emergent Mind

Think Big, Generate Quick: LLM-to-SLM for Fast Autoregressive Decoding

(2402.16844)
Published Feb 26, 2024 in cs.LG , cs.AI , and cs.CL

Abstract

LLMs have become ubiquitous in practice and are widely used for generation tasks such as translation, summarization and instruction following. However, their enormous size and reliance on autoregressive decoding increase deployment costs and complicate their use in latency-critical applications. In this work, we propose a hybrid approach that combines language models of different sizes to increase the efficiency of autoregressive decoding while maintaining high performance. Our method utilizes a pretrained frozen LLM that encodes all prompt tokens once in parallel, and uses the resulting representations to condition and guide a small language model (SLM), which then generates the response more efficiently. We investigate the combination of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families and only require fine-tuning of the SLM. Experiments with various benchmarks show substantial speedups of up to $4\times$, with minor performance penalties of $1-2\%$ for translation and summarization tasks compared to the LLM.

A large language model enhances a smaller one for efficient, high-quality response decoding.

Overview

  • The paper introduces a hybrid model approach called LLM-to-SLM for improved efficiency in autoregressive decoding, leveraging the strengths of both large and small language models.

  • A novel framework utilizes the encoding capabilities of a predefined LLM to condition an SLM, which then performs autoregressive decoding, significantly reducing computational requirements.

  • Empirical evaluations demonstrate that the LLM-to-SLM approach maintains near-LLM performance levels while achieving substantial efficiency gains in tasks like machine translation and summarization.

  • Future directions include exploring decoder-only LLMs, dynamic invocation of LLMs, and scalability to further enhance the efficiency and practicality of language model deployment.

LLM-to-SLM: Enhancing Autoregressive Decoding Efficiency with Hybrid Language Models

Introduction to LLM-to-SLM

In the domain of Natural Language Generation (NLG), deploying LLMs efficiently has been a significant challenge, primarily due to their substantial computational demands and the sequential nature of autoregressive decoding. A promising solution to this problem is presented in a recent study through a hybrid model approach termed LLM-to-SLM (Large Language Model to Small Language Model). This approach capitalizes on the strengths of both large and small models, leveraging the high-quality representation capabilities of LLMs to condition a more computationally efficient SLM for the task of autoregressive generation. The core innovation lies in performing a single pass of encoding with an LLM to guide the generation process of an SLM, striking a balance between maintaining high performance and reducing computational overhead.

Methodology

The study introduces a novel framework where the encoding capabilities of a pretrained LLM are utilized to generate a comprehensive representation of the input prompt. This representation then conditions an SLM, which is responsible for generating the output sequence. This method significantly reduces the computational burden by limiting the use of the computationally heavy LLM to a single encoding pass, thus delegating the autoregressive decoding to the more efficient SLM.

Key elements of this methodology include:

  • Hybrid Model Architecture: The integration of encoder-decoder LLMs with both encoder-decoder and decoder-only SLMs from different model families, requiring only fine-tuning of the SLM.
  • Efficiency Gains: Empirical results demonstrate substantial efficiency improvements, achieving speedups of up to 4 times, with only a minor performance decrease in comparison to using an LLM alone.
  • Implementation Details: The LLM-To-SLM utilizes a simple MLP projector to transform the prompt's representation from the LLM's embedding space to that of the SLM, facilitating this hybrid model's autoregressive generation.

Empirical Evaluation

The paper's empirical evaluation spans several benchmarks, including machine translation, summarization, and instruction tuning, across different languages and datasets. The results highlight the method's capability to maintain a close-to-LLM performance while significantly increasing computational efficiency. Notably, in translation and summarization tasks, the LLM-to-SLM configuration achieves speed enhancements by factors of 4.2 and 3.0, respectively, with only a 1 to 2 percent drop in performance metrics.

Theoretical and Practical Implications

The approach underscores a pivotal shift towards more computationally efficient deployment of language models, particularly in scenarios where latency and computational resources are limiting factors. Theoretically, it presents a compelling case for the distributed execution of tasks amongst models of varying sizes - a principle that could extend beyond language models to other domains within AI. Practically, this method opens up new possibilities for deploying advanced NLG applications on edge devices, where computational resources are scarce.

Future Directions

The study outlines several areas for future development, including exploring the potential of decoder-only LLMs within this framework, investigating the dynamic invocation of LLMs for further efficiency gains, and extending the approach to models with billions of parameters to understand scalability implications fully. These directions not only promise to refine the LLM-to-SLM approach but also contribute to the broader research landscape on efficient AI model deployment.

Conclusion

This paper introduces a novel method, LLM-to-SLM, that elegantly addresses the computational inefficiencies associated with autoregressive decoding in LLMs. By leveraging the high-quality encodings of an LLM to guide the generation process of an SLM, it achieves significant improvements in speed and efficiency without substantially compromising on performance. As this research area continues to evolve, the LLM-to-SLM method stands as a significant step towards more sustainable and practical applications of language models in real-world scenarios.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.