Emergent Mind

Abstract

While LLMs have shown remarkable abilities, they are hindered by significant resource consumption and considerable latency due to autoregressive processing. In this study, we introduce Adaptive N-gram Parallel Decoding (ANPD), an innovative and lossless approach that accelerates inference by allowing the simultaneous generation of multiple tokens. ANPD incorporates a two-stage approach: it begins with a rapid drafting phase that employs an N-gram module, which adapts based on the current interactive context, followed by a verification phase, during which the original LLM assesses and confirms the proposed tokens. Consequently, ANPD preserves the integrity of the LLM's original output while enhancing processing speed. We further leverage a multi-level architecture for the N-gram module to enhance the precision of the initial draft, consequently reducing inference latency. ANPD eliminates the need for retraining or extra GPU memory, making it an efficient and plug-and-play enhancement. In our experiments, models such as LLaMA and its fine-tuned variants have shown speed improvements up to 3.67x, validating the effectiveness of our proposed ANPD.

The ANPD pipeline processes text via tokenization, N-gram initialization, and LLM-based autoregression.

Overview

  • The paper introduces Adaptive N-gram Parallel Decoding (ANPD), a method enhancing Large Language Model (LLM) inference efficiency by combining parallel N-gram token generation with LLM verification, thus reducing latency and preserving accuracy.

  • ANPD is a flexible plug-and-play approach that does not require additional training or significant modifications to existing LLM architectures.

  • It employs a Multi-Level N-gram (MLN) algorithm to manage the sparsity of higher-order N-grams, ensuring precision and efficiency in the acceleration process.

  • Experimental validations across various models and tasks show that ANPD significantly improves processing speeds, with increases ranging from 1.95x to 3.67x, while maintaining high-quality outputs.

Accelerating Large Language Model Inference with Adaptive N-gram Parallel Decoding (ANPD)

Introduction

Recent advancements in LLMs have been accompanied by increased computational demands, particularly during the inference phase. This paper introduces an innovative method, Adaptive N-gram Parallel Decoding (ANPD), which improves the inference efficiency of LLMs. ANPD uses a unique two-stage approach that combines rapid parallel token generation via an N-gram model with a subsequent verification process by the LLM, ensuring output accuracy while markedly reducing latency.

Core Contributions

The primary contributions of this paper are:

  • The development of ANPD, a plug-and-play approach designed to speed up LLM inference without the need for additional training or significant changes to existing model architectures.
  • Integration of an adaptive N-gram modeling technique which simplifies language modeling while retaining high accuracy and reducing the reliance on large textual datasets.
  • Introduction of a Multi-Level N-gram (MLN) algorithm to increase the precision of the draft outputs, thus enhancing the efficiency of the acceleration process.
  • Comprehensive validation of ANPD's effectiveness across various models and datasets, demonstrating substantial improvements in processing speed.

Methodology

Adaptive N-gram Parallel Decoding (ANPD)

ANPD enhances LLM inference by generating multiple potential token sequences via an N-gram module, which are then confirmed through the LLM. This process involves initializing with tokenization, followed by dynamic updates to the N-gram predictions based on inputs from running text, forming a cyclical generation and verification process.

Multi-Level N-gram (MLN)

To balance performance and accuracy, ANPD incorporates a Multi-Level N-gram setup, which helps in managing the sparsity of higher-order N-grams that might not match successfully by falling back to smaller, more reliable N-grams.

Experiments

Comprehensive experiments were conducted with several model variants, including LLaMA and its variants, across different tasks like text summarization and code generation. For the models assessed, speed improvements ranged between 1.95x and 3.67x. The experiments utilized multiple datasets such as CNN/Daily Mail and XSUM for text, and HumanEval for code, to ensure diverse, robust assessment.

Results and Discussion

The ANPD method demonstrated consistent performance improvements across different models and tasks. For instance, in text summarization tasks with LLaMA-7B, a speed improvement of approximately 2.98x was observed. Moreover, in code generation tasks using the CodeLLaMa-13B model, the speed improvement was as high as 3.67x. These results underline the effectiveness of ANPD in reducing inference times while maintaining the output quality of the original LLMs.

Future Directions

Moving forward, the exploration could extend to:

  • Adapting ANPD to exploit specific features of different LLMs to optimize its effectiveness further.
  • Extending the parallel processing capabilities during the LLM's verification phase to enhance performance gains.

Conclusion

The paper presents a scalable and efficient method to accelerate the inference time of LLMs without compromising on the quality of output. ANPD's ability to integrate seamlessly as a plug-and-play solution makes it highly applicable for real-world uses, providing a substantial boost in processing speeds across various AI-driven applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube