Emergent Mind

BASS: Batched Attention-optimized Speculative Sampling

(2404.15778)
Published Apr 24, 2024 in cs.LG and cs.CL

Abstract

Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting LLMs. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.

Comparison of latency and GPU usage across RD, SD, and BASS methods on two models.

Overview

  • BASS enhances speculative decoding in LLMs by introducing batch processing for multiple sequences to improve latency, GPU usage, and output quality under time constraints.

  • The method innovates by handling variable-length sequences within batches using customized CUDA kernels for ragged tensors and dynamically adjusting the length of draft tokens during inference processes.

  • Significant performance improvements were observed in real-world AI applications such as coding assistants and conversational agents, facilitating more efficient real-time interactions.

Batched Attention-optimized Speculative Sampling: Innovations in Multi-Sequence Generation for LLMs

Introduction

The evolution of language model inference methods continues to be a critical area of research, particularly as models are scaled to billions of parameters. Batched Attention-optimized Speculative Sampling (BASS) emerges as a significant enhancement over existing methods, addressing the inefficiencies of speculative decoding with single-sequence batches by introducing a method capable of handling multiple sequences simultaneously. Demonstrating significant improvements in latency, GPU utilization, and generation quality under tight time constraints, BASS optimizes speculative decoding across multiple dimensions.

Key Challenges and Methodology

Existing Limitations

Traditional speculative decoding techniques are largely limited to single-sequence processing, curbing the potential to exploit parallelism in modern GPU hardware. This limitation is particularly impactful in scenario where multiple sequence outputs are required simultaneously, as is common in practical AI applications, where latency and throughput are critical.

BASS Overview

BASS successfully extends speculative decoding beyond these limitations, leveraging batch processes and a novel approach to calculating attention across variable-length sequences within batches. The core technique involves speculative token drafts performed in parallel, with dynamic adjustments based on the acceptability of these drafts, which significantly enhances throughput.

Technical Innovations

  • CUDA Kernels and Ragged Tensors: Handling of ragged tensors via customized CUDA kernels, facilitating efficient memory management and parallel computation.
  • Dynamic Draft Length: Algorithmic innovation to dynamically adjust draft token length, enhancing flexibility and adaptability during inference.

Experimental Setup and Results

Models and Metrics

The system was evaluated using models like OPT 13B and CodeGen-Mono 16B, with metrics including HumanEval pass@k for coding, and ROUGE for summarization tasks.

Findings

With BASS, notable improvements were observed:

  • Latency and Throughput: Achieved up to 2.15 times speed-up over optimized regular decoding methods.
  • GPU Utilization: Enhanced peak usage markedly, showing more than threefold improvement over regular decoding approaches.

These improvements catalyze better performance in applications like coding assistants, conversational agents, and more, without the expense of extended wait times typical of earlier methods.

Implications and Future Directions

Practical Implications

BASS allows real-world AI systems, especially those requiring real-time interaction and generation of multiple responses, to function more efficiently. This capability can transform user experiences across interfaces where rapid responses are essential.

Theoretical Contributions

This research contributes to the understanding of how speculative decoding can be synergized with batch processing to overcome core challenges in AI inference, such as those posed by memory bandwidth and latency.

Future Research

Further exploration into reducing disparities in GPU utilization across different phases of model inference could yield even faster and more efficient systems. Adapting BASS to a wider range of model architectures and sizes, and further optimizing CUDA implementations could also extend the benefits observed.

In conclusion, Batched Attention-optimized Speculative Sampling sets a new benchmark in the utilization and efficiency of LLMs, offering pathways to both theoretical and practical enhancements in the field of AI model inference.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.