Emergent Mind

Splitwise: Efficient generative LLM inference using phase splitting

(2311.18677)
Published Nov 30, 2023 in cs.AR and cs.DC

Abstract

Recent innovations in generative LLMs have made their applications and use-cases ubiquitous. This has led to large-scale deployments of these models, using complex, expensive, and power-hungry AI accelerators, most commonly GPUs. These developments make LLM inference efficiency an important challenge. Based on our extensive characterization, we find that there are two main phases during an LLM inference request: a compute-intensive prompt computation, and a memory-intensive token generation, each with distinct latency, throughput, memory, and power characteristics. Despite state-of-the-art batching and scheduling, the token generation phase underutilizes compute resources. Specifically, unlike compute-intensive prompt computation phases, token generation phases do not require the compute capability of the latest GPUs, and can be run with lower power and cost. With Splitwise, we propose splitting the two phases of a LLM inference request on to separate machines. This allows us to use hardware that is well-suited for each phase, and provision resources independently per phase. However, splitting an inference request across machines requires state transfer from the machine running prompt computation over to the machine generating tokens. We implement and optimize this state transfer using the fast back-plane interconnects available in today's GPU clusters. We use the Splitwise technique to design LLM inference clusters using the same or different types of machines for the prompt computation and token generation phases. Our clusters are optimized for three key objectives: throughput, cost, and power. In particular, we show that we can achieve 1.4x higher throughput at 20% lower cost than current designs. Alternatively, we can achieve 2.35x more throughput with the same cost and power budgets.

Overview

  • Splitwise introduces an efficient method to enhance LLM inference by dividing it into specific hardware-optimized phases: prompt computation and token generation.

  • Phase-splitting allows for better utilization of computing resources, with separate machines for different inference phases and high-speed state transfer between them.

  • Designs proposed by the paper achieve up to 1.4 times higher throughput for reduced costs, or a 2.35 times increase in throughput within the same cost and power envelope.

  • Splitwise's provisioning methodology ensures service level objective (SLO) compliance and can adapt to various configurations and workloads.

  • The paper shows Splitwise's potential for real-world applications in the AI community, owing to its adaptability and efficiency in the deployment of generative LLMs.

Introduction

Generative LLMs have rapidly become a foundational aspect of modern artificial intelligence research and deployment. Utilizing these models effectively and efficiently during the inference stage is a challenge that has garnered significant attention, particularly given the implications on computational resources. In the paper "Splitwise: Efficient Generative LLM Inference Using Phase Splitting," a novel approach to address these challenges is presented, employing a technique called 'Splitwise' to enhance LLM inference clusters in terms of throughput, cost, and power objectives.

Phase Splitting in LLM Inference

One of the core insights from the paper is the identification of two distinct phases during LLM inference: prompt computation and token generation. Prompt computation is a compute-intensive phase requiring high-floating point calculations. Conversely, token generation is a memory-bound phase, involving serialized computation where each new token depends on previously generated tokens and cached context. Current models underutilize computing resources during the token generation phase; this is where Splitwise comes into play. It proposes offloading prompt computation and token generation onto separate machines, allowing each phase to leverage hardware that is optimally suited to its computational profile. This separation requires efficient state transfer—specifically the model's key-value cache—from the prompt-computing to token-generating machine, optimized through the use of high-speed networking available in GPU clusters.

Splitwise: Design and Optimization

The paper introduces the design of LLM inference clusters reinforced by Splitwise's phase-splitting technique, accounting for throughput, cost, and power consumption. Three key designs are proposed, assessing homogeneous and heterogeneous clusters with different GPU combinations, reflecting scenarios where the prompt phase is powered by high-performance GPUs and the token generation phase is supported by hardware optimized for memory bandwidth and capacity. Benchmark results indicate that the Splitwise-based designs can attain up to 1.4 times higher throughput for 20% lower cost compared to conventional designs, or alternatively, a 2.35 times increase in throughput within the same cost and power envelope. These enhancements result from targeting the distinct requirements of each inference phase, pushing the efficiency boundaries of LLM deployment.

Cluster Provisioning and Scalability

The provisioning methodology under Splitwise is thoroughly outlined, accommodating different models and workloads, and varying service level objectives (SLOs). The paper examines numerous cluster configurations, assuring SLO compliance across different percentile benchmarks for latencies (end-to-end, time to first token, and time between tokens). Splitwise's design is shown to be both malleable depending on the workload and robust to model variations or load fluctuations, suggesting significant potential for real-world applicability. Further discussions and relatable work acknowledge the potential for innovation in both hardware, fitted for prompt and token phases, and scheduling strategies on heterogeneous platforms.

In summary, "Splitwise: Efficient Generative LLM Inference Using Phase Splitting" offers a practical approach to optimizing the deployment of generative LLMs, achieving higher efficiency and throughput while balancing cost and power constraints. The presented technique and findings are anticipated to be crucial for the AI community as it leans towards more scalable and efficient use of language models in numerous applications.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.