BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems (2401.17644v5)

Published 31 Jan 2024 in cs.DC and cs.PF

Abstract: Serving systems for LLMs are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-source LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM serving workload with 10.31 million traces from regional Azure OpenAI GPT services over 213 days. BurstGPT captures LLM serving characteristics from user, model and system perspectives: (1) User request concurrency: burstiness variations of requests in Azure OpenAI GPT services, revealing diversified concurrency patterns in different services and model types. (2) User conversation patterns: counts and intervals within conversations for service optimizations. (3) Model response lengths: auto-regressive serving processes of GPT models, showing statistical relations between requests and their responses. (4) System response failures: failures of conversation and API services, showing intensive resource needs and limited availability of LLM services in Azure. The details of the characteristics can serve multiple purposes in LLM serving optimizations, such as system evaluation and trace provisioning. In our demo evaluation with BurstGPT, frequent variations in BurstGPT reveal declines in efficiency, stability, or reliability in realistic LLM serving. We identify that the generalization of KV cache management, scheduling and disaggregation optimizations can be improved under realistic workload evaluations. BurstGPT is publicly available now at https://github.com/HPMLL/BurstGPT and is widely used to develop prototypes of LLM serving frameworks in the industry.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces BurstGPT, the first trace dataset capturing real-world LLM serving workloads with over 1.4 million request-response pairs.
It analyzes bursty workload patterns using Gamma distributions, highlighting distinct behaviors between conversational and API services.
The study evaluates performance impacts, exposing GPU memory bottlenecks and establishing a benchmark suite for reliable LLM serving system assessments.

Overview of "Towards Efficient and Reliable LLM Serving: A Real-World Workload Study"

The paper "Towards Efficient and Reliable LLM Serving: A Real-World Workload Study" presents a comprehensive analysis of real-world workload characteristics for LLM serving systems, specifically focusing on Generative Pretrained Transformer (GPT) models. The paper emphasizes the operational challenges in deploying LLMs, such as the substantial cost and resource demands. The research addresses a significant gap in the current understanding by presenting the first trace dataset of real-world LLM workloads, termed "BurstGPT," which captures user, system, and model behavior over two months within a campus setting.

Key Contributions and Findings

Real-World Workload Dataset:
- The introduction of BurstGPT provides empirical insights into LLM serving workloads. The dataset comprises 1,429.7 thousand request-response pairs from both ChatGPT and GPT-4 models, covering conversational and API service interactions. Notably, this dataset omits actual content to ensure user privacy, focusing instead on metadata such as request-response lengths and timestamps.
Analysis of Burstiness and Patterns:
- The paper identifies significant bursty patterns in LLM workloads, emphasizing discrepancies between conversational and API services. BurstGPT reveals unique characteristics, such as periodically high activity in conversational services and irregular, bursty patterns in API services, particularly influenced by automated usage patterns.
- The paper models these bursts using Gamma distributions, emphasizing the variability in temporal patterns, which pose challenges for workload provisioning.
Performance and Reliability Evaluation:
- A major finding is the vulnerability of LLM systems to short-term burstiness, which impacts GPU memory usage and performance stability. This research highlights frequent request failures due to memory bottlenecks, especially in high-concurrency scenarios typical of LLM serving.
- The paper introduces a benchmark suite derived from BurstGPT to enable evaluations that reflect real-world workload distributions, facilitating precise performance analysis of LLM serving systems.

Implications and Future Directions

The paper's analysis has both practical and theoretical implications for the deployment and optimization of LLM serving systems. Practically, understanding bursty workload patterns helps in designing more elastic and reliable serving frameworks capable of adjusting resources dynamically to meet service-level objectives (SLOs). Theoretically, the paper provides a foundation for developing more sophisticated models of LLM behavior, which can inform the design of more efficient serving architectures.

Importantly, the availability of BurstGPT as a public resource encourages further research into workload optimization strategies, including resource allocation and scheduling policies for LLMs. Future developments could explore advanced predictive analytics to better anticipate workload surges, improving system adaptability and reliability.

In conclusion, this paper provides a critical insight into the behavior of LLM serving systems under realistic conditions. By establishing a baseline dataset and analytical framework, it lays the groundwork for more robust, efficient, and user-centric LLM applications in diverse industrial contexts. As AI systems continue to scale, such empirical studies will be instrumental in guiding the evolution of infrastructure required to support them.

PDF Markdown

Related Papers

GitHub

GitHub - HPMLL/BurstGPT: A GPT-3.5 & GPT-4 Workload Trace to Optimize LLM Serving Systems (128 stars)

Tweets

https://twitter.com/HPCPapers/status/1752935541480284263