Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 188 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 39 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 78 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems (2407.07000v2)

Published 9 Jul 2024 in cs.LG, cs.AI, cs.CL, and cs.DC

Abstract: Serving LLMs in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Etalon, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Etalon, discussing their strengths and weaknesses. Etalon is available at https://github.com/project-etalon/etalon.

Citations (3)

Summary

  • The paper introduces Metron, a framework that employs a new fluidity-index to track token generation deadlines in LLM inference systems.
  • The evaluation reveals that traditional metrics like TTFT and TPOT mask latency spikes, while fluid token rate analysis provides clearer performance insights.
  • Metron's holistic approach enables defining operator-driven SLOs and benchmarking both proprietary and open-source systems under varying loads.

"Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems"

Introduction

The paper presents "Metron," a framework for evaluating LLM inference systems by addressing the limitations of current performance metrics. It focuses on enhancing real-time user experience metrics for LLMs used in applications like chat and translation. Standard metrics such as Time To First Token (TTFT), Time Between Tokens (TBT), and normalized latency often fail to capture the complex, temporal dynamics of LLM systems. Instead, Metron introduces a new metric called the "fluidity-index," which aims to assess the real-life performance of LLMs more accurately.

Motivation and Shortcomings of Existing Metrics

Current metrics are inadequate for evaluating LLM inference systems effectively. TTFT, which measures the prompt processing efficiency, fails to account for the quadratic increase in latency with prompt length, which makes it impractical for fixed SLOs for TTFT especially with long prompts. Figure 1

Figure 1

Figure 1

Figure 1: Increase in prefill latency with prompt length (Yi-34B on 2-H100) makes it infeasible to operate with fixed TTFT SLOs, especially for models with long context support.

Additionally, metrics such as TPOT and normalized latency can mask the irregularities in token generation due to their normalization process. For instance, tail latencies often give a misleading estimation of token throughput because of their tendency to ignore the distribution and magnitude of stalls. Figure 2

Figure 2

Figure 2

Figure 2: (a) Decode tokens can be intermittently stalled due to prefills from incoming requests. (b) Naively normalizing total decode latency in TPOT, hides these latency spikes and overestimates the system token throughput. (c) Simply observing tail latency does not capture the nuances in the latency distribution. P85 latency for Sarathi-Serve is higher compared to vLLM while it has much lower P99 latency. Performance evaluations with fluid token generation rate accounts for all these variations and provides an accurate and balanced view of system performance. Here for fluid token generation rate, we enforce that 99\% of the requests meet deadlines at least 90\% of the time (fluidity-index > 0.9).

Metron: Design and Implementation

Metron addresses these issues by introducing the fluidity-index metric, which evaluates token generation against a timeline of expected deadlines rather than average token speed. This approach draws an analogy between LLM token generation and video frame rendering in real-time systems. If tokens are generated a few ahead of schedule, any latency can be buffered, reducing the perceived delay to the user.

The key is aligning each token with an expected generation time—a deadline—considering both decode and prefill stages. A defined scheduling slack aids in adjusting for system load or unexpected delays, refining deadline accuracy. Fluidity-index measures how closely a system meets deadlines, providing a metric reflecting real-time user experience.

Evaluation

The authors conducted extensive performance evaluations using Metron on both open-source and proprietary LLM systems. For example, public APIs like Anyscale, Groq, and Fireworks were analyzed using Metron's black-box evaluation to gauge their performance under various configurations. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: Evaluation of proprietary serving offerings for Mixtral-8x7B and Llama3-70B performed over duration of 24 hrs. (a) shows the token throughput as estimated by different decode latency metrics, (b) presents the overall decode latency distribution across all requests, (c) shows the TTFT for different prompt lengths and (d) provides a full characterization of the system by showing the fluidity-index as a function of target TBT.

Open-source systems, specifically vLLM and Sarathi-Serve, were evaluated using Metron to define operator-driven Service Level Objectives (SLOs) for deployment scenarios, demonstrating Metron's ability to identify the maximum capacity each system could sustain under specific service quality requirements. Figure 4

Figure 4

Figure 4: Capacity evaluation of open source systems vLLM and Sarathi-Serve performed on H100 for Llama3-8B. (a) shows the overall capacity achieved obtained by using different decode lat. metrics - TBT, TPOT, and fluidity-index. (b) captures the distribution of deadline miss rate at capacity point for the two systems.

Conclusion

Metron emerges as a comprehensive LLM evaluation framework, offering detailed insights into LLM inference performance from a user-centric perspective. By introducing the fluidity-index, Metron bridges the gap between traditional metrics and real-world application demands, ensuring a holistic view of system efficiency. The work represents a step towards standardized evaluation for LLM inference systems, enabling developers and researchers to align model performance with user experience. The open-source availability of Metron further supports its utilization and potential integration as a standard evaluation tool across the LLM field.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 4 tweets and received 23 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com