Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems (2407.07000v2)

Published 9 Jul 2024 in cs.LG, cs.AI, cs.CL, and cs.DC

Abstract: Serving LLMs in production can incur substantial costs, which has prompted recent advances in inference system optimizations. Today, these systems are evaluated against conventional latency and throughput metrics (eg. TTFT, TBT, Normalised Latency and TPOT). However, these metrics fail to fully capture the nuances of LLM inference, leading to an incomplete assessment of user-facing performance crucial for real-time applications such as chat and translation. In this paper, we first identify the pitfalls of current performance metrics in evaluating LLM inference systems. We then propose Etalon, a comprehensive performance evaluation framework that includes fluidity-index -- a novel metric designed to reflect the intricacies of the LLM inference process and its impact on real-time user experience. Finally, we evaluate various existing open-source platforms and model-as-a-service offerings using Etalon, discussing their strengths and weaknesses. Etalon is available at https://github.com/project-etalon/etalon.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces two novel metrics—fluidity-index and fluid token generation rate—to capture detailed user experience in LLM inference systems.
It demonstrates that traditional metrics like TTFT, TBT, and TPOT can mask latency and stall issues critical in dynamic, real-time applications.
Evaluation of public API and open-source systems reveals that the new metrics offer a more accurate performance assessment, guiding effective deployment strategies.

A Comprehensive Performance Evaluation Framework for LLM Inference Systems: Metron

The paper "Metron: Holistic Performance Evaluation Framework for LLM Inference Systems" introduces a new evaluation methodology for assessing the performance of LLM inference systems. In LLM deployment, traditional performance metrics such as Time To First Token (TTFT), Time Between Tokens (TBT), normalized latency, and Time Per Output Token (TPOT) have been employed broadly. However, these metrics often fail to fully capture the intricacies and real-time user experience in LLM inference, particularly in dynamic, streaming contexts. Metron proposes a more nuanced and user-centric approach to performance evaluation, introducing two novel metrics: fluidity-index and fluid token generation rate.

Motivation and Problematic Issues with Current Metrics

Current performance metrics, while useful, do not adequately account for user experience factors important for applications such as live chat or real-time translation. The key issues identified include:

TTFT Dependencies on Prompt Length: TTFT encompasses both scheduling delay and prompt processing time. Its efficacy decreases in scenarios with long input prompts due to quadratic increases in latency, invalidating static SLO definitions.
Normalized Latency Obfuscation: By normalizing end-to-end execution time over the number of decode tokens, normalized latency can mask substantial delays caused by scheduling, leading to misleading performance interpretations.
Over-Simplification by TPOT and TBT: These metrics normalize latencies across generated tokens, concealing potential generation jitters or stalls that critically affect user experience.
Lack of Context in TBT Distributions: The cumulative distribution function of TBT does not reflect the timing and magnitude of stalls across the duration of token generation, leading to incomplete performance insights.

Metron’s Novel Approach

The core innovation of Metron lies in redefining how to measure and interpret the LLM inference performance through user-centric metrics. The proposed fluidity-index and fluid token generation rate offer a more precise representation aligned with real-world usage:

Fluidity-Index: This metric sets token-level deadlines, deriving from prefill and decode latencies, and tracks the proportion of deadlines met. By resetting deadlines post-stall events, fluidity-index captures both the immediacy and frequency of stalls in a way that aligns closely with user experience expectations.
Fluid Token Generation Rate: Complementing fluidity-index, this metric determines the highest sustainable token generation rate while respecting fluidity constraints, offering a comprehensive view of system throughput under user-centric SLOs.

Evaluation and Implications

The paper evaluates several open-source and proprietary LLM inference systems using Metron. Key findings include:

Public API Systems:
- Utilizing a mix of models (LLaMA3-70B and Mixtral-8x7B) on various public APIs (Anyscale, Groq, Fireworks), the paper reveals that traditional metrics like TPOT can significantly overestimate system performance.
- Through fluidity-index and fluid token generation rate, more balanced and insightful assessments become possible, illustrating, for example, that Groq's token throughput significantly drops when factoring in generation stalls, contrary to TPOT-based approximations.
Open Source Systems:
- Evaluated on LLaMA3-8B using H100 GPUs, systems like vLLM and Sarathi-Serve reveal varied capacities and efficiencies when decoded under user-centric metrics. Systems with optimized prefill and stall management (like Sarathi-Serve) exhibit superior performance and user experience under fluidity-index based evaluations compared to conventional metrics.

Future Directions and Conclusion

Metron provides a more accurate and user-aligned evaluation suite for LLM inference systems, addressing critical gaps in existing metrics. The paper suggests further exploration in standardizing the setting of deadlines for token generation, particularly in proprietary systems where prefill time characterizations are opaque. Additionally, systematic approaches for parameter tuning in open-source systems are highlighted as future work.

Metron’s introduction lays the groundwork for a standardized approach to LLM performance evaluation, emphasizing the importance of user-facing metrics in real-time applications. The framework aids in making informed decisions regarding deployment strategies and optimizations, potentially guiding future developments in both research and practical implementations of LLM systems.

PDF Markdown

Related Papers

GitHub

GitHub - project-metron/metron: LLM Serving Performance Evaluation Harness (53 stars)

Tweets

https://twitter.com/agrawalamey12/status/1812203186494837226

https://twitter.com/gm8xx8/status/1810855151235662178

YouTube

Show All Videos