Papers
Topics
Authors
Recent
2000 character limit reached

Towards Optimizing the Costs of LLM Usage (2402.01742v1)

Published 29 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Generative AI and LLMs in particular are heavily used nowadays for various document processing tasks such as question answering and summarization. However, different LLMs come with different capabilities for different tasks as well as with different costs, tokenization, and latency. In fact, enterprises are already incurring huge costs of operating or using LLMs for their respective use cases. In this work, we propose optimizing the usage costs of LLMs by estimating their output quality (without actually invoking the LLMs), and then solving an optimization routine for the LLM selection to either keep costs under a budget, or minimize the costs, in a quality and latency aware manner. We propose a model to predict the output quality of LLMs on document processing tasks like summarization, followed by an LP rounding algorithm to optimize the selection of LLMs. We study optimization problems trading off the quality and costs, both theoretically and empirically. We further propose a sentence simplification model for reducing the number of tokens in a controlled manner. Additionally, we propose several deterministic heuristics for reducing tokens in a quality aware manner, and study the related optimization problem of applying the heuristics optimizing the quality and cost trade-off. We perform extensive empirical validation of our methods on not only enterprise datasets but also on open-source datasets, annotated by us, and show that we perform much better compared to closest baselines. Our methods reduce costs by 40%- 90% while improving quality by 4%-7%. We will release the annotated open source datasets to the community for further research and exploration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. GPT-3 API Latency — Model Comparison. https://medium.com/@evyborov/gpt-3-api-latency-model-comparison-13888a834938.
  2. gptcache. https://github.com/zilliztech/GPTCache.
  3. gptrim. https://www.gptrim.com/.
  4. NLTK. https://www.nltk.org/.
  5. OpenAI. https://openai.com/.
  6. OpenAI Pricing. https://openai.com/pricing.
  7. pyspellchecker. https://pypi.org/project/pyspellchecker/.
  8. thesaurus. https://github.com/zaibacu/thesaurus.
  9. Tiktoken. https://github.com/openai/tiktoken.
  10. Ashoori, M. Decoding the true cost of generative ai for your enterprise. https://www.linkedin.com/pulse/decoding-true-cost-generative-ai-your-enterprise-maryam-ashoori-phd/, 2023. [Online; accessed Oct-12-2023].
  11. Ms marco: A human generated machine reading comprehension dataset, 2016.
  12. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (Ann Arbor, Michigan, June 2005), Association for Computational Linguistics, pp. 65–72.
  13. Frugalml: How to use ml prediction apis more accurately and cheaply, 2020.
  14. Efficient online ml api selection for multi-label classification tasks, 2021.
  15. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176 (2023).
  16. The economic potential of generative ai: The next productivity frontier. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier#introduction, 2023. [Online; accessed Oct-12-2023].
  17. Efficient unsupervised sentence compression by fine-tuning transformers with reinforcement learning, 2022.
  18. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization.
  19. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning, 2019.
  20. Babybear: Cheap inference triage for expensive language models, 2022.
  21. Neural text generation from structured data with application to the biography domain, 2016.
  22. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady (1966), vol. 10, Soviet Union, pp. 707–710.
  23. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, 2019.
  24. Lin, C.-Y. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out (Barcelona, Spain, July 2004), Association for Computational Linguistics, pp. 74–81.
  25. Natural language inference in context – investigating contextual reasoning over long texts, 2020.
  26. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning, 2020.
  27. Tangobert: Reducing inference cost by using cascaded architecture, 2022.
  28. Muss: Multilingual unsupervised sentence simplification by mining paraphrases, 2020.
  29. Controllable sentence simplification, 2020.
  30. fairseq: A fast, extensible toolkit for sequence modeling, 2019.
  31. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (Philadelphia, Pennsylvania, USA, July 2002), Association for Computational Linguistics, pp. 311–318.
  32. Ranodeb Banerjee, O. Automatic document processing with large language models. https://www.linkedin.com/pulse/automatic-document-processing-large-language-models-ranodeb-banerjee/?utm_source=rss&utm_campaign=articles_sitemaps&utm_medium=google_news, 2023. [Online; accessed Oct-12-2023].
  33. Sallam, R. The economic potential of generative ai: The next productivity frontier. https://www.gartner.com/en/articles/take-this-view-to-assess-roi-for-generative-ai, 2023. [Online; accessed Oct-12-2023].
  34. Shafaq Naz, E. C. Reinventing logistics: Harnessing generative ai and gpt for intelligent document processing. https://www.e2enetworks.com/blog/reinventing-logistics-harnessing-generative-ai-and-gpt-for-intelligent-document-processing, 2023. [Online; accessed Oct-12-2023].
  35. Bigpatent: A large-scale dataset for abstractive and coherent summarization, 2019.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  37. XtractEdge. Cutting through the noise – how generative ai will change the idp landscape. https://www.edgeverve.com/xtractedge/blogs/transforming-idp-with-generative/, 2023. [Online; accessed Oct-12-2023].
  38. Reclor: A reading comprehension dataset requiring logical reasoning, 2020.
  39. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675 (2019).
  40. Sentence simplification with deep reinforcement learning, 2017.
Citations (10)

Summary

  • The paper proposes QC-Opt, a framework that leverages a BERTScore predictor and a budget-aware optimization algorithm to minimize LLM usage costs.
  • It demonstrates a 40% to 90% cost reduction while achieving a 4% to 7% quality improvement compared to baseline approaches.
  • The framework integrates token optimization strategies to reduce input lengths and maintain semantic integrity without compromising output quality.

Optimizing the Costs of LLM Usage

Introduction

The increasing reliance on LLMs in various document processing tasks has underscored the necessity to manage the associated costs and performance trade-offs. The diverse cost structures and latency inherent to different LLMs necessitate optimized model selection strategies that balance cost against task-specific performance metrics and latency constraints. This paper introduces a comprehensive framework, QC-Opt, aiming to optimize LLM usage by estimating output quality and implementing algorithmic solutions for model selection that minimize costs while maintaining desired quality levels.

QC-Opt Framework

QC-Opt consists of a multi-step process aimed at minimizing costs and optimizing input token lengths:

  1. Quality Assessment: A model leverages a BERTScore predictor to estimate LLM output quality without model invocation.
  2. Optimization Algorithm: A Budget Aware optimization algorithm selects suitable LLMs to maximize performance within budget and latency constraints.
  3. Token Optimization Module: This module reduces input token lengths through controlled optimization, preserving output quality. Figure 1

    Figure 1: QC-Opt: first, we have a BertScore predictor predicting the output quality of each LLM on each section; second, we have a Budget Aware optimization algorithm, that optimizes the LLM selection to maximize expected (predicted) performance subject to budget and latency constraints; third we have a token optimization module for reducing token length in a quality aware manner.

Model Selection and Routing

Budget-Aware Optimization

The core challenge lies in selecting the optimal LLM for processing each document section given a budget constraint. The proposed solution formulates this as a constrained optimization problem to maximize total expected quality subject to cost and latency constraints. Despite its inherent NP-hardness, efficient LP-rounding strategies and rudimentary greedy solutions for relaxed cases ensure practical applicability.

Performance and Cost Trade-Off

The framework empirically demonstrates substantial cost reductions (40% to 90%) with a quality increase of 4% to 7% over closest baselines. These improvements are achieved by intelligently distributing tasks among high-performance, cost-effective models and using token length optimizations.

Token Optimization Strategies

The Token Optimization process involves:

  1. Text Simplification: Inspired by sentence simplification models, the framework rephrases input to reduce token count while preserving semantic content.
  2. General Token Reduction Heuristics: A set of heuristics fine-tunes token lengths by adjusting spaces, capitalization, and applying lemmatization or synonym replacement, enhancing quality retention. Figure 2

    Figure 2: Ablation study of various heuristics.

Empirical Evaluation

Extensive validation on datasets reflects the superiority of QC-Opt over traditional approaches such as FrugalGPT-inspired cascades, achieving comparable quality at markedly reduced costs. The quality predictor's alignment with human judgments in a user study reinforces its practical reliability. Figure 3

Figure 3: Comparison with an LLM Cascade baseline inspired by FrugalGPT. We achieve same quality at considerably lower costs and latency (not shown here).

Conclusion

QC-Opt establishes a comprehensive framework for LLM cost optimization that balances performance metrics and cost constraints efficiently. Future extensions could further refine this framework to dynamically adapt LLM selections based on real-time contextual evaluations.

In the evolution of AI-driven document processing, the practical adaptability and efficiency of cost-optimization frameworks like QC-Opt could prove instrumental in maximizing both economic and computational resource utilization.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.