Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

The economic trade-offs of large language models: A case study (2306.07402v1)

Published 8 Jun 2023 in cs.CL and cs.AI

Abstract: Contacting customer service via chat is a common practice. Because employing customer service agents is expensive, many companies are turning to NLP that assists human agents by auto-generating responses that can be used directly or with modifications. LLMs are a natural fit for this use case; however, their efficacy must be balanced with the cost of training and serving them. This paper assesses the practical cost and impact of LLMs for the enterprise as a function of the usefulness of the responses that they generate. We present a cost framework for evaluating an NLP model's utility for this use case and apply it to a single brand as a case study in the context of an existing agent assistance product. We compare three strategies for specializing an LLM - prompt engineering, fine-tuning, and knowledge distillation - using feedback from the brand's customer service agents. We find that the usability of a model's responses can make up for a large difference in inference cost for our case study brand, and we extrapolate our findings to the broader enterprise space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Towards a human-like open-domain chatbot. arXiv preprint arXiv:2001.09977.
  2. Plato-xl: Exploring the large-scale pre-training of dialogue generation. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, pages 107–118.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  5. Cohere. 2023a. Generation. https://docs.cohere.ai/docs/generation-card. Accessed: 2023-02-16.
  6. Cohere. 2023b. Pricing. https://cohere.ai/pricing. Accessed: 2023-02-16.
  7. Cohere. 2023c. Prompt engineering. https://docs.cohere.ai/docs/prompt-engineering. Accessed: 2023-02-16.
  8. Cohere. 2023d. Training custom models. https://docs.cohere.ai/docs/training-custom-models. Accessed: 2023-02-16.
  9. Google. Google cloud pricing calculator. https://cloud.google.com/products/calculator. Accessed: 2023-02-17.
  10. Eie: efficient inference engine on compressed deep neural network. ACM SIGARCH Computer Architecture News, 44(3):243–254.
  11. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28:1135–1143.
  12. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  13. A simple language model for task-oriented dialogue. Advances in Neural Information Processing Systems, 33:20179–20191.
  14. Domain-specific knowledge distillation yields smaller and better models for conversational commerce. ECNLP 2022, page 151.
  15. Huggingface. Export to onnx. https://huggingface.co/docs/transformers/serialization. Accessed: 2023-02-17.
  16. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837.
  17. NVIDIA. a. Performance analyzer. https://github.com/triton-inference-server/server/blob/main/docs/user_guide/perf_analyzer.md. Accessed: 2023-02-17.
  18. NVIDIA. b. Triton inference server. https://github.com/triton-inference-server/server. Accessed: 2023-02-17.
  19. OpenAI. 2023a. Models: Gpt-3. https://platform.openai.com/docs/models/gpt-3. Accessed: 2023-02-16.
  20. OpenAI. 2023b. Pricing. https://openai.com/api/pricing/. Accessed: 2023-02-16.
  21. Godel: Large-scale pre-training for goal-directed dialog. arXiv preprint arXiv:2206.11309.
  22. Soloist: Building task bots at scale with transfer learning and machine teaching. Transactions of the Association for Computational Linguistics, 9:907–824.
  23. R Core Team. 2021. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  24. Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
  25. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
  26. Minho Ryu and Kichun Lee. 2020. Knowledge distillation for BERT unsupervised domain adaptation. arXiv preprint arXiv:2010.11478.
  27. Victor Sanh. 2023. Huggingface distillation documentation. https://github.com/huggingface/transformers/blob/main/examples/research_projects/distillation/README.md. Accessed: 2023-02-16.
  28. Distilbert, a distilled version of BERT: smaller, faster, cheaper and lighter. In NeurIPS EMC22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT Workshop.
  29. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT. In AAAI, pages 8815–8821.
  30. Jessica Shieh. 2022. Best practices for prompt engineering with openai api. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api. Accessed: 2023-02-16.
  31. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
  32. Attention is all you need. Advances in neural information processing systems, 30.
  33. Vigilance requires hard mental work and is stressful. Human factors, 50:433–41.
  34. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
  35. Edward WD Whittaker and Bhiksha Raj. 2001. Quantization-based language model compression. In Seventh European Conference on Speech Communication and Technology.
  36. DIALOGPT : Large-scale generative pre-training for conversational response generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 270–278, Online. Association for Computational Linguistics.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com