Optimizing LLM Queries in Relational Data Analytics Workloads (2403.05821v2)
Abstract: Batch data analytics is a growing application for LLMs. LLMs enable users to perform a wide range of natural language tasks, such as classification, entity extraction, and translation, over large datasets. However, LLM inference is highly costly and slow: for example, an NVIDIA L4 GPU running Llama3-8B can only process 6 KB of text per second, taking about a day to handle 15 GB of data; processing a similar amount of data costs around $10K on OpenAI's GPT-4o. In this paper, we propose novel techniques that can significantly reduce the cost of LLM calls for relational data analytics workloads. Our key contribution is developing efficient algorithms for reordering the rows and the fields within each row of an input table to maximize key-value (KV) cache reuse when performing LLM serving. As such, our approach can be easily applied to existing analytics systems and serving platforms. Our evaluation shows that our solution can yield up to 3.4x improvement in job completion time on a benchmark of diverse LLM-based queries using Llama 3 models. Our solution also achieves a 32% cost savings under OpenAI and Anthropic pricing models.
- [n.d.]a. https://docs.databricks.com/en/large-language-models/how-to-ai-query.html.
- [n.d.]b. AI Functions on Databricks — docs.databricks.com. https://docs.databricks.com/en/large-language-models/ai-functions.html. [Accessed 01-03-2024].
- [n.d.]. How fast is DuckDB really? — Blog — Fivetran — fivetran.com. https://www.fivetran.com/blog/how-fast-is-duckdb-really. [Accessed 01-03-2024].
- [n.d.]. Large Language Models for sentiment analysis with Amazon Redshift ML (Preview) — Amazon Web Services — aws.amazon.com. https://aws.amazon.com/blogs/big-data/large-language-models-for-sentiment-analysis-with-amazon-redshift-ml-preview/. [Accessed 01-03-2024].
- [n.d.]. LLM with Vertex AI only using SQL queries in BigQuery — Google Cloud Blog — cloud.google.com. https://cloud.google.com/blog/products/ai-machine-learning/llm-with-vertex-ai-only-using-sql-queries-in-bigquery. [Accessed 01-03-2024].
- [n.d.]. PySpark. https://www.databricks.com/glossary/pyspark.
- SemDeDup: Data-efficient learning at web-scale through semantic deduplication. arXiv:2303.09540 [cs.LG]
- Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (Melbourne, Victoria, Australia) (SIGMOD ’15). Association for Computing Machinery, New York, NY, USA, 1383–1394. https://doi.org/10.1145/2723372.2742797
- Prompting Is Programming: A Query Language for Large Language Models. Proceedings of the ACM on Programming Languages 7, PLDI (June 2023), 1946–1969. https://doi.org/10.1145/3591300
- Harrison Chase. 2022. LangChain. https://github.com/langchain-ai/langchain
- Ludmila Cherkasova and Gianfranco Ciardo. 2001. Role of Aging, Frequency, and Size in Web Cache Replacement Policies. In Proceedings of the 9th International Conference on High-Performance Computing and Networking (HPCN Europe 2001). Springer-Verlag, Berlin, Heidelberg, 114–123.
- The Missing Piece in Complex Analytics: Low Latency, Scalable Model Management and Serving with Velox. arXiv:1409.3809 [cs.DB]
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135 [cs.LG]
- Ruining He and Julian McAuley. 2016. Ups and Downs: Modeling the Visual Evolution of Fashion Trends with One-Class Collaborative Filtering. In Proceedings of the 25th International Conference on World Wide Web (Montréal, Québec, Canada) (WWW ’16). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE, 507–517. https://doi.org/10.1145/2872427.2883037
- The MADlib Analytics Library or MAD Skills, the SQL. arXiv:1208.4165 [cs.DB]
- Huggingface. 2023. Text Generation Inference. https://huggingface.co/docs/text-generation-inference/en/index
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Hydragen: High-Throughput LLM Inference with Shared Prefixes. arXiv:2402.05099 [cs.LG]
- BlazeIt: optimizing declarative aggregation and limit queries for neural network-based video analytics. Proc. VLDB Endow. 13, 4 (dec 2019), 533–546.
- NoScope: optimizing neural network queries over video at scale. Proc. VLDB Endow. 10, 11 (aug 2017), 1586–1597.
- Efficient Memory Management for Large Language Model Serving with PagedAttention. In Proceedings of the 29th Symposium on Operating Systems Principles (¡conf-loc¿, ¡city¿Koblenz¡/city¿, ¡country¿Germany¡/country¿, ¡/conf-loc¿) (SOSP ’23). Association for Computing Machinery, New York, NY, USA, 611–626. https://doi.org/10.1145/3600006.3613165
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL]
- Jerry Liu. 2022. LlamaIndex. https://doi.org/10.5281/zenodo.1234
- Accelerating Machine Learning Inference with Probabilistic Predicates. In Proceedings of the 2018 International Conference on Management of Data (Houston, TX, USA) (SIGMOD ’18). Association for Computing Machinery, New York, NY, USA, 1493–1508.
- Microsoft. 2023. Guidance. https://github.com/guidance-ai/guidance
- NVIDIA. 2023a. Faster Transformer. https://github.com/NVIDIA/FasterTransformer
- NVIDIA. 2023b. TensorRT LLM. https://github.com/NVIDIA/TensorRT-LLM
- Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the ACL.
- Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD ’19). Association for Computing Machinery, New York, NY, USA, 1981–1984. https://doi.org/10.1145/3299869.3320212
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv:1606.05250 [cs.CL]
- Micrsoft Research. 2023. Artificial Intelligence Controller Interface (AICI). https://github.com/microsoft/aici
- FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. arXiv:2303.06865 [cs.LG]
- Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288 [cs.CL]
- Attention Is All You Need. arXiv:1706.03762 [cs.CL]
- Brandon T. Willard and Rémi Louf. 2023. Efficient Guided Generation for Large Language Models. arXiv:2307.09702 [cs.CL]
- RALF: Accuracy-Aware Scheduling for Feature Store Maintenance. Proc. VLDB Endow. 17, 3 (nov 2023), 563–576. https://doi.org/10.14778/3632093.3632116
- C-Pack: Packaged Resources To Advance General Chinese Embedding. arXiv:2309.07597 [cs.CL]
- Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding. https://flashinfer.ai/2024/02/02/cascade-inference.html
- Orca: A Distributed Serving System for Transformer-Based Generative Models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). USENIX Association, Carlsbad, CA, 521–538. https://www.usenix.org/conference/osdi22/presentation/yu
- Efficiently Programming Large Language Models using SGLang. arXiv:2312.07104 [cs.AI]
- A Formal Perspective on Byte-Pair Encoding. arXiv:2306.16837 [cs.CL]
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.