Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Query Rewriting via Large Language Models (2403.09060v1)

Published 14 Mar 2024 in cs.DB

Abstract: Query rewriting is one of the most effective techniques for coping with poorly written queries before passing them down to the query optimizer. Manual rewriting is not scalable, as it is error-prone and requires deep expertise. Similarly, traditional query rewriting algorithms can only handle a small subset of queries: rule-based techniques do not generalize to new query patterns and synthesis-based techniques cannot handle complex queries. Fortunately, the rise of LLMs, equipped with broad general knowledge and advanced reasoning capabilities, has created hopes for solving some of these previously open problems. In this paper, we present GenRewrite, the first holistic system that leverages LLMs for query rewriting. We introduce the notion of Natural Language Rewrite Rules (NLR2s), and use them as hints to the LLM but also a means for transferring knowledge from rewriting one query to another, and thus becoming smarter and more effective over time. We present a novel counterexample-guided technique that iteratively corrects the syntactic and semantic errors in the rewritten query, significantly reducing the LLM costs and the manual effort required for verification. GenRewrite speeds up 22 out of 99 TPC queries (the most complex public benchmark) by more than 2x, which is 2.5x--3.2x higher coverage than state-of-the-art traditional query rewriting and 2.1x higher than the out-of-the-box LLM baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. dbt Labs — Transform Data in Your Warehouse. https://www.getdbt.com/.
  2. LeetCode. https://leetcode.com/problemset/database/.
  3. Looker business intelligence platform embedded analytics. https://cloud.google.com/looker.
  4. OpenAI Introduction. https://platform.openai.com/docs/introduction/tokens.
  5. Prompt Engineering for Generative AI. https://developers.google.com/machine-learning/resources/prompt-eng.
  6. TPC-DS Benchmark. https://www.tpc.org/tpcds/.
  7. A comparative survey of recent natural language interfaces for databases. The VLDB Journal 28 (2019), 793–819.
  8. Cost-based query transformation in Oracle. In VLDB, Vol. 6. 1026–1036.
  9. Apache calcite: A foundational framework for optimized query processing over heterogeneous data sources. In Proceedings of the 2018 International Conference on Management of Data. 221–230.
  10. Computation reuse via fusion in Amazon Athena. In 2022 IEEE 38th International Conference on Data Engineering (ICDE). IEEE, 1610–1620.
  11. Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. arXiv:2305.15778 [cs.SE]
  12. The snowflake elastic data warehouse. In Proceedings of the 2016 International Conference on Management of Data. 215–226.
  13. ARe-SQL: An Online, Automatic and Non-Intrusive Approach for Rewriting SQL Queries. Journal of Information and Data Management 5, 1 (2014), 28–28.
  14. Proving Query Equivalence Using Linear Integer Arithmetic. Proceedings of the ACM on Management of Data 1, 4 (2023), 1–26.
  15. SlabCity: Whole-Query Optimization Using Program Synthesis. Proceedings of the VLDB Endowment 16, 11 (2023), 3151–3164.
  16. Sérgio Fernandes and Jorge Bernardino. 2015. What is bigquery?. In Proceedings of the 19th International Database Engineering & Applications Symposium. 202–203.
  17. CatSQL: Towards Real World Natural Language to SQL Applications. Proceedings of the VLDB Endowment 16, 6 (2023), 1534–1547.
  18. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. CoRR abs/2308.15363 (2023). arXiv preprint arXiv:2308.15363 (2023).
  19. Goetz Graefe. 1987. Rule-based query optimization in extensible database systems. Technical Report. University of Wisconsin-Madison Department of Computer Sciences.
  20. Baseball: an automatic question-answerer. In Papers presented at the May 9-11, 1961, western joint IRE-AIEE-ACM computer conference. 219–224.
  21. Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. Proceedings of the ACM on Management of Data 1, 2 (2023), 1–28.
  22. Towards complex text-to-sql in cross-domain database with intermediate representation. arXiv preprint arXiv:1905.08205 (2019).
  23. Selecting subexpressions to materialize at datacenter scale. Proceedings of the VLDB Endowment 11, 7 (2018), 800–812.
  24. George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal (2023), 1–32.
  25. Natural language to SQL: Where are we today? Proceedings of the VLDB Endowment 13, 10 (2020), 1737–1750.
  26. Alon Yitzchak Levy and Inderpal Singh Mumick. 1997. Query optimization by predicate move-around. US Patent 5,659,725.
  27. Table-gpt: Table-tuned gpt for diverse table tasks. arXiv preprint arXiv:2310.09263 (2023).
  28. Natural language data management and interfaces. Springer.
  29. Large Language Model for Table Processing: A Survey. arXiv preprint arXiv:2402.05121 (2024).
  30. M Muralikrishna et al. 1992. Improved unnesting algorithms for join aggregate SQL queries. In VLDB, Vol. 92. Citeseer, 91–102.
  31. Can foundation models wrangle your data? arXiv preprint arXiv:2205.09911 (2022).
  32. Extensible/rule based query rewrite optimization in Starburst. ACM Sigmod Record 21, 2 (1992), 39–48.
  33. Mohammadreza Pourreza and Davood Rafiei. 2024. Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems 36 (2024).
  34. Cost-based optimization for magic: Algebra and implementation. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data. 435–446.
  35. Learning contextual representations for semantic parsing with generation-augmented pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 13806–13814.
  36. Annotating columns with pre-trained language models. In Proceedings of the 2022 International Conference on Management of Data. 1493–1503.
  37. Sql-palm: Improved large language model adaptation for text-to-sql. CoRR, abs/2306.00739, 2023a. doi: 10.48550. arXiv preprint ARXIV.2306.00739 ([n. d.]).
  38. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313 (2024).
  39. Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942 (2019).
  40. A natural language interface for database: Achieving transfer-learnability using adversarial method for question understanding. In 2020 IEEE 36th International conference on data engineering (ICDE). IEEE, 97–108.
  41. Wetune: Automatic discovery and verification of query rewrite rules. In Proceedings of the 2022 International Conference on Management of Data. 94–107.
  42. Guoliang Li Xuanhe Zhou, Zhaoyan Sun. 2023. DB-GPT: Large Language Model Meets Database.
  43. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219 (2023).
  44. A learned query rewrite system using monte carlo tree search. Proceedings of the VLDB Endowment 15, 1 (2021), 46–58.
  45. D-bot: Database diagnosis system using large language models. arXiv preprint arXiv:2312.01454 (2023).
Citations (5)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces GenRewrite, a system that leverages LLMs with counterexample-guided correction for SQL query rewriting.
  • It employs Natural Language Rewrite Rules (NLR2s) to transfer rewrite knowledge between queries without relying on query-specific data.
  • Evaluation on the TPC-DS benchmark shows GenRewrite optimizes 33.3% of queries, outperforming state-of-the-art methods with significant speedups.

Query Rewriting via LLMs

Introduction

The paper "Query Rewriting via LLMs" presents GenRewrite, a novel system leveraging LLMs for rewriting SQL queries to optimize performance. This research addresses significant limitations of traditional query rewriting systems and explores the capabilities of LLMs to enhance query performance autonomously.

Challenges in Traditional Query Rewriting

Traditional query rewriting methods rely heavily on rule-based systems that struggle with novel query patterns and complex queries. These systems require predefined rewrite rules, leading to difficulties in handling new or unforeseen query patterns. Synthesis-based approaches, although not rule-bound, often falter when navigating extensive query landscapes, especially in complex benchmarks like TPC-DS.

GenRewrite: Leveraging LLMs for Query Optimization

GenRewrite introduces several innovative techniques to overcome these challenges:

  • Natural Language Rewrite Rules (NLR2s): NLR2s are textual explanations summarizing query rewrites. They transfer knowledge gained from rewriting one query to another, effectively guiding the LLM and providing understanding to users. This system avoids query-specific data in NLR2s, enhancing the generality of the knowledge transfer.
  • Counterexample-Guided Correction: This technique iteratively refines potential rewrites through semantic and syntactic corrections. The LLM addresses semantic mismatches, while database feedback ensures syntactic correctness. This dual-phase correction process ensures that candidate queries align with the intended semantics and are executable. Figure 1

    Figure 1: High-level Workflow of GenRewrite showcasing iterative processes.

Evaluation and Performance

The empirical evaluation of GenRewrite on the TPC-DS benchmark reveals its superior performance in query rewriting. GenRewrite notably optimizes 33.3% of queries beyond a 10% speedup, surpassing state-of-the-art methods like LearnedRewrite (LR) and Fusion by significant margins. Figure 2

Figure 3: Comparison of speedup achieved by different approaches on queries with at least 2x speedup.

Contributions and Implications

  1. First Comprehensive LLM Query Rewriting Analysis: This work offers an in-depth analysis of the challenges and potentials of LLMs for query rewriting, setting a foundation for future research.
  2. Introduction of NLR2s: GenRewrite enhances the LLM's capability by employing NLR2s, which help in achieving effective rewrites and knowledge transfer.
  3. Iterative Correction Approach: The introduction of a counterexample-guided correction method significantly improves rewrite accuracy, making the system robust against errors typical in LLM responses.

Conclusion

GenRewrite demonstrates a pioneering approach in automating query rewriting by harnessing the power of LLMs. Its ability to adaptively transfer knowledge and refine rewrites iteratively makes it a promising solution for complex query optimization. The insights and methodologies outlined in this paper may drive further developments in using LLMs for other database-related tasks, emphasizing the versatile application of advanced AI models in the field of database management and optimization.

The research opens pathways for future enhancements in AI-driven query optimization, potentially leading to more intelligent and adaptive database management systems in the future.

Authors (2)

X Twitter Logo Streamline Icon: https://streamlinehq.com