LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency (2404.12872v1)
Abstract: Query rewrite, which aims to generate more efficient queries by altering a SQL query's structure without changing the query result, has been an important research problem. In order to maintain equivalence between the rewritten query and the original one during rewriting, traditional query rewrite methods always rewrite the queries following certain rewrite rules. However, some problems still remain. Firstly, existing methods of finding the optimal choice or sequence of rewrite rules are still limited and the process always costs a lot of resources. Methods involving discovering new rewrite rules typically require complicated proofs of structural logic or extensive user interactions. Secondly, current query rewrite methods usually rely highly on DBMS cost estimators which are often not accurate. In this paper, we address these problems by proposing a novel method of query rewrite named LLM-R2, adopting a LLM to propose possible rewrite rules for a database rewrite system. To further improve the inference ability of LLM in recommending rewrite rules, we train a contrastive model by curriculum to learn query representations and select effective query demonstrations for the LLM. Experimental results have shown that our method can significantly improve the query execution efficiency and outperform the baseline methods. In addition, our method enjoys high robustness across different datasets.
- [n.d.]. Apache Calcite Rewrite Rules. https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/rules/package-summary.html.
- [n.d.]. Introduction of OpenAI Text Generation APIs. https://platform.openai.com/docs/guides/text-generation.
- [n.d.]. LLM As Database Administrator. https://github.com/TsinghuaDatabaseGroup/DB-GPT.
- [n.d.]. PostgreSQL. https://www.postgresql.org.
- [n.d.]. TPC-H Toolkit. https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp.
- QueryBooster: Improving SQL Performance Using Middleware Services for Human-Centered Query Rewriting. arXiv:2305.08272 [cs.DB]
- Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources. In Proceedings of the 2018 International Conference on Management of Data (SIGMOD/PODS ’18). ACM. https://doi.org/10.1145/3183713.3190662
- Curriculum Learning. In Proceedings of the 26th Annual International Conference on Machine Learning (Montreal, Quebec, Canada) (ICML ’09). Association for Computing Machinery, New York, NY, USA, 41–48. https://doi.org/10.1145/1553374.1553380
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- SemEval-2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics. https://doi.org/10.18653/v1/s17-2001
- DSB: A Decision Support Benchmark for Workload-Driven and Traditional Database Systems. In VLDB 2022. https://www.microsoft.com/en-us/research/publication/dsb-a-decision-support-benchmark-for-workload-driven-and-traditional-database-systems/
- The Faiss library. (2024). arXiv:2401.08281 [cs.LG]
- SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821 [cs.CL]
- Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data(base) Engineering Bulletin 18 (1995), 19–29. https://api.semanticscholar.org/CorpusID:260706023
- Goetz Graefe and David J. DeWitt. 1987. The EXODUS optimizer generator. In Proceedings of the 1987 ACM SIGMOD International Conference on Management of Data (San Francisco, California, USA) (SIGMOD ’87). Association for Computing Machinery, New York, NY, USA, 160–172. https://doi.org/10.1145/38713.38734
- G. Graefe and W.J. McKenna. 1993. The Volcano optimizer generator: extensibility and efficient search. In Proceedings of IEEE 9th International Conference on Data Engineering. 209–218. https://doi.org/10.1109/ICDE.1993.344061
- Survey of Hallucination in Natural Language Generation. Comput. Surveys 55, 12 (March 2023), 1–38. https://doi.org/10.1145/3571730
- How Good Are Query Optimizers, Really? Proc. VLDB Endow. 9 (2015), 204–215. https://api.semanticscholar.org/CorpusID:7953847
- Feifei Li. 2019. Cloud-Native Database Systems at Alibaba: Opportunities and Challenges. Proc. VLDB Endow. 12, 12 (aug 2019), 2263–2272. https://doi.org/10.14778/3352063.3352141
- Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111 [cs.CL]
- Unified Demonstration Retriever for In-Context Learning. arXiv:2305.04320 [cs.CL]
- Learning Word Vectors for Sentiment Analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Portland, Oregon, USA, 142–150. http://www.aclweb.org/anthology/P11-1015
- Extensible/Rule Based Query Rewrite Optimization in Starburst. SIGMOD Rec. 21, 2 (jun 1992), 39–48. https://doi.org/10.1145/141484.130294
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. https://arxiv.org/abs/1908.10084
- Toolformer: Language Models Can Teach Themselves to Use Tools. arXiv:2302.04761 [cs.CL]
- SQL-PaLM: Improved Large Language Model Adaptation for Text-to-SQL. arXiv:2306.00739 [cs.CL]
- WeTune: Automatic Discovery and Verification of Query Rewrite Rules. In Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA) (SIGMOD ’22). Association for Computing Machinery, New York, NY, USA, 94–107. https://doi.org/10.1145/3514221.3526125
- Larger language models do in-context learning differently. arXiv:2303.03846 [cs.CL]
- Factor Windows: Cost-based Query Rewriting for Optimizing Correlated Window Aggregates. arXiv:2008.12379 [cs.DB]
- DB-GPT: Empowering Database Interactions with Private Large Language Models. arXiv:2312.17449 [cs.DB]
- ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL]
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL]
- QueryFormer: a tree transformer model for query plan representation. Proceedings of the VLDB Endowment 15 (04 2022), 1658–1670. https://doi.org/10.14778/3529337.3529349
- A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies. Proc. VLDB Endow. 17, 4 (2024).
- A Learned Query Rewrite System Using Monte Carlo Tree Search. Proc. VLDB Endow. 15, 1 (sep 2021), 46–58. https://doi.org/10.14778/3485450.3485456
- R3superscript𝑅3R^{3}italic_R start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT-NL2GQL: A Hybrid Models Approach for for Accuracy Enhancing and Hallucinations Mitigation. arXiv:2311.01862 [cs.CL]