Middleware for LLMs: Tools Are Instrumental for Language Agents in Complex Environments (2402.14672v2)
Abstract: The applications of LLMs have expanded well beyond the confines of text processing, signaling a new era where LLMs are envisioned as generalist agents capable of operating within complex environments. These environments are often highly expansive, making it impossible for the LLM to process them within its short-term memory. Motivated by recent research on extending the capabilities of LLMs with tools, we seek to investigate the intriguing potential of tools to augment LLMs in handling such complexity by introducing a novel class of tools, termed middleware, to aid in the proactive exploration within these massive environments. Such specialized tools can serve as a middleware layer shielding the LLM from environmental complexity. In two representative complex environments -- knowledge bases (KBs) and databases -- we demonstrate the significant potential of augmenting language agents with tools in complex environments. Notably, equipped with the middleware, GPT-4 achieves 2.8X the performance of the best baseline in tasks requiring access to database content and 2.2X in KB tasks. Our findings illuminate the path for advancing language agents in real-world applications.
- Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics.
- Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 1247–1250. ACM.
- KQA Pro: A dataset with explicit compositional programs for complex question answering over knowledge base. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 6101–6119. Association for Computational Linguistics.
- Grounding ’grounding’ in NLP. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 4283–4305. Association for Computational Linguistics.
- Teaching large language models to self-debug. CoRR, abs/2304.05128.
- Mind2web: Towards a generalist agent for the web. CoRR, abs/2306.06070.
- CRITIC: large language models can self-correct with tool-interactive critiquing. CoRR, abs/2305.11738.
- Don’t generate, discriminate: A proposal for grounding language models to real-world environments. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 4928–4949. Association for Computational Linguistics.
- Beyond I.I.D.: three levels of generalization for question answering on knowledge bases. In WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 3477–3488. ACM / IW3C2.
- Knowledge base question answering: A semantic parsing perspective. In 4th Conference on Automated Knowledge Base Construction.
- Yu Gu and Yu Su. 2022. ArcaneQA: Dynamic program induction and contextualized encoding for knowledge base question answering. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1718–1731, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. CoRR, abs/2305.14909.
- Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554.
- A comprehensive exploration on wikisql with table-aware word contextualization. CoRR, abs/1902.01069.
- Mistral 7b. CoRR, abs/2310.06825.
- Mixtral of experts. CoRR.
- StructGPT: A general framework for large language model to reason over structured data. CoRR, abs/2305.09645.
- Can LLM already serve as A database interface? A big bench for large-scale database grounded text-to-sqls. CoRR, abs/2305.03111.
- API-Bank: A benchmark for tool-augmented llms. CoRR, abs/2304.08244.
- Few-shot in-context learning on knowledge base question answering. In Annual Meeting of the Association for Computational Linguistics.
- AgentBench: Evaluating llms as agents. CoRR, abs/2308.03688.
- Chameleon: Plug-and-play compositional reasoning with large language models. CoRR, abs/2304.09842.
- Augmented language models: a survey. CoRR, abs/2302.07842.
- Code-style in-context learning for knowledge-based question answering. CoRR, abs/2309.04695.
- OpenAI. 2023a. GPT-4 technical report. CoRR, abs/2303.08774.
- OpenAI. 2023b. Models - OpenAI API. https://platform.openai.com/docs/models/gpt-3-5.
- Tool learning with foundation models. CoRR, abs/2304.08354.
- ToolLLM: Facilitating large language models to master 16000+ real-world apis. CoRR, abs/2307.16789.
- Evaluating the text-to-sql capabilities of large language models. CoRR, abs/2204.00498.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
- Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761.
- Alfworld: Aligning text and embodied environments for interactive learning. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
- LLM-Planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).
- Yu Su. 2023. Language agents: a critical evolutionary step of artificial intelligence. yusu.substack.com.
- On generating characteristic-rich question sets for QA evaluation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 562–572. The Association for Computational Linguistics.
- Battle of the large language models: Dolly vs LLaMA vs vicuna vs guanaco vs bard vs ChatGPT - a text-to-SQL parsing comparison. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11225–11238, Singapore. Association for Computational Linguistics.
- Exploring chain of thought style prompting for text-to-sql. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 5376–5393. Association for Computational Linguistics.
- Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 641–651. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
- Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS.
- ReAct: Synergizing reasoning and acting in language models. CoRR, abs/2210.03629.
- The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 201–206, Berlin, Germany. Association for Computational Linguistics.
- DecAF: Joint decoding of answers and logical forms for question answering over knowledge bases. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 3911–3921. Association for Computational Linguistics.
- Variational reasoning for question answering with knowledge graph. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, pages 6069–6076. AAAI Press.
- Judging LLM-as-a-judge with MT-Bench and Chatbot Arena.
- Seq2SQL: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.