Grounded Language Agent for Product Search via Intelligent Web Interactions (2404.10887v2)
Abstract: The development of agents powered by LLMs to accomplish complex high-level user intents, has attracted significant attention recently. However, employing LLMs with billions of parameters (e.g., GPT-4) may incur substantial costs on top of handcrafting extensive prompts. To address this, we introduce a Grounded Language Agent for Intelligent Web Interactions, named GLAINTEL. GLAINTEL employs Flan-T5 as its backbone and is flexible in training in various settings: unsupervised learning, supervised learning, and unsupervised domain adaptation. Specifically, we tackle both the challenge of learning without human demonstrations and the opportunity to leverage human demonstrations effectively when those are available. Additionally, we explore unsupervised domain adaptation for cases where demonstrations are limited to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of GLAINTEL in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised variants of GLAINTEL. Additionally, we show that combining human demonstrations with reinforcement learning-based training yields results comparable to methods utilizing GPT-4. The code is available at: https://github.com/MultifacetedNLP/WebAgents-Unsupervised.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47 (2013), 253–279.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
- Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015).
- When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079 (2022).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning. arXiv:2302.02662 [cs.LG]
- Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416 (2022). https://doi.org/10.48550/arXiv.2210.11416
- TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics 8 (2020), 454–470. https://doi.org/10.1162/tacl_a_00317
- Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL]
- OpenAI Baselines. https://github.com/openai/baselines.
- Human-level play in the game of ¡i¿Diplomacy¡/i¿ by combining language models with strategic reasoning. Science 378, 6624 (2022), 1067–1074. https://doi.org/10.1126/science.ade9097 arXiv:https://www.science.org/doi/pdf/10.1126/science.ade9097
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=rc8o_j8I8PX
- Instruction-Finetuned Foundation Models for Multimodal Web Navigation. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. https://openreview.net/forum?id=oLc9sGOBbc
- Learning to Navigate the Web. arXiv:1812.09195 [cs.LG]
- Deep Reinforcement Learning with a Natural Language Action Space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1621–1630. https://doi.org/10.18653/v1/P16-1153
- Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ
- CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 9118–9147. https://proceedings.mlr.press/v162/huang22a.html
- Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]
- A data-driven approach for learning to control computers. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:246867455
- Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking. In NeurIPS 2023 Foundation Models for Decision Making Workshop. https://openreview.net/forum?id=YSYbTPbCPD
- Multi-Game Decision Transformers. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27921–27936. https://proceedings.neurips.cc/paper_files/paper/2022/file/b2cac94f82928a85055987d9fd44753f-Paper-Conference.pdf
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9493–9500.
- Pixel-Perfect Structure-From-Motion With Featuremetric Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5987–5997.
- Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1802.08802
- LASER: LLM Agent with State-Space Exploration for Web Navigation. arXiv preprint arXiv:2309.08172 (2023).
- Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 (2023).
- Sahisnu Mazumder and Oriana Riva. 2020. Flin: A flexible natural language interface for web navigation. arXiv preprint arXiv:2010.12844 (2020).
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).
- Rodrigo Nogueira and Kyunghyun Cho. 2016. End-to-end goal-driven web navigation. Advances in neural information processing systems 29 (2016).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments. arXiv preprint arXiv:2205.15967 (2022).
- Mapping natural language commands to web elements. arXiv preprint arXiv:1808.09132 (2018).
- Dean A Pomerleau. 1989. Alvinn: An autonomous land vehicle in a neural network. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY ….
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241 (2022).
- Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG]
- A Generalist Agent. arXiv:2205.06175 [cs.AI]
- Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=9Vrb9D0WI4
- Trust region policy optimization. In International conference on machine learning. PMLR, 1889–1897.
- From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. arXiv:2306.00245 [cs.LG]
- Language Models are Multilingual Chain-of-Thought Reasoners. arXiv:2210.03057 [cs.CL]
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 3135–3144.
- Hierarchical Prompting Assists Large Language Model on Web Navigation. In ArXiv.
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615 [cs.CL]
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
- Building natural language interfaces to web apis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 177–186.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5085–5109. https://doi.org/10.18653/v1/2022.emnlp-main.340
- Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR
- Empowering LLM to use Smartphone for Intelligent Task Automation. ArXiv abs/2308.15272 (2023). https://api.semanticscholar.org/CorpusID:261277501
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. ArXiv abs/2311.07562 (2023). https://api.semanticscholar.org/CorpusID:265149992
- Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435 (2022).
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 20744–20757. https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf
- ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
- AgentTuning: Enabling Generalized Agent Abilities for LLMs. arXiv:2310.12823 [cs.CL]
- WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854 (2023).
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.