Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 37 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 10 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 448 tok/s Pro
Claude Sonnet 4 31 tok/s Pro
2000 character limit reached

Grounded Language Agent for Product Search via Intelligent Web Interactions (2404.10887v2)

Published 16 Apr 2024 in cs.CL

Abstract: The development of agents powered by LLMs to accomplish complex high-level user intents, has attracted significant attention recently. However, employing LLMs with billions of parameters (e.g., GPT-4) may incur substantial costs on top of handcrafting extensive prompts. To address this, we introduce a Grounded Language Agent for Intelligent Web Interactions, named GLAINTEL. GLAINTEL employs Flan-T5 as its backbone and is flexible in training in various settings: unsupervised learning, supervised learning, and unsupervised domain adaptation. Specifically, we tackle both the challenge of learning without human demonstrations and the opportunity to leverage human demonstrations effectively when those are available. Additionally, we explore unsupervised domain adaptation for cases where demonstrations are limited to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of GLAINTEL in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised variants of GLAINTEL. Additionally, we show that combining human demonstrations with reinforcement learning-based training yields results comparable to methods utilizing GPT-4. The code is available at: https://github.com/MultifacetedNLP/WebAgents-Unsupervised.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
  2. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47 (2013), 253–279.
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  4. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015).
  5. When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079 (2022).
  6. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  7. Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning. arXiv:2302.02662 [cs.LG]
  8. Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416 (2022). https://doi.org/10.48550/arXiv.2210.11416
  9. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics 8 (2020), 454–470. https://doi.org/10.1162/tacl_a_00317
  10. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL]
  11. OpenAI Baselines. https://github.com/openai/baselines.
  12. Human-level play in the game of ¡i¿Diplomacy¡/i¿ by combining language models with strategic reasoning. Science 378, 6624 (2022), 1067–1074. https://doi.org/10.1126/science.ade9097 arXiv:https://www.science.org/doi/pdf/10.1126/science.ade9097
  13. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=rc8o_j8I8PX
  14. Instruction-Finetuned Foundation Models for Multimodal Web Navigation. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. https://openreview.net/forum?id=oLc9sGOBbc
  15. Learning to Navigate the Web. arXiv:1812.09195 [cs.LG]
  16. Deep Reinforcement Learning with a Natural Language Action Space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1621–1630. https://doi.org/10.18653/v1/P16-1153
  17. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ
  18. CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]
  19. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 9118–9147. https://proceedings.mlr.press/v162/huang22a.html
  20. Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]
  21. A data-driven approach for learning to control computers. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:246867455
  22. Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking. In NeurIPS 2023 Foundation Models for Decision Making Workshop. https://openreview.net/forum?id=YSYbTPbCPD
  23. Multi-Game Decision Transformers. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27921–27936. https://proceedings.neurips.cc/paper_files/paper/2022/file/b2cac94f82928a85055987d9fd44753f-Paper-Conference.pdf
  24. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9493–9500.
  25. Pixel-Perfect Structure-From-Motion With Featuremetric Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5987–5997.
  26. Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1802.08802
  27. LASER: LLM Agent with State-Space Exploration for Web Navigation. arXiv preprint arXiv:2309.08172 (2023).
  28. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 (2023).
  29. Sahisnu Mazumder and Oriana Riva. 2020. Flin: A flexible natural language interface for web navigation. arXiv preprint arXiv:2010.12844 (2020).
  30. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).
  31. Rodrigo Nogueira and Kyunghyun Cho. 2016. End-to-end goal-driven web navigation. Advances in neural information processing systems 29 (2016).
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  33. You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments. arXiv preprint arXiv:2205.15967 (2022).
  34. Mapping natural language commands to web elements. arXiv preprint arXiv:1808.09132 (2018).
  35. Dean A Pomerleau. 1989. Alvinn: An autonomous land vehicle in a neural network. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY ….
  36. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  38. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241 (2022).
  39. Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG]
  40. A Generalist Agent. arXiv:2205.06175 [cs.AI]
  41. Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=9Vrb9D0WI4
  42. Trust region policy optimization. In International conference on machine learning. PMLR, 1889–1897.
  43. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. arXiv:2306.00245 [cs.LG]
  44. Language Models are Multilingual Chain-of-Thought Reasoners. arXiv:2210.03057 [cs.CL]
  45. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 3135–3144.
  46. Hierarchical Prompting Assists Large Language Model on Web Navigation. In ArXiv.
  47. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615 [cs.CL]
  48. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
  49. Building natural language interfaces to web apis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 177–186.
  50. Attention is all you need. Advances in neural information processing systems 30 (2017).
  51. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
  52. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5085–5109. https://doi.org/10.18653/v1/2022.emnlp-main.340
  53. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR
  54. Empowering LLM to use Smartphone for Intelligent Task Automation. ArXiv abs/2308.15272 (2023). https://api.semanticscholar.org/CorpusID:261277501
  55. GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. ArXiv abs/2311.07562 (2023). https://api.semanticscholar.org/CorpusID:265149992
  56. Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435 (2022).
  57. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 20744–20757. https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf
  58. ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
  59. AgentTuning: Enabling Generalized Agent Abilities for LLMs. arXiv:2310.12823 [cs.CL]
  60. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854 (2023).
Citations (1)

Summary

We haven't generated a summary for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Lightbulb On Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.