Emergent Mind

Abstract

The User Interface (UI) is pivotal for human interaction with the digital world, facilitating efficient control of machines, information navigation, and complex task completion. To achieve easy, efficient, and free interactions, researchers have been exploring the potential of encapsulating the traditional Programming Language Interfaces (PLIs) and Graphical User Interfaces (GUIs) into Natural Language Interfaces (NLIs). However, due to the limited capabilities of small models, traditional work mainly focuses on tasks for which only a single step is needed. This largely constrains the application of NLIs. Recently, LLMs have exhibited robust reasoning and planning abilities, yet their potential for multi-turn interactions in complex environments remains under-explored. To assess LLMs as NLIs in real-world graphical environments, we introduce the GUI interaction platform, Mobile-Env, specifically on mobile apps. Mobile-Env enhances interaction flexibility, task extensibility, and environment adaptability compared with previous environments. A GUI task set based on WikiHow app is collected on Mobile-Env to form a benchmark covering a range of GUI interaction capabilities. We further conduct comprehensive evaluations of LLM agents, including various versions of GPT, LLaMA 2, and AgentLM, on WikiHow task set to acquire insights into the potentials and challenges of LLMs in GUI interactions.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Generate a summary of this paper on our Pro plan:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.