Emergent Mind

Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

(2404.05719)
Published Apr 8, 2024 in cs.CV , cs.CL , and cs.HC

Abstract

Recent advancements in multimodal LLMs (MLLMs) have been noteworthy, yet, these general-domain MLLMs often fall short in their ability to comprehend and interact effectively with user interface (UI) screens. In this paper, we present Ferret-UI, a new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. Given that UI screens typically exhibit a more elongated aspect ratio and contain smaller objects of interest (e.g., icons, texts) than natural images, we incorporate "any resolution" on top of Ferret to magnify details and leverage enhanced visual features. Specifically, each screen is divided into 2 sub-images based on the original aspect ratio (i.e., horizontal division for portrait screens and vertical division for landscape screens). Both sub-images are encoded separately before being sent to LLMs. We meticulously gather training samples from an extensive range of elementary UI tasks, such as icon recognition, find text, and widget listing. These samples are formatted for instruction-following with region annotations to facilitate precise referring and grounding. To augment the model's reasoning ability, we further compile a dataset for advanced tasks, including detailed description, perception/interaction conversations, and function inference. After training on the curated datasets, Ferret-UI exhibits outstanding comprehension of UI screens and the capability to execute open-ended instructions. For model evaluation, we establish a comprehensive benchmark encompassing all the aforementioned tasks. Ferret-UI excels not only beyond most open-source UI MLLMs, but also surpasses GPT-4V on all the elementary UI tasks.

Ferret-UI-anyres architecture enhances Ferret-UI-base by adding detailed image features through various processing layers for improved response generation.

Overview

  • Ferret-UI introduces a Multimodal Large Language Model specifically tailored for understanding mobile UI screens, addressing unique challenges such as elongated aspect ratios and dense elements.

  • The model utilizes an 'any resolution' approach dividing screens into sub-images for detailed processing, enhancing comprehension of UI elements.

  • In performance evaluations, Ferret-UI surpasses existing open-source UI MLLMs and even GPT-4V in elementary UI tasks and shows strong capabilities in advanced reasoning about UIs.

  • Ferret-UI's development signifies a step towards more intuitive AI interactions with mobile interfaces, highlighting its implication in accessibility and app testing.

Ferret-UI: Implementing Multimodal LLMs for Enhanced Mobile UI Understanding

Introduction

Mobile applications are ubiquitous in our daily activities, assisting us in a wide array of tasks from information search to entertainment. A vice for more effective interaction with these interfaces has led to the development of systems designed to interpret and act upon UI screens autonomously. This paper introduces Ferret-UI, a tailored Multimodal Large Language Model (MLLM) aimed at understanding mobile UI screens through advanced referring, grounding, and reasoning capabilities. Traditional MLLMs, while proficient in dealing with natural images, often falter when applied directly to UI understanding due to the unique characteristics of UI screens, like elongated aspect ratios and dense small-sized elements. Ferret-UI tackles these challenges by incorporating a specifically designed architecture and training datasets to operationally magnify UI details and improve comprehension and interaction with mobile interfaces.

Model Architecture and Training

Ferret-UI is built on the foundation of Ferret, a MLLM known for its adeptness in referring and grounding tasks. To adapt to the distinct features of UI screens, Ferret-UI introduces an "any resolution" approach, dividing screens into sub-images for detailed processing. This method ensures enhanced visual features from UI elements are captured, aiding in the model's understanding and interaction capabilities.

Training Ferret-UI involved creating a diverse dataset that not only encompasses basic UI tasks like icon recognition and text finding but also addresses advanced reasoning abilities through datasets for detailed description, perception/interaction conversation, and function inference. The training strategy ensures the model's proficiency in executing both elementary UI tasks and engaging in complex reasoning about UI screens.

Evaluation and Benchmarking

Ferret-UI’s performance was rigorously evaluated against a comprehensive benchmark of UI understanding tasks. Results showed Ferret-UI significantly outperforms existing open-source UI MLLMs and even surpasses GPT-4V on elementary UI tasks. The evaluation extended to advanced tasks showed Ferret-UI's strong capabilities in understanding and interacting with UIs through natural language instructions, highlighting its potential impact on accessibility, app testing, and multi-step navigation.

Implications and Future Directions

The development of Ferret-UI represents a notable step toward more nuanced and effective interaction with mobile UIs through AI. Its ability to understand and reason about UI elements has significant implications for building more intuitive and accessible digital interfaces. Future research could focus on expanding Ferret-UI's capabilities to encompass more varied UI designs and interaction modes. Additionally, exploring the integration of Ferret-UI with real-world applications offers a promising avenue for enhancing user experience and accessibility across mobile platforms.

Ferret-UI's architecture, tailored dataset, and performance on a diverse set of tasks underscore its potential in transforming how AI systems understand and interact with mobile user interfaces. As AI continues to evolve, models like Ferret-UI pave the way for more intelligent and user-friendly applications, further bridging the gap between human-computer interaction.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube