Emergent Mind

CogAgent: A Visual Language Model for GUI Agents

(2312.08914)
Published Dec 14, 2023 in cs.CV

Abstract

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. LLMs such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual language model, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM .

Examples of output from CogAgent, showcasing its capability to generate varied data samples.

Overview

  • Visual Language Models (VLMs) facilitate AI interpretation and navigation of Graphic User Interfaces (GUIs).

  • CogAgent is an advanced VLM with 18 billion parameters that excels in high-resolution GUI environments.

  • It features a novel high-resolution cross-module to manage high-resolution images efficiently.

  • CogAgent is trained on diverse datasets and outperforms in tasks like visual question-answering and GUI navigation.

  • CogAgent's innovation paves the way for future developments in AI-assisted digital interactions.

Introduction to Visual Language Models

The development of AI has led to the creation of Visual Language Models (VLMs) that can interpret and navigate Graphic User Interfaces (GUIs), an essential part of digital interaction today. These AI agents provide a new way to assist users in interacting with computers and smartphones through screens.

The Rise of CogAgent

Introducing CogAgent, an 18-billion-parameter VLM that specializes in understanding and automating tasks within GUI environments. Unlike standard models that struggle with image resolution constraints and limited textual input, CogAgent is engineered to operate with high-resolution input, allowing it to recognize small GUI elements and interpret text within images more effectively.

Architectural Advancements

CogAgent builds upon a foundation of VLMs but introduces a novel high-resolution cross-module. This allows the model to work with higher image resolutions without exponentially increasing computational costs. By incorporating both low-resolution and high-resolution image encoders, CogAgent is optimized to handle detailed visual features found in GUIs, like icons and embedded text.

Training and Evaluation

To train CogAgent, researchers constructed large-scale datasets for pre-training, focusing on recognizing various text fonts and sizes, as well as specific GUI elements and layouts. CogAgent was evaluated across several benchmarks, including text-rich visual question-answering (VQA) tasks and GUI navigation tests on both PC and Android platforms, showcasing leading performance.

The Future of AI Agents and VLMs

CogAgent represents a significant stride in the realm of AI agents and VLMs. With its high-resolution input capabilities and efficient architecture, CogAgent holds promise for future research and applications in increasingly automated and AI-assisted digital interactions across various devices.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

YouTube