Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 153 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 28 tok/s Pro

GPT-5 High 18 tok/s Pro

GPT-4o 100 tok/s Pro

Kimi K2 220 tok/s Pro

GPT OSS 120B 447 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

CogAgent: A Visual Language Model for GUI Agents (2312.08914v3)

Published 14 Dec 2023 in cs.CV

Abstract: People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. LLMs such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual LLM (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogAgent supports input at a resolution of 1120*1120, enabling it to recognize tiny page elements and text. As a generalist visual LLM, CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks, including VQAv2, OK-VQA, Text-VQA, ST-VQA, ChartQA, infoVQA, DocVQA, MM-Vet, and POPE. CogAgent, using only screenshots as input, outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks -- Mind2Web and AITW, advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM, with a new version of CogAgent-9B-20241220 available at https://github.com/THUDM/CogAgent.

References (44)

Citations (207)

View on Semantic Scholar

Summary

The paper introduces CogAgent, an 18-billion-parameter VLM that leverages high-resolution image encoders to interpret complex GUI elements.
It combines low- and high-resolution modules to efficiently capture detailed textual and visual features across varied GUI environments.
The model achieves leading performance on text-rich VQA and GUI navigation benchmarks across PC and Android platforms, setting new standards for automation.

Introduction to Visual LLMs

The development of AI has led to the creation of Visual LLMs (VLMs) that can interpret and navigate Graphic User Interfaces (GUIs), an essential part of digital interaction today. These AI agents provide a new way to assist users in interacting with computers and smartphones through screens.

The Rise of CogAgent

Introducing CogAgent, an 18-billion-parameter VLM that specializes in understanding and automating tasks within GUI environments. Unlike standard models that struggle with image resolution constraints and limited textual input, CogAgent is engineered to operate with high-resolution input, allowing it to recognize small GUI elements and interpret text within images more effectively.

Architectural Advancements

CogAgent builds upon a foundation of VLMs but introduces a novel high-resolution cross-module. This allows the model to work with higher image resolutions without exponentially increasing computational costs. By incorporating both low-resolution and high-resolution image encoders, CogAgent is optimized to handle detailed visual features found in GUIs, like icons and embedded text.

Training and Evaluation

To train CogAgent, researchers constructed large-scale datasets for pre-training, focusing on recognizing various text fonts and sizes, as well as specific GUI elements and layouts. CogAgent was evaluated across several benchmarks, including text-rich visual question-answering (VQA) tasks and GUI navigation tests on both PC and Android platforms, showcasing leading performance.

The Future of AI Agents and VLMs

CogAgent represents a significant stride in the field of AI agents and VLMs. With its high-resolution input capabilities and efficient architecture, CogAgent holds promise for future research and applications in increasingly automated and AI-assisted digital interactions across various devices.