ScreenAI: A Vision-Language Model for UI and Infographics Understanding (2402.04615v3)

Published 7 Feb 2024 in cs.CV and cs.AI

Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual language and design principles, play important roles in human communication and human-machine interaction. We introduce ScreenAI, a vision-LLM that specializes in UI and infographics understanding. Our model improves upon the PaLI architecture with the flexible patching strategy of pix2struct and is trained on a unique mixture of datasets. At the heart of this mixture is a novel screen annotation task in which the model has to identify the type and location of UI elements. We use these text annotations to describe screens to LLMs and automatically generate question-answering (QA), UI navigation, and summarization training datasets at scale. We run ablation studies to demonstrate the impact of these design choices. At only 5B parameters, ScreenAI achieves new state-of-the-artresults on UI- and infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and InfographicVQA) compared to models of similar size. Finally, we release three new datasets: one focused on the screen annotation task and two others focused on question answering.

Authors (10)

Gilles Baechler (9 papers)
Srinivas Sunkara (12 papers)
Maria Wang (5 papers)
Fedir Zubach (3 papers)
Hassan Mansoor (8 papers)
Vincent Etter (1 paper)
Jason Lin (8 papers)
Jindong Chen (21 papers)
Abhanshu Sharma (6 papers)
Victor Cărbune (4 papers)

Citations (24)

View on Semantic Scholar

Summary

The paper introduces ScreenAI, a novel vision-language model that leverages a hybrid PaLI and pix2struct architecture to interpret UIs and infographics.
It employs a unique training methodology combining self-supervision and LLM-driven data generation, achieving state-of-the-art performance across benchmarks such as DocVQA and InfographicVQA.
With a compact size of 5 billion parameters, ScreenAI demonstrates that effective UI comprehension and question-answering can be attained using efficient architectures.

Abstract

Google Research introduces ScreenAI, a model enhancing the understanding of screen user interfaces (UIs) and infographics, leveraging a novel approach that combines the capabilities of the PaLI architecture with the flexibility of the pix2struct patching mechanism. Focusing on tasks like question-answering, navigation, and summarization, ScreenAI sets new state-of-the-art or best-in-class performance in multiple UI- and infographics-based benchmarks, despite its modest parameter count of 5 billion. Additionally, it brings forth a unique textual representation for UIs and employs LLMs to generate massive training datasets autonomously.

Introduction

ScreenAI emerges in response to the confluence of design principles across infographics and digital UIs, widely used for information exchange and user interaction. Infographics demand cognitive ease in interpreting complex data, a need mirrored in interactive digital environment designs for seamless human-computer interaction. ScreenAI embodies an effort to understand and interact with such content seamlessly, overcoming the inherent challenge of pixel-rich visuals with a unified model that innovates a new state-of-the-art landscape.

Methodology

At its core, ScreenAI advances a vision-LLM (VLM) that anchors on a blend of datasets driven by self-supervision and minimal human labeling. Central to this, it introduces a new screen annotation task, compelling the model to discern UI elements through text annotations that subsequently inform LLMs to generate varied, scalable training sets. The building block of ScreenAI's architecture is reminiscent of PaLI but with a crucial pivot to accommodate non-uniform image formats through pix2struct patching. The model's parameters are split among vision and LLMs, having trained on extensive datasets that mix traditional image and text data sources with screen-related tasks.

Evaluations and Results

ScreenAI's evaluation indicates its new leading performance across multiple benchmarks, spanning Multipage DocVQA, WebSRC, MoTIF, Widget Captioning, and InfographicVQA, managing to outpace or rival other advanced models. The evaluations underline the model's proficiency in question-answering, comprehension, navigation tasks, and more. A critical aspect of performance enhancement pertains to the incorporation of OCR data during fine-tuning, which, despite demanding higher input length, appears to bolster the model's effectiveness in QA tasks.

Conclusion

In conclusion, ScreenAI represents a significant milestone in VLMs, adept at interpreting and engaging with digital content encompassing documents, infographics, and various UI formats. Its methodical approach, from LLM-driven data generation to strategic training methodologies, accentuates the potency of a unified VLM that thrives across a spectrum of screen-based interactions. Google Research's contribution, marked by the release of three vital datasets, embodies a potent resource for the community to foster groundbreaking models amenable to screen-based question-answering tasks and beyond.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1755419535325159514

https://twitter.com/KyeGomezB/status/1755737235373404397

https://twitter.com/GAIS_jp/status/1764425618534392146

https://twitter.com/tuanson84uk/status/1778252284843147495

https://twitter.com/AILucknow/status/1755507498495127890

https://twitter.com/dmt_xxxx/status/1761912163122201068

YouTube

Show All Videos

HackerNews

ScreenAI: A Vision-Language Model for UI and Infographics Understanding (1 point, 0 comments)