Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs (2406.20098v2)

Published 28 Jun 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Multimodal LLMs (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose $\texttt{Web2Code}$, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code are available at https://github.com/MBZUAI-LLM/web2code.

Citations (3)

View on Semantic Scholar

Summary

The paper presents a novel dataset with over 1.1M instruction-response pairs to fine-tune MLLMs for accurate HTML code generation.
It introduces a dual benchmarking framework that assesses both webpage understanding and HTML rendering using metrics like visual alignment and color consistency.
Experimental results demonstrate that fine-tuning on Web2Code significantly boosts MLLMs' web content interpretation without degrading general visual reasoning.

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

The paper "Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs" presents a novel dataset and accompanying evaluation framework designed to enhance the capabilities of multimodal LLMs (MLLMs) in understanding and generating HTML code from webpage screenshots. The authors introduce Web2Code, a comprehensive dataset comprising 1,179.7k webpage-based instruction-response pairs and establish a thorough evaluation suite to benchmark the performance of MLLMs in tasks related to webpage understanding and HTML code generation.

Motivation

The motivation behind Web2Code stems from the current inadequacies of existing MLLMs in accurately interpreting webpage screenshots and translating them into HTML code. Despite the proficiency of these models in handling multimodal inputs, such as images, videos, and audio, their performance falters significantly with web-based content. This shortcoming poses substantial limitations for applications requiring accurate webpage representations, such as UI prototyping, task automation, and accessibility enhancements.

Dataset Construction

The construction of the Web2Code dataset involves several strategic steps:

Creation of New Webpage Image-Code Pairs (DWCG): Utilizing GPT-3.5, the authors generated 60K high-quality HTML webpage-code pairs and subsequently converted them into instruction-following data. This step ensures the inclusion of well-structured, diverse HTML samples in the dataset.
Refinement of Existing Webpage Code Generation Data (DWCG_R): The authors refined existing datasets like WebSight and Pix2Code by enhancing the quality of HTML code through GPT-4, converting these datasets into an instruction-following format compatible with MLLMs.
Generation of Webpage Understanding Data (DWU): To cater to tasks requiring comprehensive web content understanding, the authors generated 243.5K question-answer pairs using GPT-4, focusing on various webpage elements and their configurations.
Refinement of Webpage Understanding Data (DWU_R): Existing datasets such as WebSRC were refined to enhance their quality and eliminate duplications, ensuring high fidelity of the instruction data.

Evaluation Framework

The authors propose a dual-faceted evaluation framework comprising two benchmarks:

Webpage Understanding Benchmark (WUB): This benchmark tests the model's ability to answer "Yes/No" questions about various aspects of webpage content using 5,990 question-answer pairs developed from GPT-4 Vision API.
Webpage Code Generation Benchmark (WCGB): This innovative benchmark involves rendering the output HTML back into images and comparing them with ground truth images using GPT-4 Vision API. The evaluation framework includes metrics across four categories: Visual Structure and Alignment, Color and Aesthetic Design, Textual and Content Consistency, and User Interface and Interactivity.

Experimental Results

The authors conducted extensive experiments to validate the utility of their dataset, training MLLMs like CrystalChat-7B and Vicuna1.5-7B with Web2Code. The results reveal that fine-tuning MLLMs on Web2Code notably enhances their capabilities in HTML code generation without degrading general visual reasoning performance. Specifically, models trained with the Web2Code dataset demonstrated superior performance on the WCGB benchmark, achieving high scores in aspects such as Visual Structure Alignment and Color Consistency.

Implications and Future Work

The implications of this work are significant for both theoretical and practical applications. Theoretically, the introduction of a large-scale, high-quality dataset specifically tailored for webpage-to-code translation presents a new avenue for research in multimodal learning. Practically, improvements in MLLMs’ ability to accurately generate HTML code from webpage screenshots can revolutionize fields such as web development automation, accessibility tools, and virtual prototyping.

Future developments may include expanding the dataset to encompass more diverse webpage examples, refining evaluation metrics to include aspects like code efficiency, and integrating additional tasks into the evaluation framework. Moreover, extending the scope to include dynamic web elements and scripting languages like JavaScript could further enhance the applicability of MLLMs in real-world web development scenarios.

In conclusion, the Web2Code dataset and evaluation framework represent a substantial advancement in the domain of multimodal LLMs, paving the way for more robust and capable models in web-based applications.