WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs (2404.06369v2)
Abstract: Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal LLMs (MLLMs) have shown promising potential in this area. However, our investigation reveals that most existing MLLMs are constrained by the absence of high-quality, large-scale, real-world datasets, resulting in inadequate performance in automated webpage code generation. To fill this gap, this paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. Sourced from real-world web resources, WebCode2M offers a rich and valuable dataset for webpage code generation across a variety of applications. The dataset quality is ensured by a scoring model that filters out instances with aesthetic deficiencies or other incomplete elements. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. Additionally, we introduce a new metric, TreeBLEU, to measure the structural hierarchy recall. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs, confirming its effectiveness and usability for future applications in front-end design tools. Finally, we highlight several practical challenges introduced by our dataset, calling for further research. The code and dataset are publicly available at our project homepage: https://webcode2m.github.io.
- Tony Beltramelli. pix2code: Generating code from a graphical user interface screenshot. In Proceedings of the ACM SIGCHI Symposium on Engineering Interactive Computing Systems, EICS 2018, Paris, France, June 19-22, 2018, pages 3:1–3:6. ACM, 2018. doi: 10.1145/3220134.3220135. URL https://doi.org/10.1145/3220134.3220135.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv, abs/2010.11929, 2020. URL https://api.semanticscholar.org/CorpusID:225039882.
- Incoder: A generative model for code infilling and synthesis. arXiv preprint arXiv:2204.05999, 2022.
- Long short-term memory. Neural Computation, 9:1735–1780, 1997. URL https://api.semanticscholar.org/CorpusID:1915014.
- Unlocking the conversion of web screenshots into html code with the websight dataset. 2024. URL https://api.semanticscholar.org/CorpusID:268385510.
- Backpropagation applied to handwritten zip code recognition. Neural Computation, 1:541–551, 1989. URL https://api.semanticscholar.org/CorpusID:41312633.
- Pix2struct: Screenshot parsing as pretraining for visual language understanding. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 18893–18912. PMLR, 2023. URL https://proceedings.mlr.press/v202/lee23g.html.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
- Codexglue: A machine learning benchmark dataset for code understanding and generation. In NeurIPS Datasets and Benchmarks, 2021.
- Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
- Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- OpenAI. ChatGPT. https://openai.com/blog/chatgpt/, 2022.
- Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. URL https://api.semanticscholar.org/CorpusID:246426909.
- Language models are unsupervised multitask learners. 2019. URL https://api.semanticscholar.org/CorpusID:160025533.
- Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21:1–67, 2020.
- Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39:1137–1149, 2015. URL https://api.semanticscholar.org/CorpusID:10328909.
- Codebleu: a method for automatic evaluation of code synthesis. ArXiv, abs/2009.10297, 2020. URL https://api.semanticscholar.org/CorpusID:221836101.
- Alex Robinson. Sketch2code: Generating a website from a paper mockup. CoRR, abs/1905.13750, 2019. URL http://arxiv.org/abs/1905.13750.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Design2code: How far are we from automating front-end engineering? 2024. URL https://api.semanticscholar.org/CorpusID:268248801.
- Learning ui-to-code reverse generator using visual critic without rendering. 2023. URL https://api.semanticscholar.org/CorpusID:265302631.
- Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/CorpusID:257219404.
- Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In EMNLP, pages 8696–8708, 2021.
- Codet5+: Open code large language models for code understanding and generation. arXiv preprint arXiv:2305.07922, 2023.
- Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13:600–612, 2004. URL https://api.semanticscholar.org/CorpusID:207761262.
- Screen parsing: Towards reverse engineering of ui models from screenshots. The 34th Annual ACM Symposium on User Interface Software and Technology, 2021. URL https://api.semanticscholar.org/CorpusID:237571719.