Emergent Mind

Abstract

LLMs struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

Matrix shows win probabilities of model A vs. model B, highlighting training improvements in StarChat.

Overview

  • UICoder proposes a novel method to improve LLMs for generating user interface (UI) code using automated feedback mechanisms, avoiding reliance on expensive human feedback.

  • The methodology involves an iterative self-training process where an LLM generates a large synthetic dataset that is refined using compilers and vision-language models, resulting in enhanced models through iterations.

  • Experimental results show UICoder's performance rivals proprietary models like GPT-4, with high compilation success rates and strong human preference evaluations, indicating its practical application for streamlined UI code generation.

UICoder: Finetuning LLMs to Generate User Interface Code through Automated Feedback

The paper "UICoder: Finetuning LLMs to Generate User Interface Code through Automated Feedback" presents a sophisticated methodology for improving LLMs to generate user interface (UI) code, specifically using SwiftUI. This work addresses a notable gap where existing LLMs often struggle to consistently generate compilable and visually relevant UI code. Conventional approaches often rely on expensive human feedback or output from proprietary models to enhance LLM performance. This paper, however, explores the use of automated feedback mechanisms such as compilers and multi-modal models to create high-quality outputs.

Methodology

The authors introduce an iterative self-training process that begins with prompting an existing LLM to generate a large synthetic dataset of UI programs. This dataset is then rigorously filtered and refined using automated tools. Notably, a compiler ensures the syntactic correctness and compilability of the code, while a vision-language model assesses the relevance of the generated UI to the input description. This iterative process produces a higher-quality dataset which is used to finetune the original LLM. The result is an enhanced model, termed UICoder, which demonstrates significant improvements through five iterations, accumulating nearly a million synthetic SwiftUI programs for training.

This approach diverges from traditional methods in several key ways:

  1. Automated Dataset Generation: Instead of relying on additional external data or human intervention, the model generates and refines its own training data.
  2. Use of Compilation Success and Vision-Language Models for Filtering: This ensures that only compilable and contextually relevant code is used for finetuning.
  3. Iterative Self-Training: The model continually improves by leveraging its updated outputs in subsequent training iterations.

Experimental Results

The paper provides comprehensive experimental results that include comparisons with several baseline models, including proprietary and restricted ones. Key metrics used for evaluation include the compilation success rate and the CLIP score, which measures the relevance of the generated UI to the input description. Human preference evaluations were also utilized, deriving Elo ratings from pairwise comparisons to assess the visual quality and adherence to design principles.

The results are summarized as follows:

  • The UICoder models significantly outperform all downloadable baselines and approach the performance of larger proprietary models such as GPT-3.5 and GPT-4.
  • The compilation rate of UICoder-Top reached 0.82, higher than even that of GPT-4 at 0.81.
  • The CLIP score for UICoder-Filtered achieved 0.404, closely aligning with GPT-4's score of 0.419.
  • Human preference evaluations placed UICoder models with high Elo ratings, indicating strong subjective performance in terms of visual quality and relevance to the input descriptions.

Theoretical and Practical Implications

Theoretical implications of this work suggest that automated feedback mechanisms can be effectively employed to dramatically enhance the performance of LLMs in specialized tasks such as UI code generation. This methodology could be adapted to other domains where high-quality training data is scarce but essential for achieving accurate outputs.

Practically, UICoder provides a valuable tool for developers aiming to streamline the generation of syntactically correct and visually coherent UI code, potentially reducing the time and expertise required for UI development. The results also highlight the feasibility of using current open models without falling back on proprietary systems, fostering greater flexibility and control over the generation process.

Future Developments

Looking forward, further developments in integrating more complex forms of automated feedback, such as advanced PL verification techniques or more sophisticated visual-language models, could further refine the quality and capabilities of LLMs in generating UI code. Additionally, expanding the scope beyond SwiftUI to include other UI frameworks like Flutter or React Native could make the approach even more versatile.

Conclusion

The paper offers a compelling method for enhancing the capabilities of LLMs in generating UI code through a combination of iterative self-training and automated feedback. UICoder not only achieves superior performance compared to other freely available models but also demonstrates potential parity with leading proprietary models. The results underscore the effectiveness of leveraging automated tools to guide the generation of high-quality code, providing a pathway for future research and practical applications in automated code generation.

Create an account to read this summary for free:

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.