Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 42 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 203 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback (2406.07739v1)

Published 11 Jun 2024 in cs.CL, cs.HC, and cs.SE

Abstract: LLMs struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

Citations (5)

Summary

  • The paper introduces an automated feedback approach that uses compiler success and CLIP scores to filter and iteratively refine synthetic UI code data.
  • It employs a multi-stage training pipeline combining supervised finetuning and preference alignment to boost both syntactic correctness and semantic relevance.
  • Empirical results demonstrate a significant increase in compilation rates and competitive performance compared to proprietary models in UI code generation.

Automated Feedback for LLM-Based UI Code Generation: An Analysis of UICoder

Introduction

The paper "UICoder: Finetuning LLMs to Generate User Interface Code through Automated Feedback" (2406.07739) presents a systematic approach for improving LLMs in the domain of user interface (UI) code generation, specifically targeting SwiftUI. The authors address the scarcity of high-quality, domain-specific training data and the limitations of relying on expensive human feedback or proprietary model distillation. Their method leverages automated feedback—compilers and vision-LLMs—to iteratively filter, score, and refine self-generated synthetic datasets, enabling the finetuning of LLMs for robust UI code generation. The resulting UICoder models demonstrate strong empirical performance, approaching that of larger proprietary models, and outperform all other downloadable baselines.

Methodology

Multi-Stage Training Pipeline

The training pipeline consists of three primary stages: (1) base model selection, (2) supervised finetuning with automated feedback, and (3) preference alignment. The process is depicted in the following flow chart: Figure 1

Figure 1: A flow chart showing an overview of the multi-step training process, including a base model, supervised-tuned model, and an aligned model.

Data Generation and Filtering

  • Synthetic Data Generation: The base LLM (StarChat-Beta, a 15B parameter model) is prompted to generate SwiftUI code from a curated set of UI descriptions.
  • Automated Filtering: Generated samples are filtered using:
    • Compilation Success: Only compilable programs are retained.
    • CLIP Score: A vision-LLM (CLIP) scores the semantic alignment between the rendered UI screenshot and the input description.
    • De-duplication: Density-based clustering (DBSCAN) on CLIP embeddings removes visually redundant samples.

Iterative Self-Improvement

The filtered high-quality samples are used to finetune the LLM, which is then used to generate improved datasets in subsequent iterations. This bootstrapping process is repeated, progressively enhancing the model's ability to generate syntactically correct and semantically relevant UI code.

Preference Alignment

After supervised finetuning, three alignment strategies are explored:

  • Direct Preference Optimization (DPO): Pairwise ranking of outputs using automated rules, with DPO applied for reward modeling.
  • Top Output Supervision: Supervised finetuning on the highest-ranked output per prompt.
  • Filter-then-Train: Additional iteration of the filter-then-train process without explicit ranking.

Training Infrastructure

The infrastructure comprises distributed GPU servers for code generation, macOS-based renderers for screenshot generation, and automated pipelines for scoring and dataset assembly. LoRA is used for parameter-efficient finetuning, and QLoRA is employed for DPO due to VRAM constraints.

Experimental Evaluation

Performance Trajectory

The iterative training process yields substantial improvements in both automated metrics:

  • Compilation Rate: Increases from 0.03 (base) to 0.79 (UICoder-Filtered).
  • CLIP Score: Rises from 0.334 to 0.404. Figure 2

    Figure 2: A plot of two automatically calculated metrics over time (on a held-out set): compilation rate and mean CLIP score. Over the course of training, the model improves metrics used to filter its training data.

The largest gains occur in the initial iterations, with diminishing returns in later stages. The inclusion of LLM-augmented and paraphrased descriptions, as well as platform diversity (iOS and Android), further enhances model robustness.

Baseline and Distillation Comparisons

UICoder models are benchmarked against proprietary (GPT-3.5, GPT-4), restricted (WizardCoder, MPT-30B-Chat), and permissive (StarChat-Beta, Octocoder) baselines. Additionally, the utility of UICoder-generated data for distillation is demonstrated by finetuning other LLMs (MPT-30B, MPT-7B, Octocoder) on the synthetic dataset. Figure 3

Figure 3: Matrix shows the predicted win probability of model A against model B. The training technique significantly improved the performance of an initially poorly-performing base model (StarChat) to competitive among larger proprietary models (UICoder).

Key findings:

  • UICoder-Top achieves a compilation rate of 0.82, surpassing GPT-4 (0.81) and approaching GPT-3.5 (0.88).
  • CLIP scores for UICoder variants (0.393–0.404) are close to proprietary models (0.416–0.419).
  • Distilled models (e.g., MPT-30B++) also show marked improvements, validating the generalizability of the synthetic dataset.

Qualitative Analysis

The paper provides qualitative evidence of the model's ability to generate visually plausible and structurally complex UIs from natural language descriptions. Figure 4

Figure 4: Screenshots rendered from SwiftUI code generated by the models. For illustration purposes, stock photos and icons were manually included; the model-generated code was not modified except for image asset names.

Limitations

The approach is subject to several limitations:

  • Domain Specificity: The focus on SwiftUI limits immediate applicability to other toolkits, though the methodology is generalizable.
  • Synthetic Data Bias: Reliance on self-generated data may constrain diversity and realism.
  • Automated Feedback Granularity: Compilation success is a binary signal; CLIP scores may not capture subtle design or accessibility issues.
  • Evaluation Scope: Human evaluation is limited to expert raters and static screenshots, potentially underrepresenting usability and interactivity.

Implications and Future Directions

Practical Implications

  • Open-Source Model Enhancement: The method enables significant improvements to open-source LLMs in specialized domains without proprietary data or human annotation.
  • Automated Feedback Loops: Compiler and VLM-based feedback can be integrated into other code generation or domain-specific LLM finetuning pipelines.
  • Data Efficiency: The iterative, filter-then-train paradigm demonstrates strong data efficiency, especially in low-resource domains.

Theoretical Implications

  • Self-Improvement via Automated Feedback: The results support the hypothesis that LLMs can be bootstrapped in underrepresented domains through self-generated, automatically filtered data.
  • Preference Modeling: The limited gains from DPO and preference alignment suggest that further research is needed to optimize reward modeling in code generation tasks.

Future Work

  • Generalization to Other Toolkits: Extending the methodology to web (React, HTML/CSS) or cross-platform (Flutter) UI code generation.
  • Enhanced Feedback Signals: Incorporating program analysis, static verification, or more sophisticated VLMs for finer-grained feedback.
  • Broader Evaluation: Scaling human evaluation to include end-users and interactive assessments; direct measurement of code quality and maintainability.

Conclusion

UICoder demonstrates that automated feedback from compilers and vision-LLMs can be effectively leveraged to finetune LLMs for UI code generation, circumventing the need for human-labeled data or proprietary model outputs. The approach yields models that are competitive with state-of-the-art proprietary systems and establishes a reproducible, scalable framework for domain-specific LLM enhancement. The release of model weights and synthetic datasets further facilitates research and application in UI code generation and related areas.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 tweets and received 181 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com