Papers
Topics
Authors
Recent
2000 character limit reached

How LLMs Aid in UML Modeling: An Exploratory Study with Novice Analysts (2404.17739v2)

Published 27 Apr 2024 in cs.SE

Abstract: Since the emergence of GPT-3, LLMs have caught the eyes of researchers, practitioners, and educators in the field of software engineering. However, there has been relatively little investigation regarding the performance of LLMs in assisting with requirements analysis and UML modeling. This paper explores how LLMs can assist novice analysts in creating three types of typical UML models: use case models, class diagrams, and sequence diagrams. For this purpose, we designed the modeling tasks of these three UML models for 45 undergraduate students who participated in a requirements modeling course, with the help of LLMs. By analyzing their project reports, we found that LLMs can assist undergraduate students as novice analysts in UML modeling tasks, but LLMs also have shortcomings and limitations that should be considered when using them.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. J. Liu, C. S. Xia, Y. Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  2. P. Vaithilingam, T. Zhang, and E. L. Glassman, “Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models,” in Chi conference on human factors in computing systems extended abstracts, pp. 1–7, 2022.
  3. A. Ahmad, M. Waseem, P. Liang, M. Fahmideh, M. S. Aktar, and T. Mikkonen, “Towards human-bot collaborative software architecting with chatgpt,” in Proceedings of the 27th International Conference on Evaluation and Assessment in Software Engineering (EASE), pp. 279–285, 2023.
  4. D. Zimmermann and A. Koziolek, “Automating gui-based software testing with gpt-3,” in 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pp. 62–65, IEEE, 2023.
  5. A. v. Lamsweerde, Requirements Engineering: From System Goals to UML Models to Software Specifications. John Wiley & Sons, Ltd, 2009.
  6. B. Wang, C. Wang, P. Liang, B. Li, and C. Zeng, “Case Study for the Paper: How LLMs Aid in UML Modeling: An Exploratory Study with Novice Analysts,” January 2024. https://zenodo.org/doi/10.5281/zenodo.10532600.
  7. Z. Zheng, K. Ning, J. Chen, Y. Wang, W. Chen, L. Guo, and W. Wang, “Towards an understanding of large language models in software engineering tasks,” arXiv preprint arXiv:2308.11396, 2023.
  8. D. Luitel, S. Hassani, and M. Sabetzadeh, “Improving requirements completeness: Automated assistance through large language models,” arXiv preprint arXiv:2308.03784, 2023.
  9. J. White, S. Hays, Q. Fu, J. Spencer-Smith, and D. C. Schmidt, “Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design,” arXiv preprint arXiv:2303.07839, 2023.
  10. J. Zhang, Y. Chen, N. Niu, and C. Liu, “Evaluation of chatgpt on requirements information retrieval under zero-shot setting,” Available at SSRN 4450322, 2023.
  11. D. Xie, B. Yoo, N. Jiang, M. Kim, L. Tan, X. Zhang, and J. S. Lee, “Impact of large language models on generating software specifications,” arXiv preprint arXiv:2306.03324, 2023.
  12. J. Jeuring, R. Groot, and H. Keuning, “What skills do you need when developing software using chatgpt? (discussion paper),” arXiv preprint arXiv:2310.05998, 2023.
  13. M. Waseem, T. Das, A. Ahmad, M. Fehmideh, P. Liang, and T. Mikkonen, “Chatgpt as a software development bot: A project-based study,” in Proceedings of the 19th International Conference on Evaluation of Novel Approaches to Software Engineering (ENASE), 2024.
  14. C. Arora, J. Grundy, and M. Abdelrazek, “Advancing requirements engineering through generative ai: Assessing the role of llms,” arXiv preprint arXiv:2310.13976, 2023.
  15. A. R. Sadik, S. Brulin, and M. Olhofer, “Coding by design: Gpt-4 empowers agile model driven development,” arXiv preprint arXiv:2310.04304, 2023.
  16. H. Kanuka, G. Koreki, R. Soga, and K. Nishikawa, “Exploring the chatgpt approach for bidirectional traceability problem between design models and code,” arXiv preprint arXiv:2309.14992, 2023.
  17. J. Cámara, J. Troya, L. Burgueño, and A. Vallecillo, “On the assessment of generative ai in modeling tasks: an experience report with chatgpt and uml,” Software and Systems Modeling, pp. 1–13, 2023.
  18. C. Larman, Applying UML and Patterns: An Introduction to Object Oriented Analysis and Design and Interative Development. Pearson Education, 2012.
Citations (3)

Summary

  • The paper demonstrates that LLMs support UML model drafting, achieving 88.89% correctness in identifying use cases and 82.22% in sequencing messages.
  • The study found that while LLMs perform moderately in class diagram creation (66.67% for classes, 75.56% for operations), they struggle with identifying relationships (24.44% correctness).
  • Hybrid-created diagrams, which combine AI generation with human refinement, outperformed other formats, underscoring the importance of human oversight.

How LLMs Aid in UML Modeling: An Exploratory Study with Novice Analysts

Introduction

The study "How LLMs Aid in UML Modeling: An Exploratory Study with Novice Analysts" explores the capacity of LLMs to assist undergraduate students in creating UML models, specifically use case diagrams, class diagrams, and sequence diagrams. This investigation comes in the context of a requirements modeling course involving 45 participants, aiming to understand the practical impact of LLMs in software engineering tasks.

Experimentation and Design

The experimental design involved a structured task where students used LLMs, predominantly ChatGPT, to aid the creation of UML diagrams for a given case study. Each participant submitted a project report comprising the UML models generated and the transcript of interactions with LLMs. Figure 1

Figure 1: The process of the experiment.

Results of UML Model Creation

Use Case Modeling

In evaluating the use case models generated with LLM assistance, several insights were evident:

  • LLMs excelled in identifying use cases correctly with a high success rate, as seen in 88.89% correctness.
  • However, the identification of actors and their relationships was notably less effective, achieving only 31.11% and 17.78% correctness, respectively.

Class Diagram Modeling

For class diagram creation, LLMs demonstrated good performance in identifying classes and operations, with correctness rates of 66.67% and 75.56%, respectively. Figure 2

Figure 2: Distribution of the participants with/without experience of using LLMs.

  • The recognition of relationships among classes presented challenges, with a correctness rate of merely 24.44%.

Sequence Diagram Modeling

Sequence diagrams benefitted from LLM assistance in recognizing objects and sequencing messages, where the correctness of object identification reached 73.33%.

  • Correct sequence ordering achieved 82.22% correctness, indicating a capacity for LLMs to comprehend and arrange chronological activities effectively.

Output Formats and Analysis

The research further delved into the output formats utilized in UML creation:

  • Hybrid-created diagrams performed best, with an average score of 8.20, showcasing the significant role of human intervention and optimization.
  • PlantUML-based diagrams had a moderate performance (average score 6.94), benefitting from auto-generated code but still requiring manual correction.
  • Simple wireframe outputs were the least effective, with an average score of 5.5, often lacking the necessary detail and accuracy. Figure 3

    Figure 3: Distribution of the LLMs used in UML modeling tasks.

Discussion

This study highlights that while LLMs are capable of aiding in software modeling, substantial limitations persist. LLMs often struggle with identifying complex relationships, underscoring a need for further enhancement in understanding relational constructs. These findings are pivotal for educators and industry professionals, suggesting that while LLMs serve as useful tools, reliance on them for complete accuracy without human intervention is premature.

Implications for Software Engineering

The implications for software engineering education are profound. LLMs can be integrated as supplementary tools in teaching UML modeling, leveraging their capacity to generate initial drafts of models but requiring critical human oversight. Educators and professionals must focus on training students to effectively collaborate with LLMs, enhancing their understanding while avoiding blind reliance on AI-generated outputs. Figure 4

Figure 4: Distribution of the languages used in the human-LLM interaction.

Conclusion

The exploratory study demonstrates that LLMs hold potential in assisting novice engineers with UML modeling tasks but still have significant shortcomings in relational analysis and diagram precision. As AI continues to evolve, ongoing research and refinements are essential to transform LLMs into reliable partners in software engineering practices.

Whiteboard

Practical Applications

Immediate Applications

The following applications can be deployed today by leveraging the paper’s findings that LLMs reliably extract UML elements from natural language while struggling with relationships, and that hybrid human-in-the-loop workflows and PlantUML-based outputs improve quality.

Industry (Software/IT, product teams, consulting)

  • UML copilot for early requirements modeling
    • Use case: Prompt an LLM with domain text to draft use cases, classes, attributes/operations, and sequence-flow steps; a human modeler validates and finalizes relationships.
    • Workflow/product: “UML Copilot” plugin for StarUML/Visual Studio Code that:
    • Generates PlantUML code for use case/class/sequence diagrams from requirements text.
    • Flags low-confidence relationships for manual review (focus on generalization/associations).
    • Dependencies/assumptions: High-quality textual requirements; team uses PlantUML or compatible tooling; human reviewer signs off on relationships.
  • Hybrid-created diagrams as a default modeling pattern
    • Use case: Adopt the paper’s best-performing workflow—LLM textual suggestions → human refinement → diagramming in a tool (StarUML, PlantUML).
    • Tools: Prompt templates for element extraction; pre-commit checklist for verifying relationships (inheritance, association, aggregation, composition).
    • Dependencies: Basic UML skills among staff; availability of approved LLM (e.g., GPT-4 or enterprise/private LLM).
  • Rapid prototyping for sequence behaviors
    • Use case: Generate initial sequence diagrams from user stories to facilitate design discussions, test planning, and stakeholder demos (LLMs were strongest on sequence diagram criteria).
    • Tools: “Sequence Assistant” that turns user stories into PlantUML sequence diagrams.
    • Assumptions: Stable user story format; acceptance that messages may need refinement.
  • Model quality gate in CI/CD
    • Use case: Add an automated step that uses LLMs to review UML artifacts for missing core elements and obvious contradictions, then require human validation of relationships.
    • Tools: “Relationship Validator” script + LLM checker for completeness against a project-specific glossary.
    • Dependencies: Modeling artifacts versioned as text (PlantUML/Mermaid); policy allowing LLM use in pipelines.
  • Onboarding aids from models
    • Use case: LLMs summarize existing UML diagrams and generate natural-language walkthroughs for new team members.
    • Assumptions: Non-sensitive models; access controls to prevent leakage.

Academia (Education, training, curriculum)

  • LLM-assisted modeling exercises and formative feedback
    • Use case: Assignments where students use LLMs to generate UML drafts and then improve them, guided by the paper’s rubric (elements vs. relationships).
    • Tools: “ReqModel Coach” for rubric-aligned feedback (actors/use cases/classes/attributes/operations/messages/order).
    • Dependencies: Clear academic integrity guidelines; curated prompts; instructor-provided evaluation criteria.
  • Comparative labs on output formats
    • Use case: Students compare Simple Wireframe, PlantUML-based, and Hybrid-created outputs to observe quality differences; learn why hybrid wins.
    • Assumptions: Access to StarUML/PlantUML; reproducible prompts.
  • Prompt engineering as a modeling skill
    • Use case: Teach prompt patterns that separate “element extraction” from “relationship validation,” reflecting LLM strengths/weaknesses.
    • Dependencies: Up-to-date LLM access; example corpora.

Policy and Governance (Org-level SDLC policy, compliance)

  • Responsible-use guidelines for AI-assisted modeling
    • Policy: Mandate human review for relationships; restrict sharing of sensitive requirements; log prompts/outputs as design artifacts.
    • Tools: Lightweight “AI-in-the-loop” SOPs and checklists tied to modeling milestones.
    • Dependencies: Organizational risk assessment; legal/privacy input; auditable LLM usage.
  • Procurement and vendor RFP modeling support
    • Use case: Teams use LLMs to rapidly produce standardized UML views for vendor briefings; vendors respond with LLM-assisted diagrams reviewed by humans.
    • Assumptions: Contracts clarify AI use and IP; agreed modeling standards.

Daily Life and Individual Practitioners (Students, indie developers)

  • Quick-start UML for side projects
    • Use case: Generate initial use case/class/sequence diagrams from README or feature lists; refine manually.
    • Tools: Prompt templates + PlantUML snippets.
    • Dependencies: Basic UML literacy; acceptance of iteration.
  • Study aid for understanding modeling
    • Use case: Students paste a case study and get suggested UML plus explanations; compare with course solutions.
    • Assumptions: Non-plagiarized usage; feedback from instructors.

Long-Term Applications

These rely on further research, better models (especially relationship reasoning), standard datasets, tighter tool integration, and policy maturation.

Industry (Software/IT, regulated sectors: healthcare, finance, energy)

  • End-to-end modeling assistant integrated into ALM/IDE
    • Vision: Multi-modal LLMs that co-create, validate, and repair UML with strong relationship reasoning; continuous synchronization between requirements, models, and code.
    • Tools/products: “Requirements-to-Model-to-Code” assistants in Jira/Azure DevOps/IntelliJ; model repair via constraint solvers (e.g., OCL) guided by LLM.
    • Dependencies: Fine-tuning on large UML corpora; formal constraints; reliable on-prem LLMs.
  • Domain-aware modeling copilots
    • Vision: Sector-specific ontologies (HL7/FHIR for healthcare, FIX/ISO 20022 for finance) boost relationship accuracy and traceability.
    • Tools: “Healthcare Model Copilot,” “Finance Model Copilot” with pre-trained vocabularies and compliance patterns.
    • Dependencies: Curated, licensed domain datasets; governance for safety and bias.
  • Continuous model-code traceability and conformance
    • Vision: LLMs maintain bidirectional links between requirements, UML, and code/tests; detect drift and propose fixes.
    • Tools: “TraceGuard” service monitoring repositories; automatic sequence diagrams from runtime traces reconciled with design.
    • Dependencies: Stable trace frameworks; organization-wide modeling discipline.

Academia (Education research, curriculum reform)

  • Benchmarks and shared datasets for AI-in-modeling
    • Vision: Public corpora of annotated UML and requirements; standardized metrics beyond binary scoring (continuous rubric scores).
    • Tools: Open evaluation suites; leaderboards for “relationship extraction” and “model conformance.”
    • Dependencies: Community curation; privacy-safe data; sponsorship.
  • Competency-based curricula integrating AI modeling literacy
    • Vision: Programs that formally teach “AI-assisted modeling” competencies, including risk management, verification, and human-in-the-loop design.
    • Dependencies: Accreditation alignment; faculty development.

Policy and Standards (Standards bodies, regulators, enterprise governance)

  • Standards for AI-assisted modeling artifacts and audits
    • Vision: ISO/OMG guidance on provenance, review requirements, and auditability for AI-generated UML in safety/finance-critical systems.
    • Tools: “AI Modeling Audit Pack” templates embedded in QMS.
    • Dependencies: Multi-stakeholder consensus; regulator participation.
  • Compliance automation for AI-in-the-loop modeling
    • Vision: Automated evidence generation that shows human review of relationships and conformance to modeling standards during audits.
    • Dependencies: Tool interoperability; tamper-evident logs.

Cross-cutting Tools and Methods

  • Relationship-first modeling engines
    • Vision: New LLM prompting and model-checking pipelines that explicitly reason about inheritance, associations, aggregations, and compositions, with constraint-based verification (e.g., OCL).
    • Products: “Relationship Reviewer Pro” microservice integrated with modeling environments.
    • Dependencies: Improved LLM reasoning; formal rule sets.
  • Privacy-preserving, on-prem LLMs for modeling
    • Vision: Secure deployment of LLMs behind the firewall to process sensitive requirements/models.
    • Dependencies: Enterprise-grade LLM stacks; model governance.
  • Multilingual modeling support
    • Vision: Comparable accuracy across languages; robust performance on non-English requirements.
    • Dependencies: Multilingual fine-tuning; localized datasets.

Notes on feasibility and assumptions across applications:

  • Current LLMs are strong at extracting elements but weaker on relationship accuracy; human-in-the-loop remains essential.
  • Output format matters: hybrid workflows and PlantUML-based outputs outperform raw wireframes.
  • Results were derived from novice analysts; expert performance and different domains may shift outcomes.
  • Data privacy, IP, and compliance constraints may limit prompt content; on-prem or privacy-preserving LLMs may be required.
  • Variance in LLM outputs implies the need for reproducible prompts, logs, and review checkpoints.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 6 likes about this paper.