When LLM-based Code Generation Meets the Software Development Process (2403.15852v1)
Abstract: Software process models play a pivotal role in fostering collaboration and communication within software teams, enabling them to tackle intricate development tasks effectively. This paper introduces LCG, a code generation framework inspired by established software engineering practices. LCG leverages multiple LLM agents to emulate various software process models, namely LCGWaterfall, LCGTDD, and LCGScrum. Each model assigns LLM agents specific roles such as requirement engineer, architect, developer, tester, and scrum master, mirroring typical development activities and communication patterns. Through collaborative efforts utilizing chain-of-thought and prompt composition techniques, the agents continuously refine themselves to enhance code quality. Utilizing GPT3.5 as the underlying LLM and baseline (GPT), we evaluate LCG across four code generation benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Results indicate LCGScrum outperforms other models, achieving Pass@1 scores of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively - an average 15% improvement over GPT. Analysis reveals distinct impacts of development activities on generated code, with design and code reviews contributing to enhanced exception handling, while design, testing, and code reviews mitigate code smells. Furthermore, temperature values exhibit negligible influence on Pass@1 across all models. However, variations in Pass@1 are notable for different GPT3.5 model versions, ranging from 5 to over 60 in HumanEval, highlighting the stability of LCG across model versions. This stability underscores the importance of adopting software process models to bolster the quality and consistency of LLM-generated code.
- How to refactor this code? an exploratory study on developer-chatgpt refactoring conversations. arXiv preprint arXiv:2402.06013, 2024.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Youssef Bassil. A simulation model for the waterfall software development life cycle. arXiv preprint arXiv:1205.6904, 2012.
- Benchmarking large language models in retrieval-augmented generation. arXiv preprint arXiv:2309.01431, 2023a.
- How is chatgpt’s behavior changing over time? ArXiv, abs/2307.09009, 2023b.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Code quality: Examining the efficacy of automated tools. 2017.
- Codescore: Evaluating code generation by learning code execution. arXiv preprint arXiv:2301.09043, 2023a.
- Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590, 2023b.
- Automatic detection and masking of nonatomic exception handling. IEEE Transactions on Software Engineering, 30(8):547–560, 2004.
- Martin Fowler. The new methodology, 2005. URL https://www.martinfowler.com/articles/newMethodology.html.
- Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
- Codecot and beyond: Learning to program and test like a developer. arXiv preprint arXiv:2308.08784, 2023a.
- Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010, 2023b.
- Large language models are few-shot testers: Exploring llm-based general bug reproduction. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2312–2323. IEEE, 2023.
- Reordering examples helps during priming-based few-shot learning. arXiv preprint arXiv:2106.01751, 2021.
- Log parsing with prompt-based few-shot learning. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2438–2449. IEEE, 2023.
- Structured chain-of-thought prompting for code generation. arXiv preprint arXiv:2305.06599, 2023.
- Is your code generated by chatgpt really correct. Rigorous evaluation of large language models for code generation. CoRR, abs/2305.01210, 2023a.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9), jan 2023b.
- Knowlog: Knowledge enhanced pre-trained language model for log understanding. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, 2024a.
- Llmparser: An exploratory study on using large language models for log parsing. In 2024 IEEE/ACM 46th International Conference on Software Engineering (ICSE), pages 883–883. IEEE Computer Society, 2024b.
- Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36, 2024.
- Assessing test-driven development at ibm. In 25th International Conference on Software Engineering, 2003. Proceedings., pages 564–569. IEEE, 2003.
- Retrieval-based prompt selection for code-related few-shot learning. In IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 2450–2462, 2023.
- OpenAI. Chatgpt. https://chat.openai.com/, 2023.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–22, 2023.
- Python. Built-in exceptions, 2005. URL https://docs.python.org/3/library/exceptions.html.
- Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
- A novel approach for automatic program repair using round-trip translation with large language models. arXiv preprint arXiv:2401.07994, 2024.
- An empirical evaluation of using large language models for automated unit test generation. IEEE Transactions on Software Engineering, 2023.
- Character-llm: A trainable agent for role-playing. arXiv preprint arXiv:2310.10158, 2023.
- Git merge conflict resolution leveraging strategy classification and llm. In 2023 IEEE 23rd International Conference on Software Quality, Reliability, and Security (QRS), pages 228–239. IEEE, 2023.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
- Pylint Development Team. Pylint. https://pypi.org/project/pylint/. Last accessed March 2024.
- Testsigma. What is devtestops? https://testsigma.com/devtestops. Last accessed March 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Chatgpt prompt patterns for improving code quality, refactoring, requirements elicitation, and software design. arXiv preprint arXiv:2303.07839, 2023.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864, 2023.
- Chatunitest: a chatgpt-based automated unit test generation tool. arXiv preprint arXiv:2305.04764, 2023.
- Alleviating the sample selection bias in few-shot learning by removing projection to the centroid. Advances in Neural Information Processing Systems, 35:21073–21086, 2022.
- Exploring large language models for communication games: An empirical study on werewolf. arXiv preprint arXiv:2309.04658, 2023.
- Do code smells reflect important maintainability aspects? In 2012 28th IEEE International Conference on software maintenance (ICSM), pages 306–315, 2012.
- Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023.
- Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt. arXiv preprint arXiv:2304.10778, 2023.
- Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240, 2023a.
- No more manual tests? evaluating and improving ChatGPT for unit test generation. arXiv preprint arXiv:2305.04207, 2023b.
- Automatic commit message generation: A critical review and directions for future work. IEEE Transactions on Software Engineering, 2024.
- Ldb: A large language model debugger via verifying runtime execution step-by-step. arXiv preprint arXiv:2402.16906, 2024.
- Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406, 2023.