Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 86 tok/s Pro
Kimi K2 173 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Prompting Large Language Models to Tackle the Full Software Development Lifecycle: A Case Study (2403.08604v3)

Published 13 Mar 2024 in cs.CL and cs.SE

Abstract: Recent advancements in LLMs have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. In this case study, we explore the performance of LLMs across the entire software development lifecycle with DevEval, encompassing stages including software design, environment setup, implementation, acceptance testing, and unit testing. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Openai gpt, 2023. URL https://platform.openai.com/docs/models/overview.
  2. Program synthesis with large language models, 2021.
  3. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  4. Multipl-e: A scalable and polyglot approach to benchmarking neural code generation. IEEE Transactions on Software Engineering, 49:3675–3691, 2023. URL https://api.semanticscholar.org/CorpusID:258205341.
  5. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  6. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
  7. Crosscodeeval: A diverse and multilingual benchmark for cross-file code completion. ArXiv, abs/2310.11248, 2023. URL https://api.semanticscholar.org/CorpusID:264172238.
  8. Deepseek-coder: When the large language model meets programming – the rise of code intelligence, 2024.
  9. Measuring coding challenge competence with apps. NeurIPS, 2021.
  10. Metagpt: Meta programming for a multi-agent collaborative framework, 2023.
  11. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023.
  12. Ds-1000: A natural and reliable benchmark for data science code generation. ArXiv, abs/2211.11501, 2022. URL https://api.semanticscholar.org/CorpusID:253734939.
  13. Camel: Communicative agents for” mind” exploration of large scale language model society. arXiv preprint arXiv:2303.17760, 2023.
  14. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023a.
  15. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023b.
  16. Ml-bench: Large language models leverage open-source libraries for machine learning tasks. ArXiv, abs/2311.09835, 2023c. URL https://api.semanticscholar.org/CorpusID:265221105.
  17. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  18. On the use of package managers by the c++ open-source community. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, pp.  1483–1491, 2018.
  19. Octopack: Instruction tuning code large language models. ArXiv, abs/2308.07124, 2023. URL https://api.semanticscholar.org/CorpusID:260886874.
  20. Communicative agents for software development, 2023a.
  21. Experiential co-learning of software-developing agents. arXiv preprint arXiv:2312.17025, 2023b.
  22. Royce, W. W. Managing the development of large software systems: concepts and techniques. In Proceedings of the 9th international conference on Software Engineering, pp.  328–338, 1987.
  23. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  24. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023.
  25. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560.
  26. Execution-based evaluation for open-domain code generation. In Conference on Empirical Methods in Natural Language Processing, 2022. URL https://api.semanticscholar.org/CorpusID:254877069.
  27. Intercode: Standardizing and benchmarking interactive coding with execution feedback. ArXiv, abs/2306.14898, 2023. URL https://api.semanticscholar.org/CorpusID:259262186.
  28. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  29. Natural language to code generation in interactive data science notebooks. ArXiv, abs/2212.09248, 2022. URL https://api.semanticscholar.org/CorpusID:254854112.
  30. Repocoder: Repository-level code completion through iterative retrieval and generation. In Conference on Empirical Methods in Natural Language Processing, 2023. URL https://api.semanticscholar.org/CorpusID:257663528.
  31. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023a.
  32. Codegeex: A pre-trained model for code generation with multilingual evaluations on humaneval-x. arXiv preprint arXiv:2303.17568, 2023b.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: