Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions (2402.08967v1)

Published 14 Feb 2024 in cs.SE

Abstract: GitHub's Copilot for Pull Requests (PRs) is a promising service aiming to automate various developer tasks related to PRs, such as generating summaries of changes or providing complete walkthroughs with links to the relevant code. As this innovative technology gains traction in the Open Source Software (OSS) community, it is crucial to examine its early adoption and its impact on the development process. Additionally, it offers a unique opportunity to observe how developers respond when they disagree with the generated content. In our study, we employ a mixed-methods approach, blending quantitative analysis with qualitative insights, to examine 18,256 PRs in which parts of the descriptions were crafted by generative AI. Our findings indicate that: (1) Copilot for PRs, though in its infancy, is seeing a marked uptick in adoption. (2) PRs enhanced by Copilot for PRs require less review time and have a higher likelihood of being merged. (3) Developers using Copilot for PRs often complement the automated descriptions with their manual input. These results offer valuable insights into the growing integration of generative AI in software development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Improving Few-Shot Prompts with Relevant Static Analysis Products. arXiv preprint arXiv:2304.06815 (2023).
  2. Exploring Distributional Shifts in Large Language Models for Code Analysis. arXiv preprint arXiv:2303.09128 (2023).
  3. Peter C Austin. 2022. Bootstrap vs asymptotic variance estimation when using propensity score weighting with continuous and binary outcomes. Statistics in Medicine 22 (2022), 4426–4443.
  4. Peter C Austin and Elizabeth A Stuart. 2017. Estimating the effect of treatment on binary outcomes using full matching on the propensity score. Statistical methods in medical research 6 (2017), 2505–2525.
  5. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
  6. Code generation tools (almost) for free? a study of few-shot, pre-trained language models on code. arXiv preprint arXiv:2206.01335 (2022).
  7. Investigating technical and non-technical factors influencing modern code review. Empirical Software Engineering (2016), 932–959.
  8. On the transferability of pre-trained language models for low-resource programming languages. In Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension. 401–412.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).
  10. Stay professional and efficient: automatically generate titles for your bug reports. In Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering. 385–397.
  11. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128 (2023).
  12. Piloting Copilot and Codex: Hot Temperature, Cold Prompts, or Black Magic? arXiv preprint arXiv:2210.14699 (2022).
  13. Self-collaboration Code Generation via ChatGPT. arXiv preprint arXiv:2304.07590 (2023).
  14. Christof Ebert and Panos Louridas. 2023. Generative AI for Software Practitioners. IEEE Software 4 (2023), 30–38.
  15. Günes Erkan and Dragomir R Radev. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization. Journal of artificial intelligence research (2004), 457–479.
  16. PRHAN: automated pull request description generation based on hybrid attention network. Journal of Systems and Software (2022), 111160.
  17. Constructing Effective In-Context Demonstration for Code Intelligence Tasks: An Empirical Study. arXiv preprint arXiv:2304.07575 (2023).
  18. Semantic Compression With Large Language Models. arXiv preprint arXiv:2304.12512 (2023).
  19. GitHub Next. 2023. GitHub Next | Copilot for Pull Requests — githubnext.com. https://githubnext.com/projects/copilot-for-pull-requests. [Accessed 23-09-2023].
  20. On the accuracy of bot detection techniques. In Proceedings of the Fourth International Workshop on Bots in Software Engineering. 1–5.
  21. A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments. Journal of Systems and Software (2021), 110911.
  22. Work practices and challenges in pull-based development: The contributor’s perspective. In Proceedings of the 38th International Conference on Software Engineering. 285–296.
  23. Assemble foundation models for automatic code summarization. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 935–946.
  24. Jens Hainmueller. 2012. Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political analysis 1 (2012), 25–46.
  25. GitHub Discussions: An Exploratory Study of Early Adoption. Empirical Software Engineering 1 (jan 2022), 32 pages.
  26. Large Language Models for Software Engineering: A Systematic Literature Review. arXiv preprint arXiv:2308.10620 (2023).
  27. AutoPRTitle: A tool for automatic pull request title generation. In 2022 IEEE International Conference on Software Maintenance and Evolution (ICSME). 454–458.
  28. SelfEvolve: A Code Evolution Framework via Large Language Models. arXiv preprint arXiv:2306.02907 (2023).
  29. Explainable Automated Debugging via Large Language Model-driven Scientific Debugging. arXiv preprint arXiv:2304.02195 (2023).
  30. Junaed Younus Khan and Gias Uddin. 2022. Automatic detection and analysis of technical debts in peer-review documentation of r packages. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 765–776.
  31. Studying pull request merges: A case study of shopify’s active merchant. In Proceedings of the 40th international conference on software engineering: software engineering in practice. 124–133.
  32. DS-1000: A natural and reliable benchmark for data science code generation. In International Conference on Machine Learning. 18319–18345.
  33. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019).
  34. Balancing covariates via propensity score weighting. J. Amer. Statist. Assoc. 521 (2018), 390–400.
  35. Enabling Programming Thinking in Large Language Models Toward Code Generation. arXiv preprint arXiv:2305.06599 (2023).
  36. Cctest: Testing and repairing code completion systems. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 1238–1250.
  37. To follow or not to follow: Understanding issue/pull-request templates on github. IEEE Transactions on Software Engineering 4 (2022), 2530–2544.
  38. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
  39. Improving ChatGPT Prompt for Code Generation. arXiv preprint arXiv:2305.08360 (2023).
  40. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210 (2023).
  41. Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. arXiv preprint arXiv:1908.08345 (2019).
  42. Automatic generation of pull request descriptions. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 176–188.
  43. Using deep learning to generate complete log statements. In Proceedings of the 44th International Conference on Software Engineering. 2279–2290.
  44. Studying the usage of text-to-text transfer transformer to support code-related tasks. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 336–347.
  45. An empirical study of the impact of modern code review practices on software quality. Empirical Software Engineering (2016), 2146–2189.
  46. Mockus and Votta. 2000. Identifying reasons for software changes using historic databases. In Proceedings 2000 International Conference on Software Maintenance. 120–130.
  47. Comparing Software Developers with ChatGPT: An Empirical Investigation. arXiv preprint arXiv:2305.11837 (2023).
  48. Evaluating and improving transformers pre-trained on ASTs for Code Completion. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 834–844.
  49. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP). 2339–2356.
  50. Rohith Pudari and Neil A Ernst. 2023. From Copilot to Pilot: Towards AI Supported Software Development. arXiv preprint arXiv:2303.04142 (2023).
  51. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 1 (2020), 5485–5551.
  52. Paul R Rosenbaum and Donald B Rubin. 1983. The central role of the propensity score in observational studies for causal effects. Biometrika 1 (1983), 41–55.
  53. Donald B Rubin. 2001. Using propensity scores to help design observational studies: application to the tobacco litigation. Health Services and Outcomes Research Methodology (2001), 169–188.
  54. Oussama Ben Sghaier and Houari Sahraoui. 2023. A Multi-Step Learning Approach to Assist Code Review. In 2023 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 450–460.
  55. ChatGPT: A Study on its Utility for Ubiquitous Software Engineering Tasks. arXiv preprint arXiv:2305.16837 (2023).
  56. Revisiting code ownership and its relationship with software quality in the scope of modern code review. In Proceedings of the 38th international conference on software engineering. 1039–1050.
  57. Review participation in modern code review: An empirical study of the android, Qt, and OpenStack projects. Empirical Software Engineering (2017), 768–817.
  58. Influence of social and technical factors for evaluating contribution in GitHub. In Proceedings of the 36th international conference on Software engineering. 356–366.
  59. Attention is all you need. Advances in neural information processing systems (2017).
  60. Understanding interobserver agreement: the kappa statistic. Fam med 5 (2005), 360–363.
  61. From human-human collaboration to Human-AI collaboration: Designing AI systems that can work together with people. In Extended abstracts of the 2020 CHI conference on human factors in computing systems. 1–6.
  62. Understanding shared links and their intentions to meet information needs in modern code review: A case study of the OpenStack and Qt projects. Empirical Software Engineering (2021), 1–32.
  63. Evaluating AIGC Detectors on Code Content. arXiv preprint arXiv:2304.05193 (2023).
  64. How Effective Are Neural Networks for Fixing Security Vulnerabilities. arXiv preprint arXiv:2305.18607 (2023).
  65. AI creativity and the human-AI co-creation model. In Human-Computer Interaction. Theory, Methods and Tools: Thematic Area, HCI 2021, Held as Part of the 23rd HCI International Conference, HCII 2021, Virtual Event, July 24–29, 2021, Proceedings, Part I 23. 171–190.
  66. Automated program repair in the era of large pre-trained language models. In Proceedings of the 45th International Conference on Software Engineering (ICSE 2023). Association for Computing Machinery.
  67. Chunqiu Steven Xia and Lingming Zhang. 2023. Conversational automated program repair. arXiv preprint arXiv:2301.13246 (2023).
  68. Research Artifact - Generative AI for Pull Request Descriptions: Adoption, Impact, and Developer Interventions. https://doi.org/10.5281/zenodo.10656106
  69. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 1–10.
  70. Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. arXiv preprint arXiv:2304.10778 (2023).
  71. An extensive study on pre-trained models for program understanding and generation. In Proceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. 39–51.
  72. Pull request decisions explained: An empirical overview. IEEE Transactions on Software Engineering 2 (2022), 849–871.
Citations (3)

Summary

We haven't generated a summary for this paper yet.