Practical Program Repair in the Era of Large Pre-trained Language Models (2210.14179v2)

Published 25 Oct 2022 in cs.SE

Abstract: Automated Program Repair (APR) aims to help developers automatically patch software bugs. However, current state-of-the-art traditional and learning-based APR techniques face the problem of limited patch variety, failing to fix complicated bugs. This is mainly due to the reliance on bug-fixing datasets to craft fix templates or directly predict potential patches. Large Pre-Trained LLMs (PLMs), trained using billions of text/code tokens, can potentially help avoid this issue. Very recently, researchers have directly leveraged PLMs for APR without relying on any bug-fixing datasets. Meanwhile, such existing work either failed to include state-of-the-art PLMs or was not evaluated on realistic datasets. In this work, we perform the first extensive study on directly applying PLMs for APR. We select 9 recent state-of-the-art PLMs, including both generative and infilling models, ranging from 125M to 20B in size. We designed 3 different repair settings to evaluate the different ways we can use PLMs to generate patches. We apply the PLMs under these repair settings on 5 datasets across 3 different languages and compare different PLMs in the number of bugs fixed, generation speed and compilation rate. Our study demonstrates that directly applying state-of-the-art PLMs can already substantially outperform all existing APR techniques on all our datasets. Among the studied PLMs, the scaling effect exists for APR where larger models tend to achieve better performance. Also, we show for the first time that suffix code after the buggy line (adopted in infilling-style APR) is important in not only generating more fixes but more patches with higher compilation rate. Besides patch generation, the PLMs consider correct patches to be more natural than other ones, and can even be leveraged for effective patch ranking or patch correctness checking.

Citations (154)

View on Semantic Scholar

Summary

The paper evaluates nine state-of-the-art Large Pre-trained Language Models (PLMs) for automated program repair (APR) and finds they significantly outperform existing techniques.
Evaluation across five datasets and three repair settings shows that larger PLMs perform better and incorporating suffix information improves patch generation.
PLMs fixed significantly more bugs than traditional tools (e.g., Codex fixed 32 more on Defects4J 1.2) and can facilitate patch ranking using metrics like entropy.

Practical Program Repair in the Era of Large Pre-trained LLMs

Automated Program Repair (APR) has emerged as an essential technique to alleviate the burden on developers by automatically patching software bugs. This paper explores the application of Large Pre-trained LLMs (PLMs) for APR, highlighting that traditional and learning-based techniques often suffer from limited patch diversity due to their reliance on bug-fixing datasets. These limitations hinder their ability to address complex bugs. The emergence of PLMs offers an alternative, circumventing the dependency on predefined datasets by leveraging extensive training on text and code snippets.

The paper conducts a comprehensive evaluation of nine state-of-the-art PLMs ranging from 125M to 20B parameters, including both generative and infilling models, applied to five datasets across three programming languages. The evaluation encompasses three repair settings: complete function generation, infilling code chunks between prefix and suffix, and single-line generation. The results indicate that larger models generally perform better, showcasing a scaling effect. Furthermore, using suffix information is found to increase both the quantity and compilation rate of patches.

A key insight of the paper is the outsized performance of PLMs over existing APR techniques across all datasets, with a notable improvement seen in bugs fixed. For instance, Codex, one of the PLMs studied, outperformed established tools by fixing 32 more bugs on the Defects4J 1.2 dataset. Notably, PLMs also perceive correct patches as more 'natural,' thus facilitating effective patch ranking and correctness checking through entropy metrics. The potential for further enhancing PLM-based APR is evidenced via increased sampling and incorporation of fix template information.

The implications of this research are profound. Practically, PLMs could significantly reduce the effort and time developers spend on bug-fixing, leading to more reliable software systems across various industries. Theoretically, the findings encourage a reevaluation of traditional machine learning approaches in favor of models trained on more general, diverse data, potentially spurring advancements in other code-related tasks.

Future developments may focus on refining the capabilities of PLMs, exploring hybrid models that integrate traditional heuristics with large-scale language understanding, and broadening the scope of APR to diverse programming languages and ecosystems. Additionally, addressing the data leakage issues associated with code models and ensuring ethical usage remains a priority for further investigations.

In summary, the application of PLMs in APR not only promises improvement in the patch generation process but also encourages broader adoption of LLMs in software engineering problems. Such advancements will likely catalyze new research avenues and drive innovations in automated programming solutions.

Related Papers

Tweets

https://twitter.com/ComputerPapers/status/1866798091942179002