DebugBench: Evaluating Debugging Capability of Large Language Models (2401.04621v3)

Published 9 Jan 2024 in cs.SE, cs.AI, and cs.CL

Abstract: LLMs have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

Citations (16)

View on Semantic Scholar

Summary

The paper presents DebugBench, a benchmark with 4,253 instances covering syntax, reference, logic, and multiple bugs across C++, Java, and Python.
It employs a dual-method approach with automated filtering and manual inspection to ensure data quality and mitigate data leakage.
Empirical results reveal that while closed-source models excel over open-source ones in debugging, all models still lag behind human-level performance.

DebugBench: Evaluating Debugging Capability of LLMs

The research paper introduces DebugBench, a comprehensive benchmark designed to evaluate the debugging capabilities of LLMs. While LLMs have demonstrated significant advancements in code generation, their debugging abilities remain less explored. This paper aims to address this gap by providing a rigorous framework for assessing LLMs in debugging tasks, covering a wide range of bug types across multiple programming languages.

Key Contributions

The authors present a benchmark consisting of 4,253 instances, encompassing four major bug categories—Syntax, Reference, Logic, and Multiples—across 18 minor types in C++, Java, and Python. This endeavor addresses several limitations of previous studies, such as risks of data leakage, limited dataset scale, and insufficient bug diversity.

Methodology

Dataset Construction:
- Code snippets were sourced from the LeetCode community, applying GPT-4 for bug implantation.
- Emphasis was placed on reducing data leakage by utilizing instances released post-cutoff dates of models' pre-training datasets.
Quality Assurance:
- Both automated filtering and manual inspections were employed to ensure the quality and integrity of the benchmark.
- Instances were inspected based on criteria like bug validity, sensitive information security, and scenario alignment.
Model Evaluation:
- Two commercial (closed-source) and three open-source models were evaluated on DebugBench.
- The assessment was conducted in zero-shot scenarios to establish baseline debugging abilities.

Empirical Findings

Performance Comparison:
- Closed-source models showed superior debugging performance to open-source models but still fell short of human-level proficiency.
- No open-source models succeeded in producing effective debugging responses, indicating a gap in debugging capabilities.
Impact of Bug Types:
- Different bug types presented varying levels of difficulty. Syntax and reference bugs were easier to address compared to logic or multiple simultaneous bugs.
Runtime Feedback:
- Closed-source models benefitted from runtime feedback for syntax and reference errors but struggled with logic bugs, suggesting the limitations of current runtime diagnostics.
Correlation with Code Generation:
- Debugging and code generation tasks were found to be correlated in closed-source models, indicating that improvements in one could potentially benefit the other.

Implications and Future Directions

This paper paves the way for future research into automated debugging, leveraging LLMs. The findings suggest the need for models with enhanced debugging-specific training, possibly through tailored datasets encompassing real-world debugging scenarios. Additionally, expanding the scope of debugging benchmarks to more complex, real-world codebases could provide more challenging and informative assessments.

In terms of practical applications, the research highlights potential avenues for integrating LLMs into development environments to assist in automated debugging, thereby reducing time and labor costs. Future work could also explore interactive debugging approaches or leveraging integrated development environments (IDEs) within LLM frameworks to further augment their debugging capabilities.

Overall, DebugBench represents a robust step forward in understanding and advancing the debugging proficiencies of LLMs, contributing to both academic and practical advancements in the field of AI-powered coding and software development.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Runchu_Tian/status/1745051658412040599

https://twitter.com/_superAGI/status/1800182752115081522

https://twitter.com/arxivsanitybot/status/1745437839410483338

https://twitter.com/1stsonerl/status/1745173052806271389