Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 163 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 206 tok/s Pro
GPT OSS 120B 459 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

DebugBench: Evaluating Debugging Capability of Large Language Models (2401.04621v3)

Published 9 Jan 2024 in cs.SE, cs.AI, and cs.CL

Abstract: LLMs have demonstrated exceptional coding capability. However, as another critical component of programming proficiency, the debugging capability of LLMs remains relatively unexplored. Previous evaluations of LLMs' debugging ability are significantly limited by the risk of data leakage, the scale of the dataset, and the variety of tested bugs. To overcome these deficiencies, we introduce `DebugBench', an LLM debugging benchmark consisting of 4,253 instances. It covers four major bug categories and 18 minor types in C++, Java, and Python. To construct DebugBench, we collect code snippets from the LeetCode community, implant bugs into source data with GPT-4, and assure rigorous quality checks. We evaluate two commercial and four open-source models in a zero-shot scenario. We find that (1) while closed-source models exhibit inferior debugging performance compared to humans, open-source models relatively lower pass rate scores; (2) the complexity of debugging notably fluctuates depending on the bug category; (3) incorporating runtime feedback has a clear impact on debugging performance which is not always helpful. As an extension, we also compare LLM debugging and code generation, revealing a strong correlation between them for closed-source models. These findings will benefit the development of LLMs in debugging.

Citations (16)

Summary

  • The paper presents DebugBench, a novel benchmark evaluating LLM debugging using 4,253 diverse bug instances across C++, Java, and Python.
  • The paper details a three-phase methodology—data collection, GPT-4 driven bug implantation, and strict quality control—to ensure comprehensive evaluation.
  • The paper reveals that closed-source models, especially GPT-4, outperform open-source counterparts, yet exhibit notable gaps compared to human debugging skills.

DebugBench: Evaluating Debugging Capability of LLMs

Introduction

The paper "DebugBench: Evaluating Debugging Capability of LLMs" (2401.04621) presents a new benchmark named DebugBench for assessing the debugging capabilities of LLMs. This benchmark is directed towards evaluating models in a lesser-explored domain of debugging, addressing the limitations of previous evaluations such as data leakage risks, limited dataset scale, and insufficient bug variety. DebugBench consists of 4,253 instances covering four major and eighteen minor bug categories across C++, Java, and Python programming languages.

Construction of DebugBench

The construction of DebugBench involves three main phases: source data collection, bug implantation, and quality control.

Source Data Collection:

The benchmark draws from the LeetCode community, focusing on code snippets released after July 2022 to prevent data leakage from pre-training datasets of LLMs. This ensures that models' performance on DebugBench reflects genuine debugging skill rather than memorization. Figure 1

Figure 1: This figure illustrates the construction of DebugBench. We first collect code snippets from LeetCode, then employ GPT-4 for bug implantation and finally conduct human/LLM evaluation on the benchmark.

Bug Implantation:

Bugs are implanted using GPT-4, following a taxonomy based on Barr's classification criteria, comprising Syntax, Reference, Logic, and Multiple errors. This synthetic approach provides control over error diversity and mitigates data exposure concerns associated with traditional datasets like Defects4J.

Quality Control:

A combination of automatic filtering and manual inspection ensures benchmark integrity. Automatic filtering assesses test suite performance and data leakage risk, while manual inspection verifies bug validity, security, and alignment with real-world scenarios.

Evaluation and Results

DebugBench evaluates two closed-source models (GPT-4, GPT-3.5) and three open-source models (BLOOM, CodeLlama-34b, CodeLlama-34b-Instruct) under zero-shot conditions in debugging tasks. The evaluation reveals distinct findings:

Closed-Source Models:

Closed-source models exhibit superior debugging capabilities compared to open-source models but still fall below human proficiency in some bug types. GPT-4 achieves the highest pass rate, notably outperforming open-source models by a considerable margin. Figure 2

Figure 2: Pass Rate of GPT-4 vs. alternative models in debugging tasks.

Open-Source Models:

The open-source models underperformed, achieving a pass rate of zero. This highlights the current limitations of these models in debugging, attributed to inadequate training on debugging-specific data.

Bug Complexity:

The complexity of debugging varies with the bug type. Syntax and Reference errors are generally easier to address, while Logical and Multiple errors pose more significant challenges, requiring deeper comprehension and analysis capabilities.

In-depth Analysis

Impact of Multiple Sampling and Runtime Feedback:

Allowing models to generate multiple responses improves performance, illustrating a trade-off between inference token usage and debugging effectiveness. Similarly, providing runtime feedback enhances performance for Syntax and Reference errors but is less effective for Logical errors, where feedback granularity is not sufficiently informative. Figure 3

Figure 3: Effect of runtime feedback on debugging performance, showing improvement in Syntax and Reference error handling.

Correlation with Code Generation:

There's a noted correlation between debugging and code generation performance in closed-source models. While Syntax and Reference errors are easier than code generation, Logical and Multiple errors present complex challenges on par with full code generation tasks. Figure 4

Figure 4: Pass Rate comparison of coding vs. debugging tasks with the same problems.

Conclusion

DebugBench provides a comprehensive framework for evaluating the debugging capabilities of LLMs, revealing significant gaps between current model performance and human capabilities. Future developments could focus on expanding debugging scenarios to include real-world and interactive environments, as well as enhancing open-source models with more targeted datasets. DebugBench represents a crucial step in understanding and advancing LLM capabilities in debugging, suggesting directions for further research and application improvements.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 57 likes.

Upgrade to Pro to view all of the tweets about this paper: