Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step (2402.16906v6)

Published 25 Feb 2024 in cs.SE, cs.AI, and cs.CL

Abstract: LLMs are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce LLM Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.

References (64)

Citations (24)

View on Semantic Scholar

Summary

The paper introduces LDB, a novel debugger that segments code into basic blocks and verifies each execution step.
It leverages intermediate runtime values to mirror human debugging and enhances performance by up to 9.8%.
Experimental results on HumanEval, MBPP, and TransCoder benchmarks confirm LDB’s effectiveness in refining LLM-generated code.

An Expert Overview of "LDB: A LLM Debugger via Verifying Runtime Execution Step by Step"

In this paper, the authors introduce a novel debugging framework named LLM Debugger (LDB) which aims to emulate human debugging practices for programs generated by LLMs. The primary innovation of LDB lies in its approach to incorporating runtime execution information to iteratively refine generated programs.

Key Contributions

Introduction of LDB Framework: LDB is designed to provide a systematic approach to debugging by leveraging runtime execution traces. The framework segments the code into basic blocks and verifies each block against the task description in a step-by-step manner.
Runtime Execution Information: LDB tracks the values of intermediate variables after each basic block of runtime execution. This mirrors the practical debugging procedures of human developers who set breakpoints and analyze intermediate states.
Experimental Validation: Experiments conducted on multiple benchmarks—HumanEval, MBPP, and TransCoder—demonstrate that LDB consistently enhances the baseline performance by up to 9.8%. This improvement underscores LDB's efficacy across various LLM backbones including GPT-3.5, StarCoder, and CodeLlama.

Methodological Advances

Profiling

LDB performs profiling by collecting runtime execution information using a failed visible test case. The key steps include:

Execution Traces: By mapping each program to a control flow graph (CFG), LDB segments the execution trace into basic blocks.
Intermediate States: LDB determines the runtime values of variables at the end of each basic block, which are critical for debugging the code incrementally.

Debugging

With the execution trace and intermediate states in hand, LDB proceeds to the actual debugging:

Debugging Verdicts and Explanations: For each intermediate state, the framework queries the LLM to verify the correctness of the corresponding code block and provide explanations if any discrepancies are found.
Selective and Batch Debugging: To handle lengthy execution traces, LDB employs selective debugging and batches the queries, improving both the efficiency and efficacy of the process.

Regeneration

Using the debugging insights, LDB iteratively refines the program:

The intermediate states and task description are incorporated into the prompt to regenerate the refined program.
This iterative approach continues until the program passes all visible tests or reaches the maximum number of iterations.

Experimental Results

The authors validate the effectiveness of LDB across three benchmarks:

HumanEval: LDB achieved a 9.1% improvement over the baseline for text-to-code generation.
MBPP: Demonstrated a 8.4% enhancement, showcasing LDB's robustness in refining complex logic flows.
TransCoder: LDB achieved a 5.4% improvement, highlighting its utility in code translation tasks.

Moreover, LDB showed substantial improvements even when starting with programs generated by more advanced LLMs, such as GPT-4 and Reflexion, detecting subtle bugs overlooked by these models.

Theoretical and Practical Implications

Theoretically, LDB introduces a new paradigm in the landscape of debugging with LLMs by incorporating real-time execution feedback. The segmentation of code into basic blocks and the use of runtime information enables finer-grained analysis and correction of errors.

Practically, the implementation of LDB can significantly aid in various downstream applications requiring correct code generation, such as software development, automated coding assistance, and educational tools for programming. The framework's ability to refine generated code iteratively promises advancements in achieving higher levels of accuracy and reliability in automated code generation tasks.

Future Directions

Future developments could explore:

Integration with More Complex Runtime Environments: Extending LDB to handle more complex languages and runtime environments could broaden its applicability.
Enhanced Debugging Algorithms: Developing more sophisticated algorithms for debugging on different granularity levels of at the runtime could further optimize performance.
Scalability and Efficiency: Conducting further studies on the scalability and efficiency of LDB with larger and more diverse datasets could provide deeper insights and improvements.

Overall, the LDB framework represents a significant step forward in the field of debugging for LLM-generated code, providing both theoretical contributions and practical tools for enhancing the accuracy and robustness of automated code generation.

PDF Markdown

Related Papers

Tweets

https://twitter.com/zlwang_cs/status/1764017849591587283

https://twitter.com/zlwang_cs/status/1837022550028681701

https://twitter.com/zlwang_cs/status/1791182863003799616

https://twitter.com/fly51fly/status/1764275980502433834

https://twitter.com/zlwang_cs/status/1833294047907352772

https://twitter.com/ceobillionaire/status/1804732884642623875

YouTube

Show All Videos

HackerNews

LDB: A Large Language Model Debugger via Verifying Runtime Execution (1 point, 1 comment)