Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step (2402.16906v6)
Abstract: LLMs are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce LLM Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Alfred V Aho and Ravi Sethi Jeffrey D Ullman. 2015. ,“compilers-principles, techniques, and tools”, pearson education asia, 2007.
- Frances E Allen. 1970. Control flow analysis. ACM Sigplan Notices, 5(7):1–19.
- Glenn Ammons and James R Larus. 1998. Improving data-flow analysis with path profiles. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pages 72–84.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
- Thomas Ball and James R Larus. 1994. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(4):1319–1360.
- Thomas Ball and James R Larus. 1996. Efficient path profiling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 46–57. IEEE.
- Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749.
- Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations.
- Tooldec: Syntax error-free and generalizable tool use for llms via finite-state decoding. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
- Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
- Execution-guided neural program synthesis. In International Conference on Learning Representations.
- Binding language models in symbolic languages. In The Eleventh International Conference on Learning Representations.
- John Cocke. 1970. Global common subexpression elimination. In Proceedings of a symposium on Compiler optimization, pages 20–24.
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
- Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1469–1481. IEEE.
- Python Software Foundation. 2001. pdb — the python debugger. https://docs.python.org/3/library/pdb.html.
- In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945.
- Qiuhan Gu. 2023. Llm-based code generation method for golang compiler testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 2201–2203.
- Grace: Language models meet code edits. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1483–1495.
- Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
- Tree-planner: Efficient close-loop task planning with large language models. arXiv preprint arXiv:2310.08582.
- Leveraging print debugging to improve code generation in large language models. arXiv preprint arXiv:2401.05319.
- Enhancing large language models in coding through multi-perspective self-consistency. arXiv preprint arXiv:2309.17272.
- Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
- An empirical study on fine-tuning large language models of code for automated program repair. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1162–1174. IEEE.
- Selfevolve: A code evolution framework via large language models. arXiv preprint arXiv:2306.02907.
- Inferfix: End-to-end program repair with llms. arXiv preprint arXiv:2303.07263.
- James R Larus. 1999. Whole program paths. ACM SIGPLAN Notices, 34(5):259–269.
- Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992.
- Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
- Code completion with neural attention and pointer networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 4159–25. AAAI Press.
- Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
- Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- Octopack: Instruction tuning code large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations.
- Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pages 26106–26128. PMLR.
- Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations.
- Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2339–2356. IEEE.
- Reese T Prosser. 1959. Applications of boolean matrices to the analysis of flow diagrams. In Papers presented at the December 1-3, 1959, eastern joint IRE-AIEE-ACM computer conference, pages 133–138.
- Codepori: Large scale model for autonomous software development by using multi-agents. arXiv preprint arXiv:2402.01411.
- Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, pages 419–428.
- Trace cache: a low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 24–34. IEEE.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Unsupervised translation of programming languages. Advances in Neural Information Processing Systems, 33:20601–20611.
- Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 3–14. IEEE.
- Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3533–3546.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Debugging with gdb. Free Software Foundation, 675.
- Explain-then-translate: an analysis on improving program translation with self-generated explanations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1741–1788.
- Chain-of-table: Evolving tables in the reasoning chain for table understanding. arXiv preprint arXiv:2401.04398.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
- Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 440–450.
- Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240.
- Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591.
- Coder reviewer reranking for code generation. In International Conference on Machine Learning, pages 41832–41846. PMLR.
- A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335.
- Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406.
- Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921.
- Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.