NExT: Teaching Large Language Models to Reason about Code Execution (2404.14662v1)

Published 23 Apr 2024 in cs.LG, cs.CL, cs.PL, and cs.SE

Abstract: A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, LLMs of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of how programs execute at run-time. To address this issue, we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation. Experiments on program repair tasks based on MBPP and HumanEval demonstrate that NExT improves the fix rate of a PaLM 2 model, by 26.1% and 14.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters. Our model can also generalize to scenarios where program traces are absent at test-time.

Citations (9)

View on Semantic Scholar

Summary

The paper introduces Naturalized Execution Tuning (NExT) that integrates execution traces with chain-of-thought reasoning to enhance program repair.
It employs weakly-supervised self-training with iterative rationale generation, filtering, and corrections validated by unit tests.
Experiments on MBPP and HumanEval datasets show improvements of up to 26.1% and 14.3% in code fix rates over baseline models.

Enhancing LLMs' Reasoning with Execution Information for Program Repair

Introduction to Naturalized Execution Tuning (NExT)

The paper details a novel approach to enhance the capability of LLMs in handling complex software engineering tasks—specifically, program repair tasks leveraging execution information. The introduced framework, Naturalized Execution Tuning (NExT), focuses on teaching LLMs to reason about code execution by incorporating program execution traces alongside chain-of-thought (CoT) reasoning methods to generate more sophisticated natural language rationales.

Key Concepts and Implementations

Task and Challenges Addressed:

NExT addresses the challenge of aiding LLMs in reasoning about program execution to solve programming tasks.
By providing models with detailed execution traces (variable values and states line-by-line), NExT aims to increase a model's ability to detect and rectify errors in code.

Methodology Overview:

The proposed method involves finetuning LLMs using weakly-supervised self-training.
Each iteration involves the generation and selection of NL rationales and subsequent code corrections, which are verified against unit tests for accuracy.
The approach entails multiple iterations of sampling, filtering based on test executions, and finetuning, focusing on iteratively improving the LLM’s capabilities.

Execution Traces Representation:

The model uses a compact, inline representation of execution traces as code comments, a novel yet efficient way to provide execution context.
This allows models to leverage complex execution behaviors within their normal text comprehension methods without requiring specialized architectures.

Experimental Validation

Datasets and Models:

The paper utilizes two primary datasets, MBPP for Python program repair and the HumanEval fix dataset (HE), to train and validate the LLMs enhanced by NExT.
PaLM 2-L model serves as the base LLM for enhancements through NExT.

Results and Observations:

Models enhanced with NExT demonstrated a significant improvement in the problem-fix rate, with a notable 26.1% and 14.3% absolute performance boost on MBPP and HE datasets respectively.
Evaluations show that even when execution traces are unavailable at testing, the trained models perform better than the base models, emphasizing the learning transfer and generalizability of the execution reasoning ability.

Comparative Analysis:

NExT generally matches or exceeds the performance of several strong LLM baselines.
Proxy-based evaluations reveal that generated rationales not only help the main model but also assist smaller LLMs in achieving higher success rates in code fixes.

Conclusion and Future Directions

NExT presents a promising approach to significantly elevate the capabilities of LLMs in software development applications, particularly in automated debugging and program repair tasks. The approach fosters a deeper integration of execution semantics in model reasoning pathways through natural language processsing, enriching both the interpretability and functional correctness of model outputs. Future explorations may extend NExT’s methodologies to a wider range of programming languages and more diverse coding tasks, potentially integrating more dynamic elements of program executions and exploring the scalability of such models to larger and more complex datasets.

Related Papers

Tweets

https://twitter.com/AnsongNi/status/1783311827390070941

https://twitter.com/fly51fly/status/1783142068011118990

https://twitter.com/IntuitMachine/status/1783441607989068048

https://twitter.com/agi2025/status/1783132034812252224

https://twitter.com/burny_tech/status/1785504476066914738

https://twitter.com/alexjplaskett/status/1804561739603673318

HackerNews

NExT: Teaching Large Language Models to Reason about Code Execution (4 points, 0 comments)