Emergent Mind

NExT: Teaching Large Language Models to Reason about Code Execution

Published Apr 23, 2024 in cs.LG , cs.CL , cs.PL , and cs.SE


A fundamental skill among human developers is the ability to understand and reason about program execution. As an example, a programmer can mentally simulate code execution in natural language to debug and repair code (aka. rubber duck debugging). However, LLMs of code are typically trained on the surface textual form of programs, thus may lack a semantic understanding of how programs execute at run-time. To address this issue, we propose NExT, a method to teach LLMs to inspect the execution traces of programs (variable states of executed lines) and reason about their run-time behavior through chain-of-thought (CoT) rationales. Specifically, NExT uses self-training to bootstrap a synthetic training set of execution-aware rationales that lead to correct task solutions (e.g., fixed programs) without laborious manual annotation. Experiments on program repair tasks based on MBPP and HumanEval demonstrate that NExT improves the fix rate of a PaLM 2 model, by 26.1% and 14.3% absolute, respectively, with significantly improved rationale quality as verified by automated metrics and human raters. Our model can also generalize to scenarios where program traces are absent at test-time.

The figure shows x fine-tuning an LLM to naturalize execution traces for coding task solutions.


  • The study introduces a novel framework named Naturalized Execution Tuning (NExT), designed to enhance the reasoning capabilities of LLMs in complex software engineering tasks, specifically for program repair by utilizing execution information.

  • NExT enhances LLMs by incorporating detailed program execution traces and chain-of-thought reasoning methods to improve error detection and rectification in code.

  • The method was evaluated using two major datasets, MBPP and HumanEval, with the PaLM 2-L model, showing significant performance improvements in program repair tasks.

  • NExT promises to advance LLM applications in software development, particularly automated debugging and program repair, and suggests future extensions to more programming languages and varied coding tasks.

Enhancing LLMs' Reasoning with Execution Information for Program Repair

Introduction to Naturalized Execution Tuning (NExT)

The paper details a novel approach to enhance the capability of LLMs in handling complex software engineering tasks—specifically, program repair tasks leveraging execution information. The introduced framework, Naturalized Execution Tuning (NExT), focuses on teaching LLMs to reason about code execution by incorporating program execution traces alongside chain-of-thought (CoT) reasoning methods to generate more sophisticated natural language rationales.

Key Concepts and Implementations

Task and Challenges Addressed:

  • NExT addresses the challenge of aiding LLMs in reasoning about program execution to solve programming tasks.
  • By providing models with detailed execution traces (variable values and states line-by-line), NExT aims to increase a model's ability to detect and rectify errors in code.

Methodology Overview:

  • The proposed method involves finetuning LLMs using weakly-supervised self-training.
  • Each iteration involves the generation and selection of NL rationales and subsequent code corrections, which are verified against unit tests for accuracy.
  • The approach entails multiple iterations of sampling, filtering based on test executions, and finetuning, focusing on iteratively improving the LLM’s capabilities.

Execution Traces Representation:

  • The model uses a compact, inline representation of execution traces as code comments, a novel yet efficient way to provide execution context.
  • This allows models to leverage complex execution behaviors within their normal text comprehension methods without requiring specialized architectures.

Experimental Validation

Datasets and Models:

  • The study utilizes two primary datasets, MBPP for Python program repair and the HumanEval fix dataset (HE), to train and validate the LLMs enhanced by NExT.
  • PaLM 2-L model serves as the base LLM for enhancements through NExT.

Results and Observations:

  • Models enhanced with NExT demonstrated a significant improvement in the problem-fix rate, with a notable 26.1% and 14.3% absolute performance boost on MBPP and HE datasets respectively.
  • Evaluations show that even when execution traces are unavailable at testing, the trained models perform better than the base models, emphasizing the learning transfer and generalizability of the execution reasoning ability.

Comparative Analysis:

  • NExT generally matches or exceeds the performance of several strong LLM baselines.
  • Proxy-based evaluations reveal that generated rationales not only help the main model but also assist smaller LLMs in achieving higher success rates in code fixes.

Conclusion and Future Directions

NExT presents a promising approach to significantly elevate the capabilities of LLMs in software development applications, particularly in automated debugging and program repair tasks. The approach fosters a deeper integration of execution semantics in model reasoning pathways through natural language processsing, enriching both the interpretability and functional correctness of model outputs. Future explorations may extend NExT’s methodologies to a wider range of programming languages and more diverse coding tasks, potentially integrating more dynamic elements of program executions and exploring the scalability of such models to larger and more complex datasets.

Create an account to read this summary for free:


Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.