LEVER: Learning to Verify Language-to-Code Generation with Execution

Published 16 Feb 2023 in cs.LG, cs.CL, cs.PL, and cs.SE | (2302.08468v3)

Abstract: The advent of LLMs trained on code (code LLMs) has led to significant progress in language-to-code generation. State-of-the-art approaches in this area combine LLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the LLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.

Abstract PDF Upgrade to Chat

Authors (7)

Citations (171)

View on Semantic Scholar

Summary

The paper introduces Lever, a method that verifies LLM-generated code using execution outcomes alongside natural language inputs.
It integrates LLM generation probabilities with execution-informed reranking to boost accuracy by 4.6% to 10.9% on multiple datasets.
The approach sets new state-of-the-art performance across tasks like table QA, math QA, and Python programming without additional finetuning.

Lever: Learning to Verify Language-to-Code Generation with Execution

The paper introduces "Lever," an approach to enhance language-to-code generation using LLMs trained on code (code LLMs) by incorporating a verification step grounded in execution results. Traditional language-to-code methods often combine LLM-generated code with heuristic evaluation using test cases for reranking, yet these test cases are not always available, and heuristics might miss semantic nuances critical for determining code correctness. Lever addresses this by implementing verifiers that assess the correctness of generated programs based on natural language inputs, the code itself, and execution outcomes.

Key Contributions

Verification with Execution: Lever proposes training verifiers that evaluate whether LLM-generated programs are correct, factoring the natural language input, code, and execution results into the decision-making process.
Reranking Framework: The approach integrates the verification score with the LLM generation probability, marginalizing over programs that yield the same execution results, effectively prioritizing code that executes correctly.
Performance: Lever demonstrated consistent improvements across four datasets - encompassing table question answering (QA), math QA, and basic Python programming - achieving execution accuracy improvements ranging from 4.6% to 10.9% using the code-davinci-002 model, setting new state-of-the-art results in these tasks.

Numerical Results

Lever achieved significant accuracy boosts on benchmarks such as Spider, WikiTableQuestions, GSM8k, and MBPP, by leveraging execution-informed reranking. On the Spider dataset, Lever raised the execution accuracy from 75.3% to 81.9%, surpassing both incumbent few-shot and finetuned state-of-the-art models. Similar patterns were observed across other datasets, reinforcing the efficacy of incorporating execution feedback into the language-to-code pipeline.

Implications

Practical: Lever's approach offers a robust framework for improving code synthesis without additional finetuning, valuable for applications where test cases are unfeasible or where computational resources for extensive model finetuning are limited.

Theoretical: This research contributes to the understanding of how semantics gleaned from execution results can directly enhance program synthesis accuracy, bridging gaps in current heuristic-based models.

Future Directions

Further investigation could explore the scalability of Lever across more diverse code languages and architectures. Additionally, leveraging dynamic datasets where the verifier's feedback could iteratively refine LLM generation pathways stands as a promising avenue.

Lever's methodology illustrates the substantial promise of coupling LLM-based generation with informed verification techniques, heralding advancements in how AI models translate human-readable commands into executable code. This paper sets a foundation for future research in enhancing model-driven code generation through semantic verification.

Markdown Report Issue