InCoder: A Generative Model for Code Infilling and Synthesis (2204.05999v3)

Published 12 Apr 2022 in cs.SE, cs.CL, and cs.LG

Abstract: Code is seldom written in a single left-to-right pass and is instead repeatedly edited and refined. We introduce InCoder, a unified generative model that can perform program synthesis (via left-to-right generation) as well as editing (via infilling). InCoder is trained to generate code files from a large corpus of permissively licensed code, where regions of code have been randomly masked and moved to the end of each file, allowing code infilling with bidirectional context. Our model is the first generative model that is able to directly perform zero-shot code infilling, which we evaluate on challenging tasks such as type inference, comment generation, and variable re-naming. We find that the ability to condition on bidirectional context substantially improves performance on these tasks, while still performing comparably on standard program synthesis benchmarks in comparison to left-to-right only models pretrained at similar scale. The InCoder models and code are publicly released. https://sites.google.com/view/incoder-code-models

Citations (541)

View on Semantic Scholar

Summary

The paper introduces a causal masking objective that enables InCoder to effectively infill arbitrary code segments using bidirectional context.
The model unifies program synthesis and editing by generating code left-to-right and through infilling, outperforming standard autoregressive models on benchmarks like HumanEval and MBPP.
The paper demonstrates InCoder's zero-shot capabilities on tasks such as type inference and comment generation, paving the way for advanced AI-driven development tools.

Analysis of InCoder: A Generative Model for Code Infilling and Synthesis

The research paper "InCoder: A Generative Model for Code Infilling and Synthesis" presents a novel approach to neural program synthesis and code editing via a unified generative model. This work effectively addresses the limitations of traditional autoregressive LLMs by introducing a model capable of infilling arbitrary code regions using bidirectional context. The primary innovation lies in the causal masking objective, which empowers InCoder to exploit both the left and right contexts for enhanced code generation and infilling.

Technical Contributions

The authors distinguish InCoder by training the model to handle both program synthesis and editing tasks. Unlike conventional models restricted to left-to-right generation, InCoder can infill code sections, functioning as a more versatile tool for numerous programming tasks. Key contributions include:

Causal Masking Objective: In contrast to BERT-like masked LLMs, InCoder employs a causal masking technique that allows the infilling of substantial code segments, sampling spans and positioning sentinel tokens strategically to achieve this objective.
Unified Model for Synthesis and Editing: The model integrates two traditionally distinct tasks—program synthesis and code editing—by generating code both left-to-right and through infilling.
Zero-Shot Capability: InCoder is evaluated for zero-shot generalization across challenging tasks, such as type inference and comment generation, without task-specific fine-tuning.

Experimental Evaluation

The authors conduct thorough evaluations on multiple infilling tasks and existing benchmarks like HumanEval and MBPP. InCoder demonstrates:

Superior Infilling Performance: On tasks involving single-line and multi-line infilling, InCoder outperforms baseline left-to-right models, affirming the utility of bidirectional context for program comprehension and code generation.
Effective Context Utilization: In tasks like docstring generation and return type prediction, InCoder achieves competitive results and, in some cases, rivals state-of-the-art models fine-tuned on similar tasks.
Scalability Insights: Through ablation studies, the paper reveals how model scale and data diversity impact the synthesis performance, suggesting broader training data may slightly trade off for Python-specific benchmarks but enrich versatility.

Implications and Future Directions

InCoder's success in integrating synthesis and infilling capabilities highlights a potential shift in how code generation models are conceptualized. By unifying these tasks, the research opens the door to more flexible and interactive development tools. Practically, such models can be instrumental in real-time code editing applications where rapid iteration and refinement are routine.

Theoretically, InCoder's approach could reshape models' architecture and objectives in machine learning for code, prompting a revision of current models to leverage crossover techniques between synthesis and editing tasks. Future explorations might delve into enhancing fine-tuning strategies or broadening bidirectional context utility across other domains.

In conclusion, the InCoder model effectively tackles prominent challenges in neural program synthesis and editing, establishing a robust foundation for future advancements and application in AI-driven development environments. The findings signify a marked progression in applying generative models to complex reasoning settings inherent in programming tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/rendu_a/status/1796742988355588182

https://twitter.com/jeffistyping/status/1815824311200166265

Reddit

https://arxiv.org/abs/2204.05999 (0 points, 1 comment)