A Neural Model for Generating Natural Language Summaries of Program Subroutines (1902.01954v1)

Published 5 Feb 2019 in cs.SE

Abstract: Source code summarization -- creating natural language descriptions of source code behavior -- is a rapidly-growing research topic with applications to automatic documentation generation, program comprehension, and software maintenance. Traditional techniques relied on heuristics and templates built manually by human experts. Recently, data-driven approaches based on neural machine translation have largely overtaken template-based systems. But nearly all of these techniques rely almost entirely on programs having good internal documentation; without clear identifier names, the models fail to create good summaries. In this paper, we present a neural model that combines words from code with code structure from an AST. Unlike previous approaches, our model processes each data source as a separate input, which allows the model to learn code structure independent of the text in code. This process helps our approach provide coherent summaries in many cases even when zero internal documentation is provided. We evaluate our technique with a dataset we created from 2.1m Java methods. We find improvement over two baseline techniques from SE literature and one from NLP literature.

Citations (272)

View on Semantic Scholar

Summary

The paper presents a dual-input neural model that separately processes code text and AST data, achieving a BLEU score of 20.9 on Java method summaries.
It employs distinct GRU layers and an attention mechanism within a sequence-to-sequence framework to effectively merge structural and textual information.
The study demonstrates that the model can generate useful summaries even for poorly documented or obfuscated code, indicating strong potential for automated documentation.

Overview of a Neural Model for Source Code Summarization

The paper presents a neural model tailored for the task of source code summarization, with a focus on augmenting the process of automatically generating natural language summaries for program subroutines. This task is highly relevant in software engineering, facilitating better program comprehension and reducing the manual effort required for documentation. The authors tackle the limitations of previous techniques that heavily relied on the availability of meaningful identifier names and internal comments within the code—often unrealistic in practical scenarios.

Contributions and Methodology

The authors introduce an innovative approach that leverages both textual information from the code and structural information derived from the Abstract Syntax Tree (AST) in tandem. This methodology notably diverges from traditional models by processing these inputs separately rather than as a composite representation. A key strength of this approach lies in its capacity to generate cohesive summaries even in cases where internal documentation is nonexistent or sparse.

The neural architecture employed in this paper comprises distinct GRU layers; one for processing code structure via ASTs and another for processing code as plain text. An attention mechanism is utilized to integrate context from both representations into predicting the next word in the summary, adhering to a sequence-to-sequence (seq2seq) learning framework.

Evaluation and Results

Evaluation is conducted using a sizeable dataset composed of 2.1 million Java methods, with comparisons made against several baseline techniques from both software engineering and natural language processing literature. Notably, the model demonstrates improved performance over baselines, achieving a BLEU score of 20.9 through ensemble decoding, surpassing other tested configurations by incorporating predictions from both its text and AST representations.

The paper further explores a "challenge" context whereby the model is trained and tested using only the abstract syntax structure, simulating scenarios such as obfuscated or poorly documented code. Remarkably, the model still provides summaries with a BLEU score of 9.5, indicating potential even under stringent conditions.

Implications and Future Directions

The implications of this research are twofold: practically, the ability of the model to deliver meaningful summaries with minimal reliance on internal documentation expands the applicability of automatic documentation tools. Theoretically, the adoption of a dual-input processing paradigm encourages further exploration into unique handling of heterogeneous data types within neural models.

The reported orthogonal performance between the model and a strong NLP baseline suggests opportunities for advanced ensemble learning methodologies, potentially optimizing prediction accuracy through intelligent model combination strategies. Additionally, refinement in processing individual input data types—such as employing language-specific features or advanced AST processing—warrants further investigation, potentially improving the model’s robustness across diverse programming environments.

This research contributes a nuanced perspective on employing neural architectures for software engineering tasks, emphasizing the distinct processing of textual and structural code representations, thus marking a significant step towards more versatile and effective automatic code summarization systems.

PDF Markdown