Multi-task Learning based Pre-trained Language Model for Code Completion

Published 29 Dec 2020 in cs.SE | (2012.14631v1)

Abstract: Code completion is one of the most useful features in the Integrated Development Environments (IDEs), which can accelerate software development by suggesting the next probable token based on the contextual code in real-time. Recent studies have shown that statistical language modeling techniques can improve the performance of code completion tools through learning from large-scale software repositories. However, these models suffer from two major drawbacks: a) Existing research uses static embeddings, which map a word to the same vector regardless of its context. The differences in the meaning of a token in varying contexts are lost when each token is associated with a single representation; b) Existing LLM based code completion models perform poor on completing identifiers, and the type information of the identifiers is ignored in most of these models. To address these challenges, in this paper, we develop a multi-task learning based pre-trained LLM for code understanding and code generation with a Transformer-based neural architecture. We pre-train it with hybrid objective functions that incorporate both code understanding and code generation tasks. Then we fine-tune the pre-trained model on code completion. During the completion, our model does not directly predict the next token. Instead, we adopt multi-task learning to predict the token and its type jointly and utilize the predicted type to assist the token prediction. Experiments results on two real-world datasets demonstrate the effectiveness of our model when compared with state-of-the-art methods.

Abstract PDF Upgrade to Chat

Citations (182)

View on Semantic Scholar

Summary

The paper presents CugLM, a novel Transformer-based approach leveraging multi-task learning to enhance code completion.
The paper employs a dual-phase model that integrates tasks like Masked Bidirectional LM and type prediction to improve identifier handling.
The paper shows that CugLM outperforms state-of-the-art models on Java and TypeScript datasets, boosting overall code completion accuracy.

Multi-task Learning based Pre-trained LLM for Code Completion

In the study titled "Multi-task Learning based Pre-trained LLM for Code Completion," the authors propose a novel approach to enhancing code completion functionality within Integrated Development Environments (IDEs) by adopting multi-task learning within a pre-trained LLM. The research addresses the limitations in existing LLM-based code completion systems, particularly focusing on two primary aspects: static embeddings and ineffective handling of identifiers. The proposed model, named CugLM, leverages a Transformer-based neural architecture and incorporates multiple objective functions to pre-train a model that equally considers code understanding and code generation tasks.

The authors identify significant challenges with previous LLMs: static embeddings that fail to account for context variability, and poor performance in completing identifiers due to a lack of type information integration. Their solution involves a two-phase model: first, pre-training the LLM on a curated dataset of Java and TypeScript projects; second, fine-tuning it specifically for code completion. The multi-task learning framework enhances the representation and understanding of code through tasks such as Masked Bidirectional Language Modeling, Next Code Segment Prediction, and Unidirectional Language Modeling. Notably, the model incorporates type prediction for identifiers, improving completion accuracy by leveraging type information.

Experimentation results, validated on substantial Java and TypeScript datasets, compare favorably against state-of-the-art models like the Pointer Mixture Network and BPE-based neural LLMs. The CugLM model outperformed these baselines, notably improving identifier prediction, which remains a challenging domain in code completion.

The advancements proposed in the paper indicate several implications for both practical application and further theoretical exploration. Practically, the integration of contextualized LLMs into code completion systems promises to enhance developer productivity and code quality by making reliable and contextually relevant predictions. Theoretically, the research substantiates the viability and advantages of employing multi-task learning structures in LLM pre-training, a concept that can be expanded beyond code completion to other areas of software engineering and natural language processing tasks.

Future developments might explore extending this methodology to additional programming languages, enhancing model training with larger and more diverse datasets, or integrating this system within real-world IDEs. Given the rapid advancement of transformer models and their applications, CugLM's methodology and framework provide a promising avenue for significant improvements in automated code completion technology.

Markdown Report Issue