UniXcoder: Unified Cross-Modal Pre-training for Code Representation (2203.03850v1)

Published 8 Mar 2022 in cs.CL, cs.PL, and cs.SE

Abstract: Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.

Authors (6)

Daya Guo (37 papers)
Shuai Lu (91 papers)
Nan Duan (172 papers)
Yanlin Wang (76 papers)
Ming Zhou (182 papers)
Jian Yin (67 papers)

Citations (459)

View on Semantic Scholar

Summary

The paper introduces a unified model that integrates code semantics using cross-modal inputs such as ASTs and comments.
It combines masked and unidirectional language modeling, denoising objectives, and contrastive learning to support both code understanding and generation tasks.
Evaluation across nine datasets demonstrates significant improvements in semantic code retrieval, setting a new benchmark for code intelligence.

Overview of UniXcoder: Unified Cross-Modal Pre-training for Code Representation

The paper presents UniXcoder, a unified pre-trained model designed to address both code understanding and generation tasks. Contrary to traditional methods that either rely solely on bidirectional or unidirectional frameworks, UniXcoder employs a cross-modal approach to leverage additional semantic and syntactic information, specifically Abstract Syntax Trees (ASTs) and code comments. By doing so, the model aims to encapsulate the rich information associated with code semantics and syntax in its representations.

UniXcoder distinguishes itself by utilizing a novel technique to encode ASTs, typically represented as trees, into sequences that preserve structural information. This transformation enables the integration of ASTs within a Transformer architecture. The model is based on a multi-layer Transformer and utilizes mask attention matrices with prefix adapters to manage the context visibility for tokens, enhancing its flexibility to support both understanding and generative paradigms.

Pre-training Tasks and Methodology

The model employs several pre-training strategies to learn robust code representations:

Masked LLMing (MLM): This task involves masking portions of the input and predicting these masked tokens using bidirectional context. It helps UniXcoder incorporate semantic nuances from comments and syntactic features from ASTs.
Unidirectional LLMing (ULM): Here, the model learns to predict subsequent tokens, thus facilitating its application to auto-regressive tasks such as code completion.
Denoising Objectives: Inspired by previous work like T5, UniXcoder uses a sequence generation task where randomly masked spans are reconstructed. This serves the dual purpose of inferring code semantics and supporting generative tasks.
Contrastive and Cross-Modal Learning: By leveraging multi-modal data, the model learns cross-language representations of code fragments, aligning them with semantic counterparts like code comments. This is achieved through contrastive learning and cross-modal generation, enhancing the model's ability to abstract language-agnostic code semantics.

Evaluation and Results

UniXcoder was evaluated on various tasks, spanning nine datasets. For code understanding, it includes clone detection and code search tasks, while for generation, it tackles code summarization and code generation. The model demonstrates state-of-the-art performance, particularly excelling in tasks necessitating rich semantic understanding, such as zero-shot code-to-code search.

The performance gain primarily attributes to the cross-modal learning strategies and the effective use and representation of ASTs and comments. This integration aids UniXcoder in accurately capturing the relationship between natural and programming languages, offering substantial improvements in semantic code retrieval tasks.

Implications and Future Developments

By proposing a pre-trained model that unifies code representation through multi-modal inputs, this research lays a foundation for extending the applicability of pre-trained models to more complex and semantically rich code intelligence tasks. Future directions could explore:

Scalability: Expanding the model architecture to accommodate larger datasets and longer sequences, particularly for languages with verbose syntax.
Cross-Task Learning: Investigating the transferability of representations learned for one task to another, potentially reducing the need for extensive task-specific fine-tuning.
Enhanced Multi-Modal Integration: Further refining the integration of multi-modal content, possibly incorporating more nuanced semantic features from documentation or domain-specific lexicons.

In conclusion, UniXcoder's design underscores the importance of integrating syntax and semantics in code representation, pushing the boundaries of what's achievable in automated code analysis and generation. As the field progresses, such cross-modal and unified approaches could lead to more intelligent and adaptable AI systems in software engineering.

PDF Markdown