CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation (2102.04664v2)

Published 9 Feb 2021 in cs.SE and cs.CL

Abstract: Benchmark datasets have a significant impact on accelerating research in programming language tasks. In this paper, we introduce CodeXGLUE, a benchmark dataset to foster machine learning research for program understanding and generation. CodeXGLUE includes a collection of 10 tasks across 14 datasets and a platform for model evaluation and comparison. CodeXGLUE also features three baseline systems, including the BERT-style, GPT-style, and Encoder-Decoder models, to make it easy for researchers to use the platform. The availability of such data and baselines can help the development and validation of new methods that can be applied to various program understanding and generation problems.

Citations (963)

View on Semantic Scholar

Summary

The paper introduces CodeXGLUE, a standardized benchmark dataset and evaluation framework designed to accelerate research in code intelligence.
It details diverse tasks over 14 datasets and establishes strong baseline models like CodeBERT and CodeGPT for various code-related challenges.
Experimental results demonstrate high performance in clone detection, defect detection, and code completion, underlining its practical significance.

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

The paper "CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation" presents a comprehensive dataset and benchmark suite aimed at fostering advancements in machine learning for program understanding and generation. The paper introduces CodeXGLUE, which is meticulously designed to provide a standardized evaluation framework that accelerates research and development in code intelligence.

Overview of CodeXGLUE

CodeXGLUE encompasses a diverse set of 14 datasets covering 10 distinct programming language tasks, divided into four main categories: code-code, text-code, code-text, and text-text. These tasks include:

Clone Detection: Using datasets like BigCloneBench and POJ-104 to evaluate the semantic similarity between code snippets.
Defect Detection: Identifying vulnerabilities in code using the Devign dataset.
Cloze Test: Assessing the ability to predict masked tokens in code across six programming languages.
Code Completion: Predicting the next tokens (token-level) or lines (line-level) in code using datasets such as PY150 and Github Java Corpus.
Code Translation: Translating code between programming languages, exemplified by a newly curated dataset of Java and C# function pairs.
Code Search: Measuring the semantic relatedness between natural language queries and code using datasets like CodeSearchNet AdvTest and WebQueryTest.
Code Repair: Automatically fixing bugs in code using datasets like Bugs2Fix.
Text-to-Code Generation: Generating code from natural language descriptions using datasets like CONCODE.
Code Summarization: Generating natural language comments for code using the CodeSearchNet dataset.
Documentation Translation: Translating code documentation between different natural languages using the Microsoft Docs dataset.

Baseline Models and Experimental Results

CodeXGLUE includes three baseline model frameworks: BERT-style (CodeBERT), GPT-style (CodeGPT), and Encoder-Decoder models. These models facilitate the replication of results and the comparison of new methods against established benchmarks.

CodeXGLUE Task Performance

Clone Detection: CodeBERT achieves an F1 score of 96.5 on the BigCloneBench dataset and a MAP score of 84.29 on POJ-104, outperforming several baseline models and demonstrating its efficacy in capturing semantic similarities.
Defect Detection: CodeBERT achieves an accuracy score of 62.08 on the Devign dataset, indicating strong performance in identifying vulnerable code segments.
Cloze Test: CodeBERT significantly outperforms RoBERTa with an overall accuracy score of 85.66 across six programming languages.
Code Completion: CodeGPT-adapted achieves an overall score of 71.28, highlighting its effectiveness in predicting tokens and lines of code.
Code Search: CodeBERT's MRR of 27.19 on CodeSearchNet AdvTest and an F1 score of 58.95 on WebQueryTest underline its strong retrieval capabilities.
Text-to-Code Generation: CodeGPT-adapted achieves a CodeBLEU score of 35.98, surpassing other baseline models.
Code Summarization: CodeBERT achieves a BLEU score of 17.83 across six programming languages, marking it as a leading model for generating code summaries.
Documentation Translation: The pretrained Transformer model initialized with XLM-R achieves an overall BLEU score of 66.16, demonstrating its superiority in multilingual documentation translation.

Implications and Future Directions

The introduction of CodeXGLUE is poised to significantly impact both theoretical and practical aspects of code intelligence research. The availability of diverse and high-quality datasets alongside robust baseline models facilitates:

Benchmarking and Comparison: CodeXGLUE provides a consistent framework to evaluate and compare different machine learning models, ensuring that advancements are objectively measured.
Model Development: The inclusion of various code understanding and generation tasks opens avenues for developing more generalized models capable of handling multiple programming tasks simultaneously.
Cross-Disciplinary Research: By providing datasets that intersect multiple programming languages and tasks, CodeXGLUE promotes research that bridges gaps between natural language processing and software engineering domains.

In terms of future developments, there is potential for expanding CodeXGLUE to cover more programming languages and tasks, such as idiom mining, bug localization, and test case generation. Moreover, incorporating structural information from program code, such as ASTs and control flows, into pretrained models like CodeBERT could further enhance their performance.

By laying a robust foundation for evaluating program understanding and generation tasks, CodeXGLUE is positioned to drive significant advancements in code intelligence, ultimately contributing to the productivity and efficiency of software development processes.

PDF Markdown