Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks

Published 8 Sep 2019 in cs.SE, cs.CR, cs.LG, and stat.ML | (1909.03496v1)

Abstract: Vulnerability identification is crucial to protect the software systems from attacks for cyber security. It is especially important to localize the vulnerable functions among the source code to facilitate the fix. However, it is a challenging and tedious process, and also requires specialized security expertise. Inspired by the work on manually-defined patterns of vulnerabilities from various code representation graphs and the recent advance on graph neural networks, we propose Devign, a general graph neural network based model for graph-level classification through learning on a rich set of code semantic representations. It includes a novel Conv module to efficiently extract useful features in the learned rich node representations for graph-level classification. The model is trained over manually labeled datasets built on 4 diversified large-scale open-source C projects that incorporate high complexity and variety of real source code instead of synthesis code used in previous works. The results of the extensive evaluation on the datasets demonstrate that Devign outperforms the state of the arts significantly with an average of 10.51% higher accuracy and 8.68\% F1 score, increases averagely 4.66% accuracy and 6.37% F1 by the Conv module.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (665)

View on Semantic Scholar

Summary

The paper introduces Devign, a model that transforms source code into joint graphs to capture comprehensive program semantics for vulnerability detection.
It employs a novel convolution module with GRUs to extract higher-level features, achieving a 10.51% accuracy and 8.68% F1 score improvement over baselines.
The work demonstrates the potential of GNNs in automating vulnerability detection, reducing manual analysis and shaping future cybersecurity research.

An Overview of Devign: Effective Vulnerability Identification Using Graph Neural Networks

The paper "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks" presents a substantial contribution to vulnerability detection in software via an innovative application of graph neural networks (GNNs). It specifically emphasizes identifying vulnerabilities at the function level in source code, using a model named Devign. The approach integrates comprehensive code semantics and leverages diverse code representation graphs, achieving significant advancements over existing methodologies.

Core Contribution

The authors introduce Devign, a GNN-based model designed for vulnerability identification through graph-level classification. By converting source code into a rich graphical structure that encapsulates multiple semantic representations, Devign effectively captures intricate patterns that may indicate vulnerabilities. The model focuses on:

Graph Construction: Code is transformed into a joint graph structure that integrates Abstract Syntax Trees (ASTs), Control Flow Graphs (CFGs), Data Flow Graphs (DFGs), and Natural Code Sequences (NCS). These representations collectively encompass syntax, control, and data dependencies, as well as human-readable sequences.
Conv Module: A novel Conv module extracts features from the node representations generated by gated recurrent units (GRUs). This module effectively selects higher-level representations pertinent to graph-level classification tasks.
Performance and Evaluation: The model was trained and validated on four large-scale open-source C projects. Notably, Devign surpassed existing state-of-the-art models, demonstrating a 10.51% increase in accuracy and an 8.68% enhancement in the F1 score compared to baseline methods. The Conv module alone improved accuracy by 4.66% and the F1 score by 6.37%, highlighting its efficacy in feature extraction.

Implications and Future Work

The research has significant practical implications for automated vulnerability detection in software engineering, offering a tool that reduces reliance on manual analysis, which is often slow and requires high levels of expertise. By encoding comprehensive program semantics in graphs, Devign illustrates the potential of GNNs in enhancing software security and efficiency.

From a theoretical perspective, the study contributes to the broader field of machine learning by demonstrating the applicability of GNNs beyond typical use cases, extending their utility to the field of code analysis. The innovative use of composite graphs and the Conv module may inspire further exploration into optimizing graph-level prediction tasks.

Looking forward, potential developments could involve refining Devign for scalable, real-world deployment and enhancing its adaptability to other programming languages and code structures. Additionally, research might explore integrating program slicing to handle larger functions more efficiently.

In conclusion, Devign represents a noteworthy step towards more automated and reliable vulnerability detection in software, leveraging the power of graph-based deep learning. This model is positioned to significantly influence both academic research and practical applications in cybersecurity.

Markdown Report Issue