DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection (2304.00409v2)

Published 1 Apr 2023 in cs.CR, cs.AI, cs.LG, and cs.SE

Abstract: We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that LLMs are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

Citations (89)

View on Semantic Scholar

Summary

The paper introduces DiverseVul, a large-scale C/C++ vulnerability dataset that expands evaluation horizons for deep learning detection models.
It evaluates eleven model architectures, revealing that pre-trained large language models outperform traditional GNN-based approaches with ample training data.
The findings underscore challenges in model generalization and label noise, prompting further research into code-specific pretraining strategies.

Insights into Deep Learning for Vulnerability Detection: The DiverseVul Dataset

This paper introduces DiverseVul, a comprehensive C/C++ dataset developed for deep learning-based software vulnerability detection. Comprising 18,945 vulnerable functions across 150 Common Weakness Enumerations (CWEs) and 330,492 non-vulnerable functions derived from 7,514 commits, DiverseVul surpasses previous datasets in both scale and diversity. This dataset offers a larger scope for evaluating deep learning techniques in identifying vulnerabilities in software projects, a task of critical importance given the potential for cybercrime and financial damage resulting from software flaws.

Dataset Details and Methodology

DiverseVul was created by scraping security issue websites to identify vulnerability-fixing commits from a vast array of open-source projects. Notably, it includes data from 295 projects that were not covered in prior datasets. This addition not only increases dataset size but also enhances its application diversity. The selection of functions within the dataset was guided by the identification of changes in code through these vulnerability-fixing commits, effectively labeling altered functions as vulnerable and those unchanged, or changed post commit, as non-vulnerable. This approach mirrors the methodologies utilized in other datasets like CVEFixes, but on a significantly broader scale.

Evaluation of Deep Learning Models

With this dataset, the paper evaluates the performance of different deep learning models in vulnerability detection. Eleven model architectures across four families—Graph Neural Networks (GNN), RoBERTa, GPT-2, and T5—were considered. The results indicate that LLMs, particularly those pre-trained on code with specific tasks like CodeT5 and NatGen, outperform models such as the state-of-the-art GNN-based ReVeal. Notably, the performance edge of LLMs becomes apparent only with a significant increase in training data volume, emphasizing the importance of large datasets like DiverseVul. By augmenting previous datasets with DiverseVul, an increase in F1 performance scores is observed across several model architectures.

Implications and Future Directions

Despite improvements, the paper highlights that the current state of deep learning models falls short of a viable deployment in real-world vulnerability detection due to high false positive rates and modest F1 scores. Models trained on existing data exhibit substantial generalization challenges when applied to unseen projects, reflecting a critical area for development. The discoveries underline the importance of advancing pretraining strategies tailored to source code analysis and encouraging additional work in architectures that generalize better to diverse project ecosystems.

Addressing Challenges

The paper acknowledges a significant challenge in model generalization across unseen data—a scenario frequently encountered in practical deployment. An attempt to address this through various weighting schemes, including class weights, yielded modest improvements in generalization. This observation suggests that further innovation is required to achieve models capable of recognizing vulnerabilities across diverse and previously unseen software projects effectively.

Label Noise and Dataset Accuracy

DiverseVul, like similar datasets, encounters label noise, a factor that can skew evaluation results. The paper presents a critical assessment of label accuracy, finding that changes in non-vulnerable functions within vulnerability-fixing commits contribute to erroneous labels. The paper's analysis on labeling indicates a need for improved labeling techniques or post-processing strategies that can mitigate these errors, enhancing dataset reliability.

Conclusion

DiverseVul represents a significant step forward in enabling more robust research into vulnerability detection through its inclusive and expansive dataset. It opens new avenues for examining the efficacy of LLMs and the impact of dataset diversity. The paper challenges researchers to address the generalization gap and exploit code-specific pretraining to drive future advancements. The release of DiverseVul to the broader community is anticipated to facilitate further research and innovation in this critical domain.

PDF Markdown

Related Papers

Tweets

https://twitter.com/alkalinesec/status/1748970306910273849