Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 175 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 23 tok/s Pro

GPT-4o 96 tok/s Pro

Kimi K2 196 tok/s Pro

GPT OSS 120B 464 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities (2311.16169v3)

Published 16 Nov 2023 in cs.CR, cs.PL, and cs.SE

Abstract: While automated vulnerability detection techniques have made promising progress in detecting security vulnerabilities, their scalability and applicability remain challenging. The remarkable performance of LLMs, such as GPT-4 and CodeLlama, on code-related tasks has prompted recent works to explore if LLMs can be used to detect vulnerabilities. In this paper, we perform a more comprehensive study by concurrently examining a higher number of datasets, languages and LLMs, and qualitatively evaluating performance across prompts and vulnerability classes while addressing the shortcomings of existing tools. Concretely, we evaluate the effectiveness of 16 pre-trained LLMs on 5,000 code samples from five diverse security datasets. These balanced datasets encompass both synthetic and real-world projects in Java and C/C++ and cover 25 distinct vulnerability classes. Overall, LLMs across all scales and families show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. They are significantly better at detecting vulnerabilities only requiring intra-procedural analysis, such as OS Command Injection and NULL Pointer Dereference. Moreover, they report higher accuracies on these vulnerabilities than popular static analysis tools, such as CodeQL. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average). Interestingly, we observe that LLMs show promising abilities at performing parts of the analysis correctly, such as identifying vulnerability-related specifications and leveraging natural language information to understand code behavior (e.g., to check if code is sanitized). We expect our insights to guide future work on LLM-augmented vulnerability detection systems.

Citations (25)

View on Semantic Scholar

Summary

The paper presents a comprehensive evaluation of LLMs (GPT-4, GPT-3.5, CodeLlama) for detecting code vulnerabilities using varied prompting techniques.
It shows that dataflow-based prompting and self-reflection enhance explainability and recall on synthetic datasets, though performance drops on real-world code.
The study highlights LLMs as complementary tools to static analysis, while addressing challenges of context limitations and overfitting in fine-tuning.

LLMs for Security Vulnerability Detection: A Comprehensive Evaluation

Introduction

This paper presents a systematic evaluation of LLMs, such as GPT-4 and CodeLlama, for the task of detecting security vulnerabilities in source code. The paper benchmarks LLMs against established static analysis tools (CodeQL) and deep learning-based approaches (LineVul) across five diverse datasets, spanning both synthetic and real-world code in Java and C/C++. The work further investigates the impact of prompting strategies, model fine-tuning, and adversarial code transformations on detection performance, with a focus on explainability and robustness.

Methodology and Experimental Design

The evaluation encompasses five datasets: OWASP (Java, synthetic), Juliet Java (synthetic), Juliet C/C++ (synthetic), CVEFixes Java (real-world), and CVEFixes C/C++ (real-world). The LLMs assessed include GPT-4, GPT-3.5, and CodeLlama (7B and 13B parameter variants). The paper introduces four prompting strategies:

Basic Prompt: Asks if a snippet is vulnerable.
CWE-Specific Prompt: Asks about a specific CWE.
Dataflow Analysis-Based Prompt: Instructs the model to identify sources, sinks, and sanitizers, simulating static taint analysis.
Dataflow Analysis with Self-Reflection: Adds a self-validation step, prompting the model to critique and potentially revise its own analysis.

Comparisons are made to CodeQL (static analysis) and LineVul (transformer-based vulnerability detection). Metrics include accuracy, precision, recall, and F1 score. The paper also examines the effect of adversarial code modifications (dead code injection, variable renaming, branch insertion) and the generalizability of fine-tuned models.

Key Results

Performance on Synthetic vs. Real-World Datasets

LLMs, particularly GPT-4 with dataflow-based prompting and self-reflection, achieve high F1 scores on synthetic datasets: 0.79 (OWASP), 0.86 (Juliet Java), and 0.89 (Juliet C/C++). On real-world datasets, performance drops: 0.48 (CVEFixes Java) and 0.62 (CVEFixes C/C++). The performance gap is attributed to the self-contained nature of synthetic samples versus the contextual dependencies in real-world code, where relevant information may be outside the provided snippet.

Comparative Analysis: LLMs vs. Static and Deep Learning Tools

GPT-4 outperforms CodeQL on OWASP and Juliet C/C++ (by 0.05 and 0.29 F1, respectively), but not on Juliet Java. Notably, GPT-4 detects 416 OS Command Injection vulnerabilities in Juliet C/C++ that CodeQL misses due to incomplete sink specifications.

Figure 1: CodeQL fails to detect an OS Command Injection vulnerability due to missing sink specification, while GPT-4 correctly identifies both the source and sink and explains the lack of sanitization.

LineVul achieves perfect F1 on Juliet C/C++ but only 0.51 on CVEFixes C/C++, whereas GPT-4 and CodeLlama-7B reach 0.60 and 0.65, respectively, on the latter. Unlike LLMs, LineVul provides no interpretable explanations.

Prompting Strategies and Explainability

Prompt engineering is critical. The dataflow analysis-based prompt consistently improves recall and interpretability, enabling the model to output explicit reasoning chains (sources, sinks, sanitizers, unsanitized paths). The self-reflection step further prunes false positives, especially in synthetic datasets, but can reduce recall in real-world settings due to the model's tendency to abstain when context is missing.

Robustness to Adversarial Attacks

LLMs exhibit mild degradation under adversarial code transformations: up to 12.67% reduction in accuracy for branch insertion, 11% for dead code injection, and 4.33% for variable renaming. This suggests that LLMs are not simply memorizing training data and retain some robustness to superficial code changes.

Fine-Tuning and Generalization

Fine-tuning smaller models (GPT-3.5, CodeLlama-7B) on synthetic datasets yields substantial gains, sometimes surpassing GPT-4. However, fine-tuning on real-world datasets provides limited improvement and poor cross-dataset generalization, indicating overfitting to dataset-specific patterns.

Analysis of Vulnerability Classes

LLMs perform best on vulnerabilities that are local and self-contained, such as Out-of-Bounds Read/Write (CWE-125, CWE-787), Null Pointer Dereference (CWE-476), and Integer Overflow (CWE-190). Performance is notably lower for vulnerabilities requiring broader context, such as Improper Authentication (CWE-287) and Path Traversal (CWE-22), especially in real-world datasets.

Implications and Future Directions

Practical Implications

LLMs as Complementary Tools: LLMs can detect vulnerabilities missed by static analysis, particularly when specifications are incomplete or APIs are not modeled. Their ability to provide human-readable explanations is a significant advantage for developer adoption and debugging.
Prompt Engineering: Dataflow-inspired prompts and self-reflection mechanisms are essential for eliciting reliable and interpretable predictions from LLMs.
Limitations in Real-World Contexts: The inability to access non-local context in real-world code limits LLM effectiveness. Integrating LLMs with static analysis to extract relevant context or using retrieval-augmented generation may address this.

Theoretical Implications

Emergent Reasoning: The results support the hypothesis that LLMs exhibit emergent reasoning abilities for code analysis when prompted appropriately, but these abilities are bounded by context and model scale.
Generalization Challenges: Fine-tuning on narrow distributions leads to overfitting, highlighting the need for more diverse and context-rich training data for robust vulnerability detection.

Recommendations for Future Research

Hybrid Systems: Combining LLMs with static analysis tools to leverage the strengths of both approaches for broader vulnerability coverage.
Contextual Augmentation: Developing methods to provide LLMs with relevant project-level context, possibly via static analysis or code search.
Dataset Curation: Improving the quality and granularity of real-world vulnerability datasets, with accurate labeling and richer context, to better benchmark and train detection systems.
Extrinsic Feedback: Exploring new forms of feedback (e.g., program analysis-based) to further reduce false positives and improve model calibration.

Conclusion

This paper demonstrates that LLMs, when equipped with carefully designed prompts, can match or exceed the performance of traditional static and deep learning-based vulnerability detection tools on synthetic benchmarks, and provide valuable explanations for their predictions. However, significant challenges remain in scaling these capabilities to real-world code, particularly due to context limitations and dataset quality. The integration of LLMs with static analysis and the development of improved datasets and prompting strategies represent promising avenues for advancing automated vulnerability detection.