Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

How Far Have We Gone in Vulnerability Detection Using Large Language Models (2311.12420v3)

Published 21 Nov 2023 in cs.AI, cs.CL, and cs.CR

Abstract: As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of LLMs in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. The AFL++ fuzzing framework — AFLplusplus. https://aflplus.plus/.
  2. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  3. Baichuan. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023. URL https://arxiv.org/abs/2309.10305.
  4. CodePlan: Repository-level Coding using LLMs and Planning, September 2023.
  5. Ahoy SAILR! There is No Need to DREAM of C: A Compiler-Aware Structuring Algorithm for Binary Decompilation. In 33st USENIX Security Symposium (USENIX Security 24).
  6. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  7. Klee: Unassisted and automatic generation of high-coverage tests for complex systems programs. In Proceedings of the 8th USENIX Conference on Operating Systems Design and Implementation, OSDI’08, pp.  209–224, USA, 2008. USENIX Association.
  8. Deep learning based vulnerability detection: Are we there yet? IEEE Transactions on Software Engineering, 48(09):3280–3296, sep 2022. ISSN 1939-3520. doi: 10.1109/TSE.2021.3087402.
  9. Transformer-based vulnerability detection in code at edittime: Zero-shot, few-shot, or fine-tuning? arXiv preprint arXiv:2306.01754, 2023.
  10. Diversevul: A new vulnerable source code dataset for deep learning based vulnerability detection. In Proceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses, RAID ’23, pp.  654–668, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400707650. doi: 10.1145/3607199.3607242. URL https://doi.org/10.1145/3607199.3607242.
  11. Evaluation of ChatGPT Model for Vulnerability Detection, April 2023.
  12. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Deep reinforcement learning from human preferences. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf.
  14. Data quality for software vulnerability datasets. In Proceedings of the 45th International Conference on Software Engineering, ICSE ’23, pp.  121–133. IEEE Press, 2023. ISBN 9781665457019. doi: 10.1109/ICSE48619.2023.00022. URL https://doi.org/10.1109/ICSE48619.2023.00022.
  15. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  320–335, 2022.
  16. A C/C++ Code Vulnerability Dataset with Code Changes and CVE Summaries. In Proceedings of the 17th International Conference on Mining Software Repositories, pp.  508–512, Seoul Republic of Korea, June 2020. ACM. ISBN 978-1-4503-7517-7. doi: 10.1145/3379597.3387501.
  17. LineVul: A Transformer-based Line-Level Vulnerability Prediction. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), pp.  608–620, May 2022. doi: 10.1145/3524842.3528452.
  18. Learn&Fuzz: Machine Learning for Input Fuzzing. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pp.  50–59, January 2017.
  19. guyharris. CVE-2017-13010/BEEP: Do bounds checking when comparing strings. https://github.com/the-tcpdump-group/tcpdump/commit/877b66b398518d9501513e0860c9f3a8acc70892. Accessed: 2023-11-14.
  20. Vulberta: Simplified source code pre-training for vulnerability detection. In 2022 International joint conference on neural networks (IJCNN), pp.  1–8. IEEE, 2022.
  21. Magma: A Ground-Truth Fuzzing Benchmark. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 4(3):1–29, November 2020. ISSN 2476-1249. doi: 10.1145/3428334.
  22. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  23. Hex-Rays. Hex-rays – state of the art binary analysis solutions. https://www.hex-rays.com/. Accessed: 2023-11-20.
  24. Huggingface. Text generation inference. https://huggingface.co/docs/text-generation-inference/index. Accessed: 2023-09-20.
  25. Keenlab. Binary abstract inspector, April 2022. URL https://github.com/KeenSecurityLab/BinAbsInspector.
  26. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
  27. Platypus: Quick, Cheap, and Powerful Refinement of LLMs, August 2023.
  28. The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models, July 2023.
  29. VulDeePecker: A Deep Learning-Based System for Vulnerability Detection. In Proceedings 2018 Network and Distributed System Security Symposium, 2018. doi: 10.14722/ndss.2018.23158.
  30. VulDeeLocator: A Deep Learning-based Fine-grained Vulnerability Detector. IEEE Transactions on Dependable and Secure Computing, pp.  1–1, 2021. ISSN 1545-5971, 1941-0018, 2160-9209. doi: 10.1109/TDSC.2021.3076142.
  31. An empirical study on the effectiveness of static C code analyzers for vulnerability detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pp.  544–555, Virtual South Korea, July 2022. ACM. ISBN 978-1-4503-9379-9. doi: 10.1145/3533767.3534380.
  32. Visual Instruction Tuning, April 2023a.
  33. Harnessing the Power of LLM to Support Binary Taint Analysis, October 2023b.
  34. RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019.
  35. Detecting Missing-Check bugs via semantic- and Context-Aware criticalness and constraints inferences. In 28th USENIX Security Symposium (USENIX Security 19), pp.  1769–1786, Santa Clara, CA, August 2019. USENIX Association. ISBN 978-1-939133-06-9. URL https://www.usenix.org/conference/usenixsecurity19/presentation/lu.
  36. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1. Curran, March 2021. doi: 10.48550/arXiv.2102.04664.
  37. Daniel Marjamäki. Cppcheck: A tool for static c/c++ code analysis, September 2023. URL https://cppcheck.sourceforge.io/.
  38. MITRE. CVE. https://cve.mitre.org. Accessed: 2023-09-20.
  39. David Noever. Can large language models find and fix vulnerable software? arXiv preprint arXiv:2308.10345, 2023.
  40. OpenAI. GPT-4 Technical Report. Technical report.
  41. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  42. Learning approximate execution semantics from traces for binary function similarity. IEEE Transactions on Software Engineering, 49(04):2776–2790, apr 2023. ISSN 1939-3520. doi: 10.1109/TSE.2022.3231621.
  43. Code Llama: Open Foundation Models for Code, August 2023.
  44. Toolformer: Language Models Can Teach Themselves to Use Tools, February 2023.
  45. AIFORE: Smart fuzzing based on automatic input format reverse engineering. In 32nd USENIX Security Symposium (USENIX Security 23), pp.  4967–4984, Anaheim, CA, August 2023. USENIX Association. ISBN 978-1-939133-37-3. URL https://www.usenix.org/conference/usenixsecurity23/presentation/shi-ji.
  46. An empirical study of deep learning models for vulnerability detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp.  2237–2248, Los Alamitos, CA, USA, may 2023. IEEE Computer Society. doi: 10.1109/ICSE48619.2023.00188. URL https://doi.ieeecomputersociety.org/10.1109/ICSE48619.2023.00188.
  47. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  48. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023.
  49. jTrans: Jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pp.  1–13, Virtual South Korea, July 2022. ACM. ISBN 978-1-4503-9379-9. doi: 10.1145/3533767.3534367.
  50. David Wheeler. Flawfinder, September 2023. URL https://dwheeler.com/flawfinder/.
  51. Understanding and detecting disordered error handling with precise function pairing. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2041–2058. USENIX Association, August 2021. ISBN 978-1-939133-24-3. URL https://www.usenix.org/conference/usenixsecurity21/presentation/wu-qiushi.
  52. Auto-GPT for Online Decision Making: Benchmarks and Additional Opinions, June 2023a.
  53. GPT Can Solve Mathematical Problems Without a Calculator, September 2023b.
  54. GLM-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations (ICLR), 2023. URL https://openreview.net/forum?id=-Aw0rrrPUF.
  55. Prompt-enhanced software vulnerability detection using chatgpt. arXiv preprint arXiv:2308.12697, 2023.
  56. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, July 2023.
  57. D2a: A dataset built for ai-based vulnerability detection methods using differential analysis. In Proceedings of the 43rd International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP ’21, pp.  111–120. IEEE Press, 2021. ISBN 9780738146690. doi: 10.1109/ICSE-SEIP52600.2021.00020. URL https://doi.org/10.1109/ICSE-SEIP52600.2021.00020.
  58. Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks. Curran Associates Inc., Red Hook, NY, USA, 2019.
  59. µVulDeePecker: A Deep Learning-Based System for Multiclass Vulnerability Detection. IEEE Transactions on Dependable and Secure Computing, pp.  1–1, 2019. ISSN 1545-5971, 1941-0018, 2160-9209. doi: 10.1109/TDSC.2019.2942930.
Citations (13)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.