LLbezpeky: Leveraging Large Language Models for Vulnerability Detection (2401.01269v2)
Abstract: Despite the continued research and progress in building secure systems, Android applications continue to be ridden with vulnerabilities, necessitating effective detection methods. Current strategies involving static and dynamic analysis tools come with limitations like overwhelming number of false positives and limited scope of analysis which make either difficult to adopt. Over the past years, machine learning based approaches have been extensively explored for vulnerability detection, but its real-world applicability is constrained by data requirements and feature engineering challenges. LLMs, with their vast parameters, have shown tremendous potential in understanding semnatics in human as well as programming languages. We dive into the efficacy of LLMs for detecting vulnerabilities in the context of Android security. We focus on building an AI-driven workflow to assist developers in identifying and rectifying vulnerabilities. Our experiments show that LLMs outperform our expectations in finding issues within applications correctly flagging insecure apps in 91.67% of cases in the Ghera benchmark. We use inferences from our experiments towards building a robust and actionable vulnerability detection system and demonstrate its effectiveness. Our experiments also shed light on how different various simple configurations can affect the True Positive (TP) and False Positive (FP) rates.
- Too quiet in the library: An empirical study of security updates in android apps’ native code. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1347–1359.
- Owura Asare. 2023. Security Evaluations of GitHub’s Copilot. Master’s thesis. University of Waterloo.
- An android application vulnerability mining method based on static and dynamic analysis. In 2020 IEEE 5th Information Technology and Mechatronics Engineering Conference (ITOEC). IEEE, 599–603.
- Evaluation of ChatGPT Model for Vulnerability Detection. arXiv preprint arXiv:2304.07232 (2023).
- Understanding the evolution of android app vulnerabilities. IEEE Transactions on Reliability 70, 1 (2019), 212–230.
- Large Language Models for Software Engineering: A Systematic Literature Review. arXiv preprint arXiv:2308.10620 (2023).
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems 33 (2020), 9459–9474.
- Christopher D Manning. 2022. Human language understanding & reasoning. Daedalus 151, 2 (2022), 127–138.
- David Noever. 2023. Can Large Language Models Find And Fix Vulnerable Software? arXiv preprint arXiv:2308.10345 (2023).
- Android source code vulnerability detection: a systematic literature review. Comput. Surveys 55, 9 (2023), 1–37.
- DefectHunter: A Novel LLM-Driven Boosted-Conformer-based Code Vulnerability Detection Mechanism. arXiv preprint arXiv:2309.15324 (2023).
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382 (2023).
- Prompt-Enhanced Software Vulnerability Detection Using ChatGPT. arXiv preprint arXiv:2308.12697 (2023).