Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications (2403.16073v3)
Abstract: Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that LLMs have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine-tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.
- L. Whitney, “Google paid out $10 million in bug bounties to security researchers in 2023,” https://www.zdnet.com/article/google-paid-out-10-million-in-bug-bounties-to-security-researchers-in-2023/, Mar. 2024.
- Z. Zhang, B. Zhang, W. Xu, and Z. Lin, “Demystifying Exploitable Bugs in Smart Contracts,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), May 2023, pp. 615–627.
- J. Feist, G. Grieco, and A. Groce, “Slither: A static analysis framework for smart contracts,” in 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), May 2019, pp. 8–15.
- Y. Fang, D. Wu, X. Yi, S. Wang, Y. Chen, M. Chen, Y. Liu, and L. Jiang, “Beyond “protected” and “private”: An empirical security analysis of custom function modifiers in smart contracts,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 1157–1168. [Online]. Available: https://doi.org/10.1145/3597926.3598125
- L. Brent, N. Grech, S. Lagouvardos, B. Scholz, and Y. Smaragdakis, “Ethainter: a smart contract security analyzer for composite vulnerabilities,” in Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI 2020. New York, NY, USA: Association for Computing Machinery, 2020, p. 454–469. [Online]. Available: https://doi.org/10.1145/3385412.3385990
- J. Chen, X. Xia, D. Lo, J. Grundy, X. Luo, and T. Chen, “Defectchecker: Automated smart contract defect detection by analyzing evm bytecode,” IEEE Transactions on Software Engineering, vol. 48, no. 7, p. 2189–2207, Jul. 2022.
- L. Brent, A. Jurisevic, M. Kong, E. Liu, F. Gauthier, V. Gramoli, R. Holz, and B. Scholz, “Vandal: A scalable security analysis framework for smart contracts,” no. arXiv:1809.03981, Sep. 2018, arXiv:1809.03981 [cs]. [Online]. Available: http://arxiv.org/abs/1809.03981
- S. Kalra, S. Goel, M. Dhawan, and S. Sharma, “ZEUS: Analyzing safety of smart contracts,” in Proc. ISOC NDSS, 2018.
- P. Tsankov, A. Dan, D. Drachsler-Cohen, A. Gervais, F. Bünzli, and M. Vechev, “Securify: Practical security analysis of smart contracts,” in Proc. ACM CCS, 2018.
- M. Mossberg, F. Manzano, E. Hennenfent, A. Groce, G. Grieco, J. Feist, T. Brunson, and A. Dinaburg, “Manticore: A user-friendly symbolic execution framework for binaries and smart contracts,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), Nov. 2019, p. 1186–1189. [Online]. Available: https://ieeexplore.ieee.org/document/8952204
- Defillama, “Defillama hacks,” 2024. [Online]. Available: https://defillama.com/hacks
- I. David, L. Zhou, K. Qin, D. Song, L. Cavallaro, and A. Gervais, “Do you still need a manual smart contract audit?” no. arXiv:2306.12338, Jun. 2023, arXiv:2306.12338 [cs]. [Online]. Available: http://arxiv.org/abs/2306.12338
- Y. Sun, D. Wu, Y. Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y. Liu, “GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024.
- S. Hu, T. Huang, F. Ilhan, S. Tekin, and L. Liu, “Large language model-powered smart contract vulnerability detection: New perspectives,” in 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA). Los Alamitos, CA, USA: IEEE Computer Society, nov 2023, pp. 297–306. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/TPS-ISA58951.2023.00044
- Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, M. Shi, and Y. Liu, “LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning,” no. arXiv:2401.16185, Jan. 2024.
- P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20. Red Hook, NY, USA: Curran Associates Inc., Dec. 2020, pp. 9459–9474.
- K. Tian, E. Mitchell, H. Yao, C. D. Manning, and C. Finn, “Fine-tuning Language Models for Factuality,” no. arXiv:2311.08401, Nov. 2023.
- K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu, “Full Parameter Fine-tuning for Large Language Models with Limited Resources,” no. arXiv:2306.09782, Jun. 2023.
- A. Balaguer, V. Benara, R. L. d. F. Cunha, R. d. M. E. Filho, T. Hendry, D. Holstein, J. Marsman, N. Mecklenburg, S. Malvar, L. O. Nunes, R. Padilha, M. Sharp, B. Silva, S. Sharma, V. Aski, and R. Chandra, “RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture,” no. arXiv:2401.08406, Jan. 2024.
- TrustLLM, “Trustllm inference code and dataset,” 2024. [Online]. Available: https://sites.google.com/view/trustllm/home
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, p. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
- Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 1536–1547. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.139
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” no. arXiv:1910.13461, Oct. 2019, arXiv:1910.13461 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1910.13461
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, jan 2020.
- Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708.
- W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang et al., “A survey of large language models,” no. arXiv:2303.18223, Nov. 2023, arXiv:2303.18223 [cs]. [Online]. Available: http://arxiv.org/abs/2303.18223
- Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey on evaluation of large language models,” ACM Trans. Intell. Syst. Technol., jan 2024, just Accepted. [Online]. Available: https://doi.org/10.1145/3641289
- J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” no. arXiv:2001.08361, Jan. 2020, arXiv:2001.08361 [cs, stat]. [Online]. Available: http://arxiv.org/abs/2001.08361
- Google, “Google gemini ai,” 2024. [Online]. Available: https://blog.google/technology/ai/google-gemini-ai
- H. Touvron, L. Martin, K. Stone et al., “Llama 2: Open foundation and fine-tuned chat models,” no. arXiv:2307.09288, Jul. 2023, arXiv:2307.09288 [cs].
- L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,” no. arXiv:2312.12148, Dec. 2023, arXiv:2312.12148 [cs]. [Online]. Available: http://arxiv.org/abs/2312.12148
- Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, J. Liu, Z. Qu, S. Yan, Y. Zhu, Q. Zhang, M. Chowdhury, and M. Zhang, “Efficient large language models: A survey,” no. arXiv:2312.03863, Jan. 2024, arXiv:2312.03863 [cs]. [Online]. Available: http://arxiv.org/abs/2312.03863
- Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, L. Li, and Z. Sui, “A survey on in-context learning,” no. arXiv:2301.00234, Jun. 2023, arXiv:2301.00234 [cs]. [Online]. Available: http://arxiv.org/abs/2301.00234
- Z. Hu, L. Wang, Y. Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and R. Lee, “LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 5254–5276. [Online]. Available: https://aclanthology.org/2023.emnlp-main.319
- W. Song, Z. Li, L. Zhang, H. Zhao, and B. Du, “Sparse is enough in fine-tuning pre-trained large language model,” no. arXiv:2312.11875, Dec. 2023, arXiv:2312.11875 [cs]. [Online]. Available: http://arxiv.org/abs/2312.11875
- E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2021.
- X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597.
- B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3045–3059. [Online]. Available: https://aclanthology.org/2021.emnlp-main.243
- N. Mundra, S. Doddapaneni, R. Dabre, A. Kunchukuttan, R. Puduppully, and M. M. Khapra, “A comprehensive analysis of adapter efficiency,” in Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), 2024, pp. 136–154.
- A. Petrov, P. Torr, and A. Bibi, “When do prompting and prefix-tuning work? a theory of capabilities and limitations,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=JewzobRhay
- D. A. Zetzsche, D. W. Arner, and R. P. Buckley, “Decentralized finance (defi),” Journal of Financial Regulation, vol. 6, pp. 172–203, 2020.
- Defillama, “Defillama chain,” 2024. [Online]. Available: https://defillama.com/chains
- P. Praitheeshan, L. Pan, J. Yu, J. Liu, and R. Doss, “Security analysis methods on ethereum smart contract vulnerabilities: A survey,” no. arXiv:1908.08605, Sep. 2020, arXiv:1908.08605 [cs]. [Online]. Available: http://arxiv.org/abs/1908.08605
- P. Züst, T. Nadahalli, and Y. W. R. Wattenhofer, “Analyzing and preventing sandwich attacks in ethereum,” ETH Zürich, 2021.
- C. Zhou, J. He, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Prompt consistency for zero-shot task generalization,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 2613–2626. [Online]. Available: https://aclanthology.org/2022.findings-emnlp.192
- B. Rozière, J. Gehring, F. Gloeckle, S. Sootla et al., “Code llama: Open foundation models for code,” no. arXiv:2308.12950, Jan. 2024, arXiv:2308.12950 [cs]. [Online]. Available: http://arxiv.org/abs/2308.12950
- R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
- T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
- A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, and et al., “Mixtral of experts,” no. arXiv:2401.04088, Jan. 2024, arXiv:2401.04088 [cs]. [Online]. Available: http://arxiv.org/abs/2401.04088
- A. Q. Jiang, A. Sablayrolles, A. Mensch, and et al., “Mistral 7b,” no. arXiv:2310.06825, Oct. 2023, arXiv:2310.06825 [cs]. [Online]. Available: http://arxiv.org/abs/2310.06825
- F. Xue, Z. Zheng, Y. Fu, J. Ni, Z. Zheng, W. Zhou, and Y. You, “Openmoe: An early effort on open mixture-of-experts language models,” no. arXiv:2402.01739, Jan. 2024, arXiv:2402.01739 [cs]. [Online]. Available: http://arxiv.org/abs/2402.01739
- Solodit, “Solodit,” 2024. [Online]. Available: https://solodit.xyz/
- D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, L. Shujie, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” in International Conference on Learning Representations, 2020.
- D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225.
- Huggingface, “Huggingface transformer,” 2024. [Online]. Available: https://huggingface.co/docs/transformers/
- Y. Zhuang, Z. Liu, P. Qian, Q. Liu, X. Wang, and Q. He, “Smart Contract Vulnerability Detection using Graph Neural Network,” in Twenty-Ninth International Joint Conference on Artificial Intelligence, vol. 3, Jul. 2020, pp. 3283–3290.
- Z. Liu, P. Qian, X. Wang, L. Zhu, Q. He, and S. Ji, “Smart Contract Vulnerability Detection: From Pure Neural Network to Interpretable Graph Feature and Expert Pattern Fusion,” in Twenty-Ninth International Joint Conference on Artificial Intelligence, vol. 3, Aug. 2021, pp. 2751–2759.
- H. Wu, Z. Zhang, S. Wang, Y. Lei, B. Lin, Y. Qin, H. Zhang, and X. Mao, “Peculiar: Smart Contract Vulnerability Detection Based on Crucial Data Flow Graph and Pre-training Techniques,” in 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE), Oct. 2021, pp. 378–389.
- S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Can Large Language Models Identify And Reason About Security Vulnerabilities? Not Yet,” no. arXiv:2312.12575, Dec. 2023.
- M. Fu, C. Tantithamthavorn, V. Nguyen, and T. Le, “ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We?” no. arXiv:2310.09810, Oct. 2023.
- C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and S. Nepal, “Transformer-Based Language Models for Software Vulnerability Detection,” in Proceedings of the 38th Annual Computer Security Applications Conference, ser. ACSAC ’22. New York, NY, USA: Association for Computing Machinery, Dec. 2022, pp. 481–496.
- M. Alqarni and A. Azim, “Low Level Source Code Vulnerability Detection Using Advanced BERT Language Model,” in Proceedings of the Canadian Conference on Artificial Intelligence. Canadian Artificial Intelligence Association (CAIAC), May 2022.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
- N. S. Mathews, Y. Brus, Y. Aafer, M. Nagappan, and S. McIntosh, “LLbezpeky: Leveraging Large Language Models for Vulnerability Detection,” no. arXiv:2401.01269, Feb. 2024.
- M. D. Purba, A. Ghosh, B. J. Radford, and B. Chu, “Software Vulnerability Detection using Large Language Models,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE Computer Society, Oct. 2023, pp. 112–119.
- H. Li, Y. Hao, Y. Zhai, and Z. Qian, “The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models,” no. arXiv:2308.00245, Nov. 2023.
- G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak et al., “Gemma: Open models based on gemini research and technology,” no. arXiv:2403.08295, Mar. 2024, arXiv:2403.08295 [cs]. [Online]. Available: http://arxiv.org/abs/2403.08295