Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
104 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
40 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Combining Fine-Tuning and LLM-based Agents for Intuitive Smart Contract Auditing with Justifications (2403.16073v3)

Published 24 Mar 2024 in cs.SE

Abstract: Smart contracts are decentralized applications built atop blockchains like Ethereum. Recent research has shown that LLMs have potential in auditing smart contracts, but the state-of-the-art indicates that even GPT-4 can achieve only 30% precision (when both decision and justification are correct). This is likely because off-the-shelf LLMs were primarily pre-trained on a general text/code corpus and not fine-tuned on the specific domain of Solidity smart contract auditing. In this paper, we propose iAudit, a general framework that combines fine-tuning and LLM-based agents for intuitive smart contract auditing with justifications. Specifically, iAudit is inspired by the observation that expert human auditors first perceive what could be wrong and then perform a detailed analysis of the code to identify the cause. As such, iAudit employs a two-stage fine-tuning approach: it first tunes a Detector model to make decisions and then tunes a Reasoner model to generate causes of vulnerabilities. However, fine-tuning alone faces challenges in accurately identifying the optimal cause of a vulnerability. Therefore, we introduce two LLM-based agents, the Ranker and Critic, to iteratively select and debate the most suitable cause of vulnerability based on the output of the fine-tuned Reasoner model. To evaluate iAudit, we collected a balanced dataset with 1,734 positive and 1,810 negative samples to fine-tune iAudit. We then compared it with traditional fine-tuned models (CodeBERT, GraphCodeBERT, CodeT5, and UnixCoder) as well as prompt learning-based LLMs (GPT4, GPT-3.5, and CodeLlama-13b/34b). On a dataset of 263 real smart contract vulnerabilities, iAudit achieves an F1 score of 91.21% and an accuracy of 91.11%. The causes generated by iAudit achieved a consistency of about 38% compared to the ground truth causes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. L. Whitney, “Google paid out $10 million in bug bounties to security researchers in 2023,” https://www.zdnet.com/article/google-paid-out-10-million-in-bug-bounties-to-security-researchers-in-2023/, Mar. 2024.
  2. Z. Zhang, B. Zhang, W. Xu, and Z. Lin, “Demystifying Exploitable Bugs in Smart Contracts,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), May 2023, pp. 615–627.
  3. J. Feist, G. Grieco, and A. Groce, “Slither: A static analysis framework for smart contracts,” in 2019 IEEE/ACM 2nd International Workshop on Emerging Trends in Software Engineering for Blockchain (WETSEB), May 2019, pp. 8–15.
  4. Y. Fang, D. Wu, X. Yi, S. Wang, Y. Chen, M. Chen, Y. Liu, and L. Jiang, “Beyond “protected” and “private”: An empirical security analysis of custom function modifiers in smart contracts,” in Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2023.   New York, NY, USA: Association for Computing Machinery, 2023, p. 1157–1168. [Online]. Available: https://doi.org/10.1145/3597926.3598125
  5. L. Brent, N. Grech, S. Lagouvardos, B. Scholz, and Y. Smaragdakis, “Ethainter: a smart contract security analyzer for composite vulnerabilities,” in Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation, ser. PLDI 2020.   New York, NY, USA: Association for Computing Machinery, 2020, p. 454–469. [Online]. Available: https://doi.org/10.1145/3385412.3385990
  6. J. Chen, X. Xia, D. Lo, J. Grundy, X. Luo, and T. Chen, “Defectchecker: Automated smart contract defect detection by analyzing evm bytecode,” IEEE Transactions on Software Engineering, vol. 48, no. 7, p. 2189–2207, Jul. 2022.
  7. L. Brent, A. Jurisevic, M. Kong, E. Liu, F. Gauthier, V. Gramoli, R. Holz, and B. Scholz, “Vandal: A scalable security analysis framework for smart contracts,” no. arXiv:1809.03981, Sep. 2018, arXiv:1809.03981 [cs]. [Online]. Available: http://arxiv.org/abs/1809.03981
  8. S. Kalra, S. Goel, M. Dhawan, and S. Sharma, “ZEUS: Analyzing safety of smart contracts,” in Proc. ISOC NDSS, 2018.
  9. P. Tsankov, A. Dan, D. Drachsler-Cohen, A. Gervais, F. Bünzli, and M. Vechev, “Securify: Practical security analysis of smart contracts,” in Proc. ACM CCS, 2018.
  10. M. Mossberg, F. Manzano, E. Hennenfent, A. Groce, G. Grieco, J. Feist, T. Brunson, and A. Dinaburg, “Manticore: A user-friendly symbolic execution framework for binaries and smart contracts,” in 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), Nov. 2019, p. 1186–1189. [Online]. Available: https://ieeexplore.ieee.org/document/8952204
  11. Defillama, “Defillama hacks,” 2024. [Online]. Available: https://defillama.com/hacks
  12. I. David, L. Zhou, K. Qin, D. Song, L. Cavallaro, and A. Gervais, “Do you still need a manual smart contract audit?” no. arXiv:2306.12338, Jun. 2023, arXiv:2306.12338 [cs]. [Online]. Available: http://arxiv.org/abs/2306.12338
  13. Y. Sun, D. Wu, Y. Xue, H. Liu, H. Wang, Z. Xu, X. Xie, and Y. Liu, “GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis,” in Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, 2024.
  14. S. Hu, T. Huang, F. Ilhan, S. Tekin, and L. Liu, “Large language model-powered smart contract vulnerability detection: New perspectives,” in 2023 5th IEEE International Conference on Trust, Privacy and Security in Intelligent Systems and Applications (TPS-ISA).   Los Alamitos, CA, USA: IEEE Computer Society, nov 2023, pp. 297–306. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/TPS-ISA58951.2023.00044
  15. Y. Sun, D. Wu, Y. Xue, H. Liu, W. Ma, L. Zhang, M. Shi, and Y. Liu, “LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning,” no. arXiv:2401.16185, Jan. 2024.
  16. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” in Proceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS’20.   Red Hook, NY, USA: Curran Associates Inc., Dec. 2020, pp. 9459–9474.
  17. K. Tian, E. Mitchell, H. Yao, C. D. Manning, and C. Finn, “Fine-tuning Language Models for Factuality,” no. arXiv:2311.08401, Nov. 2023.
  18. K. Lv, Y. Yang, T. Liu, Q. Gao, Q. Guo, and X. Qiu, “Full Parameter Fine-tuning for Large Language Models with Limited Resources,” no. arXiv:2306.09782, Jun. 2023.
  19. A. Balaguer, V. Benara, R. L. d. F. Cunha, R. d. M. E. Filho, T. Hendry, D. Holstein, J. Marsman, N. Mecklenburg, S. Malvar, L. O. Nunes, R. Padilha, M. Sharp, B. Silva, S. Sharma, V. Aski, and R. Chandra, “RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture,” no. arXiv:2401.08406, Jan. 2024.
  20. TrustLLM, “Trustllm inference code and dataset,” 2024. [Online]. Available: https://sites.google.com/view/trustllm/home
  21. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  22. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, p. 4171–4186. [Online]. Available: https://aclanthology.org/N19-1423
  23. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds.   Online: Association for Computational Linguistics, Nov. 2020, pp. 1536–1547. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.139
  24. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” 2019.
  25. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  26. M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” no. arXiv:1910.13461, Oct. 2019, arXiv:1910.13461 [cs, stat]. [Online]. Available: http://arxiv.org/abs/1910.13461
  27. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, jan 2020.
  28. Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 8696–8708.
  29. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang et al., “A survey of large language models,” no. arXiv:2303.18223, Nov. 2023, arXiv:2303.18223 [cs]. [Online]. Available: http://arxiv.org/abs/2303.18223
  30. Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey on evaluation of large language models,” ACM Trans. Intell. Syst. Technol., jan 2024, just Accepted. [Online]. Available: https://doi.org/10.1145/3641289
  31. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” no. arXiv:2001.08361, Jan. 2020, arXiv:2001.08361 [cs, stat]. [Online]. Available: http://arxiv.org/abs/2001.08361
  32. Google, “Google gemini ai,” 2024. [Online]. Available: https://blog.google/technology/ai/google-gemini-ai
  33. H. Touvron, L. Martin, K. Stone et al., “Llama 2: Open foundation and fine-tuned chat models,” no. arXiv:2307.09288, Jul. 2023, arXiv:2307.09288 [cs].
  34. L. Xu, H. Xie, S.-Z. J. Qin, X. Tao, and F. L. Wang, “Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment,” no. arXiv:2312.12148, Dec. 2023, arXiv:2312.12148 [cs]. [Online]. Available: http://arxiv.org/abs/2312.12148
  35. Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, J. Liu, Z. Qu, S. Yan, Y. Zhu, Q. Zhang, M. Chowdhury, and M. Zhang, “Efficient large language models: A survey,” no. arXiv:2312.03863, Jan. 2024, arXiv:2312.03863 [cs]. [Online]. Available: http://arxiv.org/abs/2312.03863
  36. Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, L. Li, and Z. Sui, “A survey on in-context learning,” no. arXiv:2301.00234, Jun. 2023, arXiv:2301.00234 [cs]. [Online]. Available: http://arxiv.org/abs/2301.00234
  37. Z. Hu, L. Wang, Y. Lan, W. Xu, E.-P. Lim, L. Bing, X. Xu, S. Poria, and R. Lee, “LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds.   Singapore: Association for Computational Linguistics, Dec. 2023, pp. 5254–5276. [Online]. Available: https://aclanthology.org/2023.emnlp-main.319
  38. W. Song, Z. Li, L. Zhang, H. Zhao, and B. Du, “Sparse is enough in fine-tuning pre-trained large language model,” no. arXiv:2312.11875, Dec. 2023, arXiv:2312.11875 [cs]. [Online]. Available: http://arxiv.org/abs/2312.11875
  39. E. J. Hu, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., “Lora: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2021.
  40. X. L. Li and P. Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597.
  41. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, M.-F. Moens, X. Huang, L. Specia, and S. W.-t. Yih, Eds.   Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 3045–3059. [Online]. Available: https://aclanthology.org/2021.emnlp-main.243
  42. N. Mundra, S. Doddapaneni, R. Dabre, A. Kunchukuttan, R. Puduppully, and M. M. Khapra, “A comprehensive analysis of adapter efficiency,” in Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), 2024, pp. 136–154.
  43. A. Petrov, P. Torr, and A. Bibi, “When do prompting and prefix-tuning work? a theory of capabilities and limitations,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=JewzobRhay
  44. D. A. Zetzsche, D. W. Arner, and R. P. Buckley, “Decentralized finance (defi),” Journal of Financial Regulation, vol. 6, pp. 172–203, 2020.
  45. Defillama, “Defillama chain,” 2024. [Online]. Available: https://defillama.com/chains
  46. P. Praitheeshan, L. Pan, J. Yu, J. Liu, and R. Doss, “Security analysis methods on ethereum smart contract vulnerabilities: A survey,” no. arXiv:1908.08605, Sep. 2020, arXiv:1908.08605 [cs]. [Online]. Available: http://arxiv.org/abs/1908.08605
  47. P. Züst, T. Nadahalli, and Y. W. R. Wattenhofer, “Analyzing and preventing sandwich attacks in ethereum,” ETH Zürich, 2021.
  48. C. Zhou, J. He, X. Ma, T. Berg-Kirkpatrick, and G. Neubig, “Prompt consistency for zero-shot task generalization,” in Findings of the Association for Computational Linguistics: EMNLP 2022, Y. Goldberg, Z. Kozareva, and Y. Zhang, Eds.   Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 2613–2626. [Online]. Available: https://aclanthology.org/2022.findings-emnlp.192
  49. B. Rozière, J. Gehring, F. Gloeckle, S. Sootla et al., “Code llama: Open foundation models for code,” no. arXiv:2308.12950, Jan. 2024, arXiv:2308.12950 [cs]. [Online]. Available: http://arxiv.org/abs/2308.12950
  50. R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford_alpaca, 2023.
  51. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
  52. A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, and et al., “Mixtral of experts,” no. arXiv:2401.04088, Jan. 2024, arXiv:2401.04088 [cs]. [Online]. Available: http://arxiv.org/abs/2401.04088
  53. A. Q. Jiang, A. Sablayrolles, A. Mensch, and et al., “Mistral 7b,” no. arXiv:2310.06825, Oct. 2023, arXiv:2310.06825 [cs]. [Online]. Available: http://arxiv.org/abs/2310.06825
  54. F. Xue, Z. Zheng, Y. Fu, J. Ni, Z. Zheng, W. Zhou, and Y. You, “Openmoe: An early effort on open mixture-of-experts language models,” no. arXiv:2402.01739, Jan. 2024, arXiv:2402.01739 [cs]. [Online]. Available: http://arxiv.org/abs/2402.01739
  55. Solodit, “Solodit,” 2024. [Online]. Available: https://solodit.xyz/
  56. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, L. Shujie, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu et al., “Graphcodebert: Pre-training code representations with data flow,” in International Conference on Learning Representations, 2020.
  57. D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “Unixcoder: Unified cross-modal pre-training for code representation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 7212–7225.
  58. Huggingface, “Huggingface transformer,” 2024. [Online]. Available: https://huggingface.co/docs/transformers/
  59. Y. Zhuang, Z. Liu, P. Qian, Q. Liu, X. Wang, and Q. He, “Smart Contract Vulnerability Detection using Graph Neural Network,” in Twenty-Ninth International Joint Conference on Artificial Intelligence, vol. 3, Jul. 2020, pp. 3283–3290.
  60. Z. Liu, P. Qian, X. Wang, L. Zhu, Q. He, and S. Ji, “Smart Contract Vulnerability Detection: From Pure Neural Network to Interpretable Graph Feature and Expert Pattern Fusion,” in Twenty-Ninth International Joint Conference on Artificial Intelligence, vol. 3, Aug. 2021, pp. 2751–2759.
  61. H. Wu, Z. Zhang, S. Wang, Y. Lei, B. Lin, Y. Qin, H. Zhang, and X. Mao, “Peculiar: Smart Contract Vulnerability Detection Based on Crucial Data Flow Graph and Pre-training Techniques,” in 2021 IEEE 32nd International Symposium on Software Reliability Engineering (ISSRE), Oct. 2021, pp. 378–389.
  62. S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Can Large Language Models Identify And Reason About Security Vulnerabilities? Not Yet,” no. arXiv:2312.12575, Dec. 2023.
  63. M. Fu, C. Tantithamthavorn, V. Nguyen, and T. Le, “ChatGPT for Vulnerability Detection, Classification, and Repair: How Far Are We?” no. arXiv:2310.09810, Oct. 2023.
  64. C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and S. Nepal, “Transformer-Based Language Models for Software Vulnerability Detection,” in Proceedings of the 38th Annual Computer Security Applications Conference, ser. ACSAC ’22.   New York, NY, USA: Association for Computing Machinery, Dec. 2022, pp. 481–496.
  65. M. Alqarni and A. Azim, “Low Level Source Code Vulnerability Detection Using Advanced BERT Language Model,” in Proceedings of the Canadian Conference on Artificial Intelligence.   Canadian Artificial Intelligence Association (CAIAC), May 2022.
  66. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds.   Minneapolis, Minnesota: Association for Computational Linguistics, Jun. 2019, pp. 4171–4186.
  67. N. S. Mathews, Y. Brus, Y. Aafer, M. Nagappan, and S. McIntosh, “LLbezpeky: Leveraging Large Language Models for Vulnerability Detection,” no. arXiv:2401.01269, Feb. 2024.
  68. M. D. Purba, A. Ghosh, B. J. Radford, and B. Chu, “Software Vulnerability Detection using Large Language Models,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering Workshops (ISSREW).   IEEE Computer Society, Oct. 2023, pp. 112–119.
  69. H. Li, Y. Hao, Y. Zhai, and Z. Qian, “The Hitchhiker’s Guide to Program Analysis: A Journey with Large Language Models,” no. arXiv:2308.00245, Nov. 2023.
  70. G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak et al., “Gemma: Open models based on gemini research and technology,” no. arXiv:2403.08295, Mar. 2024, arXiv:2403.08295 [cs]. [Online]. Available: http://arxiv.org/abs/2403.08295
Citations (10)

Summary

  • The paper presents a novel approach combining two-stage fine-tuning with LLM-based agents to both detect vulnerabilities and provide detailed justifications.
  • The methodology mimics expert human auditors by first identifying issues with a Detector and Reasoner, then refining causes with Ranker and Critic agents.
  • Empirical results show TrustLLM outperforms benchmarks with a 91.21% F1 score, demonstrating significant improvements in smart contract auditing precision.

Unified Fine-Tuning and LLM Agents for Intuitive Smart Contract Auditing with Justifications

Introduction to TrustLLM

TrustLLM represents a novel approach in the auditing of smart contracts, integrating fine-tuning techniques with LLMs to not only detect vulnerabilities in smart contracts but also provide justifications for the identified issues. Given the critical role of smart contracts in decentralized financial applications, ensuring their security is paramount. Traditional methods have shown limitations, especially with the emergent complex logical vulnerabilities. Recent advancements demonstrated LLMs' potential in this domain, yet precision remained a challenge. TrustLLM, through its innovative framework, aims to enhance detection precision and rationale clarity by emulating expert human auditors' intuitive and analytical processes.

Fine-Tuning and LLM-Based Agents Framework

TrustLLM employs a two-stage fine-tuning strategy, comprising the Detector and the Reasoner models, to initially decide on a vulnerability's presence and subsequently determine its cause. This approach mimics the human auditors' intuition followed by a detailed analysis, aiming to improve upon the unrefined precision of existing solutions.

Moreover, TrustLLM introduces LLM-based agents—Ranker and Critic—to refine the selection of vulnerability causes based on the Reasoner model's output. This iterative process enables a more accurate and defendable identification of smart contract vulnerabilities.

Empirical Evaluation

The evaluation of TrustLLM involved a comprehensive dataset assembly, contrasting its performance against both traditional fine-tuned models and prompt learning-based LLMs. The dataset featured balanced positive and negative samples, derived from reputable auditing reports and enhanced through a novel data augmentation method. TrustLLM outperformed benchmark models achieving an F1 score of 91.21\% and an accuracy of 91.11\%, with a consistency rate of about 38\% in aligning generated causes with the ground truth. This performance underscores TrustLLM’s enhanced capability in precise vulnerability detection and justification within the domain of Smart Contract auditing.

Ablation Studies and Consideration of Call Graph Information

Ablation studies justified the efficacy of the two-stage fine-tuning approach, highlighting the benefit of employing multiple prompts and majority voting in achieving superior model performance. Further examination revealed the nuanced impact of incorporating call graph information, suggesting potential benefits and pitfalls depending on its application context within the model's reasoning process.

Implications and Future Directions

TrustLLM's robust performance in detecting and justifying smart contract vulnerabilities offers significant implications for both theoretical research and practical application in the field of blockchain security. As the model demonstrates an ability to closely mirror expert human intuition and analytical rigor, it sets a promising foundation for future enhancements in automated auditing tools. Further exploration into optimizing the integration of contextual information and refining the iterative process amongst LLM-based agents could yield even higher precision and reliability in smart contract vulnerability auditing.

Conclusion

This paper introduced TrustLLM, a pioneering framework that significantly advances the auditing of smart contracts through a synergistic combination of fine-tuned models and LLM-based agents. By effectively addressing the limitations of existing LLM applications in this domain, TrustLLM not only enhances the precision of vulnerability detection but also provides cogent justifications, marking a notable contribution to the field of decentralized application security.

X Twitter Logo Streamline Icon: https://streamlinehq.com