Counteracting Concept Drift by Learning with Future Malware Predictions (2404.09352v1)
Abstract: The accuracy of deployed malware-detection classifiers degrades over time due to changes in data distributions and increasing discrepancies between training and testing data. This phenomenon is known as the concept drift. While the concept drift can be caused by various reasons in general, new malicious files are created by malware authors with a clear intention of avoiding detection. The existence of the intention opens a possibility for predicting such future samples. Including predicted samples in training data should consequently increase the accuracy of the classifiers on new testing data. We compare two methods for predicting future samples: (1) adversarial training and (2) generative adversarial networks (GANs). The first method explicitly seeks for adversarial examples against the classifier that are then used as a part of training data. Similarly, GANs also generate synthetic training data. We use GANs to learn changes in data distributions within different time periods of training data and then apply these changes to generate samples that could be in testing data. We compare these prediction methods on two different datasets: (1) Ember public dataset and (2) the internal dataset of files incoming to Avast. We show that while adversarial training yields more robust classifiers, this method is not a good predictor of future malware in general. This is in contrast with previously reported positive results in different domains (including natural language processing and spam detection). On the other hand, we show that GANs can be successfully used as predictors of future malware. We specifically examine malware families that exhibit significant changes in their data distributions over time and the experimental results confirm that GAN-based predictions can significantly improve the accuracy of the classifier on new, previously unseen data.
- J. Gama, I. Žliobaitė, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A survey on concept drift adaptation,” ACM computing surveys (CSUR), vol. 46, no. 4, pp. 1–37, 2014.
- B. Miller, A. Kantchelian, S. Afroz, R. Bachwani, E. Dauber, L. Huang, M. C. Tschantz, A. D. Joseph, and J. D. Tygar, “Adversarial active learning,” Proceedings of the ACM Conference on Computer and Communications Security, vol. 2014-Novem, no. November, pp. 3–14, 2014. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/2666652.2666656
- H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining concept-drifting data streams using ensemble classifiers,” Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 226–235, 2003. [Online]. Available: https://dl.acm.org/doi/pdf/10.1145/956750.956778
- J. Z. Kolter and M. A. Maloof, “Dynamic weighted majority: An ensemble method for drifting concepts,” Journal of Machine Learning Research, vol. 8, no. Dec, pp. 2755–2790, 2007.
- H. S. Anderson and P. Roth, “Ember: an open dataset for training static pe malware machine learning models,” arXiv preprint arXiv:1804.04637, 2018.
- C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” 2014.
- S. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” CoRR, vol. abs/1511.04599, 2015. [Online]. Available: http://arxiv.org/abs/1511.04599
- A. M. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” CoRR, vol. abs/1412.1897, 2014. [Online]. Available: http://arxiv.org/abs/1412.1897
- H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan, “Theoretically principled trade-off between robustness and accuracy,” 2019.
- S. Samanta and S. Mehta, “Towards crafting text adversarial samples,” 2017.
- L. Sun, K. Hashimoto, W. Yin, A. Asai, J. Li, P. Yu, and C. Xiong, “Adv-bert: Bert is not robust on misspellings! generating nature adversarial samples on bert,” 2020.
- K. Grosse, N. Papernot, P. Manoharan, M. Backes, and P. McDaniel, “Adversarial Examples for Malware Detection,” in ESORICS, 2017. [Online]. Available: http://patrickmcdaniel.org/pubs/esorics17.pdf
- M. Brückner, C. Kanzow, and T. Scheffer, “Static prediction games for adversarial learning problems,” Journal of Machine Learning Research, vol. 13, no. Sep, pp. 2617–2654, 2012.
- R. Colbaugh and K. Glass, “Predictive defense against evolving adversaries,” in 2012 IEEE International Conference on Intelligence and Security Informatics. IEEE, 2012, pp. 18–23.
- F. Pierazzi, F. Pendlebury, J. Cortellazzi, and L. Cavallaro, “Intriguing properties of adversarial ml attacks in the problem space,” in 2020 IEEE Symposium on Security and Privacy (SP), 2020, pp. 1332–1349.
- R. L. Castro, C. Schmitt, and G. D. Rodosek, “Armed: How automatic malware modifications can evade static detection?” in 2019 5th International Conference on Information Management (ICIM). IEEE, 2019, pp. 20–27.
- L. Tong, B. Li, C. Hajaj, C. Xiao, N. Zhang, and Y. Vorobeychik, “Improving robustness of {{\{{ML}}\}} classifiers against realizable evasion attacks using conserved features,” in 28th USENIX Security Symposium (USENIX Security 19), 2019, pp. 285–302.
- H. Li, S. Zhou, W. Yuan, J. Li, and H. Leung, “Adversarial-Example Attacks Toward Android Malware Detection System,” IEEE Systems Journal, vol. 14, no. 1, pp. 653–656, 2020.
- Y. Dai, H. Li, Y. Qian, Y. Guo, and M. Zheng, “Anticoncept drift method for malware detector based on generative adversarial network,” Security and Communication Networks, vol. 2021, 2021.
- R. Burks, K. A. Islam, Y. Lu, and J. Li, “Data Augmentation with Generative Models for Improved Malware Detection: A Comparative Study,” 2019 IEEE 10th Annual Ubiquitous Computing, Electronics and Mobile Communication Conference, UEMCON 2019, pp. 0660–0665, 2019.
- J. Y. Kim, S. J. Bu, and S. B. Cho, “Zero-day malware detection using transferred generative adversarial networks based on deep autoencoders,” Information Sciences, vol. 460-461, pp. 83–102, 2018. [Online]. Available: https://doi.org/10.1016/j.ins.2018.04.092
- K. Zhou, Z. Liu, Y. Qiao, T. Xiang, and C. C. Loy, “Domain generalization: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4396–4415, 2022.
- J. Wang, C. Lan, C. Liu, Y. Ouyang, T. Qin, W. Lu, Y. Chen, W. Zeng, and P. Yu, “Generalizing to unseen domains: A survey on domain generalization,” IEEE Transactions on Knowledge and Data Engineering, 2022.
- J. Liu, Z. Shen, Y. He, X. Zhang, R. Xu, H. Yu, and P. Cui, “Towards out-of-distribution generalization: A survey,” arXiv preprint arXiv:2108.13624, 2021.
- D. Hendrycks, S. Basart, N. Mu, S. Kadavath, F. Wang, E. Dorundo, R. Desai, T. Zhu, S. Parajuli, M. Guo, D. Song, J. Steinhardt, and J. Gilmer, “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 8340–8349.
- J. Yang, K. Zhou, Y. Li, and Z. Liu, “Generalized out-of-distribution detection: A survey,” arXiv preprint arXiv:2110.11334, 2021.
- H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” arXiv preprint arXiv:1710.09412, 2017.
- S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6023–6032.
- N. Somavarapu, C.-Y. Ma, and Z. Kira, “Frustratingly simple domain generalization via image stylization,” 2020.
- N. Kshetri, “Economics of artificial intelligence in cybersecurity,” IT Professional, vol. 23, no. 5, pp. 73–77, 2021.
- E. Wong, L. Rice, and J. Z. Kolter, “Fast is better than free: Revisiting adversarial training,” 2020.
- A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” 2019.
- I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” 2015.
- I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” 2014.
- T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” 2018.
- A. Creswell, T. White, V. Dumoulin, K. Arulkumaran, B. Sengupta, and A. A. Bharath, “Generative adversarial networks: An overview,” IEEE Signal Processing Magazine, vol. 35, no. 1, p. 53–65, Jan 2018. [Online]. Available: http://dx.doi.org/10.1109/MSP.2017.2765202
- A. Calleja, J. Tapiador, and J. Caballero, “The malsource dataset: Quantifying complexity and code reuse in malware development,” IEEE Transactions on Information Forensics and Security, vol. 14, no. 12, pp. 3175–3190, 2019.
- J. Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired Image-to-Image Translation Using Cycle-Consistent Adversarial Networks,” in Proceedings of the IEEE International Conference on Computer Vision, vol. 2017-October. Institute of Electrical and Electronics Engineers Inc., dec 2017, pp. 2242–2251.
- Y. Lu, Y.-W. Tai, and C.-K. Tang, “Attribute-guided face generation using conditional cyclegan,” 2018.
- M. Mirza and S. Osindero, “Conditional Generative Adversarial Nets,” pp. 1–7, 2014. [Online]. Available: http://arxiv.org/abs/1411.1784
- M. Sebastián, R. Rivera, P. Kotzias, and J. Caballero, “Avclass: A tool for massive malware labeling,” in International symposium on research in attacks, intrusions, and defenses. Springer, 2016, pp. 230–253.
- R. Thomas, “Lief - library to instrument executable formats,” https://lief.quarkslab.com/, April 2017.
- M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, and M. Sugiyama, “High-dimensional feature selection by feature-wise kernelized lasso,” Neural Computation, vol. 26, no. 1, p. 185–207, Jan 2014.