Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 119 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Classification and Online Clustering of Zero-Day Malware (2305.00605v2)

Published 1 May 2023 in cs.CR and cs.LG

Abstract: A large amount of new malware is constantly being generated, which must not only be distinguished from benign samples, but also classified into malware families. For this purpose, investigating how existing malware families are developed and examining emerging families need to be explored. This paper focuses on the online processing of incoming malicious samples to assign them to existing families or, in the case of samples from new families, to cluster them. We experimented with seven prevalent malware families from the EMBER dataset, four in the training set and three additional new families in the test set. Based on the classification score of the multilayer perceptron, we determined which samples would be classified and which would be clustered into new malware families. We classified 97.21% of streaming data with a balanced accuracy of 95.33%. Then, we clustered the remaining data using a self-organizing map, achieving a purity from 47.61% for four clusters to 77.68% for ten clusters. These results indicate that our approach has the potential to be applied to the classification and clustering of zero-day malware into malware families.

Citations (2)

Summary

  • The paper introduces a two-phase approach that uses a multilayer perceptron for confident classification and shifts ambiguous zero-day samples to clustering.
  • It leverages online clustering techniques like SOM, achieving a purity of 77.68% to group emerging malware families.
  • The study highlights the integration of automated malware detection with expert analysis to strengthen cybersecurity defenses.

Classification and Online Clustering of Zero-Day Malware

Introduction

The paper "Classification and Online Clustering of Zero-Day Malware" (2305.00605) focuses on developing a system to process incoming zero-day malware samples to either classify them into known malware families or cluster them into new, emerging families. This is significant given the rapid increase in new malware samples daily, which poses a substantial threat to cybersecurity.

Methodology

The proposed system is based on a two-phase approach:

  1. Phase One (Classification and Dividing the Data): Incoming streaming data is assessed using a multilayer perceptron (MLP) to determine whether the data can be classified into known malware families or should be pushed towards clustering due to insufficient confidence in classification probabilities. The classification probability is calculated using either a single multiclass classifier or multiple binary classifiers targeting each known family.
  2. Phase Two (Clustering): For samples not confidently classified, an online clustering method groups them into potential new malware families. Self-Organizing Maps (SOM), Online kk-Means (OKM), and Basic Sequential Algorithmic Scheme (BSAS) are explored for clustering. Figure 1

    Figure 1: The architecture of our proposed model for processing zero-day malware to malware families.

Experimental Setup

The system was evaluated using the EMBER dataset, which provides static analysis features extracted from portable executable files. Seven malware families were considered, with four used for training and three (representing zero-day samples) introduced in the test set as new families.

The features were processed using Principal Component Analysis (PCA) and standard score normalization to optimize classifier performance. Classification accuracy was measured using balanced accuracy (BAC), and clustering quality was evaluated with purity and silhouette coefficients.

Results

Classification Performance:

  • Multilayer Perceptron achieved the best balanced accuracy (BAC) of 98.60% when 67.97% of test samples were classified confidently using a high threshold (t=0.99999t = 0.99999).

Clustering Performance:

  • SOM outperformed OKM and BSAS, achieving a purity of 77.68% for ten clusters when the threshold t=0.9999999t = 0.9999999, which implies 44.56% of samples were clustered. Figure 2

    Figure 2: Classification probabilities prediction (p1,,pk)(p_1, \ldots, p_k) from a multiclass classifier.

    Figure 3

    Figure 3: Classification probabilities prediction (p1,,pk)(p'_1, \ldots, p'_k) from the kk binary classifiers.

The paper concludes that while classification accounted for the majority of samples accurately, the clustering setup provided a robust method for handling new and unknown samples, flagging them for further analysis with reasonably strong purity values across various cluster numbers.

Implications and Future Work

The implications of this paper highlight the potential for machine learning algorithms to automate the task of malware detection and classification in real-time without reliance solely on pre-existing signatures, thereby mitigating the threat of zero-day malware. Furthermore, the paper suggests an integration of clustering results with malware analysts' insights to improve identification and response times.

Future work includes refining the threshold for separating classification from clustering, thereby optimizing the trade-off between accuracy and the proportion of samples classified. Moreover, enhancing the clustering techniques to better discern between novel classes, especially in highly dynamic and evolving threat landscapes, is suggested.

In summary, this research paves the way for more adaptive and real-time malware analysis systems, crucial for maintaining robust cybersecurity defenses in the face of continuously evolving threats.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.