- The paper introduces a two-phase approach that uses a multilayer perceptron for confident classification and shifts ambiguous zero-day samples to clustering.
- It leverages online clustering techniques like SOM, achieving a purity of 77.68% to group emerging malware families.
- The study highlights the integration of automated malware detection with expert analysis to strengthen cybersecurity defenses.
Classification and Online Clustering of Zero-Day Malware
Introduction
The paper "Classification and Online Clustering of Zero-Day Malware" (2305.00605) focuses on developing a system to process incoming zero-day malware samples to either classify them into known malware families or cluster them into new, emerging families. This is significant given the rapid increase in new malware samples daily, which poses a substantial threat to cybersecurity.
Methodology
The proposed system is based on a two-phase approach:
- Phase One (Classification and Dividing the Data): Incoming streaming data is assessed using a multilayer perceptron (MLP) to determine whether the data can be classified into known malware families or should be pushed towards clustering due to insufficient confidence in classification probabilities. The classification probability is calculated using either a single multiclass classifier or multiple binary classifiers targeting each known family.
- Phase Two (Clustering): For samples not confidently classified, an online clustering method groups them into potential new malware families. Self-Organizing Maps (SOM), Online k-Means (OKM), and Basic Sequential Algorithmic Scheme (BSAS) are explored for clustering.
Figure 1: The architecture of our proposed model for processing zero-day malware to malware families.
Experimental Setup
The system was evaluated using the EMBER dataset, which provides static analysis features extracted from portable executable files. Seven malware families were considered, with four used for training and three (representing zero-day samples) introduced in the test set as new families.
The features were processed using Principal Component Analysis (PCA) and standard score normalization to optimize classifier performance. Classification accuracy was measured using balanced accuracy (BAC), and clustering quality was evaluated with purity and silhouette coefficients.
Results
Classification Performance:
- Multilayer Perceptron achieved the best balanced accuracy (BAC) of 98.60% when 67.97% of test samples were classified confidently using a high threshold (t=0.99999).
Clustering Performance:
- SOM outperformed OKM and BSAS, achieving a purity of 77.68% for ten clusters when the threshold t=0.9999999, which implies 44.56% of samples were clustered.
Figure 2: Classification probabilities prediction (p1,…,pk) from a multiclass classifier.
Figure 3: Classification probabilities prediction (p1′,…,pk′) from the k binary classifiers.
The paper concludes that while classification accounted for the majority of samples accurately, the clustering setup provided a robust method for handling new and unknown samples, flagging them for further analysis with reasonably strong purity values across various cluster numbers.
Implications and Future Work
The implications of this paper highlight the potential for machine learning algorithms to automate the task of malware detection and classification in real-time without reliance solely on pre-existing signatures, thereby mitigating the threat of zero-day malware. Furthermore, the paper suggests an integration of clustering results with malware analysts' insights to improve identification and response times.
Future work includes refining the threshold for separating classification from clustering, thereby optimizing the trade-off between accuracy and the proportion of samples classified. Moreover, enhancing the clustering techniques to better discern between novel classes, especially in highly dynamic and evolving threat landscapes, is suggested.
In summary, this research paves the way for more adaptive and real-time malware analysis systems, crucial for maintaining robust cybersecurity defenses in the face of continuously evolving threats.