Introduction
The detection and classification of malware in network traffic are crucial for ensuring the security and integrity of computer networks. Traditional approaches often rely on extracting specific features from network data, such as packet statistics, flow characteristics, or payload content. However, these methods may not fully exploit the valuable information embedded in the raw packet capture (pcap) files. In this study, we propose a unique approach for malware classification using Convolutional Neural Networks (CNN).
The distinction of this work is attributed to the approach of converting elements of the IP header from network sessions in pcap files into binary data, which serves as the basis for generating
There are papers [1], [2], and [3] in the literature that achieve 99% plus accuracy with various feature selection and machine learning algorithms. It is very hard to distinguish which approach is better. Several research showed that even baseline methods achieved near perfect accuracy [2], [3]. This could be caused by the datasets used in these research that makes the classification problem easy. Additionally, most of the papers identified in the literature train and test on the same dataset. It is unclear whether the model trained on one dataset will work well on other datasets.
According to the research, there are datasets from different sources with the same malware label. There was rarely any attempt in testing the model trained on one dataset to other datasets from different sources. In this paper, the Malware Capture Facility Project dataset (MCFP) [4] was chosen as the training and validation dataset. MCFP is the largest malware dataset we could find. It contains more than 140 malware classes and 460 gigabytes of packet capture (pcap) files. The distribution of samples over these malware classes varies significantly. These factors make it much harder to achieve high accuracies compared to the datasets used in literature. Our test data come from three different sources: USTC-TFC2016 [1], Taltech.ee MedBIoT [5], and IEEE-Mirai [6]. These dataset were chosen because they have overlapping malware labels with MCFP dataset.
In this paper, a novel methodology for transforming pcap data into visual representations is presented and its effectiveness for malware classification is demonstrated. The remainder of this paper is organized as follows: Section II describes related work. Section III describes our methodology including data collection, image generation, CNN model creation, baseline method for comparison, training, and evaluation. Section IV describes detailed result on validation and test datasets of our method and baseline method. Discussion, conclusion, and future work are presented in Section V.
Related Work
Most of the traffic classification research work in the literature is based on selecting the best features from network traffic data and use machine learning algorithms designed for structured data [7], [8], [9], [10], [11], [12], [13]. Features are usually statistical information about network flows or sessions such as average packet sizes, number of packets per second, and more advanced statistical features. Recently, with success of deep neural networks in research work, new approaches start to emerge.
Wang et al. [1] generated
Bendiab et al. [15] used Binvis [16], a binary file visualization tool, to generate images from pcap files. They generated only 1,000 images of dimension
Agrafiotis et al. [2] generated
Bo et al. [3] used the neural networks initially created for natural language modeling of malicious traffic. They treated each byte in the raw packet as a word input to a Gated Recurrent Units (GRU) neural network to classify the malware traffic. Their datasets include ISCX2012 [20], USTC-TFC2016 [1], and CICIDS2017 [18]. They achieved high accuracy on these datasets. Specifically, the accuracy on USTC-TFC2016 [1] was 99.94%. Similar to Wang et al. [1], their train and test were all performed within the same data set.
There are several papers utilizing the MCFP dataset. However, most of the work only used a small subset of the entire dataset. For example, Liu et al. [21] used 8 of the MCFP malware classes in their experiments to generate malware traffic fingerprints for classification. Zhang et al. [22] also used 8 MCFP malware classes in detecting encrypted malicious traffic. Li et al. [23] selected a random set of 20 MCFP malware classes for unknown malware class detection. Zhao et al. [24] used 10 MCFP malware classes in their prototype based learning method in malware classification. Rong et al. [25] used more MCFP classes than the research work above. A total of 42 malware classes were used by the paper. However, they experimented with 5-way and 20-way classifications. The F1 classification score of on 20-way classification was about 97% with their few-shot learning algorithm. In this paper, we used all 142 malware classes in MCFP and achieved a similar F1 score.
Methodology
A. Machine Learning Pipeline
Figure 1 shows the full machine learning pipeline of this research. The process begins with downloading network traffic data from four sources, including the Malware Capture Facility Project (MCFP), USTC-TFC2016, MedBIoT, and IEEE-Mirai datasets. Datasets were stored in a tree directory structure organized by source and malware class labels. The next stage, the pipeline diverges into image generation and statistical feature extraction. Images generated from the dataset were used to train a CNN model and numeric features extracted from previous stages were used to train random forest, svm, and xgboost models. Finally, the evaluation stage compared F1 scores of CNN versus traditional ML models. In the subsequent sections, each of these steps is described in detail.
B. Datasets
1) MCFP Dataset [4]
The Malware Capture Facility Project, affiliated with the Stratosphere IPS Project, captures malware and network data to refine machine learning algorithms in network security. They utilize Malware, Normal, and Background traffic to assess detection and algorithm performance. Each dataset includes pcap files and password-protected zip files related to the malware execution.
From the diverse file types in the MCFP, the focus was solely on raw pcap files representing malware. A script was created to navigate the dataset’s JSON structure, and pcaps corresponding to each malware type were extracted, subsequently organizing them into folders named after the malware class.
The dataset represent over 140 classes, to include such malware classes as Neeris, Wootbot, Sogou, RBot, NSIS Agent, Zbot, Trojan MSIL Inject, Sirefef, Dridex, and many others. Pcap file sizes range from 4KB to 70GB in size. This dataset was used for training and validation. The dataset can be downloaded here [26].
2) USTC-TFC2016 Dataset [1]
The USTC-TFC2016 dataset was created by Wang et al. for their malware classification research. contains both benign and malicious network traffic. The malicious traffic is based on CTU dataset [26] collected from 2011 to 2015. The authors made some modifications on the dataset. The dataset included a total of 20 classes, of which 10 were malware classes such as Cridex, Geodo, Htbot, Virut, and Zeus. Pcap file sizes range from 2.5MB to 282MB. This was used as a test dataset. Find the dataset here [27].
3) Medbiot Dataset [5]
The MedBIoT dataset was developed by Manzanares, et al. from the Department of Software Science at the Tallinn University of Technology, Estonia. The dataset incorporates a combination of real and emulated IoT devices (a total of 83 devices) and providing genuine malware network data, specifically from three notable botnet malwares: Mirai, Bash- Lite, and Torii. Mirai and Torii were used, where pcap file sizes range from 25MB to 635MB. These were utilized as the test dataset. Find the dataset here [28].
4) IEEE-Mirai Dataset
Mirai-based Multi-class dataset [6] created by Gebrye, et al. Derived from prior IoT intrusion data, this dataset emphasizes benign traffic and three specific Mirai botnet attacks: SYN-Flooding, ACK-Flooding, and HTTP-Flooding, all treated under the general umbrella of the Mirai class. Pcap file sizes range from 12MB to 87MB. This was used as a test dataset. Find the dataset here [29].
C. Image Generation
For each session in a PCAP file, a 3-channel RGB image of size
The following Figure 2 shows 6 random generated sample images arranged in
Next, the detailed steps of generating these images from the PCAP files are described. Scapy [30] Python library was used to extract the sessions from the PCAP files. However, the library is extremely slow in handling large PCAP files and time complexity seems to be more than linear time. Many of PCAP files in MCFP dataset are several gigabytes in size. To make the image generation faster, we used ‘tcpdump’ command to split the large PCAP files into a set of smaller PCAP files of 25 Megabtyes each. The Scapy [30] library is able to handle this size reasonably quickly. A Python script was written to iterate through all the sessions identified by the Scapy library and generate an image for each session. For each session, the Python script reads the first 50 packets in the session and extract the fields mentioned above and use PIL python library to generate a
Figures 3 and 4 show the distributions of 50 largest malware classes and 50 smallest malware classes. The largest malware class Trickbot has about 175 thousand images, while the smallest malware class jRAT Has less than 10 images. The wide distribution of number of samples makes it more challenging to classify with machine learning algorithms.
D. CNN Architecture
VGG [31] was initially used to classify images from the ImageNet [32] dataset. The dataset contains more than 14 million images and 1,000 classes. Packet images are simpler than those of ImageNet. The number of classes 142 is also less than ImageNet. Therefore, it should be less challenging to classify compared to ImageNet. The decision was made to simplify VGG to meet the project’s requirements.
The following Figure 5 shows the architecture of our CNN model. The input to the network was
E. Training CNN Model
A random 80–20 split was performed on the MCFP dataset, resulting in 1,242,468 images in the training set and 310,618 images in the validation dataset. AdamW optimizer in Pytorch library was used with
F. Baseline Method
The performance of the image-based CNN model was chosen for comparison against a commonly used method in the literature, where statistical features are extracted from the packet captures and machine learning algorithms are applied to this structured dataset. Netml [33] is a Python library that extracts 10 of the most common statistics in the literature: flow duration, number of packets per second, number of bytes per second, and the following statistics of packet sizes: mean, standard deviation, the first to the third quantiles, the minimum and the maximum. A Python script written with netml library to extract these statistical features from the same MCFP dataset. Please, note that the first 25 megabytes was used in large PCAP files for generating images. We did the same when extracting the statistical features. Another thing to note is that these statistical features are flow-based and images are generated based on sessions. Therefore, the statistically based dataset has 4,472,196 samples, which is about twice the number of samples as the images. Random Forest, Support Vector Machine, and XGBoost are applied on this statistically based dataset. A detailed result will be shown in the following result section.
G. Evaluation Methodology
Accuracy is not a good measure when the classes are not evenly distributed. Figures 3 and 4 show the distribution of classes is very uneven among different malware classes in our datasets. The decision was made to use the macro F1 score to evaluate the performance of machine learning models. To compute the macro F1 score, precision and recall values are computed for each individual class. For each class \begin{align*} \text { Precision}_{i} &= \frac {TP_{i}}{TP_{i} + FP_{i}} \tag{1}\\ \text {Recall}_{i} &= \frac {TP_{i}}{TP_{i} + FN_{i}} \tag{2}\\ \text {F1}_{i} &= 2 \cdot \frac {\text {Precision}_{i} \cdot \text { Recall}_{i}}{\text {Precision}_{i} + \text {Recall}_{i}} \tag{3}\end{align*}
\begin{align*} \text { Macro Precision} &= \frac {1}{n} \sum _{i=1}^{n} \text { Precision}_{i} \tag{4}\\ \text {Macro Recall} &= \frac {1}{n} \sum _{i=1}^{n} \text { Recall}_{i} \tag{5}\\ \text {Macro F1}& = \frac {1}{n} \sum _{i=1}^{n} \text { F1}_{i} \tag{6}\end{align*}
To evaluate the CNN model and the baseline models, classification reports were first run on the MCFP validation dataset. For each of 142 malware classes, the precision, recall, and F1 score will be computed. The macro precision, recall, and F1 scores will be computed and compared. To further test the performance of the CNN model and the baseline, classification reports will be run on test datasets from different sources including USTC-TFC2016 and Taltech.ee MedBIoT, and IEEE-Mirai. Because these test datasets are captured independently from one another, the evaluation results will be more convincing in determining the performance of our image based approach. A detailed evaluation result will be presented in the next result section.
Results
The results demonstrate that using images to represent malware packet sessions and training a Deep Convolutional Neural Network (CNN) model generates significantly better accuracy in malware classification compared to traditional methods using numerical features from sessions. Specifically, training a deep CNN on a large dataset of millions of labeled images representing 142 malware classes from the Malware Capture Facility Project (MCFP). This achieved a 97% macro F1 score on the validation set, far exceeding the maximum 74% macro F1 score from Random Forest models trained on numerical features from the same MCFP dataset. Table 1 shows the summary of the classification report in the MCFP validation dataset. CNN on image data generated from malware sessions achieved macro precision, recall, and F1 scores of 98%, 97% and 97%, respectively. Random Forest on aggregated numerical feature dataset extracted from the same packet sessions achieved macro precision, recall, and F1 scores of 79%, 78%, and 78% respectively. Other machine learning algorithms, including XGBoost and SVM, were also tried on the aggregated numerical feature dataset. XGBoost’s macro F1 score was only 24%. Due to the higher time complexity of SVM with an RBF kernel, completion of training on the entire training dataset was not achieved. However, on the much smaller subset of the training dataset, the macro F1 score was 40%.
Tests were also conducted on datasets from different sources, and it was found that the image-based deep learning approach consistently outperformed traditional methods by a wide margin. Both trained models were tested on USTC-TFC2016 [3], and Taltech.ee MedBIoT [34], and Mirai-based Multi-Class [29] datasets. They are much smaller datasets than the MCFP dataset containing only about 11 malware classes which are a subset of MCFP’s malware classes. The following tables 2 and 3 show the classification reports of these malware classes on MCFP validation datasets. Table 2 shows the result of the Random Forest algorithm on the aggregated numerical feature dataset extracted from the MCFP validation dataset that focuses only on these 11 malware classes from the test datasets. Table 3 shows the result of CNN algorithm on image extracted from the MCFP validation data set that focuses on the same set of malware classes. Please note that these tables show the results on validation dataset.
Table 4 shows the accuracy results of the trained CNN and Random Forest models on three test datasets are shown. Overall, the results indicate that the CNN model trained on image representations of malware sessions provides superior performance in malware classification on test datasets from different sources. Except for Nsis and Emotet, CNN has outperformed Random Forest by a significant margin. On a closer look on these 2 malware classes, we can find (in Table 1) that CNN’s F1 scores on MCFP validation dataset were also the lowest for Nsis and Emotet compared to all other classes. For Emotet, CNN’s Recall score on the validation dataset was only 0.87, which is much lower than the Precision score of 0.99. This could mean that the CNN model has relatively high false negatives causing the low accuracy score on the test dataset.
Because test datasets are from different sources, they may label the same malware differently from MCFP. In Table 4 Geodo from USTC-TFC2016 dataset had a CNN and Random Forest accuracy score of 0%. During testing, most of the test samples were predicted as Emotet. Emotet is also known as Geodo [35]. Testing the CNN model for Emotet, yielded 71.5% accuracy, whereas testing Random Forest model for Emotet yielded an accuracy of 78%. Similar situation exists for Zeus and Trojan MSIL Inject. Zeus had accuracy of 0% for both CNN and Random Forest. Both CNN and Random Forest predicted them as Trojan MSIL Inject instead. The MCFP dataset has a malware label for Zeus Variant, and not the original Zeus. The USTC-TCF2016 dataset had a label for Zeus, instead of Zeus Variant. There is a good chance what USTC- TCF2016 dataset has labeled as Zeus is Trojan MSIL Inject instead. Kaspersky labels Zeus as a Trojan [36]. CNN’s accuracy on Mirai was not good compared to other malware classes. Random Forest’s accuracy was even much worse. The could have been caused by Mirai malware having many variants [37]. One of the Mirai variants is call Okiru, which is another malware class in MCFP dataset. In general, it is much more challenging to apply a model learned in one dataset to other datasets. However, the CNN based model consistently outperformed the baseline method.
Discussion and Conclusion
The results show that the image-based deep convolutional neural network model performs significantly better than the various machine learning algorithms that use baseline statistical features of flows or sessions in network traffic. The image-based CNN was able to achieve 97% macro F1 score on classifying 142 malware classes in MCFP dataset. The superior performance is evident on both validation and test datasets. The reason for this performance gain could be caused by images preserving the information of individual packet while this information is lost in the statistical features. Even though only the first 50 packets of a session is used for generating images, it was enough to significantly outperform the statistical features. Convolutional neural networks are known to be excellent in automatically learning the features in the unstructured image data. By encoding the packets into images, convolutional neural networks can be enabled to detect the unique patterns in different malware packet images. There are some limitations to our current image generation algorithm. Statistical features include time related information such as flow duration and average number of packets per second. Although the current images do not include this information, it is possible to encode the time gap between packets into images in order to capture the timing information. Also, the image generation method is based only on IP headers. In the next stage of research, the images can be generated to include TCP header and application layer headers. The method presented in this paper could be used not only in malware classification, but also in general network traffic classification tasks.
ACKNOWLEDGMENT
The views and conclusion contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of U.S. Government.