Introduction
Malware is malicious software created with the intent of causing harm to a computer system, network, or cloud. This can include viruses, worms, trojan horses, spyware, ransomware, and more. Malware is usually spread through malicious emails, malicious websites, and malicious downloads [1]. For example, business email compromise (BEC) is an increasingly common form of cyber attack which targets businesses, organizations, and governments, according to Symantec’s threat report (2020) [2]. LokiBot is a malicious piece of software that is designed to steal passwords, cryptocurrency wallets, and other sensitive data by placing malicious code into files that are hosted on cloud-based services. It was first discovered in 2019 and is believed to be related to other malicious programs like TrickBot and Azorult. LokiBot is also capable of evading detection by using encrypted communication and hiding its malicious code within legitimate applications [3]. Ryuk is a dangerous form of ransomware that can be used to extract large sums of money from victims. It typically targets large organizations, encrypting all of their data and demanding a ransom in exchange for its release. Ryuk is also known for its ability to spread quickly [4]. Malware detection techniques can be classified into two categories, each providing a distinct perspective. The first classification, known as static, dynamic, and hybrid, focuses on the analysis approach employed [5]. The second classification, signature-based and anomaly-based, focuses on the detection strategy utilized [6]. The analyzing techniques for malware detection are classified into 3 methods: static, dynamic, and hybrid. Static malware detection involves analyzing the code and metadata of malware such as size, developer information, number of downloads, and other relevant details to identify malicious behavior. It is a quick initial step that can detect suspicious code without execution. However, code obfuscation and reverse-engineering make it time-consuming and pose challenges in identifying threats [5]. In contrast, Dynamic malware detection uses controlled execution of malicious code to analyze its behavior and capabilities. It provides detailed insights into the code’s actions and risks. Dynamic analysis is often used as a second step in malware detection. However, it can produce false positives, reducing detection accuracy. [5]. Hybrid malware detection combines both static and dynamic methods [5]. The most common approaches that are used to detect malware are signature-based methods [7]. Signature-based malware detectors rely on a database of known malware to identify and prevent malicious activities. They compare file characteristics to known signatures and take action if a match is found. It is an effective method to detect and prevent malware threats [8]. Signature-based malware detection has limitations. It cannot detect new or modified threats and can be computationally intensive due to processing large amounts of data [8]. Anomaly-based malware detectors identify malicious activities by detecting deviations from normal behaviour, aiming to detect unknown or zero-day malware that may evade traditional signature-based methods. These techniques employ machine learning algorithms [8]. Though machine learning-based malware detectors are effective and have good accuracy, the features should be manually and carefully collected by subject matter experts to make sure that the ML-based detectors would perform well [9]. Also, it may generate false positives or miss sophisticated attacks that mimic normal behaviour [7].
Recently, image visualization-based malware detection is becoming increasingly popular as it effectively detects malicious activities on a computer system. It is an effective way to detect sophisticated and complex malware that can bypass traditional methods of detection [5]. Instead of focusing on code analysis or behavioural patterns, this approach leverages the visual representation of malware samples to uncover potential malicious activities. The executable files or network traffic are converted into grey-scale or RGB images, and then these images are analyzed by the classification model to decide whether malware or not using visual features [10]. Image visualization-based malware detection has several advantages over traditional security techniques. First, it is more accurate in detecting malicious activities. Second, it is faster and easier to use than traditional methods. Lastly, it is more cost-effective than traditional methods, as it requires less human resources and time to set up and maintain [9].
Most of the recent studies focus on using convolutional neural networks (CNNs) as dynamic techniques for visualization-based malware detection [1], [5], [11], [12]. CNNs can analyze the visual representation of malware codes which are collected from executable files or network traffic, and identify malicious patterns based on the features that are extracted from the input image. However, CNNs are limited in their ability to identify subtle differences between malware variants [12]. As a result, they may not be able to detect small changes in malicious code that could affect the way the malware behaves [12].
The transformers-based approaches were only applied in natural language processing applications [13]. Some transformer-based approaches like Vision Transformer (ViT) [13] are currently applied in image classification tasks such as plant diseases classification [14], [15], chest X-ray image classification [16], etc. Though ViT outperforms the CNNs in terms of accuracy and image processing such that the spatial information of the image is retained in ViT [17], a few of the earlier studies developed a ViT-based malware classifier. However, some limitations should be tackled in ViT architecture to improve its performance in malware classification. ViT is able to process images at a much higher resolution than traditional CNNs, but processing very large images can still be computationally intensive. From the shallow layers of ViT, only the global representation of the input image is obtained but local representation should be also considered. it also requires large amounts of data in order to be trained accurately. Additionally, it can be computationally expensive, requiring powerful hardware and long training times [17].
This paper builds a B_ViT butterfly construction-based vision transformer model for visualization-based malware classification and detection. The main contributions of B_ViT are summarized as follows:
B_ViT is an enhanced ViT architecture captures the local spatial representation and global spatial representation of malware images.
All the variants of B_ViT such as BViT/B16, BViT/B32, BViT/L16, and BViT/L32 are experimented and evaluated on the MalImg [9], Microsoft BIG [18] datasets, and top-1000 PE imports. Then, they are compared with the performance of IEViT and ViT.
The following sections in this paper are organized as follows: Section II describes the related works and classifies the malware detection and classification approaches as per their functions. Section III discusses the grayscale malware image generation and the architecture of the proposed model B_ViT in detail. The materials and datasets are discussed in section IV. Section V analyzes the robustness of the proposed model to obfuscation. The performance analysis of B_ViT and the comparative analysis with the other ViT-based architectures are presented in section VI. In addition, section VI shows a parallel analysis of B_ViT over IEViT, and ViT. A discussion and results interpretation are provided in section VII. Finally, the conclusion and limitations are provided in section VIII.
Related Work
Static malware detection techniques are used to detect malicious software (malware) without actually executing it. This type of detection is based on analyzing the code statically or other static characteristics of the malware. Naik et al. [19] proposed a static malware detection method namely, fuzzy-import hashing. This method relies on two combined hashing techniques: fuzzy and import, each can be cooperative to improve the rate of malware detection. Liangboonprakong and Sornil [20] used N-grams sequential pattern features to build a malware classification approach. The N-grams are extracted by disassembling the binary executable into a hexadecimal string. The next step is sequential pattern extraction, which looks for frequently occurring sequences and uses them to build a feature vector to classify data. Narouei et al. [21] proposed a static malware detection technique to detect malware accurately and resist packing and injection of malware into legitimate software. Avdiienko et al. [22] mined the data flow for benign android apps, and used them to detect the android malicious apps. MUDFLOW tool is developed to mine and classify the dataflow in apps using FlowDroid. Static malware detection techniques are useful in terms of performance analysis due to identifying suspicious code without having to run it. However, these techniques can not resist code obfuscation or code packing besides, they require domain experts and reverse engineering [5].
Dynamic malware detection techniques are used to detect malicious code as it is running and can provide insights into the code’s behavior within the operating system such as system calls and system resources. These behavioral features collected from the malware behavior can be used to implement machine learning or deep learning-based frameworks for malware detection [5]. Li et al. [23] proposed a DMalNet dynamic malware detection technique based on API semantic features and graph learning. The semantic features are extracted from API arguments and names by a hybrid encoder. The relationship between API calls is then converted into the graph’s structural data using an API call graph that is derived from the API call sequence. Li et al. [24] implemented a malware detection framework using deep learning models that extract intrinsic features by capturing and combing significant features. API calls-based detection approaches have good accuracy in malware detection but their complexity is high and there is hardship to be used in a real-time environment that needs high capabilities. Recently, many state-of-arts suggested using hardware-based detection approaches that enhance the performance of model or framework [11]. Tian et al. [11] proposed MDCHD, a detection approach based on the control flow of target software at run-time that is collected from Processor Traces. Chen et al. [25] used the control flow traces of the target software to implement a malicious software deep learning-based detector. For minimal overhead execution tracing, Chen et al. [25] used processors with Intel Processor Trace enabled. There are some limitations in dynamic malware detection techniques such as selecting a proper environment to execute the software in it, the behavior of malware changing, or generating false sequences. Dynamic analysis-based methods are also non-scalable because they need a lot of resources and are environment-specific [5].
In recent studies, visualization-based malware detection is popular for complex and sophisticated malware detection that can bypass traditional malware detection techniques [5]. These methods are accurate, cost-effective, and require no feature extraction and fewer human resources [9]. Falana et al. [26] proposed a malware detection virtualization-based approach that converts the binary files of malware into RGB images and then extracts the features from these images using an ensemble method containing two neural networks: Deep-CNN and Deep-GAN. Landman et al. [1] detected unknown malware in a cloud environment based on Linux. Landman et al. [1] built a framework called Deep-Hook that extracts the memory dump sequences of a virtual machine while it is working and then reforms them into visual images. Finally, the CNN-based classifier is used to detect the malware based on the visual image. Kumar et al. [5] proposed a DTMIC method for malware classification based on deep transfer learning i.e., a pre-trained CNN model on the ImageNet dataset. DTMIC converts the portable executable files into grayscale images then, each image is processed by pre-trained deep CNN to classify to which malware family belongs. Vasan et al. [12] proposed an IMCFN approach that uses fine-tuning with the CNN model to classify image-based malware. IMCFN converts the binary sequences of malware into RGB images and then, processed them by a fine-tuned CNN model that is trained previously with the ImageNet dataset. Tian et al. [11] proposed MDCHD, a visualization-based malware detection in virtual machine environments. MDCHD collects the control flow of the target software at the run-time using the IPT technique and then converts them into RGB images to be processed by CNN architecture that detects the malware. Makandar et al. [27] use the image processing concepts for malware classification based on SVM multi-class. Using the discrete wavelet transform GIST, Gabor wavelet, and other features, the multi-resolution wavelets are utilized to construct an efficient texture feature vector. Narayanan et al. [28] visualize the viruses and malware in an image. Then, the features are extracted from the images using principal component analysis (PCA), and hybrid techniques are used such as ANN along with SVM and KNN to classify the malware.
Convolutional neural networks (CNNs) are dynamic technology that is mostly used in related studies to detect malware. CNNs have a limited ability to distinguish minute variations among malware types. They might therefore be unable to identify minute alterations in malicious code that might have an impact on how the malware behaves [12]. Recently, most of the studies applied attention mechanisms along with the CNN model to enhance the performance of CNN. Xu et al. [29] proposed Malbert, a malware detection approach that depends on a pre-trained deep learning model using 12 transformers, each transformer contains an attention layer followed by a deep neural network. Zhang et al. [30] proposed a static ransomware detection technique that uses N-gram opcodes and self-attention-based CNN. In 2021, Dosovitskiy et al. [13] proposed a vision transformer (ViT) approach which is an attention-based approach, to be used as an image-based classifier. ViT was applied in several fields and achieved high performance in all studies. Xu et al. [31] created a multi task classification framework using ViT that is capable of simultaneously predicting four glioma molecular expressions based on MR images. Okolo et al. [16] enhanced the architecture of ViT that can perform better for classifying chest X-ray images. Haurum et al. [32] proposed a multi-scale hybrid ViT to classify sewer defects. Sheynin et al. [33] proposed a method that provides simultaneous coarse-grained local interactions and global interactions.
For malware classification and detection, CNNs may not be able to detect small changes in malicious code that could affect how the malware behaves. ViT is a better solution for capturing the spatial information of malware images. Park et al. [34], proposed an enhanced ViT for malware classification. This method incorporates multiple patch encodings which capture both the location information of local features and global features. Seneviratne et al. [35] introduce SHERLOCK, a novel approach for Android malware detection. The method utilizes a Self-Supervised ViT, which is trained on a large corpus of Android application (APK) files. SHERLOCK captures the representations of APK files and based on that, decides whether a given APK file contains malware or not. In this paper, we proposed a B_ViT architecture, a butterfly construction-based vision transformer for malware image classification. The proposed architecture B_ViT tackles the limitations of ViT architecture and image malware-based studies. The proposed malware classifier captures the local spatial representation and global spatial representation of malware images. It does not require domain experts. It supports parallel processing for malware images. Moreover, it is scalable, time-effective, and data-satiable.
Methodology
In this paper, a visualization-based malware classifier/detector namely, butterfly construction-based vision transformer (B_ViT) is proposed. The input of B_ViT model could be any grayscale malware image or a portable executable import after being converted into a grayscale image. As output, B_ViT model classifies the malware image or detects the malware in the input image.
Prior to delving into the methodology sub-sections, a table of notations 1 will be provided for reference and clarification purposes.
A. Grayscale Image Generation Phase
In this phase, the binary sequence of portable executable (PE) import is converted into a 2-dimensional grayscale image. Then, the image would be used by the proposed model B_ViT to classify or detect the malware. The steps of the PE imports conversion into grayscale images are shown in Algorithm 1. Firstly, the binary sequence of PE import
Algorithm 1 Conversion Malware Portable Executable Import Into Grayscale Image
Input: Malware PE import:
output:
For
B. The Proposed Butterfly Construction-Based Vision Transformer B_ViT
B_ViT is an enhanced vision transformer (ViT) architecture that relies on the parallel processing of image segments and captures local and global spatial representations of malware images. B_ViT utilized local and global positional encoding of malware images. Local and global transformer encoders are also included in B_ViT. The architecture of the proposed malware classifier B_ViT is shown in Figure 2. B_ViT has four phases: image partitioning & patches embeddings; local attention; global attention; and training and malware classification. The details of each phase are shown in the following sections.
1) Image Partitioning & Patches Embeddings Phase
The grid partitioning is followed in the proposed model, the malware image\begin{align*} x &\rightarrow [x^{(1) }, x^{(2) }, \ldots.., x^{(N)}] \hspace {1ex};x^{(i)}\in R^{P \times P \times C} \hspace {1ex} and \\ &\qquad \qquad \qquad \qquad \qquad \qquad i = 0, \ldots, N \tag{1}\\ P^{(i)}_{E}& = LP(F(x^{(i)})) = Dense_{D}(F(x^{(i)})); i = 1, \ldots, N \tag{2}\end{align*}
Algorithm 2 B_VIT: Butterfly Construction-Based Vision Transformer Algorithm
Input:
output: (
For
END For
While
For
IF
ELSE
ENDIF
END For
IF
ELSE
For
END For
END While
2) Local Attention Phase
Unlike the original ViT, B_ViT applies self-attention-based transformers to process the malware image in segments simultaneously, where more focus is placed on local features and local representation.\begin{align*} \begin{cases} \displaystyle Z_{i} = \bigg[P_{E}^{(i-1)\times \frac {N}{k} + 1} + Lpos_{E}(1), \ldots \ldots \ldots, P_{E}^{(i)\times \frac {N}{k}}\\ \displaystyle \qquad \qquad + Lpos_{E}\left({\frac {N}{k}}\right)\bigg] \\ \displaystyle Z_{i} \gets [x^{(0)}_{class}.Linerar\_{}projection + Lpos_{E}(0) \hspace {1ex} || \hspace {1ex} Z_{i}] \end{cases} \tag{3}\end{align*}
Finally, as shown in line 20 of Algorithm 2, each segment\begin{align*} \begin{cases} a = MhSa_{h}(Norm(Z)) + Z \\ \tilde {Z} = MLP(Norm(a)) + a \end{cases}; Z \in R^ {N \times D} \tag{4}\end{align*}
3) Global Attention Phase
In this phase, the input image is processed in one block by one global transformer encoder, with more focus on global features and representation. The steps of this phase are shown in lines 22-33. To generate the global representation of malware image\begin{align*} &\hspace {-.2pc}\tilde {Z} \gets \{\tilde {Z_{1}} \hspace {1ex} || \hspace {1ex} \tilde {Z_{2}}, \ldots \hspace {1ex} || \hspace {1ex} \tilde {Z_{k}} \} \backslash \{\tilde {Z}^{(0)}_{i}\} + P_{E}^{(j)} + Gpos_{E}(j) \hspace {0.1ex}; \\ &\qquad \qquad Gpos_{E} \in R^{N \times D}; j = 1, \ldots, N; \tilde {Z} \in R^{(N+k) \times D} \tag{5}\end{align*}
After that, as shown in line 28 of Algorithm 2,
Finally, the global transformer embeddings\begin{equation*} \tilde {\tilde {Z_{i}}} \gets \left[{\tilde {\tilde {Z_{i}}}^{(i-1)\times \frac {N+k}{k} + 1}, \ldots \ldots \ldots, \tilde {\tilde {Z_{i}}}^{(i)\times \frac {N+k}{k}}}\right] \tag{6}\end{equation*}
4) Training and Malware Classification Phase
B_ViT model implements same ViT-adopted variants [13] and IEViT-adopted variants [16]. Therefore, four variants of B_ViT such as BViT/B16, BViT/B32, BViT/L16, and BViT/L32 have been experimented. Table 2 shows the details of ViT-adopted, IEViT-adopted, and B_ViT-adopted variants. More details in all the variants of B_ViT such as BViT/B16, BViT/B32, BViT/L16, and BViT/L32 are provided as follows:
BViT/B16 is a medium-sized Vision Transformer with a patch size of
pixels. It is suitable for tasks with fewer classes or limited computational resources. Each$16\times 16$ patch is linearly projected and processed by 12 Transformer encoder layers executed in 4 rounds. The local transformer encoders have 4 headers and the global transformer encoders have 12 headers. These transformer encoders generate embeddings with a size of 768.$16\times 16$ BViT/B32 is a medium-sized Vision Transformer with a patch size of
pixels. It captures a coarser-grained representation by reducing the number of patches compared to BViT/B16. It is suitable for lower-resolution images or when spatially localized features are more prevalent. Each$32\times 32$ patch is linearly projected and processed by 12 Transformer encoder layers executed in 4 rounds. The local transformer encoders have 4 headers and the global transformer encoders have 12 headers. These transformer encoders generate embeddings with a size of 768.$32\times 32$ BViT/L16 is a larger Vision Transformer with a patch size of
pixels. It has a higher capacity for capturing fine-grained details and complex patterns in images. It is suitable for tasks with many classes or when higher accuracy is prioritized over computational efficiency. Each$16\times 16$ patch is linearly projected and processed by 24 Transformer encoder layers executed in 8 rounds. The local transformer encoders have 4 headers and the global transformer encoders have 16 headers. These transformer encoders generate embeddings with a size of 1024.$16\times 16$ BViT/L32 is a larger Vision Transformer with a patch size of
pixels. It captures a more holistic view of the input image, considering broader context and global dependencies. It is advantageous for higher-resolution images or those with large-scale spatial structures. Each$32\times 32$ patch is linearly projected and processed by 24 Transformer encoder layers executed in 8 rounds. The local transformer encoders have 4 headers and the global transformer encoders have 16 headers. These transformer encoders generate embeddings with a size of 1024.$32\times 32$
For training all B_ViT variants, the local attention and global attention phases are iterated several rounds to fine-tune the training parameters of transformer encoders. In each round, three transformers’ layers are performed where
B_ViT is a transfer learning-based model, whereby the pre-trained ViT model on the ImageNet dataset [36] (\begin{align*} \begin{cases} \hat {Z} = \{ \hat {Z_{1}} || \hat {Z_{2}} || \ldots. || \hat {Z_{k}} \} \\ P_{class} = MLP_{head}(\hat {Z}) \end{cases}; \hat {Z} \in R^ {N \times D} \tag{7}\end{align*}
The use of transfer learning in B_ViT, being an efficient model, brings several benefits. Without transfer learning, B_ViT would face the following challenges:
Training from scratch: B_ViT would require an extensive amount of labelled malware images (ranging from 14 million to 300 million) and significant computational resources, leading to lengthy training times spanning weeks.
Generalization limitations: B_ViT’s performance may suffer when encountering new or unknown malware samples since it lacks the knowledge and patterns obtained from pretraining on a large-scale dataset.
Overfitting risks: In scenarios with limited training data, B_ViT would be more susceptible to overfitting, compromising its ability to generalize well to unseen malware instances.
The proposed model\begin{equation*} L_{CE} = - \sum _{j=1}^{M} d_{j} log(y_{j}) \tag{8}\end{equation*}
Data augmentation is applied in deep learning-based image classification approaches during the training phase to increase the number of training samples and overcome the overfitting issue [37]. Data augmentation includes some methods such as shearing, scaling, flipping, shifting, rotating, and zooming [38]. Table 3 shows the augmentation steps that are followed in this paper to generate more training data from the original malware images. Furthermore, the testing images are excluded from the augmentation for getting realistic results from the original images.
As shown in Figure 3, Figure 4, and Table 7, the three datasets that are used for training B_ViT are imbalanced i.e., some malware families images are significantly larger or smaller compared to other malware families. However, by increasing the malware images in some classes via data augmentation, the impact of unbalanced data can be limited. The details of data augmentation such as training samples (before augmentation), training samples (after augmentation), and testing samples for Malimg, Microsoft BIG, and PE imports are presented in Table 4, Table 5, and Table 6 respectively.
Datasets
The variants of the proposed model i.e, BViT/B16, BViT/B32, BViT/L16, and BViT/L32 are trained and evaluated on three publicly malware images datasets i.e, MalImg [9], Microsoft BIG [18], and top-1000 PE imports [39]. Tables 7, 8, and 9 show details of MalImg, Microsoft BIG, and top-1000 PE imports respectively. Malimg contains over 9,000 malware grayscale images distributed in 25 malware families as shown in Figure 3. Microsoft BIG contains over 10,000 malware bytes files represented in hexadecimal and distributed in 9 malware families as shown in Figure 4. Top-1000 imported functions are applicable for malware detection, containing over 47,000 portable executable imports distributed in 2 classes benign, malware as shown in Table 7, and collected from virusshare.com. In experiments, the MalImg and Microsoft BiG, and PE imports are divided into an 80:20 ratio for training and testing.
The implementation of the proposed architecture
Analysis of Obfuscation Resistance
Obfuscation is a variety of techniques such s code encryption, code scrambling, and code packing used by malware authors to make it evade detection and remain active on a compromised system for a longer time. Polymorphic and metamorphic obfuscations are two common circumvention techniques [40].
Polymorphic obfuscation alters the appearance and signature of malware using an encryption key. It uses a self-replicating code and mutation engine to swiftly morph malware’s code and continuously modify its shape by changing the encryption keys, many signatures are generated subsequently, it is a big challenge to prevent and detect malware using signatures-based detection methods.
Metamorphic obfuscation rewrites the code of the malware itself to generate different malware at each iteration. The new malware’s generated without using an encryption key and the functionality of them keeps same. Four methods for metamorphic obfuscation are frequently used: instruction substitution, register reassignment, dead-code insertion, and code transposition.
Over than 80% of malware binary sequences employ obfuscation techniques to evade detection. Obfuscation techniques can be overcome by texture-based malware classification [12], [41]. The texture-based features of malware images are obtained by converting the binary sequence into a 2-dimensional grayscale image as shown in algorithm 1. Then, B_ViT classifies the malware image based on texture features. The proposed study considers the malware families in three datasets i.e, MalImg, Microsoft BIG, and top-1000 PE imports as well as focuses only on polymorphic obfuscation. The results show the efficiency of texture-based malware detection as well as the resilience of B_ViT to polymorphic obfuscation. Some obfuscated malware samples from the ObfuscatorAD family in malImg as shown in Figure 5.
Performance Analysis
The proposed model, B_ViT (butterfly construction-based vision transformer) is experimented and evaluated for visualization-based malware classification and detection. Malware classification involves categorizing malware into classes based on characteristics, behaviour, or attributes. Malware detection identifies and alerts the presence of malware. However, detection is the main goal of system security but classification aids in developing detection techniques. Malware detection entails a classification process that distinguishes between malware data samples and benign data samples. MalImg [9] and Microsoft BIG [18] datasets are used to evaluate all the variants of B_ViT such as BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for malware classification. Moreover, we conducted an experiment on Microsoft PE imports to detect malware, where the data samples were categorized into two classes: “malware” and “benign”. Furthermore, in the three datasets, the B_ViT variants such as BViT/B16, BViT/B32, BViT/L16, and BViT/L32 are compared with the respective variants of IEViT and ViT. The performance of B_ViT, IEViT, and ViT is measured using the most common metrics such as accuracy, recall, precision, and f1-score which are defined in equations 9, 10, 11, and 12 respectively. Moreover, the parallel performance of B_ViT variants is analyzed. Finally, B_ViT is compared with the previous state-of-the-art visualization-based malware classifiers.\begin{align*} Accuracy &= \frac {TP + TN}{FN + TN + TP + FP} \tag{9}\\ Recall &= \frac {TP}{FN + TP} \tag{10}\\ Precision &= \frac {TP}{FP + TP} \tag{11}\\ F1-score &= \frac {2 * Recall * Precision}{Recall + Precision} \tag{12}\end{align*}
A. Comparison of B_ViT With IEViT and ViT for MalImg Dataset
Comparative analysis of the proposed method B_ViT with IEViT and ViT for image malware classification using MalImg dataset is performed. Figure 6 shows the comparative analysis results of B_ViT variants with IEViT and ViT variants. The accuracy of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 25 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 is shown in Figure 6 (a, b, c, d) respectively. The recall of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 25 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 is shown in Figure 6 (e, f, g, h) respectively. The precision of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 25 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 is shown in Figure 6 (i, j, k, l) respectively. The F1-score of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 25 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 as shown in Figure 6 (m, n, o, p) respectively. It has been noted that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware classification using MalImg dataset where accuracy, recall, precision, and F1-score are very close to 1 in most malware families.
comparative analysis of B_ViT variants with IEViT and ViT variants in terms of accuracy, recall, precision, and F1-score for malImg.
The confusion matrices of B_ViT variants i.e, BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for MalImg dataset are shown in Figure 7 (a, b, c, d) respectively. The Swizzor.gen!E and Swizzor.gen!I families are difficult to identify from one another in MalImg datasets since they are so similar as shown in Figure 8. The variants of the proposed model BViT/B16, BViT/B32, BViT/L16, and BViT/L32 achieve 98.8%, 99.3%, 99.08%, and 98.72% accuracies for Swizzor.gen!E respectively; and 98.5%, 99.0%, 98.76%, and 98.56% accuracies for Swizzor.gen!I respectively. Therefore, B_ViT is an effective malware classifier even for malware families that are difficult to distinguish.
The confusion matrices of B_ViT variants: (a) BViT/B16, (b) BViT/B32, (c) BViT/L16, and (d) BViT/L32 for malImg.
The overall comparative analysis of malware classification performance of the proposed method B_ViT variants with IEViT and ViT variants for MalImg dataset is shown in Table 8. It has been noted that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware classification using MalImg dataset in terms of accuracy, recall, precision, and F1-score.
B. Comparison of B_ViT With IEViT and ViT for Microsoft Big Dataset
Comparative analysis of the proposed method B_ViT with IEViT and ViT for image malware classification using Microsoft BIG dataset is performed. Figure 9 shows the comparative analysis results of B_ViT variants with IEViT and ViT variants. The accuracy of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 9 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 is shown in Figure 9 (a, b, c, d) respectively. B_ViT variants consistently achieve higher accuracy compared to IEViT and ViT variants across all malware families, with accuracy scores nearing 1. Among B_ViT variants, BViT/L16, and BViT/L32 have more stable and higher performance, particularly for simda and elihos_ver1 families because the number of global heads and the number of layers are higher. The recall of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 9 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 is shown in Figure 9 (e, f, g, h) respectively. B_ViT variants consistently achieve higher recall compared to IEViT and ViT variants across all malware families, with recall scores nearing 1. Among B_ViT variants, BViT/L16, and BViT/L32 have more stable and higher performance, particularly for simda and elihos_ver1 families because the number of global heads and the number of layers are higher. The precision of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 9 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 is shown in Figure 9 (i, j, k, l) respectively. B_ViT variants consistently achieve higher precision compared to IEViT and ViT variants across all malware families, with precision scores nearing 1. Among B_ViT variants, BViT/L16, and BViT/L32 have more stable and higher performance, particularly for simda and elihos_ver1 families because the number of global heads and the number of layers are higher. The F1-score of BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for 9 malware families compared to IEViT/B16, IEViT/B32, IEViT/L16, and IEViT/L32 as well as ViT/B16, ViT/B32, ViT/L16, and ViT/L32 is shown in Figure 9 (m, n, o, p) respectively. B_ViT variants consistently achieve higher F1-score compared to IEViT and ViT variants across all malware families, with F1-score scores nearing 1. Among B_ViT variants, BViT/L16, and BViT/L32 have more stable and higher performance, particularly for simda and elihos_ver1 families because the number of global heads and the number of layers are higher. It has been noted that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware classification using Microsoft BIG dataset where accuracy, recall, precision, and F1-score are very close to 1 in most malware families.
comparative analysis of B_ViT variants with IEViT and ViT variants in terms of accuracy, recall, precision, and F1-score for Microsoft BIG dataset.
The confusion matrices of B_ViT variants i.e, BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for Microsoft BIG dataset are shown in Figure 10 (a, b, c, d) respectively. The simda and kelihos_ver1 families are difficult to identify from one another in Microsoft BIG datasets since they are so similar. The variants of the proposed model BViT/B16, BViT/B32, BViT/L16, and BViT/L32 achieve 98.10%, 98.43%, 100%, and 99.88% accuracies for simda respectively; and 98.10%, 100%, 98.76%, and 99.77% accuracies for kelihos_ver1 respectively. Therefore, B_ViT in particular, BViT/L16, and BViT/L32 are effective malware classifiers even for malware families that are difficult to distinguish.
The confusion matrices of B_ViT variants i.e, (a) BViT/B16, (b) BViT/B32, (c) BViT/L16, and (d) BViT/L32 for Microsoft BIG dataset.
The overall comparative analysis of malware classification performance of the proposed method B_ViT variants with IEViT and ViT variants for Microsoft BIG dataset is shown in Table 9. It has been noted that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware classification using Microsoft BIG dataset in terms of accuracy, recall, precision, and F1-score.
C. Comparison of B_ViT With IEViT and ViT For TOP-1000 PE Imports Dataset
Comparative analysis of the proposed method B_ViT with IEViT and ViT for image malware detection using the top-1000 PE imports dataset is performed. Figure 11 (a, b, c, d) shows the confusion matrices of B_ViT variants i.e, BViT/B16, BViT/B32, BViT/L16, and BViT/L32 for the top-1000 PE imports dataset. It has been noted that most of the benign and malware samples are correctly classified with a few FP and FN. Therefore, B_ViT is an effective visualization-based malware detector.
The confusion matrices of B_ViT variants i.e, (a) BViT/B16, (b) BViT/B32, (c) BViT/L16, and (d) BViT/L32 for Microsoft PE imports.
The overall comparative analysis of malware detection performance of the proposed method B_ViT variants with IEViT and ViT variants for PE imports dataset is shown in Table 10. It has been noted that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware detection using PE imports dataset in terms of accuracy, recall, precision, and F1-score.
Finally, Tables 11 and 12 show the improvement of B_ViT variants over IEViT and ViT variants in terms of f1-score for MalImg, Microsoft BIG, and PE imports.
D. Parallel Analysis of B_ViT and Comparison With IEViT and ViT
B_ViT is a parallel-based architecture that supports the parallel processing of images’ patches. Therefore, a parallel analysis of B_ViT should be performed and compared with the performance of IEViT and ViT in terms of speed up\begin{align*} S& = \frac {ET_{S} (s)}{ET_{P} (s)} \tag{13}\\ T_{0}& = (k \times ET_{P}) - ET_{S} \tag{14}\\ E &= \frac {S}{k} \tag{15}\\ C& = k \times ET_{P} \tag{16}\end{align*}
Table 13 shows the parallel analysis of B_ViT over sequential B_ViT, IEViT, and ViT. It has been noted that an epoch time of B_ViT is smaller than sequential B_ViT; the speed up is approximately equal to k (the number of threads); the overhead is very less compared to running cost; and the efficiency of B_ViT is approximately equal to 1 which is the optimal value. Therefore, the parallel-based architecture of ViT i.e, B_ViT is time-effective for malware classification and detection as well as training. Moreover, the time performance of B_ViT is higher than IEViT and ViT as shown in Table 13. The average speed-up of B_ViT variants over IEViT and ViT variants is equal to 2.42 and 1.81 respectively.
E. Performance Comparison of B_ViT With Recent Visualization-Based Malware Classifiers
Table 14 presents a comparison between the proposed malware classifier/detector B_ViT and the recent visualization-based malware classifiers in terms of four major performance criteria i.e, accuracy, recall, precision, and F1-score. Moreover, the comparison has been performed with the recent visualization-based malware classifiers that use the same datasets used in this paper. It has been noted that the proposed method that uses B-ViT architecture outperforms recent visualization-based malware classification methods that use CNN architectures. Therefore, it may be concluded that butterfly construction-based vision transformer architecture has a higher performance than CNNs in malware image classification because B_ViT obtains the malware image’s local spatial representation and global spatial representation and extracts the features without the need for domain experts and feature engineering. In addition, B_ViT detects the obfuscated malware and the packed malware that is injected into legitimate software. Finally, B_ViT is a scalable model, whereby the pre-trained ViT model on the ImageNet dataset (≥10 million images) is used to transfer and initialize the training parameters of the transformers.
Discussion
Recent studies focus on using CNNs for dynamic visualization-based malware detection, but they struggle to detect subtle differences between malware variants. ViT outperforms CNNs, but few studies explore ViT-based malware classifiers. ViT still has limitations: computationally intensive for large images, lacks local representation, requires abundant training data, and demands powerful hardware and long training times. Therefore, B_ViT butterfly construction-based vision transformer model for visualization-based malware classification and detection is proposed. All B_ViT variants i.e, BViT/B16, BViT/B32, BViT/L16, and BViT/L32 are experimented and evaluated using MalImg and Microsoft BIG datasets for image malware classification and compared with the respective variants of IEViT and ViT. The performance comparison demonstrates the superiority of B_ViT architecture over recent visualization-based malware classification methods utilizing both CNN and ViT architectures, as indicated in Table 14 because B_ViT captures the local spatial representation and global spatial representation of malware images. Moreover, B_ViT variants demonstrate enhanced speed (S) improvement, reduced overhead (T0), improved efficiency (E), and optimized running cost (C) compared to IEViT and ViT variants as indicated in Table 11, Table 12, and Table 13. This is achieved through B_ViT’s parallel-based architecture, which enables efficient parallel processing of image patches. The comparative analysis of the proposed method B_ViT with IEViT and ViT for image malware classification and detection. B_ViT variants consistently achieve higher accuracy, recall, precision, and F1-score compared to IEViT and ViT variants across all malware families, with values scores nearing 1 as shown in Figure 6 and 9. Among B_ViT variants, BViT/L16, and BViT/L32 have more stable and higher performance, particularly for simda and elihos_ver1 families in Microsoft BIG as shown in Figure 9, Swizzor.gen!E and Swizzor.gen!I families in MalImg as shown in Figure 6. This is achieved through the number of global heads and the number of layers is higher. In Figure 7, it can be observed that 12% of the tested samples originally classified as Swizzor.gen!I family were misclassified as Swizzor.gen!E family, while 16% of the tested samples belonging to Swizzor.gen!E family were misclassified as Swizzor.gen!I family. This misclassification can be attributed to the high similarity between these two families, leading to challenges in distinguishing them accurately. In Figure 10, it can be observed that 12% of the tested samples originally classified as kelihos_ver1 family were misclassified as simda family, while 0% of the tested samples belonging to simda family were misclassified as kelihos_ver1 family. This misclassification can be attributed to the high similarity between these two families, leading to challenges in distinguishing them accurately.
Conclusion
This paper proposes a butterfly construction-based vision transformer (B_ViT) model for visualization-based malware classification and detection. B_ViT is trained and evaluated on grayscale malware images collected from MalImg or Microsoft BIG datasets or converted from portable executable imports. B_ViT has four phases: image partitioning and patches embeddings; local attention; global attention; and training and malware classification. In local attention phase, self-attention-based local transformer encoders along with local positional encoding process the input image’s patches simultaneously to capture the local representation and features of malware image. In global attention phase, one self-attention-based global transformer encoder along with global positional encoding process the input image as one block, to capture the global representation and features of malware image. B_ViT is a transfer learning-based model that uses a pre-trained ViT model on the ImageNet dataset to initialize the training parameters of transformers, then the B_ViT is fine-tuned to fit malware classification task. All B_ViT variants i.e, BViT/B16, BViT/B32, BViT/L16, and BViT/L32 are experimented and evaluated using MalImg and Microsoft BIG datasets for image malware classification and compared with the respective variants of IEViT and ViT. The comparative analysis shows that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware classification achieving accuracy equal to 98.65%, 98.28%, 99.32%, and 99.11% in MalImg; and 98.80%, 98.62%, 99.49%, and 99.26% in Microsoft BIG for BViT/B16, BViT/B32, BViT/L16, and BViT/L32 respectively. Besides, B_ViT variants are evaluated using portable executable imports for image malware detection and compared with the respective variants of IEViT and ViT. The comparative analysis shows that B_ViT variants outperform the IEViT and ViT variants in visualization-based malware detection achieving accuracy equal to 99.84%, 99.87%, 99.99%, and 99.97% for BViT/B16, BViT/B32, BViT/L16, and BViT/L32 respectively. The results show that B_ViT achieves average improvement in terms of F1-score over IEViT and ViT equal to 1.21%, and 2.48% respectively. Since B_ViT is a parallel-based architecture, a parallel analysis of B_ViT over sequential B_ViT, IEViT, and ViT is performed. The results show that B_ViT is time-effective for malware classification and detection where the average speed-up of B_ViT variants over sequential B_ViT, IEViT and ViT variants are equal to 3.84, 2.42 and 1.81 respectively. Moreover, the analysis shows the efficiency of texture-based malware detection as well as the resilience of B_ViT to polymorphic obfuscation. The proposed malware classifier/detector is visualization-based so, does not require domain experts for feature extraction, feature engineering, etc. Finally, the proposed method that uses B_ViT architecture outperforms recent visualization-based malware classification methods that use CNN architectures as well as ViT-based malware classifiers. The utilization of the B_ViT-based malware classifier/detector in practice presents certain limitations that should be acknowledged. Firstly, its implementation necessitates a high-resource platform, especially when employing a high degree of parallelism because more local transformer encoders should be run to capture the local representation of malware images. Secondly, to ensure its effectiveness, the proposed method must be thoroughly tested in an on-site environment. To overcome these limitations, future work should focus on further optimization and refinement of the approach.