Introduction
With our increasing reliance on mobile devices, people keep valuable information in their devices and online storage. The wicked people get lured by the potential access to confidential information mainly for illicit financial gain, and other purposes [1]. Such wicked people carry out cyber-attacks to steal that information or hamper the regular usage of the devices. One of the most common attacks on mobile devices is the ransomware attack, in which an attacker either encrypts the data on the devices or locks the devices, thereby preventing legitimate users from accessing the data or devices. Since Android is the most popular operating system in the world of mobile devices [2], these devices are the common target of ransomware attacks. An effective ransomware attack detection technique should be devised to ensure the security of our Android device. Next, innovative approaches are discussed for detecting Android malware. In [3], the authors propose a permission-based malware detection method to differentiate between benign and malware applications in the case of an unknown signature. This method involves two steps: feature selection using information gain and classification using Naïve Bayes, random forest, and decision tree (J48). In [3], the author uses the Genome Project dataset with 100 newly downloaded benign and 100 malware. The authors then use the APK file to extract permission. The random forest gives a higher accuracy of 89.8% than the others. The true positive rate (TPR) and false positive rate (FPR) are 0.890 and 0.110, respectively. In [4], the authors propose to find the best feature selection set with the best classifier for detecting Android malware. A dataset of 30 malicious and 30 benign applications is created. Four sets of features are selected on the basis of four feature selection algorithms. Then, five classifiers named Naïve Bayes, k-nearest neighbor, decision tree (J48), multi-layer perceptron (MLP), and random forest are trained against the selected features. The MLP classifier with Chi-square and information gain achieves the best overall result, with a TPR of 0.9, FPR of 0.23, and accuracy of 0.83. However, the reduction of 70% features still results in a good performance. In [5], an optimizer named genetic algorithm is used to select the best subset from the feature set. The genetic algorithm’s fitness function is defined so that the chromosome that gives the machine learning-based classifier high accuracy is assigned a higher weight. The author works on 40000 APK files with 99 extracted features. After applying the SVM and MLP classifiers, 33 and 40 features are selected from 99 features using SVM and MLP. In this [6], the author proposes a feature selection based on a self-variant genetic algorithm (SV-GA) for detecting Android malware. The proposed algorithm is compared to the existing genetic algorithm using the UCI dataset [7]. Four classifiers are trained using 88-feature datasets. The SV-GA attains 93.6% accuracy.
A static Android malware detection method is proposed that uses ML classifiers based on the symmetric features of Android malware applications. Three feature selection techniques are used to find a lightweight ML algorithm that can reduce data size and features. In [8], a framework is proposed that uses machine learning to prevent permission-based malware attacks. The authors first extract the features from the Android.apk files and then create a dataset from the extracted features from the Android applications. In feature selection, the
In contrast to most of the above works in which API analysis is considered, a ransomware detection method based on traffic analysis is proposed in [17]. Several feature selection algorithms are used to find the most influential features. A total of 19 features are selected. Then, the long short-term memory (LSTM) is used to classify the ransomware. Only binary classification (ransomware vs. benign traffic) is performed. An accuracy of 97.08% is achieved. Article [18] suggests another method to find ransomware based on traffic analysis. This study investigates the accuracy of finding different ransomware. In each case, a binary classification is performed between a type of ransomware and benign traffic. The random forest is found to achieve the best performance among various machine learning algorithms. A semi-supervised approach is used in [19], where a binary classification between ransomware and benign traffic is performed. In addition, a family classification similar to [18] is also performed. In [17], [18], and [19], CICAndMal2017 dataset is used. Among the studies discussed above, most articles work with API calls to detect ransomware detection. The PSO is used in [15] for feature selection of ransomware detection. However, the number of features is about 200, which incurs a substantial computational complexity. It has yet to explore the field of traffic analysis. The detection of a specific class of ransomware is done in [20]. Decision tree algorithm is used, where hyperparameter tuning is performed using the grid search algorithm. No feature selection is made in [20]. In [21], about 97% precision is obtained with 19 features. However, the flow information, such as the source IP, destination IP, protocol, and timestamp features, are considered. However, these features are experimentally dependent, and the values of these features will not be related to the production network. As a result, using these features gives the experiment a very high degree of accuracy. However, the accuracy of the production network will be much lower. In [18] and [19], the source and destination IP addresses, protocol, and timestamp features are removed to eliminate the experiment dependency. The accuracy of ransomware detection is limited, and feature selection techniques are not applied. Because of the above limitations, we need to find a ransomware detection technique that uses a minimum number of features while providing better accuracy. This paper proposes an Android ransomware detection method from traffic analysis using PSO-based feature selection. We separately select the most influential features for both binary and multiclass classification. An extensive experiment compares the proposed method’s performance with various feature selection methods and classifiers.
Proposed Ransomware Detection Algorithm
In this chapter, we present the proposed Android ransomware detection system. The system model of the proposed system is shown in Fig. 1. The details of each of the constituents of the proposed system are described in the following.
A. Dataset
The CICAndMal2017 dataset [22] is generated by the Canadian center for the cyber security of the University of New Brunswick. Besides benign traffic, this dataset contains traffic information for four different types of malware: scareware, adware, ransomware, and SMS malware. They installed the applications in real Android devices to avoid the stealthy features of various malware in an emulator environment. They collected traffic data in three different phases: (i) just after installing the apps, (ii) 15 minutes before rebooting the devices, and (iii) 15 minutes after rebooting the devices. Each of the malware categories has about ten families. For example, the ransomware category comprises ten families: charger, jisut, koler, lockerpin, simplocker, pletor, porndroid, ransomBO, svpeng, and wannalocker. Each instance of the dataset contains 84 features. The detection of Android ransomware is the subject of this study. As a result, we look at the samples that correspond to the ransomware. To do this, we created a new dataset that included both ransomware traffic and benign traffic. In Table 1, we give the number of instances per ransomware class. We merge plenty of ransomware into a single file and add 54161 instances of benign traffic samples captured in 2017 to reduce the imbalance issue in the dataset. Thus, the number of samples in the dataset (ransomware and benign) becomes 402834.
In the aggregated dataset, the F1 (Flow ID), F2 (Source IP), F4 (Destination IP), and F7 (Timestamp) features depend on the experimental setup and cannot be matched in a production network. For this reason, we remove these features, as they have no impact on traffic categories, as is done in [18] and [19]. In addition, some features have zero values for all instances. Since the values of the features do not change for the instances, they do not correlate with the class labels. For this reason, we removed those features. The all-zero features include: F38, F39, F40, F52, F56, F57, F63, F64, F65, F66, F67, and F68. In addition, we also removed the F62 feature, as it duplicates the F41 feature. After removing the above features, we have 68 features left in the dataset.
B. Preprocessing
1) Normalization
We normalize the dataset to lessen the dominance of some features with high values relative to other features with low values in the prediction results. In this paper, we use minimax normalization, which is defined as follows:\begin{equation*} x_{i}=\frac {x_{i}-x_{\mathrm {min}}}{x_{\mathrm {max}}-x_{\mathrm {min}}}\tag{1}\end{equation*}
2) Oversampling
The number of samples per class of ransomware differs significantly, which leads to a severe imbalance in the aggregated ransomware dataset. A machine learning model trained against an imbalanced dataset suffers from the effect of bias toward the classes having a higher number of samples. It predicts more frequently that the class has more samples to achieve higher accuracy. Such a model cannot perform in the production network as efficiently as on the test set. To eliminate the bias effect, we perform an over-sampling of the dataset to produce an equal number of samples in each class. We perform the oversampling by replicating the samples of the classes that have a lower number of samples.
C. Feature Selection
Once the preprocessing is done, we split the dataset into training and testing parts. We select the set of most influential features using the training dataset by applying particle swarm optimization. Then, a machine learning model is trained against the training dataset with the selected features only. After that, we remove unwanted features from the training dataset, keeping only the most influential features. Finally, the trained model is evaluated against the modified test dataset. The proposed method for selecting features uses a metaheuristic algorithm to find the best way to improve the accuracy of classification. We use particle swarm optimization (PSO) as the optimizer. The reasons for using the PSO include its easy implementation, low memory requirement, and high speed of convergence [23]. PSO optimization follows the techniques that various groups of living beings, such as a flock of birds or a school of fish, follow to find their desired target in a search space. Their current directions of searching, the best individual results (so far), influence the movements of the birds in the flock, the best results found by all birds (so far), and a random perturbation. Each bird keeps tracking its own best solution. Then, it exchanges its position with other birds. The exchange of information among birds in a flock enables them to find the desired target. A single bird’s search effort cannot achieve substantial progress in finding the target. However, the flock of birds can find the target. Next, we will describe the use of the PSO algorithm for selecting the best features of the CICAndMal2017 dataset to achieve the highest accuracy with lower computation complexity.
The PSO algorithm for the proposed feature selection is given in Algorithm 1. We use the common PSO proposed in [24]. The algorithm starts with a random population, where the population is the positions of a set of particles. Each particle presents a possible solution where a solution is the subset of features that meet some performance criteria most efficiently. The position is defined in a \begin{equation*} F_{i}=qa_{i}+\frac {1-q}{N_{f}^{sel,i}}\tag{2}\end{equation*}
Next, we select the best fitness value for each particle (line 5). Since this is the initialization phase, the best fitness value of a particle is the fitness value computed in line 4. For the
An iterative process begins when the global best-fitness value and particle are calculated. For each particle, the velocity vector, \begin{equation*} v_{i}=wv_{i}+r_{p} \left ({P_{i}^{pb}-S_{i}}\right)+r_{g} \left ({P_{i}^{gb}-S_{i}}\right)\tag{3}\end{equation*}
Experimental Results and Analysis
In this section, we evaluate the performance of the proposed Android ransomware detection method. We use the decision tree classifier to classify traffic, whereas the PSO is used to optimize the system performance. The parameters of the proposed method are given in Table 2. The decision tree is implemented using the default parameters of scikit-learn. In this study, we consider two scenarios: Scenario I: binary classification between ransomware traffic and benign traffic, and scenario II: multiclass classification among 10 types of ransomware traffic and benign traffic.
We present the proposed ransomware detection method for binary classification in the first case. In this case, the goal is to determine if there is ransomware in the traffic. Before detection, we use a PSO-based feature selection algorithm to select the most important features, improving the detection accuracy and providing a high detection rate. Table 3 shows the selected features. We perform experiments for three different weights,
We investigated the impact of the proposed method on the fitness function in Fig. 2. We see that the value of
The impact of the selection of features on the accuracy is shown in Fig. 3 for binary classification. It is evident that the accuracy improves significantly because of the proposed technique, and the value of
Illustration of the optimization of the accuracy with respect to the iteration for ransomware detection.
To investigate how the number of selected features is optimized, we can consider Fig. 4. We can see that just after one iteration, the number of selected features is reduced by 41.93%, 46.77%, and 54.84% in the
Optimization of the number of selected features with the iteration for ransomware detection.
Next, we investigate the performance of the proposed method for ransomware family classification. There are 11 classes with 10 ransomware types and benign traffic. Fig. 5 shows the optimization of the fitness for several iterations. As is seen, the fitness function improves with iteration. The fitness values are 0.678, 0.596, and 0.53 in
Optimization of the fitness with respect to iteration for ransomware family classification.
The precision of the types of ransomware classification is shown in Fig. 6. The accuracy ranges from 67.1% to 67.9%, which is 12%-13% less than the binary classification. In this scenario, we see that
Illustration of the optimization of the accuracy with respect to the iteration for ransomware family classification.
Optimization of the number of selected features with the iteration for ransomware family classification.
A performance comparison in terms of the accuracy of the proposed method with four prominent ML algorithms is shown in Fig. 8. We use random forest for binary classification in the proposed feature selection method experiment. In contrast, the decision tree for the multiclass classification as these two classifiers works best without feature selection. The proposed method outperforms all other methods in detecting ransomware (binary classification) and its classes (multiclass classification). In binary classification, the proposed method with random forest achieves 2.26% more accuracy than the random forest (RF) ensemble learning method. In [19], the ransomware detection accuracy is 69.5%, which is 12.08% less than our proposed method. The proposed method performs better than any other method in the detection of the types of ransomware. Whereas the proposed method provides 67.8% accuracy, the decision tree, which offers the best performance among other techniques, provides 64.1% accuracy. Thus, the improvement, in this case, is 3.7%. Although the XGBoost is an ensemble classifier, its performance is not even close to the proposed method.
We compare the performance of the proposed feature selection method with four benchmark feature selection methods: recursive feature elimination (RFE), Anova-F, Chi-square, and information gain. The decision tree is used as the classifier for all feature selection methods. We select the same number of features in all cases. The accuracy for various feature selection methods for ransomware classification is shown in Table 5. For 26 features, whereas the RFE obtains the highest accuracy of 77.27% among the benchmarks, the proposed method attains 2.87% higher accuracy than the RFE. The gains in accuracy for 13 and 8 features are 2.82% and 2.29%, respectively. In the classification of ransomware types (see Table 6), the gains in accuracy for 23, 8, and 5 are 3.81%, 7.64%, and 10.59%, respectively. When comparing Tables 3 and 5, it is found that among the benchmarks, while the RFE method works best to select a higher number of features, the information gain performs best for a minimal number of features. In Table 7, the performance of the proposed method is compared to that of the existing techniques. The proposed method outperforms the existing procedure for ransomware detection and ransomware types detection.
Conclusion
In this paper, we proposed a method for detecting Android ransomware using machine learning. The core part of the method uses the nature-inspired optimization algorithm, PSO. In this method, we iteratively selected a subset of the available features to optimize the accuracy and computational overhead of the detection method. We exploited the traffic of the Android devices to train a machine learning classifier. We showed that the proposed method outperformed the conventional classifier regarding optimization variables throughout the experiments. It achieved up to 81.58% accuracy in detecting ransomware attacks. Furthermore, the types of ransomware attacks could be detected with up to 67.8% of accuracy compared to the conventional method. The computational overhead could be significantly reduced by shedding up to 91.94% of the features. The proposed optimization method convergence quickly (by around 10 iterations). Compared to the traditional feature selection methods, the proposed method attained up to 10.59% more accuracy for the same number of features. In this paper, we use the default parameter settings of various classifiers. However, in the future, we plan to employ optimization in two levels. In the first optimization level, we are interested in enhancing the hyperparameters of the classifier. In the second level, we aim to improve the precision of the ransomware detection system.