Journals & Magazines >IEEE Access >Volume: 10

A Method for Automatic Android Malware Detection Based on Static Analysis and Deep Learning

Overview of Automatic Android Malware Detection Method.

Abstract:

The computers nowadays are being replaced by the smartphones for the most of the internet users around the world, and Android is getting the most of the smartphone system...Show More

Metadata

Abstract:

The computers nowadays are being replaced by the smartphones for the most of the internet users around the world, and Android is getting the most of the smartphone systems’ market. This rise of the usage of smartphones generally, and the Android system specifically, leads to a strong need to effectively secure Android, as the malware developers are targeting it with sophisticated and obfuscated malware applications. Consequently, a lot of studies were performed to propose a robust method to detect and classify android malicious software (malware). Some of them were effective, some were not; with accuracy below 90%, and some of them are being outdated; using datasets that became old containing applications for old versions of Android that are rarely used today. In this paper, a new method is proposed by using static analysis and gathering as most useful features of android applications as possible, along with two new proposed features, and then passing them to a functional API deep learning model we made. This method was implemented on a new and classified android application dataset, using 14079 malware and benign samples in total, with malware samples classified into four malware classes. Two major experiments with this dataset were implemented, one for malware detection with the dataset samples categorized into two classes as just malware and benign, the second one was made for malware detection and classification, using all the five classes of the dataset. As a result, our model overcomes the related works when using just two classes with F1-score of 99.5%. Also, high malware detection and classification performance was obtained by using the five classes, with F1-score of 97%.

Overview of Automatic Android Malware Detection Method.

Published in: IEEE Access ( Volume: 10)

Page(s): 117334 - 117352

Date of Publication: 03 November 2022

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2022.3219047

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

There is a growing need for an efficient malware detection tool for Android, since that Android is the most used mobile system [1] and there are hundreds of new applications for it that are released each day [2], thus scanning and checking them manually became a non-possible solution. More than six million mobile malware samples were encountered by McAfee in 2014 [3], and over 12 thousands Android malware samples were detected daily in 2018 [4]. Along with the rapid growth of malware, the new samples of Android malware are more sophisticated than the samples that were detected previously in regards to avoiding anti-virus detection through code obfuscation and encryption [4].

ML-based methods, and more specifically deep neural networks, have shown to be efficient at detecting malware, as they are able to learn features and patterns automatically and with multiple levels of abstraction [5] from a limited set of training examples, thus eliminating the need to explicitly define signatures when developing malware detectors [6].

Although there are lot of studies that tackle this subject, most of them utilize just the application permissions and API calls, which does not seem effective for new sophisticated samples and for a long term, as the benign applications are requiring more permissions and including more API calls than malware applications. So, those two most used features are not sufficient alone, and do not allow to learn the real characteristics of malware, thus, there is a need to handle more useful features and techniques for malware detection. It is also noticed that the used datasets in previous papers are outdated and not available anymore.

In addition, most of the existing studies do not consider malware classification, as they just train the models and classifiers to predict one of two classes, namely malware and benign, ignoring the malware classification problem, which is an important issue to tackle in this subject and essential for the cybersecurity community to help perform the right action.

Yet another limitation of the most existing studies is the usage of traditional machine learning algorithms, which proves inefficiency in feature engineering process by depending more on human intelligence and individual judgment [4]. On the other hand, the layered structure of deep learning based models improve the learning of abstract and highly non-linear patterns, which helps learn the features automatically and capture the substantial characteristics of complex data, which improves generality on new data.

Deep Learning methods, in particular, are typically better suited to capturing semantic knowledge within Android applications than classic ML methods, especially when enough data is available to build a meaningful semantic embedding [4].

In this paper, static analysis is used and a functional API deep learning model is proposed, which takes as inputs the most useful observed features of android applications, and those are: the file size, Permissions, services, API function calls, Broadcast receivers, Opcode sequences, and the fuzzy hash, which is used for similarity detection. They are automatically extracted using Bash script and Python3, which execute commands provided by the Androguard tool [7].

The structure of the model with the multiple layers, helps learn the most distinctive information from the given inputs. Additionally, the utilization of tokenization and embedding layers, helps in clustering and detecting the similarity in the discrete independent text features, such as permissions. The RNN part of the model helps learn the characteristics of the fuzzy hash values in order, and to the best of our knowledge, the usage of the fuzzy hash in this field is a novel technique, and along with Recurrent neural network, the model can detect the similarity in Android applications so efficiently, and this also enhances the classification of the new and modified samples. The model performance was evaluated using four metrics, namely the accuracy, F1 score, Recall, and Precision metric, and the obtained result is 96%.

The used dataset is ‘CICMalDroid2020’ [8] which is the newest and most diverse dataset that combines new APK samples as well as samples from the famous datasets that are used in previous studies, such as AMD and MalDozer [9]. The used dataset categorizes the android application samples into five classes: SMS malware, Banking malware, Riskware, Adware, and Benign. Using this diverse categorized dataset, our model outputs a prediction of the application’s class, which helps in detecting as well as classifying the Android malware.

In summary, the contribution of our work is as follows:

A new method for malware detection and classification was proposed with two new proposed features in the static analysis scope.
Using two target classes, which are malware and benign, our model is able to achieve very high malware detection rates evaluated by precision, recall, F1-score, and accuracy metrics, with obtained value of 99.5% for all of them.
Using the five classes of the dataset, our model is able to achieve high malware detection and classification performance evaluated by precision, recall, F1-score, and accuracy metrics, with obtained values of 97%, 96%, 97%, 97% respectively.

The rest of this paper is organized as follows: Section 2 lays out the background to this study. Section 3 subjects existing methods of Android malware detection, while Section 4 provides a detailed description of our proposed method with its all steps, with Section 5 exploring the results and discussions thereof, and Section 6 shows the results of some tested variations of our model to tune its hyperparameters, and Section 7 shows experimental evaluation of other models, including some traditional machine learning classifiers that are most used in previous studies. Finally, Section 8 concludes the paper.

SECTION II.

Background

In this section, we provide the necessary background that is relevant to our proposed method.

A. Android

The most widely used mobile operating system is Android, which is built on the Linux kernel and uses the Java programming language. Additionally, a number of drivers and libraries have been altered to improve Android’s performance on mobile devices. In 2005, Google began supporting Android Inc. financially, and in 2008, the operating system’s first smartphones were released (HTC Dream). Because it is open source and distributed under the Apache License, the operating system has seen widespread and quick development. According to AppBrain [10], over 2.6 million Android apps exist in Google Play store as of the first quarter of 2022, with 37 percent identified as low-quality apps.

Android uses a special virtual machine, that is, the Dalvik virtual machine, which uses special bytecode. Therefore, standard Java bytecode cannot be run on Android. In order to convert Java Class files into “dex” (Dalvik executable) files, Android provides the “dx” tool. The “aapt” (Android Asset Packaging Tool) bundles Android applications into a “APK” (Android Package) file.

B. Android Application Basics

Android Applications have the extension APK which stands for Android Package Kit, the APK is made of a collection of components, these components are categorized into four types, where each application can be composed of one or more of these types [11]. The four types of android application components are:

Activities: It is an interface component that implements interactions with the user. Each activity is designed to handle single user action. For instance, an appointments list in a task manager application is an activity, and showing the detail of an appointment is the role of a second activity. Each activity composes of one or more view objects, which are interface objects, such as buttons, labels, etc.
Services: Or service components are background components that run independently of the user interface. They can operate for a long time even after the user switches to another application. A service can play music, download a file, or handle network transactions, all from the background. Services can also be used for interprocess communication (IPC) between Android applications [12].
Broadcast receivers: System-wide broadcast events can occur when a device start or receive SMS or call, Broadcast receivers are made to listen to these events and interact with them. They run in background even when the app is closed.
Content providers: Components that allow external apps and system components to access application data.

Android applications are typically written in Java or Kotlin and compiled into a single archive file (Android package or APK), along with data and resource files. The components of the APK include:

an XML manifest file that contains information such as app description, components declaration (i.e. Activities, permissions etc.)
A Classes.dex file(s) that is a Dalvik executable file that runs in its own instance of a Dalvik Virtual Machine (or Android RunTime for newer versions of Android).
A “/res” directory for indexed resources like images, icons, music etc.
A “/lib” directory for compiled code.
“/META-INF” folder including the app certificate and list of resources, SHA-1 digest etc.
Resources.arsc which is a compiled resource file

C. Malware Types

There are lot of malicious software (malware) types, here we define just the existing ones in the dataset that we used in this paper:

Adware: Adware is a type of software that is designed to automatically deliver unwanted and annoying advertisements to the user.
Riskware: Riskware is any legitimate program that poses a potential risk due to security vulnerabilities, software incompatibility, or legal violations. These applications are not designed for malicious purposes, but have features that can be used for malicious purposes. If used with malicious intent, the Riskware program can be considered as malware. This gray area of security makes Riskware a particularly difficult threat to deal with.
Spyware: Monitors and sends information of victim’s system by capturing keyboard typings, gaining access to microphone or webcam, etc. Spyware does this by modifying the security settings on users’ devices. It often bundles itself with legitimate software. The SMS malware and the Banking malware categories that exist in the used dataset belongs to spyware, where they steal personal information from text messages, and banking account information from banking applications, respectively.

D. Types of Malware Analysis

There are two main approaches for automatic malware detection which are static analysis and dynamic analysis.

Static analysis detects the malicious application without the need to actually run it, and that is done by analyzing the packed files and the code which is obtained by using disassemblers or decompilers, and this process is called Reverse Engineering, or Back Engineering. However, Static analysis cannot detect some sophisticated malwares which have malicious runtime behavior, like for example generating a dynamic string which in turn downloads a malicious file. There are also approaches to complicate reverse engineering by using tools that obfuscate the code. The most common and free tool for that is Proguard [13], which obfuscate the code by renaming the classes, fields, and methods using short meaningless names. Another commercial tool built on it is called DexGuard [14], which is claimed to complicate static as well as dynamic analysis. Some other tools are Ijiami ApkProtect [15], and Bangcle [16].

From the perspective of the Android app developers, it is insecure to allow the decompiling process of their code, because reverse engineering is also used by attackers for many purposes, like infecting the apps or the servers that control the apps, searching for sensitive data hardcoded in the code, or reusing the code for their own benefit. So, we cannot consider the applications that contain obfuscated or encrypted code as malicious.

Another limitation of the static analysis is the external functions, which cannot be fetched by the static analysis, and while the benign apps use external functions to reduce their size on phone storage, we cannot assume that an app with lot of calls to external funtions is considered as malware.

Dynamic Analysis detects the malicious application by executing it in a virtual system so that the behavior of the application can be seen in action and without the risk of letting it infect a real used system or escape into the enterprise network. Dynamic Analysis requires thousands of applications’ runs to train and test the models; thus, it requires lot of resources and it is also slower than the static analysis. While it is considered in general a powerful solution for detecting malware applications and its accuracy is higher than the static analysis, it also cannot detect some sophisticated malwares which can discover that they are running in a virtual system, thus, they change their behavior to deceive the system.

Static Analysis cannot detect sophisticated malicious code, and sophisticated malwares can hide and deceive the virtual system with the dynamic analysis, so by combining those two techniques, hybrid analysis can provide security team the best of both approaches. For example, Hybrid analysis can apply static analysis to data generated by dynamic analysis, like when a malicious code runs and make changes in memory, dynamic analysis detects that and gives alert to security team to check that and perform static analysis on that memory dump. Even the most sophisticated malware threats can be found through hybrid analysis, but it is also so expensive and time-consuming solution.

E. Tools for Malware Analysis

There are different tools and software that use reverse engineering and allow to decompile and debug android applications, some of them are Radare2 [17], Dex2Jar [18], JADX [19], and the one we use in this paper; Androguard [7], which is a complete framework developed in Python and allow to analyse APK files and extract lot of features from them, like services, resources, dex files and many others. Additionally, every Androguard feature can be added to customized Python scripts, making it simple to get comprehensive information on a file.

One limitation of Androguard is that it can be too slow in analysing APK files that are more than 10 MB in size. However, by testing other existing tools, nothing seems to give a better performance. Also, some tools, like ClassyShark [20] and ApkStudio [21] are limited to user interface, thus, automation process of the feature extraction is not applicable with them. Also, not all android decompilers allow the extraction of all the features of Android applications, some are limited to decompiling dex classes, and others are limited to manifest file.

There are also different websites for malware detection, the most known one is VirusTotal [22] which is multiscanners for antiviruses. Obviously, this can detect new samples.

For dynamic analysis, there is DroidBox [23] which use API call logs that can help explain APK behaviors. Another Tool is AppsPlayground [24]. This tool aims to automate the dynamic analysis of Android apps, however access to the tool requires registration. SandDroid [25] combines both static and dynamic analysis techniques.

F. Cryptographic Hash Functions

In information security, hash methods are used to generate “fingerprints” for documents, which characterize a possibly large document as unambiguously as possible by means of a short string with a fixed number of characters. Hash functions are also used to store passwords securely in an obfuscated, irreversible form. This process of hashing has two main properties. First, the output will be drastically altered if even one bit of the input is altered. Second, finding another input that generates the identical hash is computationally impossible given an input and its hash. Examples of hash algorithms are SHA-1 (1995 revision of SHA) which generates hash values of 160-bit length, SHA-2 family which includes SHA-224, SHA-256, SHA-384, SHA-512 where the number represents the length of the hash value in bits, and MD5 hash function.

Hash algorithms are used by forensic examiners to locate known files in collections of unknown files. An examiner compiles a list of known files, generates and maintains the cryptographic hash values for each of those files. During future investigations, Every file under inquiry can have its hash value calculated, and the examiner can then compare those hash values to previously computed known values. However, malicious individuals can defeat this strategy by altering known files even by one bit, which then totally changes the file’s cryptographic hash. However, this limitation can be overcome with SSDeep, introduced in paper [26], SSDeep “is a program for computing context triggered piecewise hashes (CTPH). Also called fuzzy hashes, CTPH can match inputs that have homologies. Such inputs have sequences of identical bytes in the same order, although bytes in between these sequences may be different in both content and length.” This means that changing some bits, in other words modifying a file, will not change the complete hash value, and we will still be able to detect similarity between the original and the modified file, and that is why CTPH is a better option for detecting malwares and their variations.

SECTION III.

Existing Android Malware Detection Techniques

There are lot of published studies regarding Android malware detection based on machine learning, the majority of them use static analysis where the APK packages are analyzed and perceptive features are extracted, such as permissions, API calls, and opcode sequences. Other studies use dynamic analysis where the applications are executed in a virtual environment and their behaviors are analyzed, such as the consumption of CPUs and RAMs. There are also some studies that use a combination of the static and dynamic analysis, which is called hybrid analysis.

Some of these studies developed tools for the end user, which can detect malware directly on the Android device (on-device), other studies analyze and detect the malwares outside the Android devices (off-device).

In Figure 1, we can see a general overview of all the reviewed studies in this article, with machine learning approach as a common feature, and different analysis types.

FIGURE 1.

Overview of the existing studies.

MIT Libraries

MIT Libraries

A Method for Automatic Android Malware Detection Based on Static Analysis and Deep Learning

Alerts

Abstract:

Metadata

Abstract:

Introduction

Background

A. Android

B. Android Application Basics

C. Malware Types

D. Types of Malware Analysis

E. Tools for Malware Analysis

F. Cryptographic Hash Functions

Existing Android Malware Detection Techniques

A. Static Analysis Based Malware Detection

B. Dynamic Analysis Based Malware Detection

C. Hybrid Analysis Based Malware Detection

D. Limitations of Existing Techniques

The Proposed Android Malware Detection Method

A. Feature Extraction

B. Data Preprocessing

C. The Proposed Model Construction

D. Model Training

Results and Discussion

A. Dataset

B. Model Evaluation

1) Evaluating Our Model With Two Target Classes

2) Evaluating Our Model With Five Target Classes

C. Tools

Tuning the Model’s Hyperparameters

Experimental Evaluation of Other Models

A. Experimental Evaluation of Other Classifiers

1) Evaluation of SVC Classifier

2) Evaluation of Random Forest Classifier

3) Evaluation of SGD Classifier

4) Evaluation of Decision Tree Classifier

5) Evaluation of GaussianNB Classifier

6) Evaluation of XGBoost Classifier

7) Evaluation of KNN Classifier

8) Experimental Comparison With the Proposed Model

B. Experimental Comparison With Related Works

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?