Journals & Magazines >IEEE Access >Volume: 11

A Survey on Attacks and Their Countermeasures in Deep Learning: Applications in Deep Neural Networks, Federated, Transfer, and Deep Reinforcement Learning

Deep Learning (DL) techniques are being used in various critical applications like self-driving cars. DL techniques such as Deep Neural Networks (DNN), Deep Reinforcement...

Abstract:

Deep Learning (DL) techniques are being used in various critical applications like self-driving cars. DL techniques such as Deep Neural Networks (DNN), Deep Reinforcement...Show More

Metadata

Abstract:

Deep Learning (DL) techniques are being used in various critical applications like self-driving cars. DL techniques such as Deep Neural Networks (DNN), Deep Reinforcement Learning (DRL), Federated Learning (FL), and Transfer Learning (TL) are prone to adversarial attacks, which can make the DL techniques perform poorly. Developing such attacks and their countermeasures is the prerequisite for making artificial intelligence techniques robust, secure, and deployable. Previous survey papers only focused on one or two techniques and are outdated. They do not discuss application domains, datasets, and testbeds in detail. There is also a need to discuss the commonalities and differences among DL techniques. In this paper, we comprehensively discussed the attacks and defenses in four popular DL models, including DNN, DRL, FL, and TL. We also highlighted the application domains, datasets, metrics, and testbeds in these fields. One of our key contributions is to discuss the commonalities and differences among these DL techniques. Insights, lessons, and future research directions are also highlighted in detail.

Deep Learning (DL) techniques are being used in various critical applications like self-driving cars. DL techniques such as Deep Neural Networks (DNN), Deep Reinforcement...

Published in: IEEE Access ( Volume: 11)

Page(s): 120095 - 120130

Date of Publication: 20 October 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3326410

Funding Agency:

Contents

SECTION I.

Introduction

A. Motivation

For the past few years, adversarial deep learning (ADL) and robust DL (RDL) research have been explored with promising directions. The major research directions in these research domains have been known as developing robust adversarial attacks or robust defenses against them in Deep Neural Networks (DNNs). However, more attack or defense issues have been recognized as more advanced DL models have been explored and extensively studied, such as Federated Learning (FL), Transfer Learning (TL), or Deep Reinforcement Learning (DRL). Although FL was introduced to preserve data privacy without exchanging raw data between clients and a central server, various security and privacy vulnerabilities have been identified, such as attackers aiming to infer and reconstruct private data or ruin FL’s learning process [1]. TL has been used to expedite a model’s learning process; however, it is known to be vulnerable to various backdoor attacks [2]. The use of DRL has become popular in designing autonomous systems that can ensure critical safety, such as self-driving cars, or reliability under high dynamics, such as autonomous unmanned aerial vehicles [3]. Hence, there has been a vital need to develop robust DRL against adversarial attacks that can endanger human safety.

Due to the popularity of this topic, there have been many survey papers, particularly surveying attacks and defenses in DNNs, which is the backbone of all other DL models, as summarized in Table 1. However, no prior survey papers have provided comprehensive explanations and comparisons of attacks and defenses in various DL models, including DNNs, FL, TL, and DRL. In addition, some survey papers are already outdated and only focus on several techniques. Further, no prior work comprehensively analyzed the overall trends of datasets, application domains, metrics, and experimental environments under the four different DL models. Therefore, there is a dire need for a comprehensive survey that can be helpful for researchers who want to start their research in this research domain. We believe our in-depth future research directions can also help them to easily start and be guided to promising research.

TABLE 1 Comparison of Our Survey Paper With the Existing Surveys of Attacks and Defenses in Deep Learning

B. Comparison With Existing Surveys

In this section, we describe the existing survey papers in DNNs, FL, TL, and DRL and identify the unique contributions of our survey paper compared to them.

1) Existing Surveys in Deep Neural Networks

Zhou et al. [12] focused on discussing adversarial perturbations and defenses in DNNs. However, their discussions are limited to traditional DNN models without considering their applications in other DL models, such as FL, TL, or DRL. Michel et al. [13] surveyed four types of adversarial attacks and two defense mechanisms using three benchmark datasets in DL-based systems. Their discussions were mainly focused on the trade-off between the effectiveness and efficiency of the defense methods. Liang et al. [14] briefly discussed adversarial attacks (e.g., adversarial examples) in DNNs and the defenses against them. They briefly discussed the key challenges and prospects based on the current state-of-the-art research in this study. However, their survey is significantly limited to discussing several attacks and defenses with very limited future research directions. Zhao et al. [15] only surveyed different types of adversarial training methods as defense techniques against adversarial examples.

Ozdag [6] surveyed the state-of-the-art adversarial attacks and their corresponding defenses only applicable in computer vision and discussed the result of a Google Brain competition related to generating adversarial examples. Wang et al. [7] presented a comprehensive survey about adversarial attacks and defenses on DNNs but limited their applications to Natural Language Processing (NLP) tasks. Gao et al. [8] mainly surveyed backdoor attacks and the corresponding defenses in DNNs and discussed the key benefits of considering backdoor attacks in securing systems in different application domains. Hence, their survey is mainly limited to considering backdoor attacks. He et al. [9] surveyed various adversarial attacks, including backdoor and other attacks, in DNNs. However, their discussions are not comprehensive; particularly, the surveyed defenses are significantly limited in their scope. scope. Li et al. [11] conducted a comprehensive survey on backdoor learning, which includes backdoor attacks and defenses but is limited to traditional DNNs, not including other DRL models, such as FL, TL, or DRL.

2) Existing Surveys in Federated Learning

Compared to the survey papers in DNNs, as discussed in Section I-B1, FL has not been studied with its in-depth survey. Kairouz et al. [1] discussed common threat models as well as their proposed defenses in FL. In particular, this paper described common datasets and software environments used in FL within enterprise and research settings to provide useful resources for FL researchers to leverage. Lim et al. [16] considered only attacks and defenses developed within the FL applications to mobile edge networks (MENs) focusing on the heterogeneous nature of client computing resources. Their discussions mainly focused on how to balance critical tradeoffs preserving privacy (e.g., differential privacy) and securing aggregation or accuracy of the results in a server. However, the paper did not include the discussions of datasets, metrics, testbeds, and other experimental settings used in the surveyed research works, which is often helpful for other researchers to set up their implementation and evaluation environment. Lastly, Lyu et al. [17] covered poisoning attacks (on data and models) and inference attacks in isolation in FL, with a brief discussion of datasets, testbeds, insights, and future research, but all with limited scope. Due to the fast growth of DL research, this paper, as published in 2020, is already outdated. Gong et al. [18] discussed backdoor attacks and the corresponding defenses in FL; however, they lack in explaining the overall trends of attacks and defenses in FL, especially Vertical FL (VFL).

Zhang et al. [19] surveyed attacks and defenses in FL and discussed challenges and future research directions. Liu et al. [20] conducted a survey on the taxonomy of threats, attacks, and defenses on FL. The authors considered attacks and corresponding defense methods in FL in the training phase as well as in the predicting phase providing a more comprehensive view of security concerns in the FL systems. Chen et al. [21] provided a comprehensive survey on FL systems considering attacks and defenses at both the privacy and security levels. Xia et al. [22] conducted a survey only focusing on poisoning attacks in the FL systems providing a detailed analysis of various poisoning attacks and the corresponding defense methods. Sagar et al. [23] presented a short survey of different attacks and defense techniques in Horizontal FL (HFL) with a focus on the experimental comparison. However, the depth of their survey is significantly limited and did not discuss VFL, which is one of the promising research directions in FL. Sikandar et al. [24] briefly discussed different types of FL systems and presented some general attacks and their defenses in FL. However, they did not discuss attacks and defenses specific to a particular type of FL, such as VFL. Rodríguez-Barroso et al. [25] developed a taxonomy for attacks and defenses in FL and considered both HFL and VFL. However, the authors did not provide a holistic overview of common metrics and datasets used while only showing the results using the metric they used for their attack implementation. Nair et al. [26] explained poisoning and inference attacks in FL without discussing defenses, testbeds, and application domains. The authors did not consider other types of FL inference phase attacks.

Although FL has been surveyed by several existing papers as above, their attack and defense discussions are limited to common categories. In addition, they mainly limited their discussions to FL without involving any other DL models. Further, they did not discuss the overall trends of metrics, datasets, and testbeds while partly discussing the limitations and future work directions. In our paper, we fill the gap and aim to provide more comprehensive discussions on various DL models. Further, we identify commonalities and differences between different DL models and the characteristics of the attacks and defenses in each model.

3) Existing Surveys in Transfer Learning

Transfer learning (TL) has been recognized as one of the many emerging technologies to expedite the performance of DL models [29]. Due to its recent emergence, its survey papers are not explored as much as ones in DNNs or FL. Wang et al. [27] mainly focused on listing defenses competing with attacks without providing a complete overview along with common datasets and corresponding application domains. Wang et al. [28] focused on surveying attacks and defenses in TL but their review was limited to medical applications. Compared to [27] and [28], Shafahi et al. [29] provide a much more in-depth survey on defenses on TL based on insights in theories and applications. However, they limited their survey to their own works in terms of defenses in TL. Liu et al. [30] surveyed neural Trojan attacks in deep learning. The authors briefly discussed neural Trojan attacks in TL.

4) Existing Surveys in Deep Reinforcement Learning

Compared to the DL models discussed in the previous sections, DRL has been substantially less surveyed. Behzadan and Munir [33] is the first survey paper summarizing a few papers discussing existing adversarial attacks and defenses in DRL. They lack in providing comparisons in terms of testbeds and application domains. Since this work was done in 2018, it did not include many other DRL-based approaches published after that. Ilahi et al. [3] presented a comprehensive survey on adversarial attacks and defenses in DRL. However, this survey did not cover the defenses specifically proposed for DRL settings. In addition, it did not consider the latest explainable AI-based attacks and defenses as well as multi-agent DRL attacks. Other research survey papers include [34] and [35], whose focus is not DRL-based attacks. Although Olowononi et al. [34] surveyed resilient machine learning for cyber-physical systems focusing on DNN attacks and defenses, they did not discuss much about DRL attacks and defenses. They also did not address in-depth comparisons based on the application domains, insights, and limitations, and considered testbed. Wang et al. [35] focused on the attacks and defenses in user authentication systems and have partially addressed the attacks and defenses in DNNs and DRL. Hallaji et al. [32] surveyed attacks and defenses in TL and FL and discussed their performance in model robustness, security, and privacy. In pursuit of that, they explained attacks and defenses in both domains. However, they lack in establishing commonalities and differences in both fields.

We summarized what key aspects of attacks and defenses in DNNs, FL, DRL, and FL are surveyed in the existing survey papers discussed above in Table 1 for readers to easily capture the overall trends observed in the existing survey papers of attacks and defenses in DL models.

C. Key Contributions

This paper made the following key contributions:

Prior survey papers mostly focused on only attacks and/or defenses in one deep learning model, particularly, DNNs, FL, or TL. Based on Table 1, we notice that significantly fewer works have been explored in DRL. This paper is the only work that covers all four DL models and allows their comparisons in terms of the characteristics of attacks and defenses in each DL model.
Since our survey paper covers four DL models, including DNNs, FL, TL, and DRL, we provide unique perspectives in discussing their commonalities and differences. This allows readers to capture how each DL model is related to others and how multiple DL models can be considered synergistically in developing attack or defense algorithms.
We also provide the overall picture of attack algorithms in each DL model in terms of attack goal, method, type, defense deployed against given attacks, and datasets. This can provide the general trends of the attack algorithms studied in each model. Similarly, we provide the overall picture of defense algorithms in each DL model in terms of defense goal, method, type, application (e.g., services or tasks), attacks considered, and datasets. We also summarize these from Table 2 to Table 9.
We discuss what metrics, datasets, and application domains are considered in the state-of-the-art attacks and defenses considered in each DL model, including DNNs, FL, TL, and DRL. No prior works provided this survey demonstrating the overall trends of the experimental settings.
By extensively surveying existing attacks and defenses in the four DL models, we provide the key findings, lessons learned, and insights. In addition, based on the lessons learned and limitations identified from the state-of-the-art attacks and defenses in DL models, we suggest a set of promising future research directions in this research domain.

TABLE 2 Summary of Adversarial Attacks in Deep Neural Networks and Their Key Components (All Works are Proposed for Classification Applications)

TABLE 3 Summary of Defenses in Deep Neural Networks and Their Key Components

TABLE 4 Summary of Adversarial Attacks Unique to Federated Learning and Their Key Components (All Works are Proposed for Classification Applications)

TABLE 5 Summary of Defenses in Federated Learning and Their Key Components

TABLE 6 Summary of Adversarial Attacks in Transfer Learning and Their Key Components (All Works are Proposed for Classification Applications)

TABLE 7 Summary of Defenses in Transfer Learning and Their Key Components (All Works are Proposed for Classification Applications)

TABLE 8 Summary of Adversarial Attacks in Deep Reinforcement Learning and Their Key Components

TABLE 9 Summary of Defenses in Deep Reinforcement Learning and Their Key Components

D. Structure of the Paper

Figure 1 shows the key structure of our paper in terms of the main topics covered in this research area. We first explain the attacks and defenses separately for four algorithms of DNNs, FL, TL, and DRL, where limitations, opportunities, and summaries are given in the conclusion of each algorithm. In later sections, we discussed the experimental setups, including metrics, datasets, and application domains. To sum up and understand all algorithms, we discussed the similarities and differences in the approaches of attacks and defenses across all algorithms. In the end, we presented multiple future directions and challenges with the conclusion.

FIGURE 1.

Structure of our paper.

Show All

SECTION II.

Adversarial Attacks and Defenses in Deep Neural Networks

This section discusses the basic concept of deep neural networks (DNNs). Then, we discuss the types of adversarial attacks and defenses in DNNs as root mechanisms in deep learning (DL). We also provide insights and limitations of the existing attack and defense techniques in DNNs.

A. Background in Deep Neural Networks

DL is motivated by biological nervous systems consisting of thousands of neurons to transfer information. The DL process has two phases. The first phase is the model training using training datasets [9] and includes the data pre-processing by removing irrelevant data and transferring the data into a certain shape or dimension. The second phase is the model prediction, in which the model makes a prediction based on new data. DL models are widely used in different fields to make predictions, such as face recognition [36], spam detection [37], speech processing [38], and machine translation [39]. All DL models are based on DNNs for approximate computation based on a layered structure. The first layer is the input layer receiving from the training dataset after preprocessing. The last layer is the output layer. The middle layers are hidden layers.

Neurons on one layer are connected to other neurons in the next layer with a weight. Neurons on the layers after the input layer receive a weight, which is the result of the summation of the outputs of neurons in the previous layer. The neuron can use an activation function to transform the input into an output. DL can use a loss function with a back propagation to adjust the weights and improve the learning process [40]. Figure 2(a) describes how a DNN works.

FIGURE 2.

Key concepts of DNNs and backdoor attacks on DNNs.

Show All

In DNNs, a backdoor attack aims to poison the training dataset to make an AI model misclassify certain labels. The backdoor attack inserts a trigger on the image dataset and changes the label of these images to a different label that the attacker can exploit. The backdoor attack can cause catastrophic consequences, especially in the medical field or auto-driving cars, where the impact might be fatal. Figure 2(b) shows the key steps of how an attacker performs a backdoor attack where the clean image is classified correctly while the triggered image is misclassified.

A loss function is used to evaluate the performance of models. The idea is to transform learning problems into optimization problems while the DNNs aim to minimize a loss function. One of the most common loss functions used in DNNs for classification problems is Cross-Entropy (CE) loss the DL model aims to minimize during the training stage. Murphy [41] introduced the CE loss function from Information Theory [42] for multi-class classification problems as: $\begin{equation*} J(\theta) = -\sum _{n}\sum _{k}y_{n,k} \log (\hat {y}_{n,k}(\boldsymbol {\theta })), \tag{1}\end{equation*}$ View Source where $\boldsymbol {\theta }$ the model parameters, $n$ and $k$ are the numbers of inputs and classes, respectively. The $y_{n,k}$ indicates if the class label $k$ is correct for $n$ observations, and the predicted probability of the observations $n$ for class $k$ is $\hat {y}_{n,k}$ .

B. Attacks in DNNs

This section discusses various adversarial attacks designed to disrupt the operations in DNNs.

1) Adversarial Examples (AEs) [43]

AEs have been commonly discussed based on different attack times, such as training and testing phases in DNNs, as follows:

Data poisoning attack (DPA) injects noise or false information during the training time [43]. This attack has been popularly used under a white-box attack, meaning that an attacker is assumed to know data and a target model to attack with or without a target by [43]:
- Targeted attack: This attack injects a hidden trigger during the training process to misclassify particular samples to the attacker’s desired label. It is also called backdoor attacks [44].
- Non-targeted attack: This attack injects noise or false information for a target model to misclassify any samples except the correct labels.
Evasion attack (EVA) perturbs data during the testing time. This attack is mainly considered under gray/black-box attacks, assuming that the attacker has partial knowledge about a target model but no knowledge about the data and model [45].

2) Attribute-Based Attacks [11]

This attack type is categorized as:

Knowledge [46]: This attribute indicates how much knowledge an attacker has toward a target system regarding how the target system’s model works and what datasets are used to train the model. Hence, this knowledge-based attack has the following three subtypes:
- White-box attack assumes that an attacker has knowledge of the model and data.
- Gray-box attack assumes that an attacker only knows the model.
- Black-box attack assumes that an attacker does not have knowledge of the model and data.
Perturbation methods [10]: This involves how an attack is developed in its implementation:
- One-time adversarial example applies one data point or one specific image as an adversarial example to mislead a target model to misclassifications.
- Universal adversarial perturbation is applicable to multiple samples across the dataset to mislead a target model to misclassifications.
- Noise is random noise(s) or Gaussian noise(s) an attacker adds to the data point(s) to mislead a target model into misclassifications.
Time [47]: An attack can occur at various points in the pipeline of a model training. This attribute is also tied to the ‘knowledge’ attribute and how much control the attack possesses. This attribute-based attack is the same as the attacks discussed in Section II-B1 where DPA attack corresponds to training-time attack and EVA refers to testing-time attack.
Specificity [46]: This is a binary attribute about whether an attack is targeted or non-targeted based on an attacker’s aim where each attack type is already explained under the DPA in Section II-B1.
Trigger appearance [11] describes where the trigger will be placed in the dataset and how it interacts with the other benign images with the two types of triggers as follows:
- Semantic trigger is a collection of benign images without modifying the dataset.
- Non-semantic trigger is completely separate from the benign images such that an attacker must modify the dataset to activate a backdoor.
Attack space [11]: This attribute describes the medium an attack is applied, which can be:
- Digital image undergoes manipulation through either pixel adjustment or discoloration.
- Physical object is introduced to deceive a target model, such as glasses or a post-it note, in the case of facial recognition.
Trigger visibility [11]: Depending on the recency of an attack, this attribute describes how a trigger is placed on a target image. Two types of triggers are:
- Visible trigger can be easily discerned by the human eye and has been considered in early examples of backdoor attacks.
- Invisible trigger is a backdoor attack to avoid detection by concealing the trigger’s existence.
Trigger placement [11]: This attribute indicates how the trigger is placed onto the image and gives two types of backdoor attacks:
- Hidden backdoor trigger can be embedded in an image and affect multiple images in the dataset.
- Clean label backdoor trigger can be embedded into images an attacker aims to perturb while the images are still consistent with their labels but difficult to classify, leading the model to rely on the backdoor trigger.
Backdoor attack surfaces [8]: This attribute categorizes backdoor attacks based on how the backdoor is injected and is used to classify backdoor attacks by:
- Code poisoning employs the vulnerabilities found in public DL frameworks and exploits them to compromise the ML models built upon them.
- Outsourcing is an attack planting a backdoor during the training phase by masquerading as a benign Machine Learning as a Service (MLaaS).
- Collaborative learning is used to perform attacks where the learning is distributed, such as FL and split learning. This attack abuses the trust factor by compromising one of the participants in the model. We discuss this attack more in Section III.
- Pretrained attack occurs when an infected parent/teacher model containing a backdoor is adopted. TL is a common method to train a student model, detailed in Section IV.
- Data collection involves releasing an infected dataset containing the necessary trigger.
- Post-deployment attack assumes the attacker can access the model after training. It can manipulate the model weights or outright replace them with its own backdoored versions.

3) Model Inversion Attacks [48]

This attack aims to recover the private datasets used to train the model by exploiting the confidence information instead of impacting model behavior. Many classifiers generate several regressions based on feature vectors for each class based on the likelihood that each feature vector has a relevant class, confidence information. Attackers use confidence information (i.e., class probabilities) to infer sensitive information to obtain private datasets used in training models. Inversion algorithms can generate the complete target feature vectors with the sensitive feature. They use a maximum a posteriori (MAP) estimator to identify the value for each feature vector to maximize the probability of having observed correct values.

4) Code Poisoning Attack [49]

This attack targets an ML system with known frameworks. Thus, the frameworks are built using third-party libraries that are not well-tested or audited. These frameworks might have vulnerabilities attackers can exploit to launch different attacks, including backdoor attacks. This is not a targeted attack because the attacker cannot access the training data. However, this attack will impact the model behavior and model accuracy.

5) Model Extraction Attack

This attack aims to extract knowledge from a target model by training a substitute model by either understanding the hints of the target model or collecting the input and output pairs [54], [55].

6) Outsourcing Attack [40]

A user will use a third party to train the model. The user will define the model architecture and provide the training data to the third party. The attacker might be a third party and have access to the training data. Hence, it can insert the triggers in the training samples and train the model with the poisoned dataset to launch the attack when the model is used. This attack is a targeted attack using its full control over the data.

7) Pretrained Attack [44]

An attacker tends to exploit that a user has limited resources and uses a small public dataset and/or a pre-trained model developed by a third party. Using the user’s vulnerability, the attacker can modify the data and model to launch its attack.

8) Data Collection [51]

This attack is performed by poisoning the dataset, which will cause the model to be backdoored. The attacker does not have access to the training process.

9) Post Deployment [53]

An attacker accesses the model to launch attacks on the post-deployment phase. This attack affects the model behavior after the deployment by accessing the memory location of the model to change its behavior.

10) 3D Attacks

Zhang et al. [56] addressed the susceptibility of 3D deep neural networks, specifically in dealing with point cloud data. While adversarial examples are well-studied in 2D images, their impact on 3D data, like point clouds, has been neglected, although it is crucial for safety-focused applications, such as autonomous driving. The authors developed adversarial point clouds to deceive PointNet, a critical neural network for point cloud processing. These methods involve subtly altering existing points (i.e., point perturbation) or generating new points (i.e., point generation) to undermine the model’s accuracy. They also devised six specialized metrics to measure perturbations in point cloud attacks. Their experimental results using the ModelNet40 3D shape classification dataset demonstrated the outperformance of the proposed attack strategies, showing over 99% success for all targeted attacks. Zhang et al. [57] discussed the vulnerability of 3D deep learning models to adversarial attacks, similar to their 2D counterparts. Most advanced 3D adversarial attacks modified 3D point clouds, but converting these attacks to physical scenarios by reconstructing them into meshes significantly reduced their effectiveness. The authors proposed a new strong 3D adversarial attack called a mesh attack performing direct perturbations on the mesh of a 3D object. The authors leveraged a differentiable sample module transferring gradients from point clouds to meshes to improve the attack. The proposed mesh attack outperformed state-of-the-art 3D attacks by a large margin under various defense mechanisms. Zhang et al. [58] discussed the vulnerabilities of DNNs to adversarial attacks in the image domain and highlighted the increasing interest in 3D adversarial attacks, particularly on point clouds. Existing techniques generating adversarial point clouds did not show strong transferability and could be easily defended. The authors proposed a robust point cloud attack method to address this issue. This method focuses on the low-frequency components of point clouds, combining losses from both the point cloud and its low-frequency components to create adversarial samples. The paper demonstrates that the proposed point cloud attack significantly enhanced transferability compared to other state-of-the-art attacks. Additionally, it proved to be more resilient against advanced 3D defense methods.

Figure 3 summarizes two different methods of categorizing attack types based on adversarial examples and attribute-based attacks, respectively. Table 2 summarizes the key components of attacks in DNNs. Noticeably, the most common attack goal is for the model to misclassify labels in DNN models where different attack methods impact the performance of the DNN models.

FIGURE 3.

Classifications of attack types: (a) Adversarial examples under different attack times; and (b) Attribute-based attacks in DRL are categorized into general adversarial attacks and backdoor attacks.

Show All

C. Defenses in DNNs

In DNNs, defense mechanisms are classified by:

Modification-based defenses [5]: This defense type countermeasures attacks by modifying training or input during learning, testing, or modifying networks, which will need to add additional layers or change loss or activation functions.
Defenses against adversarial examples (AEs) [10]: This defense type can combat AEs by (1) gradient masking (or obfustication) that introduces perturbations to the models to confuse attackers; (2) robust optimization that uses additional model training approaches (e.g., adversarial training, certificate defense, or regularization); and (3) detection in input dataset that introduces auxiliary model, statistical methods, or checking consistency.
Defenses against backdoor attacks [8]: (1) blind backdoor removal without checking if the network is backdoored; (2) offline data inspection assuming that the poisoned data are available during the inspection; (3) offline model inspection assuming no access to the poisoned data; (4) online input inspection performing anomaly detection to check if the input has a trigger; and (5) online model inspection using anomaly detection techniques for backdoor attacks.

Here we discuss defenses against backdoor attacks in DNNs [8] as follows.

1) Blind Backdoor Removal (BBR) [8]

This defense aims to reduce the effect of backdoored inputs and sustain the clean data accuracy (CDA) of clean inputs at the same time. BRR has been performed by the following approaches:

Finetuning & Pruning [59]: When backdoor attacks are successfully performed on DNNs, the DNNs learn from backdoored data generating “backdoor neutrons.” Pruning is to iteratively prune deviating neurons from clean inputs in a validation dataset. This method combines pruning and finetuning to constrain the number of neutrons to be backdoored by attackers and ensure the robustness of outsourced training.
Februus [60]: When Trojan attackers generate a backdoor for a DNN, there must be some information leaked through a “bias” that can be detected. Februus is mainly designed to identify the biases, remove them from Trojan regions, and restore the clean inputs so as not to hurt CDA.
Suppression [61]: This method minimizes the effect of backdoors in DNNs by neutralizing the triggers. Randomly generated triggers can compute a wrapper based on the given datasets. The generated triggers are deployed to reduce the effect of any other triggers in the datasets. It is practical in real-world systems that have no backdoor detection mechanism.
ConFoc [62]: This defense restricts the models to focus only on the content of inputs. This method assumes the defenders can access a certain amount of clean data in the same distribution as the datasets used in the training stage. The model generated from the datasets can reduce the effect of backdoored triggers in the Trojan model.
Robustness Against Backdoor (RAB) [63]: It takes the first robust training process against backdoor attacks, especially for data poisoning attacks. For classification tasks, RAB creates a smoothed classifier by appending a Gaussian noise to the original classifier during the testing process.

2) Offline Data Inspection (OFDI)

This inspection assumes that poisoned data are available during the inspection process [8]. This approach performs defense operations on the input data and labels to detect triggers by using the following techniques:

Activation Clustering (AC) [64]: This method detects poisoned data in training inputs. To repair the compromised model, removing poisoned data or relabeling with source classes guarantees the convergence of efficiency in the AC method. The effectiveness and robustness of the AC method are validated via simulations with multimodal classes and complicated poisoned inputs.
Spectral Signature [65]: In learning feature representation from neural networks, there is always a mark called spectral signatures that can be detected. The signatures allow a defender to determine and eliminate the poisoned inputs. This lies in the rationale that the signals in learned representation are critical to the classifier’s performance. This means that when backdoor attackers aim to mislead the results of classification, they must enhance these signals in representations to make classification deviate from correct outcomes.
Gradient Clustering [66]: This approach assumes that a defender can access poisoned datasets only during data collection phases. Gradient clustering extracts and distinguishes the poisoned signals from the input gradient, which has high similarity to those signals from whole inputs.
Deep $k$ -NN [67]: This defense can be utilized on clean-label poisoning attacks to detect and remove poisoned inputs without impacting the performance of models. Like a $k$ -nearest neighbor, this deep $k$ -NN picks $k$ neighbors around the target input and checks the presence of the same class label. Otherwise, this input will be removed from samples for training. Deep $k$ -NN with optimal $k$ can perform as much as baseline data poisoning defenses.
SCAn (Statistical Contamination Analyzer) [68]: The target contamination attacks (TaCT) is a data poisoning attack to poison the samples for specific classes without touching data for other classes. This is hard to be addressed by the general defense for source-specific backdoor attacks. Thus, SCAn relies on distributions to detect any inconsistencies in representation spaces from the backdoored model to figure out the poisoned class in inputs.
Differential Privacy (DP) [69]: In DP, a shared dataset can be described by the patterns of groups without disclosing information about individual datasets. Existing outlier detection algorithms can enhance their performance using DP for anomaly detection in training samples [69].

3) Offline Model Inspection (OFMI)

This inspection assumes no access to the poisoned data. OFMI can be achieved by the following techniques [8]:

Trigger Reverse Engineer [70]: Attackers aim to mislead DNNs to generate inaccurate classification results by inserting triggers. Without directly analyzing the infected inputs, OFMI aims to detect the backdoor attacks (i.e., triggers). OFMI then determines the minimum trigger requirements for successfully performed triggers and detects the infected neutrons by the triggers. OFMI can remove compromised neutrons or weights without modifying its learning results on clean inputs.
NeuronInspect [71]: This defense aims to detect anomalies or outliers in the output explanation of a neural network (NN). Assuming no knowledge of backdoor samples is known, and there is no requirement to restore the triggers, this defense uses a heatmap for output layers in NNs. Once the heatmap is generated from the output, the triggers in the map are supposed to be the least sparse, most smooth, and persistent.
DeepInspect (DI) [72]: DI mainly detects backdoor attacks based on three stages: model inversion, trigger generation, and anomaly detection. Since it is impractical to have original clean inputs of models for defenders, DI firstly recovers the training datasets for attaining model inversion [48]. The datasets will be used to train a conditional generator to compute a series of triggers for each output class and then detect any anomaly and outliers as backdoor detection.
AEGIS [73]: This defense is named after the shield of the Greek god Zeus and has two steps: robust model training and anomaly detection in the model. For robust training, a robust optimization process is performed for each input $x$ with a trust region to make the output for each region $x$ match the true value of associated $y$ . Trust region is a region around the current iteration step trusts the model to be the accurate representation of the objective function [74]. Soremekun et al. [73] showed that if the model has inputs following a mixture of different distributions, the corresponding outputs (i.e., class labels) will also be in a mixture of distributions. The detection process utilizes the idea of feature clustering and utilizes this observation. If the inputs of feature representation are clustered, the clusters of clean and poisoned datasets can be distinguished.
Meta Classifier [75]: Jumbo learning method is proposed to train different models based on different kinds of Trojan attacks, where the models are called shadow models. In the process of meta-training, a meta-classifier is learned as a fully connected NN, and the generated feature representation vectors from the shadow models to minimize the proposed loss function. The output of the meta-classifier is to detect whether the model has a backdoor attack or not.
Universal Litmus Pattern [76]: This defense requires neither knowledge of training datasets nor analysis of the clean datasets. The key idea is to take a set of models where some of them are backdoored and others are not. A classifier will be learned based on the models to detect if the given neutral network is backdoored or not using Universal Litmus Patterns (ULPs). ULPs distinguish models trained on poisoned and clean datasets to detect backdoor attacks.

4) Online Input Inspection (OII)

This method is to detect triggers in inputs and reject any adversary data detected [8].

STRIP (STRong Intentional Perturbation) [77]: This uses the Trojan attack’s strategy to identify triggers that render the influential input data to weaken the classification accuracy. In Trojan attacks, the poisoned data have a lower entropy than the clean data. This allows OII to effectively detect and remove infected inputs from the training stage with no constraint of trigger size.
SentiNet [78]: This method detects backdoor attacks in NNs by taking the following approaches:
- Adversarial object location is performed to determine the inputs that mostly influence prediction outcomes, representing the most poisoned inputs.
- Adversarial regions are detected in these locations. SetiNet applies these regions to a clean dataset and analyzes how much data is affected. A larger amount of affected data means a higher probability of a region being attacked.
Neo [79]: This defense performs a backdoor attack detection and mitigation agnostic to what machine learning models are used for solving image classification tasks. Neo always assumes backdoored triggers exist at the same but unknown locations on the poisoned data, such as a certain corner of an image. Leveraging this knowledge, Neo applies trigger blocks to detect the backdoor on the given datasets, determines potential inputs backdoored, and blocks the backdoor triggers on them.

5) Online Model Inspection (OMI)

This defense method detects the misbehavior of target models caused by backdoor attacks [8].

Artificial Brain Stimulation (ABS) [80]: It uses the neutron simulation function (NSF) to identify potentially infected neutrons. ABS ensures that the compromised neutrons reach the activation value for elevating its effect while keeping the activation value of other healthy neutrons.
Nic [81]: This solves how input’s content converts neutrons in NNs. It uses activated value distribution (AVD) based on the distribution of activated neutrons. This approach assumes that the AVDs of poisoned inputs must be different from clean inputs, which can be utilized in detecting backdoor attacks. Then, the AVD of a given layer can be extracted and trained as a model with adversary inputs to predict if given training inputs are poisoned.

6) 3D Defenses

Xiang et al. [82] was motivated to study the impact of adversarial point clouds on current deep 3D models. They proposed a set of novel approaches to create adversarial point clouds against PointNet, a well-known DNN for point cloud processing. The proposed approaches are adversarial point perturbation and generation. They validated their approaches using the ModelNet40 3D shape classification dataset. Their results demonstrated an attack success rate higher than 99% of all existing targeted attacks. Liu et al. [83] addressed challenges in 3D point cloud analysis, such as limited dataset size and network generalization issues. They developed a data augmentation technique called PointCutMix. This method optimally matched points between two point clouds and generated new training data by replacing points in one sample with their matched counterparts. The authors proposed two strategies: One randomly replaces all points, while the other replaces the $k$ nearest neighbors of a randomly chosen point. Both strategies consistently enhanced the performance of diverse models in point cloud classification tasks. PointCutMix also boosted model robustness against point-based attacks. Notably, it surpassed state-of-the-art defense algorithms.

Table 3 summarized the key components of defense techniques in DNNs. The defense techniques mainly target backdoor detection and removal with both offline and online methods for identifying and eliminating backdoored triggers in the dataset.

D. Discussions: Insights and Limitations of Existing Attacks and Defenses in DNNs

1) Discussions on Existing Attacks in DNNs

The existing attacks in DNNs are robust and challenging for defenders under different attack surfaces. Some attacks, such as code poisoning, can even be performed without prior knowledge or data from the target model to achieve a high ASR. However, all existing attacks are not necessarily powerful enough to affect different types of DNN applications. For example, a Trojan attack becomes ineffective under the transfer learning circumstances since the attackers have no access to the label classes for the new model. In addition, to perform data collection attacks, attackers need to learn more about the target model’s architecture, which is not always available in real-world scenarios.

2) Discussions on Existing Defenses in DNNs

Despite the diverse defense mechanisms developed in DNNs, we found some limitations in their capabilities under different circumstances and classes. First, in Blind Backdoor Removal (BBR), one of the promising techniques is to remove the backdoor effects from the model by utilizing the content of inputs, such as ConFoc. To realize this, it should assume that the triggers do not overlap the content. However, this assumption restricts the defense’s applicability to some attack surfaces. Second, in offline model inspection (OFMI), defense efficiency is a major issue with offline circumstances while, in model inspection, computational cost increases in proportion to the number of classification classes and the trigger size. On the other hand, the consistency between triggers adversely impacts the backdoor removal’s effectiveness when the reversed trigger is not aligned with the original one. Third, one obvious limitation of OFMI is the latency. When applying to time-sensitive applications, the defense time should be carefully considered in designing such defense methods to best mitigate the damage to be introduced by adversarial attacks. Lastly, post-backdoor removal defenses are used in most backdoor defenses. However, they have a domain limitation. Techniques, such as NeuralCleanse [70], can only be applied to the image domain but not to the text and audio domains.

SECTION III.

Adversarial Attacks and Defenses in Federated Learning

This section discusses the basic background of the FL technologies and principles, the existing attacks and defenses in FL, and their insights and limitations.

A. Background in Federated Learning

FL [84] has become one of the popular distributed machine learning (ML) approaches, with applications in health research [85] and consumer products already deployed [86], [87]. Motivated by growing data privacy concerns across consumers and international regulators, FL is a protocol for fitting arbitrary ML models without direct access to training data [1], [17], [88].

In the standard FL approach, a central server communicates with various client devices in a hub-and-spoke manner. As described in Figure 4, a single FL training cycle consists of the following key steps: ① The server distributes the global model to the devices where the server randomly initializes the model state at the beginning of the training process; ② Each client generates a Stochastic Gradient Descent (SGD) update based on the model’s performance (i.e., minimizing a loss function) on their local dataset; ③ Each client sends its updated local model parameters, SGDs (also called updated local model), to the server; ④ The server aggregates the SGD vectors; and ⑤ Finally, the server updates the global model and sends the updated global model parameter to the clients. Training continues for several rounds until the model either achieves the desired performance level (e.g., a specific global model prediction accuracy) or until the pre-defined number of training rounds is completed. [1], [84].

FIGURE 4.

Key steps of the FL training cycle.

Show All

In an FL-based system, let $f(x; w)=\hat {y}$ be a generic ML model with weights $\boldsymbol {w}$ which predicts output $\hat {y}$ given input sample $x$ . Let $\ell (\hat {y}, y)$ be a loss function between the prediction $\hat {y}$ and ground truth label $y$ . Let $\mathcal {D}= \bigcup _{i=1}^{n} {\mathcal {D}}^{i} \cup \mathcal {D}^{T}$ be the global dataset composed of the testing set $\mathcal {D}^{T}$ and each of $n$ nodes’ local datasets ${\mathcal {D}}^{i},\forall i\in [1], [n]$ . We then have the distributed FL protocol’s goal: $\begin{equation*} \min _{ \boldsymbol {w}^{G}_{T}}\ell (f(x; \boldsymbol {w}^{G}_{T}), y),\quad \forall (x, y)\in \mathcal {D}^{T}, \tag{2}\end{equation*}$ View Source where $w^{G}_{T}$ refers to a vector of the converged global model weights at the end of training time $T$ [84]. The system converges to $\boldsymbol {w}_{T}^{G}$ over training rounds $t\in [1,T]$ where the converged global model weights at time $(t+1)$ , denoted by $\boldsymbol {w}^{G}_{t+1}$ , is given by: $\begin{equation*} \boldsymbol {w}^{G}_{t+1} = \mathcal {A}(\Delta \boldsymbol {w}_{t}^{i}) + w_{t}^{G},\quad \forall i\in [1], [n], \tag{3}\end{equation*}$ View Source where $\Delta \boldsymbol {w}^{i}_{t}$ is the SGD update proposed by node $i$ at training step $t$ , and $\mathcal {A}$ is an aggregation function for multiple nodes’ SGD updates [84]. Each client $i$ ’s SGD update, $\Delta \boldsymbol {w}^{i}$ , is estimated by: $\begin{equation*} \Delta \boldsymbol {w}_{t+1}^{i}= \boldsymbol {w}_{t}^{G}-\min _{ \boldsymbol {w}^{G}_{t}}\ell (f(x^{i}; \boldsymbol {w}^{G}_{t}), y^{i}),\quad \forall (x^{i},y^{i})\in {\mathcal {D}}^{i}. \tag{4}\end{equation*}$ View Source

B. Attacks in Federated Learning

In FL, adversarial attackers aim to achieve one of two objectives: (1) Undermine or disrupt the normal process of the learning task; and/or (2) Infer or recreate part of a client’s local dataset without direct access to it. We address each of them below.

For the convenience of our discussions, let the global model’s weights at time $t$ be $\boldsymbol {w}_{t}$ , a node’s proposed update be $\delta ^{i}_{t}$ , the collection of all benign nodes’ ${\mathcal {N}}_{i}$ samples to be $\mathcal {D}_{G} = \{(x, y) \mid (x, y) \in \mathcal {D}_{{\mathcal {N}}_{i}},\,\,\forall i \}$ , where $x$ is the sample and $y$ is its true label. Let the adversary’s data $\mathcal {D}_{A}=\{(x^{A}, y^{A})\}_{|\mathcal {D}_{A}|}$ , and the model’s test set be $\mathcal {D}_{T}$ .

1) Disruption of Normal Protocol Operations

This attack type attempts to undermine or disrupt the expected FL learning task by taking the following attacks:

Byzantine attacks: This attack type entails one or more coordinating clients attempting to disrupt the model training process as envisioned by the FL system designer. These clients may undermine the model’s accuracy by submitting arbitrary matrices as SGD updates. This attack type is usually non-targeted as the clients aim to downgrade performance across all model tasks. In the worst case, adversarial clients showing this behavior can cause model divergence [89]. The clients may maximize each training cycle’s runtime by exploiting implementation-dependent variables, such as permitted latency between communication rounds.
Backdoor attacks: This attack type involves deliberately submitting precisely perturbed model updates to a centralized party to cause the global model to wrongly classify samples $\{(x, y) | y\in Y^{\ast}\}$ , where $Y^{\ast}$ is the set of classes, which the adversary targets. Note that the adversary typically only targets a single class when introducing a backdoor. Therefore, $Y^{\ast}$ is typically a singleton set. As opposed to a Byzantine attack, the adversary’s goal includes model convergence. Therefore, their poisonous perturbations must maintain the backdoor, avoid detection, and permit the model to still perform well on the test set [49], [52], [90]. Let $1[b]$ be an indicator function on the condition $b$ . An adversary’s objective function can be represented by: $\begin{align*} &\max _{\delta ^{A}_{t}}\sum _{i=1}^{|\mathcal {D}_{T}|}1\left [{f(\boldsymbol {w}_{T}, x_{i}) = y^{A}_{i} \mid y_{i} \in Y^{\ast} }\right] \\ &\qquad +\, 1\left [{f(\boldsymbol {w}_{T}, x_{i}) = y_{i} \mid y_{i} \notin Y^{\ast} }\right], \tag{5}\end{align*}$ View Source where $1[\cdot]$ is an indicator function. This means neither any defense implementations by the FL protocol designers nor an adversary’s complementary attempts to circumvent protocol defenses.

2) Extraction of Information

This attack may try to infer or recreate part of an FL system’s construction without direct access to it. This may be data from a given client, or a trade secret pertaining to the architecture of the ML model. This attack type can be realized by the following attacks:

Honest but curious (HBC) attacks: HBC participants cooperate with the FL protocol while learning the structure of clients’ local datasets. HBC participants can easily infer features (i.e., feature inference attack) when receiving the plain text version of either a client’s SGD model update [91], [92], [93] or even the global model after a training iteration [1], [94]. HBC attacks can be also performed by the central server in most FL system designs and the individual clients in some FL systems. There also exist prediction phase feature inference [95], [96] and label inference attacks [97] possible in Vertical FL (VFL) where confidence scores and shared gradients are used respectively to reconstruct the features of other clients.
Model stealing attacks: This attack type aims to derive knowledge about the model’s architecture and design from the executable sent to the clients. While research papers [98], [99], [100] detailing the benefits of certain model architectures are published and well-known among academics and engineers, some details or techniques may be excluded for business purposes. Sophisticated clients can infer (or backward-engineer) these details with local, black-box access to the global model. While the novelty of architecture extraction is implementation-specific, many ML engineers will want the design architecture of the global model to remain proprietary and protected.

3) Collaborative Attack

This attack is performed on FL, where the clients use a global model without access to users’ data. The attacker can be one of the clients, i.e., compromised clients, aiming to infect the global model in the server [52].

Table 4 summarizes adversarial attacks unique to FL and their key components in terms of attack goal, method, type, defenses in place, and datasets used for attack evaluation. All FL models used classification tasks as their application problem. The main attack goal was data reconstruction or targeted misclassification. In addition, model or data poisoning, membership, and property inference were observed as the main attack methods. The threat model mainly considered the HBC model where each client can be honest but curious, but not necessarily malicious. Further, no prior work considered a malicious central server.

C. Defenses in Federated Learning

As the use of FL becomes more popular, a rich volume of defense mechanisms in FL has been developed to counter or entirely prevent adversarial attacks. We discuss common tools developed to provide greater privacy to client data as follows.

1) Differential Privacy (DP)

DP introduces an approach to quantifying and limiting the information disclosure present in any released data [101]. DP is parameterized by $(\epsilon, \delta)$ , where lower values for each correspond to higher privacy guarantees. Formally, a general random function $f$ is represented by $(\epsilon, \delta)$ -differentially private if for all output states $S\subseteq \text {Range}(f)$ , and adjacent datasets $\mathcal {D}$ and $\mathcal {D}'$ : $\begin{equation*} P(f(\mathcal {D})\in S)\leq e^{\epsilon} P(f(\mathcal {D}')\in S)+\delta, \tag{6}\end{equation*}$ View Source where $\mathcal {D}$ and $\mathcal {D}'$ differ by some semantically atomic amount. Traditionally, this is by one sample [102] although this usually means differing by one client’s data in FL [1], [103].

DP is traditionally implemented through random masks where all dimensions of a data vector are slightly perturbed to obscure its original values. In FL, this random mask is applied to the client’s SGD update before being returned to the server. There are several unique proposed methods of creating and applying the mask, such as local, central, shuffled, aggregated, and hybrid DP.

2) Secure Multi-Party Computation (SMPC)

A collection of parties leverage cryptography to simulate the presence of a trusted third party which calculates the result of a function on inputs provided by the parties. This is accomplished without revealing the inputs to all parties. The SMPC protocol can also reveal the result of the function to only a subset of the involved parties. SMC is highly useful in adversarial environments as FL protocols have successfully adopted SMC to prevent observation of client data from aggregating servers or necessary edge computing environments [16], [104], [105].

3) Trusted Execution Environments (TEEs)

TEEs permit the execution of arbitrary code on remote machines without trusting the machine’s operator or administrator. Specifically, TEEs ensure the confidentiality, integrity, and authenticity of executed code by limiting the permissions of any party that can interact with the hardware. In secure execution environments, confidentiality refers to no code or data that can be disclosed to any unauthorized applications, integrity is defined as runtime states that should not be manipulated, and authenticity means no code should be modified while execution [106]. As an extra precaution, TEEs are implemented broadly across most applied FL use cases today [86], [87], [103].

4) Homomorphic Encryption (HE)

HE enables the computation of a function on unknown inputs by permitting mathematical operations on ciphertexts to reflect their operation on the plaintext. Given operation $f$ , symmetric encryption scheme $e$ , plain text $p$ , and ciphertext $c = e(p)$ , HE allows for $e\left ({f(c)}\right) = f(p)$ . While simple operations require minimal overhead, arbitrarily complex functions may be calculated at significant computational cost [107].

5) Verifiable Computation

When providing the result of a computation, one party may produce a mathematical zero-knowledge (zk) proof to demonstrate they have executed the computation correctly. This is performed without revealing the data on which the computation was performed because of the nature of zk-proofs. While of great use to FL systems, the technique is most commonly leveraged when a low-resource client outsources computation to another party [16], [108]. While this technique is useful to ensure clients’ normal behavior, it is often infeasible to obligate individual users to provide zk-proofs at each training step due to the expense of their generation.

6) Gradient Filtering

One of the common strategies is filtering the gradient updates submitted to the central FL server. A simple solution from distributed ML is to replace traditional Federated Averaging step [84] with simply taking the median of all received gradients [109], [110]. More sophisticated approaches have been proposed in the literature to further strengthen the system’s resilience to adversarial gradient submissions:

Trust Bootstrapping [104]: This method involves the server maintaining its own local dataset and comparing its locally calculated SGD updates to those submitted by the clients. For each training round, the server calculates its own SGD update $\Delta w^{0}$ . It then normalizes the magnitudes of all incoming client SGD updates such that $\left |{\Delta w^{i}}\right | = \left |{\Delta w^{0}}\right |$ and filters them based on the degree to which their updates differ from $\Delta w^{0}$ .
Krum Filtering [89], [111]: Intuitively, this approach involves choosing the SGD update which most resembles nearby neighbors. This calculates the distance from it to its nearest $k$ neighbors for each submitted SGD update [89], [111]. This is extended to Multi-Krum filtering, where the server takes all $k$ updates that are most similar to each other. While demonstrably effective, the $O(n^{2})$ computation is expensive and thus not often used in production FL systems.
Redundant Calculations [112]: Chen et al. [112] considered an adversarial threat from an FL-based system’s compute nodes which computes clusters to which a central server offsets the bulk of gradient-processing. compute nodes refers to the nodes for vitiating the global model with malicious updates. Chen et al. [112] proposed an approach to filtering out adversarial gradients by obligating each of the compute nodes to calculate an aggregate of multiple gradients, such that each individual gradient is aggregated by multiple compute nodes in a redundant fashion. This redundancy can filter out anomalous submissions from compromised nodes.

The technical sophistication of these approaches does not come without its drawbacks. The computational costs of additional overhead cause further performance losses when considering the FL system’s overall runtime.

7) Blockchains

In an alternative approach to the aforementioned cryptographic and computational solutions, existing work leveraging blockchains seeks to tackle ensuring against a malicious server. A grand challenge of FL, substituting the coordinating server for an append-only blockchain, offers nuanced security trade-offs for the clients, their data, and the global model. Optimizing the use of blockchains in FL is an active area of research that focuses on minimizing the overhead cost of maintaining a distributed ledger. Nonetheless, many proposed designs have market fit, and offer better security guarantees than more centralized alternatives [16], [113], [113], [114], [115], [116].

8) Prediction Phase Defenses in VFL

Unlike Horizontal FL (HFL), in VFL-based systems, there exist feature inference attacks in the prediction phase as well using the confidence scores shared by a learning coordinator (e.g., the central server) to the active clients. To combat those attacks, rounding confidence scores [95], [96], noising the scores [96] or purifying the scores [96], [117] using Auto Encoders are proposed.

Table 5 summarizes the key characteristics of adversarial attacks for FL-based systems in terms of attack goal, method, type, defenses in place, and datasets used for validation. We found that attackers mainly aimed to perform dataset reconstruction and targeted misclassification attacks to perturb the classification performance. Their type is mainly targeted backdoor attacks or considered HBC clients, which can obtain other clients’ private information. Since all FL-based systems aim to solve classification problems, MNIST and CIFAR were the most common datasets used for validating the robustness of the proposed attacks.

D. Discussion: Insights and Limitations of Existing Attacks and Defenses in FL

Based on the limitations we found from this survey, we suggest the following future research directions:

Existing attacks mostly consider HBC clients and servers in their attack models. However, in real-world applications, damages introduced by malicious clients or a server are more serious. Hence, we should consider malicious attackers in FL settings to develop more security-hardened defenses.
The application settings requiring the same set of samples with different features become more common than ever, where VFL can be more applicable than HFL. We should consider attacks and defenses based on the vulnerabilities identified in VFL-based systems.
Most existing attacks or defenses are mainly considered to hurt or improve (or maintain) the model performance in terms of prediction accuracy. However, in real-world contexts, what strategy to take by both the attackers and defenders is limited to their respective resources while they aim to perform efficient strategies, respectively. For example, existing defenses fail to preserve clients’ privacy efficiently under strong data reconstruction attacks. Some encryption-based defenses, such as HE or MPC, are effective but introduce high costs. Hence, we need to examine how to fine-tune the critical tradeoff between prediction performance and attack or defense efficiency.
Although many defenses focus on mitigating HBC clients, there is a lack of detecting HBC clients. Since the misbehavior of HBC clients is stealthy, it is hard to detect HBC clients. Since mitigation defense is reactive, it is unavoidable to receive the damages introduced by the attacks. Hence, it is critical to detect such attacks quickly before the damage is escalated to more serious consequences.

SECTION IV.

Adversarial Attacks and Defenses in Transfer Learning

A. Background in Transfer Learning

Transfer Learning (TL) resolves the issues of data becoming outdated during training and extended periods of computation by leveraging knowledge learned from previously trained models. For example, TL has been used in a general image recognition model for facial recognition. A user can use a Teacher Model, which is publicly available, by removing the classification layer of the original model and then replacing it with the facial recognition layer. We explain the workflow of TL in Figure 5.

FIGURE 5.

Transfer learning workflow.

Show All

In TL, we have tasks deriving from a source domain and a target domain. We want to learn the target tasks such that the knowledge gained by learning from the source tasks can be applied. For example, this happens when the source and target tasks are not the same, but they do have the same label attached to them [119]. Depending on the types of tasks, a user aims to learn and the objective function may vary in the domains available to the user. Generally, the user aims to minimize the loss function using the weights from the Teacher Model. Given a source domain $S$ training on $n$ instances of $x$ features and $y$ labels, we can model the algorithm as: $\begin{equation*} \min _{f}\frac {1}{n^{S}}\sum ^{n^{S}}_{i=1}\beta _{i} \mathcal {L}\bigg (f(x^{S}_{i}),y^{S}_{i}\bigg)+\Omega (f), \tag{7}\end{equation*}$ View Source where $\beta _{i}$ are the weights, $\mathcal {L}$ is the loss function, $f$ is the decision function that assigns the input $x_{i}$ to output $y_{i}$ (e.g., classification), and $\Omega$ is the structural risk preventing the model from overfitting to the dataset and assessing the decision function $f$ itself [2].

Other objective functions may incorporate kernel functions for mapping the data to a different norm or apply different constraints for certain requirements. However, most TL has the objective to minimize its loss function.

B. Attacks in Transfer Learning

Given the distributed nature of the public models, numerous potential adversarial attacks can exist to disrupt the normal operation of TL. This section describes the various subtypes classified under our attack attribute hierarchy. The core goal for each attack subtype is misclassifying task predictions without lowering the overall model accuracy. There are two ways for the subtypes to achieve this attack goal. First, the attacker can perform model reconfiguration such as modifying the classification layer of the infected model. Second, it can perform data manipulation by training the infected model on data samples containing manipulated samples.

1) Trojan Attack

This is the classic implementation where the public model already has a backdoor in place and assumes that the user will employ it unmodified. This implementation is considered a blackbox because the attacker is assumed not to tailor this public model to a specific target [40], [120]. Liu et al. [120] showed a primary example of a Trojan attack while Liu et al. [40] covers a wide range of this specific subset, including numeric-level attacks and binary-level attacks.

2) Model Reuse Attack

This attack focuses on the feature extractor of the model and assumes no knowledge of the dense layers [50]. However, an attacker needs to know what the downstream task is and a small dataset used by a user. Due to the attacker having knowledge of the dataset, this implementation is considered a whitebox attack.

3) Programmable Neural Network

This attack assumes that attackers have knowledge of the specific task being trained [121]. This is because the feature extractor of the infected model is trained along with a trigger generator. This creates a trigger pattern that classifies any non-target input stamped with that pattern as the target class. This implementation is considered a white-box attack because the attacker needs data for the trigger generator to train the infected model.

4) Latent Backdoor

This attack type is considered an incomplete backdoor because the attacker’s target label does not exist in the Teacher Model yet; however, the model is trained in anticipation of its inclusion [122]. The backdoor only works if the user ends up including the target label. A primary example is using presidential candidates as the trigger for a facial recognition model, hoping that it will be used for a high-profile security program. This implementation can be considered either a black-box or a white-box attack depending on whether the attacker has a specific use case or simply makes assumptions about the data being used.

5) Appending Backdoor

This attack merges another separate backdoored (small) neural network with the target model. This does not tamper with the parameters of the target model [123]. However, it changes the model architecture. Hence, while it is effective and easy to implement if the victim analyzes the infected model, then it will be visually apparent it was compromised. Since the infected backdoor needs to be added to the existing model without knowledge of the data itself, we consider this approach a Graybox implementation.

6) Graph Neural Network (GNN) Backdoor

This attack type does not tamper with the parameters of the target model but changes the model architecture. This attack mixes a subgraph of the overall GNN with the trigger [124], [125]. This method is considered a Graybox implementation because the attacker needs access to the original GNN but not the data itself.

7) Membership Inference Attacks

In a membership inference attack [126], [127], [128], the attacker aims to know whether a certain data sample or feature was used in the training or not, which leaks sensitive information. For example, in a hospital dataset, an attacker would want to know whether this model was trained on certain patient data.

Table 6 summarizes the key components of adversarial attacks in TL. We found the major attack goal is model reconfiguration or data manipulation when a target system is even placed with various defenses, such as fine-pruning or neural cleanse.

C. Defenses in Transfer Learning

Since TL itself is one of the emerging technologies, few formal defenses have been developed against backdoor attacks in TL. The existing defenses for TL borrow from current defenses for DNNs in the sense that the victim looks at the neurons interacted with or the input post-mortem. Prevention and mitigation are difficult due to the balance of changing the original Teacher Model and decreasing overall performance. In this section, we review some of the common defenses in TL.

1) Neural Cleanse

It detects backdoors by scanning model output labels and reverse-engineering any potential hidden triggers. Once a backdoor has been detected, mitigation efforts are mainly made by filtering out triggered images and patching the affected neurons by unlearning images with the trigger [70].

2) Fine-Pruning

This defense takes an iterative approach to remove backdoor triggers by first pruning redundant neurons that are the least useful for classification. After that, it fine-tunes the model using clean training data to restore model performance [59]. This particular defense has further variations. For example, Catchbackdoor [129] uses fuzzing to achieve this. Shapley Prune [130] employs Shapley values for their pruning. Although their pruning approaches are different, their goals are the same in analyzing each neuron.

3) Activation Clustering

This defense uses various malicious inputs into the model and analyzes the activated neurons. The patterns of the activated neurons produced by poisoned inputs with triggers should yield different activations than those of benign inputs [64].

4) Input Blurring

STRong Intentional Perturbation (STRIP) [77] employs some blurring of the input to decrease the attack success rate (ASR). The key idea is that preprocessing can damage the trigger itself or deter how the trigger is read by the backdoor.

5) Proto2Proto

Proto2Proto aims to expose any attack presence during the TL process by comparing a benign Teacher model and a Student model that may be infected. They analyze the loss and local representations of the Student model and enforce them to be similar to the Teacher model [131].

Table 7 provides the summary of the key components of defense techniques in TL. We observe that the major defense goal is to detect or remove a Trojan attack or prevent a backdoor attack.

D. Discussion: Insights and Limitations of Existing Attacks and Defenses in TL

We learned the following lessons from reviewing the existing attacks and defenses in TL as follows:

The effectiveness of attacks in TL is quickly outpacing that of defenses. Since most defenses in TL are borrowed from defenses in DNNs, attackers have been able to easily craft attacks specific to TL itself.
A main difficulty of these attacks in TL is the assumption an attacker is able to access the architecture of a publicly distributed model or can reliably deceive a victim to use their infected teacher model. While this can be said for all backdoor attacks, this is especially true in TL where the knowledge transfer is performed from a more generic Teacher model to a specific Student model.
Moving forward, the open source community must maintain some form of a trusted authority to ensure public resources such as large datasets and models are secure against exploitation.
In terms of expanding the attacker side, branching into different domains outside of image classification, such as multimedia data, would prove important because TL incorporates more multimodal approaches by drawing from multiple sources of text and images.

SECTION V.

Adversarial Attacks and Defenses in Deep Reinforcement Learning

A. Background in DRL

Basic reinforcement learning (RL) can be explained by Markov Decision Process (MDP) where in each time step $t$ , an agent interacts with an environment by performing an action $a_{t}$ based on an optimal policy $\pi ^{\ast}$ , given a state $s_{t}$ . After this action is performed, the agent obtains a reward $r_{t}$ and next state $s_{t+1}$ . The goal of MDP is to take action to maximize the accumulated reward. We describe the key behavior of MDP in Figure 6. When we use a neural network as a policy function, this basic RL can be called deep RL (DRL). The basic definitions of RL include the following:

Environment is a simulator or real-world system in which the agent interacts and learns, such as self-driving cars, Atari games, or MuJoCo.
Policy can be considered as a function $\pi ^{\ast}$ that gives an action $a_{t}$ , given a state $s_{t}$ at a current time step $t$ , $a_{t} = \pi ^{\ast} (s_{t})$ . This function can be deterministic or probabilistic.
State is an observation at a current time step, such as an image seen by a self-driving car.
Action is a stimulus used by an agent to interact with the environment.
Reward can be represented as a numerical incentive received by an agent from the environment.
Model is used to mimic the behavior of the environment for making inferences. There can be model-free RL methods in which we do not use a model and model-based RL in which we use a model. Model-free methods learn from real-time interactions with an environment, whereas model-based methods can learn offline by using a simulated model.

FIGURE 6.

Reinforcement learning setting.

Show All

B. Attacks in DRL

In DRL, an attacker mainly aims to reduce the average reward of the DRL agent while remaining undetectable. This objective can be divided into two questions: (1) What is the right time to attack during the episode? (When to Attack); and (2) How to craft a perturbation? (How to Attack). In the DNN environment, there is only a question of how to attack. We also classify attacks based on four factors that are usually perturbed by the attackers: reward, policy, observations, and environment. The fifth classifying factor could be based on an action but the action is usually related to reward because non-optimal actions usually lead to low rewards. Figure 7 describes these classifications. Following are the detailed explanations and discussions of the existing DRL attacks under each classification:

Environment-based attacks add noises or objects to the environment. The environment is the medium where actions are taken based on observations.

FIGURE 7.

Classification of DRL attacks.

Show All

FIGURE 8.

Classification of DRL defenses.

Show All

1) Common Dominant Adversarial Examples Generation Method (CDG) [135]

This method utilized the generation of high-confidence adversarial examples by leveraging obstacles present within the environment. The researchers demonstrated an impressive Attack Success Rate (ASR) of 99.9% when applying this method to a pathfinding problem, which was solved using the A3C algorithm in a white-box scenario. The success criterion for this attack is defined as the ability to either delay the progress of the DRL agent or prevent it from reaching its intended destination.

2) Attacks on Zero Sum Games [136]

This research established a zero-sum game framework by introducing an adversarial agent into the same environment as the legitimate agent. This setup allowed for creating natural adversarial observations, enabling the adversarial agent to influence the behavior of the actual DRL agent to follow the adversarial policy. The Proximal Policy Optimization (PPO) [137] algorithm was employed to demonstrate the capabilities of these agents.

To mitigate the detrimental effects of such adversaries, frozen deployed models were utilized. Rather than relying on average reward as a metric, the effectiveness of the attacks was assessed based on the win rate achieved in these game scenarios. This alternative measure provides valuable insights into the attack success within this framework.

3) Model Querying Attacks [138]

This study introduced two novel online sequential attacks that target the environment of a DRL agent. Unlike the traditional FGSM that relies on back-propagation, these attacks employ model querying techniques. The model querying refers to the setting that users/attackers do not have access to the internal knowledge of an online model, such as its structure and parameters. Instead, they can query the outputs of the targeted model by providing input samples [139] Specifically, they utilized two methods of model querying: the adaptive dimension sampling based finite difference method (SFD) and the optimal frame selection method (OFSM). Leveraging the advantages of model querying, these attacks demonstrated even greater speed compared to FGSM-based approaches. Moreover, they exploited the temporal consistency of states during the attack process.

In addition to attacking the environment, the researchers also present alternative attack strategies that target observations and actions instead. To evaluate the effectiveness of their attacks, they conducted experiments using TORCS with both DDPG and DQN agents. These experiments encompassed both white-box and black-box settings, providing comprehensive insights into the attack’s efficacy.

Observation-based attacks manipulate the states either by sensor manipulation or directly perturbing the states.

4) Fast Gradient Sign Method-Based Uniform Attack [140]

This attack expands upon the established Fast Gradient Sign Method (FGSM) and represents the inaugural attempt to target DRL. It is called a uniform attack and uniquely operates across all time steps within an episode. Such an attack is regarded as the preliminary of adversarial attacks since it neglects the correlation between observations resulting in easy detection of the adversaries [141]. Many researchers selected Atari Games as their experimental domain and aimed to compromise the policy by employing adversarial states.

5) Policy Induction Attacks [142]

This method adopts the concept of the uniform attack [140], employing both FGSM and the Jacobian-based Salience Map Approach (JSMA) as perturbation techniques. The ability of adversarial samples generated by the target network, leading to misclassification for other networks with the same task domain is called transferability. This research demonstrates the transferability of adversarial examples across various Deep Q-Network (DQN) models, serving as a foundation for policy induction attacks.

6) FGSM and Random Noise [143]

This method explored the impact of adversarial examples and random noise on DRL policies. Through empirical experiments, the researchers established that adversarial examples generated using FGSM yield greater efficacy compared to random noise. By leveraging the value function, they showed the proposed approach achieves comparable performance at a lower attack rate. Essentially, all moments within an episode do not necessarily experience a reduction in reward when attacks are injected. This was exemplified through various controlled attacks conducted on the Asynchronous Advantage Actor-Critic (A3C) algorithm playing Atari Games.

7) Attack on Robotic Vehicle [144]

This attack involved manipulating the sensory data to intentionally misguide a dynamic autonomous robot, causing it to follow an incorrect path. They successfully demonstrated that once the tampering ceased, the robot promptly returned to its intended trajectory. This experiment serves as evidence that a hidden attack can be devised, leaving no trace behind. The DQN algorithm was employed to control an autonomous robot emulator (JAV) system within a white-box setting where access to the trained policy was necessary.

8) CopyCAT, Realistic Read-Only Attack [145]

This study argued that previous attack methods often lack realism or are computationally intensive. They operated within a read-only environment setup, limiting their ability to interact directly with the system. To address this challenge, they proposed two attack strategies: the per-observation attack and the universal mask attack. The per-observation attack introduces perturbations to each observation made by the agent within the environment. Conversely, the universal mask attack applies a single perturbation that is generated at the onset of the attack and applied to all observations.

In investigating non-targeted attacks, the authors demonstrated the efficiency and effectiveness of the FGSM. However, when targeting specific objectives, they discovered that FGSM struggled to generate perturbations that were simultaneously effective and imperceptible. To showcase the efficacy of their approach, they conducted experiments using the DQN and Rainbow algorithms playing Atari Games.

9) How to Attack in DRL (ACADIA) [146]

Previous research has primarily focused on addressing the when-to-attack problem in DRL [141], [147] while neglecting the how-to-attack aspect and relying solely on perturbations based on DNNs. Ali et al. [146] filled this gap by proposing three new adversarial perturbations in DRL, collectively referred to as ACADIA, representing Attacks Against Deep reinforcement learning). Their main objective is to develop effective attacks despite time constraints and defenses. To achieve this, they employed innovative combinations of momentum, ADAM optimizer (specifically Root Mean Square Propagation or RMSProp), and initial randomization in their attacks. They evaluated the performance and efficiency of their state perturbation attacks by using DQN and PPO algorithms in Atari Games and MuJoCo environments against RADIAL and ATLA defenses. The results demonstrated that ACADIA outperforms other state perturbation attacks, providing state-of-the-art performance. Notably, ACADIA is also nine times faster than the Carlini & Wagner (CW) method [148] while exhibiting superior performance against DRL defenses.

Policy-based attacks include the observation-based attacks as observations are the inputs to policies. Other types of attacks include model or policy extraction by observing the inputs and outputs of the policy.

10) Model Extraction and Imitation Learning [149]

Chen et al. [149] argued that in the context of DRL, alternative techniques for model extraction and imitation learning are necessary rather than relying solely on DNN approaches. To address this, they employed a Recurrent Neural Network (RNN) to infer the training algorithm of a DRL agent in a black-box setting based on predicted actions. Once the model was identified, they utilized imitation learning to obtain a replica of the target model. This process of model extraction enhanced the effectiveness of adversarial examples, particularly within black-box settings.

The experiments in [149] encompassed multiple algorithms such as DQN, PPO, and A2C, targeting the Atari Pong and Cart-Pole environments. Through these experiments, they demonstrated the power of their approach in generating potent adversarial attacks in diverse DRL scenarios.

11) Attacks on Multi-Agent RL (MARL) [150]

Figura et al. [150] introduced novel adversarial attacks specifically targeting consensus-based multi-agent RL networks. The authors demonstrated that an adversarial agent has the capability to influence all participating agents within the consensus network to adopt its desired adversarial policy. Furthermore, they established the theoretical asymptotic convergence of their algorithm, indicating that the consensus ultimately favors the adversary.

In their experimental setup, a network of Deep RL agents employs a decentralized Actor-Critic methodology within a white-box setting. This attack differs from mainstream research in adversarial DRL, which primarily focuses on compromising the state, environment, or reward through adversarial perturbations. The proposed approach instead directly manipulates the consensus-based decision-making process among the participating agents.

Reward-based attacks include reward flipping, adding perturbation, and perturbing action space such that the reward is compromised.

12) Gradient-Based Attack [151]

This particular attack strategy encompasses three distinct variants that vary based on the employed perturbation method and the incorporation of a loss function. The first approach utilizes random noise, while the second is a gradient-based (GB) attack focusing on maximizing the loss function for the worst possible discrete action. The third variant builds upon the second one by introducing SGD as an enhancement. The study showcases that GB attacks outperform FGSM-based attacks when applied to the Deep Deterministic Policy Gradient (DDPG) [152] and Double Deep Q-Network (DDQN) [153] algorithms within the Cart Pole, Mountain Car, and MuJoCo environments.

13) Strategically Timed Attack and Enchanting Attack [141]

This study primarily revolves around determining the optimal timing for attacks within an episode, as attacking at every time step would be readily detectable. To address this, the researchers introduced two approaches: strategically timed attacks and enchanting attacks. They employed the Carlini & Wagner (CW) perturbation method to disrupt these strategically chosen time steps. Remarkably, by utilizing a mere 25% attack rate, they successfully minimized the rewards while achieving a high attack success rate (ASR). The effectiveness of their attacks was demonstrated through experiments conducted on the A3C and DQN algorithms playing Atari Games.

14) Adversarial Transformer Network (ATN) [154]

ATN leveraged transformer models to generate a sequence of adversarial inputs to minimize the reward obtained by a DRL agent in a white-box scenario. They also placed emphasis on the timing of attacks, inspired by the concept of strategically timed attacks [141]. They proved the effectiveness of their attack strategy via experiments on the DQN algorithm while playing the Atari Pong game.

15) TrojDRL - Backdoor Attacks [155]

This attack constitutes a backdoor or Trojan attack executed during the training phase within a DRL environment. By modifying a mere 0.025% of the training data, a Trojan is subtly introduced into the dataset. The backdoor trigger is designed to activate only when the DRL agent’s performance deteriorates. This study represents one of the limited efforts aimed at harnessing backdoor attacks within the context of DRL. Additionally, the researchers demonstrated the remarkable effectiveness of their attack even when existing defenses within the DRL system were employed.

16) Critical Point Attack and Antagonist Attack [147]

This study can be seen as a natural extension of the strategically timed attack [141]. The primary objective remains unchanged, where the objective is perturbing the agent at key moments while minimizing detection, resulting in significant harm. The first attack approach is known as the critical point attack. In this method, the adversary constructs a predictive model of the agent’s future actions and the corresponding environmental states. They evaluate the potential impact of various attack strategies and select the most optimal one. The second technique, referred to as the antagonist attack, involves the adversary training a domain-agnostic model to identify crucial moments for launching attacks within an episode.

Through their experiments, Lin et al. [141] successfully demonstrated a significant reduction in reward by employing Carlini & Wagner perturbations [148] in fewer than five critical steps across episodes in various environments, including The Open Racing Car Simulator (TORCS), Atari Games, and MuJoCo. These findings highlight the effectiveness of their attack techniques in causing substantial disruption to the agent’s performance.

17) Action Poisoning Attack (LCB-H) [156]

Liu and Lai [156] discussed the concept of action poisoning attacks in DRL, focusing on manipulating the actions taken by the agent. They proposed an attack scheme called LCB-H, which effectively forces efficient agents to frequently select adversarial actions. The attacks are applied to model-free DRL algorithms in a periodic 1-dimensional grid world, considering both white-box and black-box settings. The computational complexity of the attacks is analyzed and found to be either sub-linear or logarithmic. This work expanded the existing research by introducing a new type of attack and providing insights into its effectiveness and computational implications.

18) Tentative Frame Attack [157]

Qiaoben et al. [157] addressed the timing of attacks in DRL continuous environments. The authors introduced a theoretical framework called Strategically-timed State-adversarial MDP (SS-MDP) to determine the optimal frames for launching attacks. They trained a frame attack strategy using these optimal frames and named the approach as Tentative Frame Attack. The efficacy of their attacks is demonstrated using Proximal Policy Optimization (PPO) with MuJoCo environments in a white-box setting. The study contributed to the field by providing a theoretical foundation and demonstrating the effectiveness of their strategically timed attacks in DRL continuous environments.

19) Frame-Correlation Economical Attacks [158]

In previous research, attacks on individual frames in an MDP were conducted separately, disregarding the correlation between neighboring states. This approach was computationally expensive and not practical for real-world scenarios with time constraints. To address this issue, Qu et al. [158] proposed the concept of transferability between frames to optimize attack efficiency and enable realistic attacks within limited time frames. They introduced three frame-correlation transfer (FCT) techniques: anterior case transfer, random projection-based transfer, and principal components-based transfer. These FCT methods utilize genetic algorithms with varying levels of computational complexity to generate adversarial attacks. The authors demonstrated the effectiveness of their real-time attacks in realistic settings by applying them to four state-of-the-art DRL algorithms and conducting experiments on Atari Games in a black-box setting.

Table 8 summarizes the key components of adversarial attacks in DRL. Unlike other DL models, attacks in DRL aim to lower the DRL agent’s reward or to change its state or policy and ultimately lead the agent to choose a poor or suboptimal action. Different from other DL models, defense mechanisms are not considered much when attacks are tested. In addition to the attack success rate, the accumulated reward is used as a main metric to measure the attack performance (i.e., lower is better).

C. Defenses in DRL

In DRL settings, a defender aims to detect an attack and efficiently mitigate it to maximize the reward. The types of defense can be categorized by:

Robust Reinforcement Learning [160], [161]: There are certain techniques that are based on some robust MDP, such as noisy action robust MDP, in which uncertainties and noise are handled such that these techniques work under both the presence and absence of adversarial perturbations. Their major goal is to become robust to environmental changes and some attacks.
Adversarial Training (AT) [135], [151], [162], [163]: This is similar to the conventional adversarial training being used as a defense in DNNs. AT is to train on a wide variety of adversarial states such that the DRL algorithms can act normally against such adversarial states. This training can include training on noise-based and gradient-based adversarial states. Another interesting way is to train a defense parallel to a learned adversary.
Detection-Only Defenses [164], [165] are the defenses only to detect attacks, not necessarily to mitigate such attacks. For example, Dimensionality reduction techniques, such as Principal Component Analysis (PCA), are used to detect the presence of adversarial states.
Game-Theoretic Approach: This approach leverages a game environment between an attacker and defender where the attacker and defender want to minimize or maximize the reward, respectively. Based on these objectives, both parties take action against each other based on game theory.
Imitation Learning [166]: This indicates that a separate agent, in addition to the original DRL agent, is learned simultaneously to correct the original agent’s mistakes under attack.
Regularization [167], [168]: This method is used to make the policies smooth and not perform worse on uncertain and noisy perturbations.
Benchmarking and Watermarking [169], [170]: The watermarking technique integrates a unique response into a specific sequence of states which defends against unauthorized policy replications. Benchmarking is to standardize certain techniques such that worst-case scenarios can be handled and the agent does not go to unsafe states.
Other Defenses [171], [172], [173]: Other types of defense include Bounded Adversarial Loss (RADIAL) and GAN Based RL (RL-VAEGAN). RADIAL is based on bounding the loss theoretically such that it performs best against adversarial states or noise. RL-VAEGAN is by learning the style of adversarial and original states. They are known to be more effective than other techniques under certain conditions.

Now, we discuss DRL defense techniques in chronological order.

1) Frame Prediction [164]

This method predicts the next frame and uses the predicted frame to detect whether the next frame is adversarial or not, making it a detection-only defense. This approach leverages an action-conditioned frame prediction module to predict the frame. FGSM [140], Carlini & Wagner method [148], and Basic Iterative Method (BIM) [174] are used as adversarial perturbations. They can detect 60% to 100% of adversarial perturbations when DQN plays Atari Games.

2) Adversarial Training (AT) [151]

AT has been used in DNNs and is considered a very well-known defense. AT particularly defends against gradient-based attacks or adversarial perturbations as they used gradient-based perturbations to train the DRL models, such as DQN and DDPG playing MuJoCo, Cart-pole, and Mountain car environments. AT was tested against FGSM and SGD perturbations.

3) Minimax Iterative Dynamic Game [175]

Ogunmolu et al. [175] proposed an iterative dynamic game (iDG) framework that follows a minimax approach for generating resilient policies in situations with disturbances and uncertainties. They assessed the effectiveness of the iDG framework using a mecanum-wheeled robot tasked with achieving a specific goal. The objective is to develop a locally robust optimal multistage policy within the given goal-reaching task. The algorithm can be easily customized to design meta-learning or deep policies that can withstand disturbances, model discrepancies, or uncertainties, as long as they remain within a predetermined disturbance limit.

4) Gradient-Based Adversarial Training [135]

This defense leverages the well-known defense of AT in the DRL setting. This defense is also against gradient-based attacks or adversarial perturbations as they used gradient-based perturbations to train the DRL models. The major difference with [151] lies in that this defense uses its own adversarial perturbation generation method called CDG [135] to train the DRL models. This technique is shown with 93.89% precision in detecting adversarial perturbations in the A3C pathfinding problem.

5) Robust Q-Learning Using PCA [165]

Xiang et al. [165] proposed a robust Q-learning method using principal component analysis (PCA) to detect adversarial inputs in robot path finding application. They leveraged PCA to compute the weights of different factors of the pathfinding problem and showed the performance of this defense with a precision of 70% in detecting adversarial inputs.

6) Robust Student DQN (RS-DQN) [166]

Fischer et al. [166] proposed RS-DQN by splitting the standard DQN architecture into a policy (Student) $S$ network and a $Q$ network. The Student $S$ network is robustly trained and helps in exploration, while simultaneously, the standard $Q$ network preserves the normal training. This can be seen as a type of imitation learning [166] to defend against attacks. Even in the presence of no attack, this defense achieves the same performance as Vanilla DQN. They also showed that the performance of RS-DQN can be enhanced as it is compatible with AT and provably robust training. They showed the effectiveness of their defense against Projected Gradient Descent (PGD) attacks [174] in Atari Games.

7) Collision Avoidance Using Benchmarking [170]

Behzadan and Munir [170] provided a novel framework for benchmarking the collision avoidance mechanisms in DRL settings. This framework deals with worst-case scenarios where an adversarial agent drives the system into unsafe states. They showed this technique’s applicability and reliability by comparing it with other collision avoidance systems and intentional collisions.

8) Watermarking [169]

This approach aims to defend against model extraction attacks in DRL. The watermarking technique integrates unique responses into a specific sequence of states to defend against unauthorized policy replications. The objective of this work is to maintain performance without introducing much impact by watermarking. The authors showed the robustness of watermarking and non-watermarking states by using DQN in a Cart Pole environment.

9) Regularization Using Partially Observable MDP (POMDP) [167]

Russo and Proutiere [167] measured the potential impact of attacks and established a relationship between attack vulnerability and the smoothness of the targeted policy. Policies with more smoothness are inherently more resistant to attacks, which proves why Lipschitz policies show higher resilience, specifically those that are smooth with respect to the state. From the perspective of the primary agent, the uncertainties within the system and the attacker can be effectively modeled as a POMDP. This work showed that using RL for POMDP, such as utilizing RNNs, leads to generating more robust policies. To evaluate the efficacy of this defense approach, this work conducted experiments using DQN, Deep Recurrent Q-learning Network (DRQN) [176], and DDPG trained in Gym environments such as Atari Games, Mountain Car, and LunarLander, subjecting them to gradient-based attacks.

10) Robust RL Using PR-MDP and NR-MDP [160]

Tessler et al. [160] proposed new MDPs to enhance robustness in Probabilistic MDP (PR-MDP) and Noisy Action Robust MDP (NR-MDP). Using these MDPs, they also changed DDPG to Action Robust DDPG (AR-DDPG). They proved that these algorithms work well with or without adversarial perturbations. They showed the effectiveness of action-robust policies in MuJoCo environments.

11) Robust RL Using Distributionally Robust Policy Iteration (RL-DRPI) [161]

Smirnova et al. [161] proposed a distributionally robust policy iteration scheme, especially for high-dimensional state or action space. In high-dimensional space, there is a high chance of learning a sub-optimal policy, so this technique makes a robust policy to solve this challenge using robust Bellman operators providing a lower bound guarantee on policy or state values. This work also leveraged mixed-exploration where conservative actions are taken in the short-term and optimistic actions are taken in the long-term to make an overall optimal policy. This policy is called a distributionally robust soft actor-critic. The authors demonstrated the effectiveness of RL-DRPI in MuJoCo environments. The key idea of this approach is that the policy’s higher robustness to environmental changes also represents higher robustness to adversarial perturbations.

12) State Adversarial DRL (SA-DRL) [168]

Zhang et al. [168] proposed using the state-adversarial MDP (SA-MDP) to investigate the core characteristics of defense in DRL. Additionally, they create a theoretically sound method for regulating policies to be utilized in various DRL algorithms, including PPO, DDPG, and DQN, regardless of whether the action control problems are discrete or continuous. This approach significantly strengthens the resilience of PPO, DDPG, and DQN agents against powerful white-box adversarial attacks. In their experiment, a robust policy substantially improved the performance of DRL, even in the absence of an adversary, across diverse environments, such as MuJoCo and Atari Games, subjecting the agents to Critic, Random, Maximal Action Difference (MAD), and Robust Sarsa (RS) attacks.

13) Robust ADversarIAl Loss (RADIAL) [173]

Oikarinen et al. [173] proposed a framework called RADIAL-RL to train different DRL agents designed based on bounded adversarial loss. They evaluated RADIAL-DQN, RADIAL-A3C, and RADIAL-PPO to defend against PGD attacks in Atari Games, MuJoCo, and ProcGen environments. RADIAL is considered one of the state-of-the-art algorithms in 2023.

14) Certified Adversarial Robustness (CARRL) [172]

This defense relies on the concept of certified adversarial robustness by calculating a lower bound on state-action pairs to find an optimal action. It tackles adversarial perturbations or noise by putting a lower bound on worst deviations. They have shown its effectiveness in collision avoidance and control tasks using DQN.

15) Adversarial Training With SR2L [162]

This defense is called Smooth Regularized Reinforcement Learning (SR2L), in which smoothness-inducing regularization is used to train the policy. This approach was inspired by the continuous space environments where there are smooth transitions between states. Search space is constrained using this regularization to increase smoothness in the policy, making the policy robust. Trust Region Policy Optimization (TRPO) [177] and Deep Deterministic Policy Gradient (DDPG) [178] are used in the experiments, where authors proved the efficacy and robustness of their technique.

16) Alternate Training With Learned Adversary (ATLA) [163]

This is currently the state-of-the-art DRL defense that first builds a DRL-based strongest learned adversary. Using this static learned adversary, authors train a DRL algorithm against the perturbed states generated by a learned adversary. Previously, AT has only been applied to combat gradient-based perturbations, such as FGSM. The experiments in [163] showed that adversarially trained DRL algorithms using learned adversaries can perform well under multiple state-of-the-art attacks. These attacks include Robust Sarsa, Maximal Action Difference, Critic-based, and Snooping attacks. They used PPO with MuJoCo environments to show the effectiveness of their defense.

17) RL-Variational Autoencoders and Generative Adversarial Networks (RL-VAEGAN) [171]

RL-VAEGAN is designed based on style transfer where it learns the styles of unperturbed adversarial states for converting perturbed states to unperturbed ones. This method works against state-of-the-art DRL attacks under white-box and black-box settings.

Table 9 shows the summary of the key components describing defense techniques in DRL. We found that AT is the most common defense under various perturbation-based attacks.

D. Discussion: Insights and Limitations of Existing Attacks and Defenses in DRL

We found the following insights and limitations after reviewing attacks and defenses in DRL:

Most state-of-the-art attacks mainly focus on identifying an optimal time to attack (i.e., when-to-attack) during a DRL episode [141], [147]. We found that [146] only discussed how-to-attack by emphasizing the importance of efficient and robust perturbation methods in DRL-based settings. However, there is still room to improve in finding the right perturbation for dynamic and continuous-control environments. Many existing perturbation methods are slow and do not work under time-constrained, safety-critical environments.
The existing attacks mainly focused on minimizing a DRL agent’s reward. However, this type of attack can be easily detected by the system due to a significant reduction in the reward. Therefore, there should be more efforts to develop stealthy and undetectable attacks, which can slightly lower rewards while the attackers can still achieve the success of their targeted or non-targeted attacks.
Considering the limited resources and capabilities of both attackers and defenders, we should look into solutions to achieve efficient and robust attack and defense strategies, particularly in multi-agent DRL environments.
Although defenses like ATLA have shown high robustness against certain attacks, they cannot deal with sophisticated and robust attacks that have been developed recently. Hence, when defense techniques are validated, we should evaluate them under a wide range of attack types to test their comprehensive defense robustness. In addition, unlike other DL models, DRL-based settings are often characterized by highly dynamic environments. The defense mechanisms in DRL should also consider environmental dynamics and adversarial attacks under such environments, which will be more challenging but must be considered.
The existing defenses only consider static adversaries. Considering the nature of dynamic environments considered in DRL settings, we should consider dynamic adversaries when developing defense techniques, which is more challenging. As adversarial training (AT) is known as the most popular defense technique, we can devise AT that is robust against constantly evolving adversaries.
There is also a need to benchmark or standardize strong attacks so that defenses can try to defend against those standard attacks. Many defenses use different sets of attacks to show the efficacy of their defenses, and that becomes hard for readers to analyze which defense is better.
Most defense techniques in DL models can concatenate with other defenses. Hence, an ensemble of defenses is necessary to understand their interplay. In addition, it is critical for the users using DRL to employ defenses together in real-time industrial applications to combat state-of-the-art adversarial attacks.

SECTION VI.

Evaluation Metrics, Datasets, and Application Domains

This section discusses the overall trends of metrics, datasets, and application domains considered in developing attacks and defenses in each DL model.

A. Metrics

Table 10 shows the popularity of a metric used for attacks and defenses developed in DNNs, FL, TL, and DRL regarding its usage in the number of papers considered in this survey. For DNNs, FL, and TL, we found attack success ratio (ASR) and prediction accuracy are mostly considered to show the attack and defense performance. For example, effective attacks aim to maximize ASR while lowering prediction accuracy. The effective defenses are to maintain high prediction accuracy in the presence of attacks or maximize the detection ratio of attacks. Specifically, attackers who disrupt the operation of DNNs aim to lower the model’s prediction accuracy by injecting backdoor triggers on the input datasets. On the other hand, defenders in DNNs aim to maximize the model’s prediction accuracy efficiently and effectively. In DRL contexts, attackers and defenders aim to minimize or maximize the average reward. In FL-based systems, attackers aim to minimize the model’s prediction accuracy by reconstructing other clients’ data with the minimum mean square error (MSE) between the reconstructed and original data. On the other hand, defenders in FL-based systems aim to maintain the model’s prediction accuracy while preserving clients’ privacy by maximizing the MSE. Unlike the metrics used in DNNs, FL, and DRL, attacks and defenses in TL used their success ratios, respectively, where most of the TL models solved classification problems.

TABLE 10 Metrics Used for Existing Attacks and Defenses in DNNs, FL, TL, and DRL

B. Datasets

In Tables 2–7, we summarized datasets used for existing attacks and defenses proposed in DNNs, FL, and TL. Since DRL is mainly considered in game applications, we did not discuss the datasets in Tables 8 and 9. Table 11 summarizes datasets commonly used for the four DL models. For TL, image classification is the most common form of attack because it is easy to visually identify the trigger. Hence, the image datasets, MNIST and CIFAR, are the most prevalent observed in the literature. By a small margin, MNIST was seen more frequently for early proof-of-concept attacks, while CIFAR is used for complex attacks aiming to conceal the trigger. The ‘Other’ category consisted of other image datasets that were not common or used in tandem with the more popular image datasets. Atari and MuJoCo are the most common game applications to test the performance of attack and defense developed for DRL-based applications.

TABLE 11 Datasets (Applications) Used for Existing Attacks and Defenses in DNN, FL, TL, and DRL

C. Application Domains

The existing attacks and defenses in DNNs, FL, and TL are considered for text or image classification tasks. However, since DRL is mostly used to solve sequential problems, the attacks and defenses in DRL are mainly proposed in the game contexts, such as Atari, MujoCo, or Cart Pole.

SECTION VII.

Discussions: Commonalities & Differences

This section discusses the commonalities and differences between adversarial attacks and defense techniques developed for the four DL models studied in this work.

A. Commonalities and Differences of Attacks in the Four DL Techniques

We find the following common trends in existing attacks proposed in DNNs, FL, TL, and DRL:

Since FL, TL, and DRL use DNNs in their architectures, they mostly inherit the attacks from DNNs as the root cause of the similarity among these algorithms. For example, FGSM is the basic perturbation method discovered in DNNs and has been used or extended by TL and DRL researchers. Hence, the perturbation methods used across these DL models are very similar.
FL, TL, and DRL are also prone to backdoor attacks, which were first discovered in DNNs, where certain data samples are used as backdoors to attack the models.
FL primarily concerns users’ privacy, where attackers can infer other users’ datasets in various ways. As FL-based systems also use DNNs in their global and local models, the attackers may perform model inversion attacks considered in DNNs for privacy disclosure by exploiting the class probabilities (i.e., confidence information).
Since all DL models, including DNNs, FL, TL, and DRL, have their roots in using DNNs, attacks on DNNs and their attack successes are also well applicable in FL, TL, and DRL models. For example, the common definition of non-targeted attacks is to increase the model’s misclassification without hurting the overall model accuracy, not to be detected by the defense system.

We also observe the differences between the attacks in the four DL models as follows:

Unlike other DL models, an attacker in DRL experiences a continuous sequence of states during an episode where he needs to find critical states to attack (when-to-attack) so that the attack remains undetectable and stealthy. An attacker in DRL also faces continuous action space in most environments, making it challenging to find which adversarial state would lead to what action in such a space.
Unlike other DL models, which may not require sharing data, FL can encounter more various types of attacks breaking privacy preservation because of its distributed learning nature. Therefore, various FL algorithms are proposed to preserve clients’ private data from the central server or other clients. In addition, compromised clients in FL aim to lead model divergence, such as Byzantine attacks, by sending random vectors in the SGD update stage, resulting in the inaccuracy of the global model. On the other hand, adversarial attacks in other DL models aim to lead to model convergence but to a wrong solution.
FL is unique among DL methodologies in its engineering complexity. Coordination of shared state across multiple users (i.e., clients) in a distributed system introduces high benefits in performance while exposing critical security issues to FL’s relative performance because involving more users means widening attack surfaces as well.
TL also has its unique characteristics. In TL, the attacker loses control of the Student model once their infected model is published. This requires attackers to predict the victim’s choices or influence how the victim trains the student model for the trigger to activate.
We observe different DL models use different attack times. In DNNs and DRL, the attacks can happen during the training and/or testing phases. On the other hand, adversarial attacks in FL are performed mainly during the training phase.

B. Commonalities and Differences in Defenses in the Four DL Techniques

We found the commonalities in the existing defenses in the four DL models as follows:

Adversarial Training (AT) is the defense applicable in any DL model where the model is made robust by training it on a number of adversarial images/states. However, the method by which these adversarial images/states would be generated may differ.
Imitation Learning is another defense applicable to any DL model as you only need to train a similar model to compare the attack detection performance of multiple models.
Game theoretic-based defenses can also be applied across DL models where an attacker aims to minimize the accuracy and the defender aims at maximizing the accuracy, and they both play a game to achieve their goals.
Dimensionality reduction techniques, such as PCA, are used to detect adversarial attacks or backdoor attacks across different models.
Fine pruning is another defense widely used in DNNs and TL. Since DNNs are the core model in any DL model, this defense applies to any model, such as DRL and FL.
Since all the DL models can be prone to noises and perturbations, regularization can also be used across different models as regularization can make the model robust against noises and perturbations.
There are many similar defenses in TL and DNN because DNN-based defenses are mostly inherited in TL. For example, fine pruning, neural cleanse, and activation clustering have been inherited from DNNs to TL.
Offline inspection defense can be applied to DNNs and TL to detect and remove backdoors in pre-trained and Teacher models in TL. This defense can also be deployed to the DNN part of DRL models.
As one of the methods based on gradient filtering defenses in FL, one can remove any outliers and anomalies in all gradient updates submitted by clients. The defense in DNNs also has a similar idea called gradient clustering to compare gradients of deceptive datasets and clean inputs to identify backdoor attacks.

We identified the following differences in the existing defenses in the four DL models:

Since DRL is based on MDP, a defender can make robust DRL by changing the theoretical understanding of MDP in the presence of an attacker. We call this defense type Robust RL where RL is made robust by changing MDP so that it is neither prone to noises nor adversarial perturbations. However, this is not the case in other DL models because those are not based on MDP.
In FL, we see cryptographic primitives frequently leveraged to ensure computational integrity or confidentiality. For instance, Multi-party computation to transparently aggregate values or blockchain-based state management to remove trusted intermediaries. These are not seen in other DL models because privacy is not a concern.
In TL, comparing a benign Teacher model with an infected Student model is an effective way to determine if there was tampering during the TL process. Analyzing various aspects such as architecture, loss values, and local representations with a frame of reference is an advantage not observed in other DL models.
We observe that DNNs and FL models are mainly threatened by privacy attacks when clients and a central server communicate. On the other hand, privacy attacks in DNNs are mainly to recover training data from trained models.
Various defenses in DNNs and DRL have been proposed to detect adversarial examples. The defenses in DNNs are mainly to identify backdoored triggers in samples. On the other hand, the defense in DRL is to identify adversarial samples by utilizing frame prediction models based on actions with NN-based policy.

SECTION VIII.

Conclusion and Future Work Directions

This section summarizes the key findings obtained from this study and suggests promising future research directions.

A. Key Findings

Our key findings from this comprehensive survey are:

Most attacks in DNNs are applied to disturb image classification tasks. Among the categorized six attack surfaces (see Section II-B), backdoor attacks are less explored under attack sources of code poisoning and post-deployment compared to those in other attack surfaces. Outsourcing backdoor attacks have the highest ASR resulting from the capability of defenders compared with other attack surfaces.
We found no existing defense technique that can defend against backdoor attacks under all attack surfaces in DNNs. The main reasons are that the defender has limited resources (e.g., computational resources) and a lack of knowledge about attacks. Although the existing defenses to protect DNNs are comprehensive and relatively capable against various backdoor attacks, some defenses require high computational resources and ML expertise, as well as prior knowledge about attacks, such as a trigger size. Due to the inherent limited resources and uncertainty, it is a non-trivial task to overcome such challenges.
Federated Learning (FL)-based systems offer stronger data privacy guarantees than local training at the cost of expedited training execution and increased engineering complexity. A critical tradeoff between privacy-preserving and model performance (e.g., a model’s prediction accuracy) should be fine-tuned to achieve both conflicting goals.
In FL-based systems, limited threats are considered, such as data reconstruction or feature inference attacks when gradients are shared with a central server or other clients. Most existing FL-based systems took privacy-preserving approaches by leveraging Differential Privacy (DP) and Homomorphic Encryption (HE) to defend against such attacks. However, malicious clients or a malicious central server have rarely been considered while the honest-but-curious (HBC) threat model is more commonly considered in FL-based systems.
In FL-based systems, defenses are mainly focused on privacy-preserving or security. However, they do not care much about efficient execution.
Vertical FL (VFL) is less studied than horizontal FL (HFL), where VFL has a high demand in its applications with the same samples and different features. In VFL, the entity alignment step and sharing of confidence scores in training as well as inference phases can expose vulnerability to novel attacks aiming to obtain extra information.
Transfer Learning (TL) leverages unique relationships between different domains of tasks and can lower the upfront cost of training. However, these relationships can be abused through deceptive means. Hence, we should take greater scrutiny and caution during the TL process through model inspection. In addition, we should protect the data itself to prevent unsuspecting users.
Attacks and defenses in DRL have been less explored than those in DNNs although they are fairly strong and robust. Existing attacks in DRL focused on identifying a critical point to attack based on the design principle of when-to-attack, rather than how-to-attack methods, which should be further explored in future work.
ATLA [163] and RADIAL [173] are known as the strongest and most comprehensive state-of-the-art defenses in DRL. However, they assume static attack strategies. There is a dire need to tackle new attacks in real dynamic settings where an attacker continuously changes its strategy.

B. Future Research Directions

We also identified the following promising future research directions in this research domain:

Developing attacks in DNNs under attack surfaces of code poisoning and post-deployment is promising because code poisoning attacks can introduce the widest victims for low-level ML code repositories, which can achieve high ASR without any knowledge of model and training data. Post-deployment attacks can be developed with high flexibility of triggers and high ASR; however, there is still a lack of research on both attacks. Hence, more research efforts should be made to develop attacks based on code poisoning and post-deployment.
When developing defenses in DNNs, we should develop lightweight solutions considering resource constraints. Hence, we should develop cost-effective defense operations while maximizing defense effectiveness. For example, the online inspection can effectively identify the triggers and clean data inputs even with large-size triggers; however, it may introduce latency when being applied to a real-time application.
The state-of-the-art attacks for FL-based systems have considered a benign central server with only HBC clients that do not exhibit actual malicious behaviors introducing an adverse impact on the system performance. Further, in VFL settings, inference-based attacks can be highly attractive to attackers trying to exploit the shared model outputs of other clients. Hence, we can investigate the impact analysis of malicious servers and clients as well as develop inference-based privacy attacks in VFL as our future research.
The existing FL defenses mainly focus on privacy-preserving. In addition to it, we should consider the ‘efficiency of the defense mechanisms’ that can guarantee model accuracy with minimum cost. Further, in addition to HBC clients in FL settings, FL defense mechanisms should be robust against a malicious server as well as malicious clients that actually breach data privacy.
The existing attacks in TL focus on maximizing the attack effectiveness by introducing its targeted or non-targeted damage to the system performance. To consider more realistic attacks considering attackers’ limited resources, we should consider an efficient attack strategy to ensure persistence in creating triggers. When a trigger is rewritten, the existing attack is rendered ineffective. Therefore, the new attack strategy should focus on developing covert techniques that can conceal or minimize changes to the model architecture to ensure persistent attacks.
We can harden the defense strength of the TL model by introducing offline detection mechanisms before training the TL model. In addition, we can leverage a certification or verification system for vendors and open-source products to ensure that publicly distributed models can be trusted.
DRL consists of DL and RL where most attacks in DNNs can be applicable to DRL settings. However, DRL has its own unique characteristics in terms of non-stationary and autonomous decisions in continuous-control environments. Hence, we need to investigate how to identify the right perturbation to address issues from such dynamic, non-stationary environments. Further, we need to ensure the robustness of an attack under DRL with defenses. Hence, it is important to make the attack stealthy and hidden so they are not detected easily while effectively hurting the agent’s decision performance, resulting in poor decisions.
DRL solves many sequential problems. Based on such a nature, adversaries also perform a series of attacks until the DRL agent makes poor decisions, leading to task failure based on the attacker’s targeted goal. To deal with such highly intelligent, time-evolving adversaries, we should consider defenses to effectively and efficiently deal with such sophisticated attacks based on learning about adversarial behaviors. We can consider an ensemble defense method by combining multiple defense techniques. One approach can adopt a game-theoretic approach where a defender can identify and take an optimal defense action to maximize the defense effectiveness for best defeating dynamic adversaries.

References is not available for this document.

A Survey on Attacks and Their Countermeasures in Deep Learning: Applications in Deep Neural Networks, Federated, Transfer, and Deep Reinforcement Learning

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Motivation

B. Comparison With Existing Surveys

1) Existing Surveys in Deep Neural Networks

2) Existing Surveys in Federated Learning

3) Existing Surveys in Transfer Learning

4) Existing Surveys in Deep Reinforcement Learning

C. Key Contributions

D. Structure of the Paper

Adversarial Attacks and Defenses in Deep Neural Networks

A. Background in Deep Neural Networks

B. Attacks in DNNs

1) Adversarial Examples (AEs) [43]

2) Attribute-Based Attacks [11]

3) Model Inversion Attacks [48]

4) Code Poisoning Attack [49]

5) Model Extraction Attack

6) Outsourcing Attack [40]

7) Pretrained Attack [44]

8) Data Collection [51]

9) Post Deployment [53]

10) 3D Attacks

C. Defenses in DNNs

1) Blind Backdoor Removal (BBR) [8]

2) Offline Data Inspection (OFDI)

3) Offline Model Inspection (OFMI)

4) Online Input Inspection (OII)

5) Online Model Inspection (OMI)

6) 3D Defenses

D. Discussions: Insights and Limitations of Existing Attacks and Defenses in DNNs

1) Discussions on Existing Attacks in DNNs

2) Discussions on Existing Defenses in DNNs

Adversarial Attacks and Defenses in Federated Learning

A. Background in Federated Learning

B. Attacks in Federated Learning

1) Disruption of Normal Protocol Operations

2) Extraction of Information

3) Collaborative Attack

C. Defenses in Federated Learning

1) Differential Privacy (DP)

2) Secure Multi-Party Computation (SMPC)

3) Trusted Execution Environments (TEEs)

4) Homomorphic Encryption (HE)

5) Verifiable Computation

6) Gradient Filtering

7) Blockchains

8) Prediction Phase Defenses in VFL

D. Discussion: Insights and Limitations of Existing Attacks and Defenses in FL

Adversarial Attacks and Defenses in Transfer Learning

A. Background in Transfer Learning

B. Attacks in Transfer Learning

1) Trojan Attack

2) Model Reuse Attack

3) Programmable Neural Network

4) Latent Backdoor

5) Appending Backdoor

6) Graph Neural Network (GNN) Backdoor

7) Membership Inference Attacks

C. Defenses in Transfer Learning

1) Neural Cleanse

2) Fine-Pruning

3) Activation Clustering

4) Input Blurring

5) Proto2Proto

D. Discussion: Insights and Limitations of Existing Attacks and Defenses in TL

Adversarial Attacks and Defenses in Deep Reinforcement Learning

A. Background in DRL

B. Attacks in DRL

1) Common Dominant Adversarial Examples Generation Method (CDG) [135]

2) Attacks on Zero Sum Games [136]

3) Model Querying Attacks [138]

4) Fast Gradient Sign Method-Based Uniform Attack [140]

5) Policy Induction Attacks [142]

6) FGSM and Random Noise [143]