Loading web-font TeX/Main/Regular
Properties that allow or prohibit transferability of adversarial attacks among quantized networks | IEEE Conference Publication | IEEE Xplore

Properties that allow or prohibit transferability of adversarial attacks among quantized networks


Abstract:

Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Further, these adversarial examples are found to be transferable from the source network i...Show More

Abstract:

Deep Neural Networks (DNNs) are known to be vulnerable to adversarial examples. Further, these adversarial examples are found to be transferable from the source network in which they are crafted to a black-box target network. As the trend of using deep learning on embedded devices grows, it becomes relevant to study the transferability properties of adversarial examples among compressed networks. In this paper, we consider quantization as a network compression technique and evaluate the performance of transfer-based attacks when the source and target networks are quantized at different bitwidths. We explore how algorithm specific properties affect transferability by considering various adversarial example generation algorithms. Furthermore, we examine transferability in a more realistic scenario where the source and target networks may differ in bitwidth and other model-related properties like capacity and architecture. We find that although quantization reduces transferability, certain attack types demonstrate an ability to enhance it. Additionally, the average transferability of adversarial examples among quantized versions of a network can be used to estimate the transferability to quantized target networks with varying capacity and architecture.CCS CONCEPTS• Security and privacy \rightarrow Software and application security.
Date of Conference: 15-16 April 2024
Date Added to IEEE Xplore: 18 June 2024
ISBN Information:

ISSN Information:

Conference Location: Lisbon, Portugal

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION 1

Introduction

The last decade has seen deep learning become increasingly popular in multitude of domains like healthcare [34], speech-recognition [25], self-driving cars [21], and collision avoidance systems [17, 24]. One of the prominent domains where deep learning is increasingly being used is the embedded systems. For instance, IOT devices use AI for data collection and processing [20]; mobile devices use them for functionalities like face recognition [12], voice assistants [31], hardware emulation in cameras [11], and so forth.

However, the evolution of AI as a powerful technology to solve complicated tasks has been, in part, made possible due to the rise of powerful hardware capable of running large DNN models [2]. Given that the embedded devices are inherently limited on resources, implementation of deep learning on the device itself often involves optimization of the base models. Model optimization is a topic of active research and several methods have been developed to fit the purpose [8, 9, 14]. One of such methods is to compress a model via quantization [7]. This involves reducing the computational complexity and size of a model by reducing the precision of the network parameters from default float 32 to smaller bitwidths.

DNNs are found to be vulnerable to data samples with deliberately added and often imperceptible perturbations [29]. Benign images that are otherwise classified correctly by a network, when subjected to these perturbation vectors, can cause classifiers to misclassify the perturbed images at a high rate [6, 30]. These perturbed images or adversarial examples are a severe threat to the usability of DNNs in safety-critical domains as they can effectively fool a network into making wrong decisions inspired by an adversary.

Moreover, adversarial examples are observed to be transferable. Examples generated on one classifier are found to be effective on other classifiers trained to perform the same task [6, 16, 30, 32]. This enables an adversary to mount a black-box attack on a target network with adversarial images crafted in another network. Here, the network in which the adversarial examples are created is termed as the source network and the network in which the adversarial attacks are applied is called the target network.

In the context of embedded systems, adversarial transferability has even more severe consequences due to the growing ubiquity of these devices. For instance, an attacker can gain access to a model deployed in an edge device and use it to craft adversarial examples; the transferability property of adversarial examples would then enable the attacker to perform transfer-based black-box attacks on a secure base model. Moreover, the attack can be transferred to other compressed models derived from the same base model [35]. Conversely, a manufacturer can use a publicly available base model and roll-out compressed versions of it on his devices. Attackers can then use the base model to craft attacks against the private devices. Thus, it becomes crucial to understand the factors that affect the transferability of adversarial attacks across compressed networks.

The main objective of this paper is to analyse the conditions and properties that facilitate or hinder transferability of adversarial attacks among networks of different bitwidths. This involves investigation of how various adversarial attack generation algorithms and model-specific properties affect the transferability of adversarial examples when source and target networks differ in quantization levels. More specifically, the goal is to answer the following research questions:

  • RQ1: What causes some algorithms to have better or worse transferability across variably quantized systems?

  • RQ2: How do model-related properties like model architecture and capacity affect the transferability of attacks between networks of different bitwidths?

To answer these questions we perform a comprehensive study of transferability among variable bitwidth networks. The main contributions of this paper are:

  • By considering a broad range of algorithms when creating adversarial examples, we evaluate the performance of quantized and full-precision networks when the attacks are transferred from another network having a different bitwidth. Our analysis of transferability properties of the Universal Adversarial Perturbation (UAP) attack [23] reveals that it allows higher magnitudes of perturbation than the internal algorithm used to craft the UAP while still keeping the adversarial image recognizable to human observers. This increases the effectiveness of UAPs at higher distortions which suggests that the performance of an attack may be improved by aggregating the individual perturbations from multiple images to create a UAP. Further, the observed poor transferability of Jacobian Saliency Map Based Attack (JSMA) [27] indicates that the attacks that depend on individual feature modifications may have poor transferability. Moreover, search-based attacks like the Boundary Attack may have poor transferability due to direction of perturbation introduced by the algorithm.

  • We consider a more realistic scenario where the target model could be different from source in terms of model capacity or architecture or both. In such scenarios, we show that the overall success-rate of a transfer-based attack can be estimated by observing the performance of the attack when it is transferred among different bitwidth versions of the source model.

The rest of the paper is organised as follows: Section 2 mentions prior studies on transferability of adversarial attacks. Section 3 provides a short background on network quantization, adversarial examples, and attack generation methods used in this work. Section 4 presents the experiments together with the evaluation of the results. Section 5 reflects on the results. Finally, we conclude in Section 6.

SECTION 2

Related Works

There are already existing bodies of work that make considerable contributions in postulating how different factors might affect transferability of adversarial examples. Findings in [19] indicate that the attack algorithm used to craft the adversarial examples can affect their transferability. Authors find that some fast gradient-based attacks like the Fast Gradient Sign Method (FGSM) [6] are less transferable than search-based attacks. Further, observations in [32] show that factors such as model capacity, accuracy, and architecture of both source and target models affect transferability. The authors additionally note that a smooth loss surface on the source network results in better chances of alignment of source and target gradients, resulting in better transferability. In [13], authors offer a new perspective on transferability as they argue that it is the property of the data distribution itself and that highly accurate models trained with non-robust or brittle features are more vulnerable to transfer-based attacks.

Thus, it is clear that there are multiple factors at play when considering transferability of adversarial attacks. Nevertheless, the dynamics of these factors change when considering transferability among networks with varying bitwidths. For instance, [32] and [30] state that high accuracy models have similar decision boundaries, and thus transferability is high among such networks. However, the transferability of attacks between a full-precision and its quantized version is observed to be considerably low even when both of the networks have similar accuracy [2].

Prior works on transferability of adversarial examples in regard to quantized networks address how various algorithms can affect transferability [2, 22, 35]; however, the focus is mainly on attacks that either use loss gradients or estimate them to craft adversarial attacks. Moreover, the current state of the art mostly consider cases where both source and target networks are different bitwidth versions of the same network, but in real-world settings, an attacker may not have knowledge of the quantization level and other model-specific properties of the target network. Our objectives with this study is to address these gaps.

SECTION 3

Background

3.1 Deep Neural networks (DNNs)

Given an input $\mathbf{x} \in \mathbb{R}^{m}$ and the corresponding output vector $\mathbf{y} \in \mathbb{R}^{n}$, a DNN is a classification function $f: \mathbf{x} \mapsto \mathbf{y}$, where $m$ is high-dimensional. In addition to the input $\mathbf{x}$, the output $\mathbf{y}$ also depends on the network parameters $\theta$ which the network learns during the training process.

For the n-class classifiers considered in this work, the output vector $\mathbf{y}$ is a probability distribution, that is: $y_{1}+y_{2}+y_{3}+\ldots+y_{n}=1$ and $0 \leq\left(y_{i}\right)_{i=1}^{n} \leq 1$. Each element $y_{i}$ in y represents the probability that the input $\mathbf{x}$ has class $i$. Thus, for a network $\mathbf{y}=f(\mathbf{x})$, the label $y$ assigned by the classifier to input $\mathbf{x}$ is: \begin{equation*}\underset{i}{\operatorname{argmax}} f_{i}\left(\mathbf{x}\right)=y \tag{1}\end{equation*}View SourceRight-click on figure for MathML and additional features.

Here, $y_{i}=f_{i}(\mathbf{x})$ is the $i^{\text {th }}$ output of the network.

3.2 Adversarial Examples

Adversarial examples are created by adding computed perturbations to a clean image. The resulting distorted samples look similar to their original counterparts and are still classified correctly by human oracles, but the perturbations are enough for a classifier to change the output class probabilities, possibly leading to a misclassification [6, 10, 16, 29].

Let $y^{\text {true }}$ be the true label of a clean sample $\mathbf{x}$, then from Equation 1 we have: $\operatorname{argmax}_{i} f_{i}(\mathbf{x})=y^{\text {true }}$. If $\mathbf{x}$ is subjected to a perturbation vector $\eta \in \mathbb{R}^{m}$, resulting in a perturbed example $\mathbf{x}^{a d v}$ that causes misclassification, then: \begin{equation*}\underset{i}{\operatorname{argmax}} f_{i}\left(\mathbf{x}^{\text {adv }}\right) \neq y^{\text {true }} \tag{2}\end{equation*}View SourceRight-click on figure for MathML and additional features.

3.3 Crafting Algorithms

The perturbation introduced to create an adversarial example is not random. In fact, it is much harder to cause misclassification by adding perturbations in random directions [30]. An adversarial example generation algorithm adds perturbations to the original image in a specific direction, computed within the steps of the algorithm.

Fast Gradient Sign Method (FGSM) [6] is one of the simplest methods to craft adversarial examples. The single-step algorithm adds max-norm constrained perturbation in the direction of the loss gradient of the network to create adversarial samples. \begin{equation*}\mathbf{x}^{a d v}=\mathbf{x}+\varepsilon \operatorname{sign}\left(\nabla J_{\mathbf{x}}\left(\mathbf{x}, y, \theta\right)\right) \tag{3}\end{equation*}View SourceRight-click on figure for MathML and additional features.

Here, $\nabla J_{\mathbf{x}}$ is the gradient of the loss function $J(\mathbf{x}, y, \theta)$, computed with respect to input $\mathbf{x}$. Since loss gradient gives the direction of the largest increase in loss, adding a small perturbation aligned with this direction can cause significant increase in loss at high dimensions. $\varepsilon$ is the $L_{\infty}$ norm of the perturbation which also gives the distance between $\mathbf{x}$ and $\mathbf{x}^{a d v}$.

Jacobian Saliency Map based Attack (JSMA) [27] computes adversarial examples by creating a direct mapping between input and output variations and using this map to isolate features that are most effective in causing classification change. The algorithm uses forward derivative of the function learned by the network (given by Equation 4) to construct a saliency map which filters the features that are most important based on the the given criteria. Equation 5 provides a basic filter criteria as defined in [27]. \begin{equation*}\nabla f\left(\mathbf{x}\right)=\frac{\partial f\left(\mathbf{x}\right)}{\partial \mathbf{x}}=\left[\frac{\partial f_{j}\left(\mathbf{x}\right)}{\partial x_{i}}\right]_{i \in 1, \ldots, m, j \in 1, \ldots, n} \tag{4}\end{equation*}View SourceRight-click on figure for MathML and additional features. \begin{equation*}S\left(\nabla f\left(\mathrm{x}\right), t\right)\left[i\right]=\begin{cases}0 \text { if } \frac{\partial f_{t}\left(\mathrm{x}\right)}{\partial x_{i}}\lt0 \text { or } \sum_{j \neq t} \frac{\partial f_{j}\left(\mathrm{x}\right)}{\partial x_{i}}\gt 0 \\ \left(\frac{\partial f_{t}\left(\mathrm{x}\right)}{\partial x_{i}}\right)\left|\sum_{j \neq t} \frac{\partial f_{j}\left(\mathrm{x}\right)}{\partial x_{i}}\right| \text { otherwise }\end{cases} \tag{5}\end{equation*}View SourceRight-click on figure for MathML and additional features.

In Equation 5, $t$ is the target class to which the input is to be misclassified, that is, $t \neq y^{\text {true }}. S(\nabla f(\mathbf{x}), t)[i]$ is the saliency map computed for $i^{\text {th }}$ feature.

Thus, as per Equation 5, from the Jacobian matrix (given by Equation 4), features that increase the target class probability and at the same time decrease the probabilities of all other classes are weighed and the feature with the highest value is selected. In each iteration, selected feature is perturbed by a defined amount. This is continued till adversarial goal of misclassification, that is, $\operatorname{argmax}_{i} f_{i}(\mathbf{x})=t$, is reached.

Universal Adversarial Perturbation (UAP) Attack [23] is quite different from other algorithms because it does not create adversarial sample but instead results in a single perturbation vector that can be added to any point in the dataset to create adversarial examples. Such perturbation vector is called Universal Adversarial Perturbation or UAP.

If $X=\left\{\mathbf{x}_{1}, \mathbf{x}_{2}, \ldots, \mathbf{x}_{n}\right\}$ be a dataset sampled from a data distribution $\mu$, then the basic idea is to sequentially iterate though each image in $X$ and compute a perturbation vector $\Delta \boldsymbol{v}_{k}$ that sends the current image $\mathbf{x}_{k}+\boldsymbol{v}$ across the decision boundary of the network. Every iteration also updates the overall perturbation $\boldsymbol{v}$ with $\Delta \boldsymbol{v}_{k}$. The algorithm stops when the computed perturbation is able to fool a pre-defined number of images in $X$. The perturbation vector $\Delta \boldsymbol{v}_{k}$ can be computed using any algorithm, in the case of this paper, we use FGSM.

The Boundary Attack (BA) [3] uses model’s decisions on the input points to craft adversarial examples. The attack is conceptually simple. Given a benign image, the algorithm initializes a random adversarial image. The initialized random image is then subjected to multiple perturbations iteratively. Each perturbation decreases the $L_{2}$ distance between the benign and the adversarial image. When adding these perturbations, the algorithm queries the model with the perturbed image as input to make sure that the perturbation still keeps the image outside the decision boundary of the true class. Thus, the initial random image is moved along the decision boundary between the adversarial and non-adversarial region until minimum $L_{2}$ distance between the benign and the adversarial image is achieved. At the end of the run, the initial image remains adversarial but looks like the benign image to human observers.

The boundary attack does not require access to model parameters and architecture, thus it can operate under black-box scenarios. Further, it does not use gradients to create adversarial examples.

The Carlini-Wagner (CW) Attack [4] is one of the most powerful adversarial attacks. It uses an optimization function that iteratively performs gradient descent towards the target adversarial class while keeping the distance between original and the adversarial example minimal. Equation 6 defines the optimization problem the attack algorithm solves. \begin{align*}\text { minimize }\left\|\mathrm{x}^{a d v}-\mathrm{x}\right\|_{2}^{2}+c \cdot l\left(\mathrm{x}^{a d v}\right); \text { where, } \\ l\left(\mathrm{x}^{a d v}\right)=\max \left(Z_{i}\left(\mathrm{x}^{a d v}\right)-\max \left(Z_{t}\left(\mathrm{x}^{a d v}\right): t \neq i\right)+\kappa, 0\right) \tag{6}\end{align*}View SourceRight-click on figure for MathML and additional features.

In the above equation, $Z_{i}\left(x^{a d v}\right)$ is the true label while $Z_{t}\left(x^{a d v}\right)$ is any label other than $i$, thus the loss function $l$ ensures misclassification. Further, it can be seen that the attack uses logits $Z(\mathbf{x})$ instead of output from the softmax layer. This enables the attack to bypass defensive techniques such as defensive distillation [28] which work by smoothening the model’s output. Similarly, $c$ is a non-negative constant that balances misclassification confidence and distance between adversarial and original sample. It is determined during run-time using binary search. The attack also allows to define misclassification confidence $\kappa$. Increasing the misclassification confidence increases the probability that the attack becomes more transferable but at the cost of higher distortions.

As can be seen, the equation focuses on only $L_{2}$ metric when creating attacks. In their paper, authors also define $L_{0}$ and $L_{\infty}$ variants; however, our analysis only includes $L_{2}$ norm as it is the strongest and the attack was originally formulated for this norm, while other norms were adapted from $L_{2}$.

3.4 Quantization as a model optimization technique.

Quantization reduces the bitwidth of network values such that computational complexity during training and inference is reduced [14]. By using lower bitwidth numbers rather than default float32 values, floating-point multiplications during convolution operations can be replaced by bitwise operations which are much faster.

In this paper, network quantization is performed during training using a method called DoReFa-Net [36]. For any real number $r_{i} \in[0,1]$, the corresponding n-bit quantized output value $r_{o}$ is computed during forward pass using the quantization function $r_{o}=\frac{1}{2^{n}-1} \operatorname{round}\left(\left(2^{n}-1\right) r_{i}\right)$. However, since a continuous function with a small finite range always has zero gradient with respect to its input [36], the quantization function itself is not differentiable This creates a problem during back-propagation. For instance, if $c$ is the cost function, then during backward pass, computation as in Equation 7 requires $\frac{\partial r_{o}}{\partial r_{i}}$ which is not defined. \begin{equation*}\frac{\partial c}{\partial r_{i}}=\frac{\partial c}{\partial r_{o}} \cdot \frac{\partial r_{o}}{\partial r_{i}} \tag{7}\end{equation*}View SourceRight-click on figure for MathML and additional features.

One of the solution to this problem is to estimate the value of $\frac{\partial r_{o}}{\partial r_{i}}$, given that $\frac{\partial c}{\partial r_{o}}$ is properly defined. The estimators that allow defining custom $\frac{\partial r_{o}}{\partial r_{i}}$ are called Straight Through Estimators or STEs [1]. DoReFa-Net uses $\frac{\partial c}{\partial r_{i}}=\frac{\partial c}{\partial r_{o}}$ as STE during activation and weight quantization.

SECTION 4

Experiments

We use CIFAR10 [15] and MNIST [18] datasets in all experiments. For MNIST dataset we trained a simple 8-layer CNN (called Mnist A for easier reference), while for CIFAR10, a ResNet20 network was trained. The architectures of both CIFAR10 and MNIST models are defined using the examples from the Tensorpack repository [33]. For both MNIST and CIFAR10 models, six quantization bitwidths are considered: 1, 2, 4, 8, 12, and 16 bits. Table 1 shows accuracies of all quantized and full-precision models. Training and architecture details are included in Appendix A.

Table 1 Test Set Accuracy of the Full-Precision (Fp) and Quantized Versions of the Mnist and Cifar10 Models
Table 1- Test Set Accuracy of the Full-Precision (Fp) and Quantized Versions of the Mnist and Cifar10 Models

Transferability properties of the five attack algorithms mentioned in Section 3.3 were studied. For each attack type, different sets of hyperparameter values were used for MNIST and CIFAR10 models. These values are as depicted in Table 2. As seen in the table, $\varepsilon$ for FGSM is the magnitude of distortion added to the original image. For JSMA, $\gamma$ and $\theta$ represent the total percentage of pixels allowed to be distorted and distortion added per pixel (per iteration), respectively. For UAP attack, FGSM was used internally to generate UAP (see Section 3.3); $\varepsilon$ thus represents the perturbation magnitude for FGSM, while $\xi$ represents the maximum magnitude of distortion allowed, measured in $L_{\infty}$ norm as recommended in [22]. For BA, $i$ represents the number of iterations. Similarly, for CW attack, $\kappa \geq 0$ is the confidence parameter, while $i$ is the maximum number of iterations (of gradient descent) per image, $b_{s}$ is the binary search steps to determine proper $c$, and $c_{i}$ is the initial value of c1.

Table 2 Attack Hyperparameter Values For the Full-Precision (Fp) and Quantized Versions of the Mnist and Cifar10 Models
Table 2- Attack Hyperparameter Values For the Full-Precision (Fp) and Quantized Versions of the Mnist and Cifar10 Models

These hyperparameter values were selected such that the resulting adversarial images are distorted but yet recognizable to human observers. Thus, for all attack types, we consider upper limit of transferability by creating adversarial samples that are as distorted as possible. Different values of hyperparameters were selected for the same attack type for MNIST and CIFAR10 because for natural datasets like CIFAR10, relatively small distortion is enough to highly distort the image [22].

Each of these attacks were implemented using the Adversarial Robustness Toolbox (ART) library [26]. Regarding quantization, we quantize both weights and activations (and not gradients).

Akin to [2], we measure the effectiveness of an attack on a network in terms of adversarial accuracy. This metric gives the network’s accuracy against adversarial examples and is computed as the ratio of the number of adversarial samples classified correctly by the network to the total number of samples. Thus, a higher adversarial accuracy means the network is more robust against adversarial samples.

4.1 Transferability of adversarial examples

The objective of this experiment was to evaluate how algorithm specific properties affect transferability. For both CIFAR10 and MNIST models, 1000 randomly selected clean samples that were classified correctly by both source and target models were used to create adversarial samples. These were then transferred among different bitwidth versions of the same model and the resulting adversarial accuracy of the target model was measured. This process was repeated 3 times and the average adversarial accuracy was noted.

Figure 1 reports the transferability results for Resnet20 models. Results for MNIST models are similar and are excluded for brevity.

From the figure, the following observations can be made:

Observation 1: Transferability of loss gradient based attacks is poor. As can be seen, transferability of FGSM, CW, and UAP is poor. CW attack, which has 100% efficiency when applied on the same network, transfers very poorly. One of the reasons for the observed poor transferability of these attacks is that the gradient of the loss function (with respect to the input) of the networks at different bitwidths do not align with each other [2]. Loss gradient based attacks create adversarial examples by adding perturbations in the direction of the loss gradient. Thus, misalignment of gradient between source and target means that the perturbation direction transferred from source may not be able to cause the same effect at the target. The lower the cosine similarity between the loss gradients, the lower is the transferability [2, 5, 32]. By comparison, UAP has better transferability which is noticed in observation 3.

Another reason for the poor transferability could be due to the quantization shift phenomenon [2, 35]. When networks are quantized, weight and activation values are grouped into the same bucket, this impacts how the attack performs when bitwidth is changed. For instance, when adversarial examples are crafted in a quantized network, an algorithm may assume that weight values associated with certain nodes are similar, but when moving the attack to a full-precision network, one weight value may be more than the other. This might end up activating unintended nodes thus making the sample unsuccessful. This also explains why a powerful attack like CW fails to transfer. Since the attack relies on creating differential activation between the logits, when the attack is transferred to another network, the attack might not be able to achieve the same activation difference between the true class and adversarial class. Hence, the change in bitwidth between source and target hampers the effectiveness of the attack.

We also noticed that the transferability of CW attack can be improved by increasing the value of $\kappa$. This was also observed in the original paper [4] as well as by other studies [2]. However, in our experiments, when the value was increased beyond $\kappa=5$, most of the images became unrecognizable. Since one of the requirements for a sample to be adversarial is that it should be recognizable to humans, we did not increase $\kappa$ beyond this value.

Observation 2: JSMA has poor transferability. This can again be explained in terms of quantization shift. JSMA crafts adversarial examples by manipulating individual features in an image based on how effective that feature is in producing misclassification. The effectiveness of each feature is based on individual parameter and activation values of that network. Due to quantization shift, the same features may no longer be sensitive on a different network with different bitwidth. This explains why JSMA has better performance when the attack is applied on the same network as it is able to find features that are sensitive enough to cause misclassification for that network. However, when the created sample is transferred to another network with different bitwidth, the same perturbations are no longer able to produce similar change in activations due to the target network being in full-precision or having different discrete levels of activation and weight values.

Observation 3: The Boundary Attack has poor transferability. As can be seen, the transferability of the Boundary Attack is poor. The target networks are highly robust against the attack even though the same examples work with almost 100% success rate at the source. This can be explained with how the Boundary Attack crafts adversarial examples and the resulting direction of the perturbations.

Adversarial examples exist in a broad contiguous regions in input space; thus, rather than an exact example, the direction of perturbation is important for transferability [6]. The perturbation directions that lead to the adversarial sub-space are called the adversarial directions and the examples created on source are transferable if their adversarial directions align with adversarial directions of the black-box target model [30]. Thus, aligning the perturbation with the adversarial direction is a reliable way of not only fooling the source but also the target model. These adversarial directions point away from the original image as the perturbations are normally added to the original image. However, in case of the Boundary Attack, perturbations are added to a random image, moving it towards the original image. The perturbed image thus has perturbation direction towards the original class boundary which is opposite to the adversarial direction. This explains the poor transferability of the attack.

Observation 4: UAP has better transferability than FGSM when the allowed perturbation is high. Although UAP algorithm internally uses FGSM to compute perturbations for individual images (Section 3.3), it performs better than FGSM itself when the allowed magnitude of perturbation $\xi$ is high.

UAP allows to increase the maximum allowed distortion to higher values than the internal algorithm while still keeping the adversarial image recognizable. Figure 2 compares adversarial images generated using UAP and FGSM on the FP Mnist A model by taking the same value of allowed distortion ($\varepsilon$ for FGSM and $\xi$ for UAP). As can be seen, the images still remain recognizable in the case of UAP, whereas the images are completely distorted for FGSM.

Summary

  • Quantization reduces the transferability of attacks. Quantization shift and misalignment of loss gradients between variable bitwidth networks result in poor transferability. Similar observations were made in $[2,35]$. The experiments in this section extend this finding to show that the quantization shift also affects the transferability of algorithms like JSMA that leverage individual features to produce misclassification. Further, the experiments on the Boundary Attack show that search based algorithms might not produce transferable examples due to the direction of perturbation.

  • At higher distortions, UAPs are more transferable than the attack used to craft it. It was found that the UAP allows to add more distortion than the algorithm used internally to craft it. Thus, the performance of UAP may be enhanced to a greater extent than that of the internal algorithm.

4.2 Transferability of adversarial examples when considering model-related properties

To explore how model-related properties affect transferability among networks of different bitwidths, adversarial examples created at source were transferred to target networks which not only had different bitwidths but also different model capacity and model architecture than the source model. The model capacity here is quantified in terms of number of parameters in the model.

Figure 1: - Transferability of adversarial attacks among different bitwidth versions of the Resnet20 model. In each matrix, rows indicate the source networks, while columns indicate the corresponding target networks. Row and column headers specify the bitwidth of the source and target models, respectively. The source and target model IDs are labelled alongside the corresponding headers. Cell values correspond to the adversarial accuracy of the target. Higher values (darker colours) indicate less transferability, while lower values (lighter colours) indicate more transferability. The diagonal values correspond to attack performance on the source (the source and target are the same model). Each value in the “Average” column indicates the average adversarial accuracy of all target models (one complete row) against a single attack source.
Figure 1:

Transferability of adversarial attacks among different bitwidth versions of the Resnet20 model. In each matrix, rows indicate the source networks, while columns indicate the corresponding target networks. Row and column headers specify the bitwidth of the source and target models, respectively. The source and target model IDs are labelled alongside the corresponding headers. Cell values correspond to the adversarial accuracy of the target. Higher values (darker colours) indicate less transferability, while lower values (lighter colours) indicate more transferability. The diagonal values correspond to attack performance on the source (the source and target are the same model). Each value in the “Average” column indicates the average adversarial accuracy of all target models (one complete row) against a single attack source.

Figure 2: - Adversarial examples generated on FP Mnist A model using: (a) UAP with $\xi=0.6$. (b) FGSM with $\varepsilon=0.6$. Both images are first 10 images from the MNIST dataset.
Figure 2:

Adversarial examples generated on FP Mnist A model using: (a) UAP with $\xi=0.6$. (b) FGSM with $\varepsilon=0.6$. Both images are first 10 images from the MNIST dataset.

Table 3 Test Set Accuracy of the Full-Precision (Fp) and Quantized Versions of High-Capacity Variants of Mnist A Model. Mnist A Has 414K Parameters, While Mnist B and Mnist C Have 836K and 1.7M Parameters, Respectively
Table 3- Test Set Accuracy of the Full-Precision (Fp) and Quantized Versions of High-Capacity Variants of Mnist A Model. Mnist A Has 414K Parameters, While Mnist B and Mnist C Have 836K and 1.7M Parameters, Respectively

Additional CIFAR10 and MNIST models of different capacities were trained. More specifically, Resnet32, Resnet44, and their quantized versions were trained for CIFAR10 and by increasing the number of channels for all convolution layers of Mnist A by 100% and 300%, Mnist B and Mnist C, respectively, were trained for MNIST. Tables 3 and Tables 4 show test accuracies for quantized and FP versions of both MNIST and CIFAR10 models, respectively. Using FP Resnet44 and its quantized versions as targets, attacks were transferred from FP and quantized versions of Resnet20 and Resnet32. Similarly, Mnist C and its quantized versions were used as target models and attacks were transferred from FP and quantized versions of Mnist A and Mnist B. Figure 3 shows the results for CIFAR10 models. Similar results were obtained for MNIST and are not shown for the sake of conciseness.

Table 4 Test Set Accuracy of the Full-Precision (Fp) and Quantized Versions of Various Resnets and Cnn Models Used. Resnet20 Has 269K Parameters, While Resnet32, Resnet44, and Cifar A Have 464K, 658K, and 4.5M Parameters, Respectively
Table 4- Test Set Accuracy of the Full-Precision (Fp) and Quantized Versions of Various Resnets and Cnn Models Used. Resnet20 Has 269K Parameters, While Resnet32, Resnet44, and Cifar A Have 464K, 658K, and 4.5M Parameters, Respectively

Similarly, to study how model architecture affects transferability, an 11-layer CNN (named as Cifar A for easier reference) was trained on the CIFAR10 dataset. Table 4 shows test accuracy of the model and its quantized versions. Adversarial examples crafted on FP and quantized versions of Resnet 20 was then transferred to FP and quantized versions of Cifar A. Figure 4 shows the results.

Figure 3: - Transferability of adversarial attacks when the source and target networks differ in capacity. The five matrices on the left column depict the transferability of each of the five attacks when the source networks are different bitwidth versions of Resnet20. The matrices on the right column depict the transferability of the same attacks when the source networks are different bitwidth versions of Resnet32. The target networks are FP Resnet44 and its quantized versions in all cases.
Figure 3:

Transferability of adversarial attacks when the source and target networks differ in capacity. The five matrices on the left column depict the transferability of each of the five attacks when the source networks are different bitwidth versions of Resnet20. The matrices on the right column depict the transferability of the same attacks when the source networks are different bitwidth versions of Resnet32. The target networks are FP Resnet44 and its quantized versions in all cases.

Figure 4: - Transferability of adversarial attacks when the source and target networks differ in architecture. The source networks are different bitwidth versions of Resnet20 and the target networks are different bitwidth versions of Cifar A.
Figure 4:

Transferability of adversarial attacks when the source and target networks differ in architecture. The source networks are different bitwidth versions of Resnet20 and the target networks are different bitwidth versions of Cifar A.

Comparing transferability results in Figure 1 with those in Figure 3 and Figure 4, it can be seen that, for the corresponding attack types, the average transferability when the source and targets have different capacity and architecture follows the same pattern as when the source and targets are similar in architecture and capacity. For instance, for the UAP attack in Figure 1 where both source and target networks are different bitwidth versions of Resnet20, the average transferability (for $\xi=0.1$) is highest when the attack source is 1-bit Resnet20, while it is lowest when the attack source is a FP Resnet20 This pattern can be seen for the UAP attack in Figure 3 as well where the attack sources are Resnet 20 and its quantized versions but the target models are Resnet44 and its quantized versions. Moreover, similar observations can be made in Figure 4. The pattern is followed for all attack sources and not just for the sources with the highest or lowest overall adversarial transferability and is true for all attack types. There are very few exceptions, for instance, in the case of JSMA; however these are rare for both MNIST and CIFAR10 models.

Further, it can be noted from Figure 3 and 4 that the transferability is not better when source and target networks are of same architecture as when they are different in architecture. This is inline with the observation in [19] but is in contrast to the one made in [32].

Summary

The average transferability of an attack among the different bitwidth versions of the same model can be an indication of transferability when the attack is transferred to another model which may be different in not only bitwidth but also in architecture and model capacity. This suggests that better overall transferability of an attack source when the network architecture and capacity of source and target are similar can also mean better transferability when both source and target differ in architecture and model capacity.

SECTION 5

Discussion

Based on the results from the experiments, the research questions put forward at the beginning of the study (Section 1) can be answered as follows:

  1. RQ1: What causes some algorithms to have better or worse transferability across variably quantized systems?It is known from prior works that the transferability of gradient-based attacks among networks of different bitwidths is poor [2, 22, 35]. The transferability experiments with FGSM and UAP showed similar behaviour where both were observed to have poor transferability. However, it was found that the adversarial examples generated using UAP algorithm remained recognizable at higher magnitudes of distortion as compared to FGSM. This resulted in UAP having better transferability than FGSM as the algorithm allowed setting distortion norm to higher values. This suggests that the efficiency of an attack may be improved by enclosing it within the UAP algorithm and increasing the magnitude of distortion.Furthermore, it was found that the quantization shift phenomena also hinders the transferability of attacks like JSMA that tend to modify individual features to produce adversarial samples.

  2. RQ2: How do model-related properties like architecture and capacity affect transferability among quantized networks? Although the transferability is poor, it was observed that the average transferability of different attack sources followed the similar pattern as the average transferability when the attacks were transferred among networks which were different bitwidth versions of the same network. This indicates that the transferability of an attack to a black-box model of unknown architecture and capacity can be approximated based on how the attack performs when it is transferred among different bitwidth versions of the source model.

Additionally, algorithms like the Boundary Attack that depend on heuristic search for finding adversarial examples may be highly effective on the source network but may have poor transferability because of the direction of the perturbation not aligning with the adversarial direction.

SECTION 6

Conclusion

This paper explores the conditions and properties that can influence the transferability of adversarial examples among networks of varying bitwidths. To this end, transferability analysis on quantized and full-precision networks trained on MNIST and CIFAR10 datasets was performed by considering various algorithms to craft the adversarial examples and also by varying the model-related properties like the architecture and capacity between the source and target networks. Based on a literature review of the current state of the art this is the first study to consider model-related properties during attack transfers in the context of quantized networks.

Optimization enables deployment of DNN models on embedded devices, however the security implications must be taken into consideration when applying an optimization technique. Quantization can provide some robustness against transfer-based attacks nonetheless, being able to estimate the performance of an attack can be considered as a vulnerability.

SECTION A

Model Details

The model architectures are based on MNIST and CIFAR models in Tensorpack repository [33]. Specifically, the models are as defined in the following scripts:

Some adjustments in image augmentations, optimizer, and learning rates were made. The training hyperparameters are mentioned as follows:

A.1 Mnist A

Training hyperparameters are as below:

  • Normalization: Features are normalized to [0, 1]

  • Optimizer: Adam optimizer with learning rate $(\alpha)=0.001, \beta_{1}=0.9$, $\beta_{2}=0.999, \varepsilon=1 \mathrm{e}-8$

  • Regularization: Weight decay on all FC layers Regularization hyperparameter $(\lambda)$: 1e-5

  • Loss: Cross-entropy

A.2 ResNet20

Training details are as below:

  • Normalization: Features are normalized to [0, 1]

  • Optimizer: Momentum optimizer with $\gamma=0.9$

  • Regularization: Weight decay on all layers Regularization hyperparameter ($\lambda$): 2e-4

  • Loss: Cross-entropy

  • Learning Rate: Scheduled to change as (epoch, value): (1, 0.1), (32, 0.01), (48, 0.001), (72, 0.0002), (82, 0.00002)

SECTION B

Adversarial Examples From Various Models

During the experiments, hyperparameters as mentioned in Table 2 were used to craft adversarial samples. The selection of these attack hyperparameters depended on whether they produced recognizable images (although distorted).

Adversarial examples were created by selecting 1000 random samples that were classified correctly by both source and target networks. Performance of the networks was then measured against the corresponding adversarial samples. The top-5 adversarial samples from these 1000 samples generated using each hyperparameter setting for all attack types are presented in the subsequent Figures.

Figure 5 and 6 depict adversarial examples crafted on different bitwidths of the Mnist A and Resnet20 model, respectively. The labels on the left of each set of 5 images represent the bitwidth of the network on which the samples were created.

Figure 5: - Adversarial samples created using FGSM, JSMA, BA, UAP, and CW attack on various bitwidths of Mnist A model.
Figure 5:

Adversarial samples created using FGSM, JSMA, BA, UAP, and CW attack on various bitwidths of Mnist A model.

Figure 6: - Adversarial samples created using FGSM, JSMA, BA, UAP, and CW attack on various bitwidths of Resnet20 model.
Figure 6:

Adversarial samples created using FGSM, JSMA, BA, UAP, and CW attack on various bitwidths of Resnet20 model.

ACKNOWLEDGMENTS

This work has been funded by https://ki-lok.itpower.de/ (German Federal Ministry for Economic Affairs and Climate Action, project no.: 19121007A) and https://iml4e.org (German Federal Ministry for Education and Research, project no.: 01IS21021C)

References

References is not available for this document.