Journals & Magazines >IEEE Access >Volume: 10

Defensive Distillation-Based Adversarial Attack Mitigation Method for Channel Estimation Using Deep Learning Models in Next-Generation Wireless Networks

Defensive distillation based adversarial machine learning attacks mitigation for channel estimation in next generation wireless systems.

Abstract:

Future wireless networks (5G and beyond), also known as Next Generation or NextG, are the vision of forthcoming cellular systems, connecting billions of devices and peopl...Show More

Metadata

Abstract:

Future wireless networks (5G and beyond), also known as Next Generation or NextG, are the vision of forthcoming cellular systems, connecting billions of devices and people together. In the last decades, cellular networks have dramatically grown with advanced telecommunication technologies for high-speed data transmission, high cell capacity, and low latency. The main goal of those technologies is to support a wide range of new applications, such as virtual reality, metaverse, telehealth, online education, autonomous and flying vehicles, smart cities, smart grids, advanced manufacturing, and many more. The key motivation of NextG networks is to meet the high demand for those applications by improving and optimizing network functions. Artificial Intelligence (AI) has a high potential to achieve these requirements by being integrated into applications throughout all network layers. However, the security concerns on network functions of NextG using AI-based models, i.e., model poisoning, have not been investigated deeply. It is crucial to protect the next-generation cellular networks against cybersecurity threats, especially adversarial attacks. Therefore, it needs to design efficient mitigation techniques and secure solutions for NextG networks using AI-based methods. This paper proposes a comprehensive vulnerability analysis of deep learning (DL)-based channel estimation models trained with the dataset obtained from MATLAB’s 5G toolbox for adversarial attacks and defensive distillation-based mitigation methods. The adversarial attacks produce faulty results by manipulating trained DL-based models for channel estimation in NextG networks while mitigation methods can make models more robust against adversarial attacks. This paper also presents the performance of the proposed defensive distillation mitigation method for each adversarial attack. The results indicate that the proposed mitigation method can defend the DL-based channel estimation models against adversarial a...

Defensive distillation based adversarial machine learning attacks mitigation for channel estimation in next generation wireless systems.

Published in: IEEE Access ( Volume: 10)

Page(s): 98191 - 98203

Date of Publication: 12 September 2022

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2022.3206385

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

A. Preamble

In the last decade, the next-generation networks deployed on cellular networks(i.e., 5G and beyond) are undergoing a major revolution along with advanced telecommunication technologies for high-speed data transmission, high cell capacity, and low latency. Each network has its own focus, i.e., 5G: deliver higher multi-Gbps peak data speeds, ultra-low latency, 6G: embed artificial intelligence. NextG networks require a high-cost investment and research to meet infrastructure, computing, security, and privacy requirements. These technologies will enable the next data communications and networking era by connecting everyone to a world in which everything is connected. The main goal of those technologies is to support a wide range of new applications, such as Augmented reality (AR), Virtual reality (VR), metaverse, telehealth, education, autonomous and flying vehicles, smart cities, and smart grids, and advanced manufacturing. They will create new opportunities for industry to improve visibility, enhance operational efficiency, and accelerate automation [1]. It is expected that next-generation networks must simultaneously provide high data speed, ultra-low latency, and high reliability to support services for those applications [2]. Artificial Intelligence (AI) plays a crucial role in achieving these requirements by being integrated into applications throughout all levels of the network. AI is one of the key drivers for next-generation wireless networks to improve network applications’ efficiency, latency, and reliability [3]. AI is also applied to channel estimation applications, which is one of the fundamental prerequisites in wireless networks. The traditional channel estimation methods are extremely complex and low accurate due to the multi-dimensional data structure and the nonlinear characteristics of the channel. Therefore, DL-based channel estimation models have been used in next-generation networks to address the traditional channel estimation. However, DL-based channel estimation models can be vulnerable to adversarial machine learning (ML) attacks. A secure scheme is crucial for DL-based channel estimation models used in next-generation networks and security and vulnerability issues. DL-based models in the next-generation wireless communication systems should be evaluated before deploying them to the production environments in terms of vulnerability, risk assessment, and security threat.

B. Related Works

The main goal of NextG networks is to provide very high data rates (Tbps) and extremely low latency (less than milliseconds) with a high cell capacity (10 million devices for every square kilometer) [4], [5]. The key of the next-generation networks is to use new technologies, such as millimeter wave (mmWave), massive multiple-input multiple-output (massive MIMO), and AI. mmWave is essential for those networks, which provides a high capacity, throughput, and very-low latency in frequency bands above 24 GHz. Massive MIMO is an advanced version of MIMO, which includes a group of antennas at both the transmitter and receiver sides. This method provides better throughput and spectrum efficiency in wireless communication. AI-based algorithms have been used to improve network performance and efficiency. This study focuses on DL-based channel estimation models in next-generation wireless networks and their vulnerabilities. In the literature, these topics have already been studied with and without vulnerability concerns [6], [7], [8], [9], [10], [11]. The authors in [6] reviewed AI-empowered wireless networks and the role of AI in deploying and optimizing next-generation architectures in terms of operations. It indicated that AI-based models have already been used to train the transmitter, receiver, and channel as an auto-encoder. This allows the transmitter and receiver to be optimized mutually. The study also indicated that next-generation networks would differ from current ones, such as network infrastructures, wireless access technologies, computing, application types, etc. The authors in [12] reviewed DL-based solutions in next-generation networks, focusing on physical layer applications of cellular networks from massive MIMO, reconfigurable intelligent surface (RIS), and multi-carrier (MC) waveform. It also emphasized the AI-based solutions’ contribution to improving network performance. The authors in [13] and [14] proposed a robust channel estimation framework using the fast and flexible denoising convolutional neural network (FFDNet) and deep convolutional neural networks (CNNs) for mmWave MIMO. Both proposed methods can deal with a wide range of signal-to-noise ratio (SNR) levels with a flexible noise level map and offer better performance for channel estimators in terms of accuracy. DL-based algorithms significantly improve the overall system performance for next-generation wireless networks. Fortunately, several research groups in the wireless research community study the main potential security issue related to AI-based algorithms, i.e., model poisoning [15], [16]. The authors in [17] and [18] provided a comprehensive review of NextG wireless networks in terms of opportunities and security and privacy challenges, as well as proposed solutions for NextG networks. Several studies also present robust frameworks focusing on detecting adversarial attacks accurately. The authors in [19] proposed a framework to detect adversarial attacks for industrial artificial intelligence systems (IAISs), called DeSVig, i.e., decentralized swift vigilance framework. According to the results, the proposed framework can detect adversarial attacks, such as DeepFool and FGSM, with high accuracy and low delay. The authors also stated that the DeSVig framework provides better performance than current state-of-art defense approaches in terms of robustness, efficiency, and scalability based on experimental results.

C. Purpose and Contributions

The channel estimation is one of the most challenging topics in 5G and beyond networks due to the difficulties of finding the correlation between many resources, system parameters, and dynamic communication channel characteristics by using existing techniques. Therefore, sophisticated AI-based algorithms can help to model the highly nonlinear correlations and estimate the channel characteristics [20]. In our recent papers [21] and [22], adversarial attacks and mitigation methods have been investigated along with the proposed framework for mmWave beamforming prediction models in next-generation networks. This study provides a comprehensive vulnerability analysis of deep learning (DL)-based channel estimation models trained with the dataset obtained from MATLAB’s 5G toolbox for adversarial attacks and defensive distillation-based mitigation methods. It also implements widely used adversarial attacks from the Fast Gradient Sign Method (FGSM), Basic Iterative Method (BIM), Projected Gradient Descent (PGD), Momentum Iterative Method (MIM), to Carlini & Wagner (C&W) as well as a defensive distillation-based mitigation method for DL-based models. The results showed that DL-based models used in these networks are vulnerable to adversarial attacks, while the models can be more secure against adversarial attacks through the proposed mitigation method. The source code is available from GitHub.1

The scope of this study is limited to one of the 5G physical layer applications, i.e., DL-based channel estimation, its vulnerability analysis under selected adversarial attacks, and the proposed defensive distillation mitigation method. There are also other attack types, like the CW attack, which computes intensively and requires more iterations than traditional methods. In this study, we use a less compute-intensive and more efficient way to create adversarial examples.

SECTION II.

Preliminaries

This section presents a brief overview of the channel estimation and the adversarial ML attacks, such as FGSM, BIM, PGD, MIM, and C&W, along with defensive distillation-based mitigation. Dataset description and scenarios are also given with the selected performance metrics to evaluate the models’ performance under normal and attack conditions.

A. Channel Estimation for Communication System

In a wireless communication system, the channel characteristic presents the communication link properties between transmitter and receiver. It is also known as channel state information (CSI). The signal is transmitted through a communication channel. i.e., medium, the transmitted signal is received as a distortion and noise added. It is needed to decode the received signal and remove the unwanted signal, i.e., distortion and noise added by the channel, from the received signal. To identify the channel characteristics is the first process to achieve that, which is called channel estimation process. The received signal is attenuated by a factor $h_{0}$ and delayed by a specific time $\tau _{0}$ . $h_{0}$ depends on the propagation medium, frequency, $Tx/Rx$ gains, while $\tau _{0}$ depends on the speed of an electromagnetic wave in the medium.

It is assumed that $x(t)$ presents the transmitted signal, while $y(t)$ presents the received signal. When $x(t)$ is transmitted through a communication channel, i.e., air, the signal is distorted, and noise is added to the transmitted signal. As a result, the received signal y(t) is not the same as the transmitted signal $x(t)$ . Received signal $y(t)$ is shown as:\begin{equation*} y(t) = h_{0} * x(t - \tau _{0})\tag{1}\end{equation*} View Source

However, the received signal comprises several reflected and scattered paths, i.e., multiple paths, with different attenuation and delay. The composed received signal is shown as:\begin{equation*} y(t)=\sum _{l = 0}^{l}h_{l} * x(t - \tau _{l})\tag{2}\end{equation*} View Source where $l$ is the specific path/tap at a time.

The mobility causes Doppler frequency shift, i.e., the change in the wavelength or frequency of the waves as to the observer being in motion with respect to the wave source. Doppler effect plays an important role in telecommunications and computations of signal path loss and fading due to multi-path propagation. In addition, the channel characteristics, i.e., $h_{0}$ an $\tau _{0}$ , can also change over time due to the mobility of the one of communication sides, and shown as $h_{l}^{t}$ and $\tau _{l}^{t}$ . The channel can be characterized by a number of paths/taps, the dependence of channel coefficients, and delay in time. The final received signal with the Doppler effect can be shown at a specific time as:\begin{equation*} y(t)=\sum _{l = 0}^{l}h_{l}^{t} * x(t - \tau _{l}^{t})\tag{3}\end{equation*} View Source

The channel estimation plays an important part in wireless communications for increasing the capacity and the overall system performance. There is a high demand for new wireless networks, higher data rates, better quality of service, and higher network capacity. Therefore, new promising technologies are needed to meet these requirements. A migration from Single Input Single Output (SISO) to Multiple Input Multiple Output (MIMO) antenna technology has started with NextG networks. The channel estimation is the core of next-generation communication systems, i.e., 5G and beyond, performed in different ways for SISO and MIMO approaches at the receiver side. The channel estimation algorithm can be classified into three main categories, i.e., blind channel estimation, semi-blind channel estimation, and training-based estimation [23]. The training-based estimation among them is widely used in communication systems. The general approach of the channel estimation is to insert known reference symbols, i.e., pilots, into the transmitted signal and then interpolate the channel response based on these known pilot symbols. The process works in the following steps: (1) develop a mathematical model to correlate the transmitted and received signals using channel characteristics, (2) embed a predefined signal, i.e., pilot signal, into the transmitted signal, (3) transmit the signal through the channel, (4) receive transmitted signal as a distorted and/or noise added through the channel, (5) decode the pilot signal from the received signal, (6) compare the transmitted and the received signals, and (7) find the correlation between the transmitted and the received signals.

There have been many efforts regarding channel estimation algorithms using different approaches in the literature. However, it is still a challenging problem due to the computational complexity degree of algorithms and an enormous amount of mathematical operations, and channel estimation accuracy at low. The equalization method is typically used to reduce the complexity and render the frequency response at the receiver side [24]. With the introducing the machine learning methods to 5G and beyond communication systems, the performance of the channel estimation algorithm has been improved in terms of the degree of low computational complexity and channel estimation accuracy compared to conventional channel estimation algorithms [25]. In addition, the nature of deep learning-based algorithms can also save a significant computational power for complex analysis needed in channel estimation algorithms [26]. However, it can still be questionable of the feasibility of using machine learning methods in channel estimation. The study [27] presented several deep learning-based channel estimation algorithms, i.e., fully-connected deep neural network (FDNN), Convolutional Neural Network (CNN), and bidirectional long short-term memory (bi-LSTM), with different scenarios of fading multi-path channel models for 5G networks. According to the results, three presented deep learning-based algorithms reduced the channel estimation error and bit error ratio and were robust to the changes in the Doppler frequency. However, bi-LSTM among them provided the most significant reduction in channel estimation error. The authors in [28] also proposed a CNN combined with a projected gradient descent algorithm to demonstrate the feasibility of using machine learning methods in channel estimation.

A channel model is a representation of the channel that a transmitted signal follows to the receiver. In the simulation environment, the channel model is typically classified into two categories, i.e, clustered delay line (CDL) model and tapped delay line (TDL) channel model. A CDL is used to model the channel when the received signal consists of multiple delayed clusters. Each cluster contains multipath components with the same delay but slight variations for angles of departure and arrival, i.e., MIMO. On the other hand, a TDL model is defined as simplified evaluations of CDL, i.e., non-MIMO evaluations or SISO. These channel models are defined well in the technical report released by 3GPP, i.e., the 3rd Generation Partnership Project [29]. According to this report, CDL/TDL models are defined in the frequency range from 0.5 GHz to 100 GHz with a maximum bandwidth of 2 GHz. For CDL/TDL models, five different channel profile models are constructed, i.e., A, B, and C for non-line-of-sight (NLOS) propagation, while D and E for line-of-sight (LOS) propagation. Power, delay and angular information are used to define CDL models, while power, delay, and Doppler spectrum information are used for TDL models in the technical report released by 3GPP.

B. Convolutional Neural Networks

The convolutional neural network (CNN) is a neural network that has shown to be very successful for image recognition [30], [31], [32]. Compared to the fully-connected neural network, CNN can extract all the information with a lower number of parameters. The main idea of CNN is that we can locate the structure of an image by the convolution operation. Suppose the image $\mathbf {x}$ is a two-dimensional matrix. The convolution operation between the image $\mathbf {x}$ and a filter $\mathbf {W}$ is defined by \begin{equation*} \mathbf {y} = \mathbf {W} \ast \mathbf {x} = \sum _{i=1}^{W} \sum _{j=1}^{H} \mathbf {W}_{i,j} \mathbf {x}_{i-s,j-s},\tag{4}\end{equation*} View Source where $W$ and $H$ are the width and height of the image $\mathbf {x}$ , respectively, and $s$ is the number of strides, which is the distance between two adjacent positions.

The CNN is composed of several types of layers. The convolution layer is the most critical layer of the CNN, consisting of several filters. Each filter extracts a particular type of feature from an input image. The pooling layer is a down-sampling layer, which reduces the size of the convolution output. Each pooling operation replaces several adjacent values with the maximal value or the mean value. The fully-connected layer is a standard neural network layer that combines all the features extracted by the convolution layer. The softmax layer is a classification layer to classify the input data.

The input image is a two-dimensional matrix. The filter in the convolution layer extracts a particular type of feature from the input image. For example, the leftmost filter extracts horizontal lines, and the middle filter extracts diagonal lines. The output of the convolution layer is then sent to the pooling layer, which reduces the size of the data. The output of the pooling layer is then sent to the fully-connected layer, which combines all the features extracted by the convolution layer. The output of the fully-connected layer is then sent to the softmax layer, which classifies the data.

C. Adversarial Attacks

ML-based models are trained to automatically learn the underlying patterns and correlations in data by using algorithms. Once an ML-based model is trained, it can be used to predict the patterns in new data. The accuracy of the trained model is essential to achieving a high performance, which can also be called as a generalization. However, the trained model can be manipulated by adding noise to the data, i.e., targeted and non-targeted adversarial ML attacks. The adversarial ML attacks are generated by adding a perturbation to a legitimate data point, i.e., an adversarial example generated craftily input with a slight difference, to fool the ML-based models. In such attacks, the attacker does not change training instances and tries to make some small input instances perturbations to make this new input instance safe in the model’s inference period. The existing defenses and adversarial attacks for images can be applied to attack and defend on other fields [33], [34], [35]. The cleverly-designed adversarial examples can fool the deep neural networks with high success rates on the test images. The adversarial examples can also be transferred from one model to another model. There are various kinds of adversarial ML attacks, such as evasion attacks, data poisoning attacks, and model inversion attacks [36]. An evasion attack aims to cause the ML-based models to classify improperly the adversarial examples as legitimate data points, i.e., targeted and non-targeted evasion attacks. Targeted attacks aim to force the models to classify the adversarial example as a specific target class. Non-targeted attacks aim to push the models to classify the adversarial example as any class other than the ground truth. Data poisoning aims to generate malicious data points to train the ML-based models to find the desired output. It can be applied to the training data, which causes the ML-based models to produce the desired outcome. Model inversion aims to generate new data points close to the original data points to find the sensitive information of the specific data points. In this study, we focus on this kind of adversarial attack. Taking channel estimation CNN model as an example, here, we use $h(\mathbf {x},\omega): \mathbb {R}^{m \times n} \mapsto \mathbb {R}^{m \times n}$ to denote the channel estimation CNN model, where $\omega $ is the parameters of the channel estimation CNN model, and $\mathbf {x}$ is the input data. A targeted adversarial attack aims to generate an adversarial example $\mathbf {x}'$ from a legitimate example $\mathbf {x}$ to fool the channel estimation CNN model to produce the desired output. The attacker uses the lowest possible budget to corrupt the inputs, aiming to increase the distance (i.e., MSE) between the model’s prediction and the real channel. Therefore $\sigma $ is calculated as \begin{equation*} \sigma ^{*} = \underset {|\sigma |_{p} \leq \epsilon }{\arg max}\,\,\ell (\omega,\mathbf {x}+\sigma,\mathbf {y})\tag{5}\end{equation*} View Source where $\mathbf {y} \in \mathbb {R}^{m \times n}$ is the label (i.e., channel information), and $p$ is the norm value and it can be $0, 1, 2, \infty $ .

Figure 1 shows a typical adversarial ML-based adversarial sample generation procedure.

FIGURE 1.

Typical adversarial ML-based adversarial sample generation.

Show All

These adversarial attack types are given as follows.

1) FGSM

Fast Gradient Sign Method (FGSM): FGSM is one of the most popular and simplest approaches to constructing adversarial examples. It is called one-step gradient-based attacks. It is used to compute the gradient of the loss function with respect to the input, $\mathbf {x}$ , and then the attacker creates the adversarial example by adding the sign of the gradient to the input data. It was first introduced by Goodfellow et al. [37]. The gradient sign is computed using the backpropagation algorithm. The steps are summarized as follows:

Compute the gradient of loss function, $\nabla _{\mathbf {x}}\ell (\mathbf {x},\mathbf {y})$
Add the gradient to the input data, $\mathbf {x}_{adv} = \mathbf {x} + \epsilon \times sign(\nabla _{\mathbf {x}}\ell)$

where

$\epsilon $

is the budget. FGSM attack has been used in [38] to attack channel estimation models.

2) BIM

Basic Iterative Method (BIM): BIM is one of the most popular attacks, which is called an iterative gradient-based attack. This attack is derived from the FGSM attack [39]. It is used to compute the gradient of the loss function with respect to the input, $\mathbf {x}$ , and then the attacker creates the adversarial example by adding the sign of the gradient to the input data. The gradient sign is computed using the backpropagation algorithm. The steps are summarized as follows:

Initialize the adversarial example as $\mathbf {x}_{adv} = \mathbf {x}$
Iterate $i$ times, where $i=0, 1, 2, 3,\ldots, N$
Compute the gradient of loss function, $\nabla _{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y})$
Add the gradient to the input data, $\mathbf {x}_{adv} = \mathbf {x}_{adv} + \epsilon \times sign(\nabla _{\mathbf {x}}\ell)$

where

$\epsilon $

is the budget, and

$N$

is the number of iterations. The BIM attack has been used in [38] to attack channel estimation models.

3) PGD

PGD is one of the most popular and powerful attacks, which is called gradient-based attacks [40], [41]. It is used to compute the gradient of the loss function with respect to the input, $\mathbf {x}$ , and then the attacker creates the adversarial example by adding the sign of the gradient to the input data. The gradient sign is computed using the backpropagation algorithm. The steps are summarized as follows:

Initialize the adversarial example as $\mathbf {x}_{adv} = \mathbf {x}$
Iterate $i$ times, where $i=0, 1, 2, 3,\ldots, N$
Compute the gradient of loss function, $\nabla _{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y})$
Add random noise to the gradient, $\hat {\nabla }_{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y}) = \nabla _{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y}) + \mathcal {U}(\epsilon)$
Add the gradient to the input data, $\mathbf {x}_{adv} = \mathbf {x}_{adv} + \alpha \times sign(\hat {\nabla }_{\mathbf {x}}\ell)$

where

$\epsilon $

is the budget,

$N$

is the number of iterations, and

$\alpha $

is the step size. PGD can generate stronger attacks than FGSM and BIM.

4) MIM

Momentum Iterative Method (MIM): MIM is a variant of the BIM adversarial attack, introducing momentum term and integrating it into iterative attacks [42]. It is used to compute the gradient of the loss function with respect to the input, $\mathbf {x}$ , and then the attacker creates the adversarial example by adding the sign of the gradient to the input data. The gradient sign is computed using the backpropagation algorithm. The steps are summarized as follows:

Initialize the adversarial example $\mathbf {x}_{adv} = \mathbf {x}$ and the momentum, $\mu = 0$
Iterate $i$ times, where $i=0, 1, 2, 3,\ldots, N$
Compute the gradient of loss function, $\nabla _{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y})$
Update the momentum, $\mu = \mu + \frac {\eta }{\epsilon } \times \nabla _{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y})$
Add random noise to the gradient, $\hat {\nabla }_{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y}) = \nabla _{\mathbf {x}}\ell (\mathbf {x}_{adv},\mathbf {y}) + \mathcal {U}(\epsilon)$
Add the gradient to the input data, $\mathbf {x}_{adv} = \mathbf {x}_{adv} + \alpha \times sign(\hat {\nabla }_{\mathbf {x}}\ell)$

where

$\epsilon $

is the budget,

$N$

is the number of iterations,

$\eta $

is the momentum rate, and

$\alpha $

is the step size.

5) C&W

The C&W attack was proposed as a targeted evasion attack by Carlini and Wagner [43]. It is based on the idea of a zero-sum game. In a zero-sum game, the total amount of value in the game is fixed. The winner of the game gets all of the value, and the loser gets nothing. The C&W method is an iterative attack that constructs adversarial examples by approximately solving the minimization problem $min_{d}(x,x')$ such that $f(x') = t'$ for the attacker-chosen target $t'$ , where $d(\cdot)$ is an appropriate distance metric. The optimization problem is shown in the following equation:\begin{equation*} min_{x \in \mathcal {X}} \mathbb {E}_{y \in \mathcal {Y}} [f(x) - y]^{2}\end{equation*} View Source where $x \in \mathcal {X}$ is a training example, $y \in \mathcal {Y}$ is the target output, and $f(x)$ is the function to be estimated. The optimization is solved for a set of points $x'$ that are close to the target $t'$ , such that the function $f(x) - y$ is maximized for all $y$ . This produces a set of adversarial examples $x'$ that are likely to fool the defender model.

The most important difference between C&W and other adversarial ML attacks is that C&W does not require an $\epsilon $ value for the optimization. That is, C&W does not require that the attacker’s goal be to find a set of points that are close to the target but instead find a set of points that are guaranteed to fool the defender. This makes C&W a more powerful attack.

D. Defensive Distillation

Knowledge distillation was previously introduced by Hinton et al. [44] to compress the knowledge of a large, densely connected neural network (the teacher) into a smaller, sparsely connected neural network (the student). It was shown that the student was able to reach a similar performance as the teacher [44]. In the initial work, the knowledge distillation was used to solve a classification problem, which is also called the teacher-student framework. Papernot et al. [45] proposed this technique for the adversarial ML defense and demonstrated that it could make the models more robust against adversarial examples. The main contribution of this work was to introduce the knowledge distillation to the adversarial ML defense. Defensive distillation is an ML framework that can enhance the robustness of the model for classification problems. The first step is to train the teacher model with a high temperature ($T$ ) parameter to soften the softmax probability outputs of the DL model. This can be done as follows:\begin{equation*} p_{softmax}(z, T) = \frac {e^{z/T}}{\sum _{i=1}^{n}e^{z_{(i)}/T}}\tag{6}\end{equation*} View Source where $n$ is the number of labels and $z$ is the output of the last layer of the DL model, i.e., $z = \mathbf {W}_{n} \cdot \mathbf {a}_{n-1} + b_{n}$ . Here, $\mathbf {W}_{n}$ is the weight matrix, and $\mathbf {a}_{n-1}$ is the activation of the last layer. In the second step, the softmax probability outputs train the student model with a lower temperature parameter. The objective function is defined as \begin{align*} \mathcal {L}_{student}(T)=&\frac {1}{N} \sum _{i=1}^{N} \sum _{j=1}^{n} \mathbf {y}_{ij} \cdot \log p_{softmax}(z_{ij}, T) \\=&\frac {1}{N} \sum _{i=1}^{N} \sum _{j=1}^{n} \mathbf {y}_{ij} \cdot \log \frac {e^{z_{ij}/T}}{\sum _{i=1}^{n}e^{z_{ij}/T}}\tag{7}\end{align*} View Source where $N$ is the number of training samples, $\mathbf {y}_{ij}$ is the training label, and $z_{ij}$ is the logit. The objective function for the training the teacher model is defined as \begin{equation*} \mathcal {L}_{teacher}(T) = -\frac {1}{N} \sum _{i=1}^{N} \sum _{j=1}^{n} \mathbf {y}_{ij} \cdot \log \frac {e^{z_{ij}/T}}{\sum _{i=1}^{n}e^{z_{ij}/T}}\tag{8}\end{equation*} View Source Defensive distillation is a method that can enhance the robustness of the models, which are trained by the soft targets provided by the teacher model. By minimizing the objective functions, the model can be trained. This method helps build robust models against adversarial examples [45]. Figure 2 shows the overall steps for this technique. According to the figure, the teacher model is typically an extensive deep neural network, while the student model is usually a small and shallow neural network. The knowledge distillation process consists of two steps: (1) training the teacher model and (2) distilling the knowledge from the teacher to the student. The distillation can be performed using the teacher model’s output probabilities, the teacher model’s activations, or the intermediate representations of the teacher model. The distillation can also be performed using a distillation loss, typically a combination of the cross-entropy loss and the distillation loss. The cross-entropy loss is used to minimize the difference between the output probabilities of the teacher and student models. In contrast, the distillation loss is used to minimize the difference between the intermediate representations of the teacher and student models.

FIGURE 2.

Overview of the system architecture with knowledge distillation.

Show All

Deep learning approaches have been shown to perform exceptionally well for a wide range of computer vision tasks (e.g., image classification, object, and action detection, scene segmentation, image generation, etc.). However, deep neural networks (DNNs) require large amounts of training data, which is not always available for new tasks or domains. Several knowledge distillation methods have been proposed to address this issue that can train a smaller student network to mimic the prediction of a more extensive and accurate teacher network.

Distillation has been applied in the field of intelligent systems, such as knowledge-based and rule-based systems, to reduce the system’s size and improve the system’s performance by improving the quality of the system’s knowledge. The teacher and student models’ differences can be considered a form of regularization, which is crucial to prevent overfitting. The algorithm 1 shows the pseudocode of distillation.

Algorithm 1 Pseudocode of Distillation

Input: Dataset $D$ , teacher model $T$ , student model $S$ , loss function $\mathcal {L}$ , learning rate $\eta $ , number of epochs $E$

Output: Trained student model $S$

Initialize the weights of the student model $S$

for $e=1$ to $E$ do

Randomly shuffle the dataset $D$

for $i=1$ to $|D|$ do

Extract the $i^{th}$ sample $(x_{i}, y_{i})$ from $D$

Forward propagate the sample $x_{i}$ through the teacher model $T$ to obtain the output probabilities $\hat {y}_{i}$

Compute the loss $\mathcal {L}$ using the output probabilities $\hat {y}_{i}$

Backpropagate the loss $\mathcal {L}$ through the student model $S$

Update the weights of the student model $S$ using the learning rate $\eta $

end for

return Trained student model $S$

In a typical wireless communication system, the channel estimation is done by the base station with the help of pilot signals sent by the user equipment (UE) during uplink. And the base station sends pilot signals toward the UE, which acknowledges the estimated channel information for the downlink transmission. Network operators and service providers are responsible for running their operations properly and meeting their obligations to the customers and the public related to privacy and data confidentiality. However, the network operations can be vulnerable to machine learning adversarial attacks, especially 5G and beyond, due to using machine learning-based applications. In Figure 2, the training of the channel estimation prediction model (i.e., student model) is protected against adversarial ML attacks, and its use in base stations is shown in all its stages.

SECTION III.

Dataset Description and Scenario

MATLAB 5G Toolbox provides a wide range of reference examples for next-generation network communications systems, such as 5G [46]. It also allows to customize and generate several types of waveforms, antennas, and channel models to obtain datasets for DL-based models. In this study, the dataset used to train the DL-based channel estimation models is generated through a reference example in MATLAB 5G Toolbox, i.e., “Deep Learning Data Synthesis for 5G Channel Estimation”. In the example, a convolutional neural network (CNN) is used for channel estimation. Single-input single-output (SISO) antenna method is also used by utilizing the physical downlink shared channel (PDSCH) and demodulation reference signal (DM-RS) to create the channel estimation model.

The reference example in the toolbox generates 256 training datasets, i.e., transmit/receive the signal 256 times, for the DL-based channel estimation model. Each dataset consists of 8568 data points, i.e., 612 subcarriers, 14 OFDM symbols, 1 antenna. However, each data point of the training dataset is converted from a complex (real and imaginary) 612–14 matrix into a real-valued 612-14-2 matrix for providing inputs separately into the neural network during the training process. This is because the resource grids consist of complex data points with real and imaginary parts in the channel estimation scenario, but the CNN model manages the resource grids as 2-D images with real numbers. In this example, the training dataset is converted into 4-D arrays, i.e., 612-14-1-2N, where N presents the number of training examples, i.e., 256.

Complex numbers are used in wireless communication technologies. The complex number system modifies and demodulates wireless signals in digital wireless communication. The most significant distinction between the real and complex number systems is that the complex number system contains more than one dimension. Adversarial ML attacks, on the other hand, use real numbers to enter the decision boundaries of the victim DL models, and the final malicious inputs are in the real number domain. Complex numbers are split into real and imaginary elements to solve this challenge. Table 1 shows the example dataset.

TABLE 1 Example Dataset. The Original Dataset is Shown as Complex Numbers in the Table at the Top. The Training Dataset is Represented in Real Numbers in the Table Below

For each set of the training dataset, a new channel characteristic is generated based on various channel parameters, such as delay profiles (TDL-A, TDL-B, TDL-C, TDL-D, TDL-E), delay spreads (1-300 nanosecond), doppler shifts (5-400 Hz), and Signal-to-noise ratio (SNR or S/N) changes between 0 and 10 dB. Each transmitted waveform with the DM-RS symbols is stored in the training dataset and the perfect channel values in train labels. The CNN-based channel estimation based is trained with the generated dataset. MATLAB 5G toolbox also allows tuning several communication channel parameters, such as the frequency, subcarrier spacing, number of subcarriers, cyclic prefix type, antennas, channel paths, bandwidth, code rate, modulation, etc. The channel estimation scenario parameters with values are given for each in Table 2.

TABLE 2 The Channel Estimation Parameters With Values

The training dataset is split into validation and training sets to avoid overfitting the training data. The training set is used to train and fit the model, while the validation data is used for monitoring the performance of the trained neural network at certain intervals, i.e., 5 per epoch. The training is expected to stop when the validation loss stops decreasing and improving the model. In this study, most part of the dataset is used for training, i.e., 80% for training, and 20% for testing.

SECTION IV.

Simulation Model, Settings and Performance Metric

A. Simulation Model

Figure 3 shows the CNN-based DL model used in this paper for the channel estimation. The input to the model is the pilot signals with different subcarriers and OFDM symbols. The input is first passed through a convolutional layer, followed by a max-pooling layer. The output of the max-pooling layer is then passed through a fully connected layer, followed by a softmax layer. The final output of the model is the channel estimation.

FIGURE 3.

The vulnerable CNN model-based channel estimation overview.

Show All

We use the channel estimation dataset described in Section III to train the model. We use five different attacks (i.e., FGSM, BIM, MIM, PGD, and C&W) to evaluate the proposed mitigation methods. The deep learning-based channel estimation model is trained in the TensorFlow environment. The proposed mitigation methods are implemented in the Keras environment. The MSE performance metric is used to evaluate the accuracy of the channel estimation model.

B. Simulation Settings

The teacher and student models are DNNs with 3 convolutional layers. They are trained using stochastic gradient descent with a momentum of 0.9 and a learning rate of 0.001 for 100 epochs. The batch size is set to 256. Table 3 shows the DL model parameters.

TABLE 3 CNN Model Architecture Parameters for the Teacher and the Student Models

Figure 3 shows the architecture of the teacher and student models.

The models are generative and supervised models trained to predict channel parameters defined at the receiver. The input and output size is $612 \times 14$ (e.g. $Subcarriers \times OFDM \, symbols$ ). Table 4 shows the CNN model’s hyper-parameters.

TABLE 4 CNN Model Architecture Parameters for the Teacher and the Student Models

Figure 4 shows the training history of all three models.

FIGURE 4.

Training history of all three models.

Show All

C. Performance Metric

The performance metric, MSE (Mean Squared Error), is used to evaluate and compare CNN-based models. The MSE scores are utilized for further analyses of the model. MSE equation is given below. It measures the average squared difference between the actual and predicted values. The MSE equals zero when a model has no error. The model error increases along with the MSE value.\begin{equation*} MSE = {\frac {\sum ^{}{(Y_{t} - {\hat {Y}}_{t})}^{2}}{n}}\tag{9}\end{equation*} View Source where: $Y_{t}$ :The actual t $^{\mathrm{ th}}$ instance, ${\hat {Y}}_{t}~$ : The forecasted t $^{\mathrm{ th}}$ instance, n: The total number of instance

SECTION V.

Evaluation and Performance Results

This section provides the experimental results to evaluate the proposed defensive distillation-based mitigation method for DL-based channel estimation models in next-generation networks. We applied the attack success ratio (ASR) as the performance metric. ASR is the ratio of test samples that an attacker can mispredict to the total number of test samples. The highest ASR indicates that the attack is more effective. The following equation is used to calculate ASR:\begin{equation*} \text {ASR} = \frac {1}{m}\sum _{i=0}^{m}{\frac {MSE(\mathbf {x}^{adv}_{(i)},\mathbf {y}_{(i)})-MSE(\mathbf {x}_{(i)},\mathbf {y}_{(i)})}{MSE(\mathbf {x}^{adv}_{(i)},\mathbf {y}_{(i)})}}\tag{10}\end{equation*} View Source

Table 5 shows the initial prediction performance results of all models with the test dataset.

TABLE 5 Initial MSE Values With Test (i.e., Benign) Dataset

The first experiment is to perform attacks on the undefended model, as shown in Table 6.

TABLE 6 Experimental Results for the Undefended DL Model. The Results Show That the Initial DL Model is Vulnerable to Adversarial ML Attacks

The results of the first experiment show that the initial DL model is vulnerable to adversarial ML attacks. As expected, the ASR value has a positive correlation with $\epsilon $ value. The results also show that the BIM, MIM, and PGD attacks are more effective than the FGSM and C&W (without $\epsilon $ ) attacks model under the same $\epsilon $ . The success rate of the C&W attack model is lower than that of the BIM, MIM, and PGD attack models.

Experimental results for the proposed defensive distillation-based mitigation method are shown in Table 7.

TABLE 7 Experimental Results for the Proposed Defensive Distillation-Based Mitigation Method. The Results Show That the Proposed Method Can Improve the Accuracy of the Channel Estimation Model. The Results Indicate That the Proposed Method Can Provide Better Results for the Attacks (i.e., FGSM, BIM, MIM, PGD, and C&W)

The experimental results show that the proposed method can improve the accuracy of the channel estimation model. The results also show that the proposed method can provide better results for the attacks (i.e., FGSM, BIM, MIM, PGD, and C&W).

Figure 5 shows the MSE results with 6 different $\epsilon $ values (i.e., $0.0, 0.1, 0.5, 1.0, 2.0, 3.0$ ) for the undefended and defensive distillation-based defended DL model for the all attacks. There is only one bar chart for the C&W attack because there is no $\epsilon $ value for the C&W attack. The results show that the proposed method can improve the accuracy of the channel estimation model.

FIGURE 5.

Experimental results for the proposed defensive distillation-based mitigation method. The results show that the proposed method can improve the accuracy of the channel estimation model.

Show All

Figure 6 shows the MSE change with different $\epsilon $ values for each attack for the undefended and defensive-distillation based defended DL model. The defended model’s MSE values (i.e., the right figure) are almost similar to each attack and $\epsilon $ values. We can see that the defensive distillation-based mitigation method works pretty well against all types of adversarial attacks.

FIGURE 6.

MSE trend line for the undefended and defended DL models.

Show All

SECTION VI.

Discussion

This study provides a comprehensive analysis of the DL-based channel estimation model in terms of vulnerabilities. The model’s vulnerabilities are studied for various adversarial attacks, including FGSM, BIM, PGD, MIM, and C&W, as well as the mitigation method, i.e., defensive distillation. The results show that CNN-based channel estimation models are vulnerable to adversarial attacks, i.e., FGSM, BIM, MIM, PGD, and C&W. The attack success ratio is also pretty much high, i.e., 0.9, under a higher power attack ($\epsilon $ equals 3.0) for BIM, MIM, and PGD attacks. On the other hand, the rate is very low for C&W attacks, i.e., 0.06, compared with the others. Fortunately, the proposed defensive distillation-based mitigation method performs better against higher-order adversarial attacks, and the attack success ratio goes down to 0.06 for BIM, MIM, and PGD attacks. The impact of the mitigation method on FGSM is lower than others, i.e., the attack success rate is 0.09. For C&W, the attack success rate goes from 0.06 to 0.007 after applying the proposed defensive distillation-based mitigation method. According to the results, adversarial attacks on DL-based channel estimation models and the use of the proposed defensive distillation-based mitigation method can be summarized as:

Observation 1:
The DL-based channel estimation models are vulnerable to adversarial attacks, especially BIM, MIM, and PGD.
Observation 2:
BIM, MIM, and PGD attacks are the most successful attack success rate.
Observation 3:
The DL-based channel estimation models are more robust against C&W attacks.
Observation 4:
A strong negative correlation exists between attack power $\epsilon $ and the performance of channel estimation models.
Observation 5:
The proposed mitigation method, i.e., defensive distillation, offers a better performance against adversarial attacks.

SECTION VII.

Conclusion and Future Work

Mobile wireless communication networks are rapidly developing with the high demand and advanced communication and computing technologies. The last few years have experienced remarkable growth in the wireless industry, especially for NextG networks. This paper provides a comprehensive vulnerability analysis of deep learning (DL) based channel estimation models for adversarial attacks (i.e., FGSM, BIM, PGD, MIM, and C&W) and defensive distillation-based mitigation methods in NextG networks. The results confirm that the original DL-based channel estimation model is significantly vulnerable to adversarial attacks, especially BIM, MIM, and PGD. The attack success rate increases under a heavy adversarial attack ($\epsilon =3.0$ ) up to 0.9 for those attacks. There is a high positive correlation between attack power $\epsilon $ and the attack success rate as expected, i.e., a high $\epsilon $ increases as the attack success rate. On the other hand, the proposed defensive distillation-based mitigation method can improve the accuracy of the channel estimation model and provide better results against higher-order adversarial attacks, e.g., the attack success rate goes from 0.9 to 0.06 after applying the proposed mitigation method. The overall results prove that the proposed method can provide better results for the attacks (i.e., FGSM, BIM, MIM, PGD, and C&W) in terms of the model accuracy and the attack success rate. The scope of this study is restricted to one of the 5G physical layer applications, its vulnerability analysis under selected adversarial machine learning attacks, and the defensive distillation mitigation method. As future work, we plan to focus on other standard defenses, e.g., adversarial training, for the deep learning-based channel estimation models against adversarial attacks and parameter-free attack methods like the C&W attack. As another future work, the authors will focus on the Intelligent Reflecting Surface (IRS) and spectrum sensing using AI-based models and their cybersecurity risks.

References is not available for this document.

Defensive Distillation-Based Adversarial Attack Mitigation Method for Channel Estimation Using Deep Learning Models in Next-Generation Wireless Networks

Abstract:

Metadata

Abstract:

Funding Agency: