Loading web-font TeX/Math/Italic
Countering Evasion Attacks for Smart Grid Reinforcement Learning-Based Detectors | IEEE Journals & Magazine | IEEE Xplore

Countering Evasion Attacks for Smart Grid Reinforcement Learning-Based Detectors


The proposed attack and defense models framework.

Abstract:

Fraudulent customers in smart power grids employ cyber-attacks by manipulating their smart meters and reporting false consumption readings to reduce their bills. To comba...Show More

Abstract:

Fraudulent customers in smart power grids employ cyber-attacks by manipulating their smart meters and reporting false consumption readings to reduce their bills. To combat these attacks and mitigate financial losses, various machine learning-based electricity theft detectors have been proposed. Unfortunately, these detectors are vulnerable to serious cyber-attacks, specifically evasion attacks. The objective of this paper is to investigate the robustness of deep reinforcement learning (DRL)-based detectors against our proposed evasion attacks through a series of experiments. Firstly, we introduce DRL-based electricity theft detectors implemented using the double deep Q networks (DDQN) algorithm. Secondly, we propose a DRL-based attack model to generate adversarial evasion attacks in a black box attack scenario. These evasion samples are generated by modifying malicious reading samples to deceive the detectors and make them appear as benign samples. We leverage the attractive features of reinforcement learning (RL) to determine optimal actions for modifying the malicious samples. Our DRL-based evasion attack model is compared with an FGSM-based evasion attack model. The experimental results reveal a significant degradation in detector performance due to the DRL-based evasion attack, achieving an attack success rate (ASR) ranging from 92.92% to 99.96%. Thirdly, to counter these attacks and enhance detection robustness, we propose hardened DRL-based defense detectors using an adversarial training process. This process involves retraining the DRL-based detectors on the generated evasion samples. The proposed defense model achieves outstanding detection performance, with a degradation in ASR ranging from 1.80% to 9.20%. Finally, we address the challenge of whether the DRL-based hardened defense model, which has been adversarially trained on DRL-based evasion samples, is capable of defending against FGSM-based evasion samples, and vice versa. We conduct extensive experime...
The proposed attack and defense models framework.
Published in: IEEE Access ( Volume: 11)
Page(s): 97373 - 97390
Date of Publication: 06 September 2023
Electronic ISSN: 2169-3536

Funding Agency:


SECTION I.

Introduction

With the rapid advancements in communication and power control systems, significant improvements and developments have been made in power grids. These changes have brought about revolutionary progress in conventional power grids, enabling the smart grid (SG) to assume important roles in enhancing the reliability, efficiency, resiliency, and sustainability of the power system [1], [2]. Alongside facilitating the reliable delivery of electricity, optimizing and regulating grid operation, and monitoring the performance of the system, the SG plays a vital role in achieving these objectives. The core structure of the SG consists of several main components, including system operator (SO), electricity production stations, and advanced metering infrastructure (AMI), along with transmission and distribution systems [1], [3]. The AMI network is a major component of the SG as it provides bidirectional communication between the SO, installed at the electric utility side, and the smart meters (SMs) installed at customers’ residences. SMs regularly monitor electricity usage and periodically transmit detailed consumption readings to the SO through the AMI network. As a result, the SO can utilize these periodic consumption readings for various purposes, including load monitoring, implementing dynamic pricing for calculating consumption bills, and efficiently managing power resources [2], [4].

Despite the progress achieved in smart grid technology, the issue of electricity theft continues to pose a significant challenge. Dishonest individuals employ various fraudulent practices, such as tampering with mechanical meters in traditional power grids. Similarly, in smart power grids, the vulnerability lies in SMs, which are software-driven embedded systems. These SMs are target for cyber-attacks orchestrated by malicious consumers aiming to manipulate electricity consumption readings and unlawfully reduce their bills [5], [6]. In the context of the SG, the problem of electricity theft through cyber-attacks becomes a heightened and significant concern compared to traditional power grids. This is primarily attributed to the potential for significant financial losses and disruptions in the smooth operation of the power grid [7], [8]. The increased concern arises from the crucial role played by the consumption data reported by SMs in enabling efficient grid management [9], [10].

Consequently, experts in the domains of cyber-security and artificial intelligence (AI) are increasingly directing their attention towards the detection of electricity theft [11], [12]. The existing literature showcases a range of supervised and unsupervised machine learning (ML) models, including deep learning (DL) and shallow models, proposed to detect electricity theft cyber-attacks [4], [10], [13], [14]. Current shallow ML detectors have limitations in effectively detecting electricity theft, primarily due to their inability to capture the intricate patterns and temporal dynamics present in electricity consumption readings. Hence, the primary emphasis in the literature is placed on DL-based models due to their ability to achieve higher detection accuracy when compared to shallow classifiers.

The current ML-based detectors in the literature can be divided into two main categories: global models and customized models. Global models are trained on diverse data patterns from various customers and serve to detect electricity theft universally [1], [11], [15], [16]. On the other hand, customized models are trained using specific data from individual customers. However, the practicality of customized models is constrained due to their need for a substantial amount of historical electricity consumption data for training [17]. Additionally, customized detection models are vulnerable to data contamination attacks, where a malicious customer can introduce false readings at the beginning, allowing future electricity theft to remain undetected, as the detector is trained on these false readings [17]. Moreover, developing separate detectors for each customer would impose a significant computational burden on the power utility. As a result, the prevailing literature predominantly favors the construction of global electricity theft detectors over the use of customized ones [1], [11], [12], [15], [16], [18].

However, the existing models in the literature are not without limitations. They often rely on fixed datasets, making them susceptible to overfitting and learning specific patterns and features rather than more generalized ones. Furthermore, these models exhibit limited adaptability to changes in consumption patterns and emerging cyber-attacks, necessitating the time-consuming and computationally intensive process of retraining the models using both existing and new data, especially when dealing with large datasets.

Reinforcement learning (RL) has emerged as a significant subfield of ML within the cyber-security domain. Its growing popularity can be attributed to its capability to interact with and adapt to the surrounding environment, enabling it to tackle dynamic decision-making challenges [19], [20]. RL is specifically designed to address such issues and has the capacity to learn optimal decision-making even with limited initial knowledge of the environment. Furthermore, RL models excel in finding the right balance between exploration and exploitation, which is a crucial aspect in cyber-security where attackers are constantly evolving and changing their strategies [21], [22], [23]. Additionally, RL enables the integration of human expertise into the decision-making process [24]. Experts can provide feedback and guidance to the RL agent, enhancing its performance. This human-in-the-loop approach enhances the accuracy and effectiveness of cyber-security attack detection, leveraging the strengths of both human expertise and machine learning [25], [26], [27]. In conclusion, RL offers unique advantages in handling dynamic decision-making challenges in the context of cyber-security attack detection, making it a promising approach to complement deep learning in this field.

In the research presented in [28], we conducted the first exploration of utilizing deep reinforcement learning (DRL) for the purpose of electricity theft detection. While this study yielded promising outcomes, its primary focus was on detecting false readings arising from electricity theft attacks, and it did not encompass the investigation of adversarial evasion attacks. In contrast, this paper specifically focuses on addressing more advanced cyberattacks, particularly adversarial evasion attacks, which occur during the testing phase. These sophisticated attacks are purposefully crafted to evade detection and deceive the detector, leading to a degradation in overall detection performance. Consequently, effectively countering such attacks poses a significant challenge for existing electricity theft detection models. Specifically, this paper presents a defense model that utilizes DRL to counter adversarial evasion attacks. The proposed approach involves three primary phases. Initially, we present a global DRL-based detection model that utilizes the Double Deep Q-Network (DDQN) algorithm, employing various neural network architectures, including convolutional neural network (CNN), gated recurrent unit neural network (GRU), and feedforward neural network (FFNN). Moving to the second phase, an attack model is developed to generate adversarial evasion samples by using malicious electricity consumption readings. This is done under the assumption of a black-box attack scenario, where the attacker lacks knowledge about the detection model. Two techniques are developed for generating evasion samples: a DRL-based DDQN model that incorporates a substitute model on the attacker’s side, and the Fast Gradient Sign Method (FGSM). Lastly, in the third phase, a defense model is introduced, aiming to strengthen the detection model through an adversarial training process using the generated evasion samples.

Despite RL’s inherent adaptability and attractive properties, it has not received adequate attention in the field of cyberattacks pertaining to electricity theft. Therefore, in comparison to existing literature, our study takes the lead by introducing the utilization of an RL-based attack model for generating evasion attacks against another RL-based detection model. Additionally, we are the first to evaluate the ability of the hardened defense model, which is trained adversarially on evasion samples generated by one attack, to defend against evasion samples generated by a different attack method. In essence, this paper’s significant contributions can be outlined as:

  • Developing DRL-based DDQN and FGSM-based attack models for generating adversarial evasion samples of SMs electricity consumption readings in a black-box attack context.

  • Investigating the effectiveness of a DRL-based adversarial training defense model in defending against both DRL and FGSM-based adversarial evasion samples.

  • Addressing the challenge of whether the DRL-based hardened defense model, which has been adversarially trained on DRL-based evasion samples, is capable of defending against FGSM-based evasion samples and vice versa.

The paper’s following sections are organized as follows: In Section II, we provide an overview of the pertinent literature relevant to adversarial evasion attacks within the smart grid domain. Section III introduces the fundamental concept of RL. In Section IV, we describe the dataset preparation process for training our attack and defense models. The proposed DRL-based and FGSM-based attack models, as well as the proposed adversarial training-based defense mechanism, are discussed in Section V. In Section VI, we assess and analyze our proposed attack and defense models. Finally, in Section VII, we draw our conclusions.

SECTION II.

Related Works

In this section, we begin by examining the prior research conducted on the detection of electricity theft. Subsequently, our focus shifts towards an overview of the existing approaches used for launching evasion attacks, as well as the countermeasures proposed to address them. Lastly, we delve into the limitations present in the literature and identify the areas that require further research.

A. Electricity Theft Detection Methods

In the realm of electricity theft detection, various methodologies have been devised to tackle and mitigate its ramifications. These methodologies can be broadly classified into three primary categories: hardware-based approaches, statistical and analytical methods, and machine learning-based techniques. In this section, we provide a concise overview of these methodologies, outlining their key characteristics and functionalities.

1) Hardware-Based Methods

One method to address electricity theft attacks involves the integration of hardware tamper-proof modules into smart meters, which act as a deterrent against unauthorized modifications and the transmission of falsified data [29]. However, it is important to recognize the limitations associated with this method. The implementation of such modules can be costly, and their effectiveness depends on a level of trust that may not always be guaranteed in real-world situations. Consequently, within the existing literature, there is an increasing preference for statistical-based and ML-based methods as they have the potential to overcome these limitations and offer more effective countermeasures against electricity theft [11].

2) Statistical and Analytical-Based Methods

Various statistical and analytical techniques have been proposed as countermeasures against electricity theft attacks. These methodologies utilize various approaches such as metaheuristic methods [30], [31], game theory [32], [33], [34], data mining, state estimation [35], clustering, principal component analysis (PCA), and local outlier factor (LOF). For instance, Singh et al. [36] propose an innovative approach that employs PCA to detect anomalies by calculating an anomaly score and comparing it to a predefined threshold. Furthermore, Peng et al. [37] employ the robust k -means clustering algorithm to group customers based on their electricity consumption data. This clustering aids in the identification of potential outliers-individuals whose consumption readings significantly deviate from the centroids of their respective clusters. To further enhance anomaly detection, the LOF technique is utilized to compute an anomaly score for each identified outlier candidate, providing a comprehensive assessment of their anomalous behavior. However, it should be noted that statistical and analytical methods often suffer from limitations in capturing the temporal dynamics and intricate patterns present in the data, which may impact their accuracy [38].

3) Machine Learning-Based Methods

In the literature, researchers have introduced various machine learning-based detectors aimed at resolving the issue of identifying erroneous power consumption data submitted by deceitful SMs. These detectors can be divided into two principal groups. The initial group, referred to as “Shallow detectors,” encompasses methods that make use of basic machine learning detection algorithms such as support vector machines (SVM), logistic regression (LR), NaÏve Bayes, decision trees (DTs), and autoregressive integrated moving average (ARIMA) [11], [12], [15], [31], [39], [40]. The second group consists of detectors that leverage DL detection algorithms [16], [18], [29]. DL algorithms, unlike shallow ML algorithms, possess the advantage of automatic feature extraction without the need for explicit feature engineering. Numerous studies [16], [18], [29], [31], [41], [42], [43], [44], [45] consistently demonstrate the promising potential of DL, consistently showcasing its superiority over shallow ML algorithms. This superiority is reflected in the higher detection accuracy achieved by DL models, making them a crucial and effective approach for detecting malicious power consumption readings.

Jokar et al. [11] developed a method for electricity theft detection. They trained custom detectors using real benign data from the Irish dataset and created malicious samples by manipulating this data. Two experiments were conducted: one with a single-class SVM detector using benign data and another with multi-class SVM detectors using both benign and malicious samples. The results favored the SVM detector trained with both data types. Furthermore, Buzau et al. [15] introduced a detection approach utilizing XGBoost as a global detector. Their approach considered consumers’ electricity data, geographical locations, and SM technical characteristics. Experimental results demonstrated the superiority of their detector in terms of accuracy compared to detectors based on K-nearest neighbors (KNNs), SVMs, and logistic regression.

Bhat et al. [16] introduced multiple DL-based global detectors employing various architectures such as CNN, LSTM, and stacked autoencoder. Through a comprehensive comparison with shallow-based detectors, the results demonstrate the superior accuracy of DL-based detectors. Furthermore, Li et al. [46] proposed a hybrid CNN-RF detector that combines the CNN’s ability to capture consumption reading features with the RF’s classification power for identifying electricity theft. The experimental results validate the superiority of the hybrid model over other shallow detection models, including GBDT, RF, LR, and SVM.

Similarly, in the same context of DL-based detectors, Zheng et al. [29] devised a DL-based electricity theft detector that utilized CNN and MLP to analyze weekly consumption readings and identify fraudulent behaviors. Experimental outcomes underscored the superior performance of this model over others, such as LR, RF, SVM, and CNN. Additionally, a hybrid CNN-LSTM model introduced in [47] exhibited a promising accuracy of 89%. Nonetheless, ML-based models in the literature have certain shortcomings. They struggle to adapt to changes in consumption patterns and cyber-attacks, necessitating retraining on new datasets, a process that can be time-consuming and computationally demanding, especially with extensive datasets. Moreover, ML models typically lack inherent exploration mechanisms, constraining their adaptability [28].

Therefore, RL offers a flexible solution for handling the dynamic nature of electricity theft attacks and consumption patterns. Our recent study in [28] made the first attempt to explore the application of RL in detecting such attacks. We proposed a DRL approach that encompassed four different scenarios. In the first scenario, we developed a global detection model using DQN and DDQN algorithms, employing architectures such as FFNN, CNN, GRU, and a hybrid CNN-GRU model. The second scenario involved constructing customized detection models for new customers based on the global detector, aiming to achieve high accuracy and prevent zero-day attacks. In the third scenario, we addressed changes in consumption patterns among existing customers, ensuring adaptability to evolving scenarios. Lastly, the fourth scenario tackled the challenges of defending against newly launched cyber-attacks. The experimental results showcased the ability of the proposed detectors to enhance the detection of electricity theft cyber-attacks. Moreover, our approach demonstrated efficient learning of new consumption patterns, adaptability to changes in existing customers’ consumption patterns, and effective defense against newly launched cyber-attacks.

B. Adversarial Evasion Attacks and Countermeasures

The previous subsection discussed works that primarily focused on training accurate models for detecting electricity theft attacks. However, these works did not address the security of the models against adversarial evasion attacks. Evasion attacks employ advanced techniques to make minimal alterations to malicious samples, causing the detector to incorrectly classify them as benign. In this subsection, our attention shifts to these specific attacks and the countermeasures proposed to mitigate them.

Numerous research studies have explored the generation of adversarial evasion samples and their impact on machine learning-based detectors. The pioneering work by Szegedy et al. [48] delved into the effects of evasion attacks on neural networks. Subsequently, Moosavi-Dezfoli et al. [49], Goodfellow et al. [50], and Rozsa et al. [51] introduced various techniques, including the fast gradient sign method (FGSM), DeepFool, and fast gradient value (FGV), respectively, to generate evasion samples capable of deceiving detection models. In the domain of electrical power, Badr et al. [1] innovatively introduced evasion attacks targeting global electricity theft detectors. They employed a generative adversarial network (GAN) trained on real data to produce deceptive low-consumption readings effective at evading the global detector.

In addition, Li et al. [46] demonstrated the susceptibility of DL-based electricity theft detectors to evasion attacks. They introduced the SearchFromFree algorithm, which leverages gradients to create evasion samples, enabling malicious samples to evade DL-based detectors while yielding financial benefits. To address and mitigate adversarial evasion attacks, the technique of adversarial training defense has emerged as a promising approach to enhance detector resilience. Adversarial training, initially proposed by Szegedy et al. [48], entails exposing a trained detector to evasion attacks. This process generates adversarial samples capable of bypassing detection. Subsequently, the detector undergoes retraining using these adversarial samples to enhance its robustness.

C. Limitations and Research Gaps

In this subsection, we discuss the main limitations present in the existing literature, which constitute the motivations for this work, as follows.

  • Lack of generalization. Many existing evasion attacks exploit specific vulnerabilities or weaknesses in a particular machine learning model. These attacks often rely on knowledge of the target model’s architecture, parameters, or training data. As a result, they may not generalize well to different models or datasets. This limitation restricts their applicability in different applications where the attacker may not have complete knowledge of the target model, limiting the effectiveness of the attacks.

  • High computational cost. Some evasion attack methods utilize computationally expensive techniques such as optimization algorithms or brute-force search. These approaches explore a large search space to find the optimal perturbations that can fool the target model. However, these techniques are time-consuming and resource-intensive, especially for large datasets. The high computational cost makes these evasion attacks impractical for large-scale applications where efficiency is crucial.

  • Limited transferability. Many existing works assume that evasion samples generated by a substitute model will also evade the target model, but this assumption is not guaranteed practically. The underlying reasons for limited transferability include differences in model architectures or decision boundaries.

To address these limitations and fill the existing research gap, we offer the following rationale for investigating RL as a viable approach to generate evasion samples:

  • Model-agnostic attacks. RL-based evasion attacks have the potential to generate evasion samples that are less dependent on the architecture and type of the target model or the training dataset. By learning from interactions with the target model, RL attack agent can adapt its attack strategy, making it more versatile and transferable. This adaptability allows for more successful evasion attacks in scenarios where the attacker has limited knowledge of the target model compared to the traditional approaches.

  • Optimization efficiency. RL techniques leverage exploration and exploitation techniques to compute evasion samples more efficiently. Instead of relying on exhaustive search or computationally expensive optimization algorithms, RL agents can learn to navigate the evasion sample space in an efficient manner. This improves the practicality and scalability of the evasion attack.

  • Stealth and imperceptibility. RL agents can be trained to generate evasion samples that are indistinguishable from the benign samples. By incorporating appropriate reward values, RL agents can learn to minimize the perturbations in a way that makes it hard to detect the attack. This ability to generate stealthy and imperceptible evasion samples increases the effectiveness of the attacks in different applications or scenarios where the goal is to evade the detection model without raising suspicion.

SECTION III.

Preliminaries

A. Reinforcement Learning (RL)

RL distinguishes itself as a unique branch of machine learning, setting it apart from popular methods like supervised learning. Unlike those methods, RL empowers autonomous agents to shape their own learning experiences through direct interaction with the environment and feedback. The fundamental structure of an RL model comprises two key components: the environment and the agent, as depicted in FIGURE 1. Initially, the agent possesses limited or no prior knowledge of the environment. To address RL problems, a Markov decision process incorporates four distinct components: state (s ), action (a ), reward (r ), and policy (\pi ). RL follows a trial-and-error approach, where the agent takes action a_{t} at each time step t , causing a transition from the current state s_{t} to a new state s_{t+1} , and receiving a reward or penalty from the environment. The reward function provides the agent with insights into the desirability of an action within a given state. Over time, the agent learns to make better decisions and avoids suboptimal choices based on the accumulated rewards. The primary objective of RL is to optimize the overall accumulated reward and formulate a policy that links states to actions. This cumulative reward represents the aggregation of rewards obtained by the agent over time and is expressed by Eq. 1.\begin{equation*} R_{t}= r_{t+1}+ \gamma. r_{t+2}+ \gamma. r_{t+3}+\ldots =\sum _{l=0}^{\infty } \gamma ^{l} r_{t+l+l}, \tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

FIGURE 1. - The main structure of an RL scheme.
FIGURE 1.

The main structure of an RL scheme.

In order to deepen our understanding of the RL model’s capacity to determine the optimal policy, it is crucial to explore and comprehend fundamental concepts such as exploration and exploitation. Exploration involves evaluating and investigating various predefined actions to identify the most suitable course of actions for the upcoming states, while exploitation focuses on utilizing current knowledge to adjust the action selection policy and maximize overall rewards. These concepts are mathematically examined through the \epsilon -greedy policy. At each state, the agent has the capability to explore an action with an exploration rate \epsilon from a predefined set of actions randomly or exploit a specific action with the maximum Q value using an exploitation rate of 1-\epsilon . The exploration rate, \epsilon \in [0, 1], is initially set to its maximum value of 1, and gradually decreases as the learning process progresses. Ultimately, as the training model evolves, the agent relies solely on the exploitation mechanism and its accumulated knowledge to determine the optimal action to execute. The concept of Q-learning algorithm, introduced in [52], empowers the agent to learn and make optimal decisions through sequential exploration of different actions. The goal of this approach is to maximize the overall accumulated reward by leveraging the Bellman equation, as expressed in Eq. 2.\begin{align*} Q^{n e w}\left ({s_{t}, a_{t}}\right) &\leftarrow (1-\alpha) Q\left ({s_{t}, a_{t}}\right) \\ &\quad +\alpha \left ({r_{t}+\gamma \max _{a} Q\left ({s_{t+1}, a_{t+1}}\right)}\right), \tag{2}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \alpha \in [{0,1}] , the learning rate plays a crucial role in determining the degree to which the updated Q -value dominates the previous one. When \alpha equals 0, the agent relies solely on prior knowledge, disregarding new information gained from recent interactions. Conversely, when \alpha equals 1, the agent abandons prior knowledge and focuses on exploring available actions for new insights.

The Q -learning algorithm utilizes a Q function to calculate the Q values for a given state, with the primary objective of maximizing rewards. To store the Q -values for state-action pairs, a Q -table is employed, where rows represent states and columns represent available actions. However, as the number of states and actions increases, the state-action space grows exponentially, making the use of a Q -table impractical. In order to address this challenge, DL plays a crucial role in the integration with RL, leading to the development of the deep Q network (DQN). This integration leverages the remarkable DL capabilities of DQN to overcome the exponential growth in the state-action space. Thanks to DL, the DQN enables efficient and effective representation of Q -values without the need for an exhaustive Q -table.

B. Deep Q Network (DQN)

The Q -learning algorithm utilizes a Q function to calculate the Q value for a given state s_{t} and action a_{t} , which helps in formulating the policy for action selection. The optimal policy \pi ^{\ast} is determined by selecting the action with the maximum Q value for each state-action pair. In the training of DQN model, individual samples are presented in a triple format (s_{t}, a_{t}, s_{t+1}) , where s_{t} , a_{t} , and s_{t+1} correspond to the current state, actual label, and the subsequent state, as illustrated in FIGURE 2. Throughout the training, the reward r_{t} for a specific state s_{t} is contingent on whether the predicted label \hat {a}_{t} aligns with the true label a_{t} . A reward of one is earned if the prediction is correct, while a reward of zero is given if it is incorrect, as illustrated in Eq. 3.\begin{align*} r_{t} = \begin{cases}\displaystyle 1 & \hat {a}_{t}= a_{t}.\\ \displaystyle 0 & \hat {a}_{t}\neq a_{t}.\end{cases} \tag{3}\end{align*}

View SourceRight-click on figure for MathML and additional features.

FIGURE 2. - The training strategy of the DQN algorithm [28].
FIGURE 2.

The training strategy of the DQN algorithm [28].

Furthermore, in the prediction phase of the DQN model illustrated in FIGURE 2, a set of Q -value combinations Q(s_{t},A) = [Q(s_{t}, a_{0}),Q(s_{t}, a_{1}),..Q(s_{t}, a_{b})] is computed for each given current state s_{t} , taking into account the available set of labels A . Here, b represents the total number of available labels. Subsequently, a selection process using \epsilon -greedy algorithm is performed to determine an action from the computed set of combinations. This selection process employs either the exploitation concept with a probability of \epsilon or the exploration concept with a probability of 1-\epsilon . Likewise, the Q -value of the next state s_{t+1} is computed using the arg_{a} max(.) policy, where \hat {Q}_{t+1} = max_{a} Q(s_{t+1}, {A}) . Upon successful completion of the training phase of the DQN model, the model is utilized to predict actions by selecting the action associated with the highest Q function value.

C. Double Deep Q Network (DDQN)

The double deep Q -network (DDQN) is a variation of the DQN, sharing its fundamental structure but diverging in the approach to predicting the next state. DDQN utilizes two deep neural networks, known as the current network and the target network [28], [53]. The former predicts the Q -value for the current state \hat {Q}_{t} , while the latter predicts the Q -value for the next state \hat {Q}_{t+1} . While both networks possess a similar architecture, the target network undergoes updates with a time-delayed synchronization method. This is done to mitigate the issue of a ‘moving target’ during gradient descent calculations for the (\hat {Q}_{t}-Q_{ref})^{2} term. Periodically, the target network’s parameters are synchronized with those of the current network. Other than this distinction, the training and prediction phases of the DDQN model closely resemble those of the DQN model. Upon the successful conclusion of the DDQN model’s training phase, it is deployed for action prediction by selecting the action associated with the highest value in the Q -function. The standard architecture and training procedure for the DDQN scheme can be visualized in FIGURE 3.

FIGURE 3. - The training strategy of the DDQN algorithm [28].
FIGURE 3.

The training strategy of the DDQN algorithm [28].

SECTION IV.

Dataset Preparation

This section outlines the process of crafting datasets that contain real-time electricity consumption records collected from consumers’ SMs. These datasets are utilized to train, evaluate, and analyze Ml-based models for electricity theft detection, as well as for constructing adversarial attack models.

A. Benign Samples

In this research paper, we employ a real electricity consumption dataset obtained from the Irish Smart Energy Trials [54] to create two distinct datasets. The first dataset is used for training and evaluating the performance of electricity theft detectors, while the second dataset is utilized to construct an attack model for generating adversarial evasion samples. The Irish dataset, publicly available since January 2012 by the Electric Ireland and Sustainable Energy, consists of half-hourly electricity consumption readings from 3, 639 residential consumers. The dataset spans a duration of 536 days, covering readings collected between 2009 and 2010. For our study, we randomly select readings from the smart meters of 130 customers, resulting in a total of 69, 680 benign samples. The electricity theft detection process focuses on a one-day period, where the detector analyzes a set of electricity consumption readings (48 readings) for a specific consumer to determine if electricity theft is occurring. Our attack model is designed to generate the adversarial evasion samples, mimicking the characteristics of real consumption patterns.

B. Malicious Samples

To ensure the effectiveness of an electricity theft detector in accurately distinguishing between benign and malicious electricity consumption samples, it is crucial to train the detector using both types of samples. However, in the case of the Irish dataset, only benign samples are available, and there is a lack of publicly available malicious samples. To address this limitation, we adopt an electricity theft attack methodology proposed in a previous study [11]. This attack involves generating malicious samples by modifying the benign samples. The attack model is functionally represented as f(x_{i}(t)) = \beta x_{i}(t) , where x_{i}(t) stands for the actual electricity consumption readings of a specific consumer i at a given time step t . This function f(x_{i}(t)) diminishes the actual consumption value by a stochastic reduction factor \beta , where 0 < \beta < 1 . The attack’s aim is to emulate malicious behavior by artificially lowering the actual consumption value using \beta .

C. Dataset Preprocessing

To generate malicious samples, we need to determine the parameter \beta for the attack function mentioned earlier. This parameter follows a uniform distribution ranging from 0.1 to 0.4 within the attack function f(.) . Subsequently, we apply the attack function to the benign samples, transforming them into a malicious dataset. Each of the 130 customers contributes 536 daily samples to this malicious dataset, which incorporates both benign and malicious readings. This process results in a grand total of 1,072 samples. When considering the dataset’s time span of 536 days and its coverage of 130 customers, it accumulates a vast total of 139,360 samples. We then divide this dataset into two distinct subsets, each containing 69,680 samples. The initial subset is exclusively reserved for training and evaluating the performance of the electricity theft detectors, while the second subset serves the purpose of constructing the attack model used for generating adversarial evasion samples. To further structure our dataset, we partition both of these subsets into training and testing subsets, maintaining a balanced 2:1 ratio. The training subset consists of 46, 453 samples, and the testing subset contains 23, 227 samples. The entire dataset preprocessing is illustrated and annotated through steps (1 to 4) within the proposed framework, as depicted in FIGURE 4. Additionally, Algorithm 1 provides a detailed explanation of the preprocessing steps.

FIGURE 4. - The proposed attack and defense models framework.
FIGURE 4.

The proposed attack and defense models framework.

Algorithm 1 Data Preprocessing Algorithm

Input:

SMs’ benign readings of C = 130 randomly selected customers from the Irish dataset, number of benign readings slots per customer T = 48 , and the reduction factor \beta \epsilon [{0,0.4}] .

Output:

Two subsets includes benign and malicious readings samples.

1:

Initiate the attack function f (x_{i}(t)) = \beta x_{i}(t) to generate the malicious samples.

2:

for i= 0,1,2,\ldots, C do

3:

for each benign reading x at time slot t in T . do

4:

Input the benign reading x_{i}(t) in the attack function to obtain the malicious reading of the sample.

5:

Concatenate the benign and malicious samples.

6:

end for

7:

Repeat until getting to epoch C .

8:

end for

9:

Divide the concatenated samples into two subsets, each consisting of benign and malicious reading samples.

SECTION V.

Attack and Defense Models

In this section, we first discuss evasion attacks that are used to attack electricity theft detectors, and then we discuss a countermeasure.

A. Attack Model

In this section, we introduce two attack models designed to generate adversarial evasion samples using malicious electricity consumption readings. The first model is the DRL-based DDQN model, and the second model is the FGSM-based model. TABLE 1 presents a comprehensive comparison between the DRL-based and FGSM-based evasion models, shedding light on their differences, advantages, and limitations.

TABLE 1 Comparison Between the DRL-Based and FGSM-Based Evasion Attack Models
Table 1- 
Comparison Between the DRL-Based and FGSM-Based Evasion Attack Models

1) Overview

Previously, electricity theft attackers resorted to use simple attack functions, as proposed in [11], to engage in electricity theft. These approaches involved manipulating and reporting fraudulent electricity consumption values to the utility. While these approaches were once effective in facilitating successful electricity theft and inflicting financial harm on the utility, recent advancements in detection technology have rendered them detectable by electricity theft detectors. However, attackers have actively been exploring new techniques to evade these detection systems. In this study, we propose an evasion attack model that empowers attackers to steal electricity by generating artificially low consumption readings. This evasion attack model operates under the assumption of a black-box attack scenario, where the attacker has no knowledge about the RL-based electricity theft detector employed by the utility. Additionally, the attack model and the electricity theft detector employ different neural network architectures and are trained on different datasets. Specifically, the RL-based electricity theft detector is trained on a combination of benign and malicious reading samples from the first subset of data presented in Section IV-B. On the other side, the attacker utilizes the malicious samples from the second subset of data, as outlined in Section IV-B, to generate low-consumption adversarial evasion samples. The objective of these evasion samples is to deceive the electricity theft detector and be classified as benign, thereby stealing electricity while evading detection.

2) DRL-Based Attack Model

The proposed approach employs DRL to develop a generation agent that can autonomously generate adversarial evasion samples to bypass the electricity theft detector. The attacker creates a substitute model to verify if the generated evasion samples can avoid detection. The attacker assumes that the generated samples that evade the substitute model are also able to pass the electricity theft detector of the utility.

To train the generation agent, the environment provides states to the agent through a sample provider that extracts malicious samples from the second subset of data. The agent employs a trial-and-error approach, leveraging the aforementioned DDQN algorithm and its mechanism for selecting the optimal action, as explained earlier in Section III-C. This action space consists of perturbation values that the generation agent can apply to a malicious sample through a multiplication process, thereby modifying it and generating an adversarial evasion sample. Subsequently, the generated evasion sample undergoes testing to determine its capability to evade the substitute model. The substitute model is trained on the second subset of the dataset, containing both benign and malicious samples. The agent’s reward is contingent upon the output of the substitute model when evaluating the generated sample. If the generated evasion sample successfully evades the substitute model, the generation agent receives a reward of 1; otherwise, the reward is 0.

In this setting, the substitute model is implemented using the DRL-based DDQN algorithm, employing three distinct neural network architectures: CNN, GRU, and FFNN. On the other hand, the generation agent of the attack model is implemented using the CNN architecture within the DRL-based DDQN algorithm. The DRL-based DDQN attack model is visually depicted and annotated in steps (7 [a, b, c], 8) in FIGURE 4. The training phase and the training accuracy of the DRL-based DDQN attack model are both outlined in Algorithm 2 and FIGURE 5, respectively. This figure visually represents the training convergence as accuracy improves with the progressive increase in the number of training batches.

FIGURE 5. - Training accuracy of DRL-based attack model.
FIGURE 5.

Training accuracy of DRL-based attack model.

Algorithm 2 DRL-Based DDQN Attack Model Training Algorithm

Input:

Exploration rate \epsilon , learning rate \alpha , discount factor \gamma , batch size H , and training epochs G .

Output:

The optimal action a^{\ast} . and the adversarial evasion sample.

1:

Initiate the action value function Q(s,a) arbitrarily.

2:

Initiate the state s using state generator in recognizable format by the agent.

3:

for i= 0,1,2,\ldots, G do

4:

for each state s in i . do

5:

Input the state s_{t} and the actions set A in the current network in order to predict Q(s,A) for all actions.

6:

Use the \epsilon -greedy policy to select the action \hat {a}_{t} .

7:

Given s_{t} and \hat {a}_{t} , obtain Q(s_{t},\hat {a}_{t}) .

8:

Generate the adversarial evasion sample through multiplying the state sample s_{t} with \hat {a}_{t} .

9:

Check whether the generated adversarial evasion sample can evade the substitute model.

10:

Calculate the reward r_{t} .

11:

Input the next state s_{t+1} and the actions set A in the target network in order to predict Q(s_{t+1},A) for all actions.

12:

Use \arg \max _{a} Q(s_{t+1},A) policy to select \hat {a}_{t+1} .

13:

Given s_{t+1} and \hat {a}_{t+1} , obtain \hat {Q}_{t+1}(s_{t+1},\hat {a}_{t+1}) .

14:

Using \hat {Q}_{t+1} , r_{t} , and \gamma , Obtain Q_{ref} .

15:

Calculate the loss function.

16:

Update the Q-value Q(s_{t},a_{t}) .

17:

Repeat until s_{t+1} is terminal.

18:

end for

19:

Repeat until getting to epoch G .

20:

end for

21:

Compute the optimal policy \pi ^{\ast} and optimal action a^{\ast} .

22:

Execute the optimal action a^{\ast}_{t} at current time slot t and get the adversarial evasion sample.

3) FGSM-Based Attack Model

Additionally, alongside the DRL-based DDQN attack model, we leverage an FGSM-based attack model to generate adversarial evasion samples. The FGSM technique is widely employed in the literature to create such samples by introducing carefully calculated perturbations to the input samples, aiming to cause misclassification of these samples [50], [55], [56], [57]. FGSM presents several distinct advantages. It operates by employing gradients of the loss function, often yielding impactful perturbations that lead to misclassification. Furthermore, it can achieve misclassification with minimal modifications to the input samples. To ensure that these added perturbations remain undetectable to the electricity theft detector, their magnitude must be within an acceptable limit. Thus, mathematically, these added perturbations can be described as follows:\begin{align*} &\min \nolimits _{\delta \vec {x}} |\delta \vec {x}| \\ &{ \text {s.t. }} \hat {f}(\vec {x}+\delta \vec {x}) \neq \hat {f}(\vec {x}) \tag{4}\end{align*}

View SourceRight-click on figure for MathML and additional features. where \delta \vec {x} represents the adversarial perturbation applied to the input malicious sample \vec {x} , the FGSM method utilizes the sign of the gradient of the cost function C_{\hat {f}} (\theta, x, y) with respect to the model \hat {f} . This gradient is evaluated for the input malicious sample, and its sign is employed to generate the adversarial perturbations described in Eq. 5. The objective is to maximize the value of the cost function to the greatest extent.\begin{equation*} \delta _{\vec {x}}=\lambda \mathrm {sign}\left ({\nabla _{\vec {x}} C_{\hat {f}} (\theta,\vec {x},y)}\right), \tag{5}\end{equation*}
View SourceRight-click on figure for MathML and additional features.
Here, \theta , \vec {x} , and y represent the model weights, the input malicious sample and the true label corresponding to \vec {x} , respectively. Meanwhile, \lambda is the parameter that is tuned so that the label produced by the model for the perturbed input data, i.e., (\vec {x}+\delta \vec {x}) changes from the malicious label to benign label and deceive the detector.

B. Defense Model

This section focuses on defense mechanisms against the attacks discussed earlier. These mechanisms are divided into two stages.

In the first stage, a DRL-based DDQN detector is employed. The detector is constructed using various neural network architectures, including FFNN, CNN, and GRU. It is trained using the initial subset of the dataset discussed in Section IV-C, allowing it to learn and identify patterns indicative of electricity theft cyber-attacks. It serves as a defense mechanism against evasion attacks, considering that the attacker might exploit adversarial evasion samples generated by either DRL-based or FGSM-based attack models to launch evasion attacks against the detector. The first stage of defense is described in steps 5 and 6 in FIGURE 4.

In the second stage of defense, evasion samples play a crucial role in the subsequent process known as adversarial training [58], [59]. This training significantly boosts a model’s resilience against evasion attacks by enhancing its ability to detect alterations in input data, resulting in improved differentiation between benign and malicious samples. This heightened sensitivity empowers models to make accurate judgments even when facing adversarial variations, thereby reducing the effectiveness of evasion attacks. The primary goal of this training process is to ‘harden’ and reinforce the detector’s defenses, making it more robust and resilient against future attacks. Leveraging the recorded evasion samples, the detector undergoes retraining, reinforcing its ability to identify and respond effectively to adversarial evasion attempts. Within the context of reinforcement learning (RL), adversarial training drives the RL agent to explore and adapt its policy to accommodate unexpected consumption patterns or attack tactics. Through intentional exposure to adversarial influences during training, the agent acquires the ability to make decisions that extend beyond routine situations, displaying resilience against variations and disruptions. Consequently, the agent enhances its capability to consider a wide range of scenarios and actions, enabling well-informed decisions even when confronting perturbed or adversarial observations.

The parameters guiding the adversarial training of the DRL-based DDQN model are outlined in TABLE 2. To gain a comprehensive understanding of the defense procedures, this defense is visually represented by steps 11 and 12 in FIGURE 4. These steps significantly contribute to safeguarding against adversarial evasion samples and enhancing the detector’s capabilities. Additionally, Algorithm 3 provides a detailed explanation of the procedures of this defense, offering insights into their implementation and functionality.

TABLE 2 Parameters of DRL-DDQN Based Attack and Defense Models
Table 2- Parameters of DRL-DDQN Based Attack and Defense Models

Algorithm 3 Defense Model Algorithm

Input:

The first subset of the dataset, and the adversarial evasion samples.

Output:

The hardened DRL-based DDQN detector.

The first stage of defense

1:

Use the first subset of the dataset to train the DRL-based DDQN model using different architectures of neural network.

2:

Obtain the trained DRL-based DDQN detector.

3:

Utilize the generated adversarial evasion samples to attack the proposed detector.

The second stage of defense

4:

Record the evasion samples.

5:

Use these recorded evasion samples to conduct the adversarial training process for the proposed detector.

6:

Obtain the hardened DRL-based DDQN detector.

SECTION VI.

Evaluations

In this section, we first discuss the experimental setup and the evaluation metrics used to assess the performance and effectiveness of our proposals. Subsequently, we present the experimental results of four conducted experiments to evaluate the severity of the attack models and the effectiveness of the defense models. In the first experiment, we train a global DRL detector that utilizes DDQN to detect electricity theft cyber-attacks. This detector, which serves as the first stage of defense, is constructed using diverse neural network architectures, including FFNN, CNN, and GRU. In the second experiment, we train DRL-based DDQN and FGSM-based attack models to generate adversarial evasion samples and evaluate their effectiveness in attacking the global electricity theft detector. Meanwhile, the third experiment focuses on evaluating the effectiveness of adversarial training in strengthening the DRL-based DDQN detector against both DRL and FGSM-based adversarial evasion samples. This hardened detector serves as the second stage of defense, aiming to enhance the resilience against potential attacks. Lastly, in the fourth experiment, we investigate whether the DRL evasion samples-based defense model can defend against FGSM-based adversarial evasion samples, and vice versa.

The configuration parameters of the proposed DRL-based DDQN based attack and defense models are specified and listed in TABLE 2. Additionally, the hyperparameters of the neural network architectures utilized in these models can be found in TABLE 3. These parameters have been fine-tuned through iterative experimentation. For our experimental work, we utilized a variety of Python 3 libraries, namely Scikit-learn, Pandas, Keras, Numpy, TensorFlow, and Matplotlib. It’s worth highlighting that all our experiments were conducted on the Google Colab platform, a web-based environment that allows for seamless Python code writing and execution within a web browser.

TABLE 3 The Hyperparameters for the Neural Network Architectures Utilized in the Proposed Models
Table 3- The Hyperparameters for the Neural Network Architectures Utilized in the Proposed Models

A. Metrics

The evaluations of the proposed attack and defense models include the analysis of multiple metrics, such as accuracy, precision, recall, false alarm, false negative rate, highest difference, F-1 score, evasion rate (EVR), attack success rate (ASR), and transfer-ability rate (TR). These metrics rely on the values of true positive (TP), true negative (TN), false positive (FP), and false negative (FN). TP and TN represent correctly classified malicious and benign samples, respectively, while FP and FN denote misclassified malicious and benign samples, respectively. The computation of these evaluation metrics is as follows:

1) Accuracy (ACC)

It represents the proportion of correctly classified test samples by the detector out of the total number of samples in the test dataset, which includes both benign and malicious samples. Mathematically, it is computed using the following equation:\begin{equation*} {ACC (\%) = \frac {TP+TN}{TP+TN+FP+FN}\times 100.} \tag{6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

2) Adversarial Accuracy (ACC_{ADV})

It represents the proportion of correctly classified test samples by the detector out of the total number of samples in the test dataset, which includes both benign and evasion samples.

3) Overall Accuracy (ACC_{ALL})

It represents the proportion of correctly classified test samples by the detector out of the total number of samples in the test dataset, which includes benign, malicious, and evasion samples.

4) Precision

It represents the proportion of true positive samples to the total number of samples classified as positive by the detector. Mathematically, it is computed using the following equation:\begin{equation*} Precision (\%) = \frac {TP}{TP+FP}\times 100. \tag{7}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

5) Recall

It represents the proportion of correctly identified positive samples to the total number of positive samples in the test dataset. Mathematically, it is computed using the following equation:\begin{equation*} Recall (\%) = \frac {TP}{TP+FN}\times 100. \tag{8}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

6) False Alarm (FA)

It represents the proportion of false positive samples to the total number of negative samples in the test dataset. Mathematically, it is computed using the following equation:\begin{equation*} FA (\%) = \frac {FP}{FP+TN}\times 100. \tag{9}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

7) Highest Difference (HD)

It is the difference between recall and false alarm (FA) . Mathematically, it is computed using the following equation:\begin{equation*} HD (\%) = Recall(\%) - FA(\%). \tag{10}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

8) F-1 Score (F1)

It is the harmonic mean between precision and recall. Mathematically, it is computed using the following equation:\begin{equation*} F1 (\%) = \frac {2* Precision*Recall}{Precision+Recall}\times 100. \tag{11}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

9) Evasion Rate (EVR)

It represents the proportion of evasion samples that are misclassified as benign by the substitute model.

10) Attack Success Rate (ASR)

It represents the proportion of evasion samples that are misclassified as benign by the utility detector. Note, the attacker sends only the evasion samples that pass the substitute model to the utility, but the ASR metric is computed over all the evasion samples.

11) Transferability Rate (TR)

It quantifies the proportion of evasion samples that successfully bypass both the substitute model and utility detector compared to the total number of evasion samples that only manage to bypass the substitute model. This metric, which is derived from the previous two metrics, provides an indication of the probability that a given sample, which evades the substitute model, will also successfully evade the utility detector. It is computed using the following equation:\begin{equation*} {TR (\%) = \frac {ASR}{EVR}\times 100} \tag{12}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

B. Experiment #1

In this experiment, we focused on training DRL-based DDQN global electricity theft detectors using the first subset of electricity consumption readings. Our approach encompassed the generation of distinct training and testing datasets, following the data preprocessing procedures detailed in Section IV-C. Subsequently, we employed the training dataset to train three distinct global DRL-based DDQN detectors, marking the initial phase of our defense strategy. These detectors employed different neural network architectures, including FFNN, CNN, and GRU. The selection of these DL-based architectures was based on their demonstrated superior performance compared to shallow architectures and their widespread use in the literature. Subsequently, we evaluated the performance of the three DRL-based DDQN detectors using the testing dataset. Table 4 provides a comprehensive comparison of the detectors in terms of key metrics such as ACC, Precision, Recall, FA, HD , and F1 . From the table, it is evident that the FFNN detector exhibited the lowest performance, which can be attributed to its simpler architecture compared to CNN and GRU. The CNN detector achieved the highest performance, leveraging its convolutional layers to extract crucial features from the electricity consumption data, leading to superior detection accuracy. Likewise, the GRU detector delivered a strong performance, effectively capturing temporal patterns within the input electricity consumption data.

TABLE 4 Comparison Between the Performance of the Different Architectures of DRL-Based DDQN Detectors
Table 4- Comparison Between the Performance of the Different Architectures of DRL-Based DDQN Detectors

C. Experiment #2

In this experiment, we conducted training for DRL-based DDQN and FGSM-based attack models to generate adversarial evasion attack samples, as explained in detail in Section V-A. During this experiment, we simulated a realistic black-box attack scenario, in which the attackers had no knowledge of the utility’s detector. Furthermore, there could be variations in the neural network architectures employed by the attack model and the electricity theft detector, as well as differences in the training datasets used. Additionally, the attacker lacked direct access to the utility’s detector. Therefore, the attacker attempted to create a substitute model to evaluate whether the generated evasion samples could successfully evade detection. The attacker’s assumption was that if the evasion samples were capable of deceiving the substitute model, they would likely also be able to bypass the actual electricity theft detector implemented by the utility. The parameter configurations of the substitute and attack models are provided in TABLE 2 and TABLE 3.

Once the attack models were trained, we utilized the test malicious samples from the second subset, as outlined in Section IV-B, to generate the evasion attack samples. These generated samples were subsequently employed to target the detectors developed in Experiment #1. The outcomes of this experiment, evaluated based on EVR , ASR , TR , and ACC , are presented in TABLE 5. These outcomes demonstrate the superior performance of the proposed DRL-based DDQN attack model compared to the FGSM-based attack model in terms of EVR , ASR , and ACC . The DRL-based DDQN attack model achieves outstanding success, with EVR ranging from 97.921% to 99.982%, ASR ranging from 92.929% to 99.965%, and ACC ranging from 41.210% to 46.563%. This indicates a significant reduction of 41.611% to 46.795% within 95% confidence interval (CI ) of (44.66 ± 1.14)% compared to the corresponding values in TABLE 4. In the same context, the FGSM-based attack model exhibits EVR values between 72.911% and 94.214%, ASR values ranging from 48.479% to 95.484%, and ACC ranging from 44.898% to 65.585%. This indicates a significant reduction of 26.768% to 42.768% within 95%CI of (35.022 ± 3.27)% compared to the corresponding values in TABLE 4. These outcomes highlight the severity of the evasion attacks and the effectiveness of our proposed DRL-DDQN based attack model in computing effective evasion samples that can deceive and bypass the electricity theft detector, particularly in the challenging context of a black box attack scenario and variations in neural network architectures employed by the attack model and the electricity theft detector.

TABLE 5 Comparison Between the Performance of DRL-Based DDQN and FGSM-Based Attack Models Against the Detector
Table 5- Comparison Between the Performance of DRL-Based DDQN and FGSM-Based Attack Models Against the Detector

D. Experiment #3

The previous experiment has revealed the detrimental impact of evasion attacks on the performance of electricity theft detectors. Our goal in this experiment, which represents the second stage of defense, is to propose a robust and hardened detector capable of maintaining a consistent detection performance against evasion attacks. To accomplish this, we employ an adversarial training process for the electricity theft detector obtained in experiment #1, leveraging the captured evasion samples from the first defense stage. The procedures of this experiment are visually illustrated through annotated steps 11 and 12 in FIGURE 4 and detailed in Algorithm 3. The results of this experiment, evaluated based on ASR , ACC , ACC_{adv} , and ACC_{all} are presented in TABLE 6.

TABLE 6 Comparison Between the Defense Performance of the Hardened Detector against DRL-Based DDQN and FGSM-Based Evasion Samples
Table 6- Comparison Between the Defense Performance of the Hardened Detector against DRL-Based DDQN and FGSM-Based Evasion Samples

These results validate the effectiveness of our proposed defense mechanism, achieved through the process of adversarial training. This training enables us to enhance the resilience and robustness of the electricity theft detection model against evasion attacks. Specifically, when considering DRL-based evasion samples and comparing to the corresponding values in TABLE 5, which represent the detector’s performance without adversarial training, we observe a substantial decrease in ASR , ranging from 1.80% to 9.20%, indicating a significant reduction of 83.72% to 98.16% within 95%CI of (94.09 ± 3.05)%. Additionally, there is a significant increase in ACC , ranging from 85.05% to 93.26%, representing an improvement of 39.81% to 46.75% within 95%CI of (34.85 ± 1.6)%.

Furthermore, notable enhancements are observed in ACC_{adv} and ACC_{all} , with values ranging from 92.31% to 97% and 88.85% to 95.18%, respectively. Similarly, for FGSM-based evasion samples, comparing the results to the corresponding values in TABLE 5, we find a decrease in ASR in the range of 0.67% to 14.4%. This showcases a reduction trend of 46.34% to 76.49% within 95%CI of (63.05 ± 7.58)%. Moreover, there is a significant increase in ACC , ranging from 84.03% to 92.76%, indicating an improvement of 25.06% to 42.29% within 95%CI of (33.50 ± 3.31)%. Additionally, significant improvements are observed in ACC_{adv} and ACC_{all} , with values ranging from 79.98% to 96.5% and 81.99% to 93.59%, respectively. The basis for these improvements in the detector’s effectiveness lies in the adversarial training, which drives the RL agent to investigate and update its policy to accommodate unexpected shifts in consumption behaviors or attack methods. This empowers the agent to become proficient at making optimal decisions and broadens its capability to handle diverse scenarios and actions, enabling informed choices even when confronting adversarial perturbations or attacks.

E. Experiment #4

Following the promising results of the previous experiment, the focus of this study shifts to investigating whether the DRL evasion samples-based defense model can defend against FGSM-based adversarial evasion samples, and vice versa. Our objective is to examine the model’s ability to defend against evasion attacks originating from different attack methods. In the first phase of the experiment, we subject the adversarially trained hardened detector, which was initially trained exclusively on DRL-based evasion samples, to attacks using FGSM-based evasion samples. The outcomes, as presented in TABLE 7, reveal the detector’s vulnerability, with an ASR ranging from 62.008% to 91.568%. To strengthen the defense mechanism, we proceed to apply adversarial training to the detector using both DRL-based and FGSM-based evasion samples. This comprehensive training approach yields remarkable performance, significantly reducing the ASR to a range of 0.604% to 10.715%. Additionally, notable improvements are observed in key evaluation metrics: ACC , ACC_{adv} , and ACC_{all} , with values ranging from 84.774% to 93.883%, 84.801% to 96.328%, and 84.751% to 95.161%, respectively.

TABLE 7 The Defense Performance of the Adversarially Trained Hardened Detector Using DRL-Based Evasion Samples Against FGSM-Based Evasion Samples
Table 7- The Defense Performance of the Adversarially Trained Hardened Detector Using DRL-Based Evasion Samples Against FGSM-Based Evasion Samples

Furthermore, in the second phase of the experiment, we examine the impact of using an adversarially trained hardened detector, initially trained exclusively on FGSM-based evasion samples, to defend against DRL-based evasion samples. The outcomes, presented in TABLE 8, also reveal the detector’s vulnerability to the evasion attack, with an ASR ranging from 97.462% to 97.750%. However, when the hardened detector is adversarially trained on both DRL-based and FGSM-based evasion samples, a reduction in ASR is achieved, ranging from 2.197% to 3.832%. Moreover, significant improvements are observed in ACC , ACC_{adv} , and ACC_{all} , with values ranging from 86.005% to 93.007%, 93.947% to 95.526%, and 90.167% to 94.308%, respectively.

TABLE 8 The Defense Performance of the Adversarially Trained Hardened Detector Using FGSM-Based Evasion Samples Against DRL-Based Evasion Samples
Table 8- The Defense Performance of the Adversarially Trained Hardened Detector Using FGSM-Based Evasion Samples Against DRL-Based Evasion Samples

SECTION VII.

Conclusion

This paper investigates the vulnerability of RL-based electricity theft detectors to adversarial evasion attacks. We propose a DRL-based DDQN attack model to generate adversarial evasion samples, leveraging the benefits of RL for determining the optimal values through exploration and exploitation mechanisms. By perturbing malicious samples, evasion samples are computed to evade the detectors and classify them as benign. The evasion attack is conducted in a black-box scenario, which is practical and challenging because attackers do not have any knowledge about the defense model. Our experiments demonstrate the effectiveness of the proposed attacks compared to FGSM-based attacks. The results indicate that our attack model can significantly degrade the detector’s performance, achieving an ASR ranging from 92.92% to 99.96%. Additionally, there is a notable decrease in ACC ranging from 41.21% to 46.56%, representing a significant reduction of 39.81% to 46.75%.

To counter evasion attacks, we train a defense model that utilizes adversarial training of a DRL-based detector to obtain a hardened detector. The experimental results showcase the robustness of the defense model against evasion attacks, reducing the ASR by 1.80% to 9.20%, which corresponds to a significant reduction of 83.72% to 98.15%. Moreover, there is a substantial increase in ACC , ranging from 85.04% to 93.26%, resulting in an improvement of 39.81% to 46.75%. Finally, we evaluate the ability of the hardened defense model, which is adversarially trained on evasion samples, to defend against evasion attack samples from different attack methods. The results suggest that the hardened defense model should be retrained on additional evasion samples originating from different evasion attack methods.

In summary, we’d like to emphasize the significance and benefits of developing a DRL-based defense model to counter fraudulent electricity theft attacks. These attacks, initiated by fraudulent customers manipulating consumption readings in smart power grids, bear practical implications for grid security. This approach yields multifaceted advantages for smart power grids, such as bolstering their security and reliability by detecting and preventing fraudulent actions. By doing so, it safeguards the grid’s functionality, diminishing the risk of cascading failures that might disrupt services for legitimate users. Minimizing losses due to fraudulent activities allows for improved resource allocation towards maintenance and upgrades. Mitigating electricity theft enhances customer trust by ensuring fair billing and promoting positive customer relationships. The insights garnered from the RL-based detector offer valuable information on consumption patterns and vulnerabilities, guiding informed decisions for grid management, load monitoring and forecasting, and security enhancements. A dependable and secure power grid environment further stimulates innovation in smart grid technologies. This empowers utility providers to confidently invest in advanced solutions like renewable energy integration, smart metering, and demand response systems, ultimately enhancing operational efficiency. The reduction in losses attributed to electricity theft also contributes to increased revenue for utility companies.

ACKNOWLEDGMENT

This work was supported by the Researchers Supporting Project number (RSPD2023R636), King Saud University, Riyadh, Saudi Arabia, and by Idaho State University funding for the Center for Advanced Energy Studies (CAES) of the USA.

References

References is not available for this document.