Journals & Magazines >IEEE Access >Volume: 12

RIS-Assisted CR-MEC Systems Using Deep Reinforcement Learning Approach

Considered RIS-assisted CR-MEC system

Abstract:

Nowadays, running applications on devices is restricted by limited computational capability and battery capacity, which needs the aid of mobile edge computing (MEC). This...Show More

Metadata

Abstract:

Nowadays, running applications on devices is restricted by limited computational capability and battery capacity, which needs the aid of mobile edge computing (MEC). This article aims to enhance MEC systems by employing revolutionary techniques: 1) adjusting wireless environment using reconfigurable intelligent surface (RIS), 2) exploiting opportunistically licensed spectrum using cognitive radio (CR), and 3) scavenging renewable energy using energy harvesting. We consider a RIS-assisted CR-MEC network where a secondary user (SU) powered by solar energy attempts to optimize computation. We study the computation rate maximization of SU, where it opportunistically either performs local computation or offloads data to the MEC server on a primary channel. A Markov decision process problem is formulated and then is firstly solved by a proposed deep policy gradient scheme, in which the system directly learns the policy from gradients of actions. To obtain higher stability, we subsequently propose a deep Q-learning scheme to derive a proper solution by maximizing the state-action value function. By taking the advantages of both policy-based and value-based methods, we further develop a deep actor-critic scheme, where an actor selects actions and a critic evaluates actions to acquire an optimal policy.

Considered RIS-assisted CR-MEC system

Published in: IEEE Access ( Volume: 12)

Page(s): 198167 - 198183

Date of Publication: 26 December 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3522783

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

With the massive growth of mobile applications (e.g., online gaming, image or video processing), tremendous computation demands have been imposed on resource-constrained wireless devices. Fortunately, Mobile Edge Computing (MEC) is referred to as a powerful technology to improve computation efficiency [1], in which a mobile device can offload tasks to the nearby servers (e.g., access points or base stations) deployed at the edge of the network. Nevertheless, offloading quality usually fails in meeting desirable levels in obstructed regions. Recently, reconfigurable intelligent surface (RIS) has been introduced in [2] to improve offloading performance. In particular, reflected signals can be constructively added by dynamically configuring phase shifts and reflection amplitudes of RIS elements.

Cognitive radio (CR) has been recognized as a potential technique to remarkably improve the spectral efficiency (SE), in which spectrum sharing is categorized by three paradigms: interweave [3], overlay [4] and underlay [5]. In cognitive radio networks (CRNs), the interference temperature to the primary users (PUs) must be guaranteed below a predefined threshold regardless of which paradigm is used. This requirement can be ignored in interweave and overlay paradigms if channel state information (CSI) of PU is perfectly known by SUs transmitters. Nevertheless, in underlay paradigm, interference temperature must be carefully considered when designing the mechanism for CRNs optimization since it significantly affects achievable data rate. Computational applications on internet of thing (IoT) equipments tend to require energy-autonomous operation, since the battery maintenance is impractical owing to difficult accessibility or high cost in harsh network scenarios. As a remedy to the above facts, energy harvesting techniques has recently emerged to address energy-constrained and energy autonomy problems [6]. Especially with energy-constrained MEC systems, addressing the critical challenges, such as the finite battery capacity, sustainable operation, and limited spectrum resources have received more and more attention. Motivated by the aforementioned considerations and technologies, these challenges can be alleviated by integrating three technologies into MEC systems: 1) RIS; 2) cognitive radio; 3) energy harvesting.

A. Literature Review

This subsection presents related works according to the MEC systems, RIS-assisted MEC systems, RIS-assisted CRN systems, and deep reinforcement learning (DRL) frameworks in RIS-assisted systems, which are respectively discussed as follows:

1) MEC Systems

The authors in [7] investigated a CR-MEC system considering the interference temperature of the primary network to maximize the overall utility. A three-layer architecture of a MEC system was studied in [8], in which CR technology was applied. Both scenarios of cooperation [9] and the noncooperation [10] between the PU and the mobile device were studied in the CR-MEC network using wireless power transfer (WPT). In particular, the authors in [9] maximized the computation efficiency, while the optimal number of computation bits was an objective of the study in [10]. The WPT-CR-MEC system in [9] was extended to multi-antenna transmission contexts [11], where the authors considered the security of the primary transmitter. Iterative algorithms were proposed to minimize offloading delay in the MEC system using non-orthogonal multiple access (NOMA) for two users [12]. Besides, the work in [13] investigated the multi-user NOMA-MEC framework to acquire minimum delay under resource constraints. Although the references [12], [13] verified the effectiveness of the proposed schemes, these might not be applicable for CR-MEC due to interference constraints to the primary system and thus the complexity of offloading policy should be carefully considered. On the other hand, these authors ignored local computing capability which is generally supported in IoT devices. It is noted that the above research efforts in [7], [8], [9], [10], [11], [12], and [13] did not use RIS for improving the offloading performance in MEC systems.

2) Ris-Assisted MEC Systems

By leveraging the benefits of RIS, performance improvement of MEC systems can be achieved [14], [15]. The RIS-aided MEC system was studied, in which users can offload partially their tasks to a MEC server with the aid of RIS [14]. The simulation results indicated that overall latency degrades considerably. The authors in [15] considered a MEC system where a RIS was used to aid both processes of WPT and data offloading. However, the works in [14] and [15] only took into account the employment of RIS in traditional MEC networks without considering various multiple access schemes. To cope with this drawback, [16] presented the RIS-aided MISO-MEC system serving multiple users in the same frequency and time block by using NOMA technology, and conventional orthogonal multiple access (OMA) methods were compared in the simulation to evaluate the effectiveness of the proposed scheme.

3) Ris-Assisted CRN Systems

Even though the CR can boost the SE, the benefits for the PUs and the SUs are conflicting. For instance, increasing transmission power at secondary transmitters to raise the received signal strength will also induce increased interference temperature towards primary receivers. This bottleneck can be addressed by utilizing RIS such that IRS’s reconfiguration function enhances received signal strength at the SUs while mitigating interference to the PUs via jointly optimizing transmit precoding and passive beamforming [17], [18]. However, it is challenging to realize these works in practice, since channel information between SUs and PUs is difficult to acquire owing to their uncooperative relationship. Several contributions were carried out by designing transmission beamforming with imperfect PU channels for RIS-CR systems [19], [20]. On the other hand, they only targeted maximizing the network capacity, which may not ensure the quality of service (QoS) at the SU such as video conference, autonomous vehicles, and so on. To cope with this fact, the authors in [21] designed a robust beamforming scheme aiming to minimize the transmission power of SUs subject to SUs’ QoS requirements and PUs’ interference threshold.

4) DRL-Based Frameworks in RIS-Assisted Systems

In recent years, the artificial intelligence has been viewed as one of the most appealing technologies to MIMO systems, in which optimization problems are non-trivial owing to the large dimension involved. As a branch of AI research, DRL has been recognized as a powerful tool for decision-making optimization in wireless networks without prior knowledge of the environment, especially in RIS-aided communication contexts. In [22], to enhance the received signal-to-noise ratio (SNR), a DRL algorithm was developed for selecting optimal RIS phase shifts. Reference [23] presented a joint transmit beamforming and RIS phase shift allocation scheme using the DRL method to gain maximal sum-rate. However, the network settings in these studies were mostly assumed for working in ideal conditions, which is impractical. In [24], the author considered the cell vectorization problem as beamforming matrix selection to attain optimal network coverage by using the DRL-based approach. Besides, a DRL scheme of jointly allocating beamforming, power control, and interference was designed to maximize network performance in [25]. The authors in [26] proposed both deep learning (DL) and DRL algorithms to find the solution of effective throughput maximization and evaluate their pros and cons. Nevertheless, the above DRL-based schemes using RIS did not take into consideration the context of MEC. Only a few studies [27], [28], [29], [30] focused on the problems in RIS-assisted MEC systems and solved them by adopting DRL techniques. More specifically, the work in [27] jointly optimized the allocated WPT time, RIS phase shifts, and IoT devices’ offloading actions to enhance the overall computational capability by adopting a double deep Q-learning algorithm. Meanwhile, a constrained DRL method was developed to select the optimal RIS phase shifts for the downlink under latency constraints in [28]. The problem of joint offloading, communication and computation resource allocation for NOMA-RIS-MEC system is investigated in [29], which is solved by a Lyapunov-based DRL with both continuous and integer mappings. To maximize the weighted sum secrecy computation efficiency of the system, the authors in [30] proposed a deep deterministic policy gradient (DDPG) scheme under eavesdropping attacks.

B. Motivations and Contributions

To the best of the authors’ knowledge, this is the first work investigating a solar-powered RIS-assisted CR-MEC system. In contrast to most of the existing literature, we consider imperfect context of spectrum sensing in interweave paradigm, in which the offloading performance of the CR system is degraded by sensing errors regarding false alarms and misdetections. Indeed, the CR-MEC network investigated in this work is closer to practice because we consider dynamic nature of wireless environment (e.g. uncertainty of the solar energy arrivals and dynamic activity of the primary network) with inevitable sensing errors of the cognitive system. With this scenario, the solar-powered battery of cognitive user can be easily drained due to imperfect spectrum sensing and improper utilization of offloading energy. Therefore, it is challenging for non-learning conventional approaches to obtain directly an optimal solution for the optimization problem owing to time-varying channel conditions, stochastic properties of harvested energy, and dynamic behavior of the primary channel. Besides, designing a scheme using mathematical tools might impose high computational complexity for implementation. Furthermore, some conventional schemes (e.g. partially observable Markov decision process (POMDP)) require the prior knowledge of harvested energy probability to determine state transition probability, causing the problem more challenging. Fortunately, DRL learning approaches can be adopted for local computing and offloading in CR-MEC network by only interacting with environment through trial-and-error fashion to gain knowledge and select the optimal action.

The motivation for using DRL in system optimization lies in its flexibility and adaptability. Unlike traditional methods like Lyapunov optimization [31], Dynamic Programming (DP) [32], Linear Programming (LP) [33], Convex Optimization [34], or Optimal Control [35], DRL does not require a precise mathematical model or rigid structural constraints, making it more suitable for complex and uncertain environments. DRL handles high-dimensional problems effectively, overcoming the curse of dimensionality, and excels in nonlinear and stochastic settings by adapting to real-time feedback. While methods like Lyapunov and Optimal Control can address some nonlinearity, they involve significant mathematical complexity and pre-defined constraints, which restrict their adaptability. In contrast to LP and Convex Optimization methods, which focus on finding an optimal solution for immediate rewards, DRL optimizes for long-term rewards by balancing exploration and exploitation. This makes DRL highly effective for sequential decision-making tasks like queuing and resource allocation. Additionally, DRL can handle complex, multi-objective function, such as balancing latency and energy, through custom reward functions.

Since the system has a large state-and-action spaces, it is difficult for the traditional learning algorithms to optimize the computation rate. Besides, the existing works did not intensively investigate the on-policy and off-policy methods with their pros and cons. We aim to transform a resource allocation problem to a MDP platform, where the DRL frameworks can be applied to solve the computation rate optimization. Thereby, investigating the details of mapping a state to an optimal action using various DRL frameworks is crucial under consideration of trade-offs between their effectiveness and computational complexity. On the other hand, the convergence property, performance and computational complexity comparison among these DRL schemes have not been studied well. To fill this gap, we adopt separate state-of-the-art DRL frameworks including policy-based (i.e., on-policy), value-based (i.e., off-policy), and policy-and-value methods to find the solution to a high dimensional optimization problem. Especially, unlike other mathematical optimization mechanisms, our proposed DRL designs do not require sophisticated mathematical formulations and calculations. The contributions of this article are listed as follows:

We consider a solar-powered RIS-assisted CR-MEC system, where an SU equipped with a solar energy harvester attempts to optimize data computation. Imperfect spectrum sensing of determining the status of licensed channel is assumed. In addition, issues of limited battery capacity, computational capability, and harvested energy are taken into account.
The MDP problem of either allocating CPU-cycle frequency, or jointly offloading energy and RIS phase shifts for computation rate maximization is defined. Subsequently, a policy-based approach, namely deep policy gradient (DPG), is first proposed to solve MDP problem. Thereby, the SU learns environment knowledge by directly following action distribution. Though high computation rate can be obtained, it is easily affected by stochastic property of generating actions, leading to a large oscillation or even divergence.
In order to avoid divergence as well as reduce the critical effect of stochastic actions on the value function of the DPG scheme, we propose to use a value-based method, namely deep Q learning (DQL), to generate the optimal CPU frequency, offloading energy, and RIS phase shifts. In this scheme, the state-action value function is used to optimize selected action rather than directly following the action distribution. With this scheme, the fluctuation in the training is degraded, yet the system frequently goes to the locally optimal policy.
By inheriting advantages of policy-based principle of the DPG and value-based principle of DQL, we further employ a policy-and-value method, namely deep actor-critic (DAC) including actor and critic components, to address the MDP problem. Thereby, the actor determines actions followed by parameterized policy, while the critic is in charge of evaluating the quality of the taken actions through rewards returned by the environment. Consequently, the good convergence and stability are acquired with proper parameters.

The remainder of this work is organized as follows. In Section II, the system model is presented. Next, we elaborate on the proposed DRL frameworks: DPG, DQL, and DAC in Section III. We conduct numerical simulations and discuss the results in Section IV, and then the conclusion of the paper is given in Section V.

SECTION II.

Network Model

As shown in Fig. 1(a), we consider a RIS-assisted CR-MEC system consisting of an SU, a RIS, a secondary base station (SBS), and a primary network including a primary transmitter (PT) and a primary receiver (PR). PT and PR are licensed to use a primary channel, while the single-antenna SU can opportunistically offload its data when the primary channel is sensed to be free. The SBS has M antennas, while the RIS is equipped with K reflecting elements and one micro-controller which is used to adjust their phase shifts to achieve better channel conditions between the SU and SBS. The considered network of this paper can reflect the realistic scenario as follows. Nowadays, many smart buildings are evolving with their built-in technologies, which usually require or generate local data. However, most Internet of Things (IoT) devices cannot quickly process data due to poor computing power and low battery capacity. Therefore, by exploiting the energy harvesting, CR, RIS, and MEC techniques, IoT devices can efficiently process large amounts of data, such as climate and temperature controls, smart signaling, asset tracking, or health monitoring data with the aids of energy autonomy from EH, spectrum sharing of CR, channel condition adjustment of RIS, and cloud server center (i.e. MEC server) for the permanent operation. Thus, investigating a multi-technique network in this paper offers great benefits for low-level devices to enable the rapid data processing with energy autonomy.

FIGURE 1.

(a) Considered RIS-assisted CR-MEC system. (b) Operational structure of a time slot.

Show All

We consider a three-dimensional (3D) Cartesian coordinate system, where the coordinates of the SU, the RIS, and the SBS are respectively denoted by ${\mathbf {u}} = {\left ({{{x_{u}},{y_{u}},{h_{u}}}}\right)^{T}}$ , ${\mathbf {r}} = {\left ({{{x_{r}},{y_{r}},{h_{r}}}}\right)^{T}}$ , and ${\mathbf {b}} = {\left ({{{x_{b}},{y_{b}},{h_{b}}}}\right)^{T}}$ . The SU needs to execute computational tasks of various applications, such as data fusion, climate and temperature controls, smart signaling, and so on. Due to a low computational capability, the SU can enhance computation rate by cooperating with the MEC server collocated in the SBS through offloading. To this end, the SU first offloads the task to the SBS such that data is remotely processed by the MEC server and the computed result will be sent back to the SU once the computation is completed. We assume that the MEC server has enough dedicated computing resources and energy to assist the computation of the SU.

During the offloading, the SU has to share the primary channel (i.e., licensed channel) with the primary network. We assume that the SU always has a computational task to process. Generally, there are three classified models of computation offloading: (a) only offloading; (b) all or no offloading, in which the entire data is either offloaded or is processed locally; and (c) partial offloading, where some parts are offloaded while the remaining parts are performed locally. This article focuses on the model (b) since the SU adaptively executes tasks to maximize the computation performance using CR, which is based on energy condition, offloading channel, and the primary channel activity. The MEC system operates in a time-slotted fashion where $\tau$ represents the length of each time slot. The operation of the SU in a time slot is illustrated in Fig. 1(b) and can proceed as follows. At the beginning of each time slot, the SU performs the spectrum sensing with sensing duration, $t_{s}$ , to determine the status of the primary channel. Based on the sensing result, the SU decides on either local computing mode or offloading mode. In particular, the SU carries out the local computing mode if the sensing result is “busy” within a duration of $\tau -t_{s}$ . On the other hand, when the result of spectrum sensing shows state “free” of the primary channel, the SU will offload its data to SBS for collaborative computation at the MEC server within the duration of $t_{o}$ . Subsequently, the MEC server computes data and feedback the computed result to the SU with the periods of $t_{m}$ and $t_{d}$ , respectively.

A. Local Computing Model

In this article, the number of CPU cycles to process 1-bit of raw data at the SU is denoted by $\theta \left ({{\theta \gt 0}}\right)$ . Let $f(t)$ ( $0 \le {f}\left ({{ t }}\right) \le {f_{\max }}$ ) represent the scheduled CPU-cycle frequency (cycles/second) of the SU in time slot t, where $f_{\max }$ is the constraint of maximal computational capability at the SU. Accordingly, the number of bits calculated by using local computation at the SU in time slot t is calculated by $\begin{equation*} r_{l}(t) = \frac {\tau - {t_{s}}}{\tau }{f}(t)\theta ^{ - 1}. \tag {1}\end{equation*}$ View SourceThe energy consumption for the local computation at the SU can be given as $\begin{equation*} e_{l}(t) = \left ({{\tau - {t_{s}}}}\right )\gamma f^{3}(t), \tag {2}\end{equation*}$ View Sourcewhere $\gamma$ is the constant of effective switched capacitance based on the chip of the SU.

B. Offloading Model

The SBS can receive signal transmitted by SU from two ways: direct path and reflected RIS path. We assume the direct path is a non-line-of-sight (NLOS) channel. Let ${{\mathbf {h}}_{\text {ub}}} = {g_{\text {ub}}}{\left [{{h_{1}^{\text {u}},\ldots ,h_{M}^{\text {u}}}}\right]^{T}} \in {\mathbb {C}^{M \times 1}}$ denote direct path’s channel, where ${g_{\text {ub}}} = d_{\text {ub}}^{ - {\textstyle {\frac {{\alpha _{L}} }{2}}}}$ represents path loss from the SU to SBS with distance $d_{\text {ub}}$ and path loss exponent $\alpha _{L}$ ; and $h_{m}^{\text {u}}$ represents small-scale fading between the SU and SBS’s antenna m. The reflected RIS path includes two links: an NLOS SU-RIS link and a line-of-sight (LOS) RIS-SBS link. The NLOS SU-RIS channel is denoted by ${{\mathbf {h}}_{\text {ur}}} = {g_{\text {ur}}}{\left [{{h_{1}^{\text {r}},\ldots ,h_{K}^{\text {r}}}}\right]^{T}} \in {\mathbb {C}^{K \times 1}}$ , where ${g_{\text {ur}}} = d_{\text {ur}}^{ - {\textstyle {\frac {{\alpha _{L}} }{2}}}}$ is the SU-RIS path loss, and $h_{k}^{\text {r}}$ represents complex Gaussian random variable, $\mathcal {CN}\left ({{0,1}}\right)$ . We denote the channel between the RIS and the SBS as ${{\mathbf {H}}_{\text {rb}}} = {g_{\text {rb}}}{\left \{{{h_{m,k}}}\right \}_{m \in \left [{{1,M}}\right],k \in \left [{{1,K}}\right]}} \in {\mathbb {C}^{M \times K}}$ , where ${g_{\text {rb}}} = 1/\sqrt {4\pi d_{\text {rb}}^{2}}$ refers to a free-space RIS-SBS path loss attenuation with distance $d_{\text {rb}}$ , and $h_{m,k}$ represents the LOS channel between SBS’s antenna m and RIS’s element k. It is assumed that ${\mathbf {h}}_{\text {ub}}$ , ${\mathbf {h}}_{\text {ur}}$ , and ${\mathbf {H}}_{\text {rb}}$ are quasi-static block-fading channels. Channel estimation is beyond the scope of this paper. For passive RIS-aided networks, various methods, such as the dual-link pilot transmission scheme [36] and the anchors-based two-phase method [37], have been proposed to estimate the CSI of individual RIS-aided channels. Therefore, we assume that the CSI is perfectly known to the SU at the start of each time slot.

When receiving signal transmitted by the SU, RIS will reflect it to the SBS following a matrix of phase shifts $\boldsymbol {\Phi } = {\text {diag}}\left ({{{e^{j{\phi _{_{1}}}}},{e^{j{\phi _{_{2}}}}},\ldots ,{e^{j{\phi _{_{K}}}}}}}\right)$ , where ${\phi _{_{k}}} \in {\textsf {F}_{\text {ris}}} = \left \{{{\frac {2\pi l}{2^{b}}}}\right \}_{l = 0}^{{2^{b}} - 1}$ is the phase shift of RIS element k, and b refers to phase shifter’s resolution [38]. The received signal at the SBS is $\begin{equation*} y\left ({{ t }}\right ) = \sqrt {{p_{o}}\left ({{ t }}\right )} \left ({{{{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right ){\boldsymbol {\Phi }} \left ({{ t }}\right ){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right ) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right )}}\right )x\left ({{ t }}\right ) + {\mathbf {n}}\left ({{ t }}\right ), \tag {3}\end{equation*}$ View Sourcewhere ${p_{o}}\left ({{ t }}\right) = \frac {{e_{o}}\left ({{ t }}\right)}{t_{o}}$ is the offloading power calculated by offloading energy ${e_{o}}\left ({{ t }}\right)$ , which is consumed by the SU, and $x\left ({{ t }}\right)$ is task information of SU; $\mathbf {n}\left ({{ t }}\right) = {\left [{{{n_{1}}\left ({{ t }}\right),{n_{2}}\left ({{ t }}\right),\ldots ,{n_{M}}\left ({{ t }}\right)}}\right]^{T}}$ represents a vector of additive white Gaussian noise, such that ${\mathbf {n}}\left ({{ t }}\right) \sim \mathcal {CN}\left ({{0,{\sigma ^{2}}{{\mathbf {I}}_{M}}}}\right)$ . The received SNR of the SBS is $\begin{equation*} {\gamma _{o}}\left ({{ t }}\right ) = \frac {{p_{o}}\left ({{ t }}\right ){{\left \|{{{{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right ){\boldsymbol {\Phi }} \left ({{ t }}\right ){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right ) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right )}}\right \|}^{2}}}{\sigma ^{2}}. \tag {4}\end{equation*}$ View SourceThus, the data rate of offloading can be achieved by $\begin{equation*} {r_{o}}\left ({{ t }}\right ) = \frac {t_{o}}{\tau }B{\log _{2}}\left ({{1 + {\gamma _{o}}\left ({{ t }}\right )}}\right ), \tag {5}\end{equation*}$ View Sourcewhere B denotes the bandwidth for the data offloading. The SU with a limited battery capacity, $E_{Ca}$ , constantly harvests solar energy via ambient environment. The arrival energy during time slot t, $e_{h}(t)$ , is assumed to be limited, where $0 \lt e_{h}(t) \lt {E_{Ca}}$ , and it follows a Poisson distribution with mean harvested energy, $\varepsilon$ . We model the status of primary channel by a 2-state discrete-Time Markov Chain process. Particularly, the state of the primary channel in a time slot is represented by “B” (busy) or “F” (free), and transition probability of its state between two adjacent time slots is denoted by ${P_{ij}}\Big |{{i,j \in \left \{{{F,B}}\right \}}}$ . During the sensing duration, the SU carries out spectrum sensing, and then a sensing result, $H\left ({{ t }}\right) \in \left \{{{F,B}}\right \}$ , shows the state “free” or “busy” of the primary channel. Because sensing errors inevitably happens, sensing performance is evaluated by two crucial metrics: false alarm probability, ${P_{f}} = \Pr \left ({{{H}(t) = B\left |{{ F }}\right .}}\right)$ , and detection probability, ${P_{d}} = \Pr \left ({{{H}(t) = B\left |{{ B }}\right .}}\right)$ . The former represents the probability that the primary channel is sensed as “busy”, but it is actually “free”, while the latter shows the probability that the SU exactly detects the state “busy”. The misdetection represents the circumstance that the channel is actually “busy” but is sensed as“free”. In essence, the performance of the network can be significantly lowered by a large number of false alarms and misdetections. Specifically, on the one hand, the false alarm may lead to the missing opportunity for the SU to use the primary channel. On the other hand, the misdetection event causes a transmission collision between the SU and PT. In this study, the probability of the primary channel being free is updated at the end of each time slot to reduce collisions. To protect the transmissions of the PT in the primary network, given the probability of the maximally allowable collision between the SU and PT, the detection probability, $P_{d}$ , can be kept to be greater than a threshold, $\varsigma$ , by modifying sensing parameters.

C. Problem Formulation

In this paper, offloading actions are influenced by the availability of primary network, as the SU must share the licensed channel with the primary network. The activity of the licensed network can significantly impact the SU’s operations. Therefore, instead of optimizing offloading time, our focus is on maximizing the long-term computation rate, encompassing both local computation and offloading rates. More specifically, this paper considers a problem of determining either the CPU-cycle frequency, or jointly offloading energy and RIS phase shifts. We aim to develop schemes to attain maximal computation rate under constraints of limited resources. The problem formulation is expressed by $\begin{align*} & {\max \limits _{\beta \left ({{ t }}\right ),f\left ({{ t }}\right ),{e_{o}}\left ({{ t }}\right ),{\boldsymbol {\Phi }}\left ({{ t }}\right )} \sum \limits _{t = 1}^{\infty }{\left ({{\left ({{1 - \beta \left ({{ t }}\right )}}\right ){r_{l}}\left ({{ t }}\right ) + \beta \left ({{ t }}\right ){r_{o}}\left ({{ t }}\right )}}\right )} } \\ & \qquad {s.t.}~{\beta \left ({{ t }}\right ) \in \left \{{{0,1}}\right \}}\qquad \quad {\left ({{ a }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{0 \leqslant f(t) \leqslant {f_{\max }}}~\qquad {\left ({{ b }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{0 \leqslant {e_{o}}\left ({{ t }}\right ) \leqslant {e_{\max }}}\quad ~~{\left ({{ c }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{e_{o}}\left ({{ t }}\right ) + {e_{s}} \leqslant {e_{r}}\left ({{ t }}\right )}\quad ~{\left ({{ d }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{e_{l}}\left ({{ t }}\right ) + {e_{s}} \leqslant {e_{r}}\left ({{ t }}\right )}\quad ~{\left ({{ e }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{r_{o}}\left ({{ t }}\right ) \geqslant {r_{o,\min }}}\qquad \quad {\left ({{ f }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{r_{l}}\left ({{ t }}\right ) \geqslant {r_{l,\min }}}\qquad \quad ~ {\left ({{ g }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{\phi _{k}}\left ({{ t }}\right ) \in {\textsf {F}_{\text {ris}}}}\qquad \qquad {\left ({{ h }}\right )} \tag {6}\end{align*}$ View Sourcewhere $r_{l}(t)$ and $r_{o}(t)$ are respectively the local computing rate and offloading rate of the SU, which are obtained by using Eq. (1) and Eq. (5); $f_{\max }$ and $e_{\max }$ are the maximal computing capability and offloading energy at the SU, respectively, and $\beta \left ({{ t }}\right) \in \left \{{{0,1}}\right \}$ represents the offloading indicator of the SU. Constraint (a) indicates the SU will either perform local computing or data offloading in time slot t, in which $\beta \left ({{ t }}\right) = 1$ shows that the task is offloaded; otherwise $\beta \left ({{ t }}\right) = 0$ . Constraints (b) and (c) restrict the selection of the CPU frequency levels and offloading energy levels, respectively. Next, the total energy consumption for either offloading and sensing with constraint (d), or local computing and sensing with constraint (e) should not exceed the current energy level, $e_{r}(t)$ , of the SU. The system can guarantee to achieve a minimum offloading rate and local computing rate by using constraint (f) and (g), respectively. It is noted that remaining energy in the battery of the SU is affected by the dynamics of energy harvesting and energy consumption. Therefore, the constraints (f) and (g) are only applied when it selects offloading energy ${e_{o}}\left ({{ t }}\right)~\gt ~0$ or local energy ${e_{l}}\left ({{ t }}\right) \gt 0$ , respectively. Constraint (h) satisfies that RIS phase shifts will be selected in the predefined phase shift candidates, ${\mathcal {F}}_{\text {ris}}$ .

Problem (6) is non-trivial, and it is challenging to obtain directly an optimal solution owing to time-varying channel conditions, stochastic properties of harvested energy, and dynamic behavior of the primary channel. Though optimal RIS phase shifts can be derived by mathematical solutions in the literature, decomposing problem (6) into multiple sub-problems will be time consuming and may increase the burden. Besides, designing a scheme using mathematical tools might impose high computational complexity for implementation. Furthermore, energy uncertainty of battery due to stochastic energy arrival makes the problem more challenging. Thus, we first formulate a Markov Decision Process (MDP) based on problem (6). Subsequently, DRL learning approaches are separately proposed to solve the MDP problem. In formulating the MDP problem for DRL solutions, we incorporate the MEC environment, where the SU operates as an edge node, processing tasks by leveraging remote computational resources provided by the MEC server at the SBS. The SU’s actions are defined as either performing local computation or offloading data to the SBS, thereby accessing MEC services. The proposed DRL schemes are designed to determine the optimal policy for the SU to maximize the computation rate within this MEC framework.

SECTION III.

Deep Reinforcement Learning Frameworks

In this section, the resource allocation problem (6) is reformulated into an MDP problem following the decision-making process. The solution to an MDP problem can be generally obtained by the POMDP scheme by adopting the method of value iteration-based programming under the assumption of the prior knowledge of state transition probability of the system. Unfortunately, this assumption might be unrealistic due to the challenge of acquiring this information in practical scenarios. On the other hand, conventional reinforcement learning (RL) approaches can be adopted to address the problem of unknown environment knowledge. However, they cannot work effectively with the high dimensional optimization problems due to the limited storage memory required for generating a Q-table in large-scale systems. Therefore, inspired by the merits of the model-free property of RL and powerful approximation ability of a deep neural network (DNN), we aim to develop DRL frameworks for the MDP problem without knowing energy harvesting distribution in advance.

A. Markov Decision Process and Observations

We describe the state space, action space, and reward function to formulate the MDP problem in this subsection. Let $\mathbb {S}$ denote the state space of the system, where each state comprises the belief that the primary channel is free, $p\left ({{ t }}\right)$ , the remaining energy of the SU, ${e_{r}}\left ({{ t }}\right)$ , and the channel condition, $\tilde {\boldsymbol {\mathbf {h}}}\left ({{ t }}\right)$ as $s(t) = \left ({{p\left ({{ t }}\right),{e_{r}}\left ({{ t }}\right),{\tilde {\boldsymbol {\mathbf {h}}}}\left ({{ t }}\right)}}\right) \in \mathbb {S}$ , where $\tilde {\boldsymbol {\mathbf {h}}}\left ({{ t }}\right) = {{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right)\boldsymbol {\Phi } \left ({{ t }}\right){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right)$ . According to state $s(t)$ , the agent will select action $a(t)$ . The action space is expressed as $\mathbb {A} = \left \{{{{a_{1}},{a_{2}},\ldots ,{a_{\left |{{ \mathbb {A} }}\right |}}}}\right \}$ , where $a\left ({{ t }}\right) = \left ({{f\left ({{ t }}\right),{e_{o}}\left ({{ t }}\right),{\boldsymbol {\Phi }}\left ({{ t }}\right)}}\right) \in \mathbb {A}$ is the action composed of the allocated CPU frequency, offloading energy, and phase shifts. It is noted that, the action is generated according to the state $s(t)$ at the beginning of the time slot before sensing. After getting the sensing result, the agent will either offload data to the SBS using the assigned $e_{o}(t)$ and $\boldsymbol {\Phi } \left ({{ t }}\right)$ if the result shows the state “F”, or perform local computing using the assigned $f(t)$ if the result is “B”. Intuitively, $\beta (t)$ is ignored in the action formulation since the sensing result can represent the value of $\beta (t)$ . Given the state $s(t)$ and action ${a}(t)$ , the immediate reward, defined as the computation rate of the SU at time slot t, is calculated by $\begin{equation*} r\left ({{ t }}\right ) = \left ({{1 - \beta \left ({{ t }}\right )}}\right ){r_{l}}\left ({{ t }}\right ) + \beta \left ({{ t }}\right ){r_{o}}\left ({{ t }}\right ). \tag {7}\end{equation*}$ View SourceIn MDP platform, a transition function, $\textsf {T}: \textsf {T} \to \left [{{0,1}}\right]$ , indicates the conditional state transition probability between two adjacent time slots. However, the state transition probability of the system considered in this work is probably not identified due to the uninterrupted variations in the environment regarding the channel gains and the harvested energy of the SU. Therefore, the transition function T is unknown. The operation of the SU in each time slot is transformed and described as follows. Firstly, the SU observes the network state $s(t)$ . Subsequently, it generates the action $a(t)$ based on $s(t)$ . After that, the spectrum sensing is conducted to obtain the result $H(t)$ . It then performs action $a(t)$ by which the SU will either offload data to the SBS with assigned offloading energy if $H(t)= "F"$ , or execute local computation with allocated CPU frequency if $H(t)= "B"$ . As a result, the SU receives an immediate reward and then updates state $s(t+1)$ . We assume that data offloaded by the SU is unsuccessfully computed by the MEC server if transmission collision between the SU and PT happens. Thus, the SU dynamically adjusts its action according to observations regarding states, actions, and obtained rewards when training. We present the updating process with respect to possible observations as the following.

Observation 1 ( $O_{1}$ ): The sensing result indicates the state “B” of the primary channel, and the SU performs local computation. Then, the system reward is $r\left ({{s(t),a\left ({{ t }}\right)\left |{{O_{1}}}\right .}}\right) = {r_{l}}\left ({{ t }}\right)$ . We update the belief by $\begin{equation*} p\left ({{t + 1}}\right ) = \frac {p\left ({{ t }}\right ){P_{f}}{P_{FF}} + \left ({{1 - p\left ({{ t }}\right )}}\right ){P_{d}}{P_{BF}}}{p\left ({{ t }}\right ){P_{f}} + \left ({{1 - p\left ({{ t }}\right )}}\right ){P_{d}}}, \tag {8}\end{equation*}$ View Sourcewhere $P_{FF}$ and $P_{BF}$ represent the transition probability of the primary channel from state F to state F, and from state B to state F between two consecutive time slots, respectively. Remaining energy of the SU is computed by $\begin{equation*} {e_{r}}\left ({{t + 1}}\right ) = {e_{r}}\left ({{ t }}\right ) - e_{s}(t) - {e_{l}}\left ({{ t }}\right ) + {e_{h}}\left ({{ t }}\right ), \tag {9}\end{equation*}$ View Sourcewhere $e_{s}(t)$ is the energy required for spectrum sensing of the SU.

Observation 2 ( $O_{2}$ ): The state “F” of the primary channel is identified, and then the SU offloads data to the SBS and receives successfully the computed result sent by the SBS. The reward can be computed by $r\left ({{s(t),a\left ({{ t }}\right)\left |{{O_{2}}}\right .}}\right) = {r_{o}}\left ({{ t }}\right)$ . In this case, the updated belief for the next time slot is $\begin{align*} & \hspace {-0pc}p\left ({{t + 1}}\right ) \\ & \!= \!\! \begin{cases} \displaystyle \!\!{\frac {p\left ({{ t }}\right )\left ({{1\! -\! {P_{f}}}}\right ){P_{FF}} \!+ \!\left ({{1\! - \!p\left ({{ t }}\right )}}\right )\left ({{1 \!-\! {P_{d}}}}\right ){P_{BF}}}{p\left ({{ t }}\right )\left ({{1 \!-\! {P_{f}}}}\right ) \!+\! \left ({{ \!\!{1\! -\! p\left ({{ t }}\right )} }}\right )\left ({{1 \!- \!{P_{d}}}}\right )}}& {{e_{o}}\left ({{ t }}\right ) \!= \!0} \\ \displaystyle {P_{FF}}& {\text {otherwise}}. \end{cases} \tag {10}\end{align*}$ View SourceThe first term occurs when SU’s energy is insufficient for task offloading, thus the SU has to stay silent. On the other hand, if the SU offloads data to the SBS (i.e. ${e_{o}}\left ({{ t }}\right) \gt 0$ ), it will know the actual state of the primary channel through reception of computed data from the MEC server and thus update the belief by using second term. The remaining energy is $\begin{equation*} {e_{r}}\left ({{t + 1}}\right ) = {e_{r}}\left ({{ t }}\right ) - e_{s}(t) - {e_{o}}\left ({{ t }}\right ) + {e_{h}}\left ({{ t }}\right ). \tag {11}\end{equation*}$ View Source

Observation 3 ( $O_{3}$ ): The sensing result unveils the state “F” of the primary channel, and the SU offloads data to the SBS, but the SBS can not successfully receive data. This implies there is a collision between the SUs’ offloading and the PTs’ transmission. Thus, there is no reward in this circumstance, i.e., $r\left ({{s(t),a\left ({{ t }}\right)\left |{{O_{3}}}\right .}}\right) = 0$ . We update the belief by the following equation: $\begin{align*} & p\left ({{t + 1}}\right ) \\ & \!= \!\! \begin{cases} \displaystyle \!{\frac {p\left ({{ t }}\right )\left ({{1 \!\!- {P_{f}}}}\right ){P_{FF}}\! +\! \left ({{1 \!- \!p\left ({{ t }}\right )}}\right )\left ({{1 \!- \!{P_{d}}}}\right ){P_{BF}}}{p\left ({{ t }}\right )\left ({{1 \!-\! {P_{f}}}}\right ) \!+ \!\left ({{1 \!- \!p\left ({{ t }}\right )}}\right )\left ({{1 \!-\! {P_{d}}}}\right )}}& {{e_{o}}\left ({{ t }}\right ) \!= \!0} \\ \displaystyle \!{P_{BF}}& {\text {otherwise}}. \end{cases} \tag {12}\end{align*}$ View SourceThe remaining energy is calculated same as Eq. (11). The process of performing an action of the SU is described in Algorithm 1. The DPG scheme will be proposed in the next subsection.

Algorithm 1 The Process of Performing an Action of the SU

if ${\text {sensing}}\;{\text {result}}\;{\text {is}}\;B$ then

SU performs local computingand receives reward $r(t) = {r_{l}}(t)$ .

Update belief and remain-ing energy using Eqs. (8)and (9).

else

SU offloads data to the SBS.

if SU receives successfully the computed data then

Receive reward $r(t) = {r_{o}}(t)$ .

Update belief and remain-ing energy using Eqs. (10)and (11).

else

10:

Receive reward $r(t) = 0$ .

11:

Update belief and remain-ing energy using Eqs. (12)and (11).

12:

end if

13:

end if

B. The Proposed Deep Policy Gradient Scheme

Let us define the discounted return as accumulative reward as ${G} = \sum \limits _{t = 1}^{\infty } {{\alpha ^{t - 1}}r\left ({{ t }}\right)}$ , where $\alpha$ is the discount factor. The target of the MDP problem is to determine the good policy, $\Omega \left ({{ s }}\right)$ , which represents a function that maps a specific state s to an action. The state value function under the policy $\Omega$ , defined as the expected return that starts from state s, is given by $\begin{align*} {V^{\Omega }}(s) = \mathbb {E}\left [{{{G}\left |{{s,\Omega }}\right .}}\right ] = \mathbb {E}\left [{{\sum \limits _{t = 1}^{\infty }{{\alpha ^{t - 1}}r\left ({{ t }}\right )\left |{{s\left ({{ 1 }}\right ) = s,\Omega }}\right .} }}\right ], \tag {13}\end{align*}$ View Sourcewhere $\mathbb {E}\left [{{. }}\right]$ denotes the expectation operator, and $\Omega$ is the stochastic policy showing the probability of choosing action a at given state s, i.e., $\Omega (a\left |{{ s }}\right .) = \Pr \left [{{a\left ({{ t }}\right) = a\left |{{s\left ({{ t }}\right) = s}}\right .}}\right]$ . In addition, we further define the state-action value function (i.e. Q-value function) as the expected reward of taking action a in state s by following the policy $\Omega$ , which is expressed as follows $\begin{align*} {Q^{\Omega }}(s,a) & = \mathbb {E}\left [{{G\left |{{s,a,\Omega }}\right .}}\right ] \\ & = \mathbb {E}\left [{{\sum \limits _{t = 1}^{\infty }{{\alpha ^{t - 1}}r\left ({{ t }}\right )\left |{{s\left ({{ 1 }}\right ) = s,a\left ({{ 1 }}\right ) = a,\Omega }}\right .} }}\right ]. \tag {14}\end{align*}$ View SourceUnlike the Q-learning principle, the agent of policy gradient method tends to directly select an action based on the probability distribution rather than the state-action value function. In this subsection, we proposed a deep policy gradient to solve the formulated problem, in which a DNN, namely policy network, is employed to model the probability function of the policy, $\Omega \left ({{a\left |{{s,\boldsymbol {\theta } }}\right .}}\right)$ . It includes one input layer, $H_{p}$ hidden layers, and each hidden layer includes $N_{p}$ neurons, and one output layer. The system state with $M+2$ elements represents the input of the policy network, while the output has a size of action space $\left |{{ \mathbb {A} }}\right |$ . The action with the highest preference in each state will be given with the highest probability of being chosen according to the soft-max activation function by $\Omega \left ({{a\left |{{s,\boldsymbol {\theta } }}\right .}}\right) = \frac {e^{h\left ({{s,a,\boldsymbol {\theta } }}\right)}}{\sum \nolimits _{a_{i} \in \mathbb {A}} {e^{h\left ({{s,a_{i},\boldsymbol {\theta } }}\right)}} },\forall a \in \mathbb {A}$ . In the proposed DPG scheme, the weight is adjusted based on the performance measure function, $J\left ({{ \boldsymbol {\theta } }}\right) \stackrel { \Delta } = {V^{\Omega _{\boldsymbol {\theta }} }}\left ({{ s }}\right)$ , defined as the expected reward at the start state s of an episode. According to the policy gradient theorem, we can derive the gradient of $J\left ({{ \boldsymbol {\theta } }}\right)$ with respect to $\boldsymbol {\theta }$ as follows $\begin{equation*} \begin{array}{l} \nabla J\left ({{ \boldsymbol {\theta } }}\right ) = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right )\sum \limits _{a \in \mathbb {A}} {Q^{\Omega }}\left ({{s,a}}\right ) {\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a\left |{{s,\boldsymbol {\theta } }}\right .}}\right )} }, \end{array} \tag {15}\end{equation*}$ View Sourcewhere ${d^{\Omega } }\left ({{ s }}\right)$ represents the state distribution. We can see that the right-hand side term in Eq. (15) represents the sum over states weighted by how frequently the states happen using the target policy $\Omega$ . Therefore, to remove the random variables of states and actions, we continue from the derivation of the performance measure function in Eq. (15) by introducing state $s(t)$ , action $a(t)$ , and an expectation under the policy $\Omega$ . Thereby, we have: $\begin{align*} \nabla J\left ({{ \boldsymbol {\theta } }}\right ) & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{\sum \limits _{a \in A} {{Q^{\Omega }}\left ({{s(t),a}}\right )\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )} }}\right ] \\ & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{G(t)\frac {\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}{{\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}}}\right ] \\ & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{G(t) \nabla _{\boldsymbol {\theta }} \ln {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}}\right ]. \tag {16}\end{align*}$ View SourceThe measure function is optimized by adopting gradient ascent for the weight update as $\begin{equation*} \boldsymbol {\theta } \left ({{t + 1}}\right ) = { }\boldsymbol {\theta } \left ({{ t }}\right ) + {\lambda _{p}}G(t) \nabla _{\boldsymbol {\theta }} \ln {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } \left ({{ t }}\right )}}\right .}}\right ), \tag {17}\end{equation*}$ View Sourcewhere $\lambda _{p}$ is the learning rate of the policy network. The detail of the proposed DPG is given in Algorithm 2. Note that, the policy parameterization is made for episodic case by following the Monte Carlo method [39], where all updates are performed after the episode is completed. Fig. 2 illustrates the flow diagram of our proposed DPG. Although the DPG can yield the nearly optimal policy, it experiences a large variance in the estimated reward, resulting in instable convergence or even divergence during the training.

FIGURE 2.

Flow diagram of the proposed deep policy gradient scheme.

Show All

Algorithm 2 Training Process of the Deep Policy Gradient Scheme

Initialize weight $\boldsymbol {\theta }$ .

for each training episode $1,2,\ldots ,N$ do

Observe the system state and generate an action $a(t)$ .

Perform action $a(t)$ using Algorithm 1.

Store data sampling DELzzz-DEL $s\left ({{ 1 }}\right)$ , $a\left ({{ 1 }}\right)$ , $r\left ({{ 1 }}\right)$ , $\ldots$ , $s\left ({{N}}\right)$ , $a\left ({{N}}\right)$ , $r\left ({{N}}\right)$ , fol-lowing $\Omega \left ({{ \cdot \left |{{ \cdot ,\boldsymbol {\theta } }}\right .}}\right)$ .

for each step $t = 1,2,\ldots ,T$ do

Estimate the return $G\left ({{ t }}\right) = \sum \nolimits _{t}^{T} {{\alpha ^{t - 1}}r\left ({{ t }}\right)}$ .

Update weight $\boldsymbol {\theta }$ byusing Eq. (17).

end for

10:

end for

In DRL schemes, computational complexity is shaped by various factors, including the dimensions of the state and action spaces, the architecture of the model, the choice of algorithm, training parameters, and the function of rewards. Collectively, these elements define the training overhead and the time required to reach the optimal solution. However, the DNN model structure typically has the most significant impact on computational complexity, as it dictates the number of layers, neurons, and parameters, directly influencing computational load and memory demands during training. Consequently, we focus on analyzing the DNN structure used in our proposed DRL method to understand its computational complexity. Specifically, the computational complexity of the proposed DPG scheme is determined by the input and output dimensions and the number of hidden layers. It has $M + 2$ neurons in the input layer, $H_{p}$ hidden layers, each of which contains $N_{p}$ neurons, and $\left |{{ \mathbb {A} }}\right |$ neurons in the output layer. Consequently, the computational complexity of the proposed DPG scheme is determined by $\mathcal {O}\left ({{\left ({{M + 2 + \left |{{ \mathbb {A} }}\right |}}\right){N_{p}} + \left ({{{H_{p}} - 1}}\right)N_{p}^{2}}}\right)$ . Next, we propose to use a deep Q-learning scheme, where the agent can obtain stability while maximizing the expected reward.

C. The Proposed Deep Q-Learning

We elaborate the proposed deep Q-learning scheme, by which an action is mainly selected based on state-action value function rather than only using policy network. In other words, state-action value function is a key metric to evaluate impact of the chosen action on immediate and future rewards, generating the appropriate policy for MDP problem. Generally, the original Q-learning is used to find optimal policy $\Omega ^{*}$ that yields the maximum Q-value function as follows $\begin{equation*} {Q^{*}}\left ({{s,a}}\right ) = {Q^{\Omega ^{*}}}\left ({{s,a}}\right ) = \max \limits _{\Omega }{Q^{\Omega }}\left ({{s,a}}\right ). \tag {18}\end{equation*}$ View SourceConsequently, once the optimal Q-value function, ${Q^{*}}\left ({{s,a}}\right)$ , of each state-action pair is achieved, the optimal policy is directly found by $\begin{equation*} {\Omega ^{*}}\left ({{a\left |{{ s }}\right .}}\right ) = \mathop {\arg \max }\limits _{a \in \mathbb {A}} {Q^{*}}\left ({{s,a}}\right ). \tag {19}\end{equation*}$ View SourceFrom Eq. (19), it is found that determining the optimal policy $\Omega ^{*}$ is equivalent to find optimal value of ${Q^{*}}\left ({{s,a}}\right)$ . Based on Bellman equation, the optimal state-action value function is $\begin{equation*} {Q^{*}}\left ({{s,a}}\right ) = r\left ({{s,a,{\Omega ^{*}}}}\right ) + \alpha \sum \limits _{s' \in \mathbb {S}} {{P_{s'}}\left ({{a\left |{{ s }}\right .}}\right )\max \limits _{a' \in \mathbb {A}} } {Q^{*}}\left ({{s',a'}}\right ), \tag {20}\end{equation*}$ View Sourcewhere $r\left ({{s,a,{\Omega ^{*}}}}\right)$ is an immediate reward achieved by taking action a in state s and ${P_{s'}}\left ({{a\left |{{ s }}\right .}}\right)$ denotes the probability that the system transfers from state s to $s'$ when executing action a. $s'$ and $a'$ represent the next state and action, respectively. According to the RL, the optimal state-action value functions can be obtained without knowing the exact state transition model in advance by frequently updating the Q-value functions as follows: $\begin{align*} & {Q^{\Omega }}\left ({{s,a}}\right ) \\ & = \left ({{1 - {\lambda _{q}}}}\right ){Q^{\Omega }}\left ({{s,a}}\right ) + {\lambda _{q}}\left ({{r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }}\left ({{s',a'}}\right )}}\right ), \tag {21}\end{align*}$ View Sourcewhere $\lambda _{q}$ stands for the learning rate of the Q-learning scheme. The system can acquire the optimal policy for the small-size optimization problem by using conventional Q-learning algorithms. However, these schemes do not effectively work with large dimension systems due to wide variance, which might converge to the locally optimal policy. Thus, we propose the DQL scheme in which a DNN, namely Q-network, is employed to approximate the Q-value function, denoted by ${Q^{\Omega } _{\boldsymbol {w}} }\left ({{s,a}}\right) \approx {Q^{\Omega } }\left ({{s,a}}\right)$ , instead of generating the Q-table used in traditional Q-learning approaches. The Q-network comprises an input layer, $H_{q}$ hidden layers, and an output layer. A system state s with $M+2$ neuron units is used as the input of the Q-network. Each hidden layer is a fully connected layer containing $N_{q}$ neuron units and uses the rectified linear activation function. The output layer with a size of $\left |{{ \mathbb {A} }}\right |$ estimates Q-values of possible state-action pairs and employs the linear activation function. In the proposed DQL scheme, the perturbation of the discretized action values can cause a large variance in the estimated Q-value function. Therefore, the policy might still induce instability, and non-convergence of the training can occur. For that reason, we adopt a fixed target network and experience replay mechanism so as to relieve the impact of perturbation as well as data correlation of the discrete action space on the stability of the training phase. For the fixed target network, another neural network is generated to estimate the target Q-value, which has the same structure as the Q-network and is parameterized by weight $\hat { \boldsymbol {w}}$ . Meanwhile, the experience replay mechanism allows to train a finite number of data samples stored in the replay buffer $\mathcal {B}$ , in which each sample, $\left ({{s,a,r,s'}}\right)$ , represents a tuple indicating the outcome of the specific action given a distinct state. Rather than selecting consecutive data samples that impose high correlations, random mini-batches are chosen to reduce correlation issue when learning. By applying fixed target network with experience replay, Eq. (26) is rewritten by $\begin{equation*} L_{\boldsymbol {w}} = {\mathbb {E}_{\mathcal {B}}}\left [{{{\left ({{Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) - {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right )}}\right )}^{2}}}\right ], \tag {22}\end{equation*}$ View Sourcewhere $Q_{{\text {target}}\;}^{\Omega }$ is the target value function obtained by target Q-network and is calculated as $\begin{equation*} Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) = r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }_{\hat {\boldsymbol {w}}} }\left ({{s',a'}}\right ). \tag {23}\end{equation*}$ View SourceThe target Q-network is frequently updated by copying the weight of the Q-network during the training process, and its weight remains unchanged during a finite number of training time slots. By using the target Q-value, the temporal difference (TD) error is calculated by $\begin{equation*} \delta = Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) - Q_{\boldsymbol {w}}^{\Omega }\left ({{s,a}}\right ). \tag {24}\end{equation*}$ View SourceWith the proposed DQL scheme, the training performance is improved by adjusting weight $\boldsymbol {w}$ in the direction of increasing the cumulative reward and stability. We update the weight of the Q-network to minimize the loss function using the stochastic gradient descent method as follows $\begin{equation*} \boldsymbol {w} = \boldsymbol {w} + {\lambda _{q}}\delta {\nabla _{\boldsymbol {w}}}{Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right ). \tag {25}\end{equation*}$ View Source

To train the Q-network, we constantly tune DNN weight to minimize the loss function, defined as a mean square error between the target and current Q-values, by using the following equation: $\begin{equation*} L_{\boldsymbol {w}} = \mathbb {E}\left [{{{\left ({{r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s',a'}}\right ) - {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right )}}\right )}^{2}}}\right ]. \tag {26}\end{equation*}$ View Source

The agent selects an action in each step by using an $\epsilon$ -greedy policy with a decay rate $\upsilon _{q}$ . We adopt $\epsilon -$ greedy method for selecting actions, in which an action is randomly chosen with probability $\epsilon _{q}$ . Otherwise, the action with the maximal state-action value function is selected according to the output of the Q-network with probability $1-\epsilon _{q}$ . The value of $\epsilon _{q}$ decays with a rate $\upsilon _{q}$ and will be updated in each step. The training process of the proposed DQL is presented in Algorithm 3, which repeats until the system converges. In addition, Fig. 3 illustrates a flow diagram of the proposed DQL scheme. Although the proposed DQL scheme can optimize the expected reward with the stable behavior, it frequently converges to the locally optimal policy in the system with such large state and action spaces. The computational complexity of the DQL scheme is determined as follows. The Q-network of the proposed DQL has $M + 2$ neurons in the input layer, $H_{q}$ hidden layers, each of which contains $N_{q}$ neurons, and $\left |{{ \mathbb {A} }}\right |$ neurons in the output layer. Besides, the target Q-network has the same structure as the Q-network. As a result, the computational complexity of the proposed DQL scheme is calculated by $\mathcal {O}\left ({{2\left ({{\left ({{M + 2 + \left |{{ \mathbb {A} }}\right |}}\right){N_{q}} + \left ({{{H_{q}} - 1}}\right)N_{q}^{2}}}\right)}}\right)$ . In the following, we investigate a combination scheme, namely deep actor-critic, inheriting advantages of both on-policy and off-policy methods.

FIGURE 3.

Flow diagram of the proposed deep Q-learning scheme.

Show All

Algorithm 3 Training Process of the Deep Q-Learning Scheme

Initialize the weights $\boldsymbol {w}$ , $\hat {\boldsymbol {w}}$ , and set $\epsilon _{q}=1$ .

for each training episode $1,2,\ldots ,N$ do

for each step $t = 1,2,\ldots ,T$ do

Define $\epsilon _{q} = \max \left ({{\epsilon _{q} \upsilon _{q},{\epsilon _{q}^{min}}}}\right)$ .

Observe system state DELzzz-DEL $s(t)$ and select action $a(t)$ based on $\epsilon$ -policy as $\begin{aligned} \begin{array}{l} a\left ({{ t }}\right)= \\ {\begin{cases} {\mathop {\arg \max }\limits _{a\left ({{ t }}\right) \in \mathbb {A}} \left \{{{{Q^{\Omega } }\left ({{a\left ({{ t }}\right)\left |{{s\left ({{ t }}\right)}}\right .}}\right),\boldsymbol {w}}}\right \}}& {{\text {w}}.{\text {p}}{\text {.}}\;1 - \epsilon _{q} } \\ {{\text {random}}\;{\text {action}}\;a\left ({{ t }}\right) \in \mathbb {A}}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}$

Perform action $a(t)$ using Algorithm 1.

Observe next state DELzzz-DEL $s(t+1)$ .

Store sample $\left ({{s\left ({{ t }}\right),a\left ({{ t }}\right),r\left ({{ t }}\right),s\left ({{t + 1}}\right)}}\right)$ inbuffer $\mathcal {B}$ , and thenselect a mini-batch.

for i in mini-batches size do

10:

Compute $Q^{\Omega } _{\boldsymbol {w}}\left ({{{s_{i}},a_{i}}}\right)$ and es-timate the target Q-valueby $\begin{aligned} \begin{array}{llllllllllllllllllll} {Q_{{\text {target}}\;}^{\Omega } = } \\ {\begin{cases} {\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;{r_{i}}}& {{\text {final}}\;{s_{i + 1}}} \\ {{r_{i}} + \alpha \max \limits _{a' \in \mathbb {A}} Q_{\hat {\boldsymbol {w}}}^{\Omega } \left ({{{s_{i + 1}},a'}}\right)}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}$

11:

end for

12:

Update weight $\boldsymbol {w}$ .

13:

end for

14:

Update target Q-network bycopying weight of Q-network.

15:

end for

D. The Proposed Deep Actor-Critic Scheme

In the AC method, the system operates based on both policy improvement (i.e. actor) and policy evaluation (i.e. critic) following the direction of maximizing a long-term return. More particularly, the actor part is used to determine parameterized policy, while the critic part is responsible for evaluating the quality of the taken actions through returned rewards. Inspired by the above analysis, we are interested in employing the DAC framework to solve the considered MDP problem. In the proposed DAC scheme, two main DNNs are used to approximate the policy and value functions. Specifically, the actor and critic neural networks are parameterized by weights $\boldsymbol {\mu }_{a}$ and $\boldsymbol {\mu }_{c}$ , respectively. These weights are randomly initialized at the beginning of the training and continuously updated during the training process. Unlike the proposed DQL scheme, where an action is selected based on Q-value function, the proposed DAC scheme selects an action based on the policy network (i.e. actor network), while maximizing the state value function evaluated by the critic network. The solution is to determine the optimal policy, $\Omega ^{*}$ , to maximize the discounted state value function following the Bellman equation as follows: $\begin{equation*} {\Omega ^{*}}\left ({{ s }}\right ) = \mathop {\arg \max }\limits _{\Omega }\left \{{{{V^{\Omega }}(s)}}\right \}. \tag {27}\end{equation*}$ View SourceCritic: In our model, the critic network comprises an input layer, $H_{c}$ hidden layers, and an output layer. The system state with $M+2$ neuron units represents the input of the critic network. The hidden layers are fully connected layers, each of which contains $N_{c}$ neuron units and uses rectified linear unit activation function. The output layer has one neuron, and linear activation function is utilized to generate the estimate of the state value function, ${V^{\Omega } }\left ({{ s }}\right)$ . In order to enhance the training stability and avoid divergence, this study adopts experience replay and fixed target network. Thereby, a data sample is stored in the replay buffer $\mathcal {D}$ with a limited size after each step. Once $\mathcal {D}$ is full, the oldest data will be replaced by a new one. A target critic network with the duplicated structure of the critic network, parameterized by weight $\hat {\boldsymbol {\mu }}_{c}$ , is used to estimate the target state value function. In each time slot, the agent performs an action selected by the actor. Subsequently, the critic receives the instant reward, r, and observes the next state $s'$ . After that, it calculates a TD error, which represents the deviation between the target value and the estimated value of the state value function, by using the following equation: $\begin{equation*} {\delta _{c}} = V_{\text {target}}^{\Omega }\left ({{ s }}\right ) - V_{\boldsymbol {\mu } _{c}}^{\Omega }\left ({{ s }}\right ), \tag {28}\end{equation*}$ View Sourcewhere $V_{\text {target}}^{\Omega } \left ({{ s }}\right) = r + \alpha V_{{\hat {\boldsymbol {\mu }} }_{c}}^{\Omega } \left ({{s'}}\right)$ represents the target state value function achieved by the target critic network with ${V_{{\hat {\boldsymbol {\mu }} }_{c}}}\left ({{s'}}\right)$ denoting the target state value function in accordance with state $s'$ . With the fixed target network and replay memory, the loss function of the critic network is defined as the mean-squared error between ${V_{{\hat {\boldsymbol {\mu }} }_{c}}}\left ({{ s }}\right)$ and ${V_{\boldsymbol {\mu } _{c}}}\left ({{ s }}\right)$ as follows: $\begin{equation*} {L_{\boldsymbol {\mu } _{c}}} = {\mathbb {E}_{\mathcal { D}}}\left [{{{\left ({{r + \alpha V_{{\hat {\boldsymbol {\mu }} }_{c}}^{\Omega }\left ({{s'}}\right ) - V_{\boldsymbol {\mu } _{c}}^{\Omega }\left ({{ s }}\right )}}\right )}^{2}}}\right ]. \tag {29}\end{equation*}$ View SourceDuring the training, we use stochastic gradient descent with back-propagation algorithm so as to minimize loss function, such that the weight $\boldsymbol {\mu _{c} }$ is optimized, as follows $\begin{equation*} {{\boldsymbol {\mu } _{c}} = {\boldsymbol {\mu } _{c}} + {\lambda _{c}}{\delta _{c}}{\nabla _{\boldsymbol {\mu } _{c}}}{V^{\Omega }_{\boldsymbol {\mu } _{c}}}\left ({{s}}\right )}, \tag {30}\end{equation*}$ View Sourcewhere ${\lambda _{c}} \gt 0$ denotes the learning rate of the critic.

Actor: Similar to the critic network, the actor network (i.e., policy network) is designed with an input layer, $H_{a}$ hidden layers, and an output layer. The input layer of the actor network represents system state. Each hidden layer includes $N_{a}$ neuron units and uses rectified linear unit activation function. The actor network is in charge of determining $\Omega$ , such that state $s \in \mathbb {S}$ can be mapped to action $a \in \mathbb {A}$ . Hence, the mapping policy $\Omega \approx {\Omega _{\boldsymbol {\mu } _{a}}}\left ({{ \mathbb {S} }}\right)$ is considered a function of $\mathbb {S}$ and is parameterized by weight $\boldsymbol {\mu } _{a}$ . The policy network outputs the distribution of possible actions for a given state. Accordingly, the number of neuron units in the output layer of the actor network is set as the number of possible actions, each of which indicates the probability of choosing the corresponding action. The output layer utilizes a soft-max activation function to estimate the probability of each action as follows: $\begin{equation*} \Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right ) = \frac {e^{h\left ({{s,a,{\boldsymbol {\mu } _{a}}}}\right )}}{\sum \nolimits _{{a_{i}} \in \mathbb {A}} {e^{h\left ({{s,{a_{i}},{\boldsymbol {\mu } _{a}}}}\right )}} },\forall a \in \mathbb {A}. \tag {31}\end{equation*}$ View Source

The objective function in the actor is defined by $\begin{align*} J\left ({{\boldsymbol {\mu } _{a}}}\right ) & = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right ){V^{\Omega }}\left ({{ s }}\right )} \\ & = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right )\sum \limits _{a \in \mathbb {A}} {Q^{\Omega }} \left ({{s,a}}\right )\Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right )}. \tag {32}\end{align*}$ View SourceWe optimize the objective function by using policy gradient with TD error obtained in the critic as follows: $\begin{equation*} {{\boldsymbol {\mu } _{a}} = {\boldsymbol {\mu } _{a}} + {\lambda _{a}}{\mathbb {E}_{\Omega _{\boldsymbol {\mu } _{a}}}}\left [{{{\nabla _{\boldsymbol {\mu } _{a}}}\ln \Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right ){\delta _{c}}}}\right ]}, \tag {33}\end{equation*}$ View Sourcewhere $\lambda _{a}$ is the learning rate of the actor. We optimize $\boldsymbol {\mu } _{a}$ by applying gradient ascent method so as to maximize the objective function of the actor during the learning. The actor adopts $\epsilon -$ greedy method for selecting actions, in which an action is randomly chosen with probability $\epsilon _{a}$ , or based on the policy network with probability $1-\epsilon _{a}$ . The value of $\epsilon _{a}$ decays with a rate $\upsilon _{a}$ after every training step. The proposed DAC algorithm and its flow diagram are respectively given in Algorithm 4 and Fig. 4.

FIGURE 4.

Flow diagram of the proposed deep actor-critic framework.

Show All

Algorithm 4 Training Algorithm of the Deep Actor-Critic Scheme

Initialize the weights $\boldsymbol {\mu } _{a}$ , $\boldsymbol {\mu } _{c}$ , $\hat {\boldsymbol {\mu }} _{c}$ , and set $\epsilon _{a} = 1$ .

for each training episode $1,2,\ldots ,N$ do

for each step $t = 1,2,\ldots ,T$ do

Define $\epsilon _{a} = \max \left ({{\epsilon _{a} \upsilon _{a},{\epsilon _{a}^{min} }}}\right)$ .

Identify state $s(t)$ and select action $a(t)$ based on $\epsilon$ -policyas $\begin{aligned} \begin{array}{llllllllllllllllllll} {a\left ({{ t }}\right) = } \\ {\begin{cases} {\mathop {\arg \max }\limits _{a\left ({{ t }}\right) \in \mathbb {A}} \left \{{{\Omega \left ({{a\left ({{ t }}\right)\left |{{s\left ({{ t }}\right),{\boldsymbol {\mu } _{a}}}}\right .}}\right)}}\right \}}& {{\text {w}}.{\text {p}}.\;1 - \epsilon _{a} } \\ {{\text {random}}\;{\text {action}}\;a\left ({{ t }}\right) \in \mathbb {A}}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}$

Perform action $a(t)$ using Algorithm 1.

Observe next state DELzzz-DEL $s(t+1)$ .

Store the sample $\left ({{s\left ({{ t }}\right),a\left ({{ t }}\right),r\left ({{ t }}\right),s\left ({{t + 1}}\right)}}\right)$ in buffer $\mathcal {D}$ andselect a mini-batch.

for i in mini-batches size do

10:

Compute $V_{\boldsymbol {\mu }_{c}} ^{\Omega } \left ({{s_{i}}}\right)$ and esti-mate the target state valuefunction by $\begin{aligned} \begin{array}{llllllllllllllllllll} {V_{{\text {target}}\;}^{\Omega } = } {\begin{cases} {\;\;\;\;\;\;\;\;{r_{i}}}& {{\text {final}}\;{s_{i + 1}}} \\ {{r_{i}} + \alpha V_{{\hat {\boldsymbol {\mu }} }_{c}}^{\Omega } \left ({{s_{i + 1}}}\right)}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}$

11:

Compute the TD error $\delta _{c}$ .

12:

end for

13:

Update weight $\boldsymbol {\mu } _{c}$ and $\boldsymbol {\mu }_{a}$ .

14:

end for

15:

Update weight $\hat {\boldsymbol {\mu }} _{c}$ bycopying the weight of thecritic network.

16:

end for

The computational complexity of the DAC scheme is described as follows. The critic network of the proposed DAC has $M + 2$ neurons in the input layer, $H_{c}$ hidden layers, each of which contains $N_{c}$ neurons, and one neuron in the output layer. The target critic network has the same structure as the critic network. In the actor network, there are $M + 2$ neurons in the input layer, $H_{a}$ hidden layers, each of which contains $N_{a}$ neurons, and $\left |{{ \mathbb {A} }}\right |$ neurons in the output layer. Consequently, the computational complexity of the proposed DAC scheme is calculated by $\begin{aligned} \mathcal {O}\left ({{ \begin{array}{lll} \left ({{M + 2 + \left |{{ \mathbb {A} }}\right |}}\right){N_{a}} + \left ({{{H_{a}} - 1}}\right)N_{a}^{2} \\ + 2\left ({{\left ({{M + 3}}\right){N_{c}} + \left ({{{H_{c}} - 1}}\right)N_{c}^{2}}}\right) \\ \end{array} }}\right) \end{aligned}$ . Compared to DPG and DQL schemes, the DAC scheme imposes a higher computational complexity. However, the stability and the good convergence are gained by using both value-based and policy-based mechanisms.

SECTION IV.

Numerical Results and Analysis

This section presents performance evaluation of the proposed DRL schemes. Simulation results were obtained by using Python 3.8, Anaconda 2021 integrated with TensorFlow DL libraries. We set the locations of the SU, the RIS, and the SBS to ${\mathbf {u}} = {\left ({{{\text {0,0,1}}{\text {.5}}}}\right)^{T}}$ , ${\mathbf {r}} = {\left ({{\text {30,30,5}}}\right)^{T}}$ , and ${\mathbf {b}} = {\left ({{\text {80,0,20}}}\right)^{T}}$ , respectively. The SBS has $M=4$ antennas and the number of RIS elements is $K=4$ . The battery capacity of SU is ${E_{Ca}} = 30\mu J$ . We set $\varepsilon = 7\mu J$ , ${e_{\max }} = 13 \mu J$ , ${P_{FF}} = 0.8$ , and ${P_{BF}} = 0.2$ . For spectrum sensing, ${P_{f}} = 0.1$ , and ${P_{d}} = 0.9$ . We design four layers for each DNN used in the proposed schemes: an input layer, two 24-nodes hidden layers, and an output layer, i.e., ${H_{p}} = {H_{q}} = {H_{a}} = {H_{c}} = 2$ and ${N_{p}} = {N_{q}} = {N_{a}} = {N_{c}} = 24$ . The ReLU activation function is used for the hidden layers of DNN for the proposed schemes. In the DPG scheme, the softmax activation function is employed for the output layer. While in the DQL scheme, we adopt the linear activation function for the output layer of both the Q-network and target Q-network. As for the DAC scheme, we utilize the softmax activation function for the output of the actor network, meanwhile the critic and target critic networks employ linear activation functions for their output layers. Furthermore, Adam optimization algorithm is used to periodically update DNNs weights. For both DQL and DAC algorithms, the replay memory size and minibatch size are respectively set to 50 and 20. It is assumed that the MEC server has sufficient computational capability and that the computation result is a short message. Hence, for simplicity, the computation time at the MEC server, $t_{m}$ , and the feedback delay (i.e., downloading time of the SU $t_{d}$ ) are negligible when compared with offloading time. Therefore, we omit $t_{m}$ and $t_{d}$ in the simulation. Unless otherwise stated, $\tau = 0.2\;{\text {s}}$ , ${t_{s}} = 0.002\;{\text {s}}$ , ${\alpha _{L}} = 3.7$ , $B = 1\;{\text {MHz}}$ , ${\sigma ^{2}} = - 80\;{\text {dBm}}$ , ${f_{\max }} = 2.3 \times {10^{8}}\;{\text {Hz}}$ , $\gamma = 0.5 \times {10^{-29}}$ , $\theta = 737.5\;{\text {cycles/bit}}$ . The proposed algorithms are trained over 2000 episodes, each of which includes 10³ time slots. We realized the algorithms through 10⁴ time slots to obtain the average results. The table of simulation parameters is given in table 1. To evaluate the performance, we compare our schemes with four traditional methods as follows:

Double deep Q-learning (DDQL): adopts DRL scheme using double deep Q-network to derive the optimal solution [40].
Myopic: optimizes reward following a greedy method which does not require information of the state transition probability as well as a statistic of harvested energy [41].
Local computing: greedily optimizes the performance by only executing local computing.
Random: randomly selects the actions for both local and remote computing.

TABLE 1 Simulation Parameters

We first examine the impact of maximum offloading energy on network performance of the proposed schemes in Fig. 5, where Fig. 5(a), Fig. 5(b), and Fig. 5(c) respectively illustrate convergence properties of DPG, DQL, and DAC schemes according to two different values of $e_{\max }$ (i.e. $10\,\mu {\text {J}}$ and $13 \,\mu {\text {J}}$ ). In particular, in each sub-figures of Fig. 5, episodic average reward and instant reward are plotted versus training episodes. We respectively compute episodic instant reward in episode i and episodic average reward in episode k by $r_{i}^{\text {ins} }= \frac {1}{T}\sum \limits _{t = 1}^{T} {r\left ({{ t }}\right)}$ and $r_{k}^{\text {avg}} = \frac {1}{k}\sum \limits _{i = 1}^{k} {r_{i}^{\text {ins}}}$ . It is observed that even though schemes can bring higher rewards when using larger maximum offloading energy, i.e. ${e_{\max }} = 13\;{\mu \text {J}}$ , the convergence speeds are degraded. It is because the action space becomes larger when increasing the number of possible actions, leading to poorer convergence. As seen from Fig. 5(a), the DPG scheme yields a high variance in instant reward obtained during the training phase. It is because when DPG performs gradient update in each time step, it is using an estimation of gradient generated by a series of data samples through the whole episode. Thus, the estimation is frequently noisy and bad gradient estimation could adversely affect the stability of the training process. Furthermore, the larger the value of $e_{\max }$ , the more instability the system suffers from during the training. In contrast to the DPG scheme, the stability of the system can be achieved in both DQL and DAC schemes, as shown in Fig. 5(b) and Fig. 5(c). The reason is that the DQL and DAC optimize system return by using the value-based method, fixed target network, and experience replay, which can significantly reduce data correlations and instability while guiding the agent to select appropriate actions.

FIGURE 5.

Reward of the proposed schemes versus various levels of maximum offloading energy.

Show All

In Fig. 6, we assess the convergence properties of the proposed schemes according to various learning rates. Fig. 6(a) unveils that the performance of the DPG method can be easily affected by $\lambda _{p}$ . For example, with $\lambda _{p}=0.05$ , the system can greatly obtain a high computation rate at the beginning of the training process, but gradually converges to a locally optimal policy or even diverges as time goes on. It is because the DPG scheme directly selects the actions following the on-policy technique, which may cause worse convergence with an improper learning rate. On the other hand, the DQL and DAC schemes usually offer stable convergence during training when the learning rate changes. It can also be seen that although the system using the DAC scheme usually obtains a higher reward than the DPG and DQL schemes, it is sophisticated to properly select both $\lambda _{a}$ and $\lambda _{c}$ . Besides, the computational complexity of the DAC scheme is higher than that of DPG and DQL schemes, which is described in Section III. From the above-simulated results in Fig. 6, we can conclude that the learning rate should be appropriately chosen for each proposed DRL-based scheme, which is neither too large nor too small. Therefore, with current setting, the optimal learning rate of DPG, DQL, and DAC methods respectively are chosen as follows: ${\lambda _{p}} = {\lambda _{q}} = {\lambda _{a}} = 0.005$ , ${\lambda _{c}} = 0.05$ .

FIGURE 6.

Reward of the proposed schemes according to different learning rates.

Show All

Next, we compare the performance of the proposed schemes with benchmarks: DDQL, myopic, local computing and random schemes in Fig. 7. Based on the current setting, it is observed that although the DPG algorithm can gain a slightly higher computational speed than the DQL algorithm, it requires more effort by adjusting the parameters to achieve a sub-optimal policy. The reason can be explained as follows. The DQL approach is expected to satisfy the Bellman equation which maintains the stability of convergence, but it usually does not guarantee finding nearly optimal behavior following $\epsilon -$ greedy policy, where the agent frequently selects the stochastic action with the probability of $\epsilon _{q}$ . Meanwhile, although DPG can learn a good policy since it only operates with a policy that can directly estimate the maximal total expected rewards, the convergence is quite sensitive with improper parameters. On the one hand, we can also recognize that the DPG and DQL almost achieve the computation rate as high as in the DDQL. Generally, DDQL effectively manages overestimation during training by distinguishing between action selection and value evaluation, which enhances its robustness and reliability in uncertain environments. While the DDQL approach is likely to yield more effective policies compared to Deep Policy Gradient (DPG) and DQL schemes, it may incur higher overhead. On the other hand, DAC possesses both advantages of policy-based and value-based methods, resulting in good convergence to the highest computation rate. It is because it not only selects actions generated by the actor, but also criticizes the performed action. The performance improvement statistics are shown in Table 2. Columns 2 and 3 indicate the percentage improvement of each proposed scheme compared to the Myopic scheme and the local computing scheme, respectively. It is evident that the proposed DAC scheme achieves the highest improvement among the methods, with a 21.32% increase over the Myopic scheme and a 161.38% increase over the local computing scheme. This highlights the substantial advantage of the DAC approach. Additionally, these results demonstrate the benefits of using MEC techniques to significantly enhance computational efficiency.

TABLE 2 Performance Improvement Comparison

FIGURE 7.

Convergence properties of different schemes.

Show All

We further study the effect of mean value of harvested energy on computation rate in Fig. 8, where $\varepsilon$ varies from $7\;{\mu \text {J}}$ to $15\;{\mu \text {J}}$ in both with-RIS and without-RIS scenarios. As expected, with an increment in harvested energy, system rewards are greatly improved. The reason is that SU collects and preserves more renewable energy in its battery to process computational tasks. The results show the effectiveness of the proposed schemes in dealing with the issue of stochastically harvested energy at the SU in each time slot. Furthermore, it is worth noting that the schemes using RIS offer better performance than the schemes without using RIS due to ability of channel condition adjustment. For the sake of verifying energy-efficient utilization of proposed algorithms, we define energy efficiency as the total computation rate over energy consumption at the SU and plot the results in Fig. 9. The curves indicate that energy efficiency degrades with increase in harvested energy. It implies that improving the number of computed bits requires more and more energy including both computing and overflowed energy. Especially, overflow circumstances will happen more frequently when the SU harvests more energy, since battery capacity is limited, which causes a degradation of energy efficiency.

FIGURE 8.

Computation rate versus harvested energy.

Show All

FIGURE 9.

Energy efficiency versus harvested energy.

Show All

Fig. 10 validates the robustness of the proposed algorithms against the changes in false alarm probability, i.e., ${P_{f}} \in \left [{{0.1,0.2}}\right]$ and battery capacity, i.e., ${E_{Ca}} = \left \{{{25,30}}\right \}\mu {\text {J}}$ . In this regard, the performance goes down as $P_{f}$ increases since the SU misses more chances to offload computational data to the SBS. For the case of ${E_{Ca}} = 30\;\mu {\text {J}}$ , the SU can preserve more energy for its local computing and offloading, and thus, rises the overall computation rate in the long run, as compared to the case of ${E_{Ca}} = 25\;\mu {\text {J}}$ .

FIGURE 10.

Computation rate with respect to false alarm probabilities.

Show All

The performance comparison of our schemes and myopic scheme with various number of RIS elements is shown in Fig. 11. Intuitively, increasing K will help improve the computation rate because the higher channel gain of reflecting link assisted by RIS is obtained. Finally, we plot reward versus the number of antennas in the SBS and detection probability in Fig. 12. It is obvious that when M goes up, it will remarkably help the SU compute its task faster. Besides, with high detection probability, the system detects the actual state “free” of the primary channel frequently and correctly, reducing the number of transmission collisions between the SU and the PT. Hence, this leads to the improvement of the system reward when $P_{d}$ increases. Through the above simulated results, we can see that the proposed DRL schemes are superior to the myopic scheme because they not only consider the current reward, but also estimate the impact of immediate action on future reward. Although the DPG and DQL schemes brings a slightly poorer performance than the DAC and DDQL, they require lower computational complexity and outperform the other traditional benchmarks. Thus, the trade-offs between the computational complexity and the network performance should be carefully considered when designing the DRL frameworks to solve the large state-and-action space optimization. Furthermore, the complexity of mathematical formulation for deriving optimal solutions can be remarkably relieved by using DRL frameworks, compared to other mathematical optimization tools.

FIGURE 11.

Computation rate according to the number of RIS elements.

Show All

FIGURE 12.

Computation rate versus the number of antennas.

Show All

SECTION V.

Conclusion

This paper studied the joint CPU-frequency and offloading energy allocation for solar-powered RIS-assisted CR-MEC system. Three DRL algorithms: deep policy gradient, deep Q-learning, and deep actor-critic were proposed to obtain proper solutions to the MDP problem. By considering energy conservation for a long-term operation, nearly optimal actions for joint local and remote computing with energy autonomy of the limited-resource SU were attained. In the inherently stochastic real-world environment, constraints and parameter settings are crucial to the effectiveness of the proposed DRL algorithms. Therefore, simulations were conducted to evaluate how parameter changes impact system performance. Our simulation results demonstrate that the proposed schemes achieve comparable performance to another DRL approach (DDQL) and surpass other benchmarks, such as myopic, local computing, and random strategies, in terms of computation rate and energy efficiency under varying parameters. The trade-offs between computational complexity and system performance of the proposed DRL schemes were discussed.

We recognize that double path fading is a fundamental limitation of passive RIS systems, and the development of active RIS systems with signal amplification offers a promising solution to this issue [42], [43]. In active RIS, signal amplification affects the optimization strategy, particularly regarding energy allocation and system constraints. Extending our method to active RIS would likely require adjustments for power management within the amplification framework, as well as evaluation of how DRL algorithms adapt to varying power budget dynamics. Although managing energy for both the source node and active RIS adds complexity for DRL-based approaches, it is anticipated to enhance system performance compared to passive RIS. This perspective motivates our future research into active RIS-aided systems.

References is not available for this document.

MIT Libraries

MIT Libraries

RIS-Assisted CR-MEC Systems Using Deep Reinforcement Learning Approach

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Literature Review

1) MEC Systems

2) Ris-Assisted MEC Systems

3) Ris-Assisted CRN Systems

4) DRL-Based Frameworks in RIS-Assisted Systems

B. Motivations and Contributions

Network Model

A. Local Computing Model

B. Offloading Model

C. Problem Formulation

Deep Reinforcement Learning Frameworks

A. Markov Decision Process and Observations

Algorithm 1 The Process of Performing an Action of the SU

B. The Proposed Deep Policy Gradient Scheme

Algorithm 2 Training Process of the Deep Policy Gradient Scheme

C. The Proposed Deep Q-Learning

Algorithm 3 Training Process of the Deep Q-Learning Scheme

D. The Proposed Deep Actor-Critic Scheme

Algorithm 4 Training Algorithm of the Deep Actor-Critic Scheme

Numerical Results and Analysis

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

RIS-Assisted CR-MEC Systems Using Deep Reinforcement Learning Approach

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Literature Review

1) MEC Systems

2) Ris-Assisted MEC Systems

3) Ris-Assisted CRN Systems

4) DRL-Based Frameworks in RIS-Assisted Systems

B. Motivations and Contributions

Network Model

A. Local Computing Model

B. Offloading Model

C. Problem Formulation

Deep Reinforcement Learning Frameworks

A. Markov Decision Process and Observations

Algorithm 1 The Process of Performing an Action of the SU

B. The Proposed Deep Policy Gradient Scheme

Algorithm 2 Training Process of the Deep Policy Gradient Scheme

C. The Proposed Deep Q-Learning

Algorithm 3 Training Process of the Deep Q-Learning Scheme

D. The Proposed Deep Actor-Critic Scheme

Algorithm 4 Training Algorithm of the Deep Actor-Critic Scheme

Numerical Results and Analysis

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?