With the massive growth of mobile applications (e.g., online gaming, image or video processing), tremendous computation demands have been imposed on resource-constrained wireless devices. Fortunately, Mobile Edge Computing (MEC) is referred to as a powerful technology to improve computation efficiency [1], in which a mobile device can offload tasks to the nearby servers (e.g., access points or base stations) deployed at the edge of the network. Nevertheless, offloading quality usually fails in meeting desirable levels in obstructed regions. Recently, reconfigurable intelligent surface (RIS) has been introduced in [2] to improve offloading performance. In particular, reflected signals can be constructively added by dynamically configuring phase shifts and reflection amplitudes of RIS elements.
Cognitive radio (CR) has been recognized as a potential technique to remarkably improve the spectral efficiency (SE), in which spectrum sharing is categorized by three paradigms: interweave [3], overlay [4] and underlay [5]. In cognitive radio networks (CRNs), the interference temperature to the primary users (PUs) must be guaranteed below a predefined threshold regardless of which paradigm is used. This requirement can be ignored in interweave and overlay paradigms if channel state information (CSI) of PU is perfectly known by SUs transmitters. Nevertheless, in underlay paradigm, interference temperature must be carefully considered when designing the mechanism for CRNs optimization since it significantly affects achievable data rate. Computational applications on internet of thing (IoT) equipments tend to require energy-autonomous operation, since the battery maintenance is impractical owing to difficult accessibility or high cost in harsh network scenarios. As a remedy to the above facts, energy harvesting techniques has recently emerged to address energy-constrained and energy autonomy problems [6]. Especially with energy-constrained MEC systems, addressing the critical challenges, such as the finite battery capacity, sustainable operation, and limited spectrum resources have received more and more attention. Motivated by the aforementioned considerations and technologies, these challenges can be alleviated by integrating three technologies into MEC systems: 1) RIS; 2) cognitive radio; 3) energy harvesting.
A. Literature Review
This subsection presents related works according to the MEC systems, RIS-assisted MEC systems, RIS-assisted CRN systems, and deep reinforcement learning (DRL) frameworks in RIS-assisted systems, which are respectively discussed as follows:
1) MEC Systems
The authors in [7] investigated a CR-MEC system considering the interference temperature of the primary network to maximize the overall utility. A three-layer architecture of a MEC system was studied in [8], in which CR technology was applied. Both scenarios of cooperation [9] and the noncooperation [10] between the PU and the mobile device were studied in the CR-MEC network using wireless power transfer (WPT). In particular, the authors in [9] maximized the computation efficiency, while the optimal number of computation bits was an objective of the study in [10]. The WPT-CR-MEC system in [9] was extended to multi-antenna transmission contexts [11], where the authors considered the security of the primary transmitter. Iterative algorithms were proposed to minimize offloading delay in the MEC system using non-orthogonal multiple access (NOMA) for two users [12]. Besides, the work in [13] investigated the multi-user NOMA-MEC framework to acquire minimum delay under resource constraints. Although the references [12], [13] verified the effectiveness of the proposed schemes, these might not be applicable for CR-MEC due to interference constraints to the primary system and thus the complexity of offloading policy should be carefully considered. On the other hand, these authors ignored local computing capability which is generally supported in IoT devices. It is noted that the above research efforts in [7], [8], [9], [10], [11], [12], and [13] did not use RIS for improving the offloading performance in MEC systems.
2) Ris-Assisted MEC Systems
By leveraging the benefits of RIS, performance improvement of MEC systems can be achieved [14], [15]. The RIS-aided MEC system was studied, in which users can offload partially their tasks to a MEC server with the aid of RIS [14]. The simulation results indicated that overall latency degrades considerably. The authors in [15] considered a MEC system where a RIS was used to aid both processes of WPT and data offloading. However, the works in [14] and [15] only took into account the employment of RIS in traditional MEC networks without considering various multiple access schemes. To cope with this drawback, [16] presented the RIS-aided MISO-MEC system serving multiple users in the same frequency and time block by using NOMA technology, and conventional orthogonal multiple access (OMA) methods were compared in the simulation to evaluate the effectiveness of the proposed scheme.
3) Ris-Assisted CRN Systems
Even though the CR can boost the SE, the benefits for the PUs and the SUs are conflicting. For instance, increasing transmission power at secondary transmitters to raise the received signal strength will also induce increased interference temperature towards primary receivers. This bottleneck can be addressed by utilizing RIS such that IRS’s reconfiguration function enhances received signal strength at the SUs while mitigating interference to the PUs via jointly optimizing transmit precoding and passive beamforming [17], [18]. However, it is challenging to realize these works in practice, since channel information between SUs and PUs is difficult to acquire owing to their uncooperative relationship. Several contributions were carried out by designing transmission beamforming with imperfect PU channels for RIS-CR systems [19], [20]. On the other hand, they only targeted maximizing the network capacity, which may not ensure the quality of service (QoS) at the SU such as video conference, autonomous vehicles, and so on. To cope with this fact, the authors in [21] designed a robust beamforming scheme aiming to minimize the transmission power of SUs subject to SUs’ QoS requirements and PUs’ interference threshold.
4) DRL-Based Frameworks in RIS-Assisted Systems
In recent years, the artificial intelligence has been viewed as one of the most appealing technologies to MIMO systems, in which optimization problems are non-trivial owing to the large dimension involved. As a branch of AI research, DRL has been recognized as a powerful tool for decision-making optimization in wireless networks without prior knowledge of the environment, especially in RIS-aided communication contexts. In [22], to enhance the received signal-to-noise ratio (SNR), a DRL algorithm was developed for selecting optimal RIS phase shifts. Reference [23] presented a joint transmit beamforming and RIS phase shift allocation scheme using the DRL method to gain maximal sum-rate. However, the network settings in these studies were mostly assumed for working in ideal conditions, which is impractical. In [24], the author considered the cell vectorization problem as beamforming matrix selection to attain optimal network coverage by using the DRL-based approach. Besides, a DRL scheme of jointly allocating beamforming, power control, and interference was designed to maximize network performance in [25]. The authors in [26] proposed both deep learning (DL) and DRL algorithms to find the solution of effective throughput maximization and evaluate their pros and cons. Nevertheless, the above DRL-based schemes using RIS did not take into consideration the context of MEC. Only a few studies [27], [28], [29], [30] focused on the problems in RIS-assisted MEC systems and solved them by adopting DRL techniques. More specifically, the work in [27] jointly optimized the allocated WPT time, RIS phase shifts, and IoT devices’ offloading actions to enhance the overall computational capability by adopting a double deep Q-learning algorithm. Meanwhile, a constrained DRL method was developed to select the optimal RIS phase shifts for the downlink under latency constraints in [28]. The problem of joint offloading, communication and computation resource allocation for NOMA-RIS-MEC system is investigated in [29], which is solved by a Lyapunov-based DRL with both continuous and integer mappings. To maximize the weighted sum secrecy computation efficiency of the system, the authors in [30] proposed a deep deterministic policy gradient (DDPG) scheme under eavesdropping attacks.
B. Motivations and Contributions
To the best of the authors’ knowledge, this is the first work investigating a solar-powered RIS-assisted CR-MEC system. In contrast to most of the existing literature, we consider imperfect context of spectrum sensing in interweave paradigm, in which the offloading performance of the CR system is degraded by sensing errors regarding false alarms and misdetections. Indeed, the CR-MEC network investigated in this work is closer to practice because we consider dynamic nature of wireless environment (e.g. uncertainty of the solar energy arrivals and dynamic activity of the primary network) with inevitable sensing errors of the cognitive system. With this scenario, the solar-powered battery of cognitive user can be easily drained due to imperfect spectrum sensing and improper utilization of offloading energy. Therefore, it is challenging for non-learning conventional approaches to obtain directly an optimal solution for the optimization problem owing to time-varying channel conditions, stochastic properties of harvested energy, and dynamic behavior of the primary channel. Besides, designing a scheme using mathematical tools might impose high computational complexity for implementation. Furthermore, some conventional schemes (e.g. partially observable Markov decision process (POMDP)) require the prior knowledge of harvested energy probability to determine state transition probability, causing the problem more challenging. Fortunately, DRL learning approaches can be adopted for local computing and offloading in CR-MEC network by only interacting with environment through trial-and-error fashion to gain knowledge and select the optimal action.
The motivation for using DRL in system optimization lies in its flexibility and adaptability. Unlike traditional methods like Lyapunov optimization [31], Dynamic Programming (DP) [32], Linear Programming (LP) [33], Convex Optimization [34], or Optimal Control [35], DRL does not require a precise mathematical model or rigid structural constraints, making it more suitable for complex and uncertain environments. DRL handles high-dimensional problems effectively, overcoming the curse of dimensionality, and excels in nonlinear and stochastic settings by adapting to real-time feedback. While methods like Lyapunov and Optimal Control can address some nonlinearity, they involve significant mathematical complexity and pre-defined constraints, which restrict their adaptability. In contrast to LP and Convex Optimization methods, which focus on finding an optimal solution for immediate rewards, DRL optimizes for long-term rewards by balancing exploration and exploitation. This makes DRL highly effective for sequential decision-making tasks like queuing and resource allocation. Additionally, DRL can handle complex, multi-objective function, such as balancing latency and energy, through custom reward functions.
Since the system has a large state-and-action spaces, it is difficult for the traditional learning algorithms to optimize the computation rate. Besides, the existing works did not intensively investigate the on-policy and off-policy methods with their pros and cons. We aim to transform a resource allocation problem to a MDP platform, where the DRL frameworks can be applied to solve the computation rate optimization. Thereby, investigating the details of mapping a state to an optimal action using various DRL frameworks is crucial under consideration of trade-offs between their effectiveness and computational complexity. On the other hand, the convergence property, performance and computational complexity comparison among these DRL schemes have not been studied well. To fill this gap, we adopt separate state-of-the-art DRL frameworks including policy-based (i.e., on-policy), value-based (i.e., off-policy), and policy-and-value methods to find the solution to a high dimensional optimization problem. Especially, unlike other mathematical optimization mechanisms, our proposed DRL designs do not require sophisticated mathematical formulations and calculations. The contributions of this article are listed as follows:
We consider a solar-powered RIS-assisted CR-MEC system, where an SU equipped with a solar energy harvester attempts to optimize data computation. Imperfect spectrum sensing of determining the status of licensed channel is assumed. In addition, issues of limited battery capacity, computational capability, and harvested energy are taken into account.
The MDP problem of either allocating CPU-cycle frequency, or jointly offloading energy and RIS phase shifts for computation rate maximization is defined. Subsequently, a policy-based approach, namely deep policy gradient (DPG), is first proposed to solve MDP problem. Thereby, the SU learns environment knowledge by directly following action distribution. Though high computation rate can be obtained, it is easily affected by stochastic property of generating actions, leading to a large oscillation or even divergence.
In order to avoid divergence as well as reduce the critical effect of stochastic actions on the value function of the DPG scheme, we propose to use a value-based method, namely deep Q learning (DQL), to generate the optimal CPU frequency, offloading energy, and RIS phase shifts. In this scheme, the state-action value function is used to optimize selected action rather than directly following the action distribution. With this scheme, the fluctuation in the training is degraded, yet the system frequently goes to the locally optimal policy.
By inheriting advantages of policy-based principle of the DPG and value-based principle of DQL, we further employ a policy-and-value method, namely deep actor-critic (DAC) including actor and critic components, to address the MDP problem. Thereby, the actor determines actions followed by parameterized policy, while the critic is in charge of evaluating the quality of the taken actions through rewards returned by the environment. Consequently, the good convergence and stability are acquired with proper parameters.
The remainder of this work is organized as follows. In Section II, the system model is presented. Next, we elaborate on the proposed DRL frameworks: DPG, DQL, and DAC in Section III. We conduct numerical simulations and discuss the results in Section IV, and then the conclusion of the paper is given in Section V.
As shown in Fig. 1(a), we consider a RIS-assisted CR-MEC system consisting of an SU, a RIS, a secondary base station (SBS), and a primary network including a primary transmitter (PT) and a primary receiver (PR). PT and PR are licensed to use a primary channel, while the single-antenna SU can opportunistically offload its data when the primary channel is sensed to be free. The SBS has M antennas, while the RIS is equipped with K reflecting elements and one micro-controller which is used to adjust their phase shifts to achieve better channel conditions between the SU and SBS. The considered network of this paper can reflect the realistic scenario as follows. Nowadays, many smart buildings are evolving with their built-in technologies, which usually require or generate local data. However, most Internet of Things (IoT) devices cannot quickly process data due to poor computing power and low battery capacity. Therefore, by exploiting the energy harvesting, CR, RIS, and MEC techniques, IoT devices can efficiently process large amounts of data, such as climate and temperature controls, smart signaling, asset tracking, or health monitoring data with the aids of energy autonomy from EH, spectrum sharing of CR, channel condition adjustment of RIS, and cloud server center (i.e. MEC server) for the permanent operation. Thus, investigating a multi-technique network in this paper offers great benefits for low-level devices to enable the rapid data processing with energy autonomy.
We consider a three-dimensional (3D) Cartesian coordinate system, where the coordinates of the SU, the RIS, and the SBS are respectively denoted by {\mathbf {u}} = {\left ({{{x_{u}},{y_{u}},{h_{u}}}}\right)^{T}}
, {\mathbf {r}} = {\left ({{{x_{r}},{y_{r}},{h_{r}}}}\right)^{T}}
, and {\mathbf {b}} = {\left ({{{x_{b}},{y_{b}},{h_{b}}}}\right)^{T}}
. The SU needs to execute computational tasks of various applications, such as data fusion, climate and temperature controls, smart signaling, and so on. Due to a low computational capability, the SU can enhance computation rate by cooperating with the MEC server collocated in the SBS through offloading. To this end, the SU first offloads the task to the SBS such that data is remotely processed by the MEC server and the computed result will be sent back to the SU once the computation is completed. We assume that the MEC server has enough dedicated computing resources and energy to assist the computation of the SU.
During the offloading, the SU has to share the primary channel (i.e., licensed channel) with the primary network. We assume that the SU always has a computational task to process. Generally, there are three classified models of computation offloading: (a) only offloading; (b) all or no offloading, in which the entire data is either offloaded or is processed locally; and (c) partial offloading, where some parts are offloaded while the remaining parts are performed locally. This article focuses on the model (b) since the SU adaptively executes tasks to maximize the computation performance using CR, which is based on energy condition, offloading channel, and the primary channel activity. The MEC system operates in a time-slotted fashion where \tau
represents the length of each time slot. The operation of the SU in a time slot is illustrated in Fig. 1(b) and can proceed as follows. At the beginning of each time slot, the SU performs the spectrum sensing with sensing duration, t_{s}
, to determine the status of the primary channel. Based on the sensing result, the SU decides on either local computing mode or offloading mode. In particular, the SU carries out the local computing mode if the sensing result is “busy” within a duration of \tau -t_{s}
. On the other hand, when the result of spectrum sensing shows state “free” of the primary channel, the SU will offload its data to SBS for collaborative computation at the MEC server within the duration of t_{o}
. Subsequently, the MEC server computes data and feedback the computed result to the SU with the periods of t_{m}
and t_{d}
, respectively.
A. Local Computing Model
In this article, the number of CPU cycles to process 1-bit of raw data at the SU is denoted by \theta \left ({{\theta \gt 0}}\right)
. Let f(t)
(0 \le {f}\left ({{ t }}\right) \le {f_{\max }}
) represent the scheduled CPU-cycle frequency (cycles/second) of the SU in time slot t, where f_{\max }
is the constraint of maximal computational capability at the SU. Accordingly, the number of bits calculated by using local computation at the SU in time slot t is calculated by\begin{equation*} r_{l}(t) = \frac {\tau - {t_{s}}}{\tau }{f}(t)\theta ^{ - 1}. \tag {1}\end{equation*}
View Source
\begin{equation*} r_{l}(t) = \frac {\tau - {t_{s}}}{\tau }{f}(t)\theta ^{ - 1}. \tag {1}\end{equation*}
The energy consumption for the local computation at the SU can be given as\begin{equation*} e_{l}(t) = \left ({{\tau - {t_{s}}}}\right )\gamma f^{3}(t), \tag {2}\end{equation*}
View Source
\begin{equation*} e_{l}(t) = \left ({{\tau - {t_{s}}}}\right )\gamma f^{3}(t), \tag {2}\end{equation*}
where \gamma
is the constant of effective switched capacitance based on the chip of the SU.
B. Offloading Model
The SBS can receive signal transmitted by SU from two ways: direct path and reflected RIS path. We assume the direct path is a non-line-of-sight (NLOS) channel. Let {{\mathbf {h}}_{\text {ub}}} = {g_{\text {ub}}}{\left [{{h_{1}^{\text {u}},\ldots ,h_{M}^{\text {u}}}}\right]^{T}} \in {\mathbb {C}^{M \times 1}}
denote direct path’s channel, where {g_{\text {ub}}} = d_{\text {ub}}^{ - {\textstyle {\frac {{\alpha _{L}} }{2}}}}
represents path loss from the SU to SBS with distance d_{\text {ub}}
and path loss exponent \alpha _{L}
; and h_{m}^{\text {u}}
represents small-scale fading between the SU and SBS’s antenna m. The reflected RIS path includes two links: an NLOS SU-RIS link and a line-of-sight (LOS) RIS-SBS link. The NLOS SU-RIS channel is denoted by {{\mathbf {h}}_{\text {ur}}} = {g_{\text {ur}}}{\left [{{h_{1}^{\text {r}},\ldots ,h_{K}^{\text {r}}}}\right]^{T}} \in {\mathbb {C}^{K \times 1}}
, where {g_{\text {ur}}} = d_{\text {ur}}^{ - {\textstyle {\frac {{\alpha _{L}} }{2}}}}
is the SU-RIS path loss, and h_{k}^{\text {r}}
represents complex Gaussian random variable, \mathcal {CN}\left ({{0,1}}\right)
. We denote the channel between the RIS and the SBS as {{\mathbf {H}}_{\text {rb}}} = {g_{\text {rb}}}{\left \{{{h_{m,k}}}\right \}_{m \in \left [{{1,M}}\right],k \in \left [{{1,K}}\right]}} \in {\mathbb {C}^{M \times K}}
, where {g_{\text {rb}}} = 1/\sqrt {4\pi d_{\text {rb}}^{2}}
refers to a free-space RIS-SBS path loss attenuation with distance d_{\text {rb}}
, and h_{m,k}
represents the LOS channel between SBS’s antenna m and RIS’s element k. It is assumed that {\mathbf {h}}_{\text {ub}}
, {\mathbf {h}}_{\text {ur}}
, and {\mathbf {H}}_{\text {rb}}
are quasi-static block-fading channels. Channel estimation is beyond the scope of this paper. For passive RIS-aided networks, various methods, such as the dual-link pilot transmission scheme [36] and the anchors-based two-phase method [37], have been proposed to estimate the CSI of individual RIS-aided channels. Therefore, we assume that the CSI is perfectly known to the SU at the start of each time slot.
When receiving signal transmitted by the SU, RIS will reflect it to the SBS following a matrix of phase shifts \boldsymbol {\Phi } = {\text {diag}}\left ({{{e^{j{\phi _{_{1}}}}},{e^{j{\phi _{_{2}}}}},\ldots ,{e^{j{\phi _{_{K}}}}}}}\right)
, where {\phi _{_{k}}} \in {\textsf {F}_{\text {ris}}} = \left \{{{\frac {2\pi l}{2^{b}}}}\right \}_{l = 0}^{{2^{b}} - 1}
is the phase shift of RIS element k, and b refers to phase shifter’s resolution [38]. The received signal at the SBS is\begin{equation*} y\left ({{ t }}\right ) = \sqrt {{p_{o}}\left ({{ t }}\right )} \left ({{{{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right ){\boldsymbol {\Phi }} \left ({{ t }}\right ){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right ) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right )}}\right )x\left ({{ t }}\right ) + {\mathbf {n}}\left ({{ t }}\right ), \tag {3}\end{equation*}
View Source
\begin{equation*} y\left ({{ t }}\right ) = \sqrt {{p_{o}}\left ({{ t }}\right )} \left ({{{{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right ){\boldsymbol {\Phi }} \left ({{ t }}\right ){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right ) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right )}}\right )x\left ({{ t }}\right ) + {\mathbf {n}}\left ({{ t }}\right ), \tag {3}\end{equation*}
where {p_{o}}\left ({{ t }}\right) = \frac {{e_{o}}\left ({{ t }}\right)}{t_{o}}
is the offloading power calculated by offloading energy {e_{o}}\left ({{ t }}\right)
, which is consumed by the SU, and x\left ({{ t }}\right)
is task information of SU; \mathbf {n}\left ({{ t }}\right) = {\left [{{{n_{1}}\left ({{ t }}\right),{n_{2}}\left ({{ t }}\right),\ldots ,{n_{M}}\left ({{ t }}\right)}}\right]^{T}}
represents a vector of additive white Gaussian noise, such that {\mathbf {n}}\left ({{ t }}\right) \sim \mathcal {CN}\left ({{0,{\sigma ^{2}}{{\mathbf {I}}_{M}}}}\right)
. The received SNR of the SBS is\begin{equation*} {\gamma _{o}}\left ({{ t }}\right ) = \frac {{p_{o}}\left ({{ t }}\right ){{\left \|{{{{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right ){\boldsymbol {\Phi }} \left ({{ t }}\right ){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right ) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right )}}\right \|}^{2}}}{\sigma ^{2}}. \tag {4}\end{equation*}
View Source
\begin{equation*} {\gamma _{o}}\left ({{ t }}\right ) = \frac {{p_{o}}\left ({{ t }}\right ){{\left \|{{{{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right ){\boldsymbol {\Phi }} \left ({{ t }}\right ){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right ) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right )}}\right \|}^{2}}}{\sigma ^{2}}. \tag {4}\end{equation*}
Thus, the data rate of offloading can be achieved by\begin{equation*} {r_{o}}\left ({{ t }}\right ) = \frac {t_{o}}{\tau }B{\log _{2}}\left ({{1 + {\gamma _{o}}\left ({{ t }}\right )}}\right ), \tag {5}\end{equation*}
View Source
\begin{equation*} {r_{o}}\left ({{ t }}\right ) = \frac {t_{o}}{\tau }B{\log _{2}}\left ({{1 + {\gamma _{o}}\left ({{ t }}\right )}}\right ), \tag {5}\end{equation*}
where B denotes the bandwidth for the data offloading. The SU with a limited battery capacity, E_{Ca}
, constantly harvests solar energy via ambient environment. The arrival energy during time slot t, e_{h}(t)
, is assumed to be limited, where 0 \lt e_{h}(t) \lt {E_{Ca}}
, and it follows a Poisson distribution with mean harvested energy, \varepsilon
. We model the status of primary channel by a 2-state discrete-Time Markov Chain process. Particularly, the state of the primary channel in a time slot is represented by “B” (busy) or “F” (free), and transition probability of its state between two adjacent time slots is denoted by {P_{ij}}\Big |{{i,j \in \left \{{{F,B}}\right \}}}
. During the sensing duration, the SU carries out spectrum sensing, and then a sensing result, H\left ({{ t }}\right) \in \left \{{{F,B}}\right \}
, shows the state “free” or “busy” of the primary channel. Because sensing errors inevitably happens, sensing performance is evaluated by two crucial metrics: false alarm probability, {P_{f}} = \Pr \left ({{{H}(t) = B\left |{{ F }}\right .}}\right)
, and detection probability, {P_{d}} = \Pr \left ({{{H}(t) = B\left |{{ B }}\right .}}\right)
. The former represents the probability that the primary channel is sensed as “busy”, but it is actually “free”, while the latter shows the probability that the SU exactly detects the state “busy”. The misdetection represents the circumstance that the channel is actually “busy” but is sensed as“free”. In essence, the performance of the network can be significantly lowered by a large number of false alarms and misdetections. Specifically, on the one hand, the false alarm may lead to the missing opportunity for the SU to use the primary channel. On the other hand, the misdetection event causes a transmission collision between the SU and PT. In this study, the probability of the primary channel being free is updated at the end of each time slot to reduce collisions. To protect the transmissions of the PT in the primary network, given the probability of the maximally allowable collision between the SU and PT, the detection probability, P_{d}
, can be kept to be greater than a threshold, \varsigma
, by modifying sensing parameters.
C. Problem Formulation
In this paper, offloading actions are influenced by the availability of primary network, as the SU must share the licensed channel with the primary network. The activity of the licensed network can significantly impact the SU’s operations. Therefore, instead of optimizing offloading time, our focus is on maximizing the long-term computation rate, encompassing both local computation and offloading rates. More specifically, this paper considers a problem of determining either the CPU-cycle frequency, or jointly offloading energy and RIS phase shifts. We aim to develop schemes to attain maximal computation rate under constraints of limited resources. The problem formulation is expressed by\begin{align*} & {\max \limits _{\beta \left ({{ t }}\right ),f\left ({{ t }}\right ),{e_{o}}\left ({{ t }}\right ),{\boldsymbol {\Phi }}\left ({{ t }}\right )} \sum \limits _{t = 1}^{\infty }{\left ({{\left ({{1 - \beta \left ({{ t }}\right )}}\right ){r_{l}}\left ({{ t }}\right ) + \beta \left ({{ t }}\right ){r_{o}}\left ({{ t }}\right )}}\right )} } \\ & \qquad {s.t.}~{\beta \left ({{ t }}\right ) \in \left \{{{0,1}}\right \}}\qquad \quad {\left ({{ a }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{0 \leqslant f(t) \leqslant {f_{\max }}}~\qquad {\left ({{ b }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{0 \leqslant {e_{o}}\left ({{ t }}\right ) \leqslant {e_{\max }}}\quad ~~{\left ({{ c }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{e_{o}}\left ({{ t }}\right ) + {e_{s}} \leqslant {e_{r}}\left ({{ t }}\right )}\quad ~{\left ({{ d }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{e_{l}}\left ({{ t }}\right ) + {e_{s}} \leqslant {e_{r}}\left ({{ t }}\right )}\quad ~{\left ({{ e }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{r_{o}}\left ({{ t }}\right ) \geqslant {r_{o,\min }}}\qquad \quad {\left ({{ f }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{r_{l}}\left ({{ t }}\right ) \geqslant {r_{l,\min }}}\qquad \quad ~ {\left ({{ g }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{\phi _{k}}\left ({{ t }}\right ) \in {\textsf {F}_{\text {ris}}}}\qquad \qquad {\left ({{ h }}\right )} \tag {6}\end{align*}
View Source
\begin{align*} & {\max \limits _{\beta \left ({{ t }}\right ),f\left ({{ t }}\right ),{e_{o}}\left ({{ t }}\right ),{\boldsymbol {\Phi }}\left ({{ t }}\right )} \sum \limits _{t = 1}^{\infty }{\left ({{\left ({{1 - \beta \left ({{ t }}\right )}}\right ){r_{l}}\left ({{ t }}\right ) + \beta \left ({{ t }}\right ){r_{o}}\left ({{ t }}\right )}}\right )} } \\ & \qquad {s.t.}~{\beta \left ({{ t }}\right ) \in \left \{{{0,1}}\right \}}\qquad \quad {\left ({{ a }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{0 \leqslant f(t) \leqslant {f_{\max }}}~\qquad {\left ({{ b }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{0 \leqslant {e_{o}}\left ({{ t }}\right ) \leqslant {e_{\max }}}\quad ~~{\left ({{ c }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{e_{o}}\left ({{ t }}\right ) + {e_{s}} \leqslant {e_{r}}\left ({{ t }}\right )}\quad ~{\left ({{ d }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{e_{l}}\left ({{ t }}\right ) + {e_{s}} \leqslant {e_{r}}\left ({{ t }}\right )}\quad ~{\left ({{ e }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{r_{o}}\left ({{ t }}\right ) \geqslant {r_{o,\min }}}\qquad \quad {\left ({{ f }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{r_{l}}\left ({{ t }}\right ) \geqslant {r_{l,\min }}}\qquad \quad ~ {\left ({{ g }}\right )} \\ {}& \hphantom {\qquad {s.t.}~}{{\phi _{k}}\left ({{ t }}\right ) \in {\textsf {F}_{\text {ris}}}}\qquad \qquad {\left ({{ h }}\right )} \tag {6}\end{align*}
where r_{l}(t)
and r_{o}(t)
are respectively the local computing rate and offloading rate of the SU, which are obtained by using Eq. (1) and Eq. (5); f_{\max }
and e_{\max }
are the maximal computing capability and offloading energy at the SU, respectively, and \beta \left ({{ t }}\right) \in \left \{{{0,1}}\right \}
represents the offloading indicator of the SU. Constraint (a) indicates the SU will either perform local computing or data offloading in time slot t, in which \beta \left ({{ t }}\right) = 1
shows that the task is offloaded; otherwise \beta \left ({{ t }}\right) = 0
. Constraints (b) and (c) restrict the selection of the CPU frequency levels and offloading energy levels, respectively. Next, the total energy consumption for either offloading and sensing with constraint (d), or local computing and sensing with constraint (e) should not exceed the current energy level, e_{r}(t)
, of the SU. The system can guarantee to achieve a minimum offloading rate and local computing rate by using constraint (f) and (g), respectively. It is noted that remaining energy in the battery of the SU is affected by the dynamics of energy harvesting and energy consumption. Therefore, the constraints (f) and (g) are only applied when it selects offloading energy {e_{o}}\left ({{ t }}\right)~\gt ~0
or local energy {e_{l}}\left ({{ t }}\right) \gt 0
, respectively. Constraint (h) satisfies that RIS phase shifts will be selected in the predefined phase shift candidates, {\mathcal {F}}_{\text {ris}}
.
Problem (6) is non-trivial, and it is challenging to obtain directly an optimal solution owing to time-varying channel conditions, stochastic properties of harvested energy, and dynamic behavior of the primary channel. Though optimal RIS phase shifts can be derived by mathematical solutions in the literature, decomposing problem (6) into multiple sub-problems will be time consuming and may increase the burden. Besides, designing a scheme using mathematical tools might impose high computational complexity for implementation. Furthermore, energy uncertainty of battery due to stochastic energy arrival makes the problem more challenging. Thus, we first formulate a Markov Decision Process (MDP) based on problem (6). Subsequently, DRL learning approaches are separately proposed to solve the MDP problem. In formulating the MDP problem for DRL solutions, we incorporate the MEC environment, where the SU operates as an edge node, processing tasks by leveraging remote computational resources provided by the MEC server at the SBS. The SU’s actions are defined as either performing local computation or offloading data to the SBS, thereby accessing MEC services. The proposed DRL schemes are designed to determine the optimal policy for the SU to maximize the computation rate within this MEC framework.
SECTION III.
Deep Reinforcement Learning Frameworks
In this section, the resource allocation problem (6) is reformulated into an MDP problem following the decision-making process. The solution to an MDP problem can be generally obtained by the POMDP scheme by adopting the method of value iteration-based programming under the assumption of the prior knowledge of state transition probability of the system. Unfortunately, this assumption might be unrealistic due to the challenge of acquiring this information in practical scenarios. On the other hand, conventional reinforcement learning (RL) approaches can be adopted to address the problem of unknown environment knowledge. However, they cannot work effectively with the high dimensional optimization problems due to the limited storage memory required for generating a Q-table in large-scale systems. Therefore, inspired by the merits of the model-free property of RL and powerful approximation ability of a deep neural network (DNN), we aim to develop DRL frameworks for the MDP problem without knowing energy harvesting distribution in advance.
A. Markov Decision Process and Observations
We describe the state space, action space, and reward function to formulate the MDP problem in this subsection. Let \mathbb {S}
denote the state space of the system, where each state comprises the belief that the primary channel is free, p\left ({{ t }}\right)
, the remaining energy of the SU, {e_{r}}\left ({{ t }}\right)
, and the channel condition, \tilde {\boldsymbol {\mathbf {h}}}\left ({{ t }}\right)
as s(t) = \left ({{p\left ({{ t }}\right),{e_{r}}\left ({{ t }}\right),{\tilde {\boldsymbol {\mathbf {h}}}}\left ({{ t }}\right)}}\right) \in \mathbb {S}
, where \tilde {\boldsymbol {\mathbf {h}}}\left ({{ t }}\right) = {{\mathbf {H}}_{\text {rb}}}\left ({{ t }}\right)\boldsymbol {\Phi } \left ({{ t }}\right){{\mathbf {h}}_{\text {ur}}}\left ({{ t }}\right) + {{\mathbf {h}}_{\text {ub}}}\left ({{ t }}\right)
. According to state s(t)
, the agent will select action a(t)
. The action space is expressed as \mathbb {A} = \left \{{{{a_{1}},{a_{2}},\ldots ,{a_{\left |{{ \mathbb {A} }}\right |}}}}\right \}
, where a\left ({{ t }}\right) = \left ({{f\left ({{ t }}\right),{e_{o}}\left ({{ t }}\right),{\boldsymbol {\Phi }}\left ({{ t }}\right)}}\right) \in \mathbb {A}
is the action composed of the allocated CPU frequency, offloading energy, and phase shifts. It is noted that, the action is generated according to the state s(t)
at the beginning of the time slot before sensing. After getting the sensing result, the agent will either offload data to the SBS using the assigned e_{o}(t)
and \boldsymbol {\Phi } \left ({{ t }}\right)
if the result shows the state “F”, or perform local computing using the assigned f(t)
if the result is “B”. Intuitively, \beta (t)
is ignored in the action formulation since the sensing result can represent the value of \beta (t)
. Given the state s(t)
and action {a}(t)
, the immediate reward, defined as the computation rate of the SU at time slot t, is calculated by\begin{equation*} r\left ({{ t }}\right ) = \left ({{1 - \beta \left ({{ t }}\right )}}\right ){r_{l}}\left ({{ t }}\right ) + \beta \left ({{ t }}\right ){r_{o}}\left ({{ t }}\right ). \tag {7}\end{equation*}
View Source
\begin{equation*} r\left ({{ t }}\right ) = \left ({{1 - \beta \left ({{ t }}\right )}}\right ){r_{l}}\left ({{ t }}\right ) + \beta \left ({{ t }}\right ){r_{o}}\left ({{ t }}\right ). \tag {7}\end{equation*}
In MDP platform, a transition function, \textsf {T}: \textsf {T} \to \left [{{0,1}}\right]
, indicates the conditional state transition probability between two adjacent time slots. However, the state transition probability of the system considered in this work is probably not identified due to the uninterrupted variations in the environment regarding the channel gains and the harvested energy of the SU. Therefore, the transition function T is unknown. The operation of the SU in each time slot is transformed and described as follows. Firstly, the SU observes the network state s(t)
. Subsequently, it generates the action a(t)
based on s(t)
. After that, the spectrum sensing is conducted to obtain the result H(t)
. It then performs action a(t)
by which the SU will either offload data to the SBS with assigned offloading energy if H(t)= "F"
, or execute local computation with allocated CPU frequency if H(t)= "B"
. As a result, the SU receives an immediate reward and then updates state s(t+1)
. We assume that data offloaded by the SU is unsuccessfully computed by the MEC server if transmission collision between the SU and PT happens. Thus, the SU dynamically adjusts its action according to observations regarding states, actions, and obtained rewards when training. We present the updating process with respect to possible observations as the following.
Observation 1 (O_{1}
): The sensing result indicates the state “B” of the primary channel, and the SU performs local computation. Then, the system reward is r\left ({{s(t),a\left ({{ t }}\right)\left |{{O_{1}}}\right .}}\right) = {r_{l}}\left ({{ t }}\right)
. We update the belief by\begin{equation*} p\left ({{t + 1}}\right ) = \frac {p\left ({{ t }}\right ){P_{f}}{P_{FF}} + \left ({{1 - p\left ({{ t }}\right )}}\right ){P_{d}}{P_{BF}}}{p\left ({{ t }}\right ){P_{f}} + \left ({{1 - p\left ({{ t }}\right )}}\right ){P_{d}}}, \tag {8}\end{equation*}
View Source
\begin{equation*} p\left ({{t + 1}}\right ) = \frac {p\left ({{ t }}\right ){P_{f}}{P_{FF}} + \left ({{1 - p\left ({{ t }}\right )}}\right ){P_{d}}{P_{BF}}}{p\left ({{ t }}\right ){P_{f}} + \left ({{1 - p\left ({{ t }}\right )}}\right ){P_{d}}}, \tag {8}\end{equation*}
where P_{FF}
and P_{BF}
represent the transition probability of the primary channel from state F to state F, and from state B to state F between two consecutive time slots, respectively. Remaining energy of the SU is computed by\begin{equation*} {e_{r}}\left ({{t + 1}}\right ) = {e_{r}}\left ({{ t }}\right ) - e_{s}(t) - {e_{l}}\left ({{ t }}\right ) + {e_{h}}\left ({{ t }}\right ), \tag {9}\end{equation*}
View Source
\begin{equation*} {e_{r}}\left ({{t + 1}}\right ) = {e_{r}}\left ({{ t }}\right ) - e_{s}(t) - {e_{l}}\left ({{ t }}\right ) + {e_{h}}\left ({{ t }}\right ), \tag {9}\end{equation*}
where e_{s}(t)
is the energy required for spectrum sensing of the SU.
Observation 2 (O_{2}
): The state “F” of the primary channel is identified, and then the SU offloads data to the SBS and receives successfully the computed result sent by the SBS. The reward can be computed by r\left ({{s(t),a\left ({{ t }}\right)\left |{{O_{2}}}\right .}}\right) = {r_{o}}\left ({{ t }}\right)
. In this case, the updated belief for the next time slot is\begin{align*} & \hspace {-0pc}p\left ({{t + 1}}\right ) \\ & \!= \!\! \begin{cases} \displaystyle \!\!{\frac {p\left ({{ t }}\right )\left ({{1\! -\! {P_{f}}}}\right ){P_{FF}} \!+ \!\left ({{1\! - \!p\left ({{ t }}\right )}}\right )\left ({{1 \!-\! {P_{d}}}}\right ){P_{BF}}}{p\left ({{ t }}\right )\left ({{1 \!-\! {P_{f}}}}\right ) \!+\! \left ({{ \!\!{1\! -\! p\left ({{ t }}\right )} }}\right )\left ({{1 \!- \!{P_{d}}}}\right )}}& {{e_{o}}\left ({{ t }}\right ) \!= \!0} \\ \displaystyle {P_{FF}}& {\text {otherwise}}. \end{cases} \tag {10}\end{align*}
View Source
\begin{align*} & \hspace {-0pc}p\left ({{t + 1}}\right ) \\ & \!= \!\! \begin{cases} \displaystyle \!\!{\frac {p\left ({{ t }}\right )\left ({{1\! -\! {P_{f}}}}\right ){P_{FF}} \!+ \!\left ({{1\! - \!p\left ({{ t }}\right )}}\right )\left ({{1 \!-\! {P_{d}}}}\right ){P_{BF}}}{p\left ({{ t }}\right )\left ({{1 \!-\! {P_{f}}}}\right ) \!+\! \left ({{ \!\!{1\! -\! p\left ({{ t }}\right )} }}\right )\left ({{1 \!- \!{P_{d}}}}\right )}}& {{e_{o}}\left ({{ t }}\right ) \!= \!0} \\ \displaystyle {P_{FF}}& {\text {otherwise}}. \end{cases} \tag {10}\end{align*}
The first term occurs when SU’s energy is insufficient for task offloading, thus the SU has to stay silent. On the other hand, if the SU offloads data to the SBS (i.e. {e_{o}}\left ({{ t }}\right) \gt 0
), it will know the actual state of the primary channel through reception of computed data from the MEC server and thus update the belief by using second term. The remaining energy is\begin{equation*} {e_{r}}\left ({{t + 1}}\right ) = {e_{r}}\left ({{ t }}\right ) - e_{s}(t) - {e_{o}}\left ({{ t }}\right ) + {e_{h}}\left ({{ t }}\right ). \tag {11}\end{equation*}
View Source
\begin{equation*} {e_{r}}\left ({{t + 1}}\right ) = {e_{r}}\left ({{ t }}\right ) - e_{s}(t) - {e_{o}}\left ({{ t }}\right ) + {e_{h}}\left ({{ t }}\right ). \tag {11}\end{equation*}
Observation 3 (O_{3}
): The sensing result unveils the state “F” of the primary channel, and the SU offloads data to the SBS, but the SBS can not successfully receive data. This implies there is a collision between the SUs’ offloading and the PTs’ transmission. Thus, there is no reward in this circumstance, i.e., r\left ({{s(t),a\left ({{ t }}\right)\left |{{O_{3}}}\right .}}\right) = 0
. We update the belief by the following equation:\begin{align*} & p\left ({{t + 1}}\right ) \\ & \!= \!\! \begin{cases} \displaystyle \!{\frac {p\left ({{ t }}\right )\left ({{1 \!\!- {P_{f}}}}\right ){P_{FF}}\! +\! \left ({{1 \!- \!p\left ({{ t }}\right )}}\right )\left ({{1 \!- \!{P_{d}}}}\right ){P_{BF}}}{p\left ({{ t }}\right )\left ({{1 \!-\! {P_{f}}}}\right ) \!+ \!\left ({{1 \!- \!p\left ({{ t }}\right )}}\right )\left ({{1 \!-\! {P_{d}}}}\right )}}& {{e_{o}}\left ({{ t }}\right ) \!= \!0} \\ \displaystyle \!{P_{BF}}& {\text {otherwise}}. \end{cases} \tag {12}\end{align*}
View Source
\begin{align*} & p\left ({{t + 1}}\right ) \\ & \!= \!\! \begin{cases} \displaystyle \!{\frac {p\left ({{ t }}\right )\left ({{1 \!\!- {P_{f}}}}\right ){P_{FF}}\! +\! \left ({{1 \!- \!p\left ({{ t }}\right )}}\right )\left ({{1 \!- \!{P_{d}}}}\right ){P_{BF}}}{p\left ({{ t }}\right )\left ({{1 \!-\! {P_{f}}}}\right ) \!+ \!\left ({{1 \!- \!p\left ({{ t }}\right )}}\right )\left ({{1 \!-\! {P_{d}}}}\right )}}& {{e_{o}}\left ({{ t }}\right ) \!= \!0} \\ \displaystyle \!{P_{BF}}& {\text {otherwise}}. \end{cases} \tag {12}\end{align*}
The remaining energy is calculated same as Eq. (11). The process of performing an action of the SU is described in Algorithm 1. The DPG scheme will be proposed in the next subsection.
Algorithm 1 The Process of Performing an Action of the SU
1:if {\text {sensing}}\;{\text {result}}\;{\text {is}}\;B
then
2:SU performs local computingand receives reward r(t) = {r_{l}}(t)
.
3:Update belief and remain-ing energy using Eqs. (8)and (9).
5:SU offloads data to the SBS.
6:if SU receives successfully the computed data then
7:Receive reward r(t) = {r_{o}}(t)
.
10:Receive reward r(t) = 0
.
B. The Proposed Deep Policy Gradient Scheme
Let us define the discounted return as accumulative reward as {G} = \sum \limits _{t = 1}^{\infty } {{\alpha ^{t - 1}}r\left ({{ t }}\right)}
, where \alpha
is the discount factor. The target of the MDP problem is to determine the good policy, \Omega \left ({{ s }}\right)
, which represents a function that maps a specific state s to an action. The state value function under the policy \Omega
, defined as the expected return that starts from state s, is given by\begin{align*} {V^{\Omega }}(s) = \mathbb {E}\left [{{{G}\left |{{s,\Omega }}\right .}}\right ] = \mathbb {E}\left [{{\sum \limits _{t = 1}^{\infty }{{\alpha ^{t - 1}}r\left ({{ t }}\right )\left |{{s\left ({{ 1 }}\right ) = s,\Omega }}\right .} }}\right ], \tag {13}\end{align*}
View Source
\begin{align*} {V^{\Omega }}(s) = \mathbb {E}\left [{{{G}\left |{{s,\Omega }}\right .}}\right ] = \mathbb {E}\left [{{\sum \limits _{t = 1}^{\infty }{{\alpha ^{t - 1}}r\left ({{ t }}\right )\left |{{s\left ({{ 1 }}\right ) = s,\Omega }}\right .} }}\right ], \tag {13}\end{align*}
where \mathbb {E}\left [{{. }}\right]
denotes the expectation operator, and \Omega
is the stochastic policy showing the probability of choosing action a at given state s, i.e., \Omega (a\left |{{ s }}\right .) = \Pr \left [{{a\left ({{ t }}\right) = a\left |{{s\left ({{ t }}\right) = s}}\right .}}\right]
. In addition, we further define the state-action value function (i.e. Q-value function) as the expected reward of taking action a in state s by following the policy \Omega
, which is expressed as follows\begin{align*} {Q^{\Omega }}(s,a) & = \mathbb {E}\left [{{G\left |{{s,a,\Omega }}\right .}}\right ] \\ & = \mathbb {E}\left [{{\sum \limits _{t = 1}^{\infty }{{\alpha ^{t - 1}}r\left ({{ t }}\right )\left |{{s\left ({{ 1 }}\right ) = s,a\left ({{ 1 }}\right ) = a,\Omega }}\right .} }}\right ]. \tag {14}\end{align*}
View Source
\begin{align*} {Q^{\Omega }}(s,a) & = \mathbb {E}\left [{{G\left |{{s,a,\Omega }}\right .}}\right ] \\ & = \mathbb {E}\left [{{\sum \limits _{t = 1}^{\infty }{{\alpha ^{t - 1}}r\left ({{ t }}\right )\left |{{s\left ({{ 1 }}\right ) = s,a\left ({{ 1 }}\right ) = a,\Omega }}\right .} }}\right ]. \tag {14}\end{align*}
Unlike the Q-learning principle, the agent of policy gradient method tends to directly select an action based on the probability distribution rather than the state-action value function. In this subsection, we proposed a deep policy gradient to solve the formulated problem, in which a DNN, namely policy network, is employed to model the probability function of the policy, \Omega \left ({{a\left |{{s,\boldsymbol {\theta } }}\right .}}\right)
. It includes one input layer, H_{p}
hidden layers, and each hidden layer includes N_{p}
neurons, and one output layer. The system state with M+2
elements represents the input of the policy network, while the output has a size of action space \left |{{ \mathbb {A} }}\right |
. The action with the highest preference in each state will be given with the highest probability of being chosen according to the soft-max activation function by \Omega \left ({{a\left |{{s,\boldsymbol {\theta } }}\right .}}\right) = \frac {e^{h\left ({{s,a,\boldsymbol {\theta } }}\right)}}{\sum \nolimits _{a_{i} \in \mathbb {A}} {e^{h\left ({{s,a_{i},\boldsymbol {\theta } }}\right)}} },\forall a \in \mathbb {A}
. In the proposed DPG scheme, the weight is adjusted based on the performance measure function, J\left ({{ \boldsymbol {\theta } }}\right) \stackrel { \Delta } = {V^{\Omega _{\boldsymbol {\theta }} }}\left ({{ s }}\right)
, defined as the expected reward at the start state s of an episode. According to the policy gradient theorem, we can derive the gradient of J\left ({{ \boldsymbol {\theta } }}\right)
with respect to \boldsymbol {\theta }
as follows\begin{equation*} \begin{array}{l} \nabla J\left ({{ \boldsymbol {\theta } }}\right ) = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right )\sum \limits _{a \in \mathbb {A}} {Q^{\Omega }}\left ({{s,a}}\right ) {\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a\left |{{s,\boldsymbol {\theta } }}\right .}}\right )} }, \end{array} \tag {15}\end{equation*}
View Source
\begin{equation*} \begin{array}{l} \nabla J\left ({{ \boldsymbol {\theta } }}\right ) = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right )\sum \limits _{a \in \mathbb {A}} {Q^{\Omega }}\left ({{s,a}}\right ) {\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a\left |{{s,\boldsymbol {\theta } }}\right .}}\right )} }, \end{array} \tag {15}\end{equation*}
where {d^{\Omega } }\left ({{ s }}\right)
represents the state distribution. We can see that the right-hand side term in Eq. (15) represents the sum over states weighted by how frequently the states happen using the target policy \Omega
. Therefore, to remove the random variables of states and actions, we continue from the derivation of the performance measure function in Eq. (15) by introducing state s(t)
, action a(t)
, and an expectation under the policy \Omega
. Thereby, we have:\begin{align*} \nabla J\left ({{ \boldsymbol {\theta } }}\right ) & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{\sum \limits _{a \in A} {{Q^{\Omega }}\left ({{s(t),a}}\right )\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )} }}\right ] \\ & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{G(t)\frac {\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}{{\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}}}\right ] \\ & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{G(t) \nabla _{\boldsymbol {\theta }} \ln {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}}\right ]. \tag {16}\end{align*}
View Source
\begin{align*} \nabla J\left ({{ \boldsymbol {\theta } }}\right ) & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{\sum \limits _{a \in A} {{Q^{\Omega }}\left ({{s(t),a}}\right )\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )} }}\right ] \\ & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{G(t)\frac {\nabla _{\boldsymbol {\theta }} {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}{{\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}}}\right ] \\ & = {\mathbb {E}_{\Omega _{\boldsymbol {\theta }} }}\left [{{G(t) \nabla _{\boldsymbol {\theta }} \ln {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } }}\right .}}\right )}}\right ]. \tag {16}\end{align*}
The measure function is optimized by adopting gradient ascent for the weight update as\begin{equation*} \boldsymbol {\theta } \left ({{t + 1}}\right ) = { }\boldsymbol {\theta } \left ({{ t }}\right ) + {\lambda _{p}}G(t) \nabla _{\boldsymbol {\theta }} \ln {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } \left ({{ t }}\right )}}\right .}}\right ), \tag {17}\end{equation*}
View Source
\begin{equation*} \boldsymbol {\theta } \left ({{t + 1}}\right ) = { }\boldsymbol {\theta } \left ({{ t }}\right ) + {\lambda _{p}}G(t) \nabla _{\boldsymbol {\theta }} \ln {\Omega }\left ({{a(t)\left |{{s(t),\boldsymbol {\theta } \left ({{ t }}\right )}}\right .}}\right ), \tag {17}\end{equation*}
where \lambda _{p}
is the learning rate of the policy network. The detail of the proposed DPG is given in Algorithm 2. Note that, the policy parameterization is made for episodic case by following the Monte Carlo method [39], where all updates are performed after the episode is completed. Fig. 2 illustrates the flow diagram of our proposed DPG. Although the DPG can yield the nearly optimal policy, it experiences a large variance in the estimated reward, resulting in instable convergence or even divergence during the training.
Algorithm 2 Training Process of the Deep Policy Gradient Scheme
1:Initialize weight \boldsymbol {\theta }
.
2:for each training episode 1,2,\ldots ,N
do
3:Observe the system state and generate an action a(t)
.
5:Store data sampling DELzzz-DELs\left ({{ 1 }}\right)
, a\left ({{ 1 }}\right)
, r\left ({{ 1 }}\right)
,\ldots
, s\left ({{N}}\right)
,a\left ({{N}}\right)
, r\left ({{N}}\right)
, fol-lowing \Omega \left ({{ \cdot \left |{{ \cdot ,\boldsymbol {\theta } }}\right .}}\right)
.
6:for each step t = 1,2,\ldots ,T
do
7:Estimate the return G\left ({{ t }}\right) = \sum \nolimits _{t}^{T} {{\alpha ^{t - 1}}r\left ({{ t }}\right)}
.
8:Update weight \boldsymbol {\theta }
byusing Eq. (17).
In DRL schemes, computational complexity is shaped by various factors, including the dimensions of the state and action spaces, the architecture of the model, the choice of algorithm, training parameters, and the function of rewards. Collectively, these elements define the training overhead and the time required to reach the optimal solution. However, the DNN model structure typically has the most significant impact on computational complexity, as it dictates the number of layers, neurons, and parameters, directly influencing computational load and memory demands during training. Consequently, we focus on analyzing the DNN structure used in our proposed DRL method to understand its computational complexity. Specifically, the computational complexity of the proposed DPG scheme is determined by the input and output dimensions and the number of hidden layers. It has M + 2
neurons in the input layer, H_{p}
hidden layers, each of which contains N_{p}
neurons, and \left |{{ \mathbb {A} }}\right |
neurons in the output layer. Consequently, the computational complexity of the proposed DPG scheme is determined by \mathcal {O}\left ({{\left ({{M + 2 + \left |{{ \mathbb {A} }}\right |}}\right){N_{p}} + \left ({{{H_{p}} - 1}}\right)N_{p}^{2}}}\right)
. Next, we propose to use a deep Q-learning scheme, where the agent can obtain stability while maximizing the expected reward.
C. The Proposed Deep Q-Learning
We elaborate the proposed deep Q-learning scheme, by which an action is mainly selected based on state-action value function rather than only using policy network. In other words, state-action value function is a key metric to evaluate impact of the chosen action on immediate and future rewards, generating the appropriate policy for MDP problem. Generally, the original Q-learning is used to find optimal policy \Omega ^{*}
that yields the maximum Q-value function as follows\begin{equation*} {Q^{*}}\left ({{s,a}}\right ) = {Q^{\Omega ^{*}}}\left ({{s,a}}\right ) = \max \limits _{\Omega }{Q^{\Omega }}\left ({{s,a}}\right ). \tag {18}\end{equation*}
View Source
\begin{equation*} {Q^{*}}\left ({{s,a}}\right ) = {Q^{\Omega ^{*}}}\left ({{s,a}}\right ) = \max \limits _{\Omega }{Q^{\Omega }}\left ({{s,a}}\right ). \tag {18}\end{equation*}
Consequently, once the optimal Q-value function, {Q^{*}}\left ({{s,a}}\right)
, of each state-action pair is achieved, the optimal policy is directly found by\begin{equation*} {\Omega ^{*}}\left ({{a\left |{{ s }}\right .}}\right ) = \mathop {\arg \max }\limits _{a \in \mathbb {A}} {Q^{*}}\left ({{s,a}}\right ). \tag {19}\end{equation*}
View Source
\begin{equation*} {\Omega ^{*}}\left ({{a\left |{{ s }}\right .}}\right ) = \mathop {\arg \max }\limits _{a \in \mathbb {A}} {Q^{*}}\left ({{s,a}}\right ). \tag {19}\end{equation*}
From Eq. (19), it is found that determining the optimal policy \Omega ^{*}
is equivalent to find optimal value of {Q^{*}}\left ({{s,a}}\right)
. Based on Bellman equation, the optimal state-action value function is\begin{equation*} {Q^{*}}\left ({{s,a}}\right ) = r\left ({{s,a,{\Omega ^{*}}}}\right ) + \alpha \sum \limits _{s' \in \mathbb {S}} {{P_{s'}}\left ({{a\left |{{ s }}\right .}}\right )\max \limits _{a' \in \mathbb {A}} } {Q^{*}}\left ({{s',a'}}\right ), \tag {20}\end{equation*}
View Source
\begin{equation*} {Q^{*}}\left ({{s,a}}\right ) = r\left ({{s,a,{\Omega ^{*}}}}\right ) + \alpha \sum \limits _{s' \in \mathbb {S}} {{P_{s'}}\left ({{a\left |{{ s }}\right .}}\right )\max \limits _{a' \in \mathbb {A}} } {Q^{*}}\left ({{s',a'}}\right ), \tag {20}\end{equation*}
where r\left ({{s,a,{\Omega ^{*}}}}\right)
is an immediate reward achieved by taking action a in state s and {P_{s'}}\left ({{a\left |{{ s }}\right .}}\right)
denotes the probability that the system transfers from state s to s'
when executing action a. s'
and a'
represent the next state and action, respectively. According to the RL, the optimal state-action value functions can be obtained without knowing the exact state transition model in advance by frequently updating the Q-value functions as follows:\begin{align*} & {Q^{\Omega }}\left ({{s,a}}\right ) \\ & = \left ({{1 - {\lambda _{q}}}}\right ){Q^{\Omega }}\left ({{s,a}}\right ) + {\lambda _{q}}\left ({{r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }}\left ({{s',a'}}\right )}}\right ), \tag {21}\end{align*}
View Source
\begin{align*} & {Q^{\Omega }}\left ({{s,a}}\right ) \\ & = \left ({{1 - {\lambda _{q}}}}\right ){Q^{\Omega }}\left ({{s,a}}\right ) + {\lambda _{q}}\left ({{r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }}\left ({{s',a'}}\right )}}\right ), \tag {21}\end{align*}
where \lambda _{q}
stands for the learning rate of the Q-learning scheme. The system can acquire the optimal policy for the small-size optimization problem by using conventional Q-learning algorithms. However, these schemes do not effectively work with large dimension systems due to wide variance, which might converge to the locally optimal policy. Thus, we propose the DQL scheme in which a DNN, namely Q-network, is employed to approximate the Q-value function, denoted by {Q^{\Omega } _{\boldsymbol {w}} }\left ({{s,a}}\right) \approx {Q^{\Omega } }\left ({{s,a}}\right)
, instead of generating the Q-table used in traditional Q-learning approaches. The Q-network comprises an input layer, H_{q}
hidden layers, and an output layer. A system state s with M+2
neuron units is used as the input of the Q-network. Each hidden layer is a fully connected layer containing N_{q}
neuron units and uses the rectified linear activation function. The output layer with a size of \left |{{ \mathbb {A} }}\right |
estimates Q-values of possible state-action pairs and employs the linear activation function. In the proposed DQL scheme, the perturbation of the discretized action values can cause a large variance in the estimated Q-value function. Therefore, the policy might still induce instability, and non-convergence of the training can occur. For that reason, we adopt a fixed target network and experience replay mechanism so as to relieve the impact of perturbation as well as data correlation of the discrete action space on the stability of the training phase. For the fixed target network, another neural network is generated to estimate the target Q-value, which has the same structure as the Q-network and is parameterized by weight \hat { \boldsymbol {w}}
. Meanwhile, the experience replay mechanism allows to train a finite number of data samples stored in the replay buffer \mathcal {B}
, in which each sample, \left ({{s,a,r,s'}}\right)
, represents a tuple indicating the outcome of the specific action given a distinct state. Rather than selecting consecutive data samples that impose high correlations, random mini-batches are chosen to reduce correlation issue when learning. By applying fixed target network with experience replay, Eq. (26) is rewritten by\begin{equation*} L_{\boldsymbol {w}} = {\mathbb {E}_{\mathcal {B}}}\left [{{{\left ({{Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) - {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right )}}\right )}^{2}}}\right ], \tag {22}\end{equation*}
View Source
\begin{equation*} L_{\boldsymbol {w}} = {\mathbb {E}_{\mathcal {B}}}\left [{{{\left ({{Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) - {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right )}}\right )}^{2}}}\right ], \tag {22}\end{equation*}
where Q_{{\text {target}}\;}^{\Omega }
is the target value function obtained by target Q-network and is calculated as\begin{equation*} Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) = r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }_{\hat {\boldsymbol {w}}} }\left ({{s',a'}}\right ). \tag {23}\end{equation*}
View Source
\begin{equation*} Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) = r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }_{\hat {\boldsymbol {w}}} }\left ({{s',a'}}\right ). \tag {23}\end{equation*}
The target Q-network is frequently updated by copying the weight of the Q-network during the training process, and its weight remains unchanged during a finite number of training time slots. By using the target Q-value, the temporal difference (TD) error is calculated by\begin{equation*} \delta = Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) - Q_{\boldsymbol {w}}^{\Omega }\left ({{s,a}}\right ). \tag {24}\end{equation*}
View Source
\begin{equation*} \delta = Q_{{\text {target}}\;}^{\Omega }\left ({{s,a}}\right ) - Q_{\boldsymbol {w}}^{\Omega }\left ({{s,a}}\right ). \tag {24}\end{equation*}
With the proposed DQL scheme, the training performance is improved by adjusting weight \boldsymbol {w}
in the direction of increasing the cumulative reward and stability. We update the weight of the Q-network to minimize the loss function using the stochastic gradient descent method as follows\begin{equation*} \boldsymbol {w} = \boldsymbol {w} + {\lambda _{q}}\delta {\nabla _{\boldsymbol {w}}}{Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right ). \tag {25}\end{equation*}
View Source
\begin{equation*} \boldsymbol {w} = \boldsymbol {w} + {\lambda _{q}}\delta {\nabla _{\boldsymbol {w}}}{Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right ). \tag {25}\end{equation*}
To train the Q-network, we constantly tune DNN weight to minimize the loss function, defined as a mean square error between the target and current Q-values, by using the following equation:\begin{equation*} L_{\boldsymbol {w}} = \mathbb {E}\left [{{{\left ({{r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s',a'}}\right ) - {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right )}}\right )}^{2}}}\right ]. \tag {26}\end{equation*}
View Source
\begin{equation*} L_{\boldsymbol {w}} = \mathbb {E}\left [{{{\left ({{r\left ({{s,a}}\right ) + \alpha \max \limits _{a' \in \mathbb {A}} {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s',a'}}\right ) - {Q^{\Omega }_{\boldsymbol {w}} }\left ({{s,a}}\right )}}\right )}^{2}}}\right ]. \tag {26}\end{equation*}
The agent selects an action in each step by using an \epsilon
-greedy policy with a decay rate \upsilon _{q}
. We adopt \epsilon -
greedy method for selecting actions, in which an action is randomly chosen with probability \epsilon _{q}
. Otherwise, the action with the maximal state-action value function is selected according to the output of the Q-network with probability 1-\epsilon _{q}
. The value of \epsilon _{q}
decays with a rate \upsilon _{q}
and will be updated in each step. The training process of the proposed DQL is presented in Algorithm 3, which repeats until the system converges. In addition, Fig. 3 illustrates a flow diagram of the proposed DQL scheme. Although the proposed DQL scheme can optimize the expected reward with the stable behavior, it frequently converges to the locally optimal policy in the system with such large state and action spaces. The computational complexity of the DQL scheme is determined as follows. The Q-network of the proposed DQL has M + 2
neurons in the input layer, H_{q}
hidden layers, each of which contains N_{q}
neurons, and \left |{{ \mathbb {A} }}\right |
neurons in the output layer. Besides, the target Q-network has the same structure as the Q-network. As a result, the computational complexity of the proposed DQL scheme is calculated by \mathcal {O}\left ({{2\left ({{\left ({{M + 2 + \left |{{ \mathbb {A} }}\right |}}\right){N_{q}} + \left ({{{H_{q}} - 1}}\right)N_{q}^{2}}}\right)}}\right)
. In the following, we investigate a combination scheme, namely deep actor-critic, inheriting advantages of both on-policy and off-policy methods.
Algorithm 3 Training Process of the Deep Q-Learning Scheme
1:Initialize the weights \boldsymbol {w}
, \hat {\boldsymbol {w}}
, and set \epsilon _{q}=1
.
2:for each training episode 1,2,\ldots ,N
do
3:for each step t = 1,2,\ldots ,T
do
4:Define \epsilon _{q} = \max \left ({{\epsilon _{q} \upsilon _{q},{\epsilon _{q}^{min}}}}\right)
.
5:Observe system state DELzzz-DELs(t)
and select actiona(t)
based on \epsilon
-policy as \begin{aligned} \begin{array}{l} a\left ({{ t }}\right)= \\ {\begin{cases} {\mathop {\arg \max }\limits _{a\left ({{ t }}\right) \in \mathbb {A}} \left \{{{{Q^{\Omega } }\left ({{a\left ({{ t }}\right)\left |{{s\left ({{ t }}\right)}}\right .}}\right),\boldsymbol {w}}}\right \}}& {{\text {w}}.{\text {p}}{\text {.}}\;1 - \epsilon _{q} } \\ {{\text {random}}\;{\text {action}}\;a\left ({{ t }}\right) \in \mathbb {A}}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}
7:Observe next state DELzzz-DELs(t+1)
.
8:Store sample \left ({{s\left ({{ t }}\right),a\left ({{ t }}\right),r\left ({{ t }}\right),s\left ({{t + 1}}\right)}}\right)
inbuffer \mathcal {B}
, and thenselect a mini-batch.
9:for i in mini-batches size do
10:Compute Q^{\Omega } _{\boldsymbol {w}}\left ({{{s_{i}},a_{i}}}\right)
and es-timate the target Q-valueby \begin{aligned} \begin{array}{llllllllllllllllllll} {Q_{{\text {target}}\;}^{\Omega } = } \\ {\begin{cases} {\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;{r_{i}}}& {{\text {final}}\;{s_{i + 1}}} \\ {{r_{i}} + \alpha \max \limits _{a' \in \mathbb {A}} Q_{\hat {\boldsymbol {w}}}^{\Omega } \left ({{{s_{i + 1}},a'}}\right)}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}
12:Update weight \boldsymbol {w}
.
14:Update target Q-network bycopying weight of Q-network.
D. The Proposed Deep Actor-Critic Scheme
In the AC method, the system operates based on both policy improvement (i.e. actor) and policy evaluation (i.e. critic) following the direction of maximizing a long-term return. More particularly, the actor part is used to determine parameterized policy, while the critic part is responsible for evaluating the quality of the taken actions through returned rewards. Inspired by the above analysis, we are interested in employing the DAC framework to solve the considered MDP problem. In the proposed DAC scheme, two main DNNs are used to approximate the policy and value functions. Specifically, the actor and critic neural networks are parameterized by weights \boldsymbol {\mu }_{a}
and \boldsymbol {\mu }_{c}
, respectively. These weights are randomly initialized at the beginning of the training and continuously updated during the training process. Unlike the proposed DQL scheme, where an action is selected based on Q-value function, the proposed DAC scheme selects an action based on the policy network (i.e. actor network), while maximizing the state value function evaluated by the critic network. The solution is to determine the optimal policy, \Omega ^{*}
, to maximize the discounted state value function following the Bellman equation as follows:\begin{equation*} {\Omega ^{*}}\left ({{ s }}\right ) = \mathop {\arg \max }\limits _{\Omega }\left \{{{{V^{\Omega }}(s)}}\right \}. \tag {27}\end{equation*}
View Source
\begin{equation*} {\Omega ^{*}}\left ({{ s }}\right ) = \mathop {\arg \max }\limits _{\Omega }\left \{{{{V^{\Omega }}(s)}}\right \}. \tag {27}\end{equation*}
Critic: In our model, the critic network comprises an input layer, H_{c}
hidden layers, and an output layer. The system state with M+2
neuron units represents the input of the critic network. The hidden layers are fully connected layers, each of which contains N_{c}
neuron units and uses rectified linear unit activation function. The output layer has one neuron, and linear activation function is utilized to generate the estimate of the state value function, {V^{\Omega } }\left ({{ s }}\right)
. In order to enhance the training stability and avoid divergence, this study adopts experience replay and fixed target network. Thereby, a data sample is stored in the replay buffer \mathcal {D}
with a limited size after each step. Once \mathcal {D}
is full, the oldest data will be replaced by a new one. A target critic network with the duplicated structure of the critic network, parameterized by weight \hat {\boldsymbol {\mu }}_{c}
, is used to estimate the target state value function. In each time slot, the agent performs an action selected by the actor. Subsequently, the critic receives the instant reward, r, and observes the next state s'
. After that, it calculates a TD error, which represents the deviation between the target value and the estimated value of the state value function, by using the following equation:\begin{equation*} {\delta _{c}} = V_{\text {target}}^{\Omega }\left ({{ s }}\right ) - V_{\boldsymbol {\mu } _{c}}^{\Omega }\left ({{ s }}\right ), \tag {28}\end{equation*}
View Source
\begin{equation*} {\delta _{c}} = V_{\text {target}}^{\Omega }\left ({{ s }}\right ) - V_{\boldsymbol {\mu } _{c}}^{\Omega }\left ({{ s }}\right ), \tag {28}\end{equation*}
where V_{\text {target}}^{\Omega } \left ({{ s }}\right) = r + \alpha V_{{\hat {\boldsymbol {\mu }} }_{c}}^{\Omega } \left ({{s'}}\right)
represents the target state value function achieved by the target critic network with {V_{{\hat {\boldsymbol {\mu }} }_{c}}}\left ({{s'}}\right)
denoting the target state value function in accordance with state s'
. With the fixed target network and replay memory, the loss function of the critic network is defined as the mean-squared error between {V_{{\hat {\boldsymbol {\mu }} }_{c}}}\left ({{ s }}\right)
and {V_{\boldsymbol {\mu } _{c}}}\left ({{ s }}\right)
as follows:\begin{equation*} {L_{\boldsymbol {\mu } _{c}}} = {\mathbb {E}_{\mathcal { D}}}\left [{{{\left ({{r + \alpha V_{{\hat {\boldsymbol {\mu }} }_{c}}^{\Omega }\left ({{s'}}\right ) - V_{\boldsymbol {\mu } _{c}}^{\Omega }\left ({{ s }}\right )}}\right )}^{2}}}\right ]. \tag {29}\end{equation*}
View Source
\begin{equation*} {L_{\boldsymbol {\mu } _{c}}} = {\mathbb {E}_{\mathcal { D}}}\left [{{{\left ({{r + \alpha V_{{\hat {\boldsymbol {\mu }} }_{c}}^{\Omega }\left ({{s'}}\right ) - V_{\boldsymbol {\mu } _{c}}^{\Omega }\left ({{ s }}\right )}}\right )}^{2}}}\right ]. \tag {29}\end{equation*}
During the training, we use stochastic gradient descent with back-propagation algorithm so as to minimize loss function, such that the weight \boldsymbol {\mu _{c} }
is optimized, as follows\begin{equation*} {{\boldsymbol {\mu } _{c}} = {\boldsymbol {\mu } _{c}} + {\lambda _{c}}{\delta _{c}}{\nabla _{\boldsymbol {\mu } _{c}}}{V^{\Omega }_{\boldsymbol {\mu } _{c}}}\left ({{s}}\right )}, \tag {30}\end{equation*}
View Source
\begin{equation*} {{\boldsymbol {\mu } _{c}} = {\boldsymbol {\mu } _{c}} + {\lambda _{c}}{\delta _{c}}{\nabla _{\boldsymbol {\mu } _{c}}}{V^{\Omega }_{\boldsymbol {\mu } _{c}}}\left ({{s}}\right )}, \tag {30}\end{equation*}
where {\lambda _{c}} \gt 0
denotes the learning rate of the critic.
Actor: Similar to the critic network, the actor network (i.e., policy network) is designed with an input layer, H_{a}
hidden layers, and an output layer. The input layer of the actor network represents system state. Each hidden layer includes N_{a}
neuron units and uses rectified linear unit activation function. The actor network is in charge of determining \Omega
, such that state s \in \mathbb {S}
can be mapped to action a \in \mathbb {A}
. Hence, the mapping policy \Omega \approx {\Omega _{\boldsymbol {\mu } _{a}}}\left ({{ \mathbb {S} }}\right)
is considered a function of \mathbb {S}
and is parameterized by weight \boldsymbol {\mu } _{a}
. The policy network outputs the distribution of possible actions for a given state. Accordingly, the number of neuron units in the output layer of the actor network is set as the number of possible actions, each of which indicates the probability of choosing the corresponding action. The output layer utilizes a soft-max activation function to estimate the probability of each action as follows:\begin{equation*} \Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right ) = \frac {e^{h\left ({{s,a,{\boldsymbol {\mu } _{a}}}}\right )}}{\sum \nolimits _{{a_{i}} \in \mathbb {A}} {e^{h\left ({{s,{a_{i}},{\boldsymbol {\mu } _{a}}}}\right )}} },\forall a \in \mathbb {A}. \tag {31}\end{equation*}
View Source
\begin{equation*} \Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right ) = \frac {e^{h\left ({{s,a,{\boldsymbol {\mu } _{a}}}}\right )}}{\sum \nolimits _{{a_{i}} \in \mathbb {A}} {e^{h\left ({{s,{a_{i}},{\boldsymbol {\mu } _{a}}}}\right )}} },\forall a \in \mathbb {A}. \tag {31}\end{equation*}
The objective function in the actor is defined by\begin{align*} J\left ({{\boldsymbol {\mu } _{a}}}\right ) & = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right ){V^{\Omega }}\left ({{ s }}\right )} \\ & = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right )\sum \limits _{a \in \mathbb {A}} {Q^{\Omega }} \left ({{s,a}}\right )\Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right )}. \tag {32}\end{align*}
View Source
\begin{align*} J\left ({{\boldsymbol {\mu } _{a}}}\right ) & = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right ){V^{\Omega }}\left ({{ s }}\right )} \\ & = \sum \limits _{s \in \mathbb {S}} {{d^{\Omega }}\left ({{ s }}\right )\sum \limits _{a \in \mathbb {A}} {Q^{\Omega }} \left ({{s,a}}\right )\Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right )}. \tag {32}\end{align*}
We optimize the objective function by using policy gradient with TD error obtained in the critic as follows:\begin{equation*} {{\boldsymbol {\mu } _{a}} = {\boldsymbol {\mu } _{a}} + {\lambda _{a}}{\mathbb {E}_{\Omega _{\boldsymbol {\mu } _{a}}}}\left [{{{\nabla _{\boldsymbol {\mu } _{a}}}\ln \Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right ){\delta _{c}}}}\right ]}, \tag {33}\end{equation*}
View Source
\begin{equation*} {{\boldsymbol {\mu } _{a}} = {\boldsymbol {\mu } _{a}} + {\lambda _{a}}{\mathbb {E}_{\Omega _{\boldsymbol {\mu } _{a}}}}\left [{{{\nabla _{\boldsymbol {\mu } _{a}}}\ln \Omega \left ({{a\left |{{s,{\boldsymbol {\mu } _{a}}}}\right .}}\right ){\delta _{c}}}}\right ]}, \tag {33}\end{equation*}
where \lambda _{a}
is the learning rate of the actor. We optimize \boldsymbol {\mu } _{a}
by applying gradient ascent method so as to maximize the objective function of the actor during the learning. The actor adopts \epsilon -
greedy method for selecting actions, in which an action is randomly chosen with probability \epsilon _{a}
, or based on the policy network with probability 1-\epsilon _{a}
. The value of \epsilon _{a}
decays with a rate \upsilon _{a}
after every training step. The proposed DAC algorithm and its flow diagram are respectively given in Algorithm 4 and Fig. 4.
Algorithm 4 Training Algorithm of the Deep Actor-Critic Scheme
1:Initialize the weights \boldsymbol {\mu } _{a}
, \boldsymbol {\mu } _{c}
, \hat {\boldsymbol {\mu }} _{c}
, and set \epsilon _{a} = 1
.
2:for each training episode 1,2,\ldots ,N
do
3:for each step t = 1,2,\ldots ,T
do
4:Define \epsilon _{a} = \max \left ({{\epsilon _{a} \upsilon _{a},{\epsilon _{a}^{min} }}}\right)
.
5:Identify state s(t)
and select action a(t)
based on \epsilon
-policyas \begin{aligned} \begin{array}{llllllllllllllllllll} {a\left ({{ t }}\right) = } \\ {\begin{cases} {\mathop {\arg \max }\limits _{a\left ({{ t }}\right) \in \mathbb {A}} \left \{{{\Omega \left ({{a\left ({{ t }}\right)\left |{{s\left ({{ t }}\right),{\boldsymbol {\mu } _{a}}}}\right .}}\right)}}\right \}}& {{\text {w}}.{\text {p}}.\;1 - \epsilon _{a} } \\ {{\text {random}}\;{\text {action}}\;a\left ({{ t }}\right) \in \mathbb {A}}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}
7:Observe next state DELzzz-DELs(t+1)
.
8:Store the sample \left ({{s\left ({{ t }}\right),a\left ({{ t }}\right),r\left ({{ t }}\right),s\left ({{t + 1}}\right)}}\right)
in buffer \mathcal {D}
andselect a mini-batch.
9:for i in mini-batches size do
10:Compute V_{\boldsymbol {\mu }_{c}} ^{\Omega } \left ({{s_{i}}}\right)
and esti-mate the target state valuefunction by \begin{aligned} \begin{array}{llllllllllllllllllll} {V_{{\text {target}}\;}^{\Omega } = } {\begin{cases} {\;\;\;\;\;\;\;\;{r_{i}}}& {{\text {final}}\;{s_{i + 1}}} \\ {{r_{i}} + \alpha V_{{\hat {\boldsymbol {\mu }} }_{c}}^{\Omega } \left ({{s_{i + 1}}}\right)}& {\text {otherwise}} \end{cases}} \end{array} \end{aligned}
11:Compute the TD error \delta _{c}
.
13:Update weight \boldsymbol {\mu } _{c}
and \boldsymbol {\mu }_{a}
.
15:Update weight \hat {\boldsymbol {\mu }} _{c}
bycopying the weight of thecritic network.
The computational complexity of the DAC scheme is described as follows. The critic network of the proposed DAC has M + 2
neurons in the input layer, H_{c}
hidden layers, each of which contains N_{c}
neurons, and one neuron in the output layer. The target critic network has the same structure as the critic network. In the actor network, there are M + 2
neurons in the input layer, H_{a}
hidden layers, each of which contains N_{a}
neurons, and \left |{{ \mathbb {A} }}\right |
neurons in the output layer. Consequently, the computational complexity of the proposed DAC scheme is calculated by \begin{aligned} \mathcal {O}\left ({{ \begin{array}{lll} \left ({{M + 2 + \left |{{ \mathbb {A} }}\right |}}\right){N_{a}} + \left ({{{H_{a}} - 1}}\right)N_{a}^{2} \\ + 2\left ({{\left ({{M + 3}}\right){N_{c}} + \left ({{{H_{c}} - 1}}\right)N_{c}^{2}}}\right) \\ \end{array} }}\right) \end{aligned}
. Compared to DPG and DQL schemes, the DAC scheme imposes a higher computational complexity. However, the stability and the good convergence are gained by using both value-based and policy-based mechanisms.
SECTION IV.
Numerical Results and Analysis
This section presents performance evaluation of the proposed DRL schemes. Simulation results were obtained by using Python 3.8, Anaconda 2021 integrated with TensorFlow DL libraries. We set the locations of the SU, the RIS, and the SBS to {\mathbf {u}} = {\left ({{{\text {0,0,1}}{\text {.5}}}}\right)^{T}}
, {\mathbf {r}} = {\left ({{\text {30,30,5}}}\right)^{T}}
, and {\mathbf {b}} = {\left ({{\text {80,0,20}}}\right)^{T}}
, respectively. The SBS has M=4
antennas and the number of RIS elements is K=4
. The battery capacity of SU is {E_{Ca}} = 30\mu J
. We set \varepsilon = 7\mu J
, {e_{\max }} = 13 \mu J
, {P_{FF}} = 0.8
, and {P_{BF}} = 0.2
. For spectrum sensing, {P_{f}} = 0.1
, and {P_{d}} = 0.9
. We design four layers for each DNN used in the proposed schemes: an input layer, two 24-nodes hidden layers, and an output layer, i.e., {H_{p}} = {H_{q}} = {H_{a}} = {H_{c}} = 2
and {N_{p}} = {N_{q}} = {N_{a}} = {N_{c}} = 24
. The ReLU activation function is used for the hidden layers of DNN for the proposed schemes. In the DPG scheme, the softmax activation function is employed for the output layer. While in the DQL scheme, we adopt the linear activation function for the output layer of both the Q-network and target Q-network. As for the DAC scheme, we utilize the softmax activation function for the output of the actor network, meanwhile the critic and target critic networks employ linear activation functions for their output layers. Furthermore, Adam optimization algorithm is used to periodically update DNNs weights. For both DQL and DAC algorithms, the replay memory size and minibatch size are respectively set to 50 and 20. It is assumed that the MEC server has sufficient computational capability and that the computation result is a short message. Hence, for simplicity, the computation time at the MEC server, t_{m}
, and the feedback delay (i.e., downloading time of the SU t_{d}
) are negligible when compared with offloading time. Therefore, we omit t_{m}
and t_{d}
in the simulation. Unless otherwise stated, \tau = 0.2\;{\text {s}}
, {t_{s}} = 0.002\;{\text {s}}
, {\alpha _{L}} = 3.7
, B = 1\;{\text {MHz}}
, {\sigma ^{2}} = - 80\;{\text {dBm}}
, {f_{\max }} = 2.3 \times {10^{8}}\;{\text {Hz}}
, \gamma = 0.5 \times {10^{-29}}
, \theta = 737.5\;{\text {cycles/bit}}
. The proposed algorithms are trained over 2000 episodes, each of which includes 103 time slots. We realized the algorithms through 104 time slots to obtain the average results. The table of simulation parameters is given in table 1. To evaluate the performance, we compare our schemes with four traditional methods as follows:
Double deep Q-learning (DDQL): adopts DRL scheme using double deep Q-network to derive the optimal solution [40].
Myopic: optimizes reward following a greedy method which does not require information of the state transition probability as well as a statistic of harvested energy [41].
Local computing: greedily optimizes the performance by only executing local computing.
Random: randomly selects the actions for both local and remote computing.
We first examine the impact of maximum offloading energy on network performance of the proposed schemes in Fig. 5, where Fig. 5(a), Fig. 5(b), and Fig. 5(c) respectively illustrate convergence properties of DPG, DQL, and DAC schemes according to two different values of e_{\max }
(i.e. 10\,\mu {\text {J}}
and 13 \,\mu {\text {J}}
). In particular, in each sub-figures of Fig. 5, episodic average reward and instant reward are plotted versus training episodes. We respectively compute episodic instant reward in episode i and episodic average reward in episode k by r_{i}^{\text {ins} }= \frac {1}{T}\sum \limits _{t = 1}^{T} {r\left ({{ t }}\right)}
and r_{k}^{\text {avg}} = \frac {1}{k}\sum \limits _{i = 1}^{k} {r_{i}^{\text {ins}}}
. It is observed that even though schemes can bring higher rewards when using larger maximum offloading energy, i.e. {e_{\max }} = 13\;{\mu \text {J}}
, the convergence speeds are degraded. It is because the action space becomes larger when increasing the number of possible actions, leading to poorer convergence. As seen from Fig. 5(a), the DPG scheme yields a high variance in instant reward obtained during the training phase. It is because when DPG performs gradient update in each time step, it is using an estimation of gradient generated by a series of data samples through the whole episode. Thus, the estimation is frequently noisy and bad gradient estimation could adversely affect the stability of the training process. Furthermore, the larger the value of e_{\max }
, the more instability the system suffers from during the training. In contrast to the DPG scheme, the stability of the system can be achieved in both DQL and DAC schemes, as shown in Fig. 5(b) and Fig. 5(c). The reason is that the DQL and DAC optimize system return by using the value-based method, fixed target network, and experience replay, which can significantly reduce data correlations and instability while guiding the agent to select appropriate actions.
In Fig. 6, we assess the convergence properties of the proposed schemes according to various learning rates. Fig. 6(a) unveils that the performance of the DPG method can be easily affected by \lambda _{p}
. For example, with \lambda _{p}=0.05
, the system can greatly obtain a high computation rate at the beginning of the training process, but gradually converges to a locally optimal policy or even diverges as time goes on. It is because the DPG scheme directly selects the actions following the on-policy technique, which may cause worse convergence with an improper learning rate. On the other hand, the DQL and DAC schemes usually offer stable convergence during training when the learning rate changes. It can also be seen that although the system using the DAC scheme usually obtains a higher reward than the DPG and DQL schemes, it is sophisticated to properly select both \lambda _{a}
and \lambda _{c}
. Besides, the computational complexity of the DAC scheme is higher than that of DPG and DQL schemes, which is described in Section III. From the above-simulated results in Fig. 6, we can conclude that the learning rate should be appropriately chosen for each proposed DRL-based scheme, which is neither too large nor too small. Therefore, with current setting, the optimal learning rate of DPG, DQL, and DAC methods respectively are chosen as follows: {\lambda _{p}} = {\lambda _{q}} = {\lambda _{a}} = 0.005
, {\lambda _{c}} = 0.05
.
Next, we compare the performance of the proposed schemes with benchmarks: DDQL, myopic, local computing and random schemes in Fig. 7. Based on the current setting, it is observed that although the DPG algorithm can gain a slightly higher computational speed than the DQL algorithm, it requires more effort by adjusting the parameters to achieve a sub-optimal policy. The reason can be explained as follows. The DQL approach is expected to satisfy the Bellman equation which maintains the stability of convergence, but it usually does not guarantee finding nearly optimal behavior following \epsilon -
greedy policy, where the agent frequently selects the stochastic action with the probability of \epsilon _{q}
. Meanwhile, although DPG can learn a good policy since it only operates with a policy that can directly estimate the maximal total expected rewards, the convergence is quite sensitive with improper parameters. On the one hand, we can also recognize that the DPG and DQL almost achieve the computation rate as high as in the DDQL. Generally, DDQL effectively manages overestimation during training by distinguishing between action selection and value evaluation, which enhances its robustness and reliability in uncertain environments. While the DDQL approach is likely to yield more effective policies compared to Deep Policy Gradient (DPG) and DQL schemes, it may incur higher overhead. On the other hand, DAC possesses both advantages of policy-based and value-based methods, resulting in good convergence to the highest computation rate. It is because it not only selects actions generated by the actor, but also criticizes the performed action. The performance improvement statistics are shown in Table 2. Columns 2 and 3 indicate the percentage improvement of each proposed scheme compared to the Myopic scheme and the local computing scheme, respectively. It is evident that the proposed DAC scheme achieves the highest improvement among the methods, with a 21.32% increase over the Myopic scheme and a 161.38% increase over the local computing scheme. This highlights the substantial advantage of the DAC approach. Additionally, these results demonstrate the benefits of using MEC techniques to significantly enhance computational efficiency.
We further study the effect of mean value of harvested energy on computation rate in Fig. 8, where \varepsilon
varies from 7\;{\mu \text {J}}
to 15\;{\mu \text {J}}
in both with-RIS and without-RIS scenarios. As expected, with an increment in harvested energy, system rewards are greatly improved. The reason is that SU collects and preserves more renewable energy in its battery to process computational tasks. The results show the effectiveness of the proposed schemes in dealing with the issue of stochastically harvested energy at the SU in each time slot. Furthermore, it is worth noting that the schemes using RIS offer better performance than the schemes without using RIS due to ability of channel condition adjustment. For the sake of verifying energy-efficient utilization of proposed algorithms, we define energy efficiency as the total computation rate over energy consumption at the SU and plot the results in Fig. 9. The curves indicate that energy efficiency degrades with increase in harvested energy. It implies that improving the number of computed bits requires more and more energy including both computing and overflowed energy. Especially, overflow circumstances will happen more frequently when the SU harvests more energy, since battery capacity is limited, which causes a degradation of energy efficiency.
Fig. 10 validates the robustness of the proposed algorithms against the changes in false alarm probability, i.e., {P_{f}} \in \left [{{0.1,0.2}}\right]
and battery capacity, i.e., {E_{Ca}} = \left \{{{25,30}}\right \}\mu {\text {J}}
. In this regard, the performance goes down as P_{f}
increases since the SU misses more chances to offload computational data to the SBS. For the case of {E_{Ca}} = 30\;\mu {\text {J}}
, the SU can preserve more energy for its local computing and offloading, and thus, rises the overall computation rate in the long run, as compared to the case of {E_{Ca}} = 25\;\mu {\text {J}}
.
The performance comparison of our schemes and myopic scheme with various number of RIS elements is shown in Fig. 11. Intuitively, increasing K will help improve the computation rate because the higher channel gain of reflecting link assisted by RIS is obtained. Finally, we plot reward versus the number of antennas in the SBS and detection probability in Fig. 12. It is obvious that when M goes up, it will remarkably help the SU compute its task faster. Besides, with high detection probability, the system detects the actual state “free” of the primary channel frequently and correctly, reducing the number of transmission collisions between the SU and the PT. Hence, this leads to the improvement of the system reward when P_{d}
increases. Through the above simulated results, we can see that the proposed DRL schemes are superior to the myopic scheme because they not only consider the current reward, but also estimate the impact of immediate action on future reward. Although the DPG and DQL schemes brings a slightly poorer performance than the DAC and DDQL, they require lower computational complexity and outperform the other traditional benchmarks. Thus, the trade-offs between the computational complexity and the network performance should be carefully considered when designing the DRL frameworks to solve the large state-and-action space optimization. Furthermore, the complexity of mathematical formulation for deriving optimal solutions can be remarkably relieved by using DRL frameworks, compared to other mathematical optimization tools.