Introduction
Non-orthogonal multiple access (NOMA) has been proposed as one of the promising multiple access (MA) techniques for next-generation wireless networks. By utilizing the superposition coding (SC) at the transmitter and the successive interference cancellation (SIC) at the receiver, NOMA offers better spectral and energy efficiencies as well as user-fairness compared to conventional orthogonal multiple access (OMA) techniques [1], [2]. This enables NOMA as a promising MA candidate to realize massive connectivity in 6G and beyond by efficiently allocating scarce radio resources. Instead of multiplexing users in time or frequency, NOMA utilizes the power domain multiplexing to multiplex users. Therefore, NOMA systems generally require more efficient and accurate power allocation algorithms to mitigate the interference levels and enable smooth practical implementations of the SIC at the receiver.
Multiple antenna communications with their additional degrees of freedom have proven to be an effective interference-mitigation technique. Hence, multiple antenna NOMA systems have been studied extensively and demonstrated significant performance gains over OMA-based multiple antenna systems due to their combined spectral efficiency and interference-suppression capabilities [3], [4], [5]. In [6], Hanif et al. proposed an iterative algorithm for a multiple-input single-output (MISO)-NOMA system with the sum-rate maximization objective. The authors in [7] proposed a semidefinite relaxation (SDR)-based approach for the optimal beamforming design problem in MISO-NOMA systems with the transmit power minimization objective. In addition, the authors in [8] proposed a sequential convex optimization solution for MISO-NOMA systems with the aim of maximizing global energy efficiency.
More recently, intelligent reflecting surface (IRS)-assisted multiple antenna systems have received significant attention from both industry and academia, thanks to the additional link reliability they introduce to the conventional wireless communication systems [9]. In IRS-assisted systems, the phase shifts of the passive IRS elements can be programmed to steer the incoming signal to the desired direction, hence, increasing the channel strength between the transmitter and the receiver(s). The IRS-assisted MISO-NOMA system model has been subject to extensive studies recently to reap the combined benefits of the NOMA, multiple antennas, and IRS techniques. In particular, the work in [10] considered the multi-cluster beamforming and IRS phase shifts design for the transmit power minimization objective, while the energy efficiency objective was considered in [11]. Xie et al. proposed a solution for the max-min fairness system objective. However, combining such sophisticated techniques often leads to tractability problems. Hence, model-based approaches break down the joint-design optimization problem into several subproblems, then, each problem is solved separately in an iterative manner. However, the downside of such approaches is that the overall computational complexity of the proposed solution is often prohibitively high, which severely limits their practical utility, especially for latency-sensitive future wireless networks [12], [13], [14], [15].
Machine learning-based methods have proved to be a viable alternative to model-based solutions for highly complex resource allocation problems in wireless communication systems. The deep learning framework has been applied to channel tracking and estimation, and beamforming design [16], [17], [18], [19]. However, since supervised deep learning requires labelled data for training, it can only be applied to problems solved a priori, which restricts the deep learning-based algorithms to problems that are already solved, albeit not on a large scale. Deep reinforcement learning (DRL) -which combines deep learning and reinforcement learning (RL) into a single framework- addresses the shortcomings of deep learning as an optimization tool. In RL, an active agent learns how to solve the problem through trial and error without any human supervision, and therefore, does not require labels for training and learning [20].
Recently, DRL has been applied to a wide variety of problems in the wireless communications domain. The work in [21] proposed a deep deterministic policy gradient (DDPG)-based design to maximize the sum-rate in cognitive-radio NOMA systems. Meng et al. also applied DDPG to solve the downlink dynamic power control problem for maximizing the system sum-rate. The application of the DRL framework has also been extended to IRS-aided NOMA systems. The work in [22] adopted the zero-forcing beamforming (ZFBF) technique while utilizing a deep Q-network (DQN) agent for optimizing phase shifts of the IRS elements. Xie et al. used DDPG to jointly optimize the beamforming vectors and IRS phase shifts for the sum-rate maximization problem [23]. The work in [24] proposed a multi-agent DRL-based design that jointly optimizes the subcarrier assignment, power allocation, and IRS phase shifts in NOMA-assisted semi-grant-free systems, while the resource allocation problem for NOMA-unmanned aerial vehicle system was considered in [25].
However, there are still practical issues facing the aforementioned works. First, all of these works assume perfect channel state information (CSI) at the base station (BS) which is extremely challenging in practice. Furthermore, the imperfect CSI at the transmitter and the receiver have severe implications in NOMA systems since the receivers utilize SIC to unlock the additional gains of NOMA. In addition, providing some guarantees of performance under channel uncertainties leads to a more complicated optimization problem that is more challenging to solve in a reasonable time, especially for latency-sensitive applications. Therefore, the performance of DRL-based methods for clustered IRS-assisted MISO-NOMA systems with imperfect CSI and SIC remains an open issue. The second challenge is that most of the literature focuses on a simplified version of the system objective. The work in [22] while considering a cluster-based IRS-assisted MISO-NOMA system, does not take into account cluster power allocation nor the quality-of-service requirements in the proposed design, both of which have a significant impact on the agent selection and the problem environment design. Furthermore, the DQN agent utilized to solve the problem cannot be applied to problems with large continuous action spaces as DQN is restricted to discrete action space problems. The work in [24] uses a DQN agent to solve the discrete channel-assignment problem, while a DDPG agent is utilized to solve the power allocation problem. However, since the BS and the user equipment units (UEs) are assumed to be equipped with a single antenna, no beamforming design is considered. Additionally, while the work in [26] considered a DRL-based approach to solve the sum-rate maximization problem through joint active and passive beamforming design, the number of SIC operations required by the strongest UE grows linearly with the number of UEs in the systems, leading to a practically unscalable and highly complex receiver.
Motivated by the impractical assumptions and the lack of a unified and scalable framework in the DRL literature, we propose a DRL-based joint design framework to solve an outage-constrained robust resource allocation problem in an IRS-assisted MISO-NOMA system. In particular, a correlation-based user-pairing algorithm is developed to limit the number of UEs in each cluster leading to a more scalable implementation of SIC-based receivers. Then, the NOMA principle is applied in each cluster to increase the spectral efficiency of the system. Moreover, the proposed DRL-based design jointly optimizes the clusters and UEs power allocation, and IRS phase shifts, while taking into account the outage-constrained QoS requirements. The ergodic sum-rate maximization is used as the objective function for the considered system. In addition, the statistical error model is used to describe the channel uncertainty which leads to an outage-constrained robust design. Furthermore, the proposed DRL-based design has much lower deployment computational complexity compared to the conventional optimization methods in the literature while still achieving competitive performance. To the best of the authors’ knowledge, this is the first work that proposes a framework for clustering and actor-critic-based resource allocation in IRS-assisted MISO-NOMA systems. The contributions of this work are summarized as follows:
By assuming a blocked direct path between the BS and the UEs due to obstacles, the BS communicates with the UEs through the IRS. In addition, the statistical error model is to used express the channel uncertainty in the system. However, the formulated robust design problem with the ergodic sum-rate maximization objective is a mixed-integer optimization problem which is challenging to solve. The user-pairing problem is isolated and solved first to reduce the complexity of the problem. Then, the zero-forcing (ZF) principle is adopted to design the beamforming vectors.
The robust resource allocation problem is still non-convex due to the coupled optimization variables. Therefore, the problem is reformulated into an RL environment. Then, a twin-delayed deep deterministic policy gradient (TD3)-based algorithm is developed to solve the reformulated joint resource allocation problem.
By providing the complexity analysis for the proposed DRL-agent’s architecture, we show that the deployment computational complexity of the proposed algorithm is much less than existing conventional optimization algorithms, which makes the DRL-based design more attractive for latency-stringent applications in future wireless networks.
The competitive performance of the proposed algorithm is illustrated through extensive simulation results for both fixed and dynamic-channel scenarios. Furthermore, the results show the TD3-based design outperforms existing conventional and other DRL-based benchmark schemes in the literature.
A. Organization
The rest of the paper is organized as follows. Section II presents the system and channel uncertainty models. The joint robust design problem is formulated in Section III. In addition, the user-clustering algorithm is also developed. In Section IV, the problem is reformulated into an RL environment and a TD3-based algorithm is developed to solve the reformulated problem. The simulation results are presented in Section V. Finally, Section VI concludes this work.
B. Notation
Bold lowercase and uppercase letters are used to represent vectors and matrices, respectively, while standard normal letters denote scalar quantities.
System and Channel Uncertainty Models
We consider a downlink transmission of an IRS-assisted MISO-NOMA system in which the BS is equipped with N transmit antenna and serves \begin{equation*} y_{c,i} = \mathbf {g}_{c,i}^{\text {H}}\mathbf {\Phi G}\sum _{c=1}^{C}\mathbf {w}_{c}x_{c} + z_{c,i}, \forall i\in \{\mathcal {S},\mathcal {W}\}, c\in \mathcal {C}, \tag {1}\end{equation*}
\begin{equation*} y_{c,i} = \mathbf {h}_{c,i}\sum _{c=1}^{C}\mathbf {w}_{c}x_{c} + z_{c,i}, \forall i\in \{\mathcal {S},\mathcal {W}\}, c\in \mathcal {C}, \tag {2}\end{equation*}
A. Channel Uncertainty Model
Due to the random nature of the wireless transmissions, uncertainties in the wireless channel estimation are inevitable. Furthermore, with the introduction of the IRS, accurate channel estimation becomes even more challenging due to the passive elements in the IRS [30], [31]. Channel estimation and quantization errors are two of the main contributors to the imperfect channel estimation in wireless communication systems [14], [32]. However, the two are often modelled differently with the quantization errors considered to belong to a norm-bounded region, while channel estimation errors are modelled statistically using unbounded error models [31], [33]. On the other hand, multiple antenna communication systems make use of the beamforming principle to enhance the system performance by exploiting the CSI at the transmitter. However, to achieve the optimal beamforming gains, perfect CSI is required at the transmitter. Unfortunately, having perfect CSI at the transmitter is extremely challenging to obtain in practical settings due to the aforementioned channel uncertainties. Therefore, robust design algorithms that take into account channel imperfections are more suitable for studying and analysing the system performance under practical conditions. In this paper, we assume that the channel uncertainties are the result of the imperfect channel estimation. Note that in NOMA systems, channel imperfections at the receiver lead to SIC degradation which is also taken into account. In particular, this paper aims to propose a robust resource allocation strategy that takes into account the imperfect CSI in the system.
The statistical error model has been extensively used to describe distortions in the acquired channel due to thermal noise, estimation errors, and insufficient pilot samples [34], [35], [36]. If the channel statistics are not known, the least square estimator is typically used to estimate the channel coefficients at the receiver. Alternatively, when the channel statistics are available, the linear minimum mean square error estimator is normally used to exploit the additional information and obtain more accurate channel estimates. Therefore, if the noise is assumed to be an additive and white Gaussian process, then, it is straightforward to interpret that the difference between the estimated and actual channels can be expressed statistically [37], [38]. Therefore, the following error model is considered for the cascaded channel [31]:\begin{equation*} \mathbf {Q}_{c,i} = \hat {\mathbf {Q}}_{c,i} + \Delta \mathbf {Q}_{c,i},\, \forall i\in \{\mathcal {S},\mathcal {W}\}, c\in \mathcal {C}, \tag {3}\end{equation*}
\begin{equation*} \beta _{c,i}^{2} = \lambda ^{2}\left \|{{\hat {\mathbf {q}}_{c,i}}}\right \|_{2}^{2},\, \forall i\in \{\mathcal {S},\mathcal {W}\}, c\in \mathcal {C}, \tag {4}\end{equation*}
B. SINR and Achievable Rates
SINR is one of the most widely used metrics for measuring the performance of wireless communication systems. For the considered cluster-based design, the SINR of the stronger UE in the c-th cluster can be defined as \begin{align*} \gamma _{c,s}=& \frac {\left |{{\mathbf {h}_{c,s}\mathbf {w}_{c}}}\right |^{2}P_{c}\alpha _{c,s}}{\left |{{\left ({{\mathbf {v}^{\text {H}}\Delta \mathbf {Q}_{c,s}}}\right )\mathbf {w}_{c}}}\right |^{2}P_{c}\alpha _{c,w}+\sum _{\substack {k=1 \\ k\neq c}}^{C} \left |{{\mathbf {h}_{c,s}\mathbf {w}_{k}}}\right |^{2}P_{k} + \sigma _{c,s}^{2}}, \\& \qquad \forall s\in \{\mathcal {S}\}, c\in \mathcal {C}, \tag {5}\end{align*}
\begin{align*} \gamma _{c,w}^{c,w}=& \frac {\left |{{\mathbf {h}_{c,w}\mathbf {w}_{c}}}\right |^{2}P_{c}\alpha _{c,w}}{\left |{{\mathbf {h}_{c,w}\mathbf {w}_{c}}}\right |^{2}P_{c}\alpha _{c,s}+\sum _{\substack {k=1 \\ k\neq c}}^{C} \left |{{\mathbf {h}_{c,w}\mathbf {w}_{k}}}\right |^{2}P_{k} + \sigma _{c,w}^{2}}, \\& \qquad \forall w\in \{\mathcal {W}\}, c\in \mathcal {C}. \tag {6}\end{align*}
\begin{align*} \gamma _{c,w}^{c,s}=& \frac {\left |{{\mathbf {h}_{c,s}\mathbf {w}_{c}}}\right |^{2}P_{c}\alpha _{c,w}}{\left |{{\mathbf {h}_{c,s}\mathbf {w}_{c}}}\right |^{2}P_{c}\alpha _{c,s}+\sum _{\substack {k=1 \\ k\neq c}}^{C} \left |{{\mathbf {h}_{c,s}\mathbf {w}_{k}}}\right |^{2}P_{k} + \sigma _{c,s}^{2}}, \\& c\in \mathcal {C}. \tag {7}\end{align*}
\begin{align*} \gamma _{c,w} = \left ({{1+min\left ({{\gamma _{c,w}^{c,s},\gamma _{c,w}^{c,w}}}\right )}}\right ) \\ c\in \mathcal {C}. \tag {8}\end{align*}
\begin{align*} R_{c,s}=& log_{2}\left ({{1+\gamma _{c,s}}}\right ), \\ R_{c,w}=& log_{2}\left ({{1+\gamma _{c,w}}}\right ),\forall s\in \{\mathcal {S}\}, w\in \{\mathcal {W}\}, c\in \mathcal {C}. \tag {9}\end{align*}
In the next section, the problem formulation of the robust design for the considered system is provided with details.
Problem Formulation
The aim of this work is to propose a joint robust design framework for a long-term performance-based resource allocation in IRS-assisted MISO-NOMA systems. In particular, we consider the objective of maximizing the ergodic system sum-rate under channel uncertainties while taking into account the dynamics of the system over multiple time-slots [3], [21], [39]. Therefore, the long-term outage-constrained joint robust design problem with the sum-rate maximization objective can be formulated as \begin{align*}& \max _{\mathbf {w}_{c},\mathbf {v},P_{c},\alpha _{c,i},b_{s,w}} {\mathbb {E}\left \{{{\sum _{t=1}^{\infty }\delta ^{t-1}\sum _{c=1}^{C} \left [{{R_{c,s}^{t}+R_{c,w}^{t}}}\right ]b_{s,w}^{t}}}\right \} }{}{} \tag {10a}\\& {p_{i}\triangleq \Pr \left \{{{\gamma _{c,i}\geq 2^{R_{c,i}^{min}}-1}}\right \}}{\geq \Gamma , \forall i\in \{\mathcal {S,W}\}, c\in \mathcal {C}, } \tag {10b}\\& {||\mathbf {w}_{c}||_{2}^{2}}{=1,c\in \mathcal {C}, } \tag {10c}\\& {\sum _{c=1}^{C}P_{c}}{\leq P_{max},c\in \mathcal {C}, } \tag {10d}\\& {\alpha _{c,s}^{t}+\alpha _{c,w}^{t}}{=1,c\in \mathcal {C},s\in \mathcal {S},w\in \mathcal {W} } \tag {10e}\\& {\sum _{c=1}^{C}b_{s,w}^{t}}{\leq 1,b\in \{0,1\},c\in \mathcal {C}, } \tag {10f}\\& {|\mathbf {v}_{m}|^{2}=1}{,\,0\leq \theta _{m}\leq 2\pi , m=1,\ldots , M, } \tag {10g}\end{align*}
The joint design problem in
The objective function is not jointly convex in terms of the optimization variables.
The expectation operator prevents defining a closed-form expression for the objective function in (10a) since approximation methods cannot be directly applied.
The outage constraints in (10b) do not admit closed-form solutions [34].
The UE pairing variable in (10f) is restricted to a binary set, resulting in a mixed-integer optimization problem.
A. User Pairing
UE pairing is considered one of the enabling techniques in multi-user NOMA systems for future wireless networks [27], [28], [43]. In addition, it has been shown that pairing a stronger UE with a weaker UE leads to enhanced overall performance in NOMA systems [44], [45]. Hence, there are two design criteria for UE pairs selection that directly affect the system sum-rate performance in NOMA networks, correlation and channel-gain difference between the paired UEs in a cluster [22], [46]. Since each cluster is served with a single beam, a higher UE correlation within the cluster translates to a lower level of intra-cluster interference experienced by the weaker UE, while sufficient channel-gain difference ensures smooth SIC operation at the stronger UE. However, since the IRS phase shifts are designed at the BS, the phase shifts could be tuned to adjust the channel-gain differences after the cluster design. Therefore, the proposed algorithm is solely based on the initial correlation between the UEs.
The basic premise of the proposed successive UE pairing algorithm (SUPA) is to pair each UE in \begin{equation*} \epsilon _{i,j}=\frac {\left \|{{\hat {\mathbf {h}}_{i}.\hat {\mathbf {h}}_{j}}}\right \|_{2}}{\left \|{{\hat {\mathbf {h}}_{i}}}\right \|_{2}\left \|{{\hat {\mathbf {h}}_{j}}}\right \|_{2}}, \forall i\in \mathcal {S}, \forall j\in \mathcal {W}, \tag {11}\end{equation*}
Algorithm 1 Successive User Pairing Algorithm
Initialise: UEs sets
Calculate the final estimated channels at the BS using
Sort all
for
for
Calculate the correlation coefficient between UEi and UEj according to (11)
end for
Find
Assign
Set
end for
Output: {
RL Framework for Robust Resource Allocation
With given UE pairs using Algorithm 1, the remaining resource allocation problem is expressed as \begin{align*}& \max _ {\mathbf {w}_{c},\mathbf {v},P_{c},\alpha _{c,i}} {\mathbb {E}\left \{{{\sum _{t=1}^{\infty }\delta ^{t-1}\sum _{c=1}^{C} \left [{{R_{c,s}^{t}+R_{c,w}^{t}}}\right ]}}\right \} }{}{} \tag {12a}\\& {\mathrm { s.t.}}~ (10{\mathrm {b}}), (10{\mathrm {c}}), (10{\mathrm {d}}), (10{\mathrm {e}}), (10{\mathrm {g}}). \tag {12b}\end{align*}
A. The Zero-Forcing Beamforming
The ZFBF is a low-complexity technique in which the channel knowledge at the transmitter is exploited to design the beamforming vectors. More importantly, under the perfect CSI assumption, the ZFBF provides a closed-form solution to the beamforming design problem with a reasonable trade-off between complexity and performance [48]. In addition, the ZFBF has been extensively used in the literature as one of the beamforming designs for sum-rate maximization [46], [48], [49]. The basic principle behind the ZFBF is to design a beamforming vector \begin{align*} \frac {\mathbf {h}_{i}}{||\mathbf {h}_{i}||_{2}}\mathbf {w}_{k}=\begin{cases} 1 & \text {if {$k=i$}} \\ 0 & \text {if {$k\neq i$}}. \end{cases} \tag {13}\end{align*}
\begin{align*} \frac {\hat {\mathbf {h}}_{i}}{||\hat {\mathbf {h}}_{i}||_{2}}\mathbf {w}_{k}=\begin{cases} 1 & \text {if {$k=i$}} \\ \gt 0 & \text {if {$k\neq i$}}. \end{cases} \tag {14}\end{align*}
\begin{equation*} \mathbf {W} = \left ({{\hat {\mathbf {H}}}}\right )^{\dagger }, \tag {15}\end{equation*}
Therefore, in this work, the robust resource allocation is realized through the accurate and joint optimization of the IRS phase shifts, cluster and UE power allocation as explained in the next section.
B. Problem Reformulation
By tackling the UE pairing and beamforming design problems, the robust resource allocation problem is reduced to the following optimization problem \begin{align*}& \max _ {\mathbf {v},P_{c},\alpha _{c,i}} {\mathbb {E}\left \{{{\sum _{t=1}^{\infty }\delta ^{t-1}\sum _{c=1}^{C} \left [{{R_{c,s}^{t}+R_{c,w}^{t}}}\right ]}}\right \} }{}{} \tag {16a}\\& {\mathrm { s.t.}} ~(10{\mathrm {b}}), (10{\mathrm {d}}), (10{\mathrm {e}}), (10{\mathrm {g}}). \tag {16b}\end{align*}
It is well-known that optimizing a system objective under uncertainty or stochastic environment can be modelled as a Markov decision process (MDP) [50]. The RL framework is one of the most effective methods to solve the control problem in MDPs, especially in model-free systems where the transition probability between the states is unknown [51]. The RL framework consists of two entities, the agent which is the active entity that takes actions, and the environment which encloses everything else except the agent. At time-step t, given a state
To reformulate the robust design problem into an RL environment, the state, action and reward entities must be clearly defined.
The action space
: Since the value of the objective is a function of the optimization variables, they are intuitively selected as the actions space of the RL environment. In particular, the actions space vector at time-step t is expressed as\mathbf {a}^{t} Note that since\begin{equation*} \mathbf {a}^{t}=\left [{{P_{1}^{t},\ldots , P_{C}^{t},\alpha _{1,w}^{t},\ldots , \alpha _{C,w}^{t},\mathbf {v}^{t}}}\right ]^{\text {T}}. \tag {17}\end{equation*} View Source\begin{equation*} \mathbf {a}^{t}=\left [{{P_{1}^{t},\ldots , P_{C}^{t},\alpha _{1,w}^{t},\ldots , \alpha _{C,w}^{t},\mathbf {v}^{t}}}\right ]^{\text {T}}. \tag {17}\end{equation*}
, only the power allocation coefficients for the weaker UEs are included in the actions vector. Furthermore, since we will be using a deep neural network (DNN) architecture that is only compatible with real numbers, complex vectors are represented using real values in this paper. In particular, and without the loss of generality, since\alpha _{c,s}^{t}=1-\alpha _{c,w}^{t},\forall c\in \mathcal {C} , then,\mathbf {v}\in \mathbb {C}^{M\text {x}1} , where\mathbf {v}\in \mathbb {R}^{2M\text {x}1} andRe\{\mathbf {v}\}\in \mathbb {R}^{M\text {x}1} are the real and the imaginary parts of the IRS vector v, respectively [19]. Therefore, we can writeIm\{\mathbf {v}\}\in \mathbb {R}^{M\text {x}1} as a vector with only real values.\mathbf {a}^{t}\in \mathbb {R}^{(2K+2M)\text {x}1} The state space
: To ensure that the state space of the environment includes the necessary information from the original robust design problem, we include the previous action as part of the state vector. Furthermore, since the correlation coefficient between the paired UEs is affected by the IRS phase shifts as highlighted by (12), the correlation coefficients vector is also included in the state space. Additionally, the channel gain between each UE pair is included in the state vector. The channel gain difference defined as the dB ratio between the two channels is used and can be expressed as\mathbf {s}^{t} Finally, to help the agent evaluate itself during training, the achieved rates of the previous time-step are also taken into account as part of the state space, Therefore, the state space is expressed as\begin{equation*} \rho _{i,j} = 10\,log_{10}\left ({{\frac {||\hat {\mathbf {h}}_{i}||_{2}}{||\hat {\mathbf {h}}_{j}||_{2}}}}\right ),\forall i\in \mathcal {S},j\in \mathcal {W}. \tag {18}\end{equation*} View Source\begin{equation*} \rho _{i,j} = 10\,log_{10}\left ({{\frac {||\hat {\mathbf {h}}_{i}||_{2}}{||\hat {\mathbf {h}}_{j}||_{2}}}}\right ),\forall i\in \mathcal {S},j\in \mathcal {W}. \tag {18}\end{equation*}
where\begin{align*} \mathbf {s}^{t}=& \left [{{\mathbf {a}^{t-1},\epsilon _{1}^{t-1},\ldots , \epsilon _{C}^{t-1},\rho _{1}^{t-1},\ldots , \rho _{C}^{t-1},}}\right . \\& \qquad \left .{{R_{1,s}^{t-1},\ldots , R_{C,w}^{t-1}}}\right ]^{\text {T}}, \tag {19}\end{align*} View Source\begin{align*} \mathbf {s}^{t}=& \left [{{\mathbf {a}^{t-1},\epsilon _{1}^{t-1},\ldots , \epsilon _{C}^{t-1},\rho _{1}^{t-1},\ldots , \rho _{C}^{t-1},}}\right . \\& \qquad \left .{{R_{1,s}^{t-1},\ldots , R_{C,w}^{t-1}}}\right ]^{\text {T}}, \tag {19}\end{align*}
. Furthermore, when training for the dynamic-channels environment, the variances of the estimated channels are also included as part of the state space. Therefore, the state vector for the dynamic-channels case is expressed as\mathbf {s}^{t}\in \mathbb {R}^{(6K+2M)\text {x}1} where\begin{align*} \mathbf {s}^{t}_{\text {dyn}}=& \left [{{\beta ^{2}_{1,s},\ldots , \beta ^{2}_{C,w},\mathbf {a}^{t-1},\epsilon _{1}^{t-1},\ldots , \epsilon _{C}^{t-1},}}\right . \\& \qquad \left .{{\rho _{1}^{t-1},\ldots , \rho _{C}^{t-1},R_{1,s}^{t-1},\ldots , R_{C,w}^{t-1}}}\right ]^{\text {T}}, \tag {20}\end{align*} View Source\begin{align*} \mathbf {s}^{t}_{\text {dyn}}=& \left [{{\beta ^{2}_{1,s},\ldots , \beta ^{2}_{C,w},\mathbf {a}^{t-1},\epsilon _{1}^{t-1},\ldots , \epsilon _{C}^{t-1},}}\right . \\& \qquad \left .{{\rho _{1}^{t-1},\ldots , \rho _{C}^{t-1},R_{1,s}^{t-1},\ldots , R_{C,w}^{t-1}}}\right ]^{\text {T}}, \tag {20}\end{align*}
. Note that since the variance of the estimated channel is closely related to the estimation error according to (4), including this information in the state space helps the agent in forming a more robust policy under the dynamic-channels environment.\mathbf {s}^{t}\in \mathbb {R}^{(8K+2M)\text {x}1} The reward function
: Defining an appropriate reward function is crucial in the RL framework as it is the only feedback that indicates the utility of the actions taken by the agent at any time-step t during training. In addition, since the objective in the original robust design problem (10a) is to maximize the long-term system sum-rate, the system sum-rate at time-step t is selected as the reward. In addition, the sum of the correlation coefficients and the channel gain ratios are added to the system sum-rate to incentivise the agent to increase the correlation and the channel gain difference between the stronger and the weaker UEs in each cluster. Therefore, the reward function is expressed asr^{t} Furthermore, to discourage the agent from taking actions that do not satisfy the QoS constraints, the following reward function is used to punish the agent:\begin{equation*} r^{t} = \sum _{c=1}^{C}\left ({{R_{c,s}^{t}+R_{c,w}^{t}}}\right )+\sum _{c=1}^{C}\epsilon _{c}^{t}+\sum _{c=1}^{C}\rho _{c}^{t}, c\in \mathcal {C}. \tag {21}\end{equation*} View Source\begin{equation*} r^{t} = \sum _{c=1}^{C}\left ({{R_{c,s}^{t}+R_{c,w}^{t}}}\right )+\sum _{c=1}^{C}\epsilon _{c}^{t}+\sum _{c=1}^{C}\rho _{c}^{t}, c\in \mathcal {C}. \tag {21}\end{equation*}
where\begin{equation*} r^{t} = \sum _{k=1}^{2K}\min \left ({{R_{k}^{t}-R_{k}^{min},0}}\right ), \tag {22}\end{equation*} View Source\begin{equation*} r^{t} = \sum _{k=1}^{2K}\min \left ({{R_{k}^{t}-R_{k}^{min},0}}\right ), \tag {22}\end{equation*}
always hold in (22). Therefore, after each action taken by the agent, the environment uses the positive reward function in (21) in case the action satisfies the QoS constraints, otherwise, the environment uses the negative reward function in (22). The details of how the reward function is utilized by the agent during training are discussed in the agent’s architecture section.r^{t}\lt 0


C. The Robust TD3-Based Algorithm
The RL agents like the Q-learning and the state-action-reward-state-action (SARSA) are called tabular methods because they use tables to keep track of the Q-values for each state-action pair [52], [53]. However, since these agents are only capable of handling discrete state and action spaces, their practical utility is severely limited as most practical problems have continuous state and action spaces.
Actor-critic agents which are state-of-the-art in DRL can handle continuous action and state spaces, and therefore, eliminate the tabular requirement which restricted the earlier RL agents. Consequently, actor-critic DRL agents have been applied to a much wider set of problems in the wireless communications domain [20].
In this paper, the proposed robust resource allocation framework is developed based on the TD3 agent [54]. The TD3 agent is an off-policy actor-critic DRL agent which optimizes a deterministic policy. To address the policy break issue in the baseline DDPG agent [55], the TD3 agent uses two critics instead of one, among other enhancements. Furthermore, since off-policy agents are more sample efficient than their on-policy counterparts, thanks to the replay buffer
The TD3 agent consists of two main parts: the actor or the policy DNN and the critic DNN. As the name implies, the actor DNN denoted \begin{equation*} \mu \left ({{\mathbf {s}}}\right ) = \mathbf {a}^{*}, \tag {25}\end{equation*}
\begin{equation*} \phi _{i}\left ({{\mathbf {s},\mathbf {a}}}\right ) = Q^{*},i=1,2. \tag {26}\end{equation*}
Similar to the DQN and the DDPG agents, the TD3 agent uses target networks to generate the training targets. Target networks are delayed copies of the actor’s and the critics’ DNNs. Furthermore, the TD3 agent also utilizes a replay buffer which stores past experiences to further stabilise the learning process. \begin{equation*} \zeta \left ({{r^{t},\mathbf {s}^{t}}}\right )=r^{t}+\delta \min _{i=1,2}\phi ^{\prime }_{i}\left ({{\mathbf {s}^{t+1},\mu ^{\prime }\left ({{\mathbf {s}^{t+1}}}\right )}}\right ),i=1,2, \tag {27}\end{equation*}
\begin{align*} L\left ({{\phi _{i},\mathcal {B}}}\right )=& \underset {\left \{{{\mathbf {s}^{t},\mathbf {a}^{t},r^{t},\mathbf {s}^{t+1}}}\right \}\sim \mathcal {B}}{\mathbb {E}}\left [{{\left ({{Q\left ({{\mathbf {s}^{t},\mathbf {a}^{t};\phi _{i}}}\right )-\zeta \left ({{r^{t},\mathbf {s}^{t}}}\right )}}\right )^{2}}}\right ], \\& \qquad i=1,2. \tag {28}\end{align*}
\begin{equation*} \underset {\psi }{\max } \underset {\mathbf {s}^{t}\sim \mathcal {B}}{\mathbb {E}}\left [{{{\mathcal {Q}}_{\phi }\left ({{s,\mu (s)}}\right )}}\right ], \tag {29}\end{equation*}
\begin{align*} \phi ^{\prime }_{i}=& \kappa \phi _{i}+\left ({{1-\kappa }}\right )\phi ^{\prime }_{i}, \, i=1,2, \\ \psi ^{\prime }=& \kappa \psi +\left ({{1-\kappa }}\right )\psi ^{\prime }, \tag {30}\end{align*}
\begin{equation*} \mathbf {a}^{t}=\text {clip}\left ({{\mu \left ({{\mathbf {s}^{t}}}\right )+\mathbf {n},a_{high},a_{low}}}\right ), \tag {31}\end{equation*}
So far, we have discussed the problem reformulation into an RL environment and explained the inner workings of the TD3 agent. Hence, the developed TD3-based algorithm for robust resource allocation is explained in Algorithm 2.
Algorithm 2 TD3-Based Robust Resource Allocation
Initialise: agent’s hyperparameters
Set
while
Obtain the estimated channels for all UEs,
Execute algorithm 1 to obtain the UE pairs.
Calculate the ZFBF matrix W according to (15)
Obtain the channel error samples
while
Get the actions vector
Add the random channel error terms according to (3) to create the final true channels
Calculate the achieved rates for all UEs according to (9)
if
Use the reward function in (21)
else
Use the reward function in (22)
end if
Obtain the next
Sample a batch of
Calculate the targets for the sampled experiences according to (27)
Train the two critics using (28)
if
Train the actor network using (29)
end if
Update the target networks using (30)
Set
end while
end while
Output:
To show how the proposed algorithm is implemented after training, Figure 3 illustrates the integration of the trained TD3 model into the BS of the considered IRS-assisted MISO-NOMA system.
The implementation of the proposed algorithm within the BS of the considered IRS-assisted MISO-NOMA system.
Note that unlike conventional optimization algorithms, we do not explicitly consider the outage probability during the training and learning stage in the TD3-based robust design, however, it is included implicitly through the random errors as explained in Algorithm 2. The first motivation for the proposed approach is that since the TD3 agent is initialized with a random policy, basing the reward function on the non-outage probability leads to extremely sparse reward in the initial training steps which eventually leads to divergence. The other motivation is that by basing the reward function on the true achieved rates, the agent always aims for a non-outage probability of 1, which leads to an inherently robust policy. Therefore, the implications of the outage constraints are included implicitly in Algorithm 2. Hence, the non-outage probability of the agent’s policy is hyperparameterized in the proposed design. Consequently, the robustness of the agent’s policy is a function of the hyperparameters of the TD3 agent.
Note that even though the agent is rewarded by the achieved true sum-rates, this does not imply that the agent has access to the true channels. In particular, since the reward is determined by the environment in the RL framework and the UEs are part of the environment, the true channels are still unknown to the agent.
D. Complexity Analysis
In this section, we provide the computational complexity for the developed TD3-based algorithm. In particular, since DRL agents are only trained once, we assume that the offline training complexity can be afforded [19]. Hence, we focus on analysing the online or inference complexity during deployment.
The big
In order to compare the analytical complexity of the proposed TD3-based algorithm to existing convex optimization algorithms, we briefly review three widely adopted conventional optimization approaches for solving the static version of the considered optimization problem. In [10], a SOCP-ADMM-based algorithm was developed to iteratively solve the transmit power minimization problem. The derived algorithm has a worst-case complexity of
While both algorithms provide solid performance and interesting results, it is obvious that they do not scale well in practical scenarios, let alone latency-sensitive applications. Furthermore, the aforementioned algorithms are derived under the assumption that the global CSI is available system-wide, and therefore, cannot be directly extended to the robust design case. On the other hand, the proposed TD3-based algorithm can be utilized to generate competitive and robust joint solutions while keeping the complexity to a minimum. Note that in this paper, we assume that the SUPA is executed in the higher layers which are more latency-tolerant compared to the physical layer. Nevertheless, it is straightforward to conclude that the worst-case run-time for the SUPA is
Training, Simulation and Numerical Results
In this section, we provide the details of the TD3 agent’s structure, hyperparameters and training. In addition, the system parameters and the simulation results for both the fixed and the dynamic-channel cases are presented.
A. Agent Structure and Hyperparameters
The developed TD3 agent consists of one actor and two critic networks. Note that the two critic networks are identical in terms of the architecture, however, they are initialized randomly. The DNN structures for both the actor and the critic networks are illustrated in Figures 4 and 5, respectively. For the actor’s DNN, the rectified linear unit
Since the number of neurons is the dominant factor that determines the learning capability of a DNN with a fixed number of layers, and consequently, the developed TD3 agent [57], we use two different neuron values for each channel case. In particular, for the fixed-channels case, we generate one set of simulation results for a TD3 agent configured with 128 neurons in each hidden layer, and another set for the same agent configured with 256 neurons in each hidden layer. Similarly, the same process is replicated for the dynamic-channels case with 256 and 512 neurons for each set of simulation results.
B. System Parameters
We consider a downlink transmission for a clustered and IRS-assisted MISO-NOMA system that is identical to the one illustrated in Figure 1. In addition, the channel between the BS and the IRS is assumed to have both a line-of-sight (LoS) and non-LoS components, and therefore, modelled using the Rician fading coefficients. In particular, the BS-IRS link is expressed as \begin{equation*} \mathbf {G}=\frac {1}{\sqrt {d_{irs}^{\iota _{b\rightarrow irs}}}}\left ({{\sqrt {\frac {L}{1+L}}\mathbf {G}_{LoS}+\sqrt {\frac {1}{1+L}}\mathbf {G}_{nLoS}}}\right ), \tag {32}\end{equation*}
\begin{equation*} \mathbf {g}_{k}=\frac {\tilde {g}}{\sqrt {d_{k}^{\iota _{irs\rightarrow u}}}}, k=1,\ldots , 2K, \tag {33}\end{equation*}
To compare the performance of the proposed algorithm to existing algorithms in the literature, we use the following benchmark schemes:
Baseline 1: a DDPG agent which has been one of the most widely adopted DRL agents in the literature. This benchmark scheme is included to provide a baseline for convergence and policy robustness testing.
Baseline 2: a convex optimization-based scheme which represents the conventional optimization approach where the IRS optimization subproblem is solved using SDP [42], then, the non-robust ZFBF with fixed power allocation is used for the beamforming design.
Baseline 3: a random algorithm which has an almost negligible complexity is used to benchmark the quality of the policy derived by the proposed agent. In this benchmark, all of the design variables are randomly selected.
C. Fixed-Channels Case
To evaluate the performance of the proposed algorithm against channel errors, we first consider the case where the channels are fixed throughout the training process. However, a new set of errors is introduced in each training episode. Furthermore, the UEs are assumed to be uniformly distributed in the fixed-channels case.
The convergence plot is a useful measure that indicates the quality of the derived policy by the agent. Figure 6 illustrates the convergence of the TD3 and DDPG agents. With two clusters (i.e.,
In the two extreme cases, however, the average reward sustained by the TD3 agent is significantly higher than that for the baseline DDPG agent. Moreover, the TD3-256 shows more stable and consistent convergence compared to both TD3-128 and DDPG. In order the show the implications of converging to a higher reward policy, the achieved system sum-rates for the trained TD3 agent are shown in Figure 7. The rates provided represent the average system sum-rate over 1000 testing episodes.
The average system sum-rates for the fixed-channels case with various number of UEs.
The TD3 agent outperforms the benchmark schemes for both
Note that Figure 7 only shows partial information about the agent’s performance. To gain a better insight, Figure 7 is interpreted in the context of the outage performance of the agent illustrated by Figures 9 and 10. However, since the outage performance of the agent is related to the weakest UE’s achieved rate, Figure 8 depicts the achieved rates probability density function (PDF) for the weakest UEs in the system.
Based on the weakest UE rate for each setting, we can infer that the TD3 agent has formed an outage-aware policy which results in the least outage across the three different system settings. Note that since the PDFs in Figure 8 are for the weakest UEs in each category, this represents the worst-case performance of the agent.
To assess the outage performance of the proposed agent against the relative channel estimation quality
Another important observation is the impact of the number of neurons on the outage probability of the TD3 agent. The simulation results suggest that the TD3-256 outperforms the TD3-128 in the more challenging cases with a higher number of UEs. This further proves our claim that since the outage constraint is hyperparameterized in the proposed robust design, it is impacted by the selected learning parameters of the TD3 agent.
D. Dynamic-Channels Case
The fixed-channels case is useful for rigorous analysis of the agent’s developed policy as the channels are considered static. In practice, however, the channel is frequently changing especially when the UEs are moving. Therefore, we extend the developed algorithm to the dynamic-channels case in this subsection. Unlike the fixed-channels case, the users are assumed to be randomly distributed within the cell radius to make the design more practical. In this case, new channels are introduced in each new training episode. Furthermore, the channels are assumed to be quasi-static, i.e., the channels remain constant during each training episode and change afterwards. Moreover, 24 different channel sets are used for training. The aim of the dynamic-channels case is to train the agent to develop a comprehensive robust policy that can be generalized to never-seen-before channels. Hence, after training the agent once, it could be deployed to any channel condition afterwards.
Figure 11 illustrates the superior performance of the TD3 agent over the DDPG baseline in developing a highly rewarding policy.
In order to generate statistically meaningful results, a set of 100 channels and 10 error samples per channel are used for testing to generate the average performance results.
The average system sum-rates achieved by the proposed agent are shown in Figure 12.
The average sum-rates figure shows that baseline 1 achieves the highest rate, which is explained by the worse outage performance illustrated in Figure 13. The two figures suggest that there is a trade-off between achieving a higher system sum-rate and a higher non-outage probability. The TD3 agents, for example, achieve an average sum-rate of around
The average outage probability of the TD3 agent versus the target rate for the dynamic-channels case,
Furthermore, the PDFs of the average rate achieved by the weakest UE in the system are illustrated in Figure 14.
The PDFs figure shows that the TD3 agents achieve the highest mean of around
Overall, the TD3 agent outperforms all benchmark algorithms in terms of outage performance. In particular, the TD3 agent shows more adaptive and robust behaviour by trading off higher sum-rates for better outage performance when it is challenging to maximize both. This shows that the proposed TD3-based algorithm is capable of converging to adaptive policies that suit the problem requirements.
Conclusion
The resource allocation problem for an IRS-assisted MISO-NOMA system was considered in this paper. In particular, by taking the imperfect channel estimation at the BS and the UEs into account, the outage-constrained robust design with an ergodic sum-rate maximization objective was formulated. A correlation-based UE clustering algorithm was proposed to pair the UEs into clusters. Then, the challenging robust design problem was reformulated into an RL environment since it cannot be solved directly using conventional optimization techniques. Subsequently, a DRL-based framework was developed to solve the reformulated problem using the TD3 agent. The simulation results demonstrated that the TD3 agent outperforms conventional and other DRL algorithms in terms of generating robust resource allocation strategies for the considered system model under different system parameters. In addition, the performance of the developed TD3-based algorithm in the dynamic-channels case showed that the proposed framework can be implemented in practical scenarios. Furthermore, the competitive performance achieved by the proposed TD3-based algorithm has a much lower computational complexity compared to conventional optimization algorithms, making it a more sensible option for latency-stringent applications.