Journals & Magazines >IEEE Open Journal of Vehicula... >Volume: 5

Multi-Agent Deep Reinforcement Learning Based Optimizing Joint 3D Trajectories and Phase Shifts in RIS-Assisted UAV-Enabled Wireless Communications

Abstract:

Unmanned aerial vehicles (UAVs) serve as airborne access points or base stations, delivering network services to the Internet of Things devices (IoTDs) in areas with comp...Show More

Metadata

Abstract:

Unmanned aerial vehicles (UAVs) serve as airborne access points or base stations, delivering network services to the Internet of Things devices (IoTDs) in areas with compromised or absent infrastructure. However, urban obstacles like trees and high buildings can obstruct the connection between UAVs and IoTDs, leading to degraded communication performance. High altitudes can also result in significant path losses. To address these challenges, this paper introduces the deployment of reconfigurable intelligent surfaces (RISs) that smartly reflect signals to improve communication quality. It proposes a method to jointly optimize the 3D trajectory of the UAV and the phase shifts of the RIS to maximize communication coverage and ensure satisfactory average achievable data rates for RIS-assisted UAV-enabled wireless communications by considering mobile multi-user scenarios. In this paper, a multi-agent double-deep Q-network (MADDQN) algorithm is presented, which each agent dynamically adjusts either the positioning of the UAV or the phase shifts of the RIS. Agents learn to collaborate with each other by sharing the same reward to achieve a common goal. In the simulation, results demonstrate that the proposed method significantly outperforms baseline strategies in terms of improving communication coverage and average achievable data rates. The proposed method achieves 98.6% of a communication coverage score, while IoTDs are guaranteed to have acceptable achievable data rates.

Published in: IEEE Open Journal of Vehicular Technology ( Volume: 5)

Page(s): 1712 - 1726

Date of Publication: 24 October 2024

Electronic ISSN: 2644-1330

DOI: 10.1109/OJVT.2024.3486197

Funding Agency:

Contents

SECTION I.

Introduction

Currently unmanned aerial vehicles (UAVs) are used as temporary airborne access points or base stations due to their flexible deployment and dynamic mobility in aerial wireless communication systems [1]. In remote areas where conventional base stations are not available, UAVs can offer wireless connectivity as well as emergency communication services. Furthermore, UAVs can serve to offer wireless communication services within densely populated settings like sports stadiums, concerts, or festivals. These areas often experience overwhelming data traffic, surpassing the capabilities of the current cellular network infrastructure [2], [3]. In recent years, UAV-assisted wireless networks have attracted enormous research interest, and numerous works have been proposed [2], [3], [4], [5]. However, due to the limited power capabilities and significant signal weakening that occurs over long distances when deploying UAVs, ensuring both high data rates and sufficient coverage becomes challenging [6]. In addition to that, aerial-ground communication channels suffer from blocking in dense urban areas due to high-rise buildings. When multiple ground users are involved, reducing series interference caused by LoS-dominated air-to-ground is also challenging [3]. In order to address this issue, reconfigurable intelligent surfaces (RIS) have been integrated with unmanned aerial vehicles (UAVs) [7], [8], [9].

The RIS technique improves the performance of wireless communication systems by manipulating electromagnetic waves. With RIS, electromagnetic wave propagation is controlled electronically via a large number of reflective elements on a planar surface controlled by a controller to regulate their propagation [8]. The RIS enhances the data rate by adjusting the phase of each element to maximize signal convergence toward the target direction [9]. In theory, RIS can manipulate the UAV-ground wireless communication channel by placing it in an environment in UAV-enabled wireless communications [10]. Researchers [11], [12], [13] have recently explored the integration of UAV and RIS in wireless communication.

Recently, researchers have used traditional optimization techniques, such as block coordinate descent (BCD), successive convex approximation (SCA), stochastic programming, and shallow machine learning algorithms to optimize UAV trajectory and RIS phase shifting to enhance the total data throughput of the system by leveraging RIS [14], [15], [16], [17], [18], [19], [20], [21], [22]. However, traditional optimization techniques lead to high computational costs and the difficulty of implementation in online and real-time environments, where the state of the UAV system continuously changes dynamically and the complexity of changing the RIS phase shifts. Due to the numerous variables involved and unpredictable system behavior, performing sequential decision-making mathematically in real-time applications is challenging, resulting in inaccurate modeling [22], [23].

In unpredictable and time-varying network environments, reinforcement learning (RL) has proven to be an effective tool for solving real-time dynamic decision-making problems [24], [25], [26], [27]. In comparison to traditional optimization-based techniques, reinforcement learning specifically the DRL algorithm provides the following advantages: (1) In the case of large-scale problems with high-dimensional environmental variables (i.e., state and action spaces), the DRL is more efficient [24]. However, classical optimization methods cannot handle continuous state and action spaces; (2) DRL-based solutions effectively solve real-time dynamic decision-making problems in a time-varying and unpredictable network environment. However, sequential decision-making problems in real-time applications are difficult to solve mathematically [22], [23]. Additionally, classical optimization algorithms have low solution efficiency and cannot adapt to the environment's dynamic nature for solving RIS-assisted UAV-enabled wireless networks; and 3) The DRL can discover optimal policies by interacting directly with a target environment to adapt to an unknown wireless environment. However, traditional optimization methods require predefined models to construct trajectory control policies. Furthermore, it often involves solving complex mathematical models that are computationally intensive and time-consuming, making them less suitable for real-time applications [23]. On the other hand, DRL-based solutions can automatically learn optimal strategies from past experiences, which makes them more generalizable and adaptable [24].

Furthermore, most prior works [26], [27] have considered a single agent for RIS-assisted UAV-enabled communication, which presents a significant challenge because of the complexity of the search space and the increased computational load. However, instead of a single agent, extending multi-agents is substantial due to the following reasons: 1) adapt to changes in real-time and enable decentralized decision-making, which is useful in large and complex environments; 2) multi-agents are particularly effective in solving complex problems because they can divide each problem into smaller subproblems, allowing each agent to focus on one part, which will lead to a more efficient and effective solution; 3) by distributing tasks among multiple agents, the computational load on any single agent is reduced. This leads to faster processing times and the ability to handle more complex tasks [28].

The distinctions between this paper and the existing works are summarized as follows: 1) in terms of system scenarios, some researchers [27], have considered a mobile RIS scenario, but this work considers a fixed ground RIS scenario due to UAV energy constraints. Prior papers [22], [27], [29] considered a single RIS, but we considered two RIS to improve communication performance; 2) to address the optimization problem, some works [19], [20], [21] employed traditional optimization methods, in the work [27] utilized single-agent reinforcement learning algorithms, and some works [29], [30] combined reinforcement learning and traditional algorithms. In this work, multi-agent DDQN is used to solve the joint problem; 3) as far as we know, this work is the first study to maximize the combined communication coverage and achievable data rate of Internet of Things devices (IoTDs) in RIS-assisted UAV-enabled wireless communication.

In this paper, we present a multi-agent DDQN-based 3D trajectory control and phase shift of the RIS (TCPS) scheme for optimizing UAV trajectory and the phase shift in an online manner to maximize communication coverage score while maintaining acceptable data rates for IoTDs in the RIS-assisted UAV-enabled wireless communication (RIS-UAVWC). The proposed approach involves executing an ongoing control operation, where the UAV monitors IoTDs and moves by their locations to ensure wireless connectivity. The main contributions are as follows:

We considered RIS-assisted UAV-enabled wireless communications to mitigate multipath effects on signal propagation between the UAV and IoTDs and to optimize the joint 3D UAV trajectory and the phase shifts of the RIS elements by considering two RISs for mobile multi-IoTDs rather than UAV-enabled wireless communications without RIS assistance examined in previous studies [31], we formulate an optimization problem for RIS-UAVWC to maximize the communication coverage score and ensure satisfactory average achievable data rates of the IoTDs in a target area by set a pre-defined threshold value while considering UAV movement constraints, and phase shift constraints. In real-time, the IoTD's location information is typically unavailable to network operators because of privacy concerns. Due to this, the proposed method operates in continuous control of the UAV trajectory and phase shift of RIS element tasks based on signals received from IoTDs.
Solving the formulated problem using traditional optimization techniques [14], [15], [16], [17], [18], [19], [20], [21], [22] is challenging due to the coupling between optimization variables and the nonlinearity of the underlying models. Furthermore, using a single-agent DRL [27] presents a significant challenge as the search space is more complex and the computational load on the single agent is increased. To address this issue, we propose a multi-agent DDQN algorithm, designed to generate trajectories and phase shifts ensuring long-term reliable communication coverage and better achievable data rates. By allowing the UAV and RIS controllers to act as agents, we turn the transformed problem into a multi-agent extension of the Markov decision process (MDP) and develop a multi-agent DDQN algorithm to solve it. While UAV mobility and RIS phase shift are optimized simultaneously through a multi-agent DDQN strategy.
We present comprehensive numerical results to demonstrate the significance of the proposed joint optimization scheme and to compare the results with baseline methods in terms of communication coverage and IoTD data rate

The rest of this paper is organized as follows: Section III introduces the system model and problem formulation. Section IV details the MADDQN-based TCPS scheme within the RIS-UAVWC framework. Section V validates the proposed method through simulation results. Finally, Section VI offers concluding remarks and discusses future work.

SECTION II.

Related Work

In recent years, UAV-assisted wireless networks have attracted considerable research attention, and many studies have been conducted on optimizing UAV trajectories to improve communication performance [2], [3], [4], [5], [31], [32], [33]. Numerous studies have explored the assistance of RIS in UAVs-enabled wireless communication systems for various aspects like power management [29], security [30], spectral efficiency [31], and resource allocation [34]. Researchers have also investigated RIS-assisted UAVs to maximize data rate using a traditional approach through optimizing UAV trajectory and phase shift. For example, the authors in [18] studied UAV-RIS-aided NOMA networks with multiple user groups, the BCD algorithm maximizes the total throughput of the transport network by changing the UAV location, NOMA decoding, and RIS phase shift. In [19], the authors examined a simple NOMA system powered by UAV-RIS used to communicate with two users on the ground, designed to maximize data rates for nearby users. An iterative SDR algorithm optimizes the horizontal UAV trajectory, beam at the base station, and RIS phase shift to achieve a better target data rate. The authors in [12] investigated a straightforward UAV-RIS system that integrated an optimized UAV flight path with RIS passive beamforming, utilizing an SCA algorithm to improve the average achievable data rate. The authors in [20] studied RIS-assisted UAV systems with LoS channels that experience blocking have aimed to optimize both the UAV trajectory and the RIS phase shift to maximize user sum rates. Two effective algorithms have been introduced: conjugate gradient and alternating optimization algorithms. In [21], the authors studied a system using a RIS-assisted UAV with OFDMA communication between a base station and a user. The objective is to enhance the system's overall data rate by jointly optimizing the UAV trajectory, the RIS phase shift, and resource allocation. To tackle the non-convex nature of these optimization problems, they proposed an alternative optimization method aimed at boosting the sum rate. In [16], the authors studied UAVs with RIS-assisted wireless communication networks to maximize ground users' minimum throughput. By optimizing both the UAV's horizontal position and the user's transmission power, this was accomplished. The study proposed using SCA and SDR methods to optimize the UAV horizontal position and the RIS passive beamforming vector. However, these traditional techniques have several drawbacks: they are time-consuming computationally, produce static solutions that may become outdated or less effective, focus on optimizing a single objective function, and cannot learn from data or adjust their strategies based on experience, ultimately leading to diminished performance.

A successful strategy has been shown using deep reinforcement learning [35] for optimizing a combined UAV trajectory control and RIS phase shift design, surpassing the constraints of conventional optimization methods. The UAV can develop an optimal trajectory strategy through ongoing interaction with the environment using DRL algorithms, without relying on an explicit model. Additionally, it incorporates RIS phase shift design in RIS-assisted UAV-enabled networks to improve system throughput. Previous studies have explored DRL-based UAV-aided communication as a means to boost system throughput [23], [24], [31], [35], [36]. For example, in [29], the deployment of RIS-assisted UAVs by jointly optimizing the 3D position of the UAV and the phase shift of the RIS to maximize data transfer rate and minimize energy propulsion and propose DRL to find an optimal solution. The authors in [24] studied how RIS-assisted UAV multicast networks can boost total data transmission rates. Explored optimizing UAV trajectories, RIS reflection matrices, and beamforming strategies using a Multi-Pass Deep Q-Network. The authors in [37] discuss a RIS-equipped UAV capable of offering aerial LoS and enhancing mobile Internet-of-Vehicle communication within densely populated urban regions supported by 5G/6G networks. To address this, the authors employed deep reinforcement learning to optimize and streamline the navigation process intelligently. The authors in [27] studied the downlink of UAV networks employing RIS in non-orthogonal multiple access scenarios. Focused on optimizing the UAV position and adjusting RIS phase shifts to enhance system capacity, all while considering UAV energy consumption, using a DDQN approach.

SECTION III.

System Model and Problem Formulation

In this section, we present the system model and formulate the problem of joint UAV trajectory control and RIS phase shift.

A. System Model

This paper considers a downlink multiuser transmission system utilizing UAV and RIS. The setup involves a single UAV U, multiple IoT devices, and RISs, as depicted in Fig. 1. The UAV acts as an aerial base station to provide continuous wireless services to K mobile IoT devices spread across a designated area. To support multiple IoT devices that can be served by UAV at each time slot, we consider a non-orthogonal multiple access scheme. The UAV connects to the core network through a ground gateway node using a backhaul link. It is assumed that the UAV determines its location via a satellite positioning system. We consider that the direct links from UAV to IoTDs are blocked, thus the RIS is used to reflect the signals. Hence, multiple RISs are deployed on buildings (in our case, two RISs are mounted on two buildings) to enhance the communication quality of IoTDs. The UAV and IoTDs are assumed to use an omnidirectional antenna, whereas, each RIS contains N reflecting elements. In addition, UAV can be located in the air, ${{L}^{UAV}}[t] = ({{x}^U}[t],{{y}^U}[t],{{h}^U}[t])$ , where $\ ({{x}^U}[t],{{y}^U}[t])$ is the UAV latitude and longitude coordinates and ${{h}^U}[t]$ is its flight dynamic altitude. RIS can be located in the $L_r^{RIS} = (x_r^R,y_r^R,h_r^R)$ where $(x_r^R,y_r^R)$ is latitude and longitude coordinates of r-th RIS_, and $h_r^R$ is a fixed height for the RISs. Each IoTD is located in the coordinates of $L_i^{IoTD}[t] = (x_i^I[t],y_i^I[t])$ and each IoTD is mobile. The distance between the UAV and the IoTDs at timeslot t is $\begin{equation*} D_{u,i}^{U,I}[t] = \sqrt {{{{({{x}^U}[t] - x_i^I[t])}}^2} + {{{({{y}^U}[t] - y_i^I[t])}}^2} + {{{({{h}^U}[t])}}^2}} . \tag{1} \end{equation*}$ View SourceThe distance between the UAV and the r-th RIS at timeslot t is $\begin{equation*} D_{u,r}^{U,R}[t] = \sqrt {{{{({{x}^U}[t] - x_r^R)}}^2} + {{{({{y}^U}[t] - y_r^R)}}^2} + {{{({{h}^U}[t] - h_r^R)}}^2}} . \tag{2} \end{equation*}$ View SourceThe distance between the RIS and the i-th IoTD at timeslot t is $\begin{equation*} D_{r,i}^{R,I}[t] = \sqrt {{{{(x_r^R - x_i^I[t])}}^2} + {{{(y_r^R - y_i^I[t])}}^2} + {{{(h_r^R)}}^2}} . \tag{3} \end{equation*}$ View Source

Figure 1.

The system model of RIS-assisted UAV-enabled wireless communication for multi-IoTDs.

Show All

This paper system model presents a UAV that flies in the air, with the RISs mounted on the surface of the buildings. As shown in Fig. 1, the channels in the system model include a direct link (i.e., the UAV-IoTD link), and if the direct link is blocked due to blocking, then IoTD will use the common reflecting link via RIS signal reflection (i.e., the UAV-RIS link, RIS-IoTD signal reflection). Thus, in the RIS-UAVWC system, the channel between each IoT device and the UAV is comprised of two connections: the direct UAV- i-th IoTD link and the reflected UAV-RIS-IoTD link.

The channel gain of a direct link (UAV- i-th IoTD link) at timeslot t is given by: $\begin{equation*} g_i^{UI}[t] \!=\! \sqrt {\frac{\alpha }{{{{{(D_{u,i}^{U,I})}}^{\beta 1}}}}} \left(\sqrt {\frac{{{{K}_1}}}{{{{K}_1} \!+\! 1}}} {{\bar{g}}_{UI}}[t]l + \sqrt {\frac{1}{{{{K}_1} \!+\! 1}}} {{\tilde{g}}_{UI}}[t]nl\right), \tag{4} \end{equation*}$ View Sourcewhere α represents the path loss at a reference distance of 1 m, β1 denotes the path loss exponent, K₁ indicates the Rician factor, ${{\bar{g}}_{UI}}[t]l$ , ${{\tilde{g}}_{UI}}[t]nl$ and the terms refer to the Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS) components of the channels respectively.

The UAV-RIS-IoTD link at timeslot t comprises two sub-links: the UAV-RIS sub-link, connecting the UAV and the r-th RIS, and the RIS-IoTD sub-link, linking the RIS and the IoTD. The channel gain of the UAV and r-th RIS at timeslot t is given by: $\begin{equation*} g_r^{UR}[t] = \sqrt {\frac{\alpha }{{{{{(D_{u,r}^{U,R})}}^2}}}} \left(\sqrt {\frac{{{{K}_1}}}{{{{K}_1} + 1}}} {{\bar{g}}_{UR}} + \sqrt {\frac{1}{{{{K}_1} + 1}}} {{\tilde{g}}_{UR}}\right), \tag{5} \end{equation*}$ View Sourcewhere ${{\tilde{g}}_{UR}}$ represents the scattering NLoS component, which is i.i.d complex Gaussian distributed, i.e., CN (0, I). According to [26], a uniform planar array RIS is considered, taking into account the corresponding antenna array response. Hence, the ${{\bar{g}}_{UR}}$ is: $\begin{equation*} {{\bar{g}}_{UR}} = {{\left[ {1,\ldots,{{e}^{ - j\frac{{2\pi {{d}_a}}}{\lambda }\phi _{ur}^{UR}[t]}},\ldots,{{e}^{ - j\frac{{2\pi {{d}_a}}}{\lambda }(N - 1)\phi _{ur}^{UR}[t]}}} \right]}^{\bm{T}}}, \tag{6} \end{equation*}$ View Sourcewhere λ is the carrier wavelength, ${{d}_a}$ is the antenna separation distance, $\phi _{ur}^{UR}[t] = \frac{{{{x}^U}[t] - x_r^R}}{{\| {{{L}^{UAV}}[t] - L_r^{RIS}} \|}}$ is the cosine of the angle of arrival from the UAV to the RIS.

The channel gain between the r-th RIS and i-th IoTD at timeslot t is given by: $\begin{equation*} g_i^{RI}[t] = \sqrt {\frac{\alpha }{{{{{(D_{r,i}^{R,I})}}^{\beta 2}}}}} \left(\sqrt {\frac{{{{K}_1}}}{{{{K}_1} + 1}}} {{\bar{g}}_{RI}} + \sqrt {\frac{1}{{{{K}_1} + 1}}} {{\tilde{g}}_{RI}}\right), \tag{7} \end{equation*}$ View Sourcewhere $\beta 2$ is path exponent of RIS- IoTD, ${{\bar{g}}_{RI}}$ given by: $\begin{equation*} {{\bar{g}}_{RI}} = {{\left[ {1,\ldots,{{e}^{ - j\frac{{2\pi {{d}_a}}}{\lambda }\phi _{ri}^{RI}[t]}},\ldots,{{e}^{ - j\frac{{2\pi {{d}_a}}}{\lambda }(N - 1)\phi _{ri}^{RI}[t]}}} \right]}^{\bm{T}}}, \tag{8} \end{equation*}$ View Sourcewhere $\phi _{ri}^{RI}[t] = \frac{{x_i^I[t]] - x_r^R}}{{\| {L_i^{IoTD}[t] - L_r^{RIS}} \|}}$ is the cosine of the angles of departure from r-th RIS to i-th IoTD, ${{\tilde{g}}_{RI}}$ represent the scattering NLoS component Consequently, the channel gain of the UAV-RIS-IoTD link at timeslot t can be written as: $\begin{equation*} g_{ri}^{URI}[t] = g_r^{UR}[t]{{\Theta }_r}[t]g_i^{RI}[t], \tag{9} \end{equation*}$ View Sourcewhere ${{\Theta }_r}[t] = diag\{ {{\theta }_1},{{\theta }_2},{{\theta }_n},\ldots.,{{\theta }_N}\}$ represents the reflection coefficient matrix at the RIS, ${{\theta }_n} = {{\beta }_n}{{e}^{j{{\varphi }_n}}}$ represents ${{\varphi }_n}$ ∈ [-2π, 2π) denotes the phase shift coefficient, and ${{\beta }_n}$ ∈ [0, [1] denotes the amplitude reflection coefficient factors of the n reflecting element in the r-th RIS, n ∈ {1, 2 $, \ldots,$ N} at timeslot t.

However, Due to the limitations of hardware and to reduce computational time, utilizing discrete phase shift values allows for a more practical and efficient design and control of RIS elements. This discrete phase shift makes managing and operating the elements easier and more effective [38]. This article used quantization techniques [36] to discretize continuous values into a finite set of discrete values. Based on a uniform quantization of the phase shift for every RIS element, 3 bits are used to determine all possible phases of reflectivity; thus, the feasible phase is ${{\varphi }_{n,m}} \in \{ 0,m\vartriangle \vartheta,\ldots\ldots({{2}^3} - 1)\vartriangle \vartheta \}$ where m ∈ {1, 2 $, \ldots,$ $({{2}^3} - 1)$ } in which $\vartriangle \vartheta = {\hbox{${2\pi }$} \!\mathord{/ {\vphantom {{2\pi } {{{2}^3}}}}} \!\hbox{${{{2}^3}}$}}$ .

Based on (4) and (9), the achievable data rate $R_{r,i}^{URI}[t]$ (in bps) for each i-th IoTD at timeslot t is given by: $\begin{equation*} R_{r,i}^{URI}[t] = B{{\log }_2} = \left( {1 + {{{\frac{{p\left| {g_i^{UI}[t] + g_{ri}^{URI}[t]} \right|}}{{{{N}_o}}}}}^2}} \right), \tag{10} \end{equation*}$ View Sourcewhere B represents the channel bandwidth, P denotes transmitted power, and N_o denotes noise power. To facilitate easy reading for the readers, the notations and their definitions used in this paper are listed in Table 1.

B. Problem Formulation

This paper aims to maximize communication coverage while assuring a satisfactory achievable data rate for IoTDs. To accomplish the objective, we define the TCPS problem by choosing the optimal trajectory of the UAV $L[t] = ({{x}^U}[t],{{y}^U}[t],{{h}^U}[t])$ and phase shift of the r-th RIS ${{\Phi }_r}[t] = [{{\theta }_{r,1}},{{\theta }_{r,2}},{{\theta }_{r,n}},\ldots,{{\theta }_{r,N}}]$ for each timeslot t. The TCPS problem is dual. First, we defined the communication coverage of the deployed UAV concerning the K IoTD in the target area. In RIS-UAVWC systems, communication coverage is determined by considering the association between the UAV and the IoT devices in the target area. The association between them can be determined based on the received signal quality. When the IoTD received data rate exceeds a defined threshold value, there is a connection between the UAV and the ground IoT devices, which means the IoT device is covered by the UAV [31]. Let $\chi _{ri}^{URI} \in \{ 0,1\}$ denote the association between UAV and the i-th IoTD directly and/or via a r-th RIS. When $\chi _{ri}^{URI} = 1$ , the i-th IoTD is served by UAV either directly or via r-th RIS; otherwise, i-th IoTD is not served by UAV. At that point, the communication coverage scores $C[t]$ of the RIS-UAVWC at timeslot t can be expressed as follows: $\begin{equation*} C[t] = \left( {\frac{\rm Z}{K}} \right) \times 100\%,{\rm Z} = \sum\limits_{i = 1}^K {\chi _{ri}^{URI},} \chi _{ri}^{URI} \in 1, \tag{11} \end{equation*}$ View Sourcewhere ${\rm Z}$ represents the number of connected IoTD in the deployed UAV at each timeslot t. Second, we have defined the average data rate achievable by the i-th IoTD in the RIS-UAVWC setup. The TCPS scheme aims to improve this average data rate to a desirable level by optimizing both the trajectory of the UAV and the phase shift of the RIS in a specified environment during timeslot t. The average achievable data rate of the i-th IoTD at timeslot t $\bar{R}_{r,i}^{URI}[t]$ can be expressed as follows: $\begin{equation*} \bar{R}_{r,i}^{URI}[t] = \frac{1}{{\bm{K}}}\left( {\sum\limits_{i = 1}^{\bm{K}} {R_{r,i}^{URI}[t]} } \right),\chi _{ri}^{URI} \in 1. \tag{12} \end{equation*}$ View Source

Then, the TCPS issue can be mathematically described as the TCPS problem: $\begin{equation*} \mathop {\arg \max }\limits_{L[t],{{\Phi }_r}[t]} \sum\limits_{t = 1}^T {C[t]} \times \bar{R}_{r,i}^{URI}[t] \tag{13} \end{equation*}$ View Source

s.t.

$\bar{R}_{r,i}^{URI}[t] \geq {{R}_{th}}$
${{\theta }_{r,n}} \in [ - 2\pi,2\pi ),\forall n \in N$
$h_{UAV}^{\min } \leq {{h}^{_U}} \leq h_{UAV}^{\max },\forall {\bm{t}}$
$0 \leq {{i}_{speed}} \leq {{V}_{\max }},\forall {\bm{t}}$

Here, C1 imposes the minimum average achievable data rate of the IoTD to guarantee satisfactory service quality; C2 signifies the phase-shift of the r-th RIS; C3 and C4 indicate the UAV boundaries of UAV flying altitude and i-th IoTD speed constraints respectively. Then we convert the transformed problem into a learning task by defining the multi-agent DDQN components that learn how to interact with the environment. After that, the MADDQN-based TCPS scheme training process is presented to maximize the expected.

SECTION IV.

Proposed Solution

In this section, we start by providing some preliminary information regarding reinforcement learning (RL) and DRL. Next, it describes how to convert the TCPS problems into Markov decision processes (MDP) and learning tasks. Lastly, the MADDQN-based TCPS scheme training process is presented to maximize expected long-term rewards.

A. Preliminaries

Reinforcement learning (RL) [35] employs dynamic learning to tackle decision-making challenges by learning from the current context and adjusting to environmental changes. This approach formulates the problem as an MDP, enabling the use of DRL algorithms. The MDP comprises four key components: the state space (S), action space(A), reward function (R), and state transition (P). At each time slot t, the agent perceives the current system state ( ${{s}_t}$ ), selects an action ( ${{a}_t}$ ) according to a policy π ( ${{s}_t}$ , ${{a}_t}$ ), receives a reward (r_t), and transitions to a new state ${{s}_{t + 1}}$ based on P( ${{s}_{t + 1}}$ | ${{s}_t}$ , ${{a}_t}$ ).

The RL agent aims to find the optimal policy ${{\pi }^ * } = \arg {{\max }_{a \in A}}Q({{s}_t},{{a}_t})$ that maximizes the accumulated discounted reward: $\begin{equation*} {{Q}_\pi } = \sum\limits_{t = 1}^T {{{\Upsilon }^{t - 1}}} r({{s}_t},\pi ({{a}_t})), \tag{14} \end{equation*}$ View Sourcewhere $\Upsilon \in (0,1)$ is the discount factor for reward delivery, $r({{s}_t},\pi ({{a}_t})$ is obtained reward by selecting policy π at the state ${{s}_t}$ .

In specific scenarios, we can utilize the standard Q-learning update to train a parameterized value function $Q({{s}_t},{{a}_t};{{\theta }_t})$ . This process entails updating the parameters ${{\theta }_t}$ after executing the action ${{a}_t}$ and observing ${{s}_{t + 1}}$ the reward ${{r}_{t + 1}}$ and transitioning to the state ${{s}_{t + 1}}$ is then $\begin{equation*} {{\theta }_{t + 1}} = {{\theta }_t} + \psi (Y_t^q - Q({{s}_t},{{a}_t};{{\theta }_t}))\nabla {{\theta }_t}Q({{s}_t},{{a}_t};{{\theta }_t}), \tag{15} \end{equation*}$ View Sourcewhere ψ represents the scalar step size and the target network $Y_t^q$ is defined as: $\begin{equation*} Y_t^q = {{r}_{t + 1}} + \gamma \mathop {\max }\limits_a Q({{s}_{t + 1}},{{a}_t};{{\theta }_t}), \tag{16} \end{equation*}$ View Source

In Q-learning, a Q-table, also known as a look-up table, contains Q values for every (s, a) pair. The goal is to choose the best possible action for a specific situation, adhering to policy π*, to maximize the total rewards accumulated over a period of time. By iteratively updating Q-values using the Bellman equation [31] with the agent experiences, the algorithm aims to discover the best policy. By consulting the Q-table, the agent makes decisions accurately and creates an effective decision-making strategy from any state.

However, the traditional Q-learning algorithms face limitations stemming from the curse of dimensionality [39]. In large-scale environments, achieving an optimal policy with Q-learning becomes difficult for two main reasons: 1) the RL agent struggles to explore numerous (state, action) pairs due to the vast lookup table, leading to diminished learning performance and algorithmic efficiency, and 2) The impractical storage demands of the Q table are being tackled through the implementation of deep neural networks (DNNs), which enhance traditional Q-learning techniques by incorporating artificial intelligence. As a result, in DRL, the Deep Q-Learning Network (DQN) algorithm employs a DNN to approximate values instead of relying on a Q-table. In the DQN methodology, the agent's interactions with its environment (specifically, the state, action, reward, and subsequent state, denoted as $({{s}_t},{{a}_t},{{r}_t},{{s}_{t + 1}})$ are stored in a buffer ℜ_b. The neural network undergoes training by sampling random minibatches of these transitions from the buffer ℜ_b. These transitions may originate from various episodes stored in the neural networks. This approach offers the advantage of reusing records from the buffer multiple times, which enhances data diversity and diminishes the mini-batch's relevance. Consequently, this improves the network's ability to generalize and leads to better training outcomes.

The target network $Y_t^{DQN}$ used by DQN is summarized as follows: $\begin{equation*} Y_t^{DQN} = {{r}_{t + 1}} + \gamma \mathop {\max }\limits_a Q({{s}_{t + 1}},{{a}_t};{{\theta }^ - }). \tag{17} \end{equation*}$ View Source

Due to the higher value chosen by DQL, the value calculated by DQN is overestimated.

Inspired by the findings in reference [27], This study introduces a DDQN-driven approach designed to enhance UAV trajectory and phase shift optimization with RIS, as depicted in Fig. 2. The essence of DDQN lies in mitigating DQN's tendency to overestimate by segregating the highest activity in the objective into distinct phases of action selection and evaluation. the target network $Y_t^{DDQN}$ is summarized as follows: $\begin{equation*} Y_t^{DDQN} = {{r}_{t + 1}} + \gamma \mathop {Q({{s}_{t + 1}},\arg \max }\limits_a Q({{s}_{t + 1}},{{a}_t};{{\theta }_t}),{{\theta }^ - }). \tag{18} \end{equation*}$ View Source

Figure 2.

The overall architecture of the MADDQN-based TCPS algorithm for RIS-UAVWC.

Show All

A single-agent DDQN approach focuses on learning optimal policies in an isolated, straightforward environment, whereas a multi-agent approach focuses on handling multi-agent interactions, which typically require more sophisticated strategies to manage dynamic and non-stationary environments.

B. Problem Transformation

The TCPS problem can be defined as a multi-agent due to the UAV and RIS involved. Using the multi-agent DDQN concept, the proposed multi-agent TCPS policy uses multiple agents to train and control UAV movement and RIS phase shift to maximize system reward by simultaneously interacting within a shared environment and executing joint actions. The trajectory of UAV with RIS phase shift can be dynamically adjusted to offer better communications service to IoTDs. Under uncertain and stochastic environments, the challenges related to decision-making can be represented using MDP.

To solve the TCPS problem, we model the time sequential-decision process of the TCPS problem as MDP and follow the multi-agent DDQN method. Therefore, in this paper, we considered an ergodic MDP to converge the proposed MADDQN-based TCPS strategy. A multi-agent MDP is typically defined as a quintuple (F, S_uri, A_uri, P, R_uri), where F is the set of all agents, which are UAV and RIS controller; S_uri represents the state space; A_uri = {A1, A2} denotes the set of action spaces for all agents; P represents the probability of state transition of the system and R_uri is the joint reward function that for two agents. We aim to find an optimal policy π* that maximizes both communication coverage and the average data rate of IoTDs in the TCPS problem defined by (13). The MDP of a UAV trajectory control and RIS phase shift agent consists of three elements: The state space, action space, and the function that determines rewards.

1) State Space

The state in the MADDQL model in the RIS-UAVWC framework is denoted by S_uri. The current state ${{s}_{uri}}[t]$ ∈S_uri at timeslot t and it can be expressed as ${{s}_{uri}}[t] = \{ {{L}^{UAV}},R_{r,i}^{URI},L_i^{IoTD},{{\theta }_n}\}$ where ${{L}^{UAV}}$ denotes the coordination of the UAV, $R_{r,i}^{URI}$ symbolizes achievable data rate of the i-th IoTD, $L_i^{IoTD}$ denotes the coordination of the i-th IoTD and ${{\theta }_n}$ represent the phase shift information of r-th RIS at each timeslot t.

2) Action Space

The action of the MADDQL model is denoted by ${{{\bm{A}}}_{{\bm{uri}}}}$ . In the context of a MADDQL system, there are two agents (i.e., UAV and RIS controller). Based on the observed state information ${{{\bm{s}}}_{{\bm{uri}}}}$ the UAV agent chooses the flying direction of the UAV ${{a}_u}[t]$ and the RIS controller agent chooses the phase shift ${{a}_r}[t]$ . In this paper, the UAV can fly in discrete directions, with nine predefined actions, and the RIS controller can phase shift in discrete phase shift between -2π and 2π, with a predefined phase shift, which was defined to optimize the objective function given by (13). Therefore, ${{a}_u}[t]$ ∈{left, right, upwards, downwards, left-up, right-up, left-down, right-down, no movement} with a step length of one meter each time, ${{a}_r}[t]$ ∈ ${{\theta }_{r,n}}$ [t], ${{\theta }_{r,n}}$ [t] ∈ {-π/4,-π/2,-3π/4,-2π/2,-5π/2,-3π/2,-7π/40, π/4, π/2, 3π/4, 2π/2, 5π/2, 3π/2, 7π/4} based on a 3-bit uniformly quantized method for phase shifts of each RIS element. Therefore, the action selected at each step is given by: ${{{\bm{A}}}_{{\bm{uri}}}} = \{ {{a}_u}[t],{{a}_r}[t]\}$ . The agents act according to what they've learned and their policy π during timeslot t.

3) Reward Function

The agents' objective in the designated setting is to maximize the reward obtained by the system. With this in mind, we have established a reward function to maximize both the communication coverage (r1 = (11)) and the average achievable data rate of the IoTDs (r1 = (12)), as described. The reward function is as follows: $\begin{equation*} R[t] = {{r}_1} \times {{r}_2} - \sum\limits_{i = 1}^2 {\ell [t]} {{\rho }_i}, \tag{19} \end{equation*}$ View Sourcewhere ${{\ell }_1}[t] \in \{ 0,1\}$ is the penalty coefficient of regardless of whether the reward function is satisfied; ${{\ell }_2}[t] \in \{ 0,1\}$ is the penalty coefficient of ${{r}_2}$ regardless of whether the reward function is satisfied; ${{\rho }_1}$ and ${{\rho }_2}$ are defined as the penalty ${{r}_1}$ and ${{r}_2}$ , respectively, which are negative rewards.

C. The Proposed MADDQN -Based TCPS Algorithm

Fig. 2 illustrates the training process of the MADDQN-based TCPS scheme. A UAV and RIS controller are treated as DDQN agents in the context of the TCPS problem. The agent heavily depends on the online network to acquire knowledge and assess the effectiveness of the TCPS policy. It involves an online network, a target network, and an experience replay buffer. The network parameters are continuously adjusted throughout the learning process by estimating Q-values, ensuring precise decision-making according to the current system state. However, these parameter updates can change the correlation between the current state and the target value, which may result in oscillations or divergences. To address this issue, methods like target networks and experience replay buffers are utilized to improve the convergence of learning. The target networks possess an identical structure to the online network and retain a duplicate of the online network parameters. This duplicate is gradually updated to avoid any instability or inconsistency during the learning process. The experience replays buffer stores past experiences generated during the learning process, ensuring that training data is not correlated and increasing their independence. Subsequently, the MADDQL algorithm uses a mini-batch to randomly select samples from the replay buffer for training DNNs. This approach improves sample efficiency and enhances learning stability. In Algorithm 1, we provided the pseudocode of the proposed MADDQN-based TCPS scheme.

Algorithm 1, summarizes the pseudocode of the proposed MADDQN-based TCPS algorithm. The algorithm combines centralized training with decentralized execution. We first initialize the entire RIS-UAVWC environment and the DDQL network parameters of the TCPS strategy (line 1). In each training episode, initialize the state. Then at each timeslot t, two agents observe their state from the target environment. Each agent then sends its private observations to the central server, which collects and aggregates all state information. Then, the centralized DDQN algorithm is trained on the central server to generate the policy based on the joint action taken by two agents in the system.

Algorithm 1: MADDQN-Based TCPS Algorithm.

The network takes in the joint observation of all agents and outputs a Q-value for each possible joint action ${{A}_{uri}}[t] = \{ {{a}_u}[t],{{a}_r}[t]\}$ . During training, the agents use an epsilon (ε)-greedy exploration strategy to select their appropriate joint actions ${{A}_{uri}}[t]$ based on the current state of the environment and the output of the centralized DDQN network (lines 5-10). When the learning process begins, the DDQL agents do not completely understand the environment, making the Q-function estimate uncertain. Therefore, the agent must explore the environment to some degree to address this uncertainty and ensure an optimal policy [35]. The agents choose an action based on the Q-table, with the probability of action selection for the DDQN agent being represented as: $\begin{equation*} {{A}_{uri}}[t] = \mathop {\arg \max }\limits_{a \in A} Q({{s}_{t + 1}},{{A}_{uri}},{{\theta }_{eva}},{{\theta }^ - }) \tag{20} \end{equation*}$ View Sourcewith probability 1 and chooses a random ${{A}_{uri}}$ with probability exploration rate ε. The ε parameter is often set to a high value (ε close to 1) at the start of the learning process and gradually decreases over time as the agent gains more experience and becomes more confident in its policy. After executing ${{A}_{uri}}$ , the reward $R[t]$ and the next system state ${{s}_{uri}}[t + 1]$ were obtained (lines 9-10) and the experiences $({{s}_{uri}}[t],{{A}_{uri}},R[t],{{s}_{uri}}[t + 1])$ are stored in replay buffer ℜ_b with the capacity (line 11). Then, the DDQN network parameters are optimized by minimizing the loss function with the stochastic gradient descent method (line 14). the target network parameter copied from the evaluation network parameter θ_eva (i.e., θ_tar =θ_eva,) in every τ training iteration of the DDQN, which smooths out fluctuations (lines 15-17).

D. Complexity of the Proposed MADDQN Algorithm

A computational complexity analysis of the MADDQN algorithm can be analyzed depending on the number of agents Ñ, the neural network with layers L and neurons in layer n_l, the forward and backward pass complexity is ${\mathrm O}(\sum\nolimits_{l = 1}^{L - 1} {{{n}_l}.{{n}_{l + 1}}} ),$ the DNN initialization run consumes a significant amount of time t_ini, the time to execute the DDQN. In each episode, the time consumed by the initial state and buffer run is t_initepos. Each step of computation consumes t_step, where t_step is the sum of all computations performed t_step_. The DDQN algorithm execution time is t_ini + (t_initepos + t_step × T) × I, so the complexity of multi-agent DDQN is ${\mathrm O}(\bar{N} \times T \times I \times (\sum\nolimits_{l = 1}^{L - 1} {{{n}_l}.{{n}_{l + 1}}} )).$

SECTION V.

Simulation Result

In this section, we present numerical results that evaluate the proposed scheme performance. Several simulations were conducted to evaluate the performance of the proposed algorithm. The simulations were performed on a PC with an NVIDIA GeForce GTX 1080 Ti GPU, using Python 3.7 and TensorFlow 1.15.0. Firstly, we outline the parameters for the simulation and the network. In the following section, we present the results and discuss them.

A. Simulation Setup

The simulation environment is similar to that of the real-world environment. In the simulations, we considered a (1000 m × 1000 m × 50 m) 3D space simulation environment as illustrated in Fig. 1. There are two RIS with fixed locations (100 m, 75 m, 10 m) and (120 m, 60 m, 10 m) respectively. The location of UAV is initially fixed at (20 m, 80 m, 50 m). The maximum flight altitude of a UAV and the minimum flight altitude of a UAV are $h_{UAV}^{\max }$ = 50 m and $h_{UAV}^{\min }$ = 10 m respectively. During the simulation, we take into account 20 IoTDs. The fast speed of IoTDs speed V_max = 2.5 m/s. The transmit power of the UAV is set as P=20 dBm. The quantity of reflective elements of RIS is 81. The path loss exponents β1 and β2 are 2.25 and 2.0 respectively. The noise power is set as No = −80 dB. In DDQN, both the original and target networks consist of two layers, each containing 100 neurons. ReLU is used as the activation function, and the Adam optimizer is employed for training the deep neural networks. The model training involved 500 episodes, each comprising 2000 execution steps. A replay buffer of size (ℜ_d) 32000 was used, with a mini-batch size of 128. Tables 2 and 3 present the detailed simulation and network parameters of the proposed algorithm, respectively.

TABLE 1 List of Notations

TABLE 2 Simulation Parameters

TABLE 3 Network Parameters

B. Results and Analysis

We illustrate the proposed MADDQN-based TCPS for RIS-UAVWC method performance and evaluate its effectiveness against the baseline schemes:

DRL-based 3D trajectory UAV deployment without RIS (DRL-UAV/RIS): In this scheme, we show the comparison of UAV deployment using a 3D trajectory strategy without RIS assistance to maximize user communication coverage and system throughput [31] With the proposed algorithm for RIS-UAVWC, RIS assistance can maximize the achievable data and communication coverage of IoTD.
DDQN-based RIS-assisted UAV with random RIS phase shift (DDQN RIS-UAV/PS): In this scheme, the phase shifts of RIS are randomly configured and we investigate the impact of optimizing phase shift in the communication performance in the RIS-UAVWC. The proposed method shows the contribution of the optimized RIS phase shift.
DDQL-based UAV trajectory and traditional phase shift algorithm (PSO) for RIS-assisted UAV (DDQN+PSO RIS-UAV): In this scheme, researchers used DDQN [28] for UAV trajectory optimization and traditional PSO for a phase shift of RIS. Based on this paper framework, we compare the performance of the proposed approach to the DDQN + PSO RIS-UAV algorithm.

The proposed MADDQN-based TCPS scheme was evaluated using two metrics: i) communication coverage score ([t])- which is the communication coverage score at the end of the test period. The communication coverage of the proposed TCPS strategy can be evaluated using the coverage score, based on finding how many IoTDs are connected to the deployed UAV (i.e., number of IoT devices they are received acceptable data rates) and the total number of IoT devices are in the target area during each time slot in testing phase by according to (11); ii) average achievable rate of IoTDs ( $\bar{R}_{r,i}^{URI}[t]$ )- which is the average achievable data rate of IoTD at the end of the test period and (12) can be used to determine the average IoTD achievable data rate. The MADDQN-based TCPS method is also compared with the DRL-UAV/RIS, DDQN RIS-UAV/PS, and DDQN+PSO RIS-UAV respectively in terms of cumulative rewards, coverage scores, and average achievable rate; in the proposed system, the average achievable rate is influenced by the number of RIS elements; the performance of communication coverage is influenced by the number of IoTDs.

Fig. 3 illustrates the cumulative reward of the number of episodes for the proposed method across three different algorithms. The learning curves for all algorithms show a gradual increase in accumulated rewards as training advances, eventually stabilizing, indicating steady convergence. This demonstrates that cumulative rewards rise with an increasing number of training episodes. During the initial learning iteration, the cumulative rewards obtained by all learning methods tend to be lower due to two reasons: i) the agents randomly explore an unknown environment. Initially, the agents have limited knowledge of the environment and must explore different actions to maximize the objective function; ii) the DNN parameters in the DRL algorithm are initialized with random weights, which can result in the agent exploring suboptimal actions and receiving negative rewards. With increasing training episodes, agents gain a deeper understanding of the environment to develop an optimal TCPS scheme. Fig. 3 shows that the proposed MADDQN algorithm achieves rapid convergence within 500 episodes and demonstrates the effectiveness of the reward structures and state spaces when compared with DQN and Q-Learning algorithms.

Figure 3.

Convergence performance of the multi-agent QL, multi-agent DQL, and MADDQL algorithms.

Show All

Fig. 4 demonstrates the learning iteration performance comparison of the proposed MADDQN-based TCPS method rewards with three baseline approaches, including DRL-UAV/RIS, DDQN RIS-UAV/PS, and DDQN+PSO RIS-UAV. The proposed method, which optimizes phase shift and trajectory simultaneously, shows strong convergence in reward and achieves greater performance compared to other approaches in terms of learning iteration performance. In addition, due to the advantage that RIS can learn and optimize its configuration based on feedback from UAV, the MADDQN-based TCPS method can achieve significant performance improvements. A DRL-UAV/RIS performs significantly lower than others because RIS allows for more diverse training scenarios, resulting in a better-trained system, as well as dynamic adaptation by adjusting the phase shifts of RIS at a more granular level.

Figure 4.

Learning iteration performance of the MADDQL-based TCPS, DRL-UAV/RIS, DDQN RIS-UAV/PS, and DDQN+PSO RIS-UAV.

Show All

Fig. 5 illustrates IoTD's data rate performance using the MADDQL-based TCPS scheme compared to baseline schemes across various time slots. It demonstrates how the data rate fluctuates due to the dynamic wireless environment, influenced by the balance between coverage and communication quality. This variability arises from suboptimal UAV positions and RIS phase shifts. This improvement is because the proposed method allows for learning from mistakes by taking joint actions to enhance each IoTD data rate. Initially, the UAV was assumed to be placed in the center of the target area, resulting in a slightly lower overall average data rate. However, as IoTDs move and are distributed throughout the target area, the data rate in some timeslots decreases, leading to suboptimal positioning of the UAV and phase shifts in the RIS, affecting the balance between communication coverage and signal quality. The proposed method repositions the UAV location based on signals and changes the phase shift value of RIS until the IoTD achievable data rate exceeds the threshold value or acceptable achievable data rate R_th=1000 kbps. As a result, MADDQL algorithms enhance communication services in dynamic and complex environments. In addition, the proposed method achieves a higher average data rate than other methods. Due to learning instability, the estimation of expected rewards using a MADDQL-based TCPS approach has significantly improved communication service quality.

Figure 5.

Average achievable data rate performance of the MADDQL-based TCPS and compared with other baseline schemes.

Show All

Fig. 6 shows the proposed approach is compared to other baseline approaches for average achievable data rates. For 75% of IoTDs in the MADDQN proposed approach, the average achievable data rates occurred in the range of 1.6 Mbps and 2 Mbps, whereas 75% of IoTDs in DRL-UAV/RIS experienced average achievable data rates occurring in the range of 1 Mbps and 1.2 Mbps, whereas for 75% of IoTDs in the DDQN+PSO RIS-UAV, average achievable data rates occurred in the range of 1.3 Mbps and 1.7 Mbps, whereas for 75% of IoTDs in the DDQN RIS-UAV/PS approach, average achievable data rates occurred in the range of 1.2 Mbps and 1.6 Mbps. This value indicates that the proposed scheme can significantly increase the average achievable data rate over the baseline methods. By deploying RIS with phase shift, the system gains the ability to enhance communication links through strategic consideration of RIS phase shift based on environmental conditions and the location of the UAV. As a result of this approach, the UAV can maximize its data rate for IoTDs with the assistance of RIS, thereby creating a more efficient communication method.

Figure 6.

The distribution of average achievable data rate performance in different methods within 500 time slots.

Show All

Fig. 7 shows a comparison of the communication coverage score obtained by the proposed MADDQL-based TCPS scheme with basslines schemes. Based on the simulation findings, as the learning iterations increase, the coverage score also increases across all approaches we analyzed. Specifically, when the time slots are sufficiently expanded, the proposed method achieves 98.6% coverage. In comparison, the DDQN+PSO RIS-UAV methods achieved a coverage score of 95%, the DDQN RIS-UAV/PS method reached 93%, and the DRL-based UAV/ RIS approach achieved a coverage score of 91.5%. The UAV system without RIS communication coverage is lower than others due to the environment, which has a blockage. Due to the blockage, some IoTDs received unsatisfactory data rates or some IoTDs did not connect to the UAV. This paper approach is more effective than others in maximizing communication coverage scores due to the following reasons: 1) With the assistance of RIS, blocked IoTDs can connect to the UAV and receive an acceptable achievable data rate; 2) considering the optimization of the UAV trajectory and RIS phase shift, agents taking joint action to achieve acceptable achievable data rate, it helps IoTDs receive a strong signal; 3) the proposed approach also ensures the optimal UAV position and RIS phase shift to improve the communication quality of IoTDs. Thus, the proposed approach achieves a better communication score than others.

Figure 7.

Communication coverage performance of the DDQL model and comparison with other methods.

Show All

Fig. 8 presents the performance of the proposed method with different RIS numbers of elements. The average achievable data rate, as discussed here, represents the average of the single-IoTD achievable data rates across all stages in a full episode. With an increased number of elements in RIS, the average achievable data rate for IoTDs also increases. However, the transition from N = 25 to N = 49 shows only a slight gain from adding more RIS elements. Yet, as the number of elements continues to increase, notably at N = 81 and N = 100, there is a significant rise in the system average achievable data rate. This improvement is attributed to the increased opportunities for agents to learn and optimize system navigation effectively.

Figure 8.

The comparison of the proposed approach average achievable data rate with different numbers of RIS elements.

Show All

However, as shown in Table 4, if the number of RIS elements increases, the computational time also increases. Based on our analysis, N = 81 is the optimal number of RIS elements in this proposed method.

TABLE 4 Proposed Model Computational Time of Different RIS Elements

In Fig. 9, we analyze the communication coverage score of the proposed method with different RIS element numbers. The simulation demonstrates that the MADDQN-based TCPS consistently surpasses N = 81 and N = 100 RIS elements. Furthermore, increasing the number of RIS elements typically enhances the system coverage. More RIS elements allow for better manipulation of the radio environment, which can extend the coverage area. This means that UAV can maintain a reliable communication link over larger distances or in areas with obstacles that would otherwise block the signal. As indicated in Table 4, the more RIS elements are present, the greater the time takes to compute. This analysis suggests that N = 81 is the most suitable number of RIS elements for this proposed method.

Figure 9.

Communication coverage scores of the proposed approach with different numbers of RIS elements.

Show All

In Fig. 10, the UAV trajectories are represented by the green line, the IoTDs (20 in this simulation) are represented by the black line, and the RIS1 and RIS2 are represented by the red and blue rectangles, respectively. UAV continuously fly between 30 m and 50 m altitude in the target area. We observed that UAV fly at lower altitudes when IoT devices are concentrated; while when IoT devices are spread out across the target area. UAV fly at higher altitudes to maintain communication links. It can be seen from 10(a) that the UAV flight trajectory is longer to get the acceptable achievable rate. In Fig. 10(b), the UAV flies towards the RIS and then hovers near it to significantly improve the acceptable rate by exploiting the reflection of the RIS element rather than flying towards the ground IoTDs. Therefore, with the assistance of the RIS, the flight trajectory of the UAV is shorter. Additionally, in the proposed TCPS strategy, UAV change their flight trajectories based on the received state information to maximize system rewards.

Figure 10.

Trajectories of UAV according to the proposed algorithm with 20 IoTDs in different timeslot. (a) Without RIS. (b) With RIS.

Show All

Fig. 11 analyzes the average communication coverage across various methods as the number of IoTDs varies. The simulation demonstrates that MADDQN-based TCPS consistently surpasses all baseline methods and nearly achieves maximum coverage. Furthermore, as the number of IoTDs rises, the average communication coverage declines steadily. This trend occurs because as IoTDs increase, the UAV ability to provide satisfactory services to each IoTD diminishes, resulting in a decline in signal quality.

Figure 11.

Impact of the number of IoTDs on communication coverage.

Show All

In Fig. 12, the averages of the achievable rates for different RIS elements are compared for a variety of IoTD numbers according to the proposed MADDQN-based TCPS method. It can be seen that as the number of IoTDs increases, the average achievable data rate decreases after optimization. Both RIS-100 and RIS-81 significantly outperform RIS-89 and RIS-25 in terms of achievable rate.

Figure 12.

Average achievable rate comparison under different RIS elements for different numbers of IoTDs.

Show All

Based on the simulation findings, the new system demonstrates strong communication capabilities in terms of communication coverage and ensures a satisfactory achievable data rate. UAV positions are adjusted with the right RIS phase shift until all IoTDs are served and their data rates surpass a set limit. Consequently, this approach enhances wireless communication efficiency, benefiting IoTDs in terms of coverage and data rates.

SECTION VI.

Conclusion

This paper has explored the 3D-UAV trajectory and RIS phase shift optimization in the context of RIS-assisted UAV-enabled wireless communications systems with mobile Internet of Things devices. It has proposed a multi-agent DDQN-based 3D trajectory for UAV and RIS phase shift control schemes. This scheme is modeled as a multi-agent Markov decision process aiming at maximizing communication coverage and ensuring satisfactory achievable data rates in dynamic settings. Specifically, the UAV and RIS controllers operate as learning agents. Each agent trains its own DDQN model based on locally observed state data. The simulations indicate that the multi-agent DDQN-based TCPS approach substantially improves communication coverage and average achievable data rates for IoTDs compared to baseline approaches. For further enhancement of network performance, future research will develop an RIS-assisted multi-UAV wireless network framework by considering UAV energy consumption using federated learning by considering interference.

References is not available for this document.

Multi-Agent Deep Reinforcement Learning Based Optimizing Joint 3D Trajectories and Phase Shifts in RIS-Assisted UAV-Enabled Wireless Communications

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work