Journals & Magazines >IEEE Access >Volume: 12

RL-Based Adaptive UAV Swarm Formation and Clustering for Secure 6G Wireless Communications in Dynamic Dense Environments

System Model: Swarm UAV-borne RIS-assisted 6G Wireless Communications in a Dynamic and Dense Environment with Multiple Jammers

Abstract:

The wireless communication landscape in beyond 5G and 6G systems, particularly in dense smart city environments, presents significant interference challenges. UAV-mounted...Show More

Metadata

Abstract:

The wireless communication landscape in beyond 5G and 6G systems, particularly in dense smart city environments, presents significant interference challenges. UAV-mounted Reconfigurable Intelligent Surfaces (RIS) offer a promising solution to counter interference from unknown jammers. However, the system’s dynamic nature, especially real-time fluctuations in device and jammer distribution and UAV resources, complicates UAV and RIS management. Current approaches, which rely on a single UAV-mounted RIS or a fixed number of UAVs covering static device clusters, fail to adapt to these dynamic conditions. Smaller swarms may lead to inadequate coverage, while larger swarms can cause inefficiency and higher energy consumption. Additionally, these approaches often target a single objective, such as maximizing sum rates or minimizing energy use, without considering UAV battery constraints. Our work introduces an adaptive UAV swarm formation and dynamic device clustering technique designed for efficient anti-jamming in dynamic multi-user clusters threatened by unknown jammers during critical public events. This approach creates a flexible UAV-borne RIS swarm that dynamically adjusts the number of UAVs and the clustering to real-time changes of mobile devices and jammers, ensuring uninterrupted operations through UAV recharging and swapping while conserving total energy by deploying the minimum sufficient number of UAVs. Using Reinforcement Learning (RL), our solution optimizes the number of UAVs, device-to-UAV associations, UAV trajectories, RIS phase shifts, and base station power to effectively balance the sum rate and energy consumption. Simulations demonstrate the superior performance of our approach in coverage, jamming mitigation, energy conservation, connectivity, and scalability compared to existing methods and baselines.

System Model: Swarm UAV-borne RIS-assisted 6G Wireless Communications in a Dynamic and Dense Environment with Multiple Jammers

Published in: IEEE Access ( Volume: 12)

Page(s): 125609 - 125628

Date of Publication: 06 September 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3455250

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

The rapid progress of Beyond 5G and 6G technologies in wireless communication systems is paving the way for fresh opportunities in optimization [1]. These upcoming wireless communication systems bring forth novel prerequisites such as maximum data rates, improved bandwidth, decreased latency, expanded coverage, enhanced spectral efficiency, and diminished power and energy consumption. Innovative methodologies are introduced to meet these challenging factors, which include NOMA, extensive MIMO, and low-power wide area networks (LPWAN). Multi-modal foundational models for Nevertheless, implementing these advancements in practice faces significant obstacles primarily due to heightened complexity and substantial hardware costs, particularly at frequencies found in the millimeter-wave (mmWave) range.

Smart wireless communication environments offer superior adaptability and mobility compared to terrestrial systems. However, maintaining line-of-sight communication in dynamic and dense urban areas poses challenges for mobile and IoT devices connecting to base stations. Reconfigurable Intelligent Surfaces (RIS) provides a promising solution by manipulating wireless signal propagation through reflection or refraction [2]. Recent advancements in intelligent radio systems have sparked research into RIS applications in Beyond-5G and 6G wireless networks.

The security of wireless communications during public events is vital for critical services like emergency response, security surveillance, and public health management. Adversarial jamming poses significant threats, undermining link quality and necessitating anti-jamming measures like RIS mounted on UAVs (Aerial RIS). This strategy is crucial for robust communication, especially in remote areas or densely populated urban hubs. Disruptions affect sectors like industrial automation, healthcare, military, emergency response, and connected vehicles, all reliant on wireless networks. In healthcare, wireless communication supports services like remote surgery and telehealth but is vulnerable to attacks and disruptions. Emergency communications are susceptible to jamming, hindering rescue efforts. Smart grid networks face threats from cyber attackers and environmental factors, leading to power outages and equipment failures.

Different methodologies have been explored for leveraging RIS affixed to buildings or UAVs deployed as relays to optimize communication metrics in non-hostile environments [3], [4], [5] [6], [7], however, limited attention has been paid to hostile scenarios. Recent studies have integrated RIS with UAVs to fine-tune communication parameters, considering challenges posed by single jammers in dynamic environments [8], [9] [10], [11] [12], [13] [14], [15]. Among some of the UAV-borne RIS anti-jamming solutions, the work by Hou et al. [8] stands out for utilizing a Reinforcement Learning (RL) technique known as Dueling Double Deep Q-Networks (3)-DQN), while the two other studies presented in [13] and [14] employ conventional Alternating Optimization (AO) techniques. An additional layer of complexity is added by considering the scalability and mobility of mobile devices, by the authors of [8] and [13]. One of the existing works in [7] presents a multi-clustered IoT scenario served by fixed multiple RIS without considering device-to-RIS association, clustering, involvement of UAVs, or presence of jammers. Similarly [12] employs the multiple UAV-mounted RIS relays to serve multiple device clusters, but it carries out fixed association without any swarm optimization or consideration of jammers.

An alternative approach to anti-jamming in multi-user scenarios is outlined in [15], which employs Win or Learn Fast-Policy Hill-Climbing (WoLF-CPHC) algorithm, utilizing an RIS fixed on a building to enhance system rate along with transmission protection against a single jammer. Similarly, the study in [16] addresses multiple jammers by employing a single UAV transmitter with a fixed RIS installed on a building to assist communication with a ground user. However, some anti-jamming works utilize multiple UAV relays but without RIS, such as [17] focusing on securing a single ground-based target from aerial jammers, targeting multiple jammers affecting ground-based IoT devices as proposed in [18]. The solution by [13] stands out by deploying multiple UAV-borne RIS relays to counter a single jammer interrupting a single mobile device.

We observe that existing swarm UAV-borne RIS-based solutions either fail to adapt to the dynamic nature of dense multi-user wireless scenarios or they do not consider this aspect at all, particularly considering real-time changes in the distribution and number of devices and jammers, as well as fluctuations in UAV resources. Current approaches that employ fixed strategies do not handle the dynamics and scalability of the system effectively; a low number of UAVs may not offer sufficient coverage in some settings, while a high number can increase energy consumption in others. They also fail to ensure energy efficiency and optimal coverage with continuous operations interrupted by battery outages or failures. Moreover, none of these solutions address the threats of multiple jammers to multiple devices.

Our work addresses these gaps by proposing an RL-based adaptive swarm formation and clustering mechanism for anti-jamming to secure 6G wireless communications in densely crowded urban environments, such as public events, stadiums, and shopping malls vulnerable to unknown jammers threatening critical public services like crowd management, emergency response, and security. This approach creates a flexible UAV-borne RIS swarm that dynamically adjusts the number of UAVs and the UAV-to-device clustering to real-time environmental changes, ensuring optimal coverage and energy efficiency. By utilizing real-time data to adapt the UAV swarm, this solution mitigates the impact of multiple unknown jammers with a multi-objective optimization strategy to maximize the sum rate and minimize energy consumption by optimizing UAV-to-device association, trajectories of UAV-borne RIS, RIS passive beamforming through phase shifts and base station transmit power, accommodating variations in jammer distribution and number of jammers and mobile devices, while ensuring continuous operations with UAV recharging and swapping facility.

Therefore, the first significant unique aspect is scalable and dynamic swarm formation and clustering due to changing conditions in a dense multi-user environment such as variation in distribution or quantity of jammers and number of devices where a single UAV or fixed swarm of UAVs, is insufficient to provide effective coverage. The RL-driven dynamic and adaptive UAV swarm formation and device clustering ensures that a sufficient number of UAVs are deployed to mitigate the effects of jammers on the devices while conserving energy. As the negative effect of jammers increases, more UAVs are dynamically added to the service while maintaining energy efficiency.

The second point of distinction is incorporating the Multi-Objective Optimization approach. Existing RL-based anti-jamming solutions, assisted by RIS, in single-user or multi-user scenarios, have predominantly focused on addressing only a single objective, such as enhancing the achievable rate maximization [8], transmission rate optimization [14], SINR improvement [13], or minimization of energy consumption [16]. Although a few solutions have tried to achieve multiple objectives concurrently, such as optimizing both sum rate and system protection level [15], they do so by a combination of conventional techniques with RL. Another anti-jamming solution presented in [19] also solves the multi-objective optimization but only for countering the jamming effect on a single device using a single UAV-mounted RIS relay.

The third aspect that sets our solution apart from the existing ones is the concept of UAVs recharging at the docking stations with swapping when their batteries expire or fail, which is one of the possible options presented by [20]. The only work in literature that provides the UAV recharging concept is presented by [21]; however, this solution applies RL to the UAV-assisted Intrusion Detection scenario, not in the wireless anti-jamming scenario.

In our proposed 6G scenario, we address the challenges of maximizing the sum rate and minimizing energy consumption of UAV-borne RIS relays by using a centralized RL agent at the base station with Proximal Policy Optimization (PPO) to optimize device-to-UAV associations, UAV trajectories, RIS beamforming, and base station transmit power. Adaptive UAV swarm formation and dynamic clustering mitigate jammer impact, balancing coverage, and energy conservation, while UAV recharging and swapping facility ensures continuous and stable operations.

The contributions of our work are described as follows:

We introduce a novel anti-jamming solution to protect critical 5G Advanced and/or 6G communications at densely crowded venues like stadiums, airports, and sporting arenas, with numerous vulnerable mobile devices targeted by multiple independent and unknown jammers that operate autonomously, while the system lacks complete knowledge of these disruptive sources. Our approach employs multiple UAV-mounted RIS relay platforms, for adaptive swarm formation and device clustering.
Our goal is to maximize the sum rate while minimizing the UAV usage for energy efficiency by dynamically optimizing the swarm formation of UAVs and device clustering by association with adaptive in-flight UAV-borne RIS relays, trajectories of UAVs, passive beamforming via RIS phase shifting, and adjusting base station transmission power to counter multiple jammers.
The optimization problem poses a significant computational complexity due to the highly dynamic and scalable nature of the system in terms of the distribution and number of jammers and devices that may change. This requires a strategy that adapts to variations and guarantees self-optimization of different parameters. Therefore, we propose a Reinforcement Learning (RL) technique called Proximal Policy Optimization (PPO) for multi-objective optimization.
The proposed system model incorporates a UAV swapping with a charging dock facility for recharging Swarm UAV batteries when they expire. This novel addition to the existing scenario guarantees practical, continuous, and seamless operations for UAVs, ensuring uninterrupted service and robust communications for mobile devices within a 6G wireless cellular region.
To demonstrate the superiority of our RL-based approach, we conduct comprehensive simulations across various configurations in related works and baselines, that include fixed RIS installations, fixed device-to-UAV associations, random device-to-UAV association, single-objective sum rate optimization, and UAV-mounted RIS with random phase shifting. These simulations cover diverse system configurations with variations in distribution areas and quantity of jammers, and the mobile devices.

The remaining paper is structured as follows. In Section II, we provide a concise review of the applications of RL techniques to address jamming threats in wireless communication systems. Additionally, we explore recent studies that have employed single and swarm UAV-mounted RIS platforms to enhance the communication efficiency in dense and dynamic adversarial environments. Section III presents the system model with problem formulation, providing detailed insights into the structure of Reinforcement Learning-based solution. In Section IV, we present the proposed implementation, covering the simulation setup, procedures, and results. Lastly, Section V offers a conclusion of this work.

SECTION II.

Related Works

The challenge of mitigating the impact of jammers in wireless communication networks has been a focal point of research for several years. Traditional approaches to combat jamming, including an adaptive rate/power control, cognitive radio, and spread spectrum, have effectively reduced jamming threats [22]. However, due to highly adaptive jamming threats, their resilience remains limited in the face of increasing complexities of 5G advanced and 6G networks. The classical Spread Spectrum technique faces challenges due to strict requirements for spectral efficiency in 5G cellular networks. A more recent MIMO-based anti-jamming approach effectively addresses the jamming effects; however, it necessitates channel state information about jammers [23].

Reinforcement Learning (RL) operates within a Markov Decision Process (MDP), allowing agents to adapt policies based on feedback, overcoming constraints [24]. RL-based anti-jamming methods excel in dynamic environments. RL techniques optimize wireless communications with Reconfigurable Intelligent Surfaces (RIS) but often neglect jamming threats. In [3] and [25], Deep Reinforcement Learning (DRL) optimizes UAV-borne RIS-assisted systems using techniques like Proximal Policy Optimization (PPO) and Decaying Deep Q-Network (DQN). Similarly, [5] applies DRL to NOMA communications, yet fails to address jamming threats.

Recent research explores leveraging UAV-borne RIS for wireless communication optimization. For example, Xu et al. [26] optimizes UAV-borne RIS trajectory and transmit power with DDPG for maximizing total throughput in UAV-powered IoT networks. Samir et al. [9] minimizes the expected sum Age-of-Information (AoI) using PPO by optimizing RIS phase shift, UAV altitude, and scheduling. Khalili et al. [10] employs DDQN in UAV-assisted RIS Hetnets to minimize total transmit power. However, all of these solutions overlook jamming threats.

There are several techniques that aim to counter the impacts of jammers and eavesdroppers in wireless networks. Liu et al. [27] introduced HIF-DRL for frequency channel selection. In [28], ADRLA with RCNN addressed multiple jammer scenarios. Xiao et al. [29] proposed Hot-booting Q-learning for power allocation in MIMO NOMA systems. Additionally, Xiao et al. [30] presented a 2-D frequency-space anti-jamming scheme for multiple jammers.

Anti-Jamming challenges also extend to emerging vehicular communication systems like the UAV-assisted VANET [31]. Li et al. [32] introduced a DRL-based anti-jamming technique, while Yao and Jia [33] proposed the Collaborative Multi-agent Anti-jamming Algorithm (CMAA). Slimeni et al. [34] suggested On-Policy Synchronous Q-learning (OPSQ-learning) for real-time avoidance of jammed channels. Peng et al. [35] employed Multi-Dimensional Anti-Jamming Reinforcement Learning (MDAJRL) for UAV communication systems. Abuzainab et al. [36] introduced a QoS-aware routing protocol based on Actor-Critic DQN to navigate around communication holes due to jamming. Ye et al. [37] presented an anti-jamming solution based on PDDQN. However, none of these utilize RIS devices for anti-jamming assistance.

Reconfigurable Intelligent Surface (RIS) uses software-controlled planar surfaces with passive elements to alter wireless propagation, enhancing communication performance with low energy consumption and enabling scalable deployment [2] in smart wireless environments.

In a study by Yu et al. [38], a fixed RIS is integrated into a NOMA framework using LMIDDPG, enhancing energy efficiency in MEC systems [6]. This involves a UAV-borne base station communicating with a mobile device through an RIS, optimizing data rate and energy efficiency using DDPG and DQN [15]. H. Zhao et al. [16] address multiple jammers by deploying a fixed RIS on stationary objects, aiding UAV-mounted transmitters in communication while mitigating jammer effects. However, these solutions lack flexibility in RIS movement and solely focus on maximizing energy efficiency, which is critical in catering to the dynamic and scalable nature of the environment, such as varying numbers of mobile devices or jammers.

Therefore, UAV-mounted RIS deployment is explored for anti-jamming in dynamic environments, particularly against single jammers [8], [13]. Since dynamic RIS installation on UAVs has proven to have outperformed fixed deployment [15], the work in [14] deploys a UAV-borne RIS relay between ground base stations (GBUs) and users (GUs), employing Alternating Optimization (AO) and Manifold Optimization (MO) for optimizing parameters against single jammer. Similarly, authors in [8] and [19] also optimize parameters to maximize communication rates against a single jammer, using the Dueling Double Deep Q-Network (3-DQN) and Deep Deterministic Policy Gradient (DDPG) algorithms. However, these techniques lack the dynamism and scalability to handle variations in devices, jammers, and mobility in dense and dynamic wireless environments.

Recent works also explore the use of Swarm UAVs for performance optimization, including anti-jamming in dense, dynamic scenarios. Studies in [17] and [18] deploy Swarm UAVs without RIS for performance optimization but using the conventional Alternating Optimization (AO) technique. The authors in [13] introduce a system wherein multiple UAV-mounted RIS relays are employed to counter the effects of a single jammer using an Alternating Optimization (AO)-based Relax-and-Retract algorithm to maximize SINR at a mobile device. However, the works in [11] and [12] employ Swarm UAV-borne RIS platforms to improve communication parameters without considering jammers using conventional optimization techniques. The latter one [12], also involves static clustering through device association with the UAV-RIS platforms. Nevertheless, these Swarm UAV-mounted RIS-based solutions overlook dynamic swarm formation and clustering, hindering system dynamism and scalability. Moreover, these solutions also do not leverage Reinforcement Learning (RL) or consider multiple objectives in optimization. A summary of a few closely related works is presented in Table 1.

TABLE 1 Related Works

In contrast to existing approaches, our proposed solution utilizes RL-based swarm UAV-borne RIS relays with dynamic swarm formation and clustering through device-to-UAV association, UAV swapping with a battery recharge facility, which addresses the dynamism and scalability challenges in wireless systems due to variations in distribution or quantity of jammers or devices. Leveraging Proximal Policy Optimization (PPO), our solution enables joint optimization of multiple parameters of Swarm UAV-mounted RIS, achieving multiple objectives across multiple mobile devices amidst multiple jammers. This includes maximizing the sum rate and minimizing energy consumption by clustering optimization through dynamic device-to-UAV association, UAV swapping, and battery recharging. This ensures optimal performance while maintaining uninterrupted operations and conserving energy by minimal deployment of UAVs.

SECTION III.

Adaptive Swarm Formation and Clustering through Aerial-RIS in Multi-Jammers Environments

In a dynamic and dense urban environment, we consider a 6G wireless communication system incorporating MIMO technology (Figure 1). This system is a crucial communication infrastructure for public events in densely populated smart cities. It facilitates connectivity for multiple clusters of mobile devices, including those used by logistics, emergency response, law enforcement, security, health safety, crowd management, and citizens. The base station B operates on 6G technology, ensuring efficient communication and data services. Despite these advantages, the system faces disruptions from multiple jammers ($\mathcal {K}={J_{1}\ldots J_{k}..J_{K}}$ ). To address this challenge, a Swarm of UAV-mounted RIS platforms ($\mathcal {U}={U_{1}\ldots U_{c}..U_{C}}$ ) is deployed, offering flexibility in movement within a specified cellular boundary. Operating in an unpredictable and dynamic environment with unknown jammer locations and varying transmission powers, the Swarm UAV-mounted RIS platforms operate in the air for T duration, divided into N time slots, each lasting $\delta _{t}$ where $T=N\delta _{t}$ . Each time step represents a brief period, allowing user devices to adjust their positions and prompting corresponding adaptations in the Swarm UAV-RIS configuration for optimized coverage. In cases where N is substantial, the small value of $\delta _{t}$ allows us to consider that the UAVs location and the RIS phase shifts remain unchanged within each time slot.

FIGURE 1.

System Model: Swarm UAV-borne RIS-assisted 6G Wireless Communications in a Dynamic and Dense Environment with Multiple Jammers.

Show All

A. System Model

We define a rectangular coordinate system representing a portion of a 6G micro-cell with the Base Station B installed at the origin, establishing maximum limits for the axes to define the mobility constraints for the Swarm Aerial RIS platforms $U_{c}$ . Communication between base station B and mobile devices $D_{i}$ is susceptible to adversarial interference from multiple malicious jammers $J_{k}$ emitting omnidirectional signals. These jammers threaten critical communication links, leading to potential chaos or criminal activities. To counteract the impact of jammers, Swarm UAV-mounted RIS relays $U_{c}$ are deployed in the cellular region, flying over the mobile devices to assist in anti-jamming operations.

Mobile devices have the freedom of movement within predefined boundaries of the cell. The Swarm UAVs serve I mobile devices, forming clusters to ensure Quality of Service (QoS). The RL controlling unit of Swarm is stationed at Base Station B. This anti-jamming communications system operates in a dynamic and unknown environment, which implies continuous and uncertain changes in the environmental conditions and variables. The agent or observer does not possess full information regarding the conditions of the environment and its future evolution.

The base station B, multiple mobile devices ($\mathcal {I}={D_{1}\ldots D_{i}..D_{I}}$ ), and jammers ($\mathcal {K}={J_{1}\ldots J_{k}..J_{K}}$ ) are situated on the ground, occupying a two-dimensional horizontal plane within the cellular region. The coordinates of the base station are denoted as $\omega _{B}=[x_{B}, y_{B}, H_{B}]$ , with $H_{B}$ representing the base station height. Thus, the base station coordinates are $\omega _{B}=[x_{B}, y_{B}, 10]$ . Each mobile user $D_{i}$ is located at $\omega _{D_{i}}=[x_{D_{i}}, y_{D_{i}}, 0]$ for $i \in [1,I]$ , and the jammers coordinates are denoted as $\omega _{J_{k}}=[x_{J_{k}}, y_{J_{k}}, 0]$ for $k \in [1, K]$ . The trajectories of Swarm UAV-borne RIS relay platforms ($\mathcal {U}={U_{1}\ldots U_{c}..U_{C}}$ ) dynamically adjust in all three dimensions during each time slot n and are represented as $\omega _{c}=[x_{c}, y_{c}, z_{c}]$ . The phase shift of each RIS, consisting of equal number of M reflecting elements, is denoted by $\theta _{c}=[\vartheta _{m}^{c}]_{m\in M}$ for $m\in [1,M]$ . The reflection coefficient matrix in time slot n for UAV-RIS $U_{c}$ is given as $\Phi _{c}[n] = \text {diag}(\theta _{c})$ , with $\theta _{c}=[\vartheta _{m}^{c}]_{m\in M}$ , where $\vartheta _{m}^{c} = Ae^{j\theta _{m}^{c}}$ and $A \in [{0,1}]$ denotes the RIS reflected signal amplitude.

We consider frequency-selective fading models for all communication channels in our scenario. Ground-based communications are modeled using the Rayleigh fading model, incorporating non-line-of-sight components arising from multi-path effects caused by obstacles [39]. Moreover, we also assume that the communication channels between the Base Station B and the Swarm UAV-mounted RIS platforms $U_{c}$ , as well as between the Swarm UAV-mounted RIS platforms $U_{c}$ and the mobile devices $D_{i}$ , consist solely of Line-of-Sight (LoS) components. Consequently, the aerial channels involving the UAV-RIS platforms exhibit a Rician fading model, characterized by dominant LoS components and an absence of non-line-of-sight components, aligning with prior studies [8], [9].

1) Distance and Channel Model

The distance model is first defined as follows. We define the distance between ground base station B and mobile device $D_{i}$ as $d_{BD_{i}}$ , the distance between ground base station B and Aerial-RIS platform $U_{c}$ as $d_{BU_{c}}$ , and the distance between Aerial-RIS platform $U_{c}$ , and mobile device $D_{i}$ as $d_{U_{c}D_{i}}$ . Similarly, we can define the distance between Swarm Aerial-RIS $U_{c}$ and the k-th Jammer $J_{k}$ as $d_{J_{k}U_{c}}~\forall k \in [1,K], c \in [1,C]$ . They are given as follows:\begin{align*} d_{BD_{i}} & = \sqrt {(x_{B} - x_{D_{i}})^{2} + (y_{B} - y_{D_{i}})^{2} + (0)^{2}} \tag {1a}\\ d_{BU_{c}} & = \sqrt {(x_{B} - x_{c})^{2} + (y_{B} - y_{c})^{2} + (z_{c})^{2}} \tag {1b}\\ d_{U_{c}D_{i}}& = \sqrt {(x_{c} - x_{D_{i}})^{2} + (y_{c} - y_{D_{i}})^{2} +(z_{c})^{2}} \tag {1c}\\ d_{J_{k}D_{i}}& =\sqrt {(x_{D_{i}} - x_{J_{k}})^{2} + (y_{D_{i}} - y_{J_{k}})^{2} + (0)^{2}} \tag {1d}\\ d_{J_{k}U_{c}}& =\sqrt {(x_{c} - x_{J_{k}})^{2} + (y_{c} - y_{J_{k}})^{2} + (z_{c})^{2}} \\ & \quad \forall c \in [1,C]; \forall k\in [1,K]; \forall i\in [1,I] \tag {1e}\end{align*} View Source

We use various parameters to denote channel gains within the communication system. The parameter $h_{BD_{i}}$ represents channel gain between ground base station B and mobile device $D_{i}$ for all $i \in [1,I]$ . Similarly, $h_{BU_{c}}$ denotes the channel gain between ground base station B and the Swarm UAV-borne RIS aerial platform $U_{c}$ for all $c \in [1,C]$ , and $h_{U_{c}D_{i}}$ is the channel gain between UAV-RIS $U_{c}$ for all $c \in [1,C]$ and mobile user $D_{i}$ for all $i \in [1,I]$ . Additionally, $h_{J_{k}U_{c}}$ , for all $k \in [1,K]$ , represents the channel gain between UAV-borne RIS $U_{c}$ and the k-th Jammer, $J_{k}$ . It is important to note that the key information about the jammers, such as their locations and transmit power, are considered only theoretically for calculating channel gains and data rates in the simulations. However, they remain unknown to the devices and the RL agent in a practical, real-world environment. These channel gain values are defined in [9] and can be expressed as follows:\begin{align*} h_{BD_{i}}[n]& = \Delta _{BD_{i}}[n] \hat {h}_{BD_{i}}[n] \tag {2a}\\ h_{BU_{c}}[n]& = \Delta _{BU_{c}}[n] \hat {h}_{BU_{c}} [n] \tag {2b}\\ h_{U_{c}D_{i}}[n]& = \Delta _{U_{c}D_{i}}[n] \hat {h}_{U_{c}D_{i}}[n] \tag {2c}\\ h_{J_{k}D_{i}}[n]& = \Delta _{J_{k}D_{i}}[n] \hat {h}_{J_{k}D_{i}}[n] \tag {2d}\\ h_{J_{k}U_{c}}[n]& = \Delta _{J_{k}U_{c}}[n] \hat {h}_{J_{k}U_{c}}[n] \tag {2e}\end{align*} View Sourcewhere $\forall c \in [1,C]$ ,$\forall k \in [1,K]$ , $\forall i \in [1,I]$ . $\hat {h}_{BD_{i}}[n]$ , $\hat {h}_{BU_{c}}[n]$ , $\hat {h}_{U_{c}D_{i}}[n]$ , $\hat {h}_{J_{k}D_{i}}[n]$ and $\hat {h}_{J_{k}U_{c}}$ represent small-scale fading. Moreover, $\Delta _{BD_{i}}[n]$ ,$\Delta _{BU_{c}}[n]$ , $\Delta _{U_{c}D_{i}}[n]$ , $\Delta _{J_{k}D_{i}}[n]$ and $\Delta _{J_{k}U_{c}}$ [9] denote Path Loss Coefficients, expressed as follows:\begin{align*} \Delta _{BD_{i}}[n]& = \sqrt {L_{0}d_{BD_{i}}^{-\eta }[n]} \tag {3a}\\ \Delta _{BU_{c}}[n] & = \sqrt {L_{0}d_{BU_{c}}^{-\alpha }[n]} \tag {3b}\\ \Delta _{U_{c}D_{i}}[n] & = \sqrt {L_{0}d_{U_{c}D_{i}}^{-\alpha }[n]} \tag {3c}\\ \Delta _{J_{k}D_{i}}[n]& = \sqrt {L_{0}d_{J_{k}D_{i}}^{-\eta }[n]}; \tag {3d}\\ \Delta _{J_{k}U_{c}}[n]& = \sqrt {L_{0}d_{J_{k}U_{c}}^{-\alpha } [n]} \tag {3e}\end{align*} View Sourcewhere $\forall c \in [1,C]; \forall k \in [1,K]; \forall i \in [1,I]$ .

In equations (3a - 3e), we define $L_{0}$ as the path-loss average channel power gain at reference distance $d_{0} = 1~m$ . Parameters $\alpha $ and $\eta $ represent the path-loss exponents for aerial and ground-based wireless transmission, respectively.

The small-scale fading vectors, also known as Rician Fading or Line-of-Sight (LoS) components vectors, denoted by $\hat {h}_{BU{c}}[n]$ , $\hat {h}_{U{c}D_{i}}[n]$ , and $\hat {h}_{J_{k}U_{c}}[n]$ , are expressed as follows:\begin{align*} \hat {h}_{BU_{c}}[n] =\sqrt {\frac {K_{1}}{K_{1}+1}}\bar {h}_{BU_{c}}[n]; \tag {4a}\\ \hat {h}_{U_{c}D_{i}}[n] = \sqrt {\frac {K_{1}}{K_{1}+1}}\bar {h}_{U_{c}D_{i}}[n]; \tag {4b}\\ \hat {h}_{J_{k}U_{c}}[n] = \sqrt {\frac {K_{1}}{K_{1}+1}}\bar {h}_{J_{k}U_{c}}[n]; \tag {4c}\end{align*} View Sourcewhere $K_{1}$ represents Rician Factor and $\bar {h}_{BU_{c}}[n]$ , $\bar {h}_{U_{c}D_{i}}[n]$ and $\bar {h}_{J_{k}U_{c}}[n]$ denote deterministic fixed LoS component vectors with elements of unit power which are also seen as array response of RIS’s uniform planner array defined as:\begin{align*} \bar {h}_{BU_{c}}[n] = [e^{j{\psi _{1}^{c}}}, e^{j{\psi _{2}^{c}}}, e^{j{\psi _{3}^{c}}}, e^{j{\psi _{4}^{c}}},\ldots,e^{j{\psi _{M}^{c}}}] \tag {5a}\\ \bar {h}_{U_{c}D_{i}}[n] = [e^{j{\omega _{1,i}^{c}}}, e^{j{\omega _{2,i}^{c}}}, e^{j{\omega _{3,i}^{c}}},\ldots,e^{j{\omega _{M,i}^{c}}}] \tag {5b}\\ \bar {h}_{J_{k}U_{c}}[n] = [e^{j{\phi _{k,1}^{c}}}, e^{j{\phi _{k,2}^{c}}}, e^{j{\phi _{k,3}^{c}}},\ldots,e^{j{\phi _{k,M}^{c}}}] \tag {5c}\end{align*} View Sourcewhere, ${\psi _{m}^{c}} \in [0,2\pi]$ ; ${\omega _{m,i}^{c}} \in [0,2\pi]$ and ${\phi _{k,m}^{c}} \in [0,2\pi]~\forall c \in [1,C]$ , $\forall k \in [1,K]$ , $m \in [1,M]$ represent the respective phase shifts of signal components reflected from $m-th$ reflecting element of the RIS installed on UAV $U_{c}$ .

Raleigh Fading, which is small-scale fading with non-LoS component $\hat {h}_{BD_{i}}[n]$ , between Base Station B and mobile device $D_{i}$ , and $\hat {h}_{J_{k}D_{i}}[n]$ between jammers $J_{k}$ and mobile device $D_{i}$ and are expressed as:\begin{align*} \hat {h}_{BD_{i}}[n] = \sqrt {\frac {1}{K_{1}+1}}\bar {h}_{BD_{i}}[n]; \tag {6a}\\ \hat {h}_{J_{k}D_{i}}[n] = \sqrt {\frac {1}{K_{1}+1}}\bar {h}_{J_{k}D_{i}}[n] \tag {6b}\end{align*} View Sourcewhere $\forall k \in [1,K], \forall i \in [1,I]$ , while, $\bar {h}_{BD_{i}}[n]$ and $\bar {h}_{J_{k}D_{i}}[n]$ represent Rayleigh Fading components.

The values of channel gain values can be represented as follows:\begin{align*} h_{BD_{i}}[n]& =\sqrt {L_{0}d_{BD_{i}}^{-\eta }[n]} \sqrt {\frac {1}{K_{1}+1}} CN(0,1); \tag {7a}\\ h_{BU_{c}}[n]& =\sqrt {L_{0}d_{BU_{c}}^{-\alpha }[n]} \sqrt {\frac {K_{1}}{K_{1}+1}} \bar {h}_{BU_{c}}[n]; \tag {7b}\\ h_{U_{c}D_{i}}[n]& =\sqrt {L_{0}d_{U_{c}D_{i}}^{-\alpha }[n]} \sqrt {\frac {K_{1}}{K_{1}+1}}\bar {h}_{U_{c}D_{i}}[n]; \tag {7c}\\ h_{J_{k}D_{i}}[n]& =\sqrt {L_{0}d_{J_{k}D_{i}}^{-\eta }[n]} \sqrt {\frac {1}{K_{1}+1}} CN(0,1) \tag {7d}\\ h_{J_{k}U_{c}}[n]& =\sqrt {L_{0}d_{J_{k}U_{c}}^{-\alpha }[n]} \sqrt {\frac {K_{1}}{K_{1}+1}} \bar {h}_{J_{k}U_{c}}[n] \tag {7e}\end{align*} View Sourcewhere $\forall c \in [1,C], \forall i \in [1,I], \forall k \in [1,K]$ , while $CN(0,1)$ is defined as complex normal distribution with mean 0 and variance 1 in equations (7a) and (7d).

2) Sum Rate

In the given channel model, the Sum Rate represents the total sum rate across all mobile devices within the cluster c associated with UAV-borne RIS $U_{c}$ in each time-slot n and is expressed as:\begin{equation*} R^{c}_{sum}[n] = B\sum _{i=1}^{I}\log _{2}(1+SINR^{c}_{i}[n]) \tag {8}\end{equation*} View Source

The SINR of mobile device $D_{i}$ associated with UAV-RIS $U_{c}$ is given as:\begin{align*} SINR^{c}_{i} & = \left ({{\frac {P_{B} \left |{{h_{BD_{i}} + h_{BU_{c}} \Phi _{c} h_{U_{c}D_{i}} }}\right |^{2}} {\sum _{k=1}^{K} P_{J_{k}} |h_{J_{k}D_{i}} + h_{J_{k}R} \Phi _{c} h_{U_{c}D_{i}}|^{2} + \sigma ^{2}} }}\right) \\ & \quad \forall i \in [1,I], \forall c \in [1,C], \forall k \in [1,K] \tag {9}\end{align*} View Source

$P_{B}$ and $P_{J_{k}}$ represent the base station transmit power and the power of k jammers, respectively, while $\sigma ^{2}$ denotes the background noise power. Notably, we disregard inter-user interference (IUI) arising from the Zero Forcing (ZF) precoding technique at the MIMO transmitter and the Successive Interference Cancellation (SIC) technique at the user devices. This approach effectively addresses interference from other devices in a dense multi-user wireless communications system, aligning with the considerations in [3] and [40].

We assume the communication bandwidth B to be $10 MHz$ with value $B=10$ in equation 8, therefore the Sum Rate at time slot n is as follows:\begin{align*} R^{c}_{sum}[n] & = B \sum _{i=1}^{I}\log _{2} \left ({{1 + }}\right. \\ & \left.{{\frac {P_{B} \left |{{h_{BD_{i}} + h_{BU_{c}} \Phi h_{U_{c}D_{i}} }}\right |^{2}} {\sum _{k=1}^{K} P_{J_{k}} |h_{J_{k}D} + h_{J_{k}R} \Phi h_{U_{c}D_{i}}|^{2} + \sigma ^{2}} }}\right) \tag {10}\end{align*} View Source

3) Energy Consumption Model

The calculations for total energy consumption $E_{U}$ for Swarm UAV-borne RIS platforms consist of two factors. first is communication energy at RIS installed on all UAVs, which is the energy dissipated by reflective beamforming through phase shifts $E^{c}_{RIS}$ . The second factor is the energy consumed by UAVs propulsion during flight, $E^{c}_{U}$ .\begin{equation*} E_{U}= \sum _{c=1}^{C}(E^{c}_{RIS}+ E^{c}_{U}); \forall c \in [1,C] \tag {11}\end{equation*} View Source

Communication energy of RIS phase shifting is given as:\begin{equation*} E^{c}_{RIS}= \sum _{n=1}^{N}\sum _{k=1}^{K} S_{k} [n] MP^{RIS_{c}} \tag {12}\end{equation*} View Sourcewhere M is the total reflecting elements, $P^{RIS_{c}}$ is the power dissipated by a single reflecting element of the RIS mounted on UAV $U_{c}$ and $S_{k}$ is the magnitude of the reflection coefficient of the k-th reflecting element.

Given the negligible energy consumption of RIS ($E^{c}_{RIS}$ ) compared to propulsion energy ($E^{c}_{U}$ ), we can safely disregard the RIS energy component. Consequently, the total energy consumption of the Swarm Aerial-RIS platforms equates to the propulsion energy consumed by all the UAVs $U_{c}~(U= \{U_{1}\ldots U_{c},,, U_{C} \})$ . Therefore, the total energy consumption will be given by:\begin{equation*} E_{U}= \sum _{c=1}^{C}E^{c}_{U}; \forall c \in [1,C] \tag {13}\end{equation*} View Source

The energy consumed for propulsion by a single rotary wing UAV $U_{c}$ is expressed as:\begin{align*} E^{c}_{U} & = \sum _{n=1}^{N}\delta _{t} \left ({{ \underset {Blade \: Profile}{\underbrace {P_{0} \left ({{1 + \frac {3v_{c}[n]^{2}}{U_{tip}^{2}}}}\right)}} + \underset {Parasite}{\underbrace {c_{0} v_{c}[n]^{3}}} }}\right) \\ & \quad + \sum _{n=1}^{N}\delta _{t} \underset {Induced \: Power} {\underbrace {P_{1} \left ({{ \sqrt {\sqrt {1 + \frac {v_{c}[n]^{4}}{4v_{0}^{2}}} - \frac {v_{c}[n]^{2}}{2v_{0}^{2}}} }}\right) }} \tag {14}\end{align*} View Source

In the given equation 14, $P_{0}$ and $P_{1}$ represent the Blade Profile Power and Induced Power of all UAVs when they are in hovering state, respectively. The parameter $c_{0}$ is a constant associated with Parasite Power in aerodynamics. Additionally, $\delta _{t}$ denotes the time duration n, $U_{tip}$ is the rotor blade tip speed, and $v_{0}$ stands for the mean motor-induced velocity. The term $v_{c}[n]$ represents the instantaneous velocity of UAV $U_{c}$ at time slot n.

The UAV velocity $v_{c}[n]$ is calculated as:\begin{equation*} v_{c}[n] = \sqrt {\frac {(dx_{c}[n])^{2} + (dy_{c}[n])^{2} + (dz_{c}[n])^{2}}{\delta _{t}}} \tag {15}\end{equation*} View Sourcewhereas, $dx_{c}[n]$ , $dy_{c}[n]$ and $dz_{c}[n]$ represent three-dimensional displacements of the selected UAV $U_{c}$ at every time step n of exploration.

The speed $v_{c}[n]$ is crucial for determining the propulsion energy of UAV $U_{c}$ as per Equation 13, while UAVs navigate to optimize the sum rate for associated mobile devices. Adjustments to RIS phase shifting and locations are made at each time step, considering the sum rate and energy consumption. The UAVs maintain a continuous flight pattern near optimal locations without needing to fly swiftly closer to mobile devices to mitigate jammers.

4) RIS Passive Beamforming

A phase shift matrix represents the reflection coefficients of all reflecting units in an RIS mounted on UAV $U_{c}$ . It is expressed as a diagonal matrix with elements $\theta ^{c}_{m}$ :\begin{equation*} \Phi _{c}[n]=diag\{e^{j\theta _{1}^{c,n}}, e^{j\theta _{2}^{c,n}}, e^{j\theta _{3}^{c,n}}, e^{j\theta _{4}^{c,n}},\ldots, e^{j\theta _{M}^{c,n}}\} \tag {16}\end{equation*} View Source

To address the computational complexity of phase shift design in large-scale RIS, we propose using a physical model that maps the RIS reflection coefficient to the beamforming direction, thus reducing passive beamforming calculations and synchronizing with real-time channel state changes. The reflection coefficients $\theta _{m}^{c}$ for each element m are described in [41] and is expressed as:\begin{align*} & \theta _{m} = \\ & \text {mod} \left ({{-\frac {2\pi }{\lambda _{w}} \left ({{ \left ({{\sin {\theta _{t}}\cos {\varphi _{t}} + \sin {\theta _{r}} \cos {\varphi _{r}} }}\right)\left ({{m - \frac {1}{2}}}\right) d_{x} }}\right. }}\right. \\ & \quad + \left.{{ \left.{{ \left ({{\sin {\theta _{t}}\sin {\varphi _{t}} + \sin {\theta _{r}}\sin {\varphi _{r}}}}\right)\left ({{m - \frac {1}{2}}}\right) d_{y} }}\right), 2\pi }}\right) \tag {17}\end{align*} View Sourcewhere $d_{x}^{c}$ and $d_{y}^{c}$ denote the dimensions of each unit cell of the RIS, falling within sub-wavelength scale ranging from $\frac {\lambda _{w}}{10}$ to $\frac {\lambda _{w}}{2}$ , and $\lambda _{w}$ denotes the wavelength of a wireless signal. Additionally, $\theta _{t}^{c}$ stands for the angle between the incident signal and the x axis, $\varphi _{t}^{c}$ is the angle between the incident signal and the z axis, $\theta _{r}^{c}$ denotes the angle between the reflected signal and the x axis, and $\varphi _{r}^{c}$ is the angle between the reflected signal and the z axis.

B. Problem Formulation

Our primary goal is to maximize the sum rate for mobile devices and improve energy efficiency for swarm UAVs to counter multiple jammers. This involves strategically moving UAV-borne RIS platforms, dynamically adjusting RIS phase shifts, base station transmit power, and device-to-UAV associations, while docking UAVs that are inactive for a specified duration.

The second objective of energy consumption is achieved in two ways: first, by optimizing the trajectory and movements of the Swarm UAVs, and second, by minimizing the number of UAVs in flight.

The trajectories of the UAVs are adapted dynamically in all three dimensions during each time slot n, denoted as $\omega _{c}[n]=[x_{c}[n], y_{c}[n], z_{c}[n]]$ . Moreover, the phase shift of the RIS, consisting of M reflecting elements, is represented by $\theta _{c} = [\theta _{m}^{c}]_{m \in M} $ . The base station transmission power at time slot n is denoted by $P_{B}$ while the device-to-UAV association for device i and UAV c is given by $\lambda ^{c}_{i}$ .

Therefore, the problem is formulated as\begin{align*} & \pmb {P1:} \quad \max _{\omega _{c}, \Phi _{c}, \lambda ^{c}_{i}, P_{B} } \sum _{n=1}^{N} \sum _{c=1}^{C} \frac {\sum _{i=1}^{I}\lambda ^{c}_{i}[n] R_{i}^{c}[n]}{E_{U}^{c}[n]} \\ & \textrm {s.t.} \quad C_{1}:\omega _{c}[n] = \\ & \qquad \{x_{c}[n], y_{c}[n], z_{c}[n]\} \cdot 1_{\left ({{\sum _{i=1}^{I} \lambda ^{c}_{i}[n]\gt 0 \& \sum _{n=1}^{h+1} E_{U}^{c} \leq B_{L}}}\right)} \\ & \qquad \quad + \{x_{d}, y_{d}, 0\} \cdot 1_{\left ({{\sum _{i=1}^{I} \lambda ^{c}_{i}[n]=0 \parallel \sum _{n=1}^{h+1} E_{U}^{c} \gt B_{L} }}\right)}; \\ & \qquad \qquad \qquad \quad \forall c \in [1,C], \{n, h\} \in [1,N] \\ & \quad C_{2}: X_{min} \leq x_{c}[n] \leq X_{max}; \forall c \in [1,C] \\ & \quad C_{3}: Y_{min} \leq y_{c}[n] \leq Y_{max}; \forall c \in [1,C] \\ & \quad C_{4}: Z_{min} \leq z_{c}[n] \leq Z_{max}; \forall c \in [1,C] \\ & \quad C_{5}: \theta _{m}^{c}[n] \in [0,2\pi ], \forall m \in [1,M], c \in [1,C] \\ & \quad C_{6}: \sum _{c=1}^{C} \lambda ^{c}_{i}[n] = 1;\forall c \in [1,C] \\ & \quad C_{7}: \lambda ^{c}_{i}[n] \in \{0,1\}; \forall c \in [1,C] \\ & \quad C_{8}: \sum _{n=j+1}^{j+p} \underset {c \neq c'}{\sum _{c=1}^{C}} \sum _{i=1}^{I} \lambda ^{c}_{i}[n] = I \cdot p - \\ & \qquad \quad \left ({{\sum _{n=j+1}^{j+p} \sum _{i=1}^{I} \lambda ^{c'}_{i}[n]}}\right)\cdot 1_{\left ({{\sum _{i=1}^{I} \lambda ^{c'}_{i}\gt 0 \parallel \sum _{n=h+1}^{j} E_{U}^{c'} \leq B_{L}}}\right)} \\ & \qquad \qquad \qquad \forall \{c, c'\} \in [1,C], \{n, j, h, p\} \in [1,N] \\ & \quad C_{9}: \parallel \omega _{c}[n] - \omega _{c'}[n]\parallel ^{2} \geq d_{min}^{2} \cdot 1_{(\{ \omega _{c}, \omega _{c'} \} \neq \{x_{d}, y_{d}, 0\})} \\ & \qquad \qquad \qquad \forall \{c, c'\} \in [1,C] \\ & \quad C_{10}: 0 \leq P_{B}[n] \leq P^{max}_{B} \tag {18}\end{align*} View Source

Constraint $C_{1}$ defines the UAV location in three dimensions and specifies that UAV $U_{c}$ remains in flight with location $x_{c}, y_{c}, Z_{c}$ if its battery limit is below the maximum limit or it is associated with at least one mobile device. Alternatively, the UAV moves to a recharge docking station at $x_{d}, y_{d}, 0$ when it is not associated with any mobile device or when the battery gets depleted. Constraints $C_{2} - C_{4}$ set the three-dimensional boundary limits for UAV movements, while $C_{5}$ limits the angles of RIS reflecting elements on UAV $U_{c}$ . Constraint $C_{6}$ ensures each mobile device $D_{i}$ is associated with only one UAV $U_{c}$ , and $C_{7}$ ensures the device association decision is binary. Constraint $C_{8}$ specifies that unassociated or battery-depleted UAVs remain inactive until a new device association. Constraint $C_{9}$ maintains a minimum distance $d_{min}$ between UAVs to avoid collisions, except at the docking station. Lastly, constraint $C_{10}$ defines the base station transmit power limit.

C. RL for Adaptive Swarm Aerial-RIS Anti-Jamming

Our problem involves a complex mixed-integer non-convex multi-objective optimization challenge, with continuous variables like $\omega _{U_{c}}$ , $\Phi _{c}$ , and $P_{B}$ , alongside an integer variable $\lambda ^{c}_{i}$ . Traditional heuristic algorithms like Genetic, Greedy, Simulated Annealing, Ant Colony Optimization (ACO), Hill Climbing, and Particle Swarm Optimization (PSO) are effective in predefined rule-based environments but struggle with changing conditions. In contrast, Reinforcement Learning (RL) excels in adapting to dynamic, nonlinear optimization challenges.

Solving a non-convex optimization problem is challenging with incomplete information about jammer locations, device trajectories, and varying numbers of devices and jammers. To address these uncertainties, we propose an RL-equipped control unit at Base Station B to manage UAV movements, RIS phase shifts, device-to-UAV associations, and transmit power, allowing swift adaptation to unforeseen challenges without restarting the learning process.

The problem is formulated as a Markov Decision Process (MDP), detailed in subsection III-C3. To discover an optimal control policy to maximize the Sum Rate for mobile users and enhance UAV energy efficiency, we employ a model-free Deep Reinforcement Learning (DRL) technique called Proximal Policy Optimization (PPO). The PPO algorithm operates without requiring prior information about the UAV-borne-RIS locations and RIS phase shift coefficients.

Compared to Deep Q-Network (DQN) [42], Proximal Policy Optimization (PPO) is better suited for continuous control stochastic problems due to its policy-based approach [43]. While DQN excels with small, discrete action spaces, our problem involves a multi-parameter continuous action space, making PPO a more appropriate choice. Additionally, PPO has higher sample efficiency, requiring fewer training samples to converge, which is crucial when data collection is resource-intensive or time-consuming.

Considering policy-based algorithms, Deep Deterministic Policy Gradient (DDPG) [43], an earlier Actor-Critic variant, contrasts with Proximal Policy Optimization (PPO), which is better suited for high-dimensional action spaces and offers smoother learning and greater stability but it is computationally inefficient and slow. While Trust Region Policy Optimization (TRPO) [44] is effective in non-stationary environments, PPO is preferred for its lower computational cost, faster convergence, and performance in high-dimensional settings.

We disregard the computational energy of the RL algorithm, as it is deemed insignificant. The computational energy associated with RL algorithms like PPO is linked to the complexity of the DNN model used. Given the low complexity of the model, the impact of computational energy on overall energy consumption is negligible.

1) Proximal Policy Optimization (PPO)

PPO exhibits the capability to ease intricate constraints by substituting them with adaptable ones. The primary aim of the PPO algorithm can be articulated as:\begin{align*} J^{PPO} = E_{s,a} [min(\eta *A^{\pi _{old}}, clip(\eta, 1-\epsilon, 1+\epsilon)A^{\pi _{old}})] \tag {19}\end{align*} View Sourcewhere $\eta $ is the policy probability ratio and expressed as:\begin{equation*} \eta = \frac {\pi (a,s)}{\pi _{old}(a,s)} \tag {20}\end{equation*} View Source

The advantage estimation function quantifies the superiority of a specific action in a given state compared to others [45] and is given as follows:\begin{equation*} A^{\pi } = Q(s_{t}, a_{t}) - \upsilon ^{\pi }(s) \tag {21}\end{equation*} View Source

Here, $\upsilon ^{\pi }(s)$ is the value function, representing the long-term reward under policy $\pi $ in state s and $Q(s_{t}, a_{t})$ signifies the long-term reward for taking a specific action $a_{t}$ following policy $\pi $ at state $s_{t}$ . Advantage estimates $\hat {A}{1}, \hat {A}{2}, {\dots }, \hat {A}_{T}$ are computed using the Generalized Advantage Estimation (GAE) algorithm. Value network targets ($\upsilon ^{\pi }(s)$ ) are determined via the Bellman equation. Policy loss, calculated through the clipped surrogate objective function (Equation 19), and value loss, measured by taking the mean squared error of the target and predicted values, drive the update process. Gradients of total loss as per parameters of policy and value networks guide updates using an optimizer like Adam or SGD.

The PPO clip range, $\epsilon $ in Equation 19 and Algorithm 1, is a key hyperparameter that limits the deviation between new and old policies to avoid disrupting training. In cases where $A^{\pi _{old}} \gt 1$ leads to a large $\eta $ , potentially causing instability, the clip range confines $\eta $ within $[1 - \epsilon, 1 + \epsilon]$ . Given the continuous or large discrete nature of optimization variables, PPO, a variant of the Actor-Critic algorithm, is well-suited to manage this.

Algorithm 1 PPO Algorithm Deployed at Base Station B

Initialization:

Initialize the locations for C Swarm Aerial-RIS

platforms $\omega _{c}$ and I mobile devices $\omega _{D_{i}}$ ;

Randomly initialize the locations for K jammers spread

around in the vicinity of I mobile devices;

Initialize matrices $\Phi _{c}$ representing RIS phase shift

elements for each Aerial-RIS $U_{c}$ ;

PPO Training

Initialize random parameters $\delta ^{\pi } $ and $\delta ^{\pi } _{old}$ .

10:

for each episode n from 1 to N do

11:

for each iteration (step) i from 1 to I do

12:

Obtain current State:

13:

$ s_{i} = s[n] = \{ \omega _{1}[n], \omega _{2}[n],..,\omega _{c}[n],..,\omega _{C}[n]$ ,

14:

$\omega _{D_{i}}[n], R^{c}_{sum}[n-1], BL_{1}[n] $ ,

15:

$BL_{2}[n],..,BL_{c}[n],..,BL_{C}[n] $ }

16:

Select Action from Equation 22:

17:

$ a_{i} = a[n] = \{a_{\lambda _{i}}^{n}, a_{U_{i}}^{n}, a_{R_{i}}^{n}, a_{BS}^{n}\}$

18:

$ = \{ \lambda _{i}[n], D^{i}_{x}[n], D^{i}_{y}[n], D^{i}_{z}[n] $ ,

19:

$\theta ^{i}_{t}[n], \varphi ^{i}_{t}[n], \theta ^{i}_{r}[n], \varphi ^{i}_{r}[n], P_{B}[n] $ }

20:

Calculate UAV-borne RIS location $\omega _{i}$ based on

21:

displacements of UAV $U_{i}$ selected by $\lambda _{i}^{n}$ :

22:

$\{D^{i}_{x}, D^{i}_{y}, D^{i}_{z}\}$ ;

23:

Calculate RIS phase shift matrix $\Phi _{i}$ from

24:

Equation 16 based on angles of RIS mounted

25:

on UAV $U_{i}$ selected by $\lambda _{i}^{n}$ : $\{\theta _{t}, \phi _{t}, \theta _{r}, \phi _{r}\}$

26:

Calculate Sum Rate of devices associated with

27:

UAV $U_{i}$ selected by $\lambda ^{n}_{i}$ from Equation 8:

28:

$R^{i}_{sum} = \log _{2}(1 + SINR_{i}[n])$

29:

Calculate Energy Consumption $E_{U_{i}}$ at UAV $U_{i}$

30:

from Equation 14

31:

Normalize the values of $R^{i}_{sum}$ and $E_{U_{i}}$ ;

32:

Calculate Penalty from Eq. 26:

33:

$p = p_{loc} + p_{batt} + p_{coll} +p_{dock} $

34:

Observe Reward from Eq. 25:

35:

$r_{t} = w R^{norm}_{sum} - (1-w) E^{norm}_{U_{i}} {-} p $

36:

Observe Next State:

37:

$s[n+1] = \{ \omega _{1}[n+1],\omega _{2}[n+1],..,\omega _{c}[n+1]$ ,

38:

$..,\omega _{C}[n+1],\omega _{D_{i}}[n+1],R^{c}_{sum}[n]$ ,

39:

$ BL_{1}[n+1],BL_{2}[n+1] $ ,

40:

$..,BL_{c}[n+1],..,BL_{C}[n+1] $ }

41:

Compute Advantage Estimates from Eq. 21,

42:

Update $\delta ^{\pi } $ by maximizing Eq. 19

43:

Update $\delta ^{\pi } _{old} \leftarrow ~\delta ^{\pi } $

44:

Capture UAV location $\omega _{i_{max}}$ based on

45:

$\{D^{i}_{x}, D^{i}_{y}, D^{i}_{z}\}$ , RIS matrix $\Phi _{i_{max}}$ and

46:

Energy Consumption $E^{max}_{U_{i}}$ during

47:

iteration $i_{max}$ with max. Sum Rate $R^{i, max}_{sum}$

48:

end for

49:

Update Aerial-RIS location $\omega _{c}$ from $\omega _{i_{max}}$

50:

Update Device locations $\omega _{D_{i}}$ from $\omega _{D_{i_{max}}}$

51:

Update RIS phase shift matrix $\Phi _{c}$ from $\Phi _{i_{max}}$

52:

Update UAV $U_{c}$ battery level $BL_{c}$ from $BL_{i_{max}}$ :

53:

$BL_{c}[n+1] = BL_{c}[n] - E_{U_{c}}[n]$

54:

end for

2) Multi-Objective Optimization Problem

The challenge involves the optimization of two objectives simultaneously i.e., maximizing the Sum Rate $R_{sum}$ and minimizing the energy consumed by the Swarm Aerial RIS platforms, $E_{U_{c}}~\forall c \in [1, C]$ . Hence, it is categorized as a Multi-Objective Optimization Problem.

Given the single-objective optimization focus of PPO, we address our multifaceted problem using Scalarization, commonly linked with Pareto Optimization. This method expresses multiple objectives as weighted values to find non-dominant Pareto-optimal solutions, balancing trade-offs between objectives. Our approach involves multi-objective optimization, aiming to maximize the sum rate and minimize energy consumption by Swarm Aerial RIS platforms, employing PPO to uncover non-dominant solutions. In the subsequent section, we frame the multi-objective optimization within the Markov Decision Process (MDP) framework.

3) General MDP Formulation

The Markov Decision Process (MDP) formulation consists of state space, action space, and reward function to align with the mixed-integer continuous decision space of PPO algorithm, as shown in Figure 3. The objective involves balancing two goals: maximizing sum rate $R_{sum}$ and minimizing the energy consumed by Swarm Aerial-RIS platforms $E_{U}$ .

FIGURE 2.

System Model: Swarm UAV-RIS assisted 6G communications for dynamic and dense wireless environment with multiple jammers.

Show All

FIGURE 3.

Proximal policy optimization (PPO) based reinforcement learning solution.

Show All

The environment comprises a base station B, multiple jammers $(K = {J_{1}, J_{2},.., J_{k},..J_{K} })$ , multiple mobile devices $(I = {D_{1}, D_{2},.., D_{i},..D_{I} })$ , with allowed mobility with Gaussian distribution pattern in all directions within the constrained urban region, multiple UAV-mounted RIS platforms $(U = {U_{1}, U_{2},.., U_{c},..U_{C} })$ , that are allowed to fly within cellular space constraints as defined in problem formulation expressed by equation 18. It is assumed that the Swarm Aerial-RIS controller RL agent lacks knowledge about the locations of jammers and the mobility pattern of the mobile devices. We also assume that the communication delays between the Base Station, UAV-RIS and the mobile devices are negligible and the size of data, and the mobile devices are negligible, and the size of data exchanged is also small. Therefore, they can be ignored for the sake of simplicity. The episode design is depicted in Figure 4, illustrating the sequential transition from one state to another through action-taking iterations.

FIGURE 4.

Design of $n-th$ episode n containing I iterations.

Show All

The state or observation space comprises ($4 \times C + 3 \times I + 1$ ) parameters. These include the location coordinates for each of the UAV-RIS platforms denoted as $\omega _{c}[n]; \forall \ c \in [1, C]$ , the location coordinates of moving mobile devices denoted as $\omega _{D_{i}}[n]; \forall \ i \in [1, I]$ , the sum rate value of mobile device clusters associated with the selected $U_{c}$ from the previous iteration (time step) denoted as $R_{sum}[n-1]$ and battery levels of all UAVs $BL_{c}[n]; \forall \ c \in [1, C]$ .

The reward function encompasses sum rate $R_{sum}[n]$ at $\mathcal {I}$ mobile devices, the energy consumed by the Swarm Aerial-RIS relay platforms $E_{U_{c}}$ and the ability to respect constraints such as cellular region boundaries, device-to-UAV associations, inter-UAV collisions, and battery limitations, reflected through penalties. These simulations aim to achieve simultaneous convergence for both objectives through multiple episodes.

The state S, action A, and reward R are defined as follows:

State
The state $s \in S$ , is composed of
1. Selected mobile device $D_{i}$ location:$\omega _{D_{i}}[n] = \{x_{D_{i}}[n], y_{D_{i}}[n], 0 \}\forall i \in [1,I]$
2. Sum Rate of mobile devices of clusters associated with $U_{c}$ : $R^{c}_{sum}[n-1]$ at the mobile devices at previous time slot $(n-1)$
3. Locations of Swarm Aerial-RIS platforms
  $\{\omega _{1}[n], \omega _{2}[n],..,\omega _{C}[n]\}$ .
4. Battery levels of Swarm UAVs,
  $ \{ BL_{1}[n], BL_{2}[n],\ldots,, BL_{C}[n] \}$ .
Action
The actions space $a \in A$ for each iteration or tiem step $i \in [1,I]$ comprises of four sub-actions consisting of Device-to-UAV association parameter $a^{n}_{\lambda _{i}}$ , UAV-RIS trajectory $a^{n}_{U_{i}}$ , RIS phase shift angles $a^{n}_{R_{i}}$ and base station transmit power $a^{n}_{BS}$ at time slot n, denoted as $a[n] = \{a^{n}_{\lambda _{i}}, a^{n}_{U_{i}}, a^{n}_{R_{i}}, a^{n}_{P_{B}} \}$ as expressed in Equation 22.\begin{align*} a_{\lambda _{i}}^{n} & = \{ \lambda _{i}[n] \} \\ a_{U_{i}}^{n} & = \{D^{i}_{x}[n], D^{i}_{y}[n], D^{i}_{z}[n]\} \\ a_{R_{i}}^{n} & = \{\theta ^{i}_{t}[n], \varphi ^{i}_{t}[n], \theta ^{i}_{r}[n], \varphi ^{i}_{r}[n] \} \\ a_{P_{B}}^{n} & = \{ P_{B}[n]\} \tag {22}\end{align*} View Source
The trajectories of Swarm Aerial-RIS platforms $U_{i}$ are given as three continuous parameters, $D^{i}_{x}$ , $D^{i}_{y}$ , and $D^{i}_{z}$ , representing UAV displacements along the x, y, and z axes within the rectangular coordinate system centered at the Base Station B. The beamforming direction of RIS elements is adjusted through phase shifts $\theta _{m}$ . Nevertheless, as the number of RIS reflecting units M increases, the action space for RIS phase shift values $\theta _{m}^{i}$ also becomes extensive. To address this challenge, beamforming directions for each of the elements of RIS are calculated indirectly through $\theta _{t}^{i}$ , $\varphi _{t}^{i}$ , $\theta _{r}^{i}$ , and $\varphi _{r}^{i}$ , representing the angles between the incident signal and the x axis, incident signal and the z axis, reflection signal and the x axis, and reflection signal and the z axis, respectively, as illustrated in Figure 1. As per [41], the corresponding value $\theta _{m}^{i}$ can be readily derived from the Equation 23, as shown at the bottom of the next page, \begin{equation*} \theta _{col, row}^{i} = mod \left.{{ \left ({{-\frac {2\pi }{\lambda _{w}} \left.{{\left ({{ \left ({{\sin {\theta _{t}^{i}}\cos {\varphi _{t}^{i}} + \sin {\theta _{r}^{i}} \cos {\varphi _{r}^{i}} }}\right) \left ({{row - \frac {1}{2}}}\right) d_{x}^{i} }}\right. }}\right.+ \left.{{ \left.{{ \left ({{\sin {\theta _{t}^{i}}\sin {\varphi _{t}^{i}} + \sin {\theta _{r}^{i}}\sin {\varphi _{r}^{i}}}}\right) \quad \left ({{col - \frac {1}{2}}}\right) d_{y}^{i} }}\right.}}\right), 2\pi }}\right. }}\right) \tag {23}\end{equation*} View Source by determining the beamforming direction, where the value element m is extracted from its respective row and column.
Here, $d_{x}^{i}$ and $d_{y}^{i}$ represent the length and width of each unit cell of the RIS. These dimensions, range between $\frac {\lambda _{w}}{10}$ and $\frac {\lambda _{w}}{2}$ , within the sub-wavelength scale, where $\lambda _{w}$ denotes the wavelength of the signal.
1. Reward
  The evaluations of the action is given by Reward $r \in R_{w}$ . It is the function of the sum rate at the mobile devices $D_{i} \forall i \in [1,I]$ and energy consumed by UAV-borne RIS plarforms $E_{U_{c}} \forall c \in [1,C]$ as follows:\begin{align*} r[n] = \sum _{c=1}^{C} \frac {R_{sum}^{c}[n]}{E_{U_{c}}[n]} = \sum _{c=1}^{C} \frac {\log _{2}\left ({{1 + SINR_{i}^{c}[n] }}\right)}{E_{U_{c}}[n]} \tag {24a}\end{align*} View Source
  The reward r encapsulates various objectives, including the sum rate R and the energy consumption by Swarm Aerial RIS platforms, $E_{U_{c}}$ , thus fitting into Multi-Objective Reinforcement Learning. Scalarization is required, involving normalizing objectives and representing them as an additive reward function akin to a weighted sum of normalized objective values. We draw guidance from a similar issue addressed in [46] to shape our rewards. The scalarized reward function is formulated as follows:\begin{equation*} r= (w)R_{sum}^{norm} - (1 - w) E_{U_{c}}^{norm} - p \tag {25}\end{equation*} View Source
  We iteratively tune the weight w in equation 25 to find the optimal balance between sum rate and energy consumption of Swarm Aerial RIS platforms, yielding a non-dominant solution. Constraints on UAV movements, battery levels, inter-UAV collisions, and device-UAV associations at the charging dock introduce a penalty p in the reward function, represented as a linear combination of penalties for violating these constraints in equation 26:\begin{equation*} p = p_{loc} + p_{batt} + p_{coll} + p_{R_{min}} + p_{dock} \tag {26}\end{equation*} View Source
  These penalties are assessed for breaching constraints in the problem formulation. The location boundary penalty ($p_{loc}$ ) is imposed when a UAV tries to cross the wireless cell boundary. The battery expiration penalty ($p_{batt}$ ) occurs when a UAV battery usage exceeds the limit. The inter-UAV collision penalty ($p_{coll}$ ) applies when two UAVs collide during flight, but not at rest on the docking station. The minimum data rate penalty ($p_{R_{min}}$ ) is imposed when any device gets a data rate less than the threshold of 5 Mbps. Lastly, the docking penalty ($p_{dock}$ ) is charged when the RL algorithm assigns a device to a recharging UAV.

The pseudo-code of the proposed PPO-based algorithm is given in Algorithm 1. The algorithm is designed to run for N episodes, and each episode $n \in [1, N]$ comprises I iterations that equal the number of mobile devices. During each iteration or time step $i \in [1, I]$ , RL agent associates the mobile device $D_{i}$ with any of the UAV $U_{i}$ , selects the trajectory for the associated UAV and phase shift angles for the RIS $R_{i}$ mounted on it along with the base station power $P_{B}$ . The value of the sum rate $R_{sum}^{i}$ of all the mobile devices associated with $U_{i}$ and the energy consumption of UAV $U_{i}$ during the current episode are computed and saved. This process is repeated for all I steps in the episode $n \in [1, N]$ and the RL actions of the step $i_{max}$ that give maximum values of $R_{sum}^{i, max}$ are selected to be updated. We update the next location $(\omega _{c}=\omega _{i_{max}})$ of selected UAV $(U_{c} = U_{i_{max}})$ , mobile device locations $(\omega _{D_{i}} = \omega _{D_{i_{max}}})$ , RIS phase shift matrix $(\Phi _{c} =\Phi _{i_{max}})$ and battery level of UAV $(BL_{c}=BL_{i_{max}})$ . This process is repeated for N episodes till convergence is achieved.

SECTION IV.

System Evaluation

A. Simulation Setup

Utilizing the PPO library from Stable Baselines package based on OpenAI baseline libraries, we configure the simulation environment to train the PPO model within a customized wireless communications context. This section elaborates on the environment, encompassing its parameters and hyperparameters. Given the non-convex and computationally complex nature of our problem, the PPO algorithm emerges as a valuable tool for its resolution. PPO is an actor-critic-based, lightweight tool using a Deep Neural Network (DNN). Both actor and critic use fully connected neural networks with two hidden layers, with 32 neurons in each layer. In two hidden layers, ReLU activation functions are used, while linear activation function is used for the output layer of the critic network. The output layer for the actor-network uses tanh as an activation function to scale output to $[-1,1]$ for better performance. That is why it has low computational complexity and is highly adaptive to rapidly changing dynamic environments.

The Base Station B, Swarm Aerial-RIS platforms $(U = \{U_{1},U_{2},..U_{c},..,U_{C}\})$ , and the Mobile Devices $(I = \{D_{1},D_{2},..,D_{i},..,D_{I}\})$ are positioned within a 6G Micro cell region with a radius between 300 to 1000 meters, as illustrated in the simulation system map in Figure 5. The base station B is located at the coordinates $\omega _{B}$ , specifically $[{50, 50, 10}]$ . The initial locations $\omega _{c}$ for the UAV-mounted RIS platforms $U_{c}$ are set at $[{300, 300, 50}]$ , and $\omega _{D_{i}}$ for the I number of mobile devices $D_{i}$ are initially positioned at the center of the 6G cell at location $[{300, 300, 0}]$ with the cell of radius $300 m$ . The UAV battery recharge docking station is located near the Base Station at $[{60, 60, 0}]$ .

FIGURE 5.

System map generated by simulation experiment.

Show All

The mobility of the Swarm Aerial RIS platforms $U_{c}$ are confined within the boundary limits specified by $[{50, 50, 40}]$ and $[{600, 600, 100}]$ . Similarly, the movements of the mobile devices $D_{i}$ follow the Gaussian mobility pattern and are bounded by the maximum area measuring 450 by 450 meters, defined by $[{100, 100, 0}]$ and $[{550, 550, 0}]$ . Figure 5 provides a two-dimensional representation of these placements. K jammers are strategically positioned around the initial location of mobile devices $D_{i}$ with a Gaussian distribution pattern.

B. Simulation Experiments

The simulations aim to find an optimal solution through multi-objective optimization.

Figure 5 shows the blue box as the boundary constraint for the mobility of mobile devices. In contrast, the red box illustrates the spatial limitations for the movements of Swarm Aerial-RIS platforms, guaranteeing their confinement within the designated 6G cellular region.

Swarm Aerial-RIS platforms are limited to a maximum displacement of 46 meters, while mobile devices can move up to 20 meters per episode. Jammers are distributed within a 120 meter radius around the initial mobile device locations. The system restricts spatial movements, UAV-to-Device associations during UAV recharging, UAV battery levels, minimum data rates for mobile devices, and inter-UAV collisions. The simulation covers approximately $14,000$ episodes, each with I steps, totaling $(14,000 \times I)$ iterations, where I is the number of mobile devices.

During PPO agent training, numerous simulations showed convergence for cumulative reward, sum rate, and energy consumption. The optimal scalarization weight w, identified as the cliff point, is set to 0.5. This point represents a balanced, non-dominant solution that maximizes the sum rate while minimizing energy consumption.

We initialize the PPO algorithm with specific hyperparameters: a reward discount factor of $\gamma =0.9$ , a learning rate of $\alpha '=0.0001$ , and a clip fraction of $\epsilon = 0.2$ . Through PPO optimization, we observed that the scalarization weight w reaches a cliff point value of approximately 0.5, resulting in a non-dominant solution that balances transmission rate and energy consumption convergence. For easy reference, the hyperparameters and other system parameters are provided in Table 3.

TABLE 2 Summary of Important Notations

TABLE 3 System Parameters

In the following subsections, we conduct simulations to evaluate the efficiency of the proposed system model, comparing it with related works and baseline scenarios while examining the effects of different parameters and configurations.

Fixed Multiple RIS Installations: Comparison to a scenario presented in a related work [7]. The authors present a multi-cluster wireless-powered Internet of Things (WP-IoT) network assisted by multiple RIS installed on stationary objects like buildings to maximize the sum throughput under fewer constraints.
Fixed Device-to-UAV Association: Comparison of the system model proposed with another related work [12] where device clusters association with the UAV-mounted RIS platforms is fixed.
Random Device-to-UAV Association: We conduct a comparative evaluation of the proposed system model with a baseline scenario where the device-to-UAV associated is randomized instead of controlled by the RL agent.
Single Objective Optimization for Sum Rate Maximization, $w=1$ : We evaluate the system performance for sum rate optimization only, with the value of scalarization weight w, set to 1. The scenario is similar to those presented in related works [11] and [12] where the authors of these two works utilize multiple UAV-mounted RIS platforms to serve multiple mobile devices to achieve only single objectives.
Random RIS Phase Shifting: Comparing the performance of the proposed system model with baseline with Random RIS phase shifting.

Simulations of the aforementioned scenarios are carried out to compare and verify the performance of our system model.

1) Impact of Variation in the Jammers Distribution

We begin by investigating the impact of varying the distribution areas of jammers experimentally, comparing the performance and adaptability of the swarm UAV-borne RIS system with different configurations in related works and the baselines to gain useful insights. The number of devices is set to $(I=12)$ in a 6G micro-cell infested by $(K=5)$ unknown jammers, with $(U=5)$ UAVs at service.

We observe the correlation between jammer proximity to mobile devices and the sum rate, particularly within 120 to 720 meters, where jamming negatively impacts received signals. As the jammer distribution increases with expanding cell size, the sum rate also decreases due to greater distances from the base station despite an even jammer distribution. However, Figure 6 suggests that effective optimization is still achievable even in scenarios with jammers nearby.

FIGURE 6.

Results for evaluating the impact of varying the jammers distribution on the proposed Swarm UAV-borne RIS scenario (a) Cumulative Reward (b) Sum Rate (c) Energy Consumption (d) Base Station Transmit Power. Adaptive UAV Swarm Formation in (e) Proposed solution and (f) Alternate Scenarios. Comparative bar plots for (g) Average Sum Rate and (h) Average Energy Consumption.

Show All

The proposed system model achieves convergence for parameters like a cumulative reward, sum rate, energy consumption, and base station transmission power as shown in subplots 6(a) to 6(d).

However, the energy consumption trend shown in subplot 6(c) shows an increase due to the higher mobility of UAVs to counter the greater challenge of jammers spread in the cellular region. Moreover, the base station power increases with the increase in the jammer distribution region shown in subplot 6(d) to overcome the increased coverage challenge for the UAVs. This shows that UAV-mounted RIS platforms can successfully overcome the increased challenge of jammer distribution.

The subplot 6(e) shows dynamic swarm formation with the increase in UAV participation activity as jammer distribution increases along with the cell size. This is triggered by the slight decline in the sum rate due to the increased distance of devices from the Base Station and UAV-mounted RIS relays.

As depicted in subplot 6(f), the adaptive UAV activity trends show significant increases with more jammers in the 6G micro-cell. Both our proposed solution and the Random RIS Phase Shifts display this trend, unlike the baseline scenarios (Random and Fixed Device-to-UAV Associations), which show higher UAV activity with smaller jammer distributions but lower sum rates and higher energy consumption due to sub-optimal UAV-device associations. In Single Objective Optimization, UAV activity increases similarly, though it results in slightly higher energy consumption and lower data rates in scenarios with fewer jammers, emphasizing the need for balanced objective considerations.

The performance of our proposed swarm Aerial-RIS solution and other baselines remains consistent across different jammer distribution radii ranging from 120 to 720 meters, as illustrated in Figures 6(g) and 6(h). Although the Fixed Multi-RIS Installation sometimes gives a better sum rate with nearly zero energy consumption, this comes randomly without adapting to the dynamic nature of the wireless environment. Our solution outperforms the Fixed and Random Device-to-UAV Association baselines introduced in related works [12], along with other configurations, in terms of average sum rate and energy consumption. This is due to RL agent optimizing the swarm UAV formation and dynamic clustering through device-to-UAV association, ensuring that a minimum number of UAVs are deployed to achieve the maximum sum rate, while the other cases do not optimize clustering. While configurations like Single Objective Optimization achieve higher data rates, they do so at the expense of slightly increased energy consumption, especially at larger jammer distributions, as they do not optimize this aspect. The Random RIS Phase Shifts configuration shows similar trends, with very low sum rates and higher energy consumption compared to our proposed solution.

We also demonstrate in Figure 7 that our proposed system model maximizes the sum rate and minimizes energy consumption while satisfying the combined respect levels for all the system constraints, over 75% for most cases. This shows that the proposed RL-based solution adapts well to varying spatial circumstances in wireless cellular environments, allowing flexibility and adaptability while fully respecting all system constraints. Respect for constraints is also maintained in all the subsequent cases discussed in this section.

FIGURE 7.

Comparative evaluation of the impact of varying the jammers distribution areas on the respect for multiple constraints in Swarm UAV-borne RIS scenario.

Show All

2) Impact of Variation in the Number of Jammers

Another set of evaluations examines the costs of jamming within our proposed system model. Figure 8 shows the impact of varying the number of jammers $(K = J_{1}, J_{2},\ldots, J_{k},.., J_{K})$ on the performance of the RL-based swarm UAV-borne RIS solution in achieving objectives. In our simulations with five UAVs $(U=5)$ and twelve mobile devices $(I=12)$ within a jamming area radius of 120 meters, we observe a decline in sum rates due to increased interference as the number of jammers increases from $(K=5)$ to $(K=50)$ , as shown in subplot 8(b).

FIGURE 8.

Results for evaluating the impact of varying number of jammers on the proposed Swarm UAV-borne RIS scenario (a) Cumulative Reward (b) Sum Rate (c) Energy Consumption (d) Base Station Transmit Power. Adaptive UAV Swarm Formation in (e) Proposed solution and (f) Alternate Scenarios. Comparative bar plots for (g) Average Sum Rate (h) Average Energy Consumption.

Show All

Conversely, the energy consumption shows slight fluctuations across different quantities of jammers from $(K=5)$ to $(K=50)$ , as demonstrated in subplot 8(c); however, it remains mainly stable, and this is because the increase in the number of jammers does not pose a significant challenge to the spatial exploration by the UAVs to provide coverage to the devices. The base station, however, tends to transmit less average power to counter the effects of the increased number of jammers, as shown in subplot 8(d), due to higher UAV activity as illustrated by bar plots of 8(e). The gradual trend in UAV participation activity is triggered by increased suppression of the sum rate due to increased jammer interference. Therefore, we observe that with a small number of jammers from $(K=5)$ up to $(K=20)$ , one UAV is sufficient to provide effective coverage. However, as the prevalence of unknown jammers increases to $(K=50)$ , the suppression of the sum rate triggers the system to push two to three more UAVs to service to provide coverage.

The adaptability of the RL agent is a key highlight in our proposed system model which is demonstrated when compared to the baselines presenting Fixed Device-to-UAV Association and Random Device-to-UAV Association, which show the maximum engagement of at least four UAVs, evident in sub-plots 8(f). All related works and baseline configurations show increased adaptive UAV Swarm formation with an increase in the number of jammers. However, these configurations are inefficient as they achieve lower sum rates with higher energy consumption due to lack of optimization of swarm formation and clustering.

The comparative bar plots of two scenarios from the related works [7], [12], and different baselines are shown in Figures 8(g) and 8(h). RL-based optimization with swarm UAV-mounted RIS platforms leads to superior performance compared to the baselines and related works, except for one specific scenario such as Single Objective Optimization. The proposed solution demonstrates partial superiority with the higher sum rate with lower energy consumption at jammer quantity from $(K=5)$ till $(K=15)$ since it tends to optimize swarm formation by engaging a minimum number of UAVs.

However, beyond $(K=15)$ , the sum rate gets slightly lower with an almost equal level of energy consumption as the increased influence of jammers poses more challenge to maintain the sum rate for the proposed solution whereas Single Objective Optimization approach gets higher sum rate as being the only objective, as demonstrated in the sub-plots 8(g) and 8(h).

3) Impact of Variation in the Number of Mobile Devices

Finally, we study the impact of varying the number of mobile devices through simulations comparing our approach to related works and baselines while keeping the number of UAVs in the swarm as $(U=5)$ and the number of jammers as $(K=5)$ .

Figure 9 illustrates the convergence of cumulative reward, sum rate, energy consumption, and base station transmit power. As the number of devices increases from $(I=6)$ to $(I=18)$ , the sum rates logically rise due to the cumulative effect of more devices covered by UAV-mounted RIS platforms. However, this also leads to higher energy consumption by the UAVs, as more are needed to service the increased number of devices, as shown in subplots 9(c) and 9(e). Additionally, base station transmission power increases due to the higher demand from the increased number of devices and UAVs, as shown in subplot 9(d).

FIGURE 9.

Results for evaluating the impact of varying the number of mobile devices on the proposed Swarm UAV-borne RIS scenario (a) Cumulative Reward (b) Sum Rate (c) Energy Consumption (d) Base Station Transmit Power. Adaptive UAV Swarm Formation in (e) Proposed solution and (f) Alternate Scenarios. Comparative bar plots for (g) Average Sum Rate and (h) Average Energy Consumption.

Show All

In our simulations, we compare UAV activity participation in the swarm, as depicted in subplot 9(e), against a baseline scenario of Random Device-to-UAV Association and related work on Fixed Device-to-UAV Association [12], along with other configurations in subplot 9(f). The proposed approach shows increased UAV activity with more devices, indicating a higher sum rate and coverage needs. In contrast, Fixed and Random Device-to-UAV Association scenarios exhibit greater UAV involvement even with fewer devices. At the same time, other configurations show a gradual increase in UAV activity as device numbers rise.

Our swarm UAV-borne RIS system model also performs better than other baselines and configurations, as shown in subplots 9(g) and 9(h). The proposed solution adequately covers more devices, such as $(I=18)$ , with satisfactory sum rates and reduced energy consumption. This highlights the optimization efficiency of the swarm UAV-mounted RIS across various mobile device quantities, outperforming systems like the Fixed Multiple RIS Installations in [7]. Despite being energy-efficient, these fixed installations focus only on phase shift optimization and lack adaptability to dynamic environments, resulting in lower sum rates in denser, more crowded scenarios, as shown in subplot 9(g).

Our proposed solution also outperforms the scenarios with Random and Fixed Device-to-UAV Association [12] due to increased mobility from UAV-mounted RIS platforms, enabling effective spatial exploration and adaptability to environmental changes for maximizing sum rates while minimizing energy consumption. However, in Random and Fixed Device-to-UAV association scenarios, despite higher UAV activity, we observe lower sum rates with very high energy consumption as shown in subplot 9(h).

We observed that the sum rate for the Single Objective Optimization configuration slightly outperforms the proposed model at higher device quantities $(I=12)$ and $(I=18)$ , focusing solely on maximizing the sum rate. However, this approach underperforms in less crowded environments. Additionally, the Random RIS Phase Shifts configuration shows degraded performance compared to the proposed solution, as it solely depends on UAV trajectory optimization while RIS phase shifts are randomized.

4) Real-Time Adaptive UAV Swarm Formation

We also demonstrate the proposed real-time adaptation solution to rapidly changing wireless environment parameters during the operations using UAV swarm formation and clustering. We observe that the RL agent adapts very well when the jammer distribution radius is switched from $120 m$ to $720 m$ along with the cell size, where the sum rate and the energy consumption trend are well maintained without significant degradation as shown in the plots in Figure 10. We also notice that the Swarm UAV activity increases with the abrupt switching from $120 m$ jammer spread radius to $720 m$ , proving the adaptive nature of the proposed solution as shown in bar plot 10(d).

FIGURE 10.

Plots showing UAV Swarm Formation adapting to the rapid changes in jammer distribution area switching from $120 m$ to $720 m$ during operation. (a) Changes in Cumulative Reward (b) Sum Rate (c) Energy Consumption (d) UAV Activity.

Show All

SECTION V.

Conclusion

This study proposes a robust strategy against jamming attacks in dense urban wireless networks using Deep Reinforcement Learning (DRL). The Proximal Policy Optimization (PPO) algorithm optimizes UAV trajectories, swarm formation, clustering, RIS phase shifts, and base station transmission power to maximize sum rates while minimizing the UAV energy consumption in the presence of unknown jammers. This approach effectively addresses the complexities of dense urban environments, ensuring optimal outcomes across various scenarios.

Future research aims to delve into multi-agent DRL methodologies within complex wireless communication frameworks. This may encompass elements such as mobile jamming entities and establishing collaborative Swarm UAVs comprising RIS-supported airborne platforms, thus introducing further complexities to the optimization problem.

ACKNOWLEDGMENT

Open Access funding is provided by Qatar National Library.

References is not available for this document.

MIT Libraries

MIT Libraries

RL-Based Adaptive UAV Swarm Formation and Clustering for Secure 6G Wireless Communications in Dynamic Dense Environments

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

Adaptive Swarm Formation and Clustering through Aerial-RIS in Multi-Jammers Environments

A. System Model

1) Distance and Channel Model

2) Sum Rate

3) Energy Consumption Model

4) RIS Passive Beamforming

B. Problem Formulation

C. RL for Adaptive Swarm Aerial-RIS Anti-Jamming

1) Proximal Policy Optimization (PPO)

Algorithm 1 PPO Algorithm Deployed at Base Station B

2) Multi-Objective Optimization Problem

3) General MDP Formulation

System Evaluation

A. Simulation Setup

B. Simulation Experiments

1) Impact of Variation in the Jammers Distribution

2) Impact of Variation in the Number of Jammers

3) Impact of Variation in the Number of Mobile Devices

4) Real-Time Adaptive UAV Swarm Formation

Conclusion

ACKNOWLEDGMENT

References

IEEE Account

Purchase Details

Profile Information

Need Help?

MIT Libraries

MIT Libraries

RL-Based Adaptive UAV Swarm Formation and Clustering for Secure 6G Wireless Communications in Dynamic Dense Environments

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

Adaptive Swarm Formation and Clustering through Aerial-RIS in Multi-Jammers Environments

A. System Model

1) Distance and Channel Model

2) Sum Rate

3) Energy Consumption Model

4) RIS Passive Beamforming

B. Problem Formulation

C. RL for Adaptive Swarm Aerial-RIS Anti-Jamming

1) Proximal Policy Optimization (PPO)

Algorithm 1 PPO Algorithm Deployed at Base Station B

2) Multi-Objective Optimization Problem

3) General MDP Formulation

System Evaluation

A. Simulation Setup

B. Simulation Experiments

1) Impact of Variation in the Jammers Distribution

2) Impact of Variation in the Number of Jammers

3) Impact of Variation in the Number of Mobile Devices

4) Real-Time Adaptive UAV Swarm Formation

Conclusion

ACKNOWLEDGMENT

References