Journals & Magazines >IEEE Access >Volume: 13

Reinforcement Learning and Multi-Access Edge Computing for 6G-Based Underwater Wireless Networks

AUVs trajectory planning with ocean current speed UC=0.5 m/s and an angle ai1= -45°.

Abstract:

6G networks are envisioned to dramatically enhance the connectivity landscape by integrating communication across ground, air, and sea environments. In the aquatic domain...Show More

Metadata

Abstract:

6G networks are envisioned to dramatically enhance the connectivity landscape by integrating communication across ground, air, and sea environments. In the aquatic domain, the Internet of Underwater Things (IoUT) represents a global network of intelligent underwater devices designed to capture, interpret, and share data. Although Underwater Acoustic Communications (UAC) has become widespread as a solution for transmitting information, data collection from Underwater Sensor Nodes (USNs) to the surface results in extensive delays and higher energy consumption. Edge communication emerges as a solution to address these challenges. In this approach, Autonomous Underwater Vehicles (AUVs) bring edge computing as close as possible to the source devices. This paper proposes an innovative AUV-based Multi-Access Edge Computing (MEC) system where cluster-heads that collect data from IoUT devices offload their associated computational tasks to local AUVs. These AUVs are strategically positioned to execute tasks either fully locally, partially, or by offloading them entirely to a more resource-equipped AUV (AUV MEC). We achieve this by jointly optimizing the task offloading strategy, resource allocation, and the trajectories of the AUVs. We formulate a non-convex optimization problem to minimize the weighted sum of service delays for all local AUVs and their energy consumption. To address the NP-hard nature of this problem, we employ a deep reinforcement learning algorithm, Deep Deterministic Policy Gradient (DDPG), to solve it. Extensive simulations have been conducted to evaluate the effectiveness of our proposed communication system. The results show that our proposed algorithm outperforms the Total Offloading (Offloading), Local Execution (Locally), and Actor-Critic (AC) algorithms.

AUVs trajectory planning with ocean current speed UC=0.5 m/s and an angle ai1= -45°.

Published in: IEEE Access ( Volume: 13)

Page(s): 60627 - 60642

Date of Publication: 02 April 2025

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2025.3557158

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

6G-based ground-air-underwater networks [1] represent a groundbreaking paradigm shift in wireless communication. These networks aim to provide seamless connectivity across terrestrial, aerial, and aquatic domains, enabling unprecedented data exchange and interaction between sensors, devices, and vehicles operating in these diverse environments. Concerning the aquatic domain, the Internet of Underwater Things (IoUT) [2] can be defined as a worldwide network of smart interconnected underwater objects with a digital entity. These devices sense, interpret, and react to the environment due to the combination of the Internet, tracking technologies, and embedded sensors. Data is sent to the surface for computation and processing. Underwater devices use different communication techniques to transmit information [3]: Radio-frequency (RF), optical, and acoustic communication. RF communication offers high data throughput over short distances and suffers from a low Doppler effect. It performs well in shallow waters but suffers from high signal attenuation in deep water. Optical transmission, preferably in blue-green wavelength, requires line-of-sight positioning. Acoustic communication is the most widely used method due to its ability to travel efficiently over long distances with relatively low energy loss. It enables the longest communication range but suffers from low throughput, significant Doppler effect impairment, and high delay spread, which causes severe inter-symbol interference [4]. Data transmission directly from IoUT devices (e.g., USN, robots, or cameras) to the surface sink is very energy-intensive. Autonomous Underwater Vehicles (AUVs) can help reduce communication distances between IoUT devices and sink nodes by collecting data from IoUT devices [5]. Sink nodes transmit the collected information to ships or Autonomous Aerial Vehicles (AAVs) in the aerial domain using radio frequency waves [6] or satellite communications. Finally, the data is sent to ground stations in the terrestrial domain, where it is stored on servers (cloud computing). At this stage, the data is processed, and depending on the application, the results must be promptly sent back to the underwater devices. In this process, latency is the most detrimental factor for applications that have real-time or mission-critical constraints such as large-scale sensing data fusion, navigation systems and real-time sensing data fusion [7]. To cope with this requirement, Multi-Access Edge Computing (MEC) has been developed, enabling cloud-like computational services closer to local devices [8]. These nearby devices are equipped with cloud-like resources, ensuring high reliability, scalability, and low latency in underwater networks.

Most prior studies overlook the computing capabilities provided by AUVs in the underwater environment, with only a few considering AUVs as edge computing nodes capable of executing tasks [8], [9]. We distinguish between “local AUVs” and a “MEC AUV”. “Local AUVs” typically have limited processing capabilities. The ‘MEC AUV’, on the other hand, is specialized to perform computational tasks efficiently. AUV-enabled MEC systems involving IoUT devices, cluster-heads, local AUVs, and MEC AUVs remain an unexplored area of research.

In this paper, we propose an innovative AUV-enabled MEC system where cluster-heads, which collect data from IoUT devices, offload their associated computing tasks to local AUVs. These AUVs are strategically positioned to (1) execute tasks entirely locally, (2) execute tasks partially and offload the remaining portion, or (3) fully offload tasks to a more resourceful MEC AUV.

Despite the advantages of AUV-assisted MEC, several challenges in network deployment and operation must be addressed. Firstly, it is difficult to determine the optimal amount of computation that should be allocated to each task offloaded from a local AUV to an MEC-enabled AUV due to the limited onboard resources. Furthermore, it is also challenging how to control each of the AUVs’ trajectories (diving direction and speed) considering that each local AUV must serve cluster-heads on its way and an MEC-enabled AUV has to serve different local AUVs in the different collection points. In addition, it is challenging how to determine the optimal route for the AUVs taking into account the effect of ocean currents on their trajectory.

Inspired by the challenges mentioned above, we propose to minimize energy consumption and task delay by jointly optimizing the task offloading strategy, resource allocation, and AUV trajectories. We formulate an efficient model for trajectory optimization, task offloading, and resource allocation as a non-convex optimization problem. This model aims to minimize the weighted sum of service delays for all local AUVs (task offloading and computation delays) and the energy consumption of AUVs (transmission energy and computation energy) [10].

The main contributions of this paper are:

We propose an AUV-enabled MEC system comprising IoUT devices, cluster-heads, local AUVs, and an MEC AUV. This system aims to distribute the workload between local AUVs and the MEC AUV to optimally reduce underwater task execution time.
We formulate the task offloading of local AUVs, resource allocation of the MEC AUV, and path selection of both local and MEC AUVs as a joint optimization problem. The objective is to minimize the underwater task execution delay while reducing the energy consumption of the entire system.
Since the problem formulated is NP-hard, we transform it into a Markov Decision Process (MDP) and solve it using a deep reinforcement learning-based algorithm, Deep Deterministic Policy Gradient (DDPG). Two deep Q networks (DQNs), the actor and critic networks, are employed. The actor network is responsible for deciding the speed of the AUVs, the task offloading strategy, and the resource allocation for the MEC AUV. Meanwhile, the critic network evaluates the actions generated by the actor network.
We have conducted extensive simulations to evaluate the effectiveness of our proposed communication system. The simulation results show that our proposed algorithm outperforms the Total Offloading (Offloading), Local Execution (Locally), and Actor-Critic (AC) algorithms and achieves a lower average delay and energy consumption. Our findings confirm that our proposal can be effectively implemented and is well-suited for mission-critical applications.

To the best of our knowledge, this is the first paper that proposes an AUV-enabled MEC system involving IoUT devices, cluster-heads, local AUVs, and MEC AUVs. The joint optimization of AUVs’ trajectories, task offloading strategy, and resource allocation in an AUV-assisted MEC, with a focus on energy efficiency and delay minimization, has not been explored before.

The remainder of the paper is organized as follows. Section II reviews related work regarding on AUVs, MEC, and reinforcement learning. The system model is presented in Section III. Reinforcement learning-based methods are explained in Section IV. Section V presents the DDPG algorithm methodology and problem solution. Section VI describes the experiments carried out and discusses the results. Finally, the paper is concluded in Section VII.

SECTION II.

Related Work

Different strategies and methods have been implemented for data collection from USNs. On the one hand, in basic approaches, such as traditional multi-hop data collection, data is relayed from one sensor node to another until it reaches the sink node at the sea surface. This strategy has several drawbacks such as the time needed for the information to reach the control center and the high energy consumption. On the other hand, more complex strategies involve additional aerial or underwater devices (e.g., AAVs and AUVs) in data collection. These approaches often leverage multi-access edge computing and use reinforcement learning techniques to intelligently ensure the optimal collection of information from the underwater environment. Next, we summarize them.

A. AAV-Assisted Data Acquisition

Several research works have proposed Autonomous Aerial Vehicle (AAV)-assisted data acquisition schemes for the IoUT. Acoustic links are commonly used for underwater communication due to the efficient propagation of sound in this medium, which allows longer ranges. In contrast, electromagnetic waves, such as radio waves, are significantly absorbed and attenuated in water, especially in saltwater, making them impractical for most types of long-range communication. Some low-frequency electromagnetic waves can penetrate water to a limited extent, but are not widely used due to their large wavelength and the technical challenges associated with generating and receiving them. In [6], the authors propose an underwater data acquisition scheme assisted by AAV. In this scheme, underwater sensor data are first transmitted via an acoustic signal link to a floating sink node, which forwards the data to a AAV using an electromagnetic link. In [11], the authors present an energy-efficient data collection scheme for AAV-assisted ocean monitoring networks, which aims to jointly maximize the energy efficiency of aerial as well as aquatic communications. In [12], the authors propose a AAV-assisted ocean monitoring network architecture designed to ensure timely transmission and extend network lifetime. In these schemes ([6], [11], [12]) during the data acquisition stage in the marine environment, information is sent directly from the USNs to the sink nodes through acoustic signals. Afterwards, in the aerial environment, the data is collected by a AAV through electromagnetic wave propagation.

B. AUV-Assisted Data Collection

Other research works focus on AUV-assisted data collection schemes for the IoUT. In [13], the authors present a Hybrid Data Collection Scheme (HDCS), that considers both real-time data collection and energy-efficiency (EE) issues. In [14], the authors, based on the energy limitation of underwater devices and the high demand for data collection, present an AUV-assisted underwater acoustic sensor network to optimize the energy consumption problem and network performance. In [15], the authors present a heterogeneous underwater information collection scheme to optimize the peak Age of Information (AoI) and to improve the energy efficiency of the aquatic device nodes. To minimize the energy consumption of resource-constrained devices, aquatic mobile devices (AUVs) are introduced, reducing the transmission distance between the USNs and the AUV, as well as between the AUV and the sink nodes. However, since the main objective is only to collect data, there may be delays until the information reaches the cloud-based servers and is processed.

Furthermore, several studies explore frameworks for enabling edge computing in underwater environments using AUVs.

C. Multi-Access Edge Computing (MEC)

In [9], the authors propose a data collection scheme based on an underwater mobile edge element (an AUV) and design a target selection algorithm to compute the mobility path of the AUV for data collection in a stable 3D environment. In this approach, they deploy an AUV to visit all target nodes and collect data by Magnetic Induction (MI) communication. Here the AUV (mobile edge platform) processes and stores a large amount of data to be sent to the sink node, and then the sink node sends the data to the cloud. In [16], the authors present a service-driven intelligent ocean convergence platform using software-defined networking and edge computing. Similar to the studies discussed above, this approach focuses on data acquisition. However, instead of sending the data to the cloud directly, it is sent to an edge server; in the first case the AUV is used as an MEC server; in the second case, an intelligent ocean convergence platform (that contains ship and buoy nodes) is used for edge computing purposes. Nevertheless, these works ([9], [16]) only deal with the data collection, they do not analyze computation offloading, that is, the convenience of executing the computation tasks locally or offloading them (partially or completely) to the AUV-enabled MEC server. In contrast, our proposal explores task offloading between a local AUV that performs data collection and an AUV-enabled MEC server. Our proposal aims to minimize the sum of the total execution delay and energy consumption during the whole process of executing a task by solving the offloading strategy, AUV path optimization, and resource allocation.

In [17], the authors highlight the role of edge computing in optimizing big data processing for underwater applications. Their work demonstrates how MEC enhances AUV-assisted data processing, reducing latency and improving efficiency. Building on this, our approach optimizes task offloading to minimize delay and energy consumption in underwater networks?

D. Reinforcement Learning

Reinforcement learning (RL) is a powerful framework in artificial intelligence, enabling agents to learn optimal behaviors through interactions with dynamic environments. By leveraging RL, agents can make intelligent decisions and adapt to changing conditions. In underwater applications, RL techniques have been utilized for tasks such as node repair, energy management, trajectory optimization, and data collection, demonstrating their potential to optimize performance and address complex challenges in aquatic environments.

In [18], a strategy for repairing nodes in the IoUT using AUVs and a multiagent reinforcement learning framework is discussed. The paper tackles the challenges of node failure caused by environmental conditions and energy limitations by proposing a novel node repair scheme. This scheme allows AUVs to autonomously identify and replace faulty nodes, ensuring continuous network operation. The use of multiple underwater mobile chargers to enhance the charging process in underwater sensor networks through a multi-agent reinforcement learning approach is explored in [19]. The unique challenges of the underwater environment, such as variable energy consumption due to movement characteristics of underwater mobile chargers and the necessity for coordination among them, are addressed. In [20], the network topology in UWSNs is optimized to enhance transmission reliability, minimize delay, and extend network lifetime. The challenges posed by dynamic ocean currents and complex communication environments are tackled by employing a centralized topology control strategy, which utilizes deep reinforcement learning to manage the network effectively. The papers [18], [19], [20] focus on leveraging multi-agent reinforcement learning to address operational challenges in underwater environments: optimizing topology control for transmission reliability, assisting node repair in the IoUT, and managing energy charging for underwater rechargeable sensor networks. In contrast, our paper takes a broader approach by integrating MEC with AUVs to address challenges in a next-generation (6G) underwater network. AUVs act not just as data collectors or node repairers but as mobile computing nodes. This integration reduces latency and enhances data processing capabilities at the edge of the network.

In [21], a multi-AUV data collection system is proposed. This system is designed to optimize the trajectory and task allocation of AUVs based on the urgency of data uploads from IoUT devices and the dynamic underwater environment. A Markov decision process and a multiagent independent soft actor-critic algorithm are employed to maximize data collection rates and throughput while minimizing energy consumption. An advanced multi-tier underwater computing framework to optimize both the trajectory and resource management of AUVs in support of the IoUT is explored in [22]. An environment-aware system that integrates communication, computing, and storage resources across AUVs, surface stations, and IoUT devices to improve overall system efficiency is introduced. The complex, high-dimensional optimization problem is addressed using an asynchronous advantage actor-critic (A3C) algorithm, with simulations demonstrating that the approach enhances system profits by efficiently managing AUV trajectories and resource allocation in dynamic underwater environments. A novel approach to optimize both data throughput and energy harvesting in underwater sensor networks using AUVs and simultaneous wireless information and power transfer (SWIPT) is discussed in [23]. The proposed system employs a model-free reinforcement learning solution to manage the AUV’s trajectory for efficient data collection and energy distribution to sensor nodes. The papers [21], [22], [23] predominantly focus on optimizing AUV operations within the IoUT through various approaches like energy-aware data collection, trajectory design considering environmental factors, and integrating simultaneous wireless information and power transfer (SWIPT) for sustainability. In contrast, our paper specifically introduces the integration of 6G technologies with AUV-based MEC systems, focusing on minimizing latency and energy consumption through advanced task offloading and resource management strategies.

SECTION III.

System Model

In this section, we define the scenario for data processing (real-time response for mission-critical applications) and data collection (cloud storage) tasks. Afterwards, we analyze the reinforcement learning algorithm used for the intelligent offloading of tasks to the edge device. We summarize all the following notations and their definitions in Table 1.

TABLE 1 Key Notation

A. Network Model

Figure 1 shows the network model. Along the seabed, several IoUT devices (USN fixed on the seafloor) are randomly deployed in a 3D ($L\times W\times H$ ) cartesian coordinate system. They are grouped based on their location, and then a cluster-head (CH) IoUT device $p\in \left \{{{1,2,\ldots ,P }}\right \}$ is chosen among them to avoid excessive energy consumption that would occur if each device sends its collected data individually. Two types of AUVs are employed for this scenario: a set of ${\mathrm {AUVs}}_{LOCAL}$ denoted as $\mathcal {K}=\left \{{{ 1,2,\ldots ,K }}\right \}$ and a single ${\mathrm {AUV}}_{MEC}$ . Each ${\mathrm {AUV}}_{LOCAL}$ collects data from the cluster-heads, while the ${\mathrm {AUV}}_{MEC}$ can receive data from several ${\mathrm {AUVs}}_{LOCAL}$ and forwards it to the sink node located on the sea surface. Moreover, for complete and uninterrupted coverage, each ${\mathrm {AUV}}_{LOCAL}$ moves in a two-dimensional (2D) horizontal plane to collect data from the cluster-heads in a specific area and at a certain depth, without surfacing. All ${\mathrm {AUVs}}_{LOCAL}$ are deployed at the same depth and remain close to the CHs. On the other hand, the ${\mathrm {AUV}}_{MEC}$ also moves in a two-dimensional (2D) horizontal plane but at a different depth, positioned between the ${\mathrm {AUVs}}_{LOCAL}$ and the sink node, to serve all the ${\mathrm {AUVs}}_{LOCAL}$ . The ${\mathrm {AUV}}_{MEC}$ uses collection points also referred to as strategic points denoted as $i\in \mathcal {J}=\left \{{{1,2,\ldots ,I }}\right \}$ ; i.e., the ${\mathrm {AUV}}_{MEC}$ traverses each of these strategic points to receive the information from the ${\mathrm {AUVs}}_{LOCAL}$ only there and communicate with the sink node. For all ${\mathrm {AUVs}}_{LOCAL}$ , we assume that communication is based on Orthogonal Frequency Division Multiplexing Access (OFDMA) [24], thereby eliminating mutual interference between them.

FIGURE 1.

Network system model.

Show All

This assumption works well in environments where the number of sensors is limited and they are reasonably spaced apart from each other. Therefore, it is assumed that 10–50 sensors are deployed per square kilometer in underwater environments. This lower density accounts for the challenges of underwater communication, such as signal attenuation and the need for longer-range transmissions. Sensors are spaced 200 to 500 meters apart on average to minimize mutual interference and to allow for efficient data transmission using acoustic communication. This density achieves a balance between effective monitoring and ensuring minimal communication interference in an underwater setting, where signals face unique propagation challenges.

We propose to bring computing resources closer to IoUT devices. Therefore, we consider that the ${\mathrm {AUV}}_{MEC}$ provides multi-access edge computing in addition to the data collection service. IoUT devices are organized into clusters and one device is selected as cluster-head in each cluster. The IoUT devices sense data and send it to a cluster-head. Each cluster-head performs data aggregation and generates tasks related to this collected data. We assume that each jth cluster-head generates one computation-intensive task in the tth time slot. This means that T tasks are generated for each cluster-head, and we have $t\in \mathcal {T}=\left \{{{1,2,\ldots ,T }}\right \}$ . If a task is related to data processing, the cluster-head device sends the input data from the task to the nearest ${\mathrm {AUV}}_{LOCAL}$ . The ${\mathrm {AUV}}_{LOCAL}$ uses a reinforcement learning algorithm to decide whether the task should be processed fully locally, fully offloaded, or partially offloaded to the ${\mathrm {AUV}}_{MEC}$ . This approach aims to achieve energy savings in data processing and reduce the overall delay in task execution. We assume that the ${\mathrm {AUV}}_{MEC}$ has enough powerful resources to process the data and return the results to the ${\mathrm {AUV}}_{LOCAL}$ , which means there is no need to send the data to a central server for cloud computing. Therefore, the system performance is improved, and the latency is reduced by processing the information at the edge. This is essential for mission-critical applications with strict reliability and latency requirements [7]. On the other hand, if the task is related to data collection, the ${\mathrm {AUVs}}_{LOCAL}$ send the data to a ${\mathrm {AUV}}_{MEC}$ . The ${\mathrm {AUV}}_{MEC}$ sends the data to a sink node located on the sea surface, which forwards the data through the AAV to the Ground Base Station (GBS) for storage in the cloud servers. Cloud-computing-based data processing is useful in non-mission-critical IoUT applications [7].

B. Communication Model

We consider two communication interfaces: cluster-head-to-${\mathrm {AUV}}_{LOCAL}$ and ${\mathrm {AUV}}_{LOCAL}$ -to-${\mathrm {AUV}}_{MEC}$ . Next, the data rate for each interface is analyzed. For this purpose, the underwater acoustic channel is introduced.

1) Underwater Acoustic Channel

The narrow-band Signal-to-noise Ratio (SNR) of an emitted underwater signal at the receiver can be expressed by the passive sonar equation [25]:\begin{equation*} SNR\left ({{ l,f }}\right )=SL\left ({{ f }}\right )-A\left ({{ l,f }}\right )-N\left ({{ f }}\right )+DI\ge DT \tag {1}\end{equation*} View Sourcewhere DT has been defined as the detection threshold, $SL\left ({{ f }}\right)$ is the source level, $A\left ({{ l,f }}\right)$ is the transmission loss, $N\left ({{ f }}\right)$ denotes the noise level, and DI is the directivity index. Since the source is assumed to be omnidirectional, DI is equal to zero.

The attenuation, transmission loss, or path loss over a transmission range l for a frequency f can be given by [26].\begin{equation*} 10\log {A\left ({{ l,f }}\right )}=k\cdot 10 \log l+l\cdot 10\log {\alpha \left ({{ f }}\right )} \tag {2}\end{equation*} View Sourcewhere k is the spreading factor that defines the geometry of the propagation, $k=1$ for shallow water (cylindrical spreading), $k=2$ for deep water (spherical spreading), and $k =1.5$ for practical spreading. Here, $\alpha \left ({{ f }}\right)$ is the absorption coefficient.

The ambient noise can be modeled by four basic sources [27]\begin{equation*} N\left ({{ f }}\right )=N_{t}\left ({{ f }}\right )+N_{s}\left ({{ f }}\right )+N_{w}\left ({{ f }}\right )+N_{th}\left ({{ f }}\right ) \tag {3}\end{equation*} View Sourcewhere the ambient noise due to turbulence $N_{t}\left ({{ f }}\right)$ , shipping $N_{s}\left ({{ f }}\right)$ , waves $N_{w}\left ({{ f }}\right)$ and thermal noise $N_{th}\left ({{ f }}\right)$ are described by the following equations.\begin{align*} 10~log~{N_{t}\left ({{ f }}\right )}& =17-30~log~\left ({{ f }}\right ) \tag {4}\\ 10~log~{N_{s}\left ({{ f }}\right )}& =40+20\left ({{ s-0.5 }}\right )+26~log~\left ({{ f }}\right ) \\ & \quad -60~log~\left ({{ f+0.03 }}\right ) \tag {5}\end{align*} View Sourcewhere s is the shipping activity factor, whose value ranges between 0 and 1 for low and high activity, respectively.\begin{align*} 10~log~{N_{w}\left ({{ f }}\right )}& =50+7.5w^{\frac {1}{2}}+20~log~\left ({{ f }}\right ) \\ & \quad -40~log~\left ({{ f+0.4 }}\right ) \tag {6}\end{align*} View Sourcewhere w is the windspeed in m/s.\begin{equation*} 10~log~{N_{th}\left ({{ f }}\right )}=-15+20~log~\left ({{ f }}\right ) \tag {7}\end{equation*} View SourceThe source level SL is related to the intensity $I_{T}$ as $SL=10\log \left ({{ \frac {I_{T}}{\mathrm {0.67\times }{10}^{-18}} }}\right)$ . The relation between the intensity $I_{T}$ , and the transmission power of the transceiver, $P_{T}$ , is expressed as\begin{equation*} I_{T}^{SW}=\frac {P_{T}}{2\pi \cdot z},{I}_{T}^{DW}=\frac {P_{T}}{4\pi } \tag {8}\end{equation*} View Source

for shallow and deep water, respectively, where $P_{T}$ is given in watts, and $z $ is the depth in meters.

If we consider the frequency-dependent part of the narrow-band SNR $\gamma \left ({{ l,f }}\right)=1$ /($A\left ({{ l,f }}\right)N\left ({{ f }}\right))$ , it has been shown in [27] that for each transmission distance, there exists an optimal frequency $f_{0}\left ({{ l }}\right)$ at which the SNR is maximized. It is assumed that the noise is Gaussian and the channel is time-invariant over a certain interval of time [27]. The total bandwidth can be divided into many narrow sub-bands, each contributing to the overall channel capacity by adding their individual capacities. The i-th sub-band is centered around the frequency $f_{i},i=1,2,\ldots $ and has a width $\Delta f$ . The maximum data rate supported by an underwater acoustic channel for a given source power and source/receiver (that is, the channel capacity) can be obtained as [27]\begin{align*} r_{tx}& =\sum \nolimits _{i} {\Delta f~log_{2}\left ({{ 1+\frac {SL\left ({{ l,f }}\right )}{A\left ({{ l,f }}\right )N\left ({{ f }}\right )} }}\right )} \\ & =B~log_{2}\left [{{ 1+\frac {I_{T}\gamma \left ({{ l,f }}\right )}{1\mu Pa} }}\right ] \tag {9}\end{align*} View Sourcewhere $1\mu Pa=0.67\times {10}^{-18}$ .

Therefore, the data rate of the cluster-head-to-${\mathrm {AUV}}_{LOCAL}$ acoustic link, denoted by $r_{ch,auvloc}$ , and the data rate of the ${\mathrm {AUV}}_{LOCAL}$ -to-${\mathrm {AUV}}_{MEC}$ acoustic link, denoted by $r_{auvloc,auvmec}$ , are less than or equal to the channel capacity of the corresponding Additive White Gaussian Noise (AWGN) channel.

C. Computing Model

1) The Velocity Synthesis Approach

Accounting for the effect of ocean currents in the underwater environment, AUVs (${\mathrm {AUVs}}_{LOCAL}$ or ${\mathrm {AUV}}_{MEC}$ ) may deviate from their trajectories, especially when the current flows opposite to their direction of movement. In our approach, we consider that the AUVs move to reach their target points, using a planned trajectory. If the shortest path is defined as the straight line connecting the initial position of the AUV and the target coordinates, the objective will be to control the movements of the AUV along this path towards the desired target. To solve this problem, we present a simple synthesis velocity algorithm (see Figure 2) that decomposes the ocean current velocity and the AUV velocity, ensuring the resultant velocity points directly toward the target.

FIGURE 2.

Velocity synthesis algorithm.

Show All

The lower left point in the figure represents the AUV and the higher point represents the target point. The vector $V_{t}$ indicates the shortest path planned by the AUV. The angle between $V_{t}$ and the x-axis is denoted as ai. The vector V represents the speed of the AUV ($v_{mec}$ , $v_{local}$ ), which can be adjusted according to system requirements. The angle between V and the x-axis is denoted as $\theta ^{h}$ ($\theta _{mec}^{h}$ , $\theta _{k}^{h}$ ). The vector Uc represents the ocean current velocity, and the angle between Uc and the x-axis is $ai1$ . The main challenge is to find out $\theta ^{h} $ to ensure that the AUV moves along $V_{t}$ . The component of the vehicle velocity that assists the motion along the desired path vector $V_{t}$ has a magnitude $V_{d}=V\mathrm {\ast }\cos \left ({{ \theta ^{h}-ai }}\right)$ . The component of the vehicle velocity that is perpendicular to the desired trajectory is denoted by $V_{n}=V\mathrm {\ast }\sin \left ({{ \theta ^{h}-ai }}\right)$ . Similarly, the component of the ocean current that assists motion along the desired trajectory is ${Uc}_{d}=Uc\mathrm {\ast }\cos \left ({{ ai-ai1 }}\right)$ , while its component perpendicular to the desired trajectory is ${Uc}_{n}=Uc\mathrm {\ast }\sin \left ({{ ai-ai1 }}\right)$ . To ensure that the AUV remains on the planned trajectory, $V_{n}$ must cancel out ${Uc}_{n}$ . This condition can be described as the following equation [28]:\begin{equation*} V\ast ~sin~\left ({{ \theta ^{h}-ai }}\right )=Uc\ast ~sin~\left ({{ ai-ai1 }}\right ) \tag {10}\end{equation*} View Source

From the above equation, it can be calculated\begin{equation*} \theta ^{h}=arcsin {\left ({{ Uc\ast \frac {~sin~\left ({{ ai-ai1 }}\right )}{V} }}\right )+ai} \tag {11}\end{equation*} View Source

The speed synthesis algorithm implementation is based on the precondition:\begin{align*} \left |{{ Uc }}\right |& \lt \left |{{ V }}\right | \tag {12}\\ V_{L}& =V_{d}+{Uc}_{d} \tag {13}\end{align*} View Source

Summarizing, by combining equations (10) to (13), $\theta ^{h}$ and $V_{L}$ can be calculated, where $V_{L}$ represents the magnitude of the AUV's resulting velocity along the desired trajectory.

2) Auvs’ Trajectory

As mentioned above, we assume that all ${\mathrm {AUVs}}_{LOCAL}$ operate at the same depth and the ${\mathrm {AUV}}_{MEC}$ operates at a different depth; both move within a coverage range that spans the horizontal plane. Their movements depend on the direction of motion (ai), the initial AUV velocity (V), and the ocean current velocity (Uc), as determined by the velocity synthesis approach. We also assume that during the t-th time slot, the ${\mathrm {AUV}}_{LOCAL}k$ moves to serve the cluster-head j in the direction of its target. Meanwhile, the ${\mathrm {AUV}}_{MEC}$ moves from the starting point to the strategic location points to serve the ${\mathrm {AUVs}}_{LOCAL}$ .

The horizontal distance from the ${\mathrm {AUV}}_{MEC}$ to a strategic point is given by\begin{equation*} d_{mec,i}^{h}\left ({{ t }}\right )=\sqrt {\left ({{ X\left ({{ t }}\right )-x_{i}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ Y\left ({{ t }}\right )-y_{i}\left ({{ t }}\right ) }}\right )^{2}} \tag {14}\end{equation*} View Sourcewhere $\left [{{ X\left ({{ t }}\right),Y\left ({{ t }}\right),Z }}\right]$ and $\left [{{ x_{i}\left ({{ t }}\right),y_{i}\left ({{ t }}\right),z_{i} }}\right]$ denote the coordinates of the ${\mathrm {AUV}}_{MEC}$ and a strategic point, respectively. Here, $Z=z_{i}=0$ . Similarly, the horizontal distance from ${\mathrm {AUV}}_{LOCAL}k$ to a cluster-head j is given by\begin{equation*} d_{k,j}^{h}\left ({{ t }}\right )=\sqrt {\left ({{ x_{k}\left ({{ t }}\right )-x_{j}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ y_{k}\left ({{ t }}\right )-y_{j}\left ({{ t }}\right ) }}\right )^{2}} \tag {15}\end{equation*} View Sourcewhere $\left [{{ x_{k}\left ({{ t }}\right),y_{k}\left ({{ t }}\right),z_{k} }}\right]$ and $\left [{{ x_{j}\left ({{ t }}\right),y_{j}\left ({{ t }}\right),z_{j} }}\right]$ represent the coordinates of the ${\mathrm {AUV}}_{LOCAL}k$ and cluster-head j (to visit), respectively. Here $z_{k}=z_{j}=0$ .

For security reasons, each AUV can only move within a rectangular area to avoid possible collisions. The maximum bounds for the ${\mathrm {AUV}}_{MEC}$ are:\begin{align*} 0& \le X\left ({{ t }}\right )\le X^{max}, \forall t\in \mathcal {T} \tag {16}\\ 0& \le Y\left ({{ t }}\right )\le Y^{max}, \forall t\in \mathcal {T} \tag {17}\end{align*} View SourceThe minimum and maximum bounds for the ${\mathrm {AUV}}_{LOCAL}k$ are:\begin{align*} x_{k}^{min}& \le x_{k}\left ({{ t }}\right )\le x_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {18}\\ y_{k}^{min}& \le y_{k}\left ({{ t }}\right )\le y_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {19}\end{align*} View Source

Therefore, the ${\mathrm {AUVs}}_{LOCAL}$ move along a path from the nearest cluster-head to the last one, ensuring that each cluster-head is covered and served only once.

To get the trajectory of the ${\mathrm {AUV}}_{MEC}$ , we consider the displacement at each time interval t, where $X\left ({{ t }}\right)=X\left ({{ 0 }}\right)+\sum \nolimits _{l=1}^{t} {d_{mec,i}^{h}\left ({{ l }}\right)} \cos {\left ({{ \theta _{mec}^{h}\left ({{ l }}\right) }}\right),} Y\left ({{ t }}\right)=Y\left ({{ 0 }}\right)+\sum \nolimits _{l=1}^{t} d_{mec,i}^{h} \left ({{ l }}\right)\sin \left ({{ \theta _{mec}^{h}\left ({{ l }}\right) }}\right). \big [X\left ({{ 0 }}\right), Y\left ({{ 0 }}\right),Z \big]$ is the initial coordinate of the ${\mathrm {AUV}}_{MEC}$ .

Similarly, for the trajectory of the ${\mathrm {AUV}}_{LOCAL}$ , the equations are: $x_{k}\left ({{ t }}\right)=x_{k}\left ({{ 0 }}\right)+\sum \nolimits _{l=1}^{t} d_{kj}^{h} \left ({{ l }}\right)\cos {\left ({{ \theta _{k}^{h}\left ({{ l }}\right) }}\right),}y_{k}\left ({{ t }}\right)=y_{k}\left ({{ 0 }}\right)+\sum \nolimits _{l=1}^{t} d_{k,j}^{h} \left ({{ l }}\right)\sin \left ({{ \theta _{k}^{h}\left ({{ l }}\right) }}\right).\left [{{ x_{k}\left ({{ 0 }}\right),y_{k}\left ({{ 0 }}\right),z_{k} }}\right]$ is the initial coordinate of the ${\mathrm {AUV}}_{LOCAL}k$ .

3) Cluster-Head Data Collection

We assume that the ${\mathrm {AUV}}_{MEC}$ and the ${\mathrm {AUV}}_{LOCAL}$ have reached the desired positions to maintain coverage between the ${\mathrm {AUV}}_{LOCAL}$ and its cluster-heads, as well as between the ${\mathrm {AUV}}_{MEC}$ and the ${\mathrm {AUV}}_{LOCAL}$ .

Next, each cluster-head $j \in {\mathcal {R}}_{k}=\left \{{{1,\ldots ,R_{k} }}\right \}$ , first offloads the input data of a task $M_{k,j}$ to the k-th ${\mathrm {AUV}}_{LOCAL}$ . The required times are the transmission delay $D_{tx}^{ch,auvloc}\left ({{ t }}\right)$ and the propagation delay $D_{prop}^{ch,auvloc}\left ({{ t }}\right)$ at time slot t. They are given by\begin{equation*} D_{tx}^{ch,auvloc}\left ({{ t }}\right )=\frac {L_{k, j}\left ({{ t }}\right )}{r_{ch, auvloc}} \tag {20}\end{equation*} View Sourcewhere $r_{ch,auvloc}$ stands for the uplink rate of CHj in the underwater medium.

To get the propagation delay $D_{prop}^{ch,auvloc}\left ({{ t }}\right)$ we first calculate the Euclidian distance between the ${\mathrm {AUV}}_{LOCAL}k$ , and the cluster-head j.\begin{align*} & d_{k,j}^{e}\left ({{ t }}\right ) \\ & =\sqrt {\left ({{ x_{k}\left ({{ t }}\right )-x_{j}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ y_{k}\left ({{ t }}\right )-y_{j}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ z_{k}\left ({{ t }}\right )-z_{j}\left ({{ t }}\right ) }}\right )^{2}} \tag {21}\\ & D_{prop}^{ch,auvloc}\left ({{ t }}\right )=\frac {d_{k,j}^{e}\left ({{ t }}\right )}{v_{prop}} \tag {22}\end{align*} View Sourcewhere $v_{prop}=1500$ m/s is the nominal speed of sound underwater.

Afterwards, if the task $M_{k,j}$ is a computation task, the k-th ${\mathrm {AUV}}_{LOCAL}$ must 1) execute the task $M_{k,j}$ locally using its edge-computing capabilities, 2) offload it completely to the AUV-enabled MEC server ${\mathrm {AUV}}_{MEC}$ , or 3) split the task for partial execution between ${\mathrm {AUV}}_{LOCAL}$ and ${\mathrm {AUV}}_{MEC}$ .

4) Local or Partially Local Computing Model

When the ${\mathrm {AUV}}_{LOCAL}$ chooses to execute the whole task $M_{k,j}$ locally, $\alpha _{k,j}\left ({{ t }}\right)0$ . Otherwise, if partial offloading is selected, ${\alpha }_{k,j}{L}_{k,j}\left ({{ t }}\right)$ represents the input data volume offloaded to the ${\mathrm {AUV}}_{MEC}$ , while $\left ({{ 1-\alpha _{k,j} }}\right){L}_{k,j}\left ({{ t }}\right)$ represents the input data volume left for local computation. The total local or partial local execution delay of the ${\mathrm {AUV}}_{LOCAL}k$ at time slot t is given by:\begin{equation*} D_{{AUV}_{L}}\left ({{ t }}\right )=\frac {\left ({{ 1-\alpha _{k, j} }}\right ){L}_{k, j}\left ({{ t }}\right ){C}_{k, j}\left ({{ t }}\right )}{f_{{AUV}_{L}}} \tag {23}\end{equation*} View Sourcewhere $L_{k,j}\left ({{ t }}\right)$ denotes the size of the computing task, $C_{k,j}\left ({{ t }}\right)$ represents the CPU cycles required to process each byte unit, and $f_{{\mathrm {AUV}}_{L}}$ indicates the computing capacity of the ${\mathrm {AUV}}_{LOCAL}$ . Additionally, the power required by each ${\mathrm {AUV}}_{LOCAL}$ to execute the task locally can be evaluated as follows:\begin{equation*} E_{k,j}^{L}\left ({{ t }}\right )=\mu \left ({{ f_{{AUV}_{L}} }}\right )^{\lambda }D_{{AUV}_{L}}\left ({{ t }}\right ) \tag {24}\end{equation*} View Sourcewhere $\mu $ is a constant that depends on the average switched capacitance and the average activity factor, and $\lambda $ is a constant typically set to 3 [29].

5) Offloading or Partially Offloading Computing Model

We consider that the ${\mathrm {AUV}}_{LOCAL}k$ chooses to execute the task $M_{k,j}$ either by fully offloading the computation ${(\alpha }_{k,j}(t)=1)$ or by partially offloading it $\left ({{ \mathrm {0\lt }\alpha _{k,j}\left ({{ t }}\right)\mathrm {\lt 1} }}\right)$ . The complete offloading process is divided into three steps:

Data upload: The ${\mathrm {AUV}}_{LOCAL}k$ uploads the required input data (i.e., program codes and parameters) to the ${\mathrm {AUV}}_{MEC}$ in the underwater medium.
Task execution: The ${\mathrm {AUV}}_{MEC}$ allocates part of its computational resources and executes the computing task.
Result retrieval: The ${\mathrm {AUV}}_{MEC}$ returns the execution results to ${\mathrm {AUV}}_{LOCAL} k$ .

Based on these steps, the time required for the first step of offloading computing is the transmission delay and the propagation delay. The transmission delay $D_{tx}^{auvloc,auvmec}\left ({{ t }}\right)$ can be represented by:\begin{equation*} D_{tx}^{auvloc, auvmec}\left ({{ t }}\right )=\frac {\alpha _{k, j} L_{k, j}\left ({{ t }}\right )}{r_{auvloc,auvmec}} \tag {25}\end{equation*} View Sourcewhere $r_{auvloc,auvmec}\mathrm {}$ stands for the uplink rate of ${\mathrm {AUV}}_{LOCAL} k$ in the underwater medium.

The Euclidian distance between the ${\mathrm {AUV}}_{LOCAL}k$ and the ${\mathrm {AUV}}_{MEC}$ is represented as\begin{align*} & d_{mec,k}^{e}\left ({{ t }}\right ) \\ & \quad =\sqrt {\left ({{ X\left ({{ t }}\right )\!-\!x_{k}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ Y\left ({{ t }}\right )\!-\!y_{k}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ Z\left ({{ t }}\right )\!-\!z_{k}\left ({{ t }}\right ) }}\right )^{2}} \tag {26}\end{align*} View Source

The propagation delay is given by\begin{equation*} D_{prop}^{auvloc, auvmec}\left ({{ t }}\right )=\frac {d_{mec,k}^{e}\left ({{ t }}\right )}{v_{prop}} \tag {27}\end{equation*} View Sourcewhere $v_{prop}=1500$ m/s is the nominal speed of sound underwater.

The overall energy required by the ${\mathrm {AUV}}_{LOCAL}k$ to transmit to the ${\mathrm {AUV}}_{MEC}$ in the time slot $\left ({{ t }}\right)$ is given by:\begin{equation*} E_{k,j}^{O}\left ({{ t }}\right )=P_{tx}^{auvloc, auvmec}D_{tx}^{auvloc, auvmec}\left ({{ t }}\right ) \tag {28}\end{equation*} View Sourcewhere $P_{tx}^{auvloc,auvmec}$ represents the data transmission power of the ${\mathrm {AUV}}_{LOCAL}k$ . For the second step of offloading computing, the time delay required for ${\mathrm {AUV}}_{MEC}$ to process the downloaded task is represented by:\begin{equation*} D_{{AUV}_{M}}\left ({{ t }}\right )=\frac {\alpha _{k, j} L_{k, j}\left ({{ t }}\right ) C_{k, j}\left ({{ t }}\right )}{f_{kj}^{m} F} \tag {29}\end{equation*} View Sourcewhere F represents the allocated computational resources (computational capacity of the ${\mathrm {AUV}}_{MEC}$ ), and $f_{kj}^{m}\mathrm {\in }\left \{{{0,1 }}\right \}$ is the specific portion of F allocated to ${\mathrm {AUV}}_{LOCAL}k$ . Similarly, the power consumption required by the ${\mathrm {AUV}}_{MEC}$ to execute the remaining task offloaded can be evaluated as:\begin{equation*} E_{k,j}^{M}\left ({{ t }}\right )=\mu \left ({{ f_{kj}^{m}F }}\right )^{\lambda }D_{{AUV}_{M}}\left ({{ t }}\right ) \tag {30}\end{equation*} View Sourcewhere $\mu $ and $\lambda $ are constants, as explained above.

For the last step of offloading computing, the time delay for receiving the processed result can be expressed as follows:\begin{equation*} D_{rx}\left ({{ t }}\right )=\frac {L_{m}\left ({{ j }}\right )}{r_{rx}} \tag {31}\end{equation*} View Sourcewhere $L_{m}\left ({{ j }}\right)$ indicates the size of the computed result and $r_{rx}$ indicates the download data rate of ${\mathrm {AUV}}_{LOCAL}$ . Based on [22], the download duration of the results from the ${\mathrm {AUV}}_{MEC}$ is typically negligible, because the result size is considerably smaller than the input data size of $L_{k,j}$ . Consequently, this and the propagation delays of this step are neglected in the rest of this paper.

6) Task Completion Time and Energy Consumption

The total time to complete a task $M_{k,j}$ locally is expressed as\begin{equation*} D_{k,j}^{L}\left ({{ t }}\right )=D_{tx}^{ch,auvloc}\left ({{ t }}\right )+D_{prop}^{ch,auvloc}\left ({{ t }}\right )+D_{{AUV}_{L}}\left ({{ t }}\right ) \tag {32}\end{equation*} View Source

The total time to complete a task $M_{k,j}$ using computational offloading is expressed as\begin{align*} D_{k,j}^{O}\left ({{ t }}\right )& =D_{tx}^{ch,auvloc}\left ({{ t }}\right )+D_{prop}^{ch,auvloc}\left ({{ t }}\right )+D_{tx}^{auvloc, auvmec}\left ({{ t }}\right ) \\ & \quad +D_{prop}^{auvloc, auvmec}\left ({{ t }}\right )+D_{{AUV}_{M}}\left ({{ t }}\right ) \tag {33}\end{align*} View Source

To summarize, the total time to complete the task $D_{k,j}$ is given by\begin{align*} & D_{k,j}\left ({{ t }}\right ) \\ & =\begin{cases} \displaystyle D_{k,j}^{L}\left ({{ t }}\right ), & {\alpha }_{k, j}=0; local~execution \\ \displaystyle D_{k,j}^{O}\left ({{ t }}\right ), & {\alpha }_{k, j}=1; offloading \\ \displaystyle max \left ({{ D_{k,j}^{L}\left ({{ t }}\right ), D_{k,j}^{O}\left ({{ t }}\right ) }}\right ),& {0\lt \alpha }_{k, j}\lt 1; \\ \displaystyle & partial ~Offloading \end{cases} \tag {34}\end{align*} View Source

And the overall energy consumption $E_{k,j}\left ({{ t }}\right)$ is expressed as\begin{align*} & E_{k,j}\left ({{ t }}\right ) \\ & =\begin{cases} \displaystyle E_{k,j}^{L}\left ({{ t }}\right ), & {\alpha }_{k, j}=0; local~execution \\ \displaystyle E_{k,j}^{O}\left ({{ t }}\right ), & {\alpha }_{k, j}=1; offloading \\ \displaystyle E_{k,j}^{L}\left ({{ t }}\right )+E_{k,j}^{O}\left ({{ t }}\right )+E_{k,j}^{M}\left ({{ t }}\right );& { 0\lt \alpha }_{k, j}\lt 1; \\ \displaystyle & partial~Offloading \end{cases} \tag {35}\end{align*} View Source

D. Problem Formulation

In this work, we consider the joint optimization of the trajectory of ${\mathrm {AUVs}}_{LOCAL}$ and the ${\mathrm {AUV}}_{MEC}$ , the task offloading strategy of the ${\mathrm {AUVs}}_{LOCAL}$ , and computing resource allocation on the ${\mathrm {AUV}}_{MEC}$ to minimize the total delay for task completion and energy consumption. The trajectory of the ${\mathrm {AUVs}}_{LOCAL}$ and the ${\mathrm {AUV}}_{MEC}$ is defined as $U=\left \{{{ {\theta _{mec}^{h},\theta }_{k}^{h},d_{mec,k}^{h}\left ({{ t }}\right),d_{k,j}^{h}\left ({{ t }}\right), \forall k\in \mathcal {K}, \forall j\in {\mathcal {R}}_{k},\forall t\in \mathcal {T} }}\right \}$ , the offloading strategy is defined as $A = \left \{{{ \alpha _{k,j}\left ({{ t }}\right), \forall k\in \mathcal {K}, \forall j\in {\mathcal {R}}_{k},\forall t\in \mathcal {T} }}\right \}$ , and the computing resource allocation vector is defined as $F = \left \{{{ f_{kj}^{m}\left ({{ t }}\right), \forall k\in \mathcal {K}, \forall j\in {\mathcal {R}}_{k},\forall t\in \mathcal {T} }}\right \}$ . Therefore, the problem of joint optimization of the trajectory U, the offloading strategy A, and the resource allocation F can be formulated as\begin{align*} & \min \limits _{U,A,F}\sum \limits _{t\in {T}} \left [{{ \left ({{ 1-\omega }}\right )\sum \limits _{k\in {K}} \sum \limits _{j\in R} {D_{k,j}\left ({{ t }}\right ) +\omega \sum \limits _{k\in {K}} \sum \limits _{j\in R} {E_{k,j}\left ({{ t }}\right )}} }}\right ] \tag {36}\\ & ~s.t.~0 \le \theta _{mec}^{h}\left ({{ t }}\right )\le 2\pi , \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36a}\\ & \hphantom {~s.t.~}0 \le \theta _{k}^{h}\left ({{ t }}\right )\le 2\pi , \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36b}\\ & \hphantom {~s.t.~}0 \le X_{mec}\left ({{ t }}\right )\le X_{mec}^{max}, \forall t\in \mathcal {T} \tag {36c}\\ & \hphantom {~s.t.~}0 \le Y_{mec}\left ({{ t }}\right )\le Y_{mec}^{max}, \forall t\in \mathcal {T} \tag {36d}\\ & \hphantom {~s.t.~}x_{k}^{min}\le x_{k}\left ({{ t }}\right )\le x_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36e}\\ & \hphantom {~s.t.~}y_{k}^{min}\le y_{k}\left ({{ t }}\right )\le y_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36f}\\ & \hphantom {~s.t.~}f_{kj}^{m}=0;if~\alpha _{k,j}\left ({{ t }}\right )=0 \tag {36g}\\ & \hphantom {~s.t.~}\sum \nolimits _{1}^{k} f_{kj}^{m} {\le 1; \alpha }_{k,j}\left ({{ t }}\right )\ne 0 \tag {36h}\end{align*} View Sourcewhere $\omega $ represents the weights of the delay and the energy consumption for task $L_{k,j}\left ({{ t }}\right)$ . The weights satisfy $0\le \omega \le 1$ .

The constraints (36a), (36b) ensure the horizontal direction of motion for the ${\mathrm {AUV}}_{MEC}$ and the ${\mathrm {AUV}}_{LOCAL}$ , respectively. The constraints (36c), (36f) define the maximum area of coverage for the AUVs. The constraint (36g) specifies that if a task is executed locally, no computational resources will be allocated on the MEC server. Finally, the constraint (36h) ensures that the resource allocation assigned by the ${\mathrm {AUV}}_{MEC}$ does not exceed the maximum available resources.

An efficient trajectory optimization, task offloading, and resource allocation model is formulated as a nonlinear problem. Its objective is to minimize the delay and energy consumption of ${\mathrm {AUVs}}_{LOCAL}$ . This model presents a challenging nonlinear problem due to its non-convex nature and integer programming constraints, making it NP-hard and impractical for exact solution derivation, chiefly due to its high-dimensional complexity.

Consequently, instead of using traditional optimization approaches, reinforcement learning-based methods are employed to obtain the near-optimal solutions for the variable parameters. For this purpose, we employ a deep deterministic policy gradient algorithm to address the problem.

SECTION IV.

Reinforcement Learning-Based Methods

In this section, we propose a computational offloading algorithm based on DDPG for underwater MEC networks. The algorithm automatically selects tasks to be offloaded to an MEC server mounted on an ${\mathrm {AUV}}_{MEC}$ is performed automatically. Previously, we introduce the main reinforcement learning technologies underlying this approach.

A. Q-Learning

Q-learning is a method proposed by Watkins [30], [31] to solve Markov Decision Processes (MDPs) with incomplete information. In essence, Q-learning is a reinforcement learning technique where an agent in a state s transitions to a new state $s^{\prime } $ by executing an action a in the environment. Upon performing the action, the agent receives information from the environment and a reward r. The information that the agent gathers from the environment, $Q\left ({{ s,a }}\right)$ or Q-values, is used by the agent to learn the optimal policy $\pi $ in the MDP [32], enabling it to maximize the cumulative reward.

in Q-learning represents the quality, whereby the model finds its next action by improving the quality in each state. For this purpose, the Q-Learning algorithm uses the Bellman equation. This equation is used to update and learn the Q-values.\begin{equation*} Q\left ({{ s,a }}\right )=r+\left [{{ \gamma {max}_{a^{\prime }}Q\left ({{ s^{\prime }, a^{\prime } }}\right ) }}\right ] \tag {37}\end{equation*} View Sourcewhere $\gamma \in \left \{{{0,1 }}\right \}$ is the discount rate, which balances the influence of future rewards on the current Q-value.

The model stores all Q-values for each state-action pair in a table, known as the Q-table. This table is initialized to zero, since it does not include any prior knowledge.

From the Algorithm 1, the key Q-learning update rule is:\begin{align*} Q\left ({{ s,a }}\right )\leftarrow Q\left ({{ s,a }}\right )+\alpha \left [{{ r+\gamma {max}_{a}Q\left ({{ s^{\prime }, a }}\right )-Q\left ({{ s,a }}\right ) }}\right ] \tag {38}\end{align*} View Sourcewhere $\alpha $ is the learning rate, a constant that determines the weight of the new information to be added in the Q-table.

Algorithm 1 Q-Learn AlgorithmInitialize

$Q\left ({{ s,a }}\right),for~all~s\in \mathcal {S}^{+},a\in \mathcal {A}\left ({{ s }}\right)$ , arbitrarily, and except that $Q\left ({{ terminal,\cdot }}\right)=0$

Repeat (for each episode):

Initialize s

Repeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g., $\varepsilon $ - greedy)

Take action a, observe $r,s^{\prime } $

$Q\left ({{ s,a }}\right)\leftarrow Q\left ({{ s,a }}\right)+\alpha \left [{{ r+\gamma {max}_{a}Q\left ({{ s^{\prime } ,a }}\right)-Q\left ({{ s,a }}\right) }}\right]$

$s\leftarrow s'$

until s is terminal

end for

One limitation of Q-learning is its suitability primarily for problems with small state spaces. As the state space grows, the memory and computation required to maintain and update the Q-table increase significantly, making it impractical for larger environments.

B. DQN

For larger state spaces, deep Q-learning can help the model efficiently update the Q-table with appropriate values and perform tasks more efficiently. Deep Q-learning enables the use of the Q-Learning strategy by integrating artificial neural networks: Neural Networks (NN), Deep Neural Networks (DNN), and Convolutional Neural Networks (CNN). A neural network enables the agent to choose actions by processing inputs, which represent the states of the environment [33]. After receiving the input, the neural network estimates the Q values. The agent makes decisions based on these Q values. To achieve this, deep Q-learning employs the Bellman equation adapted for DQN:\begin{equation*} Q\left ({{ s,a;\theta }}\right )=r+\left [{{ \gamma {max}_{a^{\prime }}Q\left ({{ s', a'; \theta ' }}\right ) }}\right ] \tag {39}\end{equation*} View Source

The neural network is trained by computing the loss or cost function, which compared the target value $\hat {y}$ and the predicted output. The target value is defined as $\hat {y}=r+\gamma \max _{a_{t+1}}{Q\left ({{ s',a';Q' }}\right)}$ (see Algorithm 2).

Algorithm 2 Deep Q-Learning With Experience Delay

Initialize replay memory $\mathcal {D}$ to capacity N

Initialize action-value function Q with random weights

for episode =1, M do

Initialize sequence $s_{1}=\{x_{1}\}$ and preprocessed sequenced $\phi _{1}=\phi (s_{1})$

for t =1, T do With probability $\epsilon $ select random action $a_{t}$

otherwise, select $a_{t}=\max \nolimits _{a}{Q^{\ast }(\phi _{1}}\left ({{ s_{t} }}\right),a;\theta)$

Execute action $a_{t}$ in emulator and observe reward $r_{t}$ and image $x_{t+1}$

Set $s_{t+1}=s_{t},a_{t},x_{t+1}$ and preprocess $\phi _{t+1}=\phi _{1}\left ({{ s_{t+1} }}\right)$

Store transition ${(\phi }_{t},{a_{t}r}_{t},\phi _{t+1})$ in $\mathcal {D}$

Sample random minibatch of transitions ${(\phi }_{j},a_{j},r_{j},\phi _{j+1})$ from $\mathcal {D}$

Set $\begin{aligned} y_{j}=\begin{cases} \displaystyle r_{j} & \text {for terminal}\phi _{j+1} \\ \displaystyle r_{j}+\gamma \max \nolimits _{a^{\prime }}{Q\left ({{ \phi _{j+1},a^{\prime } ;\theta }}\right)} & \text {otherwise} \end{cases} \end{aligned}$

Perform a gradient descent step on ${(y}_{j}-Q{(\phi _{j},a_{j};\theta))}^{2}$ with respect to the network parameters $\theta $

end for

The loss function is then expressed as\begin{equation*} L(\theta )=\mathbb {E}\left [{{ {(r+\gamma ~max_{a^{\prime }}{Q\left ({{ s', a';\theta ' }}\right )-}Q\left ({{ s,a;\theta }}\right ))}^{2} }}\right ] \tag {40}\end{equation*} View Source

C. DDPG

Q-learning and DQN perform well with discrete action spaces. However, discretizing continuous action spaces can result in an excessively large number of possible actions, making convergence difficult to achieve. The deep deterministic policy gradient algorithm extends the Deep Q-learning algorithm to handle continuous action spaces [34]. DDPG is a model-free, off-policy, actor-critic algorithm that combines Deterministic Policy Gradient (DDPG) [35] with DQN.

DDPG uses two networks: the actor network and the critic network (see Figure 3). The actor network is represented by $\mu (s\vert \theta ^{\mu })$ , where $\theta ^{\mu }$ are the weights of the actor network. It takes a state s as input and produces an action a as output. Similarly, the critic network is represented by $Q\left ({{ s,a\vert \theta ^{Q} }}\right)$ , where $\theta ^{Q}$ are the weights of the critic network. It takes both a state s and an action a as input and outputs the corresponding Q-value.

FIGURE 3.

Deep Deterministic Policy Gradient (DDPG) algorithm structure.

Show All

Similarly, a target network is defined for both the actor and critic networks, maintaining the same structure as their primary counterparts. The target actor network is represented as $\mu (s\vert \theta ^{\mu ^{\prime }})$ , and the target critic network as${}^{^{^{^{^{}}}}}~Q\left ({{ s,a\vert \theta ^{Q^{\prime } } }}\right)$ , where $\theta ^{\mu ^{\prime } }$ and $\theta ^{Q^{\prime } }$ are the weights of the actor and critic target networks, respectively.

The algorithm (Algorithm 3) first selects an action produced by the actor network, with exploration noise ${{\mathcal {N}}}$ added for exploration. This results in $a_{t}=\mu \left ({{ s_{t}\vert \theta ^{\mu } }}\right)+{{\mathcal {N}}}_{t}$ .

Algorithm 3 DDPG Algorithm

Randomly initialize critic network $Q\left ({{ s,a\vert \theta ^{Q} }}\right)$ and actor $\mu (s\vert \theta ^{\mu })$ with weights $\theta ^{Q}$ and $\theta ^{\mu }$ .

Initialize target network $Q^{\prime } $ and $\mu ^{\prime } $ with weights $\theta ^{Q^{\prime } }\leftarrow \theta ^{Q},\theta ^{\mu ^{\prime } }\leftarrow \theta ^{\mu }$

Initialize replay buffer R

for episode =1, M do

Initialize a random process ${{\mathcal {N}}}$ for action exploration.

Receive initial observation state $s_{1}$

for$t=1,T$ do

Select action $a_{t}=\mu \left ({{ s_{t}\vert \theta ^{\mu } }}\right)+{\mathcal {N}}_{t}$ according to the current policy and exploration noise.

Execute action $a_{t}$ and observe reward $r_{t}$ and observe the new state $ s_{t+1}$

Store transition $(s_{t},a_{t},r_{t}{,s}_{t+1})$ in R

Sample a random minibatch of N transitions $(s_{i},a_{i},r_{i}{,s}_{i+1})$ from R

Set $y_{i}=r_{i}+\gamma Q^{\prime } \left ({{ s_{i+1},\mu ^{\prime } (s_{i+1}\vert \theta ^{\mu ^{\prime } } }}\right)\vert \theta ^{Q^{\prime } })$

Update critic by minimizing the loss: $L=\frac {1}{N}\sum \limits _{i} {(y_{i}} - $ $Q{\left ({{ s_{i},a_{i}\vert \theta ^{Q} }}\right))}^{2}$

Update the actor policy using the sampled policy gradient:\begin{equation*}\nabla _{\theta \mu }J\approx \frac {1}{N}\sum \limits _{i} {\nabla _{a}Q\left ({{ s,a\vert \theta ^{Q} }}\right )\vert _{s=s_{i},a=\mu \left ({{ s_{i} }}\right )}\nabla _{\theta \mu }\mu \left ({{ s\vert \theta ^{\mu } }}\right )} \vert _{s_{i}}\end{equation*} View Source

Update the target networks:\begin{align*}\theta ^{Q^{\prime }}& \leftarrow \tau \theta ^{Q}+(1-\tau )\theta ^{Q^{\prime }} \\ \theta ^{\mu ^{\prime }}& \leftarrow \tau \theta ^{\mu }+(1-\tau )\theta ^{\mu ^{\prime }}\end{align*} View Source

end for

The action $a_{t} $ is evaluated in the current state $s_{t}$ , receiving a reward $r_{t}$ and transitioning the system to the next state $s_{t+1}$ . This transition tuple $(s_{t},a_{t},r_{t}{,s}_{t+1})$ is stored in a replay buffer for training purposes.

After several iterations, a mini-batch of N transitions $(s_{i},a_{i},r_{i}{,s}_{i+1})$ is sampled from the replay buffer, and the network is trained using this data. We then calculate the target Q-value as $y_{i}=r_{i}+\gamma Q^{\prime } \left ({{ s_{i+1},\mu ^{\prime } (s_{i+1}\vert \theta ^{\mu ^{\prime } } }}\right)\vert \theta ^{Q^{\prime } })$ . The weights of the critic network are updated using gradients calculated from the loss function L.\begin{equation*} L=\frac {1}{N}\sum \limits _{i} {(y_{i}-Q\left ({{ s_{i},a_{i}\vert \theta ^{Q} }}\right ))}^{2} \tag {41}\end{equation*} View Source

Similarly, the actor network policy is updated using the gradient of the sampled policy:\begin{align*} \nabla _{\theta \mu } J\approx \frac {1}{N}\sum \limits _{i} {\nabla _{a}Q\left ({{ s,a\vert \theta ^{Q} }}\right )\vert _{s=s_{i},a=\mu \left ({{ s_{i} }}\right )}\nabla _{\theta \mu }\mu \left ({{ s\vert \theta ^{\mu } }}\right )} \vert _{s_{i}} \tag {42}\end{align*} View Source

Next, the weights of the actor and critic networks in the target network are updated slowly to promote greater stability. This process, referred to as a soft replacement, is expressed as:\begin{align*} \theta ^{Q^{\prime }}\leftarrow \tau \theta ^{Q}+(1-\tau )\theta ^{Q^{\prime }} \tag {43}\\ \theta ^{\mu ^{\prime }}\leftarrow \tau \theta ^{\mu }+(1-\tau )\theta ^{\mu ^{\prime }} \tag {44}\end{align*} View Source

SECTION V.

DDPG Algorithm: Methodology Problem Solution

In this section, we propose a DDPG algorithm for MEC-based underwater networks that enables the joint optimization of the AUV trajectory, task offloading strategy, and computational resource allocation in a continuous action space. The goal is to minimize both the total delay task computation and the overall energy consumption in the system.

There are three key elements in the reinforcement learning method, namely, state, action, and reward. These elements are detailed below.

State$s\left ({{ t }}\right)$ : $\begin{aligned} s\left ({{ t }}\right)=\left \{{{\begin{matrix} \left [{{ X\left ({{ t }}\right), Y\left ({{ t }}\right), Z }}\right], \\ \left [{{ x_{k}\left ({{ t }}\right), y_{k}\left ({{ t }}\right),z_{k} }}\right], \\ L_{k, j}\forall k\in K \\ \end{matrix}}}\right \}~s\left ({{ t }}\right) \end{aligned}$ represents the set of coordinates of all AUVs, and the task size.
Action$c\left ({{ t }}\right)$ : $c\left ({{ t }}\right)$ represents the set of the actions for all AUVs. It includes the velocity $v_{mec}\left ({{ t }}\right)$ for ${\mathrm {AUV}}_{MEC}$ , velocity $v_{local}\left ({{ t }}\right)$ for ${\mathrm {AUVs}}_{LOCAL}$ , the offloading strategy $\alpha _{k,j}\left ({{ t }}\right)$ , and the computing resource allocation vector $f_{kj}^{m}$ .
Hence, the action set can be defined as\begin{align*} c\left ({{ t }}\right )=\left [{{\begin{matrix} v_{mec}\left ({{ t }}\right ),v_{local\left ({{ 1 }}\right )}\left ({{ t }}\right ),\ldots ,v_{local\left ({{ k }}\right )}\left ({{ t }}\right ), \\ \alpha _{1j}\left ({{ t }}\right ),\ldots ,\alpha _{kj}\left ({{ t }}\right ), \\ f_{1j}^{m}\left ({{ t }}\right ),\ldots ,f_{kj}^{m}\left ({{ t }}\right ) \\ \end{matrix}}}\right ]\end{align*} View Source
Reward$z\left ({{ t }}\right):z\left ({{ t }}\right)$ is defined as the overall minimum delay and energy consumption for the entire process in each time slot. It is expressed as:\begin{equation*} \left ({{ t }}\right )=-\sum \limits _{t\in T} \left [{{ \left ({{ 1-\omega }}\right )\sum \limits _{k\in K} \sum \limits _{j\in R} {D_{k,j}\left ({{ t }}\right )+ \omega \sum \limits _{k\in K} \sum \limits _{j\in R} {E_{k,j}\left ({{ t }}\right )}} }}\right ] \end{equation*} View SourceThe formula considers the weight ($\omega $ ) assigned to each parameter (latency and energy) within the equation. For this reason, our joint optimization algorithm balances latency and energy consumption, aiming to enhance the overall system performance.

SECTION VI.

Experiments and Results

In this section, simulations are conducted to verify the effectiveness and evaluate the design of the proposed algorithm. First, we describe the environment and simulation parameters used during the experiments. Then, we discuss the obtained results, comparing the proposed algorithm with baseline schemes.

All algorithms are evaluated using simulations implemented on several Jupyter Notebooks (version 6.0.3) installed via the Anaconda software suite, and developed in Python 3.7. The experiments were performed on a Lenovo computer equipped with Intel (R) Xeon (R) 2.9 GHz processor and 72 GB RAM; furthermore, NumPy, Matplotlib, and TensorFlow libraries are used to develop RL algorithms. Reinforcement learning is a type of machine learning that is based on learning through interactions with the environment and the feedback obtained from these interactions. Unlike supervised or unsupervised learning, which rely on a static data set for training, reinforcement learning obtains its training data based dynamically from the agent’s experience while interacting with the environment. This experience consists of observations, actions, rewards, and new observations, forming a cycle known as the feedback loop of reinforcement learning. Through this cycle, the agent collects real-time training data, learning how its actions affect the environment and how rewards are related to those actions. This dynamic learning process allows the agent to continuously improve its decision-making policy, thereby maximizing long-term rewards.

A. Simulation Setting

In the proposed scheme, we consider a total coverage area of $100\times 100~m^{2} (LxW)$ on the seabed. It is divided into four equal quadrants of $50\times 50~m^{2}$ each. Several IoUT devices are randomly distributed over the total area, strategically clustered into groups, with each group led by a CH. Each CH within its respective quadrant is responsible for collecting information from the other devices.

There are $4~{\mathrm {AUVs}}_{LOCAL}$ (one for each quadrant) that collect the information and tasks from the CHs for processing. The tasks collected by ${\mathrm {AUVs}}_{LOCAL}$ can be fully processed locally, fully offloaded to the ${\mathrm {AUV}}_{MEC}$ server, or partially processed by both ${\mathrm {AUVs}}_{LOCAL}$ and ${\mathrm {AUV}}_{MEC}$ . In addition, an ${\mathrm {AUV}}_{MEC}$ is responsible for serving all ${\mathrm {AUVs}}_{LOCAL}$ . The system operates and fixed depths of 20 m, 60 m, and 95 m for the ${\mathrm {AUV}}_{MEC}$ , ${\mathrm {AUVs}}_{LOCAL}$ , and CHs, respectively. Furthermore, each training epoch is divided into 60 time slots. In each time slot, a task with computation requirements is generated at each CH. The ${\mathrm {AUVs}}_{LOCAL}$ and ${\mathrm {AUV}}_{MEC}$ have a maximum speed of 2 and 5 m/s respectively. The transmission bandwidth is set to $B=2$ kHz. The maximum CPU frequency for the ${\mathrm {AUV}}_{MEC}$ and ${\mathrm {AUVs}}_{LOCAL}$ is $F=1.5\ast {10}^{9} $ Hz and $f_{{\mathrm {AUV}}_{L}}=6\times {10}^{\mathrm {7}}$ Hz, respectively. The transmission power is $P_{T}=1\times {10}^{-3} $ W. Additionally, the computational download data size $R_{n}$ is set in Kbytes and the number of cycles is set in Megacycles/second. Further detailed simulation parameters are summarized in Table 2.

TABLE 2 Simulation Parameters

To evaluate the performance of the proposed algorithm and for comparison purposes, we describe the following benchmark approaches below:

Offloading of all tasks to the ${AUV}_{MEC}$ (Offloading): The ${\mathrm {AUV}}_{MEC}$ provides computing resources to the ${\mathrm {AUVs}}_{LOCAL}$ at a designated location. Each ${\mathrm {AUV}}_{LOCAL}$ offloads all its computing tasks to be processed remotely.
Execution of all tasks locally (Locally): All computing tasks of the ${\mathrm {AUVs}}_{LOCAL}$ are executed locally without offloading to the ${\mathrm {AUV}}_{MEC}$ .
Deep Deterministic Policy Gradient (DDPG): The parameters for the proposed DDPG algorithm are configured to achieve optimal system performance.
Actor-Critic (AC): A continuous action space-based RL algorithm is implemented for the computational offloading problem to compare and evaluate the performance of the proposed DDPG algorithm.

B. Simulation Results

We trained the deep neural networks of the proposed models over a total of 2000 iterations/episodes. The configuration parameters of the DDPG algorithm are as follows: network architecture: two hidden layers with 400 and 300 fully connected neurons for both the actor and the critic network, soft update coefficient $\tau =0.001$ , reward discount factor $\gamma =0.9$ , optimizer = Adam optimizer and learning rate $lr=0.001$ and 0.002 for actor and critic network, respectively, exploration noise variance $\sigma =0.05$ . The parameters of the Actor-Critic algorithm are fundamentally the same as those of DDPG, except that each hidden layer is configured with 500 and 400 neurons, respectively.

Figure 4 shows the performance comparison between AC and DDPG reinforcement learning algorithms. The abscissa denotes the number of iterations of the main loop. The ordinate represents the episodic reward, which is the total reward obtained by the system in an episode. The final convergence of the DDPG algorithm shows that it outperforms the AC joint optimization algorithm in terms of cumulative reward. Although both algorithms perform favorably, DDPG reaches convergence after approximately 250 iterations and remains more stable. This implies that as the number of iterations increases, the total system delay and energy consumption decrease. Compared to its counterpart, this result highlights the efficiency of our proposed RL algorithm.

FIGURE 4.

Total accumulated reward for episode.

Show All

In reinforcement learning, convergence refers to the agent learning an optimal or suboptimal policy that maximizes its expected reward over time. The performance of the agent improves with increasing episodes of interaction with the environment.

In Figure 5 and Figure 6 we compare the influence of hyperparameters for DDPG. Figure 5 shows the convergence performance of the proposed algorithm (DDPG) with different batch sizes. The figure shows an enlarged picture of the convergence performance for each batch size. We observe that the DDPG algorithm becomes more stable during the training process only for batch size 64. When the batch size is 512, the algorithm does not converge and the reward deteriorates as the number of epochs increases. With a batch size of 256, the algorithm appears to converge optimally, but it lacks stability throughout training. When the batch size is 128, the algorithm converges nicely initially, but does not remain stable as training progresses. In contrast, with a batch size of 64, an optimal convergence is obtained around 250 epochs, and the algorithm remains significantly more stable than the other configurations as training progresses.

FIGURE 5.

Convergence performance of the DDPG algorithm with different batch sizes.

Show All

FIGURE 6.

Convergence performance of the DDPG algorithm with different values of learning rates.

Show All

Figure 6 illustrates the convergence performance of the proposed algorithm with different learning rate values for both the actor network $(lr-actor)$ and the critic network $(lr\mathrm {-critic)}$ . We assume that the learning rates of the actor network and the critic network are different. The figure shows that, regardless of the learning rate values, the algorithm converges. However, the aim is to minimize the overall delay and energy consumption. When the learning rate is larger (i.e., $lr-actor=0.01$ and $lr-\mathrm {critic}=0.02 $ ), the algorithm converges but fails to reach an optimal solution. Similarly, when the learning rate value is very small ($lr-actor=0.0001$ and $lr-critic=0.0002$ ), the algorithm appears to reach convergence initially, but its performance deteriorates during training. The best results are obtained when $lr-actor=0.001$ and $lr-critic=0.002$ , where the algorithm converges optimally, faster, and remains more stable throughout the training process.

After the training and convergence of the algorithms, we analyze the impact of the data size on the total delay and energy consumption by comparing the performance of the algorithms with different data sizes.

Figure 7 shows the total cumulative reward for DDPG, AC, Locally, and Offloading strategies.

FIGURE 7.

Comparison of total cumulative reward benefit and task data size (workload of the AUV).

Show All

We can notice that the total cumulative reward of the proposed algorithms can reach a near-optimal result, which means that an effective computational offloading policy can substantially reduce the total overhead of the AUVs when tasks are partially executed. Furthermore, the performance of the AC algorithm is considerably good compared to the DDPG algorithm, because both explore a continuous action space and take precise actions, ultimately leading to the optimal offloading strategy while significantly reducing latency and energy consumption. However, when ${\mathrm {AUVs}}_{LOCAL}$ execute the tasks locally without offloading (Locally strategy), they cannot use the computing resources of the entire system. As a result, they consume more energy and accelerate their battery exhaustion. Consequently, as the data size increases, the delay and power consumption of each ${\mathrm {AUV}}_{LOCAL}$ also increases, leading to a degradation in system performance. Compared to AC and Locally strategies, the DDPG algorithm achieves significantly lower delay and energy consumption as the data size increases, indicating its superior performance. In general, the total delay and energy of all schemes grow as the data size increases. Therefore, an optimal offloading strategy for each ${\mathrm {AUV}}_{LOCAL}$ is crucial to maximize the best use of ${\mathrm {AUVs}}_{LOCAL}$ and ${\mathrm {AUV}}_{MEC}$ computing resources.

Figure 8 and Figure 9 show the optimal trajectories of the AUVs for data collection from the CHs as well as for offloading from the ${\mathrm {AUVs}}_{LOCAL}$ to the ${\mathrm {AUV}}_{MEC}$ . We assume that the collection points are located at (75,50), (50,75), (25,50), (50,25), the distance between two adjacent collection points is 35,36 m. For each figure, CH locations are randomly selected, and different values are chosen for the velocity vector and the ocean current direction (Uc). In the figures, the gray dots represent the CHs located on the seafloor in each quadrant. The colored lines indicate the trajectory of each ${\mathrm {AUV}}_{LOCAL}$ and the black line represents the ${\mathrm {AUV}}_{MEC}$ trajectory. The trajectories of the ${\mathrm {AUVs}}_{LOCAL}$ and the ${\mathrm {AUV}}_{MEC}$ that minimize the delay time and energy consumption have been selected. Regardless of the CH locations, the algorithm guides the ${\mathrm {AUVs}}_{LOCAL}$ to the target point (CH), starting from the closest one and ensuring that each CH is visited only once. Once the cycle is complete, the last CH position becomes the new starting point, and the process repeats, introducing randomness in serving CHs. Similarly, the ${\mathrm {AUV}}_{MEC}$ is guided from its starting point to strategic coverage points to serve the ${\mathrm {AUVs}}_{LOCAL}$ located in each quadrant.

FIGURE 8.

AUVs trajectory planning with UC =0.5 m/s; −45°.

Show All

FIGURE 9.

AUVs trajectory planning with UC =0.8 m/s; 45°.

Show All

In Figure 8, the ocean current speed is $Uc =0.5$ m/s with and an angle of $ai1=-45^{\circ }$ . Figure 8 shows in red the angle $\theta ^{h}$ and the velocity vector V required for each AUV to reach its target. In Figure 9, the ocean current speed is $Uc =0.8$ m/s with an angle of $ai1=45^{\circ }$ . We can observe that the AUVs can optimally select their trajectories and reach their destinations irrespective of the initial cluster-head locations, the ocean current speeds, and the $ai\mathrm {1}$ angles.

SECTION VII.

Conclusion

In this paper, a novel AUV-enabled MEC system has been introduced. A joint optimization algorithm to solve the offloading strategy, resource allocation, and trajectory selection of both ${\mathrm {AUVs}}_{LOCAL}$ and ${\mathrm {AUV}}_{MEC}$ has been proposed. The deep reinforcement learning-based approach DDPG has been presented to optimize AUV trajectories, task offloading, and resource allocation, enhancing underwater communication in terms of energy efficiency and delay minimization. We have chosen DDPG over other RL algorithms because of its ability to handle continuous action spaces effectively. Our problem has involved optimizing the trajectories, speed, and resource allocation of AUVs, requiring precise control over continuous variables. Unlike discrete-action algorithms like Q-learning or Deep Q-Networks (DQN), DDPG uses an actor-critic architecture, where the actor generates continuous actions and the critic evaluates these actions. This makes it particularly well-suited for the complex and dynamic underwater environments we have addressed. DDPG’s off-policy learning approach, which utilizes replay buffers to store and reuse past experiences, significantly improves sample efficiency, something essential for computationally expensive tasks like minimizing energy consumption and service delays. Compared to on-policy algorithms such as Proximal Policy Optimization (PPO) or Trust Region Policy Optimization (TRPO), DDPG is less resource-intensive and better suited for dynamic systems requiring real-time decision-making. Its deterministic policy gradient further has allowed us to optimize actions with high precision, a critical feature for navigating underwater environments influenced by factors like ocean currents.

We have described the training process of DDPG and AC algorithms. We have compared DDPG with other strategies (Locally, Offloading, and AC). Simulation results show that DDPG and AC converge well, but DDPG outperforms AC in terms of cumulative reward and stability. The convergence performance of the DDPG with different batch sizes and learning rates has been analyzed. Batch size 64 is found to be optimal choice for convergence and stability. Simulations have been conducted to evaluate the proposed system’s performance, demonstrating that DDPG achieves significantly lower energy consumption and reduced average delay compared to Total Offloading, Local Execution, and Actor-Critic algorithms. These results underscore the system’s efficiency in minimizing energy usage through its optimized task offloading strategies, resource allocation, and trajectory planning, making it particularly suitable for mission-critical applications that require real-time responsiveness and energy efficiency. Some challenges remain in AUV-enabled MEC systems. As future work, we plan to investigate MEC systems assisted by multiple AUVs functioning as a swarm of multi-access edge computing servers. This approach aims to extend coverage while addressing interference management and optimal offloading selection among AUVs. Furthermore, in this study each ${\mathrm {AUV}}_{LOCAL}$ moves in a two-dimensional (2D) horizontal plane, to collect data from cluster-heads within a specific area and depth, without surfacing. The 2D plane was chosen for simplicity and computational feasibility, as it allows us to focus on the primary challenges of trajectory optimization and task allocation. Incorporating a full 3D trajectory model is a potential area for future work to enhance the system’s realism and adaptability to variable underwater scenarios.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.

References is not available for this document.

Reinforcement Learning and Multi-Access Edge Computing for 6G-Based Underwater Wireless Networks

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. AAV-Assisted Data Acquisition

B. AUV-Assisted Data Collection

C. Multi-Access Edge Computing (MEC)

D. Reinforcement Learning

System Model

A. Network Model

B. Communication Model

1) Underwater Acoustic Channel

C. Computing Model

1) The Velocity Synthesis Approach

2) Auvs’ Trajectory

3) Cluster-Head Data Collection

4) Local or Partially Local Computing Model

5) Offloading or Partially Offloading Computing Model

6) Task Completion Time and Energy Consumption

D. Problem Formulation

Reinforcement Learning-Based Methods

A. Q-Learning

Algorithm 1 Q-Learn AlgorithmInitialize

B. DQN

Algorithm 2 Deep Q-Learning With Experience Delay

C. DDPG

Algorithm 3 DDPG Algorithm

DDPG Algorithm: Methodology Problem Solution

Experiments and Results

A. Simulation Setting

B. Simulation Results

Conclusion

Declaration of Competing Interest

Authors

Figures

References

Keywords

References