Introduction
6G-based ground-air-underwater networks [1] represent a groundbreaking paradigm shift in wireless communication. These networks aim to provide seamless connectivity across terrestrial, aerial, and aquatic domains, enabling unprecedented data exchange and interaction between sensors, devices, and vehicles operating in these diverse environments. Concerning the aquatic domain, the Internet of Underwater Things (IoUT) [2] can be defined as a worldwide network of smart interconnected underwater objects with a digital entity. These devices sense, interpret, and react to the environment due to the combination of the Internet, tracking technologies, and embedded sensors. Data is sent to the surface for computation and processing. Underwater devices use different communication techniques to transmit information [3]: Radio-frequency (RF), optical, and acoustic communication. RF communication offers high data throughput over short distances and suffers from a low Doppler effect. It performs well in shallow waters but suffers from high signal attenuation in deep water. Optical transmission, preferably in blue-green wavelength, requires line-of-sight positioning. Acoustic communication is the most widely used method due to its ability to travel efficiently over long distances with relatively low energy loss. It enables the longest communication range but suffers from low throughput, significant Doppler effect impairment, and high delay spread, which causes severe inter-symbol interference [4]. Data transmission directly from IoUT devices (e.g., USN, robots, or cameras) to the surface sink is very energy-intensive. Autonomous Underwater Vehicles (AUVs) can help reduce communication distances between IoUT devices and sink nodes by collecting data from IoUT devices [5]. Sink nodes transmit the collected information to ships or Autonomous Aerial Vehicles (AAVs) in the aerial domain using radio frequency waves [6] or satellite communications. Finally, the data is sent to ground stations in the terrestrial domain, where it is stored on servers (cloud computing). At this stage, the data is processed, and depending on the application, the results must be promptly sent back to the underwater devices. In this process, latency is the most detrimental factor for applications that have real-time or mission-critical constraints such as large-scale sensing data fusion, navigation systems and real-time sensing data fusion [7]. To cope with this requirement, Multi-Access Edge Computing (MEC) has been developed, enabling cloud-like computational services closer to local devices [8]. These nearby devices are equipped with cloud-like resources, ensuring high reliability, scalability, and low latency in underwater networks.
Most prior studies overlook the computing capabilities provided by AUVs in the underwater environment, with only a few considering AUVs as edge computing nodes capable of executing tasks [8], [9]. We distinguish between “local AUVs” and a “MEC AUV”. “Local AUVs” typically have limited processing capabilities. The ‘MEC AUV’, on the other hand, is specialized to perform computational tasks efficiently. AUV-enabled MEC systems involving IoUT devices, cluster-heads, local AUVs, and MEC AUVs remain an unexplored area of research.
In this paper, we propose an innovative AUV-enabled MEC system where cluster-heads, which collect data from IoUT devices, offload their associated computing tasks to local AUVs. These AUVs are strategically positioned to (1) execute tasks entirely locally, (2) execute tasks partially and offload the remaining portion, or (3) fully offload tasks to a more resourceful MEC AUV.
Despite the advantages of AUV-assisted MEC, several challenges in network deployment and operation must be addressed. Firstly, it is difficult to determine the optimal amount of computation that should be allocated to each task offloaded from a local AUV to an MEC-enabled AUV due to the limited onboard resources. Furthermore, it is also challenging how to control each of the AUVs’ trajectories (diving direction and speed) considering that each local AUV must serve cluster-heads on its way and an MEC-enabled AUV has to serve different local AUVs in the different collection points. In addition, it is challenging how to determine the optimal route for the AUVs taking into account the effect of ocean currents on their trajectory.
Inspired by the challenges mentioned above, we propose to minimize energy consumption and task delay by jointly optimizing the task offloading strategy, resource allocation, and AUV trajectories. We formulate an efficient model for trajectory optimization, task offloading, and resource allocation as a non-convex optimization problem. This model aims to minimize the weighted sum of service delays for all local AUVs (task offloading and computation delays) and the energy consumption of AUVs (transmission energy and computation energy) [10].
The main contributions of this paper are:
We propose an AUV-enabled MEC system comprising IoUT devices, cluster-heads, local AUVs, and an MEC AUV. This system aims to distribute the workload between local AUVs and the MEC AUV to optimally reduce underwater task execution time.
We formulate the task offloading of local AUVs, resource allocation of the MEC AUV, and path selection of both local and MEC AUVs as a joint optimization problem. The objective is to minimize the underwater task execution delay while reducing the energy consumption of the entire system.
Since the problem formulated is NP-hard, we transform it into a Markov Decision Process (MDP) and solve it using a deep reinforcement learning-based algorithm, Deep Deterministic Policy Gradient (DDPG). Two deep Q networks (DQNs), the actor and critic networks, are employed. The actor network is responsible for deciding the speed of the AUVs, the task offloading strategy, and the resource allocation for the MEC AUV. Meanwhile, the critic network evaluates the actions generated by the actor network.
We have conducted extensive simulations to evaluate the effectiveness of our proposed communication system. The simulation results show that our proposed algorithm outperforms the Total Offloading (Offloading), Local Execution (Locally), and Actor-Critic (AC) algorithms and achieves a lower average delay and energy consumption. Our findings confirm that our proposal can be effectively implemented and is well-suited for mission-critical applications.
To the best of our knowledge, this is the first paper that proposes an AUV-enabled MEC system involving IoUT devices, cluster-heads, local AUVs, and MEC AUVs. The joint optimization of AUVs’ trajectories, task offloading strategy, and resource allocation in an AUV-assisted MEC, with a focus on energy efficiency and delay minimization, has not been explored before.
The remainder of the paper is organized as follows. Section II reviews related work regarding on AUVs, MEC, and reinforcement learning. The system model is presented in Section III. Reinforcement learning-based methods are explained in Section IV. Section V presents the DDPG algorithm methodology and problem solution. Section VI describes the experiments carried out and discusses the results. Finally, the paper is concluded in Section VII.
Related Work
Different strategies and methods have been implemented for data collection from USNs. On the one hand, in basic approaches, such as traditional multi-hop data collection, data is relayed from one sensor node to another until it reaches the sink node at the sea surface. This strategy has several drawbacks such as the time needed for the information to reach the control center and the high energy consumption. On the other hand, more complex strategies involve additional aerial or underwater devices (e.g., AAVs and AUVs) in data collection. These approaches often leverage multi-access edge computing and use reinforcement learning techniques to intelligently ensure the optimal collection of information from the underwater environment. Next, we summarize them.
A. AAV-Assisted Data Acquisition
Several research works have proposed Autonomous Aerial Vehicle (AAV)-assisted data acquisition schemes for the IoUT. Acoustic links are commonly used for underwater communication due to the efficient propagation of sound in this medium, which allows longer ranges. In contrast, electromagnetic waves, such as radio waves, are significantly absorbed and attenuated in water, especially in saltwater, making them impractical for most types of long-range communication. Some low-frequency electromagnetic waves can penetrate water to a limited extent, but are not widely used due to their large wavelength and the technical challenges associated with generating and receiving them. In [6], the authors propose an underwater data acquisition scheme assisted by AAV. In this scheme, underwater sensor data are first transmitted via an acoustic signal link to a floating sink node, which forwards the data to a AAV using an electromagnetic link. In [11], the authors present an energy-efficient data collection scheme for AAV-assisted ocean monitoring networks, which aims to jointly maximize the energy efficiency of aerial as well as aquatic communications. In [12], the authors propose a AAV-assisted ocean monitoring network architecture designed to ensure timely transmission and extend network lifetime. In these schemes ([6], [11], [12]) during the data acquisition stage in the marine environment, information is sent directly from the USNs to the sink nodes through acoustic signals. Afterwards, in the aerial environment, the data is collected by a AAV through electromagnetic wave propagation.
B. AUV-Assisted Data Collection
Other research works focus on AUV-assisted data collection schemes for the IoUT. In [13], the authors present a Hybrid Data Collection Scheme (HDCS), that considers both real-time data collection and energy-efficiency (EE) issues. In [14], the authors, based on the energy limitation of underwater devices and the high demand for data collection, present an AUV-assisted underwater acoustic sensor network to optimize the energy consumption problem and network performance. In [15], the authors present a heterogeneous underwater information collection scheme to optimize the peak Age of Information (AoI) and to improve the energy efficiency of the aquatic device nodes. To minimize the energy consumption of resource-constrained devices, aquatic mobile devices (AUVs) are introduced, reducing the transmission distance between the USNs and the AUV, as well as between the AUV and the sink nodes. However, since the main objective is only to collect data, there may be delays until the information reaches the cloud-based servers and is processed.
Furthermore, several studies explore frameworks for enabling edge computing in underwater environments using AUVs.
C. Multi-Access Edge Computing (MEC)
In [9], the authors propose a data collection scheme based on an underwater mobile edge element (an AUV) and design a target selection algorithm to compute the mobility path of the AUV for data collection in a stable 3D environment. In this approach, they deploy an AUV to visit all target nodes and collect data by Magnetic Induction (MI) communication. Here the AUV (mobile edge platform) processes and stores a large amount of data to be sent to the sink node, and then the sink node sends the data to the cloud. In [16], the authors present a service-driven intelligent ocean convergence platform using software-defined networking and edge computing. Similar to the studies discussed above, this approach focuses on data acquisition. However, instead of sending the data to the cloud directly, it is sent to an edge server; in the first case the AUV is used as an MEC server; in the second case, an intelligent ocean convergence platform (that contains ship and buoy nodes) is used for edge computing purposes. Nevertheless, these works ([9], [16]) only deal with the data collection, they do not analyze computation offloading, that is, the convenience of executing the computation tasks locally or offloading them (partially or completely) to the AUV-enabled MEC server. In contrast, our proposal explores task offloading between a local AUV that performs data collection and an AUV-enabled MEC server. Our proposal aims to minimize the sum of the total execution delay and energy consumption during the whole process of executing a task by solving the offloading strategy, AUV path optimization, and resource allocation.
In [17], the authors highlight the role of edge computing in optimizing big data processing for underwater applications. Their work demonstrates how MEC enhances AUV-assisted data processing, reducing latency and improving efficiency. Building on this, our approach optimizes task offloading to minimize delay and energy consumption in underwater networks?
D. Reinforcement Learning
Reinforcement learning (RL) is a powerful framework in artificial intelligence, enabling agents to learn optimal behaviors through interactions with dynamic environments. By leveraging RL, agents can make intelligent decisions and adapt to changing conditions. In underwater applications, RL techniques have been utilized for tasks such as node repair, energy management, trajectory optimization, and data collection, demonstrating their potential to optimize performance and address complex challenges in aquatic environments.
In [18], a strategy for repairing nodes in the IoUT using AUVs and a multiagent reinforcement learning framework is discussed. The paper tackles the challenges of node failure caused by environmental conditions and energy limitations by proposing a novel node repair scheme. This scheme allows AUVs to autonomously identify and replace faulty nodes, ensuring continuous network operation. The use of multiple underwater mobile chargers to enhance the charging process in underwater sensor networks through a multi-agent reinforcement learning approach is explored in [19]. The unique challenges of the underwater environment, such as variable energy consumption due to movement characteristics of underwater mobile chargers and the necessity for coordination among them, are addressed. In [20], the network topology in UWSNs is optimized to enhance transmission reliability, minimize delay, and extend network lifetime. The challenges posed by dynamic ocean currents and complex communication environments are tackled by employing a centralized topology control strategy, which utilizes deep reinforcement learning to manage the network effectively. The papers [18], [19], [20] focus on leveraging multi-agent reinforcement learning to address operational challenges in underwater environments: optimizing topology control for transmission reliability, assisting node repair in the IoUT, and managing energy charging for underwater rechargeable sensor networks. In contrast, our paper takes a broader approach by integrating MEC with AUVs to address challenges in a next-generation (6G) underwater network. AUVs act not just as data collectors or node repairers but as mobile computing nodes. This integration reduces latency and enhances data processing capabilities at the edge of the network.
In [21], a multi-AUV data collection system is proposed. This system is designed to optimize the trajectory and task allocation of AUVs based on the urgency of data uploads from IoUT devices and the dynamic underwater environment. A Markov decision process and a multiagent independent soft actor-critic algorithm are employed to maximize data collection rates and throughput while minimizing energy consumption. An advanced multi-tier underwater computing framework to optimize both the trajectory and resource management of AUVs in support of the IoUT is explored in [22]. An environment-aware system that integrates communication, computing, and storage resources across AUVs, surface stations, and IoUT devices to improve overall system efficiency is introduced. The complex, high-dimensional optimization problem is addressed using an asynchronous advantage actor-critic (A3C) algorithm, with simulations demonstrating that the approach enhances system profits by efficiently managing AUV trajectories and resource allocation in dynamic underwater environments. A novel approach to optimize both data throughput and energy harvesting in underwater sensor networks using AUVs and simultaneous wireless information and power transfer (SWIPT) is discussed in [23]. The proposed system employs a model-free reinforcement learning solution to manage the AUV’s trajectory for efficient data collection and energy distribution to sensor nodes. The papers [21], [22], [23] predominantly focus on optimizing AUV operations within the IoUT through various approaches like energy-aware data collection, trajectory design considering environmental factors, and integrating simultaneous wireless information and power transfer (SWIPT) for sustainability. In contrast, our paper specifically introduces the integration of 6G technologies with AUV-based MEC systems, focusing on minimizing latency and energy consumption through advanced task offloading and resource management strategies.
System Model
In this section, we define the scenario for data processing (real-time response for mission-critical applications) and data collection (cloud storage) tasks. Afterwards, we analyze the reinforcement learning algorithm used for the intelligent offloading of tasks to the edge device. We summarize all the following notations and their definitions in Table 1.
A. Network Model
Figure 1 shows the network model. Along the seabed, several IoUT devices (USN fixed on the seafloor) are randomly deployed in a 3D (
This assumption works well in environments where the number of sensors is limited and they are reasonably spaced apart from each other. Therefore, it is assumed that 10–50 sensors are deployed per square kilometer in underwater environments. This lower density accounts for the challenges of underwater communication, such as signal attenuation and the need for longer-range transmissions. Sensors are spaced 200 to 500 meters apart on average to minimize mutual interference and to allow for efficient data transmission using acoustic communication. This density achieves a balance between effective monitoring and ensuring minimal communication interference in an underwater setting, where signals face unique propagation challenges.
We propose to bring computing resources closer to IoUT devices. Therefore, we consider that the
B. Communication Model
We consider two communication interfaces: cluster-head-to-
1) Underwater Acoustic Channel
The narrow-band Signal-to-noise Ratio (SNR) of an emitted underwater signal at the receiver can be expressed by the passive sonar equation [25]:\begin{equation*} SNR\left ({{ l,f }}\right )=SL\left ({{ f }}\right )-A\left ({{ l,f }}\right )-N\left ({{ f }}\right )+DI\ge DT \tag {1}\end{equation*}
The attenuation, transmission loss, or path loss over a transmission range l for a frequency f can be given by [26].\begin{equation*} 10\log {A\left ({{ l,f }}\right )}=k\cdot 10 \log l+l\cdot 10\log {\alpha \left ({{ f }}\right )} \tag {2}\end{equation*}
The ambient noise can be modeled by four basic sources [27]\begin{equation*} N\left ({{ f }}\right )=N_{t}\left ({{ f }}\right )+N_{s}\left ({{ f }}\right )+N_{w}\left ({{ f }}\right )+N_{th}\left ({{ f }}\right ) \tag {3}\end{equation*}
\begin{align*} 10~log~{N_{t}\left ({{ f }}\right )}& =17-30~log~\left ({{ f }}\right ) \tag {4}\\ 10~log~{N_{s}\left ({{ f }}\right )}& =40+20\left ({{ s-0.5 }}\right )+26~log~\left ({{ f }}\right ) \\ & \quad -60~log~\left ({{ f+0.03 }}\right ) \tag {5}\end{align*}
\begin{align*} 10~log~{N_{w}\left ({{ f }}\right )}& =50+7.5w^{\frac {1}{2}}+20~log~\left ({{ f }}\right ) \\ & \quad -40~log~\left ({{ f+0.4 }}\right ) \tag {6}\end{align*}
\begin{equation*} 10~log~{N_{th}\left ({{ f }}\right )}=-15+20~log~\left ({{ f }}\right ) \tag {7}\end{equation*}
\begin{equation*} I_{T}^{SW}=\frac {P_{T}}{2\pi \cdot z},{I}_{T}^{DW}=\frac {P_{T}}{4\pi } \tag {8}\end{equation*}
for shallow and deep water, respectively, where
If we consider the frequency-dependent part of the narrow-band SNR \begin{align*} r_{tx}& =\sum \nolimits _{i} {\Delta f~log_{2}\left ({{ 1+\frac {SL\left ({{ l,f }}\right )}{A\left ({{ l,f }}\right )N\left ({{ f }}\right )} }}\right )} \\ & =B~log_{2}\left [{{ 1+\frac {I_{T}\gamma \left ({{ l,f }}\right )}{1\mu Pa} }}\right ] \tag {9}\end{align*}
Therefore, the data rate of the cluster-head-to-
C. Computing Model
1) The Velocity Synthesis Approach
Accounting for the effect of ocean currents in the underwater environment, AUVs (
The lower left point in the figure represents the AUV and the higher point represents the target point. The vector \begin{equation*} V\ast ~sin~\left ({{ \theta ^{h}-ai }}\right )=Uc\ast ~sin~\left ({{ ai-ai1 }}\right ) \tag {10}\end{equation*}
From the above equation, it can be calculated\begin{equation*} \theta ^{h}=arcsin {\left ({{ Uc\ast \frac {~sin~\left ({{ ai-ai1 }}\right )}{V} }}\right )+ai} \tag {11}\end{equation*}
The speed synthesis algorithm implementation is based on the precondition:\begin{align*} \left |{{ Uc }}\right |& \lt \left |{{ V }}\right | \tag {12}\\ V_{L}& =V_{d}+{Uc}_{d} \tag {13}\end{align*}
Summarizing, by combining equations (10) to (13),
2) Auvs’ Trajectory
As mentioned above, we assume that all
The horizontal distance from the \begin{equation*} d_{mec,i}^{h}\left ({{ t }}\right )=\sqrt {\left ({{ X\left ({{ t }}\right )-x_{i}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ Y\left ({{ t }}\right )-y_{i}\left ({{ t }}\right ) }}\right )^{2}} \tag {14}\end{equation*}
\begin{equation*} d_{k,j}^{h}\left ({{ t }}\right )=\sqrt {\left ({{ x_{k}\left ({{ t }}\right )-x_{j}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ y_{k}\left ({{ t }}\right )-y_{j}\left ({{ t }}\right ) }}\right )^{2}} \tag {15}\end{equation*}
For security reasons, each AUV can only move within a rectangular area to avoid possible collisions. The maximum bounds for the \begin{align*} 0& \le X\left ({{ t }}\right )\le X^{max}, \forall t\in \mathcal {T} \tag {16}\\ 0& \le Y\left ({{ t }}\right )\le Y^{max}, \forall t\in \mathcal {T} \tag {17}\end{align*}
\begin{align*} x_{k}^{min}& \le x_{k}\left ({{ t }}\right )\le x_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {18}\\ y_{k}^{min}& \le y_{k}\left ({{ t }}\right )\le y_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {19}\end{align*}
Therefore, the
To get the trajectory of the
Similarly, for the trajectory of the
3) Cluster-Head Data Collection
We assume that the
Next, each cluster-head \begin{equation*} D_{tx}^{ch,auvloc}\left ({{ t }}\right )=\frac {L_{k, j}\left ({{ t }}\right )}{r_{ch, auvloc}} \tag {20}\end{equation*}
To get the propagation delay \begin{align*} & d_{k,j}^{e}\left ({{ t }}\right ) \\ & =\sqrt {\left ({{ x_{k}\left ({{ t }}\right )-x_{j}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ y_{k}\left ({{ t }}\right )-y_{j}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ z_{k}\left ({{ t }}\right )-z_{j}\left ({{ t }}\right ) }}\right )^{2}} \tag {21}\\ & D_{prop}^{ch,auvloc}\left ({{ t }}\right )=\frac {d_{k,j}^{e}\left ({{ t }}\right )}{v_{prop}} \tag {22}\end{align*}
Afterwards, if the task
4) Local or Partially Local Computing Model
When the \begin{equation*} D_{{AUV}_{L}}\left ({{ t }}\right )=\frac {\left ({{ 1-\alpha _{k, j} }}\right ){L}_{k, j}\left ({{ t }}\right ){C}_{k, j}\left ({{ t }}\right )}{f_{{AUV}_{L}}} \tag {23}\end{equation*}
\begin{equation*} E_{k,j}^{L}\left ({{ t }}\right )=\mu \left ({{ f_{{AUV}_{L}} }}\right )^{\lambda }D_{{AUV}_{L}}\left ({{ t }}\right ) \tag {24}\end{equation*}
5) Offloading or Partially Offloading Computing Model
We consider that the
Data upload: The
uploads the required input data (i.e., program codes and parameters) to the${\mathrm {AUV}}_{LOCAL}k$ in the underwater medium.${\mathrm {AUV}}_{MEC}$ Task execution: The
allocates part of its computational resources and executes the computing task.${\mathrm {AUV}}_{MEC}$ Result retrieval: The
returns the execution results to${\mathrm {AUV}}_{MEC}$ .${\mathrm {AUV}}_{LOCAL} k$
Based on these steps, the time required for the first step of offloading computing is the transmission delay and the propagation delay. The transmission delay \begin{equation*} D_{tx}^{auvloc, auvmec}\left ({{ t }}\right )=\frac {\alpha _{k, j} L_{k, j}\left ({{ t }}\right )}{r_{auvloc,auvmec}} \tag {25}\end{equation*}
The Euclidian distance between the \begin{align*} & d_{mec,k}^{e}\left ({{ t }}\right ) \\ & \quad =\sqrt {\left ({{ X\left ({{ t }}\right )\!-\!x_{k}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ Y\left ({{ t }}\right )\!-\!y_{k}\left ({{ t }}\right ) }}\right )^{2}+\left ({{ Z\left ({{ t }}\right )\!-\!z_{k}\left ({{ t }}\right ) }}\right )^{2}} \tag {26}\end{align*}
The propagation delay is given by\begin{equation*} D_{prop}^{auvloc, auvmec}\left ({{ t }}\right )=\frac {d_{mec,k}^{e}\left ({{ t }}\right )}{v_{prop}} \tag {27}\end{equation*}
The overall energy required by the \begin{equation*} E_{k,j}^{O}\left ({{ t }}\right )=P_{tx}^{auvloc, auvmec}D_{tx}^{auvloc, auvmec}\left ({{ t }}\right ) \tag {28}\end{equation*}
\begin{equation*} D_{{AUV}_{M}}\left ({{ t }}\right )=\frac {\alpha _{k, j} L_{k, j}\left ({{ t }}\right ) C_{k, j}\left ({{ t }}\right )}{f_{kj}^{m} F} \tag {29}\end{equation*}
\begin{equation*} E_{k,j}^{M}\left ({{ t }}\right )=\mu \left ({{ f_{kj}^{m}F }}\right )^{\lambda }D_{{AUV}_{M}}\left ({{ t }}\right ) \tag {30}\end{equation*}
For the last step of offloading computing, the time delay for receiving the processed result can be expressed as follows:\begin{equation*} D_{rx}\left ({{ t }}\right )=\frac {L_{m}\left ({{ j }}\right )}{r_{rx}} \tag {31}\end{equation*}
6) Task Completion Time and Energy Consumption
The total time to complete a task \begin{equation*} D_{k,j}^{L}\left ({{ t }}\right )=D_{tx}^{ch,auvloc}\left ({{ t }}\right )+D_{prop}^{ch,auvloc}\left ({{ t }}\right )+D_{{AUV}_{L}}\left ({{ t }}\right ) \tag {32}\end{equation*}
The total time to complete a task \begin{align*} D_{k,j}^{O}\left ({{ t }}\right )& =D_{tx}^{ch,auvloc}\left ({{ t }}\right )+D_{prop}^{ch,auvloc}\left ({{ t }}\right )+D_{tx}^{auvloc, auvmec}\left ({{ t }}\right ) \\ & \quad +D_{prop}^{auvloc, auvmec}\left ({{ t }}\right )+D_{{AUV}_{M}}\left ({{ t }}\right ) \tag {33}\end{align*}
To summarize, the total time to complete the task \begin{align*} & D_{k,j}\left ({{ t }}\right ) \\ & =\begin{cases} \displaystyle D_{k,j}^{L}\left ({{ t }}\right ), & {\alpha }_{k, j}=0; local~execution \\ \displaystyle D_{k,j}^{O}\left ({{ t }}\right ), & {\alpha }_{k, j}=1; offloading \\ \displaystyle max \left ({{ D_{k,j}^{L}\left ({{ t }}\right ), D_{k,j}^{O}\left ({{ t }}\right ) }}\right ),& {0\lt \alpha }_{k, j}\lt 1; \\ \displaystyle & partial ~Offloading \end{cases} \tag {34}\end{align*}
And the overall energy consumption \begin{align*} & E_{k,j}\left ({{ t }}\right ) \\ & =\begin{cases} \displaystyle E_{k,j}^{L}\left ({{ t }}\right ), & {\alpha }_{k, j}=0; local~execution \\ \displaystyle E_{k,j}^{O}\left ({{ t }}\right ), & {\alpha }_{k, j}=1; offloading \\ \displaystyle E_{k,j}^{L}\left ({{ t }}\right )+E_{k,j}^{O}\left ({{ t }}\right )+E_{k,j}^{M}\left ({{ t }}\right );& { 0\lt \alpha }_{k, j}\lt 1; \\ \displaystyle & partial~Offloading \end{cases} \tag {35}\end{align*}
D. Problem Formulation
In this work, we consider the joint optimization of the trajectory of \begin{align*} & \min \limits _{U,A,F}\sum \limits _{t\in {T}} \left [{{ \left ({{ 1-\omega }}\right )\sum \limits _{k\in {K}} \sum \limits _{j\in R} {D_{k,j}\left ({{ t }}\right ) +\omega \sum \limits _{k\in {K}} \sum \limits _{j\in R} {E_{k,j}\left ({{ t }}\right )}} }}\right ] \tag {36}\\ & ~s.t.~0 \le \theta _{mec}^{h}\left ({{ t }}\right )\le 2\pi , \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36a}\\ & \hphantom {~s.t.~}0 \le \theta _{k}^{h}\left ({{ t }}\right )\le 2\pi , \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36b}\\ & \hphantom {~s.t.~}0 \le X_{mec}\left ({{ t }}\right )\le X_{mec}^{max}, \forall t\in \mathcal {T} \tag {36c}\\ & \hphantom {~s.t.~}0 \le Y_{mec}\left ({{ t }}\right )\le Y_{mec}^{max}, \forall t\in \mathcal {T} \tag {36d}\\ & \hphantom {~s.t.~}x_{k}^{min}\le x_{k}\left ({{ t }}\right )\le x_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36e}\\ & \hphantom {~s.t.~}y_{k}^{min}\le y_{k}\left ({{ t }}\right )\le y_{k}^{max}, \forall k\in \mathcal {K}, t\in \mathcal {T} \tag {36f}\\ & \hphantom {~s.t.~}f_{kj}^{m}=0;if~\alpha _{k,j}\left ({{ t }}\right )=0 \tag {36g}\\ & \hphantom {~s.t.~}\sum \nolimits _{1}^{k} f_{kj}^{m} {\le 1; \alpha }_{k,j}\left ({{ t }}\right )\ne 0 \tag {36h}\end{align*}
The constraints (36a), (36b) ensure the horizontal direction of motion for the
An efficient trajectory optimization, task offloading, and resource allocation model is formulated as a nonlinear problem. Its objective is to minimize the delay and energy consumption of
Consequently, instead of using traditional optimization approaches, reinforcement learning-based methods are employed to obtain the near-optimal solutions for the variable parameters. For this purpose, we employ a deep deterministic policy gradient algorithm to address the problem.
Reinforcement Learning-Based Methods
In this section, we propose a computational offloading algorithm based on DDPG for underwater MEC networks. The algorithm automatically selects tasks to be offloaded to an MEC server mounted on an
A. Q-Learning
Q-learning is a method proposed by Watkins [30], [31] to solve Markov Decision Processes (MDPs) with incomplete information. In essence, Q-learning is a reinforcement learning technique where an agent in a state s transitions to a new state
Q
in Q-learning represents the quality, whereby the model finds its next action by improving the quality in each state. For this purpose, the Q-Learning algorithm uses the Bellman equation. This equation is used to update and learn the Q-values.\begin{equation*} Q\left ({{ s,a }}\right )=r+\left [{{ \gamma {max}_{a^{\prime }}Q\left ({{ s^{\prime }, a^{\prime } }}\right ) }}\right ] \tag {37}\end{equation*}
The model stores all Q-values for each state-action pair in a table, known as the Q-table. This table is initialized to zero, since it does not include any prior knowledge.
From the Algorithm 1, the key Q-learning update rule is:\begin{align*} Q\left ({{ s,a }}\right )\leftarrow Q\left ({{ s,a }}\right )+\alpha \left [{{ r+\gamma {max}_{a}Q\left ({{ s^{\prime }, a }}\right )-Q\left ({{ s,a }}\right ) }}\right ] \tag {38}\end{align*}
Algorithm 1 Q-Learn AlgorithmInitialize
Repeat (for each episode):
Initialize s
Repeat (for each step of the episode):
Choose a from s using policy derived from Q (e.g.,
Take action a, observe
until s is terminal
end for
One limitation of Q-learning is its suitability primarily for problems with small state spaces. As the state space grows, the memory and computation required to maintain and update the Q-table increase significantly, making it impractical for larger environments.
B. DQN
For larger state spaces, deep Q-learning can help the model efficiently update the Q-table with appropriate values and perform tasks more efficiently. Deep Q-learning enables the use of the Q-Learning strategy by integrating artificial neural networks: Neural Networks (NN), Deep Neural Networks (DNN), and Convolutional Neural Networks (CNN). A neural network enables the agent to choose actions by processing inputs, which represent the states of the environment [33]. After receiving the input, the neural network estimates the Q values. The agent makes decisions based on these Q values. To achieve this, deep Q-learning employs the Bellman equation adapted for DQN:\begin{equation*} Q\left ({{ s,a;\theta }}\right )=r+\left [{{ \gamma {max}_{a^{\prime }}Q\left ({{ s', a'; \theta ' }}\right ) }}\right ] \tag {39}\end{equation*}
The neural network is trained by computing the loss or cost function, which compared the target value
Algorithm 2 Deep Q-Learning With Experience Delay
Initialize replay memory
Initialize action-value function Q with random weights
for episode =1, M do
Initialize sequence
for t =1, T do With probability
otherwise, select
Execute action
Set
Store transition
Sample random minibatch of transitions
Set
Perform a gradient descent step on
end for
end for
The loss function is then expressed as\begin{equation*} L(\theta )=\mathbb {E}\left [{{ {(r+\gamma ~max_{a^{\prime }}{Q\left ({{ s', a';\theta ' }}\right )-}Q\left ({{ s,a;\theta }}\right ))}^{2} }}\right ] \tag {40}\end{equation*}
C. DDPG
Q-learning and DQN perform well with discrete action spaces. However, discretizing continuous action spaces can result in an excessively large number of possible actions, making convergence difficult to achieve. The deep deterministic policy gradient algorithm extends the Deep Q-learning algorithm to handle continuous action spaces [34]. DDPG is a model-free, off-policy, actor-critic algorithm that combines Deterministic Policy Gradient (DDPG) [35] with DQN.
DDPG uses two networks: the actor network and the critic network (see Figure 3). The actor network is represented by
Similarly, a target network is defined for both the actor and critic networks, maintaining the same structure as their primary counterparts. The target actor network is represented as
The algorithm (Algorithm 3) first selects an action produced by the actor network, with exploration noise
Algorithm 3 DDPG Algorithm
Randomly initialize critic network
Initialize target network
Initialize replay buffer R
for episode =1, M do
Initialize a random process
Receive initial observation state
for
Select action
Execute action
Store transition
Sample a random minibatch of N transitions
Set
Update critic by minimizing the loss:
Update the actor policy using the sampled policy gradient:\begin{equation*}\nabla _{\theta \mu }J\approx \frac {1}{N}\sum \limits _{i} {\nabla _{a}Q\left ({{ s,a\vert \theta ^{Q} }}\right )\vert _{s=s_{i},a=\mu \left ({{ s_{i} }}\right )}\nabla _{\theta \mu }\mu \left ({{ s\vert \theta ^{\mu } }}\right )} \vert _{s_{i}}\end{equation*}
Update the target networks:\begin{align*}\theta ^{Q^{\prime }}& \leftarrow \tau \theta ^{Q}+(1-\tau )\theta ^{Q^{\prime }} \\ \theta ^{\mu ^{\prime }}& \leftarrow \tau \theta ^{\mu }+(1-\tau )\theta ^{\mu ^{\prime }}\end{align*}
end for
end for
The action
After several iterations, a mini-batch of N transitions \begin{equation*} L=\frac {1}{N}\sum \limits _{i} {(y_{i}-Q\left ({{ s_{i},a_{i}\vert \theta ^{Q} }}\right ))}^{2} \tag {41}\end{equation*}
Similarly, the actor network policy is updated using the gradient of the sampled policy:\begin{align*} \nabla _{\theta \mu } J\approx \frac {1}{N}\sum \limits _{i} {\nabla _{a}Q\left ({{ s,a\vert \theta ^{Q} }}\right )\vert _{s=s_{i},a=\mu \left ({{ s_{i} }}\right )}\nabla _{\theta \mu }\mu \left ({{ s\vert \theta ^{\mu } }}\right )} \vert _{s_{i}} \tag {42}\end{align*}
Next, the weights of the actor and critic networks in the target network are updated slowly to promote greater stability. This process, referred to as a soft replacement, is expressed as:\begin{align*} \theta ^{Q^{\prime }}\leftarrow \tau \theta ^{Q}+(1-\tau )\theta ^{Q^{\prime }} \tag {43}\\ \theta ^{\mu ^{\prime }}\leftarrow \tau \theta ^{\mu }+(1-\tau )\theta ^{\mu ^{\prime }} \tag {44}\end{align*}
DDPG Algorithm: Methodology Problem Solution
In this section, we propose a DDPG algorithm for MEC-based underwater networks that enables the joint optimization of the AUV trajectory, task offloading strategy, and computational resource allocation in a continuous action space. The goal is to minimize both the total delay task computation and the overall energy consumption in the system.
There are three key elements in the reinforcement learning method, namely, state, action, and reward. These elements are detailed below.
State
:$s\left ({{ t }}\right)$ represents the set of coordinates of all AUVs, and the task size.$\begin{aligned} s\left ({{ t }}\right)=\left \{{{\begin{matrix} \left [{{ X\left ({{ t }}\right), Y\left ({{ t }}\right), Z }}\right], \\ \left [{{ x_{k}\left ({{ t }}\right), y_{k}\left ({{ t }}\right),z_{k} }}\right], \\ L_{k, j}\forall k\in K \\ \end{matrix}}}\right \}~s\left ({{ t }}\right) \end{aligned}$ Action
:$c\left ({{ t }}\right)$ represents the set of the actions for all AUVs. It includes the velocity$c\left ({{ t }}\right)$ for$v_{mec}\left ({{ t }}\right)$ , velocity${\mathrm {AUV}}_{MEC}$ for$v_{local}\left ({{ t }}\right)$ , the offloading strategy${\mathrm {AUVs}}_{LOCAL}$ , and the computing resource allocation vector$\alpha _{k,j}\left ({{ t }}\right)$ .$f_{kj}^{m}$ Hence, the action set can be defined as
\begin{align*} c\left ({{ t }}\right )=\left [{{\begin{matrix} v_{mec}\left ({{ t }}\right ),v_{local\left ({{ 1 }}\right )}\left ({{ t }}\right ),\ldots ,v_{local\left ({{ k }}\right )}\left ({{ t }}\right ), \\ \alpha _{1j}\left ({{ t }}\right ),\ldots ,\alpha _{kj}\left ({{ t }}\right ), \\ f_{1j}^{m}\left ({{ t }}\right ),\ldots ,f_{kj}^{m}\left ({{ t }}\right ) \\ \end{matrix}}}\right ]\end{align*} View Source\begin{align*} c\left ({{ t }}\right )=\left [{{\begin{matrix} v_{mec}\left ({{ t }}\right ),v_{local\left ({{ 1 }}\right )}\left ({{ t }}\right ),\ldots ,v_{local\left ({{ k }}\right )}\left ({{ t }}\right ), \\ \alpha _{1j}\left ({{ t }}\right ),\ldots ,\alpha _{kj}\left ({{ t }}\right ), \\ f_{1j}^{m}\left ({{ t }}\right ),\ldots ,f_{kj}^{m}\left ({{ t }}\right ) \\ \end{matrix}}}\right ]\end{align*}
Reward
is defined as the overall minimum delay and energy consumption for the entire process in each time slot. It is expressed as:$z\left ({{ t }}\right):z\left ({{ t }}\right)$ The formula considers the weight (\begin{equation*} \left ({{ t }}\right )=-\sum \limits _{t\in T} \left [{{ \left ({{ 1-\omega }}\right )\sum \limits _{k\in K} \sum \limits _{j\in R} {D_{k,j}\left ({{ t }}\right )+ \omega \sum \limits _{k\in K} \sum \limits _{j\in R} {E_{k,j}\left ({{ t }}\right )}} }}\right ] \end{equation*} View Source\begin{equation*} \left ({{ t }}\right )=-\sum \limits _{t\in T} \left [{{ \left ({{ 1-\omega }}\right )\sum \limits _{k\in K} \sum \limits _{j\in R} {D_{k,j}\left ({{ t }}\right )+ \omega \sum \limits _{k\in K} \sum \limits _{j\in R} {E_{k,j}\left ({{ t }}\right )}} }}\right ] \end{equation*}
) assigned to each parameter (latency and energy) within the equation. For this reason, our joint optimization algorithm balances latency and energy consumption, aiming to enhance the overall system performance.$\omega $
Experiments and Results
In this section, simulations are conducted to verify the effectiveness and evaluate the design of the proposed algorithm. First, we describe the environment and simulation parameters used during the experiments. Then, we discuss the obtained results, comparing the proposed algorithm with baseline schemes.
All algorithms are evaluated using simulations implemented on several Jupyter Notebooks (version 6.0.3) installed via the Anaconda software suite, and developed in Python 3.7. The experiments were performed on a Lenovo computer equipped with Intel (R) Xeon (R) 2.9 GHz processor and 72 GB RAM; furthermore, NumPy, Matplotlib, and TensorFlow libraries are used to develop RL algorithms. Reinforcement learning is a type of machine learning that is based on learning through interactions with the environment and the feedback obtained from these interactions. Unlike supervised or unsupervised learning, which rely on a static data set for training, reinforcement learning obtains its training data based dynamically from the agent’s experience while interacting with the environment. This experience consists of observations, actions, rewards, and new observations, forming a cycle known as the feedback loop of reinforcement learning. Through this cycle, the agent collects real-time training data, learning how its actions affect the environment and how rewards are related to those actions. This dynamic learning process allows the agent to continuously improve its decision-making policy, thereby maximizing long-term rewards.
A. Simulation Setting
In the proposed scheme, we consider a total coverage area of
There are
To evaluate the performance of the proposed algorithm and for comparison purposes, we describe the following benchmark approaches below:
Offloading of all tasks to the
(Offloading): The${AUV}_{MEC}$ provides computing resources to the${\mathrm {AUV}}_{MEC}$ at a designated location. Each${\mathrm {AUVs}}_{LOCAL}$ offloads all its computing tasks to be processed remotely.${\mathrm {AUV}}_{LOCAL}$ Execution of all tasks locally (Locally): All computing tasks of the
are executed locally without offloading to the${\mathrm {AUVs}}_{LOCAL}$ .${\mathrm {AUV}}_{MEC}$ Deep Deterministic Policy Gradient (DDPG): The parameters for the proposed DDPG algorithm are configured to achieve optimal system performance.
Actor-Critic (AC): A continuous action space-based RL algorithm is implemented for the computational offloading problem to compare and evaluate the performance of the proposed DDPG algorithm.
B. Simulation Results
We trained the deep neural networks of the proposed models over a total of 2000 iterations/episodes. The configuration parameters of the DDPG algorithm are as follows: network architecture: two hidden layers with 400 and 300 fully connected neurons for both the actor and the critic network, soft update coefficient
Figure 4 shows the performance comparison between AC and DDPG reinforcement learning algorithms. The abscissa denotes the number of iterations of the main loop. The ordinate represents the episodic reward, which is the total reward obtained by the system in an episode. The final convergence of the DDPG algorithm shows that it outperforms the AC joint optimization algorithm in terms of cumulative reward. Although both algorithms perform favorably, DDPG reaches convergence after approximately 250 iterations and remains more stable. This implies that as the number of iterations increases, the total system delay and energy consumption decrease. Compared to its counterpart, this result highlights the efficiency of our proposed RL algorithm.
In reinforcement learning, convergence refers to the agent learning an optimal or suboptimal policy that maximizes its expected reward over time. The performance of the agent improves with increasing episodes of interaction with the environment.
In Figure 5 and Figure 6 we compare the influence of hyperparameters for DDPG. Figure 5 shows the convergence performance of the proposed algorithm (DDPG) with different batch sizes. The figure shows an enlarged picture of the convergence performance for each batch size. We observe that the DDPG algorithm becomes more stable during the training process only for batch size 64. When the batch size is 512, the algorithm does not converge and the reward deteriorates as the number of epochs increases. With a batch size of 256, the algorithm appears to converge optimally, but it lacks stability throughout training. When the batch size is 128, the algorithm converges nicely initially, but does not remain stable as training progresses. In contrast, with a batch size of 64, an optimal convergence is obtained around 250 epochs, and the algorithm remains significantly more stable than the other configurations as training progresses.
Convergence performance of the DDPG algorithm with different values of learning rates.
Figure 6 illustrates the convergence performance of the proposed algorithm with different learning rate values for both the actor network
After the training and convergence of the algorithms, we analyze the impact of the data size on the total delay and energy consumption by comparing the performance of the algorithms with different data sizes.
Figure 7 shows the total cumulative reward for DDPG, AC, Locally, and Offloading strategies.
Comparison of total cumulative reward benefit and task data size (workload of the AUV).
We can notice that the total cumulative reward of the proposed algorithms can reach a near-optimal result, which means that an effective computational offloading policy can substantially reduce the total overhead of the AUVs when tasks are partially executed. Furthermore, the performance of the AC algorithm is considerably good compared to the DDPG algorithm, because both explore a continuous action space and take precise actions, ultimately leading to the optimal offloading strategy while significantly reducing latency and energy consumption. However, when
Figure 8 and Figure 9 show the optimal trajectories of the AUVs for data collection from the CHs as well as for offloading from the
In Figure 8, the ocean current speed is
Conclusion
In this paper, a novel AUV-enabled MEC system has been introduced. A joint optimization algorithm to solve the offloading strategy, resource allocation, and trajectory selection of both
We have described the training process of DDPG and AC algorithms. We have compared DDPG with other strategies (Locally, Offloading, and AC). Simulation results show that DDPG and AC converge well, but DDPG outperforms AC in terms of cumulative reward and stability. The convergence performance of the DDPG with different batch sizes and learning rates has been analyzed. Batch size 64 is found to be optimal choice for convergence and stability. Simulations have been conducted to evaluate the proposed system’s performance, demonstrating that DDPG achieves significantly lower energy consumption and reduced average delay compared to Total Offloading, Local Execution, and Actor-Critic algorithms. These results underscore the system’s efficiency in minimizing energy usage through its optimized task offloading strategies, resource allocation, and trajectory planning, making it particularly suitable for mission-critical applications that require real-time responsiveness and energy efficiency. Some challenges remain in AUV-enabled MEC systems. As future work, we plan to investigate MEC systems assisted by multiple AUVs functioning as a swarm of multi-access edge computing servers. This approach aims to extend coverage while addressing interference management and optimal offloading selection among AUVs. Furthermore, in this study each
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this article.