Introduction
With the development of communication technologies, diverse applications have emerged in various domains, e.g., healthcare, smart cities, and manufacturing. These applications are generally offloaded and processed in cloud servers because of the limitation of the computation resources of end devices (EDs), e.g., personal computers, smartphones, Internet of Things devices, and cars. This process is called cloud computing (CC). Such offloaded applications’ tasks consist of the demand for computing and communication with various characteristics, e.g., traffic heavy, computing heavy, or latency sensitive. Since cloud servers are generally located far from EDs, offloading tasks to the cloud inevitably generates additional transmission latency. Therefore, CC degrades the performance of latency-sensitive tasks.
To address this issue, edge computing (EC) has been proposed to deploy computing resources at the edge servers close to EDs. Combining CC and EC provides multiple offloading choices, improving the efficiency of offloading. For example, offloading computing-heavy tasks to the cloud is effective because the cloud usually has sufficient computing resources. Similarly, offloading traffic-heavy and latency-sensitive tasks to the edge effectively shortens the transmission latency. Thus, CC and EC must be combined to improve the efficiency of various offloading tasks.
Several studies have addressed task-offloading problems for CC and EC [1], [2], [3], [4], [5], [6], [7], [8]. Current methods based on mathematical optimization still incur high computational cost. For example, Wang et al. [1] accelerated the computation time by more than 100 times using their parallel optimization framework. However, due to the computational cost, they could only evaluate it in a small network with one cloud and a few base stations (BSs) (i.e., a few edge servers). To solve this problem, reinforcement learning (RL) [9] has been gaining attention as a solution [4], [5], [6], [7], [8]. RL can immediately output the preferred task offloading by learning the relationship between input network patterns and output task offloading in advance.
Although RL can solve the problem of computation time, two issues remain. One is that these methods [4], [5], [6], [7] do not take into account CC or target only the network with a single cloud. As mentioned above, CC and EC need to be combined to improve task-offloading efficiency. Moreover, a typical network has multiple clouds. The other issue is that these methods [4], [5], [6], [7], [8] do not take into account the bandwidth capacity and backbone-network topology. Many studies attempted to minimize task latency by shortening the path that offloaded tasks take. When a control system does not take into account bandwidth capacity, some links may become congested due to the concentration of tasks. Therefore, the system should obtain the task offloading that minimizes latency while satisfying all link-bandwidth constraints. Some of the above studies assumed a typology that is not practical as a backbone network, e.g., a full-mesh network.
With RL-based task-offloading methods, multi-cloud and network topology are challenging to consider because route optimization requires many variables. Even a minimal network with five nodes requires hundreds of variables to optimize all routes between nodes. The performance of RL drastically worsens as the number of variables increases [10]. Therefore, computing offloading and route optimization are difficult to jointly solve by using RL-based methods. Since the joint problem is a non-deterministic polynomial time (NP)-hard mixed integer linear problem, it is also difficult to solve by mathematical optimization.
The key solution to this problem is that mathematical optimization can quickly solve the route-optimization problem. Therefore, we develop a method for combining RL and mathematical optimization. This method calculates the optimal computing offloading by RL and optimal routing by mathematical optimization. However, not all constraints may be satisfied when two methods are calculated individually. Therefore, we propose a combining method that uses the solution of mathematical optimization for RL reward, enabling all objectives and constraints of two methods to be considered. This method is inspired by the extendable integrated control architecture [11] that can coordinate multiple control algorithms prepared for each control metric.
Improving RL is another critical issue. Multi-agent RL (MARL) is effective in solving more complex problems. It is a system of multiple agents interacting within a common environment. Each agent works with the other agent(s) to achieve a team reward. The learning cost of each agent can be reduced by assigning each agent to each task. However, when training decentralized and independent agents to optimize the team reward, each agent faces a non-stationary learning problem [12]. An example of this problem is that when each agent independently acts simultaneously, all tasks will be allocated on the light-load server, resulting in server overload. To avoid a non-stationary problem, Zhan et al. [4] developed an algorithm combining MARL and game theory. However, the actions of decentralized agents based on game theory fall into a sub-optimal solution of the Nash equilibrium.
We formulate an optimal task-offloading problem for multi-cloud and multi-edge networks considering network topology and bandwidth constraints. We handle the task offloading of edge-to-edge and edge-to-cloud. Each task arriving at the nearest edge is either processed at the nearest edge or offloaded to the neighboring edge or cloud. We define optimal offloading as a solution that maximizes server- and link-resource efficiency and minimizes task latency while satisfying the constraints of server and link capacity and task latency. The decision variables are the computing-resource allocation of tasks and routing between the ED and allocated server. We also propose a task-offloading method that is based on cooperative multi-agent deep RL (Coop-MADRL). We assign an agent at each edge that has learned the optimal task offloading. We introduce a cooperative multi-agent technique in which several agents jointly optimize a single reward in a centralized training and decentralized execution manner. This can prevent the non-stationary problem and improve task-offloading efficiency. The proposed method combines Coop-MADRL and mathematical optimization to take into account network topology and bandwidth constraints in the task-offloading problem. We use the solution of mathematical optimization in the learning process of Coop-MADRL. The proposed method calculates the optimal computing offloading by RL and the optimal routing by mathematical optimization. Therefore, our method can handle numerous routing variables without approximating or reducing them.
The main contributions can be summarized as follows:
We formulate an optimal task-offloading problem for multi-cloud and multi-edge networks considering network topology and bandwidth constraints to find the optimal solution that maximizes server-resource, link-resource, and task-latency efficiency while satisfying the constraints of server capacity, link capacity, and task latency.
We propose a task-offloading method that is based on Coop-MADRL. We assign task-offloading control of each edge to each agent. Agents jointly optimize a single reward in centralized training and decentralized execution. This can prevent the non-stationary problem and improve task-offloading efficiency. The method combines Coop-MADRL and mathematical optimization to take into account network topology and bandwidth constraints in the task-offloading problem.
We introduce a generalized task model representing various task types, e.g., traffic-heavy, computing-heavy, and latency-sensitive tasks. This model can evaluate diverse offloading methods under mixed task conditions generated by various applications, such as healthcare, smart cities, and manufacturing.
We evaluated the effectiveness of the proposed method through simulations in terms of performance, network-topology dependency, computation time, and scalability regarding the number of tasks. The simulation results indicate that our method can find a solution that minimizes network utilization and task latency while minimizing constraint violations in less than one millisecond in various network topologies.
We also evaluated the generalization performance of our method for unknown task patterns. The simulation results indicate that our method has generalization performance for various task types by pre-training with many resource-consuming tasks.
This paper is an extension of the conference version [13]. The extension can be summarized as follows:
The primary extension is a comprehensive evaluation of the proposed method. We evaluated its performance, network-topology dependency, computation time, scalability, and generalization performance for unknown task patterns.
We prepared three new comparison methods for evaluation, including exhaustive search and heuristic-based methods. We also prepared more diverse tasks for evaluation. In the conference version, we selected the task-computation request size from several discrete values. In this study, however, we extended the computing size to continuous values.
We improved the implementation to reduce the training time and improve performance. The main improvement is parallelization, which can speed up the training by several times.
We simplified the efficiency function in the reward calculation of RL to improve the clarity of the learning strategy.
The rest of the paper is organized as follows. Section II describes related work. Section III describes the formulation of the task-offloading problem. Section IV briefly reviews RL. Section V describes the proposed Coop-MADRL-based task-offloading method. Section VI describes the evaluation of its performance, and Section VII concludes the paper.
Related Work
Several studies have addressed task-offloading problems for CC and EC [1], [2], [3], [4], [5], [6], [7], [8]. In general, CC has sufficient computing resources but inevitably increases network latency. On the other hand, EC can reduce network latency at the expense of having sufficient computing resources. These studies aim to optimize the task-offloading when considering the characteristics of heterogeneous computing resources. They sometimes refer to fog computing instead of CC or EC, or assume multiple layers in the edge network. When an edge fog or a two-tier edge network considers the heterogeneous resource characteristics, we can see them as the same studies.
Table I summarizes the characteristics of these studies and ours. “Cooperation” in method properties means whether multiple control algorithms are coordinated when a method contains multiple algorithms. It is not the case when a single algorithm determines all task offloads or when multiple algorithms independently determine them. “Comprehensiveness” in evaluation properties means whether the methods evaluate the performance of the algorithm under practical conditions, e.g., a sufficient number of edges and clouds, and under various metrics, e.g., computation time and scalability.
Wang et al. [1] considered a cooperative three-tier computing network by leveraging vertical cooperation among devices, edge nodes, and cloud servers and horizontal cooperation between edge nodes. They also presented a parallel optimization framework by using the alternating direction method of multipliers (ADMM) method. However, they did not impose network bandwidth constraints and did not assume multi-cloud networks. They evaluated minimal conditions with four edges and 40 devices. Yuan and Zhou [2] designed a profit-maximizing collaborative computation offloading and resource-allocation algorithm to maximize system profit and guarantee task-response-time constraints. They also developed a migratory-bird optimization method that is based on simulated annealing (SA) to obtain a close-to-optimal solution. However, they did not assume multi-cloud networks and backbone-network topology. They evaluated their method under very light conditions, such as one task every 20 seconds. Kai et al. [3] developed a collaborative computing framework to process mobile devices’ tasks at terminals, edge nodes, and cloud centers. They presented a pipeline-based offloading scheme, in which both mobile devices and edge nodes can offload computation-intensive tasks to an edge node and cloud center in accordance with their computation and communication capacities, respectively. However, they did not assume multi-cloud networks and backbone network topology and evaluated minimal conditions with 40 devices.
MADRL has gained attention as a solution to the problem of computation time [4], [5], [6], [7], [8]. Zhan et al. [4] designed a decentralized algorithm for computation offloading so that users can independently choose their offloading decisions. They developed their algorithm by combining MADRL and game theory. However, the actions of decentralized agents based on game theory fall into a sub-optimal solution of the Nash equilibrium. They only considered EC and did not consider CC. Nguyen et al. [5] presented a new collaborative offloading framework that is based on MADRL in heterogeneous edge networks where each ED acts as an intelligent agent to make offloading decisions collaboratively and achieve optimal system utility. However, they only considered EC and did not consider CC. Hou et al. [6] introduced a hierarchical task-offloading and resource-allocation method that is based on MADRL for the Cybertwin-based network. It can promote the flexible collaboration of EDs, EC servers, or CC servers to improve system processing efficiency and security. Although they described the formulation in detail, their paper remains only a concept, and they evaluated minimal conditions with a single cloud, three edges, and 100–300 devices. Ding and Lin [7] considered cooperative task offloading, where all edge servers cooperate to achieve good performance for the entire edge-CC system, such as low latency and energy costs. They addressed MADRL-based task offloading that takes into account cooperation among agents. However, they evaluated minimal conditions with a single cloud and five edges. They evaluated only one snapshot and did not evaluate statistical performance. They also did not conduct any practical evaluation, e.g., computation time or scalability. Zhang et al. [8] considered a three-layer distributed multi-access edge computing network where there are multiple clouds, EC servers, and EDs. They developed a distributed scheme that is based on MADRL. Each cloud jointly determines the offloading task and resource-allocation strategy on the basis of its inference of other cloud decisions. However, they evaluated minimal conditions with 1–3 clouds, 1–3 edges, and 3–9 devices.
In summary, previous studies did not consider CC or targeted only the network with a single cloud. Zhang et al. [8] assumed multi-cloud networks, but their evaluation was under the limited condition of three edge nodes, and effectiveness of their method under practical conditions remains unclear. Previous studies did not consider the link capacity and topology of the backbone network. They assumed that the backbone-network link between clouds and edges was as if it were a single link. To the best of our knowledge, our study is the first to focus on optimal task offloading for multi-cloud and multi-edge networks considering network topology and bandwidth constraints. Several studies have addressed MADRL-based task offloading by considering cooperation among agents similar to ours but do not meet the above network requirements. Furthermore, we introduce a generalized task model representing various task types. Previous studies evaluated the performances of their methods under conditions where they generated only one type of task uniformly. We evaluated the effectiveness of the proposed method in terms of performance, network-topology dependency, computation time, scalability regarding the number of tasks, and generalization performance for unknown task patterns. Such a comprehensive evaluation had not been conducted.
Problem Formulation
A. Overview
We describe an overview of the task-offloading problem. We assume a network consisting of EDs, edges, and clouds. EDs generate various tasks with diverse applications. Each task consists of the required computing demand, traffic demand, and maximum permissible latency to accomplish the task. Each ED can compute its tasks locally or offload tasks to the neighboring edge or cloud. We collectively refer to edges and clouds as nodes. All nodes have computing resources to execute the tasks instead of EDs, called edge or cloud servers. All nodes also have a function of traffic forwarding to another node as a router. Only each edge has a function that determines the optimal node to offload each task. Each cloud cannot offload tasks to other nodes, only execute the tasks. We assume that the ED determines whether to offload its tasks; optimizing the ED’s decision is outside the scope of this paper. We aim to optimize the offloaded server and the route between ED and server for the accepted task.
We describe the procedure of our task-offloading method. We consider a discrete time-step
B. Network Model
Table II summarizes the notation definitions of the physical-network model. We consider the physical-network graph
We denote the edge as
C. Task Model
Table III summarizes the notation definitions of the task model. We describe a task model for uniformly representing various types of tasks of EDs. We define the task set as
D. Optimization Problem
We formulate the optimal task-offloading problem, which aims to minimize Eq. (1) while satisfying the constraints in Eqs. (2)-(14). Table IV summarizes the notation definitions of the task-offloading problem. We aim to find optimal task-allocation variables
We introduce the objective function:\begin{equation*} \min: \sum _{t \in T} \left ({U^{N}_{t} + U^{L}_{t} }\right) + \lambda \sum _{k \in {{\mathcal { K}}}} \left ({\frac {\tau ^{N}_{k} + \tau ^{\mathrm {RTT}}_{k}}{\tau ^{\mathrm {max}}_{k}} }\right) / | {{\mathcal { K}}}|, \tag{1}\end{equation*}
We impose three types of constraints: node capacity, link capacity, and task latency. We first define the binary variable \begin{align*} \mathbb {I}_{k,t} \mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}\begin{cases} 1 & \left ({t_{k} \leq t \leq t_{k} + \tau ^{N}_{k} + \tau ^{\mathrm {RTT}}_{k}}\right) \\ 0 & \left ({\mathrm{ otherwise}}\right), \end{cases} \tag{2}\end{align*}
1) Node-Capacity Constraints:
A task-allocation variable \begin{align*}&{\mathrm {s.t.}}: \sum _{n \in \boldsymbol {N}} y_{kn} = 1\; \left ({\forall k \in {{\mathcal { K}}}}\right) \tag{3}\\&\; \sum _{k \in {{\mathcal { K}}}} \mathbb {I}_{k,t} \, y_{kn} \leq w^{N}_{n} U^{N}_{t} \; \left ({\forall n \in \boldsymbol {N}}\right) \tag{4}\\&\; y_{kn} \in \left \{{ 0,1 }\right \} \left ({\forall k \in {{\mathcal { K}}}, \forall n \in \boldsymbol {N}}\right) \tag{5}\\&\; 0 \leq U^{N}_{t} \leq 1. \tag{6}\end{align*}
2) Link-Capacity Constraints:
We formulate the link-capacity constraints as a multi-commodity flow problem, which is a network flow problem with multiple commodities (i.e., traffic flow demands) between source and destination nodes. These equations give the capacity and flow conservation constraints that the routing variables must satisfy. If we find routing variables that pass through the shortest path, we can use the shortest path algorithm, such as Dijkstra or Bellman-Ford. However, these algorithms do not take capacity constraints into account. If all traffic demands pass through the shortest path, traffic may concentrate on a single link, causing congestion.
A routing variable \begin{align*} {\mathrm {s.t.}}:&\sum _{j:\left ({i,j}\right) \in L} x_{ij,t}^{pq} - \sum _{j:\left ({j,i}\right) \in L} x_{ji,t}^{pq} = 0 \tag{7}\\&\qquad \left ({\forall p,q \in \boldsymbol {N}, i \neq p, i \neq q}\right) \\&\sum _{j:\left ({i,j}\right) \in L} x_{ij,t}^{pq} - \sum _{j:\left ({j,i}\right) \in L} x_{ji,t}^{pq} = 1 \tag{8}\\&\qquad \left ({\forall p,q \in \boldsymbol {N}, i = p}\right) \\&\sum _{p,q \in \boldsymbol {N}} m^{pq}_{t} x_{ij,t}^{pq} \leq w^{L}_{ij} U^{L}_{t} \tag{9}\\&\qquad \left ({\forall \left ({i,j}\right) \in L, \forall p,q \in \boldsymbol {N}}\right) \\&0 \leq x_{ij,t}^{pq} \leq 1 \; \left ({\forall \left ({i,j}\right) \in L, \forall p,q \in \boldsymbol {N}}\right) \tag{10}\\&0 \leq U^{L}_{t} \leq 1. \tag{11}\end{align*}
\begin{align*} m^{pq}_{t}=&\sum _{k \in {{\mathcal { K}}}} B^{\mathrm {up}}_{k} \, \mathbb {I}_{k,t} \, z_{kp} \, y_{kq} \tag{12}\\ m^{qp}_{t}=&\sum _{k \in {{\mathcal { K}}}} B^{\mathrm {down}}_{k} \, \mathbb {I}_{k,t} \, z_{kp} \, y_{kq} \\&\left ({\forall p \in \boldsymbol {E}, \forall q \in \boldsymbol {N}, p \neq q}\right).\tag{13}\end{align*}
3) Latency Constraints:
Task latency is the sum of all possible delays a task experiences during offloading, including processing, transmission, propagation, and queuing latency. Processing latency is the delay it takes to process the task at the edge or cloud server. We refer to this latency as node latency
Latency constraints of \begin{equation*} {\mathrm {s.t.}}: \tau ^{N}_{k} + \tau ^{\mathrm {RTT}}_{k} \leq \tau ^{\mathrm {max}}_{k} \left ({\forall k \in {{\mathcal { K}}}}\right). \tag{14}\end{equation*}
\begin{align*} \tau ^{N}_{k}\mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}&\sum _{n \in \boldsymbol {N}} y_{kn} \frac {C_{k}}{v^{N}_{n}} \tag{15}\\ \tau ^{\mathrm {RTT}}_{k}\mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}&\max _{p \in {{\mathcal{ P}}}^{\mathrm {up}}_{k}} \left ({\tau _{p}\left ({B^{\mathrm {up}}_{k}}\right) }\right) + \max _{p \in {{\mathcal{ P}}}^{\mathrm {down}}_{k}} \left ({\tau _{p}\left ({B^{\mathrm {up}}_{k}}\right) }\right). \tag{16}\end{align*}
\begin{equation*} \tau _{p}(b) \mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=} r_{p} \cdot \frac {b}{\min _{\left ({i, j}\right) \in {{\mathcal{ L}}}_{p}} \left ({w^{L}_{ij}}\right)} + \sum _{\left ({i, j}\right) \in {{\mathcal{ L}}}_{p}} \alpha ^{L}_{ij}. \tag{17}\end{equation*}
Reinforcement Learning
We briefly review single-agent RL and MARL. The recent review of MARL is described in detail [15], [16], [17].
A. Single-Agent Reinforcement Learning
Single-agent RL solves the decision problem in which an agent interacts with an environment. The agent observes state \begin{equation*} Q \left ({s, a }\right) \gets \left ({1-\alpha }\right) Q \left ({s, a }\right) + \alpha \left [{ r + \gamma \max _{a^{\prime } \in {{\mathcal{ A}}}} Q \left ({{s^{\prime }}, {a^{\prime }} }\right) }\right],\tag{18}\end{equation*}
Deep RL (DRL) 10], [18 dramatically improves the generalization and scalability of traditional RL algorithms and can handle continuous and high-dimensional state space by approximating the
DQN uses a replay memory to store the transition tuple \begin{align*} {{\mathcal{ L}}}(\theta)=&\sum _{i=1}^{b}\left [{\left ({y_{i}^{\mathrm {DQN}}-Q\left ({s,a;\theta }\right)}\right)^{2}}\right], \tag{19}\\ y^{\mathrm {DQN}}=&r+\gamma \max _{a^{\prime } \in {{\mathcal{ A}}}} Q\left ({s^{\prime },a^{\prime };\theta ^{-}}\right), \tag{20}\end{align*}
DRL performs better when historical information is used for learning. DQN defines the last four frames as the current state in the classic game environment to consider historical information. This approach was successful enough for games where reflexes are critical, but it became clear that DQN needed more than four frames for some games. To address these shortcomings, Deep Recurrent Q-Network (DRQN) [19] adds recurrency to DQN by replacing the first post-convolutional fully connected layer with a recurrent long short-term memory (LSTM). DRQN successfully integrates historical information and improves the performance of DQN in some games, even though it only sees a single frame at each time step. In addition, DRQN with the recurrent network can better adapt during an evaluation when the quality of the observations changes. Therefore, our method introduces DRQN for the DNN layer.
B. Multi-Agent Reinforcement Learning
MARL is a system of multiple agents interacting within a shared environment. Each agent works with the other agent(s) to achieve a given goal. It is used for learning a complex environment by dividing a single work into multiple sub-works. We can reduce the learning cost of each agent by assigning each agent to each work. Due to the complexities of the environments and the combinatorial nature of the problem, most MARL problems are categorized as NP-hard problems [20].
A cooperative multi-agent environment can be described as a decentralized partially observable Markov decision process (Dec-POMDP) [21] consisting of \begin{equation*} Q^{\pi }\left ({\boldsymbol {h}_{t}, \boldsymbol {a}_{t}}\right) = \mathbb {E} \left [{ \sum ^{\infty }_{j=0}\gamma ^{j} r_{t+j} \mid \boldsymbol {h}_{t}, \boldsymbol {a}_{t} }\right].\tag{21}\end{equation*}
1) Independent Q-Learning:
The most naive approach to solve the MARL problem is to treat each agent independently. This idea is formalized using an independent Q-learning (IQL) algorithm [22], [23], which decomposes a multi-agent problem into a collection of simultaneous single-agent problems that share the same environment. Each agent runs Q-learning [9] or DQN [18]. IQL is scalable from the viewpoint of implementation while increasing the number of agents, and each agent only needs its local history of observations during the training. In this study, IQL was trained to minimize the loss:\begin{equation*} {{\mathcal{ L}}}(\theta) = \sum _{i=1}^{b} \sum _{k=1}^{K} \left [{ \left ({y_{i}^{k} - Q_{k} \left ({h^{k}, a^{k}; \theta ^{k}}\right) }\right)^{2} }\right], \tag{22}\end{equation*}
2) Value Decomposition Networks:
Value decomposition networks (VDNs) [24] are used to learn a joint action-value function \begin{equation*} Q_{tot}\left ({\boldsymbol {h}, \boldsymbol {a}}\right) = \sum _{i=1}^{N} Q_{i} \left ({h^{i}, a^{i}}\right). \tag{23}\end{equation*}
\begin{equation*} {{\mathcal{ L}}}(\theta) = \sum _{i=1}^{b} \left [{ \left ({y_{i}^{tot} - Q_{tot}\left ({\boldsymbol {h}, \boldsymbol {a}; \theta }\right) }\right)^{2} }\right], \tag{24}\end{equation*}
Proposed Method
A. Overview
We give an overview of the proposed task-offloading method, which is based on Coop-MADRL. We aim to find the optimal task offloading that minimizes Eq. (1) while satisfying the constraints in Eqs. (2)-(14). The decision variables are the task-allocation variables
Table V summarizes the notation definitions of our method. We introduce
B. Modeling
We first define the variables that represent a subset of tasks as follows:\begin{align*} {{\mathcal { K}}}_{t}\mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}&\left \{{ k \in {{\mathcal { K}}} \mid \mathbb {I}_{k,t} = 1 }\right \} \tag{25}\\ {{\mathcal { K}}}_{e}\mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}&\left \{{ k \in {{\mathcal { K}}} \mid z_{ke} = 1 }\right \} \tag{26}\\ {{\mathcal { D}}}_{t}\mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}&\left \{{ \boldsymbol {D}_{k} \in {{\mathcal { D}}} \mid t_{k} = t }\right \} \tag{27}\\ {{\mathcal { D}}}_{e,t}\mathrel {\mathrel {\mathop:}\hspace {-0.0672em}=}&\left \{{ \boldsymbol {D}_{k} \in {{\mathcal { D}}} \mid t_{k} = t, k \in {{\mathcal { K}}}_{e} }\right \}.\tag{28}\end{align*}
A state is defined as
C. Formulation
Algorithm 1 shows the centralized training using Coop-MADRL. Line 1 shows the initialization of agent parameters. A series of procedures (lines 2–18) is repeatedly executed until learning is complete. Lines 3–4 show the generation of tasks and initialization of environment parameters. A series of actions is called an episode, and each episode (lines 5–16) is repeatedly executed. In each episode, agents collect learning samples that are combinations of
Algorithm 1 Centralized Training of Coop-MADRL
initialize: agent parameters,
while
generate tasks
initialize: environment parameters
for
for each
if
terminate episode:
if all
store episodic transition
train all agents
Algorithm 2 Decentralized Execution of Coop-MADRL
while all tasks are allocated do
for each
if all
Algorithm 3 Update Environment
return
Algorithm 2 shows the Coop-MADRL algorithm of the proposed task-offloading method. Line 1 shows the pre-training of
D. Update Environment
Algorithm 3 shows the procedure of the update environment. The task-allocation variables
E. Reward Calculation
We design the reward function on the basis of the objective function in Eq. (1). Algorithm 4 shows the procedure of the reward calculation for \begin{align*} {\mathrm {Eff}} (x)=\begin{cases} -x + 1 & \left ({x \leq 1}\right) \\ -2x & \left ({1 < x}\right). \\ \end{cases}\tag{29}\end{align*}
Algorithm 4 Reward Calculation
if
return -3
return
We design this function so that the efficiency decreases as \begin{equation*} \tau ^{\mathrm {ave}}_{t} = \sum _{k \in {{\mathcal { K}}}_{t} } \left ({\frac {\tau ^{N}_{k} + \tau ^{\mathrm {RTT}}_{k}}{\tau ^{\mathrm {max}}_{k}} }\right) / | {{\mathcal { K}}}_{t}|.\tag{30}\end{equation*}
Evaluation
We evaluated the effectiveness of the proposed method through simulations in terms of performance, network topology dependency, computation time, and scalability regarding the number of tasks. We also evaluated its generalization performance for unknown task patterns. We prepared five comparison methods and four network topologies for the evaluation. For each agent’s DNN layer, we introduced DRQN [19]. We used a three-layer DNN consisting of two fully connected layers and the gated recurrent unit (GRU) layer [27]. We used Adam [28] to optimize all DNNs and set the number of neurons in the hidden layers to 64 for all DNNs. We used Double-DQN [10] as the DRL algorithm. We implemented the DRL-algorithm-based PyTorch [29], PyMARL [30], and PyMARL2 [31]. We solved the route optimization using the GNU Linear Programming Kit (GLPK) [32]. For the hyperparameters of DRL, the learning rate was set to
A. Evaluation Conditions
We set the number of tasks to
For task generation, we prepare four types of tasks assuming various use cases. We use a generalized task model to reveal that the proposed method can effectively allocate tasks even when many types of tasks are mixed. Table VI summarizes the parameters for task generation. The task model is parameterized by upload and download traffic demand,
For the physical network, we prepare four network topologies. We first prepare the 12-node topology on the basis of Internet2 [33], which consists of nine edges and three clouds, as shown in Fig. 1. We connect each cloud to one randomly selected edge and fix the cloud placements for all evaluations. The values in this figure show the pair of node-processing capability and node capacity
We perform training and evaluation steps on the prepared tasks and physical networks. We run Algorithm 1 for training and Algorithm 2 for evaluation. Tasks in training and evaluation are generated under the same conditions. Note that the tasks generated during training and evaluation are different because some parameters are randomly generated. In the evaluation, the MADRL-based methods select the best action (i.e., the offloading server) on the basis of the trained agents in line 6 of Algorithm 2. The non-DRL-based methods select the offloading server on the basis of their respective algorithms, rather than selecting actions by agents.
B. Comparison Methods
Table VIII summarizes our method and the comparison methods. VDN indicates the proposed Coop-MADRL-based method. IQL indicates the MADRL-based method without cooperation, i.e., each agent learns on the basis of their reward. As mentioned in Section II, there is no method for considering multi-cloud networks and network topology, but IQL corresponds to such a method [4] when it considers them. Note that we did not evaluate single-agent DRL because learning will clearly be unsuccessful due to requirements for huge training iterations until the Q-values for all actions are sufficiently close to the optimal.
We also compared our method with three algorithms that may be able to offload tasks in a computation time close to RL-based methods. The random algorithm (RA) has a policy that allocates each task to a node randomly chosen from all candidate nodes at each
We also evaluated exhaustive search (ES), which has a policy that finds the best offloading node by exhaustively calculating each reward for all candidate nodes at
C. Training Curve
Figure 2 shows the training curves tracking the agent’s total return of VDN and IQL under the Internet2 topology. The total return is defined as the sum of rewards at each time until the end of the episode. For this evaluation, the total time step was set to
The results indicate that the average total return of the two MADRL methods increased as the training progress and converged around
D. Performance
1) Results:
Figure 3 shows the average performance of each method. We carried out 20 calculations with random initial conditions and set the same random seeds for all trials. The width of each bar indicates the standard deviation
Figure 3 shows that the proposed method (VDN) performed comparably to ES and outperformed the other methods. These results indicate that VDN and ES can maximize network-utilization efficiency and minimize task latency while reducing constraint violations. Note that constraint violations occurred for all methods because we evaluated performance under severe conditions that put heavy loads on the network. Although ES performed the best, it is problematic in terms of computation time (see Section VI-F for details).
Next, we discuss the performance comparison between VDN and other methods. We first compared VDN and IQL. Figure 3 shows that VDN performed better than IQL for all metrics. This indicates that the coordination among agents with VDN improves the performance of task offloading and can prevent constraint violations. Since each agent of IQL acts independently, task offloading is concentrated on lightly loaded resources in some cases, causing constraint violations as described in the Introduction.
We next compared VDN and the three non-RL-based computationally lightweight methods: RA, GA, and HA. The GA threshold was set to 0.2 because the best performance was obtained in the preliminary experiment. Figure 3 shows that VDN outperformed the three methods in terms of average reward. However, VDN had a higher maximum node utilization than GA and HA and higher maximum link utilization than RA. The reason is that agents choose the action that maximizes the average reward even if the node and link utilization are increased. As a characteristic of each method, GA performed comparably to or better than VDN in terms of maximum node utilization, maximum link utilization, and task latency efficiency by setting the best threshold in the preliminary experiment. However, GA had more constraint violations because it could not dynamically select allocations on the basis of the situation, resulting in a lower average reward than VDN. In addition, the maximum node utilization and task latency efficiency of HA were competitive with those of the other methods because HA designs the appropriate offloading node depending on the characteristics of each task. However, the other indicators of HA showed the worst performance because HA does not take into account the link constraints.
2) Analysis:
We discuss the reasons for the performance shown in Fig. 3 by analyzing the latency and allocation ratio of each task type. Figures 4 and 5 show the latency and allocation ratio of each task type in the evaluation in Fig. 3. Here, the allocation ratio indicates the proportion of where tasks allocated to the nearest edge, neighboring edges, or neighboring clouds. The allocation ratios for VDN, IQL, and ES are determined by the task and network conditions. In contrast, the allocation ratios for RA and HA are fixed, regardless of the task and network conditions, and are determined by the algorithm design. The allocation ratio for GA is determined by the GA threshold value, which was set to 0.2 in this evaluation.
Figure 4 shows that VDN, IQL, GA, and ES primarily allocated tasks to the neighboring cloud. The result shows that the neighboring cloud is the preferred node for allocating tasks. This is because of the sufficient cloud-processing capacity and the short transmission delay between the edge and cloud, making assigning tasks to the cloud preferable in this evaluation condition. However, this result is only an example of an evaluation and does not deny the effectiveness of EC. The allocation to the edge may become dominant depending on the evaluation conditions.
Figure 4 shows that VDN changed the allocation ratio in accordance with the task type. VDN mainly allocated all types of tasks to the neighboring cloud. Looking at the ratio using the edges, VDN allocated most download-heavy tasks with high transmission costs (Task 2) to the edges. However, it allocated more computing-heavy tasks (Task 3) to neighboring clouds, even though the average traffic volumes of Tasks 2 and 3 were equal. We conclude that VDN learns a superior policy through cooperative learning. However, the latency-sensitive tasks (Task 4) were primarily allocated to neighboring clouds, even though they should be allocated to the nearest edge to minimize latency. We assume the reason is that VDN attempts to maximize the overall network efficiency by allocating lightweight tasks to the cloud because of limited resources at the nearest edge.
Figure 4 shows that IQL changed the allocation ratio in accordance with the task type, although it is not associated with satisfactory performance, as shown in Figure 3. RA and GA did not change the allocation ratio in accordance with the task type, which is assumed to be one reason for the low performance of these methods. Figures 4 and 5 show that HA was suitable for latency for Tasks 3 and 4 because it changes the allocation node depending on the task type. However, as mentioned in Section VI-D.1, HA performed poorly because it does not take into account the link constraints.
Figure 4 shows that ES can use the nearest edges compared with VDN, which is assumed to be one reason that ES performed better than VDN. Even though ES selects the node that maximizes the reward for all candidate nodes, the performance difference between VDN and ES is slight. This is because ES aims to maximize the immediate reward obtained in the present. That is, selecting the node that maximizes the immediate reward in the present may not maximize the expectation of the sum of rewards obtained in the future.
E. Network-Topology Dependency
Table IX shows the average reward of each method for various network topologies. We carried out 20 calculations with random initial conditions and calculated the mean value and standard deviation of rewards. We did not evaluate IQL because it is evident that learning agents is more difficult in larger topologies. The results indicate that VDN outperformed the other methods except for ES in all network topologies. Thus, the proposed method can be applied to larger network topologies.
We first discuss the average performance with each topology. Table IX shows that the average performance of all methods in Abilene decreased compared with that of other topologies. Comparing the conditions between Internet2 and other topologies, as shown in Table VII, only the network size is larger, and the other parameters are the same. We consider that the graph structure of the topology determines the performance.
We next discuss the performance of each method. Table IX shows that VDN performed comparably to ES in all topologies, whereas the performance difference between VDN and ES increased in Abilene. Since all methods performed worse in Abilene, the performance difference between VDN and ES could be significant in environments where agent learning is complex. Table IX also shows that the performance trends of GA and HA remain unchanged. On the other hand, the performance of RA improved as the topology size increased. This is because a larger topology provides more control options, enables load balancing, and improves performance regardless of the control method.
Figures 6–8 show the allocation ratio of each task type for larger topologies. We omit the results of RA, GA, and HA because these methods do not change the allocation ratio depending on evaluation conditions. These figures show that VDN mainly allocated all types of tasks to the neighboring cloud, similar to that of Internet2. On the other hand, the allocation ratio of ES depended on network topology.
Figure 6 shows that the allocation ratio of VDN in Abilene was about the same as that in Internet2, while it increased the task allocation to the nearest edge. This means that VDN changed the allocation ratio in accordance with Abilene and learned the superior policy in accordance with the environment. Figures 7–8 show that the allocation ratio in larger topologies was similar to that of Internet2. However, the difference in the allocation ratio for each task type became smaller as the topology size increased. This means that learning in accordance with task type becomes more difficult as the topology size increases.
In summary, VDN outperformed the other methods except for ES in all network topologies. In addition, VDN can allocate tasks in accordance with the task type when the topology is small, but this becomes more difficult as the topology size increases. Although VDN cannot learn by task type in larger topologies, it achieves superior performance competitive with that of ES by learning the appropriate allocation to edges and clouds depending on the situation.
F. Computation Time
Table X shows the average execution time of each method for various network topologies. The execution time denotes the computation time required to determine the allocated node for a single task, which corresponds to line 4 in Algorithm 2 for VDN. We used Apple M1 Ultra for the evaluation. The execution time of all methods except for ES was less than 1 ms for all topologies. The execution time tended to increase as topology size increases, but the increase is almost negligible for task latency. This means that VDN performs sufficiently in terms of computation time. However, the execution time of ES was about 1 s in the fastest case. The execution time of these task-offloading methods was added to the task latency in real-world scenarios. When the execution time takes 1 s, the task latency cannot be shorter than 1 s. Therefore, it is a critical for many tasks to keep latency to less than a few tens of milliseconds. The execution time drastically increased as the topology size increased because the computational complexity of the optimization calculation also drastically increased. We conclude that VDN is the best task-offloading method, considering the allocation performance shown in Table IX and execution times shown in Table X.
Table XI shows the training time of VDN for various network topologies. The training time denotes the computation time required for the agents to complete the training of all time-steps
G. Latency-Weighting-Parameter Dependency
Table XII shows the weighting-parameter dependency of the objective function shown in Eq. (1). The weighting parameter
H. Scalability Regarding Number of Tasks
We evaluated the scalability of VDN regarding the number of tasks
Table XIV shows the scalability of VDN regarding the number of tasks
I. Generalization Performance for Task Types
We evaluate the generalization performance of VDN for various task types. We evaluate the average reward when each task type ratio differed from the training condition. Table XV shows various cases with different ratios of task types. We set the conditions in Table VI as default. We prepared cases that increase the ratio of each task type and decrease the ratio of basic tasks. We also prepared cases that change the ratio of three task types per 5% and the ratio of basic tasks per 15%.
Table XVI shows the average reward and other indicators for various cases changing the ratio of task types. We carried out 20 calculations with random initial conditions and set the same random seeds for all calculations. The average reward decreased as the ratio of Tasks 2–4 increased from the training condition. For Task 3, VDN had generalization performance until the task ratio increased by 2%. The average reward notably decreased for increases of 5%. This is because tasks with high average computing demand occupy the computing resources over a long period and reduce node utilization. For Tasks 2 and 4, VDN had generalization performance even when the task ratio increased by 20%. This is because Tasks 2 and 4 occupy computing resources for only a short time. Similarly, when we simultaneously increased the ratio of the three task types by 5%, VDN showed no generalization performance. In other words, the proposed method has generalization performance when the prediction error in the proportion of computing-heavy tasks is within 2%. Conversely, for the case in which we simultaneously decreased the ratio of the three task types, the performance of VDN improved. Therefore, we conclude that the proposed method can have generalization performance for various task types by pre-training with the predicted maximum percentage of resource-consuming tasks.
J. Discussion
We discuss the future work of the proposed method. We have demonstrated the effectiveness of the proposed method. However, there are still challenges that we should explore for applying our method in a commercial environment or a future network.
We discuss the challenge of the applicability of the proposed method. We assume that the routes in the access network take the shortest path. Current wired access networks with optical fiber automatically select the shortest path from a few routing options. Current wireless access networks for cellular automatically select the best BS for each ED on the basis of the received signal power. Therefore, the proposed method is applicable in real-world scenarios of the current network. However, there are some challenges in applying the proposed method in a commercial environment or a future network.
The first challenge is scalability for edges. The proposed method may only be applicable to networks with EC servers with dozens of edges. We target commercial networks that cover a large area, such as networks in multiple regions or parts of a country. Such an access network includes more than 10 000 edges. Similar to other methods [1], [2], [3], [4], [5], [6], [7], [8] that can only handle a few dozen BSs at most, as described in Section II, our method also cannot control more than 10 000 edges. We deploy each agent at each edge (i.e., at each BS), and each agent learns cooperative control. Therefore, the agents should learn cooperative control among many agents in the mentioned environment. However, this is difficult to achieve with current technology and is one of the future works. As a possible solution, the combined method of Mean-field Game (MFG) theory and DRL may handle a large-scale task-offloading system with many BSs. MFG theory [35] studies strategic decision-making by small interacting agents in large populations, inspired by mean-field theory in physics. Each agent minimizes or maximizes the problem objective, taking into account the decisions of the other agents. Several studies have addressed network control methods that combine MFG theory and DRL, such as unmanned aerial vehicle (UAV) control [36], [37] and computation offloading in EC [38], [39], [40].
The next challenge is scalability for tasks. The processing time of the proposed method for determining the offload server is about 0.2 ms per task. Also, because the proposed method processes tasks sequentially, one at a time, it cannot handle thousands or millions of tasks per second required in real-world environments. Therefore, sufficient scalability for tasks is one of the future works. One possible solution is parallelization. Assigning many trained agents at each edge makes it possible to process many tasks simultaneously. Many trained agents working independently may cause new problems.
The other challenge is to optimize task offloading for an entire end-to-end network. Since task latency can be expressed as the sum of backbone and access networks, we can find near-optimal task offloading and routes between EDs and servers by optimizing the backbone and access networks independently. However, future networks will require more efficient optimal task offloading by considering the entire end-to-end multi-domain between EDs and servers, such as backbone network, access network, edge servers, and cloud servers. For this purpose, the proposed method should consider the latency and link quality of the wireless network between EDs and BSs. The difficulty in the wireless network is that the observed information is incomplete or uncertain. Several studies have addressed task offloading in the wireless network under imperfect channel state information (CSI) [41], [42], [43], [44], [45], [46]. In particular, security is one of the major concerns for mobile EC with incomplete CSI. Several studies have been conducted on security-aware task allocation under incomplete CSI [47], [48], [49]. In addition, several studies have been conducted on end-to-end network slicing or multi-domain network slicing to integrate the multiple controls [50], [51], [52]. Although much research has been done, optimizing task offloading in an end-to-end network remains an unsolved problem and one of the major challenges. By combining our method with above studies, the problem should be addressed in the future.
Conclusion
We formulated an optimal task-offloading problem for multi-cloud and multi-edge networks considering network topology and bandwidth constraints. We proposed a cooperative task-offloading method that is based on cooperative multi-agent deep reinforcement learning (Coop-MADRL). This method can quickly achieve efficient task offloading by learning the relationship between network-demand patterns and optimal task offloading by using deep reinforcement learning in advance. This method also introduces a cooperative multi-agent technique, improving the efficiency of task offloading. Evaluations revealed that the proposed method can minimize network utilization and task latency while minimizing constraint violations in less than 1 ms in various network topologies. They also revealed that cooperative learning improves the efficiency of task offloading. We demonstrated that the proposed method can have generalization performance for various task types by pre-training with many resource-consuming tasks. We plan to evaluate the performance of the proposed method and conduct further detailed analysis in more complicated use cases or real-world applications. We also plan to improve the scalability and interpretability of the proposed method.