Journals & Magazines >IEEE Open Journal of the Comm... >Volume: 4

Hypergraph-Based Resource-Efficient Collaborative Reinforcement Learning for B5G Massive IoT

To handle a great challenge for network resource management, the resourceefficient collaborative reinforcement learning scheme was proposed for B5G massive IoT.

Abstract:

Beyond 5G (B5G) networks rapidly growing to connect billions of Internet of Things (IoT) devices and the dense deployment of IoT devices leads the large-scale network con...Show More

Metadata

Abstract:

Beyond 5G (B5G) networks rapidly growing to connect billions of Internet of Things (IoT) devices and the dense deployment of IoT devices leads the large-scale network conflict and obstacles the resource-efficient, which brings a great challenge for network resource management (NRM). To tackle this problem, hypergraph based resource-efficient collaborative reinforcement learning (CRL) was proposed for B5G massive IoT. Firstly, the hypergraph theory based network conflict model was formulated to quantify the conflict degree of the B5G massive IoT. Then, since the conflict-free resource management problem is a combinatorial optimization problem with NP-hard, the resource management based Markov decision process (MDP) model was built for NRM in B5G massive IoT. To reduce the computational load by distributing the training overhead throughout the entire B5G massive IoT and achieve distributed collaborative learning, the federated averaging advantage Actor-Critic (FedAvg-A2C) based resource management is proposed to handle the network conflict-free resource management problem and accelerate the training process. Simulation results show the proposed scheme has high network throughput and the resource-efficient in B5G massive IoT.

To handle a great challenge for network resource management, the resourceefficient collaborative reinforcement learning scheme was proposed for B5G massive IoT.

Published in: IEEE Open Journal of the Communications Society ( Volume: 4)

Page(s): 2439 - 2450

Date of Publication: 02 October 2023

Electronic ISSN: 2644-125X

DOI: 10.1109/OJCOMS.2023.3321310

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

A. Background and motivation

Beyond 5G (B5G) networks are rapidly expanding to connect billions of machines and Internet of Things (IoT) devices and are promising to support a variety of unprecedented services, including smart cities, smart industries, connected and autonomous systems, telemedicine, etc. [1], [2], [3]. Various new requirements for B5G networks are put forward by emerging application scenarios, such as high resource-efficient performance, ultra-low latency, high data rates and high reliability [3], [4]. The network resource efficiency was improved through devices’ dense deployment (i.e., form a dense network) in massive IoT, which increases network throughput and provides better quality of service (QoS) for more users [5]. Multiplexing resources has become a fundamental phenomenon in massive IoT networks due to the large-scale dense connectivity of terminal devices (TD). However, the ongoing densification of the network induces severe resource conflict leading to large-scale network conflict, which reduces network throughput. Therefore, dynamically providing and orchestrating network resource management (NRM) tailored to such emerging services will be a unique challenge, which needs to combine artificial intelligence (AI) technology to convert traditional wireless communication systems into intelligent wireless communication systems in B5G Massive IoT [6].

NRM system manages massive IoT by utilizing the available network resources efficiently to ensure the QoS and resource efficiency for massive IoT [7]. It can be fully utilized by employing effective design techniques, equitable resource allocation and efficient packet scheduling. However, ensuring high network resource-efficient in wireless communication networks is a challenging task as the underlying optimization problem is a nonconvex combinatorial optimization (CO) problem in massive IoT scenarios [8]. Recently, intelligent enhanced massive IoT will be built with collaborative reinforcement learning (CRL), which is a distributed collaborative machine learning. Due to multiple agents learning and performing tasks simultaneously, CRL can better deal with large-scale problems and complex environments for NRM systems [9]. For instance, NRM leverages data analytics and AI techniques to analyze large volumes of data and make informed decisions, which enables better resource management decisions, leading to improved network performance and user experience [10]. As a result, the AI-assisted IoT system could be a promising solution and enhance resource efficiency for B5G massive IoT [11].

B. Related Work

There are various approaches for NRM in IoT system, among which mainly contains optimization-based methods and heuristic methods [12]. However, multi-user NRM is usually modeled theoretically as a problem with NP-hard characteristics, which are challenging to solve by typical optimization methods [13], [14], [15]. Ghanem et al. [16] use a branch-and-bound approach based on discrete monotonic optimization theory to develop a globally optimal solution for the NRM problem and reformulate the optimization problem using the canonical form of difference of convex programming. Although adopting convex-optimization-based approaches can solve NRM problems, primal problems must be converted into solvable problems. However, the optimal of the converted problem is usually not those of the primal one, while it needs computationally intensive to handle the converted problem [12]. To tackle this problem, machine learning has emerged as a promising technology for NRM in IoT systems and was considered to be effective in improving the resource-efficient [17], [18], [19], [20], [21]. Despite the mild loss of optimality, reinforcement learning (RL) approaches could still perform well [12]. For instance, RL based scheme is adopted to address dynamic network resource management in IoT systems with cognitive radio capabilities, aiming to enhance data rates and minimize routing delays [17]. Actor-Critic based radio resource management scheme was proposed to handle the radio resource management challenge [18]. Zhu et al. [19] adopted deep reinforcement learning (DRL) and Q-learning methods, which mainly focused on resource management policies and offloading in vehicular edge computing networks. In the context of edge-IoT systems, resource management for maximizing users’ QoS is investigated in [20], formulating the problem as a Markov decision process (MDP) and proposing a Q-value approximation approach. This approach improves QoS, latency, and application task success ratio. Furthermore, the transmission latency and computation offloading could be solved by an MDP and model-free RL approach in dynamic mobile edge computing-aided IoT. In digital twin applications, resource management based on double deep Q-network scheme that optimizes the resource-efficient is proposed in [21] for multiple IoT devices, while it achieves a low computational complexity and optimal processing time.

In traditional RL, all data is often sent to a central server for training, leading to significant communication and computation overhead [22]. Due to the training of AI-driven models being an essential part [23], several recent works have considered techniques for CRL schemes to decline the training overhead [24], [25], [26]. CRL was a collaborative machine learning method that involves training a shared model across multiple decentralized and potentially non-identical agents or devices [27]. CRL reduces this communication burden by allowing devices to train locally and only transmit model updates, whose systems can be more fault-tolerant as the shared model can adapt to changes, failures, or loss of individual agents without compromising the entire learning process [28]. In addition, leverages the computational resources available on individual devices or agents, distributing the training workload and potentially reducing the need for centralized high-performance servers [29], [30], [31]. For instance, a collaborative learning scheme called adapting federated averaging (FedAvg) was proposed in [29] for communication efficiency, which dramatically reduces the number of rounds to converge by taking the form of a distributed Adam optimization. In each round of model aggregation, the FedAvg method based on model segmentation is introduced, which uses a gossip protocol for client sampling [30]. The collaborative learning models were proposed to improve resource utilization for multidomain networks by executing horizontal and vertical auto-scaling [31]. Chen et al. [32] proposed a collaborative learning framework that considers network resource management and user selection to minimize the loss value of the collaborative learning model in the wireless network. The existing works focused on optimization of resource management, rarely take large-scale network conflict into account. Dense deployment of IoT devices leads to large-scale network conflict, which brings a great challenge to resource management in massive IoT networks [12]. Hence, how to adopt distributed collaborative machine learning technology to avoid large-scale network conflict and achieve network conflict-free resource management is an unresolved issue.

C. Contributions

To tackle the challenge mentioned above, we propose a conflict hypergraph based CRL resource management framework for B5G massive IoT system management and applications, which enables B5G massive IoT to maximize network throughput and resource-efficient in the absence of large-scale network conflict. Relative to the existing works, the contributions of this work are concluded as follows:

To avoid large-scale network conflict and achieve conflict-free resource management, we analyze the direct and indirect conflicts of the B5G massive IoT network and establish a conflict graph model which clearly shows the conflict relationship between links. In addition, based on the theory of maximal cliques and hypergraphs, the conflict graph model is transformed into a conflict hypergraph model, which greatly reduces the difficulty of resource management to avoid network conflicts.
Since the conflict hypergraph-based resource management is the CO problem with NP-hard, which needs computationally intensive to be handled, we formulate an MDP model for NRM with sequential decision-making characteristics and propose a resource-efficient RL solution. Especially, the reward function was designed according to high resource-efficient requirements under conflict-free, which make the RL agent can obtain a resource management scheme that satisfies the restrictions of the CO problem.
To reduce the computational load by distributing the computational workload throughout the entire network and achieve distributed CRL, the federated averaging advantage Actor-Critic (FedAvg-A2C) is proposed to handle the network conflict-free resource management problem in B5G massive IoT scenarios and accelerate the training process of learning. Specially, a FedAvg based collaborative training framework was formulated, which consists of multi A2C local network and a local global network.

The rest of this paper is organized as follows: Section II describes the system model and analyzes the resource conflict. Section III introduces the conflict hypergraph model, conflict-free resource management problem. The proposed scheme is in Section IV. Section V presents the simulation results of the proposed methods. Finally, Section VI concludes the paper.

SECTION II.

System Model

This section introduces resource management methods for TDs in B5G massive IoT architecture. It combines graph theory and CRL technology to support the scheduling of multidimensional resources in the form of transactions.

A. Resource Management Model Based on Collaborative Framework

As shown in Fig. 1, the B5G massive IoT is decentralized, and all transactions and related operations are recorded at the local data center. The B5G massive IoT includes a device set $\mathcal {D}={d_{1},d_{2},\ldots ,d_{N}}$ and a local data center set $L={l_{1},l_{2},\ldots ,l_{K}}$ . In the model, collaborative machine learning data allocation in B5G IoT consists of two phases: 1) TDs with computational constraints sending their data to the local data center to train. 2) the local data center $l_{i}(1 \le i \le K)~\in R$ uploads training data to the aggregated global model server for training and integration.

FIGURE 1.

B5G massive IoT.

Show All

B. Conflict Analyzed Based on Graph

For the B5G massive IoT communication structure, it is recorded by graph $\mathbf {G_{T}} = ( {{V_{T}},{E_{T}}} )$ , the ${V_{T}}= \{ {{v_{t1}},{v_{t2}}, \ldots ,{v_{tn}}} \}$ is the set of nodes, and the ${E_{T}}=\{ {{e_{t1}},{e_{t2}}, \ldots ,{e_{tm}}} \}$ is the set of edges, where $e_{tm} = \{ ( {{v_{ti}},{v_{tj}}} )~:~{v_{ti}},{v_{tj}} \in e_{tm} \;\text {for some}\; e_{tm} \in E\}$ . The nodes and edges represent the TD and communication links between TD, respectively.

The communications links and the relationship between the nodes can be represented with incidence matrix $\mathbf {G_{TI}}$ . $\begin{align*} \mathbf {G_{TI}} =\left [{ {\begin{array}{cccccccccccccccccccc}{\left ({ {{v_{t1}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{v_{t1}},{e_{tm}}} }\right )} \\ \vdots &~~ \ddots &~~ \vdots \\ {\left ({ {{v_{tn}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{v_{tn}},{e_{tm}}} }\right )} \end{array}} }\right ],\tag{1}\end{align*}$ View Source where $\begin{align*} \left ({ {{v_{ti}},{e_{tj}}} }\right )= \begin{cases} 1,&~ {v_{ti}} \in {e_{tj}} \\ 0,&~ {v_{ti}} \notin {e_{tj}} \end{cases}.\tag{2}\end{align*}$ View Source

An example is presented in Fig. 2, which includes 13 TDs and 16 communication links (CLs), denoted as ${\text {TD}}_{1}, \text {TD}_{2}, \ldots ,\text {TD}_{13}$ , and ${\text {CL}}_{1}, \text {CL}_{2}, \ldots ,\text {CL}_{16}$ , respectively.

FIGURE 2.

Communication network topology.

Show All

To promote network resource management for resource-efficient in B5G massive IoT scenarios, the conflict conditions between TD are classified as direct conflict and indirect conflict as follows:

Direct conflict: Two TD pairs share a channel and have the same TD, i.e., the ${\text {CL}}_{1}$ and ${\text {CL}}_{2}$ share a channel in Fig. 3(a).
Indirect conflict: Two TD pairs share a channel and the TD of one TD pair is in the communication range of the other TD pair, i.e., the ${\text {CL}}_{1}$ and ${\text {CL}}_{3}$ share a channel in Fig. 3(b).

FIGURE 3.

Conflict types.

Show All

To avoid TD conflicts in the topology of communication networks, direct conflict can be solved by solving the typical edge coloring algorithm. However, the indirect conflict caused by hidden TD remains inevitable, due to the indirect conflict problem diverging from the core essence of the typical edge coloring problem. Therefore, it is necessary to further analyze the potential conflicts of CLs between the TDs.

SECTION III.

Resource Management Design based on Conflict Hypergraphs

In this section, the conflict graph is built to clearly show the resource conflict relationship. In addition, based on the theory of cliques and hypergraphs, the conflict graph is transformed into a hypergraph, which reduces the difficulty of solving resource conflicts. Finally, the resource conflict problem is generalized as a node coloring problem of hypergraph.

A. Conflict Graph Model

To address the resources management conflict problem in B5G massive IoT, the conflict graph model $\mathbf {G_{C}} = ( {{V_{C}},{E_{C}}} )$ is established. In the model, ${V_{C}} = \{ {{e_{t1}},{e_{t2}}, \ldots ,{e_{tm}}} \}$ is the set of nodes, ${E_{C}}$ is set of edges. The nodes and edges in the conflict graph model represent the CLs in $\mathbf {G_{T}}$ and the conflict relationship of nodes, respectively.

The conflicting relationship between nodes can be represented by adjacency matrix ${{\mathbf {G}}_{CA}}$ , $\begin{align*} {{\mathbf {G}}_{CA}} =\left [{ {\begin{array}{cccccccccccccccccccc}{\left ({ {{e_{t1}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{e_{t1}},{e_{tm}}} }\right )} \\ \vdots &~~ \ddots &~~ \vdots \\ {\left ({ {{e_{tm}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{e_{tm}},{e_{tm}}} }\right )} \end{array}} }\right ],\tag{3}\end{align*}$ View Source where $\begin{align*} \left ({ {{e_{ti}},{e_{tj}}} }\right )= \begin{cases} {1, {e_{ti}} \;\text {conflicts with }\;{e_{tj}}} \\ {0,{e_{ti}} \;\text {not conflicts with}\;{e_{tj}}} \end{cases}.\tag{4}\end{align*}$ View Source

Then, following the principles of Fig. 3, the conflict graph can be constructed as shown in Fig. 4. For understanding, an example is used to illustrate the construction on the conflict graph: the node ${\text {CL}}_{4}$ and ${\text {CL}}_{5}$ are connected since they contain the same ${\text {TD}}_{2}$ and use the same channel. For node ${\text {CL}}_{5}$ and ${\text {CL}}_{15}$ , as in ${\text {TD}}_{2}$ and ${\text {TD}}_{9}$ communication range, an edge between them. For clarity, we use different colors for the two different types of conflict. In Fig. 4, the nodes represent CLs, and edges represent the conflict relation between CLs. However, the complexity of the conflict graph grows rapidly, increasing the difficulty of avoiding conflicts.

FIGURE 4.

Conflict graph.

Show All

B. Conflict Hypergraph Model

To reduce the difficulty of avoiding resource conflicts, we simplified the conflict graph based on the theory of cliques and hypergraphs. As a fully connected subgraph in the graph, a clique can be expressed by a hyperedge and then quickly reduce the dimension of the matrix of the conflict graph. The definition of clique and hypergraph is described as follows:

Clique: a sub-graph in the conflict graph, where any two nodes are connected.

Maximal clique: a clique which is not a sub-graph of other cliques.

The hypergraph can be expressed as ${{\mathbf {G}}_{\mathbf {H}}}{\mathbf { = }}\{ {{{\mathbf {V}}_{\mathbf {H}}}{\mathbf {,}}{{\mathbf {E}}_{\mathbf {H}}}} \}$ , where ${{{\mathbf {V}}_{\mathbf {H}}}}$ and ${{{\mathbf {E}}_{\mathbf {H}}}}$ is vertex set and hyperedge set, respectively. A simple graph is a special case of a hypergraph, where a hyperedge is only associated with two vertices. The hypergraph can represent by incidence matrix $\mathbf {H}$ , ${\mathbf {H}} \in {R^{| {E}| \times | {V} |}}$ , and the value of the elements according to $h( {v,e} )$ , $\begin{align*} h\left ({ {v,e} }\right ) = \begin{cases} {1},\; & {v \in {e}} \\ {0},\; & \text {otherwise}. \end{cases}\tag{5}\end{align*}$ View Source

According to the definition of the maximal clique, the maximal clique in the conflict graph as shown in Table 1. The nodes in a clique are connected to each other, which can be validated through the conflict relationship between the nodes in Fig. 4.

TABLE 1 Conflict Node Set

According to the theory of hypergraph and clique, where all nodes are connected with each other, thus any clique can form a hyperedge and contain guaranteed conflict information without a loss since the features that any nodes in the clique conflict with each other. The maximum clique can contain more nodes (i.e., the hyperedge contains multiple nodes). All the maximum cliques obtained transform the conflict graph into a conflict hypergraph, simplifying the matrix and reducing the difficulty of conflict avoidance while ensuring that the conflicted relationship between nodes remains unchanged. The conflict avoidance problem in the conflict hypergraph is essentially a node coloring problem of the hypergraph.

C. Problem Formulation

In this section, we formulate the CO problem (i.e., the node coloring of the hypergraph) for network resource-efficient management in the B5G massive IoT scenario. To avoid resource allocation conflict, we defined a conflict degree of nodes, denoted as $\varphi$ , which includes two categories of conflict in the node coloring of the hypergraph: 1) Nodes belonging to the same hyperedge are assigned the same color. 2) The same node is assigned different colors repeatedly. The resource allocation is conflict-free if $\varphi =0$ . Moreover, the signal-to-interference-plus-noise ratio (SINR) ${\upsilon _{i}^{t}}$ of the ${{i^{th}}}$ TD at time slot ${t}$ , which can be defined as: $\begin{equation*} {\upsilon _{i}} = \frac {{{P_{i}}{h_{i}}}}{{{\sigma ^{2}} + \sum _{j \in {\mathcal {N}_{i}}} {{P_{j}}{h_{j,i}}} }},i \in \left \{{ {1,2,\ldots , {N_{{\textrm {TD}}}}} }\right \},\tag{6}\end{equation*}$ View Source where ${P_{i}}$ and ${P_{j}}$ denote the transmission powers of the ${{i^{th}}}$ and ${{j^{th}}}$ TDs, respectively. ${h_{i}}$ is the power gain of the channel corresponding to ${{i^{th}}}$ TD, ${\sigma ^{2}}$ is the noise power, ${h_{j,i}}$ is the conflict power gain from ${{j^{th}}}$ TD. ${N_{\text {TD}}}$ is the number of all TDs. ${\mathcal {N}_{i}}$ represents the conflict IDs set of ${{i^{th}}}$ TD. Hence, the transmission rate of the ${i^{th}}$ TD at time ${t}$ can be expressed as $\begin{equation*} R_{i}^{t} = B \cdot \log \left ({ {1 + \nu _{i}^{t}} }\right ), \tag{7}\end{equation*}$ View Source where ${B}$ is the bandwidth. The CO problem can be formulated as the long-term total conflict-free resource-efficient maximization problem, described as follows: $\begin{align*} \max&{\lambda _{1}}\sum _{i} {R_{i}^{t}} + {\lambda _{2}}\frac {{{N_{{\textrm {TD}}}} - N_{{\mathbf {k}}}^{t}}}{{{N_{{\textrm {TD}}}}}}, \tag{8a}\\ {\textrm {s}}{\textrm {.t}}{. }&\varphi = 0,\tag{8b}\\&\upsilon _{i}^{t} \ge \upsilon _{i}^{\min },\tag{8c}\\&i \in \left \{{ {1,2,\ldots , {N_{{\textrm {TD}}}}} }\right \},\tag{8d}\end{align*}$ View Source where ${\lambda _{1},\lambda _{2} \in ( {0,1} )}$ represent the weight of the optimization objective. The objective function of optimization (8a) is to maximize the network throughput and network resource-efficient while avoiding network conflict and meeting the requirement of the minimization SINR are implicit optimization goals.

SECTION IV.

Resource Management based on CRL method

To solve the complicated CO optimization problem in (8), the CRL-based method in B5G massive IoT was proposed to achieve long term resource-efficient. Hence, the conflict-free resource management MDP problem should need to be defined carefully for implementation in B5G massive IoT.

A. Network Conflict-free Resource Management MDP Problem Formulation

The optimization problem can be modeled as an MDP problem by designing a reasonable reward, where the reward function design is related to the optimization objective and constraints. Therefore, the reward should involve throughput, resource-efficient, conflict and SINR requirements. Generally, RL-based network resource management problems can be regarded as learning the action of resource management in the B5G massive IoT environment by sequentially allocating resources to all nodes over a sequence of times. Hence, resource management of the B5G massive IoT network is modeled as an MDP problem, which has Markov characteristics and could access all the relevant information needed to make decisions.

In MDP, the agent takes maximizing the cumulative discount reward from time $t$ rewards as the RL optimization goal, which can be denoted as $\begin{equation*} G_{t}^{\gamma } = \sum _{i = t}^{T} {{\gamma ^{i - t}}{r_{i + 1}}},\tag{9}\end{equation*}$ View Source where ${\gamma \in ( 0,1 )}$ is the discount factor. ${R_{t}}$ and ${G_{t}^{\gamma }}$ are reward and cumulative discount reward at time ${t}$ , respectively. In the B5G massive IoT system, the RL optimization goal ${G_{t}^{\gamma }}$ is to improve the resource-efficient and network throughput under the premise of guaranteeing the network conflict-free constraint. Further, the optimal network resource management policy ${\pi ^{*}}$ was obtained by the RL agent, whose optimization objective of maximizing cumulative discounted reward ${J( \pi )= {{\mathbb {E}}_{\pi }}[ G_{t}^{\lambda } ]}$ , where ${\mathbb {E}[ {\cdot } ]}$ denotes the expectation operator. The conflict-free network resource management MDP problem for B5G massive IoT can be formulated as $\begin{equation*} \underset {\pi }{\max }\,J\left ({ \pi }\right )={{\mathbb {E}}_{\pi }}\left [{ G_{t}^{\gamma } }\right ].\tag{10}\end{equation*}$ View Source

The MDP problem of maximizing cumulative discount reward to solve depends on action-value function ${{Q^{\pi }}( s,a )={\mathbb {E}_{\pi }}[G_{t}^{\gamma } |{s_{t}}=s,{a_{t}}=a ]}$ and state-value function ${{V^{\pi }}( s )={\mathbb {E}_{\pi }}[G_{t}^{\gamma } |{s_{t}}=s ]}$ . To obtain optimal policy ${\pi ^{*}}$ , which can maximize ${{V^{\pi }}( s )}$ and ${{Q^{\pi }}( s,a )}$ , choose the corresponding optimal action is ${\mathop{\max }\limits_{a}\,{Q^{{\pi ^{*}}}}( s,a )}$ for given any state. Where ${{Q^{{\pi ^{*}}}}( s,a )}$ denotes the action-value function the guidance of the optimal policy ${\pi ^{*}}$ .

B. RL Agent Design

The B5G massive IoT network state is formed by the following parameters that are observed by the RL agent at time $t$ .

${{\mathbf {m_{\nu }^{t}}}}$ : The set of all TDs SINR ${\nu }$ at time ${t}$ .
${\varphi ^{t}}$ : The network conflict of B5G massive IoT at time ${t}$ .
${{\mathbf {c}}_{\min }^{t}}$ : The set of minimum rate requirement at time ${t}$ .
${{{\mathbf {H}}}}$ : The hypergraph incidence matrix of B5G massive IoT.
${\mathbf {k^{t}}}$ : The set of assigned network resources for all TDs at time ${t}$ .

At time $t$ , the system state $s_{t}$ is a vector defined as $s_{t} \in \mathcal {S}$ , where $\mathcal {S}$ denotes the state space and ${s_{t}}$ is defined as follows $\begin{equation*} {s_{t}} = \left \{{ {{\mathbf {m}}_{\nu }^{t},{\varphi ^{t}},{\mathbf {c}}_{\min }^{t},{\mathbf {H}},{{\mathbf {k}}^{t}}} }\right \}. \tag{11}\end{equation*}$ View Source The B5G massive IoT environment transitions from state ${s_{t}}$ to state ${s_{t+1}}$ by taking an action in RL.

At each time ${t}$ , the RL agent adopts an action ${a_{t} \in \mathcal {A}}$ , which consists of selecting a network resource, ${s_{t} \in \mathcal {A}}$ , by following a policy ${\pi }$ . Thus the dimension of the action space is ${N_{\text {res}}}$ when there are ${N_{\text {res}}}$ resource blocks for the NRM system.

To maximize the network throughput and network resource-efficient while avoiding conflict and meeting the requirement of the minimization SINR are implicit optimization goals in (8). According to (8), the reward function mainly consists of four parts, as follows: network throughput, resource efficiency, the requirement of SINR, and conflict-free. Hence, when the agent maximizes the cumulative discounted reward, the long-term maximization of network throughput and resource efficiency is achieved through resource allocation subject to satisfying constraints. The network conflict-free condition is represented as a penalty if the RL agent adopts network resource allocation actions generating network conflict. Therefore, the B5G massive IoT environment will return a reward ${r_{t}}$ according to the actions taken by the agent in the time ${t}$ , which is defined as $\begin{align*} {r_{t}}=&{\lambda _{1}}\sum _{i} {R_{i}^{t}} + {\lambda _{2}}\frac {{{N_{{\textrm {TD}}}} - N_{\mathbf {k}}^{t}}}{{{N_{{\textrm {TD}}}}}} \\&{}+ {\lambda _{3}}\sum _{i} {{{\left |{ {\gamma _{i}^{t} - \gamma _{i}^{\min }} }\right |}^{2}}} - {\lambda _{4}}\varphi , \tag{12}\end{align*}$ View Source where ${{\lambda _{1}},{\lambda _{2}},{\lambda _{3}}, {\lambda _{4}} \in ( {0,1} )}$ .

The value functions are defined to quantify the expected return under B5G massive IoT network resource management policy ${\pi }$ . The RL estimation functions include state value function and action value function. The state value function ${{V^{\pi } }( s )}$ denotes the expected return from state ${s}$ , whereas the action value function ${{Q^{\pi } }( {s,a} )}$ represents the expected return after performing action ${a}$ in state ${s}$ . The value functions specific definitions are as follows: $\begin{align*}&{V^{\pi } }\left ({ s }\right ) = {\mathbb {E}_{a \sim \pi \left ({ {\left.{ \cdot }\right |s} }\right ),s^{\prime } \sim p\left ({ {\left.{ \cdot }\right |s,a} }\right )}}\left [{ {\left.{ {\sum _{t = 0}^{T} {{\gamma ^{t}}{r_{t}}} } }\right |{s_{t}} = s} }\right ],\tag{13}\\&{Q^{\pi } }\left ({ {s,a} }\right ) \\&\;={\mathbb {E}_{a \sim \pi \left ({ {\left.{ \cdot }\right |s} }\right ),s^{\prime } \sim p\left ({ {\left.{ \cdot }\right |s,a} }\right )}}\left [{ {\left.{ {\sum _{t = 0}^{T} {{\gamma ^{t}}{r_{t}}} } }\right |{s_{t}} = s,{a_{t}} = a} }\right ],\tag{14}\end{align*}$ View Source where ${\mathbb {E}[ \cdot ]}$ is the expectation. For simplicity, ${s}$ and ${a}$ are the current system state and action at the time ${t}$ , respectively. And ${s^{\prime }}$ is the next system state at the time ${t}$ .

C. FedAvg-A2C based Resource Management Method

The actor is a policy network that takes the state as input and outputs the action that approximates the policy model ${\pi ( { a |s} )}$ , aiming to maximize the expected cumulative reward by updating its parameters based on the value function provided by the actor. The RL agent tries to optimize the policy ${\pi ( {a |s;\theta } )}$ , which gives the probability distribution of actions for each state, to obtain the maximal resource-efficient under the conflict-free constraint. To update the policy ${\pi ( {a |s;\theta } )}$ , we adopt the policy gradient method of DRL with the goal of maximizing the following expected long-term discounted reward. The policy-based optimization scope is to maximize the cumulative discounted reward from an expectation perspective, which can be written as $\begin{equation*} {J_{\pi } }\left ({ \theta }\right ) = {\mathbb {E}_{\tau \sim \pi \left ({{a |s;\theta } }\right )}}\left [{ {r\left ({ \tau }\right )} }\right ],\tag{15}\end{equation*}$ View Source where ${r( \tau ) = \sum _{t = 0}^{T} {{\lambda ^{t}}{r_{t}}}}$ is the limited step discount expected reward, and ${\tau }$ is the sampling trajectory. The policy-based optimization method will optimize the policy according to the above objective function through the gradient-based method (i.e., using gradient estimates on cumulative discounted rewards for gradient learning, which will obtain the optimal policy and finally maximize the cumulative discounted reward). We assume that the gradient policy ${\pi ( {a |s;\theta } )}$ is differentiable in parameter ${\theta }$ . Therefore, the parameter ${\theta }$ gradient can be expressed as follows $\begin{align*}&{\nabla _{\theta }}{J_{\pi }}\left ({ \theta }\right ) = {{\mathbb {E}}_{\tau \sim \pi \left ({ \left.{ a }\right |s;\theta }\right )}}\left [{ \sum _{t=0}^{T}{{\nabla _{\theta }}\left ({ \log \pi \left ({ \left.{ {a_{t}} }\right |{s_{t}};\theta }\right ) }\right ){A^{{\pi _{\theta }}}}\left ({ {s_{t}},{a_{t}} }\right )} }\right ]. \tag{16}\end{align*}$ View Source

We can measure the advantage of taking action ${{a_{t}}}$ for state ${{s_{t}}}$ at time slot ${t}$ by comparing the average value and the estimated value. The advantage function is given by $\begin{align*} A^{{\pi _{\theta } }}\left ({ {{s_{t}},{a_{t}}} }\right )=&{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right ) \\&{}-\sum _{a \in \mathcal {A}} {\pi \left ({ {\left.{ a }\right |{s_{t}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},a;w} }\right )}, \tag{17}\end{align*}$ View Source which can guide the RL agent to understand how to update the DNN. Specifically, the advantage function evaluates the benefits or drawbacks of actions for the policy from the actor. To minimize ${{J_{\pi } }( \theta )}$ , the policy parameter ${\theta }$ is updated in a gradient descent direction, which is given by $\begin{equation*} \theta \leftarrow \theta - \eta \nabla {J_{\pi } }\left ({ \theta }\right ).\tag{18}\end{equation*}$ View Source Combining (16) and (17), the parameter ${\theta }$ gradient can be approximated with (19), shown at the bottom of the page. $\begin{equation*} {\nabla _{\theta } }{J_{\pi } }\left ({ \theta }\right ) ={\mathbb {E}_{\tau \sim \pi \left ({ {\left.{ a }\right |s;\theta } }\right )}}\left [{{{\sum _{t = 0}^{T} {{\nabla _{\theta } }\left ({ {\log \pi \left ({ {\left.{ {{a_{t}}} }\right |{s_{t}};\theta } }\right )} }\right )} }} }\right. \left.{{\left ({ {{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right ) - \sum _{a \in \mathcal {A}} {\log \pi \left ({ {\left.{ a }\right |{s_{t}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},a;w} }\right )} } }\right )}}\right ]. \tag{19}\end{equation*}$ View Source

The critic can provide an action-value function to measure the loss of the resource management strategy network. The Q-value is estimated by a deep neural network (DNN), that is, using the parameter $w$ to approximate the action-value function ${{{Q^{\pi }}( {s,a} )}}$ , which can be defined as ${{{Q^{{\pi _{\theta } }}}( {s,a;w} )}}$ . The parameter update is given by $\begin{equation*} w \leftarrow w - \eta \nabla {J_{Q}}\left ({ w }\right ),\tag{20}\end{equation*}$ View Source where ${w}$ denotes the learning the parameters of the critic function, $\eta$ is the learning rate. And the loss function ${{J_{Q}}( w )}$ for the estimated action-value function can be defined as (21), shown at the bottom of the page. $\begin{equation*} {J_{Q}}\left ({ w }\right ) = \frac {1}{2}\left ({ {{r_{t}} + \gamma \sum _{a \in \mathcal {A}} {\pi \left ({ {\left.{ a }\right |{s_{t + 1}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t + 1}},a;w} }\right )} - {Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right )} }\right ), \tag{21}\end{equation*}$ View Source To derive the gradients for the maximization objective, we leverage the action-value function ${{{Q^{{\pi _{\theta } }}}( {s,a;w} )}}$ . To train ${{{Q^{{\pi _{\theta } }}}( {s,a;w} )}}$ , this paper leverages the gradient descent method, which is formulated as (22), shown at the bottom of the page. $\begin{equation*} \nabla {J_{Q}}\left ({ w }\right ) = {\nabla _{w}}\left ({{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right )}\right ) \left ({ {{{r_{t}}}-{{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right )}} {{\gamma \sum _{a \in \mathcal {A}} {\pi \left ({ {\left.{ a }\right |{s_{t + 1}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t + 1}},a;w} }\right )} }} }\right ). \tag{22}\end{equation*}$ View Source

To address the huge data volume of B5G massive IoT, this paper proposed the FedAvg-A2C method to update the parameters of the value network and estimation network. In the considered B5G massive IoT, the global A2C network was maintained by the FedAvg-A2C server, and all RL agents obtain the global model of the FedAvg-A2C server to constitute the local A2C network. In each round of the global model training process, each RL agent updates its own local A2C model by random samples mini-batch of data ${\mathcal {B}}$ from local replayer buffer ${\mathcal {D}}$ . The local update of the ${k^{\text {th}}}$ RL agent minimizes the aforementioned policy network and value network loss function, ${J( {w_{k}^{t}} )}$ and ${J( {\theta _{k}^{t}} )}$ , respectively. It random select a mini-batch sample ${\mathcal {B}_{k}}$ from replay buffer ${\mathbb {D}_{k}}$ . Then, FedAvg-A2C global network is obtained with a weighted average of the parameters at the end of a round, which contains involved local A2C in this round learning process. At time ${t}$ , the minimization of FedAvg-A2C global policy network and value network loss function ${J( {w^{t}} )}$ , ${J( {\theta ^{t}} )}$ can be expressed as: $\begin{align*} \min J\left ({{w^{t}}}\right )=&\sum _{k = 1}^{K} {{p_{k}}} J\left ({w_{k}^{t}}\right ),\tag{23a}\\ \min J\left ({{\theta ^{t}}}\right )=&\sum _{k = 1}^{K} {{p_{k}}} J\left ({\theta _{k}^{t}}\right ),\tag{23b}\end{align*}$ View Source where ${{p_{k}} = {{| {{\cal B_{k}}} |} / {\sum _{k} {| {{\cal B_{k}}} |} }}}$ is the weight of the ${k^{\text {th}}}$ RL agent. The RL agents interact with the server which serves as the model aggregator at time ${t}$ , as follows. $\begin{align*} \theta _{k}^{t}=&\underbrace {{\theta ^{t - 1}}}_{{\textrm {global}}} - \eta \cdot \underbrace {\nabla J\left ({\theta _{k}^{t - 1}}\right )}_{{\textrm {local}}},\tag{24a}\\ w_{k}^{t}=&\underbrace {{w^{t - 1}}}_{{\textrm {global}}} - \eta \cdot \underbrace {\nabla J\left ({w_{k}^{t - 1}}\right )}_{{\textrm {local}}},\tag{24b}\\ {\theta ^{t}}=&\sum _{k = 1}^{K} {{p_{k}}} \theta _{k}^{t},\tag{24c}\\ {w^{t}}=&\sum _{k = 1}^{K} {{p_{k}}} w_{k}^{t}.\tag{24d}\end{align*}$ View Source Each RL agent first obtains the global model’s latest parameters ${J( {w^{t-1}} )}$ and ${J( {\theta ^{t-1}} )}$ from the server. Then, the RL agent updates its local model by computing its gradient ${{\nabla J(w_{k}^{t - 1})}}$ and ${{\nabla J(\theta _{k}^{t - 1})}}$ on the historical experience. After the local training, the RL agent sends ${J( {w^{t}} )}$ and ${J( {\theta ^{t}} )}$ to server. And the server broadcasts the global model parameters to all RL agents. Algorithm 1 summarizes the training procedure.

Algorithm 1:

FedAvg-A2C Based Resource Management Algorithm

Show All

D. Algorithm Complexity Analysis

The computational complexity of FedAvg-A2C accounts for the local model training at the A2C agent and the local model aggregation at the server. Since single A2C network model training with random samples from its own local buffer, the complexity of the RL local update is ${Q( {( {{T_{{\textrm {value}}}} + {T_{{\textrm {policy}}}}} ) \times {N_{{\textrm {lr}}}}} )}$ , depending on the value network ${{T_{{\textrm {value}}}}}$ complexity, policy network complexity ${{T_{{\textrm {policy}}}}}$ and local training number ${{{N_{{\textrm {lr}}}}}}$ . The complexity of the model aggregation is ${O( K )}$ , as it grows linearly with the number of agents ${K}$ . The overall complexity of the FedAvg-A2C algorithm is ${O\left({ {\frac {{( {{T_{{\textrm {value}}}} + {T_{{\textrm {policy}}}}} ) \times {N_{{\textrm {lr}}}}}}{K} + K} }\right)}$ . Therefore, the higher the number of RL agents, the faster the training speed of the FedAvg-A2C algorithm.

SECTION V.

Simulation

In this section, the proposed scheme was validated by conducting numerical simulations. First, the simulation setup was outlined, followed by a comprehensive presentation and analysis of the numerical results. The primary goal of this process is to showcase the superiority of the proposed schemes when compared to existing works. We run the simulations on a DELL server with an Intel Xeon Gold 6242R CPU running at 3.1 GHz and 64GB of RAM and two GPUs(NVIDIA GeForce RTX 3080Ti) running an Ubuntu 18.04 LTS OS, and we use the Python 3.9.13 environment, Pytorch 2.0.0. The FedAvg-A2C algorithm was implemented in Pytroch. The hyperparameters of the proposed FedAvg-A2C are shown in Table 2.

TABLE 2 Experimental Parameters

To verify the efficiency of the proposed algorithm, the following methods are simulated for performance comparison, such as PPO-based resource management (Comparison Algorithm 1), D3QN-based resource management (Comparison Algorithm 2), Random resource management (Comparison Algorithm 3).

A. Convergence of the Proposed Algorithm

Fig. 6 shows the convergence of the proposed algorithm under different learning rates. And the number of TDs is set by 20. The horizontal and vertical axes represent the number of training iterations and the amount of received reward, respectively. As the learning rate increases, the convergence of the proposed method increases, while the convergence is enhanced. Fig. 6 presents that the FedAvg-A2C model has a better reward when ${\eta = 0.001}$ . Therefore, this paper chooses the learning rate ${\eta = 0.001}$ as the parameter for subsequent experiments.

FIGURE 5.

Conflict hypergraph.

Show All

FIGURE 6.

Convergence vs. different learning rate.

Show All

The convergence of different discount factors is shown in Fig. 7. When ${\gamma = 0.95}$ the cumulative reward is higher than the other cases. Therefore, the learning rate ${\eta }$ is set as 0.001, the discount factor ${\gamma }$ is set as 0.95.

FIGURE 7.

Convergence vs. different discount factor.

Show All

B. Performance of the Proposed Algorithm

Fig. 8 focuses on highlighting the advantages of the proposed algorithm by comparing its maximum network throughput with three comparison algorithms for different numbers of TDs. As the number of TDs increases, resulting in heightened network resource conflicts within the communication system, four algorithms experience an overall increase in the maximum network throughput. Remarkably, the proposed algorithm outperforms comparison algorithm 1, comparison algorithm 2 and comparison algorithm 3, exhibiting significantly higher network throughput. The findings presented in Fig. 8 serve as compelling evidence, validating the remarkable capability of the proposed algorithm to effectively enhance network throughput and push the upper limit of the system’s ability.

FIGURE 8.

Maximal network throughput.

Show All

Fig. 9 significantly emphasizes the comparison of average network throughput among the algorithm proposed in this paper and three comparison algorithms for varying numbers of TDs. As the number of TDs increases, all four algorithms consistently demonstrate a notable upward trend in network throughput. Importantly, it is the proposed algorithm that distinctly outperforms both comparison algorithm 1, comparison algorithm 2 and comparison algorithm 3, unequivocally highlighting its remarkable effectiveness in enhancing the average network throughput. The compelling evidence presented in Fig. 9 effectively validates the exceptional capability of the proposed algorithm in significantly improving the performance of the system.

FIGURE 9.

Average network throughput.

Show All

Fig. 10 presents a comparison of the maximal resource-efficient between the algorithm proposed in this paper and three comparison algorithms for varying numbers of TDs. From the figure, the number of TDs increases leading to network resource-efficient drops. The proposed method has much better performance to mitigate the effectively enhance the maximal network resource-efficient of the system.

FIGURE 10.

Maximal network resource-efficient.

Show All

Fig. 11 presents a comparison of the average resource-efficient between the algorithm proposed in this paper and three comparison algorithms for varying numbers of TDs. The number of TDs increasing will decline system stability, which makes the average network resource-efficient drop in Fig. 11. The proposed method has much better performance to mitigate the effectively enhance the average network resource-efficient of the system.

FIGURE 11.

Average network resource-efficient.

Show All

SECTION VI.

Conclusion

In this paper, the conflict-free network resource-efficient management problem in the B5G massive IoT scenario was investigated, which consists of dense deployment of IoT devices and a resource management system. However, the dense deployment of IoT devices will generate large-scale network conflict of B5G massive IoT systems and decline the resource-efficient of the resource management system. Hypergraph theory-based network conflict model was proposed to quantify the conflict of the whole B5G massive IoT. Hence, under the conflict hypergraph model constraint, this paper has formulated the CO problem by maximizing the network throughput and resource-efficient. Since the conflict hypergraph-based resource management is an optimization problem with NP-hard, which needs computationally intensive to be handled, we formulate an MDP for the NRM system with sequential decision-making characteristics and propose a resource-efficient CRL solution. Then, FedAvg-A2C based resource management algorithm was proposed to handle the network conflict-free resource management problem in B5G massive IoT scenarios and accelerate the training process of learning. Finally, simulation results demonstrate the effectiveness of the FedAvg-A2C and validate the superiority of FedAvg-A2C compared with other comparison algorithms.

References is not available for this document.

Hypergraph-Based Resource-Efficient Collaborative Reinforcement Learning for B5G Massive IoT

Abstract:

Metadata

Abstract:

Funding Agency: