Introduction
A. Background and motivation
Beyond 5G (B5G) networks are rapidly expanding to connect billions of machines and Internet of Things (IoT) devices and are promising to support a variety of unprecedented services, including smart cities, smart industries, connected and autonomous systems, telemedicine, etc. [1], [2], [3]. Various new requirements for B5G networks are put forward by emerging application scenarios, such as high resource-efficient performance, ultra-low latency, high data rates and high reliability [3], [4]. The network resource efficiency was improved through devices’ dense deployment (i.e., form a dense network) in massive IoT, which increases network throughput and provides better quality of service (QoS) for more users [5]. Multiplexing resources has become a fundamental phenomenon in massive IoT networks due to the large-scale dense connectivity of terminal devices (TD). However, the ongoing densification of the network induces severe resource conflict leading to large-scale network conflict, which reduces network throughput. Therefore, dynamically providing and orchestrating network resource management (NRM) tailored to such emerging services will be a unique challenge, which needs to combine artificial intelligence (AI) technology to convert traditional wireless communication systems into intelligent wireless communication systems in B5G Massive IoT [6].
NRM system manages massive IoT by utilizing the available network resources efficiently to ensure the QoS and resource efficiency for massive IoT [7]. It can be fully utilized by employing effective design techniques, equitable resource allocation and efficient packet scheduling. However, ensuring high network resource-efficient in wireless communication networks is a challenging task as the underlying optimization problem is a nonconvex combinatorial optimization (CO) problem in massive IoT scenarios [8]. Recently, intelligent enhanced massive IoT will be built with collaborative reinforcement learning (CRL), which is a distributed collaborative machine learning. Due to multiple agents learning and performing tasks simultaneously, CRL can better deal with large-scale problems and complex environments for NRM systems [9]. For instance, NRM leverages data analytics and AI techniques to analyze large volumes of data and make informed decisions, which enables better resource management decisions, leading to improved network performance and user experience [10]. As a result, the AI-assisted IoT system could be a promising solution and enhance resource efficiency for B5G massive IoT [11].
B. Related Work
There are various approaches for NRM in IoT system, among which mainly contains optimization-based methods and heuristic methods [12]. However, multi-user NRM is usually modeled theoretically as a problem with NP-hard characteristics, which are challenging to solve by typical optimization methods [13], [14], [15]. Ghanem et al. [16] use a branch-and-bound approach based on discrete monotonic optimization theory to develop a globally optimal solution for the NRM problem and reformulate the optimization problem using the canonical form of difference of convex programming. Although adopting convex-optimization-based approaches can solve NRM problems, primal problems must be converted into solvable problems. However, the optimal of the converted problem is usually not those of the primal one, while it needs computationally intensive to handle the converted problem [12]. To tackle this problem, machine learning has emerged as a promising technology for NRM in IoT systems and was considered to be effective in improving the resource-efficient [17], [18], [19], [20], [21]. Despite the mild loss of optimality, reinforcement learning (RL) approaches could still perform well [12]. For instance, RL based scheme is adopted to address dynamic network resource management in IoT systems with cognitive radio capabilities, aiming to enhance data rates and minimize routing delays [17]. Actor-Critic based radio resource management scheme was proposed to handle the radio resource management challenge [18]. Zhu et al. [19] adopted deep reinforcement learning (DRL) and Q-learning methods, which mainly focused on resource management policies and offloading in vehicular edge computing networks. In the context of edge-IoT systems, resource management for maximizing users’ QoS is investigated in [20], formulating the problem as a Markov decision process (MDP) and proposing a Q-value approximation approach. This approach improves QoS, latency, and application task success ratio. Furthermore, the transmission latency and computation offloading could be solved by an MDP and model-free RL approach in dynamic mobile edge computing-aided IoT. In digital twin applications, resource management based on double deep Q-network scheme that optimizes the resource-efficient is proposed in [21] for multiple IoT devices, while it achieves a low computational complexity and optimal processing time.
In traditional RL, all data is often sent to a central server for training, leading to significant communication and computation overhead [22]. Due to the training of AI-driven models being an essential part [23], several recent works have considered techniques for CRL schemes to decline the training overhead [24], [25], [26]. CRL was a collaborative machine learning method that involves training a shared model across multiple decentralized and potentially non-identical agents or devices [27]. CRL reduces this communication burden by allowing devices to train locally and only transmit model updates, whose systems can be more fault-tolerant as the shared model can adapt to changes, failures, or loss of individual agents without compromising the entire learning process [28]. In addition, leverages the computational resources available on individual devices or agents, distributing the training workload and potentially reducing the need for centralized high-performance servers [29], [30], [31]. For instance, a collaborative learning scheme called adapting federated averaging (FedAvg) was proposed in [29] for communication efficiency, which dramatically reduces the number of rounds to converge by taking the form of a distributed Adam optimization. In each round of model aggregation, the FedAvg method based on model segmentation is introduced, which uses a gossip protocol for client sampling [30]. The collaborative learning models were proposed to improve resource utilization for multidomain networks by executing horizontal and vertical auto-scaling [31]. Chen et al. [32] proposed a collaborative learning framework that considers network resource management and user selection to minimize the loss value of the collaborative learning model in the wireless network. The existing works focused on optimization of resource management, rarely take large-scale network conflict into account. Dense deployment of IoT devices leads to large-scale network conflict, which brings a great challenge to resource management in massive IoT networks [12]. Hence, how to adopt distributed collaborative machine learning technology to avoid large-scale network conflict and achieve network conflict-free resource management is an unresolved issue.
C. Contributions
To tackle the challenge mentioned above, we propose a conflict hypergraph based CRL resource management framework for B5G massive IoT system management and applications, which enables B5G massive IoT to maximize network throughput and resource-efficient in the absence of large-scale network conflict. Relative to the existing works, the contributions of this work are concluded as follows:
To avoid large-scale network conflict and achieve conflict-free resource management, we analyze the direct and indirect conflicts of the B5G massive IoT network and establish a conflict graph model which clearly shows the conflict relationship between links. In addition, based on the theory of maximal cliques and hypergraphs, the conflict graph model is transformed into a conflict hypergraph model, which greatly reduces the difficulty of resource management to avoid network conflicts.
Since the conflict hypergraph-based resource management is the CO problem with NP-hard, which needs computationally intensive to be handled, we formulate an MDP model for NRM with sequential decision-making characteristics and propose a resource-efficient RL solution. Especially, the reward function was designed according to high resource-efficient requirements under conflict-free, which make the RL agent can obtain a resource management scheme that satisfies the restrictions of the CO problem.
To reduce the computational load by distributing the computational workload throughout the entire network and achieve distributed CRL, the federated averaging advantage Actor-Critic (FedAvg-A2C) is proposed to handle the network conflict-free resource management problem in B5G massive IoT scenarios and accelerate the training process of learning. Specially, a FedAvg based collaborative training framework was formulated, which consists of multi A2C local network and a local global network.
The rest of this paper is organized as follows: Section II describes the system model and analyzes the resource conflict. Section III introduces the conflict hypergraph model, conflict-free resource management problem. The proposed scheme is in Section IV. Section V presents the simulation results of the proposed methods. Finally, Section VI concludes the paper.
System Model
This section introduces resource management methods for TDs in B5G massive IoT architecture. It combines graph theory and CRL technology to support the scheduling of multidimensional resources in the form of transactions.
A. Resource Management Model Based on Collaborative Framework
As shown in Fig. 1, the B5G massive IoT is decentralized, and all transactions and related operations are recorded at the local data center. The B5G massive IoT includes a device set
B. Conflict Analyzed Based on Graph
For the B5G massive IoT communication structure, it is recorded by graph
The communications links and the relationship between the nodes can be represented with incidence matrix \begin{align*} \mathbf {G_{TI}} =\left [{ {\begin{array}{cccccccccccccccccccc}{\left ({ {{v_{t1}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{v_{t1}},{e_{tm}}} }\right )} \\ \vdots &~~ \ddots &~~ \vdots \\ {\left ({ {{v_{tn}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{v_{tn}},{e_{tm}}} }\right )} \end{array}} }\right ],\tag{1}\end{align*}
\begin{align*} \left ({ {{v_{ti}},{e_{tj}}} }\right )= \begin{cases} 1,&~ {v_{ti}} \in {e_{tj}} \\ 0,&~ {v_{ti}} \notin {e_{tj}} \end{cases}.\tag{2}\end{align*}
An example is presented in Fig. 2, which includes 13 TDs and 16 communication links (CLs), denoted as
To promote network resource management for resource-efficient in B5G massive IoT scenarios, the conflict conditions between TD are classified as direct conflict and indirect conflict as follows:
Direct conflict: Two TD pairs share a channel and have the same TD, i.e., the
and{\text {CL}}_{1} share a channel in Fig. 3(a).{\text {CL}}_{2} Indirect conflict: Two TD pairs share a channel and the TD of one TD pair is in the communication range of the other TD pair, i.e., the
and{\text {CL}}_{1} share a channel in Fig. 3(b).{\text {CL}}_{3}
To avoid TD conflicts in the topology of communication networks, direct conflict can be solved by solving the typical edge coloring algorithm. However, the indirect conflict caused by hidden TD remains inevitable, due to the indirect conflict problem diverging from the core essence of the typical edge coloring problem. Therefore, it is necessary to further analyze the potential conflicts of CLs between the TDs.
Resource Management Design based on Conflict Hypergraphs
In this section, the conflict graph is built to clearly show the resource conflict relationship. In addition, based on the theory of cliques and hypergraphs, the conflict graph is transformed into a hypergraph, which reduces the difficulty of solving resource conflicts. Finally, the resource conflict problem is generalized as a node coloring problem of hypergraph.
A. Conflict Graph Model
To address the resources management conflict problem in B5G massive IoT, the conflict graph model
The conflicting relationship between nodes can be represented by adjacency matrix \begin{align*} {{\mathbf {G}}_{CA}} =\left [{ {\begin{array}{cccccccccccccccccccc}{\left ({ {{e_{t1}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{e_{t1}},{e_{tm}}} }\right )} \\ \vdots &~~ \ddots &~~ \vdots \\ {\left ({ {{e_{tm}},{e_{t1}}} }\right )} &~~ \cdots &~~ {\left ({ {{e_{tm}},{e_{tm}}} }\right )} \end{array}} }\right ],\tag{3}\end{align*}
\begin{align*} \left ({ {{e_{ti}},{e_{tj}}} }\right )= \begin{cases} {1, {e_{ti}} \;\text {conflicts with }\;{e_{tj}}} \\ {0,{e_{ti}} \;\text {not conflicts with}\;{e_{tj}}} \end{cases}.\tag{4}\end{align*}
Then, following the principles of Fig. 3, the conflict graph can be constructed as shown in Fig. 4. For understanding, an example is used to illustrate the construction on the conflict graph: the node
B. Conflict Hypergraph Model
To reduce the difficulty of avoiding resource conflicts, we simplified the conflict graph based on the theory of cliques and hypergraphs. As a fully connected subgraph in the graph, a clique can be expressed by a hyperedge and then quickly reduce the dimension of the matrix of the conflict graph. The definition of clique and hypergraph is described as follows:
Clique: a sub-graph in the conflict graph, where any two nodes are connected.
Maximal clique: a clique which is not a sub-graph of other cliques.
The hypergraph can be expressed as \begin{align*} h\left ({ {v,e} }\right ) = \begin{cases} {1},\; & {v \in {e}} \\ {0},\; & \text {otherwise}. \end{cases}\tag{5}\end{align*}
According to the definition of the maximal clique, the maximal clique in the conflict graph as shown in Table 1. The nodes in a clique are connected to each other, which can be validated through the conflict relationship between the nodes in Fig. 4.
According to the theory of hypergraph and clique, where all nodes are connected with each other, thus any clique can form a hyperedge and contain guaranteed conflict information without a loss since the features that any nodes in the clique conflict with each other. The maximum clique can contain more nodes (i.e., the hyperedge contains multiple nodes). All the maximum cliques obtained transform the conflict graph into a conflict hypergraph, simplifying the matrix and reducing the difficulty of conflict avoidance while ensuring that the conflicted relationship between nodes remains unchanged. The conflict avoidance problem in the conflict hypergraph is essentially a node coloring problem of the hypergraph.
C. Problem Formulation
In this section, we formulate the CO problem (i.e., the node coloring of the hypergraph) for network resource-efficient management in the B5G massive IoT scenario. To avoid resource allocation conflict, we defined a conflict degree of nodes, denoted as \begin{equation*} {\upsilon _{i}} = \frac {{{P_{i}}{h_{i}}}}{{{\sigma ^{2}} + \sum _{j \in {\mathcal {N}_{i}}} {{P_{j}}{h_{j,i}}} }},i \in \left \{{ {1,2,\ldots , {N_{{\textrm {TD}}}}} }\right \},\tag{6}\end{equation*}
\begin{equation*} R_{i}^{t} = B \cdot \log \left ({ {1 + \nu _{i}^{t}} }\right ), \tag{7}\end{equation*}
\begin{align*} \max&{\lambda _{1}}\sum _{i} {R_{i}^{t}} + {\lambda _{2}}\frac {{{N_{{\textrm {TD}}}} - N_{{\mathbf {k}}}^{t}}}{{{N_{{\textrm {TD}}}}}}, \tag{8a}\\ {\textrm {s}}{\textrm {.t}}{. }&\varphi = 0,\tag{8b}\\&\upsilon _{i}^{t} \ge \upsilon _{i}^{\min },\tag{8c}\\&i \in \left \{{ {1,2,\ldots , {N_{{\textrm {TD}}}}} }\right \},\tag{8d}\end{align*}
Resource Management based on CRL method
To solve the complicated CO optimization problem in (8), the CRL-based method in B5G massive IoT was proposed to achieve long term resource-efficient. Hence, the conflict-free resource management MDP problem should need to be defined carefully for implementation in B5G massive IoT.
A. Network Conflict-free Resource Management MDP Problem Formulation
The optimization problem can be modeled as an MDP problem by designing a reasonable reward, where the reward function design is related to the optimization objective and constraints. Therefore, the reward should involve throughput, resource-efficient, conflict and SINR requirements. Generally, RL-based network resource management problems can be regarded as learning the action of resource management in the B5G massive IoT environment by sequentially allocating resources to all nodes over a sequence of times. Hence, resource management of the B5G massive IoT network is modeled as an MDP problem, which has Markov characteristics and could access all the relevant information needed to make decisions.
In MDP, the agent takes maximizing the cumulative discount reward from time \begin{equation*} G_{t}^{\gamma } = \sum _{i = t}^{T} {{\gamma ^{i - t}}{r_{i + 1}}},\tag{9}\end{equation*}
\begin{equation*} \underset {\pi }{\max }\,J\left ({ \pi }\right )={{\mathbb {E}}_{\pi }}\left [{ G_{t}^{\gamma } }\right ].\tag{10}\end{equation*}
The MDP problem of maximizing cumulative discount reward to solve depends on action-value function
B. RL Agent Design
The B5G massive IoT network state is formed by the following parameters that are observed by the RL agent at time
: The set of all TDs SINR{{\mathbf {m_{\nu }^{t}}}} at time{\nu } .{t} : The network conflict of B5G massive IoT at time{\varphi ^{t}} .{t} : The set of minimum rate requirement at time{{\mathbf {c}}_{\min }^{t}} .{t} : The hypergraph incidence matrix of B5G massive IoT.{{{\mathbf {H}}}} : The set of assigned network resources for all TDs at time{\mathbf {k^{t}}} .{t}
At time \begin{equation*} {s_{t}} = \left \{{ {{\mathbf {m}}_{\nu }^{t},{\varphi ^{t}},{\mathbf {c}}_{\min }^{t},{\mathbf {H}},{{\mathbf {k}}^{t}}} }\right \}. \tag{11}\end{equation*}
At each time
To maximize the network throughput and network resource-efficient while avoiding conflict and meeting the requirement of the minimization SINR are implicit optimization goals in (8). According to (8), the reward function mainly consists of four parts, as follows: network throughput, resource efficiency, the requirement of SINR, and conflict-free. Hence, when the agent maximizes the cumulative discounted reward, the long-term maximization of network throughput and resource efficiency is achieved through resource allocation subject to satisfying constraints. The network conflict-free condition is represented as a penalty if the RL agent adopts network resource allocation actions generating network conflict. Therefore, the B5G massive IoT environment will return a reward \begin{align*} {r_{t}}=&{\lambda _{1}}\sum _{i} {R_{i}^{t}} + {\lambda _{2}}\frac {{{N_{{\textrm {TD}}}} - N_{\mathbf {k}}^{t}}}{{{N_{{\textrm {TD}}}}}} \\&{}+ {\lambda _{3}}\sum _{i} {{{\left |{ {\gamma _{i}^{t} - \gamma _{i}^{\min }} }\right |}^{2}}} - {\lambda _{4}}\varphi , \tag{12}\end{align*}
The value functions are defined to quantify the expected return under B5G massive IoT network resource management policy \begin{align*}&{V^{\pi } }\left ({ s }\right ) = {\mathbb {E}_{a \sim \pi \left ({ {\left.{ \cdot }\right |s} }\right ),s^{\prime } \sim p\left ({ {\left.{ \cdot }\right |s,a} }\right )}}\left [{ {\left.{ {\sum _{t = 0}^{T} {{\gamma ^{t}}{r_{t}}} } }\right |{s_{t}} = s} }\right ],\tag{13}\\&{Q^{\pi } }\left ({ {s,a} }\right ) \\&\;={\mathbb {E}_{a \sim \pi \left ({ {\left.{ \cdot }\right |s} }\right ),s^{\prime } \sim p\left ({ {\left.{ \cdot }\right |s,a} }\right )}}\left [{ {\left.{ {\sum _{t = 0}^{T} {{\gamma ^{t}}{r_{t}}} } }\right |{s_{t}} = s,{a_{t}} = a} }\right ],\tag{14}\end{align*}
C. FedAvg-A2C based Resource Management Method
The actor is a policy network that takes the state as input and outputs the action that approximates the policy model \begin{equation*} {J_{\pi } }\left ({ \theta }\right ) = {\mathbb {E}_{\tau \sim \pi \left ({{a |s;\theta } }\right )}}\left [{ {r\left ({ \tau }\right )} }\right ],\tag{15}\end{equation*}
\begin{align*}&{\nabla _{\theta }}{J_{\pi }}\left ({ \theta }\right ) = {{\mathbb {E}}_{\tau \sim \pi \left ({ \left.{ a }\right |s;\theta }\right )}}\left [{ \sum _{t=0}^{T}{{\nabla _{\theta }}\left ({ \log \pi \left ({ \left.{ {a_{t}} }\right |{s_{t}};\theta }\right ) }\right ){A^{{\pi _{\theta }}}}\left ({ {s_{t}},{a_{t}} }\right )} }\right ]. \tag{16}\end{align*}
We can measure the advantage of taking action \begin{align*} A^{{\pi _{\theta } }}\left ({ {{s_{t}},{a_{t}}} }\right )=&{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right ) \\&{}-\sum _{a \in \mathcal {A}} {\pi \left ({ {\left.{ a }\right |{s_{t}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},a;w} }\right )}, \tag{17}\end{align*}
\begin{equation*} \theta \leftarrow \theta - \eta \nabla {J_{\pi } }\left ({ \theta }\right ).\tag{18}\end{equation*}
\begin{equation*} {\nabla _{\theta } }{J_{\pi } }\left ({ \theta }\right ) ={\mathbb {E}_{\tau \sim \pi \left ({ {\left.{ a }\right |s;\theta } }\right )}}\left [{{{\sum _{t = 0}^{T} {{\nabla _{\theta } }\left ({ {\log \pi \left ({ {\left.{ {{a_{t}}} }\right |{s_{t}};\theta } }\right )} }\right )} }} }\right. \left.{{\left ({ {{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right ) - \sum _{a \in \mathcal {A}} {\log \pi \left ({ {\left.{ a }\right |{s_{t}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},a;w} }\right )} } }\right )}}\right ]. \tag{19}\end{equation*}
The critic can provide an action-value function to measure the loss of the resource management strategy network. The Q-value is estimated by a deep neural network (DNN), that is, using the parameter \begin{equation*} w \leftarrow w - \eta \nabla {J_{Q}}\left ({ w }\right ),\tag{20}\end{equation*}
\begin{equation*} {J_{Q}}\left ({ w }\right ) = \frac {1}{2}\left ({ {{r_{t}} + \gamma \sum _{a \in \mathcal {A}} {\pi \left ({ {\left.{ a }\right |{s_{t + 1}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t + 1}},a;w} }\right )} - {Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right )} }\right ), \tag{21}\end{equation*}
\begin{equation*} \nabla {J_{Q}}\left ({ w }\right ) = {\nabla _{w}}\left ({{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right )}\right ) \left ({ {{{r_{t}}}-{{Q^{{\pi _{\theta } }}}\left ({ {{s_{t}},{a_{t}};w} }\right )}} {{\gamma \sum _{a \in \mathcal {A}} {\pi \left ({ {\left.{ a }\right |{s_{t + 1}};\theta } }\right ){Q^{{\pi _{\theta } }}}\left ({ {{s_{t + 1}},a;w} }\right )} }} }\right ). \tag{22}\end{equation*}
To address the huge data volume of B5G massive IoT, this paper proposed the FedAvg-A2C method to update the parameters of the value network and estimation network. In the considered B5G massive IoT, the global A2C network was maintained by the FedAvg-A2C server, and all RL agents obtain the global model of the FedAvg-A2C server to constitute the local A2C network. In each round of the global model training process, each RL agent updates its own local A2C model by random samples mini-batch of data \begin{align*} \min J\left ({{w^{t}}}\right )=&\sum _{k = 1}^{K} {{p_{k}}} J\left ({w_{k}^{t}}\right ),\tag{23a}\\ \min J\left ({{\theta ^{t}}}\right )=&\sum _{k = 1}^{K} {{p_{k}}} J\left ({\theta _{k}^{t}}\right ),\tag{23b}\end{align*}
\begin{align*} \theta _{k}^{t}=&\underbrace {{\theta ^{t - 1}}}_{{\textrm {global}}} - \eta \cdot \underbrace {\nabla J\left ({\theta _{k}^{t - 1}}\right )}_{{\textrm {local}}},\tag{24a}\\ w_{k}^{t}=&\underbrace {{w^{t - 1}}}_{{\textrm {global}}} - \eta \cdot \underbrace {\nabla J\left ({w_{k}^{t - 1}}\right )}_{{\textrm {local}}},\tag{24b}\\ {\theta ^{t}}=&\sum _{k = 1}^{K} {{p_{k}}} \theta _{k}^{t},\tag{24c}\\ {w^{t}}=&\sum _{k = 1}^{K} {{p_{k}}} w_{k}^{t}.\tag{24d}\end{align*}
D. Algorithm Complexity Analysis
The computational complexity of FedAvg-A2C accounts for the local model training at the A2C agent and the local model aggregation at the server. Since single A2C network model training with random samples from its own local buffer, the complexity of the RL local update is
Simulation
In this section, the proposed scheme was validated by conducting numerical simulations. First, the simulation setup was outlined, followed by a comprehensive presentation and analysis of the numerical results. The primary goal of this process is to showcase the superiority of the proposed schemes when compared to existing works. We run the simulations on a DELL server with an Intel Xeon Gold 6242R CPU running at 3.1 GHz and 64GB of RAM and two GPUs(NVIDIA GeForce RTX 3080Ti) running an Ubuntu 18.04 LTS OS, and we use the Python 3.9.13 environment, Pytorch 2.0.0. The FedAvg-A2C algorithm was implemented in Pytroch. The hyperparameters of the proposed FedAvg-A2C are shown in Table 2.
To verify the efficiency of the proposed algorithm, the following methods are simulated for performance comparison, such as PPO-based resource management (Comparison Algorithm 1), D3QN-based resource management (Comparison Algorithm 2), Random resource management (Comparison Algorithm 3).
A. Convergence of the Proposed Algorithm
Fig. 6 shows the convergence of the proposed algorithm under different learning rates. And the number of TDs is set by 20. The horizontal and vertical axes represent the number of training iterations and the amount of received reward, respectively. As the learning rate increases, the convergence of the proposed method increases, while the convergence is enhanced. Fig. 6 presents that the FedAvg-A2C model has a better reward when
The convergence of different discount factors is shown in Fig. 7. When
B. Performance of the Proposed Algorithm
Fig. 8 focuses on highlighting the advantages of the proposed algorithm by comparing its maximum network throughput with three comparison algorithms for different numbers of TDs. As the number of TDs increases, resulting in heightened network resource conflicts within the communication system, four algorithms experience an overall increase in the maximum network throughput. Remarkably, the proposed algorithm outperforms comparison algorithm 1, comparison algorithm 2 and comparison algorithm 3, exhibiting significantly higher network throughput. The findings presented in Fig. 8 serve as compelling evidence, validating the remarkable capability of the proposed algorithm to effectively enhance network throughput and push the upper limit of the system’s ability.
Fig. 9 significantly emphasizes the comparison of average network throughput among the algorithm proposed in this paper and three comparison algorithms for varying numbers of TDs. As the number of TDs increases, all four algorithms consistently demonstrate a notable upward trend in network throughput. Importantly, it is the proposed algorithm that distinctly outperforms both comparison algorithm 1, comparison algorithm 2 and comparison algorithm 3, unequivocally highlighting its remarkable effectiveness in enhancing the average network throughput. The compelling evidence presented in Fig. 9 effectively validates the exceptional capability of the proposed algorithm in significantly improving the performance of the system.
Fig. 10 presents a comparison of the maximal resource-efficient between the algorithm proposed in this paper and three comparison algorithms for varying numbers of TDs. From the figure, the number of TDs increases leading to network resource-efficient drops. The proposed method has much better performance to mitigate the effectively enhance the maximal network resource-efficient of the system.
Fig. 11 presents a comparison of the average resource-efficient between the algorithm proposed in this paper and three comparison algorithms for varying numbers of TDs. The number of TDs increasing will decline system stability, which makes the average network resource-efficient drop in Fig. 11. The proposed method has much better performance to mitigate the effectively enhance the average network resource-efficient of the system.
Conclusion
In this paper, the conflict-free network resource-efficient management problem in the B5G massive IoT scenario was investigated, which consists of dense deployment of IoT devices and a resource management system. However, the dense deployment of IoT devices will generate large-scale network conflict of B5G massive IoT systems and decline the resource-efficient of the resource management system. Hypergraph theory-based network conflict model was proposed to quantify the conflict of the whole B5G massive IoT. Hence, under the conflict hypergraph model constraint, this paper has formulated the CO problem by maximizing the network throughput and resource-efficient. Since the conflict hypergraph-based resource management is an optimization problem with NP-hard, which needs computationally intensive to be handled, we formulate an MDP for the NRM system with sequential decision-making characteristics and propose a resource-efficient CRL solution. Then, FedAvg-A2C based resource management algorithm was proposed to handle the network conflict-free resource management problem in B5G massive IoT scenarios and accelerate the training process of learning. Finally, simulation results demonstrate the effectiveness of the FedAvg-A2C and validate the superiority of FedAvg-A2C compared with other comparison algorithms.