Introduction
Multi-agent systems have garnered significant attention from researchers due to their widespread presence and crucial roles in various domains. In nature, these systems are evident in ecosystems and food chains[3], [13]. In industrial applications, they are indispensable in automated manufacturing and robotics[12], [21], intelligent transportation[14], coordinated patrol[2], formation control[17], and cooperative navigation[7]. Multi-agent reinforcement learning is a crucial approach for exploring multi-agent collaboration strategies[15], [19]. However, as multi-agent systems are increasingly applied to more complex tasks with a growing number of agents, the interaction information between agents is more and more frequent, and the dynamic environment changes more and more. Large-scale modeling inevitably leads to the curse of dimensionality, significantly increasing the difficulty of training.
Multi-agent interactions can be naturally modeled as a graph, where nodes represent agents and edges represent their interactions[10]. Agents can share crucial knowledge, experiences, and environmental information through the connections and information transfer between nodes to achieve mutual learning and strategy optimization. Graph convolution is an effective method to represent agent communication and analyze the importance distribution among different agents [8], [16], [18]. However, as the scale of tasks increases, the amount and complexity of information exchanged between agents also grows. Agents need to extract meaningful information from massive and dynamically changing environments, determine the state dependencies between agents to collaborate more effectively, learn more "advanced" strategies, and achieve more efficient and intelligent systems.
Motivated by the aforementioned discussions, we propose a multi-agent hierarchical graph attention actor-critic reinforcement learning method (MAHGAC). The contributions of our method are as follows: 1). We model the multi-agent interactions as a graph, where agents are represented as nodes and the connections between them as edges for information exchange. The graph neural networks encode each agent’s observations into a fixed-dimensional node embedding vector, ensuring flexibility and scalability regardless of the number of agents. 2). To solve the complex information interaction among agents, we propose a hierarchical-graph-attention mechanism, which updates the node embedding vector into an information-condensed and contextualized state representation to aggregate the ’inter-agent’ individual and ’intergroup’ hierarchical relationships. Therefore, agents can learn the importance weights of other agents, dynamically selecting teammates to cooperate with or opponents and explore more advanced strategies. 3). Finally, we performed experiments on multiple multi-agent tasks, such as full cooperation and mixed cooperative competition, to validate the effectiveness, stability, and scalability of MAHGAC.
Methods
The overall structure of the MAHGAC is shown in Figure 1. MAHGAC employs the actor-critic multi-agent reinforcement learning network, where agents interact with the environment to learn strategies through trial and error. We model the multi-agent interaction as the graph G = (V, E), the entities (agents and landmarks) in the environment are abstracted as nodes n ∈ V on the graph, and there are edges e ∈ E between nodes, which can communicate with each other. The local observation information oi of each agent i is encoded as a node-embedding vector
The overall structure of the MAHGAC. The MAHGAC adopts the centralized training and decentralized execution(CTDE) training paradigm. During the training, adopting a centralized critic, agent i can obtain information from all agents; through sharing a HGAT mechanism, the agent learns the importance weights of other agents in its vicinity. During the testing, each agent executes actions based on its observations.
A. Hierarchical Graph Attention Network (HGAT)
Hierarchical graph attention network (HGAT) updates the observations oi of agents into an information-condensed and contextualized state representation, as is shown in Figure 1(HGAT).
Step 1. Entities Clustering
We use prior knowledge or data to classify all the entities in the environment (agents, landmarks, etc.) into different groups Cg. If it is an entirely cooperative task, such as formation control (Figure 2(b)(c)), we classify all the agents into one group. If it is cooperative navigation (Figure 2(a)), we can cluster all the agents into one group and the landmarks into another. Suppose it is a mixed environment, such as a pursuit task (Figure 2(d)). In that case, we can divide the pursuers into one group, the prey into another, and the obstacles into a separate group.
Step 2. "Inter-agent" Attention
Initially, through the node embedding vectors
Step 3. "Inter-group" Attention
Then, HGAT computes the "inter-group" relationships, aggregate the group-level node embedding vector
By utilizing HGA, we obtain the updated embedding feature vector
B. Multi-Agent Actor-Critic
The node-embedded feature vectors \begin{align*} {\nabla _{{\theta _i}}}J\left( {{\pi _\theta }} \right) = {E_{s{\sim}D,a{\sim}\pi }}\left[ {{\nabla _{{\theta _i}}}\log \left( {{\pi _{{\theta _i}}}\left( {{a_i}\mid {o_i}} \right)} \right)} \right. \\ \left. {\left( { - \alpha \log \left( {{\pi _{{\theta _i}}}\left( {{a_i}\mid {o_i}} \right)} \right) + Q_i^\psi (o,a)} \right)} \right] \tag{1}\end{align*}
Update all critics by minimizing a joint regression loss function through parameter sharing:
\begin{align*} & {\mathcal{L}_Q}(\psi ) = \sum\limits_{i = 1}^N {{E_{\left( {o,a,r,{o^\prime }} \right){\sim}D}}} \left[ {{{\left( {Q_i^\psi (o,a) - {y_i}} \right)}^2}} \right]\tag{2} \\ & {y_i} = {r_i} + \gamma {E_{{a^\prime }{\sim}{\pi _{\bar \theta }}\left( {{o^\prime }} \right)}}\left[ {Q_i^{\bar \psi }\left( {{o^\prime },{a^\prime }} \right) - \alpha \log \left( {{\pi _{{{\bar \theta }_i}}}\left( {a_i^\prime \mid o_i^\prime } \right)} \right)} \right]\tag{3}\end{align*}
Experimens
A. Experimental Settings
We evaluate the effectiveness of the MAHGAC method in multi-agent tasks(Figure 2) . We exploit in a multi-agent particle environment(MPE1)where agents can move in 2×2 sq. units 2D space. The action space for each agent is discretized, allowing agents to control unit acceleration or deceleration in the X and Y directions.
Figure 2 (a) Cooperative Navigation: where agents reach different landmarks while avoiding obstacles. During an episode, each agent receives a reward of -d based on the distance to the nearest landmark and a negative reward of -1 if it collides with another agent during navigation.
Figure 2 (b) Linear Formation: There are M agents and 2 landmarks; the objective is for the agents to position themselves equally spread out in a line between the two landmarks.
Figure 2 (C) Regular Polygon Formation: There are M agents and 1 landmark; the agents must position themselves into an M -sided regular polygonal formation, with the landmark at its center.
Figure 2 (d) Confronting Pursuit: where pursuers collaborate to chase two prey, and when both prey are caught, the task is successful.
Experimental environments: (a) Cooperative Navigation. (b) Linear Formation. (c) Regular Polygon Formation. (d) Confronting Pursuit.
We conducted a series of comparative experiments to evaluate the performance of different multi-agent reinforcement learning methods across various tasks. We selected four methods as baseline methods: MADDPG[9] without attention, G2ANet[8] with attention, DGN[6] with single-layer graph attention, and MAAC[5] with actor-attention-critic. To comprehensively evaluate their performance, we employed two main evaluation metrics: Success Rate (S%), Percentage of tasks completed during evaluation episodes (higher is better). Mean Episode Length (MEL): Average length of successful episodes during evaluation (lower is better).
B. Results and Discussion
As shown in Figure 3. In both fully cooperative and mixed cooperative-competitive tasks, the mean episode reward curves of MAHGAC converge to higher levels, demonstrating superior performance compared to other methods. Furthermore, MAHGAC outperforms other methods employing single-layer graph attention in mixed cooperative-competitive tasks. It is attributable to the increased complexity of relationships among agents in these tasks, which demand more interactions and sophisticated information selection. MAHGAC adaptively extracts state-dependent relationships among multiple agents, enhancing information selection and strategy learning.
The mean episode rewards curves. (a) 3 agents in the cooperative navigation task. (b) 5 agents in the linear formation task. (c) 4 agents in the regular polygonal formation task. (d) 3 pursuers cooperating to pursue 2 prey.
Table I presents the success rate (S%) and Mean Episode Length (MEL) for each method in both fully cooperative and mixed cooperative-competitive tasks. Compared to MADDPG without attention, MAHGAC significantly outperforms in success rate across all tasks, while MEL remains consistent. In fully cooperative tasks, compared to DGN with single-layer graph attention, MAHGAC shows an average success rate improvement of 8.054%, with MEL decreasing by an average of 0.4. Compared to MAAC with actor-attention-critic, MAHGAC demonstrates an average success rate improvement of 0.98%, with MEL decreasing by an average of 0.316. In mixed cooperative-competitive tasks, MAHGAC exhibits an average success rate improvement of 19.942% compared to DGN with single-layer graph attention, with MEL decreasing by 0.58 on average. When compared to MAAC, MAHGAC shows an average success rate improvement of 7.961%, with MEL decreasing by 1.09 on average.
The average episode rewards of different methods in cooperative navigation task with different numbers of agents.
In Table II, we compare the success rate for the cooperative navigation task with different numbers of agents and verify the scalability of MAHGAC. As the number of agents and task complexity increase, the interaction information between agents becomes more intricate. It is difficult for the agent to explore an advanced strategy, leading to a rapid decrease in the success rate of baseline methods. However, from N=3 to N=15, MAHGAC exhibits a success rate standard deviation of only 0.093; the changing trend of the boxplot in Figure 4 further confirms the superior stability . Overall, MAHGAC maintains its performance as the number of agents increases, demonstrating robust scalability and significant stability. It presents novel opportunities to tackle collaborative multi-agent challenges in real-world complex tasks.
Conclusion
We propose an innovative multi-agent hierarchical graph attention actor-critic reinforcement learning method MAH-GAC. This method models the multi-agent interactions as a graph and encodes the observation information of each agent into a single node embedding vector, improving the scalability. Through a hierarchical graph attention mechanism HGAT, to model the relationship between the individual and group levels of agents, update the observations of agents into an information-condensed and contextualized state representation, adaptively extract state dependencies among agents, enabling agents to focus on interactions with the most relevant and thereby learn more advanced strategies. Finally, a series of experiment results proved that MAHGAC sustains performance as the scale increases and exhibits superior stability and scalability, offering new possibilities for addressing larger- scale tasks in practice.