Conferences >ICASSP 2025 - 2025 IEEE Inter...

Multi-Agent Hierarchical Graph Attention Actor-Critic Reinforcement Learning

Abstract:

Multi-agent systems often face challenges such as elevated communication demands and intricate interactions. We propose an innovative hierarchical graph attention actor-c...Show More

Metadata

Abstract:

Multi-agent systems often face challenges such as elevated communication demands and intricate interactions. We propose an innovative hierarchical graph attention actor-critic reinforcement learning method to address the issues, which uses the hierarchical graph attention to capture the relationships of cooperation or competition among agents, and the agent enables a better understand of the dynamic environment. Specifically, we model the interaction among agents as a graph and encode the observations of the agents as a feature embedding vector with constant dimensionality to improve scalability. Through the "inter-agent" and "inter-group" attention layers, the embedding vector of each agent is updated into an information-condensed and contextualized state representation, which can adaptively extract the state-dependent relationship between agents, model the interaction at both the individual and group level, and thus learn more "advanced" strategies. Finally, we experiment on multiple multi-agent tasks to validate our proposed method’s effectiveness, stability, and scalability.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10888861

Conference Location: Hyderabad, India

Contents

SECTION I.

Introduction

Multi-agent systems have garnered significant attention from researchers due to their widespread presence and crucial roles in various domains. In nature, these systems are evident in ecosystems and food chains[3], [13]. In industrial applications, they are indispensable in automated manufacturing and robotics[12], [21], intelligent transportation[14], coordinated patrol[2], formation control[17], and cooperative navigation[7]. Multi-agent reinforcement learning is a crucial approach for exploring multi-agent collaboration strategies[15], [19]. However, as multi-agent systems are increasingly applied to more complex tasks with a growing number of agents, the interaction information between agents is more and more frequent, and the dynamic environment changes more and more. Large-scale modeling inevitably leads to the curse of dimensionality, significantly increasing the difficulty of training.

Multi-agent interactions can be naturally modeled as a graph, where nodes represent agents and edges represent their interactions[10]. Agents can share crucial knowledge, experiences, and environmental information through the connections and information transfer between nodes to achieve mutual learning and strategy optimization. Graph convolution is an effective method to represent agent communication and analyze the importance distribution among different agents [8], [16], [18]. However, as the scale of tasks increases, the amount and complexity of information exchanged between agents also grows. Agents need to extract meaningful information from massive and dynamically changing environments, determine the state dependencies between agents to collaborate more effectively, learn more "advanced" strategies, and achieve more efficient and intelligent systems.

Motivated by the aforementioned discussions, we propose a multi-agent hierarchical graph attention actor-critic reinforcement learning method (MAHGAC). The contributions of our method are as follows: 1). We model the multi-agent interactions as a graph, where agents are represented as nodes and the connections between them as edges for information exchange. The graph neural networks encode each agent’s observations into a fixed-dimensional node embedding vector, ensuring flexibility and scalability regardless of the number of agents. 2). To solve the complex information interaction among agents, we propose a hierarchical-graph-attention mechanism, which updates the node embedding vector into an information-condensed and contextualized state representation to aggregate the ’inter-agent’ individual and ’intergroup’ hierarchical relationships. Therefore, agents can learn the importance weights of other agents, dynamically selecting teammates to cooperate with or opponents and explore more advanced strategies. 3). Finally, we performed experiments on multiple multi-agent tasks, such as full cooperation and mixed cooperative competition, to validate the effectiveness, stability, and scalability of MAHGAC.

SECTION II.

Methods

The overall structure of the MAHGAC is shown in Figure 1. MAHGAC employs the actor-critic multi-agent reinforcement learning network, where agents interact with the environment to learn strategies through trial and error. We model the multi-agent interaction as the graph G = (V, E), the entities (agents and landmarks) in the environment are abstracted as nodes n ∈ V on the graph, and there are edges e ∈ E between nodes, which can communicate with each other. The local observation information o_i of each agent i is encoded as a node-embedding vector ${\vec h_i}$ . Through a hierarchical graph attention network, the ${\vec h_i}$ an information-aggregated and contextualized state representation $\vec h_i^\prime$ , which are shared among all agents, allowing each agent i to receive contributions from other agents at each time step. Following computation through a two-layer MLP, the $\vec h_i^\prime$ updates the action-value function $Q_i^\psi (o,a) = {f_i}\left( {\left( {{o_i},{a_i}} \right),{{\overrightarrow {{h^\prime }} }_i}} \right)$ , guiding better collaboration among agents and facilitating the learning of "advanced" strategies for complex interactive multi-agent tasks.

Fig. 1:

The overall structure of the MAHGAC. The MAHGAC adopts the centralized training and decentralized execution(CTDE) training paradigm. During the training, adopting a centralized critic, agent i can obtain information from all agents; through sharing a HGAT mechanism, the agent learns the importance weights of other agents in its vicinity. During the testing, each agent executes actions based on its observations.

Show All

A. Hierarchical Graph Attention Network (HGAT)

Hierarchical graph attention network (HGAT) updates the observations o_i of agents into an information-condensed and contextualized state representation, as is shown in Figure 1(HGAT).

Step 1. Entities Clustering

We use prior knowledge or data to classify all the entities in the environment (agents, landmarks, etc.) into different groups C^g. If it is an entirely cooperative task, such as formation control (Figure 2(b)(c)), we classify all the agents into one group. If it is cooperative navigation (Figure 2(a)), we can cluster all the agents into one group and the landmarks into another. Suppose it is a mixed environment, such as a pursuit task (Figure 2(d)). In that case, we can divide the pursuers into one group, the prey into another, and the obstacles into a separate group.

Step 2. "Inter-agent" Attention

Initially, through the node embedding vectors $\vec h_{ij}^g = f_{ij}^g\left( {{s_i},{s_j};W_{ij}^g} \right),i,j \in {C^g} \cap V(i)$ of neighbor nodes j ∈ V(i) adjacent to agent i in each group, the embedding vector of the target node aggregation embedding $\vec h_i^g = \sum\nolimits_{j \in {C^g} \cap V(i)} {\alpha _{ij}^g} \vec h_{ij}^g$ is calculated, the "inter-agent" attention weight $\alpha _{ij}^g$ quantifies the importance of the embedding $\vec h_{ij}^g$ from agent j to agent i, computed based on the GAT: $\alpha _{i.}^g \propto \exp \left( {e_{i.}^g} \right)$ , where $e_{ij}^g = f_\alpha ^g\left( {{s_i},{s_j};W_\alpha ^g} \right)$ . We employed multiple attention heads[20], thus, the aggregated embedding of agent i is : $\vec h_i^g = \prod\nolimits_{k = 1}^{\vec K} {\sum\nolimits_{g = 1}^G {\sum\nolimits_{j \in {C^g} \cap V(i)} {\alpha _{ij}^g} } } \vec h_{ij}^g$ , where k is the number of attention heads.

Step 3. "Inter-group" Attention

Then, HGAT computes the "inter-group" relationships, aggregate the group-level node embedding vector $\vec h_i^1, \ldots ,\vec h_i^g$ , and update the feature embedding vector ${\overrightarrow {{h^\prime }} _i} = \sum\nolimits_{g = 1}^G {\beta _i^g} \vec h_i^g$ of agent i that incorporates contextual information. ${\beta _i} = \left( {\beta _i^1, \ldots ,\beta _i^G} \right) \propto \exp \left( {{q_i}} \right)$ exp (q_i) is the "inter-group" attention weight, which guides which group agent i should focus more on to achieve its goal, where ${q_i} = \left[ {q_i^1, \ldots ,q_i^G} \right] = {f_\beta }\left( {\left[ {\vec h_i^1, \ldots ,\vec h_i^G} \right];{W_\beta }} \right)$ . Currently, we employ HGAT for acquiring the updated embedded feature vector ${\vec h_i} \to \vec h_i^\prime$ of agent i.

By utilizing HGA, we obtain the updated embedding feature vector ${\vec h_i} \to \vec h_i^\prime$ for agent i. HGAT is particularly advantageous when addressing complex multi-agent tasks. On the one hand, HGAT encodes the agent’s observations o_i into node embedding vectors, with the critical advantage that the dimension remains unaffected by the increasing number of agents, thereby demonstrating scalability to more large-scale complex tasks. On the other hand, by aggregating individual and group-level relationships, agents can understand the "role" they should play based on the relevance and importance of their contributions at each time step, enabling them to accomplish the final task more effectively.

B. Multi-Agent Actor-Critic

The node-embedded feature vectors ${\overrightarrow {{h^\prime }} _i}$ are passed through a two-layer MLP f_i and then input into the value and policy networks $Q_i^\psi (o,a) = {f_i}\left( {{g_i}\left( {{a_i},{o_i}} \right),{{\overrightarrow {{h^\prime }} }_i}} \right)$ . These networks predict the estimated state value and probability distribution of all possible actions. Each agent selects an action from this distribution, executes the chosen action, and receives a reward from the environment based on these actions. To promote exploration and reduce the risk of converging to suboptimal deterministic policies, we adopt the modern and widely recognized maximum entropy reinforcement learning approach to learn a soft value function[4]. This approach incorporates an entropy term into the policy gradient:

$\begin{align*} {\nabla _{{\theta _i}}}J\left( {{\pi _\theta }} \right) = {E_{s{\sim}D,a{\sim}\pi }}\left[ {{\nabla _{{\theta _i}}}\log \left( {{\pi _{{\theta _i}}}\left( {{a_i}\mid {o_i}} \right)} \right)} \right. \\ \left. {\left( { - \alpha \log \left( {{\pi _{{\theta _i}}}\left( {{a_i}\mid {o_i}} \right)} \right) + Q_i^\psi (o,a)} \right)} \right] \tag{1}\end{align*}$ View Source

Update all critics by minimizing a joint regression loss function through parameter sharing:

$\begin{align*} & {\mathcal{L}_Q}(\psi ) = \sum\limits_{i = 1}^N {{E_{\left( {o,a,r,{o^\prime }} \right){\sim}D}}} \left[ {{{\left( {Q_i^\psi (o,a) - {y_i}} \right)}^2}} \right]\tag{2} \\ & {y_i} = {r_i} + \gamma {E_{{a^\prime }{\sim}{\pi _{\bar \theta }}\left( {{o^\prime }} \right)}}\left[ {Q_i^{\bar \psi }\left( {{o^\prime },{a^\prime }} \right) - \alpha \log \left( {{\pi _{{{\bar \theta }_i}}}\left( {a_i^\prime \mid o_i^\prime } \right)} \right)} \right]\tag{3}\end{align*}$ View Source

where

$\bar \psi$

and

$\bar \theta$

are the parameters of the target critics and target policies respectively.

$Q_i^\psi$

is the action-value estimate for agent i, receives observations and actions for agents. α is the temperature parameter determining the balance between maximizing entropy and rewards.

SECTION III.

Experimens

A. Experimental Settings

We evaluate the effectiveness of the MAHGAC method in multi-agent tasks(Figure 2) . We exploit in a multi-agent particle environment(MPE¹)where agents can move in 2×2 sq. units 2D space. The action space for each agent is discretized, allowing agents to control unit acceleration or deceleration in the X and Y directions.

Figure 2 (a) Cooperative Navigation: where agents reach different landmarks while avoiding obstacles. During an episode, each agent receives a reward of -d based on the distance to the nearest landmark and a negative reward of -1 if it collides with another agent during navigation.
Figure 2 (b) Linear Formation: There are M agents and 2 landmarks; the objective is for the agents to position themselves equally spread out in a line between the two landmarks.
Figure 2 (C) Regular Polygon Formation: There are M agents and 1 landmark; the agents must position themselves into an M -sided regular polygonal formation, with the landmark at its center.
Figure 2 (d) Confronting Pursuit: where pursuers collaborate to chase two prey, and when both prey are caught, the task is successful.

Fig. 2:

Experimental environments: (a) Cooperative Navigation. (b) Linear Formation. (c) Regular Polygon Formation. (d) Confronting Pursuit.

Show All

We conducted a series of comparative experiments to evaluate the performance of different multi-agent reinforcement learning methods across various tasks. We selected four methods as baseline methods: MADDPG[9] without attention, G2ANet[8] with attention, DGN[6] with single-layer graph attention, and MAAC[5] with actor-attention-critic. To comprehensively evaluate their performance, we employed two main evaluation metrics: Success Rate (S%), Percentage of tasks completed during evaluation episodes (higher is better). Mean Episode Length (MEL): Average length of successful episodes during evaluation (lower is better).

B. Results and Discussion

As shown in Figure 3. In both fully cooperative and mixed cooperative-competitive tasks, the mean episode reward curves of MAHGAC converge to higher levels, demonstrating superior performance compared to other methods. Furthermore, MAHGAC outperforms other methods employing single-layer graph attention in mixed cooperative-competitive tasks. It is attributable to the increased complexity of relationships among agents in these tasks, which demand more interactions and sophisticated information selection. MAHGAC adaptively extracts state-dependent relationships among multiple agents, enhancing information selection and strategy learning.

TABLE I: Success rate (S%) and Mean episode length (MEL) of different methods in different tasks.

Fig. 3:

The mean episode rewards curves. (a) 3 agents in the cooperative navigation task. (b) 5 agents in the linear formation task. (c) 4 agents in the regular polygonal formation task. (d) 3 pursuers cooperating to pursue 2 prey.

Show All

Table I presents the success rate (S%) and Mean Episode Length (MEL) for each method in both fully cooperative and mixed cooperative-competitive tasks. Compared to MADDPG without attention, MAHGAC significantly outperforms in success rate across all tasks, while MEL remains consistent. In fully cooperative tasks, compared to DGN with single-layer graph attention, MAHGAC shows an average success rate improvement of 8.054%, with MEL decreasing by an average of 0.4. Compared to MAAC with actor-attention-critic, MAHGAC demonstrates an average success rate improvement of 0.98%, with MEL decreasing by an average of 0.316. In mixed cooperative-competitive tasks, MAHGAC exhibits an average success rate improvement of 19.942% compared to DGN with single-layer graph attention, with MEL decreasing by 0.58 on average. When compared to MAAC, MAHGAC shows an average success rate improvement of 7.961%, with MEL decreasing by 1.09 on average.

TABLE II: Success rates for cooperative navigation task.

Fig. 4:

The average episode rewards of different methods in cooperative navigation task with different numbers of agents.

Show All

In Table II, we compare the success rate for the cooperative navigation task with different numbers of agents and verify the scalability of MAHGAC. As the number of agents and task complexity increase, the interaction information between agents becomes more intricate. It is difficult for the agent to explore an advanced strategy, leading to a rapid decrease in the success rate of baseline methods. However, from N=3 to N=15, MAHGAC exhibits a success rate standard deviation of only 0.093; the changing trend of the boxplot in Figure 4 further confirms the superior stability . Overall, MAHGAC maintains its performance as the number of agents increases, demonstrating robust scalability and significant stability. It presents novel opportunities to tackle collaborative multi-agent challenges in real-world complex tasks.

SECTION IV.

Conclusion

We propose an innovative multi-agent hierarchical graph attention actor-critic reinforcement learning method MAH-GAC. This method models the multi-agent interactions as a graph and encodes the observation information of each agent into a single node embedding vector, improving the scalability. Through a hierarchical graph attention mechanism HGAT, to model the relationship between the individual and group levels of agents, update the observations of agents into an information-condensed and contextualized state representation, adaptively extract state dependencies among agents, enabling agents to focus on interactions with the most relevant and thereby learn more advanced strategies. Finally, a series of experiment results proved that MAHGAC sustains performance as the scale increases and exhibits superior stability and scalability, offering new possibilities for addressing larger- scale tasks in practice.

References is not available for this document.

Multi-Agent Hierarchical Graph Attention Actor-Critic Reinforcement Learning

Abstract:

Metadata

Abstract:

ISSN Information:

Introduction

Methods