Introduction
The number of commercial and personal vehicles steadily grows causing pressure on the existing transportation networks. One of the ways to mitigate this issue to a certain extent is to increase the autonomy of the vehicles. The expectation is that autonomous vehicles inter-connected to the Internet of vehicles will use the shared resources (transportation networks) in a smarter way [1], [2], reducing the risk of accidents, traffic congestion and increasing the safety and comfort for their passengers. The Internet of Vehicles (IoV) in a smart city is already beginning to be practically implemented through the introduction of V2X technologies [3]. In addition to questions about what data to collect and how to organize the communication for building an Internet of vehicles, the role of tasks related to how to use this data to improve the efficiency of each autonomous vehicle working in connection with others is increasing [4], [5].
According to the Society of Automotive Engineers (SAE), 6 levels of vehicle autonomy exist,1 with 0 being conventional vehicle lacking any systems that might control any aspects of vehicle’s motion and 6 correspondings to a fully-autonomous vehicle capable of operating without any guidance by a human driver. In this work, we are focusing on the vehicles of higher degrees of autonomy.
One of the challenging scenarios for such vehicles is moving in a heterogeneous environment with numerous other moving entities such as pedestrians, autonomous vehicles, non-autonomous vehicles, etc. A characteristic example is a navigation on a parking lot. This setup is challenging for the following reasons. First, conventional means of organizing traffic such as traffic lights, road lanes, road signs, etc. are absent and vehicles have to safely navigate to their goals without them. Second, the connection to the IoV (and peer-to-peer connection as well) may suffer significantly, thus the centralized or semi-centralized (like in [6]) coordination of vehicles movements is limited. Consequently, in this setting, vehicles have to predict the behavior of other vehicles in the environment to avoid conflicts and deadlocks. An example of such scenario is presented in Figure 1.
Example of the cooperative behavior in the multi-agent partially observable environment. a) The agents move to their goals relying on partial observations of the grid world (vertexes shaded with light gray are no visible to the agents at the current time step). b) Then agents see each other, and one of them lets the other pass by stepping aside. c) After resolving the conflict, the agents continue moving towards their goals.
The described problem can be abstracted to the so-called Multi-agent Pathfinding (MAPF) [7], a well-studied problem with a range of different algorithms tailored to solve it. These algorithms differ in assumptions they make on how mobile agents may move through the workspace, how the collisions are defined etc. Still, most of those algorithms assume full knowledge of the environment. I.e. they accept as input not only the map of the environment but also locations of all the agents and their goals. Under such assumption MAPF algorithms are capable to produce collision-free plans for all the agents. This approach is hard to implement in the considered scenario when numerous vehicles constantly enter and leave the parking lot, perfect localization and mapping are burdensome, and connection to the Internet of vehicles may be unstable. In the presence of such disturbances, it is reasonable for each vehicle (agent) to operate individually, relying on the local observations. The approaches to such partially-observable multi-agent navigation are, indeed, known, see the seminal work on ORCA [8] for example. However, they often suffer from the deadlocks as they do not take the cooperative aspect of the problem into account (i.e. some agents have to stop progressing towards the goals in certain cases in order to let the other agents go). To overcome this problem in this work we suggest utilizing machine learning, and more specifically – reinforcement learning (RL) [9]. We are inspired by the successful experience of colleagues who have already used machine learning methods for the IoV related problems [10], [11].
In this paper, we propose to use reinforcement learning [9], [12] to generate the behavior of each autonomous agent in a multi-agent partially-observable environment, which in our case is an abstraction for the Internet of vehicles. Model-free reinforcement learning methods have shown excellent results in behavior generation tasks for single agents [13]–[15] and cooperative environments [16], [17]. Modern deep reinforcement learning algorithms cope well with complex observation space (visual environments) [18] and stochastic environmental conditions [19]. Nevertheless, some cases in multi-agent environments, in which long-term prediction of the purposeful behavior of other agents is essential, are hardly tackled by most of the model-free approaches. To account for the purposeful actions of the other agents, the so-called model-based approaches play an important role. They have shown impressive results in such environments as chess and Go [20], [21] where planning ahead is much needed. On the other hand, model-based approaches are very demanding on the quality of the model of the environmental dynamics and often require access to an unlimited launch of the environment simulator.
In this paper, we suggest utilizing both model-free and model-based RL in the following fashion. We decompose partially-observable MAPF into two subtasks: reaching the goal and avoiding collisions with other agents and deadlocks. The first one is handled by the model-free RL module which outputs the suggestion on which action to take (i.e. actions’ probability distribution) provided with the current observation of the agent. The second subtask is handled by either the model-based RL module or another model-free RL module. All modules are trained separately and then are integrated into a resultant hybrid policy (either model-free + model-based or model-free + model-free) via the mixing mechanism. Such an approach leads to very promising results as our experimental studies suggest.
The main contributions of our work can be summarized as follows:
we present a variant of the MAPF problem which is an abstraction of the challenging IoV setting when the communication and observation are limited,
we decompose the pathfinding problem into the two subproblems: avoiding static obstacles and resolving inter-agent conflicts in the local neighborhood,
we suggest an original hybrid model-free + model-based/model-free RL method that handles both problems in the integrated way,
we develop an original simulator – POGEMA (Partially Observable Grid Environment for Multi-Agent scenarios) that allows generating MAPF problems of different levels of complexity for further empirical evaluation,
we show empirically that the suggested hybrid is an efficient tool to solving the considered tasks and it outperforms pure model-free/model-based RL approaches.
The rest of the paper is organized as follows. Section II provides an overview of the related works from the MAPF domain. The problem statement is presented in section III followed by the description of the suggested method in section IV. We describe the developed simulator in section V. Empirical evaluation and results are presented in section VI. Section VII concludes.
Related Work
Multi-Agent Pathfinding (MAPF) is a challenging problem that is being actively investigated in AI and robotics communities.
The first methods for solving the problem were planning algorithms searching for the minimum cost path in the joint action space, e.g.
Currently used classical approaches suffer less from the curse of dimension, but are still time-consuming. Planning algorithms can be divided into two types: centralized and decentralized. Centralized ones rely on a control center that accumulates information on all agent’s positions. This group of algorithms includes conflict-based search algorithm [23] and its enhanced versions [24], [25] guaranteeing optimality and prioritized planning algorithms [26] that are capable to build near-optimal solutions under conditions. Decentralized planning considers the problem in the setting when access to information about other agents is limited. One of the most cited decentralized algorithms ORCA [27] modifies agent’s velocities on the fly to prevent collisions. Combinations of ORCA with centralized planning algorithms are also known [28].
Planning via Monte Carlo Tree Search (MCTS) which we utilize is not new for MAPF. It was recently adapted to solve Numberlink puzzles [29]. Another MCTS-based approach [30] showed great scalability on the benchmark SysAdmin domain.
Many reinforcement learning approaches address MAPF by learning separated policies or Q-functions, which means that each of them considers only a single agent. Such methods require the use of additional trickery to get agents to interact in the desired way. In actor-critic approaches [31] the critic network is usually trained in centralized fashion [32], [33] to enhance cooperation. Several off-policy learning algorithms [34]–[36] suggest to decompose optimized joint Q-function into action-values of single agents or pairs of them. Heuristics could be also helpful in collaboration tasks, e.g. [37] introduces new loss functions to prevent agents from blocking each other. The advantage of our approach is the absence of built into training heuristic rules. There are also some approaches additionally relying on exact agents behavior modeling [32], [37].
Probably the most natural way of learning agents to take into account other agents is to design observation containing the necessary information about other agents as it was done in [37]. Our simulator POGEMA returns complex observation that combines the location of static obstacles, positions of agents, and their targets in the area of visibility.
Problem Statement
Consider
Two plans are said to be conflict-free if the agents following them never collide. We consider the two common types of collisions: vertex collision and edge collision. Vertex collision occurs when two agents occupy the same grid cell at the same time step (e.g. the first agent waits in the cell and the second agent moves to it). Edge collision occurs when two agents swap vertices using the same edge at the same time step. In general, a few more collision types can be defined, see [38], but for the sake of clarity, we limit ourselves to the vertex and edge collisions in this work. Still, the method we are presenting is agnostic to the collision definition and can support other collision types provided that the corresponding collision detection mechanism is implemented.
Unlike numerous works on multi-agent pathfinding, we assume that the global state of the environment, i.e. the global grid-map, the positions of all agents, their goals, etc., is not known. Instead, each agent possess only a local observation at each time step. This observation includes:
the local map of size
, i.e. the patch of the global grid-map centered at agent’s position, where$l$ is the predefined parameter,$l$ the positions of the other agents if they fall within the local map (positions of the other agents are unknown),
global information about the agent’s goal in the form of a position
, if it falls within the local map, or in the form of its direction (projection on the edge of the local map).$(x_{g}, y_{g})$
The problem now is to design a common individual policy
In contrast to the standard MAPF problem statement, we suggest that it is enough for an agent to learn a policy due to interaction with the environment for a certain number of episodes. Furthermore, we enable the agent to simulate its actions with a copy of the environment, thus giving it access to an ideal model of transitions in the observation space, thus not violating the conditions of partial observability.
Formally, the interaction of an agent with the environment is described as Markov decision process (MDP)
In the partially observable Markov decision process (POMDP), it is assumed that a state of the environment is not fully known to the agent. Instead it is able to receive only a partial observation
In model-based reinforcement learning, it is assumed that the agent explicitly has access to the transition function of the environment
Method
To solve the stated problem we suggest to decompose the partially-observable MAPF into the two sub-problems: progressive movement of each agent towards the goal with the avoidance of static obstacles and avoiding conflicts with other agents. Accordingly, we develop two learnable policies for the agent: a policy for reaching the goal and a policy for resolving the conflicts.
We chose Proximal Policy Optimization (PPO) to learn a goal-reaching policy as one of the most versatile methods for solving a wide range of RL problems [14], [39], including procedurally-generated navigation tasks [40]. To learn the collision avoidance policy we have tried both model-free and model-based approaches, QMIX and MCTS respectively. QMIX is one of the widely used off-policy algorithms for multi-agent learning [35]. MCTS is known to show great results in highly complex antagonistic games [41] (Chess, Go, etc) and we chose to adapt it to our challenging problem, i.e. to multi-agent cooperative path-finding with limited observations. As our experimental results show this choice was worthwhile.
Both modules are learned in parallel. After learning, at the inference phase we apply a mixing mechanism to end up with the hybrid policy that combines both types of behavior (goal-reaching and collision-avoidance). The general scheme of the proposed approach is shown in Figure 2. Please note, that MCTS conflict-avoidance module is depicted, while it is interchangeable by the QMIX one. Overall, the suggested architecture of the hybrid policy is flexible and one can plug any other RL policy for goal reaching or collision avoidance, with only minor changes.
The hybrid policy approach. Three main modules are identified: Goal Reaching Module, Conflict Resolution Module, and the Hybrid Policy module. The latter receives the distributions over the actions and combines them so the agent demonstrates both types of behavior: reaching the goal and avoiding conflicts with other agents.
Next, we describe both learnable policies and the mixing mechanism in more detail.
A. Goal Reaching Module
The Proximal Policy Optimization (PPO) [42] approach has proven effective in many RL problems, especially in large-scale learning setups [39]. PPO is a variant of advantage actor-critic, which relies on specialized clipping in the objective function to penalize the new policy for getting far from the old one. PPO uses Generalized Advantage Estimation [43], a well-known approach for variance reduction.
The pathfinding problem in a partially observable environment is quite complex, even for a single agent. The agent needs to master several tasks: avoiding obstacles (sometimes dynamic), exploring the environment, and moving towards the target. The reward signal in such an environment is sparse; the agent receives it only at the end of the episode. If one trains an agent in such scenarios, then even accidentally getting to the target will be almost impossible. To overcome this problem, we suggest using curriculum learning.
Curriculum learning in RL aims to order tasks to provide curricula for training an agent, improving its performance on the target task. We can design training curricula for POGEMA using parameters of the environment (e.g. max distance to the target, obstacle density). In Figure 3, we show an example of the progressively increasing complexity of the tasks. In this work, we use human expert knowledge to provide such a curriculum plan. We denote PPO trained using that plan as cPPO.
Example of progressively increasing difficulty of the scenario. a) The target is in the agent’s field of vision, and the path to it is open. b) The target is out of the agent’s field of vision, but the path to it is quite simple and is in the same direction. c) The target is out of observation of the agent and it’s difficult to reach. The agent must properly explore the environment to find it.
B. Conflict Resolution Module
1) Model-Based Approach
The central mechanism of the Conflict resolution module is the Monte-Carlo Tree Search mechanism. Monte-Carlo Tree Search (MCTS) is a powerful technique widely used to solve decision-making problems when an agent has to decide which action to take next observing the current state of the world. Most often MCTS-based methods are used in the context of the adversarial two-player games and they show great success in solving this type of problems [44], [45]. Even bigger popularity came to MCTS in 2016 when AlphaGo [20] showed super-human performance in Go (which is a very challenging game to computers). AlphaGo was a mixture of human-designed game heuristics, MCTS, and deep learning. Its enhanced successor, AlphaZero [41], did not require any human knowledge and mastered not only Go but a range of other board games like chess and shogi. Inspired by the success of these model-based approaches we adopted the idea of mixing together planning (search) and deep learning to the considered conflict resolution problem. Next, we describe the necessary background of MCTS and its combination with deep learning and finally propose our variant of model-based learning algorithm to address the partially-observable multi-agent pathfinding.
Basic MCTS: MCTS solves the problem of choosing an action (game move) provided with the current state of the world/game (vanilla MCTS assumes a fully observable environment, where states and observations are equivalents). It does so by virtue of simulating the most promising variants of how the game develops in the future. This process can be thought of as planning (i.e. MCTS reasons about the actions and their consequences). Technically this planning is realized as a tree-search process. The node in the tree corresponds to a game state, the outgoing edges correspond to the actions applicable in that state. At each iteration of the search the four following steps are made: 1) descending the tree to a leaf along the most promising path, 2) adding the successor, 3) simulating the game from the successor using a default policy (usually random) to get the reward (e.g. 1 if we win, 0 if we lose), 4) propagating the reward backward. The notion of promising states/actions is formalized via a range of numeric variables that are assigned to each node/edge. For each node, we store the average reward that was achieved from the corresponding game-state, \begin{equation*} UCT = V_{s} + c \sqrt { \ln N_{p(s)} / N_{s}} \tag{1}\end{equation*}
Here
After a dedicated number of search iterations (usually defined by the allotted time budget) MCTS stops. The principal result of MCTS is the partially-built game tree that contains collected statistics (node/edge visits counters,
MCTS With Deep Learning: Applying MCTS every time an agent has to take action in the environment (e.g. move in a game) might be very time-consuming, as to come up with the consistent policy one needs thousands of thousands of MCTS simulations. To mitigate this issue one may wish to construct a computationally efficient still powerful (i.e. being able to select good actions) approximator of MCTS via deep learning. In [41] such an approximator (in the form of a deep neural network) was proposed for a large class of adversarial two-player games and got the name AlphaZero.
AlphaZero is made to provide both the policy (distribution of actions probabilities) and the expected reward is given a state \begin{equation*} \mathcal {L} = - {\boldsymbol \pi }^{{*}^{T}} \log {\boldsymbol \pi } + (v^{*} - v)^{2} \tag{2}\end{equation*}
The examples
The last ingredient to be discussed is the adapted MCTS which is used to create a tree from which the policy sample is extracted. The major difference is that when a node is expanded and its child is added to the tree no (random) rollout is performed to get the reward. Instead, the value \begin{equation*} U = Q_{sa} + C_{s}P_{sa} \cdot \frac {\sqrt {N_{s}}}{1 + N_{sa}}, \tag{3}\end{equation*}
Deep MCTS for Conflict Resolution: In this work we suggest utilizing the deep MCTS approach to the problem of resolving conflicts in partially-observable multi-agent path planning. In our setup we adopt the egoistic paradigm, i.e. we consider the agents as the dynamic obstacles between which the principal agent has to maneuver. Although not all multi-agent interactions can be reduced to this setting, the policy that we can learn from following this approach scales well and can solve numerous non-trivial problems, as will be shown in Section VI.
The input to our MCTS approximator
To train
Figure 4 illustrates subtree grown using the proposed method.
Example of subtree build by our Deep MCTS algorithm for conflict resolution for toy
2) Model-Free Approach
As another way for the conflict resolution module to work, we considered a model-free approach based on a coordinated change in the Q-functions of different agents. This algorithm is based on deep Q-learning that is a method to optimizes Q-function \begin{align*} \mathcal {L}=\sum _{i=1}^{b} [((r^{i}+\gamma max_{a^{i+1}}Q(s^{i+1}, a^{i+1}|\theta)) - Q(s^{i}, a^{i} |\theta))^{2}] \\\tag{4}\end{align*}
For multi-agent problems, DQN can be modified into the QMIX algorithm. QMIX uses the DQN approximator loss function to update the agents’ weights. But instead of using individual Q functions for each agent, it uses joint Q-function \begin{align*} \mathcal {L}=&\sum _{i=1}^{b} [({y^{i}_{tot}} - Q_{tot}(\tau ^{i}, u^{i}, s^{i}|\theta))^{2}] \tag{5}\\ y_{tot}=&r+\gamma max_{t^{i+1}}Q_{tot}(\tau ^{i+1}, u^{i+1}, s^{i+1}|\theta ^{-})\tag{6}\end{align*}
Here
C. Combination
The designed conflict resolution module is focused on learning the policy capable to avoid collisions with static obstacles and other agents. The designed goal-reaching module learns to choose actions that lead to the goal position. Indeed, to solve partially observable multi-agent pathfinding problems efficiently we need a combination of both of these ingredients. To this end, we suggest a non-trainable mixer of the policies.
The mixer is initialized with several policies \begin{equation*} \pi (o) = \sum _{i=1}^{N} \pi _{i}(o)\tag{7}\end{equation*}
Please note, that the mixing policy does not contain a switching rule that chooses which of the policies, goal-reaching or avoiding collision, to apply. Instead, we mix the action probabilities suggested by both policies and sample action from the resultant distribution. This simple yet effective way to combine reaching-the-goal-policy and avoid-the-conflicts policy allowed us to improve notably the performance compared to when only one policy (either goal-reaching or conflict resolving) is used (see Section VI for the details). Indeed, more involved switching mechanisms can be designed. We leave this for future work.
POGEMA Simulator
We designed and implemented the environment simulator that takes all the specifics of the partially-observable multi-agent pathfinding into account. We call this simulator – POGEMA (Partially Observable Grid Environment for Multi-Agent Scenarios).
The following parameters define the environment in POGEMA:
grid size
,$Size \geq 2$ obstacle density
),$Density\in [0, 1$ number of agents
,$Agents \geq 1$ observation radius: agents get
cells in each direction,$1 \leq R \leq Size$ the maximum number of steps in the environment before episode ends
,$Horizon \geq 1$ the distance to the goal for each agent
(is an optional parameter, if it is not set, the distance to the goal for each agent is generated randomly).$Dist$
The observation space
Obstacle matrix: 1 encodes an obstacle, and 0 encodes its absence. If any cell of the agent’s field of view is outside the environment, then it is encoded 1.
Agents’ positions matrix: 1 encodes other agent in the cell, and 0 encodes his absence.
Self agent’s target matrix: if the agent’s goal is inside the observation field, then there is 1 in the cell, where it is located, and 0 in other cells. If the target does not fall into the field of view, then it is projected onto the nearest cell of the observation field. As a cell for projection, a cell is selected on the border of the visibility area, which either has the same coordinate along with one of the axes as the target cell or if there are no such cells, then the nearest corner cell of the visibility area is selected.
Figure 5 show an example of the observation for an agent (the one highlighted in red). Overall, the agent observes only a fraction of the environment, has access to the global map and states of the other agents (unless they fall within the agent’s visibility range).
An example of the agent’s observation for the agent is shown in red. The observation is comprised of three matrices: obstacles, other agents’ positions, agent’s goal (or its projection). If the observation spans beyond the boundaries of the grid, then the corresponding cells in the observation’s obstacles matrix are considered as obstacles (see top right part of the figure), while the other matrices are filled with zeroes in these positions.
At any time step each agent has 5 actions available: stay in place, move vertically (up or down), or move horizontally (right or left). An agent can move to any free cell that is not occupied by an obstacle or other agent. If an agent moves to a cell with his own goal, then he is removed from the map and the episode is over for him.
A. Reward
One of the natural (and general) ways to encode the reward is to assign each agent either 1 or 0 depending on whether it was able to reach its goal before the episode ended. We, however, choose to design a more informative continuous reward that gives a hint on how well the agent copes with its task. This reward is computed by the following formula.\begin{equation*} v^{*}=\frac {c_{best\_{}path}}{c_{current\_{}path} + c_{path\_{}goal}}\tag{8}\end{equation*}
Here
We compared the designed reward with other variants (the conventional 0-or-1 reward and the other one known from the literature on MAPF. This comparison was in favor of our reward. The details are presented in the next section.
Experimental Evaluation
We implemented, trained, and evaluated the performance of the suggested hybrid policies MCTS + cPPO, QMIX + cPPO and compared them to the standalone goal-reaching and conflict resolution policies (cPPO and MCTS, QMIX, QMIX + MCTS respectively) on a wide range of setups. We used PPO implementation from StableBaselines3 [47], QMIX implementation from PyMARL [48] and deep MCTS implementation of our own. Hyperparameters of the neural networks used within cPPO, QMIX, and Deep MCTS are presented in Table 1.
A. Training Setup
Each policy (cPPO, QMIX, MCTS) was trained on a range of multi-agent pathfinding problems using our POGEMA simulator. Obstacle density was set to 0.3 and observation radius to 5 for training. Training configurations are presented in Table 2. As one can note we used different sizes of the environment while training. The number of agents never exceeded 2. The maximum number of time steps before the training episode ended is shown in the 4th column of the table (named horizon). The 5th column, distance, shows how far the goal was located from the start. I.e. for each problem instance, we generated the start location for an agent randomly and then picked a goal location randomly but conditioned that the cost of the individual optimal path from start to a goal does not exceed the given threshold. We also report the number of environment steps for each algorithm, on a particular configuration, in the last column of the table.
We utilized curriculum learning for PPO (thus – cPPO) starting from the smaller environments and closer targets and then gradually increasing both the environment size and the distance to the goal. A similar approach was adopted for MCTS training. QMIX, however, was trained on a single configuration as it trains
B. Test Setup and Evaluation Metrics
After training policies on the setups described above we tested and evaluated them (and their combinations) on a range of specifically created previously unseen instances. Indeed, the agent’s visibility range and obstacle density were the same for test scenarios (5 and 0.3 respectively).
The first portion of the test scenarios was created randomly. These problems vary in environment’s size (from \begin{equation*} hardness=c_{MAPF\_{}solution} - \sum _{i=1}^{n}{c_{path}^{(i)}}\tag{9}\end{equation*}
Examples of random configs from id2-rnd
Here
Besides random problem instances, we evaluated the trained policies on the challenging problems where a high degree of cooperation is expected from the agents. To this end, we created a separate dataset called cooperative in the following way. For each specification of the random dataset, we generated 10,000 different problems and computed the hardness indicator for each of them. We then select the top-100 hard instances to form the cooperative dataset. Examples are shown in Figure 7. Indeed, these problems are much harder to solve than the random ones.
Examples of cooperative configs from id2-coop
We used the following metrics for evaluating the policies on the aforementioned datasets:
Individual success rate (ISR), which is the ratio of agents that have reached their goals before the imposed step limit. E.g. 0.8 means that 80% of agents managed to arrive to their target locations (while 20% do not) before the episode ends.
Cooperative success rate (CSR), which is the ratio of the successfully accomplished episodes, i.e. the episodes with ISR of 1. In other words, it is the percentage of problem instances that were successfully solved in a strong sense (all agents successfully reached their goals within the allotted time limit).
Indeed, CSR is a more demanding indicator and is expected to be lower compared to ISR (which is confirmed by our experimental results).
C. Results
The results for each metric are divided into modules. First, we report the performance of the Conflict Resolution Module (MCTS, QMIX, MCTS + QMIX). Second, we show the results of the Goal Reaching Module, which is presented by cPPO. Third, we report the results of hybrid policies (QMIX + cPPO, MCTS + cPPO), which combine the modules. During testing, we used stochastic versions of the algorithms, so for each configuration, we ran several experiments and averaged them.
We report results for ISR metric in Table 4. Hybrid policies, in all cooperative cases, show significantly better results than any other algorithms. The best of the single algorithms is cPPO, but in comparison, for example, to MCTS + cPPO it outperforms hybrid policy only on one of the eight configurations. Moreover, cPPO outperforms other approaches on id5-rnd8x8-2 since this configuration doesn’t require cooperative interaction at all.
The results for CSR metric are presented in Table 5. Both options of the Hybrid Policy significantly outperforms other approaches on the following configurations: id1-coop
Figure 10 shows the impact of the density, the horizon, the number of agents on the performance of the algorithms. We used id8-rnd
Comparison of reward functions with PPO algorithm trained on random
The performance of the algorithms in different vehicle velocities. The number of slow-moving agents indicates the number of agents which perform actions two times slower. Hybrid policy outperforms other approaches only for the low number of slow-moving agents on id8-rnd
Evaluation results for different parameters of id8-rnd
Increasing the horizon parameter, as expected, allows agents to solve more environments. Increasing the number of agents has an insignificant effect on the ISR metric, but significantly reduces results for the CSR. Adding a new agent, on the one hand, reduces the chance of completing the whole task, but on the other hand, it increases the chance of getting an agent with a close distance to the target, which increases the result.
We present a comparison of reward function designs in Figure 8. Proposed
Additionally, we report the results of experiments for agents with different speeds in Figure 9. We have modified the environment so that some of the agents perform actions twice as slow. The slow-moving agent needs two steps to move to an adjacent cell. The ISR metric decreases slightly as the number of such agents increases on random configurations. In contrast, adding slow-moving agents in cooperative configurations does not affect the results. It emphasizes the reactive nature of RL algorithms, which allows RL agents not to worsen the performance even in environments with new dynamics.
Conclusion and Discussion
In this work, we considered the challenging navigation problem arising in the context of autonomous vehicles lacking a stable connection to the Internet of Vehicles and having to rely on local observations to arrive to their goals. We suggested a novel method (a hybrid policy) that is based on the combination of two learnable modules: the one which is in charge of generating target-driven behavior and the one which generates cooperative behavior. These modules are based on state-of-art model-based and model-free reinforcement learning algorithms. They compliment each other and the hybrid policy exploiting them, indeed, outperforms individual policies, as results of our comparative empirical evaluation shows.
There are several avenues for future work. One of them is going beyond the 4-connected grid setup. Indeed, 4-connected grids are the most widely used graph models in the context of multi-agent pathfinding, however, it is known that allowing diagonal or any-angle moves results is much shorter and faster trips for the agents. Thus it will be beneficial to study partially-observable multi-agent pathfinding on general graphs. A further step along this line of research is taking the kinematic constraints of the agents (vehicles, robots, etc.) into account. Classical approaches that rely on an explicit model of agents’ dynamics are computationally burdensome, while learnable ones may prove to be more efficient and to scale well to a large number of agents.