Journals & Magazines >IEEE Open Journal of Control ... >Volume: 2

Constrained Environment Optimization for Prioritized Multi-Agent Navigation

Abstract:

Traditional approaches for multi-agent navigation consider the environment as a fixed constraint, despite the obvious influence of spatial constraints on agents' performa...Show More

Topic: Intersection of Machine Learning with Control

Metadata

Abstract:

Traditional approaches for multi-agent navigation consider the environment as a fixed constraint, despite the obvious influence of spatial constraints on agents' performance. Yet hand-designing conducive environments is inefficient and potentially expensive. The goal of this article is to consider the obstacle layout of the environment as a decision variable in a system-level optimization problem. In other words, we aim to find an automated solution that optimizes the obstacle layout to improve the performance of multi-agent navigation, under a variety of realistic constraints. Towards this end, we propose novel problems of unprioritized and prioritized environment optimization, where the former considers agents unbiasedly and the latter incorporates agent priorities into optimization. We show, through formal proofs, under which conditions the environment can change to guarantee completeness (i.e., all agents reach goals), and analyze the role of agent priorities in the environment optimization. We proceed to impose constraints on the environment optimization that correspond to real-world restrictions on obstacle changes, and formulate it mathematically as a constrained stochastic optimization problem. Since the relationship between agents, environment and performance is challenging to model, we leverage reinforcement learning to develop a model-free solution and a primal-dual mechanism to handle constraints. Distinct information processing architectures are integrated for various implementation scenarios, including online/offline optimization and discrete/continuous environment. Numerical results corroborate the theory and demonstrate the validity and adaptability of our approach.

Topic: Intersection of Machine Learning with Control

Published in: IEEE Open Journal of Control Systems ( Volume: 2)

Page(s): 337 - 355

Date of Publication: 15 September 2023

Electronic ISSN: 2694-085X

DOI: 10.1109/OJCSYS.2023.3316090

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Multi-agent systems present an attractive solution to spatially distributed tasks, wherein motion planning among moving agents and obstacles is one of the central problems. To date, the primal focus in multi-agent motion planning has been on developing effective, safe, and near-optimal navigation algorithms [2], [3], [4], [5], [6], [7]. These algorithms consider the agents' environment as a fixed constraint, where structures and obstacles must be circumnavigated. In this process, mobile agents engage in negotiations with one another for right-of-way, driven by local incentives to minimize individual delays. However, spatial constraints of the environment may result in dead-locks, live-locks and prioritization conflicts, even for state-of-the-art algorithms [8]. These insights highlight the impact of the environment on multi-agent navigation.

Not all environments elicit the same kinds of agent behaviors and individual navigation algorithms are susceptible to environmental artefacts; undesirable environments can lead to irresolution in path planning [9]. To deal with such bottlenecks, spatial structures (e.g., intersections, roundabouts) and markings (e.g., lanes) are developed to facilitate path de-confliction [10] but these concepts are based on legacy mobility paradigms, which ignore inter-agent communication, cooperation, and systems-level optimization. While it is possible to deal with the circumvention of dead-locks and live-locks through hand-designed environment templates, such hand-designing process is inefficient [11].

Reconfigurable and automated environments are emerging as a new trend in next generation buildings [1], [12], [13], [14], which incorporate mechatronic devices to establish interactive relations with agents. This makes it feasible to re-configure the obstacle layout of the environment in a way that improves some objective measure of the multi-agent system, especially in structured settings where agents are expected to solve repetitive tasks (like warehouses, factories, restaurants, etc.). In tandem with that enabling technology, the goal of this article is to consider the obstacle layout of the environment as a decision variable in pursuit of the agents' incentives. We assume agents have a fixed navigation algorithm, and propose the problem of systematically optimizing the obstacle layout of the environment to improve the navigation performance of the given multi-agent system. Applications of environment optimization include warehouse logistics (e.g., finding the optimal positions of the shelves for cargo transportation [15]), search and rescue (e.g., clearing the best passages for trapped victims [16]), city planning (e.g., generating the optimal configuration of one-way routes [17]), and digital entertainment (e.g., building the best gaming scene for video games [18]). Environment optimization is independent of traditional approaches that focus on designing the agents' trajectory planner to interact with the environment, and can be applied as an add-on module to any trajectory planner in the same manner. Moreover, this allows us to facilitate multi-agent navigation even if the agents' actions cannot be directly controlled, e.g., when agents are human with inherent preferences and unknown decision-making processes, or physical robots with hardware and computation limitations. More in detail, our contributions are as follows:

We define novel problems of unprioritized and prioritized environment optimization, where the former considers agents unbiasedly and the latter accounts for agent priorities. We develop two solution variants, i.e., offline and online environment optimization, that adapt to different implementation scenarios.
We conduct the completeness analysis for multi-agent navigation with environment optimization, which identifies the conditions under which all agents are guaranteed to reach their goals. We also analyze the effects of agent priorities on environment optimization, and show how environment “resources” can be negotitated to improve performance in an ordered scheme.
We impose practical constraints on the environment optimization, and formulate it as a constrained stochastic optimization problem. We leverage reinforcement learning and a primal-dual mechanism to develop a model-free method, which allows us to integrate different information processing architectures (e.g., CNNs, GNNs) as a function of the problem setting.
We evaluate the proposed framework in both discrete and continuous environment settings. The results corroborate theoretical findings and show the proposed approach can generalize beyond training instances, adapts to various optimization objectives and constraints, and allows for decentralized implementation.

Related work: The problem of environment optimization is reminiscent of robot co-design [19], [20], [21], wherein sensing, computation, and control are jointly optimized. While this domain of robotics is robot-centric (i.e., does not consider the environment as an optimization criteria), there are a few works that, similarly to our approach, use reinforcement learning to co-optimize objectives [22], [23], [24]. A more closely related idea is exploited in [9], wherein the environment is adversarially optimized to fail the navigation tasks of state-of-the-art solutions. It conducts a worst-case analysis to shed light on directions in which more robust systems can be developed. On the contrary, we optimize the environment to facilitate multi-agent navigation tasks.

The role of the environment on motion planning has been previously explored. Specifically, [25], [26] emphasize the existence of congestion and deadlocks in undesirable environments and develop trajectory planning methods to escape from potential deadlocks. The work in [27] defines the concept of “well-formed” environments, in which the navigation tasks of all agents can be carried out successfully without collisions. In [28], Wu et al. show that the shape of the environment leads to distinct path prospects for different agents, and that this information can be shared among the agents to optimize and coordinate their mobility. Gur et al. in [29] generate adversarial environments and account for the latter information to develop resilient navigation algorithms. However, none of these works consider optimizing the environment to improve system-wide navigation performance.

Our problem is also related to the problem of environment design. The work in [30] design environments to influence a agent's decision and [31], [32] focus on boosting the interpretability of a robot's behavior for human, while both are formulated for a single-agent scenario. The works of Hauser [33], [34] consider removing obstacles from the environment to improve navigation performance of one agent. Bellusci et al. [35] extends a similar idea to the multi-agent setting, which searches for solutions by considering all possible environment configurations. However, this is limited in discrete environments and it may not be practical to remove obstacles in real-world applications. The problem of multi-agent pick-up and delivery also considers re-configuring the environment [36], [37], [38], [39]. However, these environment changes are pre-defined tasks for agents and the goal is to develop algorithms to complete these tasks, which are different problem settings. Moreover, these works consider discrete environment configurations that mismatch many practical scenarios.

SECTION II.

Problem Formulation

Let ${\mathcal {E}}$ be a 2-D environment described by a starting region ${\mathcal {S}}$, a destination region ${\mathcal {D}}$ and an obstacle region $\Delta$ without overlap, i.e., ${\mathcal {S}}\bigcap {\mathcal {D}}= {\mathcal {S}}\bigcap \Delta = {\mathcal {D}}\bigcap \Delta = \emptyset$. Consider a multi-agent system with $n$ agents ${\mathcal {A}}= \lbrace A_{i}\rbrace _{i=1}^{n}$ distributed in ${\mathcal {E}}$. The agent bodies are contained within circles of radii $\lbrace r_{i}\rbrace _{i=1}^{n}$. Agents are initialized at positions ${\mathbf {S}}= [{\mathbf {s}}_{1},\ldots,{\mathbf {s}}_{n}]$ in ${\mathcal {S}}$ and deploy a given trajectory planner $\pi _{a}$ to move towards destinations ${\mathbf {D}}= [{\mathbf {d}}_{1},\ldots,{\mathbf {d}}_{n}]$ in ${\mathcal {D}}$. Let $\boldsymbol{\rho }= [\rho _{1},\ldots,\rho _{n}]^\top$ be a set of priorities that represent the importance of agents w.r.t. the navigation tasks, i.e., the larger the priority, the more important the agent; in other words, for $\rho _{1} \geq \cdots \geq \rho _{n}$, agent $A_{1}$ has the highest priority.

Denote by $\partial {\mathcal {S}}$, $\partial {\mathcal {D}}$, $\partial \Delta$ the boundaries of the regions ${\mathcal {S}}$, ${\mathcal {D}}$, $\Delta$ and $B({\mathbf {s}},r)$ a closed disk centered at position ${\mathbf {s}}$ with a radius $r$. This allows us to represent each agent $A_{i}$ as $B({\mathbf {s}}_{i}, r_{i})$ where ${\mathbf {s}}_{i}$ is the agent position and $r_{i}$ is the agent radius. Define $d({\mathbf {s}}_{i}, {\mathbf {s}}_{j})$ as the closest distance between agents $A_{i}$, $A_{j}$, i.e., $d({\mathbf {s}}_{i}, {\mathbf {s}}_{j}) = \min \Vert {\mathbf {z}}_{i} - {\mathbf {z}}_{j}\Vert _{2}$ for any ${\mathbf {z}}_{i} \in B({\mathbf {s}}_{i}, r_{i})$ and ${\mathbf {z}}_{j} \in B({\mathbf {s}}_{j}, r_{j})$, and $d({\mathbf {s}}_{i}, \partial {\mathcal {S}})$ the closest distance between agent $A_{i}$ and the starting region boundary $\partial {\mathcal {S}}$, i.e., $d({\mathbf {s}}_{i}, \partial {\mathcal {S}}) = \min \Vert {\mathbf {z}}_{i} - {\mathbf {z}}_{\mathcal {S}}\Vert _{2}$ for any ${\mathbf {z}}_{i} \in B({\mathbf {s}}_{i}, r_{i})$ and ${\mathbf {z}}_{\mathcal {S}}\in \partial {\mathcal {S}}$. Analogous definitions apply for $\partial {\mathcal {D}}$ and $\partial \Delta$. Let ${\mathbf {p}}(t): [0,\infty) \to \mathbb {R}^{2}$ be the trajectory representing the central position movement of an agent. The trajectory ${\mathbf {p}}_{i}$ of agent $A_{i}$ is collision-free w.r.t. the obstacle region $\Delta$ if $d({\mathbf {p}}_{i}(t), \partial \Delta) \geq 0$ for all $t\geq 0$. The trajectories ${\mathbf {p}}_{i}$, ${\mathbf {p}}_{j}$ of two agents $A_{i}$, $A_{j}$ are collision-free with each other if $d({\mathbf {p}}_{i}(t), {\mathbf {p}}_{j}(t)) \geq 0$ for all $t \geq 0$ – see Fig. 1 for demonstration. A valid solution for the trajectory planner $\pi _{a}$ is defined as follows.

$Figure 1. - Example environment with 4 starting positions $\lbrace {\mathbf {s}}_{i}\rbrace _{i=1}^{4}$ and 4 destinations $\lbrace {\mathbf {d}}_{i}\rbrace _{i=1}^{4}$. There are $C_{{\mathcal {S}}}=3$ self-connected starting components ${\mathcal {S}}_{1}, {\mathcal {S}}_{2}, {\mathcal {S}}_{3}$ in the starting region ${\mathcal {S}}= \bigcup _{l=1}^{3} {\mathcal {S}}_{l}$ (blue), $C_{{\mathcal {D}}}=2$ self-connected destination components ${\mathcal {D}}_{1}, {\mathcal {D}}_{2}$ in the destination region ${\mathcal {D}}= \bigcup _{k=1}^{2} {\mathcal {D}}_{k}$ (green), and the obstacle region $\Delta$ (grey). Agent $A_{1}$ is initialized at ${\mathbf {s}}_{1}$ in the starting component ${\mathcal {S}}_{A_{1}} = {\mathcal {S}}_{2}$ and moves towards its goal ${\mathbf {d}}_{1}$ in the destination component ${\mathcal {D}}_{A_{1}} = {\mathcal {D}}_{1}$. Analogous notation applies for the other agents $A_{2}$, $A_{3}$ and $A_{4}$. (a) Initial environment before optimization. It demonstrates notations used in this work for theoretical analysis. (b) Environment after prioritized optimization. It optimizes the obstacle region from $\Delta$ to $\Delta ^*$, which guarantees the existence of a valid solution and improves the solution performance based on agent priorities.$

Figure 1.

Example environment with 4 starting positions $\lbrace {\mathbf {s}}_{i}\rbrace _{i=1}^{4}$ and 4 destinations $\lbrace {\mathbf {d}}_{i}\rbrace _{i=1}^{4}$. There are $C_{{\mathcal {S}}}=3$ self-connected starting components ${\mathcal {S}}_{1}, {\mathcal {S}}_{2}, {\mathcal {S}}_{3}$ in the starting region ${\mathcal {S}}= \bigcup _{l=1}^{3} {\mathcal {S}}_{l}$ (blue), $C_{{\mathcal {D}}}=2$ self-connected destination components ${\mathcal {D}}_{1}, {\mathcal {D}}_{2}$ in the destination region ${\mathcal {D}}= \bigcup _{k=1}^{2} {\mathcal {D}}_{k}$ (green), and the obstacle region $\Delta$ (grey). Agent $A_{1}$ is initialized at ${\mathbf {s}}_{1}$ in the starting component ${\mathcal {S}}_{A_{1}} = {\mathcal {S}}_{2}$ and moves towards its goal ${\mathbf {d}}_{1}$ in the destination component ${\mathcal {D}}_{A_{1}} = {\mathcal {D}}_{1}$. Analogous notation applies for the other agents $A_{2}$, $A_{3}$ and $A_{4}$. (a) Initial environment before optimization. It demonstrates notations used in this work for theoretical analysis. (b) Environment after prioritized optimization. It optimizes the obstacle region from $\Delta$ to $\Delta ^*$, which guarantees the existence of a valid solution and improves the solution performance based on agent priorities.

Show All

Definition 1:

A valid solution of the trajectory planner $\pi _{a}$ is a set of trajectories $\lbrace {\mathbf {p}}_{i}(t)\rbrace _{i=1}^{n}$ that satisfy \begin{align*} &{\mathbf {p}}_{i}(0)={\mathbf {s}}_{i},\,{\mathbf {p}}_{i}(T) = {\mathbf {d}}_{i}\,\text{ for} \,i=1,\ldots,n, \tag{1} \end{align*} View Sourceand \begin{align*} d\left({\mathbf {p}}_{i}(t), {\mathbf {p}}_{j}(t)\right) \geq 0,\,d\left({\mathbf {p}}_{i}(t), \partial \Delta \right) \geq 0 \tag{2} \end{align*} View Sourcefor any $i,j = 1,\ldots,n$ and $t \in [0,T]$, where $T$ is the maximal operation time for agents.

Definition 1 satisfies the navigation convergence with condition (1) and the collision avoidance with condition (2).

The performance of multi-agent navigation solutions depends not only on the agents' trajectory planner but also on their surrounding environment. A “well-formed” environment with an appropriate obstacle region yields good performance for a simple planner, while a “poorly-formed” environment may result in poor performance for an advanced planner. This insight implies an important role played by the environment in multi-agent navigation, and motivates the definition of the problem of environment optimization. The overarching goal is to optimize the obstacle region $\Delta$ in the environment to improve the performance of multi-agent navigation. The principle of environment optimization is that the obstacle region $\Delta$ can be controlled (i.e., changed) while its area $|\Delta |$ remains the same, i.e., $|\Delta | = |\widehat{\Delta }|$ where $\widehat{\Delta }$ is the changed obstacle region. That is, we do not remove or add any obstacle from our environment. We consider two variants: (1) unprioritized environment optimization and (2) prioritized environment optimization.

Problem 1 (Unprioritized Environment Optimization):

Given an environment with an initial obstacle region $\Delta$ and a multi-agent system of $n$ agents $\lbrace A_{i}\rbrace _{i=1}^{n}$ that are initialized at positions ${\mathbf {S}}$ and targeted towards destinations ${\mathbf {D}}$ with the trajectory planner $\pi _{a}$, optimize the obstacle region from $\Delta$ to $\Delta ^*$ such that in the optimized environment, (i) there exists a valid solution $\lbrace {\mathbf {p}}_{i}(t)\rbrace _{i=1}^{n}$ for $\pi _{a}$ [Def. 1] and (ii) the solution performance is maximized.

Problem 2 (Prioritized Environment Optimization):

Given an environment with an initial obstacle regions $\Delta$, a multi-agent system of $n$ agents $\lbrace A_{i}\rbrace _{i=1}^{n}$ that are initialized at positions ${\mathbf {S}}$ and targeted towards destinations ${\mathbf {D}}$ with the trajectory planner $\pi _{a}$, agent priorities $\boldsymbol{\rho }$ that are ordered by the index $\rho _{1}\geq \rho _{2} \geq \cdots \geq \rho _{n}$, and a metric of interest $M(\cdot) : {\mathbf {p}}(t) \to \mathbb {R}$ that measures the navigation performance of agent trajectories,¹ optimize the obstacle region from $\Delta$ to $\Delta ^*$ such that in the optimized environment, (i) there exists a valid solution $\lbrace {\mathbf {p}}_{i}(t)\rbrace _{i=1}^{n}$ for $\pi _{a}$ [Def. 1] and (ii) the solution performance is improved according to agent priorities $\boldsymbol{\rho }$, i.e., the metric values of agents' trajectories are ordered by agent priorities $\boldsymbol{\rho }$ as \begin{align*} M({\mathbf {p}}_{1}) \leq M({\mathbf {p}}_{2}) \leq \cdots \leq M({\mathbf {p}}_{n}). \tag{3} \end{align*} View Source

Problem 1 considers agents with equal priority and optimizes the environment to improve their performance unbiasedly. Problem 2 accounts for agent priorities, which not only guarantees the existence of a valid solution in the optimized environment but also orders the solution performance based on agent priorities $\boldsymbol{\rho }$ [cf. (3)]. It reduces to Problem 1 when agent priorities are equal to each other, i.e., $\rho _{1} = \cdots = \rho _{n}$. Problems 1 and 2 focus on optimizing the obstacle region of the environment to provide solution guarantees and to improve the solution performance. An underlying trajectory planner is assumed to exist, yet we do not prescribe how the trajectories should be computed – see Assumption 2 in Section III.

In Section III, we analyze the completeness for Problem 1 to show the effectiveness of environment optimization, i.e., all agents are able to carry out navigation tasks without collision with environment optimization. In Section IV, we conduct the completeness analysis for Problem 2 and characterize the effects of agent priorities on environment optimization. In Section V, we impose external constraints on the environment optimization corresponding to real-world restrictions on obstacle changes and formulate the problem mathematically as a constrained stochastic optimization problem. We transform the latter into the dual domain to tackle constraints and combine reinforcement learning with a primal-dual mechanism for solutions. Lastly, we evaluate the proposed approach numerically and corroborate theoretical findings in Section VI.

Remark 1:

Problem 2 optimizes the obstacle region of the environment to guarantee that there exists a valid solution and the solution performance is ordered based on agent priorities [cf. (3)]. We remark that it is possible that an agent $A_{j}$ finds a different trajectory $\widehat{{\mathbf {p}}}_{j} \ne {\mathbf {p}}_{j}$, which does not collide with the other trajectories $\lbrace {\mathbf {p}}_{i}\rbrace _{i \ne j}$ and satisfies $M(\widehat{{\mathbf {p}}}_{j}) < M({\mathbf {p}}_{j})$. However, the latter does not compromise the prioritized environment optimization proposed in Problem 2.

SECTION III.

Completeness of Unprioritized System

In this section, we provide the completeness analysis of multi-agent navigation for unprioritized environment optimization [Problem 1]. Specifically, the multi-agent system may fail to find collision-free trajectories in an environment with an unsatisfactory obstacle region. Environment optimization overcomes this issue by modifying the obstacle region to guarantee the navigation success for all agents. In the following, we consider both offline environment optimization and online environment optimization.

A. Offline Environment Optimization

Offline environment optimization optimizes the obstacle region $\Delta$ based on the starting region ${\mathcal {S}}$ and the destination region ${\mathcal {D}}$, and completes the optimization before the agents start moving towards destinations. The optimized environment remains static during agent movement. The goal is to find an optimal obstacle region $\Delta ^*$ that maximizes the navigation performance of the multi-agent system. Before proceeding, we need the following assumptions for completeness analysis.

Assumption 1:

The initial positions $\lbrace {\mathbf {s}}_{i}\rbrace _{i=1}^{n}$ in ${\mathcal {S}}$ and the destinations $\lbrace {\mathbf {d}}_{i}\rbrace _{i=1}^{n}$ in ${\mathcal {D}}$ are distributed in a way such that the distance between either two agents or the agent and the region boundary is larger than the maximal agent size, i.e., $d({\mathbf {s}}_{i},{\mathbf {s}}_{j}) \geq 2 \hat{r}$, $d({\mathbf {s}}_{i},\partial {\mathcal {S}}) \geq 2 \hat{r}$ and $d({\mathbf {d}}_{i},{\mathbf {d}}_{j}) \geq 2 \hat{r}$, $d({\mathbf {d}}_{i},\partial {\mathcal {D}}) \geq 2 \hat{r}$ for $i,j\!=\!1,{\ldots },n$ with $\hat{r} \!=\! \max _{i=1,{\ldots },n} r_{i}$.

Assumption 2:

Using the trajectory planner $\pi _{a}$, each agent can compute its own optimal trajectory, leading it from start to goal, in the presence of the obstacle region $\Delta$, and can pass that trajectory information on to other agents.

Assumption 1 indicates that the starting positions $\lbrace {\mathbf {s}}_{i}\rbrace _{i=1}^{n}$ (or goal positions $\lbrace {\mathbf {d}}_{i}\rbrace _{i=1}^{n}$) are not too close to each other, which is commonly satisfied in real-world navigation tasks. Assumption 2 allows each agent to find its own trajectory, if the latter exists, and disseminate the trajectory information among agents. For decentralized systems that require completeness, this is a necessary assumption (e.g., see [11], [27]). Information passing can be achieved through a variety of mechanisms, e.g., token-based [27], or communication protocol based [11]. With these preliminaries, we show the completeness of the offline environment optimization.

Theorem 1:

Consider the multi-agent system in the environment ${\mathcal {E}}$ with starting, destination and obstacle regions ${\mathcal {S}}$, ${\mathcal {D}}$ and $\Delta$ satisfying Assumptions 1 and 2. Let $d_{\max }$ be the maximal distance between ${\mathcal {S}}$ and ${\mathcal {D}}$, i.e., $d_{\max } = \max _{{\mathbf {z}}_{\mathcal {S}},{\mathbf {z}}_{\mathcal {D}}} \Vert {\mathbf {z}}_{\mathcal {S}}- {\mathbf {z}}_{\mathcal {D}}\Vert _{2}$ for any ${\mathbf {z}}_{\mathcal {S}}\in {\mathcal {S}},{\mathbf {z}}_{\mathcal {D}}\in {\mathcal {D}}$. Then, if ${\mathcal {E}}$ satisfies \begin{align*} |{\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}})| \geq 2 \text{nd}_{\max } \hat{r} \tag{4} \end{align*} View Sourcewhere $\hat{r} = \max _{i=1,\ldots,n} r_{i}$ is the maximal radius of $n$ agents and $|\cdot |$ represents the region area, the offline environment optimization guarantees that the navigation tasks of all agents can be carried out successfully without collision.

Proof:

See Appendix A.$\blacksquare$

Theorem 1 states that the offline environment optimization can guarantee the success of all navigation tasks, if the area of the environment except the starting, destination and obstacle regions is larger than $2 n d_{\max } \hat{r}$. This is a mild condition because it does not require any initial “well-formed” environment but only an obstacle-free area of minimal size [cf. (4)], which is common in real-world scenarios. The offline environment optimization depends on the starting region ${\mathcal {S}}$ and the destination region ${\mathcal {D}}$, and completes optimizing the obstacle region $\Delta$ before the agents start to move. This requires a computational overhead before each new navigation task. Moreover, we are interested in generalizing the problem s.t. ${\mathcal {S}}$ and ${\mathcal {D}}$ are allowed to be time-varying during deployments.

B. Online Environment Optimization

Online environment optimization changes the obstacle region $\Delta$ based on instantaneous states of the agents, e.g., positions, velocities and accelerations, after the agents start moving towards destinations. In other words, the environment is being optimized concurrently with agent movement. The goal is to find the optimal obstacle policy $\pi _{o}$ that changes the obstacle region to maximize the navigation performance of the multi-agent system.

Specifically, define the starting region as the union of the starting positions ${\mathcal {S}}= \bigcup _{i=1,{\ldots },n}B({\mathbf {s}}_{i}, r_{i})$ and the destination region as that of the destinations ${\mathcal {D}}= \bigcup _{i=1,{\ldots },n}B({\mathbf {d}}_{i}, r_{i})$ s.t. ${\mathcal {S}}\bigcap {\mathcal {D}}= \emptyset$. Since the obstacle region $\Delta$ now changes continuously, we define the capacity of the online environment optimization as the maximal changing rate of the obstacle area $\dot{\Delta }$, i.e., the maximal obstacle area that can be changed per time step. To proceed, we require an assumption for the environment with an empty obstacle region $\Delta = \emptyset$.

Assumption 3:

For an environment ${\mathcal {E}}$ with no obstacle region $\Delta = \emptyset$, all navigation tasks can be carried out successfully without collision and the corresponding agent trajectories $\lbrace {\mathbf {p}}_{i}\rbrace _{i=1}^{n}$ are non-overlap.

Assumption 3 is mild because an environment with no obstacle region is the best scenario for the multi-agent navigation. For an environment with a non-empty obstacle region $\Delta$, the following theorem shows the completeness of the online environment optimization.

Theorem 2:

Consider the multi-agent system of $n$ agents $\lbrace A_{i}\rbrace _{i=1}^{n}$ satisfying Assumption 3 with $n$ collision-free trajectories $\lbrace {\mathbf {p}}_{i}(t)\rbrace _{i=1}^{n}$ for the environment with an empty obstacle region. Let $\lbrace {\mathbf {v}}_{i}(t)\rbrace _{i=1}^{n}$ be the velocities along $\lbrace {\mathbf {p}}_{i}(t)\rbrace _{i=1}^{n}$, respectively. For the environment ${\mathcal {E}}$ with a non-empty obstacle region $\Delta \subset {\mathcal {E}}\setminus ({\mathcal {S}}\cup {\mathcal {D}})$ and ${\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}}) \ne \emptyset$, if the capacity of the online environment optimization satisfies \begin{align*} \dot{\Delta } \geq 2 n \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2} \tag{5} \end{align*} View Sourcewhere $\Vert \hat{{\mathbf {v}}}\Vert _{2} = \max _{t \in [0,T]}\max _{i=1,{\ldots },n}\Vert {\mathbf {v}}_{i}(t)\Vert _{2}$ is the maximal norm of the velocities and $\hat{r} = \max _{i=1,\ldots,n} r_{i}$ is the maximal agent radius, the navigation tasks of all agents can be carried out successfully without collision.

Proof:

See Appendix B.$\blacksquare$

Theorem 2 states that the online environment optimization can guarantee the success of all navigation tasks as well as its offline counterpart, if the changing rate of the obstacle region is larger than $2 n \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2}$. The result is established under a mild condition on the obstacle changing rate [cf. (5)] rather than the initial obstacle-free area [cf. (4)]. This is due to the fact that the online environment optimization changes the obstacle region concurrently with agent movement. Hence, it improves navigation performance only if timely actions can be taken based on instantaneous system states. The completeness analysis in Theorems 1 and 2 demonstrates theoretically the effectiveness of the proposed environment optimization, in improving the performance of the multi-agent navigation.

SECTION IV.

Completeness of Prioritized System

Unprioritized environment optimization guarantees the success of navigation tasks in scenarios with sufficient “resources”, i.e., a sufficiently large obstacle-free area [Thm. 1] and a sufficiently large obstacle changing rate [Thm. 2]. However, this may not be the case for scenarios with reduced resources. In the latter circumstances, the environment needs to be optimized with respect to the conflicts of interest among different agents, and allocates resources according to priorities in order to negotiate these conflicts.

With the formulation of prioritized environment optimization [Problem 2], we overcome this issue by incorporating agent priorities into environment optimization. That is, we put more emphasis on agents with higher priorities to guide the negotiation. In the sequel, we show that prioritized environment optimization is capable of guaranteeing the success of all navigation tasks with reduced resources by sacrificing the navigation performance of agents with lower priorities. Analogous to Section III, we analyze the completeness of offline and online prioritized environment optimization, respectively, which characterizes the explicit effects of agent priorities on the navigation performance.

A. Offline Prioritized Environment Optimization

Offline environment optimization optimizes the obstacle region $\Delta$ before navigation and guarantees the success of all navigation tasks if $|{\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}})| \geq 2 \text{nd}_{\max } \hat{r}$ [Thm. 1]. We consider the reduced-resource scenario, where the initial obstacle-free area is smaller than $2 \text{nd}_{\max } \hat{r}$, i.e., $|{\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}})| < 2 \text{nd}_{\max } \hat{r}$. In this circumstance, offline prioritized environment optimization is able to maintain the success of all navigation tasks, at the cost of lower-priority agent performance.

We introduce additional notation as follows. Define the starting region ${\mathcal {S}}$ as disconnected if there exist two points in ${\mathcal {S}}$ that cannot be connected by a path inside ${\mathcal {S}}$, and self-connected otherwise. For a disconnected ${\mathcal {S}}$, let $\lbrace {\mathcal {S}}_{l}\rbrace _{l=1}^{C_{\mathcal {S}}}$ be the least number of self-connected components in ${\mathcal {S}}$ s.t. their union covers ${\mathcal {S}}$, i.e., $\bigcup _{l=1}^{C_{\mathcal {S}}} {\mathcal {S}}_{l} = {\mathcal {S}}$.² Assume that agent $A_{i}$ is initialized in one of these component ${\mathcal {S}}_{A_{i}}\in \lbrace {\mathcal {S}}_{l}\rbrace _{l=1}^{C_{\mathcal {S}}}$ for $i=1,\ldots,n$. Analogous definitions apply for the destination region ${\mathcal {D}}$. Let $d_{\max }({\mathcal {S}}_{l_{1}}, {\mathcal {S}}_{l_{2}})$ be the maximal distance between the components ${\mathcal {S}}_{l_{1}}$ and ${\mathcal {S}}_{l_{2}}$ for $l_{1} \ne l_{2}\in \lbrace 1,\ldots,C_{\mathcal {S}}\rbrace$, i.e., $d_{\max }({\mathcal {S}}_{l_{1}}, {\mathcal {S}}_{l_{2}}) = \max \Vert {\mathbf {z}}_{{\mathcal {S}}_{l_{1}}} - {\mathbf {z}}_{{\mathcal {S}}_{l_{2}}}\Vert _{2}$ for any ${\mathbf {z}}_{{\mathcal {S}}_{l_{1}}} \in {\mathcal {S}}_{l_{1}}$ and ${\mathbf {z}}_{{\mathcal {S}}_{l_{2}}} \in {\mathcal {S}}_{l_{2}}$. Denote by ${\mathbf {D}}_{\mathcal {S}}\in \mathbb {R}^{C_{\mathcal {S}}\times C_{\mathcal {S}}}$ the matrix that collects the distances between the components $\lbrace {\mathcal {S}}_{l}\rbrace _{l=1}^{C_{\mathcal {S}}}$ with the $(l_{1},l_{2})$th entry $[{\mathbf {D}}_{{\mathcal {S}}}]_{l_{1} l_{2}} = d_{\max }({\mathcal {S}}_{l_{1}}, {\mathcal {S}}_{l_{2}})$ for $l_{1},l_{2}=1,\ldots,C_{\mathcal {S}}$ and $[{\mathbf {D}}_{{\mathcal {S}}}]_{ll} = 0$ for $l=1,\ldots,C_{\mathcal {S}}$, ${\mathbf {D}}_{\mathcal {D}}\in \mathbb {R}^{C_{\mathcal {D}}\times C_{\mathcal {D}}}$ the matrix that collects the distances between the components $\lbrace {\mathcal {D}}_{k}\rbrace _{k=1}^{C_{\mathcal {D}}}$, and ${\mathbf {D}}_{{\mathcal {S}},{\mathcal {D}}} \in \mathbb {R}^{C_{\mathcal {S}}\times C_{\mathcal {D}}}$ the matrix that collects the distances between $\lbrace {\mathcal {S}}_{l}\rbrace _{l=1}^{C_{\mathcal {S}}}$ and $\lbrace {\mathcal {D}}_{k}\rbrace _{k=1}^{C_{\mathcal {D}}}$ – see Fig. 1 for demonstration. For completeness analysis, we assume the following.

Assumption 4:

The distance between the starting components $\lbrace {\mathcal {S}}_{l}\rbrace _{l=1}^{C_{\mathcal {S}}}$ (or destination components $\lbrace {\mathcal {D}}_{k}\rbrace _{k=1}^{C_{\mathcal {D}}}$) is significantly smaller than the distance between the starting components $\lbrace {\mathcal {S}}_{l}\rbrace _{l=1}^{C_{\mathcal {S}}}$ and the destination components $\lbrace {\mathcal {D}}_{k}\rbrace _{k=1}^{C_{\mathcal {D}}}$, i.e., for any entry $d_{{\mathcal {S}}}$ of ${\mathbf {D}}_{{\mathcal {S}}}$, entry $d_{{\mathcal {D}}}$ of ${\mathbf {D}}_{{\mathcal {D}}}$ and entry $d_{{\mathcal {S}},{\mathcal {D}}}$ of ${\mathbf {D}}_{{\mathcal {S}},{\mathcal {D}}}$, it holds that $d_{{\mathcal {S}}} + d_{{\mathcal {D}}} \leq d_{{\mathcal {S}},{\mathcal {D}}}$.

Assumption 4 is reasonable because the starting positions are typically close to each other but far away from the destinations in real-world applications. With these in place, we characterize the completeness with the offline prioritized environment optimization in the following theorem.

Theorem 3:

Consider the multi-agent system of $n$ agents $\lbrace A_{i}\rbrace _{i=1}^{n}$ in the environment ${\mathcal {E}}$ with the same settings as Theorem 1 and satisfying Assumption 4. Let $\boldsymbol{\rho }$ be the agent priorities ordered by the index $\rho _{1}\geq \rho _{2} \geq \cdots \geq \rho _{n}$, agent $A_{i}$ be in the starting component ${\mathcal {S}}_{A_{i}} \in \lbrace {\mathcal {S}}_{l}\rbrace _{l=1}^{C_{\mathcal {S}}}$ and assigned to the destination component ${\mathcal {D}}_{A_{i}} \in \lbrace {\mathcal {D}}_{k}\rbrace _{k=1}^{C_{\mathcal {D}}}$ for $i\!=\!1,{\ldots },n$. Then, if ${\mathcal {E}}$ satisfies \begin{align*} |{\mathcal {E}}\!\setminus \! (\Delta \!\cup \! {\mathcal {S}}\!\cup \! {\mathcal {D}})| \!\geq \! 2 d_{A_{1}} \hat{r} \!+\! \sum _{i=2}^{n} \!2 (d_{A_{i},{\mathcal {S}}}\!+\!d_{A_{i},{\mathcal {D}}})\hat{r} \tag{6} \end{align*} View Sourcewhere $d_{A_{1}}$ is the distance between ${\mathcal {S}}_{A_{1}}$ and ${\mathcal {D}}_{A_{1}}$, i.e., $d_{A_{1}} = [{\mathbf {D}}_{{\mathcal {S}}, {\mathcal {D}}}]_{A_{1} A_{1}}$, $d_{A_{i},{\mathcal {S}}}$ the minimal distance between ${\mathcal {S}}_{A_{i}}$ and $\lbrace {\mathcal {S}}_{A_{j}}\rbrace _{j=1}^{i-1}$, i.e., $d_{A_{i},{\mathcal {S}}} = \min \lbrace [{\mathbf {D}}_{\mathcal {S}}]_{A_{i} A_{1}}, \ldots, [{\mathbf {D}}_{\mathcal {S}}]_{A_{i} A_{i-1}}\rbrace$, $d_{A_{i},{\mathcal {D}}}$ the minimal distance between ${\mathcal {D}}_{A_{i}}$ and $\lbrace {\mathcal {D}}_{A_{j}}\rbrace _{j=1}^{i-1}$, i.e., $d_{A_{i},{\mathcal {D}}} = \min \lbrace [{\mathbf {D}}_{\mathcal {D}}]_{A_{i} A_{1}}, \ldots, [{\mathbf {D}}_{\mathcal {D}}]_{A_{i} A_{i-1}} \rbrace$, the offline prioritized environment optimization guarantees that the navigation tasks of all agents can be carried out successfully without collision. Moreover, the traveled distance of agent $A_{i}$ outside ${\mathcal {S}}$ and ${\mathcal {D}}$ is bounded by \begin{align*} d_{A_{1}} + \sum _{j=2}^{i} \left(d_{A_{j},{\mathcal {S}}}+d_{A_{j},{\mathcal {D}}}\right) = M_{i} \tag{7} \end{align*} View Sourcefor $i=1,{\ldots },n$, which increases with the decreasing of priority, i.e., $M_{1} \!\leq \! \cdots \!\leq \! M_{n}$ with the priorities $\rho _{1} \!\geq \! \cdots \!\geq \! \rho _{n}$.

Proof:

See Appendix C.$\blacksquare$

Theorem 3 states that the offline prioritized environment optimization can guarantee the success of all navigation tasks and requires less obstacle-free area compared to its unprioritized counterpart, i.e., the lower bound in (6) is smaller than that in (4) [Asm. 4]. However, this improvement is obtained by sacrificing the navigation performance of the agents with lower priorities. That is, these agents are no longer moving along the shortest path and the traveled distance increases with the decreasing of agent priority [cf. (7)]. It corresponds to Problem 2 of prioritized environment optimization, where the metric $M(\cdot)$ is the traveled distance. The latter shows a trade-off between the navigation completeness of all agents and the individual performance of lower-priority agents.

Next, we consider scenarios with further reduced resources, i.e., we show the partial completeness when the obstacle-free area is smaller than that required in (6).

Corollary 1:

Consider the same setting as in Theorem 3. If the environment ${\mathcal {E}}$ satisfies \begin{align*} &2 d_{A_{1}} \hat{r} + \sum _{i=2}^{b} 2 \left(d_{A_{i},{\mathcal {S}}}+d_{A_{i},{\mathcal {D}}}\right)\hat{r} \\ &\quad\leq |{\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}})| < 2 d_{A_{1}} \hat{r} + \sum _{i=2}^{b+1} 2 \left(d_{A_{i},{\mathcal {S}}}+d_{A_{i},{\mathcal {D}}}\right)\hat{r} \tag{8} \end{align*} View Sourcewhere $b$ is an integer with $1 \leq b < n$, the offline prioritized environment optimization guarantees that the navigation tasks of the agents with highest $b$ priorities $\lbrace A_{i}\rbrace _{i=1}^{b}$ can be carried out successfully without collision.

For the rest of the agents $\lbrace A_{j}\rbrace _{j=b+1}^{n}$, if the starting position and the destination of agent $A_{j}$ is within the same connected components in ${\mathcal {S}}$ and ${\mathcal {D}}$ as one of the agents $\lbrace A_{i}\rbrace _{i=1}^{b}$, i.e., ${\mathbf {s}}_{j} \in {\mathcal {S}}_{A_{i}}$ and ${\mathbf {d}}_{j} \in {\mathcal {D}}_{A_{i}}$ for any $i \in \lbrace 1,\ldots,b\rbrace$, the navigation task of agent $A_{j}$ can be carried out successfully without collision as well.

Proof:

See Appendix D.$\blacksquare$

Corollary 1 demonstrates that the success of all navigation tasks may not be guaranteed if the obstacle-free area is further smaller than the lower bound in (6). In this case, the prioritized environment optimization guarantees preferentially the navigation tasks of the agents with higher priorities $\lbrace A_{i}\rbrace _{i=1}^{b}$, but overlooks the ones with lower priorities $\lbrace A_{j}\rbrace _{j=b+1}^{n}$. Moreover, the navigation tasks of lower-priority agents $\lbrace A_{j}\rbrace _{j=b+1}^{n}$ can be guaranteed only if their starting and goal positions are within the same connected components as one of the higher-priority agents $\lbrace A_{i}\rbrace _{i=1}^{b}$. This implies that lower-priority agents can benefit from higher-priority agents if they have “similar” navigation tasks.

B. Online Prioritized Environment Optimization

Online environment optimization changes the obstacle region $\Delta$ during navigation and guarantees the success of all navigation tasks if $\dot{\Delta } \geq 2 n \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2}$ [Thm. 2]. We similarly consider the reduced-resource scenario, where the capacity of the obstacle region is smaller than $2 n \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2}$, i.e., $\dot{\Delta } < 2 n \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2}$. In this circumstance, online prioritized environment optimization can guarantee the success of all navigation tasks, with a performance loss of lower-priority agents.

Theorem 4:

Consider the multi-agent system of $n$ agents $\lbrace A_{i}\rbrace _{i=1}^{n}$ in an environment ${\mathcal {E}}$ with the same settings as Theorem 2. Let $\boldsymbol{\rho }$ be the agent priorities ordered by the index $\rho _{1} \geq \rho _{2} \geq \cdots \geq \rho _{n}$ and the reduced capacity of environment optimization be \begin{align*} 2 b \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2} \leq \dot{\Delta } < 2 (b+1) \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2} \tag{9} \end{align*} View Sourcewhere $b$ is an integer with $1 \leq b < n$. Then, online prioritized environment optimization guarantees that the navigation tasks of all agents can be carried out successfully without collision. Moreover, the navigation time of agent $A_{i}$ can be bounded by \begin{align*} \frac{C_{T}}{\rho _{i}},\,\text{ for} \,i=1,\ldots,n \tag{10} \end{align*} View Sourcethat is proportional to the inverse of its priority, where $C_{T}$ is a time constant determined by $b$.

Proof:

See Appendix E.$\blacksquare$

Theorem 4 states that the online prioritized environment optimization can guarantee the success of all navigation tasks and requires a smaller obstacle changing rate compared to its unprioritized counterpart, i.e., the lower bound in (9) is less than that in (5). This improvement comes with the performance loss of lower-priority agents, i.e., the agents with higher priorities reach the destinations faster than the ones with lower priorities [cf. (10)]. We attribute this behavior to the fact that the online prioritized environment optimization changes the obstacle region concurrently to agent movement and is capable of continuously tuning the changing strategy based on the agent priorities during navigation. This corresponds to Problem 2 of prioritized environment optimization, where the metric $M(\cdot)$ is the navigation time.

We follow to consider scenarios where all agents are required to reach their destinations within a required time, and analyze the partial completeness of the online prioritized environment optimization in this setting.

Corollary 2:

Consider the same setting as in Theorem 4. If all agents are required to reach their destinations within time $T_{\max }$, i.e., $T_{i} \leq T_{\max }$ for $i=1,\ldots,n$, the online prioritized environment optimization guarantees the success of $n_{p} \leq n$ agents $\lbrace A_{i}\rbrace _{i=1}^{n_{p}}$ with highest priorities $\lbrace \rho _{i}\rbrace _{i=1}^{n_{p}}$, where $n_{p}$ is determined by the changing rate of the obstacle region, i.e., $b$ in (9), and the required time $T_{\max }$ [cf. (54)].

Proof:

See Appendix F.$\blacksquare$

Corollary 2 demonstrates that the success of all navigation tasks may not be guaranteed if agents are required to reach destinations within a finite time $T_{\max }$. Online prioritized environment optimization allows preferentially a set of agents $\lbrace A_{i}\rbrace _{i=1}^{n_{p}}$ with higher priorities to reach destinations, while the rest of agents $\lbrace A_{i}\rbrace _{i=n_{p}+1}^{n}$ may fail navigation tasks. The number of successful agents $n_{p}$ depends on the changing capacity of the obstacle region and the required time $T_{\max }$, i.e., the higher the changing rate and the required time, the larger the number of successful agents. Corollary 2 recovers Theorem 4 when the required time is larger than the maximal time in (10), i.e., $T_{\max } \geq C_{T}/\rho _{n}$.

Theorems 3 and 4 show the role played by agent priorities in the offline and online environment optimization with less resources, i.e., the obstacle-free area and the obstacle changing rate, than that required by Theorems 1 and 2. The prioritized environment optimization guides the resource allocation by emphasizing higher-priority agents. By doing so, it guarantees the success of all navigation tasks with reduced resources while sacrificing the navigation performance of lower-priority agents. Moreover, Corollaries 1 and 2 demonstrate the partial completeness of multi-agent navigation for scenarios with further reduced resources and added requirements (e.g., maximal time allowed). The prioritized environment optimization guarantees the success of higher-priority agents, but lower-priority agents may fail in these circumstances.

Remark 2:

Theoretical analysis of offline and online environment optimization assumes that the obstacle region $\Delta$ can be controlled and re-shaped continuously as long as its area remains the same. The goal is to demonstrate the effectiveness of environment optimization in improving the performance of multi-agent navigation. However, it is worth mentioning that there may exist practical restrictions either on the obstacle region or on the way it can be controlled, in real-world applications. We account for the latter by imposing constraints when mathematically formulating the environment optimization problem in the next section.

SECTION V.

Methodology

In this section, we formulate the prioritized environment optimization problem mathematically as a constrained stochastic optimization problem, where the imposed constraints correspond to resource limitations and physical restrictions on the environment optimization in real-world applications, and solve the latter by leveraging reinforcement learning with the primal-dual mechanism.

Specifically, consider the obstacle region $\Delta$ comprising $m$ obstacles ${\mathcal {O}}= \lbrace O_{1},\ldots,O_{m}\rbrace$, which can be of any shape and located at positions ${\mathbf {O}}= [{\mathbf {o}}_{1},\ldots,{\mathbf {o}}_{m}]$. The obstacles can change positions to improve the navigation performance of the agents. Denote by ${\mathbf {X}}_{o} = [{\mathbf {x}}_{o1},\ldots,{\mathbf {x}}_{om}]$ the obstacle states, ${\mathbf {U}}_{o} = [{\mathbf {u}}_{o1},\ldots,{\mathbf {u}}_{om}]$ the obstacle actions, ${\mathbf {X}}_{a} = [{\mathbf {x}}_{a1},\ldots,{\mathbf {x}}_{an}]$ the agent states and ${\mathbf {U}}_{a} = [{\mathbf {u}}_{a1},\ldots,{\mathbf {u}}_{am}]$ the agent actions. For example, the states can be positions or velocities and the actions can be accelerations. Let $\pi _{o}({\mathbf {U}}_{o} | {\mathbf {X}}_{o}, {\mathbf {X}}_{a})$ be an optimization policy that controls the obstacles, a distribution over ${\mathbf {U}}_{o}$ conditioned on ${\mathbf {X}}_{o}$ and ${\mathbf {X}}_{a}$. The objective function $f({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o})$ measures the performance of the multi-agent navigation task $({\mathbf {S}}, {\mathbf {D}})$, given the trajectory planner $\pi _{a}$, the agent priorities $\boldsymbol{\rho }$, the obstacle positions ${\mathbf {O}}$ and the optimization policy $\pi _{o}$, while the cost functions $\lbrace g_{j}({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o})\rbrace _{j=1}^{m}$ indicate the penalties of obstacle position changes, e.g., collision penalties of obstacle movements w.r.t. the other obstacles and agents. Moreover, a set of $Q$ constraints are imposed on the obstacles corresponding to resource limitations and physical restrictions in real-world applications, which are represented by the constraint functions $\lbrace h_{q}({\mathbf {X}}_{o}, {\mathbf {U}}_{o})\rbrace _{q=1}^{Q}$. For example, the deviation distances of the obstacles from their initial positions or the moving velocities of the obstacles are bounded by some maximal constant. Since the multi-agent system is with random initialization, the objective, cost, constraint functions are random functions and an expected performance would be a more meaningful metric for performance evaluation.

The goal is, thus, to find an optimal obstacle policy $\pi _{o}$ that maximizes the expected performance $\mathbb {E}[f({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o})]$ regularized by the obstacle costs $\lbrace \mathbb {E}[g_{j}({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o})]\rbrace _{j=1}^{m}$, while satisfying the imposed constraints. By introducing ${\mathcal {U}}_{o}$ as the action space of the obstacles, we formulate the prioritized environment optimization problem as a constrained stochastic optimization problem \begin{align*} &\mathop{\text{argmax}}\limits_{\pi _{o}}\,\!\! \mathbb{E}\left[f({\mathbf {S}},\! {\mathbf {D}},\! \pi _{a},\! \boldsymbol{\rho },\! {\mathbf {O}},\! \pi _{o}\!)\!\right] \!-\!\!\! \sum _{j=1}^{m}\! \beta _{j} \mathbb {E}\left[g_{j}({\mathbf {S}},\! {\mathbf {D}},\! \pi _{a},\! \boldsymbol{\rho },\! {\mathbf {O}},\! \pi _{o}\!)\!\right] \\ &\,\,\,\,\text{s.t.}\,\,\,\, h_{q}({\mathbf {X}}_{o}, {\mathbf {U}}_{o}) \leq 0,\,\text{ for all} \,q\!=\!1,\ldots,Q, \\ &\,\,\,\,\,\,\,\,\,\,\,\, \pi _{o}({\mathbf {U}}_{o}|{\mathbf {X}}_{o},{\mathbf {X}}_{a}) \in {\mathcal {U}}_{o}, \tag{11} \end{align*} View Sourcewhere $\lbrace \beta _{j}\rbrace _{j=1}^{m}$ are regularization parameters. The objective function $f({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o})$, the cost functions $\lbrace g_{j}({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o})\rbrace _{j=1}^{m}$, the constraint functions $\lbrace h_{q}({\mathbf {X}}_{o}, {\mathbf {U}}_{o})\rbrace _{q=1}^{Q}$ and the action space ${\mathcal {U}}_{o}$ are not necessarily convex/non-convex depending on specific application scenarios. The problem is challenging due to four main reasons:

The closed-form expression of the objective function $f({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o})$ is difficult to derive because the explicit relationship between agents, obstacles and navigation performance is difficult to model, precluding the application of conventional model-based methods and leading to poor performance of heuristic methods.
The imposed constraints $\lbrace h_{q}({\mathbf {X}}_{o}, {\mathbf {U}}_{o})\rbrace _{q=1}^{Q}$ restrict the space of feasible solutions and are difficult to address, leading to the failure of conventional unconstrained optimization algorithms.
The policy $\pi _{o}({\mathbf {U}}_{o} | {\mathbf {X}}_{o}, {\mathbf {X}}_{a})$ is an arbitrary mapping from the state space to the action space, which can take any function form and is infinitely dimensional.
The obstacle actions can be discrete or continuous and the action space ${\mathcal {U}}_{o}$ can be non-convex, resulting in further constraints on the feasible solution.

Due to the aforementioned challenges, we propose to solve problem (11) by leveraging reinforcement learning (RL) and the primal-dual mechanism. The former parameterizes the optimization policy with information processing architectures and allows us to train the architecture parameters in a model-free manner. The latter penalizes the constraints with dual variables, and updates the primal and dual variables alternatively while searching for feasible solutions.

A. Reinforcement Learning

We reformulate problem (11) in the RL domain and start by defining a Markov decision process. At each step $t$, the obstacles ${\mathcal {O}}$ and the agents ${\mathcal {A}}$ observe the states ${\mathbf {X}}_{o}^{(t)}$, ${\mathbf {X}}_{a}^{(t)}$ and take the actions ${\mathbf {U}}_{o}^{(t)}$, ${\mathbf {U}}_{a}^{(t)}$ with the obstacle policy $\pi _{o}$ and the trajectory planner $\pi _{a}$, respectively. The actions ${\mathbf {U}}_{o}^{(t)}$, ${\mathbf {U}}_{a}^{(t)}$ amend the states ${\mathbf {X}}_{o}^{(t)}$, ${\mathbf {X}}_{a}^{(t)}$ based on the transition probability function $P({\mathbf {X}}_{o}^{(t+1)},{\mathbf {X}}_{a}^{(t+1)}|{\mathbf {X}}_{o}^{(t)},{\mathbf {X}}_{a}^{(t)},{\mathbf {U}}_{o}^{(t)},{\mathbf {U}}_{a}^{(t)})$, which is a distribution over the states conditioned on the previous states and the actions. Let $r_{ai}({\mathbf {X}}_{o}^{(t)},{\mathbf {X}}_{a}^{(t)}, {\mathbf {U}}_{o}^{(t)},{\mathbf {U}}_{a}^{(t)})$ be the reward function of agent $A_{i}$, which represents its navigation performance at step $t$. The reward function of obstacle $O_{j}$ comprises two components: (i) the global system reward and (ii) the local obstacle reward, i.e., \begin{align*} &r_{oj} = \frac{1}{n}\sum _{i=1}^{n} \rho _{i} r_{ai} + \beta _{j} r_{j, local},\,\text{ for all} \,j=1,{\ldots },m \tag{12} \end{align*} View Sourcewhere $\rho _{i}$ is the priority of agent $A_{i}$, $\beta _{j}$ is the regularization parameter of obstacle $O_{j}$, and $r_{ai}$, $r_{j, local}$ are concise notations of $r_{ai}({\mathbf {X}}_{o}^{(t)},{\mathbf {X}}_{a}^{(t)}, {\mathbf {U}}_{o}^{(t)},{\mathbf {U}}_{a}^{(t)})$, $r_{j, local} ({\mathbf {X}}_{o}^{(t)},{\mathbf {X}}_{a}^{(t)}, {\mathbf {U}}_{o}^{(t)},{\mathbf {U}}_{a}^{(t)})$. The first term is the team reward that represents the multi-agent system performance, which is shared over all obstacles. The second term is the individual reward that corresponds to the position change penalty of obstacle $O_{j}$, which may be different among obstacles, e.g., the collision penalty. This reward definition is novel that differs from common RL scenarios, which is a combination of a global reward with a local reward. The former is the main goal of all obstacles, while the latter is the individual cost of an obstacle. The priorities $\boldsymbol{\rho }$ weigh the agents' performance to put more emphasis on the agents with higher priorities, while the regularization parameters $\lbrace \beta _{j}\rbrace _{j=1}^{m} \in [0, 1]^{m}$ weigh the environment optimization penalty on the navigation performance. The total reward of the obstacles is defined as \begin{align*} r_{o}= \sum _{j=1}^{m} r_{oj} = \frac{m}{n}\sum _{i=1}^{n} \rho _{i} r_{ai} + \sum _{j=1}^{m}\beta _{j} r_{j, local}. \tag{13} \end{align*} View Source

Let $\gamma \in [0,1]$ be the discount factor accounting for the future rewards and the expected discounted reward can be represented as \begin{align*} &R({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \pi _{o}) = \mathbb {E} \left[\sum _{t = 0}^\infty \gamma ^{t} r_{o}^{(t)}\right] \\ &\quad = \mathbb {E}\left[\sum _{t=0}^\infty \gamma ^{t} \sum _{j=1}^{m}\!\sum _{i=1}^{n}\! \frac{\rho _{i} r_{ai}^{(t)}}{n}\right] \!+\! \sum _{j=1}^{m} \beta _{j} \mathbb {E}\left[\sum _{t=0}^\infty \gamma ^{t} r_{j, local}^{(t)}\right] \tag{14} \end{align*} View Sourcewhere ${\mathbf {O}}$, ${\mathbf {S}}$ and ${\mathbf {D}}$ are the initial positions and destinations that determine the initial states ${\mathbf {X}}_{o}^{(0)}$ and ${\mathbf {X}}_{a}^{(0)}$, and $\mathbb {E}[\cdot ]$ is w.r.t. the action policy and the state transition probability. The obstacles are imposed with $Q$ constraints at each step $t$, which are functions of the obstacle states ${\mathbf {X}}_{o}^{(t)}$ and actions ${\mathbf {U}}_{o}^{(t)}$ as $h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) \leq 0$ for $q=1,\ldots,Q$. By parameterizing the obstacle policy $\pi _{o}$ with an information processing architecture $\boldsymbol{\Phi }({\mathbf {X}}_{o},{\mathbf {X}}_{a},\boldsymbol{\theta })$ of parameters $\boldsymbol{\theta }$, we formulate the constrained reinforcement learning problem as \begin{align*} &\text{argmax}_{\boldsymbol{\theta }}\, R({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \boldsymbol{\theta })\\ &\quad \text{s.t.}\,\,h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) \leq 0,\,\text{ for} \,q\!=\!1,{\ldots },Q,\,t\!=\!0,1,{\ldots },\infty, \\ &\quad \quad\boldsymbol{\Phi }\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {X}}_{a}^{(t)}, \boldsymbol{\theta }\right) \in {\mathcal {U}}_{o},\,\text{ for} \,t=0,1,\ldots,\infty.\tag{15} \end{align*} View SourceBy comparing problem (11) with problem (15), we see equivalent representations of the objective, cost and constraint functions in the RL domain. The goal is to find the optimal parameters $\boldsymbol{\theta }^*$ that maximize the reward while satisfying the constraints at each step $t$.

B. Primal-Dual Policy Gradient

Since there is no straightforward way to optimize $\boldsymbol{\theta }$ w.r.t. per-step hard constraints in problem (15), we define the reward of the constraint $h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) \leq 0$ with an indicator function as \begin{align*} r_{q}^{(t)} = \mathbb {1}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) \leq 0\right],\,\text{ for} \,q=1,\ldots,Q \tag{16} \end{align*} View Sourcewhere $\mathbb {1}[h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) \leq 0]$ is 1 if $h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) \leq 0$ and 0 otherwise. It quantifies the constraint by rewarding success and penalizing failure at each step $t$. The cumulative reward of the constraint $h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) \leq 0$ is \begin{align*} R_{q}({\mathbf {O}}, \boldsymbol{\theta }) = \mathbb {E}\left[\sum _{t=0}^\infty \gamma ^{t} r_{q}^{(t)}\right],\,\text{ for} \,q=1,\ldots,Q \tag{17} \end{align*} View Sourcewhere $\gamma$ is the discount factor. The preceding expression allows us to measure how well the constraints are satisfied in expectation and takes the same form as the expected discounted reward of the obstacles [cf. (14)]. We can then transform the constraints as \begin{align*} R_{q}({\mathbf {O}}, \boldsymbol{\theta }) \geq {\mathcal {C}}_{q},\,\text{ for} \,q=1,\ldots,Q \tag{18} \end{align*} View Sourcewhere $\lbrace {\mathcal {C}}_{q}\rbrace _{q=1}^{Q}$ are constants that lower-bound the constraint rewards and control the constraint guarantees – see details in Section V-C. This yields an alternative of problem (15) as \begin{align*} &\mathbb {P}:=\mathop{\text{argmax}}_{\boldsymbol{\theta }}\, R({\mathbf {S}}, {\mathbf {D}}, \pi _{a}, \boldsymbol{\rho }, {\mathbf {O}}, \boldsymbol{\theta })\\ &\quad \text{s.t.}\,\,R_{q}({\mathbf {O}}, \boldsymbol{\theta }) \geq {\mathcal {C}}_{q},\,\text{ for} \,q=1,\ldots,Q, \\ &\quad\quad \boldsymbol{\Phi }\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {X}}_{a}^{(t)}, \boldsymbol{\theta }\right) \in {\mathcal {U}}_{o},\,\text{ for} \,t=0,1,\ldots,\infty.\tag{19} \end{align*} View Source

By introducing the dual variables $\boldsymbol{\lambda }= [\lambda _{1},\ldots,\lambda _{Q}]^\top \in \mathbb {R}_+^{Q}$ that correspond to $Q$ constraints, we formulate the Lagrangian of problem (19) as \begin{align*} &{\mathcal {L}}(\boldsymbol{\theta }, \boldsymbol{\lambda }) = R({\mathbf {S}},\! {\mathbf {D}},\! \pi _{a},\! \boldsymbol{\rho },\! {\mathbf {O}},\! \boldsymbol{\theta }) \!+\!\! \sum _{q=1}^{Q} \!\lambda _{q} \left(R_{q}({\mathbf {O}}, \boldsymbol{\theta }) \!-\! {\mathcal {C}}_{q}\right) \tag{20} \end{align*} View Sourcewhich penalizes the objective with the constraint violation weighted by the dual variables $\boldsymbol{\lambda }$. The dual function is defined as the maximum of the Lagrangian ${\mathcal {D}}(\boldsymbol{\lambda }):=\max _{\boldsymbol{\theta }} {\mathcal {L}}(\boldsymbol{\theta }, \boldsymbol{\lambda })$. Since ${\mathcal {D}}(\boldsymbol{\lambda }) \geq \mathbb {P}$ for any dual variables $\boldsymbol{\lambda }$, we define the dual problem as \begin{align*} \mathbb {D}:= \min _{\boldsymbol{\lambda }\geq 0} {\mathcal {D}}(\boldsymbol{\lambda }) = \min _{\boldsymbol{\lambda }\geq 0} \max _{\boldsymbol{\theta }} {\mathcal {L}}(\boldsymbol{\theta }, \boldsymbol{\lambda }) \tag{21} \end{align*} View Sourcewhich computes the optimal dual variables that minimizes the dual function ${\mathcal {D}}(\boldsymbol{\lambda })$. The dual solution $\mathbb {D}$ in (21) can be considered as the best approximation of the primal solution $\mathbb {P}$ in (19). This translates a constrained maximization problem to an unconstrained min-max problem, which searches for a saddle point solution $(\boldsymbol{\theta }^*, \boldsymbol{\lambda }^*)$ that is maximal w.r.t. the primal variables $\boldsymbol{\theta }$ and minimal w.r.t. the dual variables $\boldsymbol{\lambda }$.

We propose to solve the dual problem (21) with the primal-dual policy gradient method, which updates $\boldsymbol{\theta }$ with policy gradient ascent and $\boldsymbol{\lambda }$ with gradient descent in an alternative manner. Specifically, it trains the model iteratively and each iteration consists of primal and dual steps.

Algorithm 1: Primal-Dual Policy Gradient Method.

Input: initial primal variables $\boldsymbol{\theta }^{(0)}$, initial dual variables $\boldsymbol{\lambda }^{(0)}$, initial value function parameters $\boldsymbol{\omega }^{(0)}$, trajectory planner $\pi _{a}$ and transition probability function $P$

for $k=0,1,\dots$ do

Compute value functions $V({\mathbf {X}}_{o}^{(k)},{\mathbf {X}}_{a}^{(k)}, \boldsymbol{\omega }^{(k)})$ and $V({\mathbf {X}}_{o}^{(k+1)},{\mathbf {X}}_{a}^{(k+1)}, \boldsymbol{\omega }^{(k)})$ to estimate the Lagrangian values of (20)

Compute the estimation error as in (22)

Update the primal variables $\boldsymbol{\theta }^{(k)}$ with policy gradient ascent as in (23) $\psi$ times and set $\boldsymbol{\theta }^{(k+1)} := \boldsymbol{\theta }^{(k)}_\psi$ as the updated primal variables

Compute the expected constraint rewards as in (25)

Update the dual variables $\boldsymbol{\lambda }^{(k)}$ with gradient descent as in (24)

end for

Primal step: At iteration $k$, let $\boldsymbol{\theta }^{(k)}$ be the primal variables, $\boldsymbol{\lambda }^{(k)}$ the dual variables, ${\mathbf {X}}_{o}^{(k)}$ the obstacle states, and ${\mathbf {X}}_{a}^{(k)}$ the agent states. Given these system states, the obstacle policy $\boldsymbol{\Phi }({\mathbf {X}}_{o}^{(k)},{\mathbf {X}}_{a}^{(k)},\boldsymbol{\theta }^{(k)})$ generates the obstacle actions ${\mathbf {U}}_{o}^{(k)}$ and the given trajectory planner $\pi _{a}$ generates the agent actions ${\mathbf {U}}_{a}^{(k)}$. These actions ${\mathbf {U}}_{o}^{(k)}$, ${\mathbf {U}}_{a}^{(k)}$ change the states from ${\mathbf {X}}_{o}^{(k)}$, ${\mathbf {X}}_{a}^{(k)}$ to ${\mathbf {X}}_{o,1}^{(k)}$, ${\mathbf {X}}_{a,1}^{(k)}$ based on the transition probability function $P({\mathbf {X}}_{o,1}^{(k)},{\mathbf {X}}_{a,1}^{(k)}|{\mathbf {X}}_{o}^{(k)},{\mathbf {X}}_{a}^{(k)},{\mathbf {U}}_{o}^{(k)},{\mathbf {U}}_{a}^{(k)})$, which feeds back the corresponding obstacle reward $r_{o}^{(k)}$ [cf. (13)] and the constraint rewards $\lbrace r_{q}^{(k)}\rbrace _{q=1}^{Q}$ [cf. (16)].

We follow the actor-critic method to define a value function $V({\mathbf {X}}_{o},{\mathbf {X}}_{a}, \boldsymbol{\omega })$ of parameters $\boldsymbol{\omega }$ that estimates the Lagrangian (20) initialized at the states ${\mathbf {X}}_{o},{\mathbf {X}}_{a}$. Let $\boldsymbol{\omega }^{(k)}$ be the parameters of the value function at iteration $k$, and $V({\mathbf {X}}_{o}^{(k)},{\mathbf {X}}_{a}^{(k)}, \boldsymbol{\omega }^{(k)})$, $V({\mathbf {X}}_{o,1}^{(k)},{\mathbf {X}}_{a,1}^{(k)}, \boldsymbol{\omega }^{(k)})$ be the estimated values at ${\mathbf {X}}_{o}^{(k)},{\mathbf {X}}_{a}^{(k)}$ and ${\mathbf {X}}_{o,1}^{(k)},{\mathbf {X}}_{a,1}^{(k)}$. This allows us to compute the estimation error as \begin{align*} \delta ^{(k)}_{1} \!&=\! r_{o}^{(k)} \!\!+\! \sum _{q=1}^{Q} \lambda _{q}^{(k)}\left(r_{q}^{(k)} - (1-\gamma){\mathcal {C}}_{q}\right) \\ &\quad+ \gamma V\left({\mathbf {X}}_{o,1}^{(k)},{\mathbf {X}}_{a,1}^{(k)}, \boldsymbol{\omega }^{(k)}\right) \!-\! V\left({\mathbf {X}}_{o}^{(k)},{\mathbf {X}}_{a}^{(k)}, \boldsymbol{\omega }^{(k)}\right) \tag{22} \end{align*} View Sourcewhich is used to update the parameters $\boldsymbol{\omega }^{(k)}$ of the value function. Then, we can update the primal variables $\boldsymbol{\theta }^{(k)}$ with policy gradient ascent as \begin{align*} \boldsymbol{\theta }^{(k)}_{1} = \boldsymbol{\theta }^{(k)} + \alpha \nabla _{\boldsymbol{\theta }} \log \pi _{o}\left({\mathbf {U}}_{o}^{(k)} | {\mathbf {X}}_{a}^{(k)}, {\mathbf {X}}_{o}^{(k)}\right) \delta ^{(k)}_{1} \tag{23} \end{align*} View Sourcewhere $\alpha$ is the step-size and $\pi _{o}({\mathbf {U}}_{o}^{(k)} | {\mathbf {X}}_{a}^{(k)}, {\mathbf {X}}_{o}^{(k)})$ is the policy distribution of the obstacle action specified by the parameters $\boldsymbol{\theta }^{(k)}$. The primal update is model-free because (23) requires only the error value $\delta ^{(k)}_{1}$ and the gradient of the policy distribution, but not the objective, cost and constraint function models. By performing the above procedure recursively $\psi \in \mathbb{Z}_+$ times, we obtain a sequence of primal variables $\lbrace \boldsymbol{\theta }^{(k)}_{1}, \boldsymbol{\theta }^{(k)}_{2},\ldots,\boldsymbol{\theta }^{(k)}_{\psi }\rbrace$. Define $\boldsymbol{\theta }^{(k+1)} := \boldsymbol{\theta }^{(k)}_{\psi }$ as the new primal variables and step into the dual step.

Dual step: With the updated primal variables $\boldsymbol{\theta }^{(k+1)}$, we update the dual variables $\boldsymbol{\lambda }^{(k)}$ with gradient descent as \begin{align*} &\lambda _{q}^{(k\!+\!1)} \!\!=\!\! \left[\!\lambda _{q}^{(k)} \!\!-\!\! \beta \left(\!R_{q}\left({\mathbf {O}}, \boldsymbol{\theta }^{(k\!+\!1)}\right) \!-\! {\mathcal {C}}_{q}\right)\!\right]_+\text{ for} \,q\!=\!1,{\ldots },Q \tag{24} \end{align*} View Sourcewhere $\beta$ is the step-size and $[\cdot ]_+$ is the non-negative operator corresponding to the positivity of dual variables. Since the constraint function values $\lbrace h_{q}({\mathbf {X}}_{o}^{(k)}, {\mathbf {U}}_{o}^{(k)})\rbrace _{q=1}^{Q}$ are given, the cumulative constraint rewards $\lbrace R_{q}({\mathbf {O}}, \boldsymbol{\theta }^{(k)})\rbrace _{q=1}^{Q}$ can be estimated by sampling ${\mathcal {T}}$ trajectories $\lbrace {\mathbf {X}}_{o}^{(t),1}, {\mathbf {U}}_{o}^{(t), 1}\rbrace _{t},\ldots,\lbrace {\mathbf {X}}_{o}^{(t),{\mathcal {T}}}, {\mathbf {U}}_{o}^{(t), {\mathcal {T}}}\rbrace _{t}$, computing the respective cumulative constraint rewards, and taking the average, i.e., \begin{align*} R_{q}\left({\mathbf {O}}, \boldsymbol{\theta }^{(k\!+\!1)}\right) \!\approx \! \frac{1}{{\mathcal {T}}}\sum _{\tau =1}^{{\mathcal {T}}}\sum _{t=0}^{T_\tau } \!\gamma ^{t} \mathbb {1}\left[h_{q}\left({\mathbf {X}}_{o}^{(t),\tau }\!, {\mathbf {U}}_{o}^{(t), \tau }\right) \!\leq \! 0\right] \tag{25} \end{align*} View Sourcewhere $T_\tau$ is the operation time of the $\tau$th trajectory. The dual update is model-free because (24) and (25) require only the constraint values $\lbrace h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)})\rbrace _{t}$ instead of the constraint function models. Algorithm 1 summarizes the approach and Fig. 2 demonstrates the methodology framework.

Figure 2.

Framework of primal-dual reinforcement learning.

Show All

C. Constraint Guarantees

While the cumulative constraint rewards in the alternative problem (19) is a relaxation of the per-step hard constraints in the original problem (15), we provide constraint guarantees for the alternative problem (19) in the sequel. Specifically, the expectation of the indicator function in (17) is equivalent to the probability of satisfying the constraint, i.e., \begin{align*} \mathbb {E}\left[\mathbb {1}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) \leq 0\right]\right] = \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) \leq 0\right] \tag{26} \end{align*} View Sourcewhere $\mathbb {P}[\cdot ]$ is the probability measure. This allows to rewrite the cumulative constraint reward as \begin{align*} R_{q}({\mathbf {O}}, \boldsymbol{\theta }) = \sum _{t=0}^\infty \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) \leq 0\right] \geq {\mathcal {C}}_{q} \tag{27} \end{align*} View Sourcefor $q=1,\ldots,Q$ [cf. (18)], which is the discounted sum of the constraint satisfaction probability over all steps $t$.

However, (27) are not sufficient conditions to satisfy the constraints at each step $t$, as required by problem (15). We then establish the relation between (27) and the per-step constraints in problem (15). Before proceeding, we define the $(1-\delta)$-constrained solution as follows.

Definition 2 ($(1-\delta)$-constrained solution):

A solution of problem (19) is $(1-\delta)$-constrained w.r.t. the constraints $\lbrace h_{q}\rbrace _{q=1}^{Q}$ if for each step $t \geq 0$, it holds that \begin{align*} \mathbb {P}\left[\!\cap _{0 \leq \tau \leq t}\! \left\lbrace h_{q}\left({\mathbf {X}}_{o}^{(\tau)}\!\!,\! {\mathbf {U}}_{o}^{(\tau)}\right) \!\leq \! 0\right\rbrace \!\right] \!\!\geq \!\! 1\!-\!\delta \,\text{ for} \,q\!=\!\!1,{\ldots },Q. \tag{28} \end{align*} View Source

The $(1-\delta)$-constrained solution satisfies the constraints at each step $t$ with a probability $1 - \delta \in (0,1]$, which is a feasible solution of problem (15) if $\delta = 0$. With these preliminaries, we analyze the constraint guarantees of the alternative problem (19) in the following theorem.

Theorem 5:

Consider problem (19) with the constraint constant taking the form of ${\mathcal {C}}_{q} = (1-\delta + \epsilon)/(1-\gamma)$ for $q=1,\ldots,Q$ [cf. (18)], where $\delta$ and $\epsilon$ are constraint parameters with $0 \leq \epsilon \leq \delta$. For any $\delta$, there exists $\epsilon$ such that the solution of problem (19) is $(1 - \delta)$-constrained.

Proof:

See Appendix G.$\blacksquare$

Theorem 5 states that the feasible solution of the alternative problem (19) is a $(1-\delta)$-constrained solution of problem (15). The result provides the per-step guarantees for the environment constraints in probability, which demonstrates the applicability of the alternative problem (19). In essence, we have not lost the constraint guarantees by relaxing the hard constraints to the cumulative constraint rewards.

D. Information Processing Architecture

The proposed approach is a general framework that covers various environment optimization scenarios and different information processing architectures can be integrated to solve different variants. We illustrate this fact by analyzing two representative scenarios: (i) offline optimization in discrete environments and (ii) online optimization in continuous environments, for which we use convolutional neural networks (CNNs) and graph neural networks (GNNs), respectively.

CNNs for offline discrete settings: In this setting, we first optimize the obstacles' positions and then navigate the agents. Since computations are performed offline, we collect the states of all obstacles ${\mathbf {X}}_{o}$ and agents ${\mathbf {X}}_{a}$ apriori, which allows for centralized information processing solutions (e.g., CNNs). CNNs leverage convolutional filters to extract features from image signals and have found wide applications in computer vision [40], [41], [42]. In the discrete environment, the system states can be represented by matrices and the latter are equivalent to image signals. This motivates to consider CNNs for policy parameterization.

GNNs for online continuous settings: In this setting, obstacle positions change while agents move, i.e., the obstacles take actions instantaneously. In large-scale systems with real-time communication and computation constraints, obstacles may not be able to collect the states of all other obstacles/agents and centralized solutions may not be applicable. This requires a decentralized architecture that can be implemented with local neighborhood information. GNNs are inherently decentralizable and are, thus, a suitable candidate.

GNNs are layered architectures that leverage a message passing mechanism to extract features from graph signals [43], [44], [45]. At each layer $\ell$, let ${\mathbf {X}}_{\ell -1}$ be the input signal. The output signal is generated with the message aggregation function ${\mathcal {F}}_{\ell m}$ and the feature update function ${\mathcal {F}}_{\ell u}$ as \begin{align*} [{\mathbf {X}}_{\ell }]_{i} \!=\! {\mathcal {F}}_{\ell u}\left([{\mathbf {X}}_{\ell \!-\!1}]_{i},\!\! \sum _{j \in {\mathcal {N}}_{i}}\!\! {\mathcal {F}}_{\ell m}\left([{\mathbf {X}}_{\ell \!-\!1}]_{i}, [{\mathbf {X}}_{\ell \!-\!1}]_{j}, [{\mathbf {E}}]_{ij} \right)\! \right) \end{align*} View Sourcewhere $[{\mathbf {X}}_{\ell }]_{i}$ is the signal value at node $i$, ${\mathcal {N}}_{i}$ are the neighboring nodes within the communication radius, ${\mathbf {E}}$ is the adjacency matrix, and ${\mathcal {F}}_{\ell m}$, ${\mathcal {F}}_{\ell u}$ have learnable parameters $\boldsymbol{\theta }_{\ell m}$, $\boldsymbol{\theta }_{\ell u}$. The output signal is computed with local neighborhood information only, and each node can make decisions based on its own output value; hence, allowing for decentralized implementation [46], [47], [48], [49].

SECTION VI.

Experiments

We evaluate the proposed approach in this section. First, we consider unprioritized environment optimization without constraints [Problem 1]. Then, we consider prioritized environment optimization with constraints [Problem 2]. For both cases, we consider two navigation scenarios, one in which we perform offline optimization with discrete obstacle motion, and another in which we consider online optimization with continuous obstacle motion. The obstacles have rectangular bodies and the agents have circular bodies. The given agent trajectory planner $\pi _{a}$ is the Reciprocal Velocity Obstacles (RVO) method [3], which can be applied in continuous space and allows for an efficient computation with decentralized implementation.³ Two metrics are used: Success weighted by Path Length (SPL) [50] and the percentage of the maximal speed (PCTSpeed). The former is a stringent measure combining the success rate and the path length, which has been widely used as a primary quantifier in comparing the navigation performance [50], [51], [52]. The latter is the ratio of the average speed to the maximal one, which provides complementary information regarding the moving speed along trajectories. Moreover, SPL and PCTSpeed normalize metric values to [0, 1] with a higher value representing a better performance, which provides a unified exposition for performance metrics. Results are averaged over 20 trials with random initial configurations.

A. Unprioritized Environment Optimization

We start with unprioritized environment optimization.

Offline discrete setting: We consider a grid environment of size 8 m × 8 m with 10 obstacles and 4 agents, which are distributed randomly without overlap. The maximal agent / obstacle velocity is 0.05 m per time step $\Delta t$ and the maximal time step is 500.

Setup: The environment is modeled as an occupancy grid map. An agent's location is modeled by a one-hot in a matrix of the same dimension. At each step, the policy considers one of the obstacles and moves it to one of the adjacent grid cells. This repeats for $m$ obstacles, referred to as a round, and an episode ends if the maximal round has been reached.

Training: The objective is to make agents reach destinations quickly while avoiding collision. The team reward is the sum of the PCTSpeed and the ratio of the shortest distance to the traveled distance, while the local reward is the collision penalty of individual obstacle [cf. (12)]. We parameterize the policy with a CNN of 4 layers, each containing 25 features with kernel size $2 \times 2$, and conduct training with PPO [53].

Baseline: Since exhaustive search methods are intractable for our problem, we develop a strong heuristic method to act as a baseline: At each step, one of the obstacles computes the shortest paths of all agents, checks whether it blocks any of these paths, and moves away randomly if so. This repeats for $m$ obstacles, referred to as a round, and the method ends if the maximal round is reached.

Performance: We train our model on 10 obstacles and test on 10 to 18 obstacles, which varies obstacle density from 10% to 30%. Fig. 3(a) and (b) show the results. The proposed approach consistently outperforms baselines with the highest SPL/PCTSpeed and the lowest variance. The heuristic method takes the second place and the original scenario (without any environment modification) performs worst. As we generalize to higher obstacle densities, all methods degrade as expected. However, our approach maintains a satisfactory performance due to the CNN's cabability for generalization. Fig. 3(c) and (d) show an example of how the proposed approach circumvents the dead-locks by optimizing the obstacle layout. Moreover, it improves the path efficiency such that all agents find collision-free trajectories close to their shortest paths.

Figure 3.

(a) And (b) performance of offline environment optimization compared to the baselines. Results are averaged over 20 trials and the shaded area shows the std. dev. The RL system is trained on 10 obstacles and tested on 10 to 18 obstacles. (a) SPL (1 is best). (b) PCTSpeed. (c) And (d) example of offline environment optimization. Red circles are initial positions, yellow squares are destinations, and obstacles are numbered for exposition. Color lines from red to yellow are agent trajectories and the color bar represents the time scale. Obstacle layout and agent trajectories (c) before environment optimization and (d) after environment optimization.

Show All

Online continuous setting: We proceed to a continuous environment. The agents are initialized randomly in an arbitrarily defined starting region and towards destinations in an arbitrarily defined goal region.

Setup: The agents and obstacles are modeled by positions $\lbrace {\mathbf {p}}_{a,i}\rbrace _{i=1}^{n}$, $\lbrace {\mathbf {p}}_{o,j}\rbrace _{j=1}^{m}$ and velocities $\lbrace {\mathbf {v}}_{a,i}\rbrace _{i=1}^{n}$, $\lbrace {\mathbf {v}}_{o,j}\rbrace _{j=1}^{m}$. At each step, each obstacle has a local policy that generates the desired velocity with neighborhood information and we integrate an acceleration-constrained mechanism for position changes. An episode ends if all agents reach destinations or the episode times out. The maximal acceleration is 0.05 m/$\Delta t^{2}$, the communication radius is 2 m and the episode time is 500 time steps.

Training: The team reward in (12) guides the agents to their destinations as quickly as possible and is defined as \begin{align*} r_{a,i}^{(t)} = \left(\frac{{\mathbf {p}}_{i}^{(t)}-{\mathbf {d}}_{i}}{\Vert {\mathbf {p}}_{i}^{(t)}-{\mathbf {d}}_{i}\Vert _{2}} \cdot \frac{{\mathbf {v}}_{i}^{(t)}}{\Vert {\mathbf {v}}_{i}^{(t)}\Vert _{2}} \right) \Vert {\mathbf {v}}_{i}^{(t)}\Vert _{2} \tag{29} \end{align*} View Sourceat time step $t$, which rewards fast movements towards the destination and penalizes moving away from it. The local reward is the collision penalty. We parameterize the policy with a single-layer GNN. The message aggregation function and feature update function are multi-layer perceptrons (MLPs), and we train the model using PPO.

Performance: The results are shown in Fig. 4(a) and (b). The proposed approach exhibits the best performance for both metrics and maintains a good performance for large scenarios. We attribute the latter to the fact that GNNs exploit topological information for feature extraction and are scalable to larger graphs. The heuristic method performs worse but comparably for small number of obstacles, while degrading with the increasing of obstacles. It is note-worthy that the heuristic method is centralized because it requires computing shortest paths of all agents, and hence is not applicable for online optimization and considered here as a benchmark value only for reference. Fig. 4(c) shows the moving trajectories of agents and example obstacles. We see that obstacles make way for agents to facilitate navigation s.t. agents find trajectories close to their shortest paths.

Figure 4.

(a) And (b) performance of online environment optimization compared to the baselines. Results are averaged over 20 trails and the shaded area shows the std. dev. The RL system is trained on 10 obstacles and tested on 10 to 18 obstacles. (a) SPL (1 is best). (b) PCTSpeed. (c) Example of online environment optimization. Red circles are initial positions and yellow squares are destinations. Grey and brown rectangles are the obstacles before and after environment optimization, and are numbered for exposition. Red-to-yellow lines are trajectories of agents and 5 example obstacles, and the color bar represents the time scale, showing that no agent-agent nor agent-obstacle collision occurs.

Show All

B. Prioritized Environment Optimization

We proceed to prioritized environment optimization with real-world constraints, where the agent priorities are set as $\boldsymbol{\rho }= [2, 1, 0.5, 0.1]^\top$. The environment settings and implementation details are the same as the unprioritized environment optimization in Section VI-A.

Offline discrete setting: We consider the constraint as the maximal round that can be performed by the policy $\pi _{o}$, i.e., the maximal number of grids each obstacle can move, which restricts the capacity of offline environment optimization. This constraint can be satisfied by setting the maximal length of an episode in reinforcement learning and thus, the primal-dual mechanism is not needed in this setting. We consider the maximal round as 8 and measure the performance with the PCTSpeed and the distance ratio, where the latter is the ratio of the shortest distance to the agent's traveled distance.

Performance: We evaluate the proposed approach on 14 obstacles and the results are shown in Fig. 5(a). We see that agent $A_{1}$ with the highest priority exhibits the best performance with the highest PCTSpeed and distance ratio. The performance degrades as the agent priority decreases (from $A_{1}$ to $A_{4}$), which corresponds to theoretical findings in Theorem 3. Fig. 5(c) and (d) display an example of how offline prioritized environment optimization optimizes the obstacle layout. It guarantees the success of all navigation tasks and improves agents' path efficiencies based on their priorities, i.e., it emphasizes the higher-priority agents ($A_{1}$ and $A_{2}$) over the lower-priority agents ($A_{3}$ and $A_{4}$).

$Figure 5. - (a) And (b) performance of offline prioritized environment optimization with agent priorities $\rho _{1} \geq \rho _{2} \geq \rho _{3} \geq \rho _{4}$. (a) PCTSpeeds and distance ratios of different agents on 14 obstacles (1 is best). (b) Success rates of different agents on 24 obstacles. (c) And (d) example of offline prioritized environment optimization. Red circles are initial positions, yellow squares are destinations, and obstacles are numbered for exposition. Color lines from red to yellow are agent trajectories and the color bar represents the time scale. Agent priorities $\lbrace \rho _{i}\rbrace _{i=1}^{4}$ reduce from $A_{1}$ to $A_{4}$, i.e., $\rho _{1} \geq \cdots \geq \rho _{4}$. (c) Navigation failures in original environment. (d) Completeness with offline prioritized environment optimization. PCTSpeeds of agents are $0.98, 0.94, 0.88, 0.86$ and distance ratios are $0.99, 0.98, 0.95, 0.93$, i.e., agent performance increases with the priority.$

Figure 5.

(a) And (b) performance of offline prioritized environment optimization with agent priorities $\rho _{1} \geq \rho _{2} \geq \rho _{3} \geq \rho _{4}$. (a) PCTSpeeds and distance ratios of different agents on 14 obstacles (1 is best). (b) Success rates of different agents on 24 obstacles. (c) And (d) example of offline prioritized environment optimization. Red circles are initial positions, yellow squares are destinations, and obstacles are numbered for exposition. Color lines from red to yellow are agent trajectories and the color bar represents the time scale. Agent priorities $\lbrace \rho _{i}\rbrace _{i=1}^{4}$ reduce from $A_{1}$ to $A_{4}$, i.e., $\rho _{1} \geq \cdots \geq \rho _{4}$. (c) Navigation failures in original environment. (d) Completeness with offline prioritized environment optimization. PCTSpeeds of agents are $0.98, 0.94, 0.88, 0.86$ and distance ratios are $0.99, 0.98, 0.95, 0.93$, i.e., agent performance increases with the priority.

Show All

Partial completeness: We test the trained model on 24 obstacles, where the obstacle region is dense and the obstacle-free area is limited. The success rates of all agents are zero without environment optimization. Fig. 5(b) shows the results with offline prioritized environment optimization. Agent $A_{1}$ with the highest priority achieves the best success rate and the success rate decreases with the agent priority, which corroborates the partial completeness in Corollary 1. We note that the success rate of $A_{1}$ is not one ($100 {\%}$) because (i) we train our model on 14 obstacles but test it on 24 obstacles and (ii) the proposed approach obtains a local not global solution, leading to inevitable performance degradation compared with theoretical analysis.

Online continuous setting: We consider the constraint as the total deviation distance of the obstacles away from their initial positions, i.e., $\sum _{j=1}^{m} \Vert {\mathbf {o}}_{j}^{(t)} - {\mathbf {o}}_{j}^{(0)}\Vert _{2} \leq C_{d}$ for all $t$, where ${\mathbf {o}}_{j}^{(t)}$ is the position of obstacle $j$ at time step $t$ and ${\mathbf {o}}_{j}^{(0)}$ is the initial position. It corresponds to practical scenarios where the obstacles are not allowed moving too far from their original positions.

Performance: We set $C_{d} = 10$ and show the results in Fig. 6(a). Agent $A_{1}$ with the highest priority performs best, and the performance degrades from $A_{1}$ to $A_{4}$ with the decreasing of agent priority, corroborating our analysis in Theorem 4. Fig. 8(b) shows the constraint value as a function of time steps. The total deviation distance of the obstacles is smaller than the deviation bound throughout the navigation procedure, which validates the proposed primal-dual method. Fig. 7(b) shows the moving trajectories of agents and example obstacles in online prioritized environment optimization. The obstacles change positions to improve agents' performance and the higher-priority agents (e.g., $A_{1}$) are emphasized over the lower-priority agents (e.g., $A_{4}$). For example, given the deviation distance constraint, the obstacles create an (almost) shortest path for $A_{1}$ but not for $A_{4}$.

$Figure 6. - Performance of online prioritized environment optimization with agent priorities $\rho _{1} \geq \rho _{2} \geq \rho _{3} \geq \rho _{4}$. (a) PCTSpeeds and distance ratios of different agents on 14 obstacles (1 is best). (b) Success rates of different agents within 150 time steps.$

Figure 6.

Performance of online prioritized environment optimization with agent priorities $\rho _{1} \geq \rho _{2} \geq \rho _{3} \geq \rho _{4}$. (a) PCTSpeeds and distance ratios of different agents on 14 obstacles (1 is best). (b) Success rates of different agents within 150 time steps.

Show All

$Figure 7. - Examples of online prioritized environment optimization. Red circles are initial positions and yellow squares are destinations. Grey and brown rectangles are obstacles before and after environment optimization, and are numbered for exposition. Color lines from red to yellow are agent trajectories and the color bar represents the time scale. Agent priorities $\lbrace \rho _{i}\rbrace _{i=1}^{4}$ reduce from $A_{1}$ to $A_{4}$, i.e., $\rho _{1} \geq \cdots \geq \rho _{4}$. (a) Completeness with deviation distance constraint of $C_{d}=14$. PCTSpeeds of agents are $0.98, 0.97, 0.94, 0.91$ and distance ratios are $0.99, 0.99, 0.98, 0.97$. (b) Completeness with deviation distance constraint of $C_{d}=10$. PCTSpeeds of agents are $0.97, 0.95, 0.88, 0.84$ and distance ratios are $0.98, 0.98, 0.96, 0.94$. (c) Completeness with deviation distance constraint of $C_{d}=4$. PCTSpeeds of agents are $0.86, 0.82, 0.76, 0.63$ and distance ratios are $0.93, 0.92, 0.85, 0.82$. Agent performance increases with the priority.$

Figure 7.

Examples of online prioritized environment optimization. Red circles are initial positions and yellow squares are destinations. Grey and brown rectangles are obstacles before and after environment optimization, and are numbered for exposition. Color lines from red to yellow are agent trajectories and the color bar represents the time scale. Agent priorities $\lbrace \rho _{i}\rbrace _{i=1}^{4}$ reduce from $A_{1}$ to $A_{4}$, i.e., $\rho _{1} \geq \cdots \geq \rho _{4}$. (a) Completeness with deviation distance constraint of $C_{d}=14$. PCTSpeeds of agents are $0.98, 0.97, 0.94, 0.91$ and distance ratios are $0.99, 0.99, 0.98, 0.97$. (b) Completeness with deviation distance constraint of $C_{d}=10$. PCTSpeeds of agents are $0.97, 0.95, 0.88, 0.84$ and distance ratios are $0.98, 0.98, 0.96, 0.94$. (c) Completeness with deviation distance constraint of $C_{d}=4$. PCTSpeeds of agents are $0.86, 0.82, 0.76, 0.63$ and distance ratios are $0.93, 0.92, 0.85, 0.82$. Agent performance increases with the priority.

Show All

Figure 8.

(a) Performance of online prioritized environment optimization with different constraints. (b) Deviation distance constraint of obstacles throughout the navigation procedure. (c) Speed constraint of obstacles throughout the navigation procedure.

Show All

Partial completeness: We corroborate the partial completeness in Corollary 2 by requiring that all agents arrive at destinations within $T_{\max }=150$ time steps. Fig. 6(b) shows that the success rate decreases from higher-priority agents to lower-priority ones. The success rate of $A_{1}$ is not one ($100 {\%}$) because (i) the obtained solution is local not global and (ii) we impose the constraint of deviation distance on the environment optimization.

Constraint bound: We test different constraint bounds $C_{d}=14, 10$ and 4, the decreasing of which corresponds to the increasing of environment restrictions. Fig. 8(b) shows that all variants satisfy the constraints, and Fig. 7 displays three examples of agent and obstacle trajectories. We see that all variants guarantee the success of navigation tasks. The variant with the largest bound performs best with agent trajectories close to the shortest paths, where agent priorities do not play an important role given sufficient resources. The variant with the lowest bound suffers from performance degradation because of the strongest constraint. It puts more emphasis on the higher-priority agents ($A_{1}$ and $A_{2}$) while overlooking the lower-priority agents ($A_{3}$ and $A_{4}$), which corroborates theoretical analysis in Section IV.

Velocity constraint: We show that the proposed approach can handle various constraints. Here, we test a constraint on the total speed of the obstacles, i.e., $\sum _{j=1}^{m} \Vert {\mathbf {v}}_{j}^{(t)}\Vert _{2} \leq C_{v}$ for all $t$. We set $C_{v} = 3.5$ and Fig. 8(c) shows that the speed constraint is satisfied throughout the navigation procedure. The obstacle speed first increases to make way for the agents and then slows down to satisfy the constraint. Fig. 8(a) compares the performance of environment optimization with different constraints to the unconstrained counterpart. The constrained variants achieve comparable performance (slightly worse), but saves the traveled distance and the velocity energy of the obstacles.

SECTION VII.

Conclusion

We proposed novel problems of unprioritized and prioritized environment optimization for multi-agent navigation, each of which contains offline and online variants. By conducting the completeness analysis, we provided conditions under which all navigation tasks are guaranteed success and identified the role played by agent priorities in environment optimization. We imposed constraints on the environment optimization corresponding to real-world restrictions, and formulated the latter as a constrained stochastic optimization problem. We leveraged model-free reinforcement learning together with a primal-dual mechanism to solve the problem. The former overcomes the challenge of explicitly modeling the relation between agents, environment and navigation performance, while the latter handles the constraints. By integrating different information processing architectures (e.g., CNNs and GNNs) for policy parameterization, the proposed approach can adapt to different implementation requirements. Numerical results corroborate theoretical findings, and show adaptability to various objectives and constraints.

Future work will consider the following aspects. First, we can consider agents and obstacles as a heterogeneous system and re-formulate the proposed problem from this perspective. The latter allows us to develop methods that explore the role of heterogeneity in environment optimization. Second, we only consider 2-dimensional environments. Future works will focus on higher-dimensional environments, e.g., multi-drone navigation in 3-dimensional environments or manipulator motion planning in high-dimensional configuration spaces. Third, we plan to evaluate our method in real-world experiments.

Appendix A
Proof of Theorem 1

We prove the theorem as follows. First, we optimize $\Delta$ such that the environment is “well-formed”, i.e., any initial position in ${\mathcal {S}}$ and destination in ${\mathcal {D}}$ can be connected by a collision-free path. Then, we show the optimized environment guarantees the success of all navigation tasks.

Obstacle region optimization: We first optimize $\Delta$ based on ${\mathcal {S}}$ and ${\mathcal {D}}$ to make the environment “well-formed”. To do so, we handle ${\mathcal {S}}$, ${\mathcal {D}}$ and the other space ${\mathcal {E}}\setminus ({\mathcal {S}}\bigcup {\mathcal {D}})$ separately.

From Assumption 1, the initial positions $\lbrace {\mathbf {s}}_{i}\rbrace _{i=1}^{n}$ in ${\mathcal {S}}$ are distributed such that $d({\mathbf {s}}_{i},{\mathbf {s}}_{j}) \geq 2 \hat{r}$ and $d({\mathbf {s}}_{i},\partial {\mathcal {S}}) \geq 2 \hat{r}$. Thus, for any ${\mathbf {s}}_{i}$, there exists a boundary point $\partial {\mathbf {s}}_{i} \in \partial {\mathcal {S}}$ and a path ${\mathbf {p}}_{{\mathbf {s}}_{i}}^{\partial {\mathbf {s}}}$ connecting ${\mathbf {s}}_{i}$ and $\partial {\mathbf {s}}_{i}$ that is collision-free with respect to the other initial positions.
Similar result applies to the destinations $\lbrace {\mathbf {d}}_{i}\rbrace _{i=1}^{n}$ in ${\mathcal {D}}$, i.e., for any ${\mathbf {d}}_{i}$, there exists a boundary point $\partial {\mathbf {d}}_{i} \in \partial {\mathcal {D}}$ and a path ${\mathbf {p}}_{\partial {\mathbf {d}}_{i}}^{{\mathbf {d}}_{i}}$ connecting $\partial {\mathbf {d}}_{i}$ and ${\mathbf {d}}_{i}$ that is collision-free with respect to the other destinations.
Consider $\partial {\mathbf {s}}_{i}$ and $\partial {\mathbf {d}}_{i}$ for agent $A_{i}$. The shortest path ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ that connects them is the straight path, the area of which is bounded as \begin{align*} \left|{\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}\right| \leq 2 d_{\max }\hat{r} \tag{30} \end{align*} View Sourcebecause $d_{\max }$ is the maximal distance between ${\mathcal {S}}$ and ${\mathcal {D}}$. From $|{\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}})| \geq 2 n d_{\max } \hat{r}$ in (4), the area of the obstacle-free space in ${\mathcal {E}}\setminus ({\mathcal {S}}\bigcup {\mathcal {D}})$ is larger than $2 n d_{\max } \hat{r}$. Thus, we can always optimize $\Delta$ to $\Delta ^*$ such that the path ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ is obstacle-free. If ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ dose not overlap with ${\mathcal {S}}$ and ${\mathcal {D}}$, we can connect $\partial {\mathbf {s}}_{i}$ and $\partial {\mathbf {d}}_{i}$ with ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ directly. If ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ passes through ${\mathcal {S}}$ for $K$ times, let ${\mathbf {s}}_{i,e}^{(k)}$ and ${\mathbf {s}}_{i,l}^{(k)}$ be the entering and leaving positions of ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ on ${\mathcal {S}}$ at $k$th pass for $k=1,{\ldots },K$ with ${\mathbf {s}}_{i,l}^{(0)} = \partial {\mathbf {s}}_{i}$ the initial leaving position. First, we can connect ${\mathbf {s}}_{i,l}^{(k-1)}$ and ${\mathbf {s}}_{i,e}^{(k)}$ by ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ because ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ is obstacle-free. Then, there exists a collision-free path ${\mathbf {p}}_{i}^{(k)}$ inside ${\mathcal {S}}$ that connects ${\mathbf {s}}_{i,e}^{(k)}$ and ${\mathbf {s}}_{i,l}^{(k)}$ as described in (i). Same result applies to that ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ passes through ${\mathcal {D}}$. Therefore, we can connect $\partial {\mathbf {s}}_{i}$ and $\partial {\mathbf {d}}_{i}$ with ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$ and $\lbrace {\mathbf {p}}_{i}^{(k)}\rbrace _{k=1}^{K}$.

By concatenating ${\mathbf {p}}_{{\mathbf {s}}_{i}}^{\partial {\mathbf {s}}_{i}}$, ${\mathbf {p}}_{\partial {\mathbf {s}}_{i}}^{\partial {\mathbf {d}}_{i}}$, $\lbrace {\mathbf {p}}_{i}^{(k)}\rbrace _{k=1}^{K}$ and ${\mathbf {p}}_{\partial {\mathbf {d}}_{i}}^{{\mathbf {d}}_{i}}$, we can establish the path ${\mathbf {p}}_{{\mathbf {s}}_{i}}^{{\mathbf {d}}_{i}}$ connecting ${\mathbf {s}}_{i}$ to ${\mathbf {d}}_{i}$ that is collision-free w.r.t. the other initial positions, destinations and the optimized obstacle region $\Delta ^*$ for $i=1,{\ldots },n$, i.e., the optimized environment is “well-formed”.

Completeness: From Assumption 2 and the fact that the optimized environment is “well-formed”, Theorem 4 in [27] shows that all navigation tasks will be carried out successfully without collision. Therefore, there exists an offline environment optimization scheme that guarantees the success of all navigation tasks completing the proof.

Appendix B
Proof of Theorem 2

We prove the theorem as follows. First, we separate the navigation procedure into $H$ time slices. Then, we optimize the obstacle region based on the agent positions at each time slice and show the completeness of individual time-sliced multi-agent navigation. Lastly, we show the completeness of the entire multi-agent navigation by concatenating individual time slices and complete the proof by limiting the number of time slices to the infinity, i.e., $H \to \infty$.

Navigation procedure separation: Let $T$ be the maximal operation time of trajectories $\lbrace {\mathbf {p}}_{i}\rbrace _{i=1}^{n}$ and $\lbrace [(h-1)T/H, hT/H]\rbrace _{h=1}^{H}$ the separated time slices. This yields intermediate positions $\lbrace {\mathbf {p}}_{i}(hT/H)\rbrace _{h=0}^{H}$ with ${\mathbf {p}}_{i}(0)={\mathbf {s}}_{i}$ and ${\mathbf {p}}_{i}(T)={\mathbf {d}}_{i}$ for $i=1,{\ldots },n$. We can re-formulate the navigation task into $H$ sub-navigation tasks, where the $h$th sub-navigation task of agent $A_{i}$ is from ${\mathbf {p}}_{i}((h-1)T/H)$ to ${\mathbf {p}}_{i}(hT/H)$ and the operation time of the sub-navigation task is $\delta t = T/H$. At each time slice, we first change the obstacle region based on the corresponding sub-navigation task and then navigate the agents until the next time slice.

Obstacle region optimization: We consider each sub-navigation task separately and start from the 1st one. Assume the obstacle region $\Delta$ satisfies \begin{align*} |{\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}})| > 2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t. \tag{31} \end{align*} View SourceFor the 1st sub-navigation task, the starting region is ${\mathcal {S}}^{(1)} = \bigcup _{i=1,\ldots,n} B({\mathbf {p}}_{i}(0), r_{i}) = {\mathcal {S}}$ and the destination region is ${\mathcal {D}}^{(1)} = \bigcup _{i=1,\ldots,n} B({\mathbf {p}}_{i}(T/H), r_{i})$. We optimize $\Delta$ based on ${\mathcal {S}}^{(1)}$, ${\mathcal {D}}^{(1)}$ and show the completeness of the 1st sub-navigation task, which consists of two steps. First, we change $\Delta$ to $\widetilde{\Delta }$ such that $|\Delta | = |\widetilde{\Delta }|$ and $\widetilde{\Delta } \subset {\mathcal {E}}\setminus ({\mathcal {S}}^{(1)} \cup {\mathcal {D}}^{(1)})$. This can be completed as follows. From the condition $\Delta \subset {\mathcal {E}}\setminus ({\mathcal {S}}\cup {\mathcal {D}})$ and ${\mathcal {S}}^{(1)} = {\mathcal {S}}$, there is no overlap between $\Delta$ and ${\mathcal {S}}^{(1)}$. For any overlap region $\Delta \bigcap {\mathcal {D}}^{(1)}$, we can change it to the obstacle-free region in ${\mathcal {D}}$ because $\Delta \subset {\mathcal {E}}\setminus ({\mathcal {S}}\cup {\mathcal {D}})$ and $|{\mathcal {D}}^{(1)}| = |{\mathcal {D}}|$, and keep the other region in $\Delta$ unchanged. The resulting $\widetilde{\Delta }$ satisfies $|\widetilde{\Delta }| = |\Delta |$ and $\widetilde{\Delta } \subset {\mathcal {E}}\setminus ({\mathcal {S}}^{(1)} \cup {\mathcal {D}}^{(1)})$. The changed area from $\Delta$ to $\widetilde{\Delta }$ is bounded by $|{\mathcal {D}}^{(1)}|\leq n \pi \hat{r}^{2}$. Second, we change $\widetilde{\Delta }$ to $\Delta ^{(1)}$ such that the environment is “well-formed” w.r.t. the 1st sub-navigation task. The initial position ${\mathbf {p}}_{i}(0)$ and the destination ${\mathbf {p}}_{i}(H/T)$ can be connected by a path ${\mathbf {p}}_{i}^{(1)}$ that follows the trajectory ${\mathbf {p}}_{i}$. Since $\Vert \hat{{\mathbf {v}}}\Vert _{2}$ is the maximal speed and $\delta t$ is the operation time, the area of ${\mathbf {p}}_{i}^{(1)}$ is bounded by $2 \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2} \delta t$. Since this holds for all $i=1,{\ldots },n$, we have $\sum _{i=1}^{n} |{\mathbf {p}}_{i}^{(1)}| \leq 2 n \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2} \delta t$. From (31), $|{\mathcal {S}}^{(1)}| = |{\mathcal {S}}|$, $|{\mathcal {D}}^{(1)}| = |{\mathcal {D}}|$ and $|\widetilde{\Delta }| = |\Delta |$, we have \begin{align*} |{\mathcal {E}}\setminus \left(\widetilde{\Delta } \cup {\mathcal {S}}^{(1)} \cup {\mathcal {D}}^{(1)}\right)| > 2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t. \tag{32} \end{align*} View SourceThis implies that the obstacle-free area in ${\mathcal {E}}$ is larger than the area of $n$ paths $\lbrace {\mathbf {p}}_{i}^{(1)}\rbrace _{i=1}^{n}$. Following the proof of Theorem 1, we can optimize $\widetilde{\Delta }$ to $\Delta ^{(1)}$ to guarantee the success of the 1st sub-navigation task. The changed area from $\widetilde{\Delta }$ to $\Delta ^{(1)}$ is bounded by $2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t - n \pi \hat{r}^{2}$ because the initial positions $\lbrace {\mathbf {p}}_{i}(0)\rbrace _{i=1}^{n}$ and destinations $\lbrace {\mathbf {p}}_{i}(T/H)\rbrace _{i=1}^{n}$ in $\lbrace {\mathbf {p}}_{i}^{(1)}\rbrace _{i=1}^{n}$ are collision-free from the first step, which dose not require any further change of the obstacle region. The total changed area from $\Delta$ to $\Delta ^{(1)}$ can be bounded as \begin{align*} \frac{\big |\left(\Delta \bigcup \Delta ^{(1)}\right) \setminus \left(\Delta \bigcap \Delta ^{(1)}\right)\big |}{2} \leq 2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t. \tag{33} \end{align*} View Source

From (32), $|\Delta ^{(1)}| = |\widetilde{\Delta }|$ and $\Delta ^{(1)} \subset {\mathcal {E}}\setminus ({\mathcal {S}}^{(1)} \cup {\mathcal {D}}^{(1)})$, the optimized $\Delta ^{(1)}$ satisfies $|{\mathcal {E}}\setminus (\Delta ^{(1)} \cup {\mathcal {S}}^{(1)} \cup {\mathcal {D}}^{(1)})| \geq 2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t$, which recovers the assumption in (31). Therefore, we can repeat the above process and guarantee the success of $H$ sub-navigation tasks. The entire navigation task is guaranteed success by concatenating these sub-tasks.

Completeness: When $H \to \infty$, we have $\delta t \to 0$. Since the environment optimization time is same as the agent operation time at each sub-navigation task, the obstacle region and the agents can be considered taking actions simultaneously when $\delta t \to 0$. The initial environment condition in (31) becomes \begin{align*} \lim _{\delta t \to 0}| {\mathcal {E}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}})| > 2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t \to 0. \tag{34} \end{align*} View Sourcewhich is satisfied from the condition ${\mathcal {W}}\setminus (\Delta \cup {\mathcal {S}}\cup {\mathcal {D}}) \ne \emptyset$. The changed area of the obstacle region in (33) becomes \begin{align*} \!\lim _{\delta t \to 0}\!\frac{\big |\left(\Delta ^{(h)} \!\bigcup \! \Delta ^{(h+1)}\right) \!\setminus \! \left(\Delta ^{(h)} \!\bigcap \! \Delta ^{(h+1)}\right)\big |}{2 \delta t} \!\leq \! 2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2}. \tag{35} \end{align*} View SourceThat is, if the capacity of the online environment optimization is stronger than $2 n \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2}$, i.e., $\dot{\Delta } \geq 2 n \hat{r} \Vert \hat{{\mathbf {v}}} \Vert _{2}$, the navigation task can be carried out successfully without collision. Therefore, there exists an online environment optimization scheme that guarantees the success of all navigation tasks.

Appendix C
Proof of Theorem 3

We start by considering agent $A_{1}$ with the highest priority. From the condition (6), we have \begin{align*} |{\mathcal {E}}\!\setminus \! (\Delta \!\cup \! {\mathcal {S}}\!\cup \! {\mathcal {D}})| \!\geq \! 2 d_{A_{1}} \hat{r}. \tag{36} \end{align*} View SourceFollowing the proof of Theorem 1, we can optimize the obstacle region $\Delta$ to $\Delta ^{(1)}$ such that for the initial position ${\mathbf {s}}_{1}$ and destination ${\mathbf {d}}_{1}$ of agent $A_{1}$, there exists a collision-free path ${\mathbf {p}}_{1}$ that connects ${\mathbf {s}}_{1}$ and ${\mathbf {d}}_{1}$. The area of ${\mathbf {p}}_{1}$ outside ${\mathcal {S}}$ and ${\mathcal {D}}$ is bounded as \begin{align*} |{\mathbf {p}}_{1}| \leq 2 d_{A_{1}} \hat{r}, \tag{37} \end{align*} View Sourcei.e., the traveled distance of agent $A_{1}$ is bounded by $d_{A_{1}}$.

We then consider the second agent $A_{2}$. From Assumption 1, the initial positions $\lbrace {\mathbf {s}}_{i}\rbrace _{i=1}^{n}$ in ${\mathcal {S}}$ are distributed such that $d({\mathbf {s}}_{i},{\mathbf {s}}_{j}) \geq 2 \hat{r}$, $d({\mathbf {s}}_{i},\partial {\mathcal {S}}) \geq 2 \hat{r}$ and ${\mathbf {s}}_{1}$, ${\mathbf {s}}_{2}$ are distributed in the starting components ${\mathcal {S}}_{A_{1}}$, ${\mathcal {S}}_{A_{2}}$. If ${\mathcal {S}}_{A_{1}} = {\mathcal {S}}_{A_{2}}$, there exists a path ${\mathbf {p}}_{{\mathbf {s}}_{2}}^{{\mathbf {s}}_{1}}$ in ${\mathcal {S}}_{A_{1}}$ without changing the obstacle region. If ${\mathcal {S}}_{A_{1}} \ne {\mathcal {S}}_{A_{2}}$, there exist boundary points $\partial {\mathbf {s}}_{1} \in \partial {\mathcal {S}}_{A_{1}}$, $\partial {\mathbf {s}}_{2} \in \partial {\mathcal {S}}_{A_{2}}$ and paths ${\mathbf {p}}_{{\mathbf {s}}_{1}}^{\partial {\mathbf {s}}_{1}}$, ${\mathbf {p}}_{{\mathbf {s}}_{2}}^{\partial {\mathbf {s}}_{2}}$ that connect ${\mathbf {s}}_{1}$, $\partial {\mathbf {s}}_{1}$ and ${\mathbf {s}}_{2}$, $\partial {\mathbf {s}}_{2}$, respectively, and are collision-free with respect to the other initial positions. For the boundary points $\partial {\mathbf {s}}_{2}$ and $\partial {\mathbf {s}}_{1}$, the shortest path ${\mathbf {p}}_{\partial {\mathbf {s}}_{2}}^{\partial {\mathbf {s}}_{1}}$ that connects them is the straight path, the area of which is bounded as \begin{align*} \left|{\mathbf {p}}_{\partial {\mathbf {s}}_{2}}^{\partial {\mathbf {s}}_{1}}\right| \leq 2 d_{\max }({\mathcal {S}}_{A_{2}}, {\mathcal {S}}_{A_{1}})\hat{r}\tag{38} \end{align*} View Sourcewhere $d_{\max }({\mathcal {S}}_{A_{2}}, {\mathcal {S}}_{A_{1}})$ is the distance between ${\mathcal {S}}_{A_{1}}$, ${\mathcal {S}}_{A_{2}}$ and $d_{A_{2}, {\mathcal {S}}} = d_{\max }({\mathcal {S}}_{A_{2}}, {\mathcal {S}}_{A_{1}})$ by definition. Similar result applies to the destinations ${\mathbf {d}}_{1}$, ${\mathbf {d}}_{2}$ in ${\mathcal {D}}_{A_{1}}$, ${\mathcal {D}}_{A_{2}}$. That is, if ${\mathcal {D}}_{A_{1}} = {\mathcal {D}}_{A_{2}}$, there exists a path ${\mathbf {p}}_{{\mathbf {d}}_{1}}^{{\mathbf {d}}_{2}}$ in ${\mathcal {D}}_{A_{1}}$ without changing the obstacle region. If ${\mathcal {D}}_{A_{1}} \ne {\mathcal {D}}_{A_{2}}$, there exist boundary points $\partial {\mathbf {d}}_{1} \in \partial {\mathcal {D}}_{A_{1}}$, $\partial {\mathbf {d}}_{2} \in \partial {\mathcal {D}}_{A_{2}}$ and paths ${\mathbf {p}}_{\partial {\mathbf {d}}_{1}}^{{\mathbf {d}}_{1}}$, ${\mathbf {p}}_{\partial {\mathbf {d}}_{2}}^{{\mathbf {d}}_{2}}$ that connect $\partial {\mathbf {d}}_{1}$, ${\mathbf {d}}_{1}$ and $\partial {\mathbf {d}}_{2}$, ${\mathbf {d}}_{2}$, respectively, and are collision-free with respect to the other destinations. The area of the shortest path ${\mathbf {p}}_{\partial {\mathbf {d}}_{1}}^{\partial {\mathbf {d}}_{2}}$ that connects the boundary points $\partial {\mathbf {d}}_{1}$ and $\partial {\mathbf {d}}_{2}$ is bounded as \begin{align*} \left|{\mathbf {p}}_{\partial {\mathbf {d}}_{1}}^{\partial {\mathbf {d}}_{2}}\right| \leq 2 d_{\max }({\mathcal {D}}_{A_{2}}, {\mathcal {D}}_{A_{1}})\hat{r}\tag{39} \end{align*} View Sourcewhere $d_{\max }({\mathcal {D}}_{A_{2}}, {\mathcal {D}}_{A_{1}})$ is the distance between ${\mathcal {D}}_{A_{1}}$, ${\mathcal {D}}_{A_{2}}$ and $d_{A_{2}, {\mathcal {D}}} = d_{\max }({\mathcal {D}}_{A_{2}}, {\mathcal {D}}_{A_{1}})$ by definition. From the condition (6), we have \begin{align*} |{\mathcal {E}}\!\setminus \! \left(\Delta \!\cup \! {\mathcal {S}}\!\cup \! {\mathcal {D}}\right)| \!\geq \! 2 d_{A_{1}} \hat{r} + 2 \left(d_{A_{2},{\mathcal {S}}}\!+\!d_{A_{2},{\mathcal {D}}}\right)\hat{r}. \tag{40} \end{align*} View SourceThus, we can optimize $\Delta ^{(1)}$ to $\Delta ^{(2)}$ such that the paths ${\mathbf {p}}_{\partial {\mathbf {s}}_{2}}^{\partial {\mathbf {s}}_{1}}$ and ${\mathbf {p}}_{\partial {\mathbf {d}}_{1}}^{\partial {\mathbf {d}}_{2}}$ are obstacle-free. Following the proof of Theorem 1, we can establish the paths ${\mathbf {p}}_{{\mathbf {s}}_{2}}^{{\mathbf {s}}_{1}}$ and ${\mathbf {p}}_{{\mathbf {d}}_{1}}^{{\mathbf {d}}_{2}}$ that connect ${\mathbf {s}}_{2}$, ${\mathbf {s}}_{1}$ and ${\mathbf {d}}_{1}$, ${\mathbf {d}}_{2}$, respectively, and are collision-free w.r.t. the other initial positions, destinations and the optimized obstacle region $\Delta ^{(2)}$. By concatenating ${\mathbf {p}}_{{\mathbf {s}}_{2}}^{{\mathbf {s}}_{1}}$, ${\mathbf {p}}_{1}$ and ${\mathbf {p}}_{{\mathbf {d}}_{1}}^{{\mathbf {d}}_{2}}$, we can establish the collision-free path ${\mathbf {p}}_{2}$ for agent $A_{2}$. The area of ${\mathbf {p}}_{2}$ outside ${\mathcal {S}}$ and ${\mathcal {D}}$ is bounded as \begin{align*} |{\mathbf {p}}_{2}| \leq 2 d_{A_{1}} \hat{r} + 2 \left(d_{A_{2},{\mathcal {S}}}\!+\!d_{A_{2},{\mathcal {D}}}\right)\hat{r}, \tag{41} \end{align*} View Sourcei.e., the traveled distance of agent $A_{2}$ is bounded by $d_{A_{1}} + d_{A_{2},{\mathcal {S}}}+d_{A_{2},{\mathcal {D}}}$.

Lastly, we follow the above procedure to optimize the obstacle region $\Delta ^{(2)}$ to $\Delta ^{(n)}$ for agents $A_{3}, \ldots, A_{n}$, recursively, and establish the paths $\lbrace {\mathbf {p}}_{i}\rbrace _{i=1}^{n}$ that are collision-free w.r.t. the other initial positions, destinations and the optimized obstacle region. The area of ${\mathbf {p}}_{i}$ outside ${\mathcal {S}}$ and ${\mathcal {D}}$ is bounded as⁴ \begin{align*} |{\mathbf {p}}_{i}| \leq 2 d_{A_{1}} \hat{r} + \sum _{j=2}^{i} 2 \left(d_{A_{j},{\mathcal {S}}}+d_{A_{j},{\mathcal {D}}}\right)\hat{r}, \tag{42} \end{align*} View Sourcei.e., the traveled distance of agent $A_{i}$ is bounded by $d_{A_{1}} + \sum _{j=2}^{i} (d_{A_{j},{\mathcal {S}}}+d_{A_{j},{\mathcal {D}}})$ for $i=1,\ldots,n$. This shows the optimized environment is “well-formed” for all agents $\lbrace A_{i}\rbrace _{i=1}^{n}$ and completes the proof by using Assumption 2 and Theorem 4 in [27], i.e., the navigation tasks of all agents can be carried out successfully without collision and the traveled distance of agent $A_{i}$ is bounded by $d_{A_{1}} + \sum _{j=2}^{i} (d_{A_{j},{\mathcal {S}}}+d_{A_{j},{\mathcal {D}}})$ for $i=1,\ldots,n$.

Appendix D
Proof of Corollary 1

Consider a sub-system of $b$ agents $\lbrace A_{i}\rbrace _{i=1}^{b}$ with highest priorities. Following the proof of Theorem 3, we can optimize the obstacle region $\Delta$ to $\Delta ^*$ such that for the initial position ${\mathbf {s}}_{i}$ and destination ${\mathbf {d}}_{i}$ of agent $A_{i}$, there exists a collision-free path ${\mathbf {p}}_{i}$ that connects ${\mathbf {s}}_{i}$ and ${\mathbf {d}}_{i}$ for $i=1,\ldots,b$. Therefore, $\Delta ^*$ is “well-formed” w.r.t. the considered sub-system and the navigation tasks of $b$ agents $\lbrace A_{i}\rbrace _{i=1}^{b}$ can be carried out successfully without collision.

For the rest of the agents $A_{j}$ for $j=b+1,\ldots,n$, assume that the initial position ${\mathbf {s}}_{j}$ of agent $A_{j}$ is within the same starting component as ${\mathbf {s}}_{i}$ of agent $A_{i}$ with $i\in \lbrace 1,\ldots,b\rbrace$, i.e., ${\mathcal {S}}_{A_{j}} = {\mathcal {S}}_{A_{i}}$. Denote by $\partial {\mathbf {s}}_{i}$ the boundary point where the path ${\mathbf {p}}_{i}$ intersects with the boundary of the starting component ${\mathcal {S}}_{A_{i}}$. Since the initial positions $\lbrace {\mathbf {s}}_{i}\rbrace _{i=1}^{n}$ in ${\mathcal {S}}$ are distributed in a way such that $d({\mathbf {s}}_{i_{1}},{\mathbf {s}}_{i_{2}}) \geq 2 \hat{r}$ and $d({\mathbf {s}}_{i},\partial {\mathcal {S}}) \geq 2 \hat{r}$ from Assumption 1 and ${\mathbf {s}}_{j}$ is in the same starting component as ${\mathbf {s}}_{i}$, there exists a path ${\mathbf {p}}_{{\mathbf {s}}_{j}}^{\partial {\mathbf {s}}_{i}}$ in ${\mathcal {S}}_{A_{i}}$ that connects ${\mathbf {s}}_{j}$, $\partial {\mathbf {s}}_{i}$ and is collision-free w.r.t. the other initial positions. Similar result applies to the destinations ${\mathbf {d}}_{j}$ and ${\mathbf {d}}_{i}$. That is, if ${\mathbf {d}}_{j}$ is within the same destination component as ${\mathbf {d}}_{i}$, i.e., ${\mathcal {D}}_{A_{j}} = {\mathcal {D}}_{A_{i}}$, there exists a path ${\mathbf {p}}_{{\mathbf {d}}_{j}}^{\partial {\mathbf {d}}_{i}}$ in ${\mathcal {D}}_{A_{i}}$ that connects ${\mathbf {d}}_{j}$, $\partial {\mathbf {d}}_{i}$ and is collision-free w.r.t. the other destinations. Since the boundary points $\partial {\mathbf {s}}_{i}$ and $\partial {\mathbf {d}}_{i}$ can be connected by the collision-free path ${\mathbf {p}}_{i}$, we can establish the collision-free path ${\mathbf {p}}_{j}$ that connects ${\mathbf {s}}_{j}$ and ${\mathbf {d}}_{j}$ by concatenating ${\mathbf {p}}_{{\mathbf {s}}_{j}}^{\partial {\mathbf {s}}_{i}}$, ${\mathbf {p}}_{i}$ and ${\mathbf {p}}_{{\mathbf {d}}_{j}}^{\partial {\mathbf {d}}_{i}}$. Therefore, the optimized obstacle region $\Delta ^*$ is “well-formed” w.r.t. agent $A_{j}$ as well, the navigation task of which can be carried out successfully without collision. The same result holds for all agents $A_{j} \in \lbrace A_{b+1}, \ldots, A_{n}\rbrace$ satisfying the above conditions, completing the proof.

Appendix E
Proof of Theorem 4

We prove the theorem following Theorem 2. First, we re-formulate the navigation task as $H$ sub-navigation tasks for each agent. Then, we optimize the obstacle region to guarantee the completeness of sub-navigation tasks successively. Lastly, we show the completeness of the entire navigation by concatenating these sub-navigation tasks and complete the proof by limiting $H \to \infty$.

Let $T$ be the maximal operation time required by trajectories $\lbrace {\mathbf {p}}_{i}(t)\rbrace _{i=1}^{n}$ with velocities $\lbrace {\mathbf {v}}_{i}(t)\rbrace _{i=1}^{n}$ and $\lbrace {\mathbf {p}}_{i}(hT/H)\rbrace _{h=0}^{H}$ be the intermediate positions with ${\mathbf {p}}_{i}(0)={\mathbf {s}}_{i}$ and ${\mathbf {p}}_{i}(T)={\mathbf {d}}_{i}$ for $i=1,{\ldots },n$. The goal of the $h$th sub-navigation task for agent $A_{i}$ is from ${\mathbf {p}}_{i}((h-1)T/H)$ to ${\mathbf {p}}_{i}(hT/H)$ and the operation time required by each sub-navigation task is $\delta t = T/H$. In this context, we separate the procedure of online environment optimization as a number of time slices with duration $2\delta t$. At each time slice, we change the obstacle region for sub-navigation tasks with duration $\delta t$ and navigate the agents with duration $\delta t$ in an alternative manner. From the condition (9), the area of the obstacle region that can be changed at each time slice is \begin{align*} 2 b \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t \!\leq \!\!\frac{\big |\left(\!\Delta \!\bigcup \! \widetilde{\Delta }\right) \!\setminus \! \left(\Delta \bigcap \widetilde{\Delta }\right)\big |}{2} \!\! < \! 2 (b\!+\!1) \hat{r} \Vert \hat{{\mathbf {v}}}\Vert _{2} \delta t \tag{43} \end{align*} View Sourcewhere $\Delta$ is the original obstacle region and $\widetilde{\Delta }$ is the changed obstacle region. From (43) and the proof of Theorem 2, at each time slice, we can change the obstacle region to make the environment “well-formed” w.r.t. the sub-navigation tasks of only $b$ agents instead of all $n$ agents, i.e., it only guarantees the success of $b$ sub-navigation tasks. For the agents whose sub-navigation tasks are not selected for environment optimization, we keep them static until the next time slice when their sub-navigation tasks are selected. By tuning the selections of $b$ sub-navigation tasks across time slices, we prove there exists a selection scheme such that sub-navigation tasks of all agents can be carried out successfully without collision and the navigation time of the agents can be bounded inverse proportionally to their priorities.

Specifically, let $T_{1} = H_{1} (2 \delta t)$ with $H_{1} \geq H$ be the maximal navigation time required by agent $A_{1}$ with the highest priority $\rho _{1}$, i.e., agent $A_{1}$ requires completing its $H$ sub-navigation tasks within $H_{1}$ time slices, and $T_{i} = H_{i} (2 \delta t) = \lceil H_{1} \rho _{1} / \rho _{i} \rceil (2 \delta t)$ be the maximal navigation time required by agent $A_{i}$ with the priority $\rho _{i}$ for $i=2,\ldots,n$. For agent $A_{1}$, there exists a scheme that selects the sub-navigation task of $A_{1}$ for environment optimization $H$ times within $H_{1}$ time slices because $H_{1} \geq H$ and thus, its $H$ sub-navigation tasks can be carried out successfully without collision. By concatenating these sub-tasks, the navigation task of $A_{1}$ can be carried out successfully without collision within the required time.

For agents $A_{1}$ and $A_{2}$, if it holds that \begin{align*} \min \lbrace 1, b\rbrace (H_{2}-H_{1}) + \min \lbrace 2, b\rbrace H_{1} \geq 2 H \tag{44} \end{align*} View Sourcethere exists a scheme that selects the sub-navigation tasks of $A_{1}$ and $A_{2}$ for environment optimization $H$ times during $H_{2}$ time slices, respectively. The minimal operations $\min \lbrace 1, b\rbrace$ and $\min \lbrace 2, b\rbrace$ in (44) represent the facts that single agent can be selected at most once at each time slice, and each time slice can select at most $b$ agents. Thus, the navigation tasks of $A_{1}$ and $A_{2}$ can be carried out successfully without collision within the required time. Analogously for agents $\lbrace A_{1},\ldots,A_{i}\rbrace$, if it holds that \begin{align*} \sum _{j=1}^{i} \min \lbrace j, b\rbrace \left(H_{i+1-j} - H_{i-j}\right) \geq i H \tag{45} \end{align*} View Sourcewith $H_{0} = 0$ by default, there exists a scheme that selects the sub-navigation tasks of $\lbrace A_{1},\ldots,A_{i}\rbrace$ for environment optimization $H$ times during $H_{i}$ time slices, and their navigation tasks can be carried out successfully without collision within the required time. Therefore, we conclude that if \begin{align*} \sum _{j=1}^{n} \min \lbrace j, b\rbrace \left(H_{n+1-j} - H_{n-j}\right) \geq n H \tag{46} \end{align*} View Sourcethere exists a scheme that selects the sub-navigation tasks of all $n$ agents for environment optimization $H$ times during $H_{n}$ time slices, and all navigation tasks can be carried out successfully without collision within the required time. Since $H_{i} = \lceil H_{1} \rho _{1} / \rho _{2} \rceil$, i.e., $H_{i}$ can be represented by $H_{1}$ and the associated priorities for $i=1,\ldots,n$, we can rewrite (46) as \begin{align*} \!\sum _{j=1}^{n}\! \min \lbrace j,\! b\rbrace \left(\left \lceil \frac{H_{1} \rho _{1}}{\rho _{n+1-j}} \right \rceil \!-\! \left \lceil \frac{H_{1} \rho _{1}}{\rho _{n-j}} \right \rceil \right) \!\geq \! n H. \tag{47} \end{align*} View SourceSince $\lbrace \rho _{i}\rbrace _{i=1}^{n}$, $H$ are given and the left-hand side of (47) increases with $H_{1}$, there exists a large enough $H_{1}$ such that (47) holds. Therefore, the navigation tasks of all agents can be carried out successfully and the navigation time of agent $A_{i}$ is bounded by $T_{i} = H_{i} (2 \delta t) = \lceil H_{1} \rho _{1} / \rho _{i} \rceil (2 \delta t)$.

When $H$ is sufficiently large, i.e., $\delta t$ is sufficiently small, the agents and the obstacles can be considered moving simultaneously. Since $H_{1} \geq H$ is sufficiently large as well, we have $ \lceil H_{1} \rho _{1} / \rho _{i} \rceil (2\delta t) \approx 2 H_{1} \rho _{1} \delta t / \rho _{i}$. With this observation and the conclusion obtained from (47), we complete the proof, i.e., the navigation tasks of all agents can be carried out successfully without collision and the navigation time of agent $A_{i}$ is bounded by $T_{i} = C_{T} / \rho _{i}$ for $i=1,\ldots,n$ where $C_{T} = 2 H_{1} \rho _{1} \delta t$ is the time constant.

Appendix F
Proof of Corollary 2

From the proof of Theorem 4, for any integer $1 \leq \eta \leq n$, if it holds that \begin{align*} \sum _{j=1}^{\eta }\! \min \lbrace j, b\rbrace \left(\left \lceil \! \frac{H_{1} \rho _{1}}{\rho _{\eta +1-j}} \right \rceil \!-\! \left \lceil \frac{ H_{1} \rho _{1}}{\rho _{\eta -j}} \right \rceil \right) \!\geq \! \eta H \tag{48} \end{align*} View Sourcethere exists a scheme with a sufficiently large $H_{1}$ that selects the sub-navigation tasks of $\eta$ agents for environment optimization $H$ times during $H_{\eta } = \lceil H_{1} \rho _{1} / \rho _{\eta +1-j} \rceil$ time slices and the navigation tasks of agents $\lbrace A_{i}\rbrace _{i=1}^{\eta }$ can be carried out successfully without collision within time $\lbrace T_{i}\rbrace _{i=1}^\eta$, respectively. Since the agents are required to reach their destinations within time $T_{\max }$, i.e., the maximal navigation time is $T_{\max }$, the maximal (allowed) number of time slices is $H_{\max } = \lfloor T_{\max } / 2 \delta t \rfloor$. This is equivalent to requiring \begin{align*} H_{\eta } = \left \lceil \frac{H_{1} \rho _{1}}{\rho _{\eta }} \right \rceil \leq \left \lfloor \frac{T_{\max }}{2 \delta t} \right \rfloor \tag{49} \end{align*} View Sourcebecause $H_{\eta }$ is the maximal number of time slices required by the first $\eta$ agents. Thus, we have \begin{align*} H_{1} \leq \frac{\big \lfloor \frac{T_{\max }}{2 \delta t} \big \rfloor \rho _\eta }{\rho _{1}}. \tag{50} \end{align*} View SourceSince the left-hand side of (48) increases with $H_{1}$, substituting (50) into (48) yields \begin{align*} &\sum _{j=1}^{\eta }\! \min \lbrace j, b\rbrace \left(\left \lceil \! \frac{H_{1} \rho _{1}}{\rho _{\eta +1-j}} \right \rceil \!-\! \left \lceil \frac{ H_{1} \rho _{1}}{\rho _{\eta -j}} \right \rceil \right) \\ & \quad\leq \sum _{j=1}^{\eta }\! \min \lbrace j, b\rbrace \left(\left \lceil \frac{\big \lfloor \frac{T_{\max }}{2 \delta t} \big \rfloor \rho _\eta }{\rho _{\eta +1-j}} \right \rceil \!-\! \left \lceil \frac{\big \lfloor \frac{T_{\max }}{2 \delta t} \big \rfloor \rho _\eta }{\rho _{\eta -j}} \right \rceil \right). \tag{51} \end{align*} View SourceBy substituting (51) into (48), we have \begin{align*} \!\sum _{j=1}^{\eta }\! \min \lbrace j, b\rbrace \left(\left \lceil \frac{\left \lfloor \frac{T_{\max }}{2 \delta t} \right \rfloor \rho _\eta }{\rho _{\eta +1-j}} \right \rceil \!-\! \left \lceil \frac{\left \lfloor \frac{T_{\max }}{2 \delta t} \right \rfloor \rho _\eta }{\rho _{\eta -j}} \right \rceil \right) \!\geq \! \eta H. \tag{52} \end{align*} View SourceBy using the fact $T = H \delta t$ where $T$ is the maximal operation time of trajectories $\lbrace {\mathbf {p}}_{i}(t)\rbrace _{i=1}^{n}$, we get \begin{align*} \!\sum _{j=1}^{\eta }\! \min \lbrace j, b\rbrace \left(\left \lceil \frac{\left \lfloor \frac{HT_{\max }}{\text{2}\,T} \right \rfloor \rho _\eta }{\rho _{\eta +1-j}} \right \rceil \!-\! \left \lceil \frac{\left \lfloor \frac{H T_{\max }}{\text{2}\,T} \right \rfloor \rho _\eta }{\rho _{\eta -j}} \right \rceil \right) \!\!\geq \! \eta H. \tag{53} \end{align*} View SourceWhen $H$ is sufficiently large, we have $\lceil \lfloor HT_{\max }/(\text{2}\,T) \rfloor \rho _\eta /\rho _{\eta +1-j}\rceil \approx HT_{\max }\rho _\eta /(\text{2}\,T \rho _{\eta +1-j})$ and $\lceil \lfloor HT_{\max }/(\text{2}\,T) \rfloor \rho _\eta /\rho _{\eta -j}\rceil \approx HT_{\max }\rho _\eta /(\text{2}\,T \rho _{\eta -j})$, and (53) is equivalent as \begin{align*} \!\sum _{j=1}^{\eta }\! \min \lbrace j, b\rbrace \left(\frac{T_{\max } \rho _\eta }{\text{2}\,T \rho _{\eta +1-j}} \!-\! \frac{T_{\max } \rho _\eta }{\text{2}\,T \rho _{\eta -j}} \right) \!\geq \! \eta. \tag{54} \end{align*} View SourceBy setting $n_{p}$ as the maximal $\eta$ that satisfies (54) and following the proof of Theorem 4, the navigation tasks of $\lbrace A_{i}\rbrace _{i=1}^{n_{p}}$ can be carried out successfully without collision within the required time $T_{\max }$, completing the proof.

Appendix G
Proof of Theorem 5

For a feasible solution of problem (19) with the constraint constant ${\mathcal {C}}_{q} = (1-\delta + \epsilon)/(1-\gamma)$ [cf. (18)], it satisfies that \begin{align*} &\mathbb {E}\left[\!\sum _{t=0}^\infty \!\! \gamma ^{t}\! \mathbb {1}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}\!,\! {\mathbf {U}}_{o}^{(t)}\right) \!\leq \! 0\right]\right] \!\!=\!\! \sum _{t=0}^\infty \! \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) \!\le\! 0\right] \\ &\quad \geq \frac{1-\delta + \epsilon }{1-\gamma } = \frac{1}{1-\gamma } - \frac{\delta - \epsilon }{1-\gamma },\,\text{ for} \, q=1,\ldots,Q \tag{55} \end{align*} View Sourcewhere the linearity of the expectation and (26) are used. For the term $\sum _{t=0}^\infty \gamma ^{t} \mathbb {P}[h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) \leq 0]$, we have \begin{align*} &\sum _{t=0}^\infty \!\! \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}\!,\! {\mathbf {U}}_{o}^{(t)}\!\right) \!\leq \! 0\right] \!\!=\!\! \sum _{t=0}^\infty \!\! \gamma ^{t} \!\!-\!\! \sum _{t=0}^\infty \!\! \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}\!\!,\! {\mathbf {U}}_{o}^{(t)}\!\right) \!>\! 0\right] \\ &\quad = \frac{1}{1-\gamma } - \sum _{t=0}^\infty \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) > 0\right]. \tag{56} \end{align*} View SourceBy comparing (55) and (56), we get \begin{align*} &\sum _{t=0}^\infty \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) > 0\right] \leq \frac{\delta - \epsilon }{1 - \gamma }. \tag{57} \end{align*} View SourceFor the maximal time horizon $T$, we have \begin{align*} &\sum _{t=0}^{T} \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) > 0\right] \\ &\quad\leq \sum _{t=0}^\infty \gamma ^{t} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) > 0\right] \leq \frac{\delta - \epsilon }{1 - \gamma } \tag{58} \end{align*} View Sourcebecause each term in the summation is non-negative. Then, by setting $\epsilon = \delta (1-\gamma ^{T} (1 - \gamma)) < \delta$ and substituting the latter into (58), we get $\sum _{t=0}^{T} \gamma ^{t} \mathbb {P}[h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) > 0] \leq \gamma ^{T} \delta$. Since $\gamma ^{T} \leq \gamma ^{t}$ for all $t \leq T$, we have $\sum _{t=0}^{T} \gamma ^{T} \mathbb {P}[h_{q}({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}) > 0] \leq \gamma ^{T} \delta$ and thus \begin{align*} &\sum _{t=0}^{T} \mathbb {P}\left[h_{q}\left({\mathbf {X}}_{o}^{(t)}, {\mathbf {U}}_{o}^{(t)}\right) > 0\right] \leq \delta. \tag{59} \end{align*} View SourceBy leveraging the Boole-Frechet-Bonferroni inequality, we complete the proof that for each step $0 \!\leq \! t \!\leq \! T$, it holds that \begin{align*} \mathbb {P}\left[\cap _{0 \leq \tau \leq t} \left\lbrace h_{q}\left({\mathbf {X}}_{o}^{(\tau)}, {\mathbf {U}}_{o}^{(\tau)}\right) \leq 0\right\rbrace \right] \geq 1-\delta. \tag{60} \end{align*} View Source

References is not available for this document.

Constrained Environment Optimization for Prioritized Multi-Agent Navigation

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Problem Formulation

Definition 1:

Problem 1 (Unprioritized Environment Optimization):

Problem 2 (Prioritized Environment Optimization):

Remark 1:

Completeness of Unprioritized System

A. Offline Environment Optimization

Assumption 1:

Assumption 2:

Theorem 1:

Proof:

B. Online Environment Optimization

Assumption 3:

Theorem 2:

Proof:

Completeness of Prioritized System

A. Offline Prioritized Environment Optimization

Assumption 4:

Theorem 3:

Proof:

Corollary 1:

Proof:

B. Online Prioritized Environment Optimization

Theorem 4:

Proof:

Corollary 2:

Proof:

Remark 2:

Methodology

A. Reinforcement Learning

B. Primal-Dual Policy Gradient

Algorithm 1: Primal-Dual Policy Gradient Method.

C. Constraint Guarantees

Definition 2 ($(1-\delta)$-constrained solution):

Theorem 5:

Proof:

D. Information Processing Architecture

Experiments

A. Unprioritized Environment Optimization

B. Prioritized Environment Optimization

Conclusion

Appendix AProof of Theorem 1

Proof of Theorem 1

Appendix BProof of Theorem 2

Proof of Theorem 2

Appendix CProof of Theorem 3

Proof of Theorem 3

Appendix DProof of Corollary 1

Proof of Corollary 1

Appendix EProof of Theorem 4

Proof of Theorem 4

Appendix FProof of Corollary 2

Proof of Corollary 2

Appendix GProof of Theorem 5

Proof of Theorem 5

Authors

Figures

References

Citations

Keywords

Metrics

Footnotes

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Appendix A
Proof of Theorem 1

Appendix B
Proof of Theorem 2

Appendix C
Proof of Theorem 3

Appendix D
Proof of Corollary 1

Appendix E
Proof of Theorem 4

Appendix F
Proof of Corollary 2

Appendix G
Proof of Theorem 5