I. Introduction
Reinforcement learning (RL) has shown remarkable success in many areas [1], [2], [3], [4], safety is a central concern when deploying RL algorithms in real-world applications [2], [5]. For example, a real-world robot should avoid crashing into unsafe areas and hurting humans when the robot explores the environment; a piece of false information should not be recommended to customers in a recommender system. Although safe RL has received substantial attention in recent years [6], and massive safe RL methods are proposed to helpfully ensure RL safety [7], the safety in multiagent RL (MARL) domains still remains open [8]. Due to the instability of multiagent systems, the problem of safe MARL is more complicated and challenging than the problem of safe single-agent RL [2]. In safe MARL settings, the ego agent's reward and safety are optimized while considering other agents' safety and reward. Currently, few safe MARL methods are proposed, and the stability and convergence analysis still lack, e.g., MACPO [9], MAPPO-L [9], CMIX [10], and safe MARL via shielding [11]. More importantly, most safe RL methods leveraging hard constrained policy optimization need to fine-tune the safety bounds manually, e.g., CPO [12], FOCOPS [13], PCPO [14], MACPO [9], and MAPPO-L [9]. Such that, more research burden is required to fine-tune safety bounds, and until now, there is no principle to guide the setting of safety bounds, only by empirical experiments. Moreover, these algorithms' policy update needs to satisfy the safety bounds at each iteration, which could result in oscillation during policy exploration, and it can be detrimental to the reward and safety performance. In this study, we propose a safe learning framework for safe MARL with soft constrained policy optimization, which can be a plug-and-play module for existing RL algorithms without fine-tuning safety bounds. Derived from the framework, two algorithms are developed, which are the safe multiagent trust region policy optimization algorithm (SM-TRPO) and the safe multiagent policy proximal optimization (SM-PPO), respectively. The experimental results demonstrate that our algorithms without fine-tuning safety bounds can achieve more comparable performance than the state-of-the-art baselines that do not consider safety; meanwhile, our algorithms can ensure agent safety.