I. Introduction
Over the past decade, we have witnessed tremendous progress in Autonomous Vehicles (AVs) [1]–[14] and intelligent transportation systems [15]–[17]. Sooner, these autonomous systems will be deployed on roads at scale, opening up opportunities for cooperation between them. Previous works [18]–[23] have demonstrated that by leveraging the Vehicle-to-Everything (V2X) communication technology, AVs and infrastructure can perform cooperative perception by using the shared sensing information and thus significantly enhance the perception performance [24]–[28]. Despite the remarkable improvement, these works evaluate the proposed systems on the dataset with natural scenarios that do not contain sufficient safety-critical scenes. Under the challenging scenes, these systems may have inferior performance, and thus it is crucial to identify challenging scenes to fully understand the robustness of existing cooperative perception systems. The straightforward solution is to collect a wide range of testing scenes in the real world and identify critical ones. However, compared to single-agent systems, the cost and time consumption of gathering and labeling data for multi-agent systems can be much more demanding. A preferable cost-effective solution is to generate large-scale realistic scenes [29]–[31] in high-fidelity simulators. Yet these approaches only consider the common scenes and lack the capability of performing stress tests on corner cases for the target systems.