I. Introduction
In recent years, neural network algorithms based on deep learning have made significant progress in tasks such as LiDAR 3D detection and tracking for individual vehicles [1], [2]. Several mature detection algorithms have been practically applied to an increasing number of autonomous vehicles [3], [4], [5], [6]. Despite these achievements, the perception systems of single agents still have many limitations in complex environments due to the constraints of a single field of view and restricted capabilities in navigating through intricate urban scenarios. Among these environments, the intersection environment is one of the most intricate scenarios. This environment usually includes a diverse array of road users consisting of vehicles, pedestrians, cyclists, and scooters, each with unique movement patterns and safety needs. A key challenge in enhancing intersection safety is for accurate, detailed, and real-time data that captures the road users’ classification and their movement.