Real-time multi-agent motion forecasting is essential for autonomous vehicles to safely navigate on urban roads. Recent advancements in this field exhibit the superior pe...Show More
Metadata
Abstract:
Real-time multi-agent motion forecasting is essential for autonomous vehicles to safely navigate on urban roads. Recent advancements in this field exhibit the superior performance of Query-Centric Networks for large-scale data and long-term prediction. However, these methods often suffer from high space and time complexity, which limits their practicality and impedes further research. E.g., training QCNet requires 8 RTX 3090 GPUs and its inference costs more than 100 ms for a single busy scene. This is primarily due to the heavy spatial attention layers in the encoder and redundant queries for each modal prediction in the decoder. This letter proposes an Efficient Query-Centric Network, i.e., EQNet, for multi-agent motion forecasting. EQNet employs a state-space-model (SSM) for temporal motion encoding, and introduces a cascade decoder structure which shares the queries among all motion modalities for spatial attention. Our research shows three important nuggets: First, placing spatial attention layers in the decoder exhibits greater efficiency than in the encoder. Second, SSMs serve as highly efficient temporal motion encoders. Finally, cascade decoders demonstrate efficiency as multi-modal motion decoders. These innovations significantly reduce virtual memory usage (\sim 4 × reduction) and improve processing speed (up to 2 × speedup), without significant sacrifice of performance on the Argoverse 2 benchmark.