I. Introduction
In Recent years, Bird's Eye View (BEV) perception has emerged as a crucial component in autonomous driving and robotic systems [1], [2], [3], [4]. Its ability to aggregate multi-view data and transform the surrounding environment in a unified top-down-view representation makes it highly effective and versatile for tasks like object detection, segmentation, trajectory prediction, and planning [5], [6]. A typical BEV perception model architecture comprises an image backbone, a BEV encoder, and task-specific heads [7]. While many prior efforts have focused on optimizing the design of BEV encoders and task heads to improve performance [8], [9], much less attention has been paid to enhancing BEV perception from a representation learning perspective. We argue that learned representations are central to a model's performance, and improving them can lead to uniform gains across various BEV architectures, offering broader benefits complementary to task-specific designs.