I. Introduction
Monocular Depth Estimation (MDE) is a deep neural network (DNN)-based task that estimates depth from a single image, allowing for 2D-to-3D projection by predicting the distance for each pixel in a 2D image [2]. This makes it a cost-effective alternative to pricey Lidar sensors. Applications of MDE are vast, including autonomous driving [3], visual SLAM [4], and visual relocalization [5], etc. Specifically, self-supervised MDE has become increasingly popular in the industry (e.g., Tesla Autopilot [3]) due to its ability to achieve comparable accuracy without requiring ground-truth depth data from Lidar during training. However, due to the known vulnerabilities in DNNs [6], [7], several digital-world [8], [9] and physical-world adversarial attacks [10] has been developed against MDE. These attacks primarily use optimization-based methods to create adversarial input that deceive the MDE models. Given the significance and widespread use of MDE models, these adversarial attacks pose a substantial risk to the security of applications such as autonomous driving. As a result, there is an urgent need to develop and strengthen MDE models against these threats.