I. INTRODUCTION
Accurate depth information is crucial for autonomous vehicles and robots to perceive and interact with environments in a manner akin to human cognition. Recent strides in deep learning methodologies have yielded remarkable progress in training networks to autonomously infer depth directly from RGB images. Expanding on this progress, a surge of interest has emerged in leveraging extensive, unlabeled real-world data, driving the pursuit of self-supervised methodologies employing monocular videos as input [1], [2], [3].