I. Introduction
In recent years, more and more underwater mobile observation devices, such as remote operated vehicles (ROVs) and autonomous underwater vehicles (AUVs), are equipped with high-performance vision sensors, and vision-based underwater remote sensing technology is receiving more and more attention from [1], [2], [3], [4], and [5]. Depth estimation from a single underwater image is one of the fundamental tasks of underwater visual perception. It is important for the analysis, understanding, and even reconstruction of underwater scenarios. Thanks to the rapid development of deep learning techniques, the problem of monocular depth estimation on land has been extensively studied and has made great progress [6], [7], [8], [9], [10]. While, due to the specificity of the underwater environment, it is difficult and costly to use time-of-flight (ToF) [11] strategies and structured light sensors [12] to collect real-world dense depth information and thus build large general underwater depth datasets. Therefore, supervised learning, such as deep regression with convolutional neural networks (CNNs), cannot be performed directly for underwater scenarios such as the land-based depth estimation task.