Journals & Magazines >IEEE Access >Volume: 13

Advancing Underwater Vision: A Survey of Deep Learning Models for Underwater Object Recognition and Tracking

Pipeline and taxonomy of deep learning models for underwater object recognition and tracking

Abstract:

Underwater computer vision plays a vital role in ocean research, enabling autonomous navigation, infrastructure inspections, and marine life monitoring. However, the unde...Show More

Metadata

Abstract:

Underwater computer vision plays a vital role in ocean research, enabling autonomous navigation, infrastructure inspections, and marine life monitoring. However, the underwater environment presents unique challenges, including color distortion, limited visibility, and dynamic light conditions, which hinder the performance of traditional image processing methods. Recent advancements in deep learning (DL) have demonstrated remarkable success in overcoming these challenges by enabling robust feature extraction, image enhancement, and object recognition. This review provides a comprehensive analysis of cutting-edge deep learning architectures designed for underwater object detection, segmentation, and tracking. State-of-the-art (SOTA) models, including AGW-YOLOv8, Feature-Adaptive FPN, and Dual-SAM, have shown substantial improvements in addressing occlusions, camouflaging, and small underwater object detection. For tracking tasks, transformer-based models like SiamFCA and FishTrack leverage hierarchical attention mechanisms and convolutional neural networks (CNNs) to achieve high accuracy and robustness in dynamic underwater environments. Beyond optical imaging, this review explores alternative modalities such as sonar, hyperspectral imaging, and event-based vision, which provide complementary data to enhance underwater vision systems. These approaches improve performance under challenging conditions, enabling richer and more informative scene interpretation. Promising future directions are also discussed, emphasizing the need for domain adaptation techniques to improve generalizability, lightweight architectures for real-time performance, and multi-modal data fusion to enhance interpretability and robustness. By critically evaluating current methodologies and highlighting gaps, this review provides insights for advancing underwater computer vision systems to support ocean exploration, ecological conservation, and disaster management.

Pipeline and taxonomy of deep learning models for underwater object recognition and tracking

Published in: IEEE Access ( Volume: 13)

Page(s): 17830 - 17867

Date of Publication: 24 January 2025

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2025.3534098

Funding Agency:

Contents

SECTION I.

Introduction

Underwater environments present unique challenges for computer vision tasks due to color distortion, image blur, and light scattering. Traditional computer vision methods are often not robust enough to deal with the domain shifts in such conditions, leading to unreliable outcomes [1], [2]. These limitations severely hinder the utilization of object recognition and tracking algorithms in ocean applications, including the autonomous monitoring of marine life and underwater infrastructures. Deep learning (DL) offers an effective solution to underwater object recognition and tracking challenges. These algorithms can extract meaningful features from distorted, low-visibility images, allowing for more accurate identification and tracking of objects [3].

Case studies demonstrate the scientific and practical applications of DL models in marine biology, oceanography, and underwater exploration. In aquaculture, for example, DL models for fish classification have achieved remarkable performance, with MLR-VGGNet [4] and mResNet [5] achieving accuracies of 97.09% and 96.89% respectively on the Fish4Knowledge dataset [6]. For the more complex task of multi-fish tracking under occlusions, models such as Fish-Track [7], CMFTNet [8], and GN-YOLOv5 [9] have reached a Multiple Object Tracking Accuracy (MOTA) of up to 94.8%. Other notable marine life monitoring applications include the segmentation and detection of corals [10], tiny benthic organisms [11], and camouflaged marine animals [12].

For ocean exploration, DL models support the navigation of Autonomous Underwater Vehicles (AUVs). Models like UNDERWATER-CUT [13] and joint FPN models [14] achieve mAP50 scores of 85% and 80.12% for underwater obstacle detection and general object detection, respectively. Additionally, models such as YOLOTrashCan [15] and MLDet [16] have demonstrated strong performance in marine debris detection, achieving mAP50 values of up to 65.01%. Recent advancements also emphasize the integration of multi-sensor fusion approaches, enabling simultaneous object detection and localization in challenging underwater conditions [17]. These advancements highlight the versatility of DL in addressing key challenges across diverse underwater applications.

In this review, we discuss the inherent challenges in underwater computer vision tasks in ocean applications and examine various DL architectures developed to address them, including Convolutional Neural Networks (CNNs), Generative Adversarial Networks (GANs), and Transformer Networks. We also explore promising future directions, such as developing benchmark datasets for standardized performance evaluation [18], dealing with challenging camouflaging and occlusion conditions [2], [19], introducing multi-modal data [20], as well as real-time performance [21]. Figure 1 illustrates the pipeline of a typical underwater computer vision system. The underwater environment is captured using an optical camera or other imaging modalities. The captured image is then processed through a dehazing and enhancement model to reduce blurring and correct color distortions caused by water conditions. Once the image has been improved, various tasks such as classification, detection, segmentation, and tracking are performed to analyze the content.

FIGURE 1.

Pipeline and taxonomy of deep learning models for underwater object recognition and tracking.

Show All

A. Traditional Approaches and Modern Deep Learning

Traditional computer vision methods rely on mathematical algorithms and hand-crafted features for tasks such as underwater image preprocessing, object recognition, and tracking. Image formation models in conjunction with histogram equalization and channel priors are used to enhance underwater images by restoring color and reducing scattering effects [3]. Traditional object recognition methods employ hand-crafted feature extractors like SIFT and HOG [22], while object tracking uses methods such as optical flow, Kalman filters, and background subtraction to follow objects across video frames [23].

Although computationally efficient, these methods often struggle in underwater environments due to challenges like poor visibility, uneven illumination, color distortion, and scattering from light absorption. These factors degrade hand-crafted features, limiting the robustness and generalization of classical approaches [24].

Deep learning (DL), on the other hand, leverages deep neural networks to automatically learn patterns and features directly from data, eliminating the reliance on hand-crafted features. By training models on large datasets that include varying lighting conditions, noise, and underwater distortions, DL methods achieve greater robustness in underwater computer vision tasks [1]. These models excel in recognition and tracking by adapting to the complexities of underwater environments, such as dynamic lighting changes and diverse noise patterns, making them more reliable for real-world applications. This adaptability has made DL the preferred choice for modern underwater computer vision solutions [24].

However, DL models are typically computationally expensive and require significant resources for both training and inference. Such approaches depend heavily on large datasets, which can be costly and challenging to acquire in underwater scenarios. Additionally, DL models often function as black boxes, making it difficult to interpret their decision-making processes. These limitations highlight the need for future research to focus on improving real-time performance, reducing data requirements, and enhancing interpretability without compromising accuracy [25].

B. Motivation

In recent years, numerous surveys have focused on underwater computer vision algorithms. Shuang et al. [28] and Saad Saoud et al. [1] provide comprehensive reviews of underwater image dehazing and quality enhancement algorithms, offering insights into image formation models, classical approaches, and deep learning (DL) models. However, these reviews do not address underwater object recognition or tracking. Surveys that encompass various underwater computer vision algorithms include the work by González-Sabbagh et al. [3], which briefly touches on underwater object recognition but does not cover object tracking.

Similarly, reviews by Wang et al. [26] and Naveen et al. [27] discuss object recognition but exclude tracking. Furthermore, the classification frameworks in these two reviews are simplified with respect to DL model architectures, placing greater emphasis on applications. The review by Xu et al. [24] explores the relationship between object detection accuracy and underwater image dehazing techniques, but does not address object classification, segmentation, or tracking. Moreover, these surveys primarily focus on algorithms and architectures based on optical imaging, with some extending to sonar images. However, other imaging modalities, such as hyperspectral imaging, neuromorphic vision, and laser imaging, remain largely unexplored.

This review addresses the aforementioned gaps by providing a comprehensive classification framework for DL models covering multiple imaging modalities, including optical, acoustic, hyperspectral, and neuromorphic imaging. It also highlights the integration of multi-modal systems. It discusses advancements in underwater object recognition, segmentation, and tracking, with an emphasis on recent trends such as transformer-based architectures and lightweight AI models for embedded systems. Furthermore, this survey provides an evaluation of benchmarking datasets, offering critical insights into dataset limitations and guiding future research directions. A summary of the existing reviews is presented in Table 1.

TABLE 1 Overview of Existing Reviews of Underwater Computer Vision Algorithms and Models

To gain a comprehensive understanding of the research landscape, we focused on articles that included the keywords “Underwater Vision” or “Underwater Image” in conjunction with terms relevant to our research interests, such as “Deep Learning”, “tracking”, “detection”, “segmentation”, “enhancement”, and “dehazing”. This survey utilized the Scopus database to analyze publications from 2014 to 2024. The results, depicted in Figure 2, illustrate the annual number of publications within the specified area, as defined by the selected keywords. The top publication venues for underwater perception-related research during the same time period are presented in Figure 3. The annual increase in publications demonstrates the growing global interest in this field. This exponential rise signifies an expanding research effort to decode the complexities of UIE and to develop innovative methods for their study. A key factor driving this heightened interest is the progress in DL, which has greatly enhanced researchers’ ability to study and interpret underwater computer vision.

FIGURE 2.

Number of annual publications over the last decade in the Scopus Database when searching for the keywords “Underwater Vision” or “Underwater Image” with terms relevant to our survey, such as “Deep Learning”, “tracking”, “detection”, “segmentation”, “enhancement”, and “dehazing”.

Show All

FIGURE 3.

Journals/Conferences with the highest number of publications in the Scopus Database when searching for the keywords “Underwater Vision” or “Underwater Image” with terms relevant to our survey, such as “Deep Learning”, “tracking”, “detection”, “segmentation”, “enhancement”, and “dehazing”.

Show All

C. Contributions

This paper serves as a review for researchers interested in applying DL to ocean-related applications, providing critical insights and practical applications that can guide future work in this evolving field. The key contributions of this review are:

Comprehensive Review: The literature on DL models for underwater computer vision is reviewed, highlighting the latest developments in image enhancement and preprocessing, as well as object recognition. Architectures such as CNNs, GANs, and Transformers are analyzed for their effectiveness in underwater scenarios, offering insights into their applications for object detection and tracking.
Application Showcase: The review highlights the practical applications of DL in ocean research, demonstrating its versatility across various domains. Key applications include coral reef monitoring and segmentation, the identification and tracking of marine life such as fish, vision-based obstacle avoidance and navigation for underwater unmanned vehicles, pollution monitoring through marine debris detection, and finally, facilitating the autonomous inspection of underwater structures such as marine turbines and dams.
Comparative Analysis: The discussed DL models in this study are analyzed and compared based on their reported metrics, giving insights into the best-performing architectures for each application.
Multi-modal Analysis: Beyond optical imaging, various alternative modalities are explored, including Sonar, LiDAR, stereo cameras, hyperspectral imaging, event cameras, magnetic imaging, and polarized light imaging. These modalities are evaluated, highlighting their potential to enhance underwater computer vision tasks when utilized either individually or in combination with optical techniques.
Insights into Future Research Directions: The review identifies promising avenues for future work, such as the development of benchmark datasets for standardized performance evaluation, real-time performance improvement, generalizability, and addressing challenging scenarios in object detection and tracking such as occlusions and camouflaging.

The remainder of this paper is structured as shown in Figure 4. Section II reviews recent advancements in DL-based image preprocessing and enhancement techniques aimed at minimizing color distortions and blurring in underwater images. In Section III, datasets and DL models used for object recognition tasks are examined, including classification, detection, and segmentation, with comparative analyses to identify benchmarks. Section IV explores state-of-the-art underwater tracking models, such as Siamese networks and joint models. Section V investigates alternative imaging modalities beyond optical imaging and their potential to improve underwater vision tasks. Section VI discusses the challenges and limitations of deep learning models in underwater vision. Section VII discusses future directions to overcome current challenges and finally, Section VIII draws the conclusions.

FIGURE 4.

Organization of the survey’s sections and subsections.

Show All

SECTION II.

Underwater Image Enhancement

Underwater imaging is challenging due to the effects of the underwater environment on light propagation, as shown in Figure 5. Different wavelengths of light are absorbed at varying rates as they travel through water, where red light is absorbed quickly, while blue and green penetrate deeper. Light is also scattered by suspended particles, which changes its direction and reduces its intensity. These effects alter underwater visual properties and hinder underwater image analysis.

FIGURE 5.

Effects of the underwater environment on light propagation.

Show All

This section addresses underwater image enhancement and preprocessing techniques aimed at mitigating challenges such as image blurring, color distortions, and scattering. It begins by exploring evaluation metrics for image enhancement, including both objective and no-reference approaches, in Subsection II-A. Next, it reviews the comprehensive datasets employed to train enhancement deep learning (DL) models in Subsection II-B. Finally, Subsection II-C examines DL models such as Convolutional Neural Networks (CNNs), transformers, Generative Adversarial Networks (GANs), and hybrid GAN-transformers, offering a comparative analysis of their performance across different datasets.

A. Evaluation Metrics for Underwater Image Enhancement

Assessing the effectiveness of UIE algorithms necessitates the development of robust evaluation metrics. These metrics can be broadly categorized into two main groups: objective and no-reference metrics.

Objective metrics provide quantitative measures to compare enhanced images with a reference image, typically captured under ideal lighting conditions. No-reference metrics, however, utilize image statistics to measure the naturalness of the enhanced image without the need for a reference.

1) Objective Metrics

These metrics aim to quantify the visual quality of enhanced images, aligning with human perception. Common examples include:

Peak Signal-to-Noise Ratio (PSNR): PSNR measures the peak error between the enhanced image $E(i,j)$ and the reference image $F(i,j)$ based on the mean squared error (MSE):

$\begin{equation*} \text { MSE}=\frac {1}{MN} \sum _{i=1}^{M} \sum _{j=1}^{N} \left [{{ F(i, j) - E(i, j) }}\right ]^{2} \tag {1}\end{equation*}$ View Source

where

$M \times N$

is the image size. PSNR is derived from MSE and quantifies the peak error:

$\begin{equation*} \text { PSNR}=20 \log _{10} \left ({{ \frac {\text {MAX}_{F}}{\sqrt {\text {MSE}}} }}\right) \tag {2}\end{equation*}$

View Source

where

${\text {MAX}}_{F}$

is the maximum possible pixel value of the image (e.g., 255 for 8-bit images).

Structural Similarity Index Measure (SSIM): SSIM goes beyond pixel-level differences by evaluating structural similarities between the enhanced and reference images, considering luminance, contrast, and structure (Equation 3):

$\begin{equation*} \text { SSIM}(F, E)=\frac {(2~\mu _{F} \mu _{E} + C_{1})(2\sigma _{FE} + C_{2})}{(\mu _{F}^{2} + \mu _{E}^{2} + C_{1})(\sigma _{F}^{2} + \sigma _{E}^{2} + C_{2})} \tag {3}\end{equation*}$ View Source

where

$\mu _{F}$

and

$\mu _{E}$

are the mean intensities,

$\sigma _{F}$

and

$\sigma _{E}$

are the variances, and

$\sigma _{FE}$

is the covariance of F and E. Constants

$C_{1}$

and

$C_{2}$

stabilize the division [29].

2) No-Reference Metrics

No-reference metrics aim to evaluate image quality without a reference image. They are particularly useful in scenarios where acquiring ground truth data is challenging or impractical. Examples include:

Underwater Image Quality Measure (UIQM): UIQM combines multiple metrics to assess various aspects of underwater image quality, including color, sharpness, and contrast [30].

Natural Image Quality Evaluator (NIQE): NIQE [31] indicates the naturalness of an image by measuring its statistical deviation from the natural scene statistics, where a lower color indicates a more natural-looking image.

Underwater Color Image Quality Evaluation(UCIQE): UCIQE is a linear combination of chroma, saturation, and contrast, designed to measure the nonuniform color cast, blurring, and low contrast that often characterize underwater images [32].

Although other evaluation metrics have been developed to assess UIE algorithms, the aforementioned metrics are the most used in recent research. While both objective and no-reference metrics play a vital role in UIE evaluation, it’s important to acknowledge their limitations. Subjective human evaluation remains a valuable tool, particularly for capturing aspects not fully addressed by current metrics.

B. Underwater Dehazing Datasets

The scarcity of UIE datasets for training and evaluating DL-based underwater enhancement models remains a fundamental challenge. In this section, several pioneering datasets that address this issue are explored. The first comprehensive perceptual study and analysis of UIE using real-world images was done by creating the Underwater Image Enhancement Benchmark (UIEB) [33] including 950 marine images. 890 corresponding reference images are divided into 800 for training and 90 for validation (V-90). The remaining 60 images, lacking satisfactory reference images, are considered challenging data (C-60). Similarly, EUVP Dataset [34] includes 12k paired and 8k unpaired underwater images taken under different visibility conditions during oceanic explorations.

Other datasets include the U45 dataset [35] which offers a standardized benchmark to evaluate algorithms against three types of degradation: green cast, blue cast, and haze. The RUIE Dataset [36] is another benchmark, divided into three subsets: UIQS, UCCS, and UHTS, each targeting different aspects of image quality including visibility, color cast, and object detection/classification. UID2021 Dataset [37] is also a large-scale dataset with multiple degraded underwater images to evaluate underwater image quality assessment algorithms.

The Heron Island Coral Reef Dataset (HICRD) [38] contains 2000 paired and 6003 unpaired images for the purpose of benchmarking existing image restoration methods. LSUI Dataset [39] is a large-scale dataset with 4279 underwater real-world image groups, each containing a raw image, its corresponding clear reference image, a semantic segmentation map, and a medium transmission map.

Beyond traditional image enhancement dataset creation methods, the DRUVA Dataset [40] is the first underwater video dataset for monocular self-supervision in underwater vision tasks. The SQUID Dataset [41] also introduces a novel approach with 57 stereo image pairs, rather than just utilizing monocular images for restoration algorithms.

C. Deep Learning Approaches for Underwater Image Enhancement

The inherent limitations of underwater imaging require the development of robust image enhancement models. Underwater environmental factors significantly degrade image quality and impede object recognition. Recent research efforts have focused on innovative approaches to address these challenges and enhance underwater imagery. In this section, various pioneering models and recent advancements in DL-based UIE are discussed.

1) Convolutional Neural Networks

CNNs utilize the convolution operation to extract meaningful information from input images. This information can be utilized for input image reconstruction to achieve image dehazing. For instance, PhISH-Net [42]integrates a physics-based underwater image formation model with a CNN to remove backscatter and enhance images through a retinex-based module. GCCF [43] is a CNN that utilizes a grouped color compensation encoder to extract information from the red, green, and blue channels independently. The encoder then compensates for the red and blue channels using information from the green channel through a learnable module.

Recent approaches focus on semi-supervised learning due to the lack of labeled data. As an example, Semi-UIR [44] is a CNN that utilizes a traditional mean-teacher approach but overcomes its limitations by introducing a bank for storing the highest performing outputs as pseudo-ground truth.

2) Attention Mechanism and Transformer Networks

UIE methods based on Transformer networks and attention mechanisms are gaining more popularity over purely CNN approaches. This is due to the ability of attention mechanisms to focus on relevant parts of the input images. The majority of these models incorporate CNN for feature extraction, followed by attention mechanisms for handling long-range dependencies and contextual relationships more effectively, such as UIE-MCNN [45], TANet [46], TCTL-Net [47], UGIF-Net [48], and SARSDN [49].

Examples of Transformer-based approaches include CEWformer [50], which utilizes channel self-attention and mixed self-attention transformers to address color attenuation and image quality degradation, respectively. UIE-Convformer [51] is another Transformer-based network that utilizes a multi-scale U-Net structure to extract both local and global features from the images. Similarly, Ghost-UNet [52] uses a modified attention-based U-Net, but with a conditional diffusion model. To address the limitations of existing methods in terms of model size and computational efficiency, a Swin Transformer is combined with Adaptive Group Attention (AGA) [53] The AGA module selects visually complementary channels, minimizing the network’s parameters and improving the computational efficiency.

Several Transformer-based UIE methods incorporate cascaded architectures. For instance, Spectroformer [54] is a multi-domain query cascaded transformer network that combines localized transmission and global illumination features with fusion-based attention blocks, enhancing feature propagation and resolution. Similarly, CURE-Net [55] is a cascaded network that comprises three subnetworks, with the first two employing an attention mechanism for multi-scale contextual information and the third on preserving spatial details. CNMS [56] also employs a cascade mechanism for improved feature extraction and information exchange. A triple attention module further enhances feature extraction, while the multi-level sub-networks module amplifies contextual information. Moreover, Joint-ID [57] employs a multi-modal objective function to tackle issues like image blur, image distortion, and lack of depth estimates. An integrated end-to-end training pipeline supports joint enhancement and depth estimation by leveraging a common feature space.

With recent advancements in contrastive learning, approaches like UIESC [58] combine self-attention and contrastive learning to enhance image quality. UIESC uses spatial and channel dual attention to capture local features and global dependencies, while criss-cross attention reduces the computational complexity of self-attention.

3) Generative Adversarial Networks

GANs for UIE use a generator to produce enhanced images from raw underwater images and a discriminator to distinguish between real and generated images. This adversarial process improves the visual quality by reducing haze, correcting colors, and enhancing details. For example, FUnIE-GAN [34] uses five encoder-decoder pairs with mirrored skip connections as a generator, and a Markovian Patch GAN as a discriminator to generate dehazed underwater images. Moreover, the DifSG2-CCL [59] model incorporates the Underwater Cycle Consistency Loss (U-CCL) into the generator loss to generate realistic underwater images while preserving content consistency with real images. MuLA-GAN [60] applies multi-level attention mechanisms for enhanced underwater visibility, achieving improved feature refinement and color correction.

Recent models incorporate underwater imaging physical models, such as PUGAN [61], depth estimation techniques, such as UW-GAN [62], and fusion models, such as HIFI-Net [63], for improved GAN performance.

With the lack of image pair datasets for underwater image dehazing, unsupervised methods have gained a lot of interest lately. UMRD [64] is an unsupervised GAN network for multiple representation disentanglement for domain adaptation. Furthermore, NPT-UL [65] is a GAN that combines nonphysical transformation-based data augmentation with an unsupervised learning model based on contrastive learning for enhanced performance. Also leveraging contrastive learning, UCL-Dehaze [66] addresses the domain shift problem caused by training models on synthetic data by formulating a self-contrastive perceptual loss function.

4) Hybrid GAN-Transformer Networks

Recent research incorporates Transformers and attention mechanisms within GANs for enhanced image dehazing performance. As an example, HAAM-GAN [67] incorporates hierarchical attention aggregation and multi-resolution feature learning to reduce color bias in GAN-based image restoration. UIENet [68], on the other hand, includes multi-resolution counterparts to enhance feature representation and utilizes spatial and channel attention modules in a GAN to refine global-local connections in the image. UwTGAN [69] employs Vision Transformers (ViT) with Window-based self-attention blocks to mitigate backscattering and attenuation in UIE.

5) Comparative Analysis of Underwater Image Enhancement Models

A comparison of the underwater image enhancement models in terms of methodology, utilized datasets, and evaluation metrics is presented in Table 2 and Table 3.

TABLE 2 Underwater Dehazing Models Comparison

TABLE 3 Underwater Dehazing Models Comparison (Continued)

On the EUVP dataset, UwTGAN [69] achieves the highest PSNR score of 28.85, indicating superior performance in image reconstruction quality. PhISH-net [42] has the highest SSIM with a score of 0.85, reflecting its strength in maintaining structural similarity. For color quality, Joint-ID [57] outperforms others with a UCIQE score of 0.63. Finally, Bidirectional Disentangling [66] excels in UIQM, scoring 4.88, highlighting its effectiveness in overall image quality improvement. Bidirectional disentangling also outperforms both Spectroformer [54] and UGIF-net [48] in terms of UIQM (4.46) in the UCCS dataset.

In the RUIE dataset, Semi-UIR [44] sets the benchmark with a UIQM of 4.67. Furthermore, for the SQUID dataset, GCCF [43] outperforms other models across all metrics, achieving a UIQM of 3.91, UCIQE of 0.57, and an NIQE of 3.94. On the U45 dataset, Spectroformer records the lowest NIQE (3.84), whereas UIESC [58] achieves the highest UIQM and UCIQE values of 5.08 and 0.63, respectively.

For the UIEB dataset, Spectroformer excels in objective metrics, recording the highest PSNR (24.96) and SSIM (0.91). However, UIE-MCNN [45] leads in UIQM with a score of 4.88, while PhiSH-Net and NPT-UL both achieve the highest UCIQE value of 0.64. On the more challenging UIEB-C60 subset, the performance drops significantly, with GCCF achieving the highest UIQM (3.16) and UGIF-Net the highest UCIQE (0.61).

SECTION III.

Underwater Object Recognition

Underwater object recognition includes classification, detection, and segmentation of objects within underwater environments, which presents a challenge due to the inherent limitations of underwater imaging. This section starts by covering the evaluation metrics in Subsection III-A, followed by a discussion of the commonly used underwater object recognition datasets in Subsection III-B. Finally, the DL models for object classification, detection, and segmentation are discussed in Subsections III-C, III-D, and III-E respectively.

A. Evaluation Metrics for Object Classification, Detection, and Segmentation

This subsection covers the metrics that are essential for evaluating the performance of underwater object detection, classification, and segmentation tasks:

1) Definitions

True Positives (TP): Correctly classified positive samples.
True Negatives (TN): Correctly classified negative samples.
False Positives (FP): Incorrectly classified positive samples (e.g., identifying a coral as a fish).
False Negatives (FN): Incorrectly classified negative samples (e.g., failing to detect a fish in the image).

2) Accuracy

This metric represents the overall effectiveness of the classification model. It is calculated as the ratio of correctly classified samples to the total number of samples in the dataset.

$\begin{equation*} \text { Accuracy}=\frac {\text {True Positives} + \text {True Negatives}}{\text {Total Samples}} \tag {4}\end{equation*}$ View Source

3) Precision (P)

Precision measures the ratio of correctly classified positive samples to the total number of samples predicted as positive, quantifying the selectivity of the model.

$\begin{equation*} \text { Precision}=\frac {\text {True Positives}}{\text {True Positives} + \text {False Positives}} \tag {5}\end{equation*}$ View Source

4) Recall (R)

Recall indicates the ratio of correctly classified positive samples to the total actual positive samples in the dataset, quantifying the inclusivity of the model.

$\begin{equation*} \text { Recall}=\frac {\text {True Positives}}{\text {True Positives} + \text {False Negatives}} \tag {6}\end{equation*}$ View Source

5) F1-Score

F1-score is a harmonic mean between precision and recall, providing a balanced view of model performance.

$\begin{equation*} \text { F1-score}=2 \cdot \frac {\text {Precision} \cdot \text { Recall}}{\text {Precision} + \text {Recall}} \tag {7}\end{equation*}$ View Source

6) Average Precision (AP)

AP represents the precision-recall curve’s average value. It captures the model’s ability to rank relevant objects (objects of a specific class) higher than irrelevant ones.

7) Intersection Over Union (IoU)

IoU is a measure used to evaluate the overlap between a predicted bounding box (BB) and the ground truth BB for an object. Higher IoU values indicate better overlap.

8) Mean Average Precision (mAP)

This metric is commonly used for object detection tasks and represents the average precision (AP) across all classes in the dataset. It is calculated by taking the mean of the AP values obtained at multiple Intersection over Union (IoU) thresholds (typically between 0.5 and 0.95).

9) MAP@0.50 or mAP50

This is a common variant where the mAP is calculated specifically at an IoU threshold of 0.5.

10) Mean Intersection Over Union (mIoU)

This metric is commonly used for semantic segmentation tasks and represents the average IoU across all classes in the dataset. IoU is a measure that calculates the overlap between the predicted segmentation mask and the ground truth mask for a particular class. Higher mIoU indicates better segmentation accuracy.

11) Mean Dice Coefficient (mDice)

Similar to mIoU, mDice quantifies the overlap between predicted and ground truth segmentation masks. It is computed as twice the intersection area divided by the sum of the predicted and ground truth areas, averaged across all classes. A higher mDice indicates better segmentation performance.

12) Mean Pixel Accuracy (mPA)

This metric represents the overall accuracy of the segmentation model, calculated as the ratio of correctly classified pixels to the total number of pixels in the image (averaged across all classes).

13) Mean Absolute Error (MAE)

MAE measures the average absolute difference between the predicted and ground truth pixel values for segmentation tasks. Lower MAE indicates better performance, particularly for tasks where precise pixel-level segmentation is important.

14) Weighted F-Measure ( $F^{\omega } _{\beta }$ F-Omega-Beta)

$F^{\omega } _{\beta }$ [71] is a variant of the traditional F-measure which is commonly used in segmentation tasks. The metric considers the importance of different regions in an image according to the following equation:

$\begin{equation*} F_{\beta }= \frac {(1 + \beta ^{2}) \cdot P_{w} \cdot R_{w}}{\beta ^{2} \cdot P_{w} + R_{w}} \tag {8}\end{equation*}$ View Source

where

$P_{w}$

is the weighted precision,

$R_{w}$

is the weighted recall, and

$\beta$

is a parameter that determines the relative importance of precision versus recall.

15) Structural Similarity Measure ( $S_{\alpha }$ S-Alpha)

$S_{\alpha }$ [72] is designed to capture the structural similarities between the predicted segmentation and the ground truth maps, similar to the SSIM metric in image enhancement.

16) Mean E-Measure ( $mE_{\phi }$ mE-Phi)

$mE_{\phi }$ [73] evaluates the alignment between the predicted segmentation map and the ground truth by combining the local pixel-level and image-level evaluation based on the following equation:

$\begin{equation*} E_{\phi }= \frac {1}{N} \sum _{i=1}^{N} \left ({{1 - \frac {|p_{i} + g_{i}|}{|p_{i} - g_{i}|}}}\right) \tag {9}\end{equation*}$ View Source

where

$p_{i}$

is the predicted segmentation value at pixel i,

$g_{i}$

is the ground truth segmentation value at pixel i, and N is the total number of pixels in the image.

$mE_{\phi }$

is obtained by averaging this value across all images of the dataset.

B. Underwater Object Recognition Datasets

The scarcity of large-scale datasets encompassing diverse underwater applications such as the classification, detection, and segmentation of marine organisms, human divers, marine debris, and subsea structures has long been a hurdle for achieving robust underwater recognition. This section explores various datasets developed to address this challenge. A summary of the underwater object recognition datasets is shown in Table 4.

TABLE 4 Summary of Underwater Object Recognition Datasets

The introduction of WildFish [74] marked a significant advancement in marine research, providing a comprehensive benchmark dataset featuring 1,000 fish categories and over 54,000 images. Building upon these foundations, WildFish++ [75] presents an even larger dataset, encompassing 2,348 fish categories with over 100,000 images and 3,817 descriptions. WildFish++ focuses on addressing key challenges such as distinguishing visually similar species, identifying unknown categories through open-set classification, and retrieving information across different modalities. Similarly, FishNet [76] comprises 94,532 images of 17,357 marine species, organized based on aquatic biological taxonomy. FishNet offers three distinct benchmarks: fish classification, detection, and functional trait prediction. Furthermore, the Fish4Knowledge dataset [6] contains 3678 images of 32 different fish species obtained from high-quality underwater cameras.

The focus extends beyond fish recognition, with advancements being made in generic Underwater Object Detection (UOD). USOD10K [77] tackles the challenge of underwater salient object detection by introducing the first large-scale dataset in this field. USOD10K encompasses 10,255 underwater images of 70 object categories and 12 diverse scenes, along with their associated salient object boundaries and depth maps. DUO [78] was created for the same purpose, but with more rational bounding box (BB) annotations. The dataset was created by combining images from 4 years of the Underwater Robot Professional Contest (URPC) from 2017 to 2020. Similarly, RUOD [79] is a new real-world dataset comprising 14,000 images with 74,903 labeled objects across 10 common aquatic classes.

Further advancements include the Underwater Higher-level Task-driven Set (UHTS) dataset [36] which contains 300 BB-labeled images embodying several types of sea life including scallops, sea cucumbers, and sea urchins. The UDD dataset [80] addresses the specific challenge of object grabbing in open-sea farming. The UDD dataset comprises 2,227 images of three common sea creatures taken in an open-sea farm utilizing 4K cameras. The Brackish Dataset [81] addresses robust automated marine monitoring in brackish water with BB annotations. DFUI [82] consists of 5,265 images that specifically focus on the task of UOD in challenging underwater conditions. More recently, DUFish [83] was presented as a unique dataset that includes various high-quality underwater videos with manual annotations for individual frames. The dataset includes challenging conditions, unlike existing datasets that primarily feature simple underwater environments and easily detectable targets.

Segmentation datasets include the Underwater Image Instance Segmentation (UIIS) dataset [84], the enhanced Large Scale Underwater Image (LSUI) dataset with 4279 marine scene images [39], and the Segmentation of Underwater IMagery (SUIM) [85] dataset, consisting of more than 1500 annotated images of various classes such as fish, reefs, aquatic plants, ruins, human divers, robots, and sea-floor. More challenging scenarios, including camouflaged underwater marine organisms as shown in Figure 6, are presented in the Marine Animal Segmentation (MAS) dataset with over 3000 annotated images (MAS3K) [19].

FIGURE 6.

Sample images of camouflaged underwater animals and their corresponding segmentation masks from the MAS3K Dataset [19].

Show All

Several datasets were developed to detect plastic marine debris in underwater videos, including the Deep Sea Debris (DSD) dataset [86], the TrashCan dataset [91], and the in-Water Plastic Bags Bottles (WPBB) dataset [87].

Other unique datasets include the diver detection dataset [92], Cognitive Autonomous Diving Buddy (CADDY) dataset [90] for diver intent classification, the underwater ship Lifecycle Inspection, Analysis, and Condition Information (LIACI) dataset [88], and the DUSIA dataset [89] containing timestamps for substrate transitions and species counts, with some frames also containing BB locations for invertebrates.

C. Classification Models

Image classification is the task of categorizing input data into predefined classes without localization. Underwater environments pose significant challenges for image classification tasks due to factors such as complex backgrounds, varying illumination, and diverse marine life.

1) Specialized Architectures for Enhanced Underwater Image Classifications

DAMNet [93] addresses the general complexities of underwater imaging with a dual attention mechanism, achieving a strong classification accuracy of 96.93% on underwater biological images. Building upon this foundation, MCANet [94] focuses on specific degradation types found in underwater images, such as color shifts and haze. By leveraging multi-color space encoding and multi-channel attention, MCANet achieves an even higher classification accuracy of 98.74%. Moreover, Token-selective Vision Transformers [95] have shown improved performance for fine-grained classification of marine organisms.

Residual neural networks have also demonstrated improved object classification performance in degraded underwater videos [4], [5]. MLR-VGGNet [4], for instance, improves fish classification by incorporating a Multi-Level Residual (MLR) strategy, effectively combining low-level and high-level features within the network.

GAN-based classification models have been recently introduced for underwater classification tasks. For instance, GANomly [96] is a GAN network utilized for classifying and monitoring the condition of marine turbine blades, achieving an accuracy of 91.23 %.

2) Applications of Underwater Image Classification

In diving applications, effective human-robot communication is crucial. The DARE system [97] enhances communication by enabling divers to control AUVs using gestures. Similarly, RRCommNet [98] allows AUVs to interpret human gestures, facilitating underwater interaction between divers and robots.

Fish monitoring has gained a lot of research in recent years. WildFishNet [99] addresses the challenge of fish recognition in natural settings with its open-set fine-grained recognition capabilities. Similarly, Fish-TViT [100] leverages transfer learning and visual transformers for fish classification.

Other classification tasks include the automatic identification of Crown of Thorns Starfish for coral reef conservation, which was addressed by utilizing an attention-enhanced CNN [101]. Furthermore, the task of deep-sea debris identification is addressed with a Shuffle-Xception network [102], outperforming traditional methods.

A summary of the methods, applications, utilized datasets, and performance of the discussed underwater classification models is presented in Table 5.

TABLE 5 Underwater Classification Models Comparison

D. Detection Models

Underwater Object Detection (UOD) is essential in fields like marine conservation and underwater robotics, where models localize objects using Bounding Boxes (BB). Recent advancements in UOD include region-based CNNs (R-CNNs) [103], [104], [105], YOLO-based algorithms [15], [106], [107], [108], [109], [110], [111], [112], [113], transformer-based architectures [114], [115], [116], and feature pyramid architectures [14], [16], [117], [118], [119], [120], [121].

1) R-CNN-Based UOD

R-CNNs first generate region proposals that potentially contain objects and each proposal is then wrapped and fed into a CNN for feature extraction. These features are classified using a separate classifier to identify objects. Finally, bounding boxes are refined to improve localization accuracy. The overall two-stage architecture of R-CNN models is shown in Figure 7. Such architectures have been effectively used in UOD tasks such as fish monitoring [103] and marine litter detection [104].

FIGURE 7.

Overall architecture of a typical R-CNN model.

Show All

Recent advances in R-CNNs for UOD include CAM-RCNN [103], which introduced class imbalance mitigation using a compound Dice and cross-entropy loss, and enhanced image quality via multi-scale retinex and color restoration techniques. Similarly, Boosting-RCNN [105] introduced another innovation by focusing on uncertainty modeling to boost detection accuracy. Despite these advancements, the two-stage architecture of R-CNNs causes computational inefficiencies and longer processing time.

2) YOLO-Based UOD

To address the limitations of two-stage architectures, researchers have turned to YOLO, which is a real-time object detection system that addresses object detection as a single regression problem. YOLO works by partitioning the input image into a grid and computing BB coordinates and class probabilities for every grid cell directly as shown in Figure 8.

FIGURE 8.

Overall architecture of a typical YOLO model.

Show All

YOLO models have been adapted to UOD by combining image enhancement with models like YOLOv5 [13], [111] and YOLOv8 [108], [110] with modifications focusing on improving feature reconstruction in low visibility conditions and incorporating information-theoretic learning for enhanced detection performance. These models have shown success in various tasks such as the detection of marine debris [15], [113], marine life [106], [112], underwater concrete damage [111], and methane plumes [122].

3) Feature Pyramid-Based UOD

Feature pyramid-based architectures contain a neck for feature aggregation from different layers of the backbone, followed by detection heads to output the classes and BB coordinates as shown in Figure 9. Feature Pyramid Networks (FPNs) are mostly employed in the neck for multi-scale feature maps allowing the detection heads to detect objects of various sizes more effectively.

FIGURE 9.

Overall architecture of a typical FPN model.

Show All

Feature pyramid-based UOD methods have demonstrated notable advancements in underwater object detection. For instance, MLDet [16] employs a ResNet50 backbone combined with an FPN neck to detect marine litter, while another model integrates a VGG-16 backbone with an FPN neck and a color conversion network to improve denoised UOD [14].

Enhancements to FPN structures further boost detection performance. UDMDet [121], a novel framework, introduces distraction-aware FPN (DAFPN) that refines coarse features by reducing discrepancies between objects and backgrounds, thereby improving detection accuracy. FocusDet [118], designed for small-size UOD, leverages a Bottom Focus Path Aggregation Network (BF-PAN) for feature fusion in the neck, effectively capturing lower-level features and small object locations. Similarly, augmented weighted bidirectional FPN (AWBiFPN) [120] has demonstrated improved performance for small-object UOD tasks.

Building upon these advancements, multi-scale adaptive architectures, such as those proposed by Saad Saoud et al. [123], further enhance robustness in domain generalization, enabling efficient and accurate underwater object detection across diverse environments. Similarly, ADOD [124] extends these approaches by integrating residual attention modules within the YOLOv3 framework. It focuses on domain generalization by learning domain-invariant features, making it particularly effective in handling visual variations in underwater environments and improving detection robustness across diverse domains.

4) Transformer-Based UOD

Detection Transformers (DETRs) convert object detection into a direct set prediction problem using transformers. DETR extracts features with a CNN and then processes them through a transformer encoder-decoder. The decoder predicts a fixed set of objects with class labels and bounding boxes.

Recent advancements in transformer-based models have shown promise in improving UOD. An enhanced DETR model [115] tailored for underwater environments incorporates a learnable query recall mechanism, a lightweight adapter for multi-scale features, and an optimized bounding box regression mechanism as shown in Figure 10. Another example, the path-augmented transformer detection framework [114], improves the semantic representation of small objects by facilitating interactions between high-level and low-level features. Furthermore, GCC-Net [116] leverages complementary information from both raw and enhanced images, utilizing a transformer-based module for cross-domain feature interaction and a gated feature fusion module for adaptive control of cross-domain information.

FIGURE 10.

Overall architecture of DETR with learnable query recall as proposed by [115] for enhanced UOD.

Show All

5) Comparative Analysis of UOD Models

A comparison of the aforementioned architectures, their applications, and their results is shown in Table 6. It’s evident that FocusDet [118] (mAP50=84.8%, mAP75=64.3%) outperforms Improved DETR [115] (mAP50=82.6% mAP75=57.9%), DJL-Net [119] (mAP50=83.7% mAP75=62.5%, AWBiFPN [120] (mAP50=83.62%), and the two-terminal attention mechanism YOLO [109] (mAP50=72.81%) when evaluated on the RUOD dataset.

TABLE 6 Underwater Object Detection Models Comparison

Both Boosting R-CNN [105] (mAP50:95=82.0% mAP50=97.4%) and GCC-Net [116] (mAP50:95=80.5% AP50=98.3%) achieve much higher precision for the Brackish dataset when compared to the FPN-based network [14] (mAP50=80.12%). However, the FPN-based network is lightweight with GFLOPs of 5.06 only.

For the URPC2020 dataset, the highest reported mAP50:95 is reported by MAD-YOLO [106] (mAP50:95=53.4%), while the highest mAP50 value is reported by AGW-YOLOv8 [11] (mAP50=82.9%). For the task of marine litter detection using the TrashCan Dataset, YOLOTrashCan [15] achieves the highest mAP50 value of 65.01%.

E. Segmentation Models

Significant improvements in segmentation models have advanced underwater image analysis, addressing challenges such as noise, varying visibility, and complex backgrounds in marine environments. These models perform pixel-wise classification to segment objects in images accurately.

1) Fully Convolutional Network Architecture

Fully Convolutional Network (FCN) architectures are neural networks for pixel-wise image prediction, relying solely on convolutional layers to maintain spatial hierarchies. For instance, RMP-Net [127] tackles segmentation in low-light and camouflaged marine scenes using subpixel super-resolution and a re-parameterized backbone. A-LCFCN [128] focuses on efficient fish body measurement in aquaculture, utilizing point-level supervision and an affinity matrix for improved segmentation.

For object segmentation in videos, LSTM can be used in conjunction with FCNs to maintain temporal information. For instance, Hansen et al. [129] utilized convolutional LSTM cells for enhancing obstacle segmentation in unmanned surface vehicles. However, FCNs fail when capturing long-range dependencies and contextual information. By introducing multi-level attention adaptive segmentation, MA2Net [130] shows improved segmentation of fine-grained seabed targets.

2) Encoder-Decoder Architectures

Encoder-Decoder architectures utilize an encoder to downsample the input image, capturing abstract features, and a decoder to upsample and reconstruct the segmented output map. Encoder-decoder architectures can be divided into single [131], [132], [133], stacked [134], and parallel [12], [135], [136], [137] encoder-decoders as shown in Figure 11.

FIGURE 11.

Examples of different types of encoder-decoder architectures for underwater object segmentation including (a) Single encoder-decoder (b) Stacked encoder-decoder (c) Parallel encoder-decoder.

Show All

For instance, MAFEM [131] employs a single encoder-decoder CNN with a Multi-scale Attentional Feature Extraction Module and fish image preprocessing for underwater fish image analysis. MTHI-Net [134] uses a 2-stacked encoder-decoder design with spatial-channel attention, residual refinement, and feature fusion to enhance underwater hull inspection accuracy. BCMNet [135] employs a 2-branch encoder-decoder architecture with a Bidirectional Collaborative Mentoring Network for integrating texture and contextual data, featuring Adjacent Feature Fusion blocks and a two-stage decoding pipeline. Similarly, MASNet [19] uses a Siamese architecture for MAS.

3) DeepLab-Based Segmentation

Google’s DeepLab [138] is a recently developed DL model for semantic image segmentation. The model is a point label-aware method [139] that enhances coral reef monitoring by propagating labels within superpixel regions for accurate segmentation. Building on DeepLabV3+, TagLab [140] accelerates semantic segmentation of orthoimages through a human-centered AI approach.

4) GAN-Based Segmentation

MA-AttUNet [141] addresses the challenge of underwater crack segmentation in dams using attention mechanisms in the skip connections of U-Net along with multi-level adversarial transfer learning. Moreover, motion saliency-based GANs [142] can enhance underwater moving object segmentation by incorporating motion saliency estimation.

5) Segment Anything Models (SAM)

Meta AI’s Segment Anything Model (SAM) [143] can identify and isolate objects within a scene with minimal user input. Specifically, Dual SAM [144] integrated with a multi-level coupled prompt was proposed to enhance the multi-level features by utilizing comprehensive underwater prior information. Moreover, CoralSCOP [10] utilizes SAM for the task of coral segmentation. CoralSCOP has been trained using a novel dataset called CoralMask with 41,297 coral images and 330,144 coral masks.

6) Comparative Analysis of Underwater Segmentation Models

The findings of the aforementioned research are summarized in Table 7. As illustrated in the table, on the SUIM dataset, RMP-Net [127] achieves a higher mIoU (0.845) compared to A-LCFCN (0.749) [128].

TABLE 7 Underwater Segmentation Models Comparison

When evaluated on the MAS3K dataset, PSS-net [137] outperforms other segmentation models in terms of mIoU (0.816) and $S_{\alpha }$ (0.966). However, the network has an MAE of 0.044, which is relatively high compared to BCMNet [135]’s MAE of 0.019. The Dual-SAM [144] model achieves the highest $F^{\omega } _{\beta }$ (0.865) and $mE_{\phi }$ (0.945), indicating superior performance when balancing between recall and precision, as well as a great ability to define the edges of the segmented object.

SECTION IV.

Underwater Object Tracking

Underwater Object Tracking (UOT) faces multiple challenges that significantly impede tracking performance. These include image attenuation, distortion, highly dynamic object motion with rotations and scale changes, and occlusion [18], [147], [148]. Additionally, low visibility, poor video quality with distortions in sharpness and contrast, reflections from suspended particles, and non-uniform lighting further complicate the tracking process [18]. These challenges necessitate the development of specialized models and the utilization of underwater-specific datasets for effective UOT research.

This section reviews the recent advancements in UOT systematically by firstly discussing the evaluation metric in Subsection IV-A, followed by an overview of the recently-developed UOT datasets in Subsection IV-B. Moreover, the DL models for UOT including tracking-by-detection, siamese, and joint tracking models are discussed in Subsection IV-C along with a comparative analysis.

A. Evaluation Metrics for Object Tracking

This subsection covers the metrics that are essential for evaluating the performance of underwater object-tracking tasks:

1) Area Under the Curve (AUC)

This metric is often used in the context of Receiver Operating Characteristic (ROC) curves. It represents the probability that the model will rank a relevant track higher than an irrelevant one.

2) Percentage of Normalized Overlap (P_NORM) Percentage of Normalized Overlap (P-NORM)

P_NORM measures the overlap between the predicted BB and the ground truth BB, normalized by the size of the ground truth BB.

3) Identification F1-Score (IDF1), Precision (IDP), and Recall (IDR)

IDP, IDR, and IDF1 measure the precision, recall, and F1-score for identity matching in object tracking.

4) Identification Switches (IDS)

IDS quantifies the number of times a tracked object’s identity changes incorrectly during the tracking process. Lower IDS values indicate better performance.

5) Multiple Object Tracking Accuracy (MOTA)

MOTA [149] is a widely used metric that combines several tracking performance measures, including false positives, false negatives, and identity switches. It provides an overall assessment of the tracking system’s accuracy, with higher values indicating better performance.

6) Higher Order Tracking Accuracy (HOTA)

HOTA [150] is a metric that assesses three main objectives: object localization accuracy, detection accuracy, and association accuracy across frames.]

B. Underwater Object Tracking Datasets

The development of comprehensive benchmark datasets is essential for advancing UOT algorithms. These datasets provide standardized platforms for testing and comparing different models.

UOT100 [151] is a pioneering dataset offering a diverse collection of underwater video sequences with challenging distortions. It facilitates the development and evaluation of various tracking algorithms. Subsequent work introduced [152], focusing on multi-robot convoying in unstructured underwater environments. This dataset emphasized target tracking and enabled comparisons of different tracker architectures, including CNNs and frequency-based methods.

Building on these efforts, UTB180 [18] provides an expanded dataset with increased sequence diversity and video-level attributes. It highlights the challenges of underwater tracking and serves as a benchmark for evaluating state-of-the-art trackers. More recently, the FishTrack23 dataset [153] offers a large-scale collection of fish tracking data, encompassing various backgrounds, locations, and conditions.

C. Underwater Object Tracking Models

Object tracking builds upon object detection by incorporating spatiotemporal information, enabling more precise predictions in challenging situations such as occlusions or dynamic environments. DL tracking architectures achieve this by leveraging models like Recurrent Neural Networks (RNNs), attention mechanisms, and advanced feature extraction techniques to process temporal data.

For instance, in maritime environments, object tracking is particularly crucial for ensuring safety, enhancing situational awareness, and managing operational risks in dynamic conditions [154]. A proposed framework by Chen et al. [155] extracts multiple trajectories by integrating YOLOX-CenterNet for detection, a deep snake module for contour extraction, and Bytetrack for multi-object tracking. This combination enhances accuracy in complex scenarios by addressing both spatial and temporal challenges. Similarly, Chen et al. [156] employ a dark channel prior model for dehazing, Scale-Adaptive Kernel Correlation Filtering (SAMF) for object tracking, and trajectory refinement methods to deliver reliable analyses in diverse maritime scenarios. These approaches exemplify how DL-based tracking models can leverage spatiotemporal data to overcome the challenges posed by real-world environments.

1) Tracking-by-Detection

The tracking-by-detection paradigm involves detecting objects in individual frames using an object detection model and then linking these detections across frames to form object trajectories as shown in Figure 12. Tracking algorithms, such as Deep OCSORT, ByteTrack, StrongSORT, and OC-SORT [157], are employed to predict object positions and associate them with their corresponding detections. These algorithms often utilize Kalman filters and association methods like the Hungarian Algorithm, with advanced variants incorporating occlusion and re-identification modules [157]. For instance, Marine $\mathcal {X}$ integrates a vision-based object tracking algorithm, demonstrating the effectiveness of tracking-by-detection in real-world maritime applications [158].

FIGURE 12.

Pipeline of a typical tracking-by-detection architecture.

Show All

UOSTrack [159] is a state-of-the-art Single-Object Tracking (SOT) algorithm tailored for underwater environments. It employs hybrid training of both underwater and open-air images to mitigate sample imbalance and improve feature representation. Additionally, motion-based post-processing is utilized to exclude similar objects and relocate lost targets.

Accurate object detectors are crucial for Multi-Object Tracking (MOT) systems. However, when accurate detectors are unavailable, Robust Confidence Tracking (RCT) [160] can be employed. RCT leverages detection confidence values to enhance tracking performance. Similarly, FSTA [161] proposes a tracking-by-detection approach for underwater fish-school tracking, incorporating an amendment detection module and a Resnet50-IBN re-identification network.

A DL-based tracking algorithm for AUV control [162] combines a multi-object detector with a 3D stereo tracker for MOT and SOT. Another MOT algorithm [9] utilizes GhostNetv2-integrated YOLOv5 with Coordinate Attention (CA) for object detection and StrongSORT with GIoU for tracking, incorporating a fish re-identification model for improved accuracy.

2) Siamese Architectures

Siamese networks employ identical neural networks with shared weights to process the target object’s initial appearance and subsequent frames. By formulating tracking as a convolutional feature cross-correlation between the target template and search region, Siamese networks learn a similarity function for robust target identification across varying conditions and frames [163].

LightFC [164] is a lightweight Siamese single object tracker that incorporates a cross-correlation module with a rep-center head for enhanced feature extraction. This design optimizes between performance and FPS. To address similarity interference, occlusion, and scale variation in fish tracking, SiamFCA [165] was proposed. SiamFCA utilizes a modified AlexNet with a Contrast-Limited Adaptive Histogram Equalization (CLAHE) module and a Coordinated Attention (CA) mechanism to improve feature extraction and object relationship learning. A Region Proposal Network (RPN) is employed for classification and regression as shown in Figure 13. Similarly, NewNet-62 [166] adopts the SiamRPN++ algorithm with an inverted residual bottleneck block for underwater SOT. The model employs a depth-wise separable convolution to simplify computations while maintaining accurate tracking performance.

FIGURE 13.

SiamFCA [165] architecture.

Show All

3) Joint Tracking Models

Joint models excel at concurrently performing multiple tracking tasks such as detection, embedding, and tracking. Figure 14 provides an overview of different joint model architectures.

FIGURE 14.

Different types of joint models including a) Joint Detection&Embedding b)Joint Detection&Tracking and c) Fully Joint Models.

Show All

Joint Detection and Embedding (JDE) networks integrate detection and feature extraction into a shared backbone. CMFTNet [8] employs a JDE paradigm with an anchor-free method to address mutual occlusion in fish school tracking. Deformable convolutions are incorporated to enhance feature extraction in complex underwater environments.

Transformer-based models have emerged as state-of-the-art for joint MOT. For instance, TFMFT [167] addresses instance loss in aquaculture ponds by introducing a multiple association module to improve fault tolerance. Similarly, FishTrack [7] adopts a fully-joint 3-branch architecture with a Pyramid Vision Transformer backbone for object detection, trajectory prediction, and re-identification. To enhance memory utilization, LSMAM [168] integrates a Long Short-Term Memory (LSTM) module into the transformer structure for feature aggregation across multiple frames.

4) Comparative Analysis

Table 8 summarizes the methods, applications, datasets, and findings of the discussed underwater tracking architectures. Due to the lack of diverse public underwater datasets, many researchers have created their own datasets. However, with the development of the UTB180 and UOT100 datasets in 2022, they have been utilized for benchmark purposes to compare between state-of-the-art methods [159], [164]. The results in Table 8 show that UMOTMA [168] outperforms Resnet50-IBN [161] in IDF1 but falls behind in MOTA on fish video frames. Additionally, LightFC-Vit [164] demonstrates superior performance compared to UOSTrack [159] in AUC, PNORM, and P on both datasets.

TABLE 8 Underwater Tracking Models Comparison

SECTION V.

Imaging Modalities Beyond Optical Imaging

In previous sections, we discussed image enhancement, object recognition, and object tracking algorithms, focusing primarily on optical imaging techniques, predominantly RGB imaging. In this section, we provide a brief overview of imaging modalities that can be utilized independently or in conjunction with optical imaging techniques to enhance the accuracy of underwater computer vision. Emerging imaging technologies include sonar, stereo cameras, LiDAR, hyperspectral cameras, polarized imaging, magnetic imaging, and event cameras discussed in Subsections V-A, V-B, V-C, V-D, V-E, V-F, and V-G respectively. The working principle of these technologies is presented in Figure 15.

FIGURE 15.

Working principle of (a) Sonar imaging: generating acoustic waves and calculating the time delay for the reflected signals to return after reflecting off surrounding objects. (b)LiDAR: emitting laser pulses and measuring the time it takes for the reflections to return after striking objects, creating detailed 3D maps of the environment. (c)Hyperspectral Cameras: capturing images across multiple wavelengths of light beyond the visible spectrum. (d)Stereo Cameras: using two lenses to capture images from slightly different perspectives, allowing them to create depth maps and perceive 3D structures by mimicking human binocular vision. (e)ACFM: using electromagnetic fields to detect and size surface-breaking cracks in metals. (f)Polarization Imaging: capturing the polarization state of light waves, providing additional information patterns. (g)Event Cameras: detecting changes in the scene at each pixel asynchronously, capturing only the dynamic parts of a scene with high temporal resolution.

Show All

A. Acoustic Imaging and Sonars

Different types of sonars include Forward-Looking Sonar (FLS), Side-Scan Sonar (SSS), and Multibeam Echo Sounders (MBES), each designed for specific tasks [174]. FLS is used for real-time navigation and obstacle avoidance, emitting sonar beams forward to produce a 2D image of the area ahead, making it ideal for localization in low-visibility conditions with sonar keyframes [175]. SSS, on the other hand, is optimized for imaging the seafloor and detecting underwater features. By emitting sonar signals sideways, it creates high-resolution 2D images, commonly used for mapping underwater environments, locating shipwrecks, or surveying pipelines. MBES focuses on detailed bathymetric surveys, using fan-shaped sonar beams to measure depths and generate 3D seafloor maps. Its precision and ability to provide high-resolution topographic data make it essential for hydrographic mapping and seabed analysis.

Acoustic imaging and sonar technologies have advanced significantly in object recognition, segmentation, and tracking, offering faster processing speeds compared to optical imaging. For instance, MGFGNet [176] employs spectrogram feature fusion to improve recognition speed and accuracy, while lightweight networks [177] achieve real-time segmentation of marine debris in FLS images using multi-scale attention as shown in Figure 16. Methods like GD-DAGO [178] and R²CNN [179] enhance garbage segmentation and SSS image processing, facilitating tasks such as pipeline detection [180] and autonomous underwater navigation. YOLO-based models have also been applied to sonar data, with adaptations like YOLOv5m for fish detection [181] and YOLOv7 for small object detection in SSS images [182], demonstrating the utility of deep learning (DL) in underwater applications.

FIGURE 16.

Marine trash segmentation results as reported by [177]. From left to right: the forward-looking sonar images, the ground truth segmentation masks, and the segmentation masks obtained using LMA-net [177].

Show All

Recognizing novel objects in sonar images presents challenges, particularly for zero-shot and few-shot scenarios. Approaches like MFSANet [20] and GCEANet [183] tackle this by synthesizing pseudo-SSS images from optical-acoustic pairs, achieving promising results in zero-shot classification. Few-shot learning has also shown potential for underwater image classification with limited samples [184]. For 3D reconstruction, conditional GANs [185] are used to add elevation characteristics to acoustic images, enabling better depth estimation and enhancing spatial understanding for autonomous systems.

Motion estimation and localization tasks benefit from DL-based methods like UAT-SSIC [186], which improves SLAM performance through efficient sonar image compression. Similarly, self-supervised learning [187] supports stable elevation estimation, and hierarchical matching strategies [188] enhance terrain-based localization by effectively matching real-time and reference terrain images. These advancements address challenges like limited bandwidth and data diversity, enabling robust underwater navigation and mapping with autonomous vehicles.

While this subsection provides an overview of recent progress in sonar and acoustic imaging, more comprehensive insights can be found in the reviews by Chai et al. [189] and Aubard et al. [190].

B. Stereo Vision and Depth Cameras

Stereo vision and depth cameras are critical for providing 3D information in underwater environments, but challenges such as light attenuation, scattering, and refraction limit their effectiveness. Recent advances leverage deep learning (DL) and innovative techniques to overcome these limitations. For instance, unsupervised deep networks have been developed to generate dense depth maps and restore color-corrected imagery from raw underwater stereo data by modeling image formation and geometric constraints [191]. Other models address light refraction through enhanced calibration methods for camera parameters, integrating CNN-based multi-scale feature extraction for accurate object detection and trajectory estimation using multi-camera motion fusion [192].

These advancements extend to practical applications like fish population monitoring, where stereo vision systems predict underwater depth and estimate fish pose for accurate weight and length measurements. A stereo-vision-based monitoring system for Oplegnathus Punctatus achieves 94% accuracy, outperforming traditional optical methods [193]. Similarly, DL-based object matching in aquaculture cages reduces metric estimation errors to under 5%, showcasing the potential of these methods for precise measurements in challenging underwater conditions [194]. Recent surveys provide a detailed overview of stereo-vision applications in underwater aquaculture [195], [196].

C. Laser Imaging and LiDAR

Laser imaging and Light Detection and Ranging (LiDAR) technologies are emerging as powerful tools for underwater exploration, offering advanced capabilities for remote sensing in challenging environments. Although widely utilized in terrestrial applications such as robotics and autonomous vehicles [197], LiDAR is still relatively new to underwater applications. Recent innovations have also focused on enhancing underwater object detection (UOD), with a detailed review by Adeoluwa et al. [198] highlighting laser-based image enhancement techniques.

Accurate depth estimation and 3D reconstruction have seen significant improvements with methods like full-waveform LiDAR combined with advanced data processing techniques. For instance, NLEB employs a non-local encoder block with spatial dilated convolution to enhance feature extraction and depth measurement accuracy [199]. Similarly, binocular-laser fusion integrates structured light and laser scanning to achieve joint image denoising and precise 3D point cloud generation, enabling better underwater active vision systems [200] as shown in Figure 17.

FIGURE 17.

Image of a starfish (left), the depth ground truth (middle), and the 3D reconstruction result (right) by applying the binocular-laser fusion method proposed by [200].

Show All

LiDAR is also utilized in ecological and archaeological studies. Bathymetric LiDAR data and contour mapping have facilitated the identification of submerged archaeological features in shallow waters [201]. A scanning multispectral confocal LiDAR system provides high-resolution, non-intrusive sensing of plankton abundance for ecosystem monitoring [202]. Furthermore, range-gated laser imaging combined with deep learning offers improved fishing net detection [203], while neural networks like PointNet++ have demonstrated remarkable accuracy in human pose recognition for safety and surveillance, even in challenging underwater environments [204]. These advancements highlight LiDAR’s growing versatility in underwater exploration and monitoring.

D. Hyperspectral Imaging

Hyperspectral and multispectral imaging technologies have become essential tools for underwater target detection and resource exploration, addressing the challenges of the complex underwater environment. Traditional methods for Hyperspectral Underwater Target Detection (HUTD) often overlook spatial features, a limitation addressed by frameworks such as the spectral-spatial depth-based model in [205], which employs 3D convolutions and a data-transferring network to reduce noise. Similarly, UTD-Net [206] integrates hyperspectral unmixing and bathymetric models, combining anomaly detection and autoencoder-based techniques to enhance target detection. The Self-Improving Underwater Target Detection Framework (SUTDF) [207] incorporates depth estimation with an iterative process to improve detection accuracy further.

Hyperspectral imaging surveys are often limited by flat surface assumptions and navigation drift, resulting in low-quality photo-mosaics. To address this, [208] combines RGB, inertial, and hyperspectral data to create accurate 3D reconstructions with hyperspectral textures, while [209] introduces the True2 Orthoimage Map (T2OM), integrating 3D coordinates and façade textures with high accuracy. The adoption of multi-modal sensor fusion further enhances spatial mapping by addressing alignment errors caused by navigation drift. These methods significantly improve 3D spatial modeling for diverse applications.

Unsupervised methods are also advancing underwater detection by reducing dependency on extensive training data. For example, a feature-band-based approach [210] identifies optimal bands for detection near coastlines without requiring large datasets, while the Transfer-Based Underwater Target Detection Framework (TUTDF) [211] uses synthetic data and spatial-spectral processing to achieve robust results. These advancements showcase the potential of combining spectral and spatial features with deep learning to overcome limitations in traditional underwater target detection. For a broader perspective on underwater hyperspectral imaging, including its applications in seafloor mapping and detection, readers are referred to the review by Lie et al. [212].

E. Polarized Light Imaging

Polarized light imaging provides a novel approach to address the challenges posed by scattering in underwater environments and complex remote sensing scenarios. Recent advances integrate polarization information with DL techniques for target detection, image reconstruction, and underwater vision enhancement. A modified UNet-based DL network [216] incorporates polarization information into underwater restoration, while a physical polarimetric model combined with a deep neural network [217] improves image quality in high turbidity conditions by transforming polarization-dependent parameters and introducing a polarization perceptual loss. Learning-based methods using dense networks with residual learning [218] and unsupervised restoration methods that integrate depth and polarization information [219] further enhance underwater image recovery and noise reduction.

Polarimetric imaging has also improved target detection in turbid underwater environments, especially for camouflaged objects. DL-based polarization imaging techniques [220] demonstrate superior performance, even for detecting camouflaged targets. Stokes vector-based parameter images are introduced in [213] to visualize polarization specificity, combining Otsu segmentation and morphological operations to extract camouflaged target signatures effectively, as shown in Figure 18.

$FIGURE 18. - Detection results reported by Shen et al. [213] under low-illumination conditions are presented as follows: (g1–h1) low-light $ S_{0} $ Stokes images, (g2–h2) enhanced $ S_{0} $ Stokes images, (g3–h3) polarization parameter $ I_{S} $ images generated by the proposed method, (g4–h4) IE images showing the extracted target polarization signatures, (g5–h5) detection results obtained by Tyo et al. [214], (g6–h6) detection results obtained using the method by Wang et al. [215], and (g7–h7) detection results produced by Shen et al. [213].$

FIGURE 18.

Detection results reported by Shen et al. [213] under low-illumination conditions are presented as follows: (g1–h1) low-light $S_{0}$ Stokes images, (g2–h2) enhanced $S_{0}$ Stokes images, (g3–h3) polarization parameter $I_{S}$ images generated by the proposed method, (g4–h4) IE images showing the extracted target polarization signatures, (g5–h5) detection results obtained by Tyo et al. [214], (g6–h6) detection results obtained using the method by Wang et al. [215], and (g7–h7) detection results produced by Shen et al. [213].

Show All

Polarized light imaging also enables accurate 3D shape recovery in turbid underwater scenes. A 3-stage neural network [221] is proposed for 3D reconstruction of underwater objects, incorporating local feature extraction, encoder, and decoder blocks. The method estimates the Degree of Polarization (DoP) and Angle of Polarization (AoP) to reconstruct the 3D shape of submerged objects effectively. These advancements demonstrate the potential of integrating polarization information with DL to address the inherent challenges of underwater imaging.

F. Magnetic Imaging

Magnetic imaging techniques are crucial in the non-destructive testing of underwater metallic structures [222]. Conventional Alternating Current Field Measurement (ACFM) methods, while effective, are often costly and time-intensive, limiting their practicality for frequent monitoring. Recent advancements leverage deep learning (DL) and magnetic sensor technologies to overcome these challenges.

A flexible ACFM magnetic sensor array combined with a crack quantification method based on multi-physical features fusion using a CNN has been proposed to enhance efficiency and accuracy in defect detection [223]. Another approach employs a DL-based identification method that utilizes current perturbation theory to analyze distorted currents and magnetic fields caused by cracks, linking characteristic signals to defect morphology [224]. Figure 19 illustrates the developed gradient imaging algorithm, which allows a CNN to detect visual defects.

FIGURE 19.

Gradient image of an irregular underwater crack captured by an ACFM [224].

Show All

However, magnetic images are limited to defect detection on metallic surfaces. Future work could explore the utilization of DL models with interfacial waves to detect defects in concrete surfaces as well [225].

G. Event-Based Imaging

Recent advancements in the field of neuromorphic vision have focused on leveraging event cameras, inspired by biological vision systems, to overcome the limitations associated with frame-based cameras, including degraded low-visibility performance, limited dynamic ranges, and high storage requirements [226]. Event cameras address these limitations, where each pixel works independently and asynchronously, detecting and reporting changes in brightness as they happen while remaining inactive when there are no changes [226]. Event-based imaging presents a novel and efficient approach to addressing computer vision challenges in underwater environments.

For instance, a novel event-RGB multi-scale image fusion framework is presented by [227] and has shown improved capabilities in addressing the challenges of color distortion, reduced visibility, and uneven illumination of underwater images. Moreover, events are capable of capturing details that can’t be obtained through optical cameras, like transparently camouflaged organisms, due to their near-invisibility and indistinct boundaries. For instance, Aqua-Eye [228] was developed as the first large-scale dataset of underwater transparently camouflaged objects captured using event-based cameras. The researchers also introduce TransCODNet, a detection network specifically tailored for this task, which demonstrates a detection accuracy of 84.7%, surpassing optical imaging-based object detection methods. In the direction of event-based object detection, leveraging spiking neural network frameworks, a circular object detection method is proposed in [229] to improve the accuracy of AUV docking tasks.

AUV and robotic systems utilize event cameras for their improved performance and power efficiency. Traditional computer vision methods pose computational challenges for mobile robots, limiting their image analysis capabilities on board. Event-based cameras demonstrated improved swimming trajectories and processing speed for biomimetic robotics [230].

A comparison of the aforementioned imaging modalities is shown in Table 9 in terms of working principle, advantages, limitations, and applications.

TABLE 9 Comparison of Underwater Imaging Technologies

The comparison highlights the trade-offs between cost, accuracy, and usability among the various imaging technologies. Sonar and acoustic imaging excel in turbid and dark environments but suffer from lower resolution. Optical and hyperspectral imaging deliver high detail but are more expensive and limited by water clarity. Lidar offers rapid and accurate measurements in clear waters but struggles in turbid conditions. These trade-offs emphasize the importance of choosing the appropriate imaging technology based on specific underwater application requirements.

SECTION VI.

Challenges and Limitations

The inherent challenges of underwater optical imaging hinder vision-based data acquisition systems due to image blur and color distortions. Moreover, numerous marine species present detection challenges due to their diminutive size, camouflage capabilities, and movement within swarms, which results in occlusion. Furthermore, DL models are complex and computationally expensive, making real-time onboard processing performance more challenging.

A. Data Scarcity and Generalizability

Data collection in underwater environments is a complex and expensive task requiring the use of underwater vehicles and professional divers [1]. As a result, underwater datasets are often scarce, leading to overfitting and limited generalizability of DL models trained on specific datasets. The distribution of underwater images varies significantly depending on factors like the scattering properties of suspended particles and the non-uniform color absorption of water at different depths. These variations can lead to significant domain shifts, where models trained on data from one environment perform poorly in others [28].

Future mitigation strategies include leveraging domain adaptation techniques to minimize domain gaps and augmenting datasets using Generative Adversarial Networks (GANs) and synthetic data generation. These approaches aim to improve generalization and model robustness across diverse underwater environments.

B. Model Complexity and Real-Time Performance

AUVs are becoming increasingly important tools for underwater exploration and monitoring [231]. However, their effective operation requires efficient DL models that can run in real time with minimal power consumption. Unfortunately, many recent advancements in underwater computer vision rely on computationally expensive architectures like Siamese trackers, RCNNs, GANs, and Transformers. These models are typically not well-suited for the resource-constrained environments of AUVs [232].

For instance, table 10 highlights the limitations of existing UOT models on the UTB180 dataset. SiamRPN [233] and SIAMCAR [234] exhibit high parameter counts and FLOPs, while SiamFC [235], though significantly lighter with only 2.33 million parameters, suffers from poor tracking performance in terms of AUC and precision. Stark-ST50 [236] and TransT [237] achieve a more balanced trade-off, with optimized parameter sizes (26.2M and 18.54M, respectively) and improved tracking performance. However, these models remain unsuitable for deployment on resource-constrained processors [164].

TABLE 10 Comparison of the Performance of UOT Models on the UTB180 Dataset, Reported by [164]

Hardware accelerations, such as edge TPU processors and neuromorphic computing, offer promising solutions to reduce computational costs while maintaining high performance. Additionally, model pruning, quantization, and knowledge distillation are being explored to create lightweight and efficient DL architectures for underwater applications.

C. Challenging Scenarios in Object Detection and Tracking

Underwater object detection and tracking present unique challenges including occlusions, camouflaging, and small object detection.

Occlusion, where one object partially or fully hides another, poses a significant problem in UOT, particularly when dealing with mobile organisms like fish [161]. In fish schools, individual fish frequently occlude each other, making it difficult for tracking systems to maintain accurate identification. When occlusion occurs, the system might lose track of the target fish and follow an incorrect one, compromising the entire tracking process. Effective strategies for handling occlusion are crucial for robust UOT, especially in dynamic and cluttered environments [2], [18].

Many marine animals, such as coral, pipefish, and shrimp, have evolved elaborate camouflage techniques to blend in with their surroundings (seabeds, coral reefs, beaches) [19]. These camouflage strategies significantly complicate the task of accurately segmenting and identifying marine animals in underwater images. Researchers are actively exploring methods to improve the segmentation of camouflaged objects [144].

Small object detection is also a common challenge in computer vision, and it’s particularly acute in underwater environments. The ocean is home to a vast array of small-scale marine benthos (bottom-dwelling organisms) such as sea urchins and scallops. Their diminutive size and the complexities of the underwater environment make them difficult to detect using traditional computer vision techniques.

D. Ai Interpretability

Even though AI-powered systems have brought competitive advantages across various fields, their black-box nature poses significant challenges. This lack of transparency raises concerns about reliability, accountability, and trustworthiness, which are crucial for applications in underwater environments where high-stakes decisions are often made. For example, tasks such as identifying marine species, or monitoring structural integrity rely heavily on accurate and explainable predictions. The inherent variability of underwater data, influenced by factors like light scattering, turbidity, and color absorption, exacerbates the problem, making it even harder to interpret how AI systems adapt to such conditions. Without clear insights into the decision-making process, it becomes difficult for researchers and engineers to trust AI models, optimize their performance, or deploy them confidently in real-world underwater scenarios. This challenge has driven the need for explainable artificial intelligence (XAI) in underwater computer vision, where models must not only perform well but also provide justifications for their predictions [238].

Emerging frameworks, such as SHAP (Shapley Additive Explanations) and Grad-CAM [239], have been proposed to improve AI interpretability by visualizing feature importance and decision pathways. Incorporating these frameworks into underwater DL models can enhance transparency, enabling better debugging and validation of predictions.

SECTION VII.

Future Directions

This section overviews the potential future directions and provides a roadmap to overcome the performance challenges of DL models in underwater environments.

A. Synthetic Underwater Image Data Generation

Addressing the data scarcity challenges, recent approaches aim to reduce the cost of data collection, by utilizing synthetic underwater image data generation through data augmentation, simulated data, and part-insertion methods.

1) Data Augmentation

The performance of DL-based vision models varies significantly when the lighting conditions and color distortions of the underwater environment are varied [3]. Therefore, data augmentation techniques have been employed to enhance underwater datasets by incorporating diverse noise effects, improving model robustness, and reducing domain specificity.

For underwater image classification, Xu et al. [240] propose two data augmentation methods: GAN-based augmentation and optic transformations. Similarly, Chavez et al. [241] introduce a data augmentation technique for diver gesture classification, applying pixel-based disturbances (e.g., Gaussian blur, brightness shifts, white balance, underwater alpha blending), as well as geometry-contextual disturbances.

In the context of domain-agnostic underwater object detection (UOD), Lu et al. [242] propose a visual restoration method combined with non-reference assessments to improve detection continuity and stability, while temporal performance is enhanced through an online tracklet refinement method. Flexibility between detection and tracking is achieved using small-overlap suppression. The framework’s effectiveness is demonstrated through experiments on the ImageNet VID dataset and real-world tasks. Researchers have also utilized image interpolation to detect fish under varying color distortions [243]. For underwater crack detection, Huang et al. [244] employ image-to-image translation, such as CycleGAN, to generate synthetic underwater crack images from above-water counterparts.

2) Simulated Data

3D simulation platforms offer a cheap alternative for underwater image generation. For instance, to address the challenge of estimating the 6D pose for Autonomous Underwater Vehicles (AUVs) using single images, Joshi et al. [17] utilized rendered images from Unreal Engine for training. They applied image-to-image translation networks, to make the simulated 3D images more realistic. A similar approach called ShipGAN [245], utilizes GAN networks to translate unmanned surface vehicle simulation images from Unity into realistic scenarios.

3) Part Insertion

Part insertion involves inserting patches of domain-specific objects in marine backgrounds. For instance, to address the challenges of partial visibility or occlusion in detecting lobsters, Mahmood et al. [246] propose synthesizing lobster parts and embedding them into various marine backgrounds, thereby creating a robust dataset for lobster detection. Similarly, in addressing limited data availability for marine debris identification, researchers introduce a two-stage variational autoencoder framework for part insertion of litter into marine backgrounds [247].

In summary, efforts must be directed toward curating large-scale datasets with broader environmental variability, including multi-location and multi-season data. Leveraging synthetic data augmentation and simulation-based approaches can also bridge the data scarcity gap, enabling models to generalize across environments.

B. Lightweight Models for Real-Time Performance

To address the challenge limitations of model computational complexity and slow processing time, recent research has focused on developing decentralized approaches, as well as simpler and more lightweight models that can still achieve good performance on underwater vision tasks while meeting the operational requirements of AUVs [87], [248]. In particular, decentralized frameworks using blockchain technology ensure secure data sharing and scalable collaboration among multiple agents, making them suitable for underwater swarm robotics applications. [249] Moreover, in decentralized AUV systems, federated learning enables collaborative model training across multiple AUVs without requiring the transfer of raw data, thereby preserving privacy and reducing communication costs [250], [251]. Future studies should focus on dynamic model updates and adaptive learning rates to further optimize federated systems for underwater autonomous platforms.

1) Mobile Models

Mobile models employ lightweight architectures such as MobileNet, ShuffleNet, or Tiny versions of YOLO (e.g., YOLOv8-tiny) for enhanced real-time performance. Such methods use techniques like depthwise separable convolutions to reduce the number of parameters and computations. For instance, DU-MobileYOLO [112] employs a lightweight backbone called Mobile-bone with deformable upsampling with only 4.7M parameters. Similarly, a combined lightweight color correction and object detection model was deployed by Yeh et al. [14] on the Rasberry Pi platform, showcasing its potential to be used for real-time processing in AUVs. Moreover, for underwater object tracking, LightFC [164] utilizes a novel efficient cross-correlation module. The model achieved a good tracking performance on the UTB180 dataset, with only 3.16M parameters.

2) Model Prunning

Pruning is the process of removing redundant or less significant parameters from a neural network, such as individual weights, neurons, or entire layers. By eliminating these elements, pruning reduces the size of the model and its computational complexity, resulting in faster inference and lower resource usage. Pruning techniques have been recently utilized for underwater computer vision tasks such as image dehazing [252] and UOD [21].

3) Knowledge Distillation

Knowledge distillation is a model compression technique where a simpler model, called a student model, learns to replicate the behavior of a more complex model, called a teacher model. Instead of relying solely on the original dataset labels, the student model is trained on soft labels provided by the teacher model, which convey richer information about the output distribution. This process allows the student model to approximate the accuracy of the teacher model while being significantly faster and more efficient. For instance, Qiao et al. [253] propose a semi-supervised feature distillation method combined with an unsupervised domain adversarial distillation approach for enhancing underwater images. Onlin_XKD [254] is another knowledge distillation approach, utilizing a mutual knowledge transfer structure for online distillation in UOD.

Future research should focus on optimizing neural networks for low-latency performance and developing hardware-aware models. Techniques such as pruning and knowledge distillation can significantly reduce model complexity without sacrificing accuracy. Moreover, real-time testing frameworks should be integrated to benchmark performance and energy consumption under realistic operational conditions.

C. Zero-Shot and Few-Shot Learning

Few-shot learning approaches are utilized to improve the model’s performance with limited training data, making them ideal for underwater environments where data collection is challenging due to poor visibility, high costs, and dynamic conditions. Such approaches adapt to new scenarios by leveraging knowledge from related tasks or pre-trained models. Zero-shot learning, on the other hand, enables models to predict unseen classes by using relationships between known and unknown tasks, often through semantic information like descriptions or attributes.

1) Siamese Models

Siamese networks are widely used in few-shot learning for their ability to compare pairs of inputs and determine their similarity [255]. By learning a shared embedding space, they can generalize to unseen classes by measuring distances between embeddings, making them ideal for tasks like image recognition or verification with limited data. For instance, Siamese networks have been utilized for few-shot underwater image classification tasks including debris classification in sonar images [184], and fish re-identification in optical images [256].

2) Vision-Language Models

Recent advancements in AI-based language models have seen the emergence of vision-language models that combine visual and textual information for various tasks. For instance, the Vision-Guided Semantic-Group Network (VGSG) [257] aligns visual and textual features through two modules: Semantic-Group Textual Learning SGTL, which groups textual features by semantic cues, and Vision-Guided Knowledge Transfer VGKT, which uses vision-guided attention and relational knowledge transfer to refine alignment. These modules have proven effective in underwater object classification, enabling multi-modal learning frameworks that integrate visual and descriptive cues.

Leveraging human language as a new modality, these models have shown promise in few-shot learning approaches, where models can learn to classify new categories of objects even with limited training data [258]. This capability can be particularly advantageous in underwater computer vision, where data collection can be expensive and time-consuming. Promising models like CLIP [259] can potentially be used for tasks suffering from data scarcity in underwater environments, facilitating the use of few-shot and zero-shot learning techniques as presented in [260].

3) SAM Models

Another promising approach for improving zero-shot learning is the Segment Anything Model (SAM) [143]. Introduced in 2023, this model has revolutionized image segmentation by achieving strong performance on diverse tasks. The network’s ability to generalize across various domains makes it particularly attractive for underwater image analysis. For instance, Zheng et al. [10] developed the Segment any COral Image on this Planet (CoralSCOP) model based on a SAM architecture for accurate coral segmentation. Similarly, other researchers have explored dual-structured SAMs for MAS tasks [144]. Zero-shot segmentation based on SAM and interpretable Contrastive Language-Image Pre-training (CLIP) [260] also showed efficient performance for the task of marine litter detection.

In conclusion, few-shot and zero-shot learning approaches provide a potential solution to overcome the challenges of underwater data scarcity by leveraging Siamese, SAM, and Vision-language models. However, it’s crucial to note the accuracy of few-shot models is quite low compared to DL models trained on comprehensive datasets due to their limited generalizability.

D. Addressing Camouflaging, Occlusions, and Small Object Detection

1) Specialized Architectures

Specialized DL architectures are crucial for detecting and tracking objects in challenging underwater scenarios, where factors like camouflage, occlusions, and tiny objects complicate the task. For example, MASNet [12] addresses camouflaged marine object detection by introducing a novel data augmentation strategy that modifies object degradation and camouflage attributes. This approach enables a Siamese-style fusion network with an attention mechanism to learn shared semantic representations effectively. Similarly, BCMNet [135] improves detection by integrating texture and contextual clues through bidirectional interactions during the encoding and decoding stages.

Addressing tiny object detection, current methods emphasize enhancing feature extraction and fusion to address challenges like scale variation and cluttered backgrounds. Advanced backbones, such as STCF-EANet and VOVDarkNet, are designed to capture fine-grained details and multi-scale features, while modules like Bottom Focus-PAN, AFC-PAN, and Path-Aggregated Feature Pyramid Networks (PAFPN) ensure effective integration of shallow and deep features for better representation of small objects [106], [108], [118]. Attention mechanisms, such as Locally Enhanced Position Encoding and Path Enhancement Modules (PEM), further improve contextual focus, enabling models to prioritize key regions and reduce detection errors in noisy environments. Transformer-based architectures, like PE-Transformer [114], establish rich dependency relationships between features, leveraging global attention and path enhancement for superior semantic representation.

Recent advancements in object tracking under occlusions emphasize unified frameworks that combine detection and tracking while leveraging historical and contextual information for robust performance. Techniques such as LSTM-based memory for storing feature vectors and predicting trajectories during occlusions [2], deformable convolutions for enhancing context features in dense environments [8], and autoregressive encoders for spatiotemporal updates [7] are key to maintaining continuity in challenging scenarios. Fault tolerance is improved through methods like multiple association strategies to mitigate ID switching [167] and dual-decoder architectures to decouple motion and appearance cues for better identity recovery after occlusions [7]. Transformer-based models strengthen tracking by capturing spatiotemporal dependencies using query-key mechanisms, enabling precise tracking even under severe occlusion and deformation [7], [167]. Furthermore, reinforcement learning has recently emerged as a powerful tool for multi-object tracking under occlusions [249]. Together, these innovations enhance resilience and accuracy in occlusion-heavy scenarios, demonstrating their effectiveness in dynamic environments like underwater habitats and aquaculture systems.

2) Detection and Tracking by Utilizing Multi-Modal Data

Beyond traditional optical imaging modalities, event cameras offer a unique approach to image capture. Unlike traditional cameras that capture entire image frames at fixed intervals, event cameras record only changes in pixel intensity over time [226]. This makes them well-suited for detecting camouflaged and tiny marine organisms [228] as shown in Figure 20. While event cameras have been successfully used in some underwater applications [227], [229], [230], their use in underwater computer vision is still in its early stages compared to aerial and ground applications where they have shown promise for several tasks including SLAM, navigation, and object recognition [261], [262].

FIGURE 20.

Sample image of camouflaged marine animals from the Aqua-Eye Dataset [228]. Underwater event images (left), their corresponding RGB frames (middle), and the fusion results (right).

Show All

Moreover, depth cameras can enhance object tracking under occlusions by capturing 3D spatial information, allowing for better separation of foreground objects from the background. By analyzing depth data, tracking systems can estimate the position of partially occluded objects, predict their movements, and maintain continuity in tracking [162].

In summary, leveraging multi-modal data -including depth and event information- alongside the development of specialized architectures -including multi-scale feature pyramid architectures and transformer-based attention mechanisms- hold significant potential for advancing detection and tracking in challenging scenarios.

E. Explainable AI

Interpretable models are a cornerstone of Explainable Artificial Intelligence (XAI), offering transparency, simplicity, and human-readable decision-making processes. These models, such as decision trees and rule-based systems, are naturally intelligible and provide both global and local interpretability. Key characteristics of interpretable models include their straightforward structure, transparency in decision-making, and human-readable representations, enabling users to trust and understand predictions easily [238]. Future work should focus on interpretable models in ocean applications, where understanding the logic behind AI decisions is critical.

SECTION VIII.

Conclusion

The field of underwater computer vision has made significant progress in recent years, largely driven by the integration of deep learning techniques. These advancements have enhanced capabilities in underwater image restoration, object recognition, object tracking, and the integration of non-RGB imaging technologies, enabling applications in marine exploration, biodiversity monitoring, and infrastructure inspection. Despite these achievements, several challenges and opportunities remain for further improvement.

One of the key findings of this survey highlights the critical role of image quality enhancement in underwater vision tasks. Deep learning models, including Generative Adversarial Networks (GANs) and attention-based architectures like UwTGAN, have shown substantial improvements in dehazing, color correction, and noise reduction. These approaches have set new benchmarks for image restoration, providing clearer visual inputs for downstream tasks. Nevertheless, addressing issues related to variable lighting, dynamic water conditions, and marine snow artifacts remains a priority. Future research should focus on hybrid approaches that combine data-driven deep learning techniques with physics-based models to improve robustness and generalization.

Object recognition and tracking have also benefited from deep learning, particularly through advanced architectures such as YOLOv8 variants and Feature-Adaptive FPN with Multiscale Context Integration (FA-FPN-MCI). These methods have demonstrated remarkable performance in detecting and tracking underwater objects, even in challenging conditions characterized by occlusions and low visibility. However, further work is needed to enhance the detection of small or camouflaged objects, optimize computational efficiency, and improve real-time processing capabilities for deployment in autonomous systems.

The integration of multi-modal imaging technologies, including sonar, laser, and hyperspectral sensors, has expanded the scope of underwater computer vision. These technologies, when combined with deep learning, provide richer and more diverse data representations that improve perception in non-optical environments. Future developments should focus on designing algorithms capable of effectively fusing multi-modal data and leveraging self-supervised learning techniques to address the scarcity of annotated datasets.

While current methods have delivered substantial improvements, critical gaps remain, particularly in developing adaptive models that can dynamically adjust to varying underwater conditions without retraining. Techniques such as domain adaptation, transfer learning, few-shot learning, and explainable AI can further strengthen model interpretability and reliability, enabling broader deployment in real-world scenarios. Additionally, prioritizing dataset diversity and extending benchmarks to cover diverse underwater conditions will enhance generalization and scalability.

In conclusion, this survey has provided a detailed review of state-of-the-art methodologies in underwater computer vision, highlighting deep learning’s transformative role in addressing key challenges. By leveraging advances in hybrid modeling, multi-modal integration, and real-time processing, future research can drive the development of more efficient, adaptable, and robust underwater vision systems. We aim for this work to serve as a comprehensive resource for researchers and practitioners, guiding future innovations in underwater imaging and analysis.

References is not available for this document.

Advancing Underwater Vision: A Survey of Deep Learning Models for Underwater Object Recognition and Tracking

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Traditional Approaches and Modern Deep Learning

B. Motivation

C. Contributions

Underwater Image Enhancement

A. Evaluation Metrics for Underwater Image Enhancement

1) Objective Metrics

2) No-Reference Metrics

B. Underwater Dehazing Datasets

C. Deep Learning Approaches for Underwater Image Enhancement

1) Convolutional Neural Networks

2) Attention Mechanism and Transformer Networks

3) Generative Adversarial Networks

4) Hybrid GAN-Transformer Networks

5) Comparative Analysis of Underwater Image Enhancement Models

Underwater Object Recognition

A. Evaluation Metrics for Object Classification, Detection, and Segmentation

1) Definitions

2) Accuracy

3) Precision (P)

4) Recall (R)

5) F1-Score

6) Average Precision (AP)

7) Intersection Over Union (IoU)

8) Mean Average Precision (mAP)

9) MAP@0.50 or mAP50

10) Mean Intersection Over Union (mIoU)

11) Mean Dice Coefficient (mDice)

12) Mean Pixel Accuracy (mPA)

13) Mean Absolute Error (MAE)

14) Weighted F-Measure (F^{\omega } _{\beta } F^{\omega } _{\beta } F-Omega-Beta)

15) Structural Similarity Measure (S_{\alpha } S_{\alpha } S-Alpha)

16) Mean E-Measure (mE_{\phi } mE_{\phi } mE-Phi)

B. Underwater Object Recognition Datasets

C. Classification Models

1) Specialized Architectures for Enhanced Underwater Image Classifications

2) Applications of Underwater Image Classification

D. Detection Models

1) R-CNN-Based UOD

2) YOLO-Based UOD

3) Feature Pyramid-Based UOD

4) Transformer-Based UOD

5) Comparative Analysis of UOD Models

E. Segmentation Models

1) Fully Convolutional Network Architecture

2) Encoder-Decoder Architectures

3) DeepLab-Based Segmentation

4) GAN-Based Segmentation

5) Segment Anything Models (SAM)

6) Comparative Analysis of Underwater Segmentation Models

Underwater Object Tracking

A. Evaluation Metrics for Object Tracking

1) Area Under the Curve (AUC)

2) Percentage of Normalized Overlap (PNORM) Percentage of Normalized Overlap (P-NORM)

3) Identification F1-Score (IDF1), Precision (IDP), and Recall (IDR)

4) Identification Switches (IDS)

5) Multiple Object Tracking Accuracy (MOTA)

6) Higher Order Tracking Accuracy (HOTA)

B. Underwater Object Tracking Datasets

C. Underwater Object Tracking Models

1) Tracking-by-Detection

2) Siamese Architectures

3) Joint Tracking Models

4) Comparative Analysis

Imaging Modalities Beyond Optical Imaging

A. Acoustic Imaging and Sonars

B. Stereo Vision and Depth Cameras

C. Laser Imaging and LiDAR

D. Hyperspectral Imaging

E. Polarized Light Imaging

F. Magnetic Imaging

G. Event-Based Imaging

Challenges and Limitations

A. Data Scarcity and Generalizability

14) Weighted F-Measure ( $F^{\omega } _{\beta }$ F-Omega-Beta)

15) Structural Similarity Measure ( $S_{\alpha }$ S-Alpha)

16) Mean E-Measure ( $mE_{\phi }$ mE-Phi)

2) Percentage of Normalized Overlap (P_NORM) Percentage of Normalized Overlap (P-NORM)