In the last 40 years, remote sensing technology has evolved, significantly advancing ocean observation and catapulting its data into the big data era. How to efficiently and accurately process and analyze ocean big data and solve practical problems based on ocean big data constitute a great challenge. Artificial intelligence (AI) technology has developed rapidly in recent years. Numerous deep learning (DL) models have emerged, becoming prevalent in big data analysis and practical problem solving. Among these, convolutional neural networks (CNNs) stand as a representative class of DL models and have established themselves as one of the premier solutions in various research areas, including computer vision and remote sensing applications. In this study, we first discuss the model architectures of CNNs and some of their variants as well as how they can be applied to the processing and analysis of ocean remote sensing data. Then, we demonstrate that CNNs can fulfill most of the requirements for ocean remote sensing applications across the following six categories: reconstruction of the 3D ocean field, information extraction, image superresolution, ocean phenomena forecast, transfer learning method, and CNN model interpretability method. Finally, we discuss the technical challenges facing the application of CNN-based ocean remote sensing big data and summarize future research directions.
Introduction
AI is a multidisciplinary field encompassing various subfields and techniques within computer science and beyond [1]. In the early stages, AI primarily focused on areas such as natural language processing and image processing. During this phase, the concept of “perceptron” was introduced, which endowed models with logic to solve practical problems [2]. However, early models were limited, and AI research remained low for several decades. It wasn’t until the late 1980s that neural networks experienced a resurgence with the introduction of backpropagation algorithms, enabling models to have richer expressiveness and improved training efficiency. Nevertheless, deep neural networks still face challenges in expression because of limited computational resources and data availability. With the advancement of technology, numerous practical applications of AI emerged [3], [4], [5]. Since the 21st century, there has been a rapid development in computer hardware and software, leading to a new era for AI in DL [6]. DL techniques are progressing rapidly, driving technological advancements in various fields, such as biology, economics, medicine, meteorology, and others [7], [8], [9] (Figure 1).
AI development history. AI in the early stages was mainly focused on areas such as natural language processing and image processing. However, the early models did not show superiority compared to traditional methods and were at a low level. It was not until the late 1980s that neural networks made a resurgence with the introduction of backpropagation algorithms, allowing models to acquire the ability to automatically mine features from data with richer expressive capabilities. However, deep neural networks still face challenges in expression from limited computational resources and data availability. Since the 21st century, computer hardware and software have developed rapidly, leading AI into a new era of DL. Various AI models have appeared in large numbers, driving technological progress in fields such as biology, economics, medicine, meteorology, etc. ANN: artificial neural network.
The ocean is a continuous and indefinite spatiotemporal information container closely related to human life. The launch of the first ocean remote sensing satellite, Seasat, in 1978 marked a turning point in the human exploration of the ocean. Over the last 40 years, ocean observation data have expanded significantly in both amount and scope [10], [11]. This surge in remote sensing images and data marks a transition to the era of big data in ocean exploration [12].
Currently, most ocean data processing focuses on data-to-data transformations, while the transformation from data to knowledge is inadequate, leading to low utilization of the vast ocean data. Data-to-data processing involves basic transformations of raw data into a more organized or standardized format, while data-to-knowledge processing goes further by extracting meaningful insights and knowledge from the transformed data. Hence, data mining and analysis techniques are urgently needed to realize mining services for various ocean remote sensing data types and discover potential knowledge [13], [14], [15], [16].
Since 2006, the emergence of big data has led to a renewed interest in DL models, which has resulted in a new round of rapid development. Among these models, CNNs have provided new insights for analyzing and mining elemental information from ocean remote sensing big data [17], [18], [19], [20]. In the field of computer vision, the image features extracted by the CNN structure are very effective for tasks such as image recognition [21], [22], [23], object detection [24], [25], and semantic segmentation [26], [27]. Researchers have found similarities between ocean remote sensing data features and computer vision’s video image data, highlighting CNN models’ potential in ocean remote sensing. Recent years have witnessed deep exploration of CNN models by ocean remote sensing researchers, driving AI advancements across six domains: the retrieval of the 3D ocean field, information extraction from ocean remote sensing images, the superresolution of such imagery, ocean phenomena forecasting, transfer learning with ocean remote sensing data, and CNN model interpretability in this context. Brief introductions to each category follow.
Retrieval of 3D Ocean Field
The CNN has gained significant attention in the retrieval of 3D ocean fields from ocean remote sensing as it can extract features by learning from extensive datasets, leading to more accurate parameter inversion than traditional methods. Researchers [28] have employed CNNs to estimate the ice concentration in the melting season using synthetic aperture radar (SAR) images and tested them on dual-polarization radar satellite-2 data. The study reveals that a CNN produces more detailed results than operational products. To improve the performance of general object detection, Cheng et al. [29] have proposed an effective method for a learning rotation-invariant CNN, which enhances the invariance of target rotation. The CNN-based retrieval model enables significant improvement in the accuracy of 3D retrieval of marine parameters.
Information Extraction
Information extraction based on ocean remote sensing images is a critical aspect of Earth science research, and the application of CNNs has yielded promising results. In particular, Zhou et al. [30] used a CNN to classify radar images, with the covariance matrix as input. Duan et al. [31] replaced the traditional pooling layer in a CNN with a wavelet-constrained pooling layer and integrated convolutional-wavelet neural networks with superpixels and Markov random fields to produce segmented type maps. Some researchers have also utilized pretrained networks, such as OverFeat and GoogLeNet, for high-resolution satellite image classification, achieving improved results [32], [33], [34], [35], [36]. These approaches leverage the CNN as a local feature extractor and combine it with feature encoding techniques to produce the final image semantic segmentation output.
Superresolution Reconstruction
The CNN also performs excellently in the superresolution and the reconstruction of ocean information. Yatheendradas et al. successfully reconstructed 5-km MODIS data to 1 km using a three-layer conventional superresolution CNN (SRCNN) [37]. Kumar et al. used three SRCNN-based methods, namely an SRCNN algorithm, which stacked SRCNN and DeepSD algorithms to enhance the resolution of summer monsoon season data from the Indian Meteorological Department and Tropical Rainfall Measuring Mission datasets by a factor of 4, providing better results than those from linear interpolation [38]. This technique provides higher resolution data support for ocean ecosystem conservation and ocean meteorology research.
Ocean Phenomena Forecasting
CNNs also play an important role in ocean phenomena forecasting. For example, a network called ConvLSTM, which integrates a CNN and long short-term memory (LSTM), can process ocean parameters with spatiotemporal correlations. Researchers have used ocean remote sensing data that describe various ocean phenomena as inputs to ConvLSTM models for forecasting. The forecast accuracy of multiple ocean phenomena surpasses that of traditional numerical models, including the forecasting of ocean waves [39], [40], winds, sea surface temperature (SST) [41], and sea ice concentration (SIC) [42], [43]. For instance, Tong et al. predicted tropical cyclone intensity and track with ConvLSTM [44], and Petrou et al. predicted sea ice movement for a few days ahead with ConvLSTM [45]. Gupta et al. used the ConvLSTM network to predict the monthly mean Nino3.4 index one year in advance and were also able to predict a strong El Niño [46]. Sinha et al. used ConvLSTM to predict daily sea level pressure (SLP) anomalies in central India and the Bay of Bengal seven days in advance, thereby predicting the strength of the Indian summer monsoon [47]. The prediction results were compared with those of traditional numerical weather prediction models, and the results showed that the DL models have better results in capturing weather-scale SLP fluctuations. These approaches have improved the accuracy and timeliness of ocean phenomena forecasting.
Transfer Learning
CNN models’ transfer learning methods also command significant attention. CNN model transferability refers to the capacity of a model trained on one task or domain to be transplanted to different tasks or domains while retaining robust performance. Transfer learning mainly includes three methods: feature extraction, fine-tuning, and knowledge distillation. Feature extraction involves using a pretrained model as a fixed feature extractor, keeping early layers frozen and modifying only task-specific layers for the target task. Fine-tuning extends this by adjusting earlier layers, enhancing adaptability to the new task while avoiding overfitting. Domain adaptation addresses shifts in data distribution between source and target tasks due to different environments, aligning feature distributions for improved applicability. Knowledge distillation transfers insights from a well-trained teacher model to a smaller student model, beneficial for resource-constrained deployments. These methods collectively enhance the versatility and efficiency of CNN transfer learning. For instance, Xu et al. employed CNN-based transfer learning to classify sea ice and open water areas in SAR images. In data scarcity, the method achieved an overall classification accuracy of 92.36% [48]. Lima et al. applied transfer learning methods to coastal identification. The proposed method was compared with the conventional handcraft descriptor with bag-of-visual-words (BOVW), original CNN, and last-layer fine-tuned CNN models. Lima’s method showed a significantly higher accuracy than other methods in both datasets [49]. Lumini and Nanni investigated fine-tuning and transfer learning among diverse DL models to design classifiers by exploiting their diversity. The method has high classification accuracy and f-measure on three large publicly available datasets [50]. Jeon et al. utilized CNN transfer learning to identify sea fog from Geostationary Ocean Color Imager images. The prediction results of CNN-transfer learning (CNNTL) showed a high accuracy of 96.3% [51]. Panwar et al. harnessed deep transfer learning to automatically detect waste in water bodies. The proposed model detects and classifies the different pollutants and harmful waste items floating in the oceans and on the seashores with a mean average precision of 0.81 [52]. Overall, the transferability of CNNs offers an effective and efficient solution for the ocean domain. By harnessing models trained in other domains, it becomes possible to curtail annotation costs and training durations for ocean datasets, thus facilitating wider adoption and practical implementation of DL applications in the ocean sector.
Interpretability
The DL models have often been labeled as “black boxes,” meaning that their inputs and outputs are observable but the inner workings and decision-making processes remain opaque. Compared to other DL models, CNN models exhibit a relatively higher level of interpretability. This interpretability has found applications in the field of ocean remote sensing [53], [54], [55], [56], [57], [58]. For example, Wang et al. developed an interpretable DL model for El Niño/Southern Oscillation (ENSO) prediction using gradient-based backpropagation [53]. This model illustrates the contributions of distinct input regions to outcomes, effectively illustrating the spatial features relevant to long-term ENSO forecasts. Roussillon et al. leveraged multimodal CNN and intrinsic interpretability methods to enhance comprehensibility [54]. Liu et al. introduced an interpretable DL approach, XDL, based on saliency maps, extracting explicable predictive information from global SST data and revealing SST-related regions and dependency structures associated with river flow rates [55]. Yu et al. explored interpretability in their improved STR-UNet model using the Deep Learning Important FeaTures (DeepLIFT) algorithm [56]. They identified satellite cloud-related features as the most crucial, followed by satellite water layer-related and temperature-related features. CNN interpretability is vital in ocean remote sensing, particularly in model validation, environmental monitoring, ecological conservation, and resource management. It boosts model reliability and efficiency and provides deeper insights for ocean data analysis and decision-making support for scientists, policy makers, and environmental agencies.
From a technical standpoint, the accomplishments in numerous successful scientific instances [57], [58], [59], [60], [61], [62], [63], [64], [65], [66], [67] involving CNN-based models can be credited to a crucial component within the CNN framework—the convolutional layer. This layer stands as a critical element, distinguishing these models from other neural network architectures and serving as the primary method for feature extraction. This layer performs convolution operations on the input data with convolution kernels, allowing the extraction of locally relevant features from the input data. Compared with fully connected neuron connections, the convolutional layer improves spatial feature extraction efficiency and ability [68]. Because of its unique “parameter sharing” mechanism, the CNN can quickly and accurately extract data channel information. As a result, CNNs are highly advantageous in various fields, including remote sensing image information extraction, ocean data superresolution reconstruction, and ocean phenomena prediction [69], [70], [71].
In Figure 2, we use multiple concentric circles to represent the relationship between CNNs and ocean remote sensing. At its core, the support comes from the wealth of ocean big data brought by ocean remote sensing technology. When combined with various CNN models and CNN methods in the outer circles, it can propel advancements in different domains within ocean remote sensing (the outermost circle). For example, CNNs have been used for temperature [71], [72], salinity, wind field, significant wave height (SWH), and sea ice prediction [73], [74], [75], [76]. They have also been utilized for ship identification [51], [77], oil spill detection [60], and eddy detection [78], among other applications.
CNNs in ocean remote sensing. The core is the ocean remote sensing data that train various CNN models. The surrounding gray circles represent variants of various CNN models. When these CNN models are combined with different ocean remote sensing data, they can solve various kinds of problems in the field of ocean remote sensing, where we categorize them into six domains: ocean information estimation, ocean remote sensing information extraction, ocean phenomena forecast, ocean superresolution, transfer learning ocean remote sensing, and interpretability method in ocean remote sensing.
The arrangement of this article is as follows. The next section outlines various data processing techniques for ocean remote sensing data within CNN models. The section “Introduction to Classic CNN Models” covers the introduction of the structures of various CNN-based models applied in ocean remote sensing, The section “Introduction to CNN Methods in Ocean Remote Sensing Applications” presents the various AI methods applied to these CNN models, the section “Application of CNNs in Ocean Remote Sensing” presents the applications of these models and methods in the field of ocean remote sensing, and the section “Discussion” offers the conclusion of this work on CNN models and makes suggestions for future developments.
Introduction to Ocean Remote Sensing Data and AI-Related Processing
The data processing module is pivotal when employing CNN-based AI models in ocean remote sensing, directly influencing model performance and accuracy [79]. In this context, data processing encompasses collection, preprocessing, augmentation, labeling, cropping and resizing, partitioning, and postprocessing. This section offers a concise summary of data processing methods from pertinent articles, presenting a detailed guide to preparing ocean remote sensing data for CNN-based AI model applications.
Data Collection and Preparation
In ocean remote sensing, data processing begins with collecting and preparing data from various satellite observations. Common sources include metrics like sea surface height (SSH), SST, sea surface salinity (SSS), and surface vector wind as well as images for shoreline detection [80] and seaweed monitoring [81]. For example, MODIS tracks sea temperature and ocean color [82], while Sentinel-3 records SSH [83], Aquarius/SAC-D gauges SSS [84], and WindSat captures sea surface wind [85]. When selecting images, it’s vital to account for factors like cloud cover, tides, and vegetation. Ground-truth data below the ocean surface, such as temperature and salinity, are crucial for subsurface studies.
Data Preprocessing and Normalization
Data preprocessing is pivotal for maintaining data quality and consistency, which involves radiometric calibration, atmospheric correction, and image registration, which collectively work to remove noise and discrepancies from images. By ensuring that data are comparable on a physical level, the integrity of the data is preserved. Additionally, data normalization techniques, like standard deviation scaling, are implemented to standardize the data onto a uniform scale. This not only enhances model convergence speed but also bolsters overall training performance.
Data Augmentation
Ocean remote sensing data are often limited, so data augmentation is essential to enhance the model’s generalization capability. Techniques such as random contrast and brightness adjustments, image rotation, and horizontal/vertical flipping generate diversified training samples, increasing the model’s ability to handle different conditions robustly.
Ground-Truth Labeling (for Information Extraction Tasks in Ocean Remote Sensing)
In supervised learning tasks, ground-truth labeling is necessary for training CNN models. However, the lack of corresponding ground-truth products might require manual extraction or labeling of ground-truth data in ocean remote sensing. For instance, one might draw waterline positions manually and use transparent layers with white lines as ground-truth data. Another example is generating paired input and output images, where the input images consist of original SAR images, and the output images contain manually extracted ground-truth data.
Data Cropping and Resizing
CNN models generally take fixed-sized images (e.g., 128 # 128, 256 # 256, 512 # 512) as input. Therefore, ocean remote sensing data must be cropped into appropriately sized subimages and resized to meet the model’s input requirements. These subimages are typically used as samples for training and testing.
Data Partitioning
Datasets are divided into training, validation, and testing sets to evaluate model performance and prevent overfitting. Typically, 80% of the data is used for training, and 20% is used for validation and testing. The validation set is used to adjust model hyperparameters and monitor the training process, while the testing set is used to assess the model’s generalization ability on unseen data. It should be noted that the data partitioning is not done randomly but based on time to ensure independence between the training and validation sets. For example, in ocean remote sensing 3D structure reconstruction tasks, data from January 2004 to December 2017 (168 months) are used as the training set for AI model training, while data from 2018 (12 months) are used as the testing set to validate the model’s inversion capability [86].
Postprocessing
In the postprocessing stage of model prediction results, filtering and smoothing techniques can be applied to remove noise in the predicted images and improve the spatial continuity of the results. Additionally, when applying the model to specific tasks in ocean remote sensing, result interpretation and analysis are required to utilize the model outputs effectively.
In summary, the data processing module is vital for utilizing CNN-based AI models in ocean remote sensing. Implementing these steps can boost model performance, extract valuable insights from ocean remote sensing data, and provide robust support for ocean research and resource management.
Introduction to Classic CNN Models
This section summarizes four fundamental CNN models in different domains of ocean remote sensing. In chronological order, we have illustrated four classic CNN architectures and methods in Figure 3. This section will explain the fundamental theories and application domains of these models.
The structure of various CNN models. (a) Traditional CNN model. (b) ConvLSTM. (c) U-Net. (d) SRCNN. BN: batch normalization.
Traditional CNN-Based Model
The traditional CNN model, introduced in the 1980s by Yann LeCun and John Hopfield [87], gained popularity in computer vision, but the limited computing power of the era hindered its initial performance. As computing capabilities advanced in the 21st century, CNNs experienced considerable enhancements. This growth was epitomized in 2012 when AlexNet [21], a CNN model, won the ImageNet competition, amplifying the focus on CNN technology. Unlike traditional image processing methods that depend on manually crafted feature extractors—often specialized for particular applications and lacking in adaptability—CNN-based AI models use a data-driven approach for feature extraction. By training on extensive datasets, they derive deep and precise feature representations, ensuring more robust and adaptable models with enhanced generalization capabilities.
Since then, various CNN-based models, such as residual networks [88] and deep separable convolution [89], have emerged, making CNN a widely used DL model across various domains and disciplines.
A comprehensive traditional CNN model consists of input, convolutional, pooling, and output layers, as illustrated in Figure 3(a). The role of the input layer is to pass the image matrix into the model, where, for example, in the context of ocean remote sensing, the input layer represents the pixel matrix of an image. The convolutional layer is the most crucial component of the CNN, and it extracts local correlation features of ocean remote sensing images through convolutional operations. The pooling layer aims to decrease the feature map size while preserving the important features to the greatest extent possible. Downsampling the data through the pooling layer reduces the data size, improving the computation speed. Convolutional and pooling layers can have multiple layers, and with multiple layers of convolution and pooling operations, the model can learn higher order abstract semantic features contained in the data. Finally, the learned feature representations are transformed into the target task through a fully connected or output layer [57].
CNN models have been widely employed in diverse domains, such as computer vision, natural language processing, and speech recognition. As emphasized previously, CNNs have shown great potential in multiple fields related to ocean remote sensing, and they are a key technology for extracting valuable information from ocean remote sensing big data [90], [91], [92], [93].
ConvLSTM-Based Model
The ConvLSTM network, proposed by Shi et al. in 2015, combines a CNN with LSTM to effectively capture spatiotemporal features in image time series, particularly for precipitation forecasting [94]. Prior to the introduction of the ConvLSTM model, precipitation forecasting primarily focused on temporal aspects, and LSTM models were commonly employed for this purpose. LSTM, a specialized recurrent neural network (RNN), excels at capturing sequential features in time series data through its unique architecture, involving forget and memory gates. However, it is limited by its design and cannot simultaneously capture spatial features along with temporal sequences. This limitation created a technical challenge for tasks requiring spatiotemporal forecasting, like precipitation forecasting. The ConvLSTM model addresses this challenge by enabling the establishment of LSTM-like temporal relationships while incorporating CNN-like spatial feature extraction capabilities. As a result, the ConvLSTM model outperforms other models developed during the same period.
ConvLSTM is a structure that uses convolutional operation to replace the multiplication in traditional LSTM networks, which has a stronger ability to capture the spatial structure features in the temporal data. Its neuron structure and information transfer are shown in Figure 3(b). The main components of ConvLSTM include the following:
LSTM layer: Uses the gate mechanism of LSTM to control the information transfer of the memory units so that the LSTM layer can capture the temporal features of the input data
Output layer: Maps the spatiotemporal features captured by the convolutional and LSTM layers to the target results.
The ConvLSTM has become an important model for extracting spatiotemporal features from image time series data. In recent years, ConvLSTM has played a significant role in the environmental forecasting problem of ocean remote sensing [61], [62].
U-Shaped Structure Fully Convolutional Network
The U-shaped structure fully convolutional network (U-Net) is a full CNN developed by Olaf Ronneberger in 2015 for computer vision applications [95]. The U-Net model was developed by adjusting the traditional CNN model architecture, and the U-Net architecture employs bidirectional connections between the encoder and decoder to restore lost spatial information and enhance the model’s accuracy and precision. The U-Net model’s structure is remarkably uncomplicated compared to those of other DL models, with relatively few parameters and a simpler training process.
The U-Net architecture is renowned for its symmetric U-shaped structure [refer to Figure 3(c)], which follows the encoder–decoder paradigm. U-Net captures background information through a contracting path (downsampling) and precisely locates features through a symmetric expansive path (upsampling). This design is particularly advantageous in ocean remote sensing applications, where capturing both global context and fine details is crucial for accurate target identification. Additionally, U-Net employs skip connections, concatenating feature maps from the contracting path to the corresponding layers in the expansive path. This feature fusion mechanism helps preserve fine-grained details, enabling the network to effectively combine low-level and high-level features. In the realm of ocean remote sensing, this is vital for retaining information about small and discontinuous targets throughout the network. Unlike other network architectures, like fully convolutional networks (FCNs) [26], U-Net does not require supervision and loss computation for high-level features. By merging low-level and high-level features, the generated feature maps encompass a broad range of features, significantly enhancing the accuracy of model predictions.
U-Net networks first performed an important role in medical image segmentation. With the development of technology, U-Net networks are also used in many other applications in computer vision, such as autonomous driving, target detection, image classification, and semantic segmentation. Certainly, U-Net is the most commonly used model in ocean remote sensing image segmentation [96], [97], [98].
SRCNN
SRCNN is a popular superresolution model based on the CNN, proposed by Dong et al. in 2014 [99]. The SRCNN model, built upon the traditional CNN model, employs bicubic interpolation to upscale low-resolution images to the desired size. It then utilizes convolutional modules to approximate nonlinear mappings, resulting in high-resolution image outputs for superresolution tasks. SRCNN’s first convolutional layer extracts feature maps from these images, followed by the second convolutional layer, which nonlinearly maps them to high-resolution image representations. The third convolutional layer generates the final high-resolution image using predictions from the spatial neighborhood [22], [23].
SRCNN is a hypersegmented reconstruction model that operates on single-image superresolution and follows the architecture depicted in Figure 3(d) [99]. SRCNN achieves end-to-end learning of CNN-based network models to convert low-resolution images to high-resolution ones. The SRCNN model first applies interpolation to upscale the low-resolution images and then reconstructs them. For example, the initial low-resolution images are preprocessed using bilinear interpolation to obtain higher resolution images than the input. The first convolutional layer of SRCNN extracts feature maps from these images, which are then nonlinearly mapped to high-resolution image representations by the second convolutional layer. The final high-resolution image is generated by aggregating the prediction outcomes from the spatial neighborhood using the third convolutional layer.
Compared with traditional superresolution reconstruction methods, SRCNN is a simple, structured end-to-end learning method that performs better regarding both network performance and inference speed. With the emergence of SRCNN, the field of image superresolution officially enters the era of big DL. In ocean remote sensing, the SRCNN model is also important for the superresolution reconstruction of ocean temperature and salt fields [100], [101].
Introduction to CNN Methods in Ocean Remote Sensing Applications
This section summarizes the transfer learning and interpretability methods based on CNN models in ocean remote sensing. In chronological order, we showcase transfer learning and four commonly used interpretability methods in Figure 4. This section will explain the fundamental theories and application scenarios of the interpretability and transfer learning methods.
Various CNN methods. (a) CNN transfer learning methods. (b) CNN interpretability methods. AnB: intersection convolutional layer of A and B; Grad-CAM: gradient-weighted class activation mapping; LIME: local interpretable model-agnostic explanations.
CNN Transfer Learning Method
Transfer learning in CNN models is a crucial technique that involves applying a DL model, originally trained for one task or domain, to another related task or domain without the need for complete retraining from scratch. This success is primarily attributed to several key factors. The model’s universal feature representation enables it to share learned features across various tasks. Additionally, architectural design and data augmentation techniques enhance the model’s generalization capacity, as illustrated in Figure 4(a). Transfer learning accelerates model training, reduces resource consumption, and enhances the model’s performance on new and diverse tasks by effectively reusing knowledge and features acquired from one task in other tasks [101], [102], [103].
Transfer learning finds extensive application in CNN models within ocean remote sensing. CNN models leverage transfer learning methods by pretraining on natural image data and transferring the learned features to ocean data processing and analysis tasks. This approach enables efficient exploration of ocean remote sensing data. Through transfer learning, CNN models can substantially save training time and computational resources, enhancing the efficiency and accuracy of ocean science research and applications. The transfer learning method lends robust support to ocean resource management, environmental conservation, and oceanic research, contributing to a deeper understanding of ocean environments and propelling sustainable development.
CNN Interpretability Methods
In DL, particularly within intricate neural network models, comprehending how a model arrives at specific outputs can often be challenging. Like other DL models, CNNs face interpretability challenges due to their complex structures and black-box nature. However, because of CNNs’ distinct receptive field mechanisms and visualization procedures, their interpretability is relatively favorable compared to other models [104]. Interpretability refers to the capacity to understand the decision-making process and derivation of predictive outcomes. CNN model interpretability is commonly facilitated through various methods [Figure 4(b)].
Visualizing feature maps [105]: Convolutional layers in CNNs extract different features from images, such as edges and textures. Visualizing these feature maps offers a tangible understanding of the rationale behind the model’s decisions on input images.
Gradient-weighted class activation mapping (Grad-CAM) [106]: Grad-CAM is a gradient-based interpretability method that aids in comprehending the basis of model classifications for specific categories. It calculates gradients of the target class concerning convolutional layer output feature maps to determine the importance of corresponding regions, generating class activation maps.
Guided backpropagation [107]: This method visualizes neural activation in CNN models. Guided backpropagation modifies the gradient backpropagation by allowing only the positive part of the rectified linear unit (ReLU) to pass through, excluding the portions less than zero. As a result, when it reaches the first convolutional layer, the gradient obtained influences the subsequent ReLU activations. Visualizing these gradients at this stage reveals the regions that affect the network.
Local interpretable model-agnostic explanations (LIME) [108]: LIME is a versatile interpretability method applicable to explaining the predictions of any model. It approximates global model decisions by generating a series of local samples and observing their prediction outcomes.
The ability to interpret CNN models is vital for making sense of ocean remote sensing data. Given the complexity and dynamic nature of ocean ecosystems, the interpretability of these DL models offers invaluable insights for ocean science, resource management, and environmental conservation. Such interpretability enhances our understanding of ocean settings, supporting sustainable use and preservation efforts.
Application of CNNs in Ocean Remote Sensing
In recent years, the four CNN models mentioned earlier have found application in various facets of ocean remote sensing, encompassing tasks such as reconstructing 3D ocean fields, extracting information from ocean remote sensing images, enhancing the resolution of such imagery, and forecasting ocean phenomena. Furthermore, the two techniques related to transfer learning and CNN model interpretability have garnered substantial attention and exploration. The forthcoming section will provide a comprehensive examination of these aspects.
Reconstruction of 3D Ocean Field from Ocean Remote Sensing Data by CNN-Based Models
Reconstructing the 3D ocean field is a critical endeavor in ocean remote sensing, enabling a comprehensive understanding of the intricate oceanic structure and variations, including phenomena like ocean currents, subsurface temperature (ST), and salinity. Recent satellite remote sensing technology advances have provided extensive, high-precision sea surface data with excellent spatial resolution and consistent temporal continuity. However, these data only cover the ocean’s surface, leaving out the critical dynamic processes and features below. Effectively leveraging remote sensing observations of the sea surface to infer subsurface information is a pressing concern in global ocean research. This section explores CNN applications in ST, ST anomalies [109], subsurface salinity anomalies, and chlorophyll-a concentration inversion.
In 2019, Han et al. leveraged CNN models using satellite remote sensing data, including SST, SSH, and SSS, as input to estimate the ST in the Pacific Ocean successfully [110]. Their predictive results exhibited an average coefficient of determination exceeding 0.95 compared to Argo data, indicating the model’s high accuracy. These estimations are depicted in Figure 5(A) [111].
(A) ST of Argo data with CNN inversion for different depth layers in April 2015. (a) Argo 300 m, (b) CNN 300 m, (c) Argo 600 m, (d) CNN 600 m. (See [110].) (B) Distributions of estimated higher resolution (HR) temperature. The distributions of (a) ground truth, (b) 1° estimated, (c) HR (1/4°) estimated temperature (°C), and (d) temperature from ISAS Argo gridded data at 200 m depth in January 2012. (See [112].) (C) The spatial distribution of monthly chlorophyll concentration in March 2016 shown for the following inversion methods: (a) CNN, (b) OC-CCI. (See [113].) (D) Results for the inverse chlorophyll-a using the CNN model: the left panel [(a), (c), and (e)] shows the inverse chlorophyll-a values. The right panel [(b), (d), and (f)] shows the satellite chlorophyll-a values corresponding to the left panel. The RMSE values for these three cases are 0.055, 0.204, and 0.775, respectively. (See [114].) (B) is a reproduction from original Figure 5 in [112], used with authorization from AGU.
Subsequently, Meng et al. employed CNNs to construct a model for estimating ocean subsurface information [112]. Utilizing high-resolution (1/4°) and ultrahigh-resolution (1/12°) sea surface data, they successfully inverted high-resolution ST anomalies and subsurface salinity anomalies in the Pacific Ocean. These results closely resemble Argo buoy observations both globally and in fine detail, as shown in Figure 5(B). Moreover, their model-generated high-resolution underwater temperature also exhibits characteristics consistent with actual buoy observations, offering smoother and clearer features due to the advantages of higher resolution.
Additionally, in global chlorophyll-a estimation, Yu et al. adopted the CNN approach to retrieve global chlorophyll-a concentrations from MODIS data and compared it with support vector regression (SVR). Compared with the SVR, the CNN performs better, with the mean log root-mean-square error (RMSE) and R2 being 0.129 and 0.901, respectively, indicating that using the MODIS images alone, the CNN approach can achieve results that are close to the Ocean Color Climate Change Initiative (OC-CCI) Chla concentration images [113]. Figure 5(C) illustrates the inversion results for three specific regions.
Furthermore, Jin et al., by integrating satellite ocean color and hydrodynamic model data, successfully inferred the spatiotemporal distribution of chlorophyll-a in bays using CNN models [114]. Their inversion results displayed low RMSE and high R2 values while highlighting colored dissolved organic matter as a crucial variable influencing chlorophyll-a spatiotemporal distribution. These studies provide robust support for applying CNNs in remote sensing oceanography, with specific inversion results shown in Figure 5(D).
While CNN-based estimation models have demonstrated significant value in estimating ocean wind, temperature, salinity, and chlorophyll-a concentration [113], [114], they may suffer from issues like model overfitting and overall smoothing caused by CNN’s translation invariance and information loss in the pooling layer. Moreover, CNNs aren’t ideal for extracting temporal features from input variables, impacting their ability to handle time series problems. Future research could explore integrating other time series models (e.g., LSTM) to address these limitations and improve performance through data augmentation and the inclusion of attention blocks.
Ocean Phenomena Forecasting Based on Ocean Remote Sensing by ConvLSTM-Based Models
Ocean environment forecasting is a pressing concern in the field, holding both scientific and practical significance [115]. As computer technology advances, DL models have demonstrated their prowess in tackling long-time series prediction problems. Researchers have applied models like RNNs [116], LSTM, and others to address sequence features in ocean environment forecasting. However, ocean data exhibit spatiotemporal correlations that extend beyond mere sequences, and these models are limited in their ability to capture the spatial intricacies present in ocean data. In contrast, through its convolutional layer, a CNN excels at extracting spatially correlated features [117], [118]. Combining the strengths of the CNN and LSTM methods, the ConvLSTM model efficiently extracts spatiotemporal features from data, significantly reducing computational time and enhancing prediction performance.
Currently, the ConvLSTM model is widely used in marine environment forecasting, utilizing its unique structure to extract reasonable spatiotemporal features from marine remote sensing data. For instance, Zhou et al. employed the ConvLSTM model with WaveWatch III (WW3) reanalysis data to successfully establish a 2D SWH prediction model for the South China Sea and East China Sea [39]. This model exhibited high precision and efficiency, with correlation coefficients exceeding 0.92 in 6-h and 12-h forecasts under normal and extreme conditions, such as typhoons [Figure 6(A)].
(A) Comparison of SWH data of WW3 with prediction results of ConvLSTM. (a), (d) are based on the SWH data at 5 a.m. and 6 a.m., on 3 October 2019, for 1-h and 3-h forecasts. (b), (e) are the prediction results of the WW3 wave field at the corresponding time. (c) and (f) are the errors between WW3 data and predicted results. (See [39].) (B) The anomalous comparison of SIC results from different sources on 15 December 2018, including (a) monthly average from the NSIDC, (b) daily prediction from CNNs, and (c) daily prediction from ConvLSTM. The gray in the figure represents the land, and the blue/red represents the underestimation/overestimation of SIC. The area marked by the blue dotted line is the NSIDC SIC near the Eastern Siberia Sea area. (See [120].) (C) Comparison of SLP patterns predicted by DWT-ConvLSTM with actual values from 10–11 January 2019. The first row shows the observed SLP data, and the second rows show the prediction results of DWT-ConvLSTM. (See [121].) (D) Spatial distribution of RMSE in the study area and predicted results of SST fields for selected dates in different seasons. (See [122].) NSIDC: National Snow and Ice Data Center; DWT: discrete wavelet transform. (D) Reproduction from original Figure 6 in [122], used with authorization from Elsevier.
Additionally, Liu et al. introduced a SIC prediction model based on the ConvLSTM algorithm [120]. By comparing it with a CNN model regarding spatiotemporal scale computations, they found that ConvLSTM outperformed CNN in single-time predictions across the entire test dataset [Figure 6(B)].
Mu et al. utilized LSTM and ConvLSTM to forecast the North Atlantic Oscillation index and SLP. The results indicated that ConvLSTM effectively captured the temporal and spatial dependencies in the SLP field [121]. They also incorporated preprocessing techniques, such as the discrete wavelet transform (DWT), to enhance the predictive performance for extreme events. The forecast results for the North Atlantic region from 10–16 January 2019 are depicted in Figure 6(C).
Lastly, Xiao et al. constructed a spatiotemporal DL model by stacking ConvLSTM layers to predict 36 years of National Oceanic and Atmospheric Administration SST data in the East China Sea [122]. Experimental findings demonstrated that this model outperformed other models in medium-term to short-term daily sea temperature forecasts, offering superior accuracy and convenience [Figure 6(D)]. These studies collectively underscore the significance and potential of ConvLSTM in ocean environmental forecasting.
ConvLSTM networks have found wide-ranging applications in ocean environmental forecasting, encompassing ocean temperature, salinity, and wave height prediction [123], [124], [125], [126]. These models, trained on historical ocean data, leverage a combination of CNN+LSTM to extract spatiotemporal features, enabling accurate predictions about the ocean environment [127]. Despite its successes, ConvLSTM still faces challenges, such as addressing the local smoothing problem in long-term forecasting for complex sea areas, which needs further research. And, it is worth noting that the ConvLSTM structure is not the only choice for ocean phenomena forecasting. Convlstm can be combined with other CNN-based forecasting models [128], [129], [130] to further enhance its forecasting capability.
Information Extraction Based on Ocean Remote Sensing Images by U-Net Models
Ocean remote sensing information extraction involves using remote sensing technology to collect, process, and analyze ocean-related data. The U-Net network is a widely employed model in this field, significantly contributing to its development [131], [132]. In this article, we emphasize the pivotal role of U-Net in various aspects of ocean remote sensing information extraction, including ship semantic segmentation, coastline extraction, sea ice classification, and green algae detection. These tasks are exceptionally challenging, especially when dealing with small and discontinuous targets.
Semantic segmentation of ships in ocean remote sensing images presents a formidable challenge in ocean remote sensing image analysis, especially when dealing with small-scale and multiscale ship detection. The challenges primarily revolve around two key aspects. First, the diverse scales of vessels make the segmentation of small targets challenging; second, insufficient local decoding capacity in the decoder part leads to the loss of positional information. Mi et al. employed U-Net with multiscale convolution to fuse multiscale feature maps and enhance contextual information to address these challenges [133]. They also improved the model’s decoding section by replacing convolution operations with deconvolutions, enhancing the positional accuracy. This approach effectively detected ships in remote sensing images, achieving the best performance with an average intersection over union (IOU) of 86.98% and a ship-specific IOU of 86.98%, outperforming all other models at the time [Figure 7(A)].
(A) The results of Mi et al. [133], showing the HRSC2016-SS results: the input images, ground truth, and FCN, SegNet, U-Net, and U-Net-MSPF-Deconv results. (See [133].) (B) Two examples of Sentinel-1 SAR images overlaid with corresponding waterlines (yellow and blue lines) extracted using the WENet model and ground truth waterlines (yellow and red lines): (a) Sentinel-1B image from 29 July 2019 at 9:54; (b) Sentinel-1A image from 17 June 2019 at 9:55. (C) Classification results of SAR images [vertical-vertical (VV) channel] using DAU-Net. (a) and (b) Complete sea ice with an open water area inside; (c) and (d) Many small floating ice areas; (e) and (f) Complex sea ice boundary area. (See [75].) (D) Under three conditions, test images and detection results were randomly selected for GA-Net and previous advanced models. (a) and (b) Larger views and corresponding detection results. The green box shows a low-aggregation area of green algae. (c) and (d) Detailed information on the images and detection results. The yellow box and (e) and (f) show the images and detection results of high-aggregation areas of green algae. (See [81].) DAU-Net: dual-attention U-Net.
Furthermore, in the domain of coastline extraction, Zhang et al., proposed a high spatial resolution SAR image waterline automatic extraction model based on the Improved U-Net model [80]. This model was used to extract tidal flat waterlines and map terrain automatically. Under complex imaging conditions, the model achieved an average recall rate of 0.9 and a precision of 0.8 when accurately extracting waterlines from SAR images collected along the coast of the Yellow Sea in China from 2015 to 2020. The waterline extraction results are depicted in Figure 7(B).
In the context of sea ice classification, continuous enhancements have been made to the U-Net model to improve accuracy. For instance, Ren et al. integrated dual attention mechanisms, a position attention module and a channel attention module, into the U-Net model for classifying of sea ice and open water in SAR images, resulting in the dual-attention U-Net (DAU-Net) model [75]. Figure 7(C) showcases the model’s performance on three additional SAR images. DAU-Net increased the IOU values on the three test images by 7.48%, 0.96%, and 0.83%, respectively, compared to the original U-Net. Compared to the recently released DenseNet FCN model [134], DAU-Net improved the IOU values by 3.04%, 2.53%, and 2.26%.
Guo et al., developed a texture-enhanced DL model named GA-Net based on the U-Net framework for seaweed detection [81]. Four specific modifications were made in this model, including a texture fusion input dataset, texture concatenation, weighted loss functions, and attention mechanism modules to enhance classification performance. Experimental results demonstrated that this classification method achieved a mean IOU of 86.31%, surpassing previous deep classification methods [Figure 7(D)]. These studies have contributed valuable methods and techniques to the advancement of ocean remote sensing image processing.
DL-based semantic segmentation models for remote sensing images significantly improve the segmentation of ocean remote sensing images compared to traditional methods. They solve the problem of pinpointing object boundaries, which is completely ignored by most traditional pixel-level segmentation methods, and they are also robust to noise [75], [81], [133], [135], [136], [137].
Superresolution Reconstruction of Ocean Remote Sensing Data by SRCNN
High-resolution ocean data play a critical role in studying ocean features, and the disparity between observed and numerical model resolutions significantly impacts the predictive accuracy [137]. In ocean remote sensing, higher resolution images offer finer details and critical edge information, enhancing the quality of data for subsequent processing [138], [139], [140].
Since the initial proposition of multiframe low-resolution image reconstruction into high-resolution images by Tsai and Huang in 1984 [141], superresolution reconstruction techniques have garnered significant attention and research within the academic community. In ocean remote sensing, conventional resolution enhancement methods based on empirical orthogonal function (EOF) or modeling approaches are gradually being replaced by DL. Convolutional DL models, notably exemplified by SRCNN, have substantially improved the accuracy of oceanic superresolution data. For instance, López-Radcenco et al. refined SRCNN and introduced a remote sensing image superresolution reconstruction method that combines SRCNN with local adaptive CNNs. What sets this method apart is its integration of input data from low-resolution images and a secondary image source, introducing various constraints (orthogonality, nonnegativity, and sparsity) to confine the model [142]. The improved SRCNN model successfully tackles the challenge of reconstructing high-resolution information under irregular sampling conditions [Figure 8(A)].
(A) Reconstruction results of high-resolution SSH image Y on 20 April 2012. (a) Real high-resolution SSH image Y, (b) low-resolution SSH image YLR (denoted as SSHLR), (c) reconstruction of high-resolution SSH image Y using a global convolution model with (d) (PCA), (e) k-singular value decomposition (KSVD), and (f) NN. (See [142].) (B) The spatial distribution of Argo ST1 [(a), (d), and (g)], the predicted ST0.25 by CNN [(b), (e), and (h)], and the scatterplot of predicted ST0.25 by CNN and EN4 grid ST0.25 [(c), (f), and (i)] at different depths in December 2015. (See [143].) (C) SST data for the same region and date for (a) low-resolution data, (b) high-resolution data, (c) linear transform, (d) SRCNN, (e) RRDBNet, and (f) ESRGAN high-resolution reconstructed result maps. (See [145].) (D) (a) Original 0.5-m WV2; (b) downsampled WV2-1-m; (c) downsampled WV2-1.5-m; (d) downsampled WV2-2-m images. (e), (f), and (g) FSRCNN’s visual effect of superresolution reconstruction of WV2 images. (See [147].) PCA: principal component analysis; NN: nonnegative decomposition; OISST: optimum interpolation SST; RRDBNet: residual-in-residual dense block network; ESRGAN: enhanced superresolution generative adversarial network; FSRCNN: fast superresolution CNN.
Moreover, in the context of high-resolution reconstruction of ST and subsurface salinity fields, Su et al. achieved successful reconstruction of ST in the upper 1,000 m of the ocean at a resolution of 0.25° using CNN and light gradient-boosting machines [143]. They utilized satellite sea surface parameters, Argo buoys, and Environmental and Scientific Service (EN4) profile data [144] to enhance the spatial resolution of ST from 1° to 0.25° and established a single-time model applicable across seasons and time series [Figure 8(B)]. EN4 profile data refer to ocean profile datasets from the EN4 project in France, utilized for researching and monitoring the physical and chemical properties of the global ocean. This dataset encompasses ocean parameters, including temperature, salinity, depth, and more, obtained through observations from various buoys and vessels across global ocean regions.
On a different note, Izumi et al. employed various superresolution processing methods, including the enhanced superresolution generative adversarial network (ESRGAN), SRCNN, and the residual-in-residual dense block network (RRDBNet), to handle SST data [145]. They compared the images generated by these methods with high-resolution optimum interpolation SST (OISST) [146] data and found that RRDBNet outperformed SRCNN and ESRGAN in terms of performance, while ESRGAN excelled in reflecting ocean current distributions compared to CNN-based methods [Figure 8(C)].
Chen et al. utilized SRCNN and fast superresolution CNNs (FSRCNNs) to detect invasive plant patches in the Yangtze River Delta region and enhance images to submeter (0.5-m) resolution [147]. In contrast to traditional bicubic interpolation methods, FSRCNN displayed significant advantages in preserving spectral and structural details when constructing 0.5-m resolution images from 1-m/1.5-m/2-m resolution images. It successfully detected most of the on-site measured patches; the model-reconstructed high-resolution images are illustrated in Figure 8(D). Collectively, these studies have been instrumental in advancing ocean superresolution reconstruction work.
These studies have significantly advanced ocean superresolution reconstruction technology, and as the demand for superresolution ocean data continues to grow, convolution-based DL methods offer a promising avenue for resolution enhancement. While traditional superresolution methods have their merits, exploring future combinations of traditional techniques and neural networks remains an avenue worth exploring.
Transfer Learning Method in Solving Ocean Remote Sensing Tasks
Transfer learning using CNN models is a crucial technique in ocean remote sensing, offering solutions to data scarcity, model acceleration, knowledge sharing, task migration, and limited sample challenges. It expedites model development and deployment, improving the outcomes and efficacy of ocean remote sensing technology. CNN models typically demand substantial training data, and the abundance of ocean remote sensing data makes transferring models trained in other domains feasible. This section discusses applying transfer learning techniques in various tasks, including coastal identification, extending ENSO prediction horizons, accurate internal wave amplitude inversion, and capturing spatial evolution features of SST anomalies (SSTAs) and SSH anomalies (SSHAs) [49], [53], [148], [149].
Deep CNNs face the challenge of limited training data in coastline recognition. Lima et al. [49] employed a transfer learning approach, utilizing the AlexNet model for training to address this issue. They replaced the final softmax layer with a support vector machine (SVM) for classification tasks. Subsequently, they fine-tuned all model layers through transfer learning steps, allowing the CNN model to gain deeper insights from large-scale datasets and apply this knowledge to the coastline recognition task [Figure 9(a)]. This approach resulted in higher accuracy compared to models that did not utilize transfer learning.
(a) Comparison of transfer learning with different baseline methods. Deep blue signifies replacing the final softmax layer with an SVM for classification, followed by fine-tuning all layers. Blue indicates no transfer learning, with training exclusively on remote sensing images. Green represents the fine-tuning of higher layers. Yellow corresponds to the traditional BOVW model. (See [49].) (b) The predictive performance of the STIEF model before (blue) and after (red) the incorporation of the transfer learning module. The model’s prediction value confidence intervals were calculated by extracting 128 samples from the test set for each 24-month lead time correlation calculation, repeated 64 times. This process generated a 95% confidence interval for each of the 64 correlation coefficients calculated monthly. (See [53].) (c) Model performance comparison of different fusion strategies combining laboratory and in situ data. (See [148].) (d) Different models’ RMSE predictions for SSHA, measured in meters. The CNNTL model (green solid line) is compared with the transformer model (blue dashed line) and the ConvLSTM model (orange dashed line). (See [149].). STIEF: spatiotemporal information extraction and fusion; CMIP: Coupled Model Intercomparison Project; IW: internal wave; SODA: simple ocean data assimilation; PP: positive and negative peaks;.
Similarly, Wang et al. also employed transfer learning to overcome the challenge of limited observational data [53]. They used historical simulated data to train model parameters and then improved model performance by fine-tuning with observational records. Ultimately, they utilized these models for predictions, significantly extending the effective prediction length from 11 to 22 months for the Nino3.4 index [Figure 9(b)].
On the other hand, Zhang et al. also applied transfer learning to invert ocean internal wave (IW) amplitude in satellite images [148]. They first pretrained the model with laboratory data and then fine-tuned the model parameters using satellite/in situ matching data, successfully addressing the challenge of a small-scale in situ dataset and improving the model’s predictive performance. As shown in Figure 9(c), their model, TLIAR, combines transfer learning and outperformed seven other models with different fusion strategies on the test dataset.
Finally, Miao et al. designed a hybrid model combining CNN with transfer learning to predict monthly scale SSTAs and SSHAs. This model effectively captures the spatial evolution characteristics of the SSTAs and SSHAs through pretraining and transfer learning methods, reducing prediction errors and demonstrating a better performance compared to other models [Figure 9(d)]. These studies collectively highlight the practical value of CNN models utilizing transfer learning in oceanography [149].
However, direct transfer may not always be fully effective because of the differences between source and target tasks. Adjustments and optimizations are often necessary during transfer learning to suit the target task’s requirements. Additionally, specific tasks may require more domain adaptation and model optimization to achieve optimal performance.
CNN Model Interpretability in Ocean Remote Sensing Applications
The interpretability of DL is pivotal in ocean remote sensing, influencing various domains like ocean science, environmental protection, resource management, and decision support. Despite the abundant ocean remote sensing data available for DL model training, the intrinsic processes by which these models extract, transform, and process features from input data are often inscrutable during training. This lack of transparency hampers a deeper comprehension of these models, making interpretability a burgeoning research focus. In this section, we explore interpretability using CNN models to analyze significant spatial regions for river flow prediction, discern the importance of various oceanic regions for ENSO predictions, and evaluate feature importance variations across different underlying surfaces [53], [55], [56].
Liu et al. utilized periodic significance maps to identify key spatial regions for predicting river flow in networks. They discovered, through interpretability methods, that the Earth system models (ESMs) are mainly affected by the ENSO and Indian Ocean Dipole (IOD) regions in terms of their flow prediction capability [Figure 10(A)] [55].
(A) Interpretable DL saliency maps are presented in (a), showing the significance of ESMs for Amazon River flow. Similarly, (b) uses ESM data to demonstrate the saliency maps for Congo River flow prediction. (See [55].) (B) Standard deviations of 1982–2020 CNN model activation maps for (a) predicting September EIOD SST anomalies using May–July inputs and (b) predicting October EIOD SST anomalies using June–August inputs. Standard deviations are normalized to a maximum value of 1. Note that the activation plots do not distinguish between the contributions of SST and heat content anomalies. (See [150].) (C) ENSO predictability sources extracted by the STIEF model. (a) Distribution of ENSO SST predictability sources with a forecast lead time of three months. (b) Same as (a) except for the OHC variable. (See [53].) (D) The importance of each type of band (cloud, water, and temperature) under different underlying surfaces (land, coastline, and nearshore). (See [56].) OHC: ocean heat content.
Feng et al. trained a CNN model to predict the IOD and used gradient-weighted class activation maps to visualize the CNN model [150]. This visualization indicated that strong positive IOD events [cold East Indian Ocean Dipole (EIOD) SST anomalies] can arise from different processes: internal dynamics within the Indian Ocean are associated with the 1994 positive IOD, remote correlations with the Equatorial Pacific in 1997 are significant, and cooling off the southeastern coast of the Indian Ocean contributed to the 2019 positive IOD [Figure 10(B)].
Wang et al. developed an interpretable DL model called spatiotemporal information extraction and fusion (STIEF) for ENSO prediction [53]. They employed error-based backpropagation to derive gradients for each input variable point in the model. These gradients effectively trace ENSO signals’ sources, locations, and lead times, highlighting the crucial role of interactions between ocean basins and the troposphere for ENSO prediction [Figure 10(C)].
On the other hand, Yu et al. designed the STRU-Net model, based on the U-Net architecture, specifically for reconstructing different land surfaces (land, coastline, nearshore, and ocean) [56]. They used 17 satellite-derived features as inputs and employed the DeepLIFT algorithm for model interpretability. DeepLIFT compares each neuron’s activation with its “reference” activation and propagates importance signals based on differences, allowing the analysis of feature importance changes across different land surfaces. Through this method, they discovered that the importance of satellite water-related features gradually increased during the model’s transition from land to the coastline and then to the nearshore, while the importance of satellite cloud-related and temperature-related features decreased [Figure 10(D)]. These studies demonstrate that CNN models possess good interpretability and can help us understand the internal learning processes to some extent.
Despite the relatively enhanced interpretability of CNN models, challenges remain for more intricate networks and deep architectures. Advancing the capacity for interpretability in DL models will further solidify their reliability and trustworthiness in broader application scenarios, which remains a pivotal research direction.
Discussion
This article shows the powerful capabilities of CNNs in six areas: the reconstruction of the 3D ocean field from ocean remote sensing, information extraction based on ocean remote sensing images, superresolution of ocean remote sensing imagery, ocean phenomena forecasting based on ocean remote sensing, a transfer learning method based on ocean remote sensing, and a CNN model interpretability method based on ocean remote sensing.
At present, ocean remote sensing is entering the era of big data, and we need powerful data mining technology that can accurately and effectively mine information from a large amount of ocean remote sensing data with as little human intervention as possible. In addition, this technique needs to have strong universality, and generalization can be applied to all kinds of problems contained in ocean remote sensing data. In response to the previous requirements, the various model structures based on CNNs provide a promising approach for analyzing and processing large amounts of ocean remote sensing data.
Although CNNs have made great progress in mining ocean remote sensing information, some key issues still need to be considered in future development.
First, CNN-based models require a large amount of data for model training. In the ocean information extraction task, CNN-based models require many high-precision labels. Most current real labels are manually labeled, and the labeling accuracy is affected by human experience and errors. CNN models trained with these labels will inevitably cause errors in the output results. In addition, true labels in ocean variable estimation tasks rely on in situ measurement data, which require world-class collaboration. The establishment of standardized datasets will drive CNN-based ocean remote sensing information mining. If big data is the door to AI ocean remote sensing, then CNN models are the key. For some studies, we still need expert knowledge to provide real data. It is important to combine the knowledge of different expert groups to eliminate human bias. One possible solution is to develop unsupervised DL methods that avoid the limitations of human knowledge.
Second, most CNN models for ocean remote sensing image information mining come from computer vision. These models were originally developed to extract spatial and temporal patterns to solve computer vision problems. These models can be combined with physical ocean knowledge to improve the model structure and loss functions for ocean tasks, specifically for ocean science.
Third, CNN models are sensitive to data from different sensors, and there is a lack of a model with generalization capability that can adapt to different sensor data simultaneously. To save computational and labor costs, we must study practical methods for model conversion from one sensor to another and improve the model’s generalization ability under different sensors.
Last but not least, AI-driven ocean remote sensing science is an interdisciplinary field, and it’s only through close collaboration between experts in ocean remote sensing and machine learning that DL techniques in ocean remote sensing will continue to advance. Also, there are commonalities among remote sensing data in different domains; the CNN model we developed in ocean remote sensing can also be applied to remote sensing data in other domains. Therefore, we should encourage more collaborative initiatives, working together to advance research in AI-powered, interpretable, and efficient ocean remote sensing data mining technologies.
ACKNOWLEDGMENT
This study was funded by the National Natural Science Foundation of China under Grant U2006211, Grant 42221005, and Grant 42090044 and the Strategic Priority Research Program of the Chinese Academy of Sciences under Grant XDB42000000. The authors would like to thank the researchers of the article for their support of the article content. The corresponding author is Xiaofeng Li.