Introduction
Ships are vital components of military and defense activities, and their movements must be closely observed since they are vital sea transit vehicles. The identification of ships is of utmost importance in the fight against illicit fishing, port trade, and marine traffic safety. At the moment, radar, optical and infrared reflectance, thermal infrared sensors, satellite remote sensing, and hyperspectral imaging are the main data sources used for the monitoring of small-sized ships in the ocean [1], [2]. The capacity of the satellite remote sensing technology to provide high-resolution, low-noise remote sensing photographs has attracted a lot of interest.
Small-target detection is an open problem in the field of remotely sensed imagery for a wide range of applications, including large-scale monitoring of marine ships, intelligent marine traffic, and ship position-based services. Traditional ship detection methods are mainly based on constant false alarm detection (CFAR) [3] to detect ships in remote sensing images. These methods first do a land–ocean segmentation, which allows land pixels to be suppressed and prevents interference with the CFAR step, thus limiting the speed required to acquire small-target ships. Finally, after CFAR prescreening, discriminators need to be designed to suppress noise. In addition, these methods usually rely on the statistical distribution of sea clutter, resulting in poor robustness of the new remotely sensed images [4].
With the development of deep learning-based target detection algorithms in computer vision (CV), researchers in the field of remote sensing also began to explore methods for detecting small-target ships from deep learning algorithms [5]. Due to the difficulty in acquiring remote sensing images and the fact that no researcher had yet developed a specialized remote sensing dataset at that time, deep learning-based detection methods could not be applied to remote sensing ship detection at the beginning. With the opening of the SSDD public dataset [6], a trend in the field of deep learning-based remote sensing ship detection was set off. SSDD provided researchers with a large amount of remote sensing data and evaluation criteria, which solved the problem of the lack of data for deep learning algorithms. By now, more and more researchers have adopted deep learning-based methods in this field. Sun et al. [53] focused on the complex environment and diverse ship scales of SAR images. An anchor-free detection method is proposed. It provides a better method for SAR ship detection. Wang et al. [54] proposed an MFFN multifeature fusion network, which can obtain ship seat texture information from the background of remote sensing images. Zhou et al. [55] proposed a network specifically for detecting small ships in remote sensing images to address the difficulty of detecting small ships. Gong et al. [56] proposed the enhancement strategy of SSPNet network and small ships, which contributed to the detection of SAR ships. Li et al. [7] used the improved Faster RCNN, a two-stage target detection method for ship detection, to improve the accuracy of small-target ship detection, but its detection time is long and it cannot detect ships in real time. Subsequently, the SSDD [8] detection algorithm and you only live once (YOLO) [9] single-stage detection algorithm have been introduced, which centralizes the target detection task and target category task into a single neural network model, achieving high accuracy and fast target detection performance. However, these generic target detectors have some problems when applied directly to small-target ship detection in remote sensing datasets.
First, there are fewer datasets used for the ship detection of small targets in the field of remote sensing imagery, and the datasets in recent years, such as SSDD, MSDS, and AIR-SAR Ship, the ship objects in these datasets suffer from the problems of ship multiscale and low-pixel value of the dataset, and the simple environment in which the ships are located, which cannot satisfy the ship detection task in the complex environment. Therefore, there is a need to establish a small-target ship dataset for remote sensing images in complex marine environments.
Second, current mainstream detection models often miss detecting ships with small or ambiguous targets. Our detection task faces challenges in the case of farmed nets, trailing trajectories, cloud cover, and ships in close proximity (as shown in Fig. 1). Since smaller ships usually occupy only a few pixels of the image [10], after a number of feature extractions, it will lead to the network's ability to lose the small-target features and the spatial layering information of the neural network.
Finally, remote sensing datasets are interfered by background noise, such as mariculture nets in near-shore harbors, lighthouses on the sea surface, and islands, which usually lead to false alarms for ship detection. Traditional algorithms use features to distinguish between small-target ships and other disturbances, but they usually lack accuracy and effectiveness.
Based on the above analysis, this constructs a dataset of remote sensing images of small-target ships. This dataset of remote sensing images is mainly from Hainan 1, after cutting, data enhancement, and other operations. In total, 3831 selected, high-quality remote sensing images were obtained, and we used horizontal bounding boxes and labeled 8418 instances of 1 category (ships). Second, YOLOv7, as a single-stage detector, still has the advantages of high detection accuracy and speed on small-target objects. In this article, we try to apply YOLOv7 to small-target detection of ships in remote sensing images. At the same time, we further explore optimize its accuracy and speed so that it can identify small targets with fuzzy ships more accurately and efficiently. To enhance the detection performance of small-target ships, we propose the CSDP-YOLO algorithm, which makes up for the shortcomings of YOLOv7 [11] in the performance of detecting small-target ships. The main contributions of the work are as follows.
A dataset of small-targeted ships from satellite remote sensing images in complex sea areas was constructed. The dataset contains 3831 remotely sensed images with 8418 labeled instances. This dataset does not require land and sea segmentation, which is helpful for the dynamic monitoring of ships in the sea and harbors.
Aiming at the problem that remote sensing small targets have few pixels, and when feature extraction is performed on the feature map, tiny pixel offsets will lead to a decrease in detection accuracy. In this study, we propose a CSDP structure based on deep convolution. The CSDP module consists of a full convolution block composed of large kernel deep convolution and point convolution, which replaces part of the extended high-efficiency layer aggregation network (ELAN) aggregation layer of the original YOLOv7 and enhances the model's ability of feature extraction for small targets. The large kernel deep convolution captures a wider range of contextual information and helps to identify complex patterns of ocean–land interaction in the image. Point convolution helps reduce computational cost and provides interchannel interaction while maintaining computational efficiency.
Aiming at the serious data class-imbalance problem between small-target ships and background, we introduce MPDIoU as the loss function of the penalty. So that the bounding box loss function can make full use of the geometric properties of the bounding box regression to speed up the convergence of the model and improve the detection accuracy by minimizing the distances of the top-left vertices and the bottom-right vertices between the predicted bounding box and the real bounding box.
The rest of this article is organized as follows. In Section II, the proposed dataset and the current methods used for ship remote sensing image detection are described in detail. Section III presents the detailed structure of the proposed CSDP-YOLO. Section IV describes the dataset and the experimental setup, verifies the improvement of the proposed algorithm through comparative experiments and analyses the experimental results, and verifies the generalization of the model on a publicly available dataset. Finally, Section V concludes this article.
Related Work
A. Remote Sensing Datasets
In the past few years, many target detection datasets, such as SSDD, OpenSARship [13], SAR-Ship-Dataset, AIR-SAR Ship, HRSID, LS-SSDDv1.0, FUSARv1.0, and MSDS, have been proposed in the field of aerial imagery to advance the research on ship detection in remote sensing imagery. However, the objects in these datasets have multiple scales, and therefore, these datasets are more suitable for evaluating detectors designed for multiscale object detection rather than small object detection. Although some work on small-target ship detection from remotely sensed imagery uses mainstream remote sensing data for training, the ocean is complex and most of the existing datasets do not have small-target ships in a variety of complex situations, such as cloudy, undersea nets, nearshore of harbors, islands, and so on, so this study constructs a small-target ship dataset in response to the complexity of the ocean environment in which small-target ships are located.
The construction strategy of the ship dataset in this study is as follows. We obtained 70 original remote sensing images from “Hainan No.1 Star 01,” with a resolution of 28 000 × 28 000 pixels, and we set the resolution of the cropped images to 1024 × 1024 pixels, and the corners of the original remote sensing images are less than 1024 pixels. We set the resolution size of the cropped image to 1024 × 1024 pixels, and for the part of the original remote sensing image whose edges are less than 1024, we save it as the original image without filling. To facilitate network training, the cropped remote sensing images are manually filtered, and finally, 3831 high-definition remote sensing images are obtained. Our proposed dataset contains only one ship category object and 97% of the small target ship objects are less than 40 pixels, which pushes the difficulty of small-target detection to the extreme and meets the needs of practical scenarios for applications. Fig. 2 shows some representative samples of remote sensing images of ships, including cloud cover, harbor docking, and marine aquaculture nets, including harbors, cloudy climates, underwater nets, and other scenarios. Finally, the whole dataset is randomly split into training dataset (70%), validation dataset (20%), and test dataset (10%)
Number of representative datasets of remotely sensed images from different backgrounds (The ship's target area has been magnified.)
B. Deep Learning Object Detection
Since deep neural networks can automatically learn the threshold features and shape features of the target, they are of great research value in ship detection in remote sensing images. Object detection algorithms based on deep learning can be divided into two categories: two-level detectors and single-level detectors. The single-stage detector uses the full convolutional network to perform classification and regression tasks on the anchor frame only once to obtain detection results [14]. The two-stage detector uses a deep neural network to perform two classification and regression tasks on the anchor frame to obtain detection results.
So far, common two-level detectors include R-CNN, Faster R-CNN, feature pyramid network [15], Mask RCNN, etc. Most of the design ideas of the latter two-level detectors are based on the improvement of the previous network, and the starting point of improvement is mostly from the backbone network, regional suggestion network, etc. Although the two-level detectors are more accurate in detecting objects, the detection time has increased because of the method of extraction of boundary boundaries using enveloped neural networks. Common single-stage detectors include SSD [8], YOLO [9], and RetinaNet [17]. In addition, Zhang et al. [52] proposed a multiscale global scattering feature association network for remote sensing propagation target identification. It gave us important inspiration for ship identification technology in remote sensing images. Kang et al. [57] proposed a multilayer fusion convolutional neural network to solve the difficulty of detecting small-scale ships in SAR images. Sun et al. [58] aimed at the characteristics of multiscale and dense array of ships in high-resolution SAR images. A bidirectional fusion module is proposed for YOLO, which makes the model have better robustness and generalization.
The YOLO series is widely used in the field of ship detection of remote sensing images. Deng et al. [18] used YOLOv2 to detect ships in remote sensing images, and proposed YOLOv2-reduced, mainly by reducing the partial neural networks of YOLOv2, and it has achieved greater efficiency than the YOLOv2 detection (the AP of YOLOv2-Reduced is 89.76%) and low loss of accuracy (the AP of YOLOv2 is 90.05%). Zhang et al. [6], [52] have used the DarkNet-19 network to replace the original YOLOv3 backbone network, and the new network has greatly improved the detection efficiency of the sensing remote imaging ship. Subsequently, Wang et al. [20] proposed an SSS-YOLO network, cleverly designed the feature extraction layer of the neural network, and enhanced the semantic information of small target ships. Zhou et al. [27] proposed a multiscale ship detection network based on the YOLOv5 model, which achieved a good balance between the complexity of the model and the reasoning time. Tang et al. [28] proposed a convolved block attention mechanism with a multiscale receptive field based on the YOLOv7 and made full use of the information of feature maps to accurately capture useful regions in the feature maps. The ship detection method of remote sensing images with single-order detector and double-order detector is shown in Fig. 3. Nowadays, the high accuracy and real-time performance of YOLOv7 is particularly prominent in the task of target detection, and YOLOv7 adopts a multilevel pyramid structure for target detection, which means that it can simultaneously predict the target location and category on the feature map of different resolutions. To better train the model, it allows auxiliary heads to be attached to the pyramid of the middle layer for training. The advantage of this training strategy is that it helps make up for information that might be lost in the next level of pyramid prediction. In other words, by making predictions on the pyramid of the middle layer, the model can better capture features at various resolution levels, thus improving the accuracy of target detection.
However, the performance of YOLOv7 on remote sensing image datasets is not very good. In order to improve its performance, CSDP is used as part of the backbone network of YOLOv7 based on the YOLOv7 model in this article. Deep packet convolution is used in the CSDP module to divide the input channels into multiple groups, and the channels in each group are convolved only with the convolution kernel in the corresponding group. The CSDP module is used for processing channel features, which can be regarded as a channel attention mechanism. This helps the model to better capture important information in the input features and improve the performance of small-target ship identification. At the same time, the MPDIoU loss function is also introduced to solve the problem of the imbalance between prospects and background classes, further improving the efficiency and accuracy of the model boundary frame regression.
Ship detection methods for remote sensing images with one-stage and two-stage detectors.
Proposed Methods
YOLO is the most advanced single-stage target detection algorithm that has undergone multiple iterations [9]. In addition to the original version of YOLO, there are many derivative algorithms based on YOLO architecture, which are optimized and improved based on YOLO to meet the needs of different application scenarios. YOLOv7 is also an optimized version of the YOLO architecture, it adopts the extended high-efficiency layer aggregation network (ELAN) strategy [40]. By combining cardinality to combine different features, the neural network can learn and converge more effectively by controlling the shortest gradient path, and enhance the learning ability of the network without destroying the original gradient path. The excellent learning ability of YOLOv7 is more suitable for deployment in the detection of small-target objects. Based on YOLOv7, CSDP-YOLO, a method for small-target ships in remote sensing images, is proposed in this study.
A. Proposed CSDP-YOLO Framework
The CSDP-YOLO network architecture is shown in Fig. 4. First, two high-efficiency layer aggregation network (ELAN) modules of the backbone module of YOLOv7 are removed, and the CSDP layer is introduced into the backbone part to enhance the model's performance of extracting low-level feature maps. Low-level feature maps have higher resolution, containing information on the tail and shape of the small-target ship, as well as the location and details of the islands. It is helpful to improve the discrimination of the model and the accuracy of detection. Second, based on the imbalance between the background and foreground of small-target ships, MPDIoU is introduced as a loss function to solve the problem of small loss and slow gradient convergence in the training process. The proposed network architecture can be categorized into CBS, MPC, ELAN, and CSDP modules. CBS is a basic volume module, consisting of variable lengths of volume blocks. Cat acts as a multivoltage module that uses outputs from other volume layers to perform concat operations to improve the accuracy of the network. CSDP is the low-level feature map extraction layer proposed by the authors. Specifically, the DP module is a full convolution block composed of large core deep convolution and point convolution. We choose deep convolution to mix spatial locations, and point convolution to mix channel locations. Therefore, in the process of low-level feature extraction, there is a larger receptive field to pay attention to more detailed information, such as ship ends and shapes, so that the model can better take into account the global information, and at the same time, restrain the interference of islands, clouds, and other factors on small-target ships [48]. MP is a downsampling module, which helps to gradually reduce the size of the feature map, so that the network can detect the target at different resolutions, thus improving the detection ability of the model for the target of different sizes. SPPCSPC is an improved spatial pyramid pool structure that helps to handle objects at different scales, making the model more robust.
In summary, small-target recognition from remote sensing images is achieved using the trained CSDP-YOLO model. The training process is summarized in Algorithm 1.
Algorithm 1: Training Strategy of CSDP-YOLO.
Input: Given training samples
Output: An adeptly trained CSDP-YOLO model.
An adeptly trained CSDP-YOLO model is shown in Fig. 4.
Initialize the parameters
repeat
Randomly select a batch of instances
Pass training samples forward through the CSDP-YOLO model
Compute the training loss
Propagate
Determine
Until the end of model convergence
B. CSDP Feature Extraction Architecture
Although the extended efficient layer aggregation network (ELAN) of YOLOv7 learns the features of the ships using different layer weights, it enhances the learning capability of the network by introducing the operations of expanding, shuffling, and merging bases while ensuring the continuity of the gradient paths to improve the performance and generalization [48]. However, these operations also increase the computational complexity of the network, while undergoing multiple convolutions can lead to a situation where the neural network loses information about small-target ships under remote sensing datasets.
To better integrate the small-target ships, islands, and other detailed information on the shallow feature map, for the remote sensing of small-target fuzzy, irregular shapes, and poor existing conditions [41], the CSDP feature extraction layer is constructed, and the module in the case of satisfying the computational parameters relative to the ELAN network is less, the grouping convolution of each input channel, so that the number of grouping is the same as the number of input channels. As a result, a point convolution is performed to mix the features of each output channel, to improve the ability of the neural network in the shallow feature map for small-target ships information acquisition so that better performance for the detection of small-target ships task. Dot convolution is performed to mix the features of each output channel, to improve the ability of the neural network to acquire information about small-target ships in shallow feature maps, and to achieve better performance for the task of small-target ship detection. As shown in Fig. 5, our proposed CSDP feature fusion structure consists of three feature-variable convolutional layers, a deep convolutional module, and a point convolutional module. The CSDP layer first accepts an input feature map, and then performs feature transformations through two 1 × 1 convolutional layers (CV1 and CV2) to adjust the dimensionality of the input feature map and passes it to the subsequent DP module and the CV3 convolutional layer, respectively. We set the input passed into the path of the DP module as X, the output is Z
\begin{align*}
z_{1}=& \operatorname{BN}\left(\sigma \left\lbrace \operatorname{Conv1}_{{c}_{\text{in } \rightarrow h}} (X, \text{ s }=1, \text{ k}\_\text{size }=1)\right\rbrace \right) \tag{1}\\
z_{2}=& \operatorname{BN}\left(\sigma \left\lbrace \operatorname{Conv2}_{{c}_{\text{in } \rightarrow h}} (X, \text{ s }=1, \text{ k}\_\text{size}=1)\right\rbrace \right) \tag{2}
\end{align*}
Proposed CSDP network architecture, where CV1, CV2, and CV3 are 1×1 convolutional networks.
The DP block is composed of deep convolution (the number of grouped convolutions is equal to the number of channels
\begin{align*}
&z_{l}^{\prime }\!=\! \text{BN}\left(\sigma \left\lbrace \text{ConvDepthwise }\left(z_{1}, \text{ s }=1, \text{ k }\_{\text{size }}\!=\!9\right)\right\rbrace \right)+z_{1} \tag{3}
\\
& z_{l+1}= \operatorname{BN}\left(\sigma \left\lbrace \text{ConvPointwise }\left(z_{l}^{\prime }, \text{ s }=1, \text{ k }\_{\text{size }}=1\right)\right\rbrace \right) \tag{4}
\end{align*}
\begin{equation*}
\text{CSDP}=\text{BN}\left(\sigma \left\lbrace \operatorname{concat}\left(z_{l+1}, z_{2}\right)\right\rbrace \right) \tag{5}
\end{equation*}
Compared with the ELAN high aggregation network, our proposed CSDP module has the following advantages. First, it uses large kernel deep convolutional hybrid spatial locations as well as point convolutional hybrid channel locations, to fully utilize the feature information of small-target ships between convolutional groupings, and to improve the network's ability to extract the characterization of the small-target ships in complex environments. Second, all network layers of the module maintain the same resolution size of input and output, and there is no downsampling operation of the feature maps at the continuous network layer, which prevents the loss of small-target ship information. Finally, the number of parameters of the network is reduced due to the use of deep convolution versus point convolution for feature extraction, making CSDP-YOLO require less computational resources.
C. MPDIoU Loss Function
In the task of small-target detection in remote sensing images, for the problem of serious regional imbalance between the small-target ships and the background, the CIoU loss function used leads to the problem of small loss and slow convergence of the gradient of the network during the training process when the aspect ratio of the preframe and the real frame is the same. We introduce MPDIoU as the loss function of the penalty, which can minimize the class-imbalance problem. CIoU is used as the loss function of the penalty in YOLOv7 as follows:
\begin{align*}
\text{CIoU}=& \text{IoU}-\frac{\rho ^{2}\left(\mathcal {B}_{g t}, \mathcal {B}_{\text{prd}}\right)}{C^{2}}-\alpha V \tag{6}
\\
V=& \frac{4}{\pi ^{2}}\left(\arctan \frac{w^{\text{gt}}}{h^{\text{gt}}}-\arctan \frac{w^{\text{prd}}}{h^{\text{prd}}}\right)^{2} \tag{7}
\end{align*}
Algorithm 2: Intersection Over Union With Minimum Points Distance.
Input: Two arbitrary convex shapes:A,B
Output: MPDIoU.
For A and B,
MPDIoU =
CIoU failure when the predicted box (red) has the same aspect ratio as the real box (yellow).
The MPDIoU metric simplifies the similarity comparison between predicted bounding boxes and groundtruth bounding boxes, allowing regression with and without overlapping bounding boxes. During training, the model's predicted bounding box
\begin{equation*}
\mathcal {L}=\min _\theta \sum _{\mathcal {B}_{\text{gt}} \in \mathbb {B}_{\text{gt}}} \mathcal {L}\left(\mathcal {B}_{\text{gt}}, \mathcal {B}_{\text{prd}} \mid \theta \right) \tag{8}
\end{equation*}
\begin{equation*}
{{\mathcal {L}}_{\text{MPDIoU}}}=1-\text{MPDIoU}. \tag{9}
\end{equation*}
Meanwhile, the parameters of bounding box regression can be determined by four coordinates, the regression factors are shown in Fig. 7, and the regression parameters are calculated in the following equation:
\begin{align*}
\left| C \right| = & \left({\text{max}\left({x_{2}^{\text{gt}},x_{2}^{\text{prd}}} \right) - \text{min}\left({x_{1}^{\text{gt}},x_{1}^{\text{prd}}} \right)} \right) \\
&\times\left({\text{max}\left({y_{2}^{\text{gt}},y_{2}^{\text{prd}}} \right) - \text{min}\left({y_{1}^{\text{gt}},y_{1}^{\text{prd}}} \right)} \right) \tag{10}\\
x_{c}^{\text{gt}}=& \frac{x_{1}^{\text{gt}}+x_{2}^{\text{gt}}}{2},y_{c}^{\text{gt}}=\frac{y_{1}^{\text{gt}}+y_{2}^{\text{gt}}}{2} \\
x_{c}^{\text{prd}}=& \frac{y_{1}^{\text{prd}}+y_{2}^{\text{prd}}}{2},x_{c}^{\text{prd}}=\frac{x_{1}^{\text{prd}}+x_{2}^{\text{prd}}}{2} \tag{11}\\
{{w}_{\text{gt}}}=& x_{2}^{\text{gt}}-x_{1}^{\text{gt}},{{h}_{\text{gt}}}=y_{2}^{\text{gt}}-y_{1}^{\text{gt}} \\
{{w}_{\text{prd}}}=& x_{2}^{\text{prd}}-x_{1}^{\text{prd}},{{h}_{\text{prd}}}=y_{2}^{\text{prd}}-y_{1}^{\text{prd}} \tag{12}
\end{align*}
We introduce the parameters of
Algorithm 3: IoU and MPDIoU as Bounding Box Losses.
Input: Predicted
Output:
Determine the predicted bounding box(
Calculate the area of the groundtruth bounding box and the predicted bounding box.
Calculate the intersecting area of the groundtruth bounding box and the predicted bounding box
IoU =
In Algorithm 3,
Experimental Results and Analysis
A. Datasets
The remote sensing dataset used in this study is derived from the Hainan-1 satellite, originating from Hainan province, China. This satellite has provided a wealth of remote sensing imagery for various maritime applications and ocean management, thereby enabling dynamic ship detection. We sliced the original remote sensing images into frames of 1024×1024 pixels. Due to the potential difficulty of accurately observing small-target ships with the naked eye, we carried out color depth changes on the original images. Subsequently, based on AIS information, we used labeling software to annotate the small-target ships, ultimately yielding 3831 images. We found that many small-target ships were hidden in complex backgrounds, exhibiting conditions, such as trailing ship trajectories, crowding, cloud cover, and proximity to the shore, among others (as shown in Fig. 2). Concurrently, approximately 97% of the targets do not exceed more than 0.15% of the image area. Because of this, the detection of smaller object targets necessitates augmented inference of both shallow and deep features. This dataset pushes the difficulty of small-target detection to the extreme, which fulfills the requirements of practical application scenarios. Fig. 2 illustrates several representative images. In the experiment, the training set, test set, and validation set were divided at a ratio of 7:2:1, respectively, encompassing 5795, 837, and 1786 instances. The image size input to the network is uniformly set to 640 × 640 pixels.
B. Experimental Setup
We implemented CSDP-YOLO on PyTorch 2.0.1 and trained and tested using the integrated Matrox G200eW3 Graphics Controller. During training and testing, the operating system is Ubuntu. Specific details are given in Table I.
To ensure adequate neural network training, the batch size was uniformly set to 16 with an initial learning rate of 0.01, carried out over 300 rounds of training. The stochastic gradient descent (SGD) optimizer was selected to minimize the MPDIoU loss. During the training phase, partial pretrained models of YOLOv7 were not used. Although CSDP-YOLO and YOLOv7 share part of the network architecture, their use for the detection of small-target ships in remote sensing images may result in negative transfer, that is, a decline in performance. Initial training from scratch can avoid such circumstances as it prevents the model from being restrained by previous training experiences.
C. Experimental Metrics
To demonstrate the advantages of CSDP-YOLO, we use precision (P), recall (R), F1 score, average accuracy [mean average precision (mAP)], mAP at 0.5 intersection over union (mAP@0.5), and mAP@0.5:0.95 to evaluate the parameter equations as follows:
\begin{align*}
\text{precision}(P)=&\frac{{\text{TP}}}{{\text{TP}+\text{FP}}} \tag{13}\\
\text{Recall}(R)=&\frac{{\text{TP}}}{{\text{TP}+\text{FN}}} \tag{14}\\
\text{AP}_{i}=&\int _{0}^{1} P_{i}(R_{i})d(R_{i}) \tag{15}\\
\text{mAP} =& \frac{1}{n}\sum _{i=1}^{n}\text{AP}_{i}. \tag{16}
\end{align*}
In the above equations, TP is defined as samples correctly identified in the positive class, FP is defined as samples incorrectly identified as the positive class in the negative class, FN is defined as samples incorrectly identified as the negative class in the positive class, and TN is defined as samples correctly identified as the negative class in the negative class. Typically, precision (P) refers to the proportion of true positives (TP) among all samples predicted as the positive class, whereas recall (R) refers to the proportion of true positives (TP) among all true positives (actual targets) [51]. F1 is used to assess the model's precision and recall performance. mAP represents the average value of AP and is used to measure the overall detection accuracy of object detection algorithms. mAP@0.5 is used to quantify the model's average performance at different IoU thresholds [49].
D. Experimental Results and Analysis
We evaluated the CSDP-YOLO on an Integrated Matrox G200eW3 Graphics Controller. To thoroughly validate the effectiveness of our proposed methodology, we compared our network with the original YOLOv7, YOLOv7x, and a version of YOLOv7 enhanced with an attention mechanism.
The attention mechanism is a deep learning technique inspired by the human visual system, analogous to our inclination to focus on specific areas while observing the world. This mechanism allows computational models to focus on the most critical part of the task at hand. This is achieved by allocating weights to different inputs, thereby assigning higher weights to information relevant to the task and downscaling information that is irrelevant. Such a technique permits the model to better deal with complicated data, reaping substantial advancements across various applications. Wang et al. [42] proposed incorporating a CBAM module into YOLOv5, enhancing the detection performance of remote sensing image object detection. CBAM is a convolutional neural network-based attention mechanism module that introduces spatial attention and channel attention mechanisms, assisting CNN models in better understanding and utilizing the features of input data to enhance detection performance. Zhu et al. [43] introduced the BiFormer attention mechanism module, a dynamically sparse attention mechanism that filters redundant information, retaining only the integral parts of interest, greatly boosting the identification of small targets. Both of these methods are highly representative algorithms of the attention mechanism [49]. We introduced them into YOLOv7 to explore their performance in detecting small-target ships in remote sensing data [50], comparing them with the improvements proposed in this article. The entire network is illustrated in Fig. 4.
The detection results of various algorithms on the remote sensing small-target ship dataset are shown in Tables II and III, and Figs. 9 and 10 display comparison results of different methods. Fig. 8 illustrates the mAP and recall curves of YOLOv7 and CSDP-YOLO, respectively. Fig. 10(a) and (b) represents the detection results in near-shore ports and dense ship conditions, respectively, while Fig. 10(c) and (d) shows the detection results in the case of cloud cover and ocean waves, respectively. Since ships are the sole detection targets in the remote sensing dataset, precise localization of ship targets has more value than target classification. Compared with the original YOLOv7 network, the ELAN aggregation network layer of YOLOv7X is wider than the original ELAN layer, slightly enhancing detection accuracy and recall. However, this introduces more computations that significantly decrease FPS from 5.1 to 7.7 (as given in Table III), and the model's parameters increase from 141.8 to 270.1 M, which makes it harder to deploy on terminal devices. BiFormer [43], a dynamically sparse attention mechanism, filters most irrelevant key–value pairs in coarse area features. This retains only a small portion of the routing area, highlighting crucial information on small targets and substantially improving ship feature extraction. However, due to excessive focus on global information, many ships are omitted (up to 167 ships) and the FPS is notably affected [e.g., ships are not detected under circumstances of cloud and fog obstructions and proximity of ships, as shown in Fig. 10(c) and (b), respectively]. The CBAM model, which emphasizes important spatial information features, still does not improve false detection and omission in complex environments [as in Fig. 10(d), where ship omission occurs in a crowded port].
MPDIoU and CIoU loss function training process. (a) Bounding box loss. (b) Overall loss.
Detection results of different methods for different scenarios in our early dataset. (a) Near-shore harbors. (b) Ship density and close proximity. (c) Cloud cover. (d) Ocean waves.
On the other hand, CSDP-YOLO significantly reduces ship omission (as given in Table III), and the parameter quantity of CSDP is less than that of the original YOLOv7 network, reaching 140.4 M, effectively reducing the required computational resources. Moreover, the class-imbalance problem between the background and foreground of small-target ships was solved through the CSDP layer in the backbone network and the MPDIoU loss function, maintaining stable detection performance under adverse weather or complex maritime conditions (as seen in Fig. 10). In comparison to YOLOv7, accuracy is improved by 8.6%, and recall rate by 11%. Among the detection tests on the test set (inclusive of 837 instances), CSDP-YOLO has the least omissions and no significant damage to the FPS, which is as high as 5.5.
Furthermore, Fig. 10 illustrates the mAP and recall curves of YOLOv7 and CSDP-YOLO. The comparison chart shows that the proposed CSDP-YOLO outperforms YOLOv7 in terms of average detection accuracy and recall rate. This indicates that the overall ability of CSDP-YOLO to recognize small-target ships in complex environments surpasses that of YOLOv7. Fig. 9 presents a comparison of boundary loss and overall loss between the proposed model and YOLOv7. Our proposed model uses MPDIoU, the boundary loss and overall loss of which are lower than YOLOv7’s CIoU. The model that utilizes the MPDIoU loss function predicts frames more accurately.
E. Ablation Experiments
In this research, some high-aggregation network layers (ELAN) in YOLOv7 are replaced with the proposed CSDP network layer (as shown in Fig. 4). To assess the effectiveness of the proposed CSDP network and the introduced MPDIoU loss function, we independently examined each replacement of the CSDP module in the backbone network of YOLOv7, focusing on AP values as well as recalls. The results of the ablation experiments are given in Table IV. The ELAN network aggregation layer proposed in the original YOLOv7 did not deliver ideal detection results for small-target ships; however, our proposed CSDP network layer addresses the problem of information loss during the multiple feature extraction processes of small targets. This is achieved by using a mix of large kernel depth convolution and point convolution to blend channel and spatial position, which more accurately captures the details of small target ships. From the ablation experiment, we discovered that the model's performance was optimized when two of YOLOv7’s ELAN aggregation network layers were replaced with the proposed CSDP layers in YOLOv7’s backbone network. Consequently, the model's average detection accuracy reached 91.4%, and the recall rate amounted to 86.6%. This suggests that the introduction of the CSDP module has enhanced the feature extraction capability of neural networks and reduced instances of false detection and omission. Lastly, the application of the MPDIoU loss function significantly improved the recall rate of the neural network, reaching 83% without the addition of the CSDP module. As the MPDIoU adequately considers the geometric properties of bounding box regression, minimizing the distances between the top left corner vertices and bottom right corner vertices of the predicted and actual bounding boxes, it ultimately outperforms the original YOLOv7’s CIoU loss function.
F. Generalizability Verification
To evaluate the generalizability of the proposed CSDP-YOLO model, we used the publicly available SSDD remote sensing dataset for ship detection. We compared the mainstream remote sensing ship detection models, namely, Faster RCNN, SSD, FCOS, YOLOv3, and YOLOv7, with the proposed CSDP-YOLO model under the same environment and conditions. The experimental results are given in Table V. The experiment results demonstrate that the CSDP-YOLO detection accuracy and recall rate have, respectively, improved to 93.6% and 93.7%, securing the best detection performance. The CSDP network layer mixes the spatial information of small-target ships by applying convolution to each input channel in groups, setting the number of groups equal to the number of input channels. Finally, point convolution is used to mix the features of each output channel, allowing the network to pay more attention to the finer details of small-target ships and reducing interference from irrelevant information in remote sensing images, such as islands and lighthouses.
G. Discussions
The results in Tables II and III give that in the detection task of our proposed dataset, our proposed CSDP-YOLO approach to target detection outperforms other existing algorithms when compared with the detection performance using only the YOLOv7 algorithm, and as can be seen from the detection results in Fig. 10, the CSDP-YOLO algorithm achieves accurate detection for the case of ships nearby and cloud occlusion, which are better than the other algorithms. The results in Table V give that CSDP-YOLO still outperforms other existing algorithms in the SSDD public dataset. Tables II and III give that the target detection network proposed in this article can still miss or misdetect the detection of small targets. On the one hand, the dataset may have fewer remote sensing images for certain complex backgrounds (e.g., mariculture nets, ship trailing), which leads to the model learning less information from these complex backgrounds during the training process. On the other hand, the pixels of some small-target ships in the dataset presented in this article are too small, pushing the difficulty of small-target detection to the extreme, which means that the model inevitably loses feature information during the training process.
Considering the different traveling volumes of ships in the South China Sea region in different seasons, bad weather may cause the remote sensing images taken by the satellite to be not clear enough. The experimental images are selected in the harbor as well as the principle Hainan land marine conditions, and the harbor in and out of a large number of ships, for the effective detection of ships close to each other, can prevent the ship hitchhiking to carry out smuggling activities, see Fig. 10(b). And for far away from the land of Hainan Province marine territory, see Fig. 10(c) and (d), the sea situation is sudden and changeable, so the detection of ships in poor conditions can effectively ensure the safety of the ship, and the emergence of accidents can be realized accurately locate and quickly out of the police.
The use of deep learning technology to detect small-target ships can, on the one hand, greatly improve the accuracy of detecting ships under adverse conditions and reduce the false alarms of marine safety systems. On the other hand, introducing deep technology is conducive to constructing a new model of modern intelligent detection of ships. From the point of view of maintaining marine safety, in a situation where many ships are entering and leaving the port and the degree of density a large, efficient, and scientific analysis of the ship's situation can reduce the investment of marine traffic management resources. At the same time, it can reduce the human subjective judgment and the false alarm rate of marine safety brought afterward. This innovative style has an important role in promoting efficient ship detection, rational allocation of marine traffic management resources, and realizing intelligent management of marine traffic, and can provide favorable technical support for the policy of marine ship management.
Conclusion
In this article, we construct a new dataset of remotely sensed ship images containing 3881 remotely sensed images in complex situations, such as harbors, cloud occlusion, ships close by, farmed nets, and so on, with at least one instance of small target ships in each remotely sensed image, and this dataset kind of contains 8418 instances. In addition, we propose a CSDP network layer based on the ELAN high-aggregation network layer for small-target ships in the problem of unsatisfactory feature extraction, especially in the case of harbors, dense ships, proximity, and cloud cover, and use a large kernel deep convolution to mix the spatial and channel positions of the small-target ships to capture a wider range of contextual information without downsampling the continuous layer, even in the case of complex environments. The proposed model remains valid for small-target ships even in complex environments. In our proposed private dataset, studies comparing multiple advanced detection models (YOLOV7, YOLOv7-X, BiFormer-YOLOv7, and CBAM-YOLOv7) are compared. The experimental results show that the proposed CSDP-YOLO algorithm can effectively improve the average precision, recall, and
Small-target ships occupy few pixels in complex environments and for the problem of imbalance between foreground and background categories of small target ships. The MPDIoU loss function can still be optimized when the predicted box has the same aspect ratio as the groundtruth bounding box and achieves a more efficient and precise bounding box regression by minimizing the distance between the predicted upper left point and the lower right point. We compare the bounding box losses of CIoU and MPDIoU, and the experimental results show that the trained MPDIoU bounding box loss is lower than that of CIoU, indicating that MPDIoU motivates the model to predicted bounding boxes more accurately.
Finally, to verify the generalization of our proposed model, we do 300 rounds of training on the SSDD remote sensing dataset under the same conditions as before, while comparing a variety of models commonly used for remote sensing ship detection (Faster R-CNN, SSD, FCOS, YOLOv3, YOLOv5s, and YOLOv7), and the experimental results show that our proposed model detects the indicators are optimal, and the average precision, recall, and
ACKNOWLEDGMENT
The authors would like to thank the referees for their constructive suggestions. For datasets and codes related to this article, please contact the corresponding author.