Introduction
In recent years, the success of deep learning [44], [46], [47] makes researchers introduce deep learning technologies into the researches of ship surveillance. As one of the most important tasks in ship surveillance, ship classification has important practical value for many organizations. However, most of the existing researches [57], [58] on ship classification with deep learning technology are specifically designed for hyperspectral [37], [38], infrared thermal [39]–[41], or synthetic aperture radar images [42], which makes them unsuitable for real-world scenarios where we can only get the natural images of ships. The main reasons for the lack of researches on ship classification based on natural images are: 1) The structures of most ships are visually simple, which can only be accurately classified by finding discriminative local details; 2) The data distribution of different categories of ships in real-world scenario is imbalanced; 3) The acquisition of ship natural images is limited by many factors such as scenario, hardware, and weather.
Recognizing the class of a ship accurately is the most significant work in ship surveillance. The existing deep learning networks [30], [36] show great performance in the task of classification due to their superior capability of feature extraction from images. For ship classification, the discriminative local features should be emphasized as there are only minor differences between different types of ships. To this end, some works [48]–[50] manage to obtain the discriminative local features in the image through adding manual annotations, while some others [14], [51], [52] design various weakly supervised learning mechanisms to capture the discriminative local features of objects when only image-level labels available. Although human-level accuracy of classification can be achieved by these models when applied on the balanced distribution datasets in which the numbers of samples from different classes are generally the same, the performance severely drops when the dataset becomes imbalanced.
In this paper, we propose a new framework that can be used in real-world scenarios named Adaptive Selecting and Learning Network. Without the additional manual bounding-box annotations, we adopt a multi-structure cooperative learning scheme to achieve the effective learning of ship fine-grained expert knowledge under the imbalanced distribution of different categories of ships. Specifically, we use a navigation network to ensure that the model can locate discriminative object parts in the image. When a local region contains more discriminative details of the object, the higher the probability that the region is predicted to belong to the corresponding ground-truth class. By locating these local regions accurately, our model can learn more sophisticated expert knowledge. To accurately classify ships under the imbalanced data distribution of the real world, we design a memory network equipped with an adaptive selecting strategy and an inference network equipped with attention mechanism. According to the learning performance of the model, the memory network adaptive determines which samples of each class of ships contain professional knowledge that will be helpful to improve the performance of the model and store these samples in it. We use the memory mechanism to adaptively re-balance the distribution of different classes in the learning procedure. After obtaining the hard samples that have the similar structure with input, the inference network will use the attention mechanism to accurately capture the relationships between the new sample and hard samples, and then improve the learning of the hard samples while learning the knowledge of new sample.
Since there are few real-world ship datasets of natural images for researches, we propose a new Dachan Island Ship (DIS) dataset which is imbalanced and completely collected in the real-world scenario. Compared with the existing ship classification datasets, the DIS dataset differ from them in three aspects: 1) There are great differences between the samples that belong to the same class due to the shooting angle and illumination condition, while the appearances of different sub-classes are very similar except for some minor differences; 2) There are a larger number of images and types of ships in the DIS compared with the existing datasets, which are expensive and difficult to annotate; 3) The images are all captured in real-world scenarios and the distribution of different classes is imbalanced. To the best of our knowledge, DIS is the first imbalanced fine-grained ship dataset of natural images.
Our contributions can be summarized in threefold: 1. Delving into the problem of imbalanced fine-grained classification, we propose an Adaptive Selecting and Learning Network, which can adaptive re-balance the distribution of different categories of data according to the learning performance and use the features of the hard samples to optimize the feature of the new sample with attention mechanism. 2. We propose a real-world ship benchmark dataset DIS to advance the frontiers of real-world ship classification research. 3. We conduct extensive experiments on the proposed DIS and the results show that our ASL model performs favorably against the existing fine-grained classification methods.
Related Works
A. Fine-Grained Classification
Fine-grained classification aims to perform sub-classification of images belonging to the same basic class. It is more challenging than ordinary image classification tasks due to the subtle inter-class differences and large intra-class differences among those sub-classes. Previous works on fine-grained classification mainly employ complex mathematical methods to extract image features [1]–[4], which are then sent to the multi-stage models to get the label of an image.
In recent years, deep learning has achieved great success in large-scale image classification [30], [36]. Therefore, an increasing number of researches have introduced deep learning techniques to solve fine-grained classification problems [5], [8]. Although discriminative local features can be extracted, these models heavily rely on the annotations of local features or attributes in images, which are expensive and difficult to obtain. To reduce the recline of annotations of local features and attributes, Lin et al. propose bilinear pooling [9] and improved it in [10] to obtain more precise feature maps of objects. Jaderberg et al. design [7] Spatial Transformer Network which can achieve geometric transformation in the training procedure and consider local parts of objects to improve the accuracy of fine-grained classification. Fu et al. [13] design Recurrent Attention CNN that can recursively predict the location of an object through an attention mechanism and combines multi-scale features to predict the class of the object. Zheng et al. [14] propose MA-CNN that enables the model to locate multiple parts of an object at the same time. Some recent works, such as [11], introduce meta-learning to facilitate fine-grained classification. Dubey et al. [12] combine pairwise confusion loss and cross entropy loss to learn more precise features and guarantee the generalization ability of the model.
Different from the general fine-grained classification, in ship classification, we need to address the issue from heavy imbalanced training samples from different categories. Therefore, we design the learning mechanism, selective learning integrated with the fine-grained feature extractor borrowed from the fine-grained classification networks, to address the issue from imbalanced learning.
B. Imbalanced Learning
Previous approaches to tackle the problem of imbalanced learning can be categorized into two groups: data resampling and cost-sensitive learning. Data resampling aims to mitigate the imbalanced distribution between classes from the data level. It rebalances the data distribution of different classes in the data preprocessing procedure by resampling the data, which is simple and efficient compared with designing complex algorithms. The typical methods of data resampling include over-sampling minority classes [18], [19], down-sampling majority classes [17], [28], [29] or the combination of over-sampling and down-sampling [15], [16]. Notwithstanding their simplicity, these sampling methods suffer from severe limitations, e.g., over-sampling minority classes will easily cause over-fitting and down-sampling majority classes will lead to the loss of discriminative information in majority classes.
Cost-sensitive learning tries to deal with the imbalanced learning by designing cost-sensitive models at the algorithm level. These methods impose higher penalty cost on the misclassification of minority classes to ensure the models to pay more attention to the learning of minority classes. However, how to design a good cost representation is still a challenging problem. A typical way is to leverage inverse class frequency or pre-defined misclassification costs, which are combined with SVM [20] or decision trees [21]. Boosting is an efficient method proposed by Ting [22]. Considering the limitation of boosting that it is sensitive to noise, Huang et al. [23] combine the cost-sensitivity with bagging to obtain a cost-sensitive random forest algorithm. There are also some works combining data sampling or cost-sensitive learning with deep learning technologies to achieve better performance on imbalanced learning [24]–[27]. Different from these methods, we address the imbalanced learning by re-weighting the samples importance with feature selection. The importance of each sample is learned by the proposed ASAL module, so that our framework can automatically re-balance the sample importance by their categories and difficulties on the same time.
C. Ship Surveillance
Current ship surveillance researches mainly rely on radar images and thermal infrared images to detect and classify ships. The dataset of synthetic aperture radar images proposed by Wang et al. [31] is a representative dataset of radar images. Ships in this dataset are under complex backgrounds, which makes ship detection very challenging. Lang et al. [32] propose an improved multi-class adaptive support vector machine combined with image features to achieve transfer learning between automatic identification system data and synthetic aperture radar image data. Sharifzadeh et al. [33] design a hybrid CNN-MLP model for SAR ship classification, which processes the image pixel by pixel and uses the statistical information of its neighbor pixels to detect the target pixels. Wu et al. [34] propose a BDA-KELM model that selects the optimal parameters through combining convolutional neural networks, kernel extreme learning machine and dragonfly algorithm in binary space for the classification of high-resolution SAR images. Huang et al. [35] design a ship detection model based on multi-scale heterogeneity, which constructs the heterogeneous maps by extracting multi-scale feature maps. The above methods based on deep learning are mainly designed for the SAR images of ships and there are few studies on ship classification or detection using natural images of ships, leading to a barrier for their practical application. To address this issue, this paper focuses on the classification of ships based on natural images and proposes a new dataset derived from real scenarios to facilitate the research on real-world ship classification.
Scenario and Dataset
A. Data Acquisition
In this paper, we choose Dachan island in Shenzhen as the sampling scenario. Dachan island, located in the west of the Nanshan district of Shenzhen, China, is the second-largest island of Shenzhen. Geographically, Dachan island is close to the estuary of the pearl river, at the intersection of the National Waterway of Guangdong Province and the Guangdong-Hong Kong Waterway. Since the island is located at the intersection of shipping channels, the waters near Dachan island not only have a large flow of shipping, but also a large variety of ships, which make the surveillance of ships is a huge challenge for customs officers.
Considering that the quality of images taken by cameras after sunset is usually unsatisfactory due to the weak light illumination, we choose to collect data from 9 a.m. to 5 p.m. every day. Images of ships in the data set were collected by surveillance cameras mounted on hills in the central part of the island. The shooting angle of the camera can vary from 30° to 45°, and the illumination is automatically decided by the camera. To ensure that more details of ships can be photographed, we only shoot the ships in the two nearest channels of Dachan island. For each ship, we take several images and just keep the one with the best quality.
B. Dachan Island Ship Dataset
DIS contains 13 classes of ships, with a total of 2,500 natural images and the distribution of samples for different classes is extremely imbalanced, which is shown in Table 1. Compared with the existing ship datasets, our DIS has the following properties (see Table 2): 1) DIS is the first fine-grained ship dataset collected in real-world scenarios, which contains images with different clarity and brightness. 2) DIS has a larger number of images, a higher resolution and more fine-grained image-level labels of ships. 3) The data distribution of different classes of ships in DIS is imbalanced, which is more in line with the real-world situation.
Among all the ships, bulk cargo ship, partial container ship, full container ship and sand carrier account for about 70% of the total number of images, while the other nine classes account for about 30%. The imbalanced distribution of different kinds of ships presents us with a new great challenge in fine-grained classification tasks. We only utilize 11 classes of ships except for tankers and Cruise ships in our fine-grained classification task, since the number of images from the above two is too small.
Model
As shown in Figure 2, the proposed Adaptive Selecting and Learning Network (ASL) consists of three components. The feature extractor aims to extract the features of input images. The navigation network captures the discriminative local details of an object and locates these details in the image. The feature vectors of these details and the input image will then be sent into the ASL module. Finally, a probability distribution over all the classes can be obtained. We will discuss each component in detail.
The architecture of Adaptive Selecting and Learning Network. The navigation network produces the feature vectors of discriminative local details. The ASL module selects and memorizes the hard samples that are difficult to classify and adaptively rebalance the data distribution of different classes according to the learning performance. “C” means concatenation operation.
A. Navigation Network
The aim of navigation network is to ensure the model to capture the discriminative local details of the object in the image, which helps the model ”see better”. The existing fine-grained methods mainly promote classification by two kinds of local features, one is the features of multiple different details of the object, the other is the multi-scale features of one detail of the object. Our ASL module can be combined with these two methods. In this paper, we use Navigator-Teacher-Scrutinizer Network (NTS) [56] and Recurrent Attention Convolutional Neural Network (RA-CNN) [13] as navigation network to extract local features, respectively. Both of the used networks are designed for finding the local discriminative features to enhance the feature learning for fine-grained classification. Our method can also adopt general features extracted by Resnet only, please refer to the experiments for details.
B. Adaptive Selecting and Learning Module
The proposed ASL module consists of a memory network and an inference network. The memory network adaptively chooses the samples that are difficult to be classified and memorizes them in the learning procedure. By comparing the similarities between new samples and the hard ones in memory, the inference network will choose several hard samples from the memory network to learn from them together with new samples, which enables the network to obtain new knowledge while improving the learning of hard samples. Different from the traditional methods of imbalanced learning, ASL module adaptively adjusts the data distribution between minority and majority classes according to the learning performance of model. We will detail the memory network and inference network, respectively.
1) Memory Network
Deep learning models with memory mechanisms are mainly applied to long sequence learning tasks in natural language processing, e.g., RNN. The hidden state of RNN works as a memory of the learned contextual information of the previous time step. However, the memory size of these models are fixed and quite small, which indicates that the model lacks a diversity of the knowledge they have learned. Inspired by the memory network proposed in [53], we propose a memory network with an adaptive selecting strategy that enables the model to automatically choose and memorize hard samples in the training procedure.
2) Memory Mechanism
During the training phase, the memory network stores the hard samples and manages to improve the learning from them along with new samples. The hard samples are stored in the form of key-value pairs, which can be formulated as:\begin{equation*} M=\left \{{\left ({m_{i}^{k},m_{i}^{v} }\right) }\right \}_{i=1}^{N}\tag{1}\end{equation*}
\begin{equation*} m_{i}^{k}=concatenate\left ({fv\left ({x_{i} }\right),fv\left ({P_{1}^{i} }\right),\ldots,fv\left ({P_{K}^{i} }\right) }\right)\tag{2}\end{equation*}
Memory network uses Euclidean distance between new sample \begin{equation*} w_{i}=\frac {e^{-d(x_{n},x_{h}^{i})}}{\sum \nolimits _{j=1}^{k} e^{-d(x_{n},x_{h}^{j})}}\tag{3}\end{equation*}
3) Adaptive Selecting Strategy
Whether a sample is hard nor not mainly depends on the prediction made by the model. Given a sample as input, if the model can correctly predict the label of the sample, then the sample is treated as a simple one, otherwise, it is a hard one. We determine whether the sample is a hard one according to the confidence of the ground-truth class predicted by the ASL module. If the confidence is smaller than a threshold, the sample is thought to be hard and the feature vector of it should be stored in the memory network:\begin{align*} U_{i}=\begin{cases} 1&if -ln(C_{ASAL}\left ({x_{i},P_{1}^{i},\ldots,P_{K}^{i} }\right))>D\left ({t }\right) \\ 0&if -ln(C_{ASAL}\left ({x_{i},P_{1}^{i},\ldots,P_{K}^{i} }\right)) < D\left ({t }\right) \end{cases}\tag{4}\end{align*}
\begin{align*} D_{line}\left ({\mathrm {t} }\right)=&\frac {t\varphi }{T},\tag{5}\\ D_{constant}\left ({\mathrm {t} }\right)=&0.5\varphi,\tag{6}\\ D_{root}\left ({\mathrm {t} }\right)=&\sqrt {\frac {t}{T} }\varphi \tag{7}\\ D_{quadratic}\left ({\mathrm {t} }\right)=&\left ({\frac {t}{T} }\right)^{2}\varphi\tag{8}\end{align*}
4) Inference Network
Once obtaining the feature vector of the new sample, the memory network will choose its
The architecture of the ASL module. The weights are calculated by the distance between the query sample and its neighbors. “C” means concatenation operation.
Specifically, the MHDPA module consists of a multi-head dot product attention layer and a multi-layer perceptron. We store the output of the memory network in a matrix \begin{align*} M_{n}\left ({Q,K,V }\right)=softmax\left ({\frac {M_{K} W^{Q}\left ({M_{K} W^{K} }\right)^{T}}{\sqrt {d}_{k}} }\right)M_{K} W^{V} \\\tag{9}\end{align*}
A single attention function with keys, values, and queries cannot obtain accurate relationships between memories. By leveraging unique parameters of different shared weights to conduct linear projection on memories in \begin{align*} MultiHead=&concat\left ({{head}_{1},\ldots,{head}_{h} }\right)W, \tag{10}\\ {head}_{i}=&M_{n}\left ({Q_{i},K_{i},V_{i} }\right)\tag{11}\end{align*}
We use weight network to optimize the similarity weights between new sample and hard samples in the process of inference:\begin{equation*} w_{l}^{i}=\frac {e^{g_{l}({MH}_{l}^{i},w_{l-1}^{i},\theta _{l})}}{\sum \nolimits _{j=1}^{k} e^{g_{l}({MH}_{l}^{j},w_{l-1}^{j},\theta _{l})}}\tag{12}\end{equation*}
C. Loss Function
The ASL Network is optimized by two parts of supervisions, the first is the classification loss of ASL module, the other is the loss of the navigation network. Our multi-task loss function can be formulated as follow:\begin{equation*} L=L_{ASL}+\alpha L_{N}\tag{13}\end{equation*}
Results
A. Dataset
To demonstrate the effectiveness of our model, we conduct extensive experiments of imbalanced fine-grained classification on the proposed DIS. We use 11 categories of ships from the dataset with a total number of 2,477 images, among which, 1500 are used as training set, 477 as validation set and 500 as test set. All of the images that we use only have image-level annotations, i.e., the class labels of ships.
B. Definition of Minority and Majority Classes
We define the minority classes of a dataset as \begin{equation*} \sum \limits _{\mathrm {i\in }{\mathrm {Class}}_{\mathrm {min}}} \mathrm {N}_{\mathrm {i}} \le \rho \text {N}\tag{14}\end{equation*}
C. Implementation Details
We employ ResNet-50 as our feature extractor. The navigation network extracts proposals from the output of the last residual blocks of conv3, conv4, and conv5 of ResNet-50 and selects four local features with the highest confidence, which are fed into the ASL module together with the features of the input image. We use Momentum Stochastic Gradient Descent to train the model with the momentum of 0.9 on two GTX 1080ti GPUs. The initial learning rate is set to 0.001, which is multiplied by 0.1 after 50 epochs. The batch size is set to 16. All the input images used in our experiments are resized to
D. Comparison with Other Models
We conduct experiments to demonstrate the superiority of our framework on the proposed DIS dataset. The classification accuracy of our model on DIS are shown in Table 3. The existing methods for fine-grained classification, including two attention-based methods RACNN and MACNN, are presented for comparison. H6-L8 denotes a model consisting of 8 multi-head attention modules and the number of heads in each multi-head attention module is 6. From the experimental results, it can be observed that the accuracy of classification have been improved significantly after the combination of ASL module and two fine-grained methods. Especially, ASL + NTS outperforms these existing methods and achieves 100% classification accuracy in six of the seven minority classes. The performance of ASL + NTS is better than that of ASL + RA-CNN, which indicates that for ship classification, it is more beneficial to extract multiple different detail features than to extract multi-scale features of one detail.
Table 4 summarizes the experimental results of imbalanced learning methods on DIS dataset. All these methods use Resnet-50 as feature extractor, and our ASL module only combines with resnet-50. SMOTE and Down-sampling balance the distribution of classes by different sampling strategies for majority and minority classes. CB-FL and LDAM-DRW optimize the learning of different classes by designing loss function. Compared with these methods, ASL + Resnet has a better performance. This phenomenon demonstrates that ASL module can improve the accuracy of classification by adaptive re-balance the distribution of classes according to the performance of model, and the fusion learning for difficult samples and new samples.
E. Ablation Experiments
Here we analyze the efficacy of the main components of ASL framework. We implement different variants of our framework to analyze their performance on ship classification. The results are shown in Table 5. When we only use Resnet-50 to extract global features for classification, the overall accuracy is only 84.4%. After adding the navigation network and ASAL module respectively, the accuracy of our model on DIS is improved by 8% and 3.8%. When the two modules are adopted together, the average accuracy of the model is 12.2% higher than ResNet-50, which is considered as a significant improvement on our dataset. The experimental results show that our ASL module can improve the accuracy of the model more significantly when the features of targets are more abundant.
To analyze how the update speed of threshold used to determine the hard samples affect the performance of the model in the training procedure, we conduct experiments to evaluate the performances of ASL framework (H = 6, L = 8) with different selecting functions on DIS dataset. Figure 4 shows the trend of the number of hard samples memorized in the memory network corresponding to different selecting functions. Except the constant function, the square root function has the fastest speed of update before the model converges, leading to a worse performance of the model than those with other selecting functions. The update speed of the quadratic function is the slowest before the model converges and the corresponding model achieves the highest accuracy. Slower update of threshold and lower threshold allow the model to observe these hard examples more frequently in the training procedure and therefore boosts the performance of the model.
The number of samples stored in the memory network. (a), (b), (c) and (d) correspond to the ASL Networks (H6-L8) with constant selecting function, square root selecting function, linear selecting function, and quadratic selecting function respectively.
We implement different variants of ASAL module to analyze their performance on DIS. The classification accuracy of these variants are shown in Table 6. When the value of H is fixed (H = 2 or H = 4) and L increases from 2 to 5, the accuracy achieved by the model for the minority and majority classes classification are improved. However, when L increases from 5 to 8, the performance of the model declines slightly. The results demonstrate that when we fix the value of L and change the value of H, within a certain range, the larger the value of H, the higher the accuracy. The model achieves the highest accuracy when H = 6 and L = 8. Compared with ASAL model (H4-L5), ASAL model (H6-L8) has a better capability of classifying majority classes.
To verify the effectiveness of ASL framework, experiments are conducted on the public dataset CUB, CIFAR-10 and CIFAR-100. The results of the experiments are summarized in Table 7 and Table 8 respectively. The classification accuracy achieved by ASL model on the CUB is 0.9% higher than that of NTS model, which is the best of these existing fine-grained image classification models. We conducted experiments on CIFAR-10 and CIFAR-100 that are the widely used benchmark for imbalanced learning according to the settings in [58]. The top-1 validation errors of various methods for imbalanced CIFAR-10 and CIFAR-100 are reported in Table 8. Our ASAL model performs better than these existing models on these two datasets.
Conclusion
In this paper, we propose an Adaptive Selecting and Learning Network (ASL) for adaptive learning from the imbalanced ship data, which aims to fill the gap and spark progress in the real-world ship classification based on natural images. Moreover, we present a new Dachan Island Ship (DIS) dataset with a significant imbalanced distribution between classes. The ASL Network not only locate the discriminative local details of ships to achieve more accurate classification of ships, but also adaptively re-balance the data distribution in the training procedure and enhance the learning of hard samples while learning new knowledge to achieve a better learning performance from the unbalanced data. Comprehensive experimental results on the proposed DIS demonstrate the superiority of ASL over these existing fine-grained classification methods. In the future, we shall extend the proposed method to the open set problem to better recognize the objects that do not exist in the training data, and integrate the proposed network with the detection network so we can localize each ship if two or more ships are contained in a single image.
ACKNOWLEDGMENT
The authors are grateful for the technical support of customs officers of Dachan Island. (Yujie Xu and Minghao Yan contributed equally to this work.)