Introduction
Remote sensing (RS) technology has been widely used in many practical applications, such as RS scene classification [1]–[4], RS object detection [5], [6], RS semantic segmentation [7], [8] and RS change detection [9]. Among the above applications, RS scene classification is a hot topic, which aims to classify RS scene images into different categories.
Recent years have witnessed significant progress in various computer vision tasks using deep convolutional neural networks (DCNNs) [10]–[16]. In some image classification tasks such as scene classification [17], [18], object classification [19], [20], and medical image classification [21], DCNNs have shown strong performance by extracting multilevel features with hierarchical architecture [10], [11], [22]–[27]. Basically, DCNNs efficiently encode each image into a classification probability vector, which contains the global feature. Furthermore, some well-designed DCNNs [28]–[31] also combine a global feature with local feature for better representation. However, directly using DCNNs to tackle RS scene classification task has two main problems. The first problem is the large intraclass variance caused by resolution variance of RS images (e.g., the image resolution of AID dataset ranges from 0.5 to 8 m [1]), which is intuitively illustrated by some cases in Fig. 1(a). The second problem is that RS images always contain confusing information because of covering large geographic area. As shown in Fig. 1(b), confusing information will reduce the interclass distance. An example of the inshore “resort” is similar to “beach” and the “railwaystation” built around residential has close character to “denseresidential.” The motivation of our manuscript is to solve the above-mentioned large intraclass variance problems and overcome the confusion problem caused by complex geographic elements.
Intuitive cases to explain the two main problems in RS scene classification. (a) Intraclass variance in RS images. (b) Redundant and confusing information in RS images.
To address the above two problems, we propose two intuitive assumptions as theory instruction of our method. First, besides global features, fine-grained features are also helpful to RS scene classification. For example, we can easily recognize “airport” if we see planes in RS images. Second, RS images contain latent semantic structural information, which can be explored without using detailed annotations such as bounding boxes or pixel-level annotations. As shown in the third row of Fig. 1(b), if we want to distinguish “church” from “storagetanks,” we cannot only focus on the center white tower. We need more structural information such as “tower + surroundings” to make judgment.
Based on the above assumptions, we propose a novel multigranularity multilevel feature ensemble network (MGML-FENet) to tackle the RS scene classification task. Specifically, we design multigranularity multilevel feature fusion branch (MGML-FFB) to explore fine-grained features by forcing the network to focus on a cluster of local feature patches at each level of network. In this branch, we mainly extract aggregated features containing structural information. Furthermore, we propose a multigranularity multilevel feature ensemble module (MGML-FEM) to fuse different high-level multigranularity features that share similar receptive fields but different resolutions. The overview of MGML-FENet is shown in Fig. 2.
Overview of MGML-FENet architecture. “MGML-FFB” denotes the multigranularity multilevel feature fusion branch. “MGML-FEM” denotes the multigranularity multilevel feature ensemble module, and “fc” denotes fully connected layers.
In MGML-FENet, MGML-FFB explores multigranularity multilevel features and utilizes fine-grained features to reduce adverse effects from large intraclass variance. Specifically, we use a channel-separate feature generator (CS-FG) to reconstruct feature maps. The original feature map is first cropped into several patches. Each feature patch contains a small group of channels that are split from the original feature map. Then, all feature patches are concatenated together to form a new feature map. MGML-FEM utilizes high-level features with structural information to overcome the confusion. In this module, we propose a full-channel feature generator (FC-FG) to generate predictions. The first cropping operation on the original feature map is the same as CS-FG. Then, through global average pooling and concatenation, the new feature vector is created and fed into the classifier at the end of network. At the end of the MGML-FENet, all the four predictions are fused to vote for final prediction. In this article, the “vote” does not mean majority voting. It is more like soft voting on posterior probability of each category. Specifically, MGML-FENet can generate four predictions. Here, the prediction is a global feature vector, which indicates the posterior probability of each category. Fig. 2 shows that we take the sum of all the four predictions to obtain a more convincing final prediction.
Previous methods [32], [33] always crop and zoomed-in view images to extract the multigranularity features and then apply transformation to learn transformation-invariant features. “Crop and zoom in” makes effect on extracting fine-grained feature with high-quality detailed information. However, it will overlook some important information from other local patches and do not fully use fine-grained features in a different level of networks. Compared to previous multigranularity feature extraction-based methods, MGML-FENets extract multigranularity features from different local feature patches to obtain structural information. MGML-FENets also extract and fuse multigranularity features at different levels of network, which can make full use of more abundant feature information.
To verify the effectiveness of proposed network, we conduct extensive experiments using VGG16 [22], ResNet34 [11], and DenseNet121 [34] as baseline models on multiple benchmark datasets (AID [1], NWPU-RESISC45 [3], and UC-Merced [2]) and VGoogle [35]. Compared to previous methods, MGML-FENets perform better and achieve new SOTA results.
Our main contributions are listed as follows.
We propose MGML-FFB to extract and fuse useful fine-grained features from different local patches in a different level of network. This module can efficiently solve the large intraclass variance problem.
We propose MGML-FEM to generate global features containing high-level structural information. This module can solve the confusing information problem.
We derive two novel feature generators, which are CS-FG and FC-FG to extract and fuse multigranularity multilevel features.
We integrate all predictions using a soft voting strategy and construct an end-to-end ensemble network that achieves SOTA classification results than previous networks on different benchmark datasets.
Related Works
A. Remote Sensing Scene Classification
In recent years, researchers have introduced many notable methods for RS scene classification. These methods can generally be divided into two types: traditional handcrafted feature-based method and DCNN-based method.
Handcrafted feature-based methods always use some notable feature descriptors. Yang and Newsam [2], Zhao et al. [36], Zhu et al. [37], and Bahmanyar et al. [38] investigated bag-of-visual-words (BoVWs) approaches for high-resolution land-use image classification task. As two classical feature descriptors, scale-invariant feature transform (SIFT) [39] and histogram of gradient (HoG) are also widely applied in the RS scene classification field [40]–[42].
Compared to the traditional handcrafted feature-based method, DCNNs have better feature representation ability. Recently, DCNNs have achieved great success in the RS scene classification task. Marmanis et al. [43] and Nogueira et al. [44] applied DCNNs to extract the features of RS images and further explored its generalization potential to obtain better performance. In addition, some methods integrate attention mechanism into DCNNs to gain more subordinate-level features only with the guidance of global-level annotations [45], [46]. When using DCNNs to solve very high-resolution remote sensing images (VHR RS images), the feature extraction is always partially missing and incomplete. To address this issue, [47] applies the “prediction module” with atrous convolution blocks to extract more abundant global features and uses the “residual refinement module” to correct the residual between the results of the prediction module and the real results. Shao et al. [48] adopted atrous convolution blocks and PSP pooling module to integrate multiscale features with large receptive field. Moreover, to tackle the interclass similarity issue and large intraclass variance issue, second-order information is efficiently applied in the RS scene classification task [32], [49], which receives excellent performance. More recently, Li et al. [50] proposed a notable architecture KFBNet to extract more compact global features with the guidance of key local regions, which is now the SOTA method. In this article, we will mainly compare our results with [32], [49], and [50].
B. Multigranularity Features Extraction Methods
In some classification tasks such as [51], [52], the large interclass similarity will result in rapid performance decline. To better solve this problem, many fine-grained feature extraction methods are proposed [53], [54]. However, in most cases, only global annotations are provided, which means that finding fine-grained features becomes difficult because of lacking semantic-level annotations. Therefore, multigranularity feature extraction methods are applied to enhance the region-based feature representation ability of DCNNs [32], [55]–[58]. Inspired by the above methods, we adopt multigranularity feature extraction in our method to tackle RS images.
Ensemble learning-based methods offer another perspective to extract multigranularity features by designing multisubnets with different structures. Ge et al. [59] directly used several CNNs to create different classification results that are then fused via occupation probability. Ge et al. [60] introduced a learning system that learns specific features from specific deep sub-CNNs. Ye et al. [61] adopted an ensemble extreme learning machine (EELM) classifier for RS scene classification with generalization superiority and low computational cost. Learning from the above ensemble learning-based methods, we adopt an ensemble learning method in our architecture to integrate multigranularity multilevel features.
C. Feature Fusion Methods in RS Scene Classification
To reduce the harm from resolution variance of images, many researchers employ the feature fusion method and obtain better performance. Liu et al. [62] proposed a multiscale CNN (MCNN) framework containing a fixed-scale net and a varied-scale net to solve the scale variation of the objects in RS imagery. Zeng et al. [31] designed a two-branch architecture to integrate global-context and local-object features. Li et al. [63] presented a fusion method to fuse multilayer features from pretrained CNN models for RS scene classification. Shao and Cai [64] and Shao et al. [65] focused on exploring the capability of fusion strategy, which innovatively uses DCNN to fuse features from multisource images. Similar to our method, Shao et al. [66] proposed MF-CNN to extract and integrate multiscale features on different levels of networks. MF-CNN shows the power of fusing low-level spatial information and high-level semantic information in DCNN. In this article, we also focus on the feature fusion method to tackle features that have different granularity, localization, and scales.
Proposed Method
In the RS scene classification task, only extracting global feature of RS images can work well in most cases. Besides using global features, we use multigranularity features in different levels of network to further improve the performance in some hard cases. Therefore, we propose MGML-FENet to tackle the RS scene image classification task. As shown in Fig. 3, the batch images are first fed into “conv pool” (“Conv1” and “Pool1” in Table I). Then, the output feature map passes through the four “conv layers” (“Layer1–4” in Table I) and finally generates the final classification probability vector in the main branch. At each level of main branch, the feature map is reconstructed by CS-FG and fused with former feature map in MGML-FFB. MGML-FFB offers another classification probability vector. Specifically, the “conv layers” in MGML-FFB and the main branch use the same structure but do not share parameters, which means that more parameters and computation costs are introduced. The CS-FG module of MGML-FFB extracts local feature patches and constructs a new feature map. Compared to the original feature map, the new feature map has the same channel but smaller scale, which eases the computation increase. Output feature maps of the last two main branch layers are served as an input to MGML-FEM and the two output classification probability vectors are from different levels of networks. Different from MGML-FFB, MGML-FEM brings in few extra parameters and computation.
Structure of MGML-FENet. MGML-FENet consists of three parts: main branch (blue), MGML feature fusion branch (orange), and MGML feature ensemble module (green). The main branch is ImageNet-pretrained baseline structure (ResNet34/VGG16/DenseNet121). MGML-FFB consists of CS-FG and “conv layers.” MGML-FEM consists of FC-FG. The “conv layer” is basic convolutional blocks. The detailed structure is shown in Table I.
During training, each branch is trained using cross-entropy loss with different weights. During validation, the final classification probability vector of each branch is fused together to vote for the final classification result.
A. Main Branch
In RS images, global feature contains important high-level feature. To extract global feature, we employ main branch in MGML-FENet. As shown in Fig. 3 and Table I, the main branch has the same structure as baseline models (VGG16, ResNet34, and DenseNet121). In the main branch, we denote “conv1 pool1” as \begin{equation*} \boldsymbol {F_{i}} = f_{i}\left ({ \boldsymbol {F}_{ \boldsymbol {i}-\mathbf {1}}}\right ), \quad \boldsymbol {F}_{-\mathbf {1}} = \textbf {X} \tag{1}\end{equation*}
In addition, we demote the fully connected layer as \begin{equation*} {\boldsymbol{P}_{\mathbf {mb}}} = f_{mb}\left ({ \boldsymbol {F}_{\mathbf {4}}}\right ). \tag{2}\end{equation*}
B. MGML Feature Fusion Branch
1) Overview of MGML-FFB:
To solve a large intraclass variance problem, we design MGML-FFB to utilize fine-grained features in different levels of networks. In MGML-FFB, we first propose a novel module, CS-FG to extract features from local patches. In this module, we force the network to pay more attention in multigranularity features of local patches in a different level of networks. The structure of MGML-FFB is shown in Fig. 3. One feature map output from a specific “conv layer” of main branch is first fed into CS-FG to generate a channel-separate feature map. Next, a “conv layer” of MGML-FFB is followed to represent the channel-separate feature map and the output feature map is used to fuse with the next stage channel-separate feature map.
If we, respectively, denote “CS-FG” and “conv layer” in MGML-FFB at each level as \begin{align*} \boldsymbol {G}_{ \boldsymbol {i}+\mathbf {1}}=&h_{i+1}\left ({ \boldsymbol {F}_{ \boldsymbol {i}+\mathbf {1}}}\right ) + g_{i}\left ({ \boldsymbol {G_{i}}}\right ), \quad i=0, 1, 2 \tag{3}\\ \boldsymbol {G}_{\mathbf {0}}=&h_{0}\left ({ \boldsymbol {F}_{\mathbf {0}}}\right ) = h_{0}\left ({f_{0}\left ({ \boldsymbol {X}}\right )}\right ), \quad \boldsymbol {G}_{\mathbf {4}} = g_{3}\left ({ \boldsymbol {G}_{\mathbf {3}}}\right ). \tag{4}\end{align*}
The final prediction in MGML-FFB can be calculated through another fully connected layer. The formulation is shown in (5) where “fc” layer and the prediction are, respectively, denoted as \begin{equation*} {\boldsymbol{P}_{\mathbf {ffb}}} = f_{ffb}\left ({ \boldsymbol {G}_{\mathbf {4}}}\right ). \tag{5}\end{equation*}
2) Channel-Separate Feature Generator (CS-FG):
To utilize fine-grained features and explore the structural information of multigranularity features, we design CS-FG in MGML-FFB. In each level of MGML-FFB, CS-FG reconstructs the original feature by extracting several local feature patches and combining them together. Compared to feature maps in the main branch, feature maps in MGML-FFB focus more on local feature rather global feature. Moreover, CS-FG increases the diversity of feature representation, which helps a lot on representation RS images. CS-FG is the core module of MGML-FFB. The structure is shown in Fig. 4. CS-FG consists of region proposal module (RPM) and channel-separate extractor (CS-E).
Structure of CS-FG and FC-FG modules. These two modules use the same region proposal module (RPM) to generate several feature patches. After RPM, CS-FG and FC-FG, respectively, use channel-separate extractor (CS-E) and full-channel extractor (FC-E) to reconstruct feature patches. Both the two modules take output feature maps of main branch (
RPM is used to crop original feature maps and generate feature patches. In this article, we mainly introduce two approaches: seven-crop and nine-crop (sliding windows). In Fig. 4, it is clear that the seven-crop approach extracts seven fix-position patches (left-top, left-bottom, right-top, right-bottom, center, band in middle row, band in middle column) on feature map and the nine-crop approach extracts nine fix-position patches using sliding window strategy. In addition, the nine-crop approach can be extended to
Algorithm 1 Seven- and Nine-Crop Region Proposal Algorithm
Input: A feature map
Output: An anchor list:
if RPM type is 7-crop then
end if
if RPM type is 9-crop then
for
for
end for
end for
end if
CS-E is used to extract feature patches on original feature map using anchors
In summary, CS-FG consists of RPM and CS-E. In (3), CS-FG is denoted as \begin{align*} A_{i}=&h_{i}^{0}\left ({ \boldsymbol {F_{i}}}\right ) \tag{6}\\[-2pt] \boldsymbol {H_{i}}=&h_{i}\left ({ \boldsymbol {F_{i}}}\right ) = h_{i}^{1}\left ({ \boldsymbol {F_{i}}; A_{i}}\right ) = h_{i}^{1}\left ({ \boldsymbol {F_{i}}; h_{i}^{0}\left ({ \boldsymbol {F_{i}}}\right )}\right ). \tag{7}\end{align*}
C. MGML Feature Ensemble Module
1) Overview of MGML-FEM:
To avoid confusion, we propose an MGML feature ensemble module. This module can utilize high-level features with structural information, which makes the whole network more robust. Moreover, it provides diverse predictions based on ensemble learning theory to vote for the final classification result. To generate more convincing predictions and train the network in a reasonable manner, we only apply MGML-FEM in deeper level of network because features in shallow layers always contain more low-level basic information. In MGML-FEM, we first propose FC-FG to extract global information on each local patch of high-level features and then fuse them to construct feature vector containing structural information. Fig. 3 shows the structure of MGML-FEM.
Algorithm 2 Channel-Separate Extractor Algorithm
Input: A feature map
Output: A feature map
Separate channels of input features:
Extract feature patches:
for
if
else
end if
Downsample feature patches using adaptive pooling (the output size is half of input size):
end for
Concatenate feature patches:
Mathematically, we denote the operation of MGML-FEM as \begin{align*} \boldsymbol {v_{i}}=&l_{i}\left ({ \boldsymbol {F_{i}}}\right ),\quad i = 3, 4 \tag{8}\\ {\boldsymbol{P}_{\mathbf {fem3}}}=&f_{fem3}\left ({ \boldsymbol {v}_{\mathbf {3}}}\right ), \quad {\boldsymbol{P}_{\mathbf {fem4}}} = f_{fem4}\left ({ \boldsymbol {v}_{\mathbf {4}}}\right ) \tag{9}\end{align*}
2) Full-Channel Feature Generator (FC-FG):
FC-FG is the main part in MGML-FEM. This module mainly extracts high-level features to contribute to the final prediction. As shown in Fig. 4, FC-FG is formed by RPM and full-channel extractor (FC-E). RPM in FC-FG is the same as the one in CS-FG. FC-E keeps full-channel information for each feature patches other than uses a channel-separate strategy because high-level features need sufficient channel-wise representation. In addition, FC-E directly uses global average pooling to generate feature vectors because neurons at every pixels of high-level feature have large receptive fields and contain decoupled information. Algorithm 3 clearly describes the method of FC-E.
Algorithm 3 Full-Channel Extractor Algorithm
Input: A feature map
Output: A feature vector
Extract feature patches:
for
Downsample feature patches using global average pooling:
end for
Concatenate feature patches:
To mathematically express FC-FG, we denote FC-E as \begin{equation*} \boldsymbol {v_{i}} = l_{i}\left ({ \boldsymbol {F_{i}}}\right ) = l_{i}'\left ({ \boldsymbol {F_{i}}; A_{i}}\right ) = l_{i}'\left ({ \boldsymbol {F_{i}}; h_{i}^{0}\left ({ \boldsymbol {F_{i}}}\right )}\right ). \tag{10}\end{equation*}
D. Optimizing MGML-FENet
MGML-FENet models apply conventional cross-entropy loss in every branches during training. To make the network converge well, we allocate each loss a reasonable factor. As shown in Fig. 4, the whole objective function consists of four cross-entropy losses. We optimize our MGML-FENet by minimizing the objective function \begin{align*}&\hspace {-.5pc} L_{obj}\left ({ \boldsymbol {X}| \boldsymbol {Y}}\right ) = \lambda _{1} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {mb}}}| \boldsymbol {Y}}\right ) + \lambda _{2} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {ffb}}}| \boldsymbol {Y}}\right ) \\&+ \, \lambda _{3} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {fem3}}}| \boldsymbol {Y}}\right ) + \lambda _{4} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {fem4}}}| \boldsymbol {Y}}\right ) \tag{11}\end{align*}
During validation, MGML-FENet employs an ensemble learning method, which integrates all predictions to vote for the final result. The final predictions contain diverse information including global information, multigranularity multilevel information, and high-level structural information. Equation (12) calculates the final prediction \begin{equation*} \boldsymbol {P} = {\boldsymbol{P}_{\mathbf {mb}}} + {\boldsymbol{P}_{\mathbf {ffb}}} + {\boldsymbol{P}_{\mathbf {fem3}}} + {\boldsymbol{P}_{\mathbf {fem4}}}. \tag{12}\end{equation*}
Experiments and Analysis
A. Datasets
In this article, we mainly evaluate our method on four benchmark datasets, including UC Merced [2], AID [1], NWPU-RESISC45 [3], and VGoogle [35]. The UC-Merced dataset contains 21 scene categories and total 2100 RGB images with
B. Implementation Details
In this article, we use ResNet34 [11], [23], VGG16 [22], and DenseNet121 [34] as baseline models to make fair comparison with previous methods. The detailed structure of baseline models is shown in Table I. We select VGG16 as the baseline model because many previous methods use VGG16 to extract features. Compared to VGG16, ResNet34 performs better in image classification task using less trainable parameters and FLOPs. Therefore, we also select it as baseline model. As for DenseNet121, the work [50] mainly uses it as baseline model. To make a fair comparison, we also choose it as another baseline model.
During training, the image size is set to
During experiments, we apply fixed training settings for baseline models and our proposed models. First, we use stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0005. The initial learning rate is set to 0.005 and the mini-batch size is set to 64. The total number of training epochs is 200 and the learning rate will be divided by 10 at epoch 90 and 150. For all models, we adopt the ImageNet [19] pretrained strategy and tune models on RS image datasets. In addition, all models are implemented using Pytorch on NVIDIA GTX 1080ti. Our code will be soon available online.
C. Experimental Results
We conduct extensive experiments to show the performance of MGML-FENet. To evaluate our model, we use overall accuracy (OA) as criterion, which is commonly used metric in classification task. Previous methods use different networks as backbone. Therefore, we apply the same backbone as previous methods to make a fair comparison. To make the results more convincing, we both compare the performance with previous models and baseline models.
In RPM of MGML-FENet, we mainly adopt the seven-crop strategy because RS images always contain important information in the middle “band” patches according to intuitive observation. We will also compare “nine-crop” with “seven-crop” in ablation study.
1) Classification on AID Dataset:
Following the setting of previous methods on AID dataset, we randomly select 20% or 50% data as training data and the rest data are served as testing data. We run every experiments five times to give out the mean and standard deviation of OA. The comparison results are shown in Table II. If taking VGG16 as backbone [see Table II(a)], MGML-FENet shows better performance than the SOTA method, KFBNet [50]. Especially when the training rate is 50%, MGML-FENet achieves 97.89% OA, which surpasses KFBNet by 0.7%. When applying DenseNet121 as backbone [see Table II(b)], MGML-FENet performs even stronger. It achieves 96.45% and 98.60% OA, which improves the SOTA accuracy by 0.82% and 1.2% when T.R. = 20% and 50%, respectively. In this article, we introduce ResNet34 as one of the backbones because ResNet34 is proven better than VGG16 in the image classification field with far less trainable parameters and computation cost. Results in Table II clearly show that MGML-FENet (ResNet34) performs better than MGML-FENet (VGG16) and other previous VGG16-based methods. Surprisingly, MGML-FENet (ResNet34) even achieves better results than other DenseNet121-based models [see Table II(b)].
2) Classification on NWPU-RESISC45 Dataset:
NWPU-RESISC45 contains more images and categories than the AID dataset, so previous methods choose to use 10% and 20% images for training. From Table II(a), MGML-FENet with VGG16 as backbone achieves SOTA results (from 92.95% to 93.36%) on a 20% training rate. With DenseNet121 as backbone [see Table II(b)], MGML-FENet almost obtains the best accuracy 95.39% (only 0.01% less than FDPResNet [69]) when T.R. = 20%. Although under the training rate 10%, MGML-FENet does not obtain SOTA results with VGG16 and DenseNet121, the gap is close (0.14% and 0.17%).
3) Classification on UC-Merced:
The UC-Merced dataset only has 2100 images with 21 categories. The training rate is 80%, which means that only 420 images will be served as val data. Table II shows that KFBNet achieves 99.88% and 99.76% classification accuracy, respectively, using VGG16 and Dense121 as backbone. The results are close to 100%. Compared to KFBNet, MGML-FENet also achieves high accuracy close to full marks. As mentioned before, we run every experiments five time and calculate the mean and standard deviation of OA. On the VGG16-based model, the five-time classification results are 99.76%, 99.76%, 99.76%, 100%, and 99.76%. On the ResNet-based model, we also obtain once 100% accuracy and four times 99.76% as the results of VGG16-based model. On the DenseNet121-based model, we get one more 100% compared to the above two models. In addition, 99.76% accuracy means that only one image is recognized as wrong category. From the comparison results, we can observe that our method and KFBNet both reach the ultimate limit on the UC-Merced dataset and have an obvious advantage against other previous methods.
All in all, MGML-FENets shows stronger than previous methods on most datasets. Most importantly, MGML-FENets work extremely well on different backbones, which means that we can flexibly design a much smaller network by using a smaller backbone.
D. Ablation Study
In our proposed models, we adopt different modules according to different motivations. To separately show the effectiveness of each module, we make more ablation experiments. In this section, we run all experiments on the AID and NWPU-RESISC45 datasets.
1) Comparison With Baseline Models:
In the RS scene classification task, some notable baseline models can individually work well. Besides comparing with previous SOTA models to show the superiority of our proposed method, we also compare with baseline models. In this article, we use VGG16, ResNet34, and DenseNet121 as baseline models. Table I shows the detailed structure of them.
The comparison results between baseline models and MGML-FENets are shown in Fig. 5 and Table III. On the AID and NWPU datasets, MGML-FENets achieve better results obviously. Especially taking VGG16 as baseline model, MGML-FENet improves by a large margin. On the AID dataset, MGML-FENet improves 0.98% and 0.82% than VGG16. On the NWPU-RESISC45 dataset, MGML-FENet achieves 1.16% and 0.57% higher accuracy than VGG16. Based on ResNet34, MGML-FENet still has large improvement. Especially on NWPU-RESISC45 when the training rate is 10%, our proposed model obtains 1.04% (90.35%–91.39%). When the baseline model is DenseNet121, the classification results have already achieved high level. MGML-FENet further gains improvement. On NWPU-RESISC45, the leading gap is 0.83% and 0.65%. Moreover, when using smaller group of training samples, MGML-FENets perform much better, which shows the robustness and effectiveness of our method.
Comparison between MGML-FENets and baseline models. The curve shows the OA performance of our proposed models and baseline models on the AID and NWPU-RESISC45 datasets with different training rates.
To further verify the effectiveness of our proposed method, we add more experiments with less training samples. As shown in Table IV, MGML-FENets, respectively, outperform baseline models (VGG, ResNet34, and DenseNet121) by 1.14%, 0.66%, and 0.53% on the AID dataset when the training rate is 10%. On NWPU-RESISC45 with T.R. = 5%, MGML-FENets, respectively, outperform baseline models (VGG, ResNet34, and DenseNet121) by 1.69%, 1.34%, and 0.58%. Obviously, our proposed models can surpass baseline models by larger margin with less training samples in most cases, which means that our models can perform more competitive with less training ratios.
2) Effect of MGML-FFB and MGML-FEM:
To show the separate effect of MGML-FFB, we only apply the main branch and MGML-FFB to form the whole network. Fig. 3 shows that the network will only have two predictions
MGML-FEM is designed to extract the high-level structural features. To show the effect of this module, we directly add MGML-FEM to the baseline model and evaluate the classification performance. As shown in Table III, compared to baseline models, networks only adding MGML-FEM have strong and stable performance with higher mean OA and lower standard deviation. Compared to MGML-FFB, MGML-FEM performs stabler and more efficient with almost no extra computation cost.
3) Effect of Feature Ensemble Network:
Our proposed MGML-FENet is constructed by integrating the main branch (baseline model) MGML-FFB and MGML-FEM together. Table III clearly shows that integrating MGML-FFB and MGML-FEM can gain better OA than applying each of them singly. With ensemble learning strategy, the whole network utilizes four predictions to vote for final results. Also, different branches provide predictions containing different features. Specifically, the main branch focuses on extracting global feature. MGML-FFB extracts a multigranularity feature at different levels of network. MGML-FEM aims to utilize the structural information on high-level features. With a feature ensemble learning strategy, MGML-FENets perform much stronger and stabler.
4) Seven-Crop Versus Nine-Crop:
In this article, we mainly adopt seven-crop both in RPM of CS-FG and FC-FG because we find that the typical feature of RS images always appears in “band” areas (band in middle row and band in middle column) based on observation. Compared to the seven-crop method, the nine-crop method is another region proposal method, which is more flexible. According to Algorithm 1, nine-crop can be easily expanded to “
To compare the performance of seven-crop and nine-crop, we apply these two region proposal approaches on MGML-FENets and keep other settings unchanged. The comparison results on the AID and NWPU-RESISC45 datasets are shown in Table V. Although nine-crop shows little weaker against seven-crop, it still has advantage on flexibility and extensibility.
E. Visualization and Analysis
1) Convergence Analysis:
Training MGML-FENets aims to optimize objective functions
2) Feature Map Visualization and Analysis:
To intuitively interpret our proposed method, we visualize the feature map in different levels of network. We select MGML-FENet (ResNet34) to run experiments on NWPU-RESISC45 with T.R. = 20%. When the model converges, we visualize feature maps to observe the attention area. From Fig. 7, we analyze our proposed method in the following five points.
Feature map visualization of MGML-FENet (ResNet34) on NWPU-RESISC45. The two images are randomly selected during testing and used to generate feature maps. The feature maps are selected from different levels of networks. In each feature map pair, the left feature map global feature map of the main branch (
First, CS-FG can extract multigranularity features to help reduce the negative influence from large intraclass variance. Following the explanation of [32], the global feature map (
Second, our proposed networks integrate feature maps at different levels, which can improve generalization ability. As shown in Fig. 7, feature maps at different levels of networks contain different information. In MGML-FENets, MGML-FFB and MGML-FEM both extract and fuse different level feature maps.
Third, MGML-FENets can obtain abundant fine-grained features by CS-FG, which can help network learn distinct characteristics of each category. For example, in the “Airplane” image, some features (left top, right top, and so on) have attention on the planes. Planes are the most distinct character of category “Airplane.” Besides planes, some feature patches (right bottom, middle band in row, and so on) focus on the runway, which is also a significant character for category “Airplane.” In RS images, planes in “Airplane” images are sometimes very small. Under this situation, other fine-grained features such as runway will make a big difference for classification.
Fourth, RS images have large resolution and wide cover range. Extracting local patches can help network filter redundant and confusing information. In Fig. 7, it is apparently that the attention region in some feature patches becomes clearer (color becomes warmer) than in the global feature map. For example, in the “Intersection” image, the feature maps usually have equally attention intensity on the edges of roads or road corners, which will lower the contrast. Using local feature patches can enhance the attention intensity in different local regions. For example, the “right bottom” patches will only focus on the edge information of right bottom road corner and the “middle band in column” will focus on the edge information of horizontal road. All in all, extracting local patches can enhance attention intensity and get enhanced fine-grained features through adaptive pooling on smaller local patches with less interference.
Last but not least, channel-separate strategy can guide global feature maps to have different focuses. Because of this, the networks become compact and efficient. Specifically, channel-separate strategy forces the networks to recognize through a group of local feature patches. Also, only a few channels are provided for each local patch. Through experiments and visualization (see Fig. 7), we find that global feature maps tend to have similar attention regions and patterns with the corresponding feature patches. It is positive because abundant feature representation can improve the performance of networks.
3) Predictions Visualization Analysis With T-SNE:
Inspired by ensemble learning method, we assume that the final voting accuracy will become higher if the four predictions can provide diverse and accurate results. To intuitively show the distribution patterns of four predictions, we apply the T-SNE [73] method to visualize and analyze
Visualization results of four predictions of MGML-FENet. We reduce 45-D prediction vectors to 2-D by T-SNE. In addition, we use MGML-FENet (ResNet34) on NWPU-RESISC45 and randomly select 512 testing samples to visualize. (a)–(e) represent the visualization results of five predictions (
From Fig. 8, we analyze in the following three points. First, the four predictions all have reasonable classification results on 45 categories. Even though some samples are still confusing and hard to classify, the category clusters are clear. Second, cluster maps of the four predictions have diverse patterns, which is helpful for the network to deal with confusing samples. Third, the final predictions (
4) Computation Cost Analysis:
Compared to baseline models, MGML-FENets have more computation cost during inference time. In MGML-FFB, more “conv layers” are introduced, which cause more convolution operation. However, in MGML-FFB, feature maps in each level of networks are cropped into several feature patches and recombined together by CS-FG. New feature maps have equal channels but less spatial scale as original feature maps. Therefore, the computation increment is restrained. We list the computation cost comparison in Table VI.
MGML-FENets have more computation cost than baseline models. In Table II, MGML-FENets earn accuracy improvement by big margin (more than 1% in some cases), even though some extra inference computation is introduced. In practical applications, we always need to control the computation cost. Therefore, “baseline + MGML-FEM” networks are more efficient choices. From Tables II and VI, we know that “baseline + MGML-FEM” networks can gain average 0.4%
5) Hyperparameter Setting:
Our method involves four hyperparameters (
Comparison results of different hyperparameter settings. We use MGML-FENet (ResNet34) on NWPU-RESISC45 with T.R. = 10%. The training ratio of main branch is set to 1.0, so we conduct experiments to investigate the other three hyperparameters. In addition,
6) Future Work Analysis:
In this article, we focus on exploring the representation ability of multigranularity features. We also provide a new multilevel features fusion method. Obviously, our method can overcome the large intraclass variance and confusion information with high-resolution RS scene images covering large geographic area. In the future, we will extend our method on more complex RS tasks, such as RS image object detection [5], [6] or RS image segmentation [7], [8]. Recently, RS scene tasks based on high-resolution UAV images [74], [75] are popular. High-resolution UAV images are distributed with great heterogeneity and high density bringing great difficulty on existing algorithms. Our method shows the power on preserving detailed information and decreasing the confusion information, which has superiority on dealing with UAV images. All in all, extending our method on detection or segmentation task on UAV data will be our next research focus.
Conclusion
In this article, we design an MGML-FENet to tackle the RS scene classification task. In MGML-FENet, the main branch is used to maintain useful global features. MGML-FFB is employed to extract multigranularity features and explore fine-grained features in different levels of networks. MGML-FEM is designed to utilize high-level features with structural information. Specifically, we propose two important modules: CS-FG and FC-FG to extract feature patches and recombine them. Extensive experiments show that the proposed networks outperform the previous models and achieve the SOTA results on notable benchmark datasets in the RS scene classification task. In addition, visualization results prove that our proposed networks are reasonable and interpretable. In the future, we will focus on extending our method for more complex RS scene tasks (e.g., UAV image object detection) to further improve the generalization ability of our method.
AppendixClassification Results on VGoogle Dataset
Classification Results on VGoogle Dataset
Compared to AID, NWPU, and UC-Merced, VGoogle is a new RS dataset containing more samples. We evaluate our method on VGoogle to further show the general performance of MGML-FENets. We select ResNet34 as a baseline model and run experiments with 5% and 10% training rates. We report our encouraging comparison results in Table VII. When using low training rate (5%), MGML-FENet (ResNet34) performs obviously better with 0.77% OA improvement. When the training rate is set to 10%, MGML-FENet (ResNet34) also achieves better OA. Results on VGoogle prove that our proposed MGML-FENets have convincing general performance.
To intuitively show the superiority of our proposed MGML-FENet on the VGoogle dataset, we compare the confusion matrix of baseline model (ResNet34) and MGML-FENet (ResNet34). As shown in Fig. 10, when T.R. is set to 5%, we use the baseline model and MGML-FENet (ResNet) to generate the confusion matrix on validation data. From the comparison of two confusion matrixes, it is clear that MGML-FENet (ResNet34) has less errors than the baseline model, which intuitively proves the effectiveness of our proposed method. From the comparison of the two confusion matrixes, we find that the classification accuracy of category “cemetery” is improved from 91% to 97%, the classification accuracy of category “dense_residential” is improved from 93% to 95%, and the category of “sparse_residential” is improved by a large margin (79% to 87%). The above-mentioned categories always contain confusion information (e.g., “sparse_residential” images always have similar geographic elements as “coastal_mansion” or “nursing_home” images). Compared to the baseline model, our proposed MGML-FENet shows strong ability on overcoming confusion.
Confusion matrix obtained by different models on the VGoogle dataset with T.R. = 5%. The left matrix is obtained by the baseline model (ResNet) and the right one is obtained by MGML-FENet (ResNet).