Journals & Magazines >IEEE Transactions on Neural N... >Volume: 34 Issue: 5

MGML: Multigranularity Multilevel Feature Ensemble Network for Remote Sensing Scene Classification

Abstract:

Remote sensing (RS) scene classification is a challenging task to predict scene categories of RS images. RS images have two main issues: large intraclass variance caused ...Show More

Metadata

Abstract:

Remote sensing (RS) scene classification is a challenging task to predict scene categories of RS images. RS images have two main issues: large intraclass variance caused by large resolution variance and confusing information from large geographic covering area. To ease the negative influence from the above two issues. We propose a multigranularity multilevel feature ensemble network (MGML-FENet) to efficiently tackle the RS scene classification task in this article. Specifically, we propose multigranularity multilevel feature fusion branch (MGML-FFB) to extract multigranularity features in different levels of network by channel-separate feature generator (CS-FG). To avoid the interference from confusing information, we propose a multigranularity multilevel feature ensemble module (MGML-FEM), which can provide diverse predictions by full-channel feature generator (FC-FG). Compared to previous methods, our proposed networks have the ability to use structure information and abundant fine-grained features. Furthermore, through the ensemble learning method, our proposed MGML-FENets can obtain more convincing final predictions. Extensive classification experiments on multiple RS datasets (AID, NWPU-RESISC45, UC-Merced, and VGoogle) demonstrate that our proposed networks achieve better performance than previous state-of-the-art (SOTA) networks. The visualization analysis also shows the good interpretability of MGML-FENet.

Published in: IEEE Transactions on Neural Networks and Learning Systems ( Volume: 34, Issue: 5, May 2023)

Page(s): 2308 - 2322

Date of Publication: 01 September 2021

ISSN Information:

PubMed ID: 34469317

DOI: 10.1109/TNNLS.2021.3106391

Funding Agency:

Contents

SECTION I.

Introduction

Remote sensing (RS) technology has been widely used in many practical applications, such as RS scene classification [1]–[4], RS object detection [5], [6], RS semantic segmentation [7], [8] and RS change detection [9]. Among the above applications, RS scene classification is a hot topic, which aims to classify RS scene images into different categories.

Recent years have witnessed significant progress in various computer vision tasks using deep convolutional neural networks (DCNNs) [10]–[16]. In some image classification tasks such as scene classification [17], [18], object classification [19], [20], and medical image classification [21], DCNNs have shown strong performance by extracting multilevel features with hierarchical architecture [10], [11], [22]–[27]. Basically, DCNNs efficiently encode each image into a classification probability vector, which contains the global feature. Furthermore, some well-designed DCNNs [28]–[31] also combine a global feature with local feature for better representation. However, directly using DCNNs to tackle RS scene classification task has two main problems. The first problem is the large intraclass variance caused by resolution variance of RS images (e.g., the image resolution of AID dataset ranges from 0.5 to 8 m [1]), which is intuitively illustrated by some cases in Fig. 1(a). The second problem is that RS images always contain confusing information because of covering large geographic area. As shown in Fig. 1(b), confusing information will reduce the interclass distance. An example of the inshore “resort” is similar to “beach” and the “railwaystation” built around residential has close character to “denseresidential.” The motivation of our manuscript is to solve the above-mentioned large intraclass variance problems and overcome the confusion problem caused by complex geographic elements.

Fig. 1.

Intuitive cases to explain the two main problems in RS scene classification. (a) Intraclass variance in RS images. (b) Redundant and confusing information in RS images.

Show All

To address the above two problems, we propose two intuitive assumptions as theory instruction of our method. First, besides global features, fine-grained features are also helpful to RS scene classification. For example, we can easily recognize “airport” if we see planes in RS images. Second, RS images contain latent semantic structural information, which can be explored without using detailed annotations such as bounding boxes or pixel-level annotations. As shown in the third row of Fig. 1(b), if we want to distinguish “church” from “storagetanks,” we cannot only focus on the center white tower. We need more structural information such as “tower + surroundings” to make judgment.

Based on the above assumptions, we propose a novel multigranularity multilevel feature ensemble network (MGML-FENet) to tackle the RS scene classification task. Specifically, we design multigranularity multilevel feature fusion branch (MGML-FFB) to explore fine-grained features by forcing the network to focus on a cluster of local feature patches at each level of network. In this branch, we mainly extract aggregated features containing structural information. Furthermore, we propose a multigranularity multilevel feature ensemble module (MGML-FEM) to fuse different high-level multigranularity features that share similar receptive fields but different resolutions. The overview of MGML-FENet is shown in Fig. 2.

Fig. 2.

Overview of MGML-FENet architecture. “MGML-FFB” denotes the multigranularity multilevel feature fusion branch. “MGML-FEM” denotes the multigranularity multilevel feature ensemble module, and “fc” denotes fully connected layers.

Show All

In MGML-FENet, MGML-FFB explores multigranularity multilevel features and utilizes fine-grained features to reduce adverse effects from large intraclass variance. Specifically, we use a channel-separate feature generator (CS-FG) to reconstruct feature maps. The original feature map is first cropped into several patches. Each feature patch contains a small group of channels that are split from the original feature map. Then, all feature patches are concatenated together to form a new feature map. MGML-FEM utilizes high-level features with structural information to overcome the confusion. In this module, we propose a full-channel feature generator (FC-FG) to generate predictions. The first cropping operation on the original feature map is the same as CS-FG. Then, through global average pooling and concatenation, the new feature vector is created and fed into the classifier at the end of network. At the end of the MGML-FENet, all the four predictions are fused to vote for final prediction. In this article, the “vote” does not mean majority voting. It is more like soft voting on posterior probability of each category. Specifically, MGML-FENet can generate four predictions. Here, the prediction is a global feature vector, which indicates the posterior probability of each category. Fig. 2 shows that we take the sum of all the four predictions to obtain a more convincing final prediction.

Previous methods [32], [33] always crop and zoomed-in view images to extract the multigranularity features and then apply transformation to learn transformation-invariant features. “Crop and zoom in” makes effect on extracting fine-grained feature with high-quality detailed information. However, it will overlook some important information from other local patches and do not fully use fine-grained features in a different level of networks. Compared to previous multigranularity feature extraction-based methods, MGML-FENets extract multigranularity features from different local feature patches to obtain structural information. MGML-FENets also extract and fuse multigranularity features at different levels of network, which can make full use of more abundant feature information.

To verify the effectiveness of proposed network, we conduct extensive experiments using VGG16 [22], ResNet34 [11], and DenseNet121 [34] as baseline models on multiple benchmark datasets (AID [1], NWPU-RESISC45 [3], and UC-Merced [2]) and VGoogle [35]. Compared to previous methods, MGML-FENets perform better and achieve new SOTA results.

Our main contributions are listed as follows.

We propose MGML-FFB to extract and fuse useful fine-grained features from different local patches in a different level of network. This module can efficiently solve the large intraclass variance problem.
We propose MGML-FEM to generate global features containing high-level structural information. This module can solve the confusing information problem.
We derive two novel feature generators, which are CS-FG and FC-FG to extract and fuse multigranularity multilevel features.
We integrate all predictions using a soft voting strategy and construct an end-to-end ensemble network that achieves SOTA classification results than previous networks on different benchmark datasets.

SECTION II.

Related Works

A. Remote Sensing Scene Classification

In recent years, researchers have introduced many notable methods for RS scene classification. These methods can generally be divided into two types: traditional handcrafted feature-based method and DCNN-based method.

Handcrafted feature-based methods always use some notable feature descriptors. Yang and Newsam [2], Zhao et al. [36], Zhu et al. [37], and Bahmanyar et al. [38] investigated bag-of-visual-words (BoVWs) approaches for high-resolution land-use image classification task. As two classical feature descriptors, scale-invariant feature transform (SIFT) [39] and histogram of gradient (HoG) are also widely applied in the RS scene classification field [40]–[42].

Compared to the traditional handcrafted feature-based method, DCNNs have better feature representation ability. Recently, DCNNs have achieved great success in the RS scene classification task. Marmanis et al. [43] and Nogueira et al. [44] applied DCNNs to extract the features of RS images and further explored its generalization potential to obtain better performance. In addition, some methods integrate attention mechanism into DCNNs to gain more subordinate-level features only with the guidance of global-level annotations [45], [46]. When using DCNNs to solve very high-resolution remote sensing images (VHR RS images), the feature extraction is always partially missing and incomplete. To address this issue, [47] applies the “prediction module” with atrous convolution blocks to extract more abundant global features and uses the “residual refinement module” to correct the residual between the results of the prediction module and the real results. Shao et al. [48] adopted atrous convolution blocks and PSP pooling module to integrate multiscale features with large receptive field. Moreover, to tackle the interclass similarity issue and large intraclass variance issue, second-order information is efficiently applied in the RS scene classification task [32], [49], which receives excellent performance. More recently, Li et al. [50] proposed a notable architecture KFBNet to extract more compact global features with the guidance of key local regions, which is now the SOTA method. In this article, we will mainly compare our results with [32], [49], and [50].

B. Multigranularity Features Extraction Methods

In some classification tasks such as [51], [52], the large interclass similarity will result in rapid performance decline. To better solve this problem, many fine-grained feature extraction methods are proposed [53], [54]. However, in most cases, only global annotations are provided, which means that finding fine-grained features becomes difficult because of lacking semantic-level annotations. Therefore, multigranularity feature extraction methods are applied to enhance the region-based feature representation ability of DCNNs [32], [55]–[58]. Inspired by the above methods, we adopt multigranularity feature extraction in our method to tackle RS images.

Ensemble learning-based methods offer another perspective to extract multigranularity features by designing multisubnets with different structures. Ge et al. [59] directly used several CNNs to create different classification results that are then fused via occupation probability. Ge et al. [60] introduced a learning system that learns specific features from specific deep sub-CNNs. Ye et al. [61] adopted an ensemble extreme learning machine (EELM) classifier for RS scene classification with generalization superiority and low computational cost. Learning from the above ensemble learning-based methods, we adopt an ensemble learning method in our architecture to integrate multigranularity multilevel features.

C. Feature Fusion Methods in RS Scene Classification

To reduce the harm from resolution variance of images, many researchers employ the feature fusion method and obtain better performance. Liu et al. [62] proposed a multiscale CNN (MCNN) framework containing a fixed-scale net and a varied-scale net to solve the scale variation of the objects in RS imagery. Zeng et al. [31] designed a two-branch architecture to integrate global-context and local-object features. Li et al. [63] presented a fusion method to fuse multilayer features from pretrained CNN models for RS scene classification. Shao and Cai [64] and Shao et al. [65] focused on exploring the capability of fusion strategy, which innovatively uses DCNN to fuse features from multisource images. Similar to our method, Shao et al. [66] proposed MF-CNN to extract and integrate multiscale features on different levels of networks. MF-CNN shows the power of fusing low-level spatial information and high-level semantic information in DCNN. In this article, we also focus on the feature fusion method to tackle features that have different granularity, localization, and scales.

SECTION III.

Proposed Method

In the RS scene classification task, only extracting global feature of RS images can work well in most cases. Besides using global features, we use multigranularity features in different levels of network to further improve the performance in some hard cases. Therefore, we propose MGML-FENet to tackle the RS scene image classification task. As shown in Fig. 3, the batch images are first fed into “conv pool” (“Conv1” and “Pool1” in Table I). Then, the output feature map passes through the four “conv layers” (“Layer1–4” in Table I) and finally generates the final classification probability vector in the main branch. At each level of main branch, the feature map is reconstructed by CS-FG and fused with former feature map in MGML-FFB. MGML-FFB offers another classification probability vector. Specifically, the “conv layers” in MGML-FFB and the main branch use the same structure but do not share parameters, which means that more parameters and computation costs are introduced. The CS-FG module of MGML-FFB extracts local feature patches and constructs a new feature map. Compared to the original feature map, the new feature map has the same channel but smaller scale, which eases the computation increase. Output feature maps of the last two main branch layers are served as an input to MGML-FEM and the two output classification probability vectors are from different levels of networks. Different from MGML-FFB, MGML-FEM brings in few extra parameters and computation.

TABLE I Detailed Structure of Three Baseline Models [11], [22], [34]. To Clearly Show the Structure of Baseline Model, We Choose the Common-Use Input Image Size (

$224\times224$ ) as Example. in Experiments, the Input Size May Be Different on Different Datasets. the Output Vector Size Is Equal to the Number of Categories. “Layer1–4” Denotes Four Convolutional Blocks as Shown in Fig. 2. To Lower the Number of Trainable Parameters and Computation Cost, We Modify the Fully Connected Layer of VGG16 to Make it More Compact

$Table I- Detailed Structure of Three Baseline Models [11], [22], [34]. To Clearly Show the Structure of Baseline Model, We Choose the Common-Use Input Image Size ( $224\times224$ ) as Example. in Experiments, the Input Size May Be Different on Different Datasets. the Output Vector Size Is Equal to the Number of Categories. “Layer1–4” Denotes Four Convolutional Blocks as Shown in Fig. 2. To Lower the Number of Trainable Parameters and Computation Cost, We Modify the Fully Connected Layer of VGG16 to Make it More Compact$

Fig. 3.

Structure of MGML-FENet. MGML-FENet consists of three parts: main branch (blue), MGML feature fusion branch (orange), and MGML feature ensemble module (green). The main branch is ImageNet-pretrained baseline structure (ResNet34/VGG16/DenseNet121). MGML-FFB consists of CS-FG and “conv layers.” MGML-FEM consists of FC-FG. The “conv layer” is basic convolutional blocks. The detailed structure is shown in Table I.

Show All

During training, each branch is trained using cross-entropy loss with different weights. During validation, the final classification probability vector of each branch is fused together to vote for the final classification result.

A. Main Branch

In RS images, global feature contains important high-level feature. To extract global feature, we employ main branch in MGML-FENet. As shown in Fig. 3 and Table I, the main branch has the same structure as baseline models (VGG16, ResNet34, and DenseNet121). In the main branch, we denote “conv1 pool1” as $f_{0}(\cdot )$ and “conv layer1–4”as $f_{1}(\cdot ) \sim f_{4}(\cdot )$ . The feature map at each level of main branch can be calculated as $\begin{equation*} \boldsymbol {F_{i}} = f_{i}\left ({ \boldsymbol {F}_{ \boldsymbol {i}-\mathbf {1}}}\right ), \quad \boldsymbol {F}_{-\mathbf {1}} = \textbf {X} \tag{1}\end{equation*}$ View Source where $\boldsymbol {F_{i}}$ is the output feature map of $f_{i}$ and $\boldsymbol {F}_{ \boldsymbol {i}-\mathbf {1}}$ is the feature map from former layer. When $i=0$ , the feature map $\boldsymbol {F}_{-\mathbf {1}}$ is the input image $\textbf {X}$ .

In addition, we demote the fully connected layer as $f_{mb}(\cdot )$ . The final class-probability prediction ( ${\boldsymbol{P}_{\mathbf {mb}}}$ ) is calculated as $\begin{equation*} {\boldsymbol{P}_{\mathbf {mb}}} = f_{mb}\left ({ \boldsymbol {F}_{\mathbf {4}}}\right ). \tag{2}\end{equation*}$ View Source

B. MGML Feature Fusion Branch

1) Overview of MGML-FFB:

To solve a large intraclass variance problem, we design MGML-FFB to utilize fine-grained features in different levels of networks. In MGML-FFB, we first propose a novel module, CS-FG to extract features from local patches. In this module, we force the network to pay more attention in multigranularity features of local patches in a different level of networks. The structure of MGML-FFB is shown in Fig. 3. One feature map output from a specific “conv layer” of main branch is first fed into CS-FG to generate a channel-separate feature map. Next, a “conv layer” of MGML-FFB is followed to represent the channel-separate feature map and the output feature map is used to fuse with the next stage channel-separate feature map.

If we, respectively, denote “CS-FG” and “conv layer” in MGML-FFB at each level as $h_{i}(\cdot )$ and $g_{i}(\cdot )$ , the output feature map ( $\boldsymbol {G_{i}}$ ) at each level of MGML-FFB can be calculated as $\begin{align*} \boldsymbol {G}_{ \boldsymbol {i}+\mathbf {1}}=&h_{i+1}\left ({ \boldsymbol {F}_{ \boldsymbol {i}+\mathbf {1}}}\right ) + g_{i}\left ({ \boldsymbol {G_{i}}}\right ), \quad i=0, 1, 2 \tag{3}\\ \boldsymbol {G}_{\mathbf {0}}=&h_{0}\left ({ \boldsymbol {F}_{\mathbf {0}}}\right ) = h_{0}\left ({f_{0}\left ({ \boldsymbol {X}}\right )}\right ), \quad \boldsymbol {G}_{\mathbf {4}} = g_{3}\left ({ \boldsymbol {G}_{\mathbf {3}}}\right ). \tag{4}\end{align*}$ View Source

The final prediction in MGML-FFB can be calculated through another fully connected layer. The formulation is shown in (5) where “fc” layer and the prediction are, respectively, denoted as $f_{ffb}(\cdot )$ and ${\boldsymbol{P}_{\mathbf {ffb}}}$ $\begin{equation*} {\boldsymbol{P}_{\mathbf {ffb}}} = f_{ffb}\left ({ \boldsymbol {G}_{\mathbf {4}}}\right ). \tag{5}\end{equation*}$ View Source

2) Channel-Separate Feature Generator (CS-FG):

To utilize fine-grained features and explore the structural information of multigranularity features, we design CS-FG in MGML-FFB. In each level of MGML-FFB, CS-FG reconstructs the original feature by extracting several local feature patches and combining them together. Compared to feature maps in the main branch, feature maps in MGML-FFB focus more on local feature rather global feature. Moreover, CS-FG increases the diversity of feature representation, which helps a lot on representation RS images. CS-FG is the core module of MGML-FFB. The structure is shown in Fig. 4. CS-FG consists of region proposal module (RPM) and channel-separate extractor (CS-E).

$Fig. 4. - Structure of CS-FG and FC-FG modules. These two modules use the same region proposal module (RPM) to generate several feature patches. After RPM, CS-FG and FC-FG, respectively, use channel-separate extractor (CS-E) and full-channel extractor (FC-E) to reconstruct feature patches. Both the two modules take output feature maps of main branch ( $ \boldsymbol{F}$ : $ \boldsymbol{F}_{\mathbf{0}} \sim\boldsymbol{F}_{\mathbf{3}}$ ) as input. Output feature maps of CS-FG and FC-FG are $ \boldsymbol{H}$ : $ \boldsymbol{H}_{\mathbf{0}} \sim\boldsymbol{H}_{\mathbf{3}}$ and $ \boldsymbol{v}$ : $ \boldsymbol{v}_{\mathbf{3}} \sim\boldsymbol{v}_{\mathbf{4}}$ , respectively.$

Fig. 4.

Structure of CS-FG and FC-FG modules. These two modules use the same region proposal module (RPM) to generate several feature patches. After RPM, CS-FG and FC-FG, respectively, use channel-separate extractor (CS-E) and full-channel extractor (FC-E) to reconstruct feature patches. Both the two modules take output feature maps of main branch ( $\boldsymbol{F}$ : $\boldsymbol{F}_{\mathbf{0}} \sim\boldsymbol{F}_{\mathbf{3}}$ ) as input. Output feature maps of CS-FG and FC-FG are $\boldsymbol{H}$ : $\boldsymbol{H}_{\mathbf{0}} \sim\boldsymbol{H}_{\mathbf{3}}$ and $\boldsymbol{v}$ : $\boldsymbol{v}_{\mathbf{3}} \sim\boldsymbol{v}_{\mathbf{4}}$ , respectively.

Show All

RPM is used to crop original feature maps and generate feature patches. In this article, we mainly introduce two approaches: seven-crop and nine-crop (sliding windows). In Fig. 4, it is clear that the seven-crop approach extracts seven fix-position patches (left-top, left-bottom, right-top, right-bottom, center, band in middle row, band in middle column) on feature map and the nine-crop approach extracts nine fix-position patches using sliding window strategy. In addition, the nine-crop approach can be extended to $k$ -crop. In this article, we set $k$ to 9. The seven- and nine-crop region proposal algorithm is shown in Algorithm 1.

Algorithm 1 Seven- and Nine-Crop Region Proposal Algorithm

Input: A feature map $\boldsymbol {F_{i}^{C*H*W}}$ from main branch, crop scale $\sigma \epsilon (0, 1)$ and stride $s_{H}, s_{W}$ (for 9-crop only).

Output: An anchor list: $A_{i} = \{a_{j}\}$ . The format of $a_{j}$ is like “(x1, y1, x2, y2)”

if RPM type is 7-crop then

$a_{0} = (0, 0, W*\sigma , H*\sigma )$

$a_{1} = (0, H*(1-\sigma ), W*\sigma , H)$

$a_{2} = (W*(1-\sigma ), 0, W, H*\sigma )$

$a_{3} = (W*(1-\sigma ), H*(1-\sigma ), W, H)$

$a_{4} = (W*(1-\sigma )/2, H*(1-\sigma )/2, W*(1+\sigma )/2, H*(1+\sigma )/2)$

$a_{5} = (0, H*(1-\sigma )/2, W, H*(1+\sigma )/2)$

$a_{6} = (W*(1-\sigma )/2, 0, W*(1+\sigma )/2, H)$

$A_{i} = \{a_{j}\}, \quad j\epsilon [{0, 6}]$

end if

if RPM type is 9-crop then

$(k+1)^{2} = 9, \quad count = 0$

$s_{H} = H * (1-\sigma ) / k, \quad s_{W} = W * (1-\sigma ) / k$

for $m = 0, \ldots ,k$ do

for $n = 0, \ldots ,k$ do

$a_{count} = (m*s_{W}, n*s_{H}, m*s_{W}+W*\sigma , n*s_{H}+H*\sigma )$

$count = count + 1$

end for

$A_{i} = \{a_{j}\}, \quad j\epsilon [{0, (k+1)^{2}-1}]$

end if

CS-E is used to extract feature patches on original feature map using anchors $A$ , which is generated by RPM (see Algorithm 1). Then, through recombining feature patches together, the new feature map contains the structural information. As shown in Fig. 4, feature patches in different locations are concatenated channel-wise and each feature patch uses a separate group of channels. CS-E in CS-FG explores the local-joint features, which can provide a new perspective on RS images representation. Therefore, when concatenating together, the total channels of new feature map keep unchanged. In CS-E, the inputs are $\boldsymbol {F}_{ \boldsymbol {i}}^{ \boldsymbol {C}* \boldsymbol {H}* \boldsymbol {W}}$ and $A_{i}$ and the output is $\boldsymbol {H}_{ \boldsymbol {i}}^{ \boldsymbol {C}* ( \boldsymbol {H}/\mathbf {2})* ( \boldsymbol {W}/\mathbf {2})}$ . We introduce the algorithm in Algorithm 2. With CS-E, the information of different local feature patches is integrated together. Using a channel-separate strategy mainly aims to decrease computation cost. Compared to allocate each local patch a full-channel feature, a channel-separate feature alignment method can make the new feature more compact. Local patches have less spatial information so that only a few groups of separate channels are employed. CS-E can maximally utilize the channel-wise information and explore the structural information.

In summary, CS-FG consists of RPM and CS-E. In (3), CS-FG is denoted as $h_{i}(\cdot )$ . To express CS-FG in detail, we denote RPM as $h_{i}^{0}(\cdot )$ and CS-E as $h_{i}^{1}(\cdot )$ . The detailed expression of CS-FG is $\begin{align*} A_{i}=&h_{i}^{0}\left ({ \boldsymbol {F_{i}}}\right ) \tag{6}\\[-2pt] \boldsymbol {H_{i}}=&h_{i}\left ({ \boldsymbol {F_{i}}}\right ) = h_{i}^{1}\left ({ \boldsymbol {F_{i}}; A_{i}}\right ) = h_{i}^{1}\left ({ \boldsymbol {F_{i}}; h_{i}^{0}\left ({ \boldsymbol {F_{i}}}\right )}\right ). \tag{7}\end{align*}$ View Source

C. MGML Feature Ensemble Module

1) Overview of MGML-FEM:

To avoid confusion, we propose an MGML feature ensemble module. This module can utilize high-level features with structural information, which makes the whole network more robust. Moreover, it provides diverse predictions based on ensemble learning theory to vote for the final classification result. To generate more convincing predictions and train the network in a reasonable manner, we only apply MGML-FEM in deeper level of network because features in shallow layers always contain more low-level basic information. In MGML-FEM, we first propose FC-FG to extract global information on each local patch of high-level features and then fuse them to construct feature vector containing structural information. Fig. 3 shows the structure of MGML-FEM.

Algorithm 2 Channel-Separate Extractor Algorithm

Input: A feature map $\boldsymbol {F_{i}^{C*H*W}}$ from main branch, an anchor list: $A_{i} = \{a_{j}\}$ . The format of $a_{j}$ is like “(x1, y1, x2, y2).” The number of local patches $k$ .

Output: A feature map $\boldsymbol {H_{i}^{C*\frac {H}{2}*\frac {W}{2}}}$ .

Separate channels of input features: $C' = \lfloor C / k \rfloor$

Extract feature patches:

for $j = 0, \ldots ,k-1$ do

if $j = k - 1$ then

$\boldsymbol {H_{i, j}} = \boldsymbol {F_{i}} [j*C':C, A_{j}[{1}]: A_{j}[{3}], A_{j}[{0}]: A_{j}[{2}]]$

else

$\boldsymbol {H_{i, j}} = \boldsymbol {F_{i}} [j*C':(j+1)*C', A_{j}[{1}]: A_{j}[{3}], A_{j}[{0}]: A_{j}[{2}]]$

end if

Downsample feature patches using adaptive pooling (the output size is half of input size): $\boldsymbol {H_{i, j}} =$ adapool( $\boldsymbol {H_{i, j}}$ )

end for

Concatenate feature patches: $\boldsymbol {H_{i}} = [ \boldsymbol {H}_{ \boldsymbol {i}, \mathbf {0}}, \ldots , \boldsymbol {H}_{ \boldsymbol {i}, \boldsymbol {k}-\mathbf {1}]}$

Mathematically, we denote the operation of MGML-FEM as $l(\cdot )$ . The output feature vectors ( $\boldsymbol {v_{i}}$ ) can be calculated in (8). In Fig. 3, it is clear that we only use the feature maps from last two “conv layers” of main branch. Toward these two output vectors that have different length, we design two fully connected layers to generate predictions, which is shown in (9) $\begin{align*} \boldsymbol {v_{i}}=&l_{i}\left ({ \boldsymbol {F_{i}}}\right ),\quad i = 3, 4 \tag{8}\\ {\boldsymbol{P}_{\mathbf {fem3}}}=&f_{fem3}\left ({ \boldsymbol {v}_{\mathbf {3}}}\right ), \quad {\boldsymbol{P}_{\mathbf {fem4}}} = f_{fem4}\left ({ \boldsymbol {v}_{\mathbf {4}}}\right ) \tag{9}\end{align*}$ View Source where the fully connected layers of “conv layer3” and “conv layer4” are represented as $f_{fem3}$ and $f_{fem4}$ , respectively. Also, the corresponding predictions are represented as ${\boldsymbol{P}_{\mathbf {fem3}}}$ and ${\boldsymbol{P}_{\mathbf {fem4}}}$ .

2) Full-Channel Feature Generator (FC-FG):

FC-FG is the main part in MGML-FEM. This module mainly extracts high-level features to contribute to the final prediction. As shown in Fig. 4, FC-FG is formed by RPM and full-channel extractor (FC-E). RPM in FC-FG is the same as the one in CS-FG. FC-E keeps full-channel information for each feature patches other than uses a channel-separate strategy because high-level features need sufficient channel-wise representation. In addition, FC-E directly uses global average pooling to generate feature vectors because neurons at every pixels of high-level feature have large receptive fields and contain decoupled information. Algorithm 3 clearly describes the method of FC-E.

Algorithm 3 Full-Channel Extractor Algorithm

Input: A feature map $\boldsymbol {F_{i}^{C*H*W}}$ from main branch, an anchor list: $A_{i} = \{a_{j}\}$ . The format of $a_{j}$ is like “(x1, y1, x2, y2).” The number of local patches $k$ .

Output: A feature vector $\boldsymbol {v_{i}^{(Ck)*1}}$ .

Extract feature patches:

for $j = 0, \ldots ,k-1$ do

$\boldsymbol {F_{i, j}'} = \boldsymbol {F_{i}} [0: C-1, A_{j}[{1}]: A_{j}[{3}], A_{j}[{0}]: A_{j}[{2}]]$

Downsample feature patches using global average pooling: $\boldsymbol {v_{i, j}} =$ glbpool( $\boldsymbol {F_{i, j}'}$ )

end for

Concatenate feature patches: $\boldsymbol {v_{i}} = [ \boldsymbol {v}_{ \boldsymbol {i}, \mathbf {0}}, \ldots , \boldsymbol {v}_{ \boldsymbol {i}, \boldsymbol {k}-\mathbf {1}]}$

To mathematically express FC-FG, we denote FC-E as $l'(\cdot )$ . RPM in FC-FG is represented as shown (6). The detailed expression of FC-FG is listed as follows: $\begin{equation*} \boldsymbol {v_{i}} = l_{i}\left ({ \boldsymbol {F_{i}}}\right ) = l_{i}'\left ({ \boldsymbol {F_{i}}; A_{i}}\right ) = l_{i}'\left ({ \boldsymbol {F_{i}}; h_{i}^{0}\left ({ \boldsymbol {F_{i}}}\right )}\right ). \tag{10}\end{equation*}$ View Source

D. Optimizing MGML-FENet

MGML-FENet models apply conventional cross-entropy loss in every branches during training. To make the network converge well, we allocate each loss a reasonable factor. As shown in Fig. 4, the whole objective function consists of four cross-entropy losses. We optimize our MGML-FENet by minimizing the objective function $\begin{align*}&\hspace {-.5pc} L_{obj}\left ({ \boldsymbol {X}| \boldsymbol {Y}}\right ) = \lambda _{1} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {mb}}}| \boldsymbol {Y}}\right ) + \lambda _{2} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {ffb}}}| \boldsymbol {Y}}\right ) \\&+ \, \lambda _{3} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {fem3}}}| \boldsymbol {Y}}\right ) + \lambda _{4} * L_{cn}\left ({ {\boldsymbol{P}_{\mathbf {fem4}}}| \boldsymbol {Y}}\right ) \tag{11}\end{align*}$ View Source where $L_{obj}( \boldsymbol {X}| \boldsymbol {Y})$ and $L_{cn}(\cdot )$ , respectively, denote the objective loss and cross-entropy loss, $Y$ denotes the hard label, $\lambda _{1}$ – $\lambda _{4}$ is four weighted factors to constrain the training intensity of each branch. In this article, we set ( $\lambda _{1}, \lambda _{2}, \lambda _{3}, \lambda _{4}$ ) as (1, 0.5, 0.2, 0.5) following two main principles. First, global features can work well in most cases. Therefore, the main branch is supposed to have the highest training intensity. Second, ${\boldsymbol{P}_{\mathbf {fem3}}}$ outputs from shallower layer, so the training intensity should be the lowest.

During validation, MGML-FENet employs an ensemble learning method, which integrates all predictions to vote for the final result. The final predictions contain diverse information including global information, multigranularity multilevel information, and high-level structural information. Equation (12) calculates the final prediction $\boldsymbol {P}$ . In addition, MGML-FFB and MGML-FEM in MGML-FENet can easily be dropped from or inserted into main branch as independent parts, which makes the whole network flexible $\begin{equation*} \boldsymbol {P} = {\boldsymbol{P}_{\mathbf {mb}}} + {\boldsymbol{P}_{\mathbf {ffb}}} + {\boldsymbol{P}_{\mathbf {fem3}}} + {\boldsymbol{P}_{\mathbf {fem4}}}. \tag{12}\end{equation*}$ View Source

SECTION IV.

Experiments and Analysis

A. Datasets

In this article, we mainly evaluate our method on four benchmark datasets, including UC Merced [2], AID [1], NWPU-RESISC45 [3], and VGoogle [35]. The UC-Merced dataset contains 21 scene categories and total 2100 RGB images with $256\times256$ pixels. Each category consists of 100 images. All images have the same resolution (0.3 m). The AID dataset contains 30 scene categories and total $10\,000$ large-scale RGB images with $600\times600$ pixels. Each category has 220–420 images. The image resolution varies from 0.5 to 8 m. NWPU-RESISC45 dataset contains 45 scene categories and total 31500 RGB images with $256\times256$ pixels. Each category consists of 700 images. The image resolution varies from 0.2 to 30 m. The VGoogle dataset is constructed by V-RSIR. It is a new large RS scene dataset containing 59404 RGB images and 38 categories. The resolution varies from 0.075 to 9.555 m. There are at least 1500 training samples for each category. Due to the lack of previous results on the VGoogle dataset, we compare the classification results between the baseline model and MGML-FENet. The classification results on VGoogle are shown in the Appendix.

B. Implementation Details

In this article, we use ResNet34 [11], [23], VGG16 [22], and DenseNet121 [34] as baseline models to make fair comparison with previous methods. The detailed structure of baseline models is shown in Table I. We select VGG16 as the baseline model because many previous methods use VGG16 to extract features. Compared to VGG16, ResNet34 performs better in image classification task using less trainable parameters and FLOPs. Therefore, we also select it as baseline model. As for DenseNet121, the work [50] mainly uses it as baseline model. To make a fair comparison, we also choose it as another baseline model.

During training, the image size is set to $256 \times 256$ using common “RandomResizedCrop” operation. During testing, we first resize the image to $288 \times 288$ and then “center crop” the image to $256 \times 256$ . For fair comparison with previous methods, we apply the same resolution “ $256 \times 256$ ” as input image size on all the four datasets.

During experiments, we apply fixed training settings for baseline models and our proposed models. First, we use stochastic gradient descent (SGD) with a momentum of 0.9 and a weight decay of 0.0005. The initial learning rate is set to 0.005 and the mini-batch size is set to 64. The total number of training epochs is 200 and the learning rate will be divided by 10 at epoch 90 and 150. For all models, we adopt the ImageNet [19] pretrained strategy and tune models on RS image datasets. In addition, all models are implemented using Pytorch on NVIDIA GTX 1080ti. Our code will be soon available online.

C. Experimental Results

We conduct extensive experiments to show the performance of MGML-FENet. To evaluate our model, we use overall accuracy (OA) as criterion, which is commonly used metric in classification task. Previous methods use different networks as backbone. Therefore, we apply the same backbone as previous methods to make a fair comparison. To make the results more convincing, we both compare the performance with previous models and baseline models.

In RPM of MGML-FENet, we mainly adopt the seven-crop strategy because RS images always contain important information in the middle “band” patches according to intuitive observation. We will also compare “nine-crop” with “seven-crop” in ablation study.

1) Classification on AID Dataset:

Following the setting of previous methods on AID dataset, we randomly select 20% or 50% data as training data and the rest data are served as testing data. We run every experiments five times to give out the mean and standard deviation of OA. The comparison results are shown in Table II. If taking VGG16 as backbone [see Table II(a)], MGML-FENet shows better performance than the SOTA method, KFBNet [50]. Especially when the training rate is 50%, MGML-FENet achieves 97.89% OA, which surpasses KFBNet by 0.7%. When applying DenseNet121 as backbone [see Table II(b)], MGML-FENet performs even stronger. It achieves 96.45% and 98.60% OA, which improves the SOTA accuracy by 0.82% and 1.2% when T.R. = 20% and 50%, respectively. In this article, we introduce ResNet34 as one of the backbones because ResNet34 is proven better than VGG16 in the image classification field with far less trainable parameters and computation cost. Results in Table II clearly show that MGML-FENet (ResNet34) performs better than MGML-FENet (VGG16) and other previous VGG16-based methods. Surprisingly, MGML-FENet (ResNet34) even achieves better results than other DenseNet121-based models [see Table II(b)].

TABLE II Comparison of Classification Results (%) on the UC-Merced, AID, and NWPU-RESISC45 Datasets. (a) Using VGG16 as Baseline Model. (b) Using ResNet34/DenseNet121 as Baseline Model. Some Most Recent Methods Apply Other Large Models as Baseline Models (ResNet101, EfficientNet-B3). We All Make Comparison With Our DenseNet121-Based MGML-FENet

2) Classification on NWPU-RESISC45 Dataset:

NWPU-RESISC45 contains more images and categories than the AID dataset, so previous methods choose to use 10% and 20% images for training. From Table II(a), MGML-FENet with VGG16 as backbone achieves SOTA results (from 92.95% to 93.36%) on a 20% training rate. With DenseNet121 as backbone [see Table II(b)], MGML-FENet almost obtains the best accuracy 95.39% (only 0.01% less than FDPResNet [69]) when T.R. = 20%. Although under the training rate 10%, MGML-FENet does not obtain SOTA results with VGG16 and DenseNet121, the gap is close (0.14% and 0.17%).

3) Classification on UC-Merced:

The UC-Merced dataset only has 2100 images with 21 categories. The training rate is 80%, which means that only 420 images will be served as val data. Table II shows that KFBNet achieves 99.88% and 99.76% classification accuracy, respectively, using VGG16 and Dense121 as backbone. The results are close to 100%. Compared to KFBNet, MGML-FENet also achieves high accuracy close to full marks. As mentioned before, we run every experiments five time and calculate the mean and standard deviation of OA. On the VGG16-based model, the five-time classification results are 99.76%, 99.76%, 99.76%, 100%, and 99.76%. On the ResNet-based model, we also obtain once 100% accuracy and four times 99.76% as the results of VGG16-based model. On the DenseNet121-based model, we get one more 100% compared to the above two models. In addition, 99.76% accuracy means that only one image is recognized as wrong category. From the comparison results, we can observe that our method and KFBNet both reach the ultimate limit on the UC-Merced dataset and have an obvious advantage against other previous methods.

All in all, MGML-FENets shows stronger than previous methods on most datasets. Most importantly, MGML-FENets work extremely well on different backbones, which means that we can flexibly design a much smaller network by using a smaller backbone.

D. Ablation Study

In our proposed models, we adopt different modules according to different motivations. To separately show the effectiveness of each module, we make more ablation experiments. In this section, we run all experiments on the AID and NWPU-RESISC45 datasets.

1) Comparison With Baseline Models:

In the RS scene classification task, some notable baseline models can individually work well. Besides comparing with previous SOTA models to show the superiority of our proposed method, we also compare with baseline models. In this article, we use VGG16, ResNet34, and DenseNet121 as baseline models. Table I shows the detailed structure of them.

The comparison results between baseline models and MGML-FENets are shown in Fig. 5 and Table III. On the AID and NWPU datasets, MGML-FENets achieve better results obviously. Especially taking VGG16 as baseline model, MGML-FENet improves by a large margin. On the AID dataset, MGML-FENet improves 0.98% and 0.82% than VGG16. On the NWPU-RESISC45 dataset, MGML-FENet achieves 1.16% and 0.57% higher accuracy than VGG16. Based on ResNet34, MGML-FENet still has large improvement. Especially on NWPU-RESISC45 when the training rate is 10%, our proposed model obtains 1.04% (90.35%–91.39%). When the baseline model is DenseNet121, the classification results have already achieved high level. MGML-FENet further gains improvement. On NWPU-RESISC45, the leading gap is 0.83% and 0.65%. Moreover, when using smaller group of training samples, MGML-FENets perform much better, which shows the robustness and effectiveness of our method.

TABLE III Ablation Comparison Experiments of Classification Results (%) on the AID and NWPU-RESISC45 Datasets

Fig. 5.

Comparison between MGML-FENets and baseline models. The curve shows the OA performance of our proposed models and baseline models on the AID and NWPU-RESISC45 datasets with different training rates.

Show All

To further verify the effectiveness of our proposed method, we add more experiments with less training samples. As shown in Table IV, MGML-FENets, respectively, outperform baseline models (VGG, ResNet34, and DenseNet121) by 1.14%, 0.66%, and 0.53% on the AID dataset when the training rate is 10%. On NWPU-RESISC45 with T.R. = 5%, MGML-FENets, respectively, outperform baseline models (VGG, ResNet34, and DenseNet121) by 1.69%, 1.34%, and 0.58%. Obviously, our proposed models can surpass baseline models by larger margin with less training samples in most cases, which means that our models can perform more competitive with less training ratios.

TABLE IV Ablation Comparison Experiments of Classification Results (%) on the AID and NWPU-RESISC45 Datasets With Less Training Samples

2) Effect of MGML-FFB and MGML-FEM:

To show the separate effect of MGML-FFB, we only apply the main branch and MGML-FFB to form the whole network. Fig. 3 shows that the network will only have two predictions ${\boldsymbol{P}_{\mathbf {mb}}}$ and ${\boldsymbol{P}_{\mathbf {ffb}}}$ when removing MGML-FEM. From Table III, we observe that the mean OA of networks improves when adding MGML-FFB into the baseline model. However, MGML-FFB introduces lots of extra computation cost. It seems that the drawback from extra computation cost outweighs the accuracy gain. In addition, the standard deviation of results becomes bigger. The bigger fluctuation of results is because two branches extract different features and the predictions always tend to provide different votes for final results. Actually, adding MGML-FFB makes a tradeoff between the advantage of diverse predictions and the fluctuation of negative votes.

MGML-FEM is designed to extract the high-level structural features. To show the effect of this module, we directly add MGML-FEM to the baseline model and evaluate the classification performance. As shown in Table III, compared to baseline models, networks only adding MGML-FEM have strong and stable performance with higher mean OA and lower standard deviation. Compared to MGML-FFB, MGML-FEM performs stabler and more efficient with almost no extra computation cost.

3) Effect of Feature Ensemble Network:

Our proposed MGML-FENet is constructed by integrating the main branch (baseline model) MGML-FFB and MGML-FEM together. Table III clearly shows that integrating MGML-FFB and MGML-FEM can gain better OA than applying each of them singly. With ensemble learning strategy, the whole network utilizes four predictions to vote for final results. Also, different branches provide predictions containing different features. Specifically, the main branch focuses on extracting global feature. MGML-FFB extracts a multigranularity feature at different levels of network. MGML-FEM aims to utilize the structural information on high-level features. With a feature ensemble learning strategy, MGML-FENets perform much stronger and stabler.

4) Seven-Crop Versus Nine-Crop:

In this article, we mainly adopt seven-crop both in RPM of CS-FG and FC-FG because we find that the typical feature of RS images always appears in “band” areas (band in middle row and band in middle column) based on observation. Compared to the seven-crop method, the nine-crop method is another region proposal method, which is more flexible. According to Algorithm 1, nine-crop can be easily expanded to “ $(k + 1)^{2}$ ”-crop with the setting of different values of $s_{H}$ and $s_{W}$ .

To compare the performance of seven-crop and nine-crop, we apply these two region proposal approaches on MGML-FENets and keep other settings unchanged. The comparison results on the AID and NWPU-RESISC45 datasets are shown in Table V. Although nine-crop shows little weaker against seven-crop, it still has advantage on flexibility and extensibility.

TABLE V Ablation Comparison Experiments of Classification Results (%) on the AID and NWPU-RESISC45 Datasets

E. Visualization and Analysis

1) Convergence Analysis:

Training MGML-FENets aims to optimize objective functions $L_{obj}$ . In Fig. 6, we select ResNet34 as a baseline model and use the classification results on NWPU-RESISC45 as an example to analyze the convergence by showing the “OA-epoch” curves. As shown in Fig. 6, MGML-FENets can converge smoothly even with more complex objective functions to optimize. Moreover, MGML-FENets obviously have higher OA than baseline model (ResNet34) after converging.

Fig. 6.

OA(%)/epoch training curves of MGML-FENet (ResNet34) and ResNet34 on NWPU-RESISC45.

Show All

2) Feature Map Visualization and Analysis:

To intuitively interpret our proposed method, we visualize the feature map in different levels of network. We select MGML-FENet (ResNet34) to run experiments on NWPU-RESISC45 with T.R. = 20%. When the model converges, we visualize feature maps to observe the attention area. From Fig. 7, we analyze our proposed method in the following five points.

$Fig. 7. - Feature map visualization of MGML-FENet (ResNet34) on NWPU-RESISC45. The two images are randomly selected during testing and used to generate feature maps. The feature maps are selected from different levels of networks. In each feature map pair, the left feature map global feature map of the main branch ( $ \boldsymbol{G_{i}}$ ) and the right feature map ( $ \boldsymbol{H_{i}}$ ) is cropped and pooled by the left feature map through CS-FG. For different local feature patches (seven-crop), we randomly select a one-channel feature map to visualize. The output feature maps are from “conv1 poo1,” “conv layer1,” and “conv layer2.” Moreover, the color in feature map indicates the pixel intensity. The warmer the pixel color, the bigger the pixel activation.$

Fig. 7.

Feature map visualization of MGML-FENet (ResNet34) on NWPU-RESISC45. The two images are randomly selected during testing and used to generate feature maps. The feature maps are selected from different levels of networks. In each feature map pair, the left feature map global feature map of the main branch ( $\boldsymbol{G_{i}}$ ) and the right feature map ( $\boldsymbol{H_{i}}$ ) is cropped and pooled by the left feature map through CS-FG. For different local feature patches (seven-crop), we randomly select a one-channel feature map to visualize. The output feature maps are from “conv1 poo1,” “conv layer1,” and “conv layer2.” Moreover, the color in feature map indicates the pixel intensity. The warmer the pixel color, the bigger the pixel activation.

Show All

First, CS-FG can extract multigranularity features to help reduce the negative influence from large intraclass variance. Following the explanation of [32], the global feature map ( $\boldsymbol {G_{i}}$ ) can be regarded as $1{st}$ granularity feature. Through seven-crop RPM of CS-FG, the global feature map is cropped and pooled. The output feature patches can be seen to contain the second granularity feature. When we concatenate feature patches together, the new feature maps ( $\boldsymbol {H_{i}}$ ) both contain the separate features from different feature patches and the structural feature by combining different feature patches. If we regard the structural feature as the third granularity feature, the output from CS-FG contains both second and third granularity features. All in all, with main branch and MGML-FFB, MGML-FENets utilize multigranularity feature to enhance the network performance.

Second, our proposed networks integrate feature maps at different levels, which can improve generalization ability. As shown in Fig. 7, feature maps at different levels of networks contain different information. In MGML-FENets, MGML-FFB and MGML-FEM both extract and fuse different level feature maps.

Third, MGML-FENets can obtain abundant fine-grained features by CS-FG, which can help network learn distinct characteristics of each category. For example, in the “Airplane” image, some features (left top, right top, and so on) have attention on the planes. Planes are the most distinct character of category “Airplane.” Besides planes, some feature patches (right bottom, middle band in row, and so on) focus on the runway, which is also a significant character for category “Airplane.” In RS images, planes in “Airplane” images are sometimes very small. Under this situation, other fine-grained features such as runway will make a big difference for classification.

Fourth, RS images have large resolution and wide cover range. Extracting local patches can help network filter redundant and confusing information. In Fig. 7, it is apparently that the attention region in some feature patches becomes clearer (color becomes warmer) than in the global feature map. For example, in the “Intersection” image, the feature maps usually have equally attention intensity on the edges of roads or road corners, which will lower the contrast. Using local feature patches can enhance the attention intensity in different local regions. For example, the “right bottom” patches will only focus on the edge information of right bottom road corner and the “middle band in column” will focus on the edge information of horizontal road. All in all, extracting local patches can enhance attention intensity and get enhanced fine-grained features through adaptive pooling on smaller local patches with less interference.

Last but not least, channel-separate strategy can guide global feature maps to have different focuses. Because of this, the networks become compact and efficient. Specifically, channel-separate strategy forces the networks to recognize through a group of local feature patches. Also, only a few channels are provided for each local patch. Through experiments and visualization (see Fig. 7), we find that global feature maps tend to have similar attention regions and patterns with the corresponding feature patches. It is positive because abundant feature representation can improve the performance of networks.

3) Predictions Visualization Analysis With T-SNE:

Inspired by ensemble learning method, we assume that the final voting accuracy will become higher if the four predictions can provide diverse and accurate results. To intuitively show the distribution patterns of four predictions, we apply the T-SNE [73] method to visualize and analyze ${\boldsymbol{P}_{\mathbf {mb}}}, {\boldsymbol{P}_{\mathbf {ffb}}}, {\boldsymbol{P}_{\mathbf {fem3}}}, {\boldsymbol{P}_{\mathbf {fem4}}}$ and $\boldsymbol {P}$ . The visualization results are shown in Fig. 8.

$Fig. 8. - Visualization results of four predictions of MGML-FENet. We reduce 45-D prediction vectors to 2-D by T-SNE. In addition, we use MGML-FENet (ResNet34) on NWPU-RESISC45 and randomly select 512 testing samples to visualize. (a)–(e) represent the visualization results of five predictions ( $P_{mb}$ , $P_{fem3}$ , $P_{fem4}$ , $P_{ffb}$ , and $P$ ). Particularly, (e) represents the result of ensemble prediction.$

Fig. 8.

Visualization results of four predictions of MGML-FENet. We reduce 45-D prediction vectors to 2-D by T-SNE. In addition, we use MGML-FENet (ResNet34) on NWPU-RESISC45 and randomly select 512 testing samples to visualize. (a)–(e) represent the visualization results of five predictions ( $P_{mb}$ , $P_{fem3}$ , $P_{fem4}$ , $P_{ffb}$ , and $P$ ). Particularly, (e) represents the result of ensemble prediction.

Show All

From Fig. 8, we analyze in the following three points. First, the four predictions all have reasonable classification results on 45 categories. Even though some samples are still confusing and hard to classify, the category clusters are clear. Second, cluster maps of the four predictions have diverse patterns, which is helpful for the network to deal with confusing samples. Third, the final predictions ( $\boldsymbol {P}$ ) have better cluster feature distribution. Obviously, points in clusters are tighter (smaller intraclass distance) and the distance between clusters is larger (larger interclass distance). All in all, Fig. 8 proves the effectiveness and interpretability of our feature ensemble network.

4) Computation Cost Analysis:

Compared to baseline models, MGML-FENets have more computation cost during inference time. In MGML-FFB, more “conv layers” are introduced, which cause more convolution operation. However, in MGML-FFB, feature maps in each level of networks are cropped into several feature patches and recombined together by CS-FG. New feature maps have equal channels but less spatial scale as original feature maps. Therefore, the computation increment is restrained. We list the computation cost comparison in Table VI.

TABLE VI Comparison of Computation Costs (FLOPs). The Input Image Scale Is Set as

$224 \times224$

$Table VI- Comparison of Computation Costs (FLOPs). The Input Image Scale Is Set as $224 \times224$$

MGML-FENets have more computation cost than baseline models. In Table II, MGML-FENets earn accuracy improvement by big margin (more than 1% in some cases), even though some extra inference computation is introduced. In practical applications, we always need to control the computation cost. Therefore, “baseline + MGML-FEM” networks are more efficient choices. From Tables II and VI, we know that “baseline + MGML-FEM” networks can gain average 0.4% $\sim ~0.5$ % OA improvement with almost no extra computation costs.

5) Hyperparameter Setting:

Our method involves four hyperparameters ( $\lambda _{1}, \lambda _{2}, \lambda _{3},$ and $\lambda _{4}$ ). Since the main branch can individually work well in most cases, we first set $\lambda _{1} = 1.0$ , which is the highest training ratio. ${\boldsymbol{P}_{\mathbf {fem3}}}$ is a prediction from shallower layer, so the training intensity should be the lowest. In Fig. 9, we make comparison experiments to investigate the parameter sensitivity. Compared to the original hyperparameter setting, if we increase $\lambda _{3}$ , the OA will decline. It is because ${\boldsymbol{P}_{\mathbf {fem3}}}$ is generated by shallower branch, higher training intensity will easily make the low-level features overfit the shallow branch. Therefore, ${\boldsymbol{P}_{\mathbf {fem4}}}$ will be heavily influenced. When we increase $\lambda _{2}$ and $\lambda _{4}$ , classification results are stable. All in all, following the hyperparameter setting principle, our proposed method is robust and not sensitive.

$Fig. 9. - Comparison results of different hyperparameter settings. We use MGML-FENet (ResNet34) on NWPU-RESISC45 with T.R. = 10%. The training ratio of main branch is set to 1.0, so we conduct experiments to investigate the other three hyperparameters. In addition, $X$ -axis represents the combination of “( $\lambda_{2}, \lambda_{3}, \lambda_{4}$ )”and the $Y$ -axis represents the mean OA.$

Fig. 9.

Comparison results of different hyperparameter settings. We use MGML-FENet (ResNet34) on NWPU-RESISC45 with T.R. = 10%. The training ratio of main branch is set to 1.0, so we conduct experiments to investigate the other three hyperparameters. In addition, $X$ -axis represents the combination of “( $\lambda_{2}, \lambda_{3}, \lambda_{4}$ )”and the $Y$ -axis represents the mean OA.

Show All

6) Future Work Analysis:

In this article, we focus on exploring the representation ability of multigranularity features. We also provide a new multilevel features fusion method. Obviously, our method can overcome the large intraclass variance and confusion information with high-resolution RS scene images covering large geographic area. In the future, we will extend our method on more complex RS tasks, such as RS image object detection [5], [6] or RS image segmentation [7], [8]. Recently, RS scene tasks based on high-resolution UAV images [74], [75] are popular. High-resolution UAV images are distributed with great heterogeneity and high density bringing great difficulty on existing algorithms. Our method shows the power on preserving detailed information and decreasing the confusion information, which has superiority on dealing with UAV images. All in all, extending our method on detection or segmentation task on UAV data will be our next research focus.

SECTION V.

Conclusion

In this article, we design an MGML-FENet to tackle the RS scene classification task. In MGML-FENet, the main branch is used to maintain useful global features. MGML-FFB is employed to extract multigranularity features and explore fine-grained features in different levels of networks. MGML-FEM is designed to utilize high-level features with structural information. Specifically, we propose two important modules: CS-FG and FC-FG to extract feature patches and recombine them. Extensive experiments show that the proposed networks outperform the previous models and achieve the SOTA results on notable benchmark datasets in the RS scene classification task. In addition, visualization results prove that our proposed networks are reasonable and interpretable. In the future, we will focus on extending our method for more complex RS scene tasks (e.g., UAV image object detection) to further improve the generalization ability of our method.

Appendix
Classification Results on VGoogle Dataset

Compared to AID, NWPU, and UC-Merced, VGoogle is a new RS dataset containing more samples. We evaluate our method on VGoogle to further show the general performance of MGML-FENets. We select ResNet34 as a baseline model and run experiments with 5% and 10% training rates. We report our encouraging comparison results in Table VII. When using low training rate (5%), MGML-FENet (ResNet34) performs obviously better with 0.77% OA improvement. When the training rate is set to 10%, MGML-FENet (ResNet34) also achieves better OA. Results on VGoogle prove that our proposed MGML-FENets have convincing general performance.

TABLE VII Comparison Experiments of Classification Results (%) on the VGoogle Dataset Between the Baseline Model (ResNet34) and MGML-FENets

To intuitively show the superiority of our proposed MGML-FENet on the VGoogle dataset, we compare the confusion matrix of baseline model (ResNet34) and MGML-FENet (ResNet34). As shown in Fig. 10, when T.R. is set to 5%, we use the baseline model and MGML-FENet (ResNet) to generate the confusion matrix on validation data. From the comparison of two confusion matrixes, it is clear that MGML-FENet (ResNet34) has less errors than the baseline model, which intuitively proves the effectiveness of our proposed method. From the comparison of the two confusion matrixes, we find that the classification accuracy of category “cemetery” is improved from 91% to 97%, the classification accuracy of category “dense_residential” is improved from 93% to 95%, and the category of “sparse_residential” is improved by a large margin (79% to 87%). The above-mentioned categories always contain confusion information (e.g., “sparse_residential” images always have similar geographic elements as “coastal_mansion” or “nursing_home” images). Compared to the baseline model, our proposed MGML-FENet shows strong ability on overcoming confusion.

Fig. 10.

Confusion matrix obtained by different models on the VGoogle dataset with T.R. = 5%. The left matrix is obtained by the baseline model (ResNet) and the right one is obtained by MGML-FENet (ResNet).

Show All

References is not available for this document.

MGML: Multigranularity Multilevel Feature Ensemble Network for Remote Sensing Scene Classification

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Introduction

Related Works

A. Remote Sensing Scene Classification

B. Multigranularity Features Extraction Methods

C. Feature Fusion Methods in RS Scene Classification

Proposed Method

A. Main Branch

B. MGML Feature Fusion Branch

1) Overview of MGML-FFB:

2) Channel-Separate Feature Generator (CS-FG):

Algorithm 1 Seven- and Nine-Crop Region Proposal Algorithm

C. MGML Feature Ensemble Module

1) Overview of MGML-FEM:

Algorithm 2 Channel-Separate Extractor Algorithm

2) Full-Channel Feature Generator (FC-FG):

Algorithm 3 Full-Channel Extractor Algorithm

D. Optimizing MGML-FENet

Experiments and Analysis

A. Datasets

B. Implementation Details

C. Experimental Results

1) Classification on AID Dataset:

2) Classification on NWPU-RESISC45 Dataset:

3) Classification on UC-Merced:

D. Ablation Study

1) Comparison With Baseline Models:

2) Effect of MGML-FFB and MGML-FEM:

3) Effect of Feature Ensemble Network:

4) Seven-Crop Versus Nine-Crop:

E. Visualization and Analysis

1) Convergence Analysis:

2) Feature Map Visualization and Analysis:

3) Predictions Visualization Analysis With T-SNE:

4) Computation Cost Analysis:

5) Hyperparameter Setting:

6) Future Work Analysis:

Conclusion

AppendixClassification Results on VGoogle Dataset

Classification Results on VGoogle Dataset

References

Appendix
Classification Results on VGoogle Dataset