Loading [MathJax]/extensions/TeX/boldsymbol.js
Dynamic Informative Proposal-Based Iterative Training Procedure for Weakly Supervised Object Detection in Remote Sensing Images | IEEE Journals & Magazine | IEEE Xplore

Dynamic Informative Proposal-Based Iterative Training Procedure for Weakly Supervised Object Detection in Remote Sensing Images


Abstract:

Weakly supervised object detection (WSOD) is an increasingly important task in remote sensing images. However, mainstream WSOD methods often rely on low-quality proposals...Show More

Abstract:

Weakly supervised object detection (WSOD) is an increasingly important task in remote sensing images. However, mainstream WSOD methods often rely on low-quality proposals due to the complex backgrounds of remote sensing images. Moreover, applying strong data augmentations directly in WSOD methods can introduce significant noise, which can hinder training procedures that rely only on image-level ground truth labels. To address these issues, we propose a dynamic informative proposal-based iterative WSOD training procedure. Specifically, we implement an informative proposal reconstruction (IPR) method to generate more informative proposals dynamically. We also use a proposal-based contrastive learning (PBCL) technique to steadily improve the quality of generated proposals. In addition, we employ a pseudolabel learning-based multistage (PLMS) training procedure to progressively improve the quality of new informative proposals while alleviating the noise catastrophe caused by strong data augmentations. Extensive experiments demonstrate the effectiveness of our proposed method in generating higher quality proposals and enhancing model generalization. Our method achieves state-of-the-art results in optical remote sensing images (DIOR), Northwestern Polytechnical University (NWPU) VHR-10.v2, and HRSC2016 datasets.
Page(s): 6614 - 6626
Date of Publication: 14 July 2023

ISSN Information:

Funding Agency:


SECTION I.

Introduction

Object detection in remote sensing images has experienced significant advancements with the introduction of specially designed models. However, unlike natural images, remote sensing images often contain a large number of densely clustered objects, making it difficult to obtain the accurate object-level annotations required for fully supervised object detection tasks. As a result, weakly supervised object detection (WSOD) methods have emerged as an alternative solution that utilizes only image-level annotations, reducing labor costs, and gaining increasing attention in recent years.

Currently, mainstream WSOD methods [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18] follow a standardized training process. This involves extracting proposal features using the backbone convolutional neural networks (CNNs) of the WSOD frameworks, followed by scoring these features based on their class and objectiveness. Proposals are obtained using proposal methods [19], [20], [21], which contain almost all instances in the image. However, the complex background of remote sensing images often leads to poor quality proposals, as shown in Fig. 1(b). In addition, these inferior proposals remain unchanged during the entire training process, negatively affecting the parameter updates of the neural network.

Fig. 1. - (a) Remote sensing image with complex background. (b) Low-quality original proposals generated by proposal extraction algorithm. (c) More informative proposals generated by our method. (d) Detection result of our method.
Fig. 1.

(a) Remote sensing image with complex background. (b) Low-quality original proposals generated by proposal extraction algorithm. (c) More informative proposals generated by our method. (d) Detection result of our method.

Moreover, in fully supervised object detection tasks such as Faster RCNN [22], strong data augmentations can be used to enhance the generalization ability of models by increasing the diversity of the training data. However, because image-level annotations contain less information than object-level annotations, the addition of strong data augmentations can introduce additional noise that hinders the detection performance of neural networks on complex remote sensing images. Thus, finding a way to deal with the noise and data diversity dilemma is an urgent task.

To deal with the challenge of generating informative proposals dynamically and the dilemma of whether or not to employ strong data augmentations, a dynamic informative proposal-based iterative training procedure is proposed. Specifically, an informative proposal reconstruction (IPR) module is designed to reconstruct new proposals from original inferior ones generated by a proposal extraction algorithm in every training iteration. The comprehensive similarities between the original proposals are calculated in consideration of proposal feature similarity and spatial correlation. Then, a similarity-proposal-cluster generation algorithm (SPC) is proposed to leverage these comprehensive similarities, along with the coarse predictions obtained by a multiple instance learning (MIL) branch in a WSOD framework, to divide original proposals into a group of clusters. Finally, more informative proposals are reconstructed from these clusters dynamically using the feature map of the backbone neural network. To progressively improve the quality of generated proposals during training, we propose a proposal-based contrastive learning (PBCL) module to increase the feature difference between objects and background noise. After constructing positive and negative sample sets for each cluster, a proposal-based contrastive loss function is employed to guide the neural network to discriminate objects and background noise. To employ strong data augmentations without destroying the training and continuously explore the potential of the model, we implement a pseudolabel learning-based multistage (PLMS) training procedure. Strong data augmentations are available except in the first stage, and the predictions in the previous stage serve as pseudolabels to supervise the model in the next stage for output consistency.

We perform extensive experiments on three remote sensing image datasets: DIOR, NWPU VHR-10.V2, and HRSC2016, which verify the effectiveness of our method. Our main contributions are summarized as follows.

  1. A dynamic informative proposal based iterative training procedure is proposed to improve the quality of proposals and address the training dilemma caused by strong data augmentations.

  2. We propose two modules, IPR and PBCL, which dynamically generate informative proposals during training to enhance the training effect. This approach leads to more accurate box predictions during testing.”

  3. A PLMS training procedure is proposed to leverage strong data augmentations to progressively increase the generalization of the model without noise domination.

  4. Our proposed method achieves state-of-the-art detection performance on three datasets, including DIOR, NWPU VHR-10.v2, and HRSC2016.

The rest of this article is organized as follows. Section II presents the work in this field. In Section III, we introduce the proposed method in detail. Section IV provides an evaluation of our method on three remote sensing image datasets, along with an analysis of its effectiveness. Finally, Section V concludes this article.

SECTION II.

Related Work

A. Weakly Supervised Object Detection

WSOD models trained under the supervision of image-level labels are capable of predicting object-level results. In recent years, deep MIL-based WSOD methods have been studied intensively. Bilen and Vedaldi [1] first proposed a two-stream weakly supervised deep detection network (WSDDN) performing simultaneously region selection and classification. Tang et al. [2] and [3] proposed refinement classifiers to refine predictions recurrently. Zeng et al. [4] considered bottom-up and top-down objectness jointly to design a tailored training mechanism. Cheng et al. [5] combined selective search [19] with Grad-CAM [23] to generate more proposals with higher IoU than greedy search methods. Ren et al. [6] proposed a learnable concrete DropBlock together with an instance-aware self-training algorithm. Recently, several studies [14], [15], [16] have utilized transfer learning techniques by leveraging an external fully annotated source dataset to enhance the detection performance of WSOD.

To address the challenges posed by small, abundant, and densely clustered targets in remote sensing images, amidst complex and noisy backgrounds, numerous approaches have been proposed to enhance performance. Feng et al. [7] focused the network on potential instances by the combination of local and global context information. Feng et al. [8] proposed a triple context-aware network to learn complementary and discriminative visual patterns for WSOD. Yao et al. [9] designed a dynamic curriculum learning to learn detector step by step. Feng et al. [10] proposed a self-supervised adversarial and equivarient network to learn the complementary visual patterns. Feng et al. [11] proposed a rotation-invariant network to improve the generalization of models. Guo et al. [12] introduced self-attention and cross-attention layers into the WSOD framework to establish the context relation between proposals. Tan et al. [13] designed an oriented detector to predict oriented bounding boxes for aerial images. Moreover, Liu et al. [17] proposed a decoupled classification localization network to mitigate conflicts between classification and location branches. Zhang et al. [18] proposed a domain adaptive approach named hierarchical similarity alignment (HSA) to address ship targets in SAR images. However, all the above methods utilized the original unchanged proposals generated by proposal extraction methods, which were low-quality caused by the complex background in remote sensing images and limited the ultimate performance of WSOD. In contrast, the approach proposed in this article utilizes informative proposals generated dynamically during training and implements a multistage training procedure to enhance generalization through strong data augmentations together with pseudolabel learning.

B. Contrastive Learning

Contrastive learning is an important part of self-supervised representation learning, seeking to embed similar instances nearby in the latent space while embed dissimilar ones far apart. Instancewise contrastive methods [24], [25], [26] achieved this goal by maximizing the mutual information between instances or design predictive pretext tasks free from negative sampling [27], [28], [29]. Prototypical contrastive methods either contrasted between correlated and uncorrelated prototype representations of image clusters [30], [31] or between associated and unassociated instance-prototype pairs [32], [33]. In our method, contrastive learning was introduced to improve the representation ability of feature extraction part of WSOD framework, thus improving the quality of newly generated informative proposals.

C. Pseudolabel Learning

Pseudolabel learning utilizes different approaches to generate pseudolabels, which are regarded as new supervision to train the models with various versions of input. This method is widely used in semisupervised tasks nowadays, especially in semisupervised object detection (SSOD) task. The authors in [34], [35], and [36] produced pseudolabels by ensembling the predictions from different data augmentations. Yang et al. [37] used strong augmentations on unlabeled data while weak augmentations were used to produce stable pseudolabels. Liu et al. [38] employed the exponential moving average (EMA) teacher [39] for producing more accurate pseudolabels. Yang et al. [40] utilized multiple detection heads to improve the accuracy of pseudolabels. Humble-Teacher [41] utilized the soft pseudolabel for unlabeled data. In [42], certainty-aware pseudolabels were tailored to increase the quality of supervision. Chen et al. [43] proposed an adaptive filtering (AF) strategy to assign fine-grained pseudolabels to each pixel, thus improve the model generalization performance. In this article, we introduced pseudolabel learning to iteratively train the model under the supervision of predictions from the previous stage. This approach effectively mitigates the impact of noise caused by strong data augmentations during training.

SECTION III.

Method

A. Overview of the Proposed Method

The proposed framework's overall architecture is depicted in Fig. 2, consisting of three modules designed to generate high-quality and informative proposals while improving the models' generalization performance. These modules include the IPR, the PBCL module, and the PLMS training procedure module. Our method is based on the mainstream WSOD framework.

Fig. 2. - Overall architecture of our method, including three parts named IPR, PBCL, and PLMS. Blue arrows are used in every stage. Green ones are available in the first stage. Yellow ones are available except the first stage. Blue dash arrows are offline operations used either before or after the end of training procedure for a stage. Noted that the SDA of proposals and images are operated in a pairwise way, and we omit the connection between IPR and MIL parts for figure clarity.
Fig. 2.

Overall architecture of our method, including three parts named IPR, PBCL, and PLMS. Blue arrows are used in every stage. Green ones are available in the first stage. Yellow ones are available except the first stage. Blue dash arrows are offline operations used either before or after the end of training procedure for a stage. Noted that the SDA of proposals and images are operated in a pairwise way, and we omit the connection between IPR and MIL parts for figure clarity.

As in mainstream WSOD framework, each input image I is processed by a backbone network to obtain feature maps. Before training, about 2000 proposals B = \lbrace b_{i}\rbrace are generated for one image by proposal extraction methods. The region of interest (RoI) pooling [44] and fully connected layer (Fc) are utilized to obtain RoI-features with the usage of proposals and feature maps. Then, these proposal features are sent into two streams named MIL branch and refinement branch. The MIL branch performs classification for each proposal to obtain a score vector, while the refinement branch performs detection. To train the MIL branch, the standard multiclass cross entropy loss is employed \begin{equation*} L_{MIL} = -\sum _{c=1}^{C}\left\lbrace y_{c}log\phi _{c} + (1 - y_{c})log(1-\phi _{c})\right\rbrace \tag{1} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \phi _{c} is achieved by the sum over all score vectors, which is then applied the weighted sum pooling to be restricted to the range of (0, 1). y_{c} = 1 or 0 indicates the image with or without object labeled category c.

For the refinement branch, we adopt the refinement loss defined by the following: \begin{equation*} L_{REF}^{t} = -\frac{1}{|B|}\sum _{r=1}^{|B|}\sum _{c=1}^{C+1}w_{r}^{t}y_{cr}^{t}logx_{cr}^{R_{t}} \tag{2} \end{equation*} View SourceRight-click on figure for MathML and additional features.where t denotes the index of the tth refinement classifier, w_{r}^{t} denotes the loss weight, which is equal to the proposal confidence score of proposal b_{r} obtained from the MIL branch or the (t-1)th refinement branch, y_{cr} is the pseudoground truth for proposal b_{r} on class c, and x_{cr}^{R_{t}} represents the prediction score for proposal r on class c in the tth refinement classifier. Moreover, |B| and T denote the number of proposals in the input image and classifiers in the refinement branch, respectively.

In our method, the proposal features obtained after RoI pooling are leveraged by IPR to generate more informative proposals with the introduce of similarity in consideration of both feature and location. These new generated proposals named B_{new} are sent back to the WSOD framework in the same way as the original proposals B. This enables the neural network to focus on more informative regions with fewer small boxes and partial biases, leading to improved object detection performance. The new loss functions for both B and B_{new} are defined as follows: \begin{align*} L_{MIL}^* &= L_{MIL}(B) + \epsilon L_{MIL}(B_{new}) \tag{3} \\ L_{REF}^{*} &= \sum _{t=1}^{T}L_{REF}^{t}(B) + \epsilon L_{REF}^{t}(B_{new}) \tag{4} \end{align*} View SourceRight-click on figure for MathML and additional features.where \epsilon is the parameter to balance the contribution of the original and new generated proposals.

The PBCL module is applied after IPR to enhance the feature distances between background noise and potential objects in the semantic feature space. To train this module, a proposal-based contrastive loss function is defined to calculate L_{C}^{j} as in (13) and the total contrastive loss of K clusters generated by IPR is defined as follows: \begin{equation*} L_{CON} = \frac{1}{K}\sum _{j=1}^{K}L_{C}^{j}. \tag{5} \end{equation*} View SourceRight-click on figure for MathML and additional features.

PLMS is utilized to improve the generalization ability of model by resolving the tradeoff between strong augmentations and introduced noise. The training of our proposed model is multistage, and only the final stage is saved and utilized for object detection during testing. The loss function in the first stages is formulated as: \begin{equation*} L_{1} = L_{MIL}^*(\theta _{1}) + L_{REF}^*(\theta _{1}) + \lambda L_{CON}(\theta _{1}) \tag{6} \end{equation*} View SourceRight-click on figure for MathML and additional features.where L_{MIL}^* and L_{REFINE}^* are formulated as in (3) and (4) and \lambda is the loss weight of contrastive part. \theta _{1} means the parameters of neural network in the first stage.

Similarly, the loss in stage i is defined as \begin{equation*} L_{i} = L_{MIL}^*(\theta _{i}) + L_{REF}^*(\theta _{i}) + \lambda L_{CON}(\theta _{i}) + \omega L_{PSE}(\theta _{i}) \tag{7} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \omega is the loss weight of pseudolabel supervision part. Total pseudolabel refinement loss L_{PSE} is calculated as in (16).

B. Informative Proposal Reconstruction

The proposals generated by the proposal extraction algorithm are of low quality and remain unchanged for each image during training. These substandard proposals consist of numerous inconsequential small boxes and background noise, which impede the network from learning the correct feature representation of objects. This section proposes an IPR method that utilizes the original proposals to dynamically produce more informative proposals with less background noise during training. The following steps explain this process in detail.

For each original proposal b_{i}, the corresponding region of interest (RoI) feature \bm {p}_{\boldsymbol{i}} is cropped from the feature map M generated by the backbone neural networks. To discriminate whether these RoI-features are from the same objects, a simple approach is to calculate their cosine similarities \begin{equation*} fea(i, j) = \frac{\bm {p}_{\boldsymbol{i}}^{T} \bm {p}_{\boldsymbol{j}}}{\left\Vert \bm {p}_{\boldsymbol{i}} \right\Vert \cdot \left\Vert \bm {p}_{\boldsymbol{j}} \right\Vert } \tag{8} \end{equation*} View SourceRight-click on figure for MathML and additional features.where i and j mean the indexes of two proposals. The cosine similarity between features reflect the level of semantic consistency, which is commonly utilized in contrastive learning approaches such as [24]. However, if only relying on similarity, the model fails to differentiate instances in an image that belong to the same category but are distinct from one another.

To address this problem, the spatial correlation is defined by calculating the distance intersection over union (DIOU) [45] \begin{align*} spa(i, j) &=IoU\left(b_{i},b_{j}\right) - \frac{d(i,j)^{2}}{e(i,j)^{2}} \tag{9} \\ d(i,j) &= \left(c_{xi} - c_{xj}\right)^{2} + \left(c_{yi}-c_{yj}\right)^{2} \tag{10} \end{align*} View SourceRight-click on figure for MathML and additional features.where c_{xi} and c_{yi} \in [0, 1] denote the normalized coordinate of the central point of the proposal i. Here, e(i,j) is the diagonal length of the minimum circumscribed rectangle of two proposals b_{i} and b_{j}.

The comprehensive similarity of two proposals is defined as \begin{equation*} sim(i, j) = fea(i, j) + \beta * spa(i, j) \tag{11} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \beta is the weight of spatial correlation part.

Once the comprehensive similarity is obtained, the original proposals can be clustered into multiple sets. To incorporate the ground truth label information (which is only available at the image-level), we propose the similarity-proposal-cluster generation algorithm (SPC), which is presented in Algorithm 1. Specifically, by the process of MIL branch, the score vector \bm {q}_{\boldsymbol{i}} \in \mathbb {R}^{c} is obtained from the RoI-feature \bm {p}_{\boldsymbol{i}} where c is the total number of categories and q_{i}^{m} means the probability of proposal b_{i} belongs to category m. For one category m that exists in ground truth labels, implementing the following steps. First, searching the proposal with highest probability value q_{i}^{m} and regarding this proposal as a cluster center with index Center_{j}. Then, the proposals whose comprehensive similarities with this cluster center are higher than \tau are assigned into the same cluster. These proposals are not considered in subsequent steps until all proposals have been assigned or have lower probability values than \psi. We repeat these steps to obtain new clusters \lbrace C_{k}\rbrace until all ground truth classes have been processed.

Algorithm 1: SPC.

Input: Proposals \lbrace b_{i}\rbrace _{i=1}^{m}, comprehensive similarity set {sim(i,j)}_{i,j\in [1,|B|]}, scores vectors set \lbrace q_{i}\rbrace _{i\in [1,|B|]}, threshold \tau and \psi, iteration index j.

Output: Similarity proposal cluster set C=\lbrace C_{k}\rbrace _{k=1}^{K}

1:

initial j \leftarrow 0

2:

for ground truth class m do

3:

while not all proposals have been assigned do

4:

Center_{j} \leftarrow \underset{i}{\text{arg max}(\lbrace q_{i}^{m}\rbrace })

5:

if \underset{i}{\text{max}(\lbrace q_{i}^{m}\rbrace)} < \psi

6:

break

7:

C_{j} \leftarrow Center_{j}

8:

for i = 0 to N do

9:

C_{j} \leftarrow APPEND(C_{j}, i) \mathbf{if} sim(i,j) > \tau

10:

for ind in C_{j} do

11:

b_{ind} will not participate in this while loop

12:

j \leftarrow j+1

13:

return C = {C_{1}, C_{2},{\ldots }, C_{K}}

After the implementation of SPC, clusters are generated, which contain proposals having high comprehensive similarity with each other. Our objective is to generate more informative new proposals dynamically using these clusters during training. As depicted in Fig. 3, for a cluster named C_{j}, we calculate the minimum circumscribed rectangle that encloses all the proposals in C_{j}. We define this rectangle as the proposal generation area A_{j}. Then, we take the similar steps as in WSODet [13]. The corresponding map of A_{j} is cropped from the feature map M \in {\mathbb {R}^{w \times h \times c}} and converted into a matrix M_{j}^* by adding values on all channels. We calculate the adaptive threshold [46] to convert this matrix into a binary matrix. Then all the contours in this binary matrix are searched and enclosed with minimum circumscribed rectangular boxes. These boxes are informative proposals B_{new} in consideration of MIL prediction, feature map attention and the comprehensive similarity.

Fig. 3. - Procedure of generating an informative proposal from a group of proposals in the same cluster. MCR means minimum circumscribed rectangular box.
Fig. 3.

Procedure of generating an informative proposal from a group of proposals in the same cluster. MCR means minimum circumscribed rectangular box.

C. Proposal-Based Contrastive Learning

Due to the dynamic parameter updates of neural networks during training, both the feature map M and generated B_{new} undergo changes. In order to progressively improve the quality of generated proposals, we have designed a PBCL module, which is incorporated into the IPR module, as depicted in Fig. 2.

In this module, we construct a contrastive proposal sample set for each cluster C_{j} = \lbrace b_{Center_{j}}, b_{j}^{2},{\ldots }, b_{j}^{N_{j}}\rbrace obtained in Algorithm 1 after the following steps. First, for a specific cluster C_{j} with a center proposal b_{Center_{j}}, we randomly select another proposal b_{j}^* from C_{j}. As b_{Center_{j}} and b_{j}^* exhibit high comprehensive similarity, they can be considered as two different feature patterns for one target, and are thus defined as a positive sample pair, denoted by \lbrace pos_{j}^{i}\rbrace _{i=1}^{2}. Next, we identify negative proposals as those proposals among all clusters whose comprehensive similarities with b_{Center_{j}} are lower than a negative threshold \delta. We then randomly select N_{neg} proposals from these negative proposals and define them as a negative sample set, denoted by \lbrace neg_{j}^{i}\rbrace _{i=1}^{N_{neg}}, where \begin{equation*} N_{neg}=\left\lbrace \begin{array}{cc}N_{neg} & N_{j} \geq N_{neg}\\ N_{j} & N_{j} < N_{neg} \end{array}\right. \tag{12} \end{equation*} View SourceRight-click on figure for MathML and additional features.where N_{j} is the number of proposals in cluster C_{j}. The proposal-based contrastive loss function is formulated based on InfoNCE [47] to compute the contrastive loss for a cluster \begin{equation*} L_{C}^{j} = -log\frac{e^{sim\left(pos_{j}^{1},pos_{j}^{2}\right) / \kappa }}{e^{sim\left(pos_{j}^{1},pos_{j}^{2}\right)/ \kappa } + \sum _{n=1}^{N_{neg}}e^{sim\left(pos_{j}^{1}, neg_{j}^{n}\right)/ \kappa }} \tag{13} \end{equation*} View SourceRight-click on figure for MathML and additional features.where \kappa is the temperature hyperparameter as in [26].

Under the guidance of contrastive loss, the representation ability of the backbone neural network is progressively improved. This is because the features of the same objects are brought closer together in feature space, while background noise is pushed further away. As a result, the effectiveness of the SPC generation algorithm is improved, resulting in a more accurate contrastive proposal sample set.

D. Pseudo Label Learning-Based Multistage Training Procedure

Weak data augmentations, such as flipping and size scaling, are widely used in existing WSOD methods to increase data diversity. However, due to the weak supervision provided in WSOD, performing strong data augmentations (such as cropping, affine transformation, and mosaic), which improves the data diversity more effectively, prevents the WSOD model from better representation ability due to the new introduced noise. To address this tradeoff, we propose a multistage training procedure based on pseudolabel learning as illustrated in Fig. 2. Specifically, the model is trained in stages, where the previous stage's neural network parameters serve as the initial parameters for the next stage. During the first stage of training, only weak data augmentations are applied on input images. This operation can reduce the noise of the input image and improve the stability of training. After the end of the first stage, all the images in the training set are processed by the first model to obtain predictions. Nonmaxima suppression (NMS) is employed on these predictions to remove overlapping boxes. Results whose predicted categories are not in the image-level ground truth are then deleted, and the remaining predictions with confidence scores higher than a threshold \gamma are used as pseudo labels for the first stage (pse_{1}=\lbrace (box_{pse}, con_{pse})\rbrace). Each pseudolabel contains box information box_{pse} and corresponding category confident score con_{pse}.

In the stage i where i \ne 1, weak and strong augmentations are both conducted on input images to increase their diversity. In this case, pseudolabels from the stage i-1 serves as supervision in refinement branch to enforce the neural network to predict the same results after strong augmentations. After the training of stage i, pse_{i} are obtained in the same way as in the first stage. Compared with ground truth in fully supervised detection task, these pseudolabels contains much more false positive and inaccurate samples. To mitigate this, the confident scores in pseudolabels are considered in the pseudolabel loss function. For each proposal b_{r}, calculating its IoU with all pseudoboxes and selecting the highest one with index h and IoU I_{h}. Then, pse_{ir}^{t} is defined as \begin{equation*} pse_{ir}^{t}=\left\lbrace \begin{array}{cc}con_{h} & I_{h} \geq 0.5\\ 0 & I_{h} < 0.5. \end{array}\right. \tag{14} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The pseudolabel refinement loss is similar to refinement loss in WSOD as defined in (2) instead replacing y_{ir}^{t} with pse_{ir}^{t} \begin{equation*} L_{PSE}^{t} = -\frac{1}{|B|}\sum _{r=1}^{|B|}\sum _{i=1}^{c+1}w_{r}^{t}pse_{ir}^{t}logx_{ir}^{R_{t}}. \tag{15} \end{equation*} View SourceRight-click on figure for MathML and additional features.

The total pseudolabel refinement loss is defined as \begin{equation*} L_{PSE} = \sum _{t=1}^{T}L_{PSE}^{t} \tag{16} \end{equation*} View SourceRight-click on figure for MathML and additional features.where T is the number of classifiers in refinement branch. In the experiments, we set T to 3 as in [2]. This pseudolabel refinement loss leverage the knowledge learned from previous stages to resist the effect produced by new introduced noise.

SECTION IV.

Experiments and Results

A. Datasets and Evaluation Metrics

Comprehensive experiments are elaborately conducted on three publicly available challenging datasets called DIOR [48], NWPU VHR-10.v2 [49] and HRSC2016 [50] to evaluate the effectiveness of the proposed method. The DIOR dataset is a newly challenging dataset that contains 23 493 images from 20 object categories with the size of 800 × 800. The train and test set consist of 11 725 and 11 738 images, respectively. The NWPU VHR-10.v2 dataset contains 1172 images with the size of 400 × 400 from 10 categories. In the experiments, this dataset is divided into 879 images for training and 293 images for test. HRSC2016 dataset is a ship detection dataset widely used in oriented object detection task. It contains 1061 labeled remote sensing images with the size ranging from 300 × 300 to 1500 × 900. In the experiments, 436 images of train set and 181 images of val set are merged to a new train set while 293 images of test set is processed during the test.

For evaluation, we adopt a conventional metric mAP on test set and a popular metric CorLoc [51], widely used in remote sensing images, on the train set. mAP evaluates the detection accuracy by treating a detection result as positive sample if the IoU between the ground truth and detection bounding box is greater than 0.5. CorLoc is calculated to evaluate the localization accuracy of models by measuring the percentage of positive training images that the method correctly localizes an object of the target class in accordance with visual object classes (VOC) criterion.

B. Implementation Details

The backbone network in our method is a variant of VGG16 [52] by removing original VGG16’s penultimate max-pooling layer while substituting subsequent conv layer with dilated conv layers. Most of the hyperparameters are set as in [2]. The weight of spatial correlation part \beta is set to 0.1. The threshold \tau and \psi in SPC algorithm is set to 0.9 and 0.00001, respectively. The negative threshold \delta in PBCR is set to 0.3 and the number of negative samples N_{neg} is set to 10. The keeping threshold \gamma in PLMS is set to 0.7. The loss weights \lambda, \epsilon and \omega are set to 0.1, 1, and 1, respectively. The iteration time of multistage procedure is set to 3. During training, the mini-batch size for SGD is set to 1. The total training iterations of the first stage are 100, 75, 75 k for DIOR, NWPU VHR-10.v2, and HRSC2016. In other stages, it is set to while 30, 20, and 20 k. The learning rate of the first stage is set to 0.001 for DIOR and NWPU VHR-10.v2 while 0.0005 for HRSC2016. In the other stages, this hyperparameter is halved to 0.0005 and 0.00025 to make training more stable. The momentum and weight decay are set to be 0.9 and 0.0005, respectively. The weak augmentations used in our method contain flipping and scaling. The scale sizes are \lbrace480, 576, 688, 864, 1200\rbrace, meaning resizing the shortest side of input to one of those values. The strong augmentations consist of random rotation (90^{\circ }, 180^{\circ }, and 270^{\circ }), random Mosaic and color jittering as in [37]. During the test, nonmaximum suppression (NMS) is applied to remove duplicated bounding boxes with 0.3 IoU threshold. Meanwhile, test time augmentations, including horizontal flipping and scaling (\lbrace576, 688, 864, 1200\rbrace), are implemented.

Our experiments are implemented based on the Pytorch [53] deep learning framework, using Python and C++. All of our experiments are runing on an NVIDIA GeForce RTX 2080Ti GPU and Intel (R) Xeon (R) CPU E5-2630 v4 (2.20 GHz).

C. Comparison With State-of-the-Arts

In this section, the detection performance for each class is reported and the comparisons with other advanced weakly and fully supervised object detection methods are provided on three remote sensing datasets as below.

1) DIOR

Table I illustrates the object categories of DIOR dataset. Tables II and III quantitatively compare the detection performance of our proposed methods in terms of mAP and CorLoc. Our method achieves 28.0% mAP, outperforming WSDDN [1], OICR [2], PCL [3], MELM [54], PICR [7], TCANet [8], SAENet [10], and WSODet [13] by 14.7%, 11.5%, 9.8%, 9.3%, 3.1%, 2.8%, 0.9%, and 0.7%, respectively. Noted that the boldface results indicate that they represent the highest values among all the compared WSOD methods. In comparison to other methods, our approach demonstrates a more balanced and generalized detection performance for each category. Moreover, two categories results (airplane and storage tank) achieve the highest values among methods compared, demonstrating the great performance for our method to address clustered objects. Meanwhile, the CorLoc result of our method is 50.1%, better than all the other compared methods. This result indicates our proposed method mines more information from train set by the involvement of new informative proposals. However, the performance of our method is still far lower than fully supervised methods like Faster RCNN [22], suggesting the difficulty and potential for the improvement of predicting objects from complex remote sensing images.

TABLE I Object Categories of DIOR Dataset
Table I- Object Categories of DIOR Dataset
TABLE II Map (%) Results of Different Methods on DIOR Test Set
Table II- Map (%) Results of Different Methods on DIOR Test Set
TABLE III CorLoc (%) Results of Different Methods on DIOR Trainval Set
Table III- CorLoc (%) Results of Different Methods on DIOR Trainval Set

From Tables II and III, we can see in some categories such as C7, C8, C12, and C17, the results may be worse than other WSOD methods. It may due to the uncertainty of strong data augmentations, which can improve the diversity of the data but also introduce unpredictable noise. The pros and cons of these augmentations have different effects on different categories. Nonetheless, our model's best results across all categories demonstrate its superior generalization ability and the effectiveness of our proposed components.

2) NWPU VHR-10.v2

Tables IV and V demonstrate the detection and localization results of our approach compared with some other approaches on the NWPU VHR-10.v2 dataset. Our prediction results reach 63.1% mAP and 73.7% CorLoc, which achieve the state of the art. All the categories with densely packed objects can be processed effectively by our method. The CorLoc results in three categories (ship, baseball, and ground track field) even reach 100%, attributed to the new informative proposals which mitigate ground truth missing cases of inferior original proposals. However, bridge objects seem intractable for our method (0.2% mAP and 0.1% CorLoc). This may be caused by their long aspect ratio and high frequency of co-occurrence with waters, thus, misleading our method to generate wrong proposals.

TABLE IV Map (%) Results of Different Methods on the NWPU VHR-10.V2 Test Set
Table IV- Map (%) Results of Different Methods on the NWPU VHR-10.V2 Test Set
TABLE V CorLoc (%) Results of Different Methods on the NWPU VHR-10.V2 Trainval Set
Table V- CorLoc (%) Results of Different Methods on the NWPU VHR-10.V2 Trainval Set

3) HRSC2016

We also conduct experiments on the challenging ship dataset HRSC2016 which suffers from tiny box domination problem, caused by massive tiny proposals and large ship targets. As shown in Table VI, our method achieves 55.1% mAP and 67.8% CorLoc, outperforming other methods. Our informative proposals alleviate the tiny box domination problem by generating more informative proposals which have more balanced distribution and contain less meaningless noise, thus reducing the loss effect produced by small and useless tiny proposals.

TABLE VI Map and CorLoc (%) Results of Different Methods on the HRSC2016 Dataset
Table VI- Map and CorLoc (%) Results of Different Methods on the HRSC2016 Dataset

D. Ablation Studies

To better understand how the proposed method works, we conduct a series of ablation studies on the largest dataset DIOR among three datasets to reduce result fluctuation due to insufficient data. To better understand how the proposed method works, we conduct a series of ablation studies on the largest dataset DIOR among three datasets to reduce result fluctuation due to insufficient data.

1) Effectiveness of Each Component

The contributions of different components of the proposed method are listed in Table VII. The OICR [2] method is chosen as the baseline our method while we add different components on the original framework to demonstrate the contribution of our method. From this table, it can be seen that by using IPR to producting more informative proposals, the performance can be significantly improved from 16.2 to 21.5 mAP (%) and 34.5 to 44.2 CorLoc (%). Only adding PBCL slightly improves the results (16.9% mAP and 35.5% CorLoc). However, if adding IPR and PBCL together, the performance results boost to 24.1% and 46.3%, suggesting PBCL can further improve the quality of informative proposals generated by IPR. Only applying PLMS achieves 20.6% mAP (+4.4%) and 43.6% CorLoc (+9.1%), indicating its effectiveness. If applying PLMS together with IPR, the performances of our method is improved to 25.8% mAP and 47.4% CorLoc. In addition, implementing PBCL and PLMS without IPR also achieves better performance (24.6% mAP and 46.3% CorLoc) than just implementing any single part. Employing all of three modules reaches the highest performance (28.0% mAP and 50.1% CorLoc). These results indicate that each component of our method is important, and without any part of them cause noticeable metrics drop.

TABLE VII Effectiveness of Each Component of the Proposed Method in the DIOR Dataset (%)
Table VII- Effectiveness of Each Component of the Proposed Method in the DIOR Dataset (%)

2) Ablation Studies on IPR

Table VIII provides the ablation studies on the composition of the comprehensive similarity in IPR. The way to calculate this similarity determines the clusters of proposals with similar characteristic, impacting the generation of new proposals. It can be seen that either using feature similarity or spatial correlation to generate new proposals improves the performance of baseline method. If combining these two as in (11), the performance reaches the best results, suggesting the quality of generated informative proposals depends on both semantic context and location information of proposals.

TABLE VIII Ablation Studies on the Composition of the Comprehensive Similarity in IPR (%)
Table VIII- Ablation Studies on the Composition of the Comprehensive Similarity in IPR (%)

3) Ablation Studies on PBCL

From Table IX, one can see that the different hyperparameter setting in PBCL influences the performance of the proposed approach. If the negative threshold \delta is set to 0.1 and the selected number of negative samples N_{neg} is set to 10, the mAP and CorLoc results reach the highest values.

TABLE IX Ablation Studies on the Hyperparameter Setting in PBCL Module in the DIOR Dataset (%)
Table IX- Ablation Studies on the Hyperparameter Setting in PBCL Module in the DIOR Dataset (%)

4) Ablation Studies on PLMS

As demonstrated in Table X, the performance of our method is constantly improving with the increase of the iteration number N_{ms} of the multistage procedure until N_{ms} reaches 3, indicating the quality of the newly generated proposals and the pseudolabels can be improved progressively. When N_{ms} > 3, the performance tend to be saturated, which may be caused by the interaction between the unavoidable noise in pseudolabels and data diversity by strong augmentations. Noted that N_{ms} = 1 means the training procedure is the same as that in mainstream WSOD methods.

TABLE X Ablation Studies on the Iteration Number of the Multistage Procedure in the DIOR Dataset (%)
Table X- Ablation Studies on the Iteration Number of the Multistage Procedure in the DIOR Dataset (%)

E. Discussion

In this section, some additional experiments are conducted to verify the effectiveness of the new generated informative proposals and the iterative training procedure in NWPU VHR-10.v2 dataset.

1) Quality of Informative Proposals

The size distribution of original and new informative proposals in NWPU VHR-10.v2 test set are illustrated in Fig. 4, where 100^{2} in Fig. 4 means all the proposals with the size range from 50 \times 50 to 100 \times 100. It can be seen that the tiny boxes with size smaller than 50 \times 50 accounts for 59.5% of the original proposals, preventing model from learning diversity object representations. As a comparison, the size distribution of new informative proposals generated by IPR is much more balanced with less useless tiny ones. With processing the new proposals together with original ones, our model mitigates the tiny boxes domination and part bias problems which are common in remote sensing images due to the complex background, focusing on more informative areas.

Fig. 4. - Size distribution of original proposals and the informative ones generated by IPR in NWPU test set.
Fig. 4.

Size distribution of original proposals and the informative ones generated by IPR in NWPU test set.

In Table XI, we utilize IPR during test to generate informative proposals and use them only to predict detection results without any test time augmentation. The hyperparameter \psi in SPC algorithm controls the generated number of proposals. One can see from this table that with \psi becoming higher, the average number of new generated proposals are less, but the mAP is still competitive. This result indicates that IPR can be regarded as a proposal refinement module, containing more information with less background noise. Furthermore, the performance of the model utilizing the generated proposals is enhanced as the iteration numbers (N_{ms}) increase, suggesting that our dynamic proposals are progressively improving.

TABLE XI The Average Number of Original and New Generated Proposals and the Performance Only Using One of Them on NWPU VHR-10.V2
Table XI- The Average Number of Original and New Generated Proposals and the Performance Only Using One of Them on NWPU VHR-10.V2

In the training, we use the new generated proposals during training to guide the network to concentrate on more informative areas. To obtain better performance and reducing testing time, we only use original proposals in the test.

2) Steady Improvement by PLMS on Different WSOD Methods

The proposed PLMS is a general strategy to improve the generalization of WSOD models. To verify this, we add PLMS to other WSOD methods in NWPU dataset to evaluate the contribution of this training procedure, as demonstrated in Table XII. We recurrent these methods using the same hyperparameter setting as mentioned in Section IV-B without using test time augmentations. One can see from Table XII that our method can steadily boost the prediction performance by a large margin, demonstrating the generalization of this training procedure. Moreover, the effectiveness of PLMS on NWPU is much better than on DIOR. This may because the strong data augmentations can dramatically improve the data diversity of NWPU dataset, which only contains 879 training images.

TABLE XII Improvement by PLMS on Different WSOD Methods on NWPU VHR-10.V2(%)
Table XII- Improvement by PLMS on Different WSOD Methods on NWPU VHR-10.V2(%)

3) Efficacy of Introducing Strong Data Augmentations

To assess the efficacy of strong data augmentations, we conducted several supplementary experiments that are presented in Table XIII. It should be noted that SDA refers to strong data augmentations. From the table, we observe that the application of SDA directly on baseline framework (OICR) impedes model performance because the introduced extra noise hinders model to find accurate box information under the image-level annotations. This issue is not observed in fully supervised object detection methods such as [22]. On the other hand, applying PMLS without SDA only leads to marginal improvements in performance because the data diversity is limited. However, when PMLS is implemented alongside strong data augmentations, the improvement of mAP is significant. These findings affirm the effectiveness of adopting strong data augmentations to raise the performance ceiling of the model and PMLS to mitigate the impact of introduced noise.

TABLE XIII Effeciveness of Introducing Strond Data Augmentations on NWPU VHR-10.V2 (%)
Table XIII- Effeciveness of Introducing Strond Data Augmentations on NWPU VHR-10.V2 (%)

F. Qualitative Results

The detection results on three datasets are presented in Fig. 5 to demonstrate the effectiveness of our proposed method. It is evident from the first three rows of Fig. 5 that our method can accurately and tightly enclose objects from different categories, even if they are clustered and densely packed. In addition, the problem of tiny box domination on the HRSC dataset is addressed by our approach, as it utilizes more informative proposals that facilitate the model in ignoring small proposals with excessive noise and minimal information. Consequently, our model can predict more precise bounding boxes while not solely focusing on the details inside the larger objects. Furthermore, the introduction of strong data augmentations has significantly improved the generalization of our method, especially for objects with significant orientation, such as airplanes, ships, and harbors. Nevertheless, there are still some failure cases demonstrated in the fourth row of Fig. 5, where our method struggles to process tiny, clustered objects. This inspires us to generate more tiny but informative proposals without merging them into larger ones in the future.

Fig. 5. - (a) Some successful results on DIOR dataset. (b) Some successful results on NWPU VHR-10.v2 dataset. (c) Some successful results on HRSC2016 dataset (these images are cropped into squares for visualization). (d) Some failed detection results including false detection and missed detection.
Fig. 5.

(a) Some successful results on DIOR dataset. (b) Some successful results on NWPU VHR-10.v2 dataset. (c) Some successful results on HRSC2016 dataset (these images are cropped into squares for visualization). (d) Some failed detection results including false detection and missed detection.

Furthermore, Fig. 6 illustrates the visual outcomes of our approach in comparison with OICR [2], PCL + trans [12], and WSODet [13] (horizontal bounding box output). The results indicate that our method is capable of generating more precise bounding boxes with fewer missing and misdetected instances, particularly when dealing with small and densely packed objects. This can be attributed to the fact that our newly generated proposals are of higher quality with more accurate box information than the original coarse proposals. In addition, the strong data augmentation techniques in our model enhance its generalization ability, resulting in improved performance on images containing numerous targets with complex backgrounds.

Fig. 6. - Qualitative comparisons of OICR [2], PCL + trans [12], WSODet [13] and our method. Our models can generate more precise bounding boxes with fewer missing and misdetected instances. (a) OICR. (b) PCL + trans. (c) WSODet. (d) Ours.
Fig. 6.

Qualitative comparisons of OICR [2], PCL + trans [12], WSODet [13] and our method. Our models can generate more precise bounding boxes with fewer missing and misdetected instances. (a) OICR. (b) PCL + trans. (c) WSODet. (d) Ours.

SECTION V.

Conclusion

In this article, we propose a dynamical informative proposal-based iterative training procedure for WSOD in remote sensing images to deal with the inferior quality of proposals and the performance dilemma caused by strong data augmentations. Two modules named IPR and PBCL are designed elaborately to generate high-quality informative proposals dynamically during training, which guides the model to focus on more informative areas in images. Meanwhile, a new training procedure named PLMS is proposed to leverage pseudolabel learning to improve the generalization of neural networks. Extensive experiments have been conducted to evaluate the effectiveness of the proposed method on DIOR, NWPU VHR-10, and HRSC2016 datasets. However, compared with fully supervised object detection, the prediction performance in some categories with tiny clustered objects is far behind and still has considerable potential to improve, which promotes us to explore other ways to narrow this gap in the future.

References

References is not available for this document.