Loading [MathJax]/extensions/MathZoom.js
Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images | IEEE Journals & Magazine | IEEE Xplore

Structured Adversarial Self-Supervised Learning for Robust Object Detection in Remote Sensing Images


Abstract:

Object detection plays a crucial role in scene understanding and has extensive practical applications. In the field of remote sensing object detection, both detection acc...Show More

Abstract:

Object detection plays a crucial role in scene understanding and has extensive practical applications. In the field of remote sensing object detection, both detection accuracy and robustness are of significant concern. Existing methods heavily rely on sophisticated adversarial training strategies that tend to improve robustness at the expense of accuracy. However, detection robustness is not always indicative of improved accuracy. Therefore, in this article, we research how to enhance robustness, while still preserving high accuracy, or even improve both simultaneously, with simple vanilla adversarial training or even in the absence thereof. In pursuit of a solution, we first conduct an exploratory investigation by shifting our attention from adversarial training, referred to as adversarial fine-tuning, to adversarial pretraining. Specifically, we propose a novel pretraining paradigm, namely, structured adversarial self-supervised (SASS) pretraining, to strengthen both clean accuracy and adversarial robustness for object detection in remote sensing images. At a high level, SASS pretraining aims to unify adversarial learning and self-supervised learning into pretraining and encode structured knowledge into pretrained representations for powerful transferability to downstream detection. Moreover, to fully explore the inherent robustness of vision Transformers and facilitate their pretraining efficiency, by leveraging the recent masked image modeling (MIM) as the pretext task, we further instantiate SASS pretraining into a concise end-to-end framework, named structured adversarial MIM (SA-MIM). SA-MIM consists of two pivotal components: structured adversarial attack and structured MIM (S-MIM). The former establishes structured adversaries for the context of adversarial pretraining, while the latter introduces a structured local-sampling global-masking strategy to adapt to hierarchical encoder architectures. Comprehensive experiments on three different datasets have dem...
Article Sequence Number: 5613720
Date of Publication: 14 March 2024

ISSN Information:


I. Introduction

Object detection is a fundamental task in the field of remote sensing scene understanding, with a wide range of real-world applications, including environmental monitoring, intelligent transportation, and military deployment [1], [2], [3], [4], [5], [6]. Thanks to the easy availability of large-scale remote sensing data, deep learning has recently dominated the field [7], [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20]. Specifically, convolutional neural networks (CNNs) and vision Transformers (ViTs) [21], [22], [23] have successively demonstrated their exceptional representational abilities, while the latter, as a more promising alternative to CNNs, has received increasing popularity, which reflects their significant potential for future applications in remote sensing object detection [24], [25], [26].

Contact IEEE to Subscribe

References

References is not available for this document.