Loading [MathJax]/extensions/MathMenu.js
Large Kernel Sparse ConvNet Weighted by Multi-Frequency Attention for Remote Sensing Scene Understanding | IEEE Journals & Magazine | IEEE Xplore

Large Kernel Sparse ConvNet Weighted by Multi-Frequency Attention for Remote Sensing Scene Understanding


Abstract:

Remote sensing scene understanding is a highly challenging task, and has gradually emerged as a research hotspot in the field of intelligent interpretation of remote sens...Show More

Abstract:

Remote sensing scene understanding is a highly challenging task, and has gradually emerged as a research hotspot in the field of intelligent interpretation of remote sensing data. Recently, the use of convolutional neural networks (CNNs) has been proven to be a fruitful advancement. However, with the emergence of visual transformers (ViTs), the limitations of traditional small convolutional kernels in directly capturing a large receptive field have posed significant challenges to their dominant role. Additionally, the fixed neuron connections between different convolutional layers have weakened the practicality and adaptability of the models. Furthermore, the global average pooling (GAP) also leads to the loss of effective information in the acquired features. In this work, a large kernel sparse ConvNet (LSCNet) weighted by multi-frequency attention (MFA) is proposed. First, unlike traditional CNNs, it utilizes two parallel rectangular convolutional kernels to approximate a large kernel, achieving comparable or even better results than ViTs-based methods. Second, an adaptive sparse optimization strategy is employed to dynamically optimize the fixed neuron connections between different convolutional layers, achieving a favorable connectivity pattern for capturing abstract features more accurately. Finally, a novel MFA module is used to replace GAP, so as to preserve more useful information while weighting the recognition features, thereby enhancing the discriminative and learning abilities of the model. In the conducted experiments, LSCNet achieves the best recognition results on three well-known remote sensing aerial datasets when compared to the state-of-the-art methods (including ViTs-based methods).
Article Sequence Number: 5626112
Date of Publication: 16 November 2023

ISSN Information:

Funding Agency:


I. Introduction

Remote sensing scene understanding is a vital yet difficult task in the field of intelligent interpretation of remote sensing data. It aims to capture high-level semantic information from images and precisely assign corresponding class labels to them. It has applications in various military and civilian domains, including natural disaster detection, weapon guidance, traffic supervision, and land cover monitoring [1], [2], [3], [4], [5]. In recent years, the advancement in remote sensing technology has increased the level of data abstraction from pixels to objects and ultimately to scenes [6], [7], [8]. To keep pace with these advancements, numerous researchers have dedicated their efforts over the past few decades to address the challenges and achieve scene-level image understanding [9], [10], [11]. In this task, effective feature extraction plays a crucial role, and based on the means of feature extraction, existing scene understanding works can be roughly divided into three directions: methods using low-level visual features, methods relying on mid-level visual representations, and methods based on high-level visual information [12], [13].

Contact IEEE to Subscribe

References

References is not available for this document.