Introduction
Synthetic aperture radar (SAR) [1], [2], [3], [4], developed from electromagnetic scattering in microwave bands, plays a crucial role in active Earth observations, functioning effectively across diverse weather conditions and lighting environments. With the rapid development of SAR imaging techniques, high-resolution SAR images can be accessed more easily than before, enabling even more research opportunities for intelligent interpretation of SAR images. SAR automatic target recognition (ATR), aiming at automatically localizing and classifying objects of interest (e.g., vehicles, ships, airplanes, or buildings) in SAR images, is a longstanding, important yet challenging problem in SAR image intelligent interpretation [5], [6], [7], [8]. SAR ATR1 plays an essential role in civil and national defense applications such as modern airport management, disaster management, urban planning and infrastructure monitoring, military reconnaissance, and maritime surveillance. Therefore, it has become an active research area for several decades [9], [10], [11], [12], [13], [14]. In the past decade, deep learning has brought tremendous success for SAR ATR [15], [16], [17]. Despite the significant progress, the following fundamental challenges will need to be addressed to advance the field of SAR ATR.
Firstly, task-specific property. One of the fundamental limitations of the current ATR methods [18], [19], [20], [21], [22] is that one model is trained and evaluated on one specific task. The detection and classification of a specific coarse category (e.g., vehicles, ships, airplanes, or buildings in Fig. 1.) all require their own deep models. As a result, the task-specific properties of these deep models pose significant challenges for training new tasks or developing a comprehensive SAR ATR system since each task must be learned independently from the ground up, requiring vast amounts of labeled data. This results in computational inefficiency, lower accuracy, and inconsistent results between the different models. Secondly, heavy reliance on supervised learning. Recent progress in SAR ATR [23], [24], [25], while substantial, has been limited to supervised learning, which heavily depends on massive amounts of accurately annotated target samples that are expensively labeled by expert SAR analysts and have limited generalization capability and scalability. However, the scarcity of expert SAR analysts cannot meet such an exhaustive requirement, leaving vast amounts of SAR images unlabelled and unexploited. Thirdly, the ignorance of SAR image characteristics in model designs. The imaging characteristics of SAR imagery differ significantly from those of optical imagery, leading to a significant domain gap between natural and SAR images. This raises significant challenges when one intends to transfer prior knowledge from the natural image domain. Different strong prior knowledge of SAR imagery, including speckle noise, discrete target appearances, and the lack of geometry, texture, and contour cues, needs special consideration when designing backbone architectures and learning strategies. Most of the current mainstream backbones and methods designed on natural images are not suitable for the aforementioned information. Finally, underdeveloped open-source ecosystem. Due to data sensitivity, the open-source ecosystem across the entire field is underdeveloped, making it challenging to share code and data publicly. Currently, there are no large and representative benchmark datasets for SAR ATR. As a result, this locks the potential of recent deep learning techniques for SAR ATR and significantly slows down the development of this field.
Various specialized SAR ATR datasets and tasks. SAR ATR includes various imaging conditions (i.e. operating condition), such as targets, scenes, and sensors. However, the datasets are often collected in specific settings for certain tasks due to high costs. For example, MSTAR [36] is a ten-type vehicle target classification dataset in the X-band and grass scenarios, and SAR-Aircraft is a seven-type aircraft detection dataset collected from three airports and a C-band satellite. Specialized algorithms have been proposed for these datasets. However, the differing target characteristics, scene information, and sensor parameters have complicated the generalization of existing algorithms. As such, this paper aims to develop a SAR ATR foundation model, a generalized method for conducting various tasks.
Recently, the remarkable success of foundation models (FMs) [26], [27], [28], [29] has led to a learning paradigm shift in artificial intelligence. Foundation models [30], pretrained on extensive data in a task-agnostic manner (generally via self-supervised learning), can be flexibly adapted to a wide range of downstream tasks. Self-supervised learning (SSL) [31], [32], [33], [34], [35] can be used to mitigate label inefficiencies by exploring supervision in the data directly, thereby reducing the reliance on expensive expert labeling while efficiently scaling the data and models. FMs shine in a broad range of areas, including natural language processing, computer vision, speech recognition, and medical image analysis. As summarized in Table I, FMs have also been explored in remote sensing image understanding, but they are mostly limited to the evaluation of optical data. To our knowledge, the huge potential of FMs for SAR image interpretation remains completely locked.
In our preliminary work, a novel SSL method for SAR imagery named SAR Joint-Embedding Predictive Architecture (SAR-JEPA) [37] was proposed and demonstrated promising results. Given the aforementioned discussion, to fully unlock the potential of FMs for SAR image interpretation, we present the first attempt toward building a foundation model for SAR ATR and propose SARATR-X that learns generalizable representations via SSL and provides a basis for label-efficient model adaptation to generic SAR target detection and classification tasks. While conceptually simple, existing methods for building FMs in other domains cannot be just directly applied to SAR ATR. We are still facing significant challenges, which are discussed and addressed below.
Pre-training datasets must include diverse target categories and imaging conditions to accommodate various downstream tasks. However, SAR ATR lacks a large-scale dataset, such as ImageNet [38], and the most used MSTAR dataset only includes fine-grained vehicle categories that are not suitable for larger-scale pre-training (see Table II). As such, SARDet-100K [39] incorporates 9 SAR target detection datasets. With the number of open-source SAR ATR datasets increasing, most of the datasets were integrated into this study as part of the pre-training. A total of 14 classification and detection datasets with different target categories and imaging conditions were included as a new pre-training dataset, SARDet-180K, to explore the potential of FMs, as seen in Table III.
Model backbones aim to achieve better spatial resolution representations in remote sensing images, especially for small targets in large imagery. Transformers and convolutional neural networks are among the most common architectures for these tasks. As shown in Table I, the Transformer offers better spatial resolution without downsampling, which is most commonly used in recent studies. As such, HiViT [40] was selected as it has the advantages of a Swin transformer high-resolution input and can drop patches in masked image modeling (MIM).
Self-supervised learning is complicated by SAR image quality, which is negatively affected by speckle noise in coherent imaging. The coherent imaging resulting visual features are also not as distinct or rich as in nature RGB images. Contrastive learning [41], [42] uses data augmentation and preprocessing to reduce noise, while MIM [37], [39], [43] applies various target feature for guided signals to suppress noise (see Table II). As such, the primary task of SAR SSL is to enhance the quality of feature learning and guide signals. For example, PGIL [6] leveraged a sub-frequency feature of complex SAR images to learn physics information, while our SAR-JEPA [37] applied multi-scale gradient ratios to solve for multiplicative speckle noise and capture target shapes. Furthermore, multi-stage training [39] from ImageNet to SAR diminished the effects of noise on model diversity, as seen in Fig. 5. Thus, we applied two-step pre-training from ImageNet to SAR to increase model diversity during pre-training with SAR images. Besides, multi-scale gradient features were used as high-quality guide signals for MIM with SAR images.
Results on classification and detection tasks. SARATR-X performed well across 5 datasets with 8 settings. It was superior to existing SSL methods (BIDFC [41]) for target classification in the fine-grained vehicle MSTAR dataset [36] with a few-shot setting. In addition, it performed well under extended operating conditions (EOCs) [44] (i.e., imaging conditions with variable depression angle (EOCs-Depression), target configuration (EOCs-Config), and version (EOCs-Version)). SARATR-X also demonstrated competitive object detection performance with existing supervised methods applied to various categories (SARDet-100K [39] and OGSOD [45]), as well as specific categories for ships (SSDD [46]) and aircraft (SAR-AIRcraft [47]). Our study shows the potential of a foundation model for SAR ATR.
Two-step pre-training process. The first involved performing MIM on ImageNet data to obtain better initialization weights for model diversity, as shown in Fig. 5 (c). The second involved performing MIM on SAR images with high-quality guide signals that are multi-scale gradient features suppressing speckle noise and extracting target edges.
Discussions of single and multi-scale kernel settings for MGF. Here, the scale 1/2/3 assumes r equal to 9/13/17, as the multi-scale contacts all scales. This multi-scale approach is more suitable than a single-scale technique for various targets in remote sensing images.
Averaged attention distances for various attention heads (the x-axis is the attention head w.r.t layer number, and point colors represent different layers for better visualization) in the SSL models. Attention distance represents the range of a receptive field. We focused specifically on model architectures (Fig. (a) v.s. Fig. (b)), initialization weights (Fig. (a) v.s. Fig. (c)), and SSL signals (Fig. (d) v.s. Fig. (e)) to ensure diverse attention ranges for SAR target recognition, including the HiViT architecture, ImageNet weights, and SAR target features.
Evaluation tasks need to comprehensively evaluate the performance of a foundation model for different tasks and settings. Three open-source target datasets were utilized by first constructing a fine-grained classification dataset, SAR-VSA, with 25 categories to evaluate the effectiveness of the proposed improvements. A comprehensive comparison was then performed between the proposed SARATR-X and existing methods for public classification and detection tasks.
SARATR-X achieved superior performance in 5 datasets across 8 task settings, as shown in Fig. 2, which is competitive with prior methods on various SAR ATR tasks (classification with few-shot and robustness setting and detection with specific categories or various categories). We hope that this work could advance the development of the intersection of general SAR target recognition and foundation models.
The primary contributions of this study can be summarized as follows:
We present the first foundation model called SARATR-X, which learns generalizable representations via SSL from large-scale unlabelled data and provides a cornerstone for generic SAR target detection and classification tasks.
We systematically investigate a foundation model framework for SAR ATR. We build the largest publicly available pre-training dataset, SARdet-180k, and fully discuss the model architecture and proposed SSL method with many comparisons.
SARATR-X is evaluated comprehensively with various SAR ATR tasks, such as few-shot classification, robust classification, ship detection, aircraft detection, and detection with various categories.
The remainder of this paper is organized as follows. Sec. II introduces related work in remote sensing and SAR ATR. Sec. III discusses the proposed foundation model (SARATR-X). Secs. IV and V conduct extensive experiments to demonstrate the superiority of the proposed method. Sec. VI concludes the paper and discusses future work.
Related Work
Visual foundation models are actively being used for remote sensing applications, and different algorithms have recently been proposed for various modalities and tasks. This study focuses on a foundation model used for SAR ATR (i.e., SAR image-based target classification and object detection). In the following sections, we introduce recent developments in remote sensing and SAR foundation models.
A. Foundation Models for Remote Sensing
Remote sensing foundation models [61], [62] have received widespread attention in recent years and have achieved effective learning across various modalities and tasks. Many of these studies have used existing large-scale pre-training datasets or have collected large quantities of samples from different sources. Model backbones have been established by improving attention mechanisms, positional encoding, and other aspects used to enhance the perception of complex spatial information. MIM has also been used to learn spatial-temporal contextual information, while contrast learning has been applied to multi-modal learning.
SatMAE [48] involves a novel masking strategy with the temporal and spectral positional encoding used for multi-spectral and temporal images. SatMAE has also achieved excellent performance in scene classification and semantic segmentation tasks using a new dataset (fMoW Sentinel) that includes 13 different frequency bands. RVSA [49] improves a pre-training ViT backbone using a rotated variable-size window attention method for arbitrarily oriented objects. This work demonstrates the importance of learning complex spatial contextual relationships for targets in remote sensing images. RingMo [50] utilizes a patch incomplete mask strategy for dense or small objects and a self-constructed set of 2 million images, proving effective in a variety of tasks. It shows the potential of large-scale pre-training. RingMo-Sense [51] offers a three-branch network and a masking strategy for modeling spatio-temporal interactions in temporal images. CMID [52] combines contrastive learning and masked image modeling to learn global semantic information and local spatial information. This work also shows the importance of learning the diversity of contextual relationships in remote-sensing images. GFM [53] focuses on the differences between natural and remote sensing images and employs a multi-object continual pre-training approach to leverage information from both. It shows that detailed information in natural images can complement remote sensing images well. DiffusionSat [54] is the first remote sensing generative model to employ geographic information embedding in stable diffusion. Scale-MAE [55] reconstructs images at different frequencies with improved positional encoding for ViT. FG-MAE [43] employs various hand-designed features to replace original pixels in the MIM, thereby improving the feature quality. Similarly, SMLFR [56] uses a low-pass filter to eliminate high-frequency information from the image pixels. These studies show the necessity of SSL high-quality guide signals. As such, our work focuses on the issues of SAR image quality while considering the range and spatial resolution of remote sensing.
A variety of multi-modal remote sensing FMs have also been developed, including SkySense [57] and OFA-Net [58]. SkySense [57] adopts a multi-granularity contrastive learning method to learn representations for different modalities. A GEO-context prototype has also been applied to embed geographical contextual information. OFA-Net [58] employs a shared transformer backbone for multiple modalities. In addition, vision-language models [63], such as EarthGPT [64], SkyEyeGPT [65], and LHRS-Bot [66] incorporate large language models into various remote sensing image modalities. However, due to the difficulty of annotating SAR images, the collection of public datasets used by EarthGPT only contains 10,554 SAR ship images, much less than the 84,838 infrared or 907,945 optical images. As a result, this study explores visual FMs based on unlabeled SAR images to improve the SAR ATR model’s scalability with large-scale samples.
B. Related SSL in Sar
SSL for SAR ATR has been investigated in multiple studies, as detailed in Table II. Early SSL was often used as a auxiliary task for classification tasks, as discussed below. RotANet [59] predicts the rotational patterns of MSTAR vehicle targets by capturing azimuthal features for classification tasks. UACL [60] combines data augmentation and adversarial samples for contrastive learning to improve model robustness to various adversarial attacks. PGIL [6] employs contrastive learning between complex SAR image sub-frequency features and deep amplitude image features, incorporating physical knowledge into classification tasks. SSL has also been used recently in model pre-training and fine-tuning frameworks. BIDFC [41] proposes weakly contrastive learning for pre-training in fine-grained vehicle datasets (MSTAR) and applied Gaussian noise data augmentation to simulate SAR image noise. TSCL [42] applies SAR image pre-processing prior to data augmentation in contrastive learning. FG-MAE [43] discusses different hand-crafted features for use with multi-spectral and SAR images and applies HOG features to SAR. In our previous studies, SAR-JEPA [37] SAR-JEPA [37] applies local reconstruction and multi-scale gradient features to collect target spatial signatures better. MSFA [39] proposes a multi-stage process with a filter augmentation pre-training framework for use in large-scale RGB and SAR data detection. However, these studies have only explored classification or detection tasks on a small number of datasets. Inspired by these previous works, this study aims to systematically investigate the construction of a SAR ATR foundation model for various ATR datasets via SSL.
Our Insights - These studies have demonstrated that SSL can achieve performance improvements across multiple categories [37] and tasks [37], [39], which may be comparable to the performance of specially designed supervised methods [39], [41]. However, it still lacks a foundation model for various SAR ATR applications. Besides, there is a lack of a pre-training and evaluation benchmark that contains images from classification and detection scenes. This inspired us to conduct systematic research into foundation models for general SAR target recognition, specifically with big data. We first extended the pre-dataset using different classification and detection tasks and scenarios (such as globally inland, marine, harbors, cities, and airports) based on existing research. A suitable model backbone is then discussed for the small target characteristics of remote sensing images. Since SSL requires high-quality guide signals from SAR images under the influence of noise, we applied two-step pre-training. Finally, we comprehensively evaluated the performance of the foundation models.
Approach
We aim to construct a foundation model for general ATR from large-scale SAR images via an SSL method. As described above, increasing SAR datasets and SSL studies have inspired us to develop a foundation model for SAR ATR. We focus on pre-training datasets, model backbones, SSL methods, and evaluation tasks to provide a systematic benchmark for SAR ATR foundation models.
First, we establish a diverse dataset for pre-training and integrate 14 SAR target datasets to ensure that most SAR target samples are included and that samples are balanced across categories. Besides, based on target size and image resolution, we slice bigger image sizes than 1000 pixels to increase the number of samples and extract small target slices. Second, we compare different model architectures to select a suitable model architecture for remote sensing recognition. Third, we propose a two-step pre-training strategy to achieve diversity in attention distance for different target sizes and effective training and scaling in the presence of SAR image noise interference. Natural image pre-training weights are used as SAR image pre-training initialization to enhance attentional diversity and multi-scale gradient features (MGFs) to suppress speckle noise. Finally, we perform a linear probing evaluation of proposed SSL strategies on the self-constructed SAR classification dataset and compare SARATR-X with existing methods on the public SAR target recognition dataset.
A. Pre-Training Dataset
Previous research primarily employed MSTAR [36] as a pre-training dataset. While MSTAR provides high-quality vehicle targets, it only contains a few thousand commonly used samples. The images also suffer from background bias caused by a single imaging scene [67]. In contrast, the ImageNet-1K pre-training set contains 1.4 million images with different categories and scenes. Since diverse target, scene, and sensor conditions constitute a large data sampling space in real-world scenarios, constructing a large pre-training dataset for the foundation model is central.
The increasing availability of SAR target datasets is a primary motivation for achieving this goal. Although SAR images are expensive and no single dataset contains all popular target categories or imaging conditions, collecting target samples from various open-source datasets can still provide a pre-training set with distinct categories, scenes, and sensors. Besides, considering that MSTAR’s background bias is due to the strong correlation between target classes and acquisition locations, we need to integrate different target datasets to increase the diversity of scenes and sensor conditions while learning target contextual information. As such, we constructed a new pre-training dataset, SARDet-180K, consisting of 186,600 SAR target samples from 14 open-source2 SAR target datasets’ all images, as described in Table III. This set aims, to the extent possible, to include common target categories (terrestrial and maritime targets such as vehicles, ships, aircraft, oil tanks, bridges, etc.), scenes (typical scenes such as cities, harbors, airports, oceans, etc.), and sensors (satellite, airborne, and simulation platforms of varying resolutions and bands).
B. Model Architecture
Two model backbone types were considered for SAR target recognition. The first utilized a vision transformer (ViT) [85], commonly used in SSL, offering good scalability of model parameters. The second architecture employed ConvNeXt-V2 [86], which offered the same scalability as ViT but maintained the efficiency of a convolutional neural network. In addition to the scalability of model parameters, remote sensing tasks also need to consider image properties. For example, SAR targets typically exhibit a small foreground and a dynamic context range. Swin transformers could outperform ViTs with a hierarchical structure yet are unsuitable when using drop patches in MIM to preserve computing resources. Therefore, we finally considered a variant of ViT, a hierarchical vision transformer (HiViT) [40], which improved the input spatial resolution and retained ViT properties for MIM.
C. Proposed Pre-Training Method
MIM was used as a pretext task for SSL pre-training, and masked autoencoders (MAE) [87] were employed to drop patches and preserve computational resources. MIM can help foundation models achieve SAR image interpretation by recognizing contextual relationships in objects (a key point when applying MIM). However, SAR utilizes a type of coherent imaging, which involves speckle noise that can interfere with a pretext task. As such, SARATR-X employs two pre-training steps to construct a foundation model, as shown in Fig. 3. The first step provides ImageNet weight to increase attention diversity and avoid the interference of SAR speckle noise in the early stages of the second step. We use multi-scale gradient features to suppress speckle noise throughout the second SAR pre-training step.
The first step involves performing MIM with ImageNet data to obtain better initialization weights because visible light images contain more and better signatures than SAR images. We simplified the multistage pre-training of MSFA, which performs SSL on ImageNet with a backbone, detection task pre-training on DOTA with a whole framework, and detection task finetuning on SAR images. SARATR-X uses the pre-training weights from ImageNet as initialization weights for the SAR pre-training step. This approach enhances the diversity of attention during SAR pre-training, as shown in Fig. 5. In contrast, random initialization leads to the convergence of attention toward the same pattern as in SAR pre-training with MAE. Besides, the natural image weights also compensate well for the lack of diversity in the bottom layer of HiViT in SAR image pre-training and decreased diversity in the top layer due to MGFs. The ImageNet pre-training backbone weights were obtained from an open source to reduce pre-training time. This process of using pre-trained ImageNet weights is called SSL-ImageNet & SAR.
The second step involves performing MIM with SAR images. As mentioned previously, SAR image noise is a challenging problem that has been investigated in FG-MAR, SAR-JEPA, and MSFA, which have discussed features such as CannyEdge [88], HOG [89], Haar-like [90], SAR-HOG [91], and SAR-SIFT [92]. Different feature combinations can be used to achieve the best results [39], but the simplest gradient features follow our previous SAR-JEPA approach to avoid excessive runtimes during complex feature selection. As such, MGFs [37] were used to suppress speckle noise and extract target shapes.
Multi-scale gradient feature - Multiplicative speckle noise causes distortions in the strong scattering of the region, which interferes with the SSL for the target features. Therefore, we would like to suppress the noise employing the target edge information for learning. However, Differential gradient have false points in strong target regions due to multiplicative speckle noise in SAR images [93], [94]. This result is because multiplicative noise leads to scattering points with significant strong and weak amplitude variations in the strong scattering region, especially for strongly scattering metallic targets. Therefore, using a gradient ratio within a region can improve stability. It can reduce the interference in the strong and weak points inner target regions due to speckle noise. In this study, MGF employed a gradient by ratio [91], [92] to obtain the relevant gradient features \begin{align*} R_{i} & = \frac {M_{1}(i)}{M_{2}(i)}, \tag {1}\\ G_{\text {H}} & =log(R_{1}), \tag {2}\\ G_{\text {V}} & = log(R_{3}), \tag {3}\end{align*}
\begin{align*} G_{\text {m}} & = \sqrt {G_{\text {H}}^{2}+G_{\text {V}}^{2}}, \tag {4}\\ \text {MGF} & = \text {concat}(G_{\text {m1}}, G_{\text {m2}}, G_{\text {m3}}), \tag {5}\end{align*}
Due to the dynamic range required for various targets in remote sensing [95], MGF is constructed with convolutional kernels of different sizes. We set the kernel scale r equal to 9, 13, and 17 to obtain
D. Evaluation With Recognition Tasks
Fine-grained classification datasets comprised of vehicles, ships, and aircraft were merged to form a new SAR classification dataset called SAR-VSA (Vehicles, Ships, and Aircraft) in Table III. We aim to increase the number of classes and assess the proposed improvements in feature quality. SAR-VSA compares SSL model performance with few-shot settings in Sec. IV. We then report the SARATR-X results for existing classification and detection settings, datasets, and algorithms in Sec. V to show the powerful potential of a foundation model.
SARATR-X Experiments
We first performed SSL on the pre-training dataset without label information and then fine-tuned the pre-trained model on a classification dataset using few-shot classification tasks and linear probing settings to analyze improvements made to SARATR-X. We also discuss the scalability of the proposed technique. Pre-training was first performed on eight NVIDIA RTX3090 GPUs. The SAR pre-training SARDet-180K dataset consisted of 14 SAR datasets (see Table III). Specifically, a few-shot SAR classification dataset, SAR-VSA, included 25 fine-grained targets from three SAR datasets (MSTAR [18], FUSAR-Ship [22], and SAR-ACD [22]). It is difficult to ensure training convergence by fine-tuning whole model parameters in small sample cases. As such, we used linear probing [87], which included a batch normalization layer, to adjust for differences in the statistical data properties and reduce the number of fine-tuning parameters. Detailed settings can be found in Appendix A in the Supplementry Material.
A. Comparison of Model Backbones
Table IV compares different model backbones used for SAR ATR, including ConvNeXt-V2 [86], ViT [85], and HiViT [40]. The results indicated that ViT outperformed ConvNeXt-V2. On the one hand, ViT is more flexible than ConvNeXt-V2 for learning contextual information in SAR images. On the other hand, multiple downsampling steps in ConvNeXt-V2 resulted in the loss of small targets, while ViT maintained the same spatial resolution in different layers. HiViT outperformed ViT and employed small (
B. Strategy of Two-Step Pre-Training
Here, we discuss the two-step pre-training strategy included to make full use of available model weights and SAR datasets. Table IV includes four pre-training settings: SL-ImageNet, SSL-ImageNet, SSL-SAR, and SSL-ImageNet & SAR. SL-ImageNet was pre-trained on ImageNet using supervised learning3; SSL-ImageNet was pre-trained on ImageNet from scratch using SSL; SSL-SAR was pre-trained from scratch on our SAR pre-training dataset; SSL-ImageNet & SAR pre-trained the model on a SAR dataset based on initialized weights from SSL-ImageNet.
Notice the additional supervised information introduced by SL-ImageNet did not necessarily improve SAR ATR performance (e.g., the linear probing performance of SL-ImageNet for HiVit was lower than that of SSL-ImageNet). SSL-SAR achieved better results than SSL-ImageNet using less data (12%), reflecting large differences in target features between the two images. However, ImageNet pre-training weights did provide a good initialization for lower features, such as shape and texture in SSL, with visible spectral remote sensing [98] and medical images [99]. Our experiments also confirmed that using SSL-ImageNet as initialization weights improved the pre-training performance of SAR images (see Table IV) and attention diversity (see Fig. 5). As such, SARATR-X employed the SSL-ImageNet & SAR settings to complement the richness of pre-training.
C. Design of Target Signals for Sar Images
After investigating the model backbone and learning strategy, we focused on target features for SSL methods used with SAR images. One key point for MIM is designing high-quality guide signals due to the unique multiplicative speckle noise in SAR images. This means we need to suppress noise and enhance target features. As seen in Table V, we considered five target features (pixel values) [87], including a low pass filter [96], HOG features [43], deep features [97], SAR-HOG [91], and gradient by ratio [91], [92]). All SSL methods use the SSL-SAR setting and base version. We first considered whether existing methods based on ViT were suitable for SAR image classification. PixMIM [96] applies a low pass filter to remove high-frequency components, driving the model to focus on shape information. However, PixMIM did not outperform MAE because the noise type in SAR is multiplicative, and the filter parameters require a trade-off between the target and noise. FG-MAE [43] uses HOG to capture SSL features in SAR scene-level tasks, though we found that HOG did not ensure accurate SAR target features. Target regions typically exhibited strong scattering values, and the included speckle noise often caused the gradient computations to exhibit strong false points in these regions. In addition, I-JEPA [97] proposed deep networks for use as target feature encoders to capture deep semantic features. However, this can lead to training overfitting noise and a failure to effective features.
As such, we chose SAR features as target features to enhance HiViT. SAR-HOG changes the gradient calculations for HOG features and uses gradient by ratio to solve for the speckle noise, thereby outperforming pixel values and HOG. Inspired by PixMIM, we prefer to directly use the target shape (i.e., gradient features) as a target feature.4 In addition, multi-scale methods can improve the feature representations of various small targets common in remote sensing. Discussions of kernel settings used for computing gradients are illustrated in Fig. 4. Scale can also affect feature quality, as a smaller scale is finer for small target edge extraction, while a larger scale is more suitable for large targets and noise suppression. Therefore, combining features of different scales (see Fig. 4) offers improvements for various image target sizes.
D. Analysis
As stated above, the primary contributions of SARATR-X can be summarized as follows: the HiViT architecture avoids the loss of small target information. SSL-ImageNet & SAR use ImageNet pre-training weights to provide a good initialization for diversity perceptual capabilities. MGF ensures high-quality target features and suppresses speckle noise under SSL with SAR images. By taking advantage of these insights, SARATR-X can learn high-quality target features from noisy SAR remote sensing images, as seen in Table IV. In the next section, we analyze the diversity and scalability of SARATR-X.
Visualization - Prior research [100] has shown that supervised pre-training and contrastive learning only model global information in higher layers, while MIM can model both local and global information. However, we observed that this effect was not only related to the chosen method but also to the data properties. Fig. 5 (a) demonstrates that ViT with MAE focuses on global information due to large SAR image scenes, which differs from MIM modeling properties. As such, HiViT includes various attention distances from 40 to 140 in different layers with its high spatial resolution and hierarchical structure but has decreased diversity in the bottom layer, as seen in Fig. 5 (b). In addition, using ImageNet weights for initialization solved this problem, as seen in Fig. 5 (c). Similarly, Fig. 5 (d) demonstrates that HOG features enhanced the noise interference, which limited feature diversity. MGF effectively extracts shape information from targets, focusing the model on diverse edge information in the lower layer. However, this approach removes textures and preserves edge information, which motivates higher layers to rely less on texture details and diminishes the attention range shown in Fig. 5 (e). Thus, we combined SSL-ImageNet & SAR with MGF for two-step pre-training in Fig. 5 (f). However, the benefits of different modalities, especially a sufficient number of visible images, for SAR data need to be explored more. In the future, we plan to collect the satellite and airborne multimodal paired data and use various features, including filter pixels, handmade features, and learned features, to ensure the migration of the underlying feature extraction and the higher-level semantic information. Continuous learning [53], [101] is also needed to prevent catastrophic forgetting.
Scaling experiment - Although MIM learns effectively and scales with data and model resources [121], a question arises as to whether our method can ensure scalability for MIM when dealing with noisy data, such as SAR. Fig. 6 presents the results of a scaling experiment from three perspectives: dataset size, parameters, and training epochs. Despite our pre-training set comprising 186,660 images, which is smaller than ImageNet-1K, we observed a significant rising curve in downstream task performance with increasing data and parameter quantities in Figs. 6 (a) and (b). This result indicated that the foundational model could fully achieve its potential in SAR images by extracting high-quality features as guiding signals. However, as in [121], the model tended to overfit during extended training epochs when the pre-training set contained about 100,000 images in ImageNet. In addition, SAR image noise and low resolution further aggravated the overfitting. Regardless, SARATR-X outperformed our previous study (SAR-JEPA), which overfitted at 400 epochs with 94,776 SAR images. Thus, there is a need to continue investigating new ways to ensure high-quality feature representations when extending SAR foundation models.
Scalability of SARATR-X for dataset size, parameters, and training epochs with linear probing performance. While our method benefited from these three attributes, it is important to note that excessive training epochs often led to overfitting due to the dataset size.
Leveraging SARATR-X for Recognition
We have discussed different aspects of SARATR-X, but there are many datasets and specialized models available for SAR ATR. Therefore, we compared our proposed SARATR-X algorithm with other state-of-the-art techniques, such as supervised learning (
Classification task - Table VI details the performance of SARATR-X for the MSTAR [36] dataset with standard operating conditions (SOCs) and extended operating conditions (EOCs). Notice that SSL (BIDFC and ours) and semi-supervised models (EUAPS) significantly outperformed other methods for small samples with additional unlabeled data. Our results also surpassed the previous best by large margins. For example, SOC 1-shot accuracy increased by 4.5% and EOCs 1-shot accuracy increased by 15.1% on average. Demonstrating the value of FMs in an era of rapidly growing SAR data. In particular, SARATR-X exhibited robustness to the EOC setting of variable imaging conditions. This result indicated that the foundation model could learn stable features and relationships from diverse imaging conditions in a large number of samples.
Detection task - As illustrated in Table VII, we reported a box mAP for SAR target detection with a horizontal bounding box for multi-category (SARDet-100K and OGSOD), ship (SSDD), and aircraft detection (SAR-Aircraft). SARATR-X outperformed our previous MSFA by 0.8 points on SARDet-100K. MSFA also includes complex training processes and target features, which employ multi-stage training between RGB and SAR images with three different target features (HOG [89], Haar-like [90], WSTG [122]), including a detection pre-training step. SARATR-X is thus simpler yet more effective for SAR images. The detection visualization is shown in Fig. 7, and we have fewer missed detections and false alarm results. Notably, SARATR-X outperformed or offered comparable performance for multiple datasets, compared to several specifically designed detection methods shown in Table VII.
Visualization of detection on SARDet-100K. False alarms and missed detections are common in SAR images, especially when similar targets overlap and complex scenes are present. While our method effectively improves the detection effect by learning the contextual information in the image, target detection in complex scenes and low-quality images is still very difficult.
Of course, our study is only a preliminary exploration of SSL for SAR image interpretation. More effective target features could be achieved in a data-knowledge dual-driven manner by further mining information on SAR imaging mechanisms and properties. Furthermore, given a larger dataset and computing power, the path of FMs will hopefully lead to generalized SAR interpretation, including target recognition, scene classification, semantic segmentation, and change detection, but this will require additional research.
Conclusion and Future Perspectives
This study proposed a foundation model SARATR-X for SAR ATR. First, a pre-training dataset SARDet-180K was constructed from 14 open-source datasets, including various targets, scenes, and sensors. The foundation model’s pre-training backbone, SSL methods, and downstream tasks were then discussed in detail. Importantly, SARATR-X demonstrated superior performance on different target recognition datasets, demonstrating the potential of FMs in this field. We believe that further research on SAR foundation models, including SARATR-X, has the potential to generalize feature representations of SAR images and benefit all-day, all-weather target recognition in Earth observations. However, FMs research requires large data and SAR images are expensive and require specific imaging equipment and algorithms. Privacy and security also prevent the data from becoming open source. Therefore, we are particularly grateful to the publishers of open-source SAR target datasets. By making SARATR-X publicly available, we aim to accelerate the FMs in SAR target recognition by enabling researchers to use our code and weight to design better methods and explore downstream applications.
Although this work performs systematical investigations, several limitations and challenges will require exploration in a future study. The SAR images were derived from open-source SAR datasets, and the targets primarily included vehicles, ships, aircraft, oil tanks, etc. Thus, collecting target samples from increasingly unlabeled imagery could further expand the amount of data to reach a million level and the range of downstream applications. The pre-training set image size [123] and automatic target slicing will become important issues that need to be investigated due to the larger image size and contextual range of remote sensing images compared to natural images. In terms of model architecture, learning the dynamic contextual information on space-time is also an important issue. In addition to target shape features, various scenarios and target signatures for SAR self-supervised learning need to be explored. Finally, investigating expert knowledge with text used for multimodal interactions and describing the relationship between targets and scenes could enhance the representation capabilities of FMs for SAR ATR.
In conclusion, we have demonstrated the ability of SARATR-X to adapt to diverse SAR target datasets, achieving high performance and generalizability in classification and detection tasks. By taking full advantage of the rapid growth of SAR images, this SSL-based foundation model opens the door to generalized SAR target recognition.