Introduction
With the advantages of high-resolution, day-and-night, and weather-independent imaging, synthetic aperture radar (SAR) has been widely used in both military and civilian fields [1]. Achieving the SAR automatic target recognition (ATR) is a long-standing goal for the researcher in the remote sensing area. The past several years have witnessed the blossoming of the deep learning-based SAR ATR methods [2], [3]. For its single-view recognition branch, a key challenge is that the feature for classification should be highly robust to different observing azimuths, under which target exhibits large intraclass variation. Especially when the training samples are very scarce, it is extremely hard to overcome this obstacle for the model trained. Many previous works prove that, compared to the single-view SAR ATR, using multiple SAR images from different viewing azimuths enables a better and more robust recognition performance [4], [5]. In essence, not only can the images from different views give complementary descriptions concerning the target, but they also bring the discriminative inner correlation among different views to target recognition [6]. Accordingly, leveraging multiple images from different aspects can greatly improve recognition accuracy.
However, traditional multiview recognition methods assume that the target is observed in a fixed pattern, that is, the viewing azimuth angles are randomly distributed or in a uniformly increasing manner, ignoring the way of active data collection. In contrast, active SAR ATR methods can autonomously seek more discriminative target images to achieve high-performance recognition. As shown in Fig. 1, when performing a multiview SAR target recognition task, the impact of new observations on the recognition results based on previous observations can be positive or negative. Therefore, teaching the agent to actively observe the target based on past observations is very promising in raising recognition accuracy. In addition, SAR imaging requires a lot of time costs, storage resources, and energy from the SAR-equipped platform. Active data collection can focus these resources on obtaining high-quality samples, thus improving the efficiency in performing the task.
Impact of observations under different azimuth angles on multiview SAR target recognition. For a classifier trained on a small dataset, a few target images under certain azimuths are conducive to recognizing the target, while the others would confuse its judgment based on past observations.
In this article, active target recognition (AcTR) in SAR imagery is defined as follows. The SAR-equipped platform serves as an agent, and it is able to autonomously determine the azimuth angle of the subsequent observation based on the observed images in the past (the observation at the initial moment is given), and all previous observations are combined to recognize the target at last, thereby improving classification accuracy using the same imaging time and storage resources. Furthermore, it should be noted that all the target images mentioned in this article, unless stated otherwise, refer to the SAR image slices containing the target.
The pioneering work for the AcTR problem in the remote sensing area was first seen in the literature [7], which proposed to improve the quality of data input to the ATR algorithm by optimizing sensor movement, settings, or collaboration between sensing platforms, thus improving the recognition rate. Unfortunately, few relevant fruits were published after this work. Pei et al. [8] proposed a multiview SAR ATR method based on unmanned aerial vehicle (UAV) path planning, which uses sufficient measured data to approximate the optimization function and transforms the SAR AcTR problem into a constrained optimization problem. However, the premise of sufficient measured data is somewhat unrealistic, and the necessity of active data collection would be greatly reduced because the performance of the multiview recognition algorithms with passively observed data can already reach satisfactory level with this condition.
According to the definition above, the AcTR task in SAR imagery asks for the capabilities of scene understanding and decision-making from the sensing platform. Deep reinforcement learning (DRL), combining the powerful approximation ability of deep neural network (DNN) and the excellent decision-making ability of reinforcement learning (RL), has made great progress in fields like robot control [9], adversarial games [10], and the foundation model training [11] with the in-depth research in recent years. In the context of the AcTR task, this learning paradigm also suits very well, which can be validated by the extensive studies and applications [12], [13]. The features of the observed images can be effectively extracted through DNN to construct the state in the Markov decision process (MDP). And the agent can learn an active data acquisition strategy from interacting with the environment through the RL algorithm. However, although prevailing in the optical imagery area, the DRL-based AcTR in SAR imagery remains underexplored. Hence, we intend to tackle the problem of SAR AcTR from the DRL perspective for the first time.
To cope with the three main difficulties in the AcTR in SAR imagery, this article proposes an azimuth aware DRL framework, referred to as AaDRL. First, a complete training environment is needed to provide plenty of interactions between the agent and the environment, so as to facilitate the policy learning. However, unlike the convenience of collecting optical images, obtaining measured SAR target images is costly and time-consuming, so its amount is usually insufficient to construct a complete training environment. To address this, we use the synthetic SAR images and a small number of measured samples to build a relatively complete training environment for the agent's policy learning, alleviating the difficulty of lacking measured samples.
Second, since SAR images are very sensitive to imaging settings, the recognition performance at various azimuth angles of the classifier would fluctuate in different environments [14]. If the reward function is designed only based on the recognition result, its mapping can be very vulnerable to the environment's variation, and the agent's policy would easily fail when transferred from the training environment to the test one. On this point, given that the classifier is inclined to be overfitted since the scarcity of the training samples, and it can only recognize a small part of target images that share the same or very close azimuth with the training samples. Hence, we design a simple view-matching task, where the RL algorithm of proximal policy optimization (PPO) [15] is utilized to help the agent learn how to search the target image as similar as possible to the training samples. With this design, the mapping of reward function could be sufficiently robust to the environment's variation.
In addition, in the scenario of the sim-to-real active SAR target recognition problem, there are two premises for policy learning and transfer. First, the image features of different targets under various azimuths can be distinguished by the policy network; second, for the features of training and test images, it is supposed that those with the same target class and azimuth should be matched, i.e., the most alike. In the existing AcTR frameworks [12], [13], [14], [15], [16], the single-view image representations are directly borrowed from that used for class identification, blurring the differences among the individual target images holding different azimuths within a single category. Besides, there is usually a distinct distribution gap between the measured and the simulated data [17]. Hence, there would be massive state representation mismatches when transferring the agent's policy to the test environment, which can also make its policy failed. In our AaDRL framework, the contrastive learning method is leveraged to train the single-view feature extraction module. In this way, an effective state representation is generated for the policy network, which could help distinguish the characteristics of different targets at various azimuths. This practice can not only raising the training sample efficiency, but also enhance the policy generalization capability in the test phase. Lastly, we conduct extensive experiments on the SAMPLE dataset [18] to demonstrate the effectiveness of the proposed framework.
The main contributions of this article are summarized as follows.
The DRL framework is employed to solve the AcTR problem in SAR imagery for the first time. Experimental results show a significant improvement in recognition rate by using the policy derived under the AaDRL framework compared to state-of-the-art policies.
A simple view-matching task is designed and modeled as an MDP, where the agent is guided with the PPO algorithm to learn how to find images that are easy to be recognized. With this design, the reward function in the MDP could remain robust to the environment's variation.
In our framework, the contrastive learning method is utilized to learn an effective representation that helps the agent distinguish the target images at various azimuths. This practice can not only raise the training sample efficiency but also enhance the policy generalization capability in the test phase.
Related Work
In this section, we will introduce the relevant research works of multiview SAR ATR and AcTR, respectively. Next, since the conventional target's azimuth angle estimation task is close to the view-matching task in this article, we also illustrate the similarity and difference between these two works.
A. Multiview SAR ATR
A large number of existing works [4], [5] suggest that, compared to single-view SAR ATR, using multiple SAR images from different viewing angles enables a better and more robust recognition performance. Based on the means of information fusion, previous multiview SAR ATR methods could be generally divided into two categories: feature-level fusion and decision-level fusion. The former merges different image features after the feature extraction step, generating a new output, including not only targets' identity information under different aspects but also the correlation among them. Pei et al. [4] proposed a deep learning-based multiview SAR ATR framework, whose main idea is to extract features from the input images and concatenate all intermediate features layer by layer with a convolutional neural network (CNN), and the final classification result is derived based on the feature absorbing the information from all images. Bai et al. [19] utilized the bidirectional long short-term memory (Bi-LSTM) network to merge feature vectors extracted by CNN and train the whole CNN-LSTM model in an end-to-end manner, allowing extracting features from single-view images while mining the correlation in the image sequence. Similar to [19], Li et al. [20] proposed a convolutional-transformer network. In the beginning, a convolutional auto-encoder is pretrained and serves as the feature extractor, and the encoder part of the Transformer is used to explore the intrinsic correlation among feature vectors.
In contrast, the decision-level fusion method focuses on merging the classification results of the individual target images, which can be further grouped into two kinds [6]. One is the parallel decision fusion method, which assumes that the target images from different viewpoints are independent and all classification results are directly fused. In this regard, Huan et al. [21] applied principal component analysis (PCA) based and ranking-based parallel decision fusion methods. The other is the joint decision fusion method, which utilizes the intrinsic correlation between different images when calculating the classification results of the images under each azimuth angle and then fuses all the classification results. A representative work is the literature [5], where Zhang et al. applied joint sparse representation (JSR) for multiview SAR ATR. For simplicity, the multiview classification in this article adopts the parallel decision-level fusion method, where the classification results of all individual target images are fused by summing.
B. Active Target Recognition
AcTR or active object recognition (AOR), an important branch in active vision, is a continuous decision-making process during which the observation platform reduces the uncertainty of target recognition by adjusting its position or observation angle to obtain more favorable information for target recognition. On the other hand, actively observing targets can focus imaging resources on the images with more discriminative features, thus improving the efficiency of the observation platform in performing the recognition task.
Since the introduction of the pioneering work [22], AcTR has received extensive attention, and researchers have leveraged tools such as attention mechanism [23], information theory [24], [25], and RL [12] to preferentially select target observation perspectives to reduce the uncertainty of the target identity. Among them, the basic idea of the information theory method is to select the target image that can bring the maximum information gain. Methods such as Monte Carlo sampling [25], Gaussian process regression [26] can be used to estimate the information gain of different viewpoints. Paletta and Pinz [12] modeled the observation viewpoint selection problem as an MDP, where the reward function is designed based on the reduction of Shannon's entropy, and the state is defined as the fusion of the previously observed images' features. The Q-learning algorithm is used to help learn the viewpoint selection strategy, which guides the agent to search for the observation viewpoints that can bring out the most discriminative information.
The methods mentioned above are based on handcrafted features. With the revival of deep learning, DRL starts to shine in the active vision field. Malmir et al. [27] first used DRL to solve the AOR problem by enabling the agent to learn viewpoint selection strategies directly from raw image sequences. However, their feature extraction module was obtained by pretraining on the ImageNet dataset instead of training in an end-to-end manner along with the overall model. In contrast, Jayaraman and Grauman [16] proposed an end-to-end AcTR framework, where single-view processing, information fusion, and decision-making modules are trained simultaneously; however, to improve the sample efficiency of training, the authors also used a pretrained feature extraction module during the actual experiments. By reviewing literature [12], [13], [14], [15], and [16], we can find that all their single-view image representations are directly borrowed from that used for class identification, blurring the differences among the individual target images within a single category. For a well-trained agent, it selects the action based on the state observed, and the state is constructed from multiple single-view features. The representation method may succeed in the AOR task facing optical images, since the training and testing data usually share the same distribution, while in the sim-to-real context, there is always a distinct distribution gap between these two kinds of data (see [17] for graphic proof) so it is much harder to match the feature representations, in this way, the agent that has been trained in the training environment would easily mistake the state representation and take the wrong action in the test environment.
The key to this feature mismatch problem is to distinguish each individual target image, which is similar to the instance of discrimination in computer vision. To achieve this, the contrastive unsupervised learning method is utilized in this article.
C. Azimuth Estimation for Targets in SAR Imagery
Target azimuth estimation can provide important information for SAR image target recognition or other interpretation tasks [28]. For example, template-based recognition algorithms need to match the samples with all possible templates, which imposes a huge computational burden. However, if the azimuths of the test samples are known, the amount of algorithmic computation required can be greatly reduced [29]. Wen et al. [30] used target azimuth information as a self-supervised signal to help CNN learn the representation with better generalization ability for target recognition. Common target azimuth estimation methods for SAR images include sparse representation [29], corner point estimation [31], and so on.
The view-matching task designed in this article is similar to but also different from the target azimuth estimation task. Both must be trained to distinguish the target images under different azimuth angles to make viewpoint selections or label them. If the total number of decision steps in the task equals 1, the two tasks are essentially the same. However, when this number is greater than 1, the former becomes a sequential decision-making problem, which cannot be well solved by azimuth estimation alone. Instead, the agent needs to learn an effective policy to maximize the expectation of reward summation after multiple decisions, i.e., return.
Methodology
This section focuses on the proposed AaDRL framework. First, in Section III-A, the scenario of active SAR target recognition and the details concerning MDP modeling are described. Section III-B provides a general introduction to the proposed AaDRL framework. Sections III-C and III-D introduce the feature extraction module for single-view processing and the training process of the agent using the RL algorithm, respectively.
A. Problem Formulation
The assumed practical scenario is shown in Fig. 2. Considering the UAV's advantages of mobility and flexibility, the UAV airborne SAR imaging platform is adopted as the autonomous decision-making agent in the RL framework, and the action selection corresponds to the change of the UAV's azimuth angle when observing the target. The agent mainly consists of two functional modules: the classifier and the decision-maker. The former is fine-tuned on a small number of labeled measured samples, and the decision maker is trained in an environment built from a mixup of both measured and synthetic SAR images.
Scene of AcTR using the SAR-equipped UAV platform. By sequentially altering the observing azimuths, the well-trained agent can obtain multiple target images, based on which it continues to make the next decision. The final recognition result is derived from all the newly collected images.
Suppose that the number of target classes is
\begin{equation*}
{\mathbf {D}_{\text{train}}} = \lbrace {\bm {x}}_\theta ^{n}|n \in \lbrace 0,1,{\ldots },N - 1\rbrace, \theta \in \lbrace \tilde{\theta } _{1}^{n},{\ldots },\tilde{\theta } _{M}^{n}\rbrace \rbrace \tag{1}
\end{equation*}
The dataset used for the agent's training includes both the measured and simulated data, which is defined by
\begin{align*}
\mathbf {D}_{\text{train}}^{\text{RL}} = \lbrace {\hat{\bm{x}}}_\theta ^{n}|n \in \lbrace 0,1,{\ldots },N - 1\rbrace, \\
\theta \notin \lbrace \tilde{\theta } _{1}^{n},{\ldots },\tilde{\theta } _{M}^{n}\rbrace, \theta \in \mathbf {\Pi } \rbrace \cup {\mathbf {D}_{\text{train}}}.\tag{2}
\end{align*}
\begin{equation*}
\mathbf {D}_{\text{test}}^{\text{RL}} = \lbrace {\bm {x}}_\theta ^{n}|n \in \lbrace 0,1,{\ldots },N - 1\rbrace, \theta \in \mathbf {\Pi } \rbrace. \tag{3}
\end{equation*}
Flowchart concerning the interaction between the agent and the environment. At the time step
The purpose of changing the observing angle of the UAV is to provide the classifier with more recognizable target images, based on which we can design the reward function. To this end, we first analyze the classification performance of the classifier with the variation of the azimuth angle. Fig. 4 gives several instances for the qualitative analysis, presenting the classification results of the four types of targets (2S1 rocket launcher, BMP2 infantry fighting vehicles, BTR70 armored personnel carriers, and T72 tanks) in the test set
Classification performances versus azimuth angle concerning four types of targets in the test environment. The green bar denotes whether the image is correctly recognized, with 1 for successful recognition and 0 referring to the opposite. The yellow solid line indicates how far the classification results deviate from the correct labels. Let the number of training samples be
Based on the findings above, we design the view-matching task, through which the agent is guided to search for target images that are closer to or even equivalent to the training samples, and the reward function for the target of the
\begin{align*}
r({s_{t}},{a_{t + 1}}) =& \frac{1}{{{{\left({\mathop {\min }\limits _{m} \left| {{\theta _{t + 1}} - \tilde{\theta } _{m}^{n}} \right| + a} \right)}^{2}}}} - b\\
&+check\_{r}edundance({\theta _{t + 1}},({\theta _{1}},{\ldots },{\theta _{t}}))\tag{4}
\end{align*}
Next, we are going to define the state in the MDP. According to the Markovian property, the state
\begin{equation*}
{s_{t}} = aggregator(f({{\bm {o}}_{0}}),f({{\bm {o}}_{1}}),{\ldots },f({{\bm {o}}_{t}})) \tag{5}
\end{equation*}
In the scenario shown in Fig. 2, the agent's action is defined as the UAV platform selecting the next azimuth angle from the set
\begin{equation*}
a \buildrel\Delta \over = i\Delta \theta, i = 0,1,{\ldots }, \left\lfloor {{{2\pi } \over {\Delta \theta }}} \right\rfloor. \tag{6}
\end{equation*}
\begin{equation*}
{\theta _{t + 1}} = \left({{\theta _{t}} + {a_{t + 1}}} \right)\bmod 2\pi, \begin{array}{lc}&{{\theta _{t}},} \end{array}{\theta _{t + 1}} \in \mathbf {\Pi }. \tag{7}
\end{equation*}
Considering that the target images collected at all time steps contribute the same to the final recognition result, the discount factor
\begin{equation*}
{\pi ^ * } = \arg \mathop {\max }\limits _\pi {E_{(s,a) \sim {\rho _\pi }}}\left[\sum \limits _{t = 0}^{T - 1} {{\gamma ^{t}}r({s_{t}},{a_{t + 1}})} \right]. \tag{8}
\end{equation*}
B. Overview of the AaDRL Framework
Based on the MDP modeled above, an AcTR framework for SAR imagery is proposed, which mainly includes four modules: single-view feature extractor, feature aggregator, classifier, and policy network. The forward inference flowchart of the AaDRL is demonstrated by Fig. 5. Assume there are
Inference flowchart of the proposed AaDRL framework. The top row depicts the high-level inference workflow of the AaDRL framework, which is also another demonstration for Fig. 2. The bottom part describes the details of imaging, feature extraction, aggregation, and taking action within a single time step. The single-view feature extractor, pretrained with the contrastive learning method, is used to process the newly and historically acquired target images. According to the feature fusion result, namely the state
In the proposed framework, the single-view feature extractor is purely composed of convolutional networks, constructed by modifying the last layer of A-ConvNet [2], which is proved to be efficient in SAR image feature extraction. As illustrated in Fig. 5, in the
During the training phase, if we rely only on the backpropagation from the RL algorithm to update the feature extractor, the sample efficiency would be too low to derive an effective representation. To this regard, we adopt the pretraining way. In the scenario of the sim-to-real active SAR target recognition problem, there are two premises for policy learning and transfer. First, the image features of different targets under various azimuths can be distinguished by the policy network; second, for the features of training and test images, it is supposed that those with the same target class and azimuth should be matched, i.e., the most alike. In the existing AOR frameworks [16], [27], the single-view feature extraction module is usually pretrained by performing category classification task. However, this kind of practice may not well fit sim-to-real active SAR target recognition problem in this article. With the pretraining through category classification, distinction among target images sharing the same target class while holding different azimuths is blurred. In addition, there is usually a distinct distribution gap between the measured and the simulated data [17]. Hence, the features of the images sharing the same class and azimuth in the training and test environment can hardly match each other, which contradicts the second premise. In contrast, we adopt the contrastive learning method to pretrain the single-view feature extractor. By enlarging the distances among all the image features in the training environment, the policy network can better match the features corresponding to the same target class and azimuth in the training and test environment.
The pseudocode for the forward inference of the AcTR algorithm derived under the AaDRL framework is given by Algorithm 1.
Algorithm 1: The Active Target Recognition Algorithm Derived Under the AaDRL Framework.
Input: the initial observation
Output: the predicted target label
state
loop:
return
C. Model Pretraining Based on Contrastive Learning
In this article, we pretrain the single-view feature extractor under the contrastive learning framework of SimCLR [33]. During the training, each sample corresponds to a target image of a certain class at a certain azimuth, and the batch size is
\begin{equation*}
{l_{i,j}} = - \log \frac{{\exp (\text{sim}({{\bm {z}}_{i}},{{\bm {z}}_{j}})/\tau) }}{{\sum \nolimits _{k = 1}^{2Q} {{{\bm {1}}_{[k \ne i]}}\exp (\text{sim}({{\bm {z}}_{i}},{{\bm {z}}_{k}})/\tau) } }} \tag{9}
\end{equation*}
\begin{equation*}
{\mathcal {L}}_{cl} = \frac{1}{{2Q}}\sum \limits _{q = 1}^{Q} {({l_{2q - 1,2q}} + {l_{2q,2q - 1}})}. \tag{10}
\end{equation*}
By minimizing the loss functions above, the single-view feature extractor could learn a proper representation, so as to recognize the targets' characteristics under different azimuths.
The commonly used data augmentation operations for SAR images include random crop, noise adding, occlusion, rotation, and so on. Within the AaDRL framework, the representation learned is also supposed to overcome the cross-domain generalization problem. Since the main difference between the measured and synthetic SAR images lies in the scatters distribution and strength within the target and background, we adopt noise adding as the augmentation operation
It should be noted that during the upcoming policy learning process, we keep the parameters of the pretrained feature extractor constant to guarantee its stable capability of differentiating among all kinds of targets at different azimuths.
D. Agent's Policy Learning Based on PPO-Clip Algorithm
The state in MDP could be formed by extracting and aggregating the features from the observed image sequence, based on which the policy network selects the act. In order to confer the ability of autonomous decision-making to the agent, we need to train the policy network based on the interactive experiences between the agent and the environment. In this article, the PPO-clip algorithm [15] is adopted for policy learning, which improves the sample efficiency and training stability of the policy gradient algorithm through the operations of importance sampling and restricting the interval for network parameter updating. Because of its superior performance and wide applicability to various sequential decision-making tasks, the PPO-clip algorithm often serves as a baseline in many types of research on RL. Its optimization objective is written as
\begin{align*}
\mathop {\max }\limits _\xi {E_{(s,a) \sim {\rho _{{\pi _{\xi ^{\prime }}}}}}}\!\left[\! {\min \left({\frac{{{\pi _\xi }(a|s)}}{{{\pi _{\xi ^{\prime }}}(a|s)}}{A^{{\pi _{\xi ^{\prime }}}}}(s,a), }\right.}\right. \\
\left.{\left.{ \text{clip}\left({\frac{{{\pi _\xi }(a|s)}}{{{\pi _{\xi ^{\prime }}}(a|s)}},1 \!- \!\varepsilon, 1 \!+ \!\varepsilon } \right){A^{{\pi _{\xi ^{\prime }}}}}(s,a)} \right)}\right]\tag{11}
\end{align*}
Experiments
In this section, extensive experiments are conducted to verify the effectiveness of the proposed framework in improving the target recognition accuracy. First, we introduce the paired synthetic and measured SAR image dataset named SAMPLE. Next, experiment settings are introduced, and the effectiveness of the proposed framework is verified by comparing it with other active data acquisition policies under different testing conditions. Lastly, the ablation study on the AaDRL framework is conducted to analyze the effect of the contrastive learning method on policy learning and transfer.
A. Dataset and Experimental Settings
1) Dataset Description
The SAMPLE dataset, released by the US Air Force Research Laboratory in 2019, mainly includes synthetic SAR images of various vehicle targets under different observation conditions. Except for the background, target configuration, sensor parameters, observation depression angle, and azimuth angle, etc., are kept consistent with those of the measured SAR images in the MSTAR dataset [34] released by Sandia National Laboratory. Therefore, the SAMPLE dataset provides a good benchmark for studying the differences between simulated and measured SAR images and the recognition algorithms' transfer. The publicly available part of the SAMPLE dataset contains the synthetic SAR image slices of ten ground military vehicle targets (2S1 autonomous rocket launcher, BMP-2, BTR-70 armored personnel carriers, M35, M548 trucks, M1, M2, M60, T-72 tanks, ZSU-234 air defense unit), whose azimuth angles range from 10 to 80°, depression angles range from 15 to 17°. The SAR sensor works at the X-band while imaging, and the resolution is 0.3 m. The optical images, measured SAR images, and corresponding synthetic SAR images slices of these ten types of targets are shown in Fig. 6. In order to reduce the interference caused by the target background clutter, all the slices in the dataset are center-cropped to the size of 60 × 60 pixels.
Optical and SAR image slices concerning vehicle targets of ten classes in the SAMPLE dataset. By matching target configurations and sensor parameters, details in the measured and synthetic data, including shadows, orientation, scatter distribution, and magnitudes, are in good agreement [18]. Since the producer of this dataset did not align the ground planes of the synthetic and the measured images, the backgrounds of the former are somewhat darker.
2) Experimental Settings
Since we focus on SAR target recognition with a few training samples, only
For each episode in the training phase, a target image with a random azimuth is given to the agent at the beginning, several rewards are fed back from the environment according to the states and agent' actions, and the training loss in the RL algorithm is computed accordingly. During the test phase, the agent samples the target images in
B. Implementation Details
1) Hyperparameter Settings
In the PPO-clip algorithm, the hyperparameters
During the training of the single-view feature extractor based on the SimCLR framework, the projection layer used is a two-layer fully connected network with the temperature coefficient
2) Compared Policies
Since this article is a preliminary exploratory work in the field of AcTR in SAR images, there are very few studies available for comparison. The policies added to the comparison consist of two kinds, the heuristic policy, e.g., the policy of random sampling [35] and sequential sampling [19], and the policies derived under other AcTR frameworks [16], [27] proposed for optical images. Below is the detailed introduction to these baselines.
Random sampling: With this policy, the agent randomly selects an angle from the available interval
as the azimuth of the next observation. To ensure the fairness of comparison, we add a constraint in the random sampling process so that the agent will not get duplicate target images in each episode.\mathbf {\Pi } Sequential sampling: With this policy, the azimuths of the selected target images are in a uniformly increasing manner. We set the azimuth interval for sequential sampling to 4°.
Framework A: In the framework proposed by [27], the state is represented by the accumulated belief in each time step, i.e., the elementwise-product of the single-view posterior beliefs over target identity. Framework A represents the altered AaDRL framework whose single-view feature extractor is replaced by the corresponding part used in [27].
Framework B: Similar to [16], Framework B denotes the altered AaDRL framework whose single-view feature extractor is the backbone of the ResNet18 network, first pretrained on ImageNet and then finetuned on
.\mathbf {D}_{\text{train}}^{\text{RL}}
C. Policy Comparison
This subsection first compares the performance of various policies under different conditions. Subsequently, we present and analyze the training processes of the agent's policy under different AcTR frameworks.
Table I demonstrates the target recognition performance comparison among various approaches under different test conditions. Since the measured SAR target image is very hard to obtain in reality, the measured sample amount
1) Comparison With Heuristic Policies
We visualize the performance comparison among the first three active data acquisition policies when
Performance comparison among different active data acquisition policies when
Similarly, Fig. 8 presents the comparison result among the first three active data acquisition policies when
2) Comparison With Other AcTR Frameworks
In the following, we present and analyze the training processes of the agent's policy under different AcTR frameworks. Let the number of samples for each target type be
Return curves for the agent during the training and test phases under different AcTR frameworks. In the training phase, the agents can learn effective control policy under all three frameworks, while in the test phase, for frameworks A and B, the agent's policies fail to transfer to the test environment. In contrast, although it is influenced by the generalization gap, the policy learned under the AaDRL can successfully improve the return value.
Average minimum distance curves for the agent in the training and testing phase under different AcTR frameworks. Corresponding to Fig. 9, all three curves of average minimum distance gradually decrease with the ongoing iterations at the training time. However, the policies under frameworks A and B fail to perform the view-matching task in the test environment, but the policy under the AaDRL framework can still help the newly acquired target image to approximate the training samples.
Recognition accuracy curves for the agent in the test environment under different AcTR frameworks. By using the AaDRL framework, the policy learned can bring the recognition rate gain about 10% when it comes to convergence, while under the other two frameworks, the recognition rate can barely be improved by the agent's policy.
Fig. 9 demonstrates the return curves for the agent during the training and test phasees under different AcTR frameworks. In the training environment, as the number of episodes increases, the return values achieved by the policies under three kinds of frameworks all gradually increase, whereas from the extent of improvement, the proposed AaDRL framework is superior to the other two counterparts. In the test phase, for frameworks A and B, the indicator of return nearly remains unchanged with the ongoing iteration, reflecting that the policies learned under these frameworks fail to transfer to the test environment. For the AaDRL framework, although its return improvement is weakened compared to that in the training environment, namely, there is an evident generalization gap in the policy transferring, its policy can still successfully improve recognition accuracy in the test environment.
From the other side, recalling (4), it can be inferred that as the return value rises, the azimuths of the target images newly obtained after the decision-making would get progressively closer to the azimuths of the training samples. This hypothesis is validated by Fig. 10, where all three curves of average minimum distance gradually go down with the increasing episodes in the training environment, showing the agent can perform the view-matching task well under all three frameworks. In the test phase, although influenced by the generalization gap in the policy transferring, the policy under the AaDRL framework can also help the newly acquired target image to approximate the training samples. However, corresponding to the flat curve in Fig. 9, the policies under frameworks A and B fail to help reduce the average minimum distance in the test environment.
Fig. 11 shows the recognition rate curves of the classifier using different active data acquisition policies in the test environment. Based on the analysis in the MDP modeling, the value of the designed reward function is positively correlated with the final recognition performance, we expect the policies derived under frameworks A and B would make little impression in improving the target recognition rate, which is validated by those two flat curves in the figure. By contrast, with the policy under the AaDRL framework, the target recognition rate can gain about 10% when it comes to convergence.
From Figs. 9–11, with the increasing training episodes, all the frameworks successfully help raise the agent's return and lower the average minimum distance in the training phase, indicating the agent could learn a useful policy under each circumstance in the training environment, whereas from the extent of indicator improvement, the proposed AaDRL framework is superior to the other two counterparts. In the test phase, for frameworks A and B, all three indicators nearly remain unchanged as the number of training episodes increases, which means the policies learned under these frameworks fail to transfer to the test environment. In contrast, although it is influenced by the generalization gap, the policy learned under the AaDRL can successfully improve recognition accuracy in the test environment. The key to this discrepancy lies in the representation capability of the single-view extractor, as explained in the related work part. With only the single-view extractor directly borrowed from that used for class identification, the differences among all individual target images holding different azimuths within a single category would be blurred, and it is nearly impossible to match the feature representations of the training and test data sharing the same azimuth. In this way, the agent trained in the training environment can easily mistake the state representation and take the wrong action in the test environment, thus causing the failed policy transfer.
To further illustrate the advancement of the AaDRL framework, we visualize the numerical results in Table I, making the comparison result more intuitive. Except for the improvement brought by inputting the multiple observations, the policies obtained under frameworks A and B contribute little to raising the recognition performance because of the failure in policy transfer. From both Figs. 12 and 13, we can tell that the policy derived under framework A performs slightly better than that of framework B when
Performance comparison among policies derived under three kinds of AcTR framework when
D. Analysis
1) Ablation Study
First, we experimentally verify the necessity of unsupervised pretraining of the single-view feature extractor in the AaDRL framework. The framework used for comparison is generally the same as the proposed framework, except that its feature extraction module is trained from scratch along with the policy network. Both of them are trained using the same dataset
Fig. 14 demonstrates the impact of contrastive unsupervised pretraining on the AaDRL framework. When trained from scratch, the network parameters in the single-view feature extractor are updated by the backpropagation from the PPO-clip algorithm. It can be noticed from Fig. 14 that the agent has not learned an effective policy in this way. There may be two reasons accounting for this failure. First, the dataset used is not abundant enough; besides, the distinction between neighboring SAR image slices is slight. Hence, it is fairly hard to extract an effective representation from the observed image purely depending on the end-to-end learning. In contrast, with the help of contrastive unsupervised pretraining, the feature extractor could offer the policy network an effective representation, enabling the agent to perform the view-matching task as expected.
Impact of the contrastive unsupervised pretraining on the agent's learning under the framework of AaDRL. The comparison above reflects the necessity of using the contrastive learning method to help the agent learn an effective policy from the interactions.
Since the representation learned is closely related to the loss function in the SimCLR framework, which is decided by the temperature coefficient
(a) and (b) Return and average minimum distance curves for the agent during the training and test phases under various temperature coefficients, respectively. (c) Recognition rate curve for the agent in the test environment under different temperature coefficients. In the training step, some agent's policies can achieve higher performance by choosing a proper temperature parameter
To further explore the impact of the single-view feature extractor on the AaDRL framework, we gradually adjust the noise's standard derivation
2) Impact of Feature Overlapping
Next, we use the feature visualization tool UMAP [36] to observe the distribution of image features extracted by the single-view feature extractor, and by analyzing the visualization result, we try to figure out the factors constraining the performance of the AaDRL framework. The training set
Feature distributions of the training set
Specifically, we use Fig. 17 for illustration, where the red box on the right shows a local zoom of the feature distribution map on the left. Within the red box, the orange oval frame contains two image features of a certain target type, while the blue oval frame contains the two image features of another type of target. Suppose the former two features would form the state
Impact of feature overlapping on policy learning. The states formed by the mutually overlapped features would be very similar. If the agent takes the same action facing these similar states but obtains hugely different rewards, in the PPO-clip algorithm, the critic network's estimates of these states' values would be influenced, thus lowering the performance of the policy learned.
3) Impact of Feature Mismatch
In addition to the feature overlapping, the feature mismatch problem in the testing phase would also challenge the policy learned. Fig. 18 visualizes the image features of the simulated and measured T72 tank SAR images extracted by the single-view feature extraction module, where the squares represent the features of simulated images and the triangles correspond to those of measured images. Meanwhile, the azimuth angles of the target images are reflected by different colors. From the subfigures, it can be seen that with the help of the contrastive learning approach, for the same target, the image feature changes orderly with the increase in azimuth, be it for the simulated or the measured data. However, the key to the successful generalization of the learned policy to the test environment lies in that the features extracted are robust enough to the environment's change, which means that the representations of the simulated and measured images at the same azimuth angle should be as close as possible, i.e., the distance between squares and triangles of the same color is supposed to be the smallest. However, from Fig. 18, it can be seen that the actual situation is not that ideal, where a square of one color can be the closest to a triangle of another color, i.e., there is a certain mismatch between the measured image features and the simulated image features. This phenomenon implies that when the agent faces the test environment, it may misunderstand the azimuth of the target image and consequently take the wrong action. Therefore, the more severe the feature mismatch is, the worse performance of the agent's policy would be in the test environment. Referring to the four subfigures in Fig. 18, we can find that the feature mismatch phenomenon is the most severe under the condition of
Feature distributions of T72’s measured and synthetic images through different single-view feature extractors. The squares represent the features of simulated images, and the triangles correspond to those of measured images. Meanwhile, the azimuth angles of target images are reflected by different colors. Ideally, the squares and triangles of the same color are supposed to be the closest to each other. However, there is a certain mismatch between them in some cases. This phenomenon inflects that when the agent is deployed in the test environment, it may misunderstand its state and consequently take the wrong action. Therefore, the severer feature mismatch would result in the worse test performance of the agent's policy.
4) Model Complexity Analysis
To perform the AcTR task, the active view planning module is designed in this article and it would unavoidably increase the model complexity compared to the original target recognition model with static observations. However, according to the statistics, this extra model complexity is acceptable concerning the benefit of active vision. In terms of space complexity, the applied ResNet18 classifier contains 11.2 M parameters, while the parameter number in the policy network is around 0.6 M. The computational burden for making one single action prediction is 34.7 M FLOPs, as for the classifier, its time complexity is 286.3 M FLOPs. At about 300 K iterations between the agent and the environment constructed by the training set
Conclusion
This article proposes an AcTR framework for SAR images based on DRL for the first time, which effectively improves the target recognition accuracy from the perspective of active vision. We use synthetic SAR images and a small number of measured samples to build a relatively complete and close-to-reality training environment for agent training. Second, a view-matching task is designed in the proposed framework to guide the agent to learn how to seek the target image that is as similar as possible to the training sample based on historical observations, and this kind of design can help avoid policy failure in the test environment. In addition, the contrastive learning method helps the agent learn an effective state representation that enables it to recognize the characteristics of different targets under different azimuths, laying a foundation for the agent's policy learning and transfer. Finally, the experiments' results demonstrate that the proposed framework can greatly improve the target recognition accuracy under the small sample condition. In future work, we will seek a more appropriate way to better the state representation for the policy network. In this way, the problem of feature overlapping and mismatch could be alleviated so as to enlarge the gain for SAR target recognition.
ACKNOWLEDGMENT
The authors would like to thank the anonymous reviewers for their dedicated review. And we thank W. Li, the Ph.D. candidate in our group, for the careful revision to the manuscript.