Introduction
Synthetic aperture radar (SAR), as a representative of microwave remote sensing, is an active coherent imaging system carried on satellites or spacecraft, which captures high-resolution images by processing the acquired radar echoes from ground targets. SAR has the unique advantage of all-weather and all-day observation, making it widely used in urban planning, natural disaster assessment, mapping, and military [1]. Due to the special imaging mechanism, speckle leads to SAR image degradation and seriously affects subsequent processing. Therefore, speckle suppression is an essential processing step before interpreting and analyzing SAR images.
The existing SAR image speckle suppression algorithms can be dated back to the early 1980s. SAR image despeckling methods based on spatial domain estimated the actual target's backscatter coefficient from pixel values within the local window, such as the Lee filter [2], Frost filter [3], and Kuan filter [4]. This type of denoising algorithm is simple to operate, but the window size of filters quickly affects their denoising effect. Moreover, the denoised image had a block effect and artifact. Subsequently, promising despeckling results were acquired in the images’ structure and texture preservation using the image's nonlocal similarity [5], [6]. In addition, searching for similar patches requires a large amount of calculation, which consumes a lot of time, and some parameters need to be manually set in denoising. The transform domain denoising algorithm [7], [8], [9] can effectively separate image signal coefficients from noise coefficients, improve the effect of speckle suppression, and outperform spatial domain filtering in preserving image edge details. However, artifacts occur frequently during inverse transform, and the transform from the spatial domain to the transform domain increases computational workload.
Deep learning has been a widespread issue in image processing for the last decade, and it has recently been successfully applied to SAR image denoising. Chierchia et al. [10] first applied convolutional neural network (CNN) to the SAR image denoising task, providing a new insight for applying deep learning in SAR by converting the multiplicative speckle model into an additive noise model for training. Subsequently, a division residual CNN was proposed to SAR image denoising [11]. Some works [12], [13] trained the network by skillfully designing and combining different loss functions to obtain a better speckle suppression effect. Besides, many scholars have explored various CNN architecture-based speckle suppression algorithms, e.g., autoencoder convolution [14], generative adversary network [15], dilated residual network [16], attention mechanism [17], [18], etc. All the above deep-learning-based algorithms achieve decent performance on speckle suppression. However, the position of the convolution kernels on the feature map is fixed. Thus, these networks cannot adapt to the texture and edge information, resulting in the inevitable phenomenon of excessive smoothing or losing image details during denoising.
Since there are no noise-free reference SAR images in real scenes, most current networks are trained using simulated noisy-clean images. The trained models showed dissatisfying despeckling results when tested in real SAR images. One type of work [19], [20] achieved learning reference-free image denoising by using multitemporal phase information of SAR images as a complement, combined with Bayesian prior or fine-tuning. Another is a self-supervised speckle suppression algorithm based on the denoising idea of Noise2noise extended, such as enhancing Noise2noise [21] and SAR2SAR [22]. However, the small amount of real SAR datasets limits unsupervised despeckle to some extent.
The above studies show that CNN has great potential for SAR image speckle suppression tasks. It has better performance and efficiency than the traditional model. However, the challenge remains whether the constructed network can make the despeckled SAR images preserve more edge information and texture structure. Multiscale architecture takes advantage of its multiscale nature to improve the performance of different image processing approaches, which are also applied to various image restoration tasks. Whether multiresolution or single-resolution images (feature), the most direct multiscale architecture feeds them into single or multiple subnetworks to obtain and fuse different scale features. In addition, some works use the encoder–decoder architecture, in which the layers in the encoder are connected or cascaded with the layers in the decoder. The fine-grained details learned in the encoder are used to restore and construct images in the decoder. Lattari et al. [23] constructed a residual denoising network based on a symmetric encoder–decoder. The network used skip connections to enhance the delivery of information. Chang et al. [24] proposed a spatial adaptive multiscale network—SADNet. SADNet used downsampling operations and parallel convolution kernels to acquire multiscale information. Based on deformable convolution, SADNet can adaptively represent spatial texture and structure. Similarly, Wu et al. [25] used deformable convolution in an encoder-like structure to adaptively extend the perceptual field. In addition, this structure also solved the problem of missing links between downsampling and upsampling features. Cheng et al. [26] proposed a U-Net-based network structure, NBNet, which maximized the preservation of low-frequency texture information.
Inspired by the above network structure, to effectively suppress speckle while preserving details and spatial structure information of SAR images, we propose a multiscale feature adaptive enhancement network (MFAENet). The constructed network uses the encoder–decoder network as the backbone. Different resolution features of SAR images are gradually extracted to remove noise from coarse to fine through multiple downsampling and upsampling operations. MFAENet uses splitting and cascading (SC) ways in the bottom layer to extract richer information to capture multiscale features. An attention mechanism is introduced to learn the relationship between channels and pixels. In addition, MFAENet expands the receptive field of the network by adaptively changing the convolution's shape, which can efficiently extract features of details and noise. MFAENet abandons skip connections but adopts adaptive fusion with learnable weights for the connection between encoder and decoder networks. This enhances the information transfer and retains the more detailed structure information in the denoised image.
In general, the main contributions of this article are as follows.
An MFAENet is proposed for speckle suppression. The network can effectively suppress speckles while retaining more spatial structure information of the image. The denoised results on synthetic and real images show that MFAENet can improve objective metrics and achieve good visual effects.
A multiscale feature adaptive enhancement (MFAE) module is proposed to fuse multiscale features obtained by the encoder and decoder. It further enriches feature scale information to maintain the consistency of feature coarse-grained context, which makes the network more concerned with noise information. And it also adaptively changes the shape and range of the receptive field for feature enhancement, which facilitates suppressing speckles.
MFAENet uses an adaptive fusion strategy to aggregate features within scales to further enhance the feature transform capability of the network, which enables reconstructed SAR images to reserve more details and spatial structure.
The rest of this article is organized as follows. Section II introduces the proposed SAR image denoising network framework in detail. Section III presents an experimental analysis of the proposed algorithm's denoising effect. Section IV is an ablation study. Finally, Section V concludes this article.
Methodology
This section will introduce the proposed speckle suppression network for SAR images—MFAENet in detail, which can extract the rich multiscale feature information of SAR images and adaptively fuse the features within scales to achieve high-quality denoising. The structure of MFAENet is first described below, and then the specific structure of each module is described in detail.
A. Network Architecture
As an end-to-end CNN model, it has shown a powerful learning and fitting capability. Thus, the speckle suppression can be approximately represented as
\begin{equation*}
Y = f\left( {X,\phi } \right) \tag{1}
\end{equation*}
The encoder in MFAENet mainly uses stride convolution and residual dense convolution block (RDCB) for downsampling operations to extract features at different resolutions of SAR images. In order to preserve the image's spatial structure, the constructed encoder module controls the number of downsampling operations and only extracts three scale features. Note that in the initial part of the encoder, a convolutional layer with convolutional kernel size 3 × 3, RELU activation function, and two RDCBs are combined for extracting the initial features \begin{equation*}
{f}_{{E}_0} = {H}_{\text{RDCB}}\left( {{H}_{\text{RDCB}}\left( {\text{Conv}R\left( X \right)} \right)} \right) \tag{2}
\end{equation*}
The initial feature \begin{equation*}
{f}_{{E}_1} = {H}_{\text{RDCB}}\left( {{H}_{\text{RDCB}}\left( {{f}_{{E}_0} \downarrow } \right)} \right). \tag{3}
\end{equation*}
Then, the feature \begin{equation*}
{f}_{{E}_2} = {H}_{\text{RDCB}}\left( {{H}_{\text{RDCB}}\left( {{f}_{{E}_1} \downarrow } \right)} \right). \tag{4}
\end{equation*}
RDCB cascades the outputs of the three convolution blocks by channel, which can be regarded as extracting spatial features from different scales. This is mainly because the receptive field of two successive
The encoder and decoder are connected with an MFAE module, whose structure is shown in Fig. 1. MFAENet's encoder–decoder module performs two downsampling–upsampling operations to obtain three features containing different resolution sizes and information. Consequently, MFAENet can also be considered to consist of three branches for connecting features of different resolutions in the encoder–decoder module. The MFAE module stacks four multiscale residual attention blocks (MRAB) and one adaptive enhancement block (AEB) for the bottom resolution features. MFAE expands the receptive field from the structure and obtains the feature's contextual information. This is
\begin{equation*}
{f}_{\text{sub}} = {H}_{\text{AEB}}\left( {{H}_{\text{MRAB}}\left( {{H}_{\text{MRAB}}\left( {{H}_{\text{MRAB}}\left( {{H}_{\text{MRAB}}\left( {{f}_{{E}_2}} \right)} \right)} \right)} \right)} \right) \tag{5}
\end{equation*}
The scale features from the encoder and decoder module other than the bottom branch are linked through the feature adaptive mixup (FAM) block. These scale features are adaptively fused and fed into the decoder for further image restoration. The features of the decoder are
\begin{align*}
{f}_{\text{FA}{\mathrm{M}}_{0}} =& {H}_{\text{FA}{\mathrm{M}}_{0}}({f}_{{E}_0},{f}_{{D}_0}) \tag{6}\\
{f}_{\text{FA}{\mathrm{M}}_{1}} =& {H}_{\text{FA}{\mathrm{M}}_{1}}({f}_{{E}_1},{f}_{{D}_1}). \tag{7}
\end{align*}
The scale feature layer of the decoder has a one-to-one correspondence with the scale feature layer in the encoder. In the decoder, we first use two RDCB blocks to increase the receptive field of the network for low-resolution features. Second, we use channel attention (CA) to make the network more concerned about the relationship between feature channels. Finally, we use transposed convolution as the upsampling operations. Let us assume
\begin{align*}
{f}_{{D}_0} =& {H}_{\text{CA}}\left( {{H}_{\text{RDCB}}\left( {{H}_{\text{RDCB}}\left( {{f}_{\text{FA}{\mathrm{M}}_{1}}} \right)} \right)} \right) \uparrow \tag{8}\\
{f}_{{D}_1} =& {H}_{CA}\left( {{H}_{\text{RDCB}}\left( {{H}_{\text{RDCB}}\left( {{f}_{\text{sub}}} \right)} \right)} \right) \uparrow . \tag{9}
\end{align*}
Each upsampling input of the decoder is not a sequential feed of the previous upsampling. Instead, the output of the FAM module is used as the input for the next upsampling, which is the result of adaptive fusion of the same scale-level features from the encoder and decoder. The decoder recovers the features with the same size resolution as the original image after two upsamplings. Then, reconstructed features are fed into the last convolution and the denoised image \begin{equation*}
Y = \text{Conv}\left( {{H}_{\text{CA}}\left( {{H}_{\text{RDCB}}\left( {{H}_{\text{RDCB}}\left( {{f}_{\text{FA}{\mathrm{M}}_{0}}} \right)} \right)} \right)} \right) + X. \tag{10}
\end{equation*}
B. Adaptive Enhancement of Multiscale Features
The MFAE module aims to fuse multiscale features obtained by the encoder and decoder. And it establishes the connection between shallow and deep features. The features extracted by the bottom subnetwork are of low resolution and lack relatively affluent contextual information, although they contain short geometric information. If resolution is too low, it will destroy contextual information and spatial structure. That is because low-resolution features do not maintain consistency in coarse-grained context during image denoising [18]. Therefore, the MFAE module is designed to perform adaptive enhancement of multiscale features. MFAE consists of MRAB, AEB, and FAM. MRAB enriches scale information to maintain consistency in the coarse-grained context of the features. AEB adaptively enlarges the network's receptive field. FAM enhances the conversion capability of the network. It can maintain more texture and spatial features effectively by fusing the shallow features in the encoder and deep features in the decoder. So FAM can recover more detailed information. Each submodule's structure of the MFAE module is displayed in Fig. 3. The following paragraphs describe each submodule in detail.
Structure of each submodule in the MFAE module. (a) Multiscale residual attention block (MRAB). (b) Adaptive enhancement block. (c) Residual convolution block. (d) Feature adaptive mixup.
C. Multiscale Residual Attention Block
Multiscale information is beneficial for image denoising tasks. The previous works often used downsampling operations to obtain the scale features. However, the resolution of the downsampled image is so small that it inevitably causes reconstructed information to be lost. We expand the receptive field without decreasing the images’ resolution by proposing a MRAB, shown in Fig. 3(a). MRAB contains a multiscale layer, convolutional layer, activation function, CA, and pixel attention. MRAB extracts the multiscale features through multiscale layers and then integrates them into the residual attention module. This design captures the feature information across channels and enables the network to focus more accurately on noise information.
MRAB adopts a SC way [27] to extract multiscale features. In each module, the input feature map \begin{equation*}
G = {2}^{\frac{{K - 1}}{2}}. \tag{11}
\end{equation*}
Note that when the convolution kernel size is 3 × 3, the group size defaults to 1. The experiments are conducted with four groups of convolutional kernels whose sizes are 3 × 3, 5 × 5, 7 × 7, and 9 × 9, and the corresponding group sizes are 1, 4, 8, and 16. The multiscale feature map's generating function can be expressed as
\begin{equation*}
{f^{\prime}}_i = \text{Con}{\mathrm{v}}_{{k}_i,{k}_i,{G}_i}\left( {{f}_i} \right)\begin{array}{ccccc},&{}&{i = 0,1,2,3} \end{array} \tag{12}
\end{equation*}
\begin{equation*}
f^{\prime\prime} = \text{RELU}\left( {\text{Cat}\left( {\left[ {{{f^{\prime}}}_0,{{f^{\prime}}}_1,{{f^{\prime}}}_2,{{f^{\prime}}}_3} \right]} \right)} \right) + {f}_{\text{in}} \tag{13}
\end{equation*}
\begin{equation*}
{f}_{\text{out}} = {H}_{\text{PA}}\left( {{H}_{\text{CA}}\left( {\text{Conv}\left( {f^{\prime\prime}} \right)} \right)} \right) + {f}_{\text{in}} \tag{14}
\end{equation*}
D. Adaptive Enhancement Block
The popular CNN-based SAR image denoising algorithms generally use a regular grid to sample input feature maps. This limits the receptive field's size somehow, preventing CNNs from fully exploiting the structured information in the feature space. In addition, some SAR image denoising algorithms introduce dilated convolution to expand the receptive field but usually lead to grid artifacts in denoised images. Moreover, the receptive field's shape is also significant. To extend the receptive field adaptively, AEB introduces deformable convolution with a kernel size of 3 × 3 to enhance feature representation. The structure is shown in Fig. 3(b). The deformable dynamic convolution's kernel is flexible, allowing it to capture more critical image information [28]. Let x and y be the mappings of the deformable convolution's input and output, respectively, and R be the regular grid of the two-dimensional domain. Regular convolution is performed by sampling input features x within the range of regular grid R and weighting the sampled values’ sum according to w as the output, which can be expressed as
\begin{equation*}
y\left( {{k}_0} \right) = \sum\limits_{{k}_n \in R} {w\left( {{k}_n} \right)} \cdot x\left( {{k}_0 + {k}_n} \right) \tag{15}
\end{equation*}
\begin{equation*}
y\left( {{k}_0} \right) = \sum\limits_{{k}_n \in R} {w\left( {{k}_n} \right)} \cdot x\left( {{k}_0 + {k}_n + \Delta {k}_n} \right). \tag{16}
\end{equation*}
At this point, deformable convolution becomes irregular, and then the pixel value of the final sampling position is determined by bilinear interpolation. As shown in Fig. 3(b), the constructed AEB maintains a regular
E. Feature Adaptive Mixup Block
Generally, CNNs can effectively capture shallow features of images (i.e., edges and contours), but shallow features degrade with increasing network depth. To solve this problem, some scholars combine the extracted shallow and deep features by summation or skip connection to generate new features. However, these ways make the network unable to distinguish which is more important between shallow and deep features, and causes the risk of losing image details. We construct the feature adaptive fusion module using a residual convolution block (RCB) to strengthen the information transfer between shallow and deep features. The RCB and adaptive fusion module structures are shown in Fig. 3(c) and (d). The adaptive fusion module adaptively fuses low- and high-level features with learnable weights, thus preserving more detailed structural information during denoising.
In the constructed SAR image speckle suppression network, the FAM module effectively avoids the problem of losing shallow features due to the lack of connection between the encoder's and decoder's features. In MFAENet, we consider integrating of features between different downsampling and upsampling layers of the encoder and decoder. Let
\begin{align*}
{f^{\prime}}_E =& {H}_{\text{RC}{\mathrm{B}}_i}\left( {{f}_E} \right) {}\qquad\qquad\quad{}{i = 0,1, \cdots,n} \tag{17}\\
{f}_{\text{FAM}} =& {H}_{\text{FAM}}({f}_E,{f}_D) = \sigma \left( \theta \right) * {f^{\prime}}_E + \left( {1 - \sigma \left( \theta \right)} \right) * {f}_D \tag{18}
\end{align*}
F. Loss Function
In deep-learning-based denoising methods, the noise-removing model achieves convergence by minimizing loss functions to reduce errors in predicted values. Therefore, the impact of different loss functions on the denoising model is significant. The goal of training the proposed MFAENet is to make estimated images constantly close to clean reference images. The robustness of L1 is better than L2 because it can handle outliers in the data, which can solve the problem that the model will be more sensitive to some specific sample. So we use the L1 loss for network training and optimization of network parameters. Given a training set \begin{equation*}
\ell = \frac{1}{P}\sum\limits_{i = 1}^P {{{\left\| {{{Y^{\prime}}}_i - \text{MFAENet}\left( {{X}_i,\phi } \right)} \right\|}}_1} \tag{19}
\end{equation*}
Experiment Results and Analysis
We tested MFAENet with simulated SAR noise and real SAR images. We compared it with other SAR image denoising methods on the objective metrics and subjective visual aspects to demonstrate MFAENet's effectiveness.
A. Datasets
In this article, the UC Merced Land Use dataset [29] is adopted as a training dataset, and the size of all images is 256 × 256. When producing noise-clean training image pairs, 400 images are randomly chosen as reference images and 40 images are selected among the remaining images as the test dataset. Since real SAR images can usually be regarded as grayscale images, 400 reference images are first transformed into grayscale. Then, the simulated speckle images are obtained by corrupting multiplicative speckles. The reference images and the noise images are regarded as training data for the network. During the training process, the images are cropped into blocks with a size of 64 × 64, and data enhancement operations, such as rotation and flip, are also performed. The experiments are carried out with simulated and real images, respectively.
The virtual SAR dataset publicly available in [30] is tested for simulated experiments. The virtual SAR dataset contains the noise belonging to all possible spectra of multiplicative noise components’ variance values. The virtual SAR dataset is intended to establish training data for speckle noise reduction. Twenty clean-noise image pairs are randomly selected from this dataset to be used as benchmarks for simulation tests and ablation studies. The denoising effect of four real SAR images with different scenes is demonstrated, and the proposed algorithm is further validated on 100 SAR images from Sentinel-1 ground range detected scenes with dual polarization [31].
B. Parameter Settings
Our proposed network is equipped with Pytorch 1.1 for training and testing. The operating system used is Windows 10, and the used GPU is NVIDIA GeForce GTX 1660 SUPER with CUDA11.1 and CUDNN 8.0.5 to accelerate GPU training and computing power. The exponential decay rates are beta1 = 0.9 and beta2 = 0.999, and the initial learning rate is set to 0.0003 and decays with a 0.5 times rate per 100 epoch. MFAENet is trained for 400 epochs with 6 h, and the batch size is set to 2. The original input and final output of MFAENet have channel number 1. With downsampling and upsampling operations, the channel number varies as 64, 96, 128, 96, and 64.
C. Comparison Algorithms
MFAENet is compared with the existing SAR image despeckling algorithms based on traditional models and SAR image speckle removing algorithms based on deep learning to verify the stability and efficiency. The comparison algorithms based on traditional models: are 1) Lee filtering (i.e., Lee), 2) Nonlocal SAR image despeckling methods with LLMMSE wavelet shrinkage (i.e., SAR-BM3D) proposed in [5], and 3) Bayesian threshold shrinkage SAR image despeckling methods based on the Shearlet domain's sparse representation proposed in [9] (i.e., BSSR). The comparative algorithms based on deep learning include: 1) the fast and flexible network-based image denoising algorithm (i.e., FFDNet) proposed in [32], 2) speckle suppression algorithm based on CNN and continuous cyclic translation algorithm proposed in [33] (i.e., CCSNet), 3) SAR image speckle suppression based on a deep CNN denoising prior algorithm proposed in [34] (i.e., IRCNN), and 4) SAR image denoising algorithm based on multiscale dense residual CNN proposed in [17] (i.e., MRDDANet). In this article, the code released by the corresponding papers’ authors is employed to perform denoising on the test dataset with the proposed MFAENet simultaneously.
D. Evaluation Metrics
For simulated datasets, it is convenient to objectively evaluate the despeckling effect of different algorithms because of the clean references’ existence. We choose peak-to-noise ratio (PSNR) and structural similarity index (SSIM) to evaluate the despeckled image's quality and edge preservation ability. Test time (TIME) reflects the inference efficiency of the despeckling algorithm.
For real SAR images, several unreferenced metrics are chosen for evaluating the despeckling performance because no clean reference image exists. The equivalent number of looks (ENL) [35] is used to measure the smoothness degree of homogeneous regions. ENL is obtained by the ratio between the mean square and the variance of the selected homogeneous region of the image, which is defined as
\begin{equation*}
\text{ENL} = \frac{{{{({\mu }_x)}}^2}}{{{{({\sigma }_x)}}^2}} \tag{20}
\end{equation*}
The edge preservation degree based on the ratio of average (EPD–ROA) [36] measures the ability to preserve detail in despeckled images with an ideal value of 1. The process of calculating EPD–ROA can be expressed as follows:
\begin{equation*}
\text{EPD} - \text{ROA} = \frac{{\sum\nolimits_i^m {\left| {{{{H}_0(i)} \mathord{\left/ {\vphantom {{{H}_0(i)} {{V}_0(i)}}} \right. } {{V}_0(i)}}} \right|} }}{{\sum\nolimits_i^m {\left| {{{{H}_{gt}(i)} \mathord{\left/ {\vphantom {{{H}_{gt}(i)} {{V}_{gt}(i)}}} \right. } {{V}_{gt}(i)}}} \right|} }} \tag{21}
\end{equation*}
The independent quantitative evaluation index (M score) [37] evaluates the whole image's despeckling ability. It is a new metric for computing reference-free images with the expression
\begin{equation*}
\text{UM} = {r}_{{\rm{E\hat{N}L}},\hat{\mu }} + \delta h \tag{22}
\end{equation*}
\begin{equation*}
{r}_{{\rm{E\hat{N}L}},\hat{\mu }} = \frac{1}{2}\sum\limits_{j = 1}^k {\left( {{r}_{{\rm{E\hat{N}L}}}(j) + {r}_{\hat{\mu }}(j)} \right)} \tag{23}
\end{equation*}
\begin{equation*}
\delta h = {{100\left| {{h}_0 - {{\bar{h}}}_g} \right|} \mathord{\left/ {\vphantom {{100\left| {{h}_0 - {{\bar{h}}}_g} \right|} {{h}_0}}} \right. } {{h}_0}}. \tag{24}
\end{equation*}
And \begin{equation*}
h = \sum\limits_i {\sum\limits_j {\frac{1}{{1 + {{(i,j)}}^2}}} .p(i,j)} \tag{25}
\end{equation*}
E. Simulated SAR Image Despeckling
In this article, 20 images are randomly selected from the virtual SAR dataset to test the effectiveness of each algorithm for despeckling in Table I. Table I shows the quantitative results. In short, the PSNR and SSIM values obtained by our proposed algorithm are almost always the best. This also shows that MFAENet can effectively remove speckles. The traditional SAR image denoising algorithms Lee and BSSR have some speckle suppression effect, but their PSNR and SSIM values are much lower than the values of MFAENet. Compared with the classical SAR image denoising algorithm SAR-BM3D, the PSNR value of MFAENet is about 1 dB higher. CNN-based SAR image denoising algorithms (FFDNet, IRCNN, CCSNet, and MRDDANet) do not perform as well as traditional methods in despeckling, and MFAENet achieves a substantial gain compared with them. This shows that MFAENet has a good effect in simulating speckles, which can fully extract feature information from different scales of the image and enhance the features by adaptively expanding the receptive field.
On the other hand, the constructed network adopts an adaptive fusion strategy to enhance the network's feature conversion capability for in-scale features. It can fuse the shallow features extracted by the encoder and the deep features extracted by the decoder. So the proposed method can effectively retain more texture and spatial features of the original image, restore more detailed information about the image, and greatly improve the ability of speckle removal.
Moreover, to verify our algorithm's effectiveness more intuitively, a single image is randomly selected to show the visual effects, and the PSNR and SSIM values are given. Fig. 4 shows the results of the single image despeckling from the virtual SAR dataset. It is clear that the compared algorithms, except SARBM3D, do not remove speckles, and the despeckled image is still heavily corrupted. Moreover, the images of Lee and BSSR appear blurred after despeckling. The despeckled images of CCSNet and IRCNN introduce texture that does not exist in the clean image by comparing Fig. 4(f) and (g) to (a). FFDNet only has a despeckling effect in some areas, like in white buildings where speckles remain. And the newest MRDDANet method only removes a small amount of speckle. Although SARBM3D removes speckles, the result is overall darker than clean images. In contrast, the proposed algorithm can recover a more complete and accurate target edge while completely removing noise than SARBM3D.
Visual comparison of single denoised images from virtual SAR datasets. (a) Clean. (b) Speckle. (c) Lee. (d) BSSR. (e) SAR-BM3D. (f) FFDNet. (g) CCSNet. (h) IRCNN. (i) MRDDANet. (j) Proposed.
F. Real SAR Image Despeckling
To more fully verify each algorithm's despeckling and detail-preserving ability, we test the above denoising algorithms in real SAR images, which are preprocessed by multilook processing. We randomly select four real SAR images with different scenes to demonstrate the denoising effect. The despeckled images are shown in Figs. 5–8.
Visual effect comparison of denoised SAR1. (a) SAR1. (b) Lee. (c) BSSR. (d) SAR-BM3D. (e) FFDNet. (f) CCSNet. (g) IRCNN. (h) MRDDANet. (i) Proposed.
Visual effect comparison of denoised SAR2. (a) SAR2. (b) Lee. (c) BSSR. (d) SAR-BM3D. (e) FFDNet. (f) CCSNet. (g) IRCNN. (h) MRDDANet. (i) Proposed.
Visual effect comparison of denoised SAR3. (a) SAR3. (b) Lee. (c) BSSR. (d) SAR-BM3D. (e) FFDNet. (f) CCSNet. (g) IRCNN. (h) MRDDANet. (i) Proposed.
Enlarged the region in green box of denoised images of SAR3. (a) SAR3. (b) Lee. (c) BSSR. (d) SAR-BM3D. (e) FFDNet. (f) CCSNet. (g) IRCNN. (h) MRDDANet. (i) Proposed.
Fig. 5(a) shows a real SAR image of size 256 × 256, named SAR1. SAR1 is an X-band Terra SAR-X image with an equivalent look of 2 obtained by full polarization. The resolution of SAR1 is 1.10 m × 1.04 m. Fig. 5(b)–(i), respectively, shows the denoised images acquired by each denoising algorithm. The details in the green box of the visualized image are enlarged and displayed in the red box. In Fig. 5, the denoised images after Lee filtering and BSSR are blurred, especially the enlarged area's detail part is blurred to a higher degree. SAR-BM3D has a remarkable speckle suppression effect, but the denoised image appears to be an oversmoothing phenomenon, and there is a massive loss of edge information and detailed texture in the image's enlarged region. After denoising by FFDNet, the image still has a large amount of noise residue. CCSNet and IRCNN provide a better ability for speckle suppression; however, the artificial texture is introduced in the denoised image, which corrupts the structural information of the image. MRDDANet has excellent denoising ability and strong edge retention, but some regions, such as the image's enlarged and bottom-left parts, exhibit an oversmoothing phenomenon. The denoised image of our algorithm accomplishes an admirable tradeoff between speckle suppression and preservation of details and structure, obtaining the best despeckling effect. Table II gives the objective evaluation index values of SAR1 after denoising by each algorithm. From Table II, we can notice that although MFAENet's ENL value is slightly lower than that of BSSR and MRDDANet, it is higher than that of the other algorithms, which indicates that it can effectively suppress speckle. The M score of the proposed method is only 1.0960, which is far lower than other algorithms. It means that the comprehensive denoising performance of our algorithm is the best among all denoising algorithms. MFAENet has the highest EPD–ROA value in horizontal and vertical directions, which is 0.0342 and 0.0420 higher than the second-best MRDDANet, respectively. This means that MFAENet has strong texture detail maintenance capability and can keep the image's details and spatial structure well. Finally, our algorithm runs faster than others, making noise removal very efficient.
Fig. 6(a) shows a real SAR image with L = 1, whose size is 512(512), named SAR2. SAR2 is an X-band Terra SAR-X image obtained by dual-polarization acquired over Rosenheim. Fig. 6(b)–(i) shows the denoised images obtained by each denoising algorithm, respectively. The details in the green box of the visualized image are enlarged and displayed in the red box. In Fig. 6, the Lee filter and BSSR have somewhat blurred denoising results, especially for the more severe edge blurring. The SAR-BM3D denoised image appears oversmoothing. Speckle still remains in images after denoising by FFDNet, CCSNet, and IRCNN. MRDDANet has considerable denoising ability, but oversmoothing occurs in some regions. The denoised image of our method can reduce speckles while retaining the image's details and structure.
Table III gives the objective evaluation index values of SAR2 after denoising by each algorithm. From Table III, the proposed algorithm is solely lower in ENL values but achieves the best values in all other objective metrics. Especially M score and EPD–ROA values are much improved over the second-best values. In addition, our algorithm takes the shortest time for single-image testing. These sufficiently verify MFAENet's effectiveness in speckle suppression.
Fig. 7(a) presents a real SAR amplitude image with L = 2; the size is 256(256), named SAR3. SAR3 is obtained by full polarization from the U.K. Defence Research Agency airborne X-band in the Bedfordshire farmland area. Fig. 7(b)–(i), respectively, presents the denoised images acquired by each denoising algorithm. Fig. 8 shows an enlarged detail view of each denoised image on the position of the green box in Fig. 7(a).
From Figs. 7 and 8, the Lee filter's denoised image is polluted and seriously degrades the image's quality. The image after BSSR denoising is severely blurred and loses the original spatial structure. The denoised image of SAR-BM3D loses plenty of texture and structure information of the original image. FFDNet and our algorithm's structure and edges of the enlarged region are close to the original image. However, the speckle suppression of FFDNet is not thorough enough, and a large amount of speckle remains in the denoised image. CCSNet's and IRCNN's denoised images have noise remaining and introduce severe artifacts. MRDDANet has good denoising ability, but oversmoothing occurs in some regions. MFAENet's denoised image contains rich details and structure, not remaining speckle. The objective evaluation index values of denoised SAR3 images for each denoising algorithm are given in Table IV. As Table IV, the comprehensive denoising performance of the proposed algorithms in this article is the best.
Fig. 9(a) shows a real SAR amplitude image with a size of 256 × 256 and L = 3, named SAR4. SAR4 is from the Ku-band of the Lynx airborne radar designed by Sandia National Laboratory and vertically polarized over the research area of China Lake Air Force Base, California, USA. Fig. 9(b)–(i), respectively, presents the denoised images obtained by each algorithm. Fig. 10 shows an enlarged detail view of each denoised image on the position of the green box in Fig. 9(a). From Figs. 9 and 10, after denoising by Lee filter, BSSR, SAR-BM3D, and IRCNN, the images appear blurry or oversmoothing and fail to keep the texture information well. The denoised images obtained from FFDNet, CCSNet, and MRDDANet are too smooth in the background but retain a relatively complete texture and structure. Our method has the finest and relatively realistic visual appearance, sufficiently removing speckles and maximizing the preservation of texture information in the original image, such as antennas in the enlarged area.
Visual effect comparison of denoised SAR4. (a) SAR3. (b) Lee. (c) BSSR. (d) SAR-BM3D. (e) FFDNet. (f) CCSNet. (g) IRCNN. (h) MRDDANet. (i) Proposed.
Enlarged the region in green box of denoised images of SAR4. (a) Noise image. (b) Lee. (c) BSSR. (d) SAR-BM3D. (e) FFDNet. (f) CCSNet. (g) IRCNN. (h) MRDDANet. (i) Proposed.
The objective evaluation index values of SAR4 after denoising by each algorithm are given in Table V. However, the ENL value obtained by our algorithm is lower than that of SAR-BM3D and MRDDANet. The visual effect of our algorithm is better. As shown in Fig. 10, SAR-BM3D and MRDDANet may have obtained higher ENL by sacrificing a large amount of detailed information. Our algorithm has the lowest M score, indicating that it has the most robust integrated denoising performance. In addition, the EPD–ROA values obtained by the proposed algorithm and MRDDANet are very close to 1. This indicates that both methods are relatively strong in edge retention. But it is also apparent that the visual effect of MRDDANet is worse than our algorithm. Moreover, our method still has the shortest TIME.
G. Denoising Experiment on Public SAR Dataset
To further clarify the effectiveness and stability of the proposed algorithm for speckle suppression in SAR images, 100 images were randomly selected from the Sentinel-1 ground range detected with the size of 256 × 256 to evaluate the despeckling performance of the above eight algorithms. The average values of four objective evaluation indexes were obtained and plotted as strip charts, which are shown in Fig. 11.
Objective evaluation metrics for multiple denoised images. (a)–(d) are the values of the EPD–ROA, ENL, M score and time sequentially.
Fig. 11 shows that the proposed algorithm achieves the best performance in two objective metrics, EPD–ROA and M score. Our algorithm has the best comprehensive denoising performance and better edge and texture preservation. The conventional algorithm SAR-BM3D has the highest ENL value, consistent with the oversmoothing that occurs with its denoised images. Our method has slightly better ENL values than the state-of-the-art CNN-based despeckling method—MRDDANet. However, the former is better than the latter in terms of detailed information and spatial structure preservation, and the former has a significant advantage in inference time.
Ablation Study
In this section, we show the following ablation study to investigate the impact of each component in MFAENet.
A. Ablation Study of Major Components in the Networks
To better illustrate the validity of MFAENet, each major component of the network is studied. The proposed network MFAENet serves as a baseline. Some modules are deleted from this network separately as a control. PSNR is taken as an evaluation index. Fig. 12 presents ablation experiment's results. “no-CA” indicates removing CA layer in decoder only. “no-AEB” means that only AEBs in MFAENet are removed. “no-MSRA” represents that the multiscale layer in MRAB is replaced with a plain
Furthermore, considering the effect of dilated convolution in expanding the receptive field, the ablation experiments of dilation convolution with different expansion rates are added in Table VI to further validate the contribution of deformable convolution in the model. Here, deformable convolutions in the adaptive module are replaced with dilated convolutions with different dilated rates.
As the expansion rate increases, the range of the receptive field obtained by dilated convolution gets larger, but from the results, its denoising level shows an overall trend of increasing and then decreasing. The trend shows that blindly expanding the expansion is not a positive gain. Deformable convolution outperforms all the dilated convolution and achieves the highest denoising performance. This ablation study shows that deformable convolution is more suitable for the proposed algorithm.
B. Validity of Global Residuals
Global residual strategy is used in the output location of MFAENet, which can help the network to learn the residual image instead of the estimated despeckled image. Fig. 13 gives the effect of the global residual structure. The training curve in Fig. 13 illustrates that the training process of MFAENet with global residuals is more stable and smoother. In contrast, the training curve of the pure encoder has very sharp oscillation.
Function of global residual structure in MFAENet. (a) Noisy image. (b) Output before global residuals. (c) PSNR curve with and without global residual for training.
C. Split–Cascade Way Along the Channel Direction
The SC way can extract multiscale information along channel direction and obtain richer contextual information. In theory, parallel multikernel receptive field modules (with convolutional kernels of 1, 3, 5, and 7, respectively) can perform the same function. Table VII shows the performance of MFAENet with parallel multikernel perceptual field blocks instead of SC blocks. With nearly equivalent results achieved in image recovery, MFAENet using SC is more competitive regarding FLOPs, parameters, and time.
Conclusion
In this article, we propose an MFAENet for speckle suppression of SAR images. This network can effectively extract rich multiscale feature information of SAR images and adaptively fuse intrascale features to enhance the transform capability and achieves speckle removal from coarse to fine. The constructed MRABs and AEBs can further expand the network's receptive field and enrich contextual information of the extracted features, which help facilitate the preservation of image texture detail information and spatial structure during denoising. MFAENet has conducted denoising tests on synthetic and real SAR datasets, and its denoised images have sound visual effects, better objective index values, and the most robust comprehensive denoising performance. However, there are still some limitations of MFAENet, such as the weighted fusion of FAM that prevents it from performing inference tests on images of arbitrary size. Our future article will address these issues while effectively suppressing SAR image speckle and improving the visual quality of denoised images. And we also will study whether the polarization state impacts the performance of the proposed method in the future.
List of Abbreviations
To better read this article, we have provided a list of the full names of each abbreviation, as shown in Table VIII.