Journals & Magazines >IEEE Access >Volume: 8

Coupled Rain Streak and Background Estimation via Separable Element-Wise Attention

The upper half part is the pipeline of the proposed method which predicts the rain streak layer and the background layer simultaneously. The lower part is the proposed se...

Abstract:

Single image de-raining is challenging especially in the scenarios with dense rain streaks. Existing methods resolve this problem by predicting the rain streaks of the im...Show More

Metadata

Abstract:

Single image de-raining is challenging especially in the scenarios with dense rain streaks. Existing methods resolve this problem by predicting the rain streaks of the image, which constrains the network to focus on local rain streaks features. However, dense rain streaks are visually similar to mist or fog (with large intensities), in this case, the training objective should be shifted to image recovery instead of extracting rain streaks. In this paper, we propose a coupled rain streak and background estimation network that explores the intrinsic relations between two tasks. In particular, our network produces task-dependent feature maps, each part of the features correspond to the estimation of rain streak and background. Furthermore, to inject element-wise attention to all the convolutional blocks for better understanding the rain streaks distribution, we propose a Separable Element-wise Attention mechanism. In this way, dense element-wise attention can be obtained by a sequence of channel and spatial attention modules, with negligible computation. Extensive experiments demonstrate that the proposed method outperforms state-of-the-arts on 5 existing synthesized rain datasets and the real-world scenarios, without extra multi-scale or recurrent structure.

The upper half part is the pipeline of the proposed method which predicts the rain streak layer and the background layer simultaneously. The lower part is the proposed se...

Published in: IEEE Access ( Volume: 8)

Page(s): 16627 - 16636

Date of Publication: 20 January 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.2967891

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Most existing computer vision systems are designed for disturbance-free scenarios. Therefore, rain streaks in an image degrade visibility and prevent many computer vision algorithms from working properly. Addressing this visibility problem is challenging due to the random rain streaks distribution. Early researches [2], [16], [17] treat it as a signal separation problem using low rank decomposition or Gaussian mixture models (GMM), or resolve it in a denoising manner with a nonlocal mean smoothing algorithm [13]. Recently, deep learning based models [4], [25], [27] learn from synthesized data and achieve preferable performance due to the powerful ability of feature representation.

Notwithstanding the demonstrated success, these deep models suffer from two main issues. First, most state-of-the-art models [4], [15], [25], [27] focus on predicting rain streaks only. While this is reasonable as the rain streaks are sparse and contain simple texture information, it enforces the network focus only on local feature representations. As can be seen in Fig. 1b, the feature responses learned from a residual prediction network highlights rain streaks other than background regions. On the other hand, the dense rain streak scenario is visually similar to mist or fog, which makes the prediction of rain streaks easy but difficult in recovering original image content. A network that predicts a rain-free background shows a different learning focus (see Fig. 1c), and these two different objectives may complement each other.

FIGURE 1.

Deep networks that estimate the rain streaks residual (b) or rain-free background (c) show substantially different feature responses (the activation value of the neurons). Relying on predicting the residual only cannot handle dense rain streaks scenarios.

Show All

Second, the attention on rain streak distribution is not fully explored in de-raining models. Although a spatial visual attention map is incorporated as one of the network inputs in raindrop removal [19], the attention module should be injected into feature levels of the entire network. Attention not only filters out the redundant information but also improve the representation of the features. In this sense, traditional spatial attention is not enough as it shares the same weights to all the channels of the feature maps. However, learning an element-wise attention module with the same size of the feature maps hugely increases the computational overhead.

In this paper, we address the above two problems by proposing a coupled rain streak and background estimation network with Separable Element-wise Attention. The proposed network produces task-dependent features so that the intrinsic relationship between two tasks can be explored during training. Furthermore, we implement element-wise attention using a sequence of channel and spatial attention modules. The combination of the channel and spatial attention modules is able to achieve the element-wise attention with negligible computation, in this way it can be applied to all the convolutional blocks. Extensive experiments show that the proposed method outperforms state-of-the-art de-raining methods on 5 benchmarks and real-world scenarios. More importantly, our superior performance is obtained without additional multi-scale or recurrent structures.

To summarize, our contributions are three-fold:

We propose to jointly estimate rain streaks and background in the same network with task-dependent features. This simple approach shows significant improvement over individual prediction of two tasks.
We present a Separable Element-wise Attention module. This method allows focusing on important feature elements while suppressing redundant ones. Additionally, our separable implementation enables involving element-wise attention with negligible computation efforts. It is a general component and can be applied to other deep models.
Extensive experiments conducted on 5 challenging benchmarks and real-world data demonstrate the effectiveness of the proposed approaches over state-of-the-art methods.

SECTION II.

Related Work

Rain streak removal is challenging, and therefore early works leverage the additional temporal information from multiple frames. Garg and Nayar [5] propose to detect and remove rain streaks based on the dynamics and photometry of rain. Besides temporal information, other information such as chromatic properties and shape characteristics of rains, are also utilized in [29] and [1] respectively. Recently, video rain removal are addressed using low-rank matrix [14], optical flow in local phase information [21], and matrix decomposition [20].

Different from video-based de-raining with temporal information, single image rain removal is an ill-posed problem and therefore much more challenging. Many traditional methods solve this problem with additional prior information and regard it as a signal separation problem. Kang et al. [12] and Sun et al. [22] separate images into high and low frequency parts by analyzing the morphological and structural information of rain images. Luo et al. [17] separate rain streaks and background scene by discriminative sparse coding method. In addition, Gaussian mixture models (GMM) [9], [16] are used to decompose the rainy image into background and rain streaks layers. Low rank models are also used to separate the input image into the different layers in [2], [3], [26]. Zhang and Patel [13] propose a novel idea and try to recover the rain-free image by nonlocal means filter. Although these methods can detect and remove rain streaks, their main limitation is over-smoothing the image details since a lot of texture and fine structure information belongs to the high frequency part.

Recent approaches adopt deep learning and achieve notable success in single image de-raining. Fu et al. [4] introduce a model to predict the residual rain streaks using the decomposed high frequency part as input. Yang et al. [25] present a deep recurrent model with a dilated network to detect and remove rain streaks iteratively. Zhang et al. [27] propose a density classifier and combine the predicted label with the features of a multi-stream network for de-raining. Li et al. [15] integrate deep convolutional and recurrent neural networks to remove rain streaks in a multi-stage manner.

As we mentioned above, all of these methods predict the residual rain streaks and neglect semantic background information. Additionally, they do not involve attention in the network.

SECTION III.

Approach

A rain image $O$ is commonly formulated as the linear combination of the rain-free background $B$ and rain streaks $R$ layers as follows:\begin{equation*} O = B + R.\tag{1}\end{equation*} View Source We aim to estimate both two layers simultaneously in the same network. Below we discuss the detail.

A. Network Design

The pipeline of the proposed method is shown in Fig. 2. Given an input rainy image $O_{in}$ , our network predicts the residual rain streak image $R_{out}$ and the rain-free image $B_{out}$ . By subtracting $R_{out}$ from $O_{in}$ , we can obtain another indirect rain-free image $B_{sub}$ .

FIGURE 2.

The pipeline of the proposed method and detailed structure of the Separable Element-wise Attention Block.

Show All

Our network has a plain encoder-decoder architecture. In each resolution level of the encoder and decoder except the outermost layer, we replace the single convolution with the proposed Separable Element-wise Attention (SEA) block to enrich feature representations. Average pooling is used as the downsampling operation and bilinear interpolation is used for upsampling. Skip layers concatenate the feature maps of the encoder to the feature maps of the same resolution in the decoder before feeding it to the next block.

To cope with the joint estimation of rain streaks and background, we output task-dependent features in the last layer. In particular, the last feature maps are separated into two parts. The first part corresponds to the rain streaks residual, while the other part generates a rain-free background image. Unlike traditional multi-task learning that shares all the features and uses them to output the final results at the same time, we explicitly coordinate the corresponding features of two tasks. This is able to avoid the imbalance of feature maps for two outputs and enforce the responsibility of each part that reduces the information interference at the final prediction. Although we share all the features except the last layer that generates two outputs, the entire network is governed to produce two independent features. This one-to-many supervision encourages interactions between two substantially different tasks within the network, leading to diverse and rich features representations.

B. Separable Element-Wise Attention

Rain streaks distribution is of great importance to either removing rain streaks or estimating background. Intuitively, this information is modeled as the spatial attention to govern network training. However, each map of the high-dimensional features is substantially different from each other, and they may correspond to different objectives that cannot be unified using a single spatial attention map. Directly computing the element-wise attention for all the convolutional blocks leads to high computational costs. Inspired by the separable bilateral filter [18] in the signal processing area, we propose the Separable Element-wise Attention to the network.

As shown in the bottom part of Fig. 2, the proposed Separable Element-wise Attention block is mainly composed of two parts. The first part is a dense connection module [10], which propagates the output of each convolutional layer to subsequent convolutional layers within the block, promoting the information and gradient flow.

The second part of the SEA block is the proposed element-wise attention module. This module calculates the channel attention $A_{c}(x_{m})\in \mathbb {R}^{C} $ and spatial attention $A_{s}(x_{m})\in \mathbb {R}^{ H\times W} $ of the input feature maps $x_{m}\in \mathbb {R}^{C\times H\times W}$ in different branches separately. Then these two attentions are expanded to the same size as $x_{m}$ and then multiplied together to generate a 3D attention volume $A(x_{m})\in \mathbb {R}^{C\times H\times W}$ . In this way, the element-wise attention can be obtained by the spatial and channel attention modules, and all the feature elements can be focused or suppressed during the training of the network. The detailed architecture of our spatial and channel attention modules are shown in Fig. 3. The ins and outs are specifically made in the following passage.

FIGURE 3.

The architecture of our spatial and channel attention sub-modules.

Show All

1) Channel Attention

Our channel attention focuses on the relation between different channels, aiming to assign higher weights to those important feature maps. To reduce the computational complexity, we aggregate the spatial information by global average pooling and global max pooling, encoding the input feature maps softly into two vectors $\left \{{ V_{c}^{avg},V_{c}^{max} }\right \} \in \mathbb {R}^{C\times 1\times 1}$ . Then both vectors are fed to a shared fully connected (FC) layer and the outputs are added together to obtain the final attention. The objective of this strategy is to consider both the global and local information of feature maps. To reduce the number of parameters, there is only one hidden layer and the number of neurons in the hidden layer is set to $C/r$ , where $r$ is a reduction ratio.

2) Spatial Attention

Different from the channel attention module, the spatial attention focuses on the relation between different locations on the feature maps, aiming to emphasize the spatially discriminative information. Similar to the channel attention, we first apply average pooling and max pooling on the input feature maps along the channel axis, which obtains two maps $\left \{{ M_{s}^{avg},M_{s}^{max} }\right \} \in \mathbb {R}^{1\times H\times W}$ . In addition, we apply an $1\times 1$ convolution on the input feature maps and obtain another map $M_{s}^{1\times 1}\in \mathbb {R}^{(C/r-2)\times H\times W}$ , where $r$ is the same reduction ratio as in the channel attention. We concatenate these maps and feed it to three $3\times 3$ dilated convolutions and then a $1\times 1$ convolutions to get the final spatial attention map $A_{s}(x_{m})\in \mathbb {R}^{ H\times W} $ . We follow [23] to set the dilated rates of those three dilated convolution layers as 1, 2, 5, respectively. It can avoid sampling in the checkerboard pattern that skips pixels within the convolutional regions. At the same time, as its well-known properties, dilated convolution can compute attention values with a large receptive field.

At the end of the SEA block, we utilize residual connection directly from input to output. If the number of channels is different, we use a $1\times 1$ convolution on the input feature maps to fit the channel number. Residual connections can avoid the notorious problem of gradient vanishing or exploding [7]. At the final output layer of the network, we use half of the feature maps for rain-free image prediction and another half for rain streak residual prediction. The final output is obtained by averaging two rain-free images $B_{sub}$ and $B_{out}$ .

C. Training Objectives and Details

We use four loss functions to optimize the proposed network.

1) Pixel Loss

Given the ground truth rain-free image $B_{gt}$ , the pixel loss is defined as follows:\begin{align*} \mathfrak {L}_{p}=&\frac { \left \|{ B_{sub}-B_{gt} }\right \|_{1}}{N_{gt}}+\frac { \left \|{ B_{out}-B_{gt} }\right \|_{1}}{N_{gt}} \\=&\frac { \left \|{ O_{in}-R_{out}-B_{gt} }\right \|_{1}}{N_{gt}}+\frac { \left \|{ B_{out}-B_{gt} }\right \|_{1}}{N_{gt}},\tag{2}\end{align*} View Source where $N_{gt}=C\times H\times W$ denotes the number of pixels in the ground truth. Pixel loss measures the accuracy of each pixel between the network outputs and their corresponding ground truth by $L_{1}$ distance.

2) Perceptual and Style Losses

We introduce perceptual and style losses [6] into the network, which are used to measure the content and style differences between two images. The reconstructed image should be close to the ground truth image not only in pixel-level, but also in high- and semantic-level. We first define the perceptual loss:\begin{equation*} \mathfrak {L}_{perc}\!=\!\sum _{p}\frac { \left \|{ \Phi ^{p}_{B_{sub}}\!-\!\Phi ^{p}_{B_{gt}} }\right \|_{1}}{N_{\Phi ^{p}_{B_{gt}}}}+\sum _{p}\frac { \left \|{ \Phi ^{p}_{B_{out}}-\Phi ^{p}_{B_{gt}} }\right \|_{1}}{N_{\Phi ^{p}_{B_{gt}}}},\tag{3}\end{equation*} View Source where $\Phi ^{p}_{B_{*}}$ represents the feature maps at $p$ -th layer of the ImageNet-pretrained VGG-16 model. Pool1, pool2, and pool3 layers are selected in our method. We use $L_{1}$ distance to compute the corresponding feature maps between both $B_{out}$ and $B_{sub}$ and the ground truth $B_{gt}$ .

Style loss is also calculated based on the projected VGG feature maps, but it is actually calculating the L1 distance of the Gram matrix of each VGG feature maps:\begin{align*} \mathfrak {L}_{style_{B'}}=&\sum _{p}\frac {\left \|{ K_{p}((\Phi ^{p}_{B_{sub}})^{T}(\Phi ^{p}_{B_{sub}})\!-\!(\Phi ^{p}_{B_{gt}})^{T}(\Phi ^{p}_{B_{gt}})) }\right \|_{1}}{C_{p}C_{p}}, \qquad \tag{4}\\ \mathfrak {L}_{style_{B}}=&\sum _{p}\frac {\left \|{ K_{p}((\Phi ^{p}_{B_{out}})^{T}(\Phi ^{p}_{B_{out}})\!-\!(\Phi ^{p}_{B_{gt}})^{T}(\Phi ^{p}_{B_{gt}})) }\right \|_{1}}{C_{p}C_{p}}.\tag{5}\end{align*} View Source Here, feature maps $\Phi ^{p}$ are expanded to the matrix of size $C_{p}\times (H_{p}W_{p})$ , which then generates a $C_{p}\times C_{p}$ Gram matrix. It represents the autocorrelation of each feature map. $K_{p}$ is a normalization factor with value $1/C_{p}H_{p}W_{p}$ .

3) Edge Loss

Due to the influence of rain streaks, the edges of the background are discontinuous or blurred. Using pixel loss only cannot guarantee edges correctness. To this end, we extract edges for the outputs and ground truth using Sobel operator, and then compute their L1 distances to enforce correct edges:\begin{equation*} \mathfrak {L}_{edge}\!=\!\frac { \left \|{ f_{s}(B_{sub})\!-\!f_{s}(B_{gt}) }\right \|_{1}}{N_{gt}}+\frac { \left \|{ f_{s}(B_{out})-f_{s}(B_{gt}) }\right \|_{1}}{N_{gt}},\tag{6}\end{equation*} View Source where $f_{(}s)$ denotes the Sobel operator.

Then the total loss is the summation of the above losses.\begin{equation*} \mathfrak {L}_{total}=\lambda _{r}\mathfrak {L}_{p}+\lambda _{p}\mathfrak {L}_{perc}+\lambda _{s}\mathfrak {L}_{style}+\lambda _{e}\mathfrak {L}_{edge},\tag{7}\end{equation*} View Source where the $\lambda _{*}$ denotes the weight of the corresponding loss term.

SECTION IV.

Experiments

In this section, we evaluate our proposed method on both synthetic and real collected rainy data. We also make a comparison with other state-of-the-art methods on these datasets.

A. Experiment Settings

1) Training Settings

We first describe the hyper-parameters used in our model. For each SEA block, the growth-rate, which is the feature number of sub-convolutional layer [10] in dense connection part, is set to 32, and the number of sub-convolutional layers in dense part is 8 for the innermost 9 blocks and 4 for the remaining. This setting is based on the resolution and feature numbers in each level. Furthermore, the reduction ratio $r$ in the attention module of each SEA block is set to 16 according to the analysis in [8]. The weight of each loss item are as follows: $\lambda _{r}=500$ , $\lambda _{p}=1.5$ , $\lambda _{s}=250$ , $\lambda _{edge}=1$ . Within these weights, the pixel loss contributes the most and reconstructs the image structure at the beginning of training, while the edge loss, perceptual loss and style loss are used to further refine the image. The input image is resized to $512\times 512$ and the batch size is 5. Our method is implemented with the Pytorch framework on an NVIDIA 1080 Ti GPU. We use the SGD optimizer with momentum equals 0.9 and the initial learning rate is set to $2\times 10^{-4}$ . The learning rate decreases linearly from the 50th epoch to the 300th epoch.

2) Datasets

In order to evaluate the de-raining ability of our method, we utilize three synthesis datasets in the experiments. The first one is the Rain800 dataset [28], which includes 700 images as the training set and 100 images as the testing set. The second one is the Rain200 dataset [25] (extended from Rain100), including two subsets representing: 1) heavy rain set (Rain200H) that is synthesized with five types of streaks, and 2) light rain set (Rain200L) that is synthesized with only one streak type. Each set contains 1,800 images for training and 200 images for testing. In the experiment, we train a model based on the training set of Rain200H and evaluate it with both testing set of Rain200H and Rain200L. We exclude Rain200L from the training set since the rain streak patterns of Rain200L are included in Rain200H, and in this way we can evaluate the generalization ability of the methods. The third dataset is the DIDMDN dataset, including one training set and two testing sets. The training set consists of 12,000 images, synthesized by adding three different densities (light, medium, heavy) of rain streaks to 4,000 rain-free images. The first testing set, denoted as DID-Test1, is constructed in a similar way to the training set and contains 1,200 images in total. The second one is obtained by randomly sampling 1,000 images from the synthetic dataset provided by Fu et al. [4], which is also utilized to test the generalization capability, denoted as DID-Test2. Since the proposed model predicts the rain-free image as one of the output, in order to avoid overfitting caused by predicting the same rain-free image multiple times, we choose the same number of training images with different backgrounds from three density levels, to build a new training dataset, with 4,000 images in total for our experiment.

For real-world dataset, we use the real-world rainy images provided by Yang et al. [25] and Zhang et al. [28]. We also collect some photos from the web, most of which are captured in street and city scenes, which are more consistent with the application scenario of the de-raining task.

3) Measurement and Comparison

We evaluate the de-raining methods by the commonly used peak signal to noise ratio (PSNR) [11] and Structure Similarity Index (SSIM) [24] metrics. For real images, we mainly present the qualitative comparison and user study (see our supplementary materials), due to the absence of corresponding ground-truth. We compare our proposed method with several state-of-the-art CNN-based methods, including DDN [4], JORDER [25], DID-MDN [27], SCAN and its recurrent version (RESCAN) [15].

B. Evaluation on Synthetic Dataset

For a comprehensive evaluation, we train one model on each of the training sets mentioned above and test the model with the corresponding testing sets. For a fair comparison, we fine-tune their models on the corresponding training sets with the same number of epochs as ours, except the JORDER method that only provides a model trained on Rain200H and no training details.

The quantitative results of PSNR and SSIM are shown in Table 1. We can see that our method performs better than all the other deep learning based methods. Although the latest RESCAN method achieves the best result among previous methods on almost all synthesized datasets, it performs worse than the DID-MDN method on restoring structure information (SSIM) on DID-Test2, implying that it cannot well generalize to unseen rain streaks. In contrast, our method performs better than previous methods on both DID-Test1 and DID-Test2. In addition, the rainy image is only processed once without combining RNN (as used in RESCAN) or other extra refinement networks (as used in DID-MDN).

TABLE 1 Quantitative Results of Each Method in (PSNR/SSIM)

Fig. 4 shows the visualization results of all methods. The first image is chosen from the testing set of Rain200H, which is the most difficult dataset since the original images are mostly destroyed. We can see that both DID-MDN and RESCAN methods are able to well remove the rain streaks and restore the color of the original image. However, their results contain distortions and unsmooth regions on the background and details of objects (better zoom-in on the digital version). In contrast, our method performs well for both detail recovery and background smoothing. The second image is chosen from the DID-Test2 to show the generalization of each model. We can see that on this image, most methods including the recent RESCAN are not able to completely remove the rain streaks. Although there are no obvious white rain streaks on the result of DID-MDN method, there exist many misplacements and distortions in rain streaks shape, which results in low PSNR and SSIM. Compared to other methods, our proposed model is able to remove the unseen rain streaks and well restore the structure and intensity of the original image.

FIGURE 4.

Results and evaluations of each method on synthetic images. The first image is chosen from Rain200H testing set. The second image is chosen from DID-Test2.

Show All

C. Evaluation on Real-World Dataset

The final goal of the de-raining task is to apply in real-world scenes. As a result, we perform another evaluation on rainy images captured in the real-world. For a fair comparison, we select the model trained by Rain200H for each method since Rain200H can further enhance the robustness of the network as mentioned in [25]. Example results on real-world de-raining are shown in Fig. 5. It can be observed that our proposed method can well remove the rain streaks and does not break the original structure. The result of the second image shows that our proposed method even performs well on removing rain-drop form and watermark form rain streaks, while other methods fail to handle such types of rain streaks. To further evaluate the proposed method on real-world data, we conduct a user study in the supplementary materials.

FIGURE 5.

Qualitative results of each method on real-world images.

Show All

D. Ablation Study

In this section, we study the effectiveness of each term/module in our model. To better test the fitting and generalization ability of each module, we train and test on the DID-MDN dataset.

Firstly, we validate the effectiveness of our main strategy, which simultaneously estimates the rain-free image and the residual rain streak image. In this ablation study, we train three additional models as shown in Table 2. The “Rain-Streak Only” refers to the model only predicting the residual rain streak image (and subtracting by rainy image to get the rain-free background). “Rain-free Only” refers to the model only predicting the rain-free background. “w/o Task-dependent” refers to the model predicting two outputs using the last feature maps without separating them into task-dependent features. In addition, we use the notation $B_{sub}$ and $B_{out}$ in Sec. III to indicate the different rain-free outputs.

TABLE 2 Validation of Proposed Strategy (Predicting Residual Rain-Streaks and Rain-Free Results at the Same Time) in (PSNR/SSIM)

From the result in Table 2, we can see that when jointly predicting two outputs without task-dependent features, their performances decrease compared with predicting only one output. This implies that simply adding an extra prediction task in the network cannot benefit removal performance. However, when predicting results with task-dependent feature maps, our results obtain a significant improvement compared with the single output, even though the number of feature maps for each output is reduced by half. It reveals that the motivation of our method which uses the rain-free background as one of the outputs provides more information and enables better interaction between different kinds of features.

Next, we perform experiments to compare the effectiveness of the element-wise attention, perceptual and style losses, and edge loss. The results are shown in Table 3. It can be observed that each module and loss has a positive effect on the removal performance and generalization ability of the model. Note the proposed Separable Element-wise Attention (SEA) block significantly boosts the performance.

TABLE 3 Validation of Each Module in (PSNR/SSIM)

E. User Study on Real Rainy Images

To further evaluate the effectiveness of our proposed method, we conduct a user study on 30 real rainy images. These images are collected to simulate the actual usage of a deraining system, they are captured in close-up shots, pedestrians, buildings in heavy rains, or images with a black background and strong white light source to simulate the rainy scene at night. We compare our method with DDN [4], JORDER [25], DID-MDN [27], and RESCAN [15]. We invite 30 people to participate in the survey to choose the one that is the best rain-free and most natural image after the deraining process. Results are shown in Table 4, where “Voted” represents the total number of votes for the corresponding method, and “Selected” represents the number of images obtaining the most votes. We can see that our proposed method obtains the most votes, and DID-MDN ranks the second. It reveals that although RESCAN method shows good performance on the training set, DID-MDN generalizes better in real-world scenes. On the contrary, the proposed method performs the best on both the synthetic scenes and real-world scenes.

TABLE 4 Voting Results of DDN [4], DIDMDN [27], JORDER [25], RESCAN [15] and Our Method on Real Images. ‘Selected’ Represents the Number of Images Obtain the Most Votes

SECTION V.

Conclusion

We propose a coupled rain streak and background estimation network with Separable Element-wise Attention modules. It addresses the problem of rain streaks removal from two aspects. First, we delve into the problem of the estimation for rain streak and rain-free background, and these two tasks are bridged by task-dependent features. Second, we present a Separable Element-wise Attention module to explore the rain streaks distribution in all the layers of the network. It is achieved by two attention modules: the spatial and channel attention modules. All existing convolutional blocks can inject such element-wise attention on the fly. Extensive experiments demonstrate that the proposed method achieves superior performance against state-of-the-art methods, both quantitatively and qualitatively. The proposed Separable Element-wise Attention is a general framework, which we believe to be effective in other vision tasks.

References is not available for this document.

Coupled Rain Streak and Background Estimation via Separable Element-Wise Attention

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work