Introduction
Most existing computer vision systems are designed for disturbance-free scenarios. Therefore, rain streaks in an image degrade visibility and prevent many computer vision algorithms from working properly. Addressing this visibility problem is challenging due to the random rain streaks distribution. Early researches [2], [16], [17] treat it as a signal separation problem using low rank decomposition or Gaussian mixture models (GMM), or resolve it in a denoising manner with a nonlocal mean smoothing algorithm [13]. Recently, deep learning based models [4], [25], [27] learn from synthesized data and achieve preferable performance due to the powerful ability of feature representation.
Notwithstanding the demonstrated success, these deep models suffer from two main issues. First, most state-of-the-art models [4], [15], [25], [27] focus on predicting rain streaks only. While this is reasonable as the rain streaks are sparse and contain simple texture information, it enforces the network focus only on local feature representations. As can be seen in Fig. 1b, the feature responses learned from a residual prediction network highlights rain streaks other than background regions. On the other hand, the dense rain streak scenario is visually similar to mist or fog, which makes the prediction of rain streaks easy but difficult in recovering original image content. A network that predicts a rain-free background shows a different learning focus (see Fig. 1c), and these two different objectives may complement each other.
Deep networks that estimate the rain streaks residual (b) or rain-free background (c) show substantially different feature responses (the activation value of the neurons). Relying on predicting the residual only cannot handle dense rain streaks scenarios.
Second, the attention on rain streak distribution is not fully explored in de-raining models. Although a spatial visual attention map is incorporated as one of the network inputs in raindrop removal [19], the attention module should be injected into feature levels of the entire network. Attention not only filters out the redundant information but also improve the representation of the features. In this sense, traditional spatial attention is not enough as it shares the same weights to all the channels of the feature maps. However, learning an element-wise attention module with the same size of the feature maps hugely increases the computational overhead.
In this paper, we address the above two problems by proposing a coupled rain streak and background estimation network with Separable Element-wise Attention. The proposed network produces task-dependent features so that the intrinsic relationship between two tasks can be explored during training. Furthermore, we implement element-wise attention using a sequence of channel and spatial attention modules. The combination of the channel and spatial attention modules is able to achieve the element-wise attention with negligible computation, in this way it can be applied to all the convolutional blocks. Extensive experiments show that the proposed method outperforms state-of-the-art de-raining methods on 5 benchmarks and real-world scenarios. More importantly, our superior performance is obtained without additional multi-scale or recurrent structures.
To summarize, our contributions are three-fold:
We propose to jointly estimate rain streaks and background in the same network with task-dependent features. This simple approach shows significant improvement over individual prediction of two tasks.
We present a Separable Element-wise Attention module. This method allows focusing on important feature elements while suppressing redundant ones. Additionally, our separable implementation enables involving element-wise attention with negligible computation efforts. It is a general component and can be applied to other deep models.
Extensive experiments conducted on 5 challenging benchmarks and real-world data demonstrate the effectiveness of the proposed approaches over state-of-the-art methods.
Related Work
Rain streak removal is challenging, and therefore early works leverage the additional temporal information from multiple frames. Garg and Nayar [5] propose to detect and remove rain streaks based on the dynamics and photometry of rain. Besides temporal information, other information such as chromatic properties and shape characteristics of rains, are also utilized in [29] and [1] respectively. Recently, video rain removal are addressed using low-rank matrix [14], optical flow in local phase information [21], and matrix decomposition [20].
Different from video-based de-raining with temporal information, single image rain removal is an ill-posed problem and therefore much more challenging. Many traditional methods solve this problem with additional prior information and regard it as a signal separation problem. Kang et al. [12] and Sun et al. [22] separate images into high and low frequency parts by analyzing the morphological and structural information of rain images. Luo et al. [17] separate rain streaks and background scene by discriminative sparse coding method. In addition, Gaussian mixture models (GMM) [9], [16] are used to decompose the rainy image into background and rain streaks layers. Low rank models are also used to separate the input image into the different layers in [2], [3], [26]. Zhang and Patel [13] propose a novel idea and try to recover the rain-free image by nonlocal means filter. Although these methods can detect and remove rain streaks, their main limitation is over-smoothing the image details since a lot of texture and fine structure information belongs to the high frequency part.
Recent approaches adopt deep learning and achieve notable success in single image de-raining. Fu et al. [4] introduce a model to predict the residual rain streaks using the decomposed high frequency part as input. Yang et al. [25] present a deep recurrent model with a dilated network to detect and remove rain streaks iteratively. Zhang et al. [27] propose a density classifier and combine the predicted label with the features of a multi-stream network for de-raining. Li et al. [15] integrate deep convolutional and recurrent neural networks to remove rain streaks in a multi-stage manner.
As we mentioned above, all of these methods predict the residual rain streaks and neglect semantic background information. Additionally, they do not involve attention in the network.
Approach
A rain image \begin{equation*} O = B + R.\tag{1}\end{equation*}
A. Network Design
The pipeline of the proposed method is shown in Fig. 2. Given an input rainy image
The pipeline of the proposed method and detailed structure of the Separable Element-wise Attention Block.
Our network has a plain encoder-decoder architecture. In each resolution level of the encoder and decoder except the outermost layer, we replace the single convolution with the proposed Separable Element-wise Attention (SEA) block to enrich feature representations. Average pooling is used as the downsampling operation and bilinear interpolation is used for upsampling. Skip layers concatenate the feature maps of the encoder to the feature maps of the same resolution in the decoder before feeding it to the next block.
To cope with the joint estimation of rain streaks and background, we output task-dependent features in the last layer. In particular, the last feature maps are separated into two parts. The first part corresponds to the rain streaks residual, while the other part generates a rain-free background image. Unlike traditional multi-task learning that shares all the features and uses them to output the final results at the same time, we explicitly coordinate the corresponding features of two tasks. This is able to avoid the imbalance of feature maps for two outputs and enforce the responsibility of each part that reduces the information interference at the final prediction. Although we share all the features except the last layer that generates two outputs, the entire network is governed to produce two independent features. This one-to-many supervision encourages interactions between two substantially different tasks within the network, leading to diverse and rich features representations.
B. Separable Element-Wise Attention
Rain streaks distribution is of great importance to either removing rain streaks or estimating background. Intuitively, this information is modeled as the spatial attention to govern network training. However, each map of the high-dimensional features is substantially different from each other, and they may correspond to different objectives that cannot be unified using a single spatial attention map. Directly computing the element-wise attention for all the convolutional blocks leads to high computational costs. Inspired by the separable bilateral filter [18] in the signal processing area, we propose the Separable Element-wise Attention to the network.
As shown in the bottom part of Fig. 2, the proposed Separable Element-wise Attention block is mainly composed of two parts. The first part is a dense connection module [10], which propagates the output of each convolutional layer to subsequent convolutional layers within the block, promoting the information and gradient flow.
The second part of the SEA block is the proposed element-wise attention module. This module calculates the channel attention
1) Channel Attention
Our channel attention focuses on the relation between different channels, aiming to assign higher weights to those important feature maps. To reduce the computational complexity, we aggregate the spatial information by global average pooling and global max pooling, encoding the input feature maps softly into two vectors
2) Spatial Attention
Different from the channel attention module, the spatial attention focuses on the relation between different locations on the feature maps, aiming to emphasize the spatially discriminative information. Similar to the channel attention, we first apply average pooling and max pooling on the input feature maps along the channel axis, which obtains two maps
At the end of the SEA block, we utilize residual connection directly from input to output. If the number of channels is different, we use a
C. Training Objectives and Details
We use four loss functions to optimize the proposed network.
1) Pixel Loss
Given the ground truth rain-free image \begin{align*} \mathfrak {L}_{p}=&\frac { \left \|{ B_{sub}-B_{gt} }\right \|_{1}}{N_{gt}}+\frac { \left \|{ B_{out}-B_{gt} }\right \|_{1}}{N_{gt}} \\=&\frac { \left \|{ O_{in}-R_{out}-B_{gt} }\right \|_{1}}{N_{gt}}+\frac { \left \|{ B_{out}-B_{gt} }\right \|_{1}}{N_{gt}},\tag{2}\end{align*}
2) Perceptual and Style Losses
We introduce perceptual and style losses [6] into the network, which are used to measure the content and style differences between two images. The reconstructed image should be close to the ground truth image not only in pixel-level, but also in high- and semantic-level. We first define the perceptual loss:\begin{equation*} \mathfrak {L}_{perc}\!=\!\sum _{p}\frac { \left \|{ \Phi ^{p}_{B_{sub}}\!-\!\Phi ^{p}_{B_{gt}} }\right \|_{1}}{N_{\Phi ^{p}_{B_{gt}}}}+\sum _{p}\frac { \left \|{ \Phi ^{p}_{B_{out}}-\Phi ^{p}_{B_{gt}} }\right \|_{1}}{N_{\Phi ^{p}_{B_{gt}}}},\tag{3}\end{equation*}
Style loss is also calculated based on the projected VGG feature maps, but it is actually calculating the L1 distance of the Gram matrix of each VGG feature maps:\begin{align*} \mathfrak {L}_{style_{B'}}=&\sum _{p}\frac {\left \|{ K_{p}((\Phi ^{p}_{B_{sub}})^{T}(\Phi ^{p}_{B_{sub}})\!-\!(\Phi ^{p}_{B_{gt}})^{T}(\Phi ^{p}_{B_{gt}})) }\right \|_{1}}{C_{p}C_{p}}, \qquad \tag{4}\\ \mathfrak {L}_{style_{B}}=&\sum _{p}\frac {\left \|{ K_{p}((\Phi ^{p}_{B_{out}})^{T}(\Phi ^{p}_{B_{out}})\!-\!(\Phi ^{p}_{B_{gt}})^{T}(\Phi ^{p}_{B_{gt}})) }\right \|_{1}}{C_{p}C_{p}}.\tag{5}\end{align*}
3) Edge Loss
Due to the influence of rain streaks, the edges of the background are discontinuous or blurred. Using pixel loss only cannot guarantee edges correctness. To this end, we extract edges for the outputs and ground truth using Sobel operator, and then compute their L1 distances to enforce correct edges:\begin{equation*} \mathfrak {L}_{edge}\!=\!\frac { \left \|{ f_{s}(B_{sub})\!-\!f_{s}(B_{gt}) }\right \|_{1}}{N_{gt}}+\frac { \left \|{ f_{s}(B_{out})-f_{s}(B_{gt}) }\right \|_{1}}{N_{gt}},\tag{6}\end{equation*}
Then the total loss is the summation of the above losses.\begin{equation*} \mathfrak {L}_{total}=\lambda _{r}\mathfrak {L}_{p}+\lambda _{p}\mathfrak {L}_{perc}+\lambda _{s}\mathfrak {L}_{style}+\lambda _{e}\mathfrak {L}_{edge},\tag{7}\end{equation*}
Experiments
In this section, we evaluate our proposed method on both synthetic and real collected rainy data. We also make a comparison with other state-of-the-art methods on these datasets.
A. Experiment Settings
1) Training Settings
We first describe the hyper-parameters used in our model. For each SEA block, the growth-rate, which is the feature number of sub-convolutional layer [10] in dense connection part, is set to 32, and the number of sub-convolutional layers in dense part is 8 for the innermost 9 blocks and 4 for the remaining. This setting is based on the resolution and feature numbers in each level. Furthermore, the reduction ratio
2) Datasets
In order to evaluate the de-raining ability of our method, we utilize three synthesis datasets in the experiments. The first one is the Rain800 dataset [28], which includes 700 images as the training set and 100 images as the testing set. The second one is the Rain200 dataset [25] (extended from Rain100), including two subsets representing: 1) heavy rain set (Rain200H) that is synthesized with five types of streaks, and 2) light rain set (Rain200L) that is synthesized with only one streak type. Each set contains 1,800 images for training and 200 images for testing. In the experiment, we train a model based on the training set of Rain200H and evaluate it with both testing set of Rain200H and Rain200L. We exclude Rain200L from the training set since the rain streak patterns of Rain200L are included in Rain200H, and in this way we can evaluate the generalization ability of the methods. The third dataset is the DIDMDN dataset, including one training set and two testing sets. The training set consists of 12,000 images, synthesized by adding three different densities (light, medium, heavy) of rain streaks to 4,000 rain-free images. The first testing set, denoted as DID-Test1, is constructed in a similar way to the training set and contains 1,200 images in total. The second one is obtained by randomly sampling 1,000 images from the synthetic dataset provided by Fu et al. [4], which is also utilized to test the generalization capability, denoted as DID-Test2. Since the proposed model predicts the rain-free image as one of the output, in order to avoid overfitting caused by predicting the same rain-free image multiple times, we choose the same number of training images with different backgrounds from three density levels, to build a new training dataset, with 4,000 images in total for our experiment.
For real-world dataset, we use the real-world rainy images provided by Yang et al. [25] and Zhang et al. [28]. We also collect some photos from the web, most of which are captured in street and city scenes, which are more consistent with the application scenario of the de-raining task.
3) Measurement and Comparison
We evaluate the de-raining methods by the commonly used peak signal to noise ratio (PSNR) [11] and Structure Similarity Index (SSIM) [24] metrics. For real images, we mainly present the qualitative comparison and user study (see our supplementary materials), due to the absence of corresponding ground-truth. We compare our proposed method with several state-of-the-art CNN-based methods, including DDN [4], JORDER [25], DID-MDN [27], SCAN and its recurrent version (RESCAN) [15].
B. Evaluation on Synthetic Dataset
For a comprehensive evaluation, we train one model on each of the training sets mentioned above and test the model with the corresponding testing sets. For a fair comparison, we fine-tune their models on the corresponding training sets with the same number of epochs as ours, except the JORDER method that only provides a model trained on Rain200H and no training details.
The quantitative results of PSNR and SSIM are shown in Table 1. We can see that our method performs better than all the other deep learning based methods. Although the latest RESCAN method achieves the best result among previous methods on almost all synthesized datasets, it performs worse than the DID-MDN method on restoring structure information (SSIM) on DID-Test2, implying that it cannot well generalize to unseen rain streaks. In contrast, our method performs better than previous methods on both DID-Test1 and DID-Test2. In addition, the rainy image is only processed once without combining RNN (as used in RESCAN) or other extra refinement networks (as used in DID-MDN).
Fig. 4 shows the visualization results of all methods. The first image is chosen from the testing set of Rain200H, which is the most difficult dataset since the original images are mostly destroyed. We can see that both DID-MDN and RESCAN methods are able to well remove the rain streaks and restore the color of the original image. However, their results contain distortions and unsmooth regions on the background and details of objects (better zoom-in on the digital version). In contrast, our method performs well for both detail recovery and background smoothing. The second image is chosen from the DID-Test2 to show the generalization of each model. We can see that on this image, most methods including the recent RESCAN are not able to completely remove the rain streaks. Although there are no obvious white rain streaks on the result of DID-MDN method, there exist many misplacements and distortions in rain streaks shape, which results in low PSNR and SSIM. Compared to other methods, our proposed model is able to remove the unseen rain streaks and well restore the structure and intensity of the original image.
Results and evaluations of each method on synthetic images. The first image is chosen from Rain200H testing set. The second image is chosen from DID-Test2.
C. Evaluation on Real-World Dataset
The final goal of the de-raining task is to apply in real-world scenes. As a result, we perform another evaluation on rainy images captured in the real-world. For a fair comparison, we select the model trained by Rain200H for each method since Rain200H can further enhance the robustness of the network as mentioned in [25]. Example results on real-world de-raining are shown in Fig. 5. It can be observed that our proposed method can well remove the rain streaks and does not break the original structure. The result of the second image shows that our proposed method even performs well on removing rain-drop form and watermark form rain streaks, while other methods fail to handle such types of rain streaks. To further evaluate the proposed method on real-world data, we conduct a user study in the supplementary materials.
D. Ablation Study
In this section, we study the effectiveness of each term/module in our model. To better test the fitting and generalization ability of each module, we train and test on the DID-MDN dataset.
Firstly, we validate the effectiveness of our main strategy, which simultaneously estimates the rain-free image and the residual rain streak image. In this ablation study, we train three additional models as shown in Table 2. The “Rain-Streak Only” refers to the model only predicting the residual rain streak image (and subtracting by rainy image to get the rain-free background). “Rain-free Only” refers to the model only predicting the rain-free background. “w/o Task-dependent” refers to the model predicting two outputs using the last feature maps without separating them into task-dependent features. In addition, we use the notation
From the result in Table 2, we can see that when jointly predicting two outputs without task-dependent features, their performances decrease compared with predicting only one output. This implies that simply adding an extra prediction task in the network cannot benefit removal performance. However, when predicting results with task-dependent feature maps, our results obtain a significant improvement compared with the single output, even though the number of feature maps for each output is reduced by half. It reveals that the motivation of our method which uses the rain-free background as one of the outputs provides more information and enables better interaction between different kinds of features.
Next, we perform experiments to compare the effectiveness of the element-wise attention, perceptual and style losses, and edge loss. The results are shown in Table 3. It can be observed that each module and loss has a positive effect on the removal performance and generalization ability of the model. Note the proposed Separable Element-wise Attention (SEA) block significantly boosts the performance.
E. User Study on Real Rainy Images
To further evaluate the effectiveness of our proposed method, we conduct a user study on 30 real rainy images. These images are collected to simulate the actual usage of a deraining system, they are captured in close-up shots, pedestrians, buildings in heavy rains, or images with a black background and strong white light source to simulate the rainy scene at night. We compare our method with DDN [4], JORDER [25], DID-MDN [27], and RESCAN [15]. We invite 30 people to participate in the survey to choose the one that is the best rain-free and most natural image after the deraining process. Results are shown in Table 4, where “Voted” represents the total number of votes for the corresponding method, and “Selected” represents the number of images obtaining the most votes. We can see that our proposed method obtains the most votes, and DID-MDN ranks the second. It reveals that although RESCAN method shows good performance on the training set, DID-MDN generalizes better in real-world scenes. On the contrary, the proposed method performs the best on both the synthetic scenes and real-world scenes.
Conclusion
We propose a coupled rain streak and background estimation network with Separable Element-wise Attention modules. It addresses the problem of rain streaks removal from two aspects. First, we delve into the problem of the estimation for rain streak and rain-free background, and these two tasks are bridged by task-dependent features. Second, we present a Separable Element-wise Attention module to explore the rain streaks distribution in all the layers of the network. It is achieved by two attention modules: the spatial and channel attention modules. All existing convolutional blocks can inject such element-wise attention on the fly. Extensive experiments demonstrate that the proposed method achieves superior performance against state-of-the-art methods, both quantitatively and qualitatively. The proposed Separable Element-wise Attention is a general framework, which we believe to be effective in other vision tasks.