Introduction
Most digital camera employs a single CCD/CMOS sensor to capture natural scenes. Due to 2D sensor, color filter array (CFA) is always employed to filter two-thirds of RGB information, such as the most famous Bayer pattern. Besides, due to the limitation of imaging circuit, the rest one-third of information, i.e., mosaic image, is always corrupted by various noise. Thus, recovering a high-quality color image is a highly ill-posed problem. Image denoising (DN) and demosaicing (DM) are the first crucial steps of image signal processing (ISP) pipeline in most digital camera, and their performance has vital influence on the visual appearance and down-stream application of final result [1].
Due to the modular design of traditional ISP, DN and DM are independently and sequentially handled. However, it leads to error accumulation and sub-optimal recovery. Either DN needs to handle non-linear and diverse noises introduced by DM, or DM suffers from unreliable samples caused by DN. To solve this problem, researches recently focus on the joint DN and DM (JDD) image restoration and show its advantages [2], e.g., high performance and low computational complexity.
Since this restoration task is undetermined, diverse image priors are required to assist the reconstruction. Traditional methods usually solve an optimization function in an iterative manner with embedding hand-crafted priors, e.g., total variation [3], nonlocal self-similarity [4], [5]. However, the complex data in the real word cannot be sufficiently characterized by the hand-crafted priors, and there are still a number of visually disturbing artifacts appearing on some challenging high frequency regions, e.g., checkerboard and moire patterns [1].
Recently, instead of hand-crafted prior, deep learning methods [1], [6]–[8] automatically learn the desired prior with convolutional neural network (CNN). Most approaches [1], [6], [7] brutally learn a mapping network between noisy image, i.e., mosaic image or decomposed four-channel RGGB image, and clean RGB image to exploit intra- and inter-channel correlation and complete the missing information. Considering the high sampling rate of green channel, Liu et al. [8] additionally introduced a green channel recovery branch to guide RGB image restoration. Nevertheless, all these methods are trained on synthetic data [1], [9]–[12], and CFA and Gaussian noise are utilized to synthesize the mosaic image. Due to the domain gap between synthetic and real mosaic image, these methods cannot generalize well on real raw data with complex noises.
Besides, most existing methods mainly focus on considering the raw data property for DM [8], i.e., higher sampling rate of green information, but rarely consider the raw data characteristic for DN. As human eyes can perceive green more sensitively than red and blue [13], the camera spectral sensitivity of green is designed to be larger than red and blue, which leads to higher intensity and signal-to-noise ratio (SNR) of green channel, as shown in Figure 1. These mean, due to high sampling rate and high SNR, the green channel is easy to be recovered not only for DM but also for DN.
In this work, we present a deep guided attention network (DGAN) for real image JDD, which respectively considers the high SNR and high sampling rate of green information for DN and DM, as shown in Figure 2. The network architecture of DGAN is based on UNet [14], and involves green channel guidance branch with multiple guided attention modules and decomposition and combination learning strategy. Inspired by guided filter, we design a guided attention module in local manner to adaptively generate attentive kernel weights for different spatial positions by modeling the interdependencies of more completed green channel feature in the neighborhood. To ease the learning of JDD network and fully exploit data property of green channel, we decompose JDD network into two sub-networks, where the former focuses on DN with high SNR green channel guidance and the latter takes charge of DM with high sampling rate green channel guidance. Two sub-networks are trained sequentially first, and then are combined into a whole network for jointly training to reduce the error accumulation. Besides, to support the JDD in the real world, we utilize an advanced pixelshift camera to collect a real raw dataset with paired clean full color RGB, noisy and clean mosaic images. The experimental results on real JDD dataset show that the presented approach performs better than the state-of-the-art methods, in terms of both quantitative metrics and qualitative visualization.
High SNR and sampling rate of green channel. Green channel has twice sampling rate than red and blue channels. Camera spectral sensitivity of green is higher than that of red and blue, which leads to higher intensity and SNR. The second line shows higher intensity and the third line shows higher SNR of green channel, respectively. Note that the Gb channel is similar to Gr.
The architecture of proposed guided attention network, consisting of DN and DM two sub-networks. Each sub-network employs unet as the fundamental network and residual block (RB) as basic module. We utilize green channel with high SNR and sampling rate to respectively guide the denoising and demosaicing with multiple guided attention modules (GAMs), which employ the green channel feature in the corresponding depth as guidance information. The network architecture of green channel branch is the same as the main branch, except with half feature maps. Compared with DN sub-network, DM sub-network additional utilize pixel-shuffle layers to upsample the resolution.
Our main contributions are as follows:
We present a deep guided attention network for real image JDD, that effectively considers the green channel characteristics of high SNR and high sampling rate in raw data.
We propose a guided attention module to adaptively guide RGB image restoration by the information in green channel recovery branch.
We collect a real raw JDD dataset with paired noisy mosaic, clean mosaic and clean full color RGB images, and utilize a decomposition-and-combination training strategy to make the trained network more practical to the real data.
Related Work
The most related researches on joint image denoising and demosaicing, and guided image recovery are reviewed in this section.
1. Joint Denoising and Demosaicing
The aim of image DM is recovering a RGB image from a mosaic image losing two-thirds information. Various traditional methods [15]–[23] and deep learning methods [24] have been presented. Besides, due to noise commonly existing in the real world, DM methods usually need collaboration of DN methods [25], [26], and process the noisy mosaic image sequentially. Because of error accumulation, it leads to sub-optimal recovery. To solve this problem, researchers recently pay more attention on JDD image restoration and show the benefits, which can achieve higher performance and lower computational complexity [2].
L J Since image JDD is an extremely undetermined problem, diverse image priors are required to assist the recovery. The conventional optimization methods [3]–[5] integrate hand-crafted priors to iterative optimization algorithm, and restore the clean RGB image from noisy mosaic image. Condat et al. [3] integrated the total variation prior to a primal dual optimization algorithm. Heide et al. [4] presented an optimization method with nonlocal prior to recover color image. Tan et al. [5] employed alternating direction method of multipliers (AD-MM) with nonlocal and total variation priors to recover color image.
Alternatively, deep learning methods [1], [6]–[8], [27] employ advanced convolutional neural networks to exploit the desired prior for JDD task. Gharbi et al. [1] recovered color image from noisy mosaic image with a deep convolutional neural network. Tan et al. [6] employed a convolutional neural network to refine the initialized color image via bilinear interpolation. Kokkinos et al. [7] integrated a residual DN network into unrolled majorization-minimization method for color image recovery. Xing et al. [27] discussed the effect of DN and DM processing order, and presented an end-to-end network for image JDD. Liu et al. [8] introduced additional green channel and density map guidances to design a self-guidance network for color image recovery.
The conventional methods utilize hand-crafted priors that are often limited linear characteristic and can-not sufficiently employ the image nonlinearity. The deep learning methods brutally learn the implicit mapping function from noisy mosaic to clean RGB images, but do not well consider the high SNR and high sampling rate data properties of green channel. In this work, we present a deep learning method to exploit deep prior with attentive green channel guidance for image JDD.
2. Guided Image Recovery
Guided image recovery utilizes auxiliary prior to assist image restoration. Guided filter [28], [29] is a well-known method that employs an additional image as guidance to generate filter weights and has been successful in many image recovery tasks, e.g., image demosaicing [30]. Recently, deep learning methods [31]–[40] have employed various auxiliary information to guided image recovery, particularly for super-resolution. Some methods [31], [32], [35] utilized RGB image as the auxiliary knowledge to guided the super-resolution of depth or hyperspectral image. Wang et al. [36] super-resolved the image with semantic information guidance. Zou et al. [34] super-resolved the image with cross-scale stereo information guidance.
Besides, self-guidance network [41] is presented for image DN, that employed the low resolution features to enhance the high resolution feature. Liu et al. [8] utilized the green channel property of high sampling rate and further designed a green channel sub-network to guided image JDD, where guidance information is fused with main information at the end of branches. Inspired by guided filter, we propose a guided attention module to adaptively fuse main information with guidance information. In addition, to fully exploit the guidance information in green channel, we interpolate the guided attention module into network with different depths.
Guided Attention Network
Firstly, we formulate the problem of joint image DN and DM, and introduce the motivation of DGAN. Then, we describe the guided attention module, that adaptively guides RGB image recovery by information in green channel recovery branch. Finally, the architecture of DGAN and the corresponding decomposition-and-combination training strategy are described. The decomposed sub-networks can effectively consider the high SNR and high sampling rate raw data properties for DN and DM, respectively.
1. Formulation and Motivation
The JDD aims to handle mosaic image \begin{equation*}Z=Y+n=\mathcal{M}(X)+n\tag{1}\end{equation*}
Numerous researches show that human eyes are more sensitive to green than red and blue [13]. Therefore, the CFA, e.g., Bayer pattern, in modern digital cameras are designed with higher sampling rate and spectral sensitivity of green than others, as shown in Figure 1. The higher spectral sensitivity causes higher intensity of captured green channel and the most noises in acquisition are signal-independent noise, which leads to higher SNR of green channel. Accordingly, the green channel is easier to be recovered not only for DM but also for DN.
In this paper, we first employ two networks with green channel guidance to respectively deal with DN and DM, and then fine-tune them in a joint manner. Concretely, we present a guided attention module, in which attention map is generated from guided information of green channel and is adaptive for each spatial position. We plug the guided attention module into network to recover color information with progressive green channel guidance.
2. Guided Attention Module
Before introducing guided attention module, we first review the guided filter [28], which has been widely used in image DN [42] and DM [30]. Guided filter is a translation-variant filter, involving a guidance image \begin{equation*}O_{i}= \sum\limits_{j\in \mathcal{N}(i)}W_{ij}(G)I_{j}\tag{2}\end{equation*}
Inspired by the guided filter, we presented a guided attention module, in which the attention map is generated from the correlation between the features of guidance information in current position and its neighborhood, as shown in Figure 3. We first employ two \begin{equation*}A_{ij}^{\prime}=G_{\ i}^{\prime T}G_{j}^{\prime}\tag{3}\end{equation*}
\begin{equation*}A_{i}=\sigma(A_{i}^{\prime})\tag{4}\end{equation*}
\begin{equation*}A_{ij}= \frac{e^{A_{ij}^{\prime}}}{\sum\nolimits_{k\in \mathcal{N}(i)}e^{A_{ik}^{\prime}}}\tag{5}\end{equation*}
The guided attention module. We first employ
The guided attention map \begin{equation*}O_{i}^{\prime}=\sum\limits_{j\in \mathcal{N}(i)}A_{ij}I_{j}^{\prime}\tag{6}\end{equation*}
Finally, we employ a \begin{equation*}O=f_{O}(O^{\prime})+I\tag{7}\end{equation*}
Comparing with popular self-attention [43], the guided attention map is calculated by green channel information. Due to high SNR and high sampling rate, the information of green channel is more complete than other channels. It leads to more accurate attention map calculation. Besides, taking the neighborhood with spatial size
3. Network Architecture
To ease the network training, we decompose the JDD network into DN and DM two sub-networks, as shown in Figure 2.
These two sub-networks have almost the same architecture, and the main difference is that the DN sub-network employs a convolutional layer to output the denoised data, and DM sub-network additionally employs a pixelshuffle layer to upsample the spatial resolution. As numerous researches [12], [27] show that applying DN first and DM later outperforms the opposition, we first employ the DN sub-network and feed its output through DM sub-network to obtain the clean full color RGB image.
Each sub-network consists of green channel guidance branch and main branch. Both branches are based on the same representative Unet [14] architecture. The feature maps of green channel guidance branch is half of that of main branch. Each branch has 4 encoder steps and 4 corresponding decoder steps. After each encoder step, a convolution layer with
The existing method [8] only employs output feature of green channel branch once to guide the information recovery at the end of main branch. To fully exploit the guidance information, we utilize green channel features to guide main branch restoration with multiple times. Specifically, we interpolate the guided attention module after each residual blocks, and the main branch is guided by the features in the corresponding depth.
4. Learning Strategy
As we decompose the JDD network into DN and DM two sub-networks, we present a decomposition-and-combination learning strategy to train the networks and obtain a higher recovery accuracy. Our learning strategy can be divided into three steps, including decomposed DN training, decomposed DM training and combined DN and DM training.
Firstly, we train the DN sub-network with paired clean and noisy mosaic images. Following previous works [8], we decompose the mosaic image into RGGB four channels. The \begin{equation*}\mathcal{L}_{DN}(\theta_{DN})=\Vert \tau(Y)-f_{DN}(\tau(Z);\theta_{DN})\Vert_{1}\tag{8}\end{equation*}
Secondly, we fix the parameters \begin{equation*}\mathcal{L}_{DM}(\theta_{DM})=\Vert X-f_{DM}(\tau(\hat{Y});\theta_{DM})\Vert_{1}\tag{9}\end{equation*}
Thirdly, we combine the DN and DM sub-networks, and train them in a joint manner. Given the networks trained in previous steps, we fine-tune them jointly, which can be represented as
\begin{equation*}\mathcal{L}_{J}(\theta_{DN},\theta_{DM})=\Vert X-f_{DM}(f_{DN}(\tau(Z);\theta_{DN});\theta_{DM})\Vert_{1}\tag{10}\end{equation*}
Apart from RGB recovery loss for the main branch, we add corresponding green channel recovery loss to equations (8)–(10) with balance parameter \begin{equation*}\mathcal{L}=\mathcal{L}_{M}+\lambda \mathcal{L}_{G}\tag{11}\end{equation*}
Paired Real Raw JDD Dataset
The existing deep learning JDD methods need to be learned on training datasets [1], [9]–[12]. Existing datasets for JDD have several problems. The sRGB datasets [1], [9], [10] are nonlinearly processed and demosaiced by existing DM algorithm, which mismatches the linear working space of DM approaches and introduces undesirable artifacts. The linear RGB datasets [11] are generated by raw mosaic images, however, which might alter the characteristic of signal. Recently, Qian et al. [12] captured linear full color RGB images with advanced pixel shift device. However, these datasets just include clean RGB image, but still synthesize the noisy mosaic image with CFA and Gaussian noise. It introduces domain gap between synthetic image and real image with complex noises, which limits the application of learned JDD algorithms to the real data.
To support this study, we utilize a camera with pixel shift technique to collect a paired real dataset, including noisy mosaic, clean mosaic and RGB images. To capture a color RGB image, pixel shift camera takes four mosaic images, as shown in Figure 4. Each mosaic image is captured with horizontal and/or vertical sensor movement. After capturing four times, the camera can fully capture the color information of each pixel. In each full color image capturing, we can obtain four pixel shifted mosaic images and a full color RGB image. After capturing clean full color RGB image, the noisy mosaic image is required to capture. According to the work in [44], we fix the imaging setting and reduce exposure time to collect noisy mosaic image. Thus, noisy/clean mosaic and noisy/clean full color RGB images can be captured in paired manner.
For dataset capturing, we utilize an advanced pixel shift camera Sony A7R4. We mount the camera on sturdy tripods and utilize a software to remotely control it. We first adjust focus, aperture, exposure time and other camera settings to improve the definition of the clean mosaic and full color RGB images. Then, the exposure time is reduced with a factor to collect noisy images. Due to multiple acquisitions of the same scene, we strictly keep the scenes in the dataset is static. After capturing, there are 100 outdoor and indoor scenes in our dataset, whose resolution is with
The working principle of pixel shift camera. The camera sensor takes four shots with physically moving in horizontal and vertical dimensions in each capturing. Then, these mosaic images are integrated to get a full color image.
Visual quality comparison on two representative scenes in real JDD dataset. The input noisy image and restored results of Flex-ISP, ADMM, DeepJoint, DeepUnfold are shown in the first row, and the recovered results of SGNet, JDDS, DGAN and ground truth are shown in the second row.
Experiments
In this section, we first describe the experiment settings, such as implementation details and metrics for quantitative investigation. Besides, the proposed approach is compared with several advanced approaches on collected real raw JDD dataset. Finally, we discuss the effective of different network modules and learning strategies.
1. Settings
The window size
Two evaluation metrics, i.e., the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM), are utilized to investigate the performance of all algorithms. The bigger value of PSNR and SSIM means higher image quality.
2. Evaluation on Real JDD Dataset
Quantitative Results
Table 1 provides the averaged recovery results of different situations on the real JDD dataset, which quantitatively compare the performance of FlexISP, ADMM, DeepJoint, DeepUnfold, SGNet, JDDS and DGAN. We highlight the best results for each metric in bold. We can see that the proposed approach performs better than previous algorithms under both metrics. Specifically, deep learning methods exhibits remarkably higher accuracy compared with the traditional methods based on hand-crafted priors. It demonstrates the superiority of the prior modeling capability of the deep network. Compared with the deep learning methods, the proposed method fully exploits the data property of green channel, i.e., high sampling rate and high SNR, and achieves better performance. It reveals the effectiveness of our deep guided attention network.
Perceptual Quality
For visualization, we show two typical recovered scenes in Figure 5. The input noisy image and restored results of FlexISP, ADMM, DeepJoint, DeepUnfold are shown in the first row, and the recovered results of SGNet, JDDS, DGAN and ground truth are shown in the second row. The results of FlexISP and ADMM can obviously observe noise, which indicates the hand-crafted prior is inefficient for real image JDD. Our method can produce visually pleasant results with less artifact and sharper edges compared with other methods, which is consistent with quantitative results.
Computational Complexity
The efficiency of all deep learning methods are also quantitatively evaluated by two metrics, i.e., parameters and floating-point operations (FLOPs). We show the related results in Table 2. It is worth to note that FLOPs is calculated by restoring an image with
3. Discussion
Here, we discus the effect of different guidance modules, different learning strategies, and different upsampling layers.
The Effect of Different Guidance Modules
To verify the effectiveness of the green channel guidance with multiple guided attention modules, we compare it with one attention guidance, multiple concatenation guidances, one concatenation guidance and without guidance. The results are provided in Table 3, and we highlight the best results in bold. Specifically, network with green channel guidance outperforms that without guidance, which verifies the effectiveness of green channel guidance. Further, the gains of our method with multiple guidances over that with once guidance demonstrate multiple guidances can fully exploit the data property of green channel. Last but not least, our method with attention guidance is considerably better than that with concatenation guidance. It reveals the effectiveness of our guided attention module that adaptively fuses the guidance information to main branch.
The Effect of Different Learning Strategies
To evaluate the effectiveness of the decomposition-and-combination learning strategy
The Effect of Different Upsampling Layers
To upsample the spatial resolution, there mainly three operations for deep neural network, including interpolation, deconvolution and pixel-shuffle. Inspired by the advanced image super-resolution methods [47], we employ pixel-shuffle layer to upsample the spatial resolution. To evaluate the effectiveness of the pixel-shuffle upsampling layer, we compare it with interpolation and deconvolution. We provide the results in Table 5, and highlight the best results. We can see that pixel-shuffle significantly outperforms other upsampling layers, especially the inter-polation layer. It verifies the superiority of pixel-shuffle layer in DM-subnetwork.
Conclusion
In this paper, we present a novel guided attention network for real image JDD, which considers the high SNR and high sampling rate of green information to guide the DN and DM, respectively. The designed guided attention module can adaptively guide the full color RGB image recovery, and can fully exploit the guidance of green channel by applying it multiple times in the DGAN. To ease the training, we employ a decomposition-and-combination learning strategy. Besides, we utilize pixel shift camera to collect a paired real JDD dataset containing clean RGB, clean mosaic and noisy mosaic images, making the leaned network with better generalization for the real data. The comprehensive experimental results indicate that our method have better performance than existing state-of-the-art algorithm in terms of both comprehensive quantitative metrics and visual quality. In the future, we will collect more suitable data and expand the paired real JDD dataset.
ACKNOWLEDGMENT
This work was supported by the National Natural Science Foundation of China (Grant Nos. 62171038, 61827901, and 62088101).