Introduction
Perceptual image enhancement (PIE) is a task that restores low-quality images collected from low-end devices aided by high-quality DSLR images [7]. It renews the conventional image enhancement task, pursuing not only more pleasant visual effects but also more realistic details. Perceptual image enhancement is of broad interest for plenty of applications, such as mobile photography, robot sensing, and autopilot.
DPED [7] addresses this issue by collecting a novel dataset that uses DSLR-quality image as the target for more precise learning of elaborated image details. They incorporate a ResNet as the backbone of their enhancement module. Still, DPED suffers from several problems. The first one is the misalignment problem. Since high-quality images as reference are not absolutely aligned with low-quality images, the misalignment of input image pairs might confuse PIE models when pixel-wise MSE loss is used [15], [25]. On the other hand, although the misalignment exists for absolute image positions, the image pair between high- and low- quality images are still relatively aligned, which gives a potential of generating more realistic detail by exploring more effective representation learning. DPED relieves the misalignment problem by incorporating a series of loss functions and adversarial learning. Nevertheless, typical discriminative networks prefer to focus on similarity of style characteristic instead of the region- and pixel- level similarity, and the relationship of low- and high- quality image has not been fully explored. Hence, there is still a large space for improvement over generic image enhancement pipelines [7], [22].
Various kinds of CNN architectures have been used in image processing tasks [7], [15], [21], [25]. Memnet [25] uses a dense network with memory mechanism for image restoration. EDSR [15] incorporates a large ResNet to demonstrate superior performance. DPED [7] also uses a ResNet for PIE. However, the receptive field in ResNet is not enough when processing large size photos(e.g., 2K or 4K). Another alternative backbone is U-Net [21]. MWCNN [16] adopts a wavelet-like U-Net to demonstrate a superior performance with less computational cost. Nevertheless, typical U-Net structures only pass features of same resolution between the encoder and the decoder, which limits the information flow within the network and utilization of feature at coarse level. Also, the convolutional layer and downsampling will obstruct the flow of information. Thus, the generalization ability of generative module might benefit from a more flexible network structure which allows the information flow from low-level features in the encoder to high-level decoder.
From the motivations above, we present a framework that can circulate the information at coarse levels of different resolutions freely and concentrates on the similarity of overall style characteristic as well as the local region statistics. Our approach relies on two networks: a multi-level connected generator that makes full use of cross-scale feature representations, and a pairwise relativistic discriminator that fully explores the contextual information at holistic and region levels. Besides, we incorporate auxiliary loss to generate more realistic details and sharp texture.
The main contributions of this paper can be summarized as follows.
We propose a novel multi-level generator for image enhancement, which fully utilizes the features from all resolutions and has different abstraction levels in the expansive path.
We present a pair-wise relativistic discriminator (PRaD) for adversarial training. The triplets relationship not only helps the discriminator become aware of the degree of enhancement but also incorporate pixel-level contextual information for discriminative learning.
Experiment demonstrates that our model achieves superior empirical performance with less computational complexity.
Related Work
Perceptual image enhancement is a new problem that aims at restoring a high-quality image from a low-quality observation. Recently, CNN-based approaches obtain great successes in image processing tasks, such as image restoration and enhancement. MemNet [25] uses a dense network to establish the relationship between the corrupted and clear image pair. Furthermore, EDSR [26] advance the network ability by incorporating more efficient residual block. However, their pipeline can only handle fully aligned image pair(e.g., corrupted images is obtained from clear images by some down-sample metrics). In addition, many researchers have attempted to introduce adversarial learning in image enhancement. Yan et al. [31] realize a novel automatic photo adjustment with deep neural network. Wang et al. [28] incorporate auxiliary semantic information to help generate realistic details. EnhanceNet [22] employs adversarial training and texture loss to produce realistic images. However, none of them study the image enhancement issue for the image pair without full alignment. DPED [7] incorporates a series of loss function to demonstrate a charming image quality enhancement between relatively aligned image pair. RaGAN [9] proposes an efficient learning strategy by introducing a relativistic discriminator. Image enhancement for image pairs without full alignment by utilizing accurate learning metrics still remains a open challenge, which has great potential to facilitate the development of image restoration/enhancement.
Various feature aggregation methods have been explored in semantic segmentation. U-Net [21] proposes an efficient structure that concatenates features of the same resolution betwen the encoder and the corresponding decoder. Hypercolumns [5] showed that leveraging multi-level features jointly can boost the performance, however, their network extracted and merged the features at different levels directly, while the interactions between high level and low level features can not be efficiently used. RCF [17] develops an effective edge detection framework by exploiting a hierarchical feature fusion strategy. More recently, RCAN [34] addresses the feature attention problem across the channels in image super-resolution by adding additional attention layers to select useful channels. LapSRN [12] adopts a Laplacian pyramid feature representation for image super-resolution. However, such multi-scale feature aggregation methods have been neglected in image enhancement. For image enhancement, colors and brightness in local regions and the global image are not uniform, resulting in difficulties in automatic adjustment. Nevertheless, the under-aligned training pairs put a high demand on the analysis between local and non-local contextual information. To this end, different resolution feature aggregation is meaningful for image enhancement.
Despite the fact that cross-level feature representation has been taken into account and significantly facilitated the development of many computer vision fields, there is still a large space for improvement over the general CNN architecture, especially towards the image enhancement issue. To investigate such strategy in image enhancement, we propose a light-weight model, which will be described in the next section.
Methodology
A. Framework Overview
Let \begin{equation*} I_{e} = F_{\mathbf {W}}(I_{s}),\tag{1}\end{equation*}
B. Multi-Level Connected Generator
As shown in Fig. 2, our generator inherits the structure of U-Net [21], which consists of a contracting path (encoder) and an expansive path (decoder). U-net concats the same scale features between these encoder and decoder, but this strategy is prone to limiting the image enhancement ability of the network without considering the richness of features. Therefore, in our generator, we propose cross-scale feature aggregation (CFA) layer and aim to fully exploit the cross-scale feature learning for perceptual image enhancement, aggregating the information flow of different scales from in contracting path to the expansive path.
A demonstration of perceptual image enhancement. With the limitations of devices and illumination, the captured images show a poor quality occasionally. Perceptual image enhancement provides a feasible solution toward such issue.
Overview of the generative network. We use green green, blue and orange to label different resolution feature representation respectively. As can be observed, a series of cross-scale connections(e.g., downsample and upsample) are established for passing comprehensive messages between encoder and decoder. Hence, each scale of the decoder receives contextual cross-scale features, and abstracts them into more discriminative representation.
Specifically, our generator has \begin{equation*} y_{enc}^{n} = \Downarrow C(C(y_{enc}^{n - 1})),\tag{2}\end{equation*}
In the decoder, we gradually up-sample feature maps as:\begin{equation*} y_{dec}^{n} = C({CFA_{n}}(\Uparrow y_{dec}^{n + 1})),\tag{3}\end{equation*}
Next we will elaborate the details of the proposed CFA layer. We introduce the cross-scale features of the encoder into the decoder in the CFA layer as:\begin{equation*} {CFA_{n}(\cdot)} = C(Concat(R(\Uparrow y_{dec}^{n + 1}), F_{n})),\tag{4}\end{equation*}
C. Discriminator and Loss Functions
1) Pair-Wise Relativistic Discriminator
In this section, we first describe the pair-wise relativistic discriminator, namely as PRaD. Our PRaD is motivated by relativistic discriminator(RaD) [9] as it strengthens generalization ability of generator. Our PRaD is designed to address the following two problems: 1) Compared with general GAN [4], which focuses on overall similarity, PRaD requires the similarity in pixel-to-pixel level. 2) Compared with other discriminators simply distinguishing an input image as fake or real, our PRaD prefers to estimate the probability of the given real data is more realistic than fake data on average.
Hence, the real/fake pair for PRaD is:\begin{align*} I_{r}=&Concat[I_{s}, I_{t}, \varphi (I_{t})] \\ I_{f}=&Concat[I_{s}, F_{\mathbf {W}}(I_{s}), \varphi (F_{\mathbf {W}}(I_{s}))]\tag{5}\end{align*}
\begin{equation*} D_{PRaD}(I_{r},I_{f}) = \sigma (C(I_{r} - E_{I_{f}}[C(I_{f}]),\tag{6}\end{equation*}
\begin{align*} L_{PRaD}=&-E_{I_{r}}[log(D_{PRa}(I_{r},I_{f})] \\&- E_{F_{I_{f}}}[log(1-D_{PRa}(I_{f},I_{r}))].\tag{7}\end{align*}
Our discriminator consists of five convolutional layers each followed by a LeakyReLU activation and batch normalization. The first convolutional layer has
2) Loss Functions:
a) Adversarial Loss
Following the PRaD, we first introduce the adversarial loss. As illustrated in Sec. III-C.1, the adversarial loss for our generator is defined in a symmetrical form:\begin{align*} L_{GAN}(F_{\mathbf {W}}(I_{s},I_{t}))=&L_{G}^{PRaD} \\=&-E_{I_{r}}[log(1 - D_{PRa}(I_{r},I_{f})] \\&- E_{I_{f}}[log(D_{PRa}(I_{f},I_{r}))].\tag{8}\end{align*}
b) Color Loss
we use the MSE function to measure the color difference between the enhanced and target images. Color loss can be written as:\begin{equation*} L_{color}(F_{\mathbf {W}}(I_{s}),I_{t}) = \left \|{ F_{\mathbf {W}}(I_{s})-I_{t} }\right \|_{2}^{2}.\tag{9}\end{equation*}
c) Texture Loss
Our texture loss consists of two losses: 1) local contract normalization(LCN) loss [30] and 2) gradient loss. Local contrast normalization has a decorrelating effect in spatial image analysis. Such an operation can be used to model the contrast-gain masking process in human perceptual system [10]. The local normalization can be formulated as follows:\begin{equation*} \widetilde {I}(i,j) = \frac {I(i,j) - \mu (i,j)}{\sigma (i,j) + C}, \tag{10}\end{equation*}
\begin{equation*} \mu (i,j) = \frac {1}{(2P\!+\!1)(2Q\!+\!1)}\sum _{p=-P}^{p=P}\sum _{q=-Q}^{q=Q}I(i\!+\!p,j\!+\!q),\qquad \tag{11}\end{equation*}
\begin{equation*} \sigma ^{2}(i,j) = \sum _{p=-P}^{p=P}\sum _{q=-Q}^{q=Q}(I(i+p, j+q) - \mu (i,j))^{2}. \tag{12}\end{equation*}
Based on the above equations 10, 11 and 12, we build LCN loss as:\begin{equation*} L_{lcn}(\widetilde {F}_{\mathbf {W}}(I_{s}),\widetilde {I}_{t}) =\frac {1}{CHW}\left \|{ \varphi _{j}(\widetilde {F}_{\mathbf {W}}(I_{s})) - \varphi _{j}(\widetilde {I}_{t}) }\right \|\tag{13}\end{equation*}
Our gradient loss enjoys the superiority of being robust to illumination variations and focuses on the texture:\begin{align*} L_{grad}(F_{\mathbf {W}}(I_{s}),I_{t})=&\left \|{ \bigtriangledown _{x}F_{\mathbf {W}}(I_{s}) - \bigtriangledown _{x}I_{t} }\right \| \\&+ \left \|{ \bigtriangledown _{y}F_{\mathbf {W}}(I_{s}) - \bigtriangledown _{y}I_{t} }\right \|\tag{14}\end{align*}
Our final texture loss is defined as a weighted sum of LCN loss and gradient loss with the following coefficients:\begin{align*} L_{texture}(F_{\mathbf {W}}(I_{s}),I_{t})=&L_{lcn}(\widetilde {F}_{\mathbf {W}}(I_{s}),\widetilde {I_{t}}) \\&+ L_{grad}(F_{\mathbf {W}}(I_{s}),I_{t})\tag{15}\end{align*}
d) Content Loss
Inspired by [8], [13], content loss is utilized to maintain the content similarity between the enhanced images and the target DSLR image. Thus the content loss can be defined as:\begin{equation*} L_{content}(F_{\mathbf {W}}(I_{s}),I_{t}) =\frac {\left \|{ \varphi (F_{\mathbf {W}}(I_{s}))-\varphi (I_{t}) }\right \|}{C'H'W'},\tag{16}\end{equation*}
e) Contextual Loss
We define our contextual loss based on the contextual similarity [20]. Contextual loss can be viewed as a statistical measurement between the distributions of features. As shown in [20], using such contextual loss during training facilitates enhancement network to generate more realistic images. Contextual loss can be written as:\begin{equation*} L_{cx}(F_{\mathbf {W}}(I_{s}),I_{t}) = -log(CX(\varphi _{j}(F_{\mathbf {W}}(I_{s})), \varphi _{j}(I_{t}))).\tag{17}\end{equation*}
f) Total Variation Loss
Total variation (TV) loss [1] is often used in digital image processing for denoising. It is based on the principle that images with lots of spurious detail would have high total variation, that is, the sum of the absolute gradient of an image is high. According to this principle, reducing the total variation can remove unwanted noises while enforcing spatial smoothness and preserving important high-frequency components like edges. Thus, we use total variation (TV) loss by:\begin{equation*} L_{tv}(F_{\mathbf {W}}(I_{s})) =\frac {1}{CHW}\left \|{ \bigtriangledown _{x}F_{\mathbf {W}}(I_{s}) + \bigtriangledown _{y}F_{\mathbf {W}}(I_{s}) }\right \|.\tag{18}\end{equation*}
g) Total Loss
Our final loss is defined as a weighted sum of previous losses with the following coefficients:\begin{align*}&\hspace {-2pc}L_{total}(F_{\mathbf {W}}(I_{s}),I_{t}) \\=&L_{content}(F_{\mathbf {W}}(I_{s}),I_{t}) \\&+ 0.4 \cdot L_{GAN}() + 0.4 \cdot L_{texture}(F_{\mathbf {W}}(I_{s}),I_{t}) \\&+ 0.1 \cdot L_{color}(F_{\mathbf {W}}(I_{s}),I_{t}) + 0.4 \cdot L_{cx}(F_{\mathbf {W}}(I_{s}),I_{t}) \\&+ 400 \cdot L_{tv}(F_{\mathbf {W}}(I_{s}))\tag{19}\end{align*}
D. Model Training
In the training phase, we first adopt the enhancement network to forward a corrupted image
The whole training algorithm of our model is illustrated in Algorithm. 1, which accords with the pipeline of our proposed framework shown in Fig. 2.
Algorithm 1 Training Algorithm~of our Network
Training low-quality images
Enhanced images
for number of training iterations do
Obtain enhanced images
Obtain
Update the discriminator by ascending its stochastic gradient:\begin{equation*} \nabla _{\theta _{d}}{\frac {1}{\left |{ I_{s} }\right |}}\sum \{-E_{I_{r}}[log(D_{PRa}(I_{r},I_{f})] \\ - E_{F_{I_{f}}}[log(1-D_{PRa}(I_{f},I_{r}))]\} \end{equation*}
Obtain enhanced images
Obtain
Obtain adversaial loss
Update the enhancement network by ascending its stochastic gradient:\begin{equation*} \nabla _{\theta _{g}}{\frac {1}{\left |{ I_{s} }\right |}}\sum \{ L_{total} \}.\end{equation*}
Experiment
Dataset and implementation details. We use the DPED [7] as our training and testing set. DPED has 22K photos, including 4549 photos from Sony mobile phone, 5727 from iPhone and 6015 from BlackBerry devices. Each photo has a corresponding high-quality image taken from Canon DSLR camera. The photos are cropped into
Competing methods and evaluation metrics. We compare our model with state-of-the-art methods, including Robust Retinex Model(RRM) [14], BIMEF [32], Neural Art [8], DPED [7], PMN [30] and PPCN [6]. To fully analysis our method, we use Peak Signal to Noise Ratio(PSNR) and Structural Similarity(SSIM) [29] as evaluation metrics. Moreover, we also conduct learned perceptual image patch similarity(LPIPS) [33] and mean opion score(MOS) for perceptual evaluation. For MOS, we adopt subjective voting for the enhanced image quality assessment.
A. Main Results
We conduct a quantitative and qualitative comparison of our method with state-of-the-art methods. As shown in table. 1, our model achieves 0.68 dB improvement over PMN on iPhone. At the same time, the proposed method obtains 1.12 dB and 0.99 dB gain on BlackBerry and Sony over PPCN. Compared with traditional image enhancement method, our model owns a more clear superiority. As shown in table. 1, our model outperforms RRM with a clear margin, which verifies the effectiveness of the proposed model. We also conduct a comprehensive qualitative comparison in Fig. 5, 6. Since the full-size ground truth images are not released, we use our own SIFT alignment metrics to obtain ‘Cannon’. The full-size image is unable to demonstrate a perfect alignment, it is still fine for visual comparison yet. In Fig. 5, we can find that our result is close to DSLR quality as other state-of-the-art methods have color distortion or brightness problem. For example, the enhanced image of PPCN is too gloomy and the DPED is too bright-colored. In contrast, Our model presents a natural DRLR quality. In the Fig. 6, our result demonstrate a natural color and clear structure compared other state-of-the-art. In sum, our results demonstrate a stable color and brightness improvement over other methods.
Illumination of several representative feature maps produced by the
Qualitative comparison on ‘BlackBerry’ of the DPED dataset. Instead of enhancing image too bright-colored, our results is close to natural DSLR quality.
Qualitative comparison on DPED patch test dataset. Our model promises a high-quality and natural image enhancement. Best viewed by zooming in the electronic version.
1) Perceptual Evaluation:
With the inherent property, euclidean distance and structure similarity unable to reflect subjective judgement intuitively, we further use LPIPS [33] and MOS to evaluate our method. LPIPS is a perceptual evaluation metric where a lower score means the image is more realistic. As shown in table. 4, our model surpasses other methods with a clear margin. In addition, our model also achieves obvious advantage under subjective voting metric. The above results justify the superiority of our model in term of perceptual evaluation metric.
2) Deeper Inference:
To fully verify the effectiveness of the proposed model, we also conduct several large models for comparison. As shown in table. 2, we replace the enhancement network with GridNet, ResNet-20, and ResNet-32 for comparison. GridNet is a flexible framework, which can simulate U-Net, FCN [18] and ResNet. To our surprise, the proposed model outperforms GridNet with 10 times less parameter on iPhone, which justifies the effectiveness of our model in image enhancement.
3) Real-World Cases
We investigate our model in real-world images. We collect some images from Instagram and conduct our generative model on them. In real-world cases, the images have complicated brightness and contrast, which is difficult to image enhancement. We use our generative model trained from ‘iPhone’ dataset to handle real-world images. As shown in Fig. 7, our model demonstrates a significant quality improvement in different scenes.
Results of real-world cases. We collect images from Instagram and conduct our trained generative model on them.
B. Ablation Studies
1) Cross-Scale Feature Aggregation and Refinement
We first verify the effectiveness of the cross-scale feature aggregation and feature refinement. First, we alternately disable the cross-scale feature aggregation and the feature refinement and conduct a comparison on iPhone dataset. As illustrated in table. 4, the feature refinement brings 0.07 dB improvement. Nevertheless, the model with explicit cross-scale feature aggregation contributes 0.12 dB gain. The model equipped with two components outperforms plain model with 0.17 dB promotion. To this end, the proposed two components significantly improve the image quality.
2) Discriminator
To verify the proposed PRaD, we disable the PRaD and VGG content for comparison. Meanwhile, the ‘w/o VGG content’ uses the enhanced image as the input of discriminator and ‘w/o PRaD’ disables the adversarial learning. As shown in table. 3, the full model obtains 0.08 dB improvement over the discriminator without VGG content. As the model without PRaD achieves similar PSNR score, our full model ‘w/ PRaD’ obtains obvious promotion on SSIM.
3) Loss Functions
We make an attempt to analysis loss functions on ‘iPhone’. In table. 7, we investigate our model with different loss function. We first adopt
4) The Depth of Cross-Scale Feature
To verify the effectiveness of the depth of the cross-scale feature, we conduct the same resolution feature with different depth for comparison. In our experiment, we find the model, which aggregates shallow feature achieves 22.92 dB, as the model aggregates deeper feature achieves 22.95 dB. Therefore, we incorporate the deeper feature in each resolution to perform cross-scale feature aggregation.
5) Normalization
DPED adopts batch normalization(BN) in their enhancement module as our model removes all normalization. Therefore, we conduct a study on different normalization strategy for image enhancement. As shown in table. 5, we incorporate BN, instance normalization(IN) [27] and weight normalization(WN) [23] in our model. It can find that models with BN and WN show poor performance. As the IN show superior performance on PSNR and SSIM, we observe that the model with IN may cause color excursion. In Fig. 6, the result of ‘GridNet’ uses instance normalization and render the enhanced image too blue. To this end, we remove all normalization in our enhancement network.
C. Efficiency
We make a comprehensive efficiency analysis to show the practicability of the proposed. As illustrated in Table. 8, we conduct the efficiency comparison on both CPU and GPU platform with different resolution input. Meanwhile, we use ‘Intel E5-2620 v4’ CPU and Nvidia ‘GTX1080Ti’ GPU for evaluation. Compared with DPED, our model achieves
Conclusion
We have presented a novel framework for automatic image quality enhancement. In our framework, the different scale information can flow from encoder to decoder freely. Moreover, we incorporate a novel PRaD to help enhancement network improves color rendition. Extensive results on qualitative and quantitative validate the effectiveness of the proposed model.