Journals & Magazines >IEEE Access >Volume: 7

Perceptual Image Enhancement by Relativistic Discriminant Learning With Cross-Scale Aggregated Representation

A demonstration of perceptual image enhancement. With the limitations of devices and illumination, the captured images show a poor quality occasionally. Perceptual image ...

Abstract:

In this paper, we introduce an automatic method for image quality enhancement. Our method relies on a multi-level connected generator and a pair-wise relativistic discrim...Show More

Topic: Advanced Optical Imaging for Extreme Environments

Metadata

Abstract:

In this paper, we introduce an automatic method for image quality enhancement. Our method relies on a multi-level connected generator and a pair-wise relativistic discriminator. Moreover, we also incorporate auxiliary loss functions to exhibit a high-quality image enhancement. Different from other image style transfer approaches, our method has two appealing properties: 1) different scale information not only pass on the same scale but also flow from encoder to decoder freely and 2) more contextual information is provided to help the discriminator to perform pair-wise discriminative learning. The learned generative model can then automatically enhance the low-quality image into a digital single lens reflex (DSLR) quality. The extensive experiments demonstrate that the proposed framework surpasses the state-of-the-art methods with a clear margin.

Topic: Advanced Optical Imaging for Extreme Environments

A demonstration of perceptual image enhancement. With the limitations of devices and illumination, the captured images show a poor quality occasionally. Perceptual image ...

Published in: IEEE Access ( Volume: 7)

Page(s): 39660 - 39669

Date of Publication: 22 March 2019

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2019.2906936

Funding Agency:

Contents

SECTION I.

Introduction

Perceptual image enhancement (PIE) is a task that restores low-quality images collected from low-end devices aided by high-quality DSLR images [7]. It renews the conventional image enhancement task, pursuing not only more pleasant visual effects but also more realistic details. Perceptual image enhancement is of broad interest for plenty of applications, such as mobile photography, robot sensing, and autopilot.

DPED [7] addresses this issue by collecting a novel dataset that uses DSLR-quality image as the target for more precise learning of elaborated image details. They incorporate a ResNet as the backbone of their enhancement module. Still, DPED suffers from several problems. The first one is the misalignment problem. Since high-quality images as reference are not absolutely aligned with low-quality images, the misalignment of input image pairs might confuse PIE models when pixel-wise MSE loss is used [15], [25]. On the other hand, although the misalignment exists for absolute image positions, the image pair between high- and low- quality images are still relatively aligned, which gives a potential of generating more realistic detail by exploring more effective representation learning. DPED relieves the misalignment problem by incorporating a series of loss functions and adversarial learning. Nevertheless, typical discriminative networks prefer to focus on similarity of style characteristic instead of the region- and pixel- level similarity, and the relationship of low- and high- quality image has not been fully explored. Hence, there is still a large space for improvement over generic image enhancement pipelines [7], [22].

Various kinds of CNN architectures have been used in image processing tasks [7], [15], [21], [25]. Memnet [25] uses a dense network with memory mechanism for image restoration. EDSR [15] incorporates a large ResNet to demonstrate superior performance. DPED [7] also uses a ResNet for PIE. However, the receptive field in ResNet is not enough when processing large size photos(e.g., 2K or 4K). Another alternative backbone is U-Net [21]. MWCNN [16] adopts a wavelet-like U-Net to demonstrate a superior performance with less computational cost. Nevertheless, typical U-Net structures only pass features of same resolution between the encoder and the decoder, which limits the information flow within the network and utilization of feature at coarse level. Also, the convolutional layer and downsampling will obstruct the flow of information. Thus, the generalization ability of generative module might benefit from a more flexible network structure which allows the information flow from low-level features in the encoder to high-level decoder.

From the motivations above, we present a framework that can circulate the information at coarse levels of different resolutions freely and concentrates on the similarity of overall style characteristic as well as the local region statistics. Our approach relies on two networks: a multi-level connected generator that makes full use of cross-scale feature representations, and a pairwise relativistic discriminator that fully explores the contextual information at holistic and region levels. Besides, we incorporate auxiliary loss to generate more realistic details and sharp texture.

The main contributions of this paper can be summarized as follows.

We propose a novel multi-level generator for image enhancement, which fully utilizes the features from all resolutions and has different abstraction levels in the expansive path.
We present a pair-wise relativistic discriminator (PRaD) for adversarial training. The triplets relationship not only helps the discriminator become aware of the degree of enhancement but also incorporate pixel-level contextual information for discriminative learning.
Experiment demonstrates that our model achieves superior empirical performance with less computational complexity.

SECTION II.

Related Work

Perceptual image enhancement is a new problem that aims at restoring a high-quality image from a low-quality observation. Recently, CNN-based approaches obtain great successes in image processing tasks, such as image restoration and enhancement. MemNet [25] uses a dense network to establish the relationship between the corrupted and clear image pair. Furthermore, EDSR [26] advance the network ability by incorporating more efficient residual block. However, their pipeline can only handle fully aligned image pair(e.g., corrupted images is obtained from clear images by some down-sample metrics). In addition, many researchers have attempted to introduce adversarial learning in image enhancement. Yan et al. [31] realize a novel automatic photo adjustment with deep neural network. Wang et al. [28] incorporate auxiliary semantic information to help generate realistic details. EnhanceNet [22] employs adversarial training and texture loss to produce realistic images. However, none of them study the image enhancement issue for the image pair without full alignment. DPED [7] incorporates a series of loss function to demonstrate a charming image quality enhancement between relatively aligned image pair. RaGAN [9] proposes an efficient learning strategy by introducing a relativistic discriminator. Image enhancement for image pairs without full alignment by utilizing accurate learning metrics still remains a open challenge, which has great potential to facilitate the development of image restoration/enhancement.

Various feature aggregation methods have been explored in semantic segmentation. U-Net [21] proposes an efficient structure that concatenates features of the same resolution betwen the encoder and the corresponding decoder. Hypercolumns [5] showed that leveraging multi-level features jointly can boost the performance, however, their network extracted and merged the features at different levels directly, while the interactions between high level and low level features can not be efficiently used. RCF [17] develops an effective edge detection framework by exploiting a hierarchical feature fusion strategy. More recently, RCAN [34] addresses the feature attention problem across the channels in image super-resolution by adding additional attention layers to select useful channels. LapSRN [12] adopts a Laplacian pyramid feature representation for image super-resolution. However, such multi-scale feature aggregation methods have been neglected in image enhancement. For image enhancement, colors and brightness in local regions and the global image are not uniform, resulting in difficulties in automatic adjustment. Nevertheless, the under-aligned training pairs put a high demand on the analysis between local and non-local contextual information. To this end, different resolution feature aggregation is meaningful for image enhancement.

Despite the fact that cross-level feature representation has been taken into account and significantly facilitated the development of many computer vision fields, there is still a large space for improvement over the general CNN architecture, especially towards the image enhancement issue. To investigate such strategy in image enhancement, we propose a light-weight model, which will be described in the next section.

SECTION III.

Methodology

A. Framework Overview

Let $I_{s}$ denote a source low-quality image and $I_{t}$ denote the target DSLR quality reference image. The reference image $I_{t}$ is provided only when training. The goal of the perceptual image enhancement task is to produce an enhanced high-quality image $I_{e}$ : $\begin{equation*} I_{e} = F_{\mathbf {W}}(I_{s}),\tag{1}\end{equation*}$ View Source where $F_{\mathbf {W}}$ is the enhancement function. Since $I_{s}$ and $I_{t}$ are taken by different devices, the image pair cannot be completely aligned. Such relative alignments not only obscure the standard criterion provided by the high-quality images but also prevent the generator from producing more realistic details. Hence, the main challenge of this question is to enrich generalization ability of enhancement network and define accurate loss function. To this end, we develop a novel generative adversarial network that exhibits good performance towards high-quality image enhancement. Besides, a series of loss functions are considered for precise discriminative learning.

B. Multi-Level Connected Generator

As shown in Fig. 2, our generator inherits the structure of U-Net [21], which consists of a contracting path (encoder) and an expansive path (decoder). U-net concats the same scale features between these encoder and decoder, but this strategy is prone to limiting the image enhancement ability of the network without considering the richness of features. Therefore, in our generator, we propose cross-scale feature aggregation (CFA) layer and aim to fully exploit the cross-scale feature learning for perceptual image enhancement, aggregating the information flow of different scales from in contracting path to the expansive path.

FIGURE 1.

A demonstration of perceptual image enhancement. With the limitations of devices and illumination, the captured images show a poor quality occasionally. Perceptual image enhancement provides a feasible solution toward such issue.

Show All

FIGURE 2.

Overview of the generative network. We use green green, blue and orange to label different resolution feature representation respectively. As can be observed, a series of cross-scale connections(e.g., downsample and upsample) are established for passing comprehensive messages between encoder and decoder. Hence, each scale of the decoder receives contextual cross-scale features, and abstracts them into more discriminative representation.

Show All

Specifically, our generator has $N$ scales in both encoder and decoder, respectively. In the encoder, we gradually down-sample feature maps as: $\begin{equation*} y_{enc}^{n} = \Downarrow C(C(y_{enc}^{n - 1})),\tag{2}\end{equation*}$ View Source where $y_{enc}^{n-1}$ and $y_{enc}^{n}$ are the feature maps of $n-1\,\,th$ and $n~th$ encoder unit(scale) respectively, $C$ indicates the convolutional feature extraction and $\Downarrow$ is down-sampling operation. Each unit of the encoder uses $3\times 3$ convolution filters for feature extraction and a $3\times 3$ convolution operator with stride 2 for down-sampling. The channel numbers of three stages are 32, 64, 128 in our model where $N$ is 3, respectively. The down-sampling operator uses $3\times 3$ kernels.

In the decoder, we gradually up-sample feature maps as: $\begin{equation*} y_{dec}^{n} = C({CFA_{n}}(\Uparrow y_{dec}^{n + 1})),\tag{3}\end{equation*}$ View Source where $\Uparrow$ is up-sampling operation, respectively, and $CFA_{n}$ is the cross-scale aggregated feature operation in the proposed CFA layer of the $n~th$ scale. The up-sample operator is a deconvolution layer with $5\times 5$ kernels, and their channel numbers are the same to those of the target layer.

Next we will elaborate the details of the proposed CFA layer. We introduce the cross-scale features of the encoder into the decoder in the CFA layer as: $\begin{equation*} {CFA_{n}(\cdot)} = C(Concat(R(\Uparrow y_{dec}^{n + 1}), F_{n})),\tag{4}\end{equation*}$ View Source where $Concat$ and $R$ are the feature concatenation and refinement, respectively, aggregated encoder features ${F_{n}} = Concat({O_{n}}(y_{enc}^{1}),\ldots,{O_{n}}(y_{enc}^{i}),\ldots,{O_{n}}(y_{enc}^{N}))$ , and $O_{n}$ is a feature map manipulating operation. If $i < n$ , ${O_{n}}(y_{enc}^{i})$ is down-sampling $y_{enc}^{i}$ ; if $i > n$ , ${O_{n}}(y_{enc}^{i})$ is up-sampling $y_{enc}^{i}$ ; otherwise, i.e., if $i = n$ , ${O_{n}}(y_{enc}^{i})$ is doing thing to $y_{enc}^{i}$ . Since $\Uparrow y_{dec}^{n + 1}$ is coarse due to the deconvolution operation, we conduct a feature refinement strategy to refine features and further improve the generalization ability in decoder. In our model, a $3\times 3$ convolutional operation is employed for refinement. The introduced CFA layers fully integrate features of the encoder and decoder. The two CFAs employ a $3\times 3$ convolutional layer with 192 and 128 channels in our model. In the Fig. 3, we sketch the detailed components of a CFA unit.

FIGURE 3.

Demonstration of CFA.

Show All

C. Discriminator and Loss Functions

1) Pair-Wise Relativistic Discriminator

In this section, we first describe the pair-wise relativistic discriminator, namely as PRaD. Our PRaD is motivated by relativistic discriminator(RaD) [9] as it strengthens generalization ability of generator. Our PRaD is designed to address the following two problems: 1) Compared with general GAN [4], which focuses on overall similarity, PRaD requires the similarity in pixel-to-pixel level. 2) Compared with other discriminators simply distinguishing an input image as fake or real, our PRaD prefers to estimate the probability of the given real data is more realistic than fake data on average.

Hence, the real/fake pair for PRaD is: $\begin{align*} I_{r}=&Concat[I_{s}, I_{t}, \varphi (I_{t})] \\ I_{f}=&Concat[I_{s}, F_{\mathbf {W}}(I_{s}), \varphi (F_{\mathbf {W}}(I_{s}))]\tag{5}\end{align*}$ View Source where $F_{\mathbf {W}}$ is enhancement network and $\varphi$ is content feature exaction. More specific, we use ‘conv2_1’ of VGG-19 [24] as content feature for extraction. Then, we define the PRaD as: $\begin{equation*} D_{PRaD}(I_{r},I_{f}) = \sigma (C(I_{r} - E_{I_{f}}[C(I_{f}]),\tag{6}\end{equation*}$ View Source where $\sigma$ is sigmoid activation function, $C$ is non-transformed output of discriminator, $E_{I_{f}}$ represents the average probability of $I_{f}$ . With this definition, PRaD has two novel properties: 1) PRaD is sensitive to the difference between $F_{\mathbf {W}}(I_{s})$ and $\{I_{t}, I_{s} \}$ . This triplets relationship force the discriminator be aware of degree of image enhancement. 2) Since the training pair $\{I_{s}, I_{t}\}$ is relative aligned, PRaD is able to prove more contextual pixel-level information of $\{I_{s}, I_{t}\}$ and $\{I_{s}, F_{\mathbf {W}}(I_{s})\}$ . In general, the discriminator loss is defined as: $\begin{align*} L_{PRaD}=&-E_{I_{r}}[log(D_{PRa}(I_{r},I_{f})] \\&- E_{F_{I_{f}}}[log(1-D_{PRa}(I_{f},I_{r}))].\tag{7}\end{align*}$ View Source

Our discriminator consists of five convolutional layers each followed by a LeakyReLU activation and batch normalization. The first convolutional layer has $11 \times 11 \times 48$ convolutional filters and the stride is 4. The second one is a $5 \times 5 \times 128$ convolutional layer and the stride is 2. Then we incorporate three $3 \times 3$ convolutional layer with channel numbers of 192, 192 and 128, respectively. At last, we use a fully-connected layer and sigmoid function to obtain final output.

2) Loss Functions:

a) Adversarial Loss

Following the PRaD, we first introduce the adversarial loss. As illustrated in Sec. III-C.1, the adversarial loss for our generator is defined in a symmetrical form: $\begin{align*} L_{GAN}(F_{\mathbf {W}}(I_{s},I_{t}))=&L_{G}^{PRaD} \\=&-E_{I_{r}}[log(1 - D_{PRa}(I_{r},I_{f})] \\&- E_{I_{f}}[log(D_{PRa}(I_{f},I_{r}))].\tag{8}\end{align*}$ View Source

b) Color Loss

we use the MSE function to measure the color difference between the enhanced and target images. Color loss can be written as: $\begin{equation*} L_{color}(F_{\mathbf {W}}(I_{s}),I_{t}) = \left \|{ F_{\mathbf {W}}(I_{s})-I_{t} }\right \|_{2}^{2}.\tag{9}\end{equation*}$ View Source

c) Texture Loss

Our texture loss consists of two losses: 1) local contract normalization(LCN) loss [30] and 2) gradient loss. Local contrast normalization has a decorrelating effect in spatial image analysis. Such an operation can be used to model the contrast-gain masking process in human perceptual system [10]. The local normalization can be formulated as follows: $\begin{equation*} \widetilde {I}(i,j) = \frac {I(i,j) - \mu (i,j)}{\sigma (i,j) + C}, \tag{10}\end{equation*}$ View Source where $\begin{equation*} \mu (i,j) = \frac {1}{(2P\!+\!1)(2Q\!+\!1)}\sum _{p=-P}^{p=P}\sum _{q=-Q}^{q=Q}I(i\!+\!p,j\!+\!q),\qquad \tag{11}\end{equation*}$ View Source and $\begin{equation*} \sigma ^{2}(i,j) = \sum _{p=-P}^{p=P}\sum _{q=-Q}^{q=Q}(I(i+p, j+q) - \mu (i,j))^{2}. \tag{12}\end{equation*}$ View Source In the above equations, $I(i,j)$ is the pixel intensity value at location $(i, j)$ , $\widetilde I(i,j)$ is its normalized value, $\mu (i,j)$ is the mean value, $\sigma (i,j)$ is the standard deviation and $C$ is a positive constant. Besides, the window size is $[{2P + 1, 2Q + 1}]$ . In our model, the $P$ and $Q$ is radius of slide window, we define the them as 3.

Based on the above equations 10, 11 and 12, we build LCN loss as: $\begin{equation*} L_{lcn}(\widetilde {F}_{\mathbf {W}}(I_{s}),\widetilde {I}_{t}) =\frac {1}{CHW}\left \|{ \varphi _{j}(\widetilde {F}_{\mathbf {W}}(I_{s})) - \varphi _{j}(\widetilde {I}_{t}) }\right \|\tag{13}\end{equation*}$ View Source where $\widetilde {F}_{\mathbf {W}}(I_{s})$ , $\widetilde {I}_{t}$ are the gray-scale enhanced and target image propocessed with local contrast normalization respectively. $\varphi _{j}(\widetilde {F}_{\mathbf {W}}(I_{s}))$ , $\varphi _{j}(\widetilde {I}_{t})$ are their corresponding feature maps obtained in j-th convolutional layer of VGG-19. In training, we use the features map of the ‘relu1_1’ layer as VGG-19 mainly learns edges in the first layer.

Our gradient loss enjoys the superiority of being robust to illumination variations and focuses on the texture: $\begin{align*} L_{grad}(F_{\mathbf {W}}(I_{s}),I_{t})=&\left \|{ \bigtriangledown _{x}F_{\mathbf {W}}(I_{s}) - \bigtriangledown _{x}I_{t} }\right \| \\&+ \left \|{ \bigtriangledown _{y}F_{\mathbf {W}}(I_{s}) - \bigtriangledown _{y}I_{t} }\right \|\tag{14}\end{align*}$ View Source

Our final texture loss is defined as a weighted sum of LCN loss and gradient loss with the following coefficients: $\begin{align*} L_{texture}(F_{\mathbf {W}}(I_{s}),I_{t})=&L_{lcn}(\widetilde {F}_{\mathbf {W}}(I_{s}),\widetilde {I_{t}}) \\&+ L_{grad}(F_{\mathbf {W}}(I_{s}),I_{t})\tag{15}\end{align*}$ View Source

d) Content Loss

Inspired by [8], [13], content loss is utilized to maintain the content similarity between the enhanced images and the target DSLR image. Thus the content loss can be defined as: $\begin{equation*} L_{content}(F_{\mathbf {W}}(I_{s}),I_{t}) =\frac {\left \|{ \varphi (F_{\mathbf {W}}(I_{s}))-\varphi (I_{t}) }\right \|}{C'H'W'},\tag{16}\end{equation*}$ View Source where $\varphi$ is the feature of VGG-19 ‘conv5_4’, $C'$ , $H'$ and $W'$ denotes the size of the feature maps of VGG-19 ‘conv5_4’, respectively.

e) Contextual Loss

We define our contextual loss based on the contextual similarity [20]. Contextual loss can be viewed as a statistical measurement between the distributions of features. As shown in [20], using such contextual loss during training facilitates enhancement network to generate more realistic images. Contextual loss can be written as: $\begin{equation*} L_{cx}(F_{\mathbf {W}}(I_{s}),I_{t}) = -log(CX(\varphi _{j}(F_{\mathbf {W}}(I_{s})), \varphi _{j}(I_{t}))).\tag{17}\end{equation*}$ View Source where $\varphi$ is the feature maps obtained after the $j_{th}$ convolutional layer of VGG-19 CNN. Here, we fetch the features of ‘conv3_2’ and ‘conv4_2’.

f) Total Variation Loss

Total variation (TV) loss [1] is often used in digital image processing for denoising. It is based on the principle that images with lots of spurious detail would have high total variation, that is, the sum of the absolute gradient of an image is high. According to this principle, reducing the total variation can remove unwanted noises while enforcing spatial smoothness and preserving important high-frequency components like edges. Thus, we use total variation (TV) loss by: $\begin{equation*} L_{tv}(F_{\mathbf {W}}(I_{s})) =\frac {1}{CHW}\left \|{ \bigtriangledown _{x}F_{\mathbf {W}}(I_{s}) + \bigtriangledown _{y}F_{\mathbf {W}}(I_{s}) }\right \|.\tag{18}\end{equation*}$ View Source

g) Total Loss

Our final loss is defined as a weighted sum of previous losses with the following coefficients: $\begin{align*}&\hspace {-2pc}L_{total}(F_{\mathbf {W}}(I_{s}),I_{t}) \\=&L_{content}(F_{\mathbf {W}}(I_{s}),I_{t}) \\&+ 0.4 \cdot L_{GAN}() + 0.4 \cdot L_{texture}(F_{\mathbf {W}}(I_{s}),I_{t}) \\&+ 0.1 \cdot L_{color}(F_{\mathbf {W}}(I_{s}),I_{t}) + 0.4 \cdot L_{cx}(F_{\mathbf {W}}(I_{s}),I_{t}) \\&+ 400 \cdot L_{tv}(F_{\mathbf {W}}(I_{s}))\tag{19}\end{align*}$ View Source The coefficients were chosen empirically.

D. Model Training

In the training phase, we first adopt the enhancement network to forward a corrupted image $F_{\mathbf {W}}(I_{s})$ . Then, we apply VGG-19 to obtain a high-level representation of original, enhanced and ground truth images. The adversarial loss is calculated by the real/fake pair and designed with a relativistic fashion. The multi-level connected generator and PRaD are minimized based on back-propagation, alternatively. Besides, we also incorporate color loss, texture loss, content loss, total variation loss and total loss for a multi-task learning to facilitate the multi-level connected generator reach higher plateau.

The whole training algorithm of our model is illustrated in Algorithm. 1, which accords with the pipeline of our proposed framework shown in Fig. 2.

Algorithm 1 Training Algorithm~of our Network

Input:

Training low-quality images $I_{s}$ ; Target DSLR quality refenrece images $I_{t}$

Output:

Enhanced images $I_{e}$

for number of training iterations do

Obtain enhanced images $F_{\mathbf {W}}(I_{s})$ by forwarding $I_{s}$ to enhancement network $F_{\mathbf {W}}$ .

Obtain $\varphi (F_{\mathbf {W}}(I_{s}))$ and $\varphi (I_{t})$ from pretrained VGG-19.

$I_{r}$ = $Concat$ [ $I_{s}$ , $I_{t}$ , $\varphi (I_{t})$ ].

$I_{f}$ = $Concat$ [ $I_{s}$ , $F_{\mathbf {W}}(I_{s})$ , $\varphi (F_{\mathbf {W}}(I_{s}))$ ].

Update the discriminator by ascending its stochastic gradient: $\begin{equation*} \nabla _{\theta _{d}}{\frac {1}{\left |{ I_{s} }\right |}}\sum \{-E_{I_{r}}[log(D_{PRa}(I_{r},I_{f})] \\ - E_{F_{I_{f}}}[log(1-D_{PRa}(I_{f},I_{r}))]\} \end{equation*}$ View Source

Obtain enhanced images $F_{\mathbf {W}}(I_{s})$ by forwarding $I_{s}$ to enhancement network $F_{\mathbf {W}}$ .

Obtain $\varphi (F_{\mathbf {W}}(I_{s}))$ and $\varphi (I_{t})$ from pretrained VGG-19.

$I_{r}$ = $Concat$ [ $I_{s}$ , $I_{t}$ , $\varphi (I_{t})$ ].

10:

$I_{f}$ = $Concat$ [ $I_{s}$ , $F_{\mathbf {W}}(I_{s})$ , $\varphi (F_{\mathbf {W}}(I_{s}))$ ].

11:

Obtain adversaial loss $L_{G}^{PRaD}$ , color loss $L_{color}$ , texture loss $L_{texture}$ , content loss $L_{content}$ , total variation loss $L_{tv}$ and total loss $L_{total}$ .

12:

Update the enhancement network by ascending its stochastic gradient: $\begin{equation*} \nabla _{\theta _{g}}{\frac {1}{\left |{ I_{s} }\right |}}\sum \{ L_{total} \}.\end{equation*}$ View Source

SECTION IV.

Experiment

Dataset and implementation details. We use the DPED [7] as our training and testing set. DPED has 22K photos, including 4549 photos from Sony mobile phone, 5727 from iPhone and 6015 from BlackBerry devices. Each photo has a corresponding high-quality image taken from Canon DSLR camera. The photos are cropped into $100\times100$ smaller RGB images and aligned according to SIFT keypoints [19]. We use 139K BlackBerry-Cannon, 160K iPhone-Cannon and 162K Sony-Cannon patch pairs for training and 2.4–4.2K pairs for testing. Our model is trained by ADAM optimizor [11] with $\beta _{1} = 0.9, \beta _{2} =0.999$ , and $\epsilon = 10^{-8}$ . The initial leaning rate is set to $5e-4$ and then decreases to half every $2\times 10^{4}$ iterations. The batch size is set to be 50. We use random truncated to initialize our model with zero mean and 0.01 standard deviation.

Competing methods and evaluation metrics. We compare our model with state-of-the-art methods, including Robust Retinex Model(RRM) [14], BIMEF [32], Neural Art [8], DPED [7], PMN [30] and PPCN [6]. To fully analysis our method, we use Peak Signal to Noise Ratio(PSNR) and Structural Similarity(SSIM) [29] as evaluation metrics. Moreover, we also conduct learned perceptual image patch similarity(LPIPS) [33] and mean opion score(MOS) for perceptual evaluation. For MOS, we adopt subjective voting for the enhanced image quality assessment.

A. Main Results

We conduct a quantitative and qualitative comparison of our method with state-of-the-art methods. As shown in table. 1, our model achieves 0.68 dB improvement over PMN on iPhone. At the same time, the proposed method obtains 1.12 dB and 0.99 dB gain on BlackBerry and Sony over PPCN. Compared with traditional image enhancement method, our model owns a more clear superiority. As shown in table. 1, our model outperforms RRM with a clear margin, which verifies the effectiveness of the proposed model. We also conduct a comprehensive qualitative comparison in Fig. 5, 6. Since the full-size ground truth images are not released, we use our own SIFT alignment metrics to obtain ‘Cannon’. The full-size image is unable to demonstrate a perfect alignment, it is still fine for visual comparison yet. In Fig. 5, we can find that our result is close to DSLR quality as other state-of-the-art methods have color distortion or brightness problem. For example, the enhanced image of PPCN is too gloomy and the DPED is too bright-colored. In contrast, Our model presents a natural DRLR quality. In the Fig. 6, our result demonstrate a natural color and clear structure compared other state-of-the-art. In sum, our results demonstrate a stable color and brightness improvement over other methods.

TABLE 1 The PSNR and SSIM Results of Different Approaches on Iphone, Blackberry and Sony of DPED [7]. We Use Black to Label First Place

$FIGURE 4. - Illumination of several representative feature maps produced by the $3_{th}$ , $2_{th}$ and $1_{th}$ unit of the encoder, and $2_{th}$ unit of the decoder. Generic U-Net can only receive representative feature maps of (b) and (e), our multi-level connected generator extracts all convolutional feature representations for consideration, which let our generative model enjoys the richer feature.$

FIGURE 4.

Illumination of several representative feature maps produced by the $3_{th}$ , $2_{th}$ and $1_{th}$ unit of the encoder, and $2_{th}$ unit of the decoder. Generic U-Net can only receive representative feature maps of (b) and (e), our multi-level connected generator extracts all convolutional feature representations for consideration, which let our generative model enjoys the richer feature.

Show All

FIGURE 5.

Qualitative comparison on ‘BlackBerry’ of the DPED dataset. Instead of enhancing image too bright-colored, our results is close to natural DSLR quality.

Show All

FIGURE 6.

Qualitative comparison on DPED patch test dataset. Our model promises a high-quality and natural image enhancement. Best viewed by zooming in the electronic version.

Show All

1) Perceptual Evaluation:

With the inherent property, euclidean distance and structure similarity unable to reflect subjective judgement intuitively, we further use LPIPS [33] and MOS to evaluate our method. LPIPS is a perceptual evaluation metric where a lower score means the image is more realistic. As shown in table. 4, our model surpasses other methods with a clear margin. In addition, our model also achieves obvious advantage under subjective voting metric. The above results justify the superiority of our model in term of perceptual evaluation metric.

TABLE 2 PSNR Results on Iphone Track. We Conduct Massive Models for Comprehensive Comparison

TABLE 3 Ablation Study on PRaD

TABLE 4 Perceptual Evaluation on Iphone Track. We Use LPIPS [33] and MOS for Perceptual Evaluation

2) Deeper Inference:

To fully verify the effectiveness of the proposed model, we also conduct several large models for comparison. As shown in table. 2, we replace the enhancement network with GridNet, ResNet-20, and ResNet-32 for comparison. GridNet is a flexible framework, which can simulate U-Net, FCN [18] and ResNet. To our surprise, the proposed model outperforms GridNet with 10 times less parameter on iPhone, which justifies the effectiveness of our model in image enhancement.

3) Real-World Cases

We investigate our model in real-world images. We collect some images from Instagram and conduct our generative model on them. In real-world cases, the images have complicated brightness and contrast, which is difficult to image enhancement. We use our generative model trained from ‘iPhone’ dataset to handle real-world images. As shown in Fig. 7, our model demonstrates a significant quality improvement in different scenes.

FIGURE 7.

Results of real-world cases. We collect images from Instagram and conduct our trained generative model on them.

Show All

B. Ablation Studies

1) Cross-Scale Feature Aggregation and Refinement

We first verify the effectiveness of the cross-scale feature aggregation and feature refinement. First, we alternately disable the cross-scale feature aggregation and the feature refinement and conduct a comparison on iPhone dataset. As illustrated in table. 4, the feature refinement brings 0.07 dB improvement. Nevertheless, the model with explicit cross-scale feature aggregation contributes 0.12 dB gain. The model equipped with two components outperforms plain model with 0.17 dB promotion. To this end, the proposed two components significantly improve the image quality.

2) Discriminator

To verify the proposed PRaD, we disable the PRaD and VGG content for comparison. Meanwhile, the ‘w/o VGG content’ uses the enhanced image as the input of discriminator and ‘w/o PRaD’ disables the adversarial learning. As shown in table. 3, the full model obtains 0.08 dB improvement over the discriminator without VGG content. As the model without PRaD achieves similar PSNR score, our full model ‘w/ PRaD’ obtains obvious promotion on SSIM.

3) Loss Functions

We make an attempt to analysis loss functions on ‘iPhone’. In table. 7, we investigate our model with different loss function. We first adopt $L_{color}$ explicit for training. As shown in the table. 7, single color loss shows unsatisfactory performance. Then, we incorporate more loss functions to facilitate generator to converge to a higher plateau. To our surprise, Our PRaD not only helps the generator produces realistic details but also obtains higher PSNR, which justifies the effectiveness of PRaD.

TABLE 5 Ablation Study on Normalization

TABLE 6 Ablation Study on Cross-Scale Feature Aggregation and Refinement

TABLE 7 Ablation Study on Loss Functions. We Joint Different Loss Function Gradually

4) The Depth of Cross-Scale Feature

To verify the effectiveness of the depth of the cross-scale feature, we conduct the same resolution feature with different depth for comparison. In our experiment, we find the model, which aggregates shallow feature achieves 22.92 dB, as the model aggregates deeper feature achieves 22.95 dB. Therefore, we incorporate the deeper feature in each resolution to perform cross-scale feature aggregation.

5) Normalization

DPED adopts batch normalization(BN) in their enhancement module as our model removes all normalization. Therefore, we conduct a study on different normalization strategy for image enhancement. As shown in table. 5, we incorporate BN, instance normalization(IN) [27] and weight normalization(WN) [23] in our model. It can find that models with BN and WN show poor performance. As the IN show superior performance on PSNR and SSIM, we observe that the model with IN may cause color excursion. In Fig. 6, the result of ‘GridNet’ uses instance normalization and render the enhanced image too blue. To this end, we remove all normalization in our enhancement network.

C. Efficiency

We make a comprehensive efficiency analysis to show the practicability of the proposed. As illustrated in Table. 8, we conduct the efficiency comparison on both CPU and GPU platform with different resolution input. Meanwhile, we use ‘Intel E5-2620 v4’ CPU and Nvidia ‘GTX1080Ti’ GPU for evaluation. Compared with DPED, our model achieves $3.17\times$ acceleration toward a 2K image. For a 760P image, our model is able to process 11 frames per second on the GPU. Compared with traditional frameworks(e.g., BIMEF), our model still own a similar efficiency on the CPU platform, which fully verifies the efficiency of the proposed model. To this end, our model achieves a well-balanced between efficiency and effectiveness. At the same time, the efficiency comparison also shows that our method has strong practicability toward real-world applications.

TABLE 8 Efficiency Study on Different Devices. We Use Bold Face to Label the First Place

SECTION V.

Conclusion

We have presented a novel framework for automatic image quality enhancement. In our framework, the different scale information can flow from encoder to decoder freely. Moreover, we incorporate a novel PRaD to help enhancement network improves color rendition. Extensive results on qualitative and quantitative validate the effectiveness of the proposed model.

References is not available for this document.

Perceptual Image Enhancement by Relativistic Discriminant Learning With Cross-Scale Aggregated Representation

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work