Journals & Magazines >IEEE Access >Volume: 7

Deep Inverse Tone Mapping for Compressed Images

The overview of our network. First, the input LDR image is decomposed into detail layer and base layer through guided filter; then conduct reconstruction through the two ...

Abstract:

Converting a single low dynamic range (LDR) image into a high dynamic range (HDR) image, which is the so-called inverse tone mapping (ITM), is a challenging ill-posed pro...Show More

Metadata

Abstract:

Converting a single low dynamic range (LDR) image into a high dynamic range (HDR) image, which is the so-called inverse tone mapping (ITM), is a challenging ill-posed problem since a lot of information is lost during compression and storage. Traditional ITM techniques mainly focus on high-quality LDR images without compression artifacts. However, in practice, LDR images are usually stored as a lossy compression format for the convenience of transmission, which will cause artifacts, i.e., blocking and ringing. Hence, these ITM methods suffer from severe performance drop for some real-world applications. In this paper, we propose a novel decomposition-based network to reconstruct HDR images from compressed LDR images; in other words, the proposed network can simultaneously remove blocking/ringing artifacts and recover high-quality HDR information. Considering compression artifacts mainly embedded in high frequency part, we thus decompose the input image into the high-frequency component (also known as detail layer) and low-frequency component (also known as a base layer). The detail layer mainly contains the information of texture, noise, and artifacts, while the base layer contains the information of structure and large object. Based on this, we design two subnetworks, i.e., detail layer recovery subnetwork and base layer recovery subnetwork, to restore the two parts separately. The detail layer recovery subnetwork is responsible for the artifacts removal with texture preserving, while the base layer recovery subnetwork is designed for tone expansion and overexposed/underexposed region restoration. Finally, in order to further reconstruct the serious overexposed regions, we adapt a merge subnetwork to fuse the result from the previous two subnetworks. The experimental results on compressed images demonstrate that the proposed method significantly outperforms other state-of-the-art methods.

The overview of our network. First, the input LDR image is decomposed into detail layer and base layer through guided filter; then conduct reconstruction through the two ...

Published in: IEEE Access ( Volume: 7)

Page(s): 74558 - 74569

Date of Publication: 05 June 2019

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2019.2920951

Funding Agency:

Contents

SECTION I.

Introduction

High dynamic range (HDR) images immensely expand the dynamic range compared with the low dynamic range (LDR) images, while contain more detail information. Thus, it has promising application in improving viewing experience. With the development of display technology, more and more devices support HDR features. However, direct acquisition through HDR cameras is still not popular. In addition, lots of legacy LDR media cannot be re-shoot. To this end, it is necessary to apply inverse tone mapping (ITM) techniques to convert LDR content to HDR content. Conventional ITM methods [3]–[7] mainly aim at making the LDR images displaying more visually pleasing on the HDR monitor. But they cannot restore the information of over/under exposed regions. Recently, several learning-based ITM methods have been proposed [1], [8]–[12], which show great performance improvement than the conventional methods. However, these algorithms mainly focus on high-quality LDR images without compression loss. Unfortunately, in practice, lots of legacy media sources and web images are stored as lossy compression format, which will introduce compression artifacts, such as blocking and ringing. Thus, it is necessary to remove the artifacts in ITM scenario [13]. In this paper, we propose an end-to-end deep network to reconstruct the HDR image from single compressed LDR image, in other words, we can simultaneously reduce the compression artifacts and reconstruct high quality HDR information. Since there are also plenty researches of compression artifacts removal [2], [14]–[16]. A simple idea is serial processing, that is, first remove the artifacts and then do other tasks or vice versa. However, if we take the artifacts removal as a pre-operation, the result will lost vast texture information which may lead the final results too smooth; if we take artifacts removal as a post-operation, the artifacts will be more difficult to remove since the front processes enhance the artifacts. An example is shown in Fig.1.

FIGURE 1.

As the figure shows, (a) is the ground truth HDR image; (b) is corresponding compressed LDR image, which has over exposed region with compression artifacts. If directly using an existing ITM network [1] trained with compressed dataset, as image (c) shows, it’s hard to reduce the compression artifacts and reconstruct over exposed region; if taking the compression artifacts removal [2] as a pre-operation, as image (d) shows, the result will lose lots texture information and be too smooth; if taking the compression artifacts removal as a post-operation, as image (e) shows, the artifacts will be enhanced through the previous operations, thus, get much more hard to be reduced. As image (f) shows, our method can simultaneously remove the artifacts while recover the over/under exposed regions, which is in line with our expectation. (Hdrcnn¹ and Hdrcnn² indicates the network proposed by [1] trained with no compression dataset and compression dataset; Arcnn [2] is a state-of-the-art compression removal method.)

Show All

Further considering that the artifacts mainly exist in high frequency region(we will illustrate in detail in section II and section III), we can first decompose the input LDR images into low and high frequency component via an edge-preserving process [17], [18], which is also called base and detail layer respectively. Specifically, the base layer is the filtered image while the detail layer is the difference between the input and base layer. Based on the decomposition, We design three subnetworks to complete the whole ITM task, i.e. detail layer recovery subnetwork, base layer recovery subnetwork and merge subnetwork. The architecture of the network is shown in Fig.2. Because the detail layer is relatively sparse, to preserve the structural integrity, we take use of residual network avoiding up/down sampling structure. As for base layer, it has rich information, we need to take use of multi-scale information to restore the over/under exposure and do tone expansion correctly. Thus a U-net [19] like structure is adapted as this subnetwork. We can obtain an initial result from the sum of these two subnetworks. In order to reduce the color shift and further restore the serious over exposed regions, a merge subnetwork is adapted. This subnetwork adapts a similar structure to the detail layer recovery subnetwork. Since these three subnetworks are designed to reconstruct detail layer, base layer and final results respectively, we correspondently introduce a multi-stage training strategy. First, the detail and base recovery subnetworks are trained separately; then the merge subnetwork is trained; finally, the whole network is fine-tuned as an end-to-end architecture. For further enhancement, we introduce perceptual loss and adversarial loss. The perceptual loss [20] is encouraged the output approximate ground truth in semantic aspect while adversarial loss [21] is to make the output recover more detail and be more realistic. Our contribution can be summarized as:

FIGURE 2.

The overview of our network. First, the input LDR image is decomposed into detail layer and base layer through guided filter [17]; then conduct reconstruction through the two subnetworks separately; finally fuse the outputs through the merge subnetwork. The blue dotted lines indicate the different training stages for compressed images.

Show All

By focusing on real-world compressed image HDR problem, the proposed decomposed-based method can remove the compression artifacts and restore high-quality HDR information simultaneously.

Propose an effective multi-stage training strategy, pre-train each subnetwork before train the whole network, which is benefit to performance improvement.

The experimental results show our method creates convincing results and has superior performance compared with other state-of-the-art methods.

SECTION II.

Related Work

A. Inverse Tone Mapping

Generally, ITM can be regarded as an image restoration problem [11], which can be modeled as:\begin{equation*} I_{LDR} = f_{TM}\left ({I_{HDR}}\right),\tag{1}\end{equation*} View Source where ${I_{HDR}}$ is the original HDR image, ${f_{TM}}$ is use to reduce the dynamic range and color of ${I_{HDR}}$ , also known as tone mapping, and ${I_{LDR}}$ is the degradation LDR image. In practice, images and videos are often compressed. Hence, the degradation model of ITM for compressed images can be expressed as, \begin{equation*} I_{LDR} = f_{Compress}\left ({f_{TM}\left ({I_{HDR}}\right)}\right) + \epsilon,\tag{2}\end{equation*} View Source where ${f_{Compress}}$ indicates the compression operation, and $\epsilon $ is additive noise. Our target is to find the inverse representation of equation (2).

The conventional ITM methods can be divided into two categories: global and local model methods. Landis et al. [3] propose a power function based method to expand the LDR images luminance primarily for image based lighting application. Akyüz et al. [4] find that a simple linear expansion can provide most favored viewing experience in HDR display. Meylan et al. [5] first detect the diffuse and specular regions of the input image, and then use different expansion functions for different regions. Didyk et al. [22] put forward a classification-based method, which first classify the content into lights, reflection and diffuse parts and enhance different components with different curves. Rempel et al. [6] utilize an expand map to guide the expansion of LDR images. Wang et al. [7] propose an interactive method, which can recover the over/under exposed region while boosting the luminance. However, these methods are model-driven. The various parameters are not user-friendly for non-expert users and they can hard to reconstruct the over/under exposed regions. Recently, several learning-based methods attempt to solve the ITM problem. These methods can also be mainly divided into direct and in-direct two categories.

The direct methods predict the HDR image from a single LDR image directly. Zhang and Lalonde [23] design a network to produce HDR images from LDR images which specifically for daytime outdoor and panoramas. Additionally, limited by the resolution of inputs(64*128), it can hard restore realistic details. Eilertsen et al. [1] demonstrate the most important part in ITM is the restoration of saturated regions. They apply auto-encoder architecture with skip-connections to recover the saturated pixels in LDR images. However, they do not consider about the restoration of under exposed regions since only train the network with saturated pixels. Marnerides et al. [8] think up-down sample structure will lead artifacts in the results. So they propose a multi-branch network avoiding sampling to convey LDR content to HDR. This method is lack of ability to recover the over-exposure regions. Ning et al. [10] introduce the generative adversarial regularizer to improve the quality of results. Jang et al. [24] maintain the color is vital for ITM, hence, they adapt a network architecture to learn dynamic range and color difference respectively. As opposed to direct methods, in-direct method won’t generate HDR images directly, these methods first generate a multi-exposed LDR images, and then merge them with conventional methods. The main difference among these methods is how to obtain the multi-exposure images. Endo et al. [9] combine 2-D convolution with 3-D convolution to produce a serial different exposure images form a single LDR input. This method is time-consuming because of the 3-D convolution. Lee et al. [11] propose a method based on a convolutional neural network composed of dilated convolutional layers and infers LDR images with various exposures from a single LDR image. However, this architecture is complex and redundancy. And Lee et al. [12] improve the method with a recursive structure to reduce the scale of network and conditional generative adversarial network to promote the quality. However, the in-direct methods may encounter some problems. One is that the exposure value is fixed, such as ±1, ±2, ±3, if the input is over/under exposed, the generated LDR images will be too bright or dark which affect the quality of the final HDR results. Another is that the conventional merging methods are not robust enough.

Moreover, all of the methods mentioned above mainly aim at recovering high quality HDR information while ignore compression artifacts removal which also is a vital part in compressed images ITM scenario.

B. Compression Artifacts Removal

LDR images are always stored as lossy compression format, and JPEG is one of the most common standards. Since the human visual system is not good at identifying and differencing the high frequency components variation, the quantization intervals of high frequency components are much more lager compared with low frequency components. Therefore, several compression artifacts will be introduced in LDR images, such as the blocking, ringing and banding artifacts. This is to say, the main loss is high frequency information. Many methods have been proposed to remove these compression artifacts. Foi et al. [15] threat the problem as denoise problem and design a filter-based method to reduce the artifacts. Chang et al. [14] utilize sparse coding to restore the information during compression. Dong et al. [2] first apply deep leaning to solve this problem with a sun glass architecture network. Zhang et al. [16] design network with a larger receptive fields and take full advantage of redundancies on both the pixel and DCT domains to improve the performance.

SECTION III.

Why and How to Solve ITM and Compression Artifacts Removal Simultaneously?

As illustrated above, ITM of compressed images is a severe ill-posed problem, and we must deal with compression artifacts removal while recover high-quality HDR information. Current ITM methods mainly pay attention to the last problem while ignore reduction of compression artifacts, which are limited in practical application. A simple idea for learning-based method is that enhance the dataset with compression inputs. In previous research, Eilertsen et al. [1] have proved that the recovered HDR results will be substantially reduced if train the network with compression dataset, which means it is hard to figure out this problem only via augmenting training data. That may be because that one stage network is hard to reduce compression artifacts and recover HDR information simultaneously. Another direct thought is cascading artifacts removal with other subproblems, in detail, threating artifacts removal as a pre/post operation. However, during experiment, we find that, if reduced artifacts first, the result image will be too smooth and lose important texture. On the contrary, if we threat artifacts removal as a post procedure, the other recovery operations will boost the artifacts, increasing the difficulty of artifacts reduction. Fig.1 shows the intuitive visualization result. We select Arcnn [2], one of the state-of-the-art compression artifacts removal method as the artifact reduction method and Hdrcnn [1] as the ITM method. To be more specific, we use Hdrcnn¹ to indicate the model that trained without compression dataset and Hdrcnn² indicate that with compression dataset.

As discussed in section II.B, the artifacts mainly embedded in high frequency regions, so we attempt to solve the ITM for compressed images through decomposing the image into high and low frequency component and restoring information via different subnetworks respectively. We utilize guided filter [17] to do the decomposition. Generally, the high frequency and low frequency component is called detail and base layer separately.

SECTION IV.

Proposed Method

Fig.2 shows the overview of our network architecture. And the pipeline of our method is as follow:

Decompose the input LDR images and HDR label into detail and base layer through guided filter [17];
Approximate the input detail and base layer to HDR labels via two recovery subnetworks respectively and sum the outputs to get a coarse result;
Estimate the final result by using the result in stage (2) via the merge subnetwork.

In addition, to make the processing more stable and improve the performance, we adapt a multi-stage training strategy and introduce perceptual and adversarial loss.

A. Decomposition Method

Image decomposition is widely used in image processing, such as smoothing, low light image enhancement and tone mapping. Through decomposition, we can conduct different components more effectively. Generally, an image can be expressed as after decomposition \begin{equation*} I= I_{base} + I_{detail}\tag{3}\end{equation*} View Source where ${I_{base}}$ contains the low frequency components which mainly are structural and large object information; ${I_{detail}}$ is primarily high frequency, i.e. edges, boundaries, noise and compression artifacts. One effective method to get the ${I_{base}}$ is filter the input image with an edge-preserving method. Then ${I_{detail}}$ can be obtained through subtracting the ${I_{base}}$ from the original input image. In this paper, we take use of the guided filter [17] as the filter for the purpose of efficiency. An example of the decomposition is shown in Fig.3. Zooming in the detail layer, we can clearly see the artifacts, i.e. blocking and ringing, is embedded in this layer. In addition, since difference of the pixel value is too large between LDR images and HDR images, we transfer the HDR ground truth into log domain for better predict, this is to say the result from network is a log domain HDR image. Briefly, we denote the input LDR image as ${I_{LDR}}$ and the HDR label image in log domain as ${I_{HDR}}$ .

FIGURE 3.

An example of the base and detail layer. The base layer contains the low frequency components, such as structural and large object information. The detail layer contains the high frequency information, which mainly includes edges, boundaries and compression artifacts. Zoom in the detail layer, we can clearly see the artifacts, i.e. blocking and ringing, is embedded in this layer.

Show All

B. Network Structure

1) Detail Layer Recovery Subnetwork

Considering the detail layer mainly contains high frequency components, which is relatively sparse, we take use of residual network structure avoiding up-down sampling architecture to guarantee structural integrity and reduce information loss. This subnetwork consists of 2 convolution layers with kernel $3\times3$ and 16 residual blocks. The residual block is composed of two convolution layer with kernel $3\times3$ . To avert gradient vanishing, a local shortcut is added between the input and the second convolution layer. All of the convolution layers are activated with SELU [25]. The residual blocks will map the input image into feature space and the last convolution layer can reconstruct these features back to image space. In addition, in order to further accelerate the convergence improve accuracy, a global shortcut is connected between input and output, which means the network learns the residual of the ground truth and input.

2) Base Layer Recovery Subnetwork

In contrast, the base layer primarily contains rich color and semantic information, which is significant for high quality HDR information recovery. Therefore, we want to restore these regions using more abundant global semantic information. In order to extract sufficient features, we utilize U-net structure, which can provide lager receptive fields. U-net can be divided into encoder and decoder. Encoder will convey the image into feature space representation and decoder will transfer these high dimensional features back to image space. There are four convolution blocks in the encoder, each convolution block contains two convolution layers. The first convolution layer is implemented with kernel $1 \times 1$ and stride 1 while the second layer is consisted of kernel $3 \times 3$ and stride 2, which can complete the downsampling process. Note that, we replace the pooling operation during downsampling with convolution to decrease information loss. As for decoder, it is composed of four deconvolution blocks. To avert checkerboard artifacts, we take use of a resize operation method [26] for upsampling. In detail, each deconvolution block is composed with a upsampling resize operation, a convolution layer with $3 \times 3$ and a convolution layer with $1 \times 1$ . For purpose of taking full advantage of low level features from encoder, the information from corresponding encoder convolution blocks is concatenated with the deconvlution blocks. Additionally, his subnetwork also take SELU [25] as activation function.

3) Merge Subnetwork

The output of the previous two subnetworks can form an initial reconstructing HDRI. However, it still cannot recover the serious over exposed regions properly, especially in the regions around light source. So a merge subnetwork is designed to make the output more robust and accuracy. This subnetwork is adopted a similar architecture to the detail recovery subnetwork while with 8 residual blocks.

All of these subnetworks contain no Batch Normalization, since the training data collected from various sources, we find it is not properly to do normalization.

C. Multi-Stage Training Strategy and Loss Function

In order to train the whole network stably and measure the discrepancy between output and ground truth comprehensively, we introduce perceptual loss and adversarial loss with a multi stage training strategy. In particular, we divide the whole training process into three stages. And the perceptual loss and adversarial loss are used in stage two and stage three.

1) Stage One

In this stage, only the detail and base layer recovery subnetwork will be trained. As analyzed above, the input ${I_{LDR}}$ and corresponding ${I_{HDR}}$ will be decomposed into detail and base layer via guided filter, which can be expressed as $D_{LDR}$ , $D_{HDR}$ and $B_{LDR}$ , $B_{HDR}$ respectively. The target in this stage is to build an relationship \begin{equation*} \! \begin{cases} D_{\textit {LDR}} \rightarrow D_{\textit {HDR}},\\ B_{\textit {LDR}} \rightarrow B_{\textit {HDR}}. \end{cases}\end{equation*} View Source In addition, MAE loss is selected as loss function for detail recovery subnetwork while MSE is for base layer recovery subnetwork. Mathematically \begin{align*} L_{1}\left ({D_{\textit {LDR}}, D_{\textit {HDR}}}\right)=&\frac {1}{n}\sum _{i}^{n}\left \|{ D_{\textit {HDR}}^{i} - {H_{D}(D_{\textit {LDR}}}^{i})}\right \|_{1},\tag{4}\\ L_{2}\left ({B_{\textit {LDR}}, B_{\textit {HDR}}}\right)=&\frac {1}{n}\sum _{i}^{n}\left \|{ B_{\textit {HDR}}^{i} - {H_{B}(B_{\textit {LDR}}}^{i})}\right \|_{2},\tag{5}\end{align*} View Source where ${\left \|{ \cdot }\right \|}_{1}$ is ${L}_{1}$ norm, ${H_{D}(D_{\textit {LDR}}}^{i})$ represents the output of detail recovery subnetwork recovery network $H_{D}$ ; ${\left \|{ \cdot }\right \|}_{2}$ is ${L}_{2}$ norm, ${H_{B}(B_{\textit {LDR}}}^{i})$ represents the output of base layer recovery sub network $H_{B}$ and ${i}$ indicates the $i^{th}$ pixel.

2) Stage Two

The result from stage one can provide an initial recovered HDRI, however, the serious over exposed regions need to be reconstructed further. In addition, the compression artifacts also need further removal. We only optimize the merge subnetwork while fix the parameters of detail and base recovery subnetwork in this stage. For purpose of performance improvement and measuring the discrepancy comprehensively, we introduce perceptual loss and adversarial loss combined with reconstruction loss.

a: Reconstruction Loss

MAE is selected as reconstruction loss in this stage, mainly evaluate the difference of pixels and encourages the output to match the ground truth in pixel level. Moreover, we need to pay more attention to those over exposed regions. To this end, we design a simple mask to separate over exposed regions from other regions. First, those pixels, which value are higher than 0.97, will be settled to 1 while the rest to 0. Then an initial binary mask is obtained. Furthermore, we take use of Gaussian filter to smooth the mask which is similar to expand map in conventional ITM methods [6], [22]. It can be expressed as, \begin{align*} M_{binary}\left ({x, y}\right)=&\!\begin{cases} 1, & I\left ({x,y}\right)\geq 0.97,\\ 0, & others \end{cases}\tag{6}\\ M\left ({x, y}\right)=&M_{binary} \ast G\left ({x, y}\right),\tag{7}\\ G\left ({x,y}\right)=&Ce^{-\frac {\left ({x^{2}+y^{2}}\right)}{2\delta ^{2}}},\tag{8}\end{align*} View Source where, ${I(x, y)}$ is the input LDR images, ${M_{binary}(x, y)}$ is the initial mask and ${M}(x, y)$ is the smoothed mask. $\delta $ is the standard deviation of Gaussian function, and ${C}$ is determined such that \begin{equation*} \iint G\left ({x, y}\right)dx\,dy=1,\tag{9}\end{equation*} View Source an example of the mask is shown in Fig.4. The mask multiply with the image to get the over exposed regions, which will be calculated loss separately. The whole reconstruction loss is consisted of two parts, $L_{all}$ and $L_{over}$ , ${L_{all}}$ is MAE loss for all pixels and $L_{over}$ is MAE loss only for those saturated pixels. Briefly, we indicates the output of the previous subnetworks as ${\hat {I}_{HDR}}$ . The whole reconstruction loss can be expressed as \begin{align*} L_{rec}=&L_{pixels} + \lambda \cdot L_{over}\tag{10}\\ L_{pixels}\left ({\hat {I}_{HDR}, I_{HDR}}\right)=&\frac {1}{n}\sum _{i}^{n}\left \|{ I_{HDR}^{i} - H_{M}(\hat {I}_{HDR}^{i})}\right \|_{2},\qquad \tag{11}\\ L_{over}=&\textbf {M} \cdot L_{pixels}\tag{12}\end{align*} View Source where $\hat {I}_{HDR}^{i}$ and ${I}_{HDR}^{i}$ indicate the ${i}$ th pixel in ground truth and network output. $L_{rec}$ means the reconstruction loss, ${L_{pixels}}$ calculates the loss of all the pixels while ${L_{over}}$ only calculates the saturated pixels; $\lambda $ is a parameter to balance the two loss, we set $\lambda =0.5$ in this paper.

FIGURE 4.

An example for the mask used to strengthen the loss in over exposed regions.

Show All

b: Perceptual Loss

Perceptual loss firstly is introduced in super resolution task with great success [20]. It can effectively measure the difference between output and ground truth in semantic level. And in general, perceptual loss is implemented through a pre-trained network, such as VGG19 and Resnet50. The output and ground truth will be taken into the pre-trained network and extract several specific feature maps to calculate cost. Briefly, the lth layer is expressed as ${\phi }_{l}$ , and the loss is \begin{align*} L_{vgg}\left ({\hat {I}_{HDR}, I_{HDR}}\right)= \frac {1}{N}\sum _{l}^{N}\left \|{ \phi _{l} \left ({\hat {I}_{HDR}}\right) - \phi _{l} \left ({I_{HDR}}\right) }\right \|_{1}, \\{}\tag{13}\end{align*} View Source where N is the total number of selected feature maps, ${I_{HDR}}$ is the ground truth while ${\hat {I}_{HDR}}$ is the output of the network. In this paper, we choose the second, third and fifth activation layer to calculate the loss.

c: Adversarial Loss

As discussed in [28], only using pixel level loss function will lead to blurry outputs, while adversarial loss can make the results more realistic via recovering more high frequency information. So we introduce the adversarial loss in our network. The loss is indicted as $L_{GAN}$ , which is useful to make the distribution of outputs close to the ground truth. However, the traditional GAN is notorious for its instability [29], [30] and easy to cause collapse of model. We adopt Least Square GAN(LS-GAN) to alleviate this problem. The adversarial loss is expressed as $L_{GAN}$ [21] and it can be formulated as follow:\begin{equation*} L_{GAN} = \frac {1}{M}\sum _{i=1}^{M}\left \|{D\left ({\hat {I}_{HDR}^{i}}\right) - 1}\right \|_{2},\tag{14}\end{equation*} View Source where the ${D(\hat {I}_{HDR}^{i})}$ is the output value of discriminator when take the previous output as input. The discriminator is shown in Fig.5 and it is consisted of six convolution and two fully connected layers. The kernel size of first two layers is 5*5 while the rest is 3*3. All of these convolution layers are activated with relu [31]. During the training process, the discriminator will take the output of network and the ground truth as input, and determine whether the input is fake or real. We indicate the loss of discriminator as $L_{D}$ , \begin{align*} L_{D} \!= \!\frac {1}{2}\left ({\frac {1}{M}\sum _{i=1}^{M}\left \|{D\left ({I_{HDR}^{i}}\right) \!- \!1}\right \|_{2} \!+ \! \frac {1}{M}\sum _{i=1}^{M}\left \|{D\left ({\hat {I}_{HDR}^{i}}\right)}\right \|_{2}}\right)\!. \\{}\tag{15}\end{align*} View Source where the ${D({I}_{HDR}^{i})}$ is the value of discriminator when take the ground truth as input.

$FIGURE 5. - The structure of discriminator. “K ${\times }$ K conv, C, stride = S, lrelu” indicates a convolution layer with K ${\times }$ K kernel, C output channels, strides = S and leaky relu as activation function. “FC, N, lrelu” indicates a fully-connected layer with N output nodes and leaky relu as activation function; “FC, N, sigmoid” indicates a fully-connected layer with N output nodes and sigmoid as activation function.$

FIGURE 5.

The structure of discriminator. “K ${\times }$ K conv, C, stride = S, lrelu” indicates a convolution layer with K ${\times }$ K kernel, C output channels, strides = S and leaky relu as activation function. “FC, N, lrelu” indicates a fully-connected layer with N output nodes and leaky relu as activation function; “FC, N, sigmoid” indicates a fully-connected layer with N output nodes and sigmoid as activation function.

Show All

Finally, we combine the three loss as $L_{all}$ , \begin{equation*} L_{all} = \alpha \cdot L_{rec} + \beta \cdot L_{vgg} + \gamma \cdot L_{GAN},\tag{16}\end{equation*} View Source in this paper, we choose $\alpha = 2.0$ , $\beta = 1.0$ and $\gamma = 0.1$ .

3) Stage Three

For purpose of promoting the perceptual quality, we need to finetune the parameters of all the three subnetworks. In this stage, we remove the loss functions used in the first stage, and jointly fine-tune the whole system with the loss function in stage2. In other words, the whole network is end-to-end fine-tuned by utilizing the pre-trained subnetworks as initialization. Additionally, since we predict in log domain, the final result is:\begin{equation*} I_{HDR_{final}} = e^{I_{HDR}}\tag{17}\end{equation*} View Source

SECTION V.

Experiment

A. Dataset

We collect our dataset from [32]–[39]. About 3000 HDR images are collected, in addition, we select 300 images as test set while the rest as training set. Since we cannot get the LDR images directly, we need design a degradation model to simulate the camera imaging process. More precisely, we utilize four camera response curves(CRF) clustered from [40] to compress the dynamic range, in order to imitate the abnormal exposure conditions, the highest 5%~0% pixels will be set to be saturated randomly. Then we save these images as JPG format with different quality factor(the QF varies from 10 to 70). After cropping and flipping, about 30,000 LDR/HDR training pairs patches will be created. Fig.6 shows several typical examples in our dataset. We implemented our method in TensorFlow and train the network from scratch with batch size 8 for 200, 000 iterations. Adam [41] is adopted as the optimizer with an initial learning rate of ${10}^{-5}$ and momentum parameter ${\beta }_{1}$ of 0.9.

FIGURE 6.

Some example images from our dataset. The top row shows the HDR label images while the bottom shows corresponding compressed LDR images. For convince of displaying, we utilize tone mapping algorithm [27] to compress the dynamic range of HDR images.

Show All

B. Comparisons With Other State of the Art Methods

We select other three state-of-the-art learning based methods [1], [8], [9] as the baseline. Briefly, we indicate [1] as Hdrcnn¹, [8] as Expand-net and [9] as DrTMO. In addition, to demonstrate the superiority of our method can remove compression artifacts and recover high quality HDR information simultaneously, we introduce a state-of-the-art compression artifacts removal method Arcnn [2] as a pre/post operation combined with Hdrcnn². Furthermore, We also provide the results of different training stages.

1) Quantitative Evaluation

As for quantitative evaluation, five metrics are selected, HDR-VDP-2 [42], PU-PSNR, PU-MS_SSIM [43], log_PSNR and Weber_MSE [44]. HDR-VDP-2 is the most common metric in ITM. The values of HDR-VDP-2 are those of the VDP-Q quality score, which represent the degradation of the ITM result compared with the ground truth. Since HDR images have much larger dynamic range than LDR images, for fair comparison, we apply perceptual uniformity (PU) encoding [45] to the prediction and ground truth when using PSNR and MS_SSIM. log_PSNR is to compute the PSNR in log domain [44], which is more close to human visual system(HVS). Weber_MSE calculates the mean square error between the reference and prediction using Weber ratios. A larger HDR-VDP-2 value demonstrates a less degradation between the recovered HDR images and ground truth. A higher score of the whole set of PSNR metrics means less pixel level loss while a lower Weber_MSE indicates so. A larger MS_SSIM score means a higher fidelity in structural quality.

The experimental results are shown in Table 1. Our method shows superiority compared with other state-of-art methods on all metrics for compressed images. And an typical example visualization of the HDR-VDP-2 is shown in Fig.9, the result images with color more close to blue means the recovery HDR image has less degradation from the ground truth. In addition, Table 2 shows the different quantitative results of different loss functions, which shows the effect of the hybrid loss function.

TABLE 1 Objective Evaluation of the Other State-of-the-Art ITM Methods

TABLE 2 Evaluation of Different Loss Functions

FIGURE 7.

Some normal and under exposed scenes with compression artifacts images, our method can recover the missing information with artifact free compared with other methods.

Show All

FIGURE 8.

Some seriously over exposed scenes. Our method can properly restore these saturated regions with artifacts free.

Show All

FIGURE 9.

A typical example visualization of the HDR-VDP-2 result, the images with color more close to blue demonstrate that the reconstructed HDR image is more similar with the ground truth. (a) Hdrcnn¹. (b) DrTMO. (c) Expand-net. (d) Hdrcnn² + Arcnn. (e) Arcnn + Hdrcnn². (f) Ours.

Show All

According to the data from Table 1, our method shows superiority compared with other state-of-art methods on all metrics. We achieve the highest score of HDR-VDP-2, PU-PSNR, PU-MS_SSIM, log_PSNR while the lowest weber_MSE, which means the results from our method have high fidelity.

2) Subjective Evaluation

Some typical visualization results are shown in Fig.7 and Fig.8. Fig.7 shows some normal and under exposed scenes and Fig.8 shows the serious over exposed scenes. For convenience of display, all of the images are tone mapped with Reinhard Tone Mapping algorithm [27]. We zoom out the related region in order to compare in detail.

From these results our method shows great superiority than others. More precisely, our method can restore the missing information in large and serious over/under exposed regions, referring Fig.8. And more importantly, the compression artifacts is removed simultaneously, which is in line with our expectations.

As for Hdrcnn, it can recover the saturated pixels but fails in some serious over exposed areas. Besides it cannot properly restore the under exposed region since these pixels are not involved in training process. What’s more, even we take use of the model that trained with compression dataset, the artifacts removal ability is limited, the block and ringing artifacts still exist in the results. For Hdrcnn²+Arcnn, the noise and artifacts will be boosted after ITM operation which makes it harder to be reduced. Specific example can be referred to the pillar in Fig.7. And the problem of Arcnn+Hdrcnn² is that taking Arcnn as a pre operation will lead to the result more smooth and lots of texture information is lost. Expand-net can generate convincing results for some scenes, however, it cannot restore the details in the under/over exposed regions well, such as the electronic fan and the sun in Fig.8. DrTMO usually produce an over-enhanced result and cannot reconstruct the over exposed regions. The reason may be it is hard to create correct multi-exposure LDR images from a single LDR images and it can be easily affected with the conventional merging methods. Moreover, these methods can not eliminate the compression artifacts.

SECTION VI.

Conclusion and Discussion

In this paper we propose decomposition-based inverse tone mapping(ITM) network for compressed LDR images. Since the images and videos are usually compressed in many real-world ITM applications, our method attempt to simultaneously remove compression artifacts and reconstruct high quality HDR information. More precisely, we take use of guided filter to decompose the LDR image into base and detail layer, then recover them through corresponding subnetworks respectively.

In the future, we think it is promising to combine the super resolution task with inverse tone mapping for viewing experience improvement.

References is not available for this document.

Deep Inverse Tone Mapping for Compressed Images

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction