Introduction
High dynamic range (HDR) images immensely expand the dynamic range compared with the low dynamic range (LDR) images, while contain more detail information. Thus, it has promising application in improving viewing experience. With the development of display technology, more and more devices support HDR features. However, direct acquisition through HDR cameras is still not popular. In addition, lots of legacy LDR media cannot be re-shoot. To this end, it is necessary to apply inverse tone mapping (ITM) techniques to convert LDR content to HDR content. Conventional ITM methods [3]–[7] mainly aim at making the LDR images displaying more visually pleasing on the HDR monitor. But they cannot restore the information of over/under exposed regions. Recently, several learning-based ITM methods have been proposed [1], [8]–[12], which show great performance improvement than the conventional methods. However, these algorithms mainly focus on high-quality LDR images without compression loss. Unfortunately, in practice, lots of legacy media sources and web images are stored as lossy compression format, which will introduce compression artifacts, such as blocking and ringing. Thus, it is necessary to remove the artifacts in ITM scenario [13]. In this paper, we propose an end-to-end deep network to reconstruct the HDR image from single compressed LDR image, in other words, we can simultaneously reduce the compression artifacts and reconstruct high quality HDR information. Since there are also plenty researches of compression artifacts removal [2], [14]–[16]. A simple idea is serial processing, that is, first remove the artifacts and then do other tasks or vice versa. However, if we take the artifacts removal as a pre-operation, the result will lost vast texture information which may lead the final results too smooth; if we take artifacts removal as a post-operation, the artifacts will be more difficult to remove since the front processes enhance the artifacts. An example is shown in Fig.1.
As the figure shows, (a) is the ground truth HDR image; (b) is corresponding compressed LDR image, which has over exposed region with compression artifacts. If directly using an existing ITM network [1] trained with compressed dataset, as image (c) shows, it’s hard to reduce the compression artifacts and reconstruct over exposed region; if taking the compression artifacts removal [2] as a pre-operation, as image (d) shows, the result will lose lots texture information and be too smooth; if taking the compression artifacts removal as a post-operation, as image (e) shows, the artifacts will be enhanced through the previous operations, thus, get much more hard to be reduced. As image (f) shows, our method can simultaneously remove the artifacts while recover the over/under exposed regions, which is in line with our expectation. (Hdrcnn1 and Hdrcnn2 indicates the network proposed by [1] trained with no compression dataset and compression dataset; Arcnn [2] is a state-of-the-art compression removal method.)
Further considering that the artifacts mainly exist in high frequency region(we will illustrate in detail in section II and section III), we can first decompose the input LDR images into low and high frequency component via an edge-preserving process [17], [18], which is also called base and detail layer respectively. Specifically, the base layer is the filtered image while the detail layer is the difference between the input and base layer. Based on the decomposition, We design three subnetworks to complete the whole ITM task, i.e. detail layer recovery subnetwork, base layer recovery subnetwork and merge subnetwork. The architecture of the network is shown in Fig.2. Because the detail layer is relatively sparse, to preserve the structural integrity, we take use of residual network avoiding up/down sampling structure. As for base layer, it has rich information, we need to take use of multi-scale information to restore the over/under exposure and do tone expansion correctly. Thus a U-net [19] like structure is adapted as this subnetwork. We can obtain an initial result from the sum of these two subnetworks. In order to reduce the color shift and further restore the serious over exposed regions, a merge subnetwork is adapted. This subnetwork adapts a similar structure to the detail layer recovery subnetwork. Since these three subnetworks are designed to reconstruct detail layer, base layer and final results respectively, we correspondently introduce a multi-stage training strategy. First, the detail and base recovery subnetworks are trained separately; then the merge subnetwork is trained; finally, the whole network is fine-tuned as an end-to-end architecture. For further enhancement, we introduce perceptual loss and adversarial loss. The perceptual loss [20] is encouraged the output approximate ground truth in semantic aspect while adversarial loss [21] is to make the output recover more detail and be more realistic. Our contribution can be summarized as:
The overview of our network. First, the input LDR image is decomposed into detail layer and base layer through guided filter [17]; then conduct reconstruction through the two subnetworks separately; finally fuse the outputs through the merge subnetwork. The blue dotted lines indicate the different training stages for compressed images.
By focusing on real-world compressed image HDR problem, the proposed decomposed-based method can remove the compression artifacts and restore high-quality HDR information simultaneously.
Propose an effective multi-stage training strategy, pre-train each subnetwork before train the whole network, which is benefit to performance improvement.
The experimental results show our method creates convincing results and has superior performance compared with other state-of-the-art methods.
Related Work
A. Inverse Tone Mapping
Generally, ITM can be regarded as an image restoration problem [11], which can be modeled as:\begin{equation*} I_{LDR} = f_{TM}\left ({I_{HDR}}\right),\tag{1}\end{equation*}
\begin{equation*} I_{LDR} = f_{Compress}\left ({f_{TM}\left ({I_{HDR}}\right)}\right) + \epsilon,\tag{2}\end{equation*}
The conventional ITM methods can be divided into two categories: global and local model methods. Landis et al. [3] propose a power function based method to expand the LDR images luminance primarily for image based lighting application. Akyüz et al. [4] find that a simple linear expansion can provide most favored viewing experience in HDR display. Meylan et al. [5] first detect the diffuse and specular regions of the input image, and then use different expansion functions for different regions. Didyk et al. [22] put forward a classification-based method, which first classify the content into lights, reflection and diffuse parts and enhance different components with different curves. Rempel et al. [6] utilize an expand map to guide the expansion of LDR images. Wang et al. [7] propose an interactive method, which can recover the over/under exposed region while boosting the luminance. However, these methods are model-driven. The various parameters are not user-friendly for non-expert users and they can hard to reconstruct the over/under exposed regions. Recently, several learning-based methods attempt to solve the ITM problem. These methods can also be mainly divided into direct and in-direct two categories.
The direct methods predict the HDR image from a single LDR image directly. Zhang and Lalonde [23] design a network to produce HDR images from LDR images which specifically for daytime outdoor and panoramas. Additionally, limited by the resolution of inputs(64*128), it can hard restore realistic details. Eilertsen et al. [1] demonstrate the most important part in ITM is the restoration of saturated regions. They apply auto-encoder architecture with skip-connections to recover the saturated pixels in LDR images. However, they do not consider about the restoration of under exposed regions since only train the network with saturated pixels. Marnerides et al. [8] think up-down sample structure will lead artifacts in the results. So they propose a multi-branch network avoiding sampling to convey LDR content to HDR. This method is lack of ability to recover the over-exposure regions. Ning et al. [10] introduce the generative adversarial regularizer to improve the quality of results. Jang et al. [24] maintain the color is vital for ITM, hence, they adapt a network architecture to learn dynamic range and color difference respectively. As opposed to direct methods, in-direct method won’t generate HDR images directly, these methods first generate a multi-exposed LDR images, and then merge them with conventional methods. The main difference among these methods is how to obtain the multi-exposure images. Endo et al. [9] combine 2-D convolution with 3-D convolution to produce a serial different exposure images form a single LDR input. This method is time-consuming because of the 3-D convolution. Lee et al. [11] propose a method based on a convolutional neural network composed of dilated convolutional layers and infers LDR images with various exposures from a single LDR image. However, this architecture is complex and redundancy. And Lee et al. [12] improve the method with a recursive structure to reduce the scale of network and conditional generative adversarial network to promote the quality. However, the in-direct methods may encounter some problems. One is that the exposure value is fixed, such as ±1, ±2, ±3, if the input is over/under exposed, the generated LDR images will be too bright or dark which affect the quality of the final HDR results. Another is that the conventional merging methods are not robust enough.
Moreover, all of the methods mentioned above mainly aim at recovering high quality HDR information while ignore compression artifacts removal which also is a vital part in compressed images ITM scenario.
B. Compression Artifacts Removal
LDR images are always stored as lossy compression format, and JPEG is one of the most common standards. Since the human visual system is not good at identifying and differencing the high frequency components variation, the quantization intervals of high frequency components are much more lager compared with low frequency components. Therefore, several compression artifacts will be introduced in LDR images, such as the blocking, ringing and banding artifacts. This is to say, the main loss is high frequency information. Many methods have been proposed to remove these compression artifacts. Foi et al. [15] threat the problem as denoise problem and design a filter-based method to reduce the artifacts. Chang et al. [14] utilize sparse coding to restore the information during compression. Dong et al. [2] first apply deep leaning to solve this problem with a sun glass architecture network. Zhang et al. [16] design network with a larger receptive fields and take full advantage of redundancies on both the pixel and DCT domains to improve the performance.
Why and How to Solve ITM and Compression Artifacts Removal Simultaneously?
As illustrated above, ITM of compressed images is a severe ill-posed problem, and we must deal with compression artifacts removal while recover high-quality HDR information. Current ITM methods mainly pay attention to the last problem while ignore reduction of compression artifacts, which are limited in practical application. A simple idea for learning-based method is that enhance the dataset with compression inputs. In previous research, Eilertsen et al. [1] have proved that the recovered HDR results will be substantially reduced if train the network with compression dataset, which means it is hard to figure out this problem only via augmenting training data. That may be because that one stage network is hard to reduce compression artifacts and recover HDR information simultaneously. Another direct thought is cascading artifacts removal with other subproblems, in detail, threating artifacts removal as a pre/post operation. However, during experiment, we find that, if reduced artifacts first, the result image will be too smooth and lose important texture. On the contrary, if we threat artifacts removal as a post procedure, the other recovery operations will boost the artifacts, increasing the difficulty of artifacts reduction. Fig.1 shows the intuitive visualization result. We select Arcnn [2], one of the state-of-the-art compression artifacts removal method as the artifact reduction method and Hdrcnn [1] as the ITM method. To be more specific, we use Hdrcnn1 to indicate the model that trained without compression dataset and Hdrcnn2 indicate that with compression dataset.
As discussed in section II.B, the artifacts mainly embedded in high frequency regions, so we attempt to solve the ITM for compressed images through decomposing the image into high and low frequency component and restoring information via different subnetworks respectively. We utilize guided filter [17] to do the decomposition. Generally, the high frequency and low frequency component is called detail and base layer separately.
Proposed Method
Fig.2 shows the overview of our network architecture. And the pipeline of our method is as follow:
Decompose the input LDR images and HDR label into detail and base layer through guided filter [17];
Approximate the input detail and base layer to HDR labels via two recovery subnetworks respectively and sum the outputs to get a coarse result;
Estimate the final result by using the result in stage (2) via the merge subnetwork.
In addition, to make the processing more stable and improve the performance, we adapt a multi-stage training strategy and introduce perceptual and adversarial loss.
A. Decomposition Method
Image decomposition is widely used in image processing, such as smoothing, low light image enhancement and tone mapping. Through decomposition, we can conduct different components more effectively. Generally, an image can be expressed as after decomposition \begin{equation*} I= I_{base} + I_{detail}\tag{3}\end{equation*}
An example of the base and detail layer. The base layer contains the low frequency components, such as structural and large object information. The detail layer contains the high frequency information, which mainly includes edges, boundaries and compression artifacts. Zoom in the detail layer, we can clearly see the artifacts, i.e. blocking and ringing, is embedded in this layer.
B. Network Structure
1) Detail Layer Recovery Subnetwork
Considering the detail layer mainly contains high frequency components, which is relatively sparse, we take use of residual network structure avoiding up-down sampling architecture to guarantee structural integrity and reduce information loss. This subnetwork consists of 2 convolution layers with kernel
2) Base Layer Recovery Subnetwork
In contrast, the base layer primarily contains rich color and semantic information, which is significant for high quality HDR information recovery. Therefore, we want to restore these regions using more abundant global semantic information. In order to extract sufficient features, we utilize U-net structure, which can provide lager receptive fields. U-net can be divided into encoder and decoder. Encoder will convey the image into feature space representation and decoder will transfer these high dimensional features back to image space. There are four convolution blocks in the encoder, each convolution block contains two convolution layers. The first convolution layer is implemented with kernel
3) Merge Subnetwork
The output of the previous two subnetworks can form an initial reconstructing HDRI. However, it still cannot recover the serious over exposed regions properly, especially in the regions around light source. So a merge subnetwork is designed to make the output more robust and accuracy. This subnetwork is adopted a similar architecture to the detail recovery subnetwork while with 8 residual blocks.
All of these subnetworks contain no Batch Normalization, since the training data collected from various sources, we find it is not properly to do normalization.
C. Multi-Stage Training Strategy and Loss Function
In order to train the whole network stably and measure the discrepancy between output and ground truth comprehensively, we introduce perceptual loss and adversarial loss with a multi stage training strategy. In particular, we divide the whole training process into three stages. And the perceptual loss and adversarial loss are used in stage two and stage three.
1) Stage One
In this stage, only the detail and base layer recovery subnetwork will be trained. As analyzed above, the input \begin{equation*} \! \begin{cases} D_{\textit {LDR}} \rightarrow D_{\textit {HDR}},\\ B_{\textit {LDR}} \rightarrow B_{\textit {HDR}}. \end{cases}\end{equation*}
\begin{align*} L_{1}\left ({D_{\textit {LDR}}, D_{\textit {HDR}}}\right)=&\frac {1}{n}\sum _{i}^{n}\left \|{ D_{\textit {HDR}}^{i} - {H_{D}(D_{\textit {LDR}}}^{i})}\right \|_{1},\tag{4}\\ L_{2}\left ({B_{\textit {LDR}}, B_{\textit {HDR}}}\right)=&\frac {1}{n}\sum _{i}^{n}\left \|{ B_{\textit {HDR}}^{i} - {H_{B}(B_{\textit {LDR}}}^{i})}\right \|_{2},\tag{5}\end{align*}
2) Stage Two
The result from stage one can provide an initial recovered HDRI, however, the serious over exposed regions need to be reconstructed further. In addition, the compression artifacts also need further removal. We only optimize the merge subnetwork while fix the parameters of detail and base recovery subnetwork in this stage. For purpose of performance improvement and measuring the discrepancy comprehensively, we introduce perceptual loss and adversarial loss combined with reconstruction loss.
a: Reconstruction Loss
MAE is selected as reconstruction loss in this stage, mainly evaluate the difference of pixels and encourages the output to match the ground truth in pixel level. Moreover, we need to pay more attention to those over exposed regions. To this end, we design a simple mask to separate over exposed regions from other regions. First, those pixels, which value are higher than 0.97, will be settled to 1 while the rest to 0. Then an initial binary mask is obtained. Furthermore, we take use of Gaussian filter to smooth the mask which is similar to expand map in conventional ITM methods [6], [22]. It can be expressed as, \begin{align*} M_{binary}\left ({x, y}\right)=&\!\begin{cases} 1, & I\left ({x,y}\right)\geq 0.97,\\ 0, & others \end{cases}\tag{6}\\ M\left ({x, y}\right)=&M_{binary} \ast G\left ({x, y}\right),\tag{7}\\ G\left ({x,y}\right)=&Ce^{-\frac {\left ({x^{2}+y^{2}}\right)}{2\delta ^{2}}},\tag{8}\end{align*}
\begin{equation*} \iint G\left ({x, y}\right)dx\,dy=1,\tag{9}\end{equation*}
\begin{align*} L_{rec}=&L_{pixels} + \lambda \cdot L_{over}\tag{10}\\ L_{pixels}\left ({\hat {I}_{HDR}, I_{HDR}}\right)=&\frac {1}{n}\sum _{i}^{n}\left \|{ I_{HDR}^{i} - H_{M}(\hat {I}_{HDR}^{i})}\right \|_{2},\qquad \tag{11}\\ L_{over}=&\textbf {M} \cdot L_{pixels}\tag{12}\end{align*}
b: Perceptual Loss
Perceptual loss firstly is introduced in super resolution task with great success [20]. It can effectively measure the difference between output and ground truth in semantic level. And in general, perceptual loss is implemented through a pre-trained network, such as VGG19 and Resnet50. The output and ground truth will be taken into the pre-trained network and extract several specific feature maps to calculate cost. Briefly, the lth layer is expressed as \begin{align*} L_{vgg}\left ({\hat {I}_{HDR}, I_{HDR}}\right)= \frac {1}{N}\sum _{l}^{N}\left \|{ \phi _{l} \left ({\hat {I}_{HDR}}\right) - \phi _{l} \left ({I_{HDR}}\right) }\right \|_{1}, \\{}\tag{13}\end{align*}
c: Adversarial Loss
As discussed in [28], only using pixel level loss function will lead to blurry outputs, while adversarial loss can make the results more realistic via recovering more high frequency information. So we introduce the adversarial loss in our network. The loss is indicted as \begin{equation*} L_{GAN} = \frac {1}{M}\sum _{i=1}^{M}\left \|{D\left ({\hat {I}_{HDR}^{i}}\right) - 1}\right \|_{2},\tag{14}\end{equation*}
\begin{align*} L_{D} \!= \!\frac {1}{2}\left ({\frac {1}{M}\sum _{i=1}^{M}\left \|{D\left ({I_{HDR}^{i}}\right) \!- \!1}\right \|_{2} \!+ \! \frac {1}{M}\sum _{i=1}^{M}\left \|{D\left ({\hat {I}_{HDR}^{i}}\right)}\right \|_{2}}\right)\!. \\{}\tag{15}\end{align*}
The structure of discriminator. “K
Finally, we combine the three loss as \begin{equation*} L_{all} = \alpha \cdot L_{rec} + \beta \cdot L_{vgg} + \gamma \cdot L_{GAN},\tag{16}\end{equation*}
3) Stage Three
For purpose of promoting the perceptual quality, we need to finetune the parameters of all the three subnetworks. In this stage, we remove the loss functions used in the first stage, and jointly fine-tune the whole system with the loss function in stage2. In other words, the whole network is end-to-end fine-tuned by utilizing the pre-trained subnetworks as initialization. Additionally, since we predict in log domain, the final result is:\begin{equation*} I_{HDR_{final}} = e^{I_{HDR}}\tag{17}\end{equation*}
Experiment
A. Dataset
We collect our dataset from [32]–[39]. About 3000 HDR images are collected, in addition, we select 300 images as test set while the rest as training set. Since we cannot get the LDR images directly, we need design a degradation model to simulate the camera imaging process. More precisely, we utilize four camera response curves(CRF) clustered from [40] to compress the dynamic range, in order to imitate the abnormal exposure conditions, the highest 5%~0% pixels will be set to be saturated randomly. Then we save these images as JPG format with different quality factor(the QF varies from 10 to 70). After cropping and flipping, about 30,000 LDR/HDR training pairs patches will be created. Fig.6 shows several typical examples in our dataset. We implemented our method in TensorFlow and train the network from scratch with batch size 8 for 200, 000 iterations. Adam [41] is adopted as the optimizer with an initial learning rate of
B. Comparisons With Other State of the Art Methods
We select other three state-of-the-art learning based methods [1], [8], [9] as the baseline. Briefly, we indicate [1] as Hdrcnn1, [8] as Expand-net and [9] as DrTMO. In addition, to demonstrate the superiority of our method can remove compression artifacts and recover high quality HDR information simultaneously, we introduce a state-of-the-art compression artifacts removal method Arcnn [2] as a pre/post operation combined with Hdrcnn2. Furthermore, We also provide the results of different training stages.
1) Quantitative Evaluation
As for quantitative evaluation, five metrics are selected, HDR-VDP-2 [42], PU-PSNR, PU-MS_SSIM [43], log_PSNR and Weber_MSE [44]. HDR-VDP-2 is the most common metric in ITM. The values of HDR-VDP-2 are those of the VDP-Q quality score, which represent the degradation of the ITM result compared with the ground truth. Since HDR images have much larger dynamic range than LDR images, for fair comparison, we apply perceptual uniformity (PU) encoding [45] to the prediction and ground truth when using PSNR and MS_SSIM. log_PSNR is to compute the PSNR in log domain [44], which is more close to human visual system(HVS). Weber_MSE calculates the mean square error between the reference and prediction using Weber ratios. A larger HDR-VDP-2 value demonstrates a less degradation between the recovered HDR images and ground truth. A higher score of the whole set of PSNR metrics means less pixel level loss while a lower Weber_MSE indicates so. A larger MS_SSIM score means a higher fidelity in structural quality.
The experimental results are shown in Table 1. Our method shows superiority compared with other state-of-art methods on all metrics for compressed images. And an typical example visualization of the HDR-VDP-2 is shown in Fig.9, the result images with color more close to blue means the recovery HDR image has less degradation from the ground truth. In addition, Table 2 shows the different quantitative results of different loss functions, which shows the effect of the hybrid loss function.
Some normal and under exposed scenes with compression artifacts images, our method can recover the missing information with artifact free compared with other methods.
Some seriously over exposed scenes. Our method can properly restore these saturated regions with artifacts free.
A typical example visualization of the HDR-VDP-2 result, the images with color more close to blue demonstrate that the reconstructed HDR image is more similar with the ground truth. (a) Hdrcnn1. (b) DrTMO. (c) Expand-net. (d) Hdrcnn2 + Arcnn. (e) Arcnn + Hdrcnn2. (f) Ours.
According to the data from Table 1, our method shows superiority compared with other state-of-art methods on all metrics. We achieve the highest score of HDR-VDP-2, PU-PSNR, PU-MS_SSIM, log_PSNR while the lowest weber_MSE, which means the results from our method have high fidelity.
2) Subjective Evaluation
Some typical visualization results are shown in Fig.7 and Fig.8. Fig.7 shows some normal and under exposed scenes and Fig.8 shows the serious over exposed scenes. For convenience of display, all of the images are tone mapped with Reinhard Tone Mapping algorithm [27]. We zoom out the related region in order to compare in detail.
From these results our method shows great superiority than others. More precisely, our method can restore the missing information in large and serious over/under exposed regions, referring Fig.8. And more importantly, the compression artifacts is removed simultaneously, which is in line with our expectations.
As for Hdrcnn, it can recover the saturated pixels but fails in some serious over exposed areas. Besides it cannot properly restore the under exposed region since these pixels are not involved in training process. What’s more, even we take use of the model that trained with compression dataset, the artifacts removal ability is limited, the block and ringing artifacts still exist in the results. For Hdrcnn2+Arcnn, the noise and artifacts will be boosted after ITM operation which makes it harder to be reduced. Specific example can be referred to the pillar in Fig.7. And the problem of Arcnn+Hdrcnn2 is that taking Arcnn as a pre operation will lead to the result more smooth and lots of texture information is lost. Expand-net can generate convincing results for some scenes, however, it cannot restore the details in the under/over exposed regions well, such as the electronic fan and the sun in Fig.8. DrTMO usually produce an over-enhanced result and cannot reconstruct the over exposed regions. The reason may be it is hard to create correct multi-exposure LDR images from a single LDR images and it can be easily affected with the conventional merging methods. Moreover, these methods can not eliminate the compression artifacts.
Conclusion and Discussion
In this paper we propose decomposition-based inverse tone mapping(ITM) network for compressed LDR images. Since the images and videos are usually compressed in many real-world ITM applications, our method attempt to simultaneously remove compression artifacts and reconstruct high quality HDR information. More precisely, we take use of guided filter to decompose the LDR image into base and detail layer, then recover them through corresponding subnetworks respectively.
In the future, we think it is promising to combine the super resolution task with inverse tone mapping for viewing experience improvement.