Journals & Magazines >IEEE Access >Volume: 9

Single Exposure High Dynamic Range Image Reconstruction Based on Deep Dual-Branch Network

0 seconds of 0 secondsVolume 90%

00:00

BranchNet: Fully convolutional autoencoder network with two branch. We take the 320 x 320 size used in training as an example to give the size of the network output of ea...

Abstract:

In this article, we relate the operation of single-frame-based high dynamic range (HDR) image reconstruction to the following two tasks: 1) highlight suppression in over-...Show More

Metadata

Abstract:

In this article, we relate the operation of single-frame-based high dynamic range (HDR) image reconstruction to the following two tasks: 1) highlight suppression in over-exposed areas and 2) noise elimination in under-exposed areas. The common goal of both tasks is to preserve or even enhance the details and improve the visibility of scenes when generating the HDR image. These two tasks can be solved separately with fundamentally different ways. In this article, we propose a dual-branch network to process the over- and under- exposed areas respectively for single-frame-based HDR image reconstruction. First, the low dynamic range (LDR) image is normalized, linearized and inputted into both branches, and the masks of the over- and under- exposed regions are calculated to detect the improper exposed areas. Second, the over- and under- exposed areas are restored and enhanced by the two branches respectively, at the same time, the color distribution is learned to obtain more consistent color saturation between the generated HDR image and the ground truth. Third, the output of the two branches and the linearized input LDR image are combined based on the masks to obtain the reconstructed HDR image. Extensive experiments show that the proposed method can efficiently restore the texture and color of the over-exposed areas, suppress the noise of the under-exposed areas, and obtain the HDR image with good contrast, clear details and high structural fidelity of the ground truth image appearance.

0 seconds of 0 secondsVolume 90%

00:00

BranchNet: Fully convolutional autoencoder network with two branch. We take the 320 x 320 size used in training as an example to give the size of the network output of ea...

Published in: IEEE Access ( Volume: 9)

Page(s): 9610 - 9624

Date of Publication: 06 January 2021

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2021.3049480

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

HDR image includes rich detail and color information and largely preserves the luminance information of the real scene, so it can bring us a more realistic visual experience. In actual life, scenes often contain extremely highlight and dark areas at the same time, which can not be recorded truthfully because of the limited dynamic range of common sensors and the error of quantization. In order to record this kind of scenes more accurately, researchers have tried many methods of hardware, software, or combination of software and hardware, and HDR imaging technology was born from it. Now we can use expensive HDR camera to catch the native HDR image or shoot multiple different exposure images with the LDR camera and fuse them to get an artificial HDR image [1]. Although HDR technology and related hardware have developed rapidly in recent years, it still takes a considerable cost to obtain high-quality HDR images. Therefore, the HDR image reconstruction method based on single exposure image is worth exploring.

In this article, we propose a practical network structure with two branches for reconstructing HDR image from single LDR image. In the network, two branches are used to process over-exposed areas and under-exposed areas respectively. For the HDR image without corresponding LDR version, the HDR image is degenerated by simulating the formation of over- and under-exposed pixels to generate the LDR version. The LDR image is normalized and its gray image is obtained. According to the gray image, the masks of over- and under- exposed areas used in the integration step are calculated. Then, the normalized LDR image is linearized and inputted into the network. The purpose of linearizing is to eliminate the non-linearity occurring in the photographic process of digital cameras. Furthermore, the over- and under- exposed areas are restored and enhanced by the two branches respectively, and the color is corrected to obtain more consistent color saturation between the generated HDR image and the ground truth. Because of the distinct characteristics of over- and under- exposed areas of LDR images, the two branches enhance and reveal the details of them using different methods. The outputs of the two branches are weighted and integrated with the linearized LDR image to generate the reconstructed HDR image.

The main contributions of this work can be summarized in several aspects: (1) a novel end-to-end network structure with dual-branch is designed to simultaneously reconstruct the information of over- and under- exposed regions, which can not only reveal the hidden details in over-exposed region, but also suppress the hidden noise in under-exposed region. (2) a hue-loss based on the hue value in HSV(Hue, Saturation, Value) space is included in the loss function to enable the network to more accurately learn the pixel color distribution, which makes the HDR image outputted by the network are more faithful and natural in color. (3) the enhanced under-exposed region information by the dark branch is taken into consideration while integrating the restored over-exposed area with linearized LDR image to generate the HDR image, which can well balance the enhancement of the details in different regions. Comprehensive experiments show that compared to the state-of-the-art methods, the proposed method can well reconstruct the dark and bright regions and obtain HDR image approximate to the ground truth.

The remainder of this article is organized as follows. Section II introduces previous related works. Section III details the proposed model structure and the loss function used. Section IV describes the data set we used, and sorts out the parameters used in the experiment and the steps that need to be pay attention to, and Section V shows the analysis of elimination experiments, as well as subjective and objective comparisons of existing iTMOs (inverse Tone-Mapping Operators) and methods based on neural network. Finally, the conclusions of our work are presented in Section VI.

SECTION II.

Related Works

A. Traditional Inverse Tone-Mapping

Inverse tone-mapping is a technology for converting LDR images into HDR images in order to load LDR resources on HDR applications. To a certain extent, it achieves HDR “restore” and upward compatibility with existing LDR resources. The existing iTMOs can be divided into global methods and local ones. The global iTMOs use the same conversion function which could be a linear scaling [2] or non-linear function like gamma-curve [3] to expand all pixels of the entire image. The global iTMOs are more suitable for scenes whose dynamic range is close to the dynamic range supported by the display device. However, the global transformations may excessively compress the tonal range, which results in unavoidable loss in contrast and visual detail.

For the local iTMOs, they use different conversion functions in different regions of the image. In this case, the regions with the same color before mapping may have different colors after mapping, which is related to the pixel location and the surrounding pixel values. The local algorithm can increase local contrast, which improves the detail visibility of the image. In general, local methods first expand the image to a moderate dynamic range, then deal with the improper exposed areas by a specially designed function. Banterle et al. [4] achieved HDR image reconstruction by the inverse transform of Reinhard et al.’s tone mapping operator [5] combined with an expand-map. Meylan et al. [6] used a piece-wise linear function to expand the dynamic range of the image, so that the high exposed areas can have a more natural appearance. Wang et al. [7] selected appropriate under-exposed and over-exposed areas in the image, and used interpolation to improve the brightness and texture performance of the selected area. Rempel et al. [8] used an expand-map calculated based on a Gaussian filter and an edge-stopping function to process the over-exposed regions. Kovaleski and Oliviera [9] expanded the work of Rempel et al.’s by replacing the Gaussian filter with a cross bilateral filter, and their methods have achieved good results in a wide range of exposures. Subsequently, inspired by the characteristics of the human visual system, Huo et al. [10] proposed a novel physiological approach, which could avoid artifacts occurred in most existing algorithms. Kim and Kim [11] applied guided filter to divide the original image into base layer and detail layer, extended the dynamic range of base layer with nonlinear functions, and the detail layer with a linear mapping function obtained by a learning-based algorithm.

B. HDR Reconstruction Based on Convolution Neural Networks

Due to the excellent performance of deep learning methods in various analytical learning tasks, CNNs have been extensively used in various computer vision problems including image-to-image translation. CNN has also been successfully applied to reconstruct HDR image using multiple exposure images captured with different exposure times [12]. On this basis, in order to solve the artifact problem in large-scale foreground motion dynamic scenes, Wu et al. [13] proposed the first non-fluid depth frame for high dynamic range imaging. In terms of reconstructing HDR image from one single-exposure LDR image, Eilertsen et al. [14] used the encoder-decoder network structure and proposed a very general solution to the problem specification, considering any type of saturated region. This method can reduce the generation of artifacts in similar structures because its modified U-net structure only predicts values of saturated pixels. It is worth mentioning that Endo et al. [15] used an automatic encoder architecture to predict a set of LDR images with different exposure levels from a single input image, and synthesized them using Mertens et al.’s method [16] to obtain HDR image. Marnerides et al. [17] proposed a multiscale architecture which avoids the use of upsampling layers to improve image quality, and the branches are for global, semi-local, and local feature extraction. Jang et al. [18] designed the two-stage cascade network to learn HDR image generation and HDR image color refining. Kinoshita and Kiya [19] proposed a loss function based on tone-mapped image to solve the inaccuracy problem caused by the nonlinear relationship between LDR and HDR images. Wang et al. [20] decomposed the image into high-frequency and low-frequency components, and designed two sub-networks to process them separately. Jang et al. [21] designed the network to learn the cumulative histogram of HDR images, and used the results for histogram matching of LDR images. Finally, a color learning network is cascaded to refine the image color. Recently, Generative Adversarial Networks (GAN) [22] have made significant progress in many tasks such as image generation and image restoration and have already been used for HDR image reconstruction by Lee et al. [23] and Ning et al. [24]. However, the image generated by GAN leads to reconstruction errors and unreal artifacts. Also combined with GAN, Moriwaki et al. [25] proposed that if the reconstruction error is the only loss function, the recovered image is easily blurred, so they further introduced the perceptual loss and reconstruction loss optimized for HDR, which can improve the image quality.

SECTION III.

Dual-Branch Network Based Single Exposure HDR Image Reconstruction

A. Problem Statement

When a normal camera is used in an inappropriate exposure environment, such as a dazzling sun or a dimly lit room, an excessively large or small dynamic range will cause the camera’s photosensitive element to operate under an abnormal condition. In this case, the details of the scene will be easily blurred or even lost. However, it is worth noting that the causes of information loss are quite different in these two situations.

In an environment with high light intensity, the camera sensor is apt to receive too much light to cause image overexposure. In the over-exposed areas, the value of one or all channels of pixel is saturated, which results in loss of image detail. If we want to reconstruct the image information of these regions in HDR domain, a better method is to estimate the value of the saturated pixel based on the information of its unsaturated channel or the adjacent unsaturated pixel [14] as equation (1).

$\begin{equation*} \hat {Y}^{L}_{i,c}=f_{L}(X^{L}_{i\_{}adj,c}, X^{L}_{i,c}) \tag{1}\end{equation*}$ View Source

where

$\hat {Y}^{L}_{i,c}$

is the estimated value of the

$c$

channel of over-exposed pixel

$i$

in the HDR domain and

$X^{L}_{i\_{}adj,c}$

is the value in the LDR domain.

$i\_{}adj$

means adjacent pixel of pixel

$i$

. The mapping of pixel value from the LDR domain to the HDR domain is implemented by

$f_{L}$

, which is the light branch to be introduced below.

Correspondingly, in a low-light environment, the camera’s photosensitive elements cannot capture too small changes, or the level of quantification is not enough to record such tiny detail information, which will result in texture blur and color drift. In addition, the noise caused by the longer shutter time in dim conditions can also significantly affect the quality of the image. From this point of view, the reasons for the decline of image quality in the under-exposed area are different from those in the over-exposed area. The reconstruction of under-exposed areas in HDR domain can be expressed in a similar form as equation (1).

$\begin{equation*} \hat {Y}^{D}_{i,c}=f_{D}(X^{D}_{i\_{}adj,c}, X^{D}_{i,c}) \tag{2}\end{equation*}$ View Source

Similar to the above, the

$\hat {Y}^{D}_{i,c}$

means the estimated value of the

$c$

channel of under-exposed pixel

$i$

in the HDR domain and the

$f_{D}$

is the mapping function of the under-exposed region from the LDR domain to the HDR domain. Although the goals of

$f_{L}$

and

$f_{D}$

are the same, the way they achieve their goals is different. The role of

$f_{L}$

is to make the bright area brighter; on the contrary,

$f_{D}$

makes the dark area darker. Furthermore, there is also a significant difference in the distribution of pixel values between bright and dark regions. HDR images have a very high dynamic range, which means that pixels in bright areas can have much higher pixel values than those in dark areas, and the former tends to be sparser in data space. Using the same branch to map disparate distributions is a tricky task. Therefore, this article proposes to use two branches to process the over-exposed and under-exposed areas of the image respectively.

B. Model Structure

Fig. 1 exhibits the pipeline of our method. Firstly, the input image is mapped to the linear domain by the inverse function of equation (3).

$\begin{equation*} f(Y)=(1+\sigma)\frac {Y^{n}}{Y^{n}+\sigma } \tag{3}\end{equation*}$ View Source

where

$n$

and

$\sigma$

are the parameters that used to control the shape of the curve, here we set

$n=0.9$

$\sigma =0.6$

, and

$Y$

means the HDR image. This equation is fitted by Eilertsen et al. [14] based on the camera response curve library which is collected by Grossberg et al. [26]. We use this mapping operator with random parameter to convert the HDR image into LDR image when generating the image pairs for training. Therefore, this mapping can bring the pixels value of the input image closer to the HDR domain, which is conducive to model convergence. After linearizing, the image is copied into two and putted into the light and dark branches respectively. The two branches of the model have an encoder-decoder architecture that is similar to UNet [27], but simpler. As the name suggests, the two branches are responsible for processing different areas of the image.

$FIGURE 1. - BranchNet: Fully convolutional autoencoder network with two branch. We take the $320 \times 320$ size used in training as an example to give the size of the network output of each layer. In fact, the network does not require a fixed input image size because the fully connected layer is not used.$

FIGURE 1.

BranchNet: Fully convolutional autoencoder network with two branch. We take the $320 \times 320$ size used in training as an example to give the size of the network output of each layer. In fact, the network does not require a fixed input image size because the fully connected layer is not used.

Show All

1) Light Branch

Limited by the narrow dynamic range, LDR images cannot simultaneously record objects with extremely high brightness (such as the sun and other light sources) and the dark corners without direct light. In this case, the high light areas in the image are prone to texture loss and color distortion, such as the outline of the object is covered by dazzling flare or the blue sky become brilliant white. If the dynamic range of image is extended directly, the distortion caused by over-exposed will seriously affect the visual effect. In order to avoid such problems and obtain good visual experience in over-exposed area, we built a light branch that specifically repairs the high light area. Light branch is a network of autoencoder architecture [28], its encoder maps the input image to a nonlinear space to obtain a low-dimensional abstract feature map. By performing feature processing and extraction on these feature maps, the neural network can realize operations that are difficult to implement on the original feature space. And the decoder is trained to reconstruct full dimensional data through a large number of feature maps outputted by the encoder, thus realizing the conversion of the image from the LDR domain to the HDR domain.

The structure of the Light Branch is shown in the upper half of the network in Fig. 1. The encoder and decoder of the network have the same number of convolutional blocks, both of them are five. In the encoder, the instance normalization [29] layer is set behind the last convolutional layer in each convolution block. The batch normalization layer [30] will calculate the mean and variance together with the same batch of samples entered. For HDR images, the high dynamic range attribute tends to make the statistical indicators between different images vastly different, so the calculated mean and variance are not beneficial to the individual. The IN layer calculates the mean and variance for a single sample, and is relatively more suitable for HDR image reconstruction. In addition to the fifth convolutional block, the remaining modules are finally connected to a maxpooling layer for downsampling. In the decoder, each convolution block will receive the feature map from the decoder through skip-connection to compensate for the loss of information caused by downsampling. However, the size of these feature maps is twice that of the convolution block input. Here we use bilinear interpolation to expand the size of the input, which is also the upsampling operation of the decoding process. In addition to the last convolutional block in the decoder, the last convolutional layer of the remaining blocks is connected to the instance normalization layer. The convolution kernel used in the convolution layer of the entire light branch has a shape of $3 \times 3$ and a stride of 1. The convolutional layer in each convolution block has the same number of kernel, which are 64, 128, 256, 512, 512, 512, 256, 128, 64, 3, 3 from the first block number to the last. The light yellow module in the figure represents a $3 \times 3$ convolution layer. When the $3 \times 3$ convolution layer follows the $1 \times 1$ convolution layer, the $1 \times 1$ convolution layer does not use an activation function. In this branch, ReLU [31] is used as the activation function.

2) Dark Branch

As Reibel et al. [32] concluded, in low-light situations, in addition to photon response inhomogeneity, CCD and CMOS sensors are subject to interference from a variety of noise sources, such as readings, photon emissions, dark current, and fixed-mode noise. Noise varies not only depending on the exposure setting and camera model, but also on the scene captured. For digital cameras, darker areas appear to contain more noise than bright areas, which is showed in Fig. 2. Note that noise becomes less noticeable when the image becomes brighter. This is because brighter areas have stronger signals due to more light, which results in higher signal-to-noise ratio. This means that under-exposed areas will emerge more noticeable noise when their brightness are promoted to natural level. If not suppressed, these noises will appear more abrupt in HDR images. The traditional inverse tone mapping method mainly classifies images according to different brightness levels, and then dynamically expands them by different expansion operators [33]. This can reduce the noise expansion and artifacts caused by improper expansion of dark areas to a certain extent. In the existing methods based on deep learning, the under-exposed area is rarely treated specially during the HDR image reconstruction, which affects the visual effect of these areas in HDR image.

FIGURE 2.

The performance of different brightness areas under the same intensity of additive Gaussian noise. The brighter the area, the harder it is to notice the reduction in visual effects due to noise.

Show All

In addition to the noise caused by the camera hardware itself, there is noise in the image compression process. The pixel values of the under-exposed areas are often low, and the changes of texture and color are not obvious. In the JPEG encoding compression, the pixels in under-exposed area will be considered as unimportant, which causes most of the useful information in the under-exposed area to be discarded. Furthermore, because the pixels in under-exposed areas have small gradient, it is easy to lose information during the quantization and cause an unsmooth transition, which results in obvious artifact bands. When the hardware noise and the quantization loss are added together, it will seriously affect the HDR image reconstruction in the under-exposed area. Fig. 3 shows the reconstructed HDR image without noise suppression in under-exposed area. In Fig. 3, the left is the input LDR image, and the right is the HDR image reconstructed without processing on the under-exposed area. It can be found that although no obvious noise in the LDR image, in the HDR image, the color noise and artifact bands in the sky appear very abrupt, which needs to be dealt with in the HDR imaging process. Here we try to use a network branch to perform high dynamic range reconstruction on the under-exposed area of the image, and to make the dark pixels darker, thus improving the viewing effect of the entire image. We define the network that implements this function as a dark branch, and its model structure is the lower half of the network in Fig. 1. It can be seen that the $3\times 3$ convolution module of the dark branch is represented with colors different from the light branch, because it uses a different activation function, which is Leaky-ReLu. The activation function can retain values less than zero in the output image of the network. Compared with the light branch, the pixel value of the feature map that needs to be processed in the dark branch is closer to zero. The network will inevitably oscillate back and forth during the training process. Thus, it is easier to have a large number of feature map pixels with values less than zero. The gradient information of these pixels is also important for us, so we need to ensure that their gradients can be effectively returned.

FIGURE 3.

The display effect in the HDR image when the noise of the under-exposed area is not suppressed. (a) LDR image; (b) HDR image generated by the network without processing to under-exposed area.

Show All

C. Loss Function

In a variety of image generation tasks, researchers usually design an appropriate loss function according to the actual demand to ensure that the network converges in the desired direction. For the deep learning based methods of HDR image reconstruction, in addition to directly computing the mean square error or the average absolute value error between the output HDR image and the ground truth, the loss function also includes: calculating the mean square error of the image after tone mapping to ensure the quality of the tone mapped image [12]; calculating the difference of the image gradient information to facilitate the repair of the image texture information [24]; calculating the cosine similarity between the input image and the ground truth to make the color of the image more accuracy [17].

The goal of our method is to reconstruct the high-brightness and low-luminance regions of the image, thus we need to extract the region of interest from the image to get a mask so that the network can focus more on the prediction of the pixel values in the mask region. For light branch, it extracts the region with high pixel value in the image by equation (4) [14] using the threshold $t_{l}$ .

$\begin{equation*} M_{i}^{L}=\frac {max(I_{i}-t_{l},0)}{1-t_{l}} \tag{4}\end{equation*}$ View Source

where

$M_{i}^{L}$

is the value of pixel

$i$

in light mask, which is calculated based on the gray value

$I_{i}$

of the image. And the threshold

$t_{l}$

used in our experiment is 0.88. The value of

$t_{l}$

affects the network convergence performance, the smaller

$t_{l}$

, the slower the network convergence.

$t_{l} =0.88$

is obtained according to the trade-off between the convergence rate of network and the accuracy of over exposure pixel extraction. We use a grayscale image instead of the V channel in the HSV color space in [14] to calculate the mask. The main reason is that the mask calculated using the grayscale image is smoother, and the boundary between the extracted region and other regions is more gentle and natural, which can effectively avoid artifacts band that may present in the final output image.

Light mask acts on the light branch’s loss function as (5), and guides the light branch to focus on the repair of over-exposed areas.

$\begin{align*} L_{light}(\hat {Y},Y)=&\frac {\alpha ^{L}}{wh}\sum _{i}{\left |{M^{L}_{i}(\hat {H}_{i}\!-\!H_{i})}\right |^{2}} \\&+\frac {1}{3wh}\sum _{i,c}{\left |{ M^{L}_{i}(log(\hat {Y}_{i,c}\!+\!\epsilon)\!-\!log(Y_{i,c}\!+\!\epsilon))}\right |}^{2} \\ \tag{5}\end{align*}$ View Source

where

$\hat {Y}$

is the output of network and

$Y$

is ground truth. Light branch’s loss function consists of two parts: the logarithmic L2 distance and the hue-based L2 distance. In the log-based L2 distance, the small constant–

$\epsilon$

is used to avoid the case that the pixel value in the logarithm is zero.

$w$

and

$h$

are the width and height of the input image, respectively. And the

$\alpha ^{L}$

can be adjusted for assigning different importance to the logarithmic L2 distance and the hue-based L2 distance. The high pixel value of the over-exposed area in the HDR image will cause a large loss value when calculating the L2 distance, which will affect the convergence of the network and make it impossible for the network to reconstruct the information in the saturated regions. In addition, according to Weber-Fechner’s law [34], the perception curve of the human visual system (HVS) in high-luminance regions can be approximated by logarithm. The use of logarithm in the loss function can better match the perceived effect of HVS in this region [14]. The elements of two-dimensional matrix

$H$

are normalized hue value of image Y, which can be calculated as equation (6).

$\begin{align*} H_{i}= \begin{cases} \displaystyle \frac {Y_{i,g}-Y_{i,b}}{6\Delta _{i}}, Y_{i,r}=max_{c}(Y_{i,c}) \& Y_{i,g} \ge Y_{i,b}\\[5pt] \displaystyle \frac {Y_{i,g}-Y_{i,b}}{6\Delta _{i}}+1, Y_{i,r}=max_{c}(Y_{i,c}) \& Y_{i,g} < Y_{i,b}\\[5pt] \displaystyle \frac {Y_{i,b}-Y_{i,r}}{6\Delta _{i}}+\frac {1}{3}, Y_{i,g}=max_{c}(Y_{i,c}) \\[5pt] \displaystyle \frac {Y_{i,r}-Y_{i,g}}{6\Delta _{i}}+\frac {2}{3}, Y_{i,b}=max_{c}(Y_{i,c}) \end{cases} \tag{6}\end{align*}$

View Source

where

$Y_{i,*}$

means the

$i^{th}$

pixel’s value of channel *. And

$\Delta _{i}=max_{c}(Y_{i,c})-min_{c}(Y_{i,c})$

. Regardless of the exposure conditions, the value of the hue channel always precisely reflects the true color of image pixels. In areas where the brightness of the image is low, the difference between the network output and the ground truth will fluctuate within a small range. Low pixel values make the ratio between R, G and B channels prone to change significantly, making the colors of these pixels significantly different. The L2 distance calculated in this situation are often very close, but on the hue channel in the HSV color space, these pixels with unequal ratios of channel values have noticeable distances.

The loss function used by dark branch is slightly different from that in the light branch. Since the area of interest in the dark branch is mainly the under-exposed portion of the image, the pixel value here is very small and is not suitable for operation in the logarithmic domain. So we use the L2 distance directly to calculate the loss of pixels in the dark region as equation (7).

$\begin{align*}&\hspace {-.5pc} L_{dark}(\hat {Y},Y)=\frac {\alpha ^{D}}{wh}\sum _{i}{\left |{M^{D}_{i}(\hat {H}_{i}-H_{i})}\right |^{2}} \\&+\frac {1}{3wh}\sum _{i,c}{\left |{ M^{D}_{i}(\hat {Y}_{i,c}-Y_{i,c})}\right |}^{2} \tag{7}\end{align*}$ View Source

where

$\alpha ^{D}$

is used to balance the loss of L2 distance and hue-based L2 distance,

$M^{D}_{i}$

is the value of pixel

$i$

in dark mask after guide filter and can be expressed by equation (8). In the equation (8),

$t_{d}$

is threshold for determining the under-exposed area and is set as 0.12 in training. It affects the convergence rate of the network, the smaller

$t_{d}$

, the faster the network convergence. The

$G(\cdot)$

means guided filter proposed by He et al. [35], Which is used to suppress artifact bands caused by image compression in under-exposed areas of the image. This will be discussed in detail in Section V.

$\begin{equation*} M_{i}^{D}=G\left({\frac {max(t_{d}-I_{i},0)}{t_{d}}}\right) \tag{8}\end{equation*}$

View Source

Since the pixel value of the under-exposed area is small, the L2 distance calculated during training is also insignificant. In order to ensure that the gradient of this part is not overwhelmed by the loss of hue channel and the loss of light branch when the gradient is passed back,

$\lambda$

is set to balance the values between RGB color space’s L2 distance loss and hue loss in under-exposed region.

We can train both the light branch and the dark branch at the same time, so the final loss function is shown in equation (9):

$\begin{equation*} L_{final}(\hat {Y},Y)=L_{light}(\hat {Y},Y)+L_{dark}(\hat {Y},Y) \tag{9}\end{equation*}$ View Source

And the reconstructed HDR image $\hat {Y}$ is obtained by the weighted fusion of the two branches’ outputs and the linearized LDR image as equation (10).

$\begin{align*}&\hspace {-.5pc} \hat {Y}=\sum _{i,c}(M^{L}_{i}*\hat {Y}^{L}_{i,c}+M^{D}_{i}*\hat {Y}^{D}_{i,c} \\&+\,(1-M^{L}_{i}-M^{D}_{i})*f^{-1}(X_{i,c})) \tag{10}\end{align*}$ View Source

where

$f^{-1}$

is the inverse function of equation (3).

SECTION IV.

Experiments

A. Dataset

The dataset largely determines the upper limit of a network. We need a large set of well-structured training data to ensure that the network can learn abundant useful information. For the task of HDR image reconstruction, the native HDR image taken directly by a professional HDR camera is most suitable as ground truth. However, limited by the expensive price of such cameras, existing resources are scarce. Therefore, an HDR image synthesized by multi-frame images with unequal exposure time can also be considered for use as a training label. We have collected a total of 1,304 HDR images of the above categories with resolutions ranging from $512\times 512$ up to $10000\times 5000$ , from various sources. The detailed source information can be seen in Table 1.

TABLE 1 Source Information of HDR Images in Our Dataset

With the HDR image as the ground truth, we also need to get the corresponding LDR image to form the data pair for training. Generally, HDR images have extremely high resolution. We perform cropping on the random position of the original HDR image, randomly flip the extracted area, resample them to the size of $320\times 320$ , and conduct other data augmentation operations such as adding noise, adjusting saturation, etc. And the random camera response function will be used to map them to the LDR domain. Finally, the images are encoded in JPEG format with random compression rate to simulate the quality loss caused by image storage. By such operations, theoretically we can get 1M training pairs based on about 1000 HDR images.

B. Training

We use the specific loss function to guide the two branches of the network to learn different mappings. The two branches are relatively independent, so they can be trained separately or simultaneously. In fact, these two training methods require different training parameters. In order to avoid the difficulties caused by parameter adjustment, we first train two branches at the same time, and then freeze the parameters of the light branch to reduce the learning rate and fine-tune the dark branches. So we use loss function given in equation (9) directly to optimize the parameters of each layer in the network. The equation (9) consists of light branch’s loss and dark branch’s loss. The hyper-parameter $\alpha ^{L}$ and $\alpha ^{D}$ are both set to 0.01. To ensure that the dark branch can also converge quickly during training, in equation (7) we use the reciprocal of the $t_{d}$ as the value of the $\lambda$ . As the $t_{d}$ is the threshold for calculating the under-exposed area, the L2 distance loss value can be automatically controlled within an appropriate range by changing the $t_{d}$ value.

The loss minimization is performed with the ADAM optimizer [36] and the learning rate of ADAM is $5\times 10^{-5}$ , $\beta 1$ and $\beta 2$ are 0.99, 0.999 respectively. Total 600K steps of back-propagation are performed, with a mini-batch size of 8.

SECTION V.

Results

In this section, we compare our method with existing methods mainly through subjective perception and objective indicators, and perform ablation studies to prove the effectiveness of the modules used.

Image quality metrics are generally categorized into three classes: full-reference (FR), reduced-reference (RR), and no-reference (NR) metrics. Because HDR images have wider dynamic range than LDR images’, we tend to use several FR metrics based on Visible Differences Predictor (VDP) [37], Peak Signal to Noise Ratio (PSNR) [38] and Structural Similarity Index Measure (SSIM) [39] to evaluate the difference between predicted HDRs and reference HDRs. Researchers have improved the calculation methods for PSNR and SSIM. Before the scores calculation, they applied Perceptual Uniformity (PU) coding [40] to the predicted and reference images to make them suitable for HDR comparison. Based on the fact that distortions in darker image areas are less visible, for these areas, metrics with PU coding are generally more accurate than luminance-independent metrics. HDR-VDP-2.2 [41] is calibrated visual metric for visibility and quality predictions in all luminance conditions. It provides a PMAP (Probability MAP) to visualize the probability of detection per pixel and a VDP-Q quality score to measure the overall quality of the predicted image.

A. Comparisons With Existing Methods

1) Qualitative Comparison

We compare our method with the existing three conventional iTMOs, i.e. Akyüz et al. [2], Masia et al. [42] and Huo et al. [43], and three deep high dynamic range image construction methods, i.e. the multiscale reconstructed model ExpandNet [17], multiple exposure reconstruction model [15] and high exposure repair model [14]. The implementation of three iTMOs has been well integrated into a toolkit supplied by Banterle et al. [44].

Fig. 4 to Fig. 8 shows the results of qualitative comparison. The image above in Fig. 4 is a close-up view of Horseshoe Lake. The sun reflected in the lake caused a significant overexposure, resulting in the loss of plant stem and leaf details in this region(red box). At the same time, in the lower right corner of the image, there is a noticeable shadow area (blue box). In Fig. 4, Akyüz et al. [2], Masia et al. [42] and Endo et al.’s methods [15] ((c),(d),(g) and (l),(m),(p)) raised the overall brightness of the image, however, they failed to suppress the over-exposed for the bright region (red box). Moreover, the contrast and visibility of the output images of these algorithms are poor. Huo et al.’s method [43] ((e),(n)) suppressed the overexposure too much, which results in distortion of information in the bright area. Eilertsen et al.’s approach [14] ((f),(o)) has a certain effect on highlight suppression, and the output image looks more natural. The ExpandNet [17] ((h),(q)) and the proposed method ((i),(r)) performed well and restored more details for the over-exposed region, and produced result images closer to the ground truth. Furthermore, the proposed algorithm also revealed more texture details and retain more color information for the dark region (blue box). The small town image below mainly shows the excellent effect of our method on highlight suppression in over-exposed scenes. The instance normalization structure in the network allows us to implement regularization for each image, which helps the network to learn the personalized repair method of over exposure area information. This makes our method can effectively solve the problems of detail weakening and color whitening caused by over exposure in the image without special brightness adjustment.

FIGURE 4.

Result images of all compared methods. (a),(j) input LDR images corresponding to test images Horseshoe Lake and Small Town; (b),(k) Ground Truth; (c)-(h),(l)-(q) outputs of the methods for comparison; (i),(r) our results.

Show All

FIGURE 5.

Result images of all compared methods. (a),(j) input LDR images corresponding to test images Seashore and Sea Surface; (b),(k) Ground Truth; (c)-(h),(l)-(q) outputs of the methods for comparison; (i),(r) our results. In this set of images we use zoom-in windows to highlight some regions.

Show All

FIGURE 6.

Result images of all compared methods. (a),(b),(c),(d) input LDR images corresponding to test images HDR008_1800, HDR007_1800, HDR006_1800, and HDR_110_Tunnel; (a1)-(d1) Ground Truth; (a2)-(d7) outputs of the methods for comparison; (a8)-(d8) our results.

Show All

FIGURE 7.

The performance compared images of all methods for overexposure areas. (d),(l) input LDR images. (h),(p) our results; others are outputs of the methods for comparison.

Show All

FIGURE 8.

The performance compared images of all methods for overexposure areas. (d),(l) input LDR images. (h),(p) our results; others are outputs of the methods for comparison.

Show All

The images shown in Fig. 5 are Seashore and Sea Surface, they shows the performance of the compared algorithms on over-exposed and under-exposed areas respectively. In these two sets of images, we use zoom-in windows to highlight some regions, so that we can more clearly distinguish the differences between the methods. Fig. 5 shows the similar results as Fig. 4. Akyüz et al. [2], Masia et al. [42] and Endo et al.’s methods [15] ((c),(d),(g) and (l),(m),(p)) enhanced the brightness of over-exposed and under-exposed regions, but did not suppress the overexposure and quantization noise, which causes obvious artifacts. Huo et al.’s algorithm [43] ((e),(n)) lowered the brightness of the entire image and caused artifacts in the over-exposed area. Eilertsen et al.’s approach [14] ((f),(o)) suppressed the highlight, but induced color distortion and artifact bands. The ExpandNet [17] ((h),(q)) enhanced the brightness and suppressed the quantization noise, and achieved natural visual effects for over-exposed area, but induce color shift in the under-exposed area. The proposed method ((i),(r)) made the output image natural, pleasing and closer to the ground truth. This is due to the hue loss we used in training, which can more precisely guide the network to learn the color distribution of the image.

It is worth noting that, in Fig. 4, our network shows seemingly opposite decision, i.e., enhancing the brightness of the strong texture regions. This is due to the fact that our network can be personalized for different image areas. That means the network does not increase the brightness of the smooth area, and it does not produce artifacts when highlighting strong texture areas. This is due to the fact that in strong texture regions, even though some pixels have lower values, if they are enhanced in brightness, it is not prone to artifacts. In this case, our network chose to brighten the dark pixels, which helps to obtain better visual effects.

Fig. 6 shows the performance of the comparison algorithms for images taken from dark room in sunny day. Akyüz et al. [2], Masia et al. [42] and Endo et al.’s methods [15] ((a2)-(d2), (a3)-(d3) and (a6)-(d6)) enhance the brightness of the whole image too much, resulting in the loss of contrast and blurred image. Huo et al.’s algorithm [43] ((a4)-(d4)) lowered the brightness and contrast of the entire image. Eilertsen et al.’s approach [14] ((a5)-(d5)) is better than that of the above four algorithms, but the contrast in bright area is lower than that of the ExpandNet [17] ((a7)-(d7)) and the proposed method ((a8)-(d8)). Furthermore, the proposed network restored more detail in bright areas.

Fig. 7 and Fig. 8 mainly show the performance of each method on reconstruction of over-exposed area. It can be seen that the test images are at a lower exposure level as a whole, but each of them include extremely over-exposed area. In order to better display the details of the over- exposure areas, we use zoom window in the figure to highlight the difference between the over-exposed area before and after HDR reconstruction. Compared to the existing method, the proposed network structure shows excellent performance and effectively reconstructs the detail information in over-exposed area. In general, our model has learned a more personalized processing method for different scene images.

2) Quantitative Comparison

In addition to qualitative comparisons, we also made quantitative comparisons with the six methods mentioned above to verify the effectiveness of our method. We randomly selected 210 original HDR images from the dataset as the test set. In order to ensure that each method can obtain the calculation results in a reasonable time and guarantee the data quality, we randomly crop the high-resolution image and adjust the size to a shape of $540 \times 320$ . For the data with corresponding LDR image, the LDR image needs to be operated synchronously. For the data without corresponding LDR image, the top 5% pixel values of HDR image are clipped, then the HDR image is tone mapped to generate the LDR image.

The range of pixel values of HDR images outputted by different reconstruction methods is different. Traditional HDR image generation methods tend to output absolute brightness values, while CNN-based methods tend to directly output data in the range of [0, 1]. Although the perceptual-uniformity-encoding-based HDR metrics dependent on absolute luminance values in $cd/m^{2}$ , we do not scale the output of each method in order to avoid the loss of objectivity due to inappropriate scaling when applying large amounts of data, even though this may affect the scores of those methods whose output is in the range of [0, 1] to a certain extent. Correspondingly, because the HDR-VDP-V2.2 metric also uses the perceptual uniformity encoding when calculating the score, we did not perform additional operations on the input image when using this metric. Table 2 shows the scores of the compared methods using the metrics VDP-Q, PMAP95, PMAP75, PU2-PSNR, PU2-SSIM PU2-MS-SSIM and LOE (Light Order Error [45], indicates the relative order of lightness which represents the light source directions and the lightness variation. It shows the naturalness of an enhanced image). As shown in Table 2, the deep learning method scores better in each index than the traditional inverse tone mapping algorithm. The test set we used contains rich scene categories and lighting environments. Deep learning-based methods learn a variety of image data during training, so they can show better generalization ability in various data. In addition, for the situations such as blurred texture and color distortion caused by limited exposure conditions, deep learning methods can often repair lost details to a large extent. This feature is quite meaningful in the HDRI reconstruction task of single-exposure image with very limited information, and helps to obtain high-quality reconstructed image.

TABLE 2 The Average Value of the Five Metrics Calculated by Comparison With the Existing Method, Where the Value in Bold Indicates the Best Score

Our method has the best scores in six of the seven objective indicators in Table 2. They are VDP-Q, PMAP95, PMAP75, PU2-SSIM, PU2-MS-SSIM and LOE respectively. The score of the VDP-Q represents the quality of the generated image. The higher the score, the closer the predicted image is to the ground truth under the observation of the human visual system. For the target image and the reference image, PMAP95 represents the percentage of pixels in the image that can be detect differences with a probability greater than 95%, and PMAP75 means probability greater than 75%. When the target image looks closer to the reference HDR image, it will have fewer perceptible difference pixels. The scores of PU2-SSIM and PU2-MS-SSIM indicate the structure similarity of two images. The higher the score, the more similar the predicted image is to the ground truth. It is worth mentioning that VDP-Q lacks good evaluation in terms of structural features that have a significant impact on human visual perception, so the scores of PU2-SSIM and PU2-MS-SSIM are a good complementation. LOE measures the naturalness of the output image compared to the ground truth, the smaller the LOE value is, the better the lightness order is preserved. In addition, our PU-PSNR score is second out of seven algorithms, only slightly behind the first. Table 2 indicates that our method has obvious advantage in all the methods considered in the comparison. Fig. 9 and Fig. 10 show the HDR-VDP-2.2 PMAPs calculated by the prediction results of all the methods. The HDR-VDP-2.2 visibility PMAPs describe the probability that an observer perceives the difference between two images at each pixel. Red pixels indicate high probability, and blue pixels indicate low probability. Benefiting from the independent processing of over-exposed and under-exposed areas in the image by the light and dark branch network, it can be clearly seen from the figures that for over- and under- exposure areas that are prone to visual differences, our results have a lower probability of feeling the difference. This means that the HDR image generated by our algorithm is closer to the ground truth than that generated by the methods considered in comparison. Therefore, it can be inferred from the results of Fig. 9 and Fig. 10 that our method performs better than other compared methods.

FIGURE 9.

Visibility PMAPs of HDR-VDP-2.2. Among them, blue indicates a difference that imperceptible by the human visual system, and red indicates perceptible by the human visual system.

Show All

FIGURE 10.

Visibility PMAPs of HDR-VDP-2.2. Among them, blue indicates a difference that imperceptible by the human visual system, and red indicates perceptible by the human visual system.

Show All

B. Ablation Studies

Different network structures and loss function often lead to discrepant results. In order to verify the validity of the structure and loss function mentioned above, we conducted experiments using different modules in the network and analyzed the statistical results.

1) Branches

The proposed network is mainly composed of light and dark branches. As mentioned in the previous section, the light branch focuses on repairing over-exposed areas and restoring the contrast and blurry texture that are weakened in LDR image due to compression. However, in low light conditions, the problem of sensor noise interference becomes more prominent due to the decrease of incident photons and the increase of sensor sensitivity. The image may have blurred details, unclear texture information, and color shift [46], [47]. Therefore, we introduce a dark branch to suppress the image quality degradation caused by the stretching of the dynamic range in the low illumination area. Similar to the light branch, we use the method of calculating the mask to make the network pay attention to the areas that need to be repaired. However, if the mask is calculated directly on the gray map and expanded linearly, due to the impact of image compression on the under-exposed area, the mask is prone to obvious artifact bands as shown in Fig. 11. However, after the guided filter processing, we can smooth out the uneven weights caused by noise in the input image while ensuring that the mask area is not incorrectly extended, which is shown in Fig. 11 (d).

FIGURE 11.

Figures of the smoothing results of the guided filtering on the extracted mask. (a) input LDR image; (b) HDR image after the pixel values in (a) are squared; (c) the mask calculated by (a); (d) the mask after guided filtering.

Show All

2) Color Correction Loss

The loss function is very important for reconstruction task. Because the usual L2 distance loss function cannot accurately measure the color difference between vectors, we need to constrain the color of the network output image by adding a new loss function to make it closer to the ground truth. The value of cosine similarity loss will be smaller in the case of similar image color channel ratio, making the search space have more minimum points, which is not conducive to the convergence of the network. Here, based on the fact that hue-loss can reflect the color difference more accurately, we use hue-loss instead of cosine similarity loss to guide the network to find the correct color distribution. Table 3 shows the results, where LB, DB represent the loss function of light branch and dark branch respectively, proposed means the final loss function of the proposed method. The results of our method with hue-loss are better than the results using cosine similarity loss (CS-Loss) or L2 Loss only. Furthermore, from the results of Fig. 4 and Fig. 5, we also can see that our method can reconstruct HDR images with sufficient color accuracy.

TABLE 3 The Average of the Five Metrics in the Ablation Studies, Where the Value in Bold Indicates the Best Score. LB: Light Branch Only; DB: Dark Branch Only; CS-Loss: Cosine Similarity Loss + L2 Distance; L2-Loss: L2 Distance Only; Proposed: Light Branch Loss + Dark Branch Loss

SECTION VI.

Conclusion

In this article, a deep learning-based single frame HDR image generation method is proposed. In the process of HDR image reconstruction, two key issues need to be solved: highlight suppression in over-exposed areas and noise elimination in under-exposed areas. Most of the existing algorithms focus on the problem of highlight suppression in over-exposed areas. We think that LDR images that need to be processed generally do not have excessively severe over-exposed problems. Therefore, in the design of experiments, we are more inclined to let the network learn to repair pixel information on the edges of saturate values. These repaired pixels are neither generated out of thin air by the network nor obtained through over-fitting, and they can be traced. In practice, the texture information that we cannot see clearly on the LDR image may just be “hidden” in the image, which does not mean that they do not exist. Thus, the main goal of our algorithm is to find and enhance these “hidden” details according to the existing information in the image. Similarly, the under-exposed areas may also hide lots of noise that is difficult to perceive in the LDR image. When the image is converted to the HDR domain, these hidden noises become apparent, and if not suppressed, the quality of the generated HDR image will be seriously affected. Thus, we propose a novel dual-branch network structure, which can not only reveal the hidden information in the over-exposed areas, but also suppress noise hidden in the under-exposed areas, so as to ensure that the reconstructed HDR image has excellent quality. In addition, we also introduce hue-loss to enable the network to more accurately learn the pixel color distribution in the image, making the HDR image generated by the network more accurate and natural in color. The experimental results show that our model has obvious advantages in both subjective analysis and objective score comparison.

HDR image generation contains many different sub-tasks. It is difficult to solve these problems with a single network directly. If we can divide the entire pipeline into different sub-tasks, and design special modules to guide the model to learn the corresponding function, maybe we can get better results. In future, we will attempt to further divide the over-exposed area into light source area and reflection area, or the under-exposed area into strong noise area and under-brightness area, etc.; and design network to do pertinent processing.

References is not available for this document.

Single Exposure High Dynamic Range Image Reconstruction Based on Deep Dual-Branch Network

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Traditional Inverse Tone-Mapping

B. HDR Reconstruction Based on Convolution Neural Networks

Dual-Branch Network Based Single Exposure HDR Image Reconstruction

A. Problem Statement

B. Model Structure

1) Light Branch

2) Dark Branch

C. Loss Function

Experiments

A. Dataset

B. Training

Results

A. Comparisons With Existing Methods

1) Qualitative Comparison

2) Quantitative Comparison

B. Ablation Studies

1) Branches

2) Color Correction Loss

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Single Exposure High Dynamic Range Image Reconstruction Based on Deep Dual-Branch Network

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Traditional Inverse Tone-Mapping

B. HDR Reconstruction Based on Convolution Neural Networks

Dual-Branch Network Based Single Exposure HDR Image Reconstruction

A. Problem Statement

B. Model Structure

1) Light Branch

2) Dark Branch

C. Loss Function

Experiments

A. Dataset

B. Training

Results

A. Comparisons With Existing Methods

1) Qualitative Comparison

2) Quantitative Comparison

B. Ablation Studies

1) Branches

2) Color Correction Loss

Conclusion

References