Introduction and Related Works
A. Introduction
Modern displays support high dynamic range (HDR) contents. Despite the capability of displays, current services, including Digital TV and Internet TV (IPTV), are still providing standard dynamic range (SDR) videos to customers [1], [2], [3]. With the limited resources of HDR contents, the technology that converts SDR contents to HDR contents is highly demanded in the media industry.
There are many works on hdr conversion problem. Reference [1], [2], [3], [4], [5], [6], [7] from the previous hdr conversion methods, in HDRTVNet [3], authors pointed out the absence of a unified standard for HDR display and a widely-used public large-scale dataset for the reasons. In our work, we follow the dataset introduced in HDRTVNet [3] with a slight modification to extend this emerging research area called SDRTV-to-HDRTV. However, there have been only a few prior works [1], [2], [3] on the SDRTV-to-HDRTV conversion problem.
Similar to other low-level vision problems including super-resolution [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], denoising [23], [24], [25], [26], and colorization [27], [28], [29], SDRTV-to-HDRTV is an ill-posed problem. SDRTV-to-HDRTV is a derivative of inverse tone mapping (ITM). The concept of ITM is reconstructing HDR imgaes from SDR images. There are two different subsections of ITM which are LDR-to-HDR and SDRTV-to-HDRTV. The task of LDR-to-HDR [30], [31], [32], [33], [34], [35], [36], [37], [38], [39], [40] is to predict the target luminance of HDR in the linear domain. On the contrary, SDRTV-to-HDRTV aims to convert SDR contents derived from a raw data to their HDR contents counterparts formed from the same raw data. Additionally, SDR and HDR contents for the SDRTV-to-HDRTV problem follow specific standards [41], [42], respectively. Fig. 1 illustrates the main functional difference between SDRTV-to-HDRTV and LDR-to-HDR. In the best of our knowledge, Our work is the first to address SDRTV-to-HDRTV problem with Vision Transformer. Plus, we introduce architecturally advanced Vision Transformer which is called DenSE-SwinHDR.
(a) HDR formation pipeline and (b) SDRTV and HDRTV formation pipeline. HDR formation is different from HDRTV formation pipeline. This describes why SDRTV-to-HDRTV conversion problem should be distinguished from LDR-to-HDR.
In the deep learning society, Convolution Neural Networks (CNN) had been dominant to solve vision problems. Transformer [43], which works with self-attention mechanisms, arises as an alternative to CNN. Initially, it was proposed for solving Natural Language Processing (NLP) problems. As it achieved great performance in the NLP domain, there have been many efforts to apply cross-domain techniques between the NLP domain and the vision domain [44], [45], [46], [47], [48]. By the fact that the self-attention mechanism of Transformer [43] captures global interactions between contexts, it has shown promising performance in vision tasks [44], [45], [46], [47], [48].
With the successes of Vision Transformer in high-level vision problems including classification, detection, and segmentation [44], [45], [47], there have been several studies of applying Vision Transformer on low-level vision problems [46], [48]. ViT [44] tried to substitute entire convolution filters with transformer blocks and showed state-of-the-art performance in high-level vision problems. As an extended version of ViT [44], IPT [48] showed successful performance in low-level vision problems including super-resolution, reference super-resolution and colorization. Another extension of ViT [44], Swin-Trasnformer [45] introduced more efficient window operation. Swin-Trasnformer [45] also enabled extracting hierarchical features using Transformer. SwinIR [46] applied Swin-Transformer [45] on low-level problems and achieved competitive performance. In our work, we set the architecture of SwinIR [46] as our baseline to take the advantages of Vision Transformer. Further, we combined the baseline architecture with architectural strategies previously introduced in CNN such as residual connection [49], dense connection [50], and squeeze-and-excitation module [51].
CNN and Transformer working together produce insightful findings. The transformer works as a low-pass filter, and CNN works as a high-pass filter [52]. The transformer considers global information of the image, so it is suitable to do global mapping of SDRTV-to-HDRTV. We integrate Transformer and CNN so that we may take both local and global information into account.
As has been proven, the improvements of high-level vision problems led to successes of low-level vision problems [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22] in most of cases. For example, several architectural solutions, including residual connection [49], dense connection [50] and SE module [51], were succeeded in high-level vision problems first and then low-level vision tasks [10], [16], [21], [22], [24]. Likewise, [46] proposed residual Swin Transformer blocks (RSTB) which are composed of naive Swin Transformer blocks with residual connections. Reference [46] proved RSTB is promising for low-level vision tasks. In this literature, we expected that such network design strategies previously proposed for CNN can be also applied on Transformer architectures to boost the performance. To prove our hypothesis, we built a new architecture design which includes dense connections [50] and SE module [51] inside of RSTB [46]. We conducted experiments on SDRTV-to-HDRTV, and showed our new architecture achieved the state-of-the-art performance by the margin of 0.79 dB PSNR and 0.93 dB PU-PSNR on the SDRTV-to-HDRTV problem compared to HDRTVNet [3]. These results mean our proposed model has higher fidelity than others from the PSNR perspective. Regarding the PU-SSIM and HDR-VDP3, they are developed based on assumption that structure information and visibility states of the converted pixels are important, respectively. These assumptions may improve the predicted quality of human judgement for the converted HDR contents, while they cannot reflect the exact subjective quality assessment results. To verify the effectiveness of the HDR conversion methods, we conduct the subjective quality assessment in accordance with ITU-R. BT. 500–11 [3]
The main contributions of this paper are threefold:
Our work is the first to introduce Vision Transformer applied on the SDRTV-to-HDRTV conversion problem, and showed our method outperforms state-of-the-art methods of the SDRTV-to-HDRTV conversion problem by up to 0.79 dB for PSNR, 0.93 dB for PU-PSNR.
This paper shows the advantages of applying network design strategies [49], [50], [51] previously proposed for CNN on Swin Transformer [45], [46]. We introduce a new Vision Transformer architecture named dense Swin Transformer block with SE module (DenSE-STB).
We showed Vision Transformer [46], [48] outperforms in the SDRTV-to-HDRTV conversion problem with quantitative visualization results.
B. SDRTV-to-HDRTV
Deep-SRITM [1] tried estimating the pixel value of the target HDRTV display from SDRTV inputs directly. Deep-SRITM [1] designed a network which addresses super-resolution and SDRTV-to-HDRTV conversion jointly in an end-to-end manner. Deep-SRITM [1] decomposes the input images into two layers which are the base layer and the detail layer using the guided filter. It includes multi-path architecture for each decomposed layer. It eventually fuses the outputs of each path for final outputs. JSI-GAN [2] is also a joint learning of super-resolution and the SDRTV-to-HDRTV conversion problem. JSI-GAN [2] introduced a GAN-based method which includes a detail restoration sub-network, an image reconstruction sub-network, and a local contrast enhancement sub-network for the generator. HDRTVNet [3] simplified SDRTV and HDRTV content formation pipeline and construct the inverse pipeline to the solve the SDRTV-to-HDRTV conversion problem. HDRTVNet [3] has three steps for reconstruction which are global color mapping, local enhancement, and highlight generation and there are individual CNNs for each step.
C. Vision Transformer
1) Vision Transformer in High-Level Vision Tasks
Vision Transformer [44], [45], [47] learns where to focus on input images with its self-attention layers. As described in ViT [44], the self-attention layers learn the global relations between patches. Reference [44] directly applied a Transformer architecture on non-overlapping image patches for high-level vision tasks and achieved impressive performance. Due to the lack of inductive biases in Transformer, training [44] requires large-scale training datasets such as JFT-300M [53]. To overcome this situation, [54] introduced training strategies for training in smaller datasets. Reference [45] pointed out that [44] is an unsuitable architecture for use as a general-purpose backbone network on dense predictions and its quadratic increase in complexity with input image resolution. Reference [45] designed a new Transformer architecture which includes shifted window-based self-attention performing local self-attention inside of windows. Setting window sizes variously, [45] could take the advantages of extracting hierarchical features of input images like CNN. It achieved a better speed-accuracy trade-off than prior works [44], [54].
2) Vision Transformer in Low-Level Vision Tasks
Reference [48] developed a pre-trained model for image processing named image processing transformer (IPT) and showed high performance in super-resolution, de-raining, de-noising, and de-blurring. As Swin Transformer [45] introduced, SwinIR [46] proposed a new architecture called Residual Swin Transformer Block (RSTB) which includes residual connection inside of a block. It recorded state-of-the-art performance on super-resolution, de-nosing, and JPEG artifact problem [46].
D. Architectural Strategy
1) Residual Connection
ResNet [49] proposed learning residuals instead of a thorough mapping. ResNet [49] mitigated the gradient vanishing problem caused by deeper network architectures. It brought remarkable improvements in high-level vision problems [44], [45], [47]. In the aspect of low-level vision tasks, the emergence of ResNet [49] reduced learning complexity and model size because inputs and targets of low-level vision problems are not so different [22]. In Vision Transformer, SwinIR [46] proposed an image restoration network including residual connection and showed great performance.
2) Dense Connection
For each layer in a dense block of DenseNet [50], the feature maps of the entire preceding layers are reused as current inputs [50]. The dense connections help alleviate the gradient vanishing problem, enhance signal propagation, and encourage feature reuse [50]. Due to richer information which could be distorted during forward propagation, it brought successful gain in both high-level and low-level vision tasks. With the success of dense connection in CNN, T2T [47] proposed ViT [44] with dense connection (ViT-Dense) but failed to show positive results. Despite failure of T2T [47], the re-usability of preceding features might be helpful in Swin Transformer [45] based architecture because of its ability to extract hierarchical feature maps which are similar to CNN.
3) Squeeze and Excitation Module
For considering the interdependence and interaction between different channels of the feature maps, Squeeze and Excitation (SE) module [51] has been proposed. In SE module [51], input feature maps are squeezed into constants using global average pooling (GAP), then these values are fed into two dense layers to compute channel-wise factors for input feature maps. Re-scaling the feature maps enabled CNN to improve representation ability. In our work, SE module is expected to be effective for the increased number of channels by dense connections.
Preliminary
SDRTV-to-HDRTV is firstly named in HDRTVNet [3]. Following HDRTVNet [3], we denote any content, including images and videos, under SDR or HDR TV standards as SDRTV and HDRTV. While LDR-to-HDR aims to compute the target luminance values in the linear space, SDRTV-to-HDRTV directly predicts the pixel values of HDRTV which are encoded in HDR TV standards, such as HDR10, HLG, and Dolby Vision [3]. The process of how SDRTV and HDRTV are generated from the common source is illustrated in Fig. 1 (b). HDRTVNet [3] simplified the formation process of SDRTV and HDRTV with the four key operations which are tone-mapping, gamut mapping, opto-electronic transfer function, and quantization.
A. Tone-Mapping
Tone mapping has the global mapping and local mapping. Global tone mapping maps all pixel values using an equal transformation function. But local tone mapping involves high computational costs and artifacts during computing local features and contexts. So HDRTVNet [3] considers only global tone mapping. The parameters computed from global image statistics [55], human-designed S-shaped curves [39] and lookup tables [56] are mostly used. The function of tone-mapping can be formulated as:\begin{equation*} I_{tS} = T_{S}(I_{S}\vert \theta _{S}), I_{tH} = T_{H}(I_{H}\vert \theta _{H}),\tag{1}\end{equation*}
B. Gamut Mapping
Gamut mapping is a color transformation process. The wider color gamuts can represent more colorful views. Following the standards [41] and [42] of ITU-R, the process can be written as:\begin{align*}&\hspace {-2pc}\begin{bmatrix} 0.6370 & 0.1446 & 0.1689\\ 0.2627 & 0.6780 & 0.0593\\ 0 & 0.0281 & 1.0610 \end{bmatrix} \begin{bmatrix} R_{2020} \\ G_{2020} \\ B_{2020} \end{bmatrix} \\=&\begin{bmatrix} 0.4124 & 0.3576 & 0.1805 \\ 0.2126 & 0.7152 & 0.0722 \\ 0.0193 & 0.1192 & 0.9505 \end{bmatrix} \begin{bmatrix} R_{709} \\ G_{709} \\ B_{709} \end{bmatrix}\tag{2}\end{align*}
C. OPTO-Electronic Transfer Function
Opto-electronic transfer function (OETF) transfers linear optical signals to non-linear electronic signals. We take gamma-inverse function [41] for SDRTV and PQ-OETF [42] of HDR10 standard for HDRTV:\begin{equation*} f_{S}(I) = I^{1/2.2}, f_{H}(I) = \left({\frac {c_{1} + c_{2}I^{m_{1}}}{1 + c_{3}I^{m_{1}}}}\right)^{m_{2}}\tag{3}\end{equation*}
D. Quantization
In the final step, the pixel values are quantized with desired bit depth as:\begin{equation*} Q(I, n) = \frac {{(2^{n} - 1) * I + 0.5}}{2^{n} - 1},\tag{4}\end{equation*}
The SDRTV and HDRTV formation pipeline can be simplified as:\begin{align*} I_{S}=&Q_{S}\circ f_{S} \circ M_{S} \circ T_{S} \circ (I_{RAW}), \tag{5}\\ I_{H}=&Q_{H}\circ f_{H} \circ M_{H} \circ T_{H} \circ (I_{RAW}),\tag{6}\end{align*}
\begin{align*} I_{H} = Q_{H}\circ f_{H} \circ M_{H} \circ T_{H} \circ T^{-1}_{S}\circ M^{-1}_{S} \circ f^{-1}_{S} \circ Q^{-1}_{S} \circ (I_{S}), \!\!\!\tag{7}\end{align*}
Proposed Method
A. Network Architecture
Based on SwinIR [46], our proposed architecture consists of three sub-modules which are shallow feature extraction, deep feature extraction, and image reconstruction modules. The whole network architecture is shown in Fig 2.
1) Shallow Feature Extraction
Vision Transformers often take hybrid architecture [44], [45], [46], [47], [48]. Rather than using raw image patches as inputs, the input sequences formed from early convolution lead to more stable optimization [58]. It also has effects of increasing non-linearity and mapping to a higher dimension space [46]. We set a single
2) Deep Feature Extraction
In the literature of replacing fully convolutional networks with Vision Transformer architecture, a number of Swin Transformer blocks perform deep feature extraction instead of convolutional filters [46]. This module contains a convolutional layer at the end. It is needed to inject inductive bias of convolution, especially locality and transitional equivariance, as Vision Transformer is known to be robust in the spatial variant property [46].
3) Image Reconstruction Layer
With the global residual connection of the deep feature extraction module, the low-frequency features from the shallow feature extraction module can be combined with the output features of the deep feature extraction module. It eases deep feature extraction module to learn recovering lost high-frequency details. Finally, image reconstruction layers take combined features for inputs. We placed a
B. Dense SWIN Transformer Block
Motivated by [46], [47], [49], and [50], [51], we propose a densely connected Swin Transformer block with squeeze-and-excitation module (DSE-STB). We take residual Swin Transformer blocks (RSTB) introduced in SwinIR [46] as our baseline. The comparison of each architecture is shown in Fig 3.
1) Residual SWIN Transformer Block
A RSTB includes Swin Transformer layer (STL), a convolution layer, and a residual connection sequentially. The residual connections inside of preceding blocks make deeper blocks be able to focus on higher frequency information progressively [49].
2) Dense SWIN Transformer Block
We added dense connections in RSTB and named dense Swin Transformer blcok (DSTB). Dense connection enables propagating richer information by transmitting inputs of entire preceding Vision Transformer layers identically [50]. We expect that feature re-usability of dense connection would be helpful to preserve global and local information which could be distorted during successive self-attention mechanisms.
3) SE SWIN Transformer Block
To consider the interdependence between channels of the feature maps, we simply put squeeze-and-excitation module [51] after a self-attention operation inside of a block. We denote the architecture RSTB with SE module as SE-STB and DSTB with SE module as DSE-STB, respectively. We expected that SE module [51] will be more effective with a larger number of channels generated by dense connections [50].
Experiments
HDRTVNet [3] introduced a dataset for SDRTV-to-HDRTV conversion problem. It consists of 22 pairs of SDRTV and HDRTV videos which are collected from YouTube. More details can be found in Table 1. In our work, we down-scaled 4K (
A. Experimental Setup
1) Training Details
We randomly crop the input images into
2) Architecture Details
A Swin Transformer block contains 6 layers. We set the number of heads to 6 for multi-head self-attention. The embedding dimension is 60. For shifting window self-attention operation, we set the window size to 8. In case of dense connection, the growth rate is 30.
B. Evaluation of SDRTV-to-HDRTV
1) Evaluation Metrics
We evaluated our results with three metrics which are peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), HDR-VDP3 [60], PU-PSNR, and PU-SSIM [61]. PSNR is the most widely used score in image restoration which measures the absolute error of entire pixels between restored and ground truth image. SSIM is a method to measure the objective quality of the distorted image given the reference image based on structural similarity between the two images, consisting of luminance, contrast, and structural score. HDR-VDP3 is a new version of HDR-VDP2 [60] that supports Rec.2020 [42] color gamut. To evaluate HDR-VDP3 scores, we compared visualization results by setting task to “side-by-side”, color encoding to “rgb-bt.2020”, pixel per degree to 50, and option of “rgb-display” with “led-lcd-wcg” [3]. To an explanation of PU-PSNR and PU-SSIM, the linear color values used for HDR photographs must not be utilized directly with the metrics designed for standard dynamic range (SDR) images since they are not perceptually uniform (such as PSNR, SSIM, MS-SSIM). Absolute linear RGB color values are encoded using PU21 to make them more perceptually uniform and usable with SDR measurements.
2) Visualization
Since original HDRTV images are encoded in 16-bits “PNG” format and decoded by gamma electro-optical transfer function (EOTF) on SDR displays, they look comparatively dimmer than on HDR displays. Because tone-mapped HDRTV images may lead to unfair comparison [3], we do not attach any rendered images for comparison.
C. Experimental Results
While Vision Transformer [44], [45], [46], [47], [48] uses self-attention mechanisms to extract features, CNN extracts features using convolution operation. To compare the power of feature extraction, we focused on the base architecture first. Vision Transformer methods, SwinIR [46] and our method recorded higher scores in PSNR and SSIM than CNN-based methods. In an aspect of HDR-VDP3 score, we found that the methods which include ResNet [49] architecture recorded higher score than the others and GAN [62] architecture used in highlight generation phase results amplifying the distortion. The experimental result of PSNR, SSIM, HDR-VDP3 can be shown in Table 2. Our method also recorded a higher score of PU-PSNR than AGCM [3] while a lower score of PU-SSIM. The result of PU-PSNR and PU-SSIM can be shown in Table 3. We showed that Vision Transformer methods are better than CNN-based method on SDRTV-to-HDRTV conversion problem and Swin-Transformer [45] based method can be boosted by traditional architecture modification strategies [49], [50], [51].
D. Visualization Results
To verify that our method delivers better quality, we evaluated our visualization results in three aspects which are introduced in [3].
1) Global Tone Mapping
While AGCM module of [3] is guided to learn global statistics of input images using global feature extraction module, Vision Transformer methods [46] only took cropped parts of images for training. Despite the absence of global information during training, our method showed better visualization quality in an aspect of global tone mapping than [3], [49], [56]. It could be crucial evidence that our training strategy which uses randomly cropped images at the running time has the effect of building up better global statistics than global feature extracted from convolution network in [49], [56], and [3]. Visualization results for global tone mapping performance are illustrated in Fig 4. In Fig 4 we could find that CNN-based methods especially have difficulty of mapping properly bright regions.
2) Local Enhancement
To evaluate the ability of enhancing local features, we carefully focused on sharpened edges and enhanced contrasts. Similar to single image super-resolution problem [8], [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21], [22], HDRTV contents should have better quality delivered by wider dynamic range than SDRTV contents. While [3] showed blurred output, our method restored more local details. Visualization results for local enhancement comparison are illustrated in Fig 5.
3) Highlight Generation
Most of the details in the saturated areas are lost in SDRTV contents. To restore the details in saturated regions, previous works [1], [2], [3] often used masks or GAN architectures [62] for the solutions. However, computing masks at runtime causes additional complexity and GAN architectures may generate unnatural artifacts. We believed Vision Transformer methods can generate lost details using additional information such as positional encoding and global inter-relationship between patches. As a result, our method mostly outperformed in highlight generation. Although [3] used GAN architecture for reconstruction, it could not recover enough details. Remarkable improvements can be found in Fig 6.
E. Color Transition Test
We conducted a color transition test to observe the proper color gamut mapping from Rec.709 [41] to Rec.2020 [42]. Comparing color histogram or scatter plot of pixel value would be helpful but it is not easy to analyze one-to-one relationship between pixels of SDRTV and HDRTV. Most of CNN based methods except ResNet [49] showed smooth transition results as shown in Fig 7. The Output of ResNet [49] showed severely damaged results in the blue color area. Reference [3] said the reason for this phenomenon is that blue colors are more easily losing information during the process of compression. Violet colors of CNN-based methods often showed brighter than expected while those of Vision Transformer showed much more natural transition. However, Vision Transformer methods suffer the most unnatural transition in orange colors. It tends to convert orange colors to dark yellow.
Result of color transition test. The input image is artificially generated by code. Methods which capture global features ([3], [46], [56], and ours) showed natural transition results. However, ResNet [49] which only used local features failed to map proper colors. Vision Trasnformer methods ([46] and ours) commonly have a difficulty of restoring blue colors or yellow colors.
F. Ablation Study
To verify the effects of each component [49], [50], [51], we performed an ablation study on each layer. RSTL showed a higher score than the CNN-based methods as shown in Table 2. We firstly expected SE module [51] could bring performance boost as described in [47] which introduced a new Vision Transformer architecture adding SE module on the basic layer of ViT [44]. SE-STL scored higher than CNN-based methods but could not bring performance improvement compared to RSTL. In [47], dense connection hindered training of Vision Transformer. As the experimental results of SE-STL showed the opposite result to [47], we expected dense connection in low-level problems would be positive. As a result, DSTL recorded a higher score than RSTL. The result can be shown in Table 4. To effectively use richer information of dense connection, we added SE module on DSTL (DSE-STL). As we expected, DSE-STL scored the highest in PSNR and SSIM with the margin of 0.79dB and 0.0014, respectively.
G. User Study
We conduct a user study and compare our approach to AGCM [3] and SwinIR [46] in order to further confirm the perceptual quality. 15 people took part in the user study that we conducted. 5 people are women and the others are men. 20 photos are chosen at random from the HDR synthesized images. We show people two images of ground truth and a synthesized image to compare image quality. We asked participants to rate their preferences between ground truth and synthesized image on a scale of -3, -2, -1, 0, 1, 2, 3, with 3 indicating that our result is significantly better and -3 suggesting it is not. As shown in Table 5, our strategy outperformed others in terms of visual performance. The outperformed results are because our proposed method utilizes both Vision Transformer and CNN architecture so that it easily conducts global and local mapping.
Discussion
We used images of the dataset from HDRTVNet [3]. It is consist of converted images of 22 pairs of SDRTV and HDRTV videos. We used images during the training and test time. To extend image results to video results, We found that our method works well in the video with ignorable temporal distortions. We verified that the proposed method can be applied further to the video domain.
Conclusion
We introduced DSE-STL as a new Vision Transformer architecture for low-level vision problems. We found that Vision Transformer methods including DSE-STL is capable of addressing the SDRTV-to-HDRTV conversion problem better than CNN-based methods [3], [49], [56]. Also, we showed the architectural strategies proposed for CNN can boost the performance of Vision Transformer such as residual connection [49], dense connection [50], and SE module [51]. Through quantitative experimental and visualization results, we verified our method is outperforming in terms of metrics and visual quality.
ACKNOWLEDGMENT
(Joon-ki Bae and Subin Yang contributed equally to this work.)