Introduction
Image quality is one of the fundamental concept in computer vision and image processing. People want to capture their daily lives in visually appealing photographs. The performance of computer vision systems such as object detection [1] and segmentation [2], [3] relies heavily on the quality of input images. Regrettably, the visual quality of images can be easily degraded due to inadequate shooting environments like low-light conditions and camera sensor limitations. Therefore, image enhancement techniques become more popular, which automatically retouch input images to improve their visual quality.
Recently, image enhancement studies have primarily focused on the learning-based approach that involves training models using pairs of low-quality and high-quality images, aiming to learn the mapping between them. Especially, with advances in datasets [4], [5], [6] and deep learning [7], most state-of-the-art techniques [8], [9], [10], [11], [12], [13], [14], [15] train deep neural networks to learn a complex pixel-wise mapping between low-quality and high-quality images and yield promising enhancement results. However, their networks come with substantial computational costs. Considering that image enhancement algorithms often serve as a pre-processing step in various computer vision systems, it is crucial to develop them to be fast and lightweight.
To address this issue, many attempts [16], [17], [18], [19], [20], [21] have been made to learn a color transformation, which defines a color mapping between input and output images, rather than pixel-wise mapping between them. This approach models a color transformation controlled by only a few parameters. Then, instead of outputting enhanced images directly, neural networks predict these parameters, resulting in a significant improvement in the computational complexity of learning-based image enhancement algorithms. However, despite the advantages of this approach, developing an effective color transformation model remains a challenge. For instance, some methods [18], [19], [20], [21] design a color mapping as a weighted combination of predefined 3-dimensional lookup tables (LUTs) [18], which are data structures to describe a color mapping. Consequently, their networks only need to predict a small number of LUT weights, enabling real-time processing. Nevertheless, these LUT-based methods may encounter difficulties in accurately emulating complex color mappings due to the limitations imposed by the less flexible predefined LUTs.
In this paper, we propose an efficient image enhancement algorithm called the improved representative color transformation (RCT++), which is an extension of our previous work [22]. The RCT++ algorithm estimates the representative features associated with the representative colors found in the input image, as well as the transformed colors that indicate the enhanced representative colors in the output image. Subsequently, it interpolates the output image from transformed colors based on the similarities between the representative features and per-pixel features. Compared to the original RCT [22], our method has three major improvements: First, we clarify the concept of representative colors by incorporating the reconstruction loss into the RCT framework. This encourages the representative features to correspond to the significant color of the input image. Second, we introduce the entropy term to measure the diversity of representative features. By maximizing this entropy term, our scheme can estimate more diverse representative features, which are useful to enhance minority colors in the input image. Third, we design an efficient enhancement network that performs the RCT++ algorithm for fast and lightweight image enhancement. Specifically, we implement RCT++ algorithm based on depth-wise separable convolution [23] to lighten our network. Experimental results on three datasets [4], [5], [6] with different characteristics demonstrate that our RCT++ outperforms existing algorithms in efficient image enhancement with comparable computational costs and parameters. The main contributions are summarized as follows:
We propose the novel color transform, RCT++, for image enhancement. The RCT++ improves the original RCT [22] by incorporating the reconstruction term and the entropy term.
We develop an efficient image enhancement network. Our network contains about 131K parameters and takes 3ms to process an image of
resolution.480\times 720 RCT++ outperforms state-of-the-art methods in efficient image enhancement on three datasets collected in different shooting environments.
The rest of this paper is organized as follows. Section II reviews related works. Section III describes the proposed algorithm, and Section IV discusses experimental results. Finally, Section V draw a conclusion.
Related Work
Image enhancement is a long-standing problem with wide applications. Therefore, lots of attempts have been made to improve the image enhancement performance. In this section, we briefly review relevant studies on learning-based image enhancement and efficient image enhancement which are closely related to our work.
A. Learning-Based Image Enhancement
Early learning-based image enhancement methods [24], [25], [26] mainly depend on hand-crafted features, such as intensity, brightness, and the amount of highlight, or prefixed mappings. Dale et al. [24] introduced visual context based on scale-invariant feature transform (SIFT) [27]. Wang et al. [25] defined tone and color adjustments given a set of examples. Hwang et al. [26] searched contextually similar images from examples and adjusted the input image via corresponding transformation functions. However, due to the limited representation of hand-crafted features, it is difficult for these methods to reliably enhance various input images.
Deep neural networks allow image enhancement methods to learn complex mapping between low-quality and high-quality images. Therefore, over the past decade, many deep learning-based image enhancement methods [8], [10], [11], [14], [15], [28], [29], [30], [31] have been proposed. Yan et al. [8] learned the feature descriptor for each input image pixel to consider semantic information in retouching. Lore et al. [10] stacked a sparse denoising autoencoder, which can learn to adaptively enhance and denoise from synthetically darkened and noise-added training examples. However, these methods [8], [10] often fail to exploit high-level context for image enhancement due to the small receptive field of their networks.
Chen et al. [28] employed the encoder-decoder structure [32] for image enhancement, in which the encoder gradually performs down-sampling to increase the size of receptive fields, and the decoder restores the original resolution while enhancing images. Yang et al. [11] developed two encoder-decoder structures for low-light image enhancement. Based on the retinex theory [33], Wang et al. [29] enhanced an input image by decomposing it into reflectance and illumination, and then improving the illumination. Xu et al. [30] proposed the frequency-based decomposition to enhance low-light images. Kim et al. [31] adopted the encoder-decoder structure to perform a personalized image enhancement. Tu et al. [14] proposed a general image processing network through multi-axis multilayer perceptron (MLP). Cai et al. [15] designed a transformer-based network to leverage the large receptive field of the attention mechanism [34]. These methods [11], [14], [15], [28], [29], [30], [31] yields the promising enhanced results. However, it requires many parameters and computational costs, making it difficult to apply to various applications.
B. Efficient Image Enhancement
Efficient image enhancement aims to minimize high computational burdens while maintaining the high visual quality of output images. There are two streamlines for efficient image enhancement. The first focuses on efficient color transformation modeling, defining color mappings between input and output images rather than pixel-wise mappings between them. Deng et al. [16] defined the piece-wise intensity curve controlled by only a few parameters predicted by the neural network. And the small size of the output space reduces the complexity of learning-based image enhancement algorithms. Similarly, Kim et al. [17] presented the learnable non-monotonic intensity transformation for both paired and unpaired image enhancement.
Look-up tables (LUTs) are another popular way to model color transformation, which are efficient data structures enabling real-time enhancement [18], [19], [20], [21]. Zeng et al. [18] stored non-linear color transformation in a 3D lattice, which can be loaded through simple indexing and leveraged by affine transforms. Wang et al. [20] integrated 1D-LUTs and 3D-LUTs, considering image-level and pixel-wise transforms, respectively. Yang et al. [21] designed an effective sampling strategy to improve the quality of the output image Despite their efforts to improve enhancement quality, they demonstrate limited capability to estimate highly non-linear retouching mappings due to predefined transformations on LUTs. In contrast, the proposed method estimates adaptive representative colors according to the input image and predicts color transformation for each representative color.
The second streamline attempts to design a lightweight architecture. Gharbi et al. [35] used a low-resolution image to predict bilateral coefficients. It then applied an affine transform to the original resolution. Ma et al. [36] relieved the computational burden of cascaded blocks via sharing weights. In this line, we compose the network using depth-wise convolution layers resulting in the small-sized model.
Proposed Algorithm
In this section, we first describe the original RCT algorithm [22] and highlight its limitations. We then propose the improved RCT (RCT++) algorithm and develop its network architecture for efficient image enhancement. Finally, we present the loss functions used to train our network.
A. Representative Color Transform
Given an RGB input image \begin{equation*} Z=f_{\theta }(X) \tag {1}\end{equation*}
RCT estimates n representative features \begin{align*} F & = g_{\phi }(Z) \tag {2}\\ C_{\mathrm {t}} & = g_{\psi }(Z) \tag {3}\end{align*}
Note that the transformed colors only describe the color mapping for N representative colors. To determine the enhanced colors for all input colors, we interpolate them through the scaled-dot product attention mechanism [34]. Specifically, we consider these enhanced colors as the weighted sum of transformed colors, in which the weights are proportional to the feature similarities between the input colors and representative colors. So, the weight matrix \begin{equation*} A = \mathrm {softmax}\left ({{ \frac {ZF^{T}}{\tau } }}\right) \tag {4}\end{equation*}
\begin{equation*} \widehat {Y} = AC_{\mathrm {t}} \tag {5}\end{equation*}
Since the RCT is fully described by the representative features F, transformed colors
B. Improved Representative Color Transform
To overcome the first limitation, we explicitly define the representative colors of the input image. We introduce a function denoted as \begin{equation*} C_{\mathrm {r}} = g_{\omega }(Z) \tag {6}\end{equation*}
\begin{equation*} \widehat {X} = AC_{\mathrm {r}} \tag {7}\end{equation*}
Next, we address the second problem of the original RCT by introducing an entropy term that quantifies the diversity among representative features. To this end, we compute a similarity matrix S whose each element \begin{equation*} s_{ij} = \frac {\mathbf {f}_{i}\mathbf {f}_{j}^{T}}{||\mathbf {f}_{i}|| ||\mathbf {f}_{j}||} \tag {8}\end{equation*}
\begin{equation*} P=\mathrm {softmax}(\tilde {S}) \tag {9}\end{equation*}
\begin{equation*} \mathrm {entropy} = -\frac {1}{n}\sum _{i=1}^{n}\sum _{j=1}^{n} p_{ij} \, \log \, p_{ij} \tag {10}\end{equation*}
Algorithm 1 An Entropy of Representative Features
Note that the entropy term is maximized when each representative feature is dissimilar to each other. By incorporating this entropy term into the RCT method, we effectively promote the diversity of representative features, enabling the RCT method to handle colors that exist in the input image as a minority. In Section IV, we will provide experimental results to demonstrate the effectiveness of the entropy term.
C. Enhancement Network
We develop an enhancement network to implement the proposed RCT++ method, aiming for efficient image enhancement. Figure 1 illustrates the overall architecture of our enhancement network, which comprises four modules: an encoder, a representative feature module, a transformed color module, and a representative color module. These modules correspond to
An overview of the proposed RCT++ process. The blue dotted line, which reconstructs the input image, is only used in the training.
1) Encoder
The encoder embeds the input image X into the image feature Z for the RCT++ process. As shown in Figure 2, the encoder is a residual block [37] with two branches. The first branch is a single convolution layer with
(a) The detailed structure of the encoder module. (b) The detailed structure of depth-wise separable convolution block.
2) Representative Feature Module
The representative feature module takes the image feature map Z and estimates the representative features F. Specifically, the representative feature module first reduces the spatial resolution of the feature map to
Next, the module feeds the resized feature map into four depth-wise separable convolution (Dwconv) blocks. As depicted in Figure 2, each Dwconv block is made up of a depth-wise convolution and a point-wise convolution, which leads to fewer parameters and operations compared to standard convolutions. These computational and memory gains become more significant in the later stages of the network when the number of feature channels increases. Subsequently, the module performs a global average pooling. It then estimates
3) Transformed/Representative Color Module
Our design involves the prediction of two color sets: one for improving the input images and another for restoring them. To achieve this, we design both the transformed and representative color modules to have identical architectures. As specified in Table 1, these architectures are the same as the representative feature module, except they predict n colors by adjusting the number of neurons in the second linear layer of the MLP block. Note that the representative color module only works during the training phase and thus does not increase computational or memory load during the testing phase.
D. Loss Function
We train the enhancement network by minimizing the total loss, which encompasses multiple loss components: a color loss (\begin{equation*} {\mathcal {L}}= {\mathcal {L}}_{\mathrm {col}} + {\mathcal {L}}_{\mathrm {rec}} + {\mathcal {L}}_{\mathrm {ent}} + {\mathcal {L}}_{\mathrm {freq}} \tag {11}\end{equation*}
1) Color Loss
The color loss is the mean absolute error between the ground-truth image Y and the predicted image \begin{equation*} {\mathcal {L}}_{\mathrm {col}} = ||Y-\widehat {Y}||_{1} \tag {12}\end{equation*}
2) Reconstruction Loss
The reconstruction loss is the mean absolute error between the input image X and the reconstructed image \begin{equation*} {\mathcal {L}}_{\mathrm {rec}} = ||X-\widehat {X}||_{1} \tag {13}\end{equation*}
3) Entropy Loss
We set the entropy loss to the inverse of the entropy term.\begin{equation*} {\mathcal {L}}_{\mathrm {ent}} = \frac {1}{\mathrm {entropy} + \epsilon } \tag {14}\end{equation*}
4) Grid Frequency Loss
To compute the grid frequency loss, we decompose images into m non-overlapping grids. Let \begin{equation*} {\mathcal {L}}_{\mathrm {freq}} = \sum _{i=1}^{m}{ ||\mathcal {F}(Y_{i})-\mathcal {F}(\widehat {Y_{i}})||_{1}} \tag {15}\end{equation*}
Experiments
In this section, we present a comprehensive evaluation of the proposed method. We compare our method with state-of-the-art methods on three datasets: MIT-Adobe 5K (Adobe5K) [4], Low-Light (LoL) [5], and Underwater Image Enhancement Benchmark (UIEB) [6]. These datasets are collected in a diverse range of shooting environments, providing a faithful evaluation of our method. Furthermore, we study the impact of the proposed components. For quantitative comparison, we use three performance metrics: peak signal-to-noise ratio (PSNR), structural similarity index measure (SSIM), and learned perceptual image patch similarity (LPIPS) [43]. These metrics measure the color, structural, and perceptual similarity between the enhanced image and the ground-truth image, respectively. Also, we evaluate the efficiency of our method compared to existing methods in terms of a number of parameters and run-time cost.
A. Datasets
1) MIT-Adobe 5K
The Adobe5K [4] dataset contains 5,000 image pairs. Each pair consists of a low-quality image and manually retouched versions by five experts (A/B/C/D/E). For experiments, we use the enhanced images improved by expert C as the ground truth, following the experiment setting of recent image enhancement methods [18], [19], [21]. We split the dataset into 4,500 pairs for training and 500 pairs for validation. As done in [18], we resize each image to have 480 pixels on the short side, while maintaining its aspect ratio.
2) Low Light
The LoL [5] dataset is developed for low-light image enhancement. This dataset comprises 500 pairs of low-light and normal-light images, all of which were taken from real-world scenes. The dataset is divided into 485 pairs for training and 15 pairs for testing. Compared to the Adobe5K dataset, the LoL dataset contains a considerable amount of noise generated during the image capture process. All images in this dataset have a resolution of
3) Underwater Image Enhancement Benchmark
The UIEB [6] dataset consists of 950 real-world underwater images, 890 of which have their reference images and the others do not. Thus, we divide 890 reference pairs into 800 pairs of training and 90 pairs for our experiments evaluation, following the previous work [22].
B. Implementation Details
For training, we use the AdamW optimizer [44] with an initial learning rate of 0.0005 and a weight decay of 0.05. We set the batch size to 16. The training is done for 600 and 5000 epochs for experiments on Adobe5k and the other datasets, respectively. We schedule a learning rate via cosine annealing. For data augmentation, we randomly crop images and then resize them to
C. Main Results
We compare the proposed RCT++ with recent state-of-the-art methods. For evaluation, we obtain the results of existing methods by executing the published codes of these algorithms. If the code is not available, we use the results reported in its paper.
1) Results on MIT-Adobe 5K
Table 2 compares the proposed RCT++ with recent efficient image enhancement algorithms on the Adobek5K dataset [4]: HDRNet [35], 3D-LUT [18], Sep-LUT [19], and AdaInt [21]. In Table 2, our method establishes the best results on all three metrics with large margins. Compared to AdaInt [21], which is the LUT-based efficient image enhancement algorithm giving the second best algorithm in Table 2, our method provides better scores by 0.11, 0.008, and 0.003 in terms of PSNR, SSIM, and LPIPS, respectively, despite having the five times fewer parameters. In addition, our network has a comparable size to Sep-LUT [19] but demonstrates superior performance. For a comprehensive evaluation, Figure 3 shows the enhanced images of ours and AdaInt. We see that AdaInt fails to estimate accurate color mappings for the three examples due to the limitation of a predefined look-up table. In contrast, our method is the more flexible color transformation model, resulting in more similar images to manually retouched ground-truth images.
2) Results on Low Light
We evaluate the performance of our method in extremely low-light shooting condition. Table 3 lists the quantitative comparisons with existing methods [12], [13], [48], [49], [50], [51] on the LoL dataset [5]. Our method achieves the highest PSNR score, meaning the best color enhancement results. Despite Zero-DCE [49] and RUAS [50] demonstrating efficient low-light enhancement networks with the lowest and second-lowest network parameter consumption, respectively, they show poor performance. On the other hand, our method not only utilizes a low number of parameters but also achieves superior performance across various metrics. Our method shows relatively lower performance on SSIM and LPIPS, which are more sensitive to sensor noise due to low-light shooting condition. This is because, the RCT++ models a global color transformation, resulting in less effectiveness in denoising than spatial filtering-based methods. However, this problem can be mitigated by using simple denoising techniques as the post-processing. As shown in Table 3, a simple bilateral filter improves our method in all of the performance metrics. Specifically, PSNR, SSIM, and LPIPS scores increase to 22.47 dB, 0.825, and 0.156, respectively.
Figure 4 qualitatively compares the enhancement results of our method with KIND++ [13], the second best algorithm in Table 3. For all input images, KIND++ [13] fails to faithfully restore the ground-truth images. In contrast, the proposed RCT++ produces enhanced images with color tones more similar to ground-truth images. We also see that our results contain slightly more noise than KIND++. However, it can be suppressed through simple post-processing using a bilateral filter.
3) Results on Underwater Image Benchmark
Finally, we assess the enhancement results of the proposed RCT++ using underwater images to validate its scalability to various shooting environments. Table 4 summarizes the quantitative results of RCT++ and those of the existing algorithms [6], [52], [53], [54] on the UIEB [6] dataset. In Table 4, our method exceeds conventional algorithms in all of the performance metrics, with significantly fewer parameters. Figure 5 illustrates the enhanced results on the UIEB dataset. We see that RCT++ effectively corrects biased hue caused by underwater shooting environment. In the first and second rows, in which input images are predominantly biased toward blue tone, our method successfully balances the overall hue and restores the original color of the shark and coral. Also, in the third row, we see that our method successfully enhances the input image to a color tone similar to ground-truth.
D. Ablation Study
We perform ablation studies on the Adobe5k [4] dataset to analyze our design. For all ablation studies, we use the same experiment settings in Section IV-B.
1) Loss Functions
Table 5 reports the performance of the proposed method with different combinations of loss terms. The model trained with only the color loss
Figure 6 visualizes the enhanced images and error maps according to different loss combinations. As shown in Figure 6b, the output image with only color loss
2) Encoder Design
Next, we study the effectiveness of the encoder structure. To this end, we detach the first or second branch from the encoder module, respectively. In Table 6, the second branch improves the performance in all three metrics. This means that considering neighboring pixels helps generate better image features for the RCT++ process. This is because it allows the RCT++ to enhance the same input colors to different transformed colors. It also helps mitigate the negative effects of noise in the input color. Note that the improvement is much more significant in SSIM (5.0%) and LPIPS (10.1%) than in PSNR (1.4%), which are more sensitive metrics to noise levels. In Table 6, the best result is obtained when we use the output of both branches.
Conclusion
We presented a novel algorithm, called the improved representative color transform (RCT++), for efficient image enhancement. The algorithm predicts image adaptive representative colors and their features, and their transformed colors. It then interpolates output colors for all pixels based on the similarities between representative features and image features. Compared to our conference version paper [22], this work has distinct improvements in that we clarified the role of representative colors and diversified the representative features by introducing the reconstruction and entropy losses, respectively. Also, we developed a fast and light enhancement network for efficient processing. We validated the effectiveness and efficiency of our method through extensive experiments on three different image enhancement datasets [4], [5], [6]. Notably, our method outperforms the existing methods in efficient image enhancement with comparable memory and computation costs.