Introduction
Atmospheric scattering occurs when the sunlight enters the atmosphere and is diffused in all directions due to particle-particle collision. Scattering, coupled with absorption, decreases the quality of digital images, resulting in various types of degradation, such as faint color, low contrast, and detail loss. Although the degradation degree depends on the size of atmospheric particles, which varies according to weather conditions [1], this phenomenon is widely referred to as haze. On the one hand, haze obscures distant objects and affects the visibility perceived by human visual systems. On the other hand, it also affects high-level computer vision applications that assume clean input image/video data, as pointed out in [2]. Hence, image dehazing is a research branch focusing on alleviating the adverse effects of haze. Fig. 1 demonstrates the dehazing results of a real-world hazy image using a deep learning model [3] and the proposed method. Our result is more favorable to human visual systems because haze has been removed efficiently while fine details have been satisfactorily recovered (in the blue-cropped region).
A real-world hazy image and its corresponding dehazing results by Ren et al. [3] and the proposed method. MS-CNN stands for the multi-scale convolutional neural network.
A. Degradation Model
Reference [4] formalizes the haze-induced degradation by a model comprising the direct attenuation and the airlight scattering, denoted as blue and red in Fig. 2, respectively. The former occurs when the reflected light at a particular wavelength \begin{equation*} \frac {\mathrm {d}E_{a}(r_{1},\lambda)}{E_{a}(r_{1},\lambda)} = -\beta _{sc}\mathrm {d}r_{1}, \tag{1}\end{equation*}
\begin{equation*} \int _{E_{a}(0,\lambda)}^{E_{a}(d,\lambda)} \frac {\mathrm {d}E_{a}(r_{1},\lambda)}{E_{a}(r_{1},\lambda)} = \int _{0}^{d} -\beta _{sc}\mathrm {d}r_{1}. \tag{2}\end{equation*}
At \begin{equation*} E_{a}(d,\lambda) = \Omega _{\lambda} S_{0} F_{\lambda} \, \mathrm {exp}(-\beta _{sc}d). \tag{3}\end{equation*}
The other part, the airlight, indicates a portion of light reflected from the terrain surface or scattered directly to the camera. The irradiance \begin{equation*} \mathrm {d}E_{s}(r_{2},\lambda) = \Omega _{\lambda} S_{0} \beta _{sc} \, \mathrm {exp}(-\beta _{sc}r_{2}) \mathrm {d}r_{2}. \tag{4}\end{equation*}
\begin{align*} E_{s}(d,\lambda)=&\Omega _{\lambda} S_{0} \beta _{sc} \int _{0}^{d} \mathrm {exp}(-\beta _{sc}r_{2}) \mathrm {d}r_{2} \\=&\Omega _{\lambda} S_{0}[1-\mathrm {exp}(-\beta _{sc}d)]. \tag{5}\end{align*}
The total irradiance \begin{align*} E_{t}(d,\lambda) = \Omega _{\lambda} S_{0} F_{\lambda} \, \mathrm {exp}(-\beta _{sc}d) + \Omega _{\lambda} S_{0}[1-\mathrm {exp}(-\beta _{sc}d)]. \\{}\tag{6}\end{align*}
For ease of representation, it is convenient to substitute \begin{equation*} \mathbf {I}(x) = \mathbf {J}(x)t(x) + \mathbf {A}[1-t(x)]. \tag{7}\end{equation*}
B. Ill-Posedness
The concept of ill-posedness dates back to [5], and a mathematical problem is called ill-posed (or incorrectly posed) if at least one of the following conditions for its solution fails:
The existence
The uniqueness
The stability
In (7), the hazy intensities
Furthermore, for a low-level vision task such as image dehazing, deep neural networks (DNNs) are often overkill, as discussed in [13] about deep learning and traditional computer vision techniques. In fact, they fit well with high-level cognitive tasks, such as object classification, recognition, and localization. The data-driven performance of DNNs is also more of a hindrance than a help because abstract features learned by DNNs are specific to the training dataset, whose construction is highly cumbersome for statistical reliability. Thus, the learned features may be inappropriate for images different from those in the training set, lowering the performance in general.
Related Works
This section briefly reviews influential works in the literature based on the categorization in [14], where algorithms have been divided into three categories according to their data exploitation. The first two, image processing and machine learning, were typified by low-level hand-engineered image features discovered through statistical analysis of real-world images. The last category, deep learning, exploited the powerful representation capability of DNNs to learn high-level data-driven image features. This categorization could give useful insights into (i) the complexity of dehazing algorithms and (ii) subjective/objective preferences for dehazed images. Generally, image processing and machine learning-based methods possess low complexity and favor human perception. Deep learning-based methods, on the contrary, are computationally costly and favor image quality assessment metrics.
A. Image Processing
The dark channel prior [15], one of the most influential works in image dehazing, is a prime example of the first category. He et al. [15] observed natural haze-free images and discovered that the dark channel–calculated as the local minimum of the minimum channel–tended to approach zero at non-sky image patches. This finding, coupled with the degradation model, offered an efficient means to estimate the transmittance, which required soft-matting [16] for edge-aware smoothing. He et al. [15] also proposed locating the atmospheric light as the brightest pixel (in the red-green-blue (RGB) color space) among the top 5% pixels with the highest intensities in the dark channel. Given the estimated transmittance and atmospheric light, they reversed (7) to obtain the haze-free image.
Another influential work is the linear time algorithm by Tarel and Hautiere [17]. They first white-balanced the input image to support the assumption that the atmospheric light was pure white. After that, they inferred the airlight, the term
B. Machine Learning
As image dehazing based on in situ information is highly challenging, the distilled knowledge from relevant image datasets may improve the dehazing performance. Zhu et al. [19] developed the color attenuation prior based on extensive observations of natural outdoor images. This prior stated that the scene depth correlated with the difference between image saturation and brightness. Zhu et al. [19] then utilized a linear function to model that correlation and estimated the function’s parameters by applying supervised learning on a synthetic dataset. Thus, the distilled knowledge was parameters used to estimate the scene depth from image saturation and brightness.
From a more general perspective, Tang et al. [20] investigated four haze-relevant features, including the dark channel, hue disparity, locally maximum contrast, and locally maximum saturation, at multiple scales and found the following. Although the dark channel was the most informative feature (as discovered by He et al. [15]), other features also contributed in a complementary manner. Hence, Tang et al. [20] devised a framework for inferring the transmittance from different haze-relevant features. In [20], they employed a random forest regressor for ease of analysis and demonstration, albeit with slow inference time. They also discussed the importance of post-processing and presented two post-processing options: adaptive atmospheric light estimation and adaptive exposure scaling.
C. Deep Learning
The aforementioned approaches require significant efforts in seeking (i) a good feature (or a set of features) and (ii) an efficient inference scheme. However, there is no guarantee that they will always perform as intended in all circumstances. As a result, deep learning has been applied to image dehazing to improve flexibility. Given a reliable training dataset, DNNs can estimate the transmittance and atmospheric light with high accuracy because they allow learning and augmenting image features from low to high levels of abstraction. For example, Cai et al. [21] designed a convolutional neural network (CNN) to perform the following: low-level feature extraction, multi-scale mapping, augmentation (for spatial invariance), and non-linear transmittance inference.
The powerful learning ability of DNNs or deep CNNs also allows them to infer the dehazed image directly from the hazy input. In this direction, the encoder-decoder network has been proved highly efficient for end-to-end learning [22], [23]. In addition, some well-known image processing schemes can be applied to deep learning to improve performance, as witnessed by multi-scale image fusion [22] and domain adaptation [23]. Also, inspired by the human brain that knowledge learned from doing a particular activity may benefit another activity, joint learning is a promising direction, typified by [24], where image dehazing benefits object detection.
Some state-of-the-art deep dehazing networks developed recently include GridDehazeNet (GDN) [25], multi-scale boosted dehazing network (MSBD) [26], you only look yourself (YOLY) [27], and self-augmented unpaired image dehazing (D4) [28]. GDN is a supervised network and comprises three modules. The pre-processing module applies different data-driven enhancement processes to the input image. The backbone module then fuses the results based on the grid network, where a channel-wise attention mechanism is adopted to facilitate the cross-scale circulation of information. Finally, the post-processing module remedies residual artifacts to improve the dehazing quality.
MSBD is also a supervised network designed with boosting and error feedback mechanisms. The former successively refines the intermediate dehazing result to reduce the portion of haze (PoH defined by Dong et al. [26]), and the latter successively recovers spatial details obscured by haze. YOLY, on the contrary, is an unsupervised and untrained network. Based on the layer disentanglement in [29], YOLY is designed with three sub-networks that decompose the hazy image into three latent layers corresponding to scene radiance, transmittance, and atmospheric light. Thus, YOLY supervises itself to jointly optimize three sub-networks and reconstruct the hazy image from a single input.
Yang et al. [28] argued that YOLY lacked knowledge from the clean image domain and developed D4 as an alternative solution. Unlike other unpaired networks, D4 takes account of the scattering coefficient and scene depth when carrying out dehazing and rehazing cycles. Consequently, D4 can benefit from physical-model-based haze removal and generation to improve the performance of unpaired learning.
D. Motivations
Image dehazing has undergone approximately five decades of development since the pioneering work in 1972 [30]. It is currently in the mature stage, and the focus is deemed to attain computational efficiency for integrating into low-cost edge devices, which are prevalent in the Industry 4.0 era.
As discussed thus far, although DNNs offer some definite advantages, such as accuracy and flexibility, they are not a preferable option. In contrast, traditional computer vision techniques are more suitable for image dehazing because a hand-engineered method can deliver comparative performance at a cheaper computational cost. This paper, therefore, proposes an
Proposed Method
Fig. 3 illustrates three major steps constituting the proposed method. The first is unsharp masking for enhancing the sharpness, wherein the enhancement is locally adapted to the variance of image intensities lest the out-of-range problem occurs. The second performs image dehazing based on the improved color attenuation prior [31], which estimates the transmittance from saturation and brightness (as by Zhu et al. [19]) and applies two no-black-pixel (NBP) constraints. The third performs color gamut expansion by enhancing the luminance and then expanding the color gamut proportionally to avoid color distortion. The following describes these three steps in more detail.
Illustration of the proposed method. NBP stands for no-black-pixel, and CDF stands for the cumulative distribution function.
A. Pre-Processing
In the beginning, it is worth recalling that the input RGB image of size
As haze is depth-dependent, it is generally smooth except at discontinuities. Hence, it can be viewed as a low-frequency component that obscures fine details in the captured image. This pre-processing step then enhances these obscured details by adding the scaled Laplacian image to the original, as Fig. 4 shows. Because the sharpness enhancement only applies to the luminance channel, it is necessary to convert between RGB and YCbCr color spaces using (8) and (9) from [32]. In (8), \begin{align*} \begin{bmatrix} Y\\ \mathrm {Cb}\\ \mathrm {Cr} \end{bmatrix}=&\begin{bmatrix} 0.183 &\quad 0.614 &\quad 0.062\\ -0.101 &\quad -0.338 &\quad 0.439\\ 0.439 &\quad -0.399 &\quad -0.040 \end{bmatrix} \! \begin{bmatrix} I^{R}\\ I^{G}\\ I^{B} \end{bmatrix} \\&+ \begin{bmatrix} 16\\ 128\\ 128 \end{bmatrix}, \tag{8}\\ \begin{bmatrix} I_{e}^{R}\\ I_{e}^{G}\\ I_{e}^{B} \end{bmatrix}=&\begin{bmatrix} 1.164 &\quad 0 &\quad 1.793\\ 1.164 &\quad -0.213 &\quad -0.534\\ 1.164 &\quad 2.115 &\quad 0 \end{bmatrix} \! \begin{bmatrix} Y_{e} - 16\\ \mathrm {Cb} - 128\\ \mathrm {Cr} - 128 \end{bmatrix}. \tag{9}\end{align*}
Next, the Laplacian image is obtained by convolving the input luminance \begin{align*} \nabla ^{2}\triangleq&\begin{bmatrix} 0 &\quad 1 &\quad 0\\ 1 &\quad -4 &\quad 1\\ 0 &\quad 1 &\quad 0 \end{bmatrix}, \tag{10}\\ v=&Y^{2} \circledast \left ({\frac {U_{k}}{k^{2}}}\right) - \left [{Y \circledast \left ({\frac {U_{k}}{k^{2}}}\right)}\right]^{2}. \tag{11}\end{align*}
As demonstrated at the bottom-left of Fig. 4, the scaling factor \begin{align*} \alpha=&\begin{cases} \alpha _{1} & v < v_{1}\\ \left ({\dfrac {\alpha _{2}-\alpha _{1}}{v_{2}-v_{1}}}\right)v + \dfrac {\alpha _{1} v_{2} - \alpha _{2} v_{1}}{v_{2}-v_{1}} & v_{1} \leq v \leq v_{2}\\ \alpha _{2} & v > v_{2}, \end{cases} \qquad \tag{12}\\ Y_{e}=&Y + \alpha \cdot (\nabla ^{2} \circledast Y). \tag{13}\end{align*}
Unsharp masking can be loosely viewed as a “mildly dehazing” step because it partially relieves the impact of haze on image sharpness. The following, conversely, is a haze-removal-dedicated step developed from the improved color attenuation prior [31].
B. Dehazing
Two important parts of this step are (i) scene depth estimation and (ii) NBP constraint derivation. The former is based on the color attenuation prior [19] with several improvements in the learning scheme and the dataset preparation. Meanwhile, the latter is inspired by [33] to constrain the transmittance lest black pixels occur. Fig. 5 shows the overall block diagram, where the input image
1) Scene Depth Estimation
The scene depth \begin{equation*} d = \theta _{0} + \theta _{1} S + \theta _{2} V + \varepsilon, \tag{14}\end{equation*}
Zhu et al. [19] utilized the standard uniform distribution (SUD) to generate a synthetic training dataset. After that, they adopted the stochastic gradient ascent (SGA) to find the parameters that maximized the log-likelihood function. In the proposed method, the enhanced equidistribution [31] supersedes SUD to improve the statistical reliability of the synthetic dataset. Additionally, the mini-batch gradient ascent with an adaptive learning rate [34] replaces SGA to reduce the convergence time.
The scene depth
2) NBP Constraints
From (7), the dehazed image (or, equivalently, scene radiance) \begin{equation*} \mathbf {J} = \frac {\mathbf {I}_{e} - \mathbf {A}}{t} + \mathbf {A}. \tag{15}\end{equation*}
The first NBP constraint, \begin{equation*} t \geq 1 - \min _{c\in \{R,G,B\}}\left ({\frac {I_{e}^{c}}{A^{c}}}\right), \tag{16}\end{equation*}
The second NBP constraint is inspired by [33] that the local mean intensity of \begin{equation*} \mathop{\mathrm{mean}}\limits _{\forall y\in \Omega (x)}[Y_{p}(y)] \geq q\cdot \mathop{\mathrm{std}}\limits _{\forall y\in \Omega (x)}[Y_{p}(y)], \tag{17}\end{equation*}
\begin{align*} \mathop{\mathrm{mean}}\limits _{\forall y\in \Omega (x)}[Y_{p}(y)]\approx&\frac {1}{t}\left [{Y_{e} \circledast \left ({\frac {U_{k}}{k^{2}}}\right) - \bar {A}}\right] + \bar {A}, \tag{18}\\ \mathop{\mathrm{std}}\limits _{\forall y\in \Omega (x)}[Y_{p}(y)]\approx&\frac {1}{t}\sqrt {Y_{e}^{2}\circledast \left ({\frac {U_{k}}{k^{2}}}\right) \!-\! \left [{Y_{e}\circledast \left ({\frac {U_{k}}{k^{2}}}\right)}\right]^{2}}, \tag{19}\end{align*}
\begin{align*}&\hspace {-0.5pc} t \geq 1 - \left ({{\bar {A}}}\right)^{-1}\Bigg \{Y_{e} \circledast \left ({\frac {U_{k}}{k^{2}}}\right) \\&- q\sqrt {Y_{e}^{2}\circledast \left ({\frac {U_{k}}{k^{2}}}\right) - \left [{Y_{e}\circledast \left ({\frac {U_{k}}{k^{2}}}\right)}\right]^{2}}\Bigg \}. \tag{20}\end{align*}
Let \begin{equation*} t_{\mathrm {NBP}} = \max \left ({t_{\mathrm {NBP}_{1}}, t_{\mathrm {NBP}_{2}}}\right), \tag{21}\end{equation*}
\begin{equation*} t_{\mathrm {NBP}} \leq t \leq 1, \tag{22}\end{equation*}
As underflows and overflows are inevitable in digital computations, the recovered image suffers from color gamut reduction, rendering a post-processing step highly relevant. The following describes an efficient method for luminance enhancement and color gamut expansion [36]. This method also produces a positive ramification that eases the atmospheric light estimation. More precisely, it can be observed from (15) that
C. Post-Processing
Fig. 6 shows the overall block diagram, where the input image is the recovered scene radiance
1) Luminance Enhancement
Existing enhancement methods generally operate on the entire luminance range, which may result in over-enhancement. Accordingly, the method in [36] adopted an adaptive limit point (ALP) to constrain the range scene-wisely. Given the luminance channel \begin{align*} \mathrm {ALP} = \begin{cases} 0.04 + \dfrac {0.02}{255}\left ({L_{\mathrm {CDF}_{0.9}} - L_{\mathrm {CDF}_{0.1}}}\right) & \bar {Y_{p}} > 128\\[8pt] 0.04 - \dfrac {0.02}{255}\left ({L_{\mathrm {CDF}_{0.9}} - L_{\mathrm {CDF}_{0.1}}}\right) & \bar {Y_{p}} \leq 128, \end{cases} \\{}\tag{23}\end{align*}
It is worth noting that over-enhancement is avoidable by assigning higher gains to smaller luminance values, and ALP can be exploited for that purpose, as (24) shows:\begin{align*} g_{1}(Y_{p})=&\frac {Y_{p}}{2^{21}}\left [{255\left ({1 - \frac {Y_{p}-\mathrm {ALP}}{255}}\right)^{\theta} \left ({\frac {255-Y_{p}}{255}}\right)}\right]^{2}, \\{}\tag{24}\\ \theta=&\frac {1.5\left ({L_{\mathrm {CDF}_{0.4}}-L_{\mathrm {CDF}_{0.1}}}\right)}{\bar {Y_{p}}-L_{\mathrm {CDF}_{0.1}}} - 0.55, \tag{25}\end{align*}
\begin{align*} g_{2}(Y_{p})=&\frac {\mathrm {SL}}{255}Y_{p} + \mathrm {IN}, \tag{26}\\ Y_{f}=&Y_{p} + g_{1}(Y_{p}) \cdot g_{2}(Y_{p}). \tag{27}\end{align*}
2) Color Gamut Expansion
The first block of color space conversion in Fig. 6 produces \begin{align*} \mathrm {Cb}_{d}=&\mathrm {Cb}_{p} \circledast D_{\mathrm {dec}} = \left \{{\mathrm {cb}_{ij}\in \mathbb {R}}\right \}, \tag{28}\\ \mathrm {Cr}_{d}=&\mathrm {Cr}_{p} \circledast D_{\mathrm {dec}} = \left \{{\mathrm {cr}_{ij}\in \mathbb {R}}\right \}, \tag{29}\\ \mathrm {Ch}_{p}=&\left \{{\mathrm {ch}_{ij}\in \mathbb {R} \, \big |\, \mathrm {ch}_{ij}=\mathrm {cb}_{ij}, }\right. \\&\left.{ \forall i,j \;\mathrm {s.t.}\; j=\{2n+1 \, \big |\, n\in \mathbb {Z}_{0}^{+}\}, }\right. \\&\left.{ \mathrm {otherwise}\; \mathrm {ch}_{ij}=\mathrm {cr}_{ij}}\right \}. \tag{30}\end{align*}
According to the Helmholtz-Kohlrausch effect [39], the luminance \begin{equation*} g_{3}(Y_{p},\mathrm {Ch}_{p}) = \frac {Y_{f}}{Y_{p}}\mathrm {Ch}_{p}. \tag{31}\end{equation*}
Moreover, an additional weight \begin{align*} g_{4}(Y_{p}) \!=\!\! \begin{cases} 0.7 & Y_{p} < \mathrm {TH}_{1}\\ 0.7 - 0.26\dfrac {Y_{p} - \mathrm {TH}_{1}}{\mathrm {TH}_{2} - \mathrm {TH}_{1}} & \mathrm {TH}_{1} \leq Y_{p} \leq \mathrm {TH}_{2}\\ 0.44 & Y_{p} > \mathrm {TH}_{2}, \end{cases}\qquad \tag{32}\end{align*}
\begin{equation*} \mathrm {Ch}_{f} = \mathrm {Ch}_{p} + g_{3}(Y_{p},\mathrm {Ch}_{p}) \cdot g_{4}(Y_{p}). \tag{33}\end{equation*}
Next, the chrominance interpolation block separates Chf into two temporary variables, Cbt and Crt, for the final block of color space conversion. To describe chrominance interpolation, we reused \begin{align*} \mathrm {Cb}_{t}=&\left \{{\mathrm {cb}_{ij}\in \mathbb {R} \, \big |\, \mathrm {cb}_{ij}=\mathrm {ch}_{ij}, }\right. \\&\left.{ \forall i,j \;\mathrm {s.t.}\; j=\{2n+1 \, \big |\, n\in \mathbb {Z}_{0}^{+}\} }\right. \\&\left.{ \mathrm {otherwise}\; \mathrm {cb}_{ij}=0}\right \}, \tag{34}\\ \mathrm {Cr}_{t}=&\left \{{\mathrm {cr}_{ij}\in \mathbb {R} \, \big |\, \mathrm {cr}_{ij}=\mathrm {ch}_{ij}, }\right. \\&\left.{ \forall i,j \;\mathrm {s.t.}\; j=\{2n \, \big |\, n\in \mathbb {Z}^{+}\} }\right. \\&\left.{ \mathrm {otherwise}\; \mathrm {cr}_{ij}=0}\right \}. \tag{35}\end{align*}
After that, Cbt and Crt are convolved with the interpolation filter
Results and Discussions
This section presents a comparative evaluation of the proposed method against nine state-of-the-art benchmarks selected from the three image dehazing categories discussed in Section II. These nine are proposed by He et al. [15], Tarel and Hautiere [17], Zhu et al. [19], Cai et al. [21], Ren et al. [3], Liu et al. [25], Dong et al. [26], Li et al. [27], and Yang et al. [28], respectively.
A. Qualitative Evaluation on Real-World Hazy Images
Fig. 7 demonstrates a qualitative comparison of ten methods on a real-world hazy image from the IVC dataset [40]. Results by He et al. [15], Cai et al. [21], Liu et al. [25], Dong et al. [26], Li et al. [27], and Yang et al. [28] exhibit a good dehazing performance as the scene radiance has been recovered without any unpleasant artifacts. The result by Zhu et al. [19] is slightly over-dehazed, losing dark and distant details. Results by Tarel and Hautiere [17] and Ren et al. [3] are less favorable than others because a portion of haze persists.
Dehazing results of ten methods on a real-world hazy image. From left to right: input image and results by He et al. [15], Tarel and Hautiere [17], Zhu et al. [19], Cai et al. [21], Ren et al. [3], Liu et al. [25], Dong et al. [26], Li et al. [27], Yang et al. [28], and the proposed method. The input image was duplicated for ease of comparison.
Above all, it can be observed that nine benchmark methods are ineffective in recovering image details, as witnessed by the traffic light and the man’s face in the red-cropped and blue-cropped regions. This common drawback can be explained as follows. Dehazing is fundamentally the subtraction of haze from the input image, and the subtraction degree depends on the transmittance. However, estimating a transmittance with rich details is challenging because spatial filtering usually attenuates high-frequency information. Although an outstanding guided filter [41] has been adopted to refine the transmittance estimate, it is noted that the best guidance image in single image dehazing methods is the input image itself. Accordingly, the lack of an informative guidance image constrains the refinement.
The proposed method, in contrast, effectively removes haze while enhancing the sharpness and the color gamut, as witnessed by the man’s face and the facial skin color in the blue-cropped region. This definite advantage is attributed to the pre-processing (unsharp masking) and post-processing (color gamut expansion) steps. The intermediate results in Fig. 8 show that the former has improved image details to such an extent that the contours of distant objects have become noticeable. Meanwhile, the latter, as claimed, has successfully remedied the post-dehazing problem of color gamut reduction.
Intermediate results of the proposed method on a real-world hazy image. From left to right: input image and results after pre-processing, dehazing, and post-processing steps.
Fig. 9 shows more qualitative comparison results on real-world hazy images. It can be observed from the first row that the result by He et al. [15] is satisfactory, albeit with the post-dehazing false enlargement of the train’s headlight. This problem occurs when the atmospheric light is less than pixel intensities around the headlight, as discussed in [42]. Accordingly, the maximum value of
Qualitative evaluation of ten methods on real-world hazy images. From left to right: input images and results by He et al. [15], Tarel and Hautiere [17], Zhu et al. [19], Cai et al. [21], Ren et al. [3], Liu et al. [25], Dong et al. [26], Li et al. [27], Yang et al. [28], and the proposed method. We abbreviated the method of Tarel and Hautiere [17] to T & H and the proposed method to PM.
Similar observations also emerge from the second to the fourth rows of Fig. 9. The dark channel assumption of the method of He et al. [15] does not hold for the sky region, causing severe color distortion in the second row. As with the interpretation of results in the first row, the method of Tarel and Hautiere [17] suffers from halo artifacts, and the method of Zhu et al. [19] suffers from a loss of dark details. Results by deep-learning-based methods, on the contrary, do not exhibit any unpleasant artifacts, which is attributed to the powerful representation capability and the flexibility of CNNs. Compared with these benchmarks, the proposed method exhibits an almost comparative or even better performance.
B. Quantitative Evaluation on Public Datasets
This section presents an objective assessment of the proposed method against nine benchmarks on public image datasets. It is worth noting that there are numerous metrics for image quality assessment, such as the conventional peak signal-to-noise ratio (PSNR), the structural similarity (SSIM) [43], the feature similarity index extended to color images (FSIMc) [44], and the TMQI [37]. The first is pixel-based and thus does not correlate well with subjective ratings. The second, in contrast, is structure-based and can better quantify the perceived image quality. However, it has a drawback in that it utilizes a uniform weight to pool a single quality score. Accordingly, the third improves the SSIM by adopting an adaptive pooling weight and taking account of chrominance information. The fourth improves the SSIM in another direction by considering multi-scale structural similarity and naturalness. Therefore, we selected the FSIMc and the TMQI for our quantitative evaluation due to their high correlation with subjective assessment. These two metrics vary from zero to unity, and higher scores are more favorable in image dehazing. Also, as they are full-reference, their computation requires datasets comprising pairs of hazy and haze-free images.
Table 1 summarizes five public datasets utilized in this evaluation, including FRIDA2 [33], D-HAZY [45], O-HAZE [46], I-HAZE [47], and DENSE-HAZE [48]. The FRIDA2 consists of 66 graphics-generated images depicting road scenes, from which four corresponding sets of hazy images are generated, thus a total of 66 haze-free and 264 hazy images. The D-HAZY is generated from the Middlebury [49] and NYU Depth [50] datasets according to the degradation model in (7) with scene depths captured by a Microsoft Kinect camera. It is composed of 1472 pairs of synthetic indoor images, but this evaluation only utilizes 23 pairs from Middlebury due to their substantial variation in image scenes. The other 1449 pairs from NYU Depth portray relatively similar scenes and thus may bias the evaluation results. In contrast, the O-HAZE, I-HAZE, and DENSE-HAZE comprise 45, 30, and 55 pairs depicting real-world indoors, outdoors, and both, respectively.
Tables 2 and 3 demonstrate average evaluation scores in FSIMc and TMQI on five public datasets. It can be observed from Table 2 that the proposed method is ranked fourth overall, below the deep dehazing models of Yang et al. [28], Ren et al. [3], and Dong et al. [26]. However, it is worth noting that the difference between its FSIMc score and the best is subtle. Specifically, it outperforms other benchmarks on FRIDA2 and is within the top four methods for dehazing real-world images in O-HAZE and I-HAZE. Nevertheless, its performance is slightly under-par on D-HAZY and DENSE-HAZE. This observation can be related to the fact that FSIMc quantifies the degradation rather than the enhancement. Thus, it is unsurprising that two models of Ren et al. [3] and Dong et al. [26], trained on fully annotated datasets to minimize the difference between their predictions and corresponding ground-truth references, have achieved the top scores. This interpretation can be further supported by the unimpressive score of the unsupervised model of Li et al. [27], which does not require ground-truth references. However, this and the best-performing model of Yang et al. [28] have shown the great potential of unsupervised and unpaired learning in computer vision.
According to Table 3, the proposed method is ranked second overall, and its TMQI score only differs from the best at the fourth decimal place. More specifically, the proposed method exhibits a comparative performance on FRIDA2 and an under-par performance on D-HAZY. In contrast, it outperforms benchmark methods on real-world datasets, such as O-HAZE and I-HAZE, as witnessed by a significant difference in TMQI scores. Hence, unsharp masking and color gamut expansion appear to benefit real-world images. However, such benefits as these two steps offer do not suffice for handling densely hazy images owing to the under-performance of the dehazing step.
As a result, it can be concluded that the proposed method demonstrates a comparative performance with state-of-the-art benchmarks, notably the deep learning models of Yang et al. [28], Ren et al. [3], and Dong et al. [26].
C. Processing Time Comparison
Notwithstanding a comparative performance, the proposed method possesses a linear time complexity,
Table 4 summarizes the processing time of ten methods on different image resolutions, ranging from VGA (
It emerges that the method of He et al. [15] is the least efficient in terms of time and memory. This finding is consistent with its widely known drawback rooted in soft-matting. Notably, RAM was exhausted when invoking this algorithm on an 8K UHD image, and the processing time, in this case, was denoted as not available. Similarly, the unsupervised model of Li et al. [27] could not process DCI 4K and 8K UHD images owing to memory exhaustion. This model progressively refines the dehazing result, and its default configuration is to run through 800 iterations. Therefore, its processing time is significantly larger than those of other methods.
Next on the list are two models of Cai et al. [21] and Ren et al. [3]. As discussed earlier, they are deemed to be overkill due to the high computational cost inherent in them. Measurements in Table 4 then verified that claim. However, recent models of Liu et al. [25], Dong et al. [26], and Yang et al. [28] benefited from batch processing and parallel computing. Under these mechanisms, PyTorch needs to initialize the GPU, for example, making replicas of the model on each GPU worker. Accordingly, the execution time of the first image was prolonged, whereas those of the remaining images were substantially shortened. Notably, the model of Liu et al. [25] utilized four GPU workers and thus consumed the least processing time for SVGA, HD, FHD, and DCI 4K resolutions. It is also worth noting that batch processing and parallel computing generally cause a jump in memory consumption, proportional to the number of GPU workers. Thus, the model of Liu et al. [25] suffered from memory exhaustion when handling an 8K UHD image. Conversely, the model of Dong et al. [26] was free of that problem but returned a run-time error when processing an SVGA image.
The two methods of Tarel and Hautiere [17] and Zhu et al. [19] are well-recognized for their computational efficiency, thus accounting for their fast processing speed recorded in Table 4. Additionally, it is noteworthy that Zhu et al. [19] adopted the fast implementation of the guided filter, which downscaled the input image to ease the computational burden. If they utilize the standard guided filter, their method would be slower than that of Tarel and Hautiere [17].
The proposed method is ranked second overall, notably without batch processing or parallel computing but simply sequential computing. Although it is slower than the fastest model of Liu et al. [25], Tables 2 and 3 demonstrate that it outperforms this model under FSIMc and TMQI. Also, compared with the fast sequential method of Zhu et al. [19], it achieved approximately
D. Ablation Study
The proposed method consists of three steps that operate in a complementary manner. To verify the individual contribution of each step, we conduct ablation studies by considering three variants of our algorithm. They are created by dropping the pre-processing step, the post-processing step, and both, respectively. Table 5 summarizes the evaluation results on five public datasets under FSIMc and TMQI. It can be observed that the pre-processing step (unsharp masking) contributes more to the structural information, while the post-processing step (color gamut expansion) contributes more to the naturalness. Hence, each of these two steps plays an essential role in the proposed method, which justifies its significance.
Moreover, the contributions of the pre-processing and post-processing steps can also be verified by qualitative results in Fig. 10. Excluding the former causes a loss of image details, and excluding the latter gives rise to color gamut reduction. Image quality further worsens when neither of them is included. Accordingly, these observations verify the whole algorithm with the pre-processing, dehazing, and post-processing steps.
Comparisons on real-world hazy images for different variants of the proposed method.
Conclusion
This paper presented an efficient method for single image dehazing in linear time. It began with a literature review arguing that deep learning models were overkill, and dehazing by traditional computer vision techniques could achieve a comparative performance at a much lower computational cost. After that, a detailed description of the proposed method was followed. The pre-processing step enhances image sharpness, and the dehazing step recovers the scene radiance according to the improved color attenuation prior. Finally, the post-processing step compensates for the color gamut reduction. Subjective and objective evaluation against nine benchmarks demonstrated that the proposed method was substantially fast while achieving a comparative performance.
Nonetheless, under-performance was observed on densely hazy images. This drawback is common for methods developed from hand-engineered features, which are not abstract enough to reflect how human visual systems recognize dense haze. In this case, image inpainting and conditional image generation may be viable alternatives, but it appears challenging to implement them computationally efficiently. Thus, this task is left for future research.