Introduction
Low illuminance environment is closely related to our life. We spend almost half of every day in low illuminance environment. It brings us many inconveniences, especially in the field of security where the obtained low-illumination image often contains valuable information. In low-illumination environment, due to the dim light or insufficient exposure, the low-illumination image has problems of low brightness, low contrast and noise. One of straightest solutions is to improve hardware, such as using infrared monitoring or increasing the aperture of the camera. However, these hardware improvements will make the cost too high. Therefore, much research is still focused on software algorithmic solutions.
Most of current works on low-illumination focus on image enhancement, aiming to improve low-level visual qualities of images. However, for high-level tasks, such as object detection in low-illumination environments, so far have not been received enough attention. Works on object detection under low-illumination as shown in Figure 1 are still scarce.
An example of low-illumination objects detection. Our detector have achieved amazing results in some common scenes of low-illumination.
Object detection in low-illumination environments is very challenging. Due to insufficient reflected light under low-illumination, image captured contains a lot of dark areas and noise. Though many successful object detection algorithms have been proposed with developments of deep learning, most state-of-the-art object detectors, such as Faster-RCNN [1], SSD [2] etc., cannot perform their best performance under low-illumination conditions. Even with additional light source, due to the uneven distribution of brightness, it is still impossible to distinguish object’s details. The underlying reason we think is that current mainstream detectors are designed for normal-illumination data. So far, there is no special solution for vision tasks in low-illumination environments.
As a natural idea, image enhancement can be straightforwardly taken as pre-processing step before low-illumination detection. Therefore, in this paper, we follow to attempt the idea to implement different low-illumination enhancement algorithms on ExDARK, a real low-illuminance scene dataset which was lately developed by Loh and Chan [3]. We train state-of-the-art object detection models on the illumination-enhanced data. However, we are depressed to find that the detection performance of model trained on the enhanced data is not better than the performance of model trained directly on original dark data. The experiments frustratedly imply that most of current image enhancement algorithms though achieve visually pleasing results, could not meet low-illumination computer vision tasks. Therefore, we try to look into the causes and give some explanations through visualizing impacts of illumination on feature preservations at different convolution levels.
Going deeper, without loss of generality, we develop a night vision detector based on a representative state-of-the-art detector RFB-Net [4]. We introduce a specifically designed feature pyramid network into backbone layers of RFB-Net for hierarchical feature fusion. Context information are fused to compensate loss of low-level textural/contour information in deeper layers under low-illumination conditions. The experiments on ExDARK dataset demonstrate that our night vision detector achieves satisfied improvements on average detection precisions.
The contributions of this paper are summarized as followings:
We have visualized impacts of illumination on feature preservations at different convolution levels and explained the underlying reasons that why illumination greatly affects the performance of detectors.
We have evaluated a two-step strategy that performs detection on illumination-enhanced images. The experimental results give us some frustrated experiences that most of current low-illumination enhancement algorithms cannot meet real-time detection requirements, and do not bring substantial performance improvements on detectors.
We have proposed a Night Vision Detector (NVD) with specifically designed feature pyramid network and context fusion network on basis of RFB-Net. We train and test the proposed detector directly on ExDARK, a lately published real low-light scene dataset. The experiments show that our scheme can amazingly improve the mean average detection precision performance in low-illumination scenes.
We have further experimented and discussed the affects of data augmentation and parameters initialization. We reach an experimental conclusion that no matter which illumination bias is specified for initial model, the converged model trained on the illumination-balanced data consistently bias towards low-illuminance scenes. Fine-tuning from pre-trained parameters of normal-illumination model can benefit for low-illumination performance improvements.
In summary, our investigations in this paper can offer a good starting point for further studies on low-illumination object detection.
In the following parts, we will briefly survey some related work in Section II. Then we will discuss about the importance of illumination in SectionIII. In Section IV, we analysis the validation of current pre-enhancing strategies in low-illumination detection tasks. After that, in Section V, we describe the experiment settings of our developed night vision detector. In Section VI, we perform experiments and discussions on the affects of illumination-balanced data and model’s parameter training. At last, we give our conclusions in final Section VII.
Related Work
A. Low-Illumination Image Enhancement
1) Traditional Pipeline
The most classic techniques to enhance the contrast of low-light images is the histogram equalization [5]. It is very simple, having low computational complexity. However, gray levels that characterize image details are easily lost due to excessive gray merge.
Another classic technique is the
Inspired by the dark channel prior based defogging algorithm [6], several low-illumination video enhancement algorithms were proposed [7], [8]. Łoza et al. [9] proposed a statistical modeling method based on image wavelet coefficients for images with low-illumination and uneven illumination.
2) Retinex Theory
The Retinex theory [10] from human vision system (HVS) believes that the observed brightness of objects is composed of illumination component and reflection component. It is formulated as Equation 1:\begin{equation*} S = I \circ R\tag{1}\end{equation*}
Inspired from retinex theory, single-scale retinex (SSR) [11], multi-scale retinex (MSR) [12] and multiscale retinex with color restoration (MSRCR) [13] were successively proposed. In recent, illumination map estimation like LIME [14] was proposed to construct illumination map through finding maximum intensity for image channels. Robust-Retinex [15] were similarly proposed based on retinex theory [10]. However, the decompose of observed brightness is an ill-posed problem that so far hasn’t been solved well.
3) Deep-Learning Based Model
Benefiting from the popularity of deep learning methods, low-level image restoration research such as deblurring, denoising, and super-resolution have made great breakthroughs. However, for low-illumination enhancement, it is very difficult to obtain corresponding ground truth. Most of current low-illumination enhancement methods are therefore trained on synthetic data. LLNet [16] was the first deep auto-encoder-based approach to identify signals from low-light images and adaptively brighten images without over-amplifying the lighter parts in images. Tao et al. [17] proposed a two-step strategy to enhance low-light images based on atmospheric scattering lighting model.
Since Retinex theory is more suitable for human vision characteristics, it becomes popular to combine deep learning with retinex theory, such as LightenNet [18]. Shen et al. [19] proved that multi-scale retinex is equivalent to a feedforward convolutional neural network with different gaussian convolution kernels. They proposed a MSR-Net for directly learning the end-to-end mapping between dark and bright images. Wei et al. [20] proposed a Retinex-Net model that contains DecomNet for decomposition and EnhanceNet for lighting adjustment. Chen et al. [21] developed the first real low-illumination RAW dataset, and developed an enhanced network SID for processing low-light images. Jiang et al. [22] proposed an unsupervised generative adversarial network called EnlightenGAN to solve the training without low/normal illumination image pairs. In the most recent, a maximum entropy based retinex model [23] was proposed through self-supervised learning. Hong et al. [24] proposed to address spectral variability by applying a data-driven learning strategy in inverse problems of hyper-spectral unmixing.
B. Object Detection
Following R-CNN [25] that use convolutional neural networks [26] for object detection, the object detection schemes are divided into two categories: (1) two-stage detectors such as R-CNN [25], Faster-RCNN [1], MS-CNN [27], Mask-RCNN [28],etc. and (2) one-stage detectors such as YOLO [29], SSD [2], DSSD [30], RFB-Net [4] and RetinaNet [31], etc.
The two-stage detection algorithm decomposes detection process into two steps in which candidate regions (region proposals) are first generated, then the candidate regions are classified and object locations are refined. The two-stage detection algorithm can achieve low recognition error rate, but cannot meet real-time detection scenarios. The one-stage detection algorithm does not require region proposal step. It can directly generate the category labels and coordinate positions of interested objects after a single detection, having much faster speed than most of two-stage detection algorithms. However, the detection accuracy of one-stage algorithms are mostly inferior to that of two-stage algorithms.
With the development of computer vision, both detection strategies have been greatly improved. The improvements are well-known attributed to public dataset such as PASCAL VOC1 or COCO.2 However, all these famous public data contain less than 1% of low-illuminance images. As a result, current detectors cannot fully exert their optimal performance in low-illuminance scenes.
In some application areas, such as in optical remote sensing imagery, specific auxiliary informations are utilized. Wu et al. [32] propose an optical remote sensing imagery detector through jointly considering the rotation-invariant channel features constructed in frequency domain and the original spatial channel features. Further, they proposed a fourier-based rotation-invariant feature boosting framework [33] to solve object deformation problem.
As we know, low-illumination scenes are closely related to our life. Currently, the development of corresponding intelligent vision systems is yet to be studied. In this paper, we are concentrated to shed light to providing some feasible solutions for improving the object detection performance under low-illumination.
Importance of Illumination
In order to study the impacts of illumination on feature extractions at different convolution layers, in this section, we propose to experiment and visualize the impacts on different illumination datasets.
Low-illumination data comes from ExDARK3 [3], with a total of 7363 images, including 12 categories (‘bicycle’, ‘boat’, ‘bottle’, ‘bus’, ‘car’, ‘cat’, ‘chair’, ‘cup’, ‘dog’, ‘motorbike’, ‘people’, ‘table’). Among these images, 4800 images are for training and 2563 images are for testing.
To construct normal-illumination counterpart, we randomly select 600 images for each category that is defined in ExDARK, from corresponding COCO category. As a result, a total of 7200 images are selected, from which 4800 images are used for training and 2400 images for testing. We denote the new collected dataset as COCO* to distinguish it from its original COCO dataset.
Without loss of generality, we employ a lightweight one-stage objection detection algorithm, representatively the RFB-Net [4] for experiment. One of reasons for selecting one-stage strategy is that we can easily visualize the intra-layer feature mappings of whole network from bottom to top, which can benefit and simply our analysis process. We train RFB-Net on ExDARK and COCO* respectively with the same experiment setting, and obtain two variants: RFB-Dark and RFB-Normal.
The detection performances are evaluated by using standard COCO evaluation APIs.4 Specifically, in COCO, 10 IoU thresholds
In order to observe the impacts of different illumination data on model’s robustness, we perform a cross-dataset testing. Specifically, we train and test model on different illumination data. The cross-testing results are shown in Table 1. As expected, performances of detector severely degrades when it is tested on a dataset having different illumination from its training data.
For deeper insight into the influence of illumination on feature learning, we visualize feature maps at different convolution levels. The comparisons between RFB-Dark and RFB-Normal models are shown in Fig 2. It is obvious that, when dealing with low-illumination image, model RFB-Dark and RFB-Normal extract very different features at the same convolution layers. Features from RFB-Dark is much richer and more semantic complete than ones from RFB-Normal. Especially on high-level layers, unexpected less information is extracted by RFB-Normal. That’s why the low-illumination model RFB-Dark can successfully detect objects in low-illumination environment, while normal-illumination model RFB-Normal can not. Therefore, illumination as low-level information should be paid more specific attention for model’s robustness.
Shallow illumination information affects high-level feature representation learning, which will also affect the final detect results. ‘Backbone-3’ represents the third layer of the backbone network output. (Red: undetected groundtruth, Green: detected groundtruth, Yellow: proposed box).
Pre-Enhancement Maybe Invalid
Existing object detectors under normal-conditions are powerless in face of extreme adverse environments such as fogy, rainy, and night. A natural idea is to employ image enhancement as preprocess stage before proceeding to high-level vision task. This seems in line with the requirements of human vision. Therefore, we use traditional method LIME5 [14], Retinex-Robust6 [15] and deep-learning based method Retinex-Net7 [20] to respectively enhance original low-illumination data, and obtain three resulted enhanced datasets. We name them as the same as their respectively constructor, LIME, Retinex-Rob, Retinex-Net for simplicity.
We train detection model respectively on these newly constructed illumination-enhanced data. The detection performances are shown in Table 2. We are frustrated to find that, compared with performance directly training and testing on original ExDARK data, all pre-enhanced data do not improve detection performance, while decrease the performance instead. Some visualization results are show in Fig 3. More examples are shown in Appendix VII.
Comparison of object detection results before and after using image enhancement. (Red: undetected groundtruth, Green: detected groundtruth, Yellow: predicted bounding-box).
From these visual samples, we can easily find that image enhancement algorithms can improve image’s brightness visually. However, noises are inevitably introduced, especially in Retinex-Net scheme. The Retinex-Robust and LIME show relatively better than Retinex-Net. In spired of that, their final detection performances are still worse than detection performance on original ExDARK. Furthermore, the enhancement process of Retinex-Robust takes too much time (e.g. processing a 1080P high-resolution image spent more than one minutes), which is impossible to serve as a preprocessing step for real-time detection task. Therefore, according to our experiment results, we reach a frustrated conclusion that current image enhancement algorithms seem helpless for low-level illumination detection task, except improving image’s visual qualities. Pre-enhancement maybe invalid.
Making of Night Vision
In this section, we describe and experiment our developed model which is specific for low-illumination detection task. It is on basis of the state-of-the-art RFB-Net model. We name our proposed detector as Night Vision Detector(NVD). As demonstrated in previous section III, valuable informations are easily lost in deeper layers. Especially under low-illumination, parts of the object are easily merged into dark background during performing convolutions. Therefore, in order to improve the detection performance under low-illumination, we introduce feature pyramid fusion network into detection layers. Context informations are fused into the detection backbone to compensate for loss of low-level textural and contour features. The architecture details are shown in Fig 4.
The architecture of our developed Night Vision Detector (NVS). The specific structure is based on RFB-Net. Feature pyramid net is introduced for improvements on small object detection. Contextual information fusion is introduced for maximumly retain the limited and weak object information in low-illumination image.
A. Night Vision Detector
1) Feature Pyramid Fusion Networks
Feature Pyramid Network (FPN) [34] was first introduced as an extension of Mask R-CNN [28] for better representing objects at multiple scales. FPN improves the standard feature extraction pyramid by adding a second pyramid that takes the high level features from the first pyramid and passes them down to lower layers. It is a general strategy that combines top-down fusion with skip layer and pyramid predictions at multi-scales.
The motivation of FPN is very suitable to adapt itself to low-illumination object detection task, since it can generate feature pyramids with strong semantic information without obvious computation cost on all scales. Different from original FPN [34], we give specific modification on the structure of pyramid feature fusion process. The structure differences are shown in Fig 5.
The structures of our proposed pyramid feature fusion module and original module [34]. ‘BN+ReLU’ means ReLU after BatchNormalization.
For the pyramid feature fusion modules illustrated by dashed red rectangles in the Feature Pyramid Net of Figure 4, from top to down, lower spatial resolution feature map (e.g. the “conv11” layer) is interpolated to the same spatial size of higher spatial resolution feature map (e.g. the “conv10” layer), and concatenated with the higher spatial resolution feature map (e.g. herein the “conv10” layer), then fed into a
The motivation of our pyramid feature fusion process is that we aim to maximumly utilize pre-trained channel informations. Through concatenation, semantic information in the channels of lower spatial resolution feature maps and textural/contour information in the channels of higher spatial resolution maps are redundantly preserved.
In contrast, the ‘add’ operation in original FPN has limitation that channel size of higher spatial feature map must be compromised to the same as the one of lower spatial feature map. We has tried the use of original FPN too, during our many experimental attempts. Frustratedly, the training process cannot converge, if we use original FPN structure.
In our later ablation experiments, we validate that the proposed feature pyramid fusion network improve the performances especially on small objects detection by 2.2%, when compared basic RFB-Net.
2) Context Fusion (CF) Net
We observe that there are a lot of dark areas in the image captured under low-illumination. The information of objects in the image is often covered by dark areas or merged with dark background. Affected by uneven light source, the sensor often can only capture limited parts of objects’ contours on image. In most cases, the captured contours are weak. Conventional hierarchical convolutions inevitably lose valuable informations what there are little, such as informative texture/contour details. Therefore, we introduce a context fusion net in bottom-up way into backbone network for feature compensation during the lower-level to higher-level convolution process. The fusion process has similar structure with pyramid feature fusion process.
Specifically, we select the ‘
B. Experiment
Since currently there is no special solution for low-illuminance detectors, the experiments in this paper have to be compared with our basic model RFB-Net. We directly train model on ExDARK dataset and study the contributions of each proposed component to the model. The experimental results are shown in the Table 3.
Compared with basic model RFB-Net, our proposed new FPN component can improve the detection performance
It should be point out that both our proposed components are general that can be easily applied to any multi-scale detection network. Our proposed scheme has effectively improved the performance of object detection under low-illumination.
The Effects of Illumination-Balanced Data and Parameters Initialization
As we know, the performance of data-driven deep model is generally benefited from rich data and well-initialized parameters. Therefore, in this section, we discuss the effects of data augmentation and parameters initialization. Since the discussed factors are not related specific model, we experiment the discussions on basic RBF-Net for simplicity.
We augment training data by collecting training data from both the two compared datasets COCO* and ExDARK, a total of 9,600 training images.
We first train models on COCO*, ExDARK and COCO*+ExDARK respectively with random initialized parameters. The testing performances are shown in the first three rows in Table 4. It is worth noting that in this experiment, we haven’t made any model settings to favor any kind of illumination data. Augmented training data benefits the performance boosting. However, compared with the performance improvement on normal illumination data COCO*, for low-illumination scenes, the detection performance is improved much greatly. We explain that model tend to learn hard samples during training and the knowledge learned from normal-illumination data benefits more for low-illumination learning case.
Afterwards, we try to specify model bias through weighted combination of pre-trained model’s parameters with different types of lighting data. We interpolate the model parameters by parameters pre-trained with normal-illumination data and parameters pre-trained on low-illumination data. The formal expression follows Equation 2.\begin{equation*} {\mathcal{ \theta }}_{init} =(1 - \alpha) {\mathcal{ \theta }}_{dark} + \alpha {\mathcal{ \theta }}_{coco} \qquad \alpha \in [{0, 1}] \tag{2}\end{equation*}
We discretely take three
Conclusion
In this work, we have investigated several important issues on object detection under low-illumination environment and without loss of generality proposed a Night Vision Detector (NVD) based on RFB-Net for low-illumination environments. We find that illumination information has great impacts on feature learning. Different illumination data should be modeled separately, since they would interfere with each other during training. We further suggest that current image quality driven enhancement methods could be employed to improve visual quality, while helpless for high-level real-time object detection tasks. Therefore, our proposed NVD framework introduced feature pyramid fusion net and context fusion net. The two components take comprehensive considerations of informations that affect low-illumination detection. The experiments demonstrate that the proposed NVD achieves low-illumination detection performance by 0.5%~2.8% higher than basic RFB-Net on all standard COCO evaluation criterions. Our work can provide baseline strategies and shed light to future studies on low-illumination detection.
Appendix
Appendix
More results are shown in Fig. 6, Fig. 7 and Fig. 8. Ten different low-illumination scenes (Low, Ambient, Object, Single, Weak, Strong, Screen, Window, Shadow, Twilight, please refer to [3] for their specific definition), and corresponding detection results after low-illumination enhancement are listed. The results suggest that visually improvements after enhancement show little benefits for high-level vision tasks.
Comparisons on detection results for images of 10 listed low-illumination types. From left to right, original image, image enhanced by Retinex-Net, Robust-Retinex and image enhanced by LIME.
Comparisons on detection results for images of 10 listed low-illumination types. From left to right, original image, image enhanced by Retinex-Net, Robust-Retinex and image enhanced by LIME.
Comparisons on detection results for images of 10 listed low-illumination types. From left to right, original image, image enhanced by Retinex-Net, Robust-Retinex and image enhanced by LIME.