Introduction
Human vision has the ability to effectively select relevant information out of irrelevant noises and to locate the highly relevant subjects in a scene. As a fundamental issue in computer vision, saliency detection has been applied as a pre-processing procedure to a wide range of computer vision tasks, such as object segmentation [1], image compression [2], object detection [3], and image retrieval [4]. As saliency detection is capable of finding the most important and distinctive region in an image, we apply saliency detection to infrared pedestrian detection, which is an essential and important task for driving assistants and intelligent transportation systems. However, constrained by the characteristics of infrared imaging, it is still challenging to accurately detect saliency in infrared pedestrian images.
The development of saliency detection methods can be roughly divided into two stages. The first stage focuses on exploring low-level cues of salient objects, such as color [5], orientation [6], and texture [7]. Because of the uniqueness and rareness of salient objects, contrast prior has been widely used as a computational mechanism to measure the difference between foreground and background. Contrast could be investigated from both local and global perspectives according to the scale of pixel neighborhoods. Local contrast [8]–[10] assumes that the more distinctive an object is compared with its neighborhoods, the more salient this object will be. However, contrast with only local cues always results in wrongly suppressed internal regions of salient objects. To alleviate these problems, global contrast [11] is proposed, which assigns higher saliency scores to objects with more unique features in the whole image. Global contrast is useful to highlight the whole object, but it may fail to thoroughly suppress the background. Previous contrast mechanisms usually take pixels as processing units, which may suffer from a boundary blurring problem. To obtain saliency maps with well-defined boundaries, contrast based on segments is exploited. It could suppress noises in background and reduce the computational load, as used in methods such as simple linear iterative clustering (SLIC) [12], mean shift [13], and Gaussian mixture model [14].
The second stage of saliency detection is the propagation based saliency detection. Recently, propagation algorithms attract increasing attention in saliency detection and have achieved state-of-the-art performances. Markov chains [15], random walks [16], and manifold ranking [17] are the most frequently used propagation methods, which are all based on graphs. Harel et al. [18] first put forward the graph based visual saliency, which employs an ergodic Markov chain to produce feature maps. Li et al. [19] propose a novel regularized random walk, which suggests a fitting constraint to take into account the local image data and prior estimation. Later, Zhang et al. [17] infer the saliency score of each region via graph-based manifold ranking which ranks the similarity of superpixels with foreground or background seeds. In addition to these classic methods, various new patterns of saliency propagation are proposed. Li et al. [20] define the saliency value using a co-transduction algorithm, which fuses both boundary and objectness labels through an inter propagation scheme. Qin et al. [21] present a cellular automata based saliency propagation method exploiting the intrinsic relevance between neighboring cells to improve the saliency performance. Qin et al. [22] further propose the Cuboid Cellular Automata to integrate multiple saliency maps in a Bayesian framework, which incorporates the low-level image features as well as high-level semantic information. Nevertheless, these saliency propagation methods still cannot perform well with challenging images, especially when the salient objects are similar to backgrounds.
Even though various saliency models have been proposed recently, most of them are designed for visible images. Some works directly apply these state-of-the-art models on infrared images as pre-processing to locate salient objects [23], [24]. But they could only obtain coarse results or even fail in saliency detection. Compared with visible images, infrared images have unique advantages. They are less sensitive to lighting conditions, and this makes it possible to eliminate the influence of illumination variations, so it could be used in both day and night and other difficult situations. Additionally, benefiting from the insensitivity to color, texture, and other appearance features, infrared images can be used to separate objects with similar appearances by their thermal radiation differences. With infrared pedestrian images, more challenges exist. Firstly, due to the limitation of infrared thermal imaging, infrared pedestrian images have low clarity, low SNR, and low contrast. Secondly, there is no color or little texture information in infrared images, which makes it difficult to extract saliency features of objects in infrared images. And this is also the primary reason why most of the existing saliency models fail with infrared images. Thirdly, high image intensity is a crucial characteristics for pedestrians in infrared images, but non-human objects, such as light poles, vehicles, and tree trunks, may also produce additional bright areas. These interferences increase the difficulty of saliency detection in infrared pedestrian images.
To apply saliency detection to infrared pedestrian images, some researches have been carried out. Ko et al. [25] calculate the luminance saliency map by estimating the luminance contrast using a center-surrounded scheme. Zhang et al. [26] propose an associative saliency, generated from both region and edge contrasts. Li et al. [27] apply the gradient information on pedestrians to enhance the uniqueness of intensity, and combine it with multi-scale contrasts to obtain the final saliency. Wang et al. [28] exploit a mutual consistency guided fusion strategy to adaptively combine the luminance contrast saliency map and contour saliency map for infrared images. Li et al. [1] first calculate the background likelihood with background prior, and then use a Bayesian model to obtain the object prior based saliency. The final saliency of this method is an integration of background prior and object prior.
However, previous saliency models designed for infrared images mainly use low-level features, such as gradient and intensity to describe salient objects, and employ weighted summation or multiplication to integrate these features. Thus, these features only fit simple images, and they perform poorly for complex infrared scenes, which have diverse composition of backgrounds, including trees, buildings, roads, skies, street lamps, brushwood, and other objects. And taking the above problems into consideration, our work proposes two unique saliency features from both thermal characteristics and appearance characteristics to describe pedestrians in infrared images. These two features have better ability to represent the saliency of pedestrians in infrared images. Also, our algorithm introduces saliency propagation to integrate features and optimize the saliency performance simultaneously. The proposed method consists of three parts: Firstly, the thermal analysis based saliency (TAS) is proposed based on the thermal characteristics of pedestrians and radiation models; Secondly, taking into account the appearance features, the appearance analysis-weighted saliency (AAS) is introduced as a complement; At last, the mutual guidance-based saliency propagation method is proposed in this paper to mutually facilitate the two features and improve the final saliency.
Thus, the main contributions of this paper are as follows:
A novel propagation based saliency model is proposed to adaptively detect pedestrians from complex infrared images. The proposed method advances state-of-the-art saliency detection methods on both public datasets and a more complex dataset constructed in this work.
Two features are explored from both an infrared imaging mechanism and the actual performance to describe the saliency of pedestrians in infrared images, including TAS and AAS. These features are able to distinguish pedestrians from complex backgrounds.
A mutual guidance based saliency detection method is developed in this paper, which puts forward the concepts of intra-scale and inter-scale neighborhoods. This propagation method can not only integrate the two saliency features but also correct any mistakes in initial saliency maps to improve the final saliency.
Two datasets IMS and DIP are constructed including 600 infrared pedestrian images with more than 33 scenes. We publish the dataset and the source code of this work at <https://github.com/zhxtu/SP_IR>.
Proposed Method
Fig. 1 shows the diagram of the proposed saliency detection method for infrared pedestrian images. Firstly, SLIC [29] is used to segment the input infrared image into homogeneous superpixels. Secondly, the maximally stable extremal region (MSER) [30] is extracted to measure the stableness of pedestrians, which is further improved by an intensity filter to obtain the thermal analysis based saliency (TAS). Thirdly, the intensity contrast is calculated and further enhanced by the vertical edge weight and intensity weight to obtain the appearance analysis-weighted saliency (AAS). Finally, a mutual guidance based propagation method, which combines the intra-scale and inter-scale neighborhoods, is introduced to integrate the two features and improve the final saliency.
A. Thermal Analysis Based Saliency (TAS)
Infrared images are generated from the translation of thermal radiation through thermographic cameras. Thus, infrared images are the products of the complex interaction among factors such as temperature, emissivity, and atmosphere effect. Besides, the intensity of each object is determined not only by the thermal radiation of the object itself, but also by the reflection of other objects and the atmosphere [31]. Calculating the saliency of pedestrians is actually suppressing the radiation from background and obtaining the radiation of pedestrians themselves.
Based on the thermal analysis, we first introduce the MSER-based local stableness, which is further improved by the intensity filter to obtain the TAS.
1) MSER-Based Local Stableness
Fig. 2 shows an infrared image with a pedestrian and its corresponding 3D intensity plot. Obviously, the intensity on the pedestrian differ greatly from that of its surrounding regions. This phenomenon results from the thermal imaging principle [32] that stronger thermal radiations generate higher intensities. As temperature increases, the atomic and molecular activity would be enhanced. This would produce more heat and stronger thermal radiation. Thus, pedestrians with higher temperatures are usually brighter than the background.
An example of a local region in an infrared pedestrian image, and the corresponding 3D intensity plot.
Besides, object emissivity serving as a decisive factor of infrared radiation is closely related to the material property of the object [31]. Thus, regions composed of different materials differ in intensity accordingly. Then, pedestrian regions are different from their surrounding regions and are completely surrounded by regions with lower intensities.
Following the principle that areas surrounded by others tend to be more salient, infrared pedestrian regions could be described by the capacity of the MSER for detecting the surrounded regions with a homogeneous intensity. Thus, the MSER-based local stableness is proposed. Although MSER is an existing approach, it is mostly applied in text localization and has not been used to measure accurate saliency yet. MSER is defined by an extremal property of its intensity function in the region and on its outer boundary. To calculate MSER in an image \begin{equation*} \forall p \in \boldsymbol {R}_{l}, \quad \forall q \in boundary(\boldsymbol {R}_{l})\rightarrow I_{m}(p)\geq I_{m}(q),\tag{1}\end{equation*}
\begin{equation*} \boldsymbol {I}^{g}_{bim}= \begin{cases} 1 & \boldsymbol {I}_{m}\geq g\\ 0 & \text {otherwise}\\ \end{cases} g \in [\min (\boldsymbol {I}_{m}), \max (\boldsymbol {I}_{m})],\tag{2}\end{equation*}
\begin{equation*} \Psi (\boldsymbol {R}_{l}^{g})=(|\boldsymbol {R}_{l}^{g+\delta }-\boldsymbol {R}_{l}^{g-\delta }|)/|\boldsymbol {R}_{l}^{g}|,\tag{3}\end{equation*}
To measure the stableness \begin{equation*} F(p)=\sum \limits _{k=1}^{k} e_{k}(p)\quad e_{k}(p)=\begin{cases} 1 & p \in \boldsymbol {sr}_{k}\\ 0 & \text {otherwise},\\ \end{cases}\tag{4}\end{equation*}
(a) Pixel based stableness
As thermal radiation from parts of a human body is hampered by clothes, there are generally noises inside pedestrian regions. In order to smooth the internal distribution of intensities inside pedestrian regions, the image is segmented into \begin{equation*} F_{s}(i)=\frac {\sum \limits _{p\in \boldsymbol {sp}_{i}} F(p)}{|\boldsymbol {sp}_{i}|}.\tag{5}\end{equation*}
With the accumulation on each superpixel, stable regions could be enhanced and backgrounds are suppressed, while the accurate contour information could also be preserved. Fig. 3(b) shows that superpixel based stableness can reduce the inhomogeneous saliency distribution inside human body regions and partly reduce noises in background.
2) Intensity Filter-Enhanced Saliency
With only TAS, some objects, such as street lamps and tree trunks, may be wrongly assigned with high saliency values. To distinguish pedestrians from other objects, the principle that pedestrians always produce stronger thermal radiation is used. For pedestrians in the scene, the thermal radiation received by an infrared camera is not only from pedestrians themselves, but also from the radiation reflected from other objects onto pedestrians and the thermal radiation of atmosphere. According to the physics of radiation [33], emissivity and reflectivity are inversely proportional. And the reflectivity of the pedestrian is usually much lower than its emissivity because of its rough surface. Thus, radiation reflected from other objects could be ignored. As the radiation of atmosphere is directly received by a thermal sensor, the influence of atmosphere is significant. As a result, the total radiation composition \begin{equation*} E=E_{o}+E_{A},\tag{6}\end{equation*}
Since the values of \begin{equation*} IF(i)=\left |{\frac {\sum \limits _{p\in \boldsymbol {sp}_{i}} I_{m}(p)}{|\boldsymbol {sp}_{i}|}-I_{\mu }}\right |^{2},\tag{7}\end{equation*}
\begin{equation*} S_{TAS}(i)=F_{s}(i)\cdot IF(i).\tag{8}\end{equation*}
By subtracting
B. Appearance Analysis-Weighted Saliency (AAS)
Although TAS has a good ability to make pedestrian regions prominent, some targets which are too small or too similar to the background may be wrongly suppressed by TAS. As a supplement, the AAS is introduced. Contrast is a commonly used feature in saliency detection, which often measures the color difference. Observed from infrared pedestrian images, the intensity distribution of a pedestrian is obviously different from backgrounds. Therefore, the contrast can also be applied to infrared images to highlight pedestrians. And the contrast is defined as:\begin{equation*} con(i)=\sum \limits _{j=1}^{N}|v_{i}-v_{j}|\cdot \exp \left ({d_{i}-d_{j}|}\right),\tag{9}\end{equation*}
(a) Contrast without weight; (b) Contrast with vertical edge weight; (c) Contrast with both vertical edge weight and intensity weight.
However, there are two shortcomings for the contrast feature. Firstly, low contrast is an inherent characteristic of infrared images, which makes it difficult to separate pedestrians from backgrounds with only the contrast feature. And then, tree, lamps, and other objects with high intensities may also have high values in the contrast map, which may affect the saliency detection of pedestrians. To handle these problems, the AAS is introduced which employs the appearance information of pedestrians to enhance the contrast, which is calculated as:\begin{equation*} S_{AAS}(i)=w_{i}\cdot Con(i),\tag{10}\end{equation*}
Vertical shape is a distinct feature of pedestrians, which is widely used in pedestrian detection and recognition. Aspect ratio [34] is commonly used to describe the vertical feature of pedestrians, yet it is inaccurate and difficult to extract. In this paper, the vertical edge weight is used to describe the vertical feature of pedestrians. As objects usually contain more edge information than background, superpixels with more edge information are more likely to belong to the salient object. Also, the vertical edges of a pedestrian are much stronger than the horizontal edges and can better represent a pedestrian as shown in Fig. 5.
An example of a pedestrian in an infrared image, and its vertical edges and horizontal edges obtained by the Canny edge detection method.
To calculate the vertical edge weight \begin{equation*} w_{i}^{ve}=\frac {1}{|\boldsymbol {b}_{i}|}\sum \limits _{p \in \boldsymbol {b}_{i}} g_{v}(p),\tag{11}\end{equation*}
However, background regions along edges are wrongly highlighted, while the regions inside the pedestrian with only a few edges are also mistakenly suppressed by the vertical edge weight at the same time. As pedestrians have higher intensities than surrounding regions, the intensity of each superpixel is applied as the intensity weight \begin{equation*} w_{i}^{in}=\frac {1}{|\boldsymbol {sp}_{i}|}\sum \limits _{p\in \boldsymbol {sp}_{i}} I_{m}(p).\tag{12}\end{equation*}
Fig. 4(c) shows that the intensity weight is an effective complement to the vertical edge weight. It not only fills the holes caused by the edge weight, but also suppresses the surrounding regions of pedestrians. At last, by integrating the vertical edge weight and the intensity weight, the appearance weight \begin{equation*} \boldsymbol {w}=\boldsymbol {w^{ve}}+\boldsymbol {w^{in}}.\tag{13}\end{equation*}
This equation formulates the rule that superpixels with more vertical edges and higher intensity values have higher probabilities of belonging to pedestrians. The effectiveness of the appearance weight is demonstrated by Fig. 4. The vertical edge weight performs well to suppress backgrounds, and the intensity weight can better highlight foregrounds. Thus, the appearance weight improves the intensity contrast to achieve a better saliency detection performance.
C. Mutual Guidance Based Saliency Propagation
Previous propagation based saliency models commonly integrate saliency features to obtain the initial saliency map before propagation via summation or multiplication [16], [21]. These integration methods always result in information loss and wrong saliency distribution. Thus, the proposed method introduces mutual guidance based propagation to integrate saliency features and optimize saliency performance simultaneously, which does not need integration before propagation. The propagation method is inspired by cellular automata which consists of three factors: cell, neighborhood and updating rules.
In this paper, each superpixel is taken as a cell. Previous propagation models only use surrounding superpixels to smooth and amend the initial saliency. Different from this, the proposed method propagates saliency scores between not only the neighboring superpixels (intra-scale neighborhood), but also the TAS and AAS feature maps (inter-scale neighborhood).
1) Intra-Scale Neighborhood
Based on the intuition that neighboring cells are likely to share similar saliency values, the saliency of each cell should be determined by its neighborhood. As shown in Fig. 6, the intra-scale neighborhood of a cell (red dot) is defined as its direct neighboring cells (green dots connected by solid lines) and the direct neighborhood of these cells (green dots connected by dotted lines). Also, neighborhoods that have similar intensities to the central cell should be assigned a large weight on the central cell. Thus, the intensity similarity matrix \begin{equation*} m_{ij}= \begin{cases} \exp (|v_{i}-v_{j}|/\sigma ^{2})& j \in \boldsymbol {NB}(i)\\ 0 & i=j~\text {or otherwise},\\ \end{cases}\tag{14}\end{equation*}
Demonstration of the proposed mutual guidance based propagation method: intra-scale neighborhood, inter-scale neighborhood and update rules.
Effects of the proposed saliency propagation method. Top:the original results without propagatio; Middle: results of propagation
2) Inter-Scale Neighborhood
As intra-scale neighborhoods can assimilate neighboring cells, small targets may be wrongly suppressed, as shown in the second column of Fig. 7. With the smoothing effect of intra-scale neighborhoods, pedestrians with small sizes are likely to be assimilated by their surrounding backgrounds because of the low contrast between them. To solve this issue, an inter-scale neighborhood is proposed.
The principle is that the final saliency value of each cell should be approximately consistent with their corresponding values in TAS and AAS. TAS based on MSER can locate the salient regions and suppress backgrounds, but it sometimes unduly suppresses salient regions. AAS based on contrast can enhance the difference between foreground and background, but it cannot strongly suppress the background. Therefore, these two features always complement each other. Then, cells with the same coordinates in the TAS and AAS maps are defined as the inter-scale neighborhood of each other (red dots linked by blue arrows in Fig. 6 are a pair of neighborhoods), so they can amend each other as an aid of intra-scale neighborhoods. Therefore, the state of cell \begin{equation*} sf_{1}^{t}(i)\Leftarrow \{sf_{1}^{t-1}(i),sf_{2}^{t-1}(i),sf_{1}^{t-1}(\boldsymbol {NB}(i))\},\tag{15}\end{equation*}
3) Updating Rules
To balance the impact strengths of intra-scale and inter-scale neighborhoods on the propagation, a coherence matrix \begin{equation*} \begin{cases} \boldsymbol {F}_{T}^{t}=\boldsymbol {F}_{T}^{t-1}+\underbrace {(\boldsymbol {I}-\boldsymbol {C})\cdot \boldsymbol {M} \cdot \boldsymbol {F}_{T}^{t-1}}_{\text {Intra-scale}}+\underbrace {\boldsymbol {C} \cdot \boldsymbol {F}_{A}^{t-1}}_{\text {Intra-scale}}\\ \boldsymbol {F}_{A}^{t}=\boldsymbol {F}_{A}^{t-1}+\underbrace {(\boldsymbol {I}-\boldsymbol {C})\cdot \boldsymbol {M} \cdot \boldsymbol {F}_{A}^{t-1}}_{\text {Intra-scale}}+\underbrace {\boldsymbol {C} \cdot \boldsymbol {F}_{T}^{t-1}}_{\text {Intra-scale}}\\ \boldsymbol {F}_{T}^{t}=\frac {\boldsymbol {F}_{T}^{t}}{\|\boldsymbol {F}_{T}^{t}\|}, \boldsymbol {F}_{A}^{t}=\frac {\boldsymbol {F}_{A}^{t}}{\|\boldsymbol {F}_{A}^{t}\|}\\ \boldsymbol {S}^{t}=\boldsymbol {F}_{T}^{t} \cdot \boldsymbol {F}_{A}^{t},\\ \end{cases}\tag{16}\end{equation*}
Algorithm 1 Mutual Guidance Based Saliency Propagation
The TAS and AAS. The intra-scale neighborhood similarity matrix
Initialize:
For
End for
For
End for
While
End while
The final saliency scores
It is also important to decide when to stop the iteration. If there are not enough iterations, the propagation cannot achieve an ideal result. Otherwise, if it iterates too many times, there will be unnecessary increase in computational load. And sometimes excessive iterations may make the saliency result worse. Qin et al. [21] set the maximum iteration to a fixed value, which is simple but not always suitable for all the images. The complexity of images and the performances of TAS and AAS all affect the propagation, thus an adaptive termination condition of the iteration is necessary. In this work, the termination of iteration is decided by checking the average variance among the current state and its previous 3 iterations:\begin{equation*} check=\text {var}(\boldsymbol {S}^{t-3},\boldsymbol {S}^{t-2},\boldsymbol {S}^{t-1},\boldsymbol {S}^{t}).\tag{17}\end{equation*}
Considering the propagation mechanism, the propagation would develop a steady local environment of results and come to convergence. Thus, when
In summary, the stopping criterion of saliency propagation is defined by the following rules:
When
has a value below thresholdchenk , the iteration will stop.T_{C}-=10^{-5} When iterations reaches
, the iteration will stop, regardless of whetherT_{max} has reachedcheck .T_{C}
And after the iteration stops, the final saliency of the proposed method will be
4) Convergence Analysis
Since the saliency score is propagated with the similarity estimation, salient parts with a similar appearance in the image would naturally merge and enhance each other due to the connectivity and compactness of the object. Moreover, the boundary between an object and the background would become more explicit according to the contrast between different components. Thus, saliency maps would not change any more once the system achieves stability. And the propagation would converge gradually.
To intuitively demonstrate the convergence, experiments are also conducted. As shown in Fig. 8(a), the variances of all images in the dataset DIP are recorded from the 5-th iteration to the 17-th iteration, and then variances are averaged at each iteration. The trend is declining and will gradually flatten to
(a) Average variance trend in the propagation; (b) Performance evaluation for iterating certain times
To further illustrate the convergence of the propagation, we set the iteration number
Furthermore, we use mathematical inference to prove the convergence of the proposed propagation method. Re-writing the update rules in Eq. (16) into a matrix form, we have \begin{align*} \left [{{\begin{array}{cccccccccccccccccccc} \boldsymbol {F}_{T}^{t}\\ \boldsymbol {A}_{T}^{t}\\ \end{array}}}\right]=\left [{{\begin{array}{cccccccccccccccccccc} \boldsymbol {I}+(\boldsymbol {I-C})\cdot \boldsymbol {M} & \boldsymbol {C}\\ \boldsymbol {C} & \boldsymbol {I}+(\boldsymbol {I-C})\cdot \boldsymbol {M}\\ \end{array}}}\right]\left [{{\begin{array}{cccccccccccccccccccc} \boldsymbol {F}_{T}^{t-1}\\ \boldsymbol {A}_{T}^{t-1}\\ \end{array}}}\right].\!\!\!\!\! \\ {}\tag{18}\end{align*}
The update rules can be rewritten as a linear recursive sequence:\begin{equation*} \boldsymbol {u}=\boldsymbol {Au}_{t-1} \quad (t=1,2,\ldots),\tag{19}\end{equation*}
\begin{equation*} \boldsymbol {u}_{t}=\boldsymbol {Au}_{t-1}=\boldsymbol {A}^{2}\boldsymbol {u}_{t-2}=\cdot \cdot \cdot =\boldsymbol {A}^{t}\boldsymbol {u}_{0}.\tag{20}\end{equation*}
As the coherence matrix \begin{equation*} |\lambda _{1}|>|\lambda _{2}|\geq |\lambda _{3}|\geq \ldots \geq |\lambda _{n}|\tag{21}\end{equation*}
\begin{align*} |\lambda _{1}|\!=\!|\lambda _{2}|=\ldots =|\lambda _{r}|>|\lambda _{r+1}|\geq |\lambda _{r+2}|\geq \ldots \geq |\lambda _{n}|\!\!\!\!\! \\ {}\tag{22}\end{align*}
As \begin{equation*} \boldsymbol {u}_{0}=\alpha _{1}\boldsymbol {x}_{1}+\alpha _{2}\boldsymbol {x}_{2}+\ldots +\alpha _{n}\boldsymbol {x}_{n}.\tag{23}\end{equation*}
To avoid the modulus of \begin{equation*} \begin{cases} \boldsymbol {y}_{t}=\dfrac {\boldsymbol {u}_{t}}{\|\boldsymbol {u}_{t}\|}\\ \boldsymbol {u}_{t}=\boldsymbol {y}_{t}\\ \end{cases}(t=1,2,\ldots),\tag{24}\end{equation*}
\begin{align*} {\boldsymbol {y}}_{t}=&\frac {{\boldsymbol {A}}^{t}{\boldsymbol {u}}_{0}}{\left \|{ {{\boldsymbol {A}}^{t}{\boldsymbol { u}}_{0}} }\right \|}=\frac {\alpha _{1} {\boldsymbol {A}}^{t}{\boldsymbol {x}}_{1} +\alpha _{2} {\boldsymbol {A}}^{t}{\boldsymbol {x}}_{2} +\cdots +\alpha _{n} {\boldsymbol {A}}^{t}{\boldsymbol {x}}_{n} }{\left \|{ {\alpha _{1} {\boldsymbol {A}}^{t}{\boldsymbol { x}}_{1} +\alpha _{2} {\boldsymbol {A}}^{t}{\boldsymbol {x}}_{2} +\cdots +\alpha _{n} {\boldsymbol {A}}^{t}{\boldsymbol { x}}_{n}} }\right \|} \\=&\frac {\alpha _{1} \lambda _{1}^{t} {\boldsymbol {x}}_{1} +\alpha _{2} \lambda _{2}^{t} {\boldsymbol {x}}_{2} +\cdots +\alpha _{n} \lambda _{n}^{t} {\boldsymbol {x}}_{n}}{\left \|{ {\alpha _{1} \lambda _{1}^{t} {\boldsymbol {x}}_{1} +\alpha _{2} \lambda _{2}^{t} {\boldsymbol {x}}_{2} +\cdots +\alpha _{n} \lambda _{n}^{t} {\boldsymbol {x}}_{n}} }\right \|} \\=&\left ({{\frac {\lambda _{1}}{\left |{ {\lambda _{1}} }\right |}} }\right)^{t}\frac {\alpha _{1} {\boldsymbol {x}}_{1} +\alpha _{2} \left ({{\frac {\lambda _{2}}{\lambda _{1}}} }\right)^{t}{\boldsymbol {x}}_{2} +\cdots +\alpha _{n} \left ({{\frac {\lambda _{n}}{\lambda _{1}}} }\right)^{t}{\boldsymbol {x}}_{n}}{\left \|{ {\alpha _{1} {\boldsymbol {x}}_{1} +\alpha _{2} \left ({{\frac {\lambda _{2}}{\lambda _{1}}} }\right)^{t}{\boldsymbol {x}}_{2} +\cdots +\alpha _{n} \left ({{\frac {\lambda _{n}}{\lambda _{1}}} }\right)^{t}{\boldsymbol {x}}_{n}} }\right \|}. \\ {}\tag{25}\end{align*}
Considering Eq. (21), when \begin{align*} {\boldsymbol {y}}_{t} =\left [{ {\begin{array}{l} {\boldsymbol {F}}_{T}^{t} \\ {\boldsymbol {F}}_{A}^{t} \\ \end{array}} }\right]\rightarrow \begin{cases} \dfrac {\alpha _{1}\boldsymbol {x}_{1}}{\|\alpha _{1}\boldsymbol {x}_{1}\|} & \lambda _{1}>0\\[0.7pc] \pm \dfrac {\alpha _{1}\boldsymbol {x}_{1}}{\|\alpha _{1}\boldsymbol {x}_{1}\|} & \lambda _{1}< 0.\\ \end{cases}\tag{26}\end{align*}
Letting
With
With
Considering Eq. (27), when \begin{align*} {\boldsymbol {y}}_{t} =\left [{ {\begin{array}{l} {\boldsymbol {F}}_{T}^{t} \\ {\boldsymbol {F}}_{A}^{t} \\ \end{array}} }\right]\rightarrow \begin{cases} \dfrac {\alpha _{1}\boldsymbol {x}_{1}+\alpha _{2}\boldsymbol {x}_{2}+\ldots +\alpha _{r}\boldsymbol {x}_{r}}{\|\alpha _{1}\boldsymbol {x}_{1}+\alpha _{2}\boldsymbol {x}_{2}+\ldots +\alpha _{r}\boldsymbol {x}_{r}\|} & \lambda _{1}>0\\[0.7pc] \pm \dfrac {\alpha _{1}\boldsymbol {x}_{1}+\alpha _{2}\boldsymbol {x}_{2}+\ldots +\alpha _{r}\boldsymbol {x}_{r}}{\|\alpha _{1}\boldsymbol {x}_{1}+\alpha _{2}\boldsymbol {x}_{2}+\ldots +\alpha _{r}\boldsymbol {x}_{r}\|} & \lambda _{1}< 0\\ \end{cases}\!\!\!\! \\ {}\tag{27}\end{align*}
As the linear combination of eigenvectors corresponding to the same eigenvalue is still an eigenvector of that eigenvalue,
Consequently, the saliency score always has a certain limit, which proves the convergence of the proposed propagation method.
Experiments
A. Dataset and Analysis
To evaluate the effectiveness of the proposed method, experiments are carried out on three datasets.
1) OSU
Sequences irw01 and irw06, from the public Terravic Motion IR Database in the OTCBVS Benchmark Dataset Collection [36] are used. There are totally 400 images in this dataset and each image contains two pedestrians. This dataset mainly focuses on the changing postures of pedestrians, which is much simple to handle because of its high contrast and flat backgrounds.
2) IMS
This dataset, which is provided by our collaborator, consists of 200 images. There are 39 images containing one pedestrian and other images containing two pedestrians. In this dataset, pedestrians either walk towards or walk away from the camera, so the sizes of pedestrians change greatly. With this dataset, the robustness of the proposed algorithm for pedestrians with different sizes could be verified. Moreover, the images in dataset IMS have a lower contrast than those in dataset OSU.
3) DIP
As the datasets above are relatively simple and only cover a small number of scenes, we construct a more comprehensive dataset to testify the effectiveness of our method. There are 400 infrared images with human-segmented GT in this dataset, which were obtained via the use of a Tau 2 LWIR camera. The complexity of the dataset DIP can be illustrated on the following aspects:
Complex Objects: with multiple pedestrians with diverse postures and sizes. There are totally 634 pedestrians in the dataset, which contains not only 220 images with a single pedestrian but also 180 images with multiple pedestrians. These pedestrians are enormously different from each other in clothing, somatotype, posture and size.
Complex Backgrounds: with diverse composition of background in multiple scenes. There are totally 31 scenes which differ greatly from each other in the dataset DIP. And these scenes have diverse background compositions, including road, sky, buildings, street lamps, trees, brushwood, and other objects.
Based on its complexity and comprehensiveness, dataset DIP is closer to actual scenes and can be better used to examine the robustness of saliency models for infrared pedestrian images.
B. Evaluation Metrics
In order to evaluate the saliency models, the widely used PR curves [15], F-measure [8], and mean absolute error (MAE) [8] are employed to measure the correctly/wrongly assigned pixels between each image and its corresponding GT among the whole dataset. A good saliency map should achieve a higher PR curve and a larger F-measure value, meanwhile maintaining a low MAE value.
Firstly, to measure the similarity between saliency maps and the GT, precision and recall are defined as:\begin{align*} Precision(h)=&\frac {|\boldsymbol {BM}(h)\cap \boldsymbol {GT}|}{|\boldsymbol {BM}(h)|},\tag{28}\\ Recall(h)=&\frac {|\boldsymbol {BM}(h)\cap \boldsymbol {GT}|}{|\boldsymbol {GT}|},\tag{29}\end{align*}
Secondly, as precision and recall measure the saliency performance from different point of views, the F-measure is used to obtain the combination of them:\begin{equation*} F-measure=\frac {(1+\beta ^{2})\times Precision \times Recall}{\beta ^{2}\times Precision + Recall}.\tag{30}\end{equation*}
Lastly, MAE is used as a complement of PR curves and F-measure to measure the pixel-wise error between the saliency map and GT:\begin{equation*} MAE=\frac {1}{|\boldsymbol {S}|}\sum \limits _{p \in \boldsymbol {S}}\left |{S(p)-GT(p)}\right |,\tag{31}\end{equation*}
C. Parameter Analysis
To choose the appropriate parameters for our model, we use a sub set (20%) of dataset DIP as the validation set to tune the parameters
1) Parameter N
of SLIC N
controls the number of superpixels. If
Saliency comparisons for parameter analysis with PR curves. (a). on
2) Parameter T_{M}
of TAS
3) Parameters \sigma^{2}
and T_{C}
of the Mutual Guidance Based Saliency Propagation
D. Evaluations of Model Components
In this section, a series of experiments are presented to investigate the influence of various factors on the proposed saliency model.
1) Intensity Filter
To eliminate the effect of atmosphere from the radiation of pedestrians, the intensity filter is introduced, which subtracts the average intensity of infrared images from each superpixel. To explore the best functional form of the intensity, we define the intensity filter as:\begin{equation*} IF(i)=\Psi \left ({\frac {\sum \limits _{p\in \boldsymbol {sp}_{i}} I_{m}(p)}{|\boldsymbol {sp}_{i}|}-I_{\mu }}\right).\tag{32}\end{equation*}
Performance evaluations of model components with MAE, PR curves and F-measure. Top: the effects of different intensity filters. Middle: the effectiveness of appearance and each part of it. Bottom: the mutual guidance based saliency propagation.
2) Appearance Weight
To measure the availability of each part of the appearance weight, the experiment is designed by comparing their corresponding saliency performances. In the middle row of Fig. 10,
3) Effectiveness of Propagation
This experiment is designed to verify the effectiveness of the saliency propagation model and the contribution of inter-scale neighborhood on the saliency result.
Firstly, with the
To show the contribution of inter-scale neighborhood through quantitative analysis, the performance of saliency propagation with only intra-scale neighborhood
Moreover, as can be seen from Fig. 10, the final saliency is better than the results of both TAS and AAS after propagation.
E. Comparison With State-of-the-Art Saliency Models
Following previous saliency models for infrared images [24]–[28], the proposed saliency detection method is first compared with 10 state-of-the-art saliency models: FT [8], CA [11], GS [38], BD [39], BSCA [21], MAP [40], MB+ [41], RS [17] and HCA [22]. And the experiments are carried out on three datasets, OSU, IMS, and DIP.
1) Subjective Comparisons
Some saliency maps of the proposed method and state-of-the-art methods are shown in Fig. 11, which directly present the visual comparisons. For data set OSU, we can see that most saliency models effectively handle the second image. This is because the second image has a relatively simple background and a high contrast between pedestrians and backgrounds. However, it can be found that all the methods except FT and our method fail in the first image, because salient objects are defaulted to be close to the center of an image in most saliency detectors. And it is difficult to detect pedestrians near the border. Moreover, it is obvious that the proposed method better suppresses the background than FT.
Visual comparison on the three datasets OSU, IMS and DIP among the proposed method and 10 state-of-the-art saliency detection methods.
For dataset IMS, most of the state-of-the-art methods perform badly, where pedestrians cannot even be recognized. That is because the ground and trees along the road have high intensities similar to pedestrians. HCA could accurately highlight the pedestrians, while parts of the background are also wrongly highlighted. CA could highlight the contour of pedestrians and partly suppress the background. But the blurring contour and wrongly highlighted background make CA a bad saliency detector. The proposed method could highlight the whole region of pedestrians and suppress the background at the same time.
For dataset DIP, saliency detection is more difficult than the two other datasets. Almost all these methods have the ability of separating pedestrians from background in the third image of Fig. 11, where the pedestrians have relatively lager sizes. But they fail in the other images with pedestrians of small sizes. Because the state-of-the-art saliency models are all tested on datasets with larger salient objects. BD and HCA can separate pedestrians from the background in most images, but the noises in background cannot be suppressed efficiently. Note that the proposed method could highlight pedestrians regardless of the size of pedestrians and has better saliency value distributions.
Therefore, the proposed method achieves good saliency detection performance for infrared pedestrian images superior to the other state-of-the-art methods.
2) Objective Comparisons
We further objectively compare different saliency models using PR curves, F-measure, and MAE. For dataset OSU, it is obvious in Fig. 12 that the proposed method achieves similar performances to the BD method on PR curves and F-measure. However, it is noteworthy that our saliency model has the lowest MAE value 0.01, which is smaller than all other methods. This fact demonstrates the effectiveness of our method on background suppression.
Objective comparison among the proposed method and 10 state-of-the-art methods with MAE, PR curves and F-measure. From top to the bottom are on datasets OSU, IMS, and DIP.
For dataset IMS, the proposed method is superior to the other state-of-the-art methods in all the evaluation metrics. The proposed method achieves the highest precision in almost all the recall range [0, 1] up to 0.95, while the precisions of all the other methods are lower than 0.2. Also, the proposed method performs the best in MAE and F-measure. This is in accord with the visual performance that the dark sky region tends to be taken as a salient region, while the pedestrian regions are totally suppressed with these compared methods.
For dataset DIP, the proposed method achieves the best performance than other methods, which attains the highest precision in almost all the recall range [0, 1] up to 0.91, while all the other method are lower than 0.7. HCA could only obtain a high precision value when its recall is small, because the background noise cannot be well suppressed by HCA. The F-measure value of the proposed method is also higher than the others. And it achieves the lowest MAE value, which is close to 0.03. These results demonstrate the ability of the proposed method to highlight pedestrians and suppress backgrounds.
All the experiments illustrate the superiority of the proposed method and also its robustness to the complexity of images.
F. Comparison With Saliency Models of Infrared Pedestrian Images
Actually, the above state-of-the-art saliency models are designed for visible images, and saliency detection of infrared images has not been extensively studied. To comprehensively demonstrate the superiority of the proposed saliency model, four saliency models which are designed for infrared pedestrian images are applied as comparison. These methods include LSM [25], AS [26], CD [27], MCS [28], and BO [1] which have been introduced in Section I. The experiments are also conducted on all the three datasets, OSU, IMS, and DIP.
1) Subjective Comparisons
Fig. 13 shows saliency maps of the proposed method and the other infrared saliency models mentioned above. It is obvious that the proposed method achieves the best performance. We can find that LSM CD, and MCS attempt to obtain edges of pedestrians, while AS, BO and the proposed method try to highlight the whole pedestrians.
Visual comparison on the three datasets OSU, IMS and DIP between the proposed method and 4 saliency models of infrared pedestrian images.
For dataset OSU, as the background of images in this dataset consists of bushes, their abundant textures lead to the failure of LSM to separate pedestrians from the background. MCS has a better ability to suppress the noises of background, whereas CD performs better than LSM and MCS in highlighting pedestrians. Different from the above methods, AS, BO and the proposed method can highlight pedestrians as a complete region. And the proposed method suppresses background more effectively.
For dataset IMS, the background is more homogeneous than OSU, so edges in background are better suppressed in saliency maps for LSM, CD and MCS. But the wrongly highlighted edges in background and regions inside pedestrians all indicate the poor performance of these methods. AS could separate pedestrians from background and highlight each pedestrian as a whole region. However, the high intensity corners are wrongly distributed with high saliency values by AS. BO performs poorly in the second image, because BO first needs an object detection method to locate pedestrians, and then calculates the saliency of pedestrians in the detected regions marked by rectangle boxes. Its performance of saliency detection heavily depends on the accuracy of detection methods. The proposed method needs no pre-detection but accurately locates the pedestrians via saliency.
For dataset DIP, the complex background of this dataset still results in poor performance of LSM and MCS. CD and BO perform much better than LSM and MCS. CD even has a much better ability than AS to suppress the background, while it cannot highlight the inner part of pedestrians. BO performs well in some images, but there still exists wrong saliency distribution and missed detection due to the inaccuracy of pre-detection. Comparing with the five methods above, the proposed method performs much better in suppressing background and highlighting pedestrians.
2) Objective Comparisons
PR curves, F-measure, and MAE are also used to compare the objective performance of the proposed method and other models designed for infrared pedestrian images. For dataset OSU, it is obvious that the proposed method is superior to the other methods in all the datasets and evaluation metrics. AS and BO achieve comparative performance to the proposed method. This fact is consistent with their subjective performances, which also verifies the effectiveness of the evaluation metrics. The proposed method achieves the highest precision close to 0.9, while the precisions of the other methods apart from AS and BO are lower than 0.4. Meanwhile, the proposed method obtains the lowest MAE close to 0.01 and the highest F-measure up to 0.7.
For dataset IMS, we can see from Fig. 14 that the proposed method achieves the lowest MAE value 0.01. The F-measure value of the proposed method is close to 0.7, while the F-measure for all the other models are even lower than 0.2. It is noteworthy that, the highest precision in the PR curve of the proposed method is up to 0.95, while the PR curves for others are lower than 0.3.
Objective comparison among the proposed method and 4 saliency models of infrared images with MAE, PR curves, and F-measure. From top to the bottom are on datasets OSU, IMS, and DIP.
For dataset DIP, the proposed method achieves the highest precision 0.9, the lowest MAE 0.027 and the highest F-measure 0.72. We can find that AS and CD perform much better than LSM and MCS in PR curve and F-measure, while they perform worse in MAE. This is because AS and CD not highlight only pedestrians but also the background. So, superior performance of the proposed method in both PR curves and MAE verifies its ability to highlight pedestrians and suppress background.
In summary, the proposed method performs much better than previous saliency models designed for infrared pedestrian images.
G. Run Time Comparisons
The run time experiment is performed via MATLAB 2015b on an Intel i5-3450 (3.10GHz) CPU with 8 GB RAM. The proposed method is compared with both state-of-the-art saliency models and saliency models for infrared pedestrian images in Table 1. Our method is slower than most of the others because of the calculation for the vertical edge weight, which consists of PB algorithm-based boundary map extraction and the edge weight calculation. The PB algorithm takes more than half of the total time, and the calculation of edge weight needs to traverse all the superpixels, which also takes lots of time. However, better results could be obtained as shown in the above comparisons of performances, at the cost of more computational time.
Conclusion
In this paper, by analyzing the thermal and appearance characteristics of infrared pedestrian images, a novel saliency detection method for infrared pedestrian images is proposed. Two features on thermal analysis-based saliency and appearance analysis-weighted saliency are first proposed. And then, a mutual guidance based saliency propagation method is introduced to facilitate the two features and improve the final saliency. We have also built two datasets DIP and IMS with 600 infrared pedestrian images, and have made them available to the public. All the experiments on three infrared pedestrian datasets demonstrate the effectiveness of the proposed method.