Introduction
Pedestrian detection mymargin mymargin is a fundamental task in computer vision applications, such as surveillance, advanced driver assistant systems (ADASs), robotics, entertainment and human-computer interfaces. Although it has been studied for many decades, accurate pedestrian detection remains an ongoing problem and presents potential challenges caused by different pedestrian postures, occlusion of pedestrians by objects or other pedestrians, non-rigid motion and variance in pedestrians’ appearance caused by illumination changes. Among the various problems related to pedestrian detection in surveillance videos, the critical problems of occlusion and frequent pedestrian interactions in crowded scenes are the most challenging [1] and we focus on these problems in this paper.
In conventional pedestrian detection, the input images are densely up- and down-sampled according to predefined ratios to allow varying pedestrian sizes to be considered. Then, hand-crafted features are extracted from the candidate pedestrian regions in each size image using the scanning window method. Trained pedestrian detectors using a support vector machine (SVM) or AdaBoost classifier verify that candidate regions belong to the pedestrian or background class. Non-maximum suppression (NMS) is a post-processing algorithm that is responsible for merging all the detections that belong to the same object. Although conventional approaches require less computing power and memory than deep learning-based approaches, the feature extraction algorithms and the classifiers should be designed by a programmer and they cannot be jointly optimized to improve performance [2].
In contrast, deep learning-based pedestrian detection has recently exhibited state-of-the-art performances in pedestrian detection tasks. This approach performs end-to-end learning by significantly reducing the dependence of the detection on hand-crafted features and other preprocessing techniques. In particular, the convolutional neural network (CNN) has showed impressive accuracy as compared to conventional approaches because of its capability to learn discriminative features from raw pixels [3]. In CNN-based pedestrian detection, a kernel of size
Surveillance videos tend to include a variety of perspective views, because the cameras are usually installed at an elevated location. Therefore, CNN-based detectors are not appropriate for detecting pedestrians in low resolution video when the altitude of the surveillance camera is high and the video thus includes various small-sized pedestrians. Another problem related to using a CNN-based detector is that it requires a large number of datasets for training and testing, but it is not easy to collect a large amount of training data for the surveillance camera under sufficiently different conditions to train the CNN. Moreover, to process multiple channel videos simultaneously, a CNN-based detector requires a high-level and massive computing device as compared to conventional detectors.
Therefore, in this study, we focused on developing a new fast pedestrian detection algorithm for surveillance cameras that can be run in a low-level computing device by applying the teacher-student framework to the conventional random forest (RF).
The remainder of this paper is organized as follows. In Section II, we describe pedestrian detection in videos captured by an elevated surveillance camera and the major contributions of this paper. In Section III, we introduce pedestrian detection using shallow RF (S-RF) based on a teacher-student learning framework. In Section IV, we present experiments demonstrating the accuracy and applicability of our proposed pedestrian detection method. Finally, our conclusions and scope for future work are presented in Section V.
Related Works
Because this paper presents a study on pedestrian detection in videos captured by an elevated surveillance camera, we introduce related research on various approaches for detecting pedestrians in surveillance camera videos.
Histograms of oriented gradient (HOG) [5] is the most widely used feature descriptor for pedestrian detection. Although a dense overlapping HOG grid provides good pedestrian detection results with a lower false positive rate than traditional Haar-like descriptors, it is also produces false positives when the pedestrian is similar in color and/or pattern to the background or misses pedestrians positioned in a crowd, as well as having a heavy computation demand [6].
To solve the missing pedestrian and false positive problems related to global feature descriptors such as HOG and local binary patterns (LBP) [7], the deformable part model (DPM) [8] was proposed for pedestrian detection based on mixtures of multiscale deformable parts and a latent SVM. The DPM is characterized by a coarse root filter that approximately covers an entire object and higher resolution part filters that cover smaller parts of the object. However, the DPM still cannot easily detect partially occluded pedestrians in surveillance videos, because it considers the score of the occluded parts in the final decision score. To solve this problem, Dehghan et al. [9] inferred occlusion information from the score of the parts and utilized only those parts having high confidence in their emergence by finding the most reliable set of parts that maximizes the probability of detection.
The performance of conventional approaches is in general limited by the representation power of the low-level hand-crafted features [1]. Therefore, CNN-based pedestrian detectors for surveillance systems have been attracting attention. Ouyang and Wang [10] proposed a deep model that jointly learns four components for pedestrian detection in a surveillance camera video: feature extraction, deformation handling, occlusion handling and classification. In this unified deep model, three components interact with each other in the learning process and each component is allowed to maximize its strength when cooperating with others. Chen et al. [1] converted the task of pedestrian detection into head-shoulder part detection to detect severely occluded pedestrians in surveillance videos. In their paper, they proposed a three-stage CNN cascade to capture the most discriminative information of the head-shoulder parts of pedestrians. Zhao et al. [3] used the Edge Boxes algorithm [11] to obtain low-redundancy and a high quality of candidate windows with Fast R-CNN [12] architecture, which can extract thousands of region proposals and classify pedestrians at those locations based on a CNN. To reduce the run time of the region proposal of R-CNN, Faster R-CNN [12] was proposed, in which a region proposal network (RPN) that shares full-image convolutional features with the detection network was introduced. However, Faster R-CNN [12], as well as other CNN-based approaches, are still not appropriate for real-time pedestrian detection in surveillance systems. To reduce the processing time and improve the detection performance, you only look once (YOLO) [13] and YOLO 9000 [14] were proposed. These methods use a single neural network to predict the bounding boxes and class probabilities directly from full images in a single evaluation.
In recent years, research on small deep neural network architectures has been actively conducted to detect objects in embedded devices. For example, SqueezeNet [15], MobileNets [16], ShuffleNet [17], and TinySSD [18] are designed specifically to minimize model retaining object detection performance. Although tiny CNN architectures for object detection have shown a good performance, several problems related to them still remain to be resolved as mentioned in the Introduction. For example, in the case of TinySSD [18], the size of the network is greatly reduced through optimization, which is 26 times smaller than the Tiny YOLO [14] (60.5MB). However, the size of the model still exceeds 2.3MB and requires 571.09 million operations. Therefore, these limitations make it difficult to implement applications in real-time systems and constitute an obstacle to operating multiple channel videos simultaneously.
In addition to the feature extraction and classification algorithms related to pedestrian detection, variation in the camera’s perspective affects the accuracy of pedestrian detection because the range of the image scaling level and multi-scale scanning can vary according to the camera’s altitude and these two factors are closely related to the detection performance in terms of accuracy and run-time speed. To handle pedestrian detection in videos of surveillance cameras having different altitudes, Bae et al. [19] proposed scale of interest (SOI) and region of interest (ROI) estimation to minimize unnecessary computations in practical multiscale pedestrian detection. The role of the SOI is to determine the image-scaling level by estimating the perspective of the image and that of the ROI is to search the area of a scaled image. Ko et al. [6] proposed Hough windows maps (HWMs) for determining the levels of image scaling with a divide-and-conquer algorithm to reduce the computational complexity involved in processing surveillance video sequences. Moreover, an adaptive ROI for image scaling helps improve the detection accuracy and reduce the detection time.
Hattori et al. [20] proposed a spatially varying pedestrian appearance CNN model that takes into account the perspective geometry of the scene, because when a new surveillance system is installed in a new location, a scene-specific pedestrian detector must first be trained. To compensate for insufficient data resulting from frequently changing camera positions, this method used geometric scene data and a customizable database of virtual simulations of pedestrian motion instead of changing the ROI or image scaling level. Cai et al. [21] proposed a multi-scale CNN for a fast multi-scale pedestrian detection algorithm consisting of receptive fields of different scales and a scale-specific detector to produce a strong multi-scale pedestrian detector. Jiang et al. [22] proposed pedestrian detection based on sharing features across a group of CNNs that correspond to pedestrian models of different sizes. This method detects pedestrians of several different scales in a single layer of an image pyramid by sharing features in order to reduce the computational burden incurred by extracting features from an image pyramid.
A. Contributions of This Work
To design a fast pedestrian detection scheme that is well suited for surveillance systems having a limited memory and processing unit, we introduce an algorithm for compressing deep and wide classification architectures into shallower ones. In this study, the proposed compression algorithm was applied to an RF classifier, which is an ensemble of decision trees, instead of to a CNN, because a CNN still demands a large amount of memory and processing resources, even when the depth of the layers is decreased by the proposed algorithm.
The major contributions of this paper are as follows.
We describe the adoption of HWMs for determining the levels of image scaling and an adaptive ROI algorithm to reduce the amount of image scaling and sliding windows in surveillance camera videos.
We explore new types of model compression algorithm that are realized by transferring the teacher-student framework to an RF model instead of using computationally heavy deep learning.
We propose a model compression in which a teacher-student compression framework is applied to RF to allow training of a student shallow RF (S-RF), which is shallower than the teacher RF, using a softened version of the teacher’s output.
We prove that S-RF trained by soft target training is a reasonable method for mimicking the classification ability of a teacher classifier. In addition, it also an efficient method for detecting small-sized and closely positioned pedestrians in a high perspective surveillance video.
We prove that the proposed S-RF efficiently and considerably decreases the processing time without sacrificing accuracy.
We describe the successful application of the proposed method to a benchmark dataset, and we confirm that its detection accuracy is similar to or higher than that of other CNN-based related methods with a shorter processing time.
Teacher–Student Model Compression
A. Estimation of Image-Scaling Level and Adaptive Region of Interest
The amount of image scaling and the number of search regions is a significant burden in pedestrian detection, because a multi-scale image pyramid requires frequent image scaling and the sliding windows should be applied at each scale for feature matching.
To reduce the amount of image scaling and sliding windows required to process surveillance camera videos, we adopted HWMs for determining the levels of image scaling and an adaptive ROI algorithm [6] for providing a different search ROI for each image scale. The image scaling level and corresponding ROI are changeable according to the perspective angle of the surveillance camera. As the feature, we use an oriented center-symmetric local binary pattern (OCS-LBP) [23], because it supports the gradient magnitude and pixel orientation simultaneously. To create a robust feature model for pedestrian occlusion, we compute the OCS-LBP from 4
As the pedestrian classification algorithm, we introduce an S-RF classifier trained using the proposed teacher-student training framework to separate candidate windows into pedestrian and non-pedestrian classes. The S-RF training procedure is described in detail in Section 3-C.
B. Review of Teacher-Student Frameworks
Although the performance of deep neural networks improves as the layers become deeper, they suffer from the disadvantage of increasing memory requirements for millions of parameters and computational complexity for millions of multiplications of filters. For these reasons, as mentioned above, a high-performance wide and deep network is not suitable for memory and time constrained applications [4], [24]. To reduce the memory required for numerous parameters and the computational burden at the inference time, several model compression frameworks have been proposed, such as parameter pruning and sharing [25], low-rank factorization [26], transferred/compact convolutional filters [27] and the teacher-student framework [4], [28]–[30]. It is known that among these four categories, the performance of the teacher-student framework matches or is superior to that of the teacher’s framework and requires considerably fewer parameters and multiplications [24].
The teacher-student framework constructs a deep and wide teacher network having a high performance based on a large amount of training data and deep layers, and constructs a shallower student network with an equal performance based on the teacher network [4], [28]–[30]. As shown in Fig. 1, the student network is generated by using the probability values extracted by the softmax of the teacher network in the learning process instead of the class labels of the training data. The correlation between classes can be considered by using probability values (soft targets) instead of class labels (hard targets) for training data. Student networks that use non-hard target (soft targets) train the student network by using cross-entropy to reduce the difference in the output of the teacher and the student network.
Teacher-student learning framework for compressing a teacher network into a student network. The softened outputs of the teacher network are used for training the target student network using other unlabeled training data by comparing the loss function (
However, as mentioned in the Introduction, a compressed CNN model still demands a large memory for very considerable amounts of parameters and processing resources for multiplication. For example, the representative teacher-student framework, FitNet [4], still requires 2.5 million parameters and 382 million multiplications, although the teacher network is reduced at a compression ratio of 3:6. Therefore, CNN-based deep top performing networks are not well suited for applications with memory or time limitations, even if model compression algorithms are applied.
In this study, we explored new types of model compression algorithms achieved by transferring the CNN-based teacher-student framework to the RF model, that is, an ensemble of decision trees. An RF is a decision tree ensemble classifier, where each tree is grown using a certain type of randomization. RFs have the capacity to process very large amounts of data with high training speeds, based on a decision tree. Moreover, this classifier has been shown to be effective in a large variety of high-dimensional problems, with a high computational performance and accuracy as compared to conventional SVM or AdaBoost classifiers. The structure of each tree in the RF is binary and is created in a top-down manner [31], [32]. An RF has a structure to which a CNN-based teacher-student framework can easily be applied because it can reduce the size of the forest by pruning the number of decision trees.
C. Training of Shallow Random Forest
In this study, we applied the teacher-student compression framework to RF to allow training of a student S-RF that is shallower than the teacher RF, using the softened version of the teacher’s output.
Hinton et al. [28] trained a student network using real class labels and the softened output of an ensemble of a teacher network. The student network was trained to optimize the loss function (\begin{equation*} \mathcal L\left ({\mathbf {W}_{\boldsymbol {S}} }\right)=\boldsymbol {\mathcal H}\left ({\boldsymbol {y}_{\boldsymbol {true}}\boldsymbol {,}\boldsymbol {P}_{\boldsymbol {S}} }\right)+\boldsymbol {\lambda }\boldsymbol {\mathcal H}\left ({\boldsymbol {P}_{\boldsymbol {T}}^{\boldsymbol {\tau }}\boldsymbol {,}\boldsymbol {P}_{\boldsymbol {S}}^{\boldsymbol {\tau }} }\right)\tag{1}\end{equation*}
According to the experimental results presented in [29], soft target data (a class probability vector) are able to capture more information than the original hard target (0/1 labels) data by retaining the class relationship between the different classes and the input that has been internalized by the teacher. Moreover, unlabeled data help the student networks to learn to approximate better the outputs of the fully trained teacher networks. Therefore, in this study a new dataset
First, the training pedestrian dataset is divided into dataset A for learning the teacher and a larger dataset B for learning the student. For training the teacher RF model, a training set A is given as a base for training component \begin{equation*} \mathrm {A=}\left \{{\left ({\mathbf {x}_{\mathbf {i}}\mathbf {,}y_{i} }\right)\vert i=1,2,\ldots N }\right \}\end{equation*}
In the training procedure of the teacher RF, the RF first chooses a random subset \begin{equation*} \Delta E=E\left ({\mathrm {A}_{O}^{\prime } }\right)-\frac {\left |{ \mathrm {A}_{l}^{\prime } }\right |}{\left |{ \mathrm {A}_{O}^{\prime } }\right |}E\left ({\mathrm {A}_{l}^{\prime } }\right)-\frac {\left |{ \mathrm {A}_{r}^{\prime } }\right |}{\left |{ \mathrm {A}_{O}^{\prime } }\right |}E(\mathrm {A}_{r}^{\prime })\tag{2}\end{equation*}
After the decision tree has been trained, the leaf nodes store statistical information containing the probability of each class that reached node \begin{equation*} p\left ({c_{i}\thinspace \vert \thinspace \mathbf {x}}\right) =\frac {1}{T}\sum \nolimits _{t=1}^{T} {p_{t}(c_{i}\vert \mathbf {x})}\tag{3}\end{equation*}
Training dataset B is then input to the teacher RF, which is trained using the corresponding hard targets. Unlike those of training dataset A, the input samples of dataset B have the same labels, consisting of a vector of class posterior probability summing to 1, expressed as \begin{equation*} \mathbf {B=}\left \{{\left ({\mathbf {x}_{\mathbf {i}}\mathbf {,}\mathbf {p}_{\mathbf {i}} }\right)\vert i=1,2,\ldots N }\right \}\end{equation*}
To obtain \begin{equation*} \mathbf {B}^{\ast }=\left \{{\left ({\mathbf {x}_{\mathbf {i}}^{\boldsymbol {\ast }}\mathbf {, }\mathbf {p}_{\mathbf {i}}^{\boldsymbol {\ast }} }\right)\vert i=1,2,\ldots N^{\ast } }\right \}\end{equation*}
The new dataset
For training the decision tree of the student RF, the entropy estimation for evaluating the split function is calculated using output class distribution \begin{equation*} \mathrm {E}\left ({\mathrm {B}_{O}^{\prime } }\right)=-\frac {1}{N}\sum \nolimits _{i=1}^{N} \sum \nolimits _{j=1}^{C} {p_{ij}^{\ast }\mathrm {log}(p_{ij}^{\ast })}\tag{4}\end{equation*}
After decision tree \begin{equation*} {\mathrm {Tr}\left ({\mathrm {T,S} }\right)}_{t}=-\sum \nolimits _{i=1}^{N} \sum \nolimits _{j=1}^{C} {p_{ij}^{\ast }(T\mathrm {)log}(p_{ij}^{\ast }(S_{t}))}\tag{5}\end{equation*}
Algorithm 1 details the student RF training process using a soft target dataset.
Algorithm 1 Procedure of Training Student Shallow Random Forest
Input:
N, C: number of samples of
S-RF: set of student S-RF
n(S-RF): number of trees in S-RF
While n(S-RF)
Select subset
Grow an unpruned tree using the
Step 1: Tree growing with
For
Each internal node randomly selects
Loop: Using different \begin{align*} {\mathrm {B'}}_{l}=&\{x\in \mathrm {B'}\vert f\left ({v_{p} }\right) < t,\\ {\mathrm {B'}}_{r}=&\left \{{{x\in \mathrm {B}^{\mathrm {'}}}\thinspace \vert \thinspace {f\left ({v_{p} }\right)\ge t}}\right \}.\end{align*}
Compute entropy
Calculate information gain
If (
Else go to loop.
Store the tree structure
End of For
Step 2: Tree evaluation
Compute the class probability,
The cross-entropy
Minimization criteria of cross-entropy.
Else remove
End While // End of student random forest growing
Step 3: Output of student S-RF
The S-RF consists of
Threshold
Figure 2 shows the overall S-RF training procedure based on soft-target training data
Shallow random forest (S-RF) training procedure based on soft target training data
To train the teacher RF, we collected 4,250 images from Caltech dataset images [33] and YouTube consisting of 5,502 positive and 7,566 negative pedestrian samples as dataset A. In this study, we set the maximum size (number) of the teacher RF at 300 trees, because the accuracy no longer improves as the tree number of trees increases over 300. Then, dataset B was also generated from 1,700 images consisting of 2,2001 positive and 3,000 negative samples using the rest of dataset. After the teacher RF was constructed using dataset A, dataset B was applied to the teacher RF and produced soft target training data
Expeimental Results
From among many datasets for evaluating pedestrian detection in video sequences, we chose two, Town Centre [34] and Performance Evaluation of Tracking and Surveillance (PETS) 2006 [35], because these two datasets were originally designed for evaluating pedestrian detection in videos captured by elevated surveillance cameras, which was the focus of this study. The resolution of Town Centre is high,
In addition, we compared performance with state-of-the-art researches for Caltech dataset to measure pedestrian detection performance in low-angle moving cameras.
In this study, we fixed the model size of the pedestrian at
To evaluate the performance of the pedestrian detection, we measured the precision and recall, which are in general used to evaluate the performance of pedestrian detection. A correct detection was counted if the overlap ratio between the detected bounding box and the ground truth bounding box exceeded 50%.
All the experiments were conducted using an Intel Core i7 processor with 8 GB of RAM running Microsoft Windows 10. In addition, all the RF approaches, including teacher RF and S-RF, were executed based on a CPU and the CNN-based state-of-the-art approaches were executed based on a single Titan-X GPU.
A. Number of Optimal Decision Trees for Shallow Random Forest
To determine the number of optimal trees of the S-RF, we compared the detection performance on the Town Centre dataset while changing the size (number) of trees as shown in Table 1. From the teacher RF consisting of 300 trees, we sequentially decreased the desired number of trees (
B. Detection Comparison on Pets2006 Dataset
To verify the effectiveness of the soft target training scheme, we compared its performance with that of six state-of-the-art methods: (1) DPM [8]; (2) the Faster R-CNN approach, which shares full-image convolutional features with the detection network [12]; (3) the scene pose CNN network (SPN), which generates a scene-specific pedestrian detector and pose estimator [36]; (4) YOLO 9000, which is a real-time CNN-based object detection system over 9000 object categories [14]; (5) teacher RFs consisting of 300 trees (teacher RF); and (6) proposed S-RF consisting of 50 trees (proposed S-RF). Faster R-CNN and YOLO 9000 used pre-trained model parameters without performing fine-tuning. SPN used synthetic pedestrian dataset considering wider range of human poses for training.
In the first experiment using the PETS 2006 dataset, we predicted that two CNN-based methods, Faster R-CNN [12] and YOLO9000 [14], but not SPN [36], would produce a worse detection performance than the other methods as shown in Fig. 3. The main reason is that they are more likely than conventional approaches to fail to locate each individual in a perspective view, because of the small and blurred boundaries between closely positioned pedestrians.
The SPN approach showed a higher performance than Faster R-CNN, but the precision rate fluctuated according to the variation in the recall rate. In contrast, the DPM and RF-based approaches showed a better performance than the CNN-based approaches. The results confirmed that a methodology that uses ROI with scanning window is effective for pedestrian detection in surveillance videos. Moreover, the best performance of the proposed S-RF method is very similar to that of teacher RF and better than other state-of-the-art approaches.
C. Detection Comparison on Town Centre Dataset
To achieve a fair performance comparison of the same six algorithms using an additional dataset, we also performed pedestrian detection using the Town Centre dataset. Figure 4 shows the precision and recall curves for the six methods. When the recall value was 0.5, the highest precision rates of the teacher RF, DPM [8], YOLO9000 [14] and the proposed S-RF algorithms were similar, about 0.97, and the patterns were similar to those shown in Fig. 4. Although the DPM algorithm [8] showed precision results similar to those of the proposed method up to a recall value of 0.5, the precision rate decreased rapidly when the recall value was 0.7. YOLO9000 [14] yielded a higher precision rate than other CNN-based approaches (Faster R-CNN [12] and SPN [36]), with recall values around 0.8. It also showed a somewhat higher precision rate than the proposed method when the recall value was larger than 0.8. Through our experiments using the additional Town Centre dataset, we verified that the proposed method gave a generalized performance as compared to CNN-based approaches, although the number of trees decreased considerably.
In summary, the proposed method showed a higher performance than the other state-of-the-arts methods and a similar performance pattern on the precision and recall curve. In particular, the proposed method showed better performance than other CNN-based methods for small-sized or occluded pedestrians located in the upper part of the image (see Fig. 6). The results indicate that S-RF trained by soft target training is a reasonable method to mimic the classification ability of a teacher classifier and also an efficient method for detecting small-sized and closely positioned pedestrians in a high perspective surveillance video.
Five possible pairs of experimental results for determining the threshold
Pedestrian detection results in a surveillance camera using the proposed method, (a) the pedestrian detection results for PETS 2006, (b) the pedestrian detection results for Town Centre, (c) sample false detection results caused by a complicated or similar background and occlusion. The green rectangles indicate correctly detected pedestrians and the red arrows indicate false or missing detections.
D. Detection Comparison on Caltech Dataset
Additional experiment was performed on the Caltech dataset to test whether the proposed algorithm effectively detects pedestrians in videos with various sizes of pedestrians because approximately 70% of the pedestrian height of Caltech dataset is less than 100 pixels, including extremely small pedestrian less than 50 pixels. For this experiment, we divided dataset into two categories, following the typical protocol in literatures [39], [40]. First, far subset consists of non/partial occlusion pedestrians which are less than 45 pixels in height and middle subset has a height between 45 and 115 pixels. As the evaluation metric, we used the averaged log miss rated over the false positive per image, denoted as MR.
For evaluation, we used the standard test set of 4,024 Caltech images under two different performance protocols. To validate the detection performance of the proposed algorithm, we compared the quantitative results with that of six state-of-the-art methods focusing on detection of multi-scale pedestrians: (1) UDN+SS [10]; (2) Faster-RCNN [12]; (3) SA-Fast-RCNN, which use multi-scale CNN [21]; (4) F-DNN+SS, which uses a derivation of the Faster R-CNN; (5) TLL(MRF)+FGFA, where uses lightweight flow networks [38]; (6) TLL(MRF)+LSTM, where uses temporal feature aggregation [40]; and (7) proposed S-RF. For training, six comparison methods used the dense sampling of the training Caltech data (every 3th frame, resulted in 42782 images) and proposed method used the soft target training data
As shown in Table 2, the propose S-RF method achieved lower MR to those of the state-of-the-arts for small-sized pedestrians. In detail, the proposed method had better MR performance for small-sized pedestrians as a result of 13.19% improvement over TTL(MRF)+LSTM [40] method. In contrast, the general CNN based methods [10], [12], [21] showed a good detection performance for middle-sized pedestrians. However, when the size of the pedestrian becomes smaller, the size of the image used for detection becomes smaller, which results in increasing the MR.
From the result, we know that the proposed student RF with adaptive image scaling is suitable algorithm for detecting small-sized pedestrians. However, when the pedestrian size is middle, MR performance is 4.68% lower than TTL(MRF)+LSTM [40]. This is because the camera perspective angle of the Caltech dataset is low, so pedestrians of various sizes overlap at the same ROI. Therefore, in the case of an image taken at a low camera perspective angle, our method is necessary to adjust the ROI and the image scaling level to reduce the MR.
E. Time Complexity
The purpose of soft target training based on a teacher student framework is to reduce the computational cost of the classifiers. The time complexity of the proposed S-RF was compared with that of four other methods, as shown in Table 2. DPM [8], teacher RF and S-RF were tested on an Intel Core i-7 CPU, and two CNN-based methods (YOLO 9000 [14] and SPN [36]) were tested on a Titan X GPU using PETS 2006. As shown in Table 3, the process time per frame (FPS) of YOLO9000 [14] is fastest, 19.9 ms, and is faster than the similar CNN-based SPN [36]. The second fastest method is the proposed S-RF at 20.8 ms, being faster than the similar CPU-based DPM [8] and teacher RF. Although the processing time of the proposed S-RF is 1.1 ms slower than that of YOLO90000 [14], we know that the proposed S-RF efficiently and considerably decreases the processing time without losing accuracy, because the processing time of YOLO9000 [14] is obtained on a GPU and that of the proposed S-RF is obtained on a CPU. The overall experimental results confirm that the proposed S-RF is more suitable than CNN-based or computationally heavy classifiers for low specification embedded surveillance systems.
F. Determination of Cross-Entropy Threshold
Threshold
Figure 6 shows the pedestrian detection results of PETS 2006 (Fig. 6(a)) and Town Centre (Fig. 6(b) and (c)) using the proposed S-RF, in the case of occlusion. As shown in Fig. 6(a), the proposed approach detects pedestrians correctly when the pedestrians’ sizes are significantly small, occluded by other pedestrians or the poses of the pedestrians differ. In Fig. 6(b), although in the Town Centre dataset more pedestrians appear on the street and the perspective angle is larger than that in PETS2006, the proposed S-RF also correctly detected pedestrians when they were small-sized and closely positioned in the high perspective surveillance video. However, the proposed S-RF still yields a few incorrect detections: it missed a pedestrian pushing a stroller (Fig. 6(c), first column) and falsely detected a pedestrian on a background that was similar in color and complicated (Fig. 6(c), second and fourth columns); it also missed a pedestrian when occluded by other objects and when the pattern of the background was similar to that of the pedestrian (Fig. 6(c), third and fifth columns).
Demo videos have been uploaded to our webpage, http://cvpr.kmu.ac.kr/SoftTarget.htm.
Conclusion
In this paper, we proposed a pedestrian detection algorithm that can detect small-sized and closely positioned pedestrians in a surveillance video when the camera is installed at a high location. Although deep learning, especially CNN, based approaches are known to achieve top performances in pedestrian detection tasks, they also require very wide and deep networks with numerous parameters, a large-scale dataset and massive computing power for kernel multiplication. Moreover, CNN-based detectors have difficulty not only detecting small-sized pedestrians in a low resolution video but also operating in a low specification embedded surveillance system. To detect small-sized pedestrians and determine the levels of image scaling, we adopted HWMs and an adaptive ROI algorithm instead of predicting the ROI using Faster R-CNN and YOLO9000. This study focused on exploring new types of model compression algorithms realized by transferring the teacher-student framework to an RF model instead of using heavy deep learning, because RF has a structure to which a teacher-student framework can be easily applied by reducing the size of the forest through pruning the number of decision trees. The proposed teacher-student compression model S-RF is a shallower version of the original teacher RF, using a softened version of the teacher’s output. The experimental results proved that the proposed S-RF trained by soft target training is a reasonable method to mimic the classification ability of a teacher classifier. In addition, in high perspective surveillance datasets it also efficiently detected small-sized and closely positioned pedestrians and decreased the processing time considerably without losing accuracy.
In future work, we plan to improve our algorithm in order to reduce the false and missing detection rate when a pedestrian is similar to the background or is occluded by other objects by considering another feature. Moreover, although the proposed method exhibited a reasonable computational speed without degrading accuracy when run on a PC, a more compact S-RF version is needed so that it can be applied it to a low specification embedded board. In our opinion, it is reasonable to combine two or three of model compression algorithms to maximize the compression/speed up rates, for example, compressing the RF with teacher-student framework and performing a pruning method in each decision tree.