Introduction
Vehicle and pedestrian detection (VaPD) using a monocular camera is one of the crucial computer vision problem in automotive safety and driver assistance domain [1], [2]. VaPD is a typical object detection problem in digital image processing, that has attracted intense interest from many areas, including surveillance [3], robotics [4] and care for the elderly or disabled [5]. However, VaPD in self-driving cars usually requires higher performance such as high detection precision and fast processing speed.
Indeed, although a cloud scheme can be adopted to perform VaPD, in which an image is sent to a cloud server, and then the response sent back after the cloud computing is complete, this process requires that this system is in a low-latency network. Nevertheless, such a system is not able in general to guarantee real-time detection, as it is required in an Advanced Driver Assistance Systems (ADAS) for the safety purpose. Thus, at present, embedded VaPD seems to be the most suitable solutions to meet this unavoidable requirement. Following this motivation, this paper is focused on VaPD algorithms for embedded systems with reduced computational and memory resources.
Typically, the detection process requires localizing the candidate region, in which the object is recognized using some classification algorithms. Thus a VaPD method based on a monocular camera includes two steps: region of interest (ROI) generation and classification. For at least two decades a sliding-window approach has been used for ROI or proposal generation [6], [7], [8]. This approach is based on searching the object in the entire image area using search windows of various sizes. Thus, in this process a large number of ROI are generated (typically
More recently several methods that do not use sliding windows have been proposed [9], and they can be divided into two main classes: grouping methods and window scoring methods.
Grouping methods aim to generate segments that are likely to correspond to objects. The simplest way for this purpose is to directly use any image segmentation algorithm. Grouping methods can be classified into three types according to how they generate proposals: by grouping superpixels (SP) [10], [11], [12], [13], [14], solving multiple graph cut (GS) [15], [16], [17], [18], [19], directly from edge contours (EC) [20], [21].
In window scoring methods a score is assigned to each candidate window according to how likely it is to contain an object. Typically, this approach generates proposals with low localization accuracy. In this context the concept of objectness [22] is used to generate initial proposals from a saliency map and these proposals are then scored according to multiple cues, such as colour, edges, and localization.
Other methods differ from the method used to generalize the initial proposals: ranking support vector machines (SVMs) with cascade architecture [23], Edge Boxes in which edge groups are constructed from an edge map [24], binarized normed gradient (BING) which employs a linear SVM trained with a normed gradient (NG) feature [25].
With the rapid development of deep learning, object detection has experienced a remarkable progress, thanks to a large variety of convolutional architectures that are able to achieve better and better performances. Currently, convolutional neural network (CNN)-based detectors can be classified into two main categories: two-stage and one-stage detectors.
The basic architecture of two-stage detectors consists of a region proposal network that extracts a fixed-length feature vector to feed region proposals into classifier and regressor. Region-based CNN (R-CNN) is a two-stage detector, firstly proposed in [26], that generally reports accurate detection performance. However, due to heavy computational cost in region proposal generation, R-CNN is slow in performing object detection at test time. To overcome this problem several extensions, fast R-CNN [27], faster R-CNN [28], mask R-CNN [29], were successfully proposed, but the slow inference speed remains a bottleneck for applications in real-time systems.
One-stage detectors remove the region proposal step and merge all computation into just one stage. YOLO architecture proposed in [30], is a unified object detection method that predicts class probabilities and bounding box offsets using a single network. In order to gain in accuracy and inference time, several new versions based on YOLO architecture, YOLO v2, YOLO 9000 and YOLO v3, were successfully proposed in [31] and [32]. YOLO v3 is a hybrid approach between the network used in YOLO v2, Darknet-19, and the residual network structure of ResNet, and obtains the best detection speed and accuracy at present. However, a large amount of parameters and calculations in YOLO v3 are required, thus this architecture is not suitable to be implemented in embedded hardware.
The main challenge for real-time pedestrian detection in low-power embedded platform with reduced computational and memory capability, is to achieve a trade-off between accuracy and computation cost. At present, although a wide variety of solutions have been proposed for high-performance hardware platforms, such as NVIDIA Titan GPU, only a marginal attention has been devoted to VaPD for embedded systems. In fact, most of the papers in literature refer to complex and expensive computing architectures [33], [34], [35], [36], [37], [38], while few works explore the feasibility and use on embedded devices with limited resources, low power consumption and memory footprint [39], [40], [41], [42], [43], [44], [45], [46].
A typical approach for machine learning applications involves streaming data acquired from IoT sensors and actuators to external computing systems (such as cloud servers) for further processing, but this worsens latency, leads to increased communication costs and increases privacy issues. In recent years, the goal is to address this problem by processing AI algorithms directly on the device where the data are generated (edge), focusing on operational aspects including compression techniques, dimensionality reduction, tensor decomposition and parallel computing, thus reducing latency, communication costs and privacy issues. To meet these specifications and fill this gap, this work proposes an investigation of the performance of a custom tiny network for pedestrian and vehicle detection.
The aim of this paper is to propose a compressed Tiny YOLO v3 architecture that meets the requirements of real-time VaPD on embedded systems.
The main contributions of the paper are summarized as follows.
A CNN compression technique based on Tucker tensor decomposition has been analyzed and adopted to reduce the complexity of each convolutional layer in the network.
A Low-Rank (LR) Tiny YOLO v3 Darknet that reduces the computational complexity has been developed starting from Tiny YOLO v3 Darknet architecture.
To validate the derived LR architecture, extensive experimentation has been carried out to evaluate the performance of the proposed network on two embedded platforms, the Raspberry Pi 4 and the NVIDIA Jetson Nano 2 GB, respectively equipped with CPU and GPU embedded. Furthermore, the proposed network was trained and tested on two datasets commonly used for object and traffic scene detection, namely the PASCAL VOC [47] and KITTI [48] datasets.
The rest of the paper is organized as follows: Section II summarized the related work, while Section III describes the proposed method. Section IV analyses and applies the Tucker tensor decomposition to a convolutional layer. Section V derives the LR Tiny YOLO v3 Darknet architecture. Finally, Section VI reports results from experiments carried out in two different hardware platforms: Raspberry Pi 4 and NVIDIA Jetson Nano 2 GB.
Related Work
In this section, an overview of the state-of-the-art methods for pedestrian detection is provided, together with a detailed description of one of the well-known method in literature represented by the YOLO architecture (and its variants), in order to better comprise the proposed method.
A. Classical Methods for Pedestrian Detection
Before the advent of deep learning, hand-crafted features were widely used to capture localization signals in images [49], [50], [51], [52]. For example, a Histogram of Oriented Gradients (HOG) feature descriptor, which combines gradient direction and edge orientation on small spatial regions of an image, was used for pedestrian detection [49]. Other traditional and popular methods for this task were: Scale Invariant Feature Transform (SIFT) [53], Shape Contexts [54], Haar-like features [55], Histogram of Gradients (HOG) [49] and Local Binary Patterns (LBP) [56]. Significant progress was made by using gradient-based features. Building on the idea of SIFT [57], Triggs and Dalal [49] popularized HOG by utilizing intensity-based features. Zhu et al. [58] successfully sped up the computation of HOG features using integral histograms [59].
B. Deep Learning-Based Methods for Pedestrian Detection
Deep learning-based (DL) algorithms [30], [60], [61], [62] have revolutionized pedestrian detection, establishing themselves as the leading approach in addressing this problem. Pedestrian detection methods based on Faster R-CNN [63], [64], [65] have been proposed, utilizing a Region Proposal Network (RPN) to generate pedestrian candidates. This approach results in both improved detection performance and runtime efficiency compared to earlier methods [26], [58]. Nevertheless, the computational cost of Faster R-CNN remains high, making it challenging for real-time applications such as autonomous driving. One way to achieve better runtime efficiency is to use single-stage pedestrian detectors. These detectors leverage the power of CNNs to combine feature extraction, location regression, and region classification into a single process. However, they often suffer from lower accuracy [66]. In addition to the aforementioned models that rely on a multi-stage pipeline, single-stage object detectors like SSD [51] and YOLO [52] have also been proposed. These methods utilize a single convolutional network to simultaneously predict class labels and bounding boxes, resulting in significantly faster running speeds. More details about YOLO architecture and its variants were provided in the following section.
C. YOLO Architecture
1) YOLO:
YOLO (You-Only-Look-Once) [30] was developed with the idea of extracting classification and location predictions at the same time with a unique processing of input image.
The image is divided into a grid with a dimension
2) YOLO v3 Darknet:
The basic architecture of standard YOLO has been generalized throughout time by providing multiple predictions at different scales thus obtaining YOLO v2, YOLO v3 versions. Moreover, several backbones for features’ extraction have been used as MobileNet [67], ResNet [68], Xception [69] and Inception v4 [70].
One of the most effective backbones for object detection was shown to be Darknet-53 [71] which indeed achieves similar classification results on ImageNet dataset as heavier backbones like ResNet with a faster inference on GPU environment.
Darknet-53 based YOLO v3 architecture provides outputs at three different scales and it consists of several groups of residual blocks. Each group has a different number of filters and residual block repetitions and it is responsible for a down-sampling operation.
YOLO v3 Darknet computes every prediction information across the bounding boxes, including location, shape, confidence and classes probabilities. The dimension of the output feature is therefore given by
YOLO v3 Darknet obtained better detection and inference time performances than high accuracy object detection state-of-art networks based on Single Shot Detector [72], Retina [73] and Faster-RCNN [28].
3) Tiny YOLO v3 Darknet:
YOLO v3 Darknet however is not very suitable for a real-time object detection scenario due to its high computational cost, memory occupation and inference time on limited resources hardware.
In order to address this problem a much lighter version of this architecture was proposed called YOLO v3 Tiny [74]. The network was derived by replacing the computationally expensive groups of residual blocks with single convolutions and by using only 2 output prediction features.
A very high acceleration of inference is obtained thus making the model much more practical in real-time contexts in exchange of a reduction of the detection quality.
4) YOLO v4 Darknet:
YOLO v4 [75] is an evolution of the previous YOLO versions which significantly improves accuracy with two main changes. First it uses a CSPDarknet53 [76] backbone consisting of a particular type of residual blocks which are repeated several times as in Darknet53, an initial feature split is performed before feeding each block sequence, at the end features are recombined with concatenation. Instead of LeakyReLU more performing Mish activations are used.
Second SPP [77] and PAN [78] modules follow CSPDarknet53 backbone before providing the output for the final prediction head which is the same as YOLO v3.
5) Tiny YOLO v4 Darknet:
A tiny and suitable for low-end GPU version of YOLO v4 was proposed by [79]. The number of computations of each block, which is repeated only once, is reduced by using CSPOSANet [80] blocks that are shown in Figure 1. The architecture of Tiny YOLO v4 Darknet is reported in Table I.
6) YOLOX Darknet:
YOLOX [81] main novelty consists of a decoupled head as opposed to the coupled head of other YOLO architectures, in addition an anchor-free approach is chosen for training.
7) Tiny YOLOX Darknet:
A tiny version of this network is also proposed in [81]. Classification, confidence, and bounding box predictions are produced by separate convolutions receveing the output of several convolution blocks as shown in Figure 2. In practice the number of filters of head convolutions should be limited by a factor
The decoupled head and anchor-free approach can be easily extended to any YOLO and Tiny YOLO architecture to produce its YOLOX counterpart. In this work we applied it to Tiny YOLO v3 which is the network we adopted as baseline.
8) YOLO Lite:
A very high speed architecture optimized for non-GPU environment is the YOLO Lite [82]. It was derived from a truncation of the Tiny YOLO v2 [31] by selecting its first 4 convolution blocks and by appending 3 convolutions with low number of filters before the final single output feature.
The main drawback of this architecture is that it is essentially focused on inference speed and detection quality is considerably lower than Tiny YOLO v2, v3.
9) YOLO Fastest:
An architecture which offers a much better trade-off between speed and detection accuracy was proposed in [83] and it was called YOLO-Fastest. The model was optimized for an ARM terminal with the NCNN framework, but it achieves good performances on GPU as well.
The architecture was designed by using several groups of residual blocks made up of depthwise separable convolutions with an inverted bottleneck structure shown in Figure 3, where the number of channels is increased from S to E,
The detailed architecture of YOLO Fastest is described in Table II where depthwise-blocks consist of the main path of the residual blocks without the final dropout layer.
Proposed Method
The networks mentioned above are all focused on a specific aspect of object detection, therefore none of them is able to realize both the requirements of low inference time and high precision needed in real-time pedestrian detection. The main purpose of the proposed work is to demonstrate that a compressed Tiny YOLO v3 architecture can be derived, to achieve an optimal trade-off between accuracy and inference time. To achieve this goal the following steps have been carried out:
The Tucker decomposition (TD) technique has been applied to the kernel of a generic convolution layer. Following this method, the convolution layer is decomposed as two convolutions with
layer and a convolution with a core-tensor. As a result, a significant reduction of parameters and mathematical operations is achieved. The main advantages of TD method with respect to other tensor decomposition techniques are i) no depthwise convolutions but only standard convolutions are used, ii) decomposition and fine tuning can be jointly carried out in one step alone without any problem of stability.1\times 1 A Low-Rank Tiny YOLO v3 Darknet has been developed starting from an initial training of Tiny YOLO v3 Darknet and by applying Tucker decomposition to the
convolution layers. Pointwise layers are not decomposed since they cannot be furtherly decomposed. A one-step fine-tuning is done after decomposition to recover accuracy lost due to compression. Several configurations with different compression factors were implemented before reaching the optimal final network.3\times 3
The source code to train and test the Low-Rank Tiny YOLO v3 Darknet is available at https://github.com/ManoniLo/Tiny-YOLOv3-Tucker-compression.
Tucker Decomposition for CNN Compression
Satisfying the demand for CNN accuracy requires a high computational complexity, thus it is essential to investigate CNN compression techniques to reduce the inference time. Several different methods, such as pruning [84], [85], [86], [87], [88], quantization [89] and tensor decomposition are available for this purpose.
Tensor decomposition is a multilinear combination that approximates a full-rank tensor with one or more low-rank tensors. This kind of decomposition is particularly suitable to be implemented in a CNN since for the convolution operations filter banks are involved which are well represented by tensors. In this context a large variety of algorithms exists, and the most common are CANDECOMP/PARAFAC (CP) decomposition [90], TD [90], tSVD [91], [92].
In this paper we adopt the TD method, that has proven to be particularly useful in deriving a low complexity CNN architecture for pedestrian tracking.
A. Tucker Decomposition
TD method gives an approximation of an nth-order tensor \begin{align*} {\mathcal {X}}_{Tuck} = \displaystyle { \sum _{r_{1}=1}^{R_{1}} \ldots \sum _{r_{n}=1}^{R_{n}}} g_{r_{1}, \ldots, r_{n}} \, U^{(1)}(:, r_{1}) \circ \ldots \circ U^{(n)}(:, r_{n}), \tag {1}\end{align*}
\begin{equation*} {\mathcal {X}}_{Tuck} = \mathcal {G} \times _{1} U^{(1)} \ldots \times _{n} U^{(n)}. \tag {2}\end{equation*}
\begin{equation*} \min _{{\mathcal {X}}_{Tuck}} \lVert {\mathcal {X}} - {\mathcal {X}}_{Tuck} \rVert _{2}^{2}. \tag {3}\end{equation*}
\begin{equation*} \mathcal {G} = \mathcal {X} \times _{1} (U^{(1)})^{T} \ldots \times _{n} (U^{(n)})^{T}. \tag {4}\end{equation*}
B. CNN Compression
In a CNN the convolutional layers are the most time-consuming part, thus tensor decomposition applied to these layers is beneficial to reduce the network complexity. The kernel of a convolutional layer \begin{equation*} \mathcal {K}_{i,j,s,t}=\displaystyle {\sum _{r_{3}=1}^{R_{3}}}\displaystyle {\sum _{r_{4}=1}^{R_{4}}}\mathcal {G}_{i,j,r_{3},r_{4}} \, U_{s,r_{3}}^{(3)} \, U_{r_{4},t}^{(4)}, \tag {5}\end{equation*}
\begin{equation*} \mathcal {K}= \mathcal {G} \times _{3} U^{(3)} \times _{4} U^{(4)}. \tag {6}\end{equation*}
\begin{align*} \mathcal {Y}_{h',w',t}=\displaystyle {\sum _{i,j} \sum _{s=1}^{S}} \, \mathcal {K}_{i,j,s,t} \, \mathcal {X}_{h_{i},w_{j},s}= \tag {7}\\ = \mathcal {K}_{i,j,s,t} \, U_{s,r_{3}}^{(3)} \, \mathcal {G}_{i,j,r_{3},r_{4}} \, U_{r_{4},t}^{(4)}.\end{align*}
Figure 4 shows the effect of TD on the convolutional layer.
The convolution with the kernel
The compression factor c that quantifies the reduction of parameters numbers in the convolutional layer is given by:\begin{equation*} c= \dfrac {D^{2} R_{3} R_{4} + R_{3} + T R_{4}}{D^{2} S T}. \tag {8}\end{equation*}
C. Decomposition Methods Comparison
Table III shows a comparison between Tucker decomposition and other common decomposition algorithms, specifically, CP and tSVD decomposition in order to highlight the motivation of our choice. A detailed comparative study using the above-mentioned decomposition methods was carried out in [93], from the point of view of the impact of the compression factor applied to data tensors on accuracy and approximation error. For the sake of completeness, the tSVD method has also been reported in the Table III, which however, to the best of our knowledge, can only be used for the decomposition of the tensors and not for that of the layers, as is the case in this work. In fact, the compression factor for the tSVD decomposition strictly depends on the size of the tensor, compared to the Tucker and CP decomposition whose compression factor can be estimated in terms of the convolution layer parameters.
With respect to CP technique, no depthwise convolutions but only standard convolutions are used in TD decomposition. Although with equal compression factor c the TD method gives rise to a larger error with respect to CP method, Tucker decomposition is more advantageous. In fact, the decomposition process with CP requires several steps, decomposing a few layers at each step (or just one) and successively retraining the network, due to instability of depthwise convolution gradient, as shown in [94] and [95]. Instead, TD and fine tuning can be jointly carried out in one step alone without any problem of stability.
Design of TensorFlow Low-Rank Architecture
A. Full-Rank Network
In order to develop the proposed network we started from Tiny YOLO v3 Darknet whose architecture is described in Table V. As briefly explained before, the backbone was obtained by replacing the residual blocks of the YOLO v3 Darknet shown in Figure 3 where the residual part is repeated numerous times, with a simple convolution with the same number of filters and strides.
Similar convolution blocks together with Upsampling and Concatenations as in YOLO v3 Darknet architecture(described by Table IV) are used to combine outputs at different scales. Finally pointwise convolutions provide the 2 final prediction features.
B. Proposed LR Tiny YOLO v3 Darknet
In order to develop the proposed LR network ensuring the desired speed and accuracy once ported on embedded platforms, Tiny YOLO v3 Darknet was initially trained and subsequently it was compressed using Tucker decomposition following the approach described in Sec. IV.
Tucker decomposition was applied only to
Experimental Results
A. Datasets
The capability of the proposed method has been validated by experiments conducted on the Pattern Analysis, Statistical Modelling and Computational Learning (PASCAL) Visual Object Classes (VOC) 2007 & 2012 dataset [47] and the 2D Object Detection KITTI dataset [48].
1) PASCAL VOC 2007 & 2012:
PASCAL VOC [47] is a collection of publicly available datasets of annotated images for challenges in visual object recognition. The datasets of the PASCAL VOC collection differ for the number of images and classes. Since the objects are captured in different scenarios for bright conditions, visibility and dimension, the VOC datasets provide a complete annotation of all objects. Thus these datasets contain not only images, but also corresponding labels, annotations, dimensions and bounding box coordinates, for multi-class object detection.
In this work, the PASCAL VOC 2007 [97], [98] and PASCAL VOC 2012 datasets [99], [100] have been chosen. The model has been trained on PASCAL VOC 2007 & 2012 training dataset, and tested on PASCAL VOC 2007 test dataset. The PASCAL VOC 2007 & 2012 dataset includes a total of 21,503 images for 20 classes: 16,551 are used for training/validation with a 90%/10% split and the remaining 4,952 are devoted to testing. The original 20 categories are: ‘aeroplane’, ‘bicycle’, ‘boat’, ‘bottle’, ‘bus’, ‘car’, ‘cat’, ‘chair’, ‘cow’, ‘dining table’, ‘dog’, ‘horse’, ‘motorbike’, ‘person’, ‘potted plant’, ‘sheep’, ‘train’, ‘TV’.
For our purpose only the 5 classes related to a pedestrian tracking scenario: ‘bicycle’, ‘bus’, ‘car’, ‘motorbike’ and ‘person’ has been chosen, and a custom dataset containing the labels related to the 5 selected classes only, has been created.
Microsoft COCO [101] is a large-scale dataset for object detection, instance segmentation, image captioning and person keypoints localization. This dataset consists in images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. The consistence of the dataset is: 91 object types (classes) with a total of 2.5 million labeled instances in 328k images.
In this work all the models started from pretrained weights on COCO, providing thus a significant improvement of detection performances in PASCAL VOC.
2) KITTI:
The KITTI dataset [48] is one of the most widely used and popular computer vision dataset for autonomous driving. Images are taken from rural, urban and highway environments and they are labeled with data with the purpose of 3D object localization and tracking and also 2D object detection. KITTI dataset consists of 7,481 training/validation and 7,518 testing images which contain objects belonging to 3 categories: ‘car’, ‘cyclist’ and ‘person’. Since labels for testing set are not available, in this work we followed the literature, and we used the validation set to evaluate accuracy by using a 90/10% train/val split. In the same manner as PASCAL VOC we started all models from pretrained weights on COCO.
B. Evaluation Metrics
In this subsection the main features of the metrics used to evaluate the performance of the Object Detector have been summarized. In order to measure detection accuracy, the traditional Average Precision (AP) metric, also used in other common object detectors as SSD, has been adopted. The AP is based on the concept of Precision and Recall.
The Precision is given by the following relationship\begin{equation*} Precision = \frac {True\,Positive}{True\,Positive + False\,Positive}, \tag {9}\end{equation*}
The Recall is defined by\begin{equation*} Recall = \frac {True\,Positive}{True\,Positive + False\,Negative}, \tag {10}\end{equation*}
In the Object Detection both these two parameters depend on the Intersection over Union (IOU) defined as\begin{equation*} IOU = \frac {Area\,of\,Overlap}{Area\,of\,Union}, \tag {11}\end{equation*}
The Average Precision is computed from the Precision-Recall curve \begin{equation*} AP = \int _{0}^{1} p(r) dr. \tag {12}\end{equation*}
C. Training & Compression
The initial training of Tiny YOLO v3 Darknet was performed starting on COCO dataset pretrained weights available at https://pjreddie.com/darknet/tiny-darknet/.
We used a traditional loss resulting from a combination of a cross-entropy loss for classification and a mean square error loss for boxes localization:\begin{align*} L& =L_{\text {class}}+\alpha L_{\text {loc}}, \\ L_{\text {class}}& =-\displaystyle {\sum _{i=1}^{S^{2}}\sum _{j=1}^{B}} \theta _{ij} \left ({{\log y_{ij}^{\text {conf}} + \displaystyle {\sum _{k=1}^{C}} t_{ik} \log y_{ijk} }}\right), \\ L_{\text {loc}}& = \displaystyle {\sum _{c \in \text { coord}} \sum _{j=1}^{B} \sum _{i=1}^{S^{2}}} \theta _{ij}\left ({{c_{ij}-\widehat {c}_{ij}}}\right)^{2}, \\ \theta _{ij}& = \begin{cases} \displaystyle 1 & \begin{array}{c}\text { if a object appears in the cell} \ i\\\text { and bounding box } j \text{ is responsible}\\\text {for prediction, }\end{array} \\ \displaystyle 0 & \text {otherwise}, \end{cases} \tag {13}\end{align*}
The training was done by initially freezing all but the final
1) PASCAL VOC Configuration:
The full-rank model was trained on 5-classes PASCAL VOC dataset with
Compression and fine-tuning were performed in a single step with Adam optimizer using a decay of 0.0 and a batch size of 32. In order to reach the compressed final network five configurations were implemented with different compression factors for each layer which are reported in Table VII.
The optimal configuration was found by pushing down compression factors of layers having low accuracy sensitivity such as the final convolutions. At the same time compression of the initial layers was limited since their decomposition results in a high accuracy drop while their contribute to the computational cost is almost the same to that of the last layers.
The first two convolutions were not decomposed since it would have caused an unacceptable loss of detection accuracy impossible to recover.
Table VIII reports the fine-tuning learning rates, number of epochs, number of parameters and the mean Average-Precision (mAP) metric among the 5 classes of PASCAL VOC obtained by the five configurations.
The final configuration despite not having the maximum number of parameters ensures the highest Average Precision for all classes as shown in Table IX.
2) KITTI Configuration:
Training on KITTI dataset was performed by resizing images to the same size as PASCAL VOC
Similarly to the PASCAL VOC case, several configurations were tested whose compression factors and training hyperparameters are reported in Table X and XI. Since accuracy was very sensitive to the compression of layer conv #11 and since it does not add an excessive computational load, we chose not to decompose it by starting from configuration #3. Average Precision of the 3 classes of KITTI dataset for each model configuration is reported in Table XII.
D. Testing
As stated in the Introduction, the main objective of the paper is to demonstrate the suitability of the LR Tiny YOLO v3 architecture previously described for real-time VaPD on embedded systems. To this end, an extensive experimentation on the two low-power, low-cost embedded platforms, Raspberry Pi 4 and NVIDIA Jetson Nano 2 GB, have been conducted.
The results in terms of detection quality and inference speed were compared to those obtained by the described state-of-art networks to show the effectiveness of the proposed architecture once ported on embedded platforms. The main goal of the experimentation is to demonstrate that the proposed network outperforms the baseline network.
The LR architecture for real-time VaPD has been firstly implemented in the Raspberry Pi 4 platform equipped with an USB-camera. The main hardware specifications of the Raspberry Pi 4 model B are: SoC Cortex-A72 (ARM v8) 64-bit quad-core @ 1.5 GHz, 4 GB LPDDR4 SDRAM. This board can be considered representative of a low-power system as it consumes 3 W when idle and 6 W on full load.1 In addition, a Coral TPU accelerator2 has been added to the hardware. The Coral USB Accelerator adds an Edge TPU coprocessor to an embedded system, enabling high-speed machine learning inferencing simply by connecting it to a USB port.
For a VaPD system the main performance required is the low latency, due to the higher speed of a car. For this reason, the NVIDIA Jetson Nano 2 GB, equipped with a NVIDIA embedded GPU, has been chosen as a compromise to satisfy the requirements of high speed and low cost. The main hardware specifications of the NVIDIA Jetson Nano 2 GB are: SoC Cortex-A57 (ARM v8) 64-bit quad-core @ 1.43 GHz, 128-core NVIDIA Maxwell GPU, 2 GB LPDDR4 SDRAM.
The implemented models were trained using Google Colaboratory, then the code was executed on a Google cloud server with Google hardware, such as graphics processing units (GPUs) and tensor processing units (TPUs). Specifically, a runtime with GPU hardware accelerator was selected. About the software, the Python TensorFlow (v. 2.8.0) and TensorFlow Keras (v. 2.8.0) libraries were used.
To perform inference on Raspberry Pi 4, we used TensorFlow/Keras v. 2.8.0 with Python v. 3.7.3 on Debian Buster (Raspbian 10) operating system. The proposed model first developed in 32-bit floating point precision.h5 format was then converted to.tflite with int8 quantization in order to be implemented in Raspberry Pi 4. USB Coral Accelerator finally requires a further conversion to edgetpu.tflite format with a dedicated TPU compiler which performs specific layer optimizations for Edge TPU coprocessor. A model compiled for the Edge TPU must be executed using a corresponding version of the Edge TPU runtime, so in this case, we used the Edge TPU compiler version 16.0. However Leaky-Relu layers optimization is not fully supported by the TPU compiler, thus resulting in very high inference times during the testing with Coral Accelerator. Leaky-Relu layers were therefore replaced with classical Relu which are fully supported by the converter.
To run the evaluation code on Jetson Nano 2 GB, we used the NVIDIA TensorRT,3 a software development kit (SDK) for high-performance deep learning inference on NVIDIA hardware. This SDK supports models in Open Neural Network Exchange (ONNX) format, thus, to compute inference, the original Keras h5 models were firstly converted in ONNX format, and then the ONNX models with 32-bit floating point precision and opset 13 have been generated. In detail, NVIDIA JetPack v. 4.6 including TensorRT v. 8.0.1 and CUDA v. 10.2 has been used, with Jetson module running at its maximum performance, that is 10W for Jetson Nano.
Table XIII compares the results of the different configurations for the LR network in terms of storage cost, number of parameters, floating point operations (FLOPs), mean average precision (mAP), frames per second (FPS) obtained running the inference on the embedded platforms Raspberry Pi 4 with USB Coral Accelerator and NVIDIA Jetson Nano 2 GB. The last configuration provides the best accuracy and inference time trade-off since the accuracy is the highest and FPS value on both the embedded platforms is only outperformed by the first two. Since these configurations achieve a much lower mAP, the last one is the most suitable model for our purpose.
For the aforementioned reasons, this final configuration has been used to conduct a comparison with the state-of-the-art object detection networks.
Table XIV and Table XV reported a comparison between the final LR Tiny YOLO v3 Darknet and the state-of-art networks implemented on Raspberry Pi 4 equipped with the USB Coral Accelerator, trained on PASCAL VOC and KITTI datasets respectively. As you can see, the proposed network provides a FPS value close to that achieved with YOLO Lite, but with much higher accuracy and lower MFLOPs for both the datasets. In particular, the proposed architecture requires only between 22% and 32% of the memory required by the baseline architecture, still providing better inference time in all the tested datasets (2.46 respect to 2.19 FPS for PASCAL VOC and 2.74 respect to 2.40 FPS for KITTI dataset), with only a marginal decrease in accuracy (about 2%).
The comparison of the final LR Tiny YOLO v3 Darknet and the state-of-art networks implemented on the NVIDIA Jetson Nano 2 GB board and trained on PASCAL VOC and KITTI datasets is shown in Table XVI and Table XVII respectively. The propose network provide the best balance of detection accuracy and FPS on Jetson Nano GPU among all the tested models. YOLO v3 Darknet offers the highest mAP but only 5.14/5.15 FPS on Jetson Nano 2 GB, for PASCAL VOC and KITTI dataset, respectively, which is far too low for a real-time application. YOLO Lite achieves the best inference time performances but its accuracy is beyond an acceptable level. YOLO Fastest reaches a similar accuracy as the proposed low-rank model but it is clearly outperformed in terms of FPS on NVIDIA Jetson Nano 2 GB. The same considerations are valid for both PASCAL VOC and KITTI datasets. As far as the comparison with the Tiny YOLO v3 Darknet baseline architecture is concerned even in this case the proposed architecture requires a reduced memory occupancy and provides better inference time (about 15 FPS more for PASCAL VOC and 7 FPS more for KITTI dataset), with only a 2% decrease in accuracy.
An estimation of the failure cases in detection results was conducted obtaining these values on KITTI dataset: 4.2% errors for class ‘car’, 21.11% errors for class ‘cyclist’, 14.88% errors for class ‘person’. From a careful visual inspection of the failure cases, it can be observed that the main errors that impact performance are mainly due to two cases: distance of the camera and occlusion of the object. Figure 7a shows an example of an error in the detection of the class ‘car’ due to the distance of the camera and in Figure 7b an example relating to the occlusion of the class ‘person’. Future works are focused to study and solve these problems.
Other recent studies, based on the same datasets used in this works, that is PASCAL VOC and KITTI, report the performance on VaPD task using different neural networks than YOLO-based ones. Particularly, Chen et al. [33] provide an exhaustive study on the performance of several mainstream object detection architectures applied to VaPD, including Faster R-CNN, R-FCN, and SSD, ResNet50, ResNet101, MobileNet_V1, MobileNet_V2, Inception_V2 and Inception_ResNet_V2, pretrained on the COCO dataset and fine-tuned by using transfer learning on the KITTI dataset. This analysis demonstrates that Faster R-CNN ResNet50 obtains the best AP (58%) for vehicle and pedestrian detection, ResNet101 consumes the highest memory (9907 MB), Inception_ResNet_V2 is the slowest model (3.05 FPS), SSD MobileNet_V2 is the fastest model (70 FPS) and SSD MobileNet_V1 is the lightest model in terms of memory usage (875 MB). The proposed method based on YOLO v3 architecture outperforms the results presented in Chen et al. in terms of storage cost (10.719 MB vs. 875 MB). Regarding the inference time, we obtain a model of 34.72 FPS while their fastest model achieves 70 FPS. Anyway, it must be taken into consideration that their results were computed on a PC with an Intel Core i7-7820X CPU and a NVIDIA GeForce GTX GPU while our results on the embedded platform NVIDIA Jetson Nano 2 GB.
Conclusion
Most existing VaPD methods focus on improving detection accuracy, while ignoring the efficiency. However, VaPD application requires real-time detection speed with limited computational resources. Consequently, the study of light-weight and real-time VaPD methods for embedded device assumes great importance.
This paper shows that adopting a CNN compression technique based on Tucker tensor decomposition applied to the Tiny YOLO v3 Darknet architecture, a new architecture named Low-Rank Tiny YOLO v3 Darknet that is suitable for real-time VaPD on embedded systems, can be derived. Considering the Tiny YOLO v3 Darknet as the baseline architecture, the proposed architecture fills well this spot, using only between 22% to 32% of the memory required by the baseline architecture, and always providing better inference time in all the tested datasets, with only a marginal decrease in accuracy. Furthermore, the network proposed in this paper provides the best balance in terms of inference time and complexity, as well as accuracy, respect to the state-of-the-art networks, demonstrating its suitability for VaPD on embedded systems.
Limitations of our method concern the optimal choice of fine-tuning hyperparameters which must be carefully set in order to maximize compressed model accuracy. Particularly the learning rate must be accurately chosen [94], [102] for compression with tensor decompositions.
To demonstrate the suitability and the superiority of the proposed approach in obtaining the best compromise between inference time, accuracy and memory occupancy to respect the state-of-the-art, a large experimentation on two low-power, low-cost embedded platforms, Raspberry Pi 4 and NVIDIA Jetson Nano 2 GB, and two datasets commonly used for object and traffic scene detection, PASCAL VOC and KITTI, have been conducted.
NOTE
Open Access provided by 'Università Politecnica delle Marche' within the CRUI CARE Agreement