I. Introduction
The success of deep learning has been greatly beneficial for various fields such as natural language processing [1], [2], [3], robotics [4], [5], [6], computer vision [7], [8], [9], etc. This is especially evident in the case of computer vision, where majority of the progress can be largely attributed to the advancements in deep convolutional neural networks (DCNN) [7]. Owing to their learning capacity, DCNN models have achieved state-of-the-art performance in many vision tasks such as object classification ([9], [10], [11]), semantic segmentation ([12], [13], [14]), and object detection ([15], [16], [17]). This has led to DCNN's increased popularity in several real world applications as compared to the classical computer vision techniques. Specifically, deep learning based object detection has become an integral part of many real-world applications ranging from video security/surveillance, augmented reality, autonomous navigation, human computer interface, self-checkout convenience stores. Major advancements like Faster-RCNN [15], You Only Look Once (YOLO) [16] and Single Shot Multi-box Detector (SSD) [17] have resulted in significant improvements of detection performance and speed.