I. Introduction
In recent years, we have witnessed the popularization of visual machine learning in the Internet of Things (IoT), which drives a wide range of intelligent video analytic applications, such as traffic control, environment monitoring, and autonomous driving. Critical in these applications is detecting objects from continuously captured video frames. An ideal object detection engine should be accurate and real-time for these deep vision applications. However, most object detection models are based on deep neural networks (DNNs) and place a computational burden on portable devices, which leads to task timeout, energy exhaustion, and other serious issues [1].