I. Introduction
With the development of deep neural network (DNN)-based computer vision (CV) models and an increasing number of camera deployments, recent years have witnessed an explosive growth of video analytics (VAs) tasks, such as object detection and semantic segmentation [1], [2], [3]. Currently, most front-end devices stream their captured videos to resource-rich cloud servers to achieve high-analysis accuracy. However, cloud servers typically reside far away from these front-end devices, which introduces significant network consumption for large-volume video transmission. Streaming filtered frames to the cloud server can affect the inference accuracy due to the loss of image details. On the other hand, analyzing videos directly on front-end devices could avoid the network cost, however, it cannot support high-accuracy VAs at need due to the limited computational resources of most front-end devices. Therefore, it is urgent to find a solution to achieve high-accurate VAs with an affordable network cost.