I. Introduction
In recent years, deep learning (DL) has achieved great success in various fields such as computer vision [1], natural language processing [2], and speech recognition [3]. Traditional DL tasks are widely supported by cloud servers, but due to limitations in latency, privacy issues, and reliance on network connection, the need to deploy models on edge devices has become more urgent [4]. DL is a computationally intensive task, and edge devices are limited by their power consumption and size [5]. Generally, they are not equipped with acceleration devices to improve computing efficiency, and their memory resources are also strictly limited. Inference tasks are latency-sensitive, and users expect these tasks to be performed efficiently and produce accurate predictions in a short period of time.