I. Introduction
Alongside deep learning's tremendous success in different tasks [2], [11], [23], [38], it remains difficult to apply deep neural networks to real-world problems due to computational and memory constraints. To avoid this problem, many attempts [12], [26], [48], [49] have been made to reduce the computational cost of deep learning models, with Knowledge Distillation (KD) [16] being one of them. KD is a network training strategy that works by transferring knowledge from a high-capacity teacher model to a low-capacity target student model at runtime, resulting in a better accuracy-efficiency trade-off.