1. Introduction
For computer vision applications on resource-constrained devices, how to learn portable neural networks yet with sat-isfied prediction accuracy is the key problem. Knowledge distillation (KD) [2], [12], [17], [22], [25], [33], which leverages the information of a pre-trained large teacher network to promote the training of a smaller target student network on the same training data, has become a mainstream solution. Conventional KD methods assume that the original training data is always available. However, accessing the source dataset on which the teacher network was trained is usually not feasible in practice, due to its potential privacy or security or proprietary or huge-size concerns. To relax the constraint on training data, know ledge distillation under a data-free regime has recently attracted increasing attention.