Loading [MathJax]/extensions/MathZoom.js
Highlight Every Step: Knowledge Distillation via Collaborative Teaching | IEEE Journals & Magazine | IEEE Xplore

Highlight Every Step: Knowledge Distillation via Collaborative Teaching


Abstract:

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation (KD) aims to train a compact stu...Show More

Abstract:

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation (KD) aims to train a compact student network by transferring knowledge from a larger pretrained teacher model. However, most existing methods on KD ignore the valuable information among the training process associated with training results. In this article, we provide a new collaborative teaching KD (CTKD) strategy which employs two special teachers. Specifically, one teacher trained from scratch (i.e., scratch teacher) assists the student step by step using its temporary outputs. It forces the student to approach the optimal path toward the final logits with high accuracy. The other pretrained teacher (i.e., expert teacher) guides the student to focus on a critical region that is more useful for the task. The combination of the knowledge from two special teachers can significantly improve the performance of the student network in KD. The results of experiments on CIFAR-10, CIFAR-100, SVHN, Tiny ImageNet, and ImageNet datasets verify that the proposed KD method is efficient and achieves state-of-the-art performance.
Published in: IEEE Transactions on Cybernetics ( Volume: 52, Issue: 4, April 2022)
Page(s): 2070 - 2081
Date of Publication: 28 July 2020

ISSN Information:

PubMed ID: 32721909

Funding Agency:


I. Introduction

Recently, deep neural networks achieved superior performance in a variety of applications, such as computer vision [1]–[4] and natural language processing [5], [6]. However, along with high performance, the deep neural network’s architecture becomes much deeper and wider which requires a high cost of computation and memory in inference. It is a great burden to deploy these models on edge-computing systems, such as embedded devices and mobile phones. Therefore, many methods [7]–[11] are proposed to reduce the deep neural network’s computational complexity and high storage. Some lightweight networks, such as Inception [12], MobileNet [13], ShuffleNet [14], SqueezeNet [15], and Condense-Net [16] have been proposed to reduce the network size as much as possible under the condition of keeping high recognition accuracy. All the above-mentioned methods focus on physically reducing the internal redundancy of the model to obtain a shallow and thin architecture. Nevertheless, how to train the reduced network with high performance is yet an unresolved issue.

Contact IEEE to Subscribe

References

References is not available for this document.