Journals & Magazines >IEEE Transactions on Cybernet... >Volume: 52 Issue: 4

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation (KD) aims to train a compact stu...Show More

Metadata

Abstract:

High storage and computational costs obstruct deep neural networks to be deployed on resource-constrained devices. Knowledge distillation (KD) aims to train a compact student network by transferring knowledge from a larger pretrained teacher model. However, most existing methods on KD ignore the valuable information among the training process associated with training results. In this article, we provide a new collaborative teaching KD (CTKD) strategy which employs two special teachers. Specifically, one teacher trained from scratch (i.e., scratch teacher) assists the student step by step using its temporary outputs. It forces the student to approach the optimal path toward the final logits with high accuracy. The other pretrained teacher (i.e., expert teacher) guides the student to focus on a critical region that is more useful for the task. The combination of the knowledge from two special teachers can significantly improve the performance of the student network in KD. The results of experiments on CIFAR-10, CIFAR-100, SVHN, Tiny ImageNet, and ImageNet datasets verify that the proposed KD method is efficient and achieves state-of-the-art performance.

Published in: IEEE Transactions on Cybernetics ( Volume: 52, Issue: 4, April 2022)

Page(s): 2070 - 2081

Date of Publication: 28 July 2020

ISSN Information:

PubMed ID: 32721909

DOI: 10.1109/TCYB.2020.3007506

Funding Agency:

Contents

I. Introduction

Recently, deep neural networks achieved superior performance in a variety of applications, such as computer vision [1]–[4] and natural language processing [5], [6]. However, along with high performance, the deep neural network’s architecture becomes much deeper and wider which requires a high cost of computation and memory in inference. It is a great burden to deploy these models on edge-computing systems, such as embedded devices and mobile phones. Therefore, many methods [7]–[11] are proposed to reduce the deep neural network’s computational complexity and high storage. Some lightweight networks, such as Inception [12], MobileNet [13], ShuffleNet [14], SqueezeNet [15], and Condense-Net [16] have been proposed to reduce the network size as much as possible under the condition of keeping high recognition accuracy. All the above-mentioned methods focus on physically reducing the internal redundancy of the model to obtain a shallow and thin architecture. Nevertheless, how to train the reduced network with high performance is yet an unresolved issue.

References is not available for this document.

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Highlight Every Step: Knowledge Distillation via Collaborative Teaching

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

I. Introduction

References

IEEE Account

Purchase Details

Profile Information

Need Help?