I. Introduction
Nowadays, federated learning (FL) is considered a promising distributed deep learning (DL) solution, which can yield efficient and privacy-preserving collaborative training over many devices [1]. Meanwhile, with ever-increasing computing capability, mobile devices can conduct the on-device training of more and more deep neural network (DNN) models. FL over mobile devices has prompted numerous applications, such as keyboard predictions [2], physical hazard detection in smart homes [3], health event detection [4], etc. A key obstacle to unleash the full potential of FL over mobile devices is the long training delay, especially for delay-sensitive mobile applications. One of the possible reasons behind this is the device heterogeneity among FL clients, which can be defined as diverse computing capabilities and communication conditions [5]. Among those mobile devices, “slow” performing ones are considered stragglers for the FL training. In FL, the server and other devices may need to wait for all the stragglers to finish their local updates. Therefore, the FL training delay is bottlenecked by such slow devices.