I. Introduction
By sharing the models rather than the raw privacy-sensitive data, cross-device Federated Learning (FL), an emerging distributed machine learning approach that enables end devices like mobile phones to train models cooperatively, has been widely deployed in production for personalization services like on-device item ranking, next-word prediction, content suggestions for on-device keyboards, and real-time e-commerce recommendations [1]–[5]. Generically, FL models are trained iteratively: in each round, a set of dynamically selected end devices (EDs) first download the current model from the central FL server (FLS) to launch the on-device training, then upload their local gradients back to the FLS, which would aggregate received gradients to obtain the new model [1]. Obviously, with the increase in the number of EDs, the central FLS would become the bottleneck of the entire FL system. Optimizing the performance of FL systems, especially removing the bottleneck effects of FLS, becomes the key to support very large-scale federated learning tasks [3].