I. Introduction
Recently, Big Data analytic has seen a paradigm shift towards Edge AI (i.e., pushing intelligence towards the edge). This is mainly driven by the need to move computation toward data sources in an effort to reduce communication needs as well as enhance privacy and security [1], [2]. These efforts led to the formation of the Federated Learning (FL) paradigm which transformed traditional distributed machine learning (ML) training methods. Many service providers such as Google, Facebook, and Apple use FL to train global models for natural language processing (NLP) and computer vision (CV) tasks to server applications such as virtual keyboards, object detection, image classification, and recommendation systems [3]–[7]. FL is commonly used with distributed medical imaging data [8]; smart camera images [9]. In FL, the central server managing the models ships them to the clients' end-device on which the training is performed locally to preserve the privacy and security of user data. Due to the lack of control over client devices, FL environments are highly heterogeneous which presents a variety of challenges. In FL, the process is participatory and relies on the availability of the clients and their data. These clients produce and store the application data used to locally train the central ML model and contribute their model updates to the central server for incorporation into the global model. Time-to-accuracy is a vital performance measure for training quality and is the focus of much work in this area [2], [10]–[13]. Generally, the objective is to reduce the time-to-accuracy by reducing the training time and improving the statistical efficiency. Reducing training time requires hardware acceleration, and time-efficient training algorithms, and reducing the training time requires time-efficient training algorithms, hardware acceleration, and bandwidth-efficient communication methods on the devices [2], [13], [14].