Introduction
Three major developments are currently transforming the ways how data are created and processed: First of all, with the advent of the Internet of Things (IoT), the number of intelligent devices in the world has rapidly grown in the last couple of years. Many of these devices are equipped with various sensors and increasingly potent hardware that allow them to collect and process data at unprecedented scales [1]–[3].
In a concurrent development, deep learning has revolutionized the ways that information can be extracted from data resources with groundbreaking successes in areas such as computer vision, natural language processing, or voice recognition, among many others [4]–[9]. Deep learning scales well with growing amounts of data and its astounding successes in recent times can be at least partly attributed to the availability of very large data sets for training. Therefore, there lays huge potential in harnessing the rich data provided by IoT devices for the training and improving deep learning models [10].
At the same time, data privacy has become a growing concern for many users. Multiple cases of data leakage and misuse in recent times have demonstrated that the centralized processing of data comes at high risk for the end users privacy. As IoT devices usually collect data in private environments, often even without explicit awareness of the users, these concerns hold particularly strong. It is, therefore, generally not an option to share this data with a centralized entity that could conduct training of a deep learning model. In other situations, local processing of the data might be desirable for other reasons such as increased autonomy of the local agent.
This leaves us facing the following dilemma: How are we going to make use of the rich combined data of millions of IoT devices for training deep learning models if this data cannot be stored at a centralized location?
Federated learning resolves this issue as it allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their data to a centralized server [10]. This form of privacy-preserving collaborative learning is achieved by following a simple three-step protocol illustrated in Fig. 1. In the first step, all participating clients download the latest master model
Federated learning with a parameter server. Illustrated is one communication round of distributed SGD. (a) Clients synchronize with the server. (b) Clients compute a weight update independently based on their local data. (c) Clients upload their local weight updates to the server, where they are averaged to produce the new master model.
A major issue in federated learning is the massive communication overhead that arises from sending around the model updates. When naively following the protocol described earlier, every participating client has to communicate a full model update during every training iteration. Every such update is of the same size as the trained model, which can be in the range of gigabytes for modern architectures with millions of parameters [15], [16]. Over the course of multiple hundred thousands of training iterations on big data sets, the total communication for every client can easily grow to more than a petabyte [17]. Consequently, if communication bandwidth is limited or communication is costly (naive), federated learning can become unproductive or even completely unfeasible.
The total amount of bits that have to be uploaded and downloaded by every client during training is given by \begin{align*} \texttt {b}^{\mathrm{ up/down}} \in \mathcal {O}(\underbrace {N_{\mathrm{ iter}} \times f}_{{\#\text { updates}}} \times \underbrace {| \mathcal {W}|\times (H(\Delta \mathcal {W} ^{\mathrm{ up/down}})+\eta)}_{\text {update size}}) \\\tag{1}\end{align*}
Challenges of the Federated Learning Environment
Before we can consider ways to reduce the amount of communication, we first have to take into account the unique characteristics, which distinguish federated learning from other distributed training settings such as parallel training (compare also with [10]). In federated learning, the distribution of both training data and computational resources is a fundamental and fixed property of the learning environment. This entails the following challenges.
Unbalanced and non-i.i.d. data: As the training data present on the individual clients is collected by the clients themselves based on their local environment and usage pattern, both the size and the distribution of the local data sets will typically vary heavily between different clients.
Large number of clients: Federated learning environments may constitute of multiple millions of participants [18]. Furthermore, as the quality of the collaboratively learned model is determined by the combined available data of all clients, collaborative learning environments will have a natural tendency to grow.
Parameter server: Once the number of clients grows beyond a certain threshold, direct communication of weight updates becomes unfeasible because the workload for both communication and aggregation of updates grows linearly with the number of clients. In federated learning, it is, therefore, unavoidable to communicate via an intermediate parameter server. This reduces the amount of communication per client and communication rounds to one single upload of a local weight update to and one download of the aggregated update from the server and moves the workload of aggregation away from the clients. Communicating via a parameter server, however, introduces an additional challenge to communication-efficient distributed training, as now both the upload to the server and the download from the server need to be compressed in order to reduce communication time and energy consumption.
Partial participation: In the general federated learning for IoT setting, it can generally not be guaranteed that all clients participate in every communication round. Devices might lose their connection, run out of battery or seize to contribute to the collaborative training for other reasons.
Limited battery and memory: Mobile and embedded devices often are not connected to a power grid. Instead, their capacity to run computations is limited by a finite battery. Performing iterations of SGD is notoriously expensive for deep neural networks. It is, therefore, necessary to keep the number of gradient evaluations per client as small as possible. Mobile and embedded devices also typically have only very limited memory. As the memory footprint of SGD grows linearly with the batch size, this might force the devices to train on very small batch sizes.
Based on the above-mentioned characterization of the federated learning environment, we conclude that a communication-efficient distributed training algorithm for federated learning needs to fulfil the following requirements.
It should compress both upstream and downstream communications.
It should be robust to non-i.i.d., small batch sizes, and unbalanced data.
It should be robust to large numbers of clients and partial client participation.
Contribution
In this article, we will demonstrate that none of the existing methods proposed for communication-efficient federated learning satisfies all of these requirements (see Table I). More concretely, we will show that the methods that are able to compress both upstream and downstream communications are very sensitive to non-i.i.d. data distributions, while the methods that are more robust to this type of data do not compress the downstream (see Section V). We will then proceed to construct a new efficient communication protocol for federated learning that resolves these issues and meets all requirements (R1)–(R3). We provide a convergence analysis of our method as well as extensive empirical results on four different neural network architectures and data sets that demonstrate that the sparse ternary compression (STC) protocol is superior to the existing compression schemes in that it requires both fewer gradient evaluations and communicated bits to converge to a given target accuracy (see Section IX). These results also extend to the i.i.d. regime.
Related Work
In the broader realm of communication-efficient distributed deep learning, a wide variety of methods has been proposed to reduce the amount of communication during the training process. Using (1) as a reference, we can organize the substantial existing research body on communication-efficient distributed deep learning into three different groups.
Communication delay methods reduce the communication frequency
. McMahan et al. [10] propose federated averaging where instead of communicating after every iteration, every client performs multiple iterations of SGD to compute a weight update. McMahan et al. observe that on different convolutional and recurrent neural network architectures, communication can be delayed for up to 100 iterations without significantly affecting the convergence speed as long as the data are distributed among the clients in an i.i.d. manner. The amount of communication can be reduced even further with longer delay periods; however, this comes at the cost of an increased number of gradient evaluations. In a follow-up work, Konečnỳ et al. [27] combine this communication delay with random sparsification and probabilistic quantization. They restrict the clients to learn random sparse weight updates or force random sparsity on them afterward (“structured” versus “sketched” updates) and combine this sparsification with probabilistic quantization. Their method, however, significantly slows down convergence speed in terms of SGD iterations. Communication delay methods automatically reduce both upstream and downstream communication and are proven to work with large numbers of clients and partial client participation.$f$ Sparsification methods reduce the entropy
of the updates by restricting changes to only a small subset of the parameters. Strom [24] presents an approach (later modified by [26]) in which only gradients with a magnitude greater than a certain predefined threshold are sent to the server. All other gradients are accumulated in a residual. This method is shown to achieve upstream compression rates of up to three orders of magnitude on an acoustic modeling task. In practice, however, it is hard to choose appropriate values for the threshold, as it may vary a lot for different architectures and even different layers. To overcome this issue, Aji and Heafield [23] instead fix the sparsity rate and only communicate the fraction$H(\Delta \mathcal {W})$ entries with the biggest magnitude of each gradient while also collecting all other gradients in a residual. At a sparsity rate of$p$ , their method only slightly degrades the convergence speed and final accuracy of the trained model. Lin et al. [25] present minor modifications to the work of Aji and Heafield [23] that even close this small performance gap. Sparsification methods have been proposed primarily with the intention to speed up parallel training in the data center. Their convergence properties in the much more challenging federated learning environments have not yet been investigated. Sparsification methods (in their existing form) primarily compress the upstream communication, as the sparsity patterns on the updates from different clients will generally differ. If the number of participating clients is greater than the inverse sparsity rate, which can easily be the case in federated learning, the downstream update will not even be compressed at all.$p=0.001$ Dense quantization methods reduce the entropy of the weight updates by restricting all updates to a reduced set of values. Bernstein et al. [22] propose signSGD, a compression method with theoretical convergence guarantees on i.i.d. data that quantizes every gradient update to its binary sign, thus reducing the bit size per update by a factor of
. signSGD also incorporates download compression by aggregating the binary updates from all clients by means of a majority vote. Other authors propose to stochastically quantize the gradients during upload in an unbiased way (TernGrad [19], quantized stochastic gradient descent (QSGD) [20], ATOMO [21]). These methods are theoretically appealing, as they inherit the convergence properties of regular SGD under relatively mild assumptions. However, their empirical performance and compression rates do not match those of sparsification methods.$\times 32$
Out of all the above-listed methods, only federated averaging and signSGD compress both the upstream and downstream communications. All other methods are of limited utility in the federated learning setting defined in Section II, as they leave the communication from the server to the clients uncompressed.
Notation: In the following, calligraphic
Limitations of Existing Compression Methods
The related work on efficient distributed deep learning almost exclusively considers i.i.d. data distributions among the clients, i.e., they assume unbiasedness of the local gradients with respect to the full-batch gradient according to \begin{equation*} \mathbb {E}_{x\sim p_{i}}[\nabla _{\mathcal {W}} l(x, \mathcal {W})] = \nabla _ {\mathcal {W}} R(\mathcal {W})\quad \forall i=1,..,n \tag{2}\end{equation*}
While this assumption is reasonable for parallel training where the distribution of data among the clients is chosen by the practitioner, it is typically not valid in the federated learning setting where we can generally only hope for unbiasedness in the mean \begin{equation*} \frac {1}{n}\sum _{i=1}^{n}\mathbb {E}_{x^{i}\sim p_{i}}[\nabla _ {\mathcal {W}} l(x^{i}, \mathcal {W})] = \nabla _ {\mathcal {W}} R(\mathcal {W})\tag{3}\end{equation*}
\begin{align*} \mathbb {E}_{x\sim p_{i}}[\nabla _ {\mathcal {W}} l(x, \mathcal {W})]\! =\! \nabla _ {\mathcal {W}} R_{i}(\mathcal {W}) \!\neq \! \nabla _ {\mathcal {W}} R(\mathcal {W})\quad \forall i\!=\!1,..,n. \\\tag{4}\end{align*}
As it violates assumption (2), a non-i.i.d. distribution of the local data renders existing convergence guarantees, as formulated in [19]–[21] and [29], inapplicability and has dramatic effects on the practical performance of communication-efficient distributed training algorithms as we will demonstrate in the following experiments.
A. Preliminary Experiments
We run preliminary experiments with a simplified version of the well-studied 11-layer VGG11 network [28], which we train on the CIFAR-10 [30] data set in a federated learning setup using ten clients. For the i.i.d. setting, we split the training data randomly into equally sized shards and assign one shard to every one of the clients. For the “non-i.i.d. (
B. Results
Fig. 2 shows the convergence speed in terms of gradient evaluations for the two models when trained using different methods for communication-efficient federated learning. We observe that while all compression methods achieve comparably fast convergence in terms of gradient evaluations on i.i.d. data, closely matching the uncompressed baseline (black line), they suffer considerably in the non-i.i.d. training settings. As this trend can be observed also for the logistic regression model, we can conclude that the underlying phenomenon is not unique to deep neural networks and also carries over to convex objectives. We will now analyze these results in detail for the different compression methods.
Convergence speed when using different compression methods during the training of VGG11*1on CIFAR-10 and logistic regression on MNIST and Fashion-MNIST in a distributed setting with ten clients for i.i.d. and non-i.i.d. data. In the non-i.i.d. cases, every client only holds examples from exactly two respectively one of the ten classes in the data set. All compression methods suffer from degraded convergence speed in the non-i.i.d. situation, but sparse top-
1) Federated Averaging:
Most noticeably, federated averaging [10] (see orange line in Fig. 2), although specifically proposed for the federated learning setting, suffers considerably from non-i.i.d. data. This observation is consistent with Zhao et al. [32] who demonstrated that model accuracy can drop by up to 55% in non-i.i.d. learning environments compared to the i.i.d. ones. They attribute the loss in accuracy to the increased weight divergence between the clients and propose to side-step the problem by assigning a shared public i.i.d. data set to all clients. While this approach can indeed create more accurate models, it also has multiple shortcomings, the most crucial one being that we generally cannot assume the availability of such a public data set. If a public data set were to exist, one could use it to pretrain a model at the server, which is not consistent with the assumptions typically made in federated learning. Furthermore, if all clients share (part of) the same public data set, overfitting to this shared data can become a serious issue. This effect will be particularly severe in highly distributed settings where the number of data points on every client is small. Finally, even when sharing a relatively large data set between the clients, the original accuracy achieved in the i.i.d. situation cannot be fully restored. For these reasons, we believe that the data-sharing strategy proposed by [32] is an insufficient workaround to the fundamental problem of federated averaging having convergence issues on non-i.i.d. data.
2) SignSGD:
The quantization method signSGD [29] (see green line in Fig. 2) suffers from even worse stability issues in the non-i.i.d. learning environment. The method completely fails to converge on the CIFAR benchmark, and even for the convex logistic regression objective, the training plateaus at a substantially degraded accuracy.
To understand the reasons for these convergence issues, we have to investigate how likely it is for a single batch gradient to have the “correct” sign. Let \begin{equation*} g^{k}_{w}=\frac {1}{k}\sum _{i=1}^{k}\nabla _{w} l(x_{i}, \mathcal {W})\tag{5}\end{equation*}
\begin{equation*} \alpha _{w}(k)=\mathbb {P}\big [\text {sign}\big (g^{k}_{w}\big)=\text {sign}(g_{w})\big].\tag{6}\end{equation*}
\begin{equation*} \alpha (k) = \frac {1}{| \mathcal {W}|}\sum _{w\in \mathcal {W} }\alpha _{w}(k)\tag{7}\end{equation*}
Fig. 3 (left) exemplary shows the distribution of values for
Left: distribution of values for
3) Top-$k$
Sparsification:
Out of all existing compression methods, top-
Table I summarizes our findings. None of the existing compression methods supports both download compression and properly works with non-i.i.d. data.
Sparse Ternary Compression
Top-
We will further increase the efficiency of our method by employing quantization and optimal lossless coding of the weight updates.
We will incorporate downstream compression into the method to allow for efficient communication from server to clients.
We will implement a caching mechanism to keep the clients synchronized in case of partial client participation.
A. Ternarizing Weight Updates
Regular top-
We adopt the method described in [17] to the federated learning setting and quantize the remaining top-
Algorithm 1 STC
input: flattened tensor
output: sparse ternary tensor
return
This ternarization step reduces the entropy of the update from \begin{equation*} H_{\mathrm{ sparse}} = -p\log _{2}(p)-(1-p)\log _{2}(p)+32p\tag{8}\end{equation*}
\begin{equation*} H_{\mathrm{ STC}} = -p\log _{2}(p)-(1-p)\log _{2}(p)+p\tag{9}\end{equation*}
Using a theoretical framework developed by Stich et al. [33], we can prove the convergence of STC under standard assumptions on the loss function. The proof relies on bounding the impact of the perturbation caused by the compression operator. This is formalized in the following definition.
Definition 1 (k-Contraction)[33]:
For a parameter \begin{equation*} \mathbb {E}\|x-\text {comp}(x)\|^{2}\leq \left ({1-\frac {k}{d}}\right)\|x\|^{2} \quad \forall x\in \mathbb {R}^{d}.\tag{10}\end{equation*}
We can show that STC indeed is a
Lemma 2:
STCk as defined in Algorithm 1 is a \begin{equation*} 0 < \tilde {k} = \frac {\|\text {top}_{k}(x)\|_{1}^{2}}{k\|x\|_{2}^{2}}d \leq d.\tag{11}\end{equation*}
\begin{align*} \mathcal {W}^{(t+1)}:=&\mathcal {W}^{(t)}-\text {STC}_{k}\big (\mathcal {A}^{(t)}+\eta \Delta \mathcal {W}_{i_{t}}^{(t)}\big) \tag{12}\\ \mathcal {A}^{(t+1)}:=&\mathcal {A}^{(t)}+\Delta \mathcal {W} _{i_{t}}^{(t+1)}-\text {STC}_{k}\big ({\Delta \mathcal {W} _{i_{t}}}^{(t+1)}\big)\tag{13}\end{align*}
\begin{align*} \mathbb {E}[f(\overline { \mathcal {W}_{T}})]\!-\!f^{*}\!\leq \! \mathcal {O}\left ({\frac {G^{2}}{\mu T}}\right)\!+\!\mathcal {O} \left ({\frac {\frac {d^{2}}{\tilde {k}^{2}}G^{2}\frac {L}{\mu }}{\mu T^{2}}}\right)\!+\! \mathcal {O}\left ({\frac {\frac {d^{3}}{\tilde {k}^{3}}G^{2}}{\mu T^{3}}}\right). \\\tag{14}\end{align*}
Preliminary experiments are in line with our theoretical findings. Fig. 4 shows the final accuracy of the VGG11* model when trained at different sparsity levels with and without ternarization. As we can see, additional ternarization does only have a negligible effect on the convergence speed and sometimes does even increase the final accuracy of the trained model. It seems evident that a combination of sparsity and quantization makes more efficient use of the communication budged than pure sparsification.
Effects of ternarization at different levels of upload and download sparsities. Displayed is the difference in final accuracy in % between a model trained with sparse updates and a model trained with sparse binarized updates. Positive numbers indicate better performance of the model trained with pure sparsity. VGG11 trained on CIFAR10 for 16 000 iterations with five clients holding i.i.d. and non-i.i.d. data.
B. Extending to Downstream Compression
Existing compression frameworks that were proposed for distributed training (see [19], [20], [23], [25]) only compress the communication from clients to the server, which is sufficient for applications where aggregation can be achieved via an all-reduce operation. However, in the federated learning setting, where the clients have to download the aggregated weight-updates from the server, this approach is not feasible, as it will lead to a communication bottleneck.
To illustrate this point, let \begin{align*} \Delta \mathcal {W} ^{(t+1)}=&\frac {1}{n}\sum _{i=1}^{n}\underbrace {\text {STC}_{k}\big (\Delta \mathcal {W} _{i}^{(t+1)}+A_{i}^{(t)}\big)}_{\tilde {\Delta \mathcal {W} _{i}}^{(t+1)}} \tag{15}\\ A_{i}^{(t+1)}=&A_{i}^{(t)}+\Delta \mathcal {W} _{i}^{(t+1)}-\tilde {\Delta \mathcal {W} }_{i}^{(t+1)}\tag{16}\end{align*}
To resolve this issue, we propose to apply the same compression mechanism that is used on the clients also at the server side to compress the downstream communication. This modifies the update rule to \begin{align*} \tilde {\Delta \mathcal {W} ^{(t+1)}} = \text {STC}_{k}\left ({\frac {1}{n}\sum _{i=1}^{n}\underbrace {\text {STC}_{k}\big (\Delta \mathcal {W} _{i}^{(t+1)}+A_{i}^{(t)}\big)}_{\tilde {\Delta \mathcal {W} _{i}}^{(t+1)}}+A^{(t)}}\right) \\\tag{17}\end{align*}
\begin{align*} A_{i}^{(t+1)}=&A_{i}^{(t)}+\Delta \mathcal {W} _{i}^{(t+1)}-\tilde {\Delta \mathcal {W} _{i}}^{(t+1)}\tag{18}\\ A^{(t+1)}=&A^{(t)}+\Delta \mathcal {W} ^{(t+1)}-\tilde {\Delta \mathcal {W} }^{(t+1)}.\tag{19}\end{align*}
We can express this new update rule for both upload and download compression (17) as a special case of pure upload compression (15) with generalized filter masks. Let
Fig. 5 shows the accuracies achieved by VGG11 on CIFAR10, when trained in a federated learning environment with five clients for 10 000 iterations at different rates of upload and download compression. As we can see, for as long as download and upload sparsity are of the same order, sparsifying the download is not very harmful to the convergence and decreases the accuracy by at most 2% in both the i.i.d. and the non-i.i.d. case.
Accuracy achieved by VGG11* when trained on CIFAR in a distributed setting with five clients for 16 000 iterations at different levels of upload and download sparsity. Sparsifying the updates for downstream communication reduces the final accuracy by at most 3% when compared to using only upload sparsity.
C. Weight Update Caching for Partial Client Participation
This far we have only been looking at scenarios in which all of the clients participate throughout the entire training process. However, as elaborated in Section II, in federated learning, typically only a fraction of the entire client population will participate in any particular communication round. As clients do not download the full model
To solve the synchronization problem and reduce the workload for the clients, we propose to use a caching mechanism on the server. Assume that the last \begin{equation*} H(P^{(\tau)}) \leq \tau H(P^{(1) })=\tau H(\tilde {\Delta \mathcal {W} }^{(T-1)})\tag{20}\end{equation*}
\begin{equation*} H(P_{signSGD}^{(\tau)})\leq \log _{2}(2\tau +1).\tag{21}\end{equation*}
Partial client participation also has effects on the convergence speed of federated training, both with delayed and sparsified updates. We will investigate these effects in detail in Section VII-C.
D. Lossless Encoding
To communicate a set of sparse ternary tensors produced by STC, we only need to transfer the positions of the nonzero elements in the flattened tensors, along with one bit per nonzero update to indicate the mean sign \begin{equation*} \bar {\texttt {b}}_{\mathrm{ pos}} = \mathbf {b}^{*}+\frac {1}{1-(1-p)^{2^{\mathbf {b}^{*}}}}\tag{22}\end{equation*}
The complete compression framework that features upstream and downstream compression via sparsification, ternarization, and optimal encoding of the updates is described in Algorithm 2.
Experiments
We evaluate our proposed communication protocol on four different learning tasks and compare its performance to federated averaging and signSGD in a wide a variety of different federated learning environments.
Models and Data Sets: To cover a broad spectrum of learning problems, we evaluate on differently sized convolutional and recurrent neural networks for the relevant federated learning tasks of image classification and speech recognition:
VGG11* on CIFAR: We train a modified version of the popular 11-layer VGG11 network [28] on the CIFAR [30] data set. We simplify the VGG11 architecture by reducing the number of convolutional filters to [32, 64, 128, 128, 128, 128, 128, 128] in the respective convolutional layers and reducing the size of the hidden fully-connected layers to 128. We also remove all dropout layers and batch-normalization layers as the regularization is no longer required. Batch normalization has been observed to perform very poorly with both small batch sizes and non-i.i.d. data [35], and we do not want this effect to obscure the investigated behavior. The resulting VGG11* network still achieves 85.46% accuracy on the validation set after 20 000 iterations of training with a constant learning rate of 0.16 and contains 865 482 parameters.
CNN on KWS: We train the four-layer convolutional neural network (CNN) from [27] on the speech commands data set [36]. The speech commands data set consists of 51 088 different speech samples of specific keywords. There are 30 different keywords in total, and every speech sample is of 1-s duration. Like [32], we restrict us to the subset of the ten most common keywords. For every speech command, we extract the Mel spectrogram from the short-time Fourier transform, which results in a
LSTM on Fashion-MNIST: We also train an Long Short-Term Memory (LSTM) network with two hidden layers of size 128 on the Fashion-MNIST data set [37]. The Fashion-MNIST data set contains 60 000 train and 10 000 validation greyscale images of ten different fashion items. Every
Logistic Regression on MNIST: Finally, we also train a simple logistic regression classifier on the MNIST [31] data set. The MNIST data set contains 60 000 training and 10 000 test greyscale images of handwritten digits of size
The different learning tasks are summarized in Table II. In the following, we will primarily discuss the results for VGG11* trained on CIFAR; however, the described phenomena carry over to all other benchmarks and the supporting experimental results can be found in the Appendix in the Supplementary Material.
Compression Methods: We compare our proposed STC method at a sparsity rate of
Learning Environment: The federated learning environment described in Algorithm 2 can be fully characterized by five parameters. For the base configuration, we set the number of clients to 100, the participation ratio to 10%, and the local batch size to 20 and assign every client an equally sized subset of the training data containing samples from ten different classes. In the following experiments, if not explicitly signified otherwise, all hyperparameters will default to this base configuration summarized in Table III. We will use the short notations “Clients:
A. Momentum in Federated Optimization
We start out by investigating the effects of momentum optimization on the convergence behavior of the different compression methods. Figs. 6–9 show the final accuracy achieved by federated averaging (
Robustness of different compression methods to the non-i.i.d.-ness of client data on four different benchmarks. VGG11* trained on CIFAR. STC distinctively outperforms federated averaging on non-i.i.d. data. The learning environment is configured as described in Table III. Dashed lines signify that a momentum of
Maximum accuracy achieved by the different compression methods when training VGG11* on CIFAR for 20 000 iterations at varying batch sizes in a federated learning environment with ten clients and full participation. Left: Every client holds data from exactly two different classes. Right: Every client holds an i.i.d. subset of data.
Validation accuracy achieved by VGG11* on CIFAR after 20 000 iterations of communication-efficient federated training with different compression methods. The relative client participation fraction is varied between 100% (5/5) and 5% (5/100). Left: Every client holds data from exactly two different classes. Right: Every client holds an i.i.d. subset of data.
Validation accuracy achieved by VGG11* on CIFAR after 20 000 iterations of communication-efficient federated training with different compression methods. The training data are split among the client at different degrees of unbalancedness with
When we compare federated averaging, signSGD and STC in the following, we will ignore whichever version of these methods (momentum “on” or “off”) performs worse.
B. Non-i.i.d.-ness of the Data
Our preliminary experiments in Section V have already demonstrated that the convergence behavior of both federated averaging and signSGD is very sensitive to the degree of i.i.d.-ness of the local client data, whereas sparse communication seems to be more robust. We will now investigate this behavior in some more detail. Fig. 6 shows the maximum achieved generalization accuracy after a fixed number of iterations for VGG11* trained on CIFAR at different levels of non-i.i.d.-ness. Additional results on all other benchmarks can be found in Fig. A2 in the Appendix in the Supplementary Material. Both at full (left plot) and partial (right plot) client participations, STC outperforms federated averaging across all levels of i.i.d.-ness. The most distinct difference can be observed in the non-i.i.d. regime, where the individual clients hold less than five different classes. Here, STC (without momentum) outperforms both federated averaging and signSGD by a wide margin. In the extreme case where every client only holds data from exactly one class, STC still achieves 79.5% and 53.2% accuracy at full and partial client participations, respectively, while both federated averaging and signSGD fail to converge at all.
C. Robustness to Other Parameters of the Learning Environment
We will now proceed to investigate the effects of other parameters of the learning environment on the convergence behavior of the different compression methods. Figs. 7–9 show the maximum achieved accuracy after training VGG11* on CIFAR for 20 000 iterations in different federated learning environments. Additional results on the three other benchmarks can be found in Section D in the Appendix in the Supplementary Material.
We observe that STC (without momentum) consistently dominates federated averaging on all benchmarks and learning environments.
1) Local Batch Size:
The memory capacity of mobile and IoT devices is typically very limited. As the memory footprint of SGD is proportional to the batch size used during training, clients might be restricted to train on small minibatches only. Fig. 7 shows the influence of the local batch size on the performance of different communication-efficient federated learning techniques exemplary for VGG11* trained on CIFAR. First of all, we notice that using momentum significantly slows down the convergence speed of both STC and federated averaging at batch sizes smaller than 20 independent of the distribution of data among the clients. As we can see, even if the training data is distributed among the clients in an i.i.d. manner (see Fig. 7 right) and all clients participate in every training iteration, federated averaging suffers considerably from small batch sizes. STC, on the other hand, demonstrates to be far more robust to this type of constraint. At an extreme batch size of one, the model trained with STC still achieves an accuracy of 63.8%, while the federated averaging model only reaches 39.2% after 20 000 training iterations.
2) Client Participation Fraction:
Fig. 8 shows the convergence speed of VGG11* trained on CIFAR10 in a federated learning environment with different degrees of client participation. To isolate the effects of reduced participation, we keep the absolute number of participating clients and the local batch sizes at constant values of 5 and 40, respectively, throughout all experiments and vary only the total number of clients (and thus the relative participation
3) Unbalancedness:
Up until now, all experiments were performed with a balanced split of data in which every client was assigned the same amount of data points. In practice, however, the data sets on different clients will typically vary heavily in size. To simulate different degrees of unbalancedness, we split the data among the clients in a way such that the \begin{equation*} \varphi _{i}(\alpha, \gamma)=\frac {\alpha }{n}+(1-\alpha)\frac {\gamma ^{i}}{\sum _{j=1}^{n}\gamma ^{j}}\tag{23}\end{equation*}
D. Communication Efficiency
Finally, we compare the different compression methods with respect to the number of iterations and communicated bits they require to achieve a certain target accuracy on a federated learning task. As we saw in Section V, both federated averaging and signSGD perform considerably worse if clients hold non-i.i.d. data or use small batch sizes. To still have a meaningful comparison, we, therefore, choose to evaluate this time on an i.i.d. environment where every client holds ten different classes and uses a moderate batch size of 20 during training. This setup favors federated averaging and signSGD to the maximum degree possible! All other parameters of the learning environment are set to the base configuration given in Table III. We train until the target accuracy is achieved or a maximum amount of iterations is exceeded and measure the amount of communicated bits both for upload and download. Fig. 10 shows the results for VGG11* trained on CIFAR, CNN trained on keyword spotting (KWS), and the LSTM model trained on Fashion-MNIST. We can see that even if all clients hold i.i.d. data, STC still manages to achieve the desired target accuracy within the smallest communication budget out of all methods. STC also converges faster in terms of training iterations than the versions of federated averaging with comparable compression rate. Unsurprisingly, we see that both for federated averaging and STC, we face a tradeoff between the number of training iterations (“computation”) and the number of communicated bits (“communication”). On all investigated benchmarks, however, STC is Pareto-superior to federated averaging in the sense for any fixed iteration complexity, it achieves a lower (upload) communication complexity.
Convergence speed of federated learning with compressed communication in terms of training iterations (left) and uploaded bits (right) on three different benchmarks (top to bottom) in an i.i.d. federated learning environment with 100 clients and 10% participation fraction. For better readability, the validation error curves are average-smoothed with a step size of five. On all benchmarks, STC requires the least amount of bits to converge to the target accuracy.
Table IV shows the amount of upstream and downstream communications required to achieve the target accuracy for the different methods in megabytes. On the CIFAR learning task, STC at a sparsity rate of
Lessons Learned
We will now summarize the findings of this article and give general suggestions on how to approach communication-constrained federated learning problems (see our summarizing Fig. 11).
If clients hold non-i.i.d. data, sparse communication protocols such as STC distinctively outperform federated averaging across all federated learning environments [see Figs. 6, 7 (left), and 8 (left)].
The same holds true if clients are forced to train on small minibatches (e.g., because the hardware is memory constrained). In these situations, STC outperforms federated averaging even if the client’s data are i.i.d. [see Fig. 7 (right)].
STC should also be preferred over federated averaging if the client participation rate is expected to be low, as it converges more stable and quickly in both the i.i.d. and non-i.i.d. regime [see Fig. 8 (right)].
STC is generally most advantageous in situations where the communication is bandwidth-constrained or costly (metered network, limited battery), as it does achieve a certain target accuracy within the minimum amount of communicated bits even on i.i.d. data (see Fig. 10 and Table IV).
Federated averaging in return should be used if the communication is latency-constrained or if the client participation is expected to be very low (and 1–3 do not hold).
Momentum optimization should be avoided in federated learning whenever either clients are training with small batch sizes or the client data are non-i.i.d. and the participation rate is low (see Figs. 6–8).
Left: accuracy achieved by VGG11* on CIFAR after 20 000 iterations of federated training with federated averaging and STC for three different configurations of the learning environment. Right: upstream and downstream communication necessary to achieve a validation accuracy of 84% with federated averaging and STC on the CIFAR benchmark under i.i.d. data and a moderate batch-size.
Conclusion
Federated learning for mobile and IoT applications is a challenging task, as generally little to no control can be exerted over the properties of the learning environment.
In this article, we demonstrated that the convergence behavior of current methods for communication-efficient federated learning is very sensitive to these properties. On a variety of different data sets and model architectures, we observe that the convergence speed of federated averaging drastically decreases in learning environments where the clients either hold non-i.i.d. subsets of data are forced to train on small minibatches or where only a small fraction of clients participates in every communication round. To address these issues, we propose STC, a communication protocol that compresses both the upstream and downstream communications via sparsification, ternarization, error accumulation, and optimal Golomb encoding. Our experiments show that STC is far more robust to the above-mentioned peculiarities of the learning environment than federated averaging. Moreover, STC converges faster than federated averaging both with respect to the number of training iterations and the amount of communicated bits even if the clients hold i.i.d. data and use moderate batch sizes during training.
Our approach can be understood as an alternative paradigm for communication-efficient federated optimization that relies on high-frequent low-volume instead of low-frequent high-volume communication. As such, it is particularly well suited for federated learning environments that are characterized by low latency and low bandwidth channels between clients and server.