Loading web-font TeX/Math/Italic
Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data | IEEE Journals & Magazine | IEEE Xplore

Robust and Communication-Efficient Federated Learning From Non-i.i.d. Data


Abstract:

Federated learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local dat...Show More

Abstract:

Federated learning allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their local data to a centralized server. This form of privacy-preserving collaborative learning, however, comes at the cost of a significant communication overhead during training. To address this problem, several compression methods have been proposed in the distributed training literature that can reduce the amount of required communication by up to three orders of magnitude. These existing methods, however, are only of limited utility in the federated learning setting, as they either only compress the upstream communication from the clients to the server (leaving the downstream communication uncompressed) or only perform well under idealized conditions, such as i.i.d. distribution of the client data, which typically cannot be found in federated learning. In this article, we propose sparse ternary compression (STC), a new compression framework that is specifically designed to meet the requirements of the federated learning environment. STC extends the existing compression technique of top- k gradient sparsification with a novel mechanism to enable downstream compression as well as ternarization and optimal Golomb encoding of the weight updates. Our experiments on four different learning tasks demonstrate that STC distinctively outperforms federated averaging in common federated learning scenarios. These results advocate for a paradigm shift in federated optimization toward high-frequency low-bitwidth communication, in particular in the bandwidth-constrained learning environments.
Published in: IEEE Transactions on Neural Networks and Learning Systems ( Volume: 31, Issue: 9, September 2020)
Page(s): 3400 - 3413
Date of Publication: 01 November 2019

ISSN Information:

PubMed ID: 31689214

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Three major developments are currently transforming the ways how data are created and processed: First of all, with the advent of the Internet of Things (IoT), the number of intelligent devices in the world has rapidly grown in the last couple of years. Many of these devices are equipped with various sensors and increasingly potent hardware that allow them to collect and process data at unprecedented scales [1]–​[3].

In a concurrent development, deep learning has revolutionized the ways that information can be extracted from data resources with groundbreaking successes in areas such as computer vision, natural language processing, or voice recognition, among many others [4]–​[9]. Deep learning scales well with growing amounts of data and its astounding successes in recent times can be at least partly attributed to the availability of very large data sets for training. Therefore, there lays huge potential in harnessing the rich data provided by IoT devices for the training and improving deep learning models [10].

At the same time, data privacy has become a growing concern for many users. Multiple cases of data leakage and misuse in recent times have demonstrated that the centralized processing of data comes at high risk for the end users privacy. As IoT devices usually collect data in private environments, often even without explicit awareness of the users, these concerns hold particularly strong. It is, therefore, generally not an option to share this data with a centralized entity that could conduct training of a deep learning model. In other situations, local processing of the data might be desirable for other reasons such as increased autonomy of the local agent.

This leaves us facing the following dilemma: How are we going to make use of the rich combined data of millions of IoT devices for training deep learning models if this data cannot be stored at a centralized location?

Federated learning resolves this issue as it allows multiple parties to jointly train a deep learning model on their combined data, without any of the participants having to reveal their data to a centralized server [10]. This form of privacy-preserving collaborative learning is achieved by following a simple three-step protocol illustrated in Fig. 1. In the first step, all participating clients download the latest master model $\mathcal {W}$ from the server. Next, the clients improve the downloaded model, based on their local training data using stochastic gradient descent (SGD). Finally, all participating clients upload their locally improved models $\mathcal {W}_{i}$ back to the server, where they are gathered and aggregated to form a new master model (in practice, weight updates $\Delta \mathcal {W} = \mathcal {W}^{\mathrm{ new}}- \mathcal {W}^{\mathrm{ old}}$ can be communicated instead of full models $\mathcal {W}$ , which is equivalent as long as all clients remain synchronized). These steps are repeated until a certain convergence criterion is satisfied. Observe that when following this protocol, training data never leave the local devices as only model updates are communicated. Although it has been shown that in adversarial settings information about the training data can still be inferred from these updates [11], additional mechanisms, such as homomorphic encryption of the updates [12], [13] or differentially private training [14], can be applied to fully conceal any information about the local data.

Fig. 1. - Federated learning with a parameter server. Illustrated is one communication round of distributed SGD. (a) Clients synchronize with the server. (b) Clients compute a weight update independently based on their local data. (c) Clients upload their local weight updates to the server, where they are averaged to produce the new master model.
Fig. 1.

Federated learning with a parameter server. Illustrated is one communication round of distributed SGD. (a) Clients synchronize with the server. (b) Clients compute a weight update independently based on their local data. (c) Clients upload their local weight updates to the server, where they are averaged to produce the new master model.

A major issue in federated learning is the massive communication overhead that arises from sending around the model updates. When naively following the protocol described earlier, every participating client has to communicate a full model update during every training iteration. Every such update is of the same size as the trained model, which can be in the range of gigabytes for modern architectures with millions of parameters [15], [16]. Over the course of multiple hundred thousands of training iterations on big data sets, the total communication for every client can easily grow to more than a petabyte [17]. Consequently, if communication bandwidth is limited or communication is costly (naive), federated learning can become unproductive or even completely unfeasible.

The total amount of bits that have to be uploaded and downloaded by every client during training is given by \begin{align*} \texttt {b}^{\mathrm{ up/down}} \in \mathcal {O}(\underbrace {N_{\mathrm{ iter}} \times f}_{{\#\text { updates}}} \times \underbrace {| \mathcal {W}|\times (H(\Delta \mathcal {W} ^{\mathrm{ up/down}})+\eta)}_{\text {update size}}) \\\tag{1}\end{align*} View SourceRight-click on figure for MathML and additional features. where $N_{\mathrm{ iter}}$ is the total number of training iterations (forward–backward passes) performed by every client, $f$ is the communication frequency, $| \mathcal {W}|$ is the size of the model, $H(\Delta \mathcal {W} ^{\mathrm{ up/down}})$ is the entropy of the weight updates exchanged during upload and download, respectively, and $\eta $ is the inefficiency of the encoding, i.e., the difference between the true update size and the minimal update size (which is given by the entropy). If we assume the size of the model and number of training iterations to be fixed (e.g., because we want to achieve a certain accuracy on a given task), this leaves us with three options to reduce communication: 1) we can reduce the communication frequency $f$ ; 2) reduce the entropy of the weight updates $H(\Delta \mathcal {W} ^{\mathrm{ up/down}})$ via lossy compression schemes; and/or 3) use more efficient encodings to communicate the weight updates, thus reducing $\eta $ .

SECTION II.

Challenges of the Federated Learning Environment

Before we can consider ways to reduce the amount of communication, we first have to take into account the unique characteristics, which distinguish federated learning from other distributed training settings such as parallel training (compare also with [10]). In federated learning, the distribution of both training data and computational resources is a fundamental and fixed property of the learning environment. This entails the following challenges.

  1. Unbalanced and non-i.i.d. data: As the training data present on the individual clients is collected by the clients themselves based on their local environment and usage pattern, both the size and the distribution of the local data sets will typically vary heavily between different clients.

  2. Large number of clients: Federated learning environments may constitute of multiple millions of participants [18]. Furthermore, as the quality of the collaboratively learned model is determined by the combined available data of all clients, collaborative learning environments will have a natural tendency to grow.

  3. Parameter server: Once the number of clients grows beyond a certain threshold, direct communication of weight updates becomes unfeasible because the workload for both communication and aggregation of updates grows linearly with the number of clients. In federated learning, it is, therefore, unavoidable to communicate via an intermediate parameter server. This reduces the amount of communication per client and communication rounds to one single upload of a local weight update to and one download of the aggregated update from the server and moves the workload of aggregation away from the clients. Communicating via a parameter server, however, introduces an additional challenge to communication-efficient distributed training, as now both the upload to the server and the download from the server need to be compressed in order to reduce communication time and energy consumption.

  4. Partial participation: In the general federated learning for IoT setting, it can generally not be guaranteed that all clients participate in every communication round. Devices might lose their connection, run out of battery or seize to contribute to the collaborative training for other reasons.

  5. Limited battery and memory: Mobile and embedded devices often are not connected to a power grid. Instead, their capacity to run computations is limited by a finite battery. Performing iterations of SGD is notoriously expensive for deep neural networks. It is, therefore, necessary to keep the number of gradient evaluations per client as small as possible. Mobile and embedded devices also typically have only very limited memory. As the memory footprint of SGD grows linearly with the batch size, this might force the devices to train on very small batch sizes.

Based on the above-mentioned characterization of the federated learning environment, we conclude that a communication-efficient distributed training algorithm for federated learning needs to fulfil the following requirements.

  1. It should compress both upstream and downstream communications.

  2. It should be robust to non-i.i.d., small batch sizes, and unbalanced data.

  3. It should be robust to large numbers of clients and partial client participation.

SECTION III.

Contribution

In this article, we will demonstrate that none of the existing methods proposed for communication-efficient federated learning satisfies all of these requirements (see Table I). More concretely, we will show that the methods that are able to compress both upstream and downstream communications are very sensitive to non-i.i.d. data distributions, while the methods that are more robust to this type of data do not compress the downstream (see Section V). We will then proceed to construct a new efficient communication protocol for federated learning that resolves these issues and meets all requirements (R1)(R3). We provide a convergence analysis of our method as well as extensive empirical results on four different neural network architectures and data sets that demonstrate that the sparse ternary compression (STC) protocol is superior to the existing compression schemes in that it requires both fewer gradient evaluations and communicated bits to converge to a given target accuracy (see Section IX). These results also extend to the i.i.d. regime.

TABLE I Different Methods for Communication-Efficient Distributed Deep Learning Proposed in the Literature. None of the Existing Methods Satisfies All Requirements (R1)–(R3) of the Federated Learning Environment. We Call a Method “Robust to Non-i.i.d. Data” if the Federated Training Converges Independent of the Local Distribution of Client Data. We Call Compression Rates Greater Than $\times32$ “Strong” and Those Smaller or Equal to $\times32$ “Weak”
Table I- 
Different Methods for Communication-Efficient Distributed Deep Learning Proposed in the Literature. None of the Existing Methods Satisfies All Requirements (R1)–(R3) of the Federated Learning Environment. We Call a Method “Robust to Non-i.i.d. Data” if the Federated Training Converges Independent of the Local Distribution of Client Data. We Call Compression Rates Greater Than 
$\times32$
 “Strong” and Those Smaller or Equal to 
$\times32$
 “Weak”

SECTION IV.

Related Work

In the broader realm of communication-efficient distributed deep learning, a wide variety of methods has been proposed to reduce the amount of communication during the training process. Using (1) as a reference, we can organize the substantial existing research body on communication-efficient distributed deep learning into three different groups.

  1. Communication delay methods reduce the communication frequency $f$ . McMahan et al. [10] propose federated averaging where instead of communicating after every iteration, every client performs multiple iterations of SGD to compute a weight update. McMahan et al. observe that on different convolutional and recurrent neural network architectures, communication can be delayed for up to 100 iterations without significantly affecting the convergence speed as long as the data are distributed among the clients in an i.i.d. manner. The amount of communication can be reduced even further with longer delay periods; however, this comes at the cost of an increased number of gradient evaluations. In a follow-up work, Konečnỳ et al. [27] combine this communication delay with random sparsification and probabilistic quantization. They restrict the clients to learn random sparse weight updates or force random sparsity on them afterward (“structured” versus “sketched” updates) and combine this sparsification with probabilistic quantization. Their method, however, significantly slows down convergence speed in terms of SGD iterations. Communication delay methods automatically reduce both upstream and downstream communication and are proven to work with large numbers of clients and partial client participation.

  2. Sparsification methods reduce the entropy $H(\Delta \mathcal {W})$ of the updates by restricting changes to only a small subset of the parameters. Strom [24] presents an approach (later modified by [26]) in which only gradients with a magnitude greater than a certain predefined threshold are sent to the server. All other gradients are accumulated in a residual. This method is shown to achieve upstream compression rates of up to three orders of magnitude on an acoustic modeling task. In practice, however, it is hard to choose appropriate values for the threshold, as it may vary a lot for different architectures and even different layers. To overcome this issue, Aji and Heafield [23] instead fix the sparsity rate and only communicate the fraction $p$ entries with the biggest magnitude of each gradient while also collecting all other gradients in a residual. At a sparsity rate of $p=0.001$ , their method only slightly degrades the convergence speed and final accuracy of the trained model. Lin et al. [25] present minor modifications to the work of Aji and Heafield [23] that even close this small performance gap. Sparsification methods have been proposed primarily with the intention to speed up parallel training in the data center. Their convergence properties in the much more challenging federated learning environments have not yet been investigated. Sparsification methods (in their existing form) primarily compress the upstream communication, as the sparsity patterns on the updates from different clients will generally differ. If the number of participating clients is greater than the inverse sparsity rate, which can easily be the case in federated learning, the downstream update will not even be compressed at all.

  3. Dense quantization methods reduce the entropy of the weight updates by restricting all updates to a reduced set of values. Bernstein et al. [22] propose signSGD, a compression method with theoretical convergence guarantees on i.i.d. data that quantizes every gradient update to its binary sign, thus reducing the bit size per update by a factor of $\times 32$ . signSGD also incorporates download compression by aggregating the binary updates from all clients by means of a majority vote. Other authors propose to stochastically quantize the gradients during upload in an unbiased way (TernGrad [19], quantized stochastic gradient descent (QSGD) [20], ATOMO [21]). These methods are theoretically appealing, as they inherit the convergence properties of regular SGD under relatively mild assumptions. However, their empirical performance and compression rates do not match those of sparsification methods.

Out of all the above-listed methods, only federated averaging and signSGD compress both the upstream and downstream communications. All other methods are of limited utility in the federated learning setting defined in Section II, as they leave the communication from the server to the clients uncompressed.

Notation: In the following, calligraphic $\mathcal {W}$ will refer to the entirety of parameters of a neural network, while regular uppercase $W$ refers to one specific tensor of parameters within $\mathcal {W}$ and lowercase $w$ refers to one single scalar parameter of the network. Arithmetic operations between the neural network parameters are to be understood elementwise.

SECTION V.

Limitations of Existing Compression Methods

The related work on efficient distributed deep learning almost exclusively considers i.i.d. data distributions among the clients, i.e., they assume unbiasedness of the local gradients with respect to the full-batch gradient according to \begin{equation*} \mathbb {E}_{x\sim p_{i}}[\nabla _{\mathcal {W}} l(x, \mathcal {W})] = \nabla _ {\mathcal {W}} R(\mathcal {W})\quad \forall i=1,..,n \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $p_{i}$ is the distribution of data on the $i$ th client and $R(\mathcal {W})$ is the empirical risk function over the combined training data.

While this assumption is reasonable for parallel training where the distribution of data among the clients is chosen by the practitioner, it is typically not valid in the federated learning setting where we can generally only hope for unbiasedness in the mean \begin{equation*} \frac {1}{n}\sum _{i=1}^{n}\mathbb {E}_{x^{i}\sim p_{i}}[\nabla _ {\mathcal {W}} l(x^{i}, \mathcal {W})] = \nabla _ {\mathcal {W}} R(\mathcal {W})\tag{3}\end{equation*} View SourceRight-click on figure for MathML and additional features. while the individual client’s gradients will be biased toward the local data set according to \begin{align*} \mathbb {E}_{x\sim p_{i}}[\nabla _ {\mathcal {W}} l(x, \mathcal {W})]\! =\! \nabla _ {\mathcal {W}} R_{i}(\mathcal {W}) \!\neq \! \nabla _ {\mathcal {W}} R(\mathcal {W})\quad \forall i\!=\!1,..,n. \\\tag{4}\end{align*} View SourceRight-click on figure for MathML and additional features.

As it violates assumption (2), a non-i.i.d. distribution of the local data renders existing convergence guarantees, as formulated in [19]–​[21] and [29], inapplicability and has dramatic effects on the practical performance of communication-efficient distributed training algorithms as we will demonstrate in the following experiments.

A. Preliminary Experiments

We run preliminary experiments with a simplified version of the well-studied 11-layer VGG11 network [28], which we train on the CIFAR-10 [30] data set in a federated learning setup using ten clients. For the i.i.d. setting, we split the training data randomly into equally sized shards and assign one shard to every one of the clients. For the “non-i.i.d. ($m$ )” setting, we assign every client samples from exactly $m$ classes of the data set. The data splits are nonoverlapping and balanced, such that every client ends up with the same number of data points. The detailed procedure that generates the split of data is described in Section B of the Appendix in the Supplementary Material. We also perform experiments with a simple logistic regression classifier, which we train on the MNIST data set [31] under the same setup of the federated learning environment. Both models are trained using momentum SGD. To make the results comparable, all compression methods use the same learning rate and batch size.

B. Results

Fig. 2 shows the convergence speed in terms of gradient evaluations for the two models when trained using different methods for communication-efficient federated learning. We observe that while all compression methods achieve comparably fast convergence in terms of gradient evaluations on i.i.d. data, closely matching the uncompressed baseline (black line), they suffer considerably in the non-i.i.d. training settings. As this trend can be observed also for the logistic regression model, we can conclude that the underlying phenomenon is not unique to deep neural networks and also carries over to convex objectives. We will now analyze these results in detail for the different compression methods.

Fig. 2. - Convergence speed when using different compression methods during the training of VGG11*1on CIFAR-10 and logistic regression on MNIST and Fashion-MNIST in a distributed setting with ten clients for i.i.d. and non-i.i.d. data. In the non-i.i.d. cases, every client only holds examples from exactly two respectively one of the ten classes in the data set. All compression methods suffer from degraded convergence speed in the non-i.i.d. situation, but sparse top-
$k$
 is affected by far the least.
Fig. 2.

Convergence speed when using different compression methods during the training of VGG11*1on CIFAR-10 and logistic regression on MNIST and Fashion-MNIST in a distributed setting with ten clients for i.i.d. and non-i.i.d. data. In the non-i.i.d. cases, every client only holds examples from exactly two respectively one of the ten classes in the data set. All compression methods suffer from degraded convergence speed in the non-i.i.d. situation, but sparse top-$k$ is affected by far the least.

1) Federated Averaging:

Most noticeably, federated averaging [10] (see orange line in Fig. 2), although specifically proposed for the federated learning setting, suffers considerably from non-i.i.d. data. This observation is consistent with Zhao et al. [32] who demonstrated that model accuracy can drop by up to 55% in non-i.i.d. learning environments compared to the i.i.d. ones. They attribute the loss in accuracy to the increased weight divergence between the clients and propose to side-step the problem by assigning a shared public i.i.d. data set to all clients. While this approach can indeed create more accurate models, it also has multiple shortcomings, the most crucial one being that we generally cannot assume the availability of such a public data set. If a public data set were to exist, one could use it to pretrain a model at the server, which is not consistent with the assumptions typically made in federated learning. Furthermore, if all clients share (part of) the same public data set, overfitting to this shared data can become a serious issue. This effect will be particularly severe in highly distributed settings where the number of data points on every client is small. Finally, even when sharing a relatively large data set between the clients, the original accuracy achieved in the i.i.d. situation cannot be fully restored. For these reasons, we believe that the data-sharing strategy proposed by [32] is an insufficient workaround to the fundamental problem of federated averaging having convergence issues on non-i.i.d. data.

2) SignSGD:

The quantization method signSGD [29] (see green line in Fig. 2) suffers from even worse stability issues in the non-i.i.d. learning environment. The method completely fails to converge on the CIFAR benchmark, and even for the convex logistic regression objective, the training plateaus at a substantially degraded accuracy.

To understand the reasons for these convergence issues, we have to investigate how likely it is for a single batch gradient to have the “correct” sign. Let \begin{equation*} g^{k}_{w}=\frac {1}{k}\sum _{i=1}^{k}\nabla _{w} l(x_{i}, \mathcal {W})\tag{5}\end{equation*} View SourceRight-click on figure for MathML and additional features. be the batch gradient over a specific minibatch of data $D^{k}=\{x_{1},\ldots,x_{k}\}\subset D$ of size $k$ at parameter $w$ . Let, further, $g_{w}$ be the gradient over the entire training data $D$ . Then, we can define this probability by \begin{equation*} \alpha _{w}(k)=\mathbb {P}\big [\text {sign}\big (g^{k}_{w}\big)=\text {sign}(g_{w})\big].\tag{6}\end{equation*} View SourceRight-click on figure for MathML and additional features. We can also compute the mean statistic \begin{equation*} \alpha (k) = \frac {1}{| \mathcal {W}|}\sum _{w\in \mathcal {W} }\alpha _{w}(k)\tag{7}\end{equation*} View SourceRight-click on figure for MathML and additional features. to estimate the average congruence over all parameters of the network.

Fig. 3 (left) exemplary shows the distribution of values for $\alpha _{w}(1) $ within the weights of logistic regression on MNIST at the beginning of training. As we can see, at a batch size of 1, $g^{1}_{w}$ is a very bad predictor of the true gradient sign with a very high variance and an average congruence of $\alpha (1) = 0.51$ just slightly higher than random. The sensitivity of signSGD to non-i.i.d. data becomes apparent once we inspect the development of the gradient sign congruence for increasing batch sizes. Fig. 3 (right) shows this development for batches of increasing size sampled from an i.i.d. and non-i.i.d. distribution. For the latter one, every sampled batch only contains data from exactly one class. As we can see, for i.i.d. data, $\alpha $ quickly grows with increasing batch size, resulting in increasingly accurate updates. For non-i.i.d. data, however, the congruence stays low, independent of the size of the batch. This means that if clients hold highly non-i.i.d. subsets of data, signSGD updates will only weakly correlate with the direction of steepest descent, no matter how large of a batch size is chosen for training.

Fig. 3. - Left: distribution of values for 
$\alpha _{w}(1) $
 for the weight layer of logistic regression over the MNIST data set. Right: development of 
$\alpha (k)$
 for increasing batch sizes. In the i.i.d. case, the batches are sampled randomly from the training data, while in the non-i.i.d. case, every batch contains samples from only exactly one class. For i.i.d. batches, the gradient sign becomes increasingly accurate with growing batch sizes. For non-i.i.d. batches of data, this is not the case. The gradient signs remain highly incongruent with the full-batch gradient, no matter how large the size of the batch.
Fig. 3.

Left: distribution of values for $\alpha _{w}(1) $ for the weight layer of logistic regression over the MNIST data set. Right: development of $\alpha (k)$ for increasing batch sizes. In the i.i.d. case, the batches are sampled randomly from the training data, while in the non-i.i.d. case, every batch contains samples from only exactly one class. For i.i.d. batches, the gradient sign becomes increasingly accurate with growing batch sizes. For non-i.i.d. batches of data, this is not the case. The gradient signs remain highly incongruent with the full-batch gradient, no matter how large the size of the batch.

3) Top-$k$ Sparsification:

Out of all existing compression methods, top-$k$ sparsification (see blue line in Fig. 2) suffers least from non-i.i.d. data. For VGG11 on CIFAR the training still converges reliably even if every client only holds data from exactly one class, and for the logistic regression classifier trained on MNIST, the convergence does not slow down at all. We hypothesize that this robustness to non-i.i.d. data is due to mainly two reasons. First of all, the frequent communication of weight updates between the clients prevents them from diverging too far from one another, and hence, top-$k$ sparsification does not suffer from weight divergence [32] as it is the case for federated averaging. Second, sparsification does not destabilize the training nearly as much as signSGD does since the noise in the stochastic gradients is not amplified by quantization. Although top-$k$ sparsification shows promising performance on non-i.i.d. data, its utility is limited in the federated learning setting as it only directly compresses the upstream communication.

Table I summarizes our findings. None of the existing compression methods supports both download compression and properly works with non-i.i.d. data.

SECTION VI.

Sparse Ternary Compression

Top-$k$ sparsification shows the most promising performance in distributed learning environments with non-i.i.d. client data. We will use this observation as a starting point to construct an efficient communication protocol for federated learning. To arrive at this protocol, we will solve three open problems that prevent the direct application of top-$k$ sparsification to federated learning.

  1. We will further increase the efficiency of our method by employing quantization and optimal lossless coding of the weight updates.

  2. We will incorporate downstream compression into the method to allow for efficient communication from server to clients.

  3. We will implement a caching mechanism to keep the clients synchronized in case of partial client participation.

A. Ternarizing Weight Updates

Regular top-$k$ sparsification, as proposed in [23] and [25], communicates the fraction of largest elements at full precision, while all other elements are not communicated at all. In our previous work (Sattler et al. [17]), we already demonstrated that this imbalance in update precision is wasteful in the distributed training setting and that higher compression gains can be achieved when sparsification is combined with quantization of the nonzero elements.

We adopt the method described in [17] to the federated learning setting and quantize the remaining top-$k$ elements of the sparsified updates to the mean population magnitude, leaving us with a ternary tensor containing values $\{-\mu,0,\mu \}$ . The quantization method is formalized in Algorithm 1.

Algorithm 1 STC

1

input: flattened tensor $T\in \mathbb {R}^{n}$ , sparsity $p$

2

output: sparse ternary tensor $T^{*}\in \{-\mu,0,\mu \}^{n}$

3

$\cdot k \leftarrow \max (np,1)$

4

$\cdot v \leftarrow \text {top}_{k}(|T|)$

5

$\cdot \text {mask} \leftarrow (|T|\geq v) \in \{0,1\}^{n}$

6

$\cdot T^{masked}\leftarrow \text {mask}\odot T$

7

$\cdot \mu \leftarrow \frac {1}{k}\sum _{i=1}^{n}|T^{masked}_{i}|$

8

return $T^{*} \leftarrow \mu \times \text {sign}(T^{masked})$

This ternarization step reduces the entropy of the update from \begin{equation*} H_{\mathrm{ sparse}} = -p\log _{2}(p)-(1-p)\log _{2}(p)+32p\tag{8}\end{equation*} View SourceRight-click on figure for MathML and additional features. to \begin{equation*} H_{\mathrm{ STC}} = -p\log _{2}(p)-(1-p)\log _{2}(p)+p\tag{9}\end{equation*} View SourceRight-click on figure for MathML and additional features. when compared to the regular sparsification. At a sparsity rate of $p=0.01$ , the additional compression achieved by ternarization is $H_{\mathrm{ sparse}}/H_{\mathrm{ STC}} = 4.414$ . In order to achieve the same compression gains by pure sparsification, one would have to increase the sparsity rate by approximately the same factor.

Using a theoretical framework developed by Stich et al. [33], we can prove the convergence of STC under standard assumptions on the loss function. The proof relies on bounding the impact of the perturbation caused by the compression operator. This is formalized in the following definition.

Definition 1 (k-Contraction)[33]:

For a parameter $0 < k\leq d$ , a $k$ -contraction is an operator $\text {comp}: \mathbb {R}^{d}\rightarrow \mathbb {R}^{d}$ that satisfies the contraction property \begin{equation*} \mathbb {E}\|x-\text {comp}(x)\|^{2}\leq \left ({1-\frac {k}{d}}\right)\|x\|^{2} \quad \forall x\in \mathbb {R}^{d}.\tag{10}\end{equation*} View SourceRight-click on figure for MathML and additional features.

We can show that STC indeed is a $k$ -contraction.

Lemma 2:

STCk as defined in Algorithm 1 is a $\tilde {k}$ -contraction, with \begin{equation*} 0 < \tilde {k} = \frac {\|\text {top}_{k}(x)\|_{1}^{2}}{k\|x\|_{2}^{2}}d \leq d.\tag{11}\end{equation*} View SourceRight-click on figure for MathML and additional features. The proof can be found in Appendix E in the Supplementary Material. It then directly follows from [33, Th. 2.4] that for any $L$ -smooth, $\mu $ -strongly convex objective function $f$ with bounded gradients $\mathbb {E}\|\Delta \mathcal {W} \|^{2}\leq G^{2}$ , the update rule \begin{align*} \mathcal {W}^{(t+1)}:=&\mathcal {W}^{(t)}-\text {STC}_{k}\big (\mathcal {A}^{(t)}+\eta \Delta \mathcal {W}_{i_{t}}^{(t)}\big) \tag{12}\\ \mathcal {A}^{(t+1)}:=&\mathcal {A}^{(t)}+\Delta \mathcal {W} _{i_{t}}^{(t+1)}-\text {STC}_{k}\big ({\Delta \mathcal {W} _{i_{t}}}^{(t+1)}\big)\tag{13}\end{align*} View SourceRight-click on figure for MathML and additional features. converges according to \begin{align*} \mathbb {E}[f(\overline { \mathcal {W}_{T}})]\!-\!f^{*}\!\leq \! \mathcal {O}\left ({\frac {G^{2}}{\mu T}}\right)\!+\!\mathcal {O} \left ({\frac {\frac {d^{2}}{\tilde {k}^{2}}G^{2}\frac {L}{\mu }}{\mu T^{2}}}\right)\!+\! \mathcal {O}\left ({\frac {\frac {d^{3}}{\tilde {k}^{3}}G^{2}}{\mu T^{3}}}\right). \\\tag{14}\end{align*} View SourceRight-click on figure for MathML and additional features. This means that for $T\in \mathcal {O}((d/\tilde {k})((L/\mu))^{1/2})$ , STC converges at rate $\mathcal {O}((G^{2}/\mu T))$ , which is the same as for regular SGD!

Preliminary experiments are in line with our theoretical findings. Fig. 4 shows the final accuracy of the VGG11* model when trained at different sparsity levels with and without ternarization. As we can see, additional ternarization does only have a negligible effect on the convergence speed and sometimes does even increase the final accuracy of the trained model. It seems evident that a combination of sparsity and quantization makes more efficient use of the communication budged than pure sparsification.

Fig. 4. - Effects of ternarization at different levels of upload and download sparsities. Displayed is the difference in final accuracy in % between a model trained with sparse updates and a model trained with sparse binarized updates. Positive numbers indicate better performance of the model trained with pure sparsity. VGG11 trained on CIFAR10 for 16 000 iterations with five clients holding i.i.d. and non-i.i.d. data.
Fig. 4.

Effects of ternarization at different levels of upload and download sparsities. Displayed is the difference in final accuracy in % between a model trained with sparse updates and a model trained with sparse binarized updates. Positive numbers indicate better performance of the model trained with pure sparsity. VGG11 trained on CIFAR10 for 16 000 iterations with five clients holding i.i.d. and non-i.i.d. data.

B. Extending to Downstream Compression

Existing compression frameworks that were proposed for distributed training (see [19], [20], [23], [25]) only compress the communication from clients to the server, which is sufficient for applications where aggregation can be achieved via an all-reduce operation. However, in the federated learning setting, where the clients have to download the aggregated weight-updates from the server, this approach is not feasible, as it will lead to a communication bottleneck.

To illustrate this point, let $\text {STC}_{k}: \mathbb {R}^{n}\rightarrow \mathbb {R}^{n}, \Delta \mathcal {W} \mapsto \tilde {\Delta \mathcal {W} }$ be the compression operator that maps a (flattened) weight update $\Delta \mathcal {W} $ to a sparsified and ternarized weight update $\tilde {\Delta \mathcal {W} }$ according to Algorithm 1. For local weight updates $\Delta \mathcal {W} _{i}^{(t)}$ , the update rule for STC can then be written as \begin{align*} \Delta \mathcal {W} ^{(t+1)}=&\frac {1}{n}\sum _{i=1}^{n}\underbrace {\text {STC}_{k}\big (\Delta \mathcal {W} _{i}^{(t+1)}+A_{i}^{(t)}\big)}_{\tilde {\Delta \mathcal {W} _{i}}^{(t+1)}} \tag{15}\\ A_{i}^{(t+1)}=&A_{i}^{(t)}+\Delta \mathcal {W} _{i}^{(t+1)}-\tilde {\Delta \mathcal {W} }_{i}^{(t+1)}\tag{16}\end{align*} View SourceRight-click on figure for MathML and additional features. starting with an empty residual $A_{i}^{(0)}=0\in \mathbb {R}^{n}$ on all clients. While the updates $\tilde {\Delta \mathcal {W} }_{i}^{(t+1)}$ that are sent from clients to the server are always sparse, the number of nonzero elements in the update $\Delta \mathcal {W} ^{(t+1)}$ that is sent downstream grows linearly with the amount of participating clients in the worst case. If the participation rate exceeds the inverse sparsity $1/p$ , the update $\Delta \mathcal {W} ^{(t+1)}$ essentially becomes dense.

To resolve this issue, we propose to apply the same compression mechanism that is used on the clients also at the server side to compress the downstream communication. This modifies the update rule to \begin{align*} \tilde {\Delta \mathcal {W} ^{(t+1)}} = \text {STC}_{k}\left ({\frac {1}{n}\sum _{i=1}^{n}\underbrace {\text {STC}_{k}\big (\Delta \mathcal {W} _{i}^{(t+1)}+A_{i}^{(t)}\big)}_{\tilde {\Delta \mathcal {W} _{i}}^{(t+1)}}+A^{(t)}}\right) \\\tag{17}\end{align*} View SourceRight-click on figure for MathML and additional features. with a client-side and a server-side residual updates \begin{align*} A_{i}^{(t+1)}=&A_{i}^{(t)}+\Delta \mathcal {W} _{i}^{(t+1)}-\tilde {\Delta \mathcal {W} _{i}}^{(t+1)}\tag{18}\\ A^{(t+1)}=&A^{(t)}+\Delta \mathcal {W} ^{(t+1)}-\tilde {\Delta \mathcal {W} }^{(t+1)}.\tag{19}\end{align*} View SourceRight-click on figure for MathML and additional features.

We can express this new update rule for both upload and download compression (17) as a special case of pure upload compression (15) with generalized filter masks. Let $M_{i}$ , $i=1,..,n$ be the sparsifying filter masks used by the respective clients during the upload and $M$ be the one used during the download by the server. Then, we could arrive at the same sparse update $\tilde {\Delta \mathcal {W} }^{(t+1)}$ if all clients use filter masks $\tilde {M}_{i}=M_{i}\odot M$ , where $\odot $ is the Hadamard product. We, thus, predict that training models using this new update rule should behave similar to regular upstream-only sparsification but with a slightly increased sparsity rate. We experimentally verify this prediction:

Fig. 5 shows the accuracies achieved by VGG11 on CIFAR10, when trained in a federated learning environment with five clients for 10 000 iterations at different rates of upload and download compression. As we can see, for as long as download and upload sparsity are of the same order, sparsifying the download is not very harmful to the convergence and decreases the accuracy by at most 2% in both the i.i.d. and the non-i.i.d. case.

Fig. 5. - Accuracy achieved by VGG11* when trained on CIFAR in a distributed setting with five clients for 16 000 iterations at different levels of upload and download sparsity. Sparsifying the updates for downstream communication reduces the final accuracy by at most 3% when compared to using only upload sparsity.
Fig. 5.

Accuracy achieved by VGG11* when trained on CIFAR in a distributed setting with five clients for 16 000 iterations at different levels of upload and download sparsity. Sparsifying the updates for downstream communication reduces the final accuracy by at most 3% when compared to using only upload sparsity.

C. Weight Update Caching for Partial Client Participation

This far we have only been looking at scenarios in which all of the clients participate throughout the entire training process. However, as elaborated in Section II, in federated learning, typically only a fraction of the entire client population will participate in any particular communication round. As clients do not download the full model $\mathcal {W}^{(t)}$ , but only compressed model updates $\Delta \tilde { \mathcal {W}}^{(t)}$ ; this introduces new challenges when it comes to keeping all clients synchronized.

To solve the synchronization problem and reduce the workload for the clients, we propose to use a caching mechanism on the server. Assume that the last $\tau $ communication rounds have produced the updates $\{\tilde {\Delta \mathcal {W} }^{\vphantom {D^{l}}(t)}|t=T-1,\ldots,T-\tau \}$ . The server can cache all partial sums of these updates up until a certain point $\left\{{P^{(s)}=\sum _{t=1}^{s} \tilde {\Delta \mathcal {W} }^{(T-t)}|s=1,..,\tau }\right\}$ together with the global model $\mathcal {W}^{(T)}= \mathcal {W}^{(T-\tau -1)}+\sum _{t=1}^\tau \tilde {\Delta \mathcal {W} }^{(T-t)}$ . Every client that wants to participate in the next communication round then has to first synchronize itself with the server by either downloading $P^{(s)}$ or $\mathcal {W}^{(T)}$ , depending on how many previous communication rounds it has skipped. For general sparse updates, the bound on the entropy \begin{equation*} H(P^{(\tau)}) \leq \tau H(P^{(1) })=\tau H(\tilde {\Delta \mathcal {W} }^{(T-1)})\tag{20}\end{equation*} View SourceRight-click on figure for MathML and additional features. can be attained. This means that the size of the download will grow linearly with the number of rounds a client has skipped training. The average number of skipped rounds is equal to the inverse participation fraction $1/\eta $ . This is usually tolerable as the downlink typically is cheaper and has far higher bandwidth than the uplink, as already noted in [10] and [19]. Essentially, all compression methods that communicate only parameter updates instead of full models suffer from this same problem. This is also the case for signSGD although here the size of the downstream update only grows logarithmically with the delay period according to \begin{equation*} H(P_{signSGD}^{(\tau)})\leq \log _{2}(2\tau +1).\tag{21}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Partial client participation also has effects on the convergence speed of federated training, both with delayed and sparsified updates. We will investigate these effects in detail in Section VII-C.

D. Lossless Encoding

To communicate a set of sparse ternary tensors produced by STC, we only need to transfer the positions of the nonzero elements in the flattened tensors, along with one bit per nonzero update to indicate the mean sign $\mu $ or $-\mu $ . Instead of communicating the absolute positions of the nonzero elements, it is favorable to communicate the distances between them. Assuming a random sparsity pattern we know that for big values of $|W|$ and $k=p|W|$ , the distances are approximately geometrically distributed with success probability equal to the sparsity rate $p$ . Therefore, we can optimally encode the distances using the Golomb code [34]. The Golomb encoding reduces the average number of position bits to \begin{equation*} \bar {\texttt {b}}_{\mathrm{ pos}} = \mathbf {b}^{*}+\frac {1}{1-(1-p)^{2^{\mathbf {b}^{*}}}}\tag{22}\end{equation*} View SourceRight-click on figure for MathML and additional features. with $\mathbf {b}^{*}=1+\lfloor \log _{2}((\log (\phi -1)/\log (1-p)))\rfloor $ and $\phi = (\sqrt {5}+1/2)$ being the golden ratio. For a sparsity rate of e.g., $p=0.01$ , we get $\bar {\texttt {b}}_{\mathrm{ pos}}=8.38$ , which translates to $\times 1.9$ compression, compared to a naive distance encoding with 16 fixed bits. Both the encoding and the decoding scheme can be found in Section A of the Appendix (Algorithms A1 and A2) in the Supplementary Material. The updates are encoded both before upload and before download.

The complete compression framework that features upstream and downstream compression via sparsification, ternarization, and optimal encoding of the updates is described in Algorithm 2.

Algorithm 2 - Efficient Federated Learning With Parameter Server Via STC
Algorithm 2

Efficient Federated Learning With Parameter Server Via STC

SECTION VII.

Experiments

We evaluate our proposed communication protocol on four different learning tasks and compare its performance to federated averaging and signSGD in a wide a variety of different federated learning environments.

Models and Data Sets: To cover a broad spectrum of learning problems, we evaluate on differently sized convolutional and recurrent neural networks for the relevant federated learning tasks of image classification and speech recognition:

VGG11* on CIFAR: We train a modified version of the popular 11-layer VGG11 network [28] on the CIFAR [30] data set. We simplify the VGG11 architecture by reducing the number of convolutional filters to [32, 64, 128, 128, 128, 128, 128, 128] in the respective convolutional layers and reducing the size of the hidden fully-connected layers to 128. We also remove all dropout layers and batch-normalization layers as the regularization is no longer required. Batch normalization has been observed to perform very poorly with both small batch sizes and non-i.i.d. data [35], and we do not want this effect to obscure the investigated behavior. The resulting VGG11* network still achieves 85.46% accuracy on the validation set after 20 000 iterations of training with a constant learning rate of 0.16 and contains 865 482 parameters.

CNN on KWS: We train the four-layer convolutional neural network (CNN) from [27] on the speech commands data set [36]. The speech commands data set consists of 51 088 different speech samples of specific keywords. There are 30 different keywords in total, and every speech sample is of 1-s duration. Like [32], we restrict us to the subset of the ten most common keywords. For every speech command, we extract the Mel spectrogram from the short-time Fourier transform, which results in a $32 \times 32$ feature map. The CNN architecture achieves 89.12% accuracy after 10 000 training iterations and has 876 938 parameters in total.

LSTM on Fashion-MNIST: We also train an Long Short-Term Memory (LSTM) network with two hidden layers of size 128 on the Fashion-MNIST data set [37]. The Fashion-MNIST data set contains 60 000 train and 10 000 validation greyscale images of ten different fashion items. Every $28 \times 28$ image is treated as a sequence of 28 features of dimensionality 28 and fed as such in the many-to-one LSTM network. After 20 000 training iterations with a learning rate of 0.04, the LSTM model achieves 90.21% accuracy on the validation set. The model contains 216 330 parameters.

Logistic Regression on MNIST: Finally, we also train a simple logistic regression classifier on the MNIST [31] data set. The MNIST data set contains 60 000 training and 10 000 test greyscale images of handwritten digits of size $28\times28$ . The trained logistic regression classifier achieves 92.31% accuracy on the test set and contains 7850 parameters.

The different learning tasks are summarized in Table II. In the following, we will primarily discuss the results for VGG11* trained on CIFAR; however, the described phenomena carry over to all other benchmarks and the supporting experimental results can be found in the Appendix in the Supplementary Material.

TABLE II Models and Hyperparameters. The Learning Rate is Kept Constant Throughout Training
Table II- 
Models and Hyperparameters. The Learning Rate is Kept Constant Throughout Training

Compression Methods: We compare our proposed STC method at a sparsity rate of $p=1/400$ with federated averaging at an “equivalent” delay period of $n=400$ iterations and signSGD with a coordinatewise step size of $\delta =0.0002$ . At a sparsity rate of $p=1/400$ , STC compresses updates both during upload and download by roughly a factor of $\times 1050$ . A delay period of $n=400$ iterations for federated averaging results in a slightly smaller compression rate of $\times 400$ . Further analysis on the effects of the sparsity rate $p$ and delay period $n$ on the convergence speed of STC and federated averaging can be found in Section C of the Appendix in the Supplementary Material. During our experiments, we keep all training related hyperparameters constant for the different compression methods. To be able to compare the different methods in a fair way, all methods are given the same budged of training iterations in the following experiments (one communication round of federated averaging uses up $n$ iterations, where $n$ is the number of local iterations).

Learning Environment: The federated learning environment described in Algorithm 2 can be fully characterized by five parameters. For the base configuration, we set the number of clients to 100, the participation ratio to 10%, and the local batch size to 20 and assign every client an equally sized subset of the training data containing samples from ten different classes. In the following experiments, if not explicitly signified otherwise, all hyperparameters will default to this base configuration summarized in Table III. We will use the short notations “Clients: $\eta N/N$ ” and “Classes: $c$ ” to refer to a setup of the federated learning environment in which a random subset of $\eta N$ out of a total of $N$ clients participates in every communication round and every client is holding data from exactly $c$ different classes.

TABLE III Base Configuration of the Federated Learning Environment in Our Experiments
Table III- 
Base Configuration of the Federated Learning Environment in Our Experiments

A. Momentum in Federated Optimization

We start out by investigating the effects of momentum optimization on the convergence behavior of the different compression methods. Figs. 6–​9 show the final accuracy achieved by federated averaging ($n=400$ ), STC ($p=1/400$ ), and signSGD after 20 000 training iterations in a variety of different federated learning environments. In Figs. 6–​9, dashed lines refer to experiments where the momentum of $m=0.9$ was used during training, while solid lines signify that classical SGD was used. As we can see, momentum has a significant influence on the convergence behavior of the different methods. While signSGD always performs distinctively better if momentum is turned on during the optimization, the picture is less clear for STC and federated averaging. We can make out three different parameters of the learning environment that determine whether momentum is beneficial or harmful to the performance of STC. If the participation rate is high and the batch size used during training is sufficiently large (see Fig. 7 left), momentum improves the performance of STC. Conversely, momentum will deteriorate the training performance in situations where training is carried out on small batches and with low client participation. The latter effect is increasingly strong if clients hold non-i.i.d. subsets of data [see Fig. 6 (right)]. These results are not surprising, as the issues with stale momentum described in [25] are enhanced in these situations. Similar relationships can be observed for federated averaging where again the size (see Fig. 7) and the heterogeneity (see Fig. 6) of the local minibatches determine whether the momentum will have a positive effect on the training performance or not.

Fig. 6. - Robustness of different compression methods to the non-i.i.d.-ness of client data on four different benchmarks. VGG11* trained on CIFAR. STC distinctively outperforms federated averaging on non-i.i.d. data. The learning environment is configured as described in Table III. Dashed lines signify that a momentum of 
$m=0.9$
 was used.
Fig. 6.

Robustness of different compression methods to the non-i.i.d.-ness of client data on four different benchmarks. VGG11* trained on CIFAR. STC distinctively outperforms federated averaging on non-i.i.d. data. The learning environment is configured as described in Table III. Dashed lines signify that a momentum of $m=0.9$ was used.

Fig. 7. - Maximum accuracy achieved by the different compression methods when training VGG11* on CIFAR for 20 000 iterations at varying batch sizes in a federated learning environment with ten clients and full participation. Left: Every client holds data from exactly two different classes. Right: Every client holds an i.i.d. subset of data.
Fig. 7.

Maximum accuracy achieved by the different compression methods when training VGG11* on CIFAR for 20 000 iterations at varying batch sizes in a federated learning environment with ten clients and full participation. Left: Every client holds data from exactly two different classes. Right: Every client holds an i.i.d. subset of data.

Fig. 8. - Validation accuracy achieved by VGG11* on CIFAR after 20 000 iterations of communication-efficient federated training with different compression methods. The relative client participation fraction is varied between 100% (5/5) and 5% (5/100). Left: Every client holds data from exactly two different classes. Right: Every client holds an i.i.d. subset of data.
Fig. 8.

Validation accuracy achieved by VGG11* on CIFAR after 20 000 iterations of communication-efficient federated training with different compression methods. The relative client participation fraction is varied between 100% (5/5) and 5% (5/100). Left: Every client holds data from exactly two different classes. Right: Every client holds an i.i.d. subset of data.

Fig. 9. - Validation accuracy achieved by VGG11* on CIFAR after 20 000 iterations of communication-efficient federated training with different compression methods. The training data are split among the client at different degrees of unbalancedness with 
$\gamma $
 varying between 0.9 and 1.0.
Fig. 9.

Validation accuracy achieved by VGG11* on CIFAR after 20 000 iterations of communication-efficient federated training with different compression methods. The training data are split among the client at different degrees of unbalancedness with $\gamma $ varying between 0.9 and 1.0.

When we compare federated averaging, signSGD and STC in the following, we will ignore whichever version of these methods (momentum “on” or “off”) performs worse.

B. Non-i.i.d.-ness of the Data

Our preliminary experiments in Section V have already demonstrated that the convergence behavior of both federated averaging and signSGD is very sensitive to the degree of i.i.d.-ness of the local client data, whereas sparse communication seems to be more robust. We will now investigate this behavior in some more detail. Fig. 6 shows the maximum achieved generalization accuracy after a fixed number of iterations for VGG11* trained on CIFAR at different levels of non-i.i.d.-ness. Additional results on all other benchmarks can be found in Fig. A2 in the Appendix in the Supplementary Material. Both at full (left plot) and partial (right plot) client participations, STC outperforms federated averaging across all levels of i.i.d.-ness. The most distinct difference can be observed in the non-i.i.d. regime, where the individual clients hold less than five different classes. Here, STC (without momentum) outperforms both federated averaging and signSGD by a wide margin. In the extreme case where every client only holds data from exactly one class, STC still achieves 79.5% and 53.2% accuracy at full and partial client participations, respectively, while both federated averaging and signSGD fail to converge at all.

C. Robustness to Other Parameters of the Learning Environment

We will now proceed to investigate the effects of other parameters of the learning environment on the convergence behavior of the different compression methods. Figs. 7–​9 show the maximum achieved accuracy after training VGG11* on CIFAR for 20 000 iterations in different federated learning environments. Additional results on the three other benchmarks can be found in Section D in the Appendix in the Supplementary Material.

We observe that STC (without momentum) consistently dominates federated averaging on all benchmarks and learning environments.

1) Local Batch Size:

The memory capacity of mobile and IoT devices is typically very limited. As the memory footprint of SGD is proportional to the batch size used during training, clients might be restricted to train on small minibatches only. Fig. 7 shows the influence of the local batch size on the performance of different communication-efficient federated learning techniques exemplary for VGG11* trained on CIFAR. First of all, we notice that using momentum significantly slows down the convergence speed of both STC and federated averaging at batch sizes smaller than 20 independent of the distribution of data among the clients. As we can see, even if the training data is distributed among the clients in an i.i.d. manner (see Fig. 7 right) and all clients participate in every training iteration, federated averaging suffers considerably from small batch sizes. STC, on the other hand, demonstrates to be far more robust to this type of constraint. At an extreme batch size of one, the model trained with STC still achieves an accuracy of 63.8%, while the federated averaging model only reaches 39.2% after 20 000 training iterations.

2) Client Participation Fraction:

Fig. 8 shows the convergence speed of VGG11* trained on CIFAR10 in a federated learning environment with different degrees of client participation. To isolate the effects of reduced participation, we keep the absolute number of participating clients and the local batch sizes at constant values of 5 and 40, respectively, throughout all experiments and vary only the total number of clients (and thus the relative participation $\eta $ ). As we can see, reducing the participation rate has negative effects on both federated averaging and STC. The causes for these negative effects, however, are different. In federated averaging, the participation rate is proportional to the effective amount of data that the training is conducted on in any individual communication round. If a nonrepresentative subset of clients is selected to participate in a particular communication round of federated averaging, this can steer the optimization process away from the minimum and might even cause catastrophic forgetting [38] of previously learned concepts. On the other hand, partial participation reduces the convergence speed of STC by causing the clients residuals to go out sync and increasing the gradient staleness [25]. The more rounds a client has to wait before it is selected to participate during training again, the more outdated its accumulated gradients become. We can observe this behavior for STC most strongly in the non-i.i.d. situation (see Fig. 8 left), where the accuracy steadily decreases with the participation rate. However, even in the extreme case where only 5 out of 400 clients participate in every round of training, STC still achieves higher accuracy than federated averaging and signSGD. If the clients hold i.i.d. data (see Fig. 8 right), STC suffers much less from a reduced participation rate than federated averaging. If only 5 out of 400 clients participate in every round, STC (without momentum) still manages to achieve an accuracy of 68.2% while federated averaging stagnates at 42.3% accuracy. signSGD is affected the least by reduced participation, which is unsurprising, as only the absolute number of participating clients would have a direct influence on its performance. Similar behavior can be observed on all other benchmarks, and the results can be found in Fig. A3 in the Appendix in the Supplementary Material. It is noteworthy that in federated learning, it is usually possible for the server to exercise some control over the rate of client participation. For instance, it is typically possible to increase the participation ratio at the cost of a long waiting time for all clients to finish.

3) Unbalancedness:

Up until now, all experiments were performed with a balanced split of data in which every client was assigned the same amount of data points. In practice, however, the data sets on different clients will typically vary heavily in size. To simulate different degrees of unbalancedness, we split the data among the clients in a way such that the $i$ th out of $n$ clients is assigned a fraction \begin{equation*} \varphi _{i}(\alpha, \gamma)=\frac {\alpha }{n}+(1-\alpha)\frac {\gamma ^{i}}{\sum _{j=1}^{n}\gamma ^{j}}\tag{23}\end{equation*} View SourceRight-click on figure for MathML and additional features. of the total data. The parameter $\alpha $ controls the minimum amount of data on every client, while the parameter $\gamma $ controls the concentration of data. We fix $\alpha =0.1$ and vary $\gamma $ between 0.9 and 1.0 in our experiments. To amplify the effects of unbalanced client data, we also set the client participation to a low value of only 5 out of 200 clients. Fig. 9 shows the final accuracy achieved after 20 000 iterations for different values of $\gamma $ . Interestingly, the unbalancedness of the data does not seem to have a significant effect on the performance of either of the compression methods. Even if the data are highly concentrated on a few clients (as is the case for $\gamma =0.9$ ), all methods converge reliably, and for federated averaging, the accuracy even slightly goes down with increased balancedness. Apparently, the rare participation of large clients can balance out several communication rounds with much smaller clients. These results also carry over to all other benchmarks (see Fig. A5 in the Appendix in the Supplementary Material).

D. Communication Efficiency

Finally, we compare the different compression methods with respect to the number of iterations and communicated bits they require to achieve a certain target accuracy on a federated learning task. As we saw in Section V, both federated averaging and signSGD perform considerably worse if clients hold non-i.i.d. data or use small batch sizes. To still have a meaningful comparison, we, therefore, choose to evaluate this time on an i.i.d. environment where every client holds ten different classes and uses a moderate batch size of 20 during training. This setup favors federated averaging and signSGD to the maximum degree possible! All other parameters of the learning environment are set to the base configuration given in Table III. We train until the target accuracy is achieved or a maximum amount of iterations is exceeded and measure the amount of communicated bits both for upload and download. Fig. 10 shows the results for VGG11* trained on CIFAR, CNN trained on keyword spotting (KWS), and the LSTM model trained on Fashion-MNIST. We can see that even if all clients hold i.i.d. data, STC still manages to achieve the desired target accuracy within the smallest communication budget out of all methods. STC also converges faster in terms of training iterations than the versions of federated averaging with comparable compression rate. Unsurprisingly, we see that both for federated averaging and STC, we face a tradeoff between the number of training iterations (“computation”) and the number of communicated bits (“communication”). On all investigated benchmarks, however, STC is Pareto-superior to federated averaging in the sense for any fixed iteration complexity, it achieves a lower (upload) communication complexity.

Fig. 10. - Convergence speed of federated learning with compressed communication in terms of training iterations (left) and uploaded bits (right) on three different benchmarks (top to bottom) in an i.i.d. federated learning environment with 100 clients and 10% participation fraction. For better readability, the validation error curves are average-smoothed with a step size of five. On all benchmarks, STC requires the least amount of bits to converge to the target accuracy.
Fig. 10.

Convergence speed of federated learning with compressed communication in terms of training iterations (left) and uploaded bits (right) on three different benchmarks (top to bottom) in an i.i.d. federated learning environment with 100 clients and 10% participation fraction. For better readability, the validation error curves are average-smoothed with a step size of five. On all benchmarks, STC requires the least amount of bits to converge to the target accuracy.

Table IV shows the amount of upstream and downstream communications required to achieve the target accuracy for the different methods in megabytes. On the CIFAR learning task, STC at a sparsity rate of $p=0.0025$ only communicates 183.9 MB worth of data, which is a reduction in communication by a factor of $\times 199.5$ as compared to the baseline with requires 36696 MB and federated averaging ($n=100$ ), which still requires 1606 MB. Federated averaging with a delay period of 1000 steps does not achieve the target accuracy within the given iteration budget.

TABLE IV Bits Required for Upload and/Download to Achieve a Certain Target Accuracy on Different Learning Tasks in an i.i.d. Learning Environment. A Value of “n.a.” in the Table Signifies That the Method Has Not Achieved the Target Accuracy Within the Iteration Budget. The Learning Environment is Configuredas Described in Table III
Table IV- 
Bits Required for Upload and/Download to Achieve a Certain Target Accuracy on Different Learning Tasks in an i.i.d. Learning Environment. A Value of “n.a.” in the Table Signifies That the Method Has Not Achieved the Target Accuracy Within the Iteration Budget. The Learning Environment is Configuredas Described in Table III

SECTION VIII.

Lessons Learned

We will now summarize the findings of this article and give general suggestions on how to approach communication-constrained federated learning problems (see our summarizing Fig. 11).

  1. If clients hold non-i.i.d. data, sparse communication protocols such as STC distinctively outperform federated averaging across all federated learning environments [see Figs. 6, 7 (left), and 8 (left)].

  2. The same holds true if clients are forced to train on small minibatches (e.g., because the hardware is memory constrained). In these situations, STC outperforms federated averaging even if the client’s data are i.i.d. [see Fig. 7 (right)].

  3. STC should also be preferred over federated averaging if the client participation rate is expected to be low, as it converges more stable and quickly in both the i.i.d. and non-i.i.d. regime [see Fig. 8 (right)].

  4. STC is generally most advantageous in situations where the communication is bandwidth-constrained or costly (metered network, limited battery), as it does achieve a certain target accuracy within the minimum amount of communicated bits even on i.i.d. data (see Fig. 10 and Table IV).

  5. Federated averaging in return should be used if the communication is latency-constrained or if the client participation is expected to be very low (and 1–3 do not hold).

  6. Momentum optimization should be avoided in federated learning whenever either clients are training with small batch sizes or the client data are non-i.i.d. and the participation rate is low (see Figs. 6–​8).

Fig. 11. - Left: accuracy achieved by VGG11* on CIFAR after 20 000 iterations of federated training with federated averaging and STC for three different configurations of the learning environment. Right: upstream and downstream communication necessary to achieve a validation accuracy of 84% with federated averaging and STC on the CIFAR benchmark under i.i.d. data and a moderate batch-size.
Fig. 11.

Left: accuracy achieved by VGG11* on CIFAR after 20 000 iterations of federated training with federated averaging and STC for three different configurations of the learning environment. Right: upstream and downstream communication necessary to achieve a validation accuracy of 84% with federated averaging and STC on the CIFAR benchmark under i.i.d. data and a moderate batch-size.

SECTION IX.

Conclusion

Federated learning for mobile and IoT applications is a challenging task, as generally little to no control can be exerted over the properties of the learning environment.

In this article, we demonstrated that the convergence behavior of current methods for communication-efficient federated learning is very sensitive to these properties. On a variety of different data sets and model architectures, we observe that the convergence speed of federated averaging drastically decreases in learning environments where the clients either hold non-i.i.d. subsets of data are forced to train on small minibatches or where only a small fraction of clients participates in every communication round. To address these issues, we propose STC, a communication protocol that compresses both the upstream and downstream communications via sparsification, ternarization, error accumulation, and optimal Golomb encoding. Our experiments show that STC is far more robust to the above-mentioned peculiarities of the learning environment than federated averaging. Moreover, STC converges faster than federated averaging both with respect to the number of training iterations and the amount of communicated bits even if the clients hold i.i.d. data and use moderate batch sizes during training.

Our approach can be understood as an alternative paradigm for communication-efficient federated optimization that relies on high-frequent low-volume instead of low-frequent high-volume communication. As such, it is particularly well suited for federated learning environments that are characterized by low latency and low bandwidth channels between clients and server.

References

References is not available for this document.