Minimum Gaussian Noise Variance of Federated Learning in the Presence of Mutual Information Based Differential Privacy

SECTION I.

Introduction

In centralized distributed machine learning architecture, participants need to upload sensitive data to a central data server. However, centralized data is more susceptible to leakage risks, and even the central data server may not be entirely trustworthy. In order to facilitate cooperation amongst various participants safely, Google creatively proposed federated learning (FL) and the Federated Averaging Algorithm (FedAvg) which are based on the stochastic gradient descent (SGD) optimization algorithm [1]. The main idea of FL is that clients perform local machine learning model training, and then upload the trained gradients or updated model parameters to the central server for aggregation, which is used to obtain a global model for the next round of training. As a kind of distributed machine learning, FL no longer retrieves data from clients, but instead trains models by exchanging training gradients or model parameters. Therefore, with the help of FL, data utilization efficiency is improved, and the risk of client data privacy leakage can be reduced. However, note that information exchange during the training process still causes privacy leakage. Both internal participants (participating clients and central server) and external participants (model consumers and eavesdroppers) may be potential attackers, and the main attack method is inference attack, which includes inference of class representatives [2], [3], memberships [4], properties of the training data [5] and training samples and labels [6]. As a result, privacy-preserving FL (PPFL) has been heralded as a solution to further protect client privacy by encrypting or randomizing the intermediate information exchanged between clients and server during FL [7]. Differential privacy (DP) [8] is the main privacy protection method in FL. Existing works on DP-based PPFL include central DP (CDP) [9], [10] and local DP (LDP) [11]. CDP relies on a trusted server to add noise to the aggregated model. On the other hand, LDP adds noise to the training gradients or model parameters locally at clients, without relying on a trusted aggregation server. However, due to the lack of collaborative optimization during the noise addition process, larger amount of noise needs to be introduced, resulting in a significant loss in the performance of the trained model. Therefore, the privacy-utility trade-off in DP is a big challenge for us.

The essence of the privacy-preserving model based on DP is data obfuscation and random perturbation, which can be characterized by a probabilistic function mapping closely related to methods in information theory (IT). Therefore, in DP research, IT has developed into an effective privacy measurement method. Specifically, [12] proposed the idea of applying quantifying information flow to quantify the uncertainty of privacy information in DP, and abstracted the DP noise mechanism as a channel noise in IT, and this lays the foundation for IT-based research in DP. Subsequently, fruitful results in IT have been applied to DP measuring privacy leakage and analyzing the trade-off between privacy and utility [13], [15], [17].

In this paper, we study the privacy-utility trade-off in DP-based PPFL by considering Gaussian DP mechanism used in either clients or central server. For a given privacy budget, we establish lower bounds on the variance of the Gaussian noise added to clients or central server of the FL model, and show that the utility of the global model remains the same for both cases. Furthermore, a privacy-preserving scheme is proposed, which maximizes the utility of the global model while different privacy requirements of all clients are preserved.

The rest of this paper is organized as follows. In Section II, we briefly review the related works. Section III is an introduction about the FL and the definition of MI-DP. Sections IV and V are about problem formulation and the main results. Section VI is about the experimental result, and final conclusion is given in Section VII.

SECTION II.

Related Work

The concept of federated learning was initially introduced with the aim of achieving model training while protecting user privacy. However, local model parameters or gradients, aggregated model parameters or gradients, and the final model all contain users’ private information. As a result, it is essential to implement privacy-preserving measures for intermediate model parameters or gradients [7]. Compared with cryptographic methods such as secure multiparty computation and homomorphic encryption, DP-based PPFL sacrifices some data utility to achieve data privacy, without requiring complex encryption and decryption algorithms, and has lower communication overhead [24]. Recently, [9] was the first to apply the DP-SGD algorithm within the framework of federated learning, effectively ensuring user-level DP protection. Similarly, [10] is based on user-level DP and achieves privacy protection by adding noise on the server, and it provides guiding principles for parameter adjustment when using the DP technique to train a complex model. Based on LDP, [11] proposed a privacy protection scheme for federated learning and also introduced a method for sharing parameter updates, selectively sharing model parameter updates during various rounds of the iterative training process. In [25], a LDP additive-increase and multiplicative-decrease multi-resource allocation algorithm was developed for federated settings, eliminating the need for inter-agent communication. Here note that the aforementioned studies are all based on classical DP technique, without studying DP based on IT.

For the IT-based research in DP, a privacy risk-distortion function is proposed in [13]. This function suggests that privacy query mechanisms should aim to minimize the mutual information between limited query results and the expected distortion of the database, or to minimize the conditional entropy of query results given a database. Subsequently, in the scenario of non-interactive data publishing, [14] quantified the trade-off between data privacy requirements and the utility of encoded data based on rate-distortion theory. In [15], a unified privacy-distortion model was formulated to establish fundamental connections among identifiability, DP, and mutual information privacy. It was proven that under certain conditions, optimal mutual information mechanisms can satisfy $\epsilon $ -DP. Reference [16] investigated the optimal channel mechanism that balances privacy and data utility under Hamming distortion for finite discrete distribution data sources. Moreover, [17] proposed the definition of mutual information differential privacy (MI-DP), and further established the relationship between the privacy protection strength of DP and MI-DP. Note that the aforementioned studies [13], [15], [17] have focused only on traditional data analysis query scenarios without considering the context of federated learning, and this paper applies DP technique based on IT to the practical scenario of federated learning.

SECTION III.

Preliminaries

A. Federated Learning

A general FL system consists of one central server and $N$ clients. The $N$ clients with private data learn an AI model with the help of the server which does not need to know the clients’ database. They find a parameter vector $\mathbf {w}\in \mathbb {R}^{d}$ of the AI model to minimize certain loss functions together. Such an optimization problem can be formulated as:\begin{equation*} {\mathbf {w}^{*}}=\underset {\mathbf {w}}{\mathop {\arg \min }}\underset {k=1}{\overset {N}{\mathop \sum }}\,{p_{k}}{F_{k}}\left ({\mathbf {w} }\right), \tag{3.1}\end{equation*} View Source where $F_{k}(\cdot)$ is the local loss function of the $k$ -th client, $p_{k}=\frac {\left |{ {\mathcal {D}_{k}} }\right |}{\sum \limits _{i=1}^{N}{\left |{ {\mathcal {D}_{i}} }\right |}}$ , and $\left |{ {\mathcal {D}_{k}} }\right |$ is the cardinality of $k$ -th client database $\mathcal {D}_{k}$ .

In the FL process, all clients and server will initialize a training model with the same structure and initialize the model parameters ${{\bar {\mathbf {w}}}^{0}}$ , in each of the global iteration rounds $t$ , where $t\in \{1,2,..,T\}$ , the training process of this algorithm specifically consists of the following steps:

Broadcasting: The server broadcasts the global aggregated parameter vector ${{\bar {\mathbf {w}}}^{t-1}}$ to the $N$ clients, especially in the first round, the server broadcast ${{\bar {\mathbf {w}}}^{0}}$ which is not aggregated parameter but initialized parameter.

Local training: Each client update the global parameter vector ${{\bar {\mathbf {w}}}^{t-1}}$ to the local model, then train with the local database ${\mathcal {D}_{k}}$ and send locally trained parameter $\mathbf {w}_{k}^{t}$ to the server ($k\in \{1,\ldots,N\}$ ).

Model aggregating: The server performs secure aggregation over the uploaded parameters from $N$ clients to obtain an updated parameter vector. Formally, the server aggregates the parameters sent from the $N$ clients as:\begin{equation*} {{\bar {\mathbf {w}}}^{t}}=\underset {k=1}{\overset {N}{\mathop \sum }}\,{p_{k}}\mathbf {w}_{k}^{t}. \tag{3.2}\end{equation*} View Source

With enough communication rounds $T$ , the optimization problem (3.1) can be solved.

B. Mutual Information Differential Privacy

FL guarantees certain privacy since the clients’ data is not publicly spread. However, a certain amount of private client information can still be inferred from the shared information of the local network. DP can provide a decisive criterion for privacy preservation, which aims to introduce a level of uncertainty into the released model sufficient to mask the contribution of any individual. DP is quantified by privacy loss parameters ($\epsilon,\delta $ ), where smaller ($\epsilon,\delta $ ) corresponds to increased privacy.

Definition 1 (($\epsilon,\delta $ )-DP[8]):

A randomized mechanism $\mathcal {M}$ : $\mathcal {D}\rightarrow {\mathcal {R}}$ with domain $\mathcal {D}$ and range $\mathcal {R}$ satisfies ($\epsilon,\delta $ )-DP, if for all measurable sets $\mathcal {S}\in Range(\mathcal {M})$ and any two adjacent input datasets $D,D^{\prime }\in {\mathcal {D}}$ , \begin{equation*} \Pr \left [{ \mathcal {M}\left ({D }\right)\in \mathcal {S} }\right]\le {e^{\epsilon }}\Pr \left [{ \mathcal {M}\left ({{D^{\prime }} }\right)\in \mathcal {S} }\right]+\delta. \tag{3.3}\end{equation*} View Source

DP is intended to prevent individual information from leaking to the opponent. It is a property of the query mechanism and does not assume a specific prior distribution on the database. By conditioning the remainder of the database and defining it as a supremum on conditional mutual information over all possible distributions on the database, MI-DP makes this attribute explicit.

Suppose that the database ${X^{N}}=\{{X_{1}},\ldots,{X_{N}}\}$ , where $X_{i}$ represents the $i$ -th data in the database. $X^{-i}$ represents the set of data other than $X_{i}$ in the database, and the random variable $Y$ represents the query result.

Definition 2 (Mutual Information Differential Privacy[17]):

A randomized mechanism $P_{Y|X^{N}}$ satisfies $\epsilon $ -mutual information differential privacy ($\epsilon $ -MI-DP) if \begin{equation*} \underset {i,{P_{{X^{N}}}}}{\sup }\,I({X_{i}};Y|{X^{-i}})\le \epsilon {}nats, \tag{3.4}\end{equation*} View Source where $nats$ are information units resulting from using the natural logarithm instead of the logarithm base two. The main property of MI-DP is that it is sandwiched between $\epsilon $ -DP ($(\epsilon,\delta =0)$ -DP) and $(\epsilon,\delta)$ -DP, i.e., if a randomized mechanism satisfies $\epsilon $ -MI-DP, it also satisfies $\epsilon $ -DP, but it may not satisfy $(\epsilon,\delta)$ -DP. In this paper, we use $\epsilon $ -MI-DP to characterize the privacy level in FL.

SECTION IV.

Federated Learning Under $\epsilon$ -MI-DP

A. $\epsilon$ -MI-DP Guarantee for Local Parameters

We consider a FL system with $N$ clients and $T$ rounds of model training, see Figures 1 and 2. After $t$ rounds of training iterations, where $t\in \{1,2,\ldots,T\}$ , the privacy model parameters of all clients include all local model parameters from the beginning of training up to round $t$ , that is $\{\mathbf {w}_{k}^{1},\ldots,\mathbf {w}_{k}^{t}\}$ ($k\in \{1,\ldots,N\}$ ), while the aggregated model parameters held by the central server at this point are ${{\bar {\mathbf {w}}}^{t}}$ , which will be broadcast to all clients at the next iteration and exposed to semi-honest clients or hidden adversaries. Since the model parameters is related with client data, we consider the local model parameters $\{\mathbf {w}_{k}^{1},\ldots,\mathbf {w}_{k}^{t}\}$ as the object to be protected and the aggregated model parameter ${{\bar {\mathbf {w}}}^{t}}$ as the result obtained by the attacker. Based on the definition of $\epsilon $ -MI-DP, we propose a client-level $\epsilon $ -MI-DP constraint in FL.

FIGURE 1.

DP at the client side.

Show All

FIGURE 2.

DP at the server side.

Show All

Definition 3 (Client-Level$\epsilon $ -MI-DP Constraint in FL):

For a FL process with $T$ rounds of model training, in the $t\in \{1,2,\ldots,T\}$ round, if the local model parameters uploaded by clients and the model parameters aggregated by the server satisfy \begin{equation*} \underset {i,{P_{{\mathbf {w}^{t\times N}}}}}{\sup }\,I\left ({\mathbf {w}_{i}^{1},\ldots,\mathbf {w}_{i}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{\mathbf {w}^{1,-i}},\ldots,{\mathbf {w}^{t,-i}} }\right)\le \epsilon, \tag{4.5}\end{equation*} View Source then this FL process satisfies client-level $\epsilon $ -MI-DP, where $\mathbf {w}^{t,-i}=\{\mathbf {w}_{1}^{t},\ldots,\mathbf {w}_{i-1}^{t},\mathbf {w}_{i+1}^{t},\ldots,\mathbf {w}_{N}^{t}\}$ .

In traditional DP, both client-level and data-level DP measure clients’ privacy loss in terms of data privacy loss. However, since model parameters is related with clients’ private data, the client-level $\epsilon $ -MI-DP we defined above is not limited to clients’ data but rather focuses on the privacy leakage of clients’ local model parameters.

B. Minimum Variance of Gaussian Noise Satisfying $\epsilon$ -MI-DP

In differential privacy-based PPFL, privacy protection for clients is generally achieved by adding Gaussian noise to the model parameters or gradients at the server or client sides, and the noise variance is determined using the definition of the Gaussian mechanism. Typically, clients need to clip the uploaded model parameters or gradients with a fixed value to limit the global sensitivity locally. For the parameter-averaging-based FedAvg, the model parameters can be appropriately modified based on the relationship between the clipping threshold $C$ and the $\ell _{2}$ -norm of model parameters. When $\lVert \mathbf {w}_{k}^{t} \rVert \le C$ , $\mathbf {w}_{k}^{t}$ remains unchanged, and when $\lVert \mathbf {w}_{k}^{t} \rVert \ge C$ , it is proportionally shrunk. Therefore, the uploaded model parameter vector can be represented as:\begin{equation*} \mathbf {w}_{k}^{t}=\frac {\mathbf {w}_{k}^{t}}{\max \left ({1,\frac {\lVert \mathbf {w}_{k}^{t} \rVert }{C} }\right)}. \tag{4.6}\end{equation*} View Source This clipping operation ensures that the $\ell _{2}$ -norm of the model parameters is less than a certain threshold, i.e., for $\forall k\in {1,2,\ldots,N}$ and $\forall t\in {1,2,\ldots,T}$ , $\lVert \mathbf {w}_{k}^{t} \rVert \le C$ . Although this form of clipping may alter some of the uploaded local parameters, it can prevent the aggregation result from overfitting to a single client. As there is no prior distribution of the model parameters, we adopt this method to constrain the model parameters locally at the client side. Next, we will prove that adding Gaussian noise at the server or client sides can satisfy this $\epsilon $ -MI-DP guarantee and give the minimum standard deviation of the noise at a fixed privacy level.

1) Adding Gaussian Noise at the Server Side

For differential privacy-based PPFL that adds Gaussian noise to the model parameters at the server side, at each global iteration round $t$ , the server adds noise ${{\mathbf {Z}}^{t}}\sim \mathcal {N}(\mathbf {0},\sigma _{s}^{2}{{\mathbf {I}}_{d}})$ to the aggregated model parameters, where ${{\mathbf {I}}_{d}}$ represents the $d$ -dimensional identity matrix and $d$ is the number of parameters in the model. The perturbed aggregated model is broadcasted to all clients. We assume that the noises added at different iteration rounds are mutually independent and the noise ${\mathbf {Z}}^{t}$ is independent from any local model parameters uploaded by any client at current and previous iteration rounds $\{\mathbf {w}_{k}^{1},\ldots,\mathbf {w}_{k}^{t}\},\forall k$ .

Theorem 1:

For each global epoch $t$ , to ensure $\epsilon $ -MI-DP after $t$ times aggregation for clients’ parameters, the minimum standard deviation $\sigma _{s}$ of Gaussian noise that is added to the aggregated parameters by the server is given by \begin{equation*} {\sigma _{s}}=\frac {C\mathop{\max }\limits_{i}{p_{i}}}{\sqrt {d\left ({{e^{\frac {2\epsilon }{d}}}-1 }\right)}}. \tag{4.7}\end{equation*} View Source

Proof:

See Appendix A.

Remark 1:

According to (4.7), to achieve a slight standard deviation of noises, the optimal condition is that all the clients use the same number of local databases for training, i.e., for $\forall i\in \left \{{ 1,2,\ldots,N }\right \}$ , there is ${p_{i}}=1/N$ . Under this condition, the minimum noise variance is \begin{equation*} {\sigma _{s}}=\frac {C}{N\sqrt {d\left ({{e^{\frac {2\epsilon }{d}}}-1 }\right)}}. \tag{4.8}\end{equation*} View Source

Theorem 1 shows that to satisfy $\epsilon $ -MI-DP, the standard deviation ${\sigma _{s}}$ of the Gaussian noise added by the server to the model parameters is related not only to the privacy budget $\epsilon $ but also to the aggregation weight $p$ , the dimensionality of the model parameter vector $d$ and the clipping threshold $C$ .

Complexity Analysis: Note that the complexity of Algorithm 1 mainly consists of the time complexity of aggregation at the server, which is given by $\mathcal {O}(N)$ and $\mathcal {O}(\cdot)$ represents the big-O notation. Since Algorithm 1 is executed $T$ times for convergence, the overall complexity of Algorithm 1 is $\mathcal {O}(TN)$ . It is approximately the same as the FedAvg [1].

Algorithm 1

Perturbing in the server of FedAvg

Show All

2) Adding Gaussian Noise at the Client Side

For differential privacy-based PPFL adding Gaussian noise at the client side, at each global iteration round $t$ , the $k$ -th ($k\in \{1,\ldots,N\}$ ) client adds Gaussian noise ${{\mathbf {Z}}^{t}_{k}}\sim \mathcal {N}(\mathbf {0},\sigma _{c}^{2}{{\mathbf {I}}_{d}})$ to its trained parameters and send this randomized version to the server. We assume all noises added by the different clients and at different global epochs are independent and identically distributed, and each noise is independent from all parameters before it is added.

Corollary 1:

For each global epoch $t$ , to ensure $\epsilon $ -MI-DP after $t$ times aggregation for clients’ parameters, the minimum noise variance $\sigma _{c}$ of Gaussian noise that is added to the local parameters by the client can be given as:\begin{equation*} {\sigma _{c}}=\frac {C\mathop{\max }\limits_{i}{p_{i}}}{\sqrt {d\left ({{e^{\frac {2\epsilon }{d}}}-1 }\right)\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{k=1}p_{k}^{2}}}. \tag{4.9}\end{equation*} View Source

Proof:

See Appendix B.

Since \begin{equation*} \frac {\mathop{\max }\limits_{i}{p_{i}}}{\sqrt {\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{k=1}p_{k}^{2}}}\ge \frac {\mathop{\max }\limits_{i}{p_{i}}}{\sqrt {N{{\left ({\mathop{\max }\limits_{i}{p_{i}} }\right)}^{2}}}}=\frac {1}{\sqrt {N}}, \tag{4.10}\end{equation*} View Source the equation holds only when ${p_{i}}=1/N$ , $\forall i\in \left \{{ 1,2,\ldots,N }\right \}$ . Therefore, for this privacy-preserving scheme, when the size of the training database used by all clients is the same, the added noise is minimal. In this case, the lower bound of the standard deviation of the noise ${\sigma _{c}}$ is given by \begin{equation*} {\sigma _{c}}=\frac {C}{\sqrt {dN\left ({{e^{\frac {2\epsilon }{d}}}-1 }\right)}}. \tag{4.11}\end{equation*} View Source

C. Utility Analysis

The most intuitive way to quantify the utility is based on the measurement of result deviation. To solve the compression loss problem, information theory introduces the distortion measurement function $d(\cdot,\cdot)$ to quantify the degree of distortion between random variables [18]. it has been widely used in related research on DP to quantify data utility [15]. Therefore, for differential privacy-based PPFL, we define utility $U^{t}$ of the global model at any global iteration round $t$ as the multiplicative inverse of the expected distortion between the true global model parameters and the global model parameters with DP mechanisms, i.e., we have \begin{equation*} {U^{t}}=\frac {1}{\mathbb {E}\left [{ d\left ({{\bar {\mathbf {w}}}^{t},{\bar {\mathbf {w}}}_{dp}^{t} }\right) }\right]}, \tag{4.12}\end{equation*} View Source where ${\bar {\mathbf {w}}}^{t}$ represents the true global model parameters without using DP mechanisms, which can be expressed as (3.2). ${\bar {\mathbf {w}}}_{dp}^{t}$ represents the global model parameters after using DP mechanism. Considering the model parameters as a continuous random vector, we use the commonly used squared error distortion as the distortion function:\begin{equation*} d({\bar {\mathbf {w}}}^{t},{\bar {\mathbf {w}}}_{dp}^{t})={\lVert {\bar {\mathbf {w}}}^{t}-{\bar {\mathbf {w}}}_{dp}^{t} \rVert }^{2}. \tag{4.13}\end{equation*} View Source

Since the global model parameters corresponding to these two privacy-preserving schemes are different, we use $U_{ldp}^{t}$ to denote the global model utility when Gaussian noise is added at the client side, and the global model parameters can be represented as:\begin{equation*} \bar {\mathbf {w}}_{ldp}^{t}=\sum \limits _{k=1}^{N}{{p_{k}}\left ({\mathbf {w}_{k}^{t}+\mathcal {N}\left ({\mathbf {0},\sigma _{c}^{2}{\mathbf {I}_{d}} }\right) }\right)}. \tag{4.14}\end{equation*} View Source We use $U_{cdp}^{t}$ to denote the global model utility when Gaussian noise is added on the server side, and the global model parameters can be represented as:\begin{equation*} \bar {\mathbf {w}}_{cdp}^{t}=\sum \limits _{k=1}^{N}{{p_{k}}\mathbf {w}_{k}^{t}}+\mathcal {N}(\mathbf {0},\sigma _{s}^{2}{\mathbf {I}_{d}}). \tag{4.15}\end{equation*} View Source

Theorem 2:

Under the same privacy budget, when adding the minimum Gaussian noise that satisfies the client-level $\epsilon $ -MI-DP in FL, the utility of the global model parameters is the same for both the client-side noise addition and the server-side noise addition privacy-preserving schemes, i.e., \begin{equation*} U_{ldp}^{t}=U_{cdp}^{t},\,\,t=1,2,\ldots,T. \tag{4.16}\end{equation*} View Source

Proof:

See Appendix C.

According to Theorem 2, it can be seen that in these two privacy-preserving schemes, the utility of the global model increases as the noise standard deviation $\sigma $ decreases and the privacy budget $\epsilon $ increases. Although where the noise is added does not affect the utility of the global model, when the server is untrusted, adding noise at the client side can effectively prevent the server from obtaining the client’s original training parameters.

SECTION V.

Personalized $\epsilon$ -MI-DP Based FL

Although the two basic privacy-preserving schemes in section IV can satisfy $\epsilon $ -MI-DP constraint, it is assumed that all clients have the same privacy budgets and the local model parameters of clients are uniformly constrained by clipping threshold, which ignores the variability among clients. Based on the theoretical analysis in Section IV, in this section, we further consider the differences between clients’ model parameters and privacy requirements.

A. Problem Definition

1) Adaptive Clipping Threshold

Firstly, the selection of a clipping threshold has a significant impact on the model performance. Setting the threshold too small or too large will decrease the accuracy of the model, but it is difficult to determine an appropriate threshold before training. Secondly, in FL, each client has a different database, and each client’s data may follow a completely different distribution. In this case, using the same global model to train will result in non-i.i.d. local model parameters. Using a unified clipping threshold to clip the local model parameters will destroy the unbiasedness of the parameter estimation. Therefore, we propose a dynamic clipping threshold method, similar to the gradient clipping method used in the stochastic gradient descent algorithm in [19]. The server only needs to give an initial clipping threshold in the first round of training, and the clipping threshold is dynamically adjusted based on the local model parameter information of each client. We can define the loss function of the clipping threshold as:\begin{equation*} L(C)=\frac {1}{2}{{\left ({C-\lVert \mathbf {w} \rVert }\right)}^{2}}. \tag{5.17}\end{equation*} View Source From this loss function, the gradient can be derived as $\nabla L=C-\lVert w \rVert $ . Then, at each global iteration $t$ , each client updates its clipping threshold $C_{k,t}$ for the next iteration round by \begin{align*} {C_{k,t+1}}={C_{k,t}}-{\eta _{c}}\nabla L,t=1,2,\ldots,T,k=1,2,\ldots,N, \tag{5.18}\end{align*} View Source where ${\eta }_{c}$ is the learning rate of the clipping threshold. At this point, the constraint for the local model parameters of each client is:\begin{equation*} \lVert \mathbf {w}_{k}^{t} \rVert \le {C_{k,t}},t=1,2,\ldots,T,k=1,2,\ldots,N. \tag{5.19}\end{equation*} View Source

2) Personalized Privacy Budgets

In the real world, each client may have different privacy requirements, and to satisfy these different requirements, we introduce the concept of personalized DP [20] which was originally proposed in the field of databases. We can extend the Definition 3 to obtain client-level personalized $\epsilon $ -MI-DP constraints in FL.

Definition 4 (Personalized$\epsilon $ -MI-DP Constraint in FL):

For a FL process that goes through $T$ rounds, if the $k$ -th client has a privacy budget of $\epsilon _{k}$ , where $k\in \{1,2,\ldots,N\}$ , and if the local model parameters uploaded by clients and the aggregated model parameters at the server satisfy \begin{align*} \underset {{P_{{{\mathbf {w}}^{t\times N}}}}}{\sup }I\left ({\mathbf {w}_{k}^{1},\ldots,\mathbf {w}_{k}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,-k}},\ldots,{{\mathbf {w}}^{t,-k}} }\right)\le {\epsilon _{k}},\forall k,\forall t, \tag{5.20}\end{align*} View Source then this FL process satisfies personalized client-level $\epsilon $ -MI-DP.

B. Optimization Problem of Privacy-Utility Trade-Off

In this section, we assume that the databases of all clients have the same size, i.e., the aggregation weights of each client in the two basic schemes are the same. In the case where there are differences in the constraints on local model parameters and privacy budgets among the clients, to achieve personalized client-level $\epsilon $ -MI-DP with privacy budgets $\{\epsilon _{1},\ldots,\epsilon _{N}\}$ in the client-side noise addition privacy-preserving scheme, at the global iteration round $t$ , the minimum standard deviation ${\sigma _{c,t}}$ of the same distribution Gaussian noise $\mathbf {Z}_{k}^{t}\sim \mathcal {N}(\mathbf {0},\sigma _ {c,t}^{2}{\mathbf {I}_{d}})$ added to the local model parameters can be obtained as follows:\begin{equation*} {\sigma _{c,t}}=\underset {k}{\max }\,\left ({\frac {{C_{k,t}}}{\sqrt {dN\left ({{e^{\frac {2{\epsilon _{k}}}{d}}}-1 }\right)}} }\right),t=1,2,\ldots,T. \tag{5.21}\end{equation*} View Source For the server-side noise addition privacy-preserving scheme, the minimum standard deviation of the Gaussian noise is:\begin{equation*} {\sigma _{s,t}}=\underset {k}{\max }\,\left ({\frac {{C_{k,t}}}{N\sqrt {d\left ({{e^{\frac {2{\epsilon _{k}}}{d}}}-1 }\right)}} }\right),t=1,2,\ldots,T. \tag{5.22}\end{equation*} View Source At this point, the minimum noise standard deviation of these two basic schemes depends on the client with the highest noise demand. When the clipping threshold is equal, the variance of the noise depends on the client with the minimum privacy budget. When the privacy budget is the same, it depends on the client with the maximum clipping threshold. This harms the performance of the model, so it is necessary to adjust the noise distribution and corresponding aggregation weights of clients in each round of training, in order to achieve the highest utility of the global model while satisfying the personalized privacy requirements of different clients. Combining the conditions that the noise standard deviation and aggregation weight need to satisfy, we can summarize it as an optimization problem of privacy-utility trade-off as follows:\begin{align*} & \underset {\{{\sigma _{k,t}},{p_{k,t}}\}_{k=1}^{N}}{\max }{U^{t}} \\ &\quad \,\,\text {s.t.}\underset {{P_{{{\mathbf {w}}^{t\times N}}}}}{\sup }I\left ({\mathbf {w}_{k}^{1},\ldots,\mathbf {w}_{k}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,-k}},\ldots,{{\mathbf {w}}^{t,-k}} }\right)\le {\epsilon _{k}},\forall k, \\ &\hphantom {\quad \,\,\text {s.t.}} \sum \limits _{k=1}^{N}{{p_{k,t}}}=1, \\ &\hphantom {\quad \,\,\text {s.t.}} {p_{k,t}}\ge 0,k=1,2,\ldots,N, \\ &\hphantom {\quad \,\,\text {s.t.}} {\sigma _{k,t}}\ge 0,k=1,2,\ldots,N. \tag{5.23}\end{align*} View Source

For this optimization problem, we use the definition of utility in (4.12). Specifically, the objective function can be expressed as:\begin{equation*} {U^{t}}=\frac {1}{\mathbb {E}\left [{ d\left ({{{{\bar {\mathbf {w}}}}^{t}},\bar {\mathbf {w}}_{dp}^{t} }\right) }\right]}=\frac {1}{d\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{i=1}p_{i,t}^{2}\sigma _{i,t}^{2}}. \tag{5.24}\end{equation*} View Source

Meanwhile, We can obtain the supremum of the conditional mutual information in the constraint of (5.20):\begin{align*} &\hspace {-1pc}\underset {{P_{{{\mathbf {w}}^{t\times N}}}}}{\sup }I\left ({\mathbf {w}_{k}^{1},\ldots,\mathbf {w}_{k}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,-k}},\ldots,{{\mathbf {w}}^{t,-k}} }\right) \\[-2pt] &=\frac {d}{2}\ln \left ({\frac {p_{k,t}^{2}C_{k,t}^{2}}{d\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{i=1}p_{i,t}^{2}\sigma _{i,t}^{2}}+1 }\right),\forall k,\forall t. \tag{5.25}\end{align*} View Source

Substituting (5.24) and (5.25) into (5.23), the original optimization problem is equivalent to:\begin{align*} & \min ~~f({\sigma _{1,t}},\ldots,{\sigma _{N,t}},{p_{1,t}},\ldots,{p_{N,t}})=\underset {i=1}{\overset {N}{\mathop \sum }}p_{i,t}^{2}\sigma _{i,t}^{2} \\[-2pt] &\,\text {s}\text {.t}{. } \underset {i=1}{\overset {N}{\mathop \sum }}p_{i,t}^{2}\sigma _{i,t}^{2}-\alpha _{k,t}^{2}p_{k,t}^{2}\ge 0,\forall k, \\[-2pt] &\hphantom {\,\text {s}\text {.t}{. } } {p_{k,t}}\ge 0,\forall k, \\[-2pt] &\hphantom {\,\text {s}\text {.t}{. } } {\sigma _{k,t}}\ge 0,\forall k, \\[-2pt] &\hphantom {\,\text {s}\text {.t}{. } } \sum \limits _{k=1}^{N}{{p_{k,t}}}=1, \tag{5.26}\end{align*} View Source where \begin{equation*} {\alpha _{k,t}}=\frac {{C_{i,t}}}{\sqrt {d\left ({{e^{\frac {2{\epsilon _{k}}}{d}}}-1 }\right)}}>0. \tag{5.27}\end{equation*} View Source

By analyzing the conditions that the optimal solution must satisfy, we can obtain Lemma 1.

Lemma 1:

If $\{\sigma _{1,t}^{*},\ldots,\sigma _{N,t}^{*},p_{1,t}^{*},\ldots,p_{N,t}^{*}\}$ is the optimal solution of (5.26), then for $\forall i,j\in \{1,2,\ldots,N\}$ , we have \begin{equation*} {\alpha _{i,t}}p_{i,t}^{*}={\alpha _{j,t}}p_{j,t}^{*}. \tag{5.28}\end{equation*} View Source

Proof:

See Appendix D.

Theorem 3:

For the optimization problem in (5.23), the upper bound of global model utility at global iteration $t$ when satisfying all clients’ privacy budgets is:\begin{equation*} \underset {\{{\sigma _{k,t}},{p_{k,t}}\}_{k=1}^{N}}{\max }{U^{t}}=\frac {1}{d}{{\left ({\frac {\sum \limits _{i=1}^{N}{\left ({\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne i}{{\alpha _{j,t}}} }\right)}}{\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \}}{{\alpha _{j,t}}}} }\right)}^{2}}. \tag{5.29}\end{equation*} View Source The optimal set of noise variances and aggregation weights $\{\sigma _{1,t}^{*},\ldots,\sigma _{N,t}^{*},p_{1,t}^{*},\ldots,p_{N,t}^{*}\}$ that can achieve this maximum utility can be represented as:\begin{align*} \begin{cases} &\! \left ({\sigma _{1,t}^{*},\ldots,\sigma _{N,t}^{*} }\right)\!\in \! \left \{{\! \left ({{\sigma _{1,t}},\ldots,{\sigma _{N,t}} }\right)\left |{ \sum \limits _{k=1}^{N}{\frac {\sigma _{k,t}^{2}}{\alpha _{k,t}^{2}}\!=\!1},{\sigma _{k,t}}\!\ge \!0 }\right.\!}\right \} \\[-2pt] & p_{k,t}^{*}=\frac {\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne k}{{\alpha _{j,t}}}}{\sum \limits _{i=1}^{N}{\left ({\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne i}{{\alpha _{j,t}}} }\right)}}, {}k=1,2,\ldots,N \\ \end{cases} \tag{5.30}\end{align*} View Source where \begin{equation*} {\alpha _{k,t}}=\frac {{C_{k,t}}}{\sqrt {d\left({{e^{\frac {2{\epsilon _{k}}}{d}}}-1}\right)}},k=1,2,\ldots,N. \tag{5.31}\end{equation*} View Source

Proof:

See Appendix E.

C. PMIDP-FL Scheduling Policy

Theoretically, the calculation of optimal noise variance and aggregation weights requires the server to collect the client’s current clipping threshold before each round of aggregation and then return the noise variance to each client, which undoubtedly increases communication costs. However, we can choose a suitable optimal solution in the optimal solution set so that the noise variance in each round can be calculated by the client locally. The optimal solution is shown in Corollary 2.

Corollary 2:

In a FL process, if the local model parameters of each client satisfy $\lVert \mathbf {w}_{k}^{t} \rVert \le {C_{k,t}}$ , and its privacy budget is $\epsilon _{k}$ ($k\in \{1,\ldots,N\}$ ). At each global round $t$ ($t\in \{1,\ldots,T\}$ ), each client adds noise locally as follows: $\mathbf {Z}_{k}^{t}\sim \mathcal {N}\left ({\mathbf {0},\sigma _{k,t}^{2}{{\mathbf {I}}_{d}} }\right)$ , where \begin{equation*} {\sigma _{k,t}}=\frac {{C_{k,t}}}{\sqrt {dN\left({{e^{\frac {2{\epsilon _{k}}}{d}}}-1}\right)}},k=1,2,\ldots,N. \tag{5.32}\end{equation*} View Source The server aggregates the uploaded model parameters from clients with the following aggregation weights:\begin{equation*} {p_{k,t}}=\frac {\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne k}{{\sigma _{j,t}}}}{\sum \limits _{i=1}^{N}{\left ({\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne i}{{\sigma _{j,t}}} }\right)}},k=1,2,\ldots,N. \tag{5.33}\end{equation*} View Source Then the FL process can satisfy personalized client-level $\epsilon $ -MI-DP and maximize the global model’s utility.

In this scheme, clients can independently calculate the distribution of the required noise, without relying on a trusted central server, which also enables the central server to only receive the variance of each client’s noise to calculate the aggregation weights, without collecting clients’ privacy budgets and local clipping thresholds. Based on this theory, we propose a personalized $\epsilon $ -MI-DP based FL algorithm (PMIDP-FL) as shown in Algorithm 2. This algorithm not only satisfies the different privacy requirements of all clients but also maximizes the model utility. In each global iteration round of this algorithm, the client iterates on the clipping threshold based on its model parameters, calculates the required noise variance locally based on its privacy budget and clipping threshold, adds the corresponding noise to the local model parameters, and then sends both the noise variance and the noisy local model parameters to the server. The server then calculates the aggregation weights based on the noise variances of all clients.

Algorithm 2

PMIDP-FL

Show All

Complexity Analysis: The complexity of Algorithm 2 is decided by the time complexity of computing the aggregation weights $p_{k,t}$ (the highest complexity loop in Algorithm 2), which is given by $\mathcal {O}(N^{2})$ and $\mathcal {O}(\cdot)$ represents the big-O notation. In addition, since Algorithm 1 is executed $T$ times for convergence, the overall complexity of Algorithm 2 is given by $\mathcal {O}(TN^{2})$ .

SECTION VI.

Experiment Result and Analysis

In this section, we use multi-layer perception (MLP) and conduct experiments on the MNIST handwritten digit recognition dataset, which consists of 60,000 training samples and 10,000 test samples [21]. Furthermore, we train a convolutional neural network (CNN) on the CIFAR-10 dataset [26], which consists of 50,000 color images for training and 10,000 color images for testing. To simulate the FL environment, each client is allocated a subset of uniformly sampled examples. We use the SGD optimizer and set the learning rate to 0.01. To reduce the communication times and speed up the learning, we set the local epoch $\tau =5$ and batch size $B=64$ . And aggregation times $T$ is set to 100.

A. Basic Privacy-Preserving Schemes

we analyze the model performance of the privacy-preserving schemes with Gaussian noise added to the client and server separately under the $\epsilon $ -MI-DP constraint. On the one hand, we analyze the impact of different parameter settings on the performance of the two schemes, and we mainly consider the parameters that will have an impact on the noise size, i.e., the privacy budget $\epsilon $ , the number of clients $N$ , and the clipping threshold $C$ ; on the other hand, we compare the performance differences between the two schemes under the same experimental settings to verify the conclusions of utility.

In Fig. 3 and Fig. 4, we choose various privacy levels $\epsilon $ =5, $\epsilon $ =10 and $\epsilon $ =20 to show the change of the loss function value during training under $\epsilon $ -MI-DP constraint. In these experiments, we set $N$ =100, $C$ =10. Furthermore, we also include a non-private approach to compare. In Fig. 3, the noise is added on the client side, while in Fig. 4, the noise is added on the server side. From the results of two datasets, it can be seen that without noise perturbing the model parameters, the model has the smallest final loss function value. For FL with privacy-preserving mechanisms, there is almost no difference from the benchmark in the early stage of training. However, due to noise disturbance, the loss function value fluctuates slightly with the increase of training rounds in the later stage, and the decrease is slower. As the privacy budget $\epsilon $ increases, the added noise becomes smaller, and the final loss function value of the training decreases as we relax the privacy budget, eventually approaching the benchmark.

FIGURE 3.

The comparison of loss function value when noising at the client side with various privacy levels.

Show All

FIGURE 4.

The comparison of loss function value when noising at the server side with various privacy levels.

Show All

Fig. 5 and Fig. 6 respectively show the influence of the number of clients $N$ and clipping threshold $C$ on the model performance. The vertical coordinates in the figures are all the test accuracy of the model on the test set after the training is completed. Fig. 5 shows the change in model accuracy with the number of clients $N$ . In these two experiments, the number of clients was set to 10 values uniformly between 20 and 200. Fig. 5 shows that as the number of clients increases, the model accuracy gradually increases. This is because more clients not only provide a larger global database for training but also reduce the variance of noise added to satisfy the $\epsilon $ -MI-DP constraint. This suggests that the approach with privacy mechanisms is more suitable for large-scale FL.

FIGURE 5.

Impact of the number of clients $N$ .

Show All

FIGURE 6.

Impact of the clipping threshold $C$ .

Show All

Fig. 6 shows the change in model accuracy with the clipping threshold $C$ . In these two experiments, the clipping threshold was set to 8 values uniformly between 2.5 and 20. Since the clipping threshold not only controls the degree of parameter pruning during model training but also determines the variance of the noise. When the clipping threshold is large, only a few clients will change their local original model parameters, but this will increase the variance of the noise, which will harm the model performance. Conversely, when the clipping threshold is small, the negative impact of the noise term on the model performance is small, but more clients will change their original model parameters, which will also affect the model performance.

Finally, it can be seen from the results that under the same experimental parameter settings, the accuracy of the models trained under the two privacy-preserving schemes on the test set is the same, and the change patterns of accuracy with different parameters are consistent, which validates Theorem 2.

B. Comparison of $\epsilon$ -MI-DP and Privacy-Preserving Schemes

To analyze the impact of client privacy budgets on the performance of PMIDP-FL and compare its performance with that of the basic client-side noise addition scheme under the same conditions, we randomly sample the privacy budgets of all clients $\{{\epsilon _{1}},\ldots,{\epsilon _{N}}\}$ from a Gaussian distribution $N({\mu _{\epsilon }},\sigma _{\epsilon }^{2})$ . Since the privacy budget must be positive, but direct sampling from a Gaussian distribution may result in negative values, we set the minimum value of ${\epsilon _{k}}$ to be 1.

Fig. 7 and Fig. 8 show the variation of model accuracy with PMIDP-FL under different privacy budget distributions. In Fig. 7, the mean of the privacy budget distribution is set to ${\mu }_{\epsilon }$ =20, and the variances $\sigma _{\epsilon }$ are set to 0, 5, 10, and 15, respectively. In Fig. 8, the variances of the privacy budget distribution are set to $\sigma _{\epsilon }$ =10 and the means ${\mu }_{\epsilon }$ are set to 5, 10, and 20, respectively.

FIGURE 7.

The model accuracy of PMIDP-FL under different variances of privacy budget distribution.

Show All

FIGURE 8.

The model accuracy of PMIDP-FL under different mean values of privacy budget distribution.

Show All

The results of two datasets show that the model’s accuracy is not significantly affected by the variation of the noise variance of the privacy budget distribution $\sigma _{\epsilon }$ , and is the same as when all clients have the same privacy. This indicates that our proposed scheme can ensure the high utility of the model even when clients have very different privacy requirements. In contrast, when the mean of the privacy budget distribution ${\mu }_{\epsilon }$ varies, the model’s accuracy changes significantly, and the lower the mean value of the privacy budget, the lower the model’s accuracy. The maximum utility of the model does not depend on the client with the highest privacy requirements but on the overall privacy budget size of all clients.

The basic client-side noise addition scheme adds noise with the same distribution and aggregation weights to each client, and the variance of the noise is given by corollary 1. Theoretically, its model utility is lower than that of the PMIDP-FL. To verify whether the performance of the two schemes in actual FL is consistent with the theory, we applied the two schemes with the same parameter settings and obtained the accuracy of the models on the test sets of two datasets, as shown in Fig. 9 and Fig. 10. Fig. 9 shows the relationship between the test accuracy of the two schemes and the mean of the privacy budget distribution ${\mu }_{\epsilon }$ when the noise variance of the privacy budget distribution $\sigma _{\epsilon }$ is the same. Fig. 10 shows the relationship between the test accuracy of the two schemes and the noise variance of the privacy budget distribution $\sigma _{\epsilon }$ when the mean of the privacy budget distribution ${\mu }_{\epsilon }$ is the same.

$FIGURE 9. - Impact of mean value of privacy budget distribution ${\mu }_{\epsilon }$ .$

FIGURE 9.

Impact of mean value of privacy budget distribution ${\mu }_{\epsilon }$ .

Show All

$FIGURE 10. - Impact of variances of privacy budget distribution $\sigma _{\epsilon }$ .$

FIGURE 10.

Impact of variances of privacy budget distribution $\sigma _{\epsilon }$ .

Show All

According to the results, when the mean of the privacy budget distribution is low or the variance is large, the model accuracy of the PMIDP-FL scheme is higher than that of the baseline approach. Especially when the noise variance is large, due to the possibility of clients with extremely low privacy budgets, the noise added in the baseline approach may depend on the client with the lowest privacy budget, resulting in a significant difference in model accuracy compared with the PMIDP-FL. Although the experiment cannot verify the optimal utility of the PMIDP-FL scheme when there are differences in the model parameters constraints and privacy requirements of clients, it has significant performance improvement.

SECTION VII.

Conclusion

In this paper, we focus on the trade-off between privacy and utility in $\epsilon $ -MI-DP-based PPFL. We define a client-level $\epsilon $ -MI-DP requirement and establish the lower bound on the noise variances of artificial Gaussian noises at the client and server sides. We prove that where the noise is added does not affect the model utility at the same privacy level. Considering the difference between model parameters and privacy requirements, we further refine the privacy-preserving model of FL and establish the optimization problem for the privacy-utility trade-off. The results of this paper are further explained by experiment.

Appendix A
Proof of Theorem 1

To satisfy the $\epsilon $ -MI-DP in Definition 3, the aggregated parameters and the client local model parameters must satisfy the constraint in (4.5). In this privacy-preserving scheme, for $i\in \left \{{ 1,2,\ldots,N }\right \}$ and $t\in \left \{{ 1,2,\ldots,T }\right \}$ , the conditional mutual information satisfies:\begin{align*} &\hspace {-1pc} I\left ({\mathbf {w}_{i}^{1},\ldots,\mathbf {w}_{i}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,-i}},\ldots,{{\mathbf {w}}^{t,-i}} }\right) \\ &\mathop {=}h\left ({{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,-i}},\ldots,{{\mathbf {w}}^{t,-i}} }\right)-h\left ({{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,N}},\ldots,{{\mathbf {w}}^{t,N}} }\right) \\ & \overset {(a)}{\mathop {=}}h\left ({{{{\bar {\mathbf {\mathbf {w}}}}}^{t}}|{{\mathbf {w}}^{t,-i}},\ldots,{{\mathbf {w}}^{t,-i}} }\right)-h\left ({{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{t,N}} }\right) \\ & \overset {(b)}{\mathop {=}}h\left ({{p_{i}}\mathbf {w}_{i}^{t}+{{\mathbf {Z}}^{t}}|{{\mathbf {w}}^{1,-i}},\ldots,{{\mathbf {w}}^{t,-i}} }\right)-h\left ({{{\mathbf {Z}}^{t}} }\right) \\ & \overset {(c)}{\mathop {\le }}h\left ({{p_{i}}\mathbf {w}_{i}^{t}+{{\mathbf {Z}}^{t}} }\right)-h\left ({{{\mathbf {Z}}^{t}} }\right), \tag{1.34}\end{align*} View Source where (a) is follows from the property of Markov chain, in this FL process, for any global iteration round $t$ , from $\{{\mathbf {w}^{1,N}},\ldots,{\mathbf {w}^{t-1,N}}\}$ to get $\mathbf {w}^{t,N}$ depends on the client’s local database and the noise added before the iteration round $t$ , while from $\mathbf {w}^{t,N}$ to get ${\bar {\mathbf {w}}}^{t}$ depends only on the noise at $t$ -round $\mathbf {Z}^{t}$ , which is independent of the previously added noise and model parameters, so $\{{\mathbf {w}^{1,N}},\ldots,{\mathbf {w}^{t-1,N}}\}\to \mathbf {w}^{t,N} \to {\bar {\mathbf {w}}}^{t}$ forms a Markov chain; (b) follows from the noise and local model parameters are independent of each other; (c) follows from the condition makes the entropy decrease.

For $h\left ({{p_{i}}\mathbf {w}_{i}^{t}+{{\mathbf {Z}}^{t}} }\right)$ , by the maximum differential entropy theorem, we have \begin{equation*} h\left ({{p_{i}}\mathbf {w}_{i}^{t}+{{\mathbf {Z}}^{t}} }\right)\le \frac {1}{2}\ln {\left ({2\pi e }\right)^{d}}\det \left ({\boldsymbol{\Sigma }}\right), \tag{1.35}\end{equation*} View Source where $\boldsymbol{\Sigma }$ denotes the covariance matrix of ${p_{i}}\mathbf {w}_{i}^{t}+{{\mathbf {Z}}^{t}}$ and $\det \left ({{\boldsymbol{\Sigma }}}\right)$ denotes the determinant of the matrix $\boldsymbol{\Sigma }$ . We denote the element of $\boldsymbol{\Sigma }$ in $j$ -th row and $k$ -th column as ${\Sigma }_{j,k}$ ; denote the $j$ -th element of $\mathbf {w}_{i}^{t}$ as $w_{i,j}^{t}$ ; denote the $j$ -th element of $\mathbf {Z}^{t}$ as $Z_{j}^{t}$ , then $\det \left ({{\boldsymbol{\Sigma }}}\right)$ can be bounded as:\begin{align*} \det \left ({{\boldsymbol{\Sigma }}}\right)&\overset {(a)}{\mathop {\le }}\underset {j=1}{\overset {d}{\mathop \prod }}{\Sigma _{j,j}} \\ & \mathop =\underset {j=1}{\overset {d}{\mathop \prod }}Var\left ({{p_{i}}w_{i,j}^{t}+Z_{j}^{t} }\right) \\ & \overset {(b)}{\mathop {=}}\underset {j=1}{\overset {d}{\mathop \prod }}\left ({{p_{i}}^{2}Var\left ({w_{i,j}^{t} }\right)+ \sigma _{s}^{2} }\right) \\ & \le \underset {j=1}{\overset {d}{\mathop \prod }}\left ({{p_{i}}^{2}\mathbb {E}\left [{ {{\left ({w_{i,j}^{t} }\right)}^{2}} }\right]+\sigma _{s}^{2} }\right) \\ & \overset {(c)}{\mathop {\le }}{{\left ({\frac {1}{d}\underset {j=1}{\overset {d}{\mathop \sum }}({p_{i}}^{2}\mathbb {E}\left [{ {{\left ({w_{i,j}^{t} }\right)}^{2}} }\right]+\sigma _{s}^{2}) }\right)}^{d}} \\ & ={{\left ({\sigma _{s}^{2}+\frac {{p_{i}}^{2}}{d}\sum \limits _{j=1}^{d}{\mathbb {E}\left [{ {{\left ({w_{i,j}^{t} }\right)}^{2}} }\right]} }\right)}^{d}}, \tag{1.36}\end{align*} View Source where (a) follows from Hadamard’s inequality since a covariance matrix is positive-semidefinite; (b) follows from the noise and the local model parameters are independent of each other; (c) can be obtained from the arithmetic-geometric mean inequality. We can further bound $\det \left ({{\boldsymbol{\Sigma }}}\right)$ by using the threshold of the $\ell _{2}$ -norm of the clipped parameter vector.\begin{align*} \underset {j=1}{\overset {d}{\mathop \sum }}\mathbb {E}\left [{ {{\left ({w_{i,j}^{t} }\right)}^{2}} }\right] =\mathbb {E}\left [{ \underset {j=1}{\overset {d}{\mathop \sum }}{{\left ({w_{i,j}^{t} }\right)}^{2}} }\right]\le \mathbb {E}\left ({{C^{2}} }\right)={C^{2}}. \tag{1.37}\end{align*} View Source

Combined (1.35)–(1.37), we obtain:\begin{equation*} h\left ({{p_{i}\mathbf {w}_{i}^{t} + {\mathbf {Z}^{t}}} }\right) \le \frac {1}{2}\ln {\left ({{2\pi e\left ({{\frac {{p_{i}^{2}{C^{2}}}}{d} + \sigma _{s}^{2}} }\right)} }\right)^{d}}. \tag{1.38}\end{equation*} View Source

Since ${\mathbf {Z}}^{t}$ is a $d$ dimensional Gaussian random vector, it is straightforward to obtain \begin{equation*} h\left ({{{\mathbf {Z}}^{t}} }\right)=\frac {1}{2}\ln {{\left ({2\pi e\sigma _{s}^{2} }\right)}^{d}}. \tag{1.39}\end{equation*} View Source

Combined (1.34), (1.38) and (1.39), for $\forall i\in \left \{{ 1,2,\ldots,N }\right \}$ and $\forall t\in \left \{{ 1,2,\ldots,T }\right \}$ , with no prior distribution on the local model parameters, the supremum of the conditional mutual information in (4.5) is \begin{align*} &\hspace {-1pc} \underset {{P_{{\mathbf {w}^{t\times N}}}}}{\sup }\,I\left ({\mathbf {w}_{i}^{1},\ldots,\mathbf {w}_{i}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{\mathbf {w}^{1,-i}},\ldots,{\mathbf {w}^{t,-i}} }\right) \\ & =\frac {1}{2}\ln {{\left ({2\pi e\left ({\frac {{p_{i}}^{2}{C^{2}}}{d}+\sigma _{s}^{2} }\right) }\right)}^{d}}-\frac {1}{2}\ln {{\left ({2\pi e\sigma _{s}^{2} }\right)}^{d}} \\ & =\frac {d}{2}\ln \left ({\frac {{p_{i}}^{2}{C^{2}}}{d\sigma _{s}^{2}}+1 }\right). \tag{1.40}\end{align*} View Source

This supremum increases with the increase of $p_{i}$ , so we can obtain:\begin{align*} &\hspace {-1pc}\underset {i,{P_{{{\mathbf {w}}^{t\times N}}}}}{\sup }I(\mathbf {w}_{i}^{1},\ldots,\mathbf {w}_{i}^{t};{{\bar {\mathbf {w}}}^{t}}|{{\mathbf {w}}^{t,-i}},\ldots,{{\mathbf {w}}^{t,-i}}) \\ &=\frac {d}{2}\ln \left ({\frac {\mathop{\max }\limits_{i}{p_{i}}^{2}{C^{2}}}{d\sigma _{s}^{2}}+1 }\right),\forall t \tag{1.41}\end{align*} View Source

Hence when the noise variance ${\sigma _{s}}$ satisfies \begin{equation*} {\sigma _{s}}\ge \frac {\mathop{\max }\limits_{i}{p_{i}}C}{\sqrt {d\left ({{e^{\frac {2\epsilon }{d}}}-1 }\right)}}, \tag{1.42}\end{equation*} View Source this FL process satisfies $\epsilon $ -MI-DP.

Appendix B
Proof of Corollary 1

Similar to the proof of Theorem 1, we need to find the supremum of the conditional mutual information in (4.5), In this situation, for $\forall i\in \left \{{ 1,2,\ldots,N }\right \}$ and $\forall t\in \left \{{ 1,2,\ldots,T }\right \}$ , we have:\begin{align*} &I\left ({\mathbf {w}_{i}^{1},\ldots,\mathbf {w}_{i}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,-i}},\ldots,{{\mathbf {w}}^{t,-i}} }\right) \\ &\le h\left ({{p_{i}}\mathbf {w}_{i}^{t}+\underset {k=1}{\overset {N}{\mathop \sum }}{p_{k}}\mathbf {Z}_{k}^{t} }\right)-h\left ({\underset {k=1}{\overset {N}{\mathop \sum }}{p_{k}}\mathbf {Z}_{k}^{t} }\right) \tag{2.43}\end{align*} View Source

Since the noise added by different clients is independent of each other, the distribution of the aggregated noise satisfies:\begin{equation*} \underset {k=1}{\overset {N}{\mathop \sum }}{p_{k}}\mathbf {Z}_{k}^{t}\sim \mathcal {N}\left ({\mathbf {0},\sigma _{c}^{2}\underset {k=1}{\overset {N}{\mathop \sum }}p_{k}^{2}{\mathbf {I}_{d}} }\right). \tag{2.44}\end{equation*} View Source

Thus its differential entropy can be obtained as:\begin{equation*} h\left ({\underset {k=1}{\overset {N}{\mathop \sum }}{p_{k}}\mathbf {Z}_{k}^{t} }\right)= \frac {1}{2}\ln {{\left ({2\pi e\sigma _{c}^{2}\underset {k=1}{\overset {N}{\mathop \sum }}p_{k}^{2} }\right)}^{d}}. \tag{2.45}\end{equation*} View Source

Similarly, \begin{align*} &\hspace {-1pc}h\left ({{p_{i}}\mathbf {w}_{i}^{t}+\underset {k=1}{\overset {N}{\mathop \sum }}{p_{k}}\mathbf {Z}_{k}^{t} }\right) \\ &\le \frac {1}{2}\ln {{\left ({2\pi e\left ({\frac {{p_{i}}^{2}{C^{2}}}{d}+\sigma _{c}^{2}\underset {k=1}{\overset {N}{\mathop \sum }}p_{k}^{2} }\right) }\right)}^{d}}. \tag{2.46}\end{align*} View Source

Combined (2.43), (2.44) and (2.46), we can obtain:\begin{align*} &\hspace {-1pc}\underset {{P_{{{\mathbf {w}}^{t\times N}}}}}{\sup }I\left ({\mathbf {w}_{i}^{1},\ldots,\mathbf {w}_{i}^{t};{{{\bar {\mathbf {w}}}}^{t}}|{{\mathbf {w}}^{1,-i}},\ldots,{{\mathbf {w}}^{t,-i}} }\right) \\ &=\frac {d}{2}\ln \left ({\frac {p_{i}^{2}{C^{2}}}{d\sigma _{c}^{2}\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{k=1}p_{k}^{2}}+1 }\right). \tag{2.47}\end{align*} View Source

Hence when the noise variance $\sigma _{c}$ satisfies \begin{equation*} {\sigma _{c}}\ge \frac {C\mathop{\max }\limits_{i}{p_{i}}}{\sqrt {d\left ({{e^{\frac {2\epsilon }{d}}}-1 }\right)\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{k=1}p_{k}^{2}}}, \tag{2.48}\end{equation*} View Source this FL process satisfies $\epsilon $ -MI-DP.

Appendix C
Proof of Theorem 2

Substituting the global model parameters from (4.14) and (4.15) into the expression for utility, for the privacy-preserving scheme which noise is added at the server side, its utility $U_{cdp}^{t}$ satisfies:\begin{align*} & U_{cdp}^{t}=\frac {1}{\mathbb {E}\left [{ d\left ({\bar {\mathbf {w}}_{}^{t},\bar {\mathbf {w}}_{cdp}^{t} }\right) }\right]} \\ & =\frac {1}{\mathbb {E}\left [{ {{\lVert \sum \limits _{k=1}^{N}{{p_{k}}\mathbf {w}_{k}^{t}-}\sum \limits _{k=1}^{N}{{p_{k}}\mathbf {w}_{k}^{t}}-\mathcal {N}\left ({\mathbf {0},\sigma _{s}^{2}{\mathbf {I}_{d}} }\right) \rVert }^{2}} }\right]} \\ & =\frac {1}{\mathbb {E}\left [{ {{\lVert \mathcal {N}\left ({\mathbf {0},\sigma _{s}^{2}{\mathbf {I}_{d}} }\right) \rVert }^{2}} }\right]} \\ & =\frac {1}{d\sigma _{s}^{2}},\forall t, \tag{3.49}\end{align*} View Source and for the privacy-preserving scheme in which noise is added at the server side, its utility $U_{ldp}^{t}$ satisfies:\begin{align*} U_{ldp}^{t}&=\frac {1}{\mathbb {E}\left [{ d\left ({\bar {\mathbf {w}}_{}^{t},\bar {\mathbf {w}}_{ldp}^{t} }\right) }\right]} \\[-1pt] & =\frac {1}{\mathbb {E}\left [{ {{\lVert \sum \limits _{k=1}^{N}{{p_{k}}\mathbf {w}_{k}^{t}-}\sum \limits _{k=1}^{N}{{p_{k}}\left ({\mathbf {w}_{k}^{t}+\mathcal {N}\left ({\mathbf {0},\sigma _{c}^{2}{\mathbf {I}_{d}} }\right) }\right)} \rVert }^{2}} }\right]} \\[-1pt] & =\frac {1}{\mathbb {E}\left [{ {{\lVert \sum \limits _{k=1}^{N}{{p_{k}}\mathcal {N}\left ({\mathbf {0},\sigma _{c}^{2}{\mathbf {I}_{d}} }\right)} \rVert }^{2}} }\right]} \\[-1pt] & =\frac {1}{d\sigma _{c}^{2}\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{k=1}p_{k}^{2}},\forall t. \tag{3.50}\end{align*} View Source

From Theorem 1 and Corollary 1, the minimum variance of the Gaussian noise to be added by the two schemes under the same privacy budget $\sigma _{s}$ and $\sigma _{c}$ satisfy (4.7) and (4.9). Substituting them into (3.49) and (3.50), we can obtain that for $\forall t\in \left \{{ 1,2,\ldots,T }\right \}$ the utility of the two schemes satisfies:\begin{align*} U_{ldp}^{t}=\frac {1}{d\sigma _{c}^{2}\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{k=1}p_{k}^{2}}=\frac {{e^{\frac {2\epsilon }{d}}}-1}{{C^{2}}{{\left ({\mathop{\max }\limits_{i}{p_{i}} }\right)}^{2}}}=\frac {1}{d\sigma _{s}^{2}}=U_{cdp}^{t}. \tag{3.51}\end{align*} View Source

Appendix D
Proof of Lemma 1

We assume that $\{\sigma _{1,t}^{*},\ldots,\sigma _{N,t}^{*},p_{1,t}^{*},\ldots,p_{N,t}^{*}\}$ is the optimal solution of (5.26). We can simplify the constraints in (5.26) by expressing the first set of inequalities as one inequality, that is, \begin{equation*} \underset {i=1}{\overset {N}{\mathop \sum }}p_{i,t}^{2}\sigma _{i,t}^{2}\ge \underset {k}{\max }\alpha _{k,t}^{2}p_{k,t}^{2} \tag{4.52}\end{equation*} View Source

If we do not consider this constraint, the minimum value of the objective function of the new optimization problem is 0, while ${\max }\alpha _{k,t}^{2}p_{k,t}^{2}>0$ , so the optimal solution must be on the boundary of this constraint.And since $\alpha _{k,t}$ and $p_{k,t}$ are both non-negative real numbers, the optimal solution must satisfy:\begin{equation*} \underset {i=1}{\overset {N}{\mathop \sum }}{{\left ({p_{i,t}^{*}\sigma _{i,t}^{*} }\right)}^{2}}=\underset {k}{\max }{{\left ({\alpha _{k,t}^{*}p_{k,t}^{*} }\right)}^{2}}={{\left ({\underset {k}{\max }\alpha _{k,t}^{*}p_{k,t}^{*} }\right)}^{2}} \tag{4.53}\end{equation*} View Source Then we can use proof by contradiction, assuming that $\{\sigma _{1,t}^{*},\ldots,\sigma _{N,t}^{*},p_{1,t}^{*},\ldots,p_{N,t}^{*}\}$ does not satisfy (5.28), that is, \begin{equation*} \underset {k}{\max }{\alpha _{k,t}}p_{k,t}^{*}>\underset {k}{\min }{\alpha _{k,t}}p_{k,t}^{*} \tag{4.54}\end{equation*} View Source

Suppose that the coordinates of the maximum and minimum values are $k_{0}$ and $k_{1}$ respectively, i.e., \begin{align*} {k_{0}}=\underset {k}{\mathop {\arg \max }}{\alpha _{k,t}}p_{k,t}^{*} \tag{4.55}\\ {k_{1}}=\underset {k}{\mathop {\arg \max }}{\alpha _{k,t}}p_{k,t}^{*} \tag{4.56}\end{align*} View Source

In addition, we assume that the coordinates for obtaining the maximum value other than ${\alpha _{{k_{0}},t}}p_{{k_{0}},t}^{*}$ is $k_{2}$ , i.e., \begin{equation*} {k_{2}}=\underset {k\ne {k_{0}}}{\mathop {\arg \max }}{\alpha _{k,t}}p_{k,t}^{*} \tag{4.57}\end{equation*} View Source

From (4.53), the value of the objective function $f^{*}$ is given by:\begin{align*} {f^{*}}=\underset {k=1}{\overset {N}{\mathop \sum }}{{\left ({p_{i,t}^{*}\sigma _{i,t}^{*} }\right)}^{2}}={{\left ({\underset {k}{\max }\alpha _{k,t}^{*}p_{k,t}^{*} }\right)}^{2}}={{\left ({{\alpha _{{k_{0}},t}}p_{{k_{0}},t}^{*} }\right)}^{2}} \tag{4.58}\end{align*} View Source

For $k_{0}$ and $k_{1}$ , there exists $\delta >0$ such that:\begin{equation*} {\alpha _{{k_{0}},t}}p_{{k_{0}},t}^{*}-{\alpha _{{k_{1}},t}}p_{{k_{1}},t}^{*}>\left ({{\alpha _{{k_{0}},t}}+{\alpha _{{k_{1}},t}} }\right)\delta \tag{4.59}\end{equation*} View Source

We can construct a new set of aggregation weights $\{{{\tilde {p}}_{1,t}},\ldots,{{\tilde {p}}_{N,t}}\}$ based on $\{p_{1,t}^{*},\ldots,p_{N,t}^{*}\}$ as follows:\begin{align*}, \begin{cases} \displaystyle {{{\tilde {p}}}_{i,t}}=p_{i,t}^{*},i\in \{1,2,\ldots,N\},i\ne {k_{0}},{k_{1}} \\ \displaystyle {{{\tilde {p}}}_{{k_{0}},t}}=p_{{k_{0}},t}^{*}-\delta \\ \displaystyle {{{\tilde {p}}}_{{k_{1}},t}}=p_{{k_{1}},t}^{*}+\delta \end{cases} \tag{4.60}\end{align*} View Source

For this new set of aggregation weights, we have:\begin{equation*} \sum \limits _{i=1}^{N}{{{{\tilde {p}}}_{i,t}}}=\sum \limits _{i\ne {k_{0}},{k_{1}}}{p_{i,t}^{*}}+p_{{k_{0}},t}^{*}-\delta +p_{{k_{1}},t}^{*}+\delta =1 \tag{4.61}\end{equation*} View Source And, \begin{equation*} {{\tilde {p}}_{{k_{0}},t}}=p_{{k_{0}},t}^{*}-\delta >\frac {{\alpha _{{k_{1}},t}}\left ({\delta +p_{{k_{1}},t}^{*} }\right)}{{\alpha _{{k_{0}},t}}}>0 \tag{4.62}\end{equation*} View Source

For the constraint (4.53), we can adjust the values of $\{\sigma _{1,t}^{*},\ldots,\sigma _{N,t}^{*}\}$ to satisfy it. Therefore, $\{{{\tilde {p}}_{1,t}},\ldots,{{\tilde {p}}_{N,t}}\}$ can satisfy all the constraints in the original problem.

Next, we analyze the value of the objective function corresponding to this solution. First, for the aggregate weights of coordinates ${k_{0}}$ and ${k_{1}}$ , it satisfies:\begin{align*} &\hspace {-1pc} {\alpha _{{k_{0}},t}}{{{\tilde {p}}}_{{k_{0}},t}}-{\alpha _{{k_{1}},t}}{{{\tilde {p}}}_{{k_{1}},t}} \\ & ={\alpha _{{k_{0}},t}}\left ({p_{{k_{0}},t}^{*}-\delta }\right)-{\alpha _{{k_{1}},t}}\left ({p_{{k_{1}},t}^{*}+\delta }\right) \\ & ={\alpha _{{k_{0}},t}}p_{{k_{0}},t}^{*}-{\alpha _{{k_{1}},t}}p_{{k_{1}},t}^{*}-\left ({{\alpha _{{k_{0}},t}}+{\alpha _{{k_{1}},t}} }\right)\delta \\ & >0, \tag{4.63}\end{align*} View Source

Therefore, for this solution, the value of the objective function $\tilde {f}$ can be divided into the following three cases based on the value at coordinate ${k_{2}}$ :

If ${\alpha _{{k_{0}},t}}{{\tilde {p}}_{{k_{0}},t}}\ge {\alpha _{{k_{2}},t}}{{\tilde {p}}_{{k_{2}},t}}$ , then \begin{align*} \tilde {f}&={{\left ({\max _{k}{\alpha _{k,t}}{{{\tilde {p}}}_{k,t}} }\right)}^{2}} \\ &={{\left ({{\alpha _{{k_{0}},t}}{{{\tilde {p}}}_{{k_{0}},t}} }\right)}^{2}} \\ &={{\left ({{\alpha _{{k_{0}},t}}\left ({p_{{k_{0}},t}^{*}-\delta }\right) }\right)}^{2}} < {f^{*}}. \tag{4.64}\end{align*} View Source If ${\alpha _{{k_{0}},t}}{{\tilde {p}}_{{k_{0}},t}} < {\alpha _{{k_{2}},t}}{{\tilde {p}}_{{k_{2}},t}}$ and ${\alpha _{{k_{0}},t}}p_{{k_{0}},t}^{*}>{\alpha _{{k_{2}},t}}p_{{k_{2}},t}^{*}$ , then \begin{equation*} \tilde {f}={{\left ({{\alpha _{{k_{2}},t}}{{{\tilde {p}}}_{{k_{2}},t}} }\right)}^{2}}={{\left ({{\alpha _{{k_{2}},t}}p_{{k_{2}},t}^{*} }\right)}^{2}} < {f^{*}}. \tag{4.65}\end{equation*} View Source If ${\alpha _{{k_{0}},t}}{{\tilde {p}}_{{k_{0}},t}} < {\alpha _{{k_{2}},t}}{{\tilde {p}}_{{k_{2}},t}}$ and ${\alpha _{{k_{0}},t}}p_{{k_{0}},t}^{*}={\alpha _{{k_{2}},t}}p_{{k_{2}},t}^{*}$ , then \begin{align*} \tilde {f}={{\left ({{\alpha _{{k_{2}},t}}{{{\tilde {p}}}_{{k_{2}},t}} }\right)}^{2}}={{\left ({{\alpha _{{k_{2}},t}}p_{{k_{2}},t}^{*} }\right)}^{2}}={{\left ({{\alpha _{{k_{0}},t}}p_{{k_{0}},t}^{*} }\right)}^{2}}={f^{*}}. \tag{4.66}\end{align*} View Source Continuing to iteratively construct new solutions using the method of (4.60), a objective function value smaller than ${f^{*}}$ can be obtained by iterating at most $N-1$ times. In summary, if the optimal solution satisfies (4.54), then a feasible solution that satisfies the constraints and has a smaller objective function value can always be found, which contradicts the assumption of the optimal solution. Therefore, for any $i,j\in \{1,2,\ldots,N\}$ , the optimal solution satisfies:\begin{equation*} {\alpha _{i,t}}p_{i,t}^{*}={\alpha _{j,t}}p_{j,t}^{*} \tag{4.67}\end{equation*} View Source

Appendix D
Proof of Theorem 3

For the optimal aggregation weights ${p_{1,t}^{},\ldots,p_{N,t}^{}}$ , from the equivalent optimization problem (5.26) using Lemma 1 and the constraint that the sum of all aggregation weights equals 1, We can obtain:\begin{align*} \begin{cases} & {\alpha _{i,t}}p_{i,t}^{*}\!=\!{\alpha _{j,t}}p_{j,t}^{*},i\in \{1,2,\ldots,N\},j\in \{1,2,\ldots,N\},i\ne j \\ & \sum \limits _{k=1}^{N}{p_{k,t}^{*}=1} \\ \end{cases} \tag{5.68}\end{align*} View Source

After simplification, we have:\begin{align*} \left ({\begin{matrix} 1 & 1 & \cdots & 1 \\ {\alpha _{1,t}} & -{\alpha _{2,t}} & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ {\alpha _{1,t}} & 0 & \cdots & -{\alpha _{N,t}} \\ \end{matrix} }\right)\left ({\begin{matrix} p_{1,t}^{*} \\ \vdots \\ p_{1,t}^{*} \\ \end{matrix} }\right)=\left ({\begin{matrix} 1 \\ 0 \\ \vdots \\ 0 \\ \end{matrix} }\right) \tag{5.69}\end{align*} View Source

The linear equations have a unique solution, i.e., \begin{equation*} p_{k,t}^{*}=\frac {\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne k}{{\alpha _{j,t}}}}{\sum \limits _{i=1}^{N}{\left ({\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne i}{{\alpha _{j,t}}} }\right)}},~~k=1,2,\ldots,N. \tag{5.70}\end{equation*} View Source

By substituting the optimal aggregation weights (5.70) into (4.58), we can obtain:\begin{align*} \underset {k=1}{\overset {N}{\mathop \sum }}&{{\left ({\frac {\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne k}{{\alpha _{j,t}}}}{\sum \limits _{i=1}^{N}{\left ({\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne i}{{\alpha _{j,t}}} }\right)}}\sigma _{k,t}^{*} }\right)}^{2}} \\ &={{\left ({\frac {\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \}}{{\alpha _{j,t}}}}{\sum \limits _{i=1}^{N}{\left ({\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne i}{{\alpha _{j,t}}} }\right)}} }\right)}^{2}} \tag{5.71}\end{align*} View Source e.g., \begin{equation*} \sum \limits _{k=1}^{N}{{{\left ({\frac {\sigma _{k,t}^{*}}{{\alpha _{k,t}}} }\right)}^{2}}=1} \tag{5.72}\end{equation*} View Source Together with the non-negative constraint on the noise variance, we obtain the optimal solution set in (5.30). Since this problem is equivalent to the original optimization problem, substituting this optimal solution set into (5.24) gives the maximum global model utility as \begin{align*} \underset {\{{\sigma _{k,t}},{p_{k,t}}\}_{k=1}^{N}}{\max }{U^{t}}&=\frac {1}{d\mathop{\mathop{\mathop \sum }\limits^{N}}\limits_{k=1}{{\left ({p_{k,t}^{*}\sigma _{k,t}^{*} }\right)}^{2}}} \\ &=\frac {1}{d}{{\left ({\frac {\sum \limits _{i=1}^{N}{\left ({\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \},j\ne i}{{\alpha _{j,t}}} }\right)}}{\prod \limits _{j\in \left \{{ 1,\ldots,N }\right \}}{{\alpha _{j,t}}}} }\right)}^{2}}. \tag{5.73}\end{align*} View Source

MIT Libraries

MIT Libraries

Minimum Gaussian Noise Variance of Federated Learning in the Presence of Mutual Information Based Differential Privacy

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

Preliminaries

A. Federated Learning

B. Mutual Information Differential Privacy

Definition 1 (($\epsilon,\delta $ )-DP[8]):

Definition 2 (Mutual Information Differential Privacy[17]):

Federated Learning Under $\epsilon$ -MI-DP

A. $\epsilon$ -MI-DP Guarantee for Local Parameters

Definition 3 (Client-Level$\epsilon $ -MI-DP Constraint in FL):

B. Minimum Variance of Gaussian Noise Satisfying $\epsilon$ -MI-DP

1) Adding Gaussian Noise at the Server Side

Theorem 1:

Proof:

Remark 1:

2) Adding Gaussian Noise at the Client Side

Corollary 1:

Proof:

C. Utility Analysis

Theorem 2:

Proof:

Personalized $\epsilon$ -MI-DP Based FL

A. Problem Definition

1) Adaptive Clipping Threshold

2) Personalized Privacy Budgets

Definition 4 (Personalized$\epsilon $ -MI-DP Constraint in FL):

B. Optimization Problem of Privacy-Utility Trade-Off

Lemma 1:

Proof:

Theorem 3:

Proof:

C. PMIDP-FL Scheduling Policy

Corollary 2:

Experiment Result and Analysis

A. Basic Privacy-Preserving Schemes

B. Comparison of $\epsilon$ -MI-DP and Privacy-Preserving Schemes

Conclusion

ACKNOWLEDGMENT

Appendix AProof of Theorem 1

Proof of Theorem 1

Appendix BProof of Corollary 1

Proof of Corollary 1

Appendix CProof of Theorem 2

Proof of Theorem 2

Appendix DProof of Lemma 1

Proof of Lemma 1

Appendix DProof of Theorem 3

Proof of Theorem 3

References

Appendix A
Proof of Theorem 1

Appendix B
Proof of Corollary 1

Appendix C
Proof of Theorem 2

Appendix D
Proof of Lemma 1

Appendix D
Proof of Theorem 3