Journals & Magazines >IEEE Access >Volume: 12

VPPFL: Verifiable Privacy-Preserving Federated Learning in Cloud Environment

Information Transmission Process of VPPFL

Abstract:

As a distributed machine learning paradigm, federated learning has attracted wide attention from academia and industry by enabling multiple users to jointly train models ...Show More

Metadata

Abstract:

As a distributed machine learning paradigm, federated learning has attracted wide attention from academia and industry by enabling multiple users to jointly train models without sharing local data. However, federated learning still faces various security and privacy issues. First, even if users only upload gradients, their privacy information may still be leaked. Second, when the aggregation server intentionally returns fabricated results, the model’s performance may be degraded. To address the above issues, we propose a verifiable privacy-preserving federated learning scheme VPPFL against semi-malicious cloud server. We use threshold multi-key homomorphic encryption to protect local gradients, and construct a one-way function to enable the users to independently verify the aggregation results. Furthermore, our scheme supports a small portion of users dropout during the training process. Finally, we conduct simulation experiments on the MNIST dataset, demonstrating that VPPFL can correctly and effectively complete training and achieve privacy protection.

Information Transmission Process of VPPFL

Published in: IEEE Access ( Volume: 12)

Page(s): 151998 - 152008

Date of Publication: 02 October 2024

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2024.3472467

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Recently machine learning technology has played a key role in numerous fields. For example, it has achieved significant results in medical prediction [1], [2], autonomous driving technologies [3], [4], and image recognition [5]. In machine learning, data privacy and security are key concerns. For example, medical data often contains a lot of sensitive personal information. If an unauthorized third party accesses medical data, it can lead to a serious privacy breach that affects the interests of patients [6], [7]. Additionally, since traditional machine learning typically operates on unencrypted data, this also poses a serious risk of privacy leakage.

To address these issues, Google proposed federated learning in 2016 [8]. In federated learning(FL), users only need to share the local gradients instead of original valuable data. This method conveniently utilizes sensitive information while mitigating the risk of privacy breaches that may arise from collecting data from different users.

However, current research indicates that even if users only upload gradient information, their privacy could still be compromised [9], [10]. Attackers might exploit vulnerabilities in cloud server to reveal specific attributes of training samples, or fabricate aggregation results to induce users to leak more valuable information. In some extreme cases, attackers could even use the leaked data to reconstruct users’ original data. On the other hand, motivated by illicit profits, malicious cloud server might return incorrect aggregation results to users. For example, in order to reduce its computation costs, the cloud server may use a more simplified and less accurate model to process the uploaded gradients, or even directly modify the aggregation results [11], [12].

To address these issues, we propose the Verifiable Privacy-Preserving Federated Learning (VPPFL). Our contributions are outlined as follows:

We design a verifiable federated learning mechanism to deal with the semi-malicious cloud server. This scheme enables users to independently verify the aggregation results without the intervention of a trusted third party, and can effectively prevent collusion attacks initiated by the cloud server in collaboration with a small portion of users.
Our scheme employs multi-key threshold homomorphic encryption to protect users’ privacy data, and allows a small portion of users dropout during training without adding additional burden to the server. Even if several users fail to upload their data, the training process won’t be interrupted.
We provide security analysis and simulation experiments to validate the security and efficiency of VPPFL.

The rest of this paper is organized as follows. Section II reviews the current relevant research and the basic concepts and techniques involved. Section III introduces the system model and security requirement. Section IV elaborates in detail the teshnical specific of VPPFL. We analyze the security and performance of VPPFL in Section V. Section VI presents the experimental analysis results. Finally, section VII concludes the work.

SECTION II.

Related Work and Relevant Concepts and Technologies

A. Related Work

When constructing privacy-preserving FL, there are three commonly used cryptographic tools, i.e. differential privacy [13] and homomorphic encryption [14].

Differential privacy is a privacy protection technique with provable security, which protects data by adding random noise. In 2006, Microsoft’s Dwork [15] first proposed the differential privacy technology, and later in [16], a new differential privacy mechanism Propose-Test-Release(PTR) was used to achieve high-quality differential privacy results. Geyer et al. [17] used differential privacy for the first time in FL to protect participants’ data by adding Gaussian noise on the server side. Wei et al. [18] employed a local differential privacy strategy during the local model updates of deep neural networks. They protect the local gradient by adding noise before uploading the local model. However, this method does not take users dropout into consideration.

Homomorphic encryption is now widely used in the construction of privacy-preserving FL. Rivest and Dertouzos [19] introduced homomorphic encryption in the asynchronous stochastic gradient descent training. However, all users utilize the same private key, leading to a potential risk: if the server colludes with some users, the data privacy of other users cannot be guaranteed. Wang et al. [20] in their research adopted homomorphic encryption to protect users’ local data and implemented access control to verify the credibility of user identities, effectively defending against threats from internal attacks. Ma et al. [21] used multi-key homomorphic encryption to encrypt the model before updating the local gradients. Decryption requires the collaborative participation of all users to prevent unauthorized access to participant’s data. As a result, if users dropout in the middle of the training process, decryption cannot be achieved, which is impractical for real-world FL.

Recently, the research community has proposed various schemes to address the data integrity challenge in FL. Xu et al. [22] introduced an innovative verifiable privacy-preserving FL architecture. By employing homomorphic hash functions and zero-knowledge proof, they construct a verifiable and secure aggregation mechanism. Guo et al. [23] modified this framework to reduce the communication cost, while they also pointed out that if malicious cloud server colluded with users, the scheme in [22] would still face certain security vulnerabilities. Lin and Zhang [24] used differential privacy and one-way function to allow users to verify aggregation results returned by a lazy server, but the approach does not support user dropout during training. Ren et al. [25] adopted linear homomorphic hash function and digital signature to achieve traceable verification of aggregation results and identification of erroneous cycles. However, this approach inevitably increases communication cost.

B. Concepts and Technologies

We now introduce some relevant conceptions and technologies. Some of the symbols used in this paper are listed in Table 1.

TABLE 1 List of Symbols

1) Federated Learning

Different from traditional machine learning, FL has made significant strides in protecting user’s privacy. In FL, users do not upload personal data, but only need to share the local gradients, significantly reducing the risk of personal information leakage. As shown in Figure 1, users upload these gradients to the cloud server, which then aggregates the data and feeds the results back to users. By this method, users and server collaborate to cultivate a comprehensively optimized global model, ensuring the security of personal data while achieving efficient model training.

FIGURE 1.

FL architecture.

Show All

2) Neural Network

We now introduce a classic deep neural network - fully connected neural network (FCNN). Figure 2 shows the architecture of FCNN. The neurons in each layer are densely connected to the neurons in the preceding and following layers by weight $\omega$ .

FIGURE 2.

The architecture of FCNN.

Show All

FCNN can be represented by $f(x,\omega) = \tilde {y}$ , where x is the input and $\tilde {y}$ is the corresponding output. Assuming the entire dataset $D = \{\lt x_{i}, y_{i}\gt , i=1,\cdots ,T\}$ , the loss function be defined as: $\begin{equation*} L_{f}(D,\omega ) = \frac {1}{|D|} \sum _{(x_{i},y_{i}) \in D} L_{f}((x_{i},y_{i}),\omega ), \tag {1}\end{equation*}$ View Sourcewhere $L_{f}((x_{i},y_{i}),\omega) = l(y,\tilde {y}) = ||y,\tilde {y}||_{2}$ .

The objective of neural network training is to achieve an optimal set of parameters $\omega$ , which minimizes the value of the loss function. To achieve this goal, we use Algorithm 1: the mini-batch gradient descent method (SGD).

Algorithm 1 SGD

input:

Dataset $D = \{(x_{i}, y_{i}) : i = 1, \ldots , N\}$ ,

Learning rate $\theta$ ,

Loss function $L_{f}(D,\omega) = \frac {1}{|D|} \sum _{(x_{i},y_{i})\in D} L_{f}((x_{i},y_{i}),\omega)$ .

output:

The optimal model parameters $\omega$ .

Randomly select an initial $\omega ^{0}$ ;

At the j-th iteration, randomly select a small batch of data $D^{j}\subseteq D$ ;

for $(x_{i}, y_{i}) \in D^{j}$ do

Calculate $g_{(x_{i},y_{i})}^{j} \leftarrow \nabla L_{f}((x_{i},y_{i}),\omega _{j})$ ;

end

Calculate $g_{*}^{j} \leftarrow \frac {1}{|D^{j}|} \sum _{(x_{j},y_{j})\in D^{j}} g_{(x_{i},y_{i})}^{j}$ ;

Update weight $\omega ^{j+1} \leftarrow \omega ^{j} - \theta \cdot g_{*}^{j}$ ;

until convergence is satisfied;

return $\omega$ .

3) Threshold Paillier Cryptosystem

In VPPFL, we use the threshold Paillier cryptosystem [26] to construct a secure framework since it has two important features: 1) Threshold property: Each user cannot decrypt the ciphertext alone, at least t users are need to work together to decrypt the ciphertext; 2) Homomorphic additivity: Multiplying ciphertexts equals adding plaintexts, enabling operations on plaintexts through ciphertext calculations. These two features provide sufficient functionality and privacy protection for our scheme.

In the threshold Paillier cryptosystem, the public key $pk=(G,K)$ is openly shared with all participants, where $G=1+K$ , $K=pq$ , with p and q are two large primes. The private key is split into N keys, denoted as $(sk_{1}, sk_{2}, \ldots , sk_{N})$ , with each user holding his own private key.

For a plaintext M, encrypting it using the public key pk will yield the ciphertext $\begin{equation*} c = E_{pk}(M) = G^{M} x^{K}\;mod\; K^{2}, \tag {2}\end{equation*}$ View Sourcewhere x is a random positive integer in the multiplicative group $Z_{K^{2}}$ .

This cryptosystem has homomorphic additivity, which can be described as: $c = E_{pk}(M_{1} + M_{2}) = G^{(M_{1} + M_{2})} (x_{1} x_{2})^{K}\;mod\; K^{2} = E_{pk}(M_{1}) \cdot E_{pk}(M_{2})$ ,where $M_{1}$ and $M_{2}$ are the two plaintexts, and $x_{1}$ and $x_{2}$ are random positive integers in $\mathbb {Z}_{K^{2}}^{*}$ .

4) One-Way Function

The generation of one-way function is based on hardness of the irreversible logarithm problem, which ensures that the semi-malicious cloud server cannot infer the user’s privacy information through the function values of gradients. The specific process is as follows:

Assume a is a generator of order k, and b is a large prime number. Construct a one-way function h: $\mathbb {Z} \to \mathbb {Z}_{p}$ as: $\begin{equation*} h(M) = a^{M} \;mod\;b,\; M \in \mathbb {Z}. \tag {3}\end{equation*}$ View Source

It satisfies homomorphic addition, i.e. $\forall x_{1}, x_{2} \in \mathbb {Z}, h(x_{1}) = a^{x_{1}}\;mod\;b, h(x_{2}) = a^{x_{2}} \;mod\;b$ , then $h(x_{1}+x_{2}) = a^{x_{1}+x_{2}} \;mod\;b = \left ({{ a^{x_{1}} \mod b }}\right) \cdot \left ({{ a^{x_{2}} \mod b }}\right) \;mod\;b = h(x_{1})h(x_{2})$ .

SECTION III.

System Model and Security Requirements

A. System Architecture

As shown in Figure 3, our system consists of three parts: semi-malicious cloud server (CS), trusted authority (TA) and semi-honest users (Users).

FIGURE 3.

System architecture.

Show All

Semi-malicious cloud server (CS): The main task of CS is to aggregate the gradients uploaded by users, broadcast the function values of gradients to all users, decrypt the aggregation results, and send them to users. We require CS to only obtain ciphertexts and the final aggregation results, without knowing any other information.

Trusted authority (TA): TA is an authoritative and trustworthy entity (e.g. a government agency). It does not collude with any party. Its main task is to initialize model parameters, generate a one-way function, and generate key pairs for all users. Then, it sends the one-way function and key pairs to users through secure channel and broadcasts public key. After that, unless there is a dispute, it will go offline.

Semi-honest users (Users): Users are the owner of the data, who participate in the training process, and ultimately obtain the global model. Each user sends his encrypted local gradient and the function value of gradient to CS during each round, and cooperates with CS to decrypt the aggregation results. Finally, all users verify the aggregation results returned by CS.

B. Threat Model

Semi-malicious CS attack: CS may deceive users by reducing the gradient aggregation of one or more users in order to save costs.

Half-honest user attack: He may try to use the information he has to infer the private data of other users.

Collusion attacks: There may be collusion attacks between a small number of users and CS, and the private information of other users can be inferred by sharing information such as model parameters.

External malicious attack: There is a malicious adversary, denoted as $\mathbb {A}$ , who will use any means to obtain useful information from users. For example, $\mathbb {A}$ can launch active attacks by infiltrating CS, modifying or injecting false data, returning incorrect aggregation results to deceive users into revealing more privacy data.

C. Security Objectives

We aim to propose an efficient, secure, and verifiable privacy-preserving FL scheme. Specifically, the following objectives should be achieved:

Privacy of user’s data: No entity other than the user himself should be able to access sensitive information of the user, including an external adversary $\mathbb {A}$ and CS.
Every user should be able to independently verify the aggregation results. If CS returns incorrect aggregation results, users should have the right to deny the results and request CS to reaggregate the results.
The scheme should allow a small portion of users to join or dropout the training process without interrupting the overall training of the model.

SECTION IV.

VPPFL

In this section, we provide the detailed design of VPPFL. Our scheme consists of four main stages: 1) Initialization; 2) Encryption; 3) Decryption and 4) Verification. Figure 4 illustrates the process flow of VPPFL.

Initialization
TA takes on the role of initializing the system parameters and generating key pairs. The specific process for generating the public and private keys is as follows, as shown in Algorithm 2.
Parameter Generation: The parameters that need to be initialized include the global weight $\omega$ , learning rate $\theta$ , training epoch, the safety parameter $\kappa$ , and the one-way function h.
Key generation and distribution: First, TA randomly generates two large prime numbers $p=2p'+1$ , $q=2q'+1$ , where $p'$ , $q' \lt \kappa$ . Second, TA generates the RSA modulus $K=pq$ , ensuring that $\gcd (K, \psi (K)) = 1$ . Then, TA randomly selects $\beta \in \mathbb {Z}_{K}^{*}$ and calculates $m=p'q'$ and $\Delta = N!$ . Next, TA disguise m as $\alpha = m \beta \;mod\;K$ .
TA sets public key $pk = (K, G, \alpha)$ and private key $SK = \beta m$ , where $G=K+1$ . TA splits the private key SK as follows: selects t random numbers $a_{1}, a_{2}, \ldots , a_{t} \in \{0, 1, \ldots , Km-1\}$ , then generates the polynomial $f(x) = \beta m + a_{1} x + \ldots + a_{t} x^{t-1}\; mod \; Km$ . Finally, TA sends $f(n)$ to each participant $P_{n}(1 \leq n \leq N)$ through secure channel.
Encryption
Each user $P_{n} (1 \leq n \leq N)$ encrypts his own gradient: for the gradient vector $g_{n} = [g_{n1},g_{n2}, \ldots , g_{nm}]$ , $P_{n}$ chooses a random number $x_{n} \in Z_{K}^{*}$ , uses the public key pk to calculate ciphertext $Enc_{pk}(g_{n}) = G^{g_{n}} x_{n}^{K} \;mod\;K^{2}$ , and the one-way function value of gradient $h(sum(g_{n})) = a^{sum(g_{n})} \;mod\;b$ .
Each user $P_{n}$ sends the ciphertext $Enc_{pk}(g_{n})$ and the one-way function value of his gradient $h(sum(g_{n}))$ to CS. CS broadcasts the received one-way function values of gradients $\{h(sum(g_{1})), h(sum(g_{2})), \ldots , h(sum(g_{N}))\}$ , and aggregates the ciphertexts to obtain the encrypted gradient ciphertext, $\begin{equation*} c = \prod _{n=1}^{N} Enc_{pk}(g_{n}). \tag {4}\end{equation*}$ View Source
Decryption
For the ciphertext c, CS randomly selects $t(1\leq t \leq N)$ users to send decryption requests. Suppose the selected participants form a set S. The selected participant computes the decryption share $s_{n} = c^{2\Delta sk_{n}} \;mod\;K^{2}$ and sends it to CS. CS can then compute the aggregation results $\begin{equation*} g_{*} = L\left ({{\prod _{n \in S} s_{n}^{2~\mu _{n}}\;mod\;K^{2}}}\right ) \times \frac {1}{4\Delta ^{2} \alpha } \;mod\;K, \tag {5}\end{equation*}$ View Sourcewhere $\mu _{n} = \Delta \times \lambda _{0,n}^{S} \in \mathbb {Z}$ , $\lambda _{0,n}^{S} = \prod _{n' \in S\backslash \{n\}} \frac {-n'}{n-n'}$ , $L(u) = \frac {u - 1}{K}$ .
Verification

FIGURE 4.

The process flow of VPPFL.

Show All

SECTION Algorithm 2

KGA(TA)

output:

Private key $(sk_{1}, sk_{2}, \ldots , sk_{N})$ .

Randomly generate two prime numbers $p'$ and $q'$ , where $p'$ , $q' \lt L$ ;

Calculate $p = 2p'+1$ , $q=2q'+1$ , p and q are also prime numbers;

Calculate RSA modulus $K=pq$ , and ensure that $\gcd (K, \varphi (K)) = 1$ , where $\varphi (K) = (p-1)(q-1)$ ;

Calculate decryption key $m = p'q' = \frac {\varphi (K)}{4}$ ;

Randomly choose a $\beta \in \mathbb {Z}_{K}^{*}$ , calculate $\alpha = m \beta \boldsymbol od K$ ;

Calculate $\Delta = N!$ ;

Set private key $SK = \beta m$ , public key $pk = (K, G, \alpha , \Delta)$ , where $G=1+K$ ;

Split the private key SK: select t random numbers $a_{1}, a_{2}, \ldots , a_{t-1} \in \{0,1, \ldots , Km-1\}$ , generate a polynomial $f(x) = \beta m + a_{1} x + \ldots + a_{t} x^{t-1} \mod Nm$ , calculate $sk_{n} = f(n)$ and send $sk_{n}$ to the corresponding participant $P_{n}(1 \leq n \leq N)$ .

Each user $P_{n} (1 \leq n \leq N)$ receives the one-way function values of gradients $\{h(sum(g_{1})), h(sum(g_{2})), \ldots ,h(sum(g_{N}))\}$ from CS, then he calculates $h(sum(g_{*}^{\prime })) = \prod _{n=1}^{N} h\left ({{sum(g_{n}}}\right)$ , and $h(sum(g_{*})) = a^{ sum(g_{*})} \;mod\;b$ based on the aggregation results $g_{*}$ returned by CS. If $h(sum(g_{*}^{\prime })) = h(sum(g_{*}))$ , the next round of training will begin. Otherwise, CS is required to reaggregate the results.

Algorithm 3 provides a detailed description of the VPPFL process.

SECTION Algorithm 3

VPPFL

Round 0 (Initialization)

TA:

Generate a set of private keys $\{ sk_{1}, sk_{2}, \ldots , sk_{N} \}$ and a public key $pk = (K, G, \alpha , \Delta)$ based on Algorithm 2. Broadcast the public key to all participants;

Send the private key $sk_{n} = f(n)(1 \leq n \leq N)$ to the corresponding user $P_{n}$ via a secure channel;

Select a large prime number b and a generator a of order k to create a one-way function.

Send the function to all users and then go offline.

Round 1(Encryption)

Users:

Each user $P_{n}$ selects a mini-batch subset $D_{n}^{j} \subseteq D_{n}$ and calculates $g_{n} \leftarrow \sum _{(x_{i},y_{i}) \in D_{n}^{j}} \nabla L_{f}((x_{i},y_{i}),\omega ^{j})$ ;

Calculate the encryption of gradient $Enc_{pk}(g_{n})$ and the one-way function of the gradient $h(sum(g_{n}))$ and then upload them to CS.

CS:

Receive $Enc_{pk}(g_{n})$ and $h(sum(g_{n}))$ from user $P_{n}$ ;

Calculate the aggregated encrypted gradient $c \leftarrow \prod _{n=1}^{N} Enc_{pk}(g_{n})$ ;

Broadcast the received $h(sum(g_{n}))$ to all users.

Round 2(Decryption)

CS randomly selects t users and sends the decryption requests to them;

After receiving the decryption request, user $P_{n}(1 \leq n \leq t)$ calculates decryption share $s_{n} = c^{2\Delta f(n)} \;mod\;K^{2}$ and sends it to CS;

CS:

Receive the decryption shares $s_{n}(1 \leq n \leq t)$ from the users;

Calculate $\lambda _{0,n}^{S} \leftarrow \prod _{n' \in S\backslash \{n\}} \frac {-n'}{n-n'}$ ;

Calculate $\mu _{n} \leftarrow \Delta \times \lambda _{0,n}^{S}$ ;

Decrypt the aggregation gradient $g_{*} \leftarrow L\left ({{\prod _{n \in S} s_{n}^{2~\mu _{n}} \mod N^{2}}}\right) \times \frac {1}{4\Delta ^{2} \alpha } \;mod\;K$ ;

Send $g_{*} = [g_{*}^{1}, g_{*}^{2}, \ldots , g_{*}^{m}]$ to the users.

Round 3(Verification)

Users:

Each user receives $g_{*}$ from CS;

Calculate $h\left ({{sum(g_{*})}}\right) \leftarrow a^{sum(g_{*})} \;mod\;b$ ;

Calculate $h\left ({{sum(g_{*}')}}\right) \leftarrow \prod _{n=1}^{N} h\left ({{sum(g_{n})}}\right)$ ;

if $h\left ({{sum(g_{*}')}}\right) = h\left ({{sum(g_{*})}}\right)$ Users perform parameter update $\omega ^{j+1} \leftarrow \omega ^{j} - \theta \cdot \left ({{\sum _{n \in N} g_{*} }}\right) / \left ({{\sum _{n \in N} |D_{n}^{j}| }}\right)$ ;

else

Users request CS to recompute the aggregation results.

Next, we provide a proof of correctness for our scheme.

Theorem 1:

If CS honestly performs the aggregation operations in the VPPFL, the aggregation results will pass verification.

Proof:

The encrypted gradients uploaded by the users to CS are $\{Enc_{pk}(g_{1}), Enc_{pk}(g_{2}), \ldots , Enc_{pk}(g_{N})\}$ . If CS honestly performs the aggregation operation, it will get the encrypted aggregation results as $Enc_{pk}(g_{*}) = \prod _{n=1}^{N} Enc_{pk}(g_{n})$ . Subsequently, CS randomly sends decryption requests to t users, and the numbers of the selected participants form a set S to perform the decryption operation. After receiving the decryption request from CS, each user in S sends the decryption share $s_{n} = c^{2\Delta sk_{n}} \;mod\;K^{2}\;(n \in S)$ to CS.

We have $\prod _{n \in S} s_{n}^{2~\mu _{n}} = c^{4\Delta \sum _{n \in S} f(n) \mu _{n}} = c^{4\Delta \sum _{n \in S} \Delta f(n) \lambda _{0,n}^{S}} = c^{4\Delta ^{2} m\beta } = (G^{g_{*}}x^{K} \;mod\;K^{2})^{4\Delta ^{2} m\beta } = G^{4\Delta ^{2} m\beta g_{*}} \;mod\;K^{2} (\because \forall x,\;x^{2Km} = 1 \;mod\;K^{2}) = 1+4\Delta ^{2} m\beta g_{*} K \;mod\;K^{2} (\because G=1+K)$ .

Therefore, $L\left ({{\prod _{n \in S} s_{n}^{2~\mu _{n}} \;mod\;K^{2}}}\right) = 4\Delta ^{2} m\beta g_{*} = g_{*} \times 4\Delta ^{2} \alpha \;mod\;K$ .

Given that $\Delta$ and $\alpha$ are components of the public key, CS is thus able to obtain the aggregation results $g_{*} = L\left ({{\prod _{n \in S} s_{n}^{2~\mu _{n}} \;mod\;K^{2}}}\right) \times \frac {1}{4\Delta ^{2} \alpha } \;mod\;K$ .

Participants will receive the aggregation gradient $g_{*}$ returned by CS and the broadcasted one-way function values of gradients $\{h\left ({{sum(g_{1})}}\right), h\left ({{sum(g_{2})}}\right), \ldots , h\left ({{sum(g_{N})}}\right)\}$ , then they will obtain $sum(g_{*})$ by summing up all elements within $g_{*}$ . Since threshold Paillier encryption satisfies homomorphic addition, i.e. $Enc_{pk}(g_{*}) = \prod _{n=1}^{N} Enc_{pk}(g_{n})$ , the following equation holds: $g_{*} = g_{1} + g_{2} + \cdots + g_{N}$ .

Based on the homomorphic property of the one-way function, if the following equation holds, then the aggregation gradients will pass verification: $\prod _{i=1}^{N} h\left ({{sum(g_{i})}}\right) = \prod _{i=1}^{N} a^{sum(g_{i})} \;mod\;b = a^{\sum _{i=1}^{N} \left ({{sum(g_{i})}}\right)} \;mod\;b = a^{sum(g_{*})} \;mod\;b = h\left ({{sum(g_{*})}}\right)$ .

Therefore, the aggregation results will pass verification.

SECTION V.

Security Analysis and Performance Evaluation

A. Security Analysis

In this section, we conduct theoretical analysis and security proof of the VPPFL, including data privacy and the verifiability of the aggregation gradients.

We first introduce some notations. Consider a server CS interacting with a set of N users, and let the security parameter be $\kappa$ . We use $U_{i}$ to denote the set of users that successfully uploaded their local gradients in round $i-1$ , such that $U_{4} \subseteq U_{3} \subseteq U_{2} \subseteq U_{1}$ . Users from these sets can dropout at any time during the process.

Given a subset $W\subseteq U \cup S$ , the collective perspective of users in W can be represented as a random variable $REAL_{W}^{\kappa } (g, U_{1}, U_{2}, U_{3}, U_{4})$ , where $\kappa$ is the security parameter. To prove that our scheme is secure, we first introduce a definition; only schemes that meet the conditions of this definition are considered secure.

Definition 1:

If any adversary $\mathbb {A}$ has a negligible advantage over the following game in polynomial time for security parameter $\kappa$ , then the scheme is indistinguishable under the choice of plaintext attack, and the scheme is said to be ICD-CPA secure.

Initialization stage: Enter the security parameter $\kappa$ , challenger $\mathbb {C}$ generates the system parameter para and private key sk, and sends the system parameter para to opponent $\mathbb {A}$ .

Challenge Phase: The adversary chooses two messages $m_{0}$ and $m_{1}$ and sends them to the challenger $\mathbb {C}$ . Here, the two messages are of equal length, i.e. $|m_{0}| = |m_{1}|$ . Upon receiving these two messages, the challenger $\mathbb {C}$ randomly selects $b \in \{0, 1\}$ , computes $C^{*} = ENC(param, m_{b})$ , and then sends $C^{*}$ to the adversary $\mathbb {A}$ .

Output: The adversary $\mathbb {A}$ outputs a guess $b'$ for b. If $b' = b$ , the adversary $\mathbb {A}$ wins the challenge; otherwise, the adversary $\mathbb {A}$ loses the challenge.

The advantage of the adversary $\mathbb {A}$ in winning the above game is defined as $Adv_{\epsilon , A}(\kappa) = \left |{{ \Pr (b = b') - \frac {1}{2} }}\right |$ .

Data privacy

Theorem 2:

VPPFL can resist collusion attacks between the CS and fewer than t users. That is, for all $\kappa , g, W \subseteq U \cup S$ , and $U_{4} \subseteq U_{3} \subseteq U_{2} \subseteq U_{1}$ , there exists a PPT simulator SIM whose output is indistinguishable from the output of $REAL_{W}^{\kappa }$ .

$REAL_{W}^{\kappa } (g, U_{1}, U_{2}, U_{3}, U_{4}) \equiv SIM_{W}^{\kappa } (g, U_{1}, U_{2}, U_{3}, U_{4})$ .

Proof:

We assume the set of participants colluding with CS is $P_{collude} = \{P_{1}, P_{2}, {\dots }, P_{Num}\}$ , where $Num\lt t$ . If the adversary $\mathbb {A}$ want to obtain the plaintext gradient $g_{n}$ from the encrypted gradient $Enc_{pk}(g_{n})$ , he needs to get the decryption key Sk. However, none of the parties can obtain this decryption key individually, at least t users are required through Shamir’s secret sharing. In fact, Shamir’s secret sharing scheme has been shown to be semantically secure under the DDH hardness assumption [27]. Therefore, in the case of fewer than t users colluding, the output of the simulator SIM is computationally indistinguishable from the output of REAL.

Theorem 3:

In VPPFL, no party can obtain the private information of other users.

Proof:

The data that CS can obtain the encrypted aggregation gradients and the one-way function values of gradients $\{h(sum(g_{1})), h(sum(g_{2})), \ldots , h(sum(g_{N}))\}$ . The data that a user $P_{i}$ can obtain include all users’ one-way function value of gradient and their own split secret key $sk_{i}$ .

From Theorem 2, we know that CS colluding with fewer than t users, cannot extract other users’ private information from the encrypted gradients. Therefore, a single user also cannot derive any useful information from the encrypted gradients. Both CS and all users can obtain the one-way function values of gradients $\{h(sum(g_{1})), h(sum(g_{2})), \ldots , h(sum(g_{N}))\}$ . Furthermore, users can also access the one-way function $h(M) = a^{M} \;mod\; b$ . Due to the irreversibility of the one-way function, the plaintext $sum(g_{i})$ is secure. Even in the extreme case where $sum(g_{i})$ is obtained, users wouldn’t get any private information about $g_{i}$ , since $sum(g_{i})$ is only the aggregation value of the gradient.

Theorem 4:

If the DDH difficulty question is assumed, VPPFL is IND-CPA safe. That is, the proposed scheme can meet the security definition of data privacy under the selection of plaintext attacks.

Proof:

If there exists an external adversary $\mathbb {A}$ who attempts to eavesdrop on the encrypted gradients $Enc_{pk}(g_{n})$ uploaded by users to the server. Since $Enc_{pk}(g_{n})$ is a valid ciphertext in the Paillier cryptosystem, and this system has been proven to be semantically secure under the DDH hardness assumption [26]. The external adversary $\mathbb {A}$ cannot obtain the corresponding plaintext information from the ciphertexts generated by the users.

At the same time, as a result of Theorem 2, our protocol is secure even if a small number of users collude with CS. It can be obtained from theorem 3, even if any party involved in the training calculates based on the input data they obtain, intermediate results, etc. Therefore, the probability that external adversary $\mathbb {A}$ will get the plaintext in polynomial time is negligible.

Through the above proof, neither external adversary nor internal adversary can obtain the private information of a single user. Therefore, our protocol is IND-CPA secure.

Verifiability of the aggregation gradient

Theorem 5:

If CS returns an incorrect aggregation gradient $g_{*}$ , it will fail the verification process.

Proof:

From Theorem 1, if CS tries to reduce the amount of computation for aggregation, the aggregation gradient ciphertext will become $\prod _{i=1}^{N} h(sum(g_{i})) = \prod _{i=1}^{N} a^{sum(g_{i})} \;mod\;b = a^{\sum _{i=1}^{N} (sum(g_{i}))}\;mod\;b{} = a^{sum(g_{*})} \;mod\;b = h(sum(g_{*}))$ . If CS attempts to reduce the computation cost by providing an aggregated gradient ciphertext $Enc_{pk_{less}}(g_{*}) = \prod _{n=1}^{less} Enc_{pk}(g_{n})$ , where $less \lt N$ . Each user already knows the one-way function values of gradients $\{h(sum(g_{1})), h(sum(g_{2})), \ldots , h(sum(g_{N}))\}$ , so they can compute $h\left ({{sum (g_{*(less)})}}\right) \lt \prod _{i=1}^{N} h\left ({{sum(g_{i})}}\right)$ . Therefore, any reduction in the computation cost by CS, indicating laziness or tampering, will certainly be detected.

B. Performance Evaluation

To highlight the advantages of VPPFL, we conducted a detailed comparison with some existing schemes, as shown in Table 2. Moreover, we also implemented the PPVerifier scheme to facilitate a more detailed comparison with our scheme.

TABLE 2 Scheme Comparison

The computational overhead of the scheme can be described as follows. For simplicity, we only consider the steps with high computational complexity. Let $t_{mul}$ , $t_{inv}$ , and $t_{exp}$ represent modular multiplication, modular inversion, and modular exponentiation operations, respectively. Assume there is a server, N users, and each user holds a model parameter vector of dimension d. The server randomly selects t users to assist with decryption. Let $\kappa$ denote the security parameter. Table 3 shows the comparison results of the computation and communication complexity of this solution between the user and CS. The cost of VPPFL on the user and CS is shown in Table 3.

TABLE 3 Computation and Communication Cost

The computation cost of the CS mainly comes from aggregating the gradients uploaded by users, decrypting the aggregated gradient, and computing the hash value of the aggregated gradient for verification. The CS aggregates the local gradients uploaded by N users, involving N modular multiplications, with a computation cost of $N t_{mul} \cdot d$ . Decrypting the aggregated gradient involves one modular exponentiation, one modular inversion, and t modular multiplications, with a computation cost of $(t_{exp} + t_{inv} + t \cdot t_{mul}) \cdot d$ . Computing the hash value of the aggregated gradient for verification involves one modular exponentiation, with a computation cost of $t_{exp}$ .Therefore, the total computation cost for the CS is $(t_{inv} + t_{exp} + (N + t)t_{mul})d + t_{exp}$ .The communication cost for the CS mainly comes from broadcasting the decrypted model parameters and the hash value of the gradient to all users. The CS broadcasts the decrypted model parameters to all users, with a communication cost of $N d \cdot \kappa ^{2}$ . The CS also broadcasts the hash value of the gradient to all participants, with a communication cost of $N \cdot \kappa ^{2}$ . Therefore, the total communication cost for the CS is $N(d+1) \cdot \kappa ^{2}$ .

The users’ computation cost mainly comes from encrypting the local gradients and computing the hash value of the gradient. Encrypting the local gradient involves two modular exponentiations and one modular multiplication, with a computation cost of $(2 \cdot t_{exp} + t_{mul}) \cdot d$ . Computing the hash value of the gradient for verification involves one modular exponentiation, with a computation cost of $t_{\exp }$ . Therefore, the total computation cost for each user is $(2 \cdot t_{exp} + t_{mul}) \cdot d + t_{exp}$ . The users’ communication cost mainly comes from uploading the encrypted gradient to CS and broadcasting the hash value of the local gradient to all users. Uploading the encrypted gradient to CS has a communication cost of $d \cdot \kappa ^{2}$ . Broadcasting the hash value of the local gradient to all users has a communication cost of $N \cdot \kappa ^{2}$ . Therefore, the total communication cost for each user is $(N + d) \cdot \kappa ^{2}$ .

SECTION VI.

Experiment Evaluation

In this section, we conduct all-round experiments on VPPFL to evaluate its performance.

A. Experimental Environment

We implemented VPPFL using MATLAB2016a. The algorithm was implemented on the MNIST dataset (https://yann.lecun.com/exdb/mnist/). The dataset includes 70,000 grayscale images, each $28\times 28$ pixels, depicting handwritten digits, segmented into 60,000 images for training and 10,000 for testing. The experiment utilized a neural network as the training model, consisting of an input layer, an output layer and two hidden layers. We set the learning rate is 0.2. The experiments were conducted on a computer with Intel Core i5-1035G1, 1.0GHz CPU, and 8GB of memory.

B. Classification Accuracy

We implemented the PPVerifier protocol [24] as well as the unencrypted original algorithm Baseline to analyze the accuracy of our scheme in neural network training. In practical use cases, as gradient vectors typically exist in floating-point format, our approach requires preprocessing them into integers before encryption. This is why our VPPFL and PPVerifier slightly lag behind Baseline in terms of accuracy. We kept all conditions the same.

As shown in Figure 5, with a total gradient count set to 100,000 and 100 training rounds respectively. After 100 training rounds, both VPPFL and Baseline achieved nearly the same 98% accuracy. This indicates that when using VPPFL to protect the gradients, it can still maintain the model’s accuracy. This is because only a small portion of gradient information is lost.

FIGURE 5.

Compares the accuracy of VPPFL, PPVerifier and Baseline.

Show All

C. Computation Cost

In this section, we will delve into the total computation cost, the impact of the number of gradients on computation cost, the computation cost on CS and users, as well as the analysis of computation cost when users dropout during the training process.

1) Total Computation Cost

As shown in Figure 6, we set the total number of gradients involved in training to 100,000. Although we incurred some additional time cost compared to PPVerifier, this cost is acceptable, and our scheme supports involvement and dropout of users during the training process, which is more aligned with practical applications. PPVerifier does not support this feature. Therefore, the cost we incurred is worthwhile.

FIGURE 6.

Total computation cost.

Show All

2) Computation Cost Between CS and Users

To facilitate observation, we set the number of users participating in training to 10. As shown in Figure 7, with an increase in the number of gradients, the computation cost on the client side increases linearly. When each user has 10,000 gradients, the users’ computation cost will surpass the CS computation cost. This places higher demands on the computing power of users participating in training. Therefore, each user needs to appropriately manage the amount of data they participate in training each round, thus to avoid training too much data at once to prevent excessive burden on themselves. The server’s computation cost does not significantly increase because CS does not participate in the users’ training process, which ensures user privacy and security.

FIGURE 7.

Comparison of computation cost on client and CS.

Show All

3) The Computation Cost in Different Stages

We analyze in detail the computation cost of different stages in one round of training. We set the number of users participating in training to 100. As shown in Table 4, it is evident that the users’ computation cost are primarily composed of encryption and verification stages, with this portion of the expenditure occupying a very low proportion of the total computation cost. The part that incurs the highest cost is the decryption process, which is handled by CS, making it more inclusive for users with weaker computing capabilities.

TABLE 4 Computation Cost of Different Parts of the Training Process

4) Computation Cost on CS When Users Dropout

As shown in Figure 8, we set the number of users participating in training to 100. As the number of dropout users increases, the computation cost on CS does not increase. The reason for this is that the cloud server’s computation cost are mainly concentrated in the decryption process, which involves sending decryption requests to t users. Even if some users dropout, CS only needs to send decryption requests to the remaining users who are still online, thus not adding extra computation for CS.

FIGURE 8.

Comparison of computation cost on CS when users dropout.

Show All

5) Verification Time

Scheme [23] and scheme [25] both use a homomorphic hash function to provide verifiability for users, but the overhead is huge. We note this verification method as “LHH”. To highlight the superiority of our scheme, we compared VPPFL with “LHH” and PPVerifier, and the results are shown in Figure 9. It can be concluded that the overhead of VPPFL for verification is almost negligible compared to the homomorphic hash function of LHH.

FIGURE 9.

Verification time for each user.

Show All

SECTION VII.

Conclusion

In this paper, we proposed VPPFL, a privacy-preserving FL scheme for the semi-malicious server. VPPFL supports users dropout and provides verifiability for each user during the training process while preserving user privacy. Furthermore, we proved the security of the scheme and validated the practical performance of our scheme through simulated experiments on real data theoretically.The scheme proposed in this paper solves some problems in FL to a certain extent, but there is still room for improvement. Our scheme relies on a trusted third party to distribute the key, and assumes that the entity is too strong to be breached. Once the entity is compromised, then the data of all parties involved is no longer secure, but in a real-world deployment of FL, it is difficult to find such an entity. How to achieve verifiable privacy protection FL without the participation of a trusted third party is an important direction of follow-up research.

Data availability

The datasets are available online. The URL is as follows: MNIST database: https://yann.lecun.com/exdb/mnist.

References is not available for this document.

VPPFL: Verifiable Privacy-Preserving Federated Learning in Cloud Environment

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work and Relevant Concepts and Technologies

A. Related Work

B. Concepts and Technologies

1) Federated Learning

2) Neural Network

Algorithm 1 SGD

3) Threshold Paillier Cryptosystem

4) One-Way Function

System Model and Security Requirements

A. System Architecture

B. Threat Model

C. Security Objectives

VPPFL

KGA(TA)

VPPFL

Theorem 1:

Proof:

Security Analysis and Performance Evaluation

A. Security Analysis

Definition 1:

Theorem 2:

Proof:

Theorem 3:

Proof:

Theorem 4:

Proof:

Theorem 5:

Proof:

B. Performance Evaluation

Experiment Evaluation

A. Experimental Environment

B. Classification Accuracy

C. Computation Cost

1) Total Computation Cost

2) Computation Cost Between CS and Users

3) The Computation Cost in Different Stages

4) Computation Cost on CS When Users Dropout

5) Verification Time

Conclusion

Authors

Figures

References

Keywords

Metrics

References