Introduction
Recently machine learning technology has played a key role in numerous fields. For example, it has achieved significant results in medical prediction [1], [2], autonomous driving technologies [3], [4], and image recognition [5]. In machine learning, data privacy and security are key concerns. For example, medical data often contains a lot of sensitive personal information. If an unauthorized third party accesses medical data, it can lead to a serious privacy breach that affects the interests of patients [6], [7]. Additionally, since traditional machine learning typically operates on unencrypted data, this also poses a serious risk of privacy leakage.
To address these issues, Google proposed federated learning in 2016 [8]. In federated learning(FL), users only need to share the local gradients instead of original valuable data. This method conveniently utilizes sensitive information while mitigating the risk of privacy breaches that may arise from collecting data from different users.
However, current research indicates that even if users only upload gradient information, their privacy could still be compromised [9], [10]. Attackers might exploit vulnerabilities in cloud server to reveal specific attributes of training samples, or fabricate aggregation results to induce users to leak more valuable information. In some extreme cases, attackers could even use the leaked data to reconstruct users’ original data. On the other hand, motivated by illicit profits, malicious cloud server might return incorrect aggregation results to users. For example, in order to reduce its computation costs, the cloud server may use a more simplified and less accurate model to process the uploaded gradients, or even directly modify the aggregation results [11], [12].
To address these issues, we propose the Verifiable Privacy-Preserving Federated Learning (VPPFL). Our contributions are outlined as follows:
We design a verifiable federated learning mechanism to deal with the semi-malicious cloud server. This scheme enables users to independently verify the aggregation results without the intervention of a trusted third party, and can effectively prevent collusion attacks initiated by the cloud server in collaboration with a small portion of users.
Our scheme employs multi-key threshold homomorphic encryption to protect users’ privacy data, and allows a small portion of users dropout during training without adding additional burden to the server. Even if several users fail to upload their data, the training process won’t be interrupted.
We provide security analysis and simulation experiments to validate the security and efficiency of VPPFL.
The rest of this paper is organized as follows. Section II reviews the current relevant research and the basic concepts and techniques involved. Section III introduces the system model and security requirement. Section IV elaborates in detail the teshnical specific of VPPFL. We analyze the security and performance of VPPFL in Section V. Section VI presents the experimental analysis results. Finally, section VII concludes the work.
Related Work and Relevant Concepts and Technologies
A. Related Work
When constructing privacy-preserving FL, there are three commonly used cryptographic tools, i.e. differential privacy [13] and homomorphic encryption [14].
Differential privacy is a privacy protection technique with provable security, which protects data by adding random noise. In 2006, Microsoft’s Dwork [15] first proposed the differential privacy technology, and later in [16], a new differential privacy mechanism Propose-Test-Release(PTR) was used to achieve high-quality differential privacy results. Geyer et al. [17] used differential privacy for the first time in FL to protect participants’ data by adding Gaussian noise on the server side. Wei et al. [18] employed a local differential privacy strategy during the local model updates of deep neural networks. They protect the local gradient by adding noise before uploading the local model. However, this method does not take users dropout into consideration.
Homomorphic encryption is now widely used in the construction of privacy-preserving FL. Rivest and Dertouzos [19] introduced homomorphic encryption in the asynchronous stochastic gradient descent training. However, all users utilize the same private key, leading to a potential risk: if the server colludes with some users, the data privacy of other users cannot be guaranteed. Wang et al. [20] in their research adopted homomorphic encryption to protect users’ local data and implemented access control to verify the credibility of user identities, effectively defending against threats from internal attacks. Ma et al. [21] used multi-key homomorphic encryption to encrypt the model before updating the local gradients. Decryption requires the collaborative participation of all users to prevent unauthorized access to participant’s data. As a result, if users dropout in the middle of the training process, decryption cannot be achieved, which is impractical for real-world FL.
Recently, the research community has proposed various schemes to address the data integrity challenge in FL. Xu et al. [22] introduced an innovative verifiable privacy-preserving FL architecture. By employing homomorphic hash functions and zero-knowledge proof, they construct a verifiable and secure aggregation mechanism. Guo et al. [23] modified this framework to reduce the communication cost, while they also pointed out that if malicious cloud server colluded with users, the scheme in [22] would still face certain security vulnerabilities. Lin and Zhang [24] used differential privacy and one-way function to allow users to verify aggregation results returned by a lazy server, but the approach does not support user dropout during training. Ren et al. [25] adopted linear homomorphic hash function and digital signature to achieve traceable verification of aggregation results and identification of erroneous cycles. However, this approach inevitably increases communication cost.
B. Concepts and Technologies
We now introduce some relevant conceptions and technologies. Some of the symbols used in this paper are listed in Table 1.
1) Federated Learning
Different from traditional machine learning, FL has made significant strides in protecting user’s privacy. In FL, users do not upload personal data, but only need to share the local gradients, significantly reducing the risk of personal information leakage. As shown in Figure 1, users upload these gradients to the cloud server, which then aggregates the data and feeds the results back to users. By this method, users and server collaborate to cultivate a comprehensively optimized global model, ensuring the security of personal data while achieving efficient model training.
2) Neural Network
We now introduce a classic deep neural network - fully connected neural network (FCNN). Figure 2 shows the architecture of FCNN. The neurons in each layer are densely connected to the neurons in the preceding and following layers by weight
FCNN can be represented by \begin{equation*} L_{f}(D,\omega ) = \frac {1}{|D|} \sum _{(x_{i},y_{i}) \in D} L_{f}((x_{i},y_{i}),\omega ), \tag {1}\end{equation*}
The objective of neural network training is to achieve an optimal set of parameters
Algorithm 1 SGD
Dataset
Learning rate
Loss function
The optimal model parameters
Randomly select an initial
At the j-th iteration, randomly select a small batch of data
for
Calculate
end
Calculate
Update weight
until convergence is satisfied;
return
3) Threshold Paillier Cryptosystem
In VPPFL, we use the threshold Paillier cryptosystem [26] to construct a secure framework since it has two important features: 1) Threshold property: Each user cannot decrypt the ciphertext alone, at least t users are need to work together to decrypt the ciphertext; 2) Homomorphic additivity: Multiplying ciphertexts equals adding plaintexts, enabling operations on plaintexts through ciphertext calculations. These two features provide sufficient functionality and privacy protection for our scheme.
In the threshold Paillier cryptosystem, the public key
For a plaintext M, encrypting it using the public key pk will yield the ciphertext\begin{equation*} c = E_{pk}(M) = G^{M} x^{K}\;mod\; K^{2}, \tag {2}\end{equation*}
This cryptosystem has homomorphic additivity, which can be described as:
4) One-Way Function
The generation of one-way function is based on hardness of the irreversible logarithm problem, which ensures that the semi-malicious cloud server cannot infer the user’s privacy information through the function values of gradients. The specific process is as follows:
Assume a is a generator of order k, and b is a large prime number. Construct a one-way function h: \begin{equation*} h(M) = a^{M} \;mod\;b,\; M \in \mathbb {Z}. \tag {3}\end{equation*}
It satisfies homomorphic addition, i.e.
System Model and Security Requirements
A. System Architecture
As shown in Figure 3, our system consists of three parts: semi-malicious cloud server (CS), trusted authority (TA) and semi-honest users (Users).
Semi-malicious cloud server (CS): The main task of CS is to aggregate the gradients uploaded by users, broadcast the function values of gradients to all users, decrypt the aggregation results, and send them to users. We require CS to only obtain ciphertexts and the final aggregation results, without knowing any other information.
Trusted authority (TA): TA is an authoritative and trustworthy entity (e.g. a government agency). It does not collude with any party. Its main task is to initialize model parameters, generate a one-way function, and generate key pairs for all users. Then, it sends the one-way function and key pairs to users through secure channel and broadcasts public key. After that, unless there is a dispute, it will go offline.
Semi-honest users (Users): Users are the owner of the data, who participate in the training process, and ultimately obtain the global model. Each user sends his encrypted local gradient and the function value of gradient to CS during each round, and cooperates with CS to decrypt the aggregation results. Finally, all users verify the aggregation results returned by CS.
B. Threat Model
Semi-malicious CS attack: CS may deceive users by reducing the gradient aggregation of one or more users in order to save costs.
Half-honest user attack: He may try to use the information he has to infer the private data of other users.
Collusion attacks: There may be collusion attacks between a small number of users and CS, and the private information of other users can be inferred by sharing information such as model parameters.
External malicious attack: There is a malicious adversary, denoted as
C. Security Objectives
We aim to propose an efficient, secure, and verifiable privacy-preserving FL scheme. Specifically, the following objectives should be achieved:
Privacy of user’s data: No entity other than the user himself should be able to access sensitive information of the user, including an external adversary
and CS.\mathbb {A} Every user should be able to independently verify the aggregation results. If CS returns incorrect aggregation results, users should have the right to deny the results and request CS to reaggregate the results.
The scheme should allow a small portion of users to join or dropout the training process without interrupting the overall training of the model.
VPPFL
In this section, we provide the detailed design of VPPFL. Our scheme consists of four main stages: 1) Initialization; 2) Encryption; 3) Decryption and 4) Verification. Figure 4 illustrates the process flow of VPPFL.
Initialization
TA takes on the role of initializing the system parameters and generating key pairs. The specific process for generating the public and private keys is as follows, as shown in Algorithm 2.
Parameter Generation: The parameters that need to be initialized include the global weight
, learning rate\omega , training epoch, the safety parameter\theta , and the one-way function h.\kappa Key generation and distribution: First, TA randomly generates two large prime numbers
,p=2p'+1 , whereq=2q'+1 ,p' . Second, TA generates the RSA modulusq' \lt \kappa , ensuring thatK=pq . Then, TA randomly selects\gcd (K, \psi (K)) = 1 and calculates\beta \in \mathbb {Z}_{K}^{*} andm=p'q' . Next, TA disguise m as\Delta = N! .\alpha = m \beta \;mod\;K TA sets public key
and private keypk = (K, G, \alpha) , whereSK = \beta m . TA splits the private key SK as follows: selects t random numbersG=K+1 , then generates the polynomiala_{1}, a_{2}, \ldots , a_{t} \in \{0, 1, \ldots , Km-1\} . Finally, TA sendsf(x) = \beta m + a_{1} x + \ldots + a_{t} x^{t-1}\; mod \; Km to each participantf(n) through secure channel.P_{n}(1 \leq n \leq N) Encryption
Each user
encrypts his own gradient: for the gradient vectorP_{n} (1 \leq n \leq N) ,g_{n} = [g_{n1},g_{n2}, \ldots , g_{nm}] chooses a random numberP_{n} , uses the public key pk to calculate ciphertextx_{n} \in Z_{K}^{*} , and the one-way function value of gradientEnc_{pk}(g_{n}) = G^{g_{n}} x_{n}^{K} \;mod\;K^{2} .h(sum(g_{n})) = a^{sum(g_{n})} \;mod\;b Each user
sends the ciphertextP_{n} and the one-way function value of his gradientEnc_{pk}(g_{n}) to CS. CS broadcasts the received one-way function values of gradientsh(sum(g_{n})) , and aggregates the ciphertexts to obtain the encrypted gradient ciphertext,\{h(sum(g_{1})), h(sum(g_{2})), \ldots , h(sum(g_{N}))\} \begin{equation*} c = \prod _{n=1}^{N} Enc_{pk}(g_{n}). \tag {4}\end{equation*} View Source\begin{equation*} c = \prod _{n=1}^{N} Enc_{pk}(g_{n}). \tag {4}\end{equation*}
Decryption
For the ciphertext c, CS randomly selects
users to send decryption requests. Suppose the selected participants form a set S. The selected participant computes the decryption sharet(1\leq t \leq N) and sends it to CS. CS can then compute the aggregation resultss_{n} = c^{2\Delta sk_{n}} \;mod\;K^{2} where\begin{equation*} g_{*} = L\left ({{\prod _{n \in S} s_{n}^{2~\mu _{n}}\;mod\;K^{2}}}\right ) \times \frac {1}{4\Delta ^{2} \alpha } \;mod\;K, \tag {5}\end{equation*} View Source\begin{equation*} g_{*} = L\left ({{\prod _{n \in S} s_{n}^{2~\mu _{n}}\;mod\;K^{2}}}\right ) \times \frac {1}{4\Delta ^{2} \alpha } \;mod\;K, \tag {5}\end{equation*}
,\mu _{n} = \Delta \times \lambda _{0,n}^{S} \in \mathbb {Z} ,\lambda _{0,n}^{S} = \prod _{n' \in S\backslash \{n\}} \frac {-n'}{n-n'} .L(u) = \frac {u - 1}{K} Verification
KGA(TA)
Private key
Randomly generate two prime numbers
Calculate
Calculate RSA modulus
Calculate decryption key
Randomly choose a
Calculate
Set private key
Split the private key SK: select t random numbers
Each user
Algorithm 3 provides a detailed description of the VPPFL process.
VPPFL
Round 0 (Initialization)
TA:
Generate a set of private keys
Send the private key
Select a large prime number b and a generator a of order k to create a one-way function.
Send the function to all users and then go offline.
Round 1(Encryption)
Users:
Each user
Calculate the encryption of gradient
CS:
Receive
Calculate the aggregated encrypted gradient
Broadcast the received
Round 2(Decryption)
CS randomly selects t users and sends the decryption requests to them;
After receiving the decryption request, user
CS:
Receive the decryption shares
Calculate
Calculate
Decrypt the aggregation gradient
Send
Round 3(Verification)
Users:
Each user receives
Calculate
Calculate
if
else
Users request CS to recompute the aggregation results.
Next, we provide a proof of correctness for our scheme.
Theorem 1:
If CS honestly performs the aggregation operations in the VPPFL, the aggregation results will pass verification.
Proof:
The encrypted gradients uploaded by the users to CS are
We have
Therefore,
Given that
Participants will receive the aggregation gradient
Based on the homomorphic property of the one-way function, if the following equation holds, then the aggregation gradients will pass verification:
Therefore, the aggregation results will pass verification.
Security Analysis and Performance Evaluation
A. Security Analysis
In this section, we conduct theoretical analysis and security proof of the VPPFL, including data privacy and the verifiability of the aggregation gradients.
We first introduce some notations. Consider a server CS interacting with a set of N users, and let the security parameter be
Given a subset
Definition 1:
If any adversary
Initialization stage: Enter the security parameter
Challenge Phase: The adversary chooses two messages
Output: The adversary
The advantage of the adversary
Data privacy
Theorem 2:
VPPFL can resist collusion attacks between the CS and fewer than t users. That is, for all
Proof:
We assume the set of participants colluding with CS is
Theorem 3:
In VPPFL, no party can obtain the private information of other users.
Proof:
The data that CS can obtain the encrypted aggregation gradients and the one-way function values of gradients
From Theorem 2, we know that CS colluding with fewer than t users, cannot extract other users’ private information from the encrypted gradients. Therefore, a single user also cannot derive any useful information from the encrypted gradients. Both CS and all users can obtain the one-way function values of gradients
Theorem 4:
If the DDH difficulty question is assumed, VPPFL is IND-CPA safe. That is, the proposed scheme can meet the security definition of data privacy under the selection of plaintext attacks.
Proof:
If there exists an external adversary
At the same time, as a result of Theorem 2, our protocol is secure even if a small number of users collude with CS. It can be obtained from theorem 3, even if any party involved in the training calculates based on the input data they obtain, intermediate results, etc. Therefore, the probability that external adversary
Through the above proof, neither external adversary nor internal adversary can obtain the private information of a single user. Therefore, our protocol is IND-CPA secure.
Verifiability of the aggregation gradient
Theorem 5:
If CS returns an incorrect aggregation gradient
Proof:
From Theorem 1, if CS tries to reduce the amount of computation for aggregation, the aggregation gradient ciphertext will become
B. Performance Evaluation
To highlight the advantages of VPPFL, we conducted a detailed comparison with some existing schemes, as shown in Table 2. Moreover, we also implemented the PPVerifier scheme to facilitate a more detailed comparison with our scheme.
The computational overhead of the scheme can be described as follows. For simplicity, we only consider the steps with high computational complexity. Let
The computation cost of the CS mainly comes from aggregating the gradients uploaded by users, decrypting the aggregated gradient, and computing the hash value of the aggregated gradient for verification. The CS aggregates the local gradients uploaded by N users, involving N modular multiplications, with a computation cost of
The users’ computation cost mainly comes from encrypting the local gradients and computing the hash value of the gradient. Encrypting the local gradient involves two modular exponentiations and one modular multiplication, with a computation cost of
Experiment Evaluation
In this section, we conduct all-round experiments on VPPFL to evaluate its performance.
A. Experimental Environment
We implemented VPPFL using MATLAB2016a. The algorithm was implemented on the MNIST dataset (https://yann.lecun.com/exdb/mnist/). The dataset includes 70,000 grayscale images, each
B. Classification Accuracy
We implemented the PPVerifier protocol [24] as well as the unencrypted original algorithm Baseline to analyze the accuracy of our scheme in neural network training. In practical use cases, as gradient vectors typically exist in floating-point format, our approach requires preprocessing them into integers before encryption. This is why our VPPFL and PPVerifier slightly lag behind Baseline in terms of accuracy. We kept all conditions the same.
As shown in Figure 5, with a total gradient count set to 100,000 and 100 training rounds respectively. After 100 training rounds, both VPPFL and Baseline achieved nearly the same 98% accuracy. This indicates that when using VPPFL to protect the gradients, it can still maintain the model’s accuracy. This is because only a small portion of gradient information is lost.
C. Computation Cost
In this section, we will delve into the total computation cost, the impact of the number of gradients on computation cost, the computation cost on CS and users, as well as the analysis of computation cost when users dropout during the training process.
1) Total Computation Cost
As shown in Figure 6, we set the total number of gradients involved in training to 100,000. Although we incurred some additional time cost compared to PPVerifier, this cost is acceptable, and our scheme supports involvement and dropout of users during the training process, which is more aligned with practical applications. PPVerifier does not support this feature. Therefore, the cost we incurred is worthwhile.
2) Computation Cost Between CS and Users
To facilitate observation, we set the number of users participating in training to 10. As shown in Figure 7, with an increase in the number of gradients, the computation cost on the client side increases linearly. When each user has 10,000 gradients, the users’ computation cost will surpass the CS computation cost. This places higher demands on the computing power of users participating in training. Therefore, each user needs to appropriately manage the amount of data they participate in training each round, thus to avoid training too much data at once to prevent excessive burden on themselves. The server’s computation cost does not significantly increase because CS does not participate in the users’ training process, which ensures user privacy and security.
3) The Computation Cost in Different Stages
We analyze in detail the computation cost of different stages in one round of training. We set the number of users participating in training to 100. As shown in Table 4, it is evident that the users’ computation cost are primarily composed of encryption and verification stages, with this portion of the expenditure occupying a very low proportion of the total computation cost. The part that incurs the highest cost is the decryption process, which is handled by CS, making it more inclusive for users with weaker computing capabilities.
4) Computation Cost on CS When Users Dropout
As shown in Figure 8, we set the number of users participating in training to 100. As the number of dropout users increases, the computation cost on CS does not increase. The reason for this is that the cloud server’s computation cost are mainly concentrated in the decryption process, which involves sending decryption requests to t users. Even if some users dropout, CS only needs to send decryption requests to the remaining users who are still online, thus not adding extra computation for CS.
5) Verification Time
Scheme [23] and scheme [25] both use a homomorphic hash function to provide verifiability for users, but the overhead is huge. We note this verification method as “LHH”. To highlight the superiority of our scheme, we compared VPPFL with “LHH” and PPVerifier, and the results are shown in Figure 9. It can be concluded that the overhead of VPPFL for verification is almost negligible compared to the homomorphic hash function of LHH.
Conclusion
In this paper, we proposed VPPFL, a privacy-preserving FL scheme for the semi-malicious server. VPPFL supports users dropout and provides verifiability for each user during the training process while preserving user privacy. Furthermore, we proved the security of the scheme and validated the practical performance of our scheme through simulated experiments on real data theoretically.The scheme proposed in this paper solves some problems in FL to a certain extent, but there is still room for improvement. Our scheme relies on a trusted third party to distribute the key, and assumes that the entity is too strong to be breached. Once the entity is compromised, then the data of all parties involved is no longer secure, but in a real-world deployment of FL, it is difficult to find such an entity. How to achieve verifiable privacy protection FL without the participation of a trusted third party is an important direction of follow-up research.
Data availability
The datasets are available online. The URL is as follows: MNIST database: https://yann.lecun.com/exdb/mnist.