Introduction
Differential privacy is the state-of-the-art method for preserving privacy. It is based on good mathematics foundation and it can quantitatively describe the problem of privacy disclose. The two unique features make it different from other methods. Many organizations believe that differential privacy will became the most effective method in the field of preserving privacy. The companies like Microsoft, Google, Uber etc are devoting resources into the research of differential privacy mechanism. The differential privacy mechanism may become the standard practice of industry.
There is a demand to develop differential privacy mechanism for multiple users. On one hand, data owners, such as National Bureau of Statistics, need technology to provide data with multiple data users. Due to the development of data analysis technology, increasing number of companies start to recognize the value of data. That is, data could improve services of these companies. On the other hand, data owners must be cautious about the people’s concern about personal privacy. Data owners cannot obtain any data from individuals if they are not able to protect personal privacy especially for sensitive personal information, such as home address, ID and phone number. In brief, designing differential privacy mechanism for multiple users could be a good solution.
Privacy budget is exhausted so quick that the number of queries is not enough. The most of queries among users are same but the rest of queries are unique. This phenomenon is called “long tail effect”. When the number of user increases, the number of unique query increases, leading to excessive privacy budget. In order to tackle the problem of excessive privacy budget, a multiple users mechanism is proposed, and the contributions include:
We propose a multiple users differential privacy mechanism whose privacy budget does not increase when number of users increases. In the proposed mechanism, users are isolated by Laplace distribution with different non-zero mean so that the privacy budget is not exhausted so quick.
The analysis of utility and privacy guarantee is performed. The formula about accuracy and privacy guarantee is given. In terms of accuracy, the analysis is performed from two respects, namely data distribution’s distortion and absolute value of noise.
Users cannot obtain a good estimation for a query’s true answer even if the users work together. The analysis of users’ conspiracy attacks is given and the formula which is related to the users’ conspiracy attacks is also given.
A. Related Work
The differential privacy is proposed by Dwork on [2] in 2006. After that, many schemes are proposed. An extensible platform for privacy-preserving data analysis called PINQ(Privacy Integrated Queries) is proposed by McSherry in [7]. The PINQ is based on the Laplace distribution with mean 0 and provides analysts with a programming interface to obtain data through a SQL-like language. Proserpio et al. publish a paper [9] to describe a platform called wPINQ which generalizes the PINQ to weighted datasets. Johnson et al. apply the differential privacy to actual Sql query in [6] where a tool called FLEX is implemented. The FLEX can rewrite the Sql statements into a new form satisfying differential privacy. Nissim et al. propose a generic framework called Sample-Aggregate Framework for private data analysis in [8]. In the Nissim’s paper, the conception of local sensitivity is introduced, which can reduce the noise added to the real query answers. And there are papers focusing on privacy protection such as [3] and there are lots of papers focusing on privacy-preserving data minting such as [1], [4], [5], [8], [10], [11]. All of these papers are based on noise drawn from Laplace distribution with mean 0.
B. Organization
In this section, there are three parts including introduction of differential privacy, our contributions and related works. In next section, preliminary knowledge is given. Related conception, useful properties and mathematical tools are introduced. In the third section, the detail information of proposed mechanism is given. The analysis of accuracy and privacy guarantee is performed and the formula related to the analysis is given. In the fourth section, experiments are performed to verify our claims.
Preliminary Knowledge
In this section, we briefly introduce the preliminary knowledge, including the system model, basic conceptions and mathematics tools.
A. System Model
The system consists of three components, as shown in Figure 1. The database holds the sensitive information. The DPM(differential privacy mechanism) is responsible for protecting the sensitive information. The DPM fetches the sensitive information and sends it to users after the sensitive information is covered by differential privacy mechanism. The users are the person who takes advantage of the covered information. For the model, we have the following assumptions
There are multiple users some of whom may work together to estimate true answer of some queries.
There are multiple DPM and each of them is responsible for dealing with one users’ queries by a differential privacy mechanism.
The differential privacy mechanism is implemented by Laplace mechanism.
B. Differential Privacy
Differential privacy is strict concept for privacy protection. It quantitatively describes disclose of privacy by probability tools. Concretely, probability of output is insensitive to any one record. One conception called neighboring database is introduced to describe databases between which there is only one different record. Database
Definition 1 (Differential Privacy):
Any random mechanism \begin{equation*} P\{A(D)\in S\} \le e^\epsilon P\{A(D')\in S\}\end{equation*}
Here, the
The differential privacy mechanism needs to cover difference cased by presence or absence of any record. So one conception called global sensitivity is introduced to measure the biggest difference cased by presence or absence of any record.
Definition 2 (Global Sensitivity):
The global sensitivity of neighboring database \begin{equation*} \Delta D = \max \limits _{D \sim D'} | f(D)-f(D')|\end{equation*}
Here,
There are lots of differential privacy mechanism and Laplace mechanism is most popular one to deal with numerical value.
Definition 3 (Laplace Mechanism):
For given database \begin{equation*} M(D,f,\epsilon) = f(D) + Lap\left({\frac {\Delta D}{\epsilon }}\right)\end{equation*}
The differential privacy has many good properties which make it possible to build complex differential privacy mechanism by basic block algorithms. The two of them are Sequential Composition Theorem and Parallel Composition Theorem.
1) Sequential Composition Theorem
Let
2) Parallel Composition Theorem
Let
Laplace mechanism is based on noise drawn from Laplace distribution. The definition of Laplace distribution is given. And one of it’s properties is given, which is a useful mathematics tool for our analysis of accuracy.
3) Laplace Distribution
A random variable \begin{equation*} F(x|\mu,b)=\frac {1}{2b}e^{-\frac {|x-\mu |}{b}}\end{equation*}
Here,
Fact 1:
If \begin{equation*} P\{ Y>b\cdot t \} = \frac {1}{2}e^{(-t)}\end{equation*}
Proposed Mechanism
In this section, the mechanism is discussed in detail. Firstly, the detailed construction of proposed mechanism will be presented. Secondly, a analysis will be performed in terms of privacy guarantee and accuracy. The proposed scheme is compared with common practice in which the noise is drawn from a Laplace distribution with location parameter 0.
A. Differential Privacy Mechanism for Multiple Users
Our goal is to design a differential privacy mechanism for multiple users by Laplace distribution with various non-zero location parameter. The intuition behind proposed mechanism is that the stronger privacy guarantee and higher accuracy can be obtained if the users can be isolated well with each other. In the proposed mechanism, the answers for different user are covered by noise drawn from Laplace distribution with different non-zero location parameter. Next, the proposed mechanism is presented.
The proposed mechanism
Algorithm 1 Mechanism A
: Differential Privacy Mechanism for Multiple Users
The number of user
the number of query for each user
The query set,
The interval
The database
The privacy budget
The set of answer
for each user
Randomly choose a number
Set the
end for
for each query
The answer
end for
There are some used denotations. The database is denoted by
Theorem 1:
For the database
Proof:
For neighboring database \begin{align*} P\{\hat {r}_{ij}=t\}=&P\{r_{ij} + \bar {r}_{ij}=t\} \\=&P\{\bar {r}_{ij}= t - r_{ij} \}\end{align*}
The \begin{equation*} P\{\bar {r}_{ij}= t - r_{ij}\} =\frac {\epsilon }{2 k\Delta D}e^{\left({\frac {-\epsilon |t - r_{ij} -\mu _{i}|}{ k\Delta D}}\right)}\end{equation*}
Similarly for the neighbouring database \begin{equation*} P\{r'_{ij}=t\} =\frac {\epsilon }{2~k\Delta D}e^{\left({\frac {-\epsilon |t -\tilde {r}_{ij} -\mu _{i}|}{ k\Delta D}}\right)}\end{equation*}
So, for \begin{align*} \frac {P\{\hat {r}_{ij}=t\}}{P\{r'_{ij}=t\}}=&\frac {\frac {\epsilon }{2~k\Delta D}e^{\left({\frac {-\epsilon |t-\bar {r}_{ij}-\mu _{i}|}{ k\Delta D}}\right)}}{\frac {\epsilon }{2~k\Delta D}e^{\left({\frac {-\epsilon |t-\tilde {r}_{ij}-\mu _{i}|}{ k\Delta D}}\right)}} \\=&e^{\left({\frac {-\epsilon |t-\bar {r}_{ij}-\mu _{i}|}{ k\Delta D}-\frac {-\epsilon |t-\tilde {r}_{ij}-\mu _{i}|}{ k\Delta D}}\right)} \\=&e^{\left({\frac {\epsilon }{ k\Delta D}(|t-\tilde {r}_{ij}-\mu _{i}|-|t-\bar {r}_{ij}-\mu _{i}|) }\right)} \\ < =&e^{\left({\frac {\epsilon }{ k\Delta D}(|\tilde {r}_{ij} - \bar {r}_{ij}|)}\right)} \\=&e^{\frac {\epsilon }{k}}\end{align*}
So, for the query
Next, the accuracy of the proposed mechanism will be discussed. In this paper, the discussion of accuracy is performed from two perspectives, including the absolute value of noise and the distortion of data distribution. In this subsection, the analysis is performed in terms of absolute value of noise and the concept of accuracy is
Definition (\alpha,\beta
):
Accuracy. We say that an algorithm which outputs answers in response to a set of queries
Firstly, a lemma is introduced, which is used in the proof of theorem 2.
Lemma 1:
If \begin{equation*} P\{ |Y|>b\cdot t \} = \frac {1}{2}e^{-\left({t-\frac {\mu }{b}}\right)}+\frac {1}{2}e^{-\left({t+\frac {\mu }{b}}\right)}\end{equation*}
Proof:
\begin{align*}&\hspace {-2pc}P\{|Y|>b\cdot t\} \\=&P\{Y>bt ~ or~~Y < -bt\} \\=&P\{Y>bt\}+P\{Y < -bt\} \\=&P\{Y-\mu >bt-\mu \}+P\{Y-\mu < -bt-\mu \} \\=&P\{Y-\mu >bt-\mu \}+P\{Y-\mu >bt+\mu \} \quad (4)\\=&P\{Y-\mu >b(t-\mu /b)\}+P\{Y-\mu >b(t+\mu /b)\} \\=&\frac {1}{2}e^{-\left({t-\frac {\mu }{b}}\right)}+\frac {1}{2}e^{-\left({t+\frac {\mu }{b}}\right)}\end{align*}
Theorem 2:
The mechanism \begin{equation*} \alpha \ge \frac {-\Delta D}{\epsilon }log\left({\frac {2\beta }{e^{\frac {\mu \epsilon }{\Delta D}} +e^{-\frac {\mu \epsilon }{\Delta D}} }}\right)\end{equation*}
Proof:
For a \begin{equation*} P\{{|\bar {r}_{ij}| > \alpha }\} = P\left\{{{|\bar {r}_{ij}| >b\cdot \frac {\alpha }{b}}}\right\}\end{equation*}
According to the Lemma 1, we have \begin{equation*} P\{{|\bar {r}_{ij}| > \alpha }\} = \frac {1}{2}e^{-\left({\frac {\alpha }{b}-\frac {\mu }{b}}\right)}+\frac {1}{2}e^{-\left({\frac {\alpha }{b}+\frac {\mu }{b}}\right)}\end{equation*}
We plug the \begin{equation*} \alpha \ge \frac {-\Delta D}{\epsilon }log\left({\frac {2\beta }{e^{\frac {\mu \epsilon }{\Delta D}} +e^{-\frac {\mu \epsilon }{\Delta D}} }}\right)\end{equation*}
\begin{equation*} P\{|\bar {r}_{ij}| > \alpha \} < \beta\end{equation*}
B. Comparison Between Proposed Mechanism and Common Practice
In this subsection, a comparison will be made between proposed mechanism and common practice. In proposed mechanism, a non-zero number is applied to location parameter of Laplace distribution. However in common practice, the location parameter of Laplace distribution is zero. We will analyse the difference between proposed mechanism and common practice in terms of privacy guarantee and accuracy and design experiment to verify the analysis.
The assumption is that operations of common practice are same with proposed mechanism except that the location parameter is zero in common practice. So in common practice, the users’ noise distribution is same with each other. According to the Sequential Composition Theorem, the privacy budget
Next, we design an experiment to show the different guarantee between proposed mechanism and common practice. We assume that there are
Firstly, the analysis about common practice is given. The i malicious users send a same query respectively such as “how many people are infected with HIV”. The real answer of the query is
For the mean \begin{equation*} \hat {r}=\frac {r_{1}+r_{2}+ {\dots }+r_{i}}{i} = r + \frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i}\end{equation*}
Here, \begin{equation*} \lim _{i \to \infty }{\frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i}} = 0\tag{1}\end{equation*}
So \begin{align*} \lim _{i \to \infty }{\hat {r}}=&\lim _{i \to \infty }{\frac {r_{1}+r_{2}+ {\dots }+r_{i}}{i}} \\=&r + \lim _{i \to \infty }{\frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i}} \\=&r\end{align*}
Next, we discuss how many malicious users are needed to obtain a confidential estimation for the real answer. We will give a formula to calculate the number
Lemma 2 (Central Limit Theorem):
If
According to lemma 2,
We define a constant positive \begin{equation*} P\{-m < R < m\} = \int _{-m}^{m}{\frac {\sqrt {i}}{\sigma \sqrt {2\pi }}e^{-\frac {i(x-u)^{2}}{2\sigma ^{2}}}}\end{equation*}
\begin{equation*} P\left\{{-3\frac {\sigma }{\sqrt {i}} < R < 3\frac {\sigma }{\sqrt {i}}}\right\}= 99\%\end{equation*}
So when \begin{align*} i=&\max \left\{{30,{\frac {9\sigma }{m^{2}}}^{2}}\right\} \\=&\max \left\{{30,\frac {18{\Delta D}^{2}}{m^{2}\epsilon ^{2}} }\right\}\end{align*}
Secondly, the analysis about proposed mechanism is given.
Theorem 3:
The proposed mechanism can resist conspiracy attack if the mean
Proof:
\begin{align*}&\hspace {-1.8pc} \lim _{i \to \infty }{\hat {r}} \\=&\lim _{i \to \infty }{\frac {r_{1}+r_{2}+ {\dots }+r_{i}}{i}} \\=&r + \lim _{i \to \infty } \frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i} \\=&r + \lim _{i \to \infty }{\frac {(r_{1r}-\mu _{1})+(r_{2r}-\mu _{2})+ {\dots }+(r_{ir}-\mu _{i})}{i}} \\&+\lim _{i \to \infty }{\frac {(\mu _{1}+\mu _{2}+ {\dots }+\mu _{i})}{i}} \\=&r + \hat {m} \\>&r + m\end{align*}
Discussion: For proposed mechanism, the distribution of Laplace distribution’s location parameter could be any one as long as its mean is slightly greater than error boundary
Next, the analysis of accuracy will be given. Firstly, comparison of noise absolute value will be given, which is a normal standard to quantize the accuracy of differential privacy mechanism. Secondly, the distortion of the data distribution will be discussed. For statistic analysis, the influence of differential privacy mechanism on data distribution is more important than the absolute value of noise. For example, private machine learning and statistical analysis are the instances. In this paper, the KL-Divergence is used to quantize the distortion of data distribution.
Firstly, the analysis about noise’s absolute value is given. According to the theorem 2, mechanism
Secondly, the analysis about noise’s influence on data distribution is given. The assumption is that the raw data distribution is
Definition (KL-Divergence):
The KL-Divergence between two random variables Y and Z taking values from the same domain is defined to be:\begin{equation*} D(Y||Z)=E_{y \sim Y}\left[{ln\frac {P(Y=y)}{P(Z=y)}}\right]\end{equation*}
So, for \begin{align*} KL_{c}=&\int _{-\infty }^{+\infty }p(x)ln\left({\frac {p(x)}{\frac {1}{2b}e^{-\frac {|x|}{b}}}}\right)dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx + \int _{-\infty }^{+\infty } \frac {p(x)}{b}(x) dx \\&+\int _{-\infty }^{0} \frac {p(x)}{b}(-2x) dx\end{align*}
And for \begin{equation*} KL_{p} = \int _{-\infty }^{+\infty }p(x)ln\left({\frac {p(x)}{\sum _{i=1}^{n}\left({\frac {1}{2b}e^{-\frac {|x-\mu _{i}|}{b}}\frac {1}{n}}\right)}}\right)dx\end{equation*}
Let \begin{equation*} KL_{p} = \int _{-\infty }^{+\infty }p(x)ln\left({\frac {p(x)}{\frac {1}{2b}e^{-\frac {|x-\bar {\mu }|}{b}}}}\right)dx\end{equation*}
When the \begin{equation*} KL_{p} = KL_{c}\end{equation*}
When the \begin{align*} KL_{p}=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx - \int _{-\infty }^{+\infty }p(x)ln\left({e^{-\frac {|x-\bar {\mu }|}{b}}}\right)dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx + \int _{-\infty }^{+\infty }p(x) \frac {|x-\bar {\mu }|}{b} dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-x+\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(x-\bar {\mu }) dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-x) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(x) dx \\&+ \int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx + \int _{-\infty }^{+\infty } \frac {p(x)}{b}(x) dx \\&+\int _{-\infty }^{0} \frac {p(x)}{b}(-2x) dx \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx + \int _{-\infty }^{0} \frac {p(x)}{b}(2x) dx\\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&KL_{c} \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx + \int _{-\infty }^{0} \frac {p(x)}{b}(2x) dx \\&+ \int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&KL_{c} + \int _{0}^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx \\&+ \int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&KL_{c} + \int _{0}^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx + \frac {\bar {\mu }}{b}\left({1-2\int _{-\infty }^{\bar {\mu }}p(x)dx}\right) \\{}\end{align*}
In a word, in terms of distortion of data distribution, the proposed mechanism is same with common practice when the mean of location parameter in proposed mechanism is zero. And the proposed mechanism is worse than common practice when the mean of location parameter in proposed mechanism is non-zero. The comparison is summed in table 1.
Experiment
This section consists of two parts. The first part is to verify that proposed mechanism provides stronger privacy guarantee. The second part is to verify that the proposed mechanism is same with common practice in terms of accuracy when the application scenario is statistical-based machine learning.
Firstly, we analyse the privacy guarantee. When the number of malicious users
We do simulation experiments to verify that the common practice based on noise drawn from Laplace distribution with mean
According to the figure 2, a good estimate for real answer is obtained when
We do simulation experiments to verify that the proposed mechanism can resist to conspiracy attacks. We assume
According to the data of Figure 3, the estimate of conspiracy attack
The FLEX is open source tool for differential privacy SQL query and it is based on Laplace distribution with mean 0. We do a experiment with it. The SQL statement used is “select count(*) from orders join customers on orders.customer_id = customers.customer_id where orders.product_id = 1 and customers.address like ‘%United States%”’. The Elastic sensitivity is 50. The epsilon
Secondly, experiments are performed to verify the analysis of accuracy. As claimed, the distortion of data distribution has a more important influence than the absolute value of noise in statistic analysis. In the experiments, we will show that proposed mechanism is same with common practice for accuracy respect in the scenario of the statistic-based machine learning. The chosen model is logistic model and the chosen data is iris. The logistic model is chosen because it is used wildly. The iris data is chosen because it is a famous and popular data set used on field of machine learning. The experiment is classification experiment by logistic model over iris data set. The logistic model satisfies differential privacy by perturbation of the object function. The experiment results are showed on figure 6.
According to the Figure 6, the accuracy increases for given location parameter when the privacy budget increases. And the accuracy fluctuates slightly for given privacy budget when the the location parameter increases from −1 to 1. The reason behind the experiment results is that for a given privacy budget, the change of location parameter of Laplace distribution does not lead to much distortion of data distribution even if the absolute value of noise increase. The experiment results are on line with our theoretical analysis.
Conclusion
In this paper, a differential privacy mechanism is proposed to optimize the number of queries for application scenario of multiple users. Users are isolated by assigning various non-zero mean to noise distribution. For privacy guarantee respect proposed mechanism is better. In terms of utility, analysis is performed from the view of data distribution’s distortion and the view of noise’s absolute value.