Optimizing Query Times for Multiple Users Scenario of Differential Privacy

SECTION I.

Introduction

Differential privacy is the state-of-the-art method for preserving privacy. It is based on good mathematics foundation and it can quantitatively describe the problem of privacy disclose. The two unique features make it different from other methods. Many organizations believe that differential privacy will became the most effective method in the field of preserving privacy. The companies like Microsoft, Google, Uber etc are devoting resources into the research of differential privacy mechanism. The differential privacy mechanism may become the standard practice of industry.

There is a demand to develop differential privacy mechanism for multiple users. On one hand, data owners, such as National Bureau of Statistics, need technology to provide data with multiple data users. Due to the development of data analysis technology, increasing number of companies start to recognize the value of data. That is, data could improve services of these companies. On the other hand, data owners must be cautious about the people’s concern about personal privacy. Data owners cannot obtain any data from individuals if they are not able to protect personal privacy especially for sensitive personal information, such as home address, ID and phone number. In brief, designing differential privacy mechanism for multiple users could be a good solution.

Privacy budget is exhausted so quick that the number of queries is not enough. The most of queries among users are same but the rest of queries are unique. This phenomenon is called “long tail effect”. When the number of user increases, the number of unique query increases, leading to excessive privacy budget. In order to tackle the problem of excessive privacy budget, a multiple users mechanism is proposed, and the contributions include:

We propose a multiple users differential privacy mechanism whose privacy budget does not increase when number of users increases. In the proposed mechanism, users are isolated by Laplace distribution with different non-zero mean so that the privacy budget is not exhausted so quick.
The analysis of utility and privacy guarantee is performed. The formula about accuracy and privacy guarantee is given. In terms of accuracy, the analysis is performed from two respects, namely data distribution’s distortion and absolute value of noise.
Users cannot obtain a good estimation for a query’s true answer even if the users work together. The analysis of users’ conspiracy attacks is given and the formula which is related to the users’ conspiracy attacks is also given.

A. Related Work

The differential privacy is proposed by Dwork on [2] in 2006. After that, many schemes are proposed. An extensible platform for privacy-preserving data analysis called PINQ(Privacy Integrated Queries) is proposed by McSherry in [7]. The PINQ is based on the Laplace distribution with mean 0 and provides analysts with a programming interface to obtain data through a SQL-like language. Proserpio et al. publish a paper [9] to describe a platform called wPINQ which generalizes the PINQ to weighted datasets. Johnson et al. apply the differential privacy to actual Sql query in [6] where a tool called FLEX is implemented. The FLEX can rewrite the Sql statements into a new form satisfying differential privacy. Nissim et al. propose a generic framework called Sample-Aggregate Framework for private data analysis in [8]. In the Nissim’s paper, the conception of local sensitivity is introduced, which can reduce the noise added to the real query answers. And there are papers focusing on privacy protection such as [3] and there are lots of papers focusing on privacy-preserving data minting such as [1], [4], [5], [8], [10], [11]. All of these papers are based on noise drawn from Laplace distribution with mean 0.

B. Organization

In this section, there are three parts including introduction of differential privacy, our contributions and related works. In next section, preliminary knowledge is given. Related conception, useful properties and mathematical tools are introduced. In the third section, the detail information of proposed mechanism is given. The analysis of accuracy and privacy guarantee is performed and the formula related to the analysis is given. In the fourth section, experiments are performed to verify our claims.

SECTION II.

Preliminary Knowledge

In this section, we briefly introduce the preliminary knowledge, including the system model, basic conceptions and mathematics tools.

A. System Model

The system consists of three components, as shown in Figure 1. The database holds the sensitive information. The DPM(differential privacy mechanism) is responsible for protecting the sensitive information. The DPM fetches the sensitive information and sends it to users after the sensitive information is covered by differential privacy mechanism. The users are the person who takes advantage of the covered information. For the model, we have the following assumptions

There are multiple users some of whom may work together to estimate true answer of some queries.
There are multiple DPM and each of them is responsible for dealing with one users’ queries by a differential privacy mechanism.
The differential privacy mechanism is implemented by Laplace mechanism.

FIGURE 1.

System model.

Show All

B. Differential Privacy

Differential privacy is strict concept for privacy protection. It quantitatively describes disclose of privacy by probability tools. Concretely, probability of output is insensitive to any one record. One conception called neighboring database is introduced to describe databases between which there is only one different record. Database $D$ and $D'$ are neighboring database denoted by $D \sim D'$ if there is only one different record between $D$ and $D'$ , which denoted by $d(D,D')=1$ . Presence or absence of any record has limited effect on final result. The formal definition of differential is

Definition 1 (Differential Privacy):

Any random mechanism $A:D^{n}\to R^{d}$ preserves $\epsilon$ - $DP$ ( $\epsilon$ -differential privacy) if for any neighbouring databases $D,D'$ such that $d(D,D')=1$ and for all sets of possible output $S$ :

$\begin{equation*} P\{A(D)\in S\} \le e^\epsilon P\{A(D')\in S\}\end{equation*}$ View Source

Here, the $\epsilon$ is the privacy budget which is a quantitative number related to privacy guarantee. The privacy budget is the limitation of difference cased by presence or absence of one record in terms of probability.

The differential privacy mechanism needs to cover difference cased by presence or absence of any record. So one conception called global sensitivity is introduced to measure the biggest difference cased by presence or absence of any record.

Definition 2 (Global Sensitivity):

The global sensitivity of neighboring database $D,D'$ denoted by $\Delta D$ is

$\begin{equation*} \Delta D = \max \limits _{D \sim D'} | f(D)-f(D')|\end{equation*}$ View Source

Here, $f$ is the operation which is executed on the database.

There are lots of differential privacy mechanism and Laplace mechanism is most popular one to deal with numerical value.

Definition 3 (Laplace Mechanism):

For given database $D$ , the function $f$ and privacy budget $\epsilon$ , output of Laplace mechanism is

$\begin{equation*} M(D,f,\epsilon) = f(D) + Lap\left({\frac {\Delta D}{\epsilon }}\right)\end{equation*}$ View Source

The differential privacy has many good properties which make it possible to build complex differential privacy mechanism by basic block algorithms. The two of them are Sequential Composition Theorem and Parallel Composition Theorem.

1) Sequential Composition Theorem

Let $A_{1},A_{2}, \cdots, A_{k}$ be $k$ algorithms that satisfy $\epsilon _{1}$ -DP, $\epsilon _{2}$ -DP, $\cdots$ , $\epsilon _{k}$ -DP respectively. Publishing $t = < t_{1},t_{2},\cdots, t_{k}>$ satisfies $\left({\sum _{i=1}^{k}\epsilon _{i}}\right)$ -DP. Where, $t_{1}=A_{1}(D),t_{2}=A_{2}(t_{1},D), t_{3}=A_{3}( < t_{1},t_{2},D>),\cdots, t_{k}=A_{k}( < t_{1},t_{2},t_{3},\cdots,t_{k-1}>,D)$

2) Parallel Composition Theorem

Let $A_{1},A_{2}, \cdots, A_{k}$ be $k$ algorithms that satisfy $\epsilon _{1}$ -DP, $\epsilon _{2}$ -DP, ${\dots }$ , $\epsilon _{k}$ -DP respectively. Publishing $t = < t_{1},t_{2},\cdots, t_{k}>$ satisfies $(max_{i\in [1,2\cdots k]}\epsilon _{i})$ -DP. Where, $t_{1}=A_{1}(D),t_{2}=A_{2}(D),\cdots,t_{k}=A_{k}(D)$ .

Laplace mechanism is based on noise drawn from Laplace distribution. The definition of Laplace distribution is given. And one of it’s properties is given, which is a useful mathematics tool for our analysis of accuracy.

3) Laplace Distribution

A random variable $X$ has a Laplace( $\mu$ , $b$ ) distribution if its probability density function is

$\begin{equation*} F(x|\mu,b)=\frac {1}{2b}e^{-\frac {|x-\mu |}{b}}\end{equation*}$ View Source

Here, $\mu$ is a location parameter and $b>0$ is a scale parameter. The expectation and variance are $\mu$ and $2b^{2}$ respectively.

Fact 1:

If $Y\sim Lap(0,b)$ , then:

$\begin{equation*} P\{ Y>b\cdot t \} = \frac {1}{2}e^{(-t)}\end{equation*}$ View Source

SECTION III.

Proposed Mechanism

In this section, the mechanism is discussed in detail. Firstly, the detailed construction of proposed mechanism will be presented. Secondly, a analysis will be performed in terms of privacy guarantee and accuracy. The proposed scheme is compared with common practice in which the noise is drawn from a Laplace distribution with location parameter 0.

A. Differential Privacy Mechanism for Multiple Users

Our goal is to design a differential privacy mechanism for multiple users by Laplace distribution with various non-zero location parameter. The intuition behind proposed mechanism is that the stronger privacy guarantee and higher accuracy can be obtained if the users can be isolated well with each other. In the proposed mechanism, the answers for different user are covered by noise drawn from Laplace distribution with different non-zero location parameter. Next, the proposed mechanism is presented.

The proposed mechanism $A$ is aimed at answering multiple user’s queries. The assumption is that the number of users is $n$ and for each user the number of query is $k$ . The query set is $Q = \{q_{ij}|i\in [n],j\in [k]\}$ . The answers for $i$ th user are covered by noise drawn from Laplace distribution with scale parameter $b=\frac {\Delta D }{\epsilon }$ and location parameter $u=u_{i}$ . The $u_{i}$ is randomly chosen from a interval $[a,b]$ .

Algorithm 1 Mechanism $A$ : Differential Privacy Mechanism for Multiple Users

Require:

The number of user $n$

the number of query for each user $k$

The query set, $Q$

The interval $[a,b]$

The database $D$ and its global sensitivity $\Delta D$

The privacy budget $\epsilon$

Ensure:

The set of answer $\{\hat {r}_{ij}\}$ for queries

1:

for each user $i \in [n]$ do

2:

Randomly choose a number $\mu _{i}$ from interval $[a,b]$ for $i$ th user

3:

Set the $i$ th user’s noise distribution $Lap_{i}\left({\frac {k\Delta D}{\epsilon },\mu _{i}}\right)$

4:

end for

5:

for each query $q_{ij} \in Q$ do

6:

The answer $\hat {r}_{ij} = q_{ij}(D) + Lap_{i}\left({\frac {k\Delta D}{\epsilon },\mu _{i}}\right)$

7:

end for

There are some used denotations. The database is denoted by $D$ and its global sensitivity is $\Delta D$ . The $\hat {r}_{ij} = r_{ij} + \bar {r}_{ij}$ is the noisy answer of query $q_{ij}$ over database $D$ , where the $r_{ij}$ is the true answer of the $q_{ij}$ and $\bar {r}_{ij}$ is noise drawn from Laplace distribution with scale parameter $b=\frac {k\Delta D }{\epsilon }$ and the location parameter $u=u_{i}$ . The $r'_{ij}$ is the noisy answer for $q_{ij}$ over the database $D'$ which is the neighbouring database of $D$ . The $\tilde {r}_{ij}$ is the true answer for $q_{ij}$ over the database $D'$ .

Theorem 1:

For the database $D$ and query set $Q$ , mechanism $A$ is $\epsilon$ differential privacy.

Proof:

For neighboring database $D$ , $D'$ and the $i$ th user’s query $q_{ij}$ , we assume that the location parameter of Laplace distribution is $\mu _{i}$ so that we have

$\begin{align*} P\{\hat {r}_{ij}=t\}=&P\{r_{ij} + \bar {r}_{ij}=t\} \\=&P\{\bar {r}_{ij}= t - r_{ij} \}\end{align*}$ View Source

The $\bar {r}_{ij}$ is a sample of the Laplace distribution, so we have

$\begin{equation*} P\{\bar {r}_{ij}= t - r_{ij}\} =\frac {\epsilon }{2 k\Delta D}e^{\left({\frac {-\epsilon |t - r_{ij} -\mu _{i}|}{ k\Delta D}}\right)}\end{equation*}$ View Source

Similarly for the neighbouring database $D'$ , we have

$\begin{equation*} P\{r'_{ij}=t\} =\frac {\epsilon }{2~k\Delta D}e^{\left({\frac {-\epsilon |t -\tilde {r}_{ij} -\mu _{i}|}{ k\Delta D}}\right)}\end{equation*}$ View Source

So, for $i$ th user’s query $q_{ij}$ , we have

$\begin{align*} \frac {P\{\hat {r}_{ij}=t\}}{P\{r'_{ij}=t\}}=&\frac {\frac {\epsilon }{2~k\Delta D}e^{\left({\frac {-\epsilon |t-\bar {r}_{ij}-\mu _{i}|}{ k\Delta D}}\right)}}{\frac {\epsilon }{2~k\Delta D}e^{\left({\frac {-\epsilon |t-\tilde {r}_{ij}-\mu _{i}|}{ k\Delta D}}\right)}} \\=&e^{\left({\frac {-\epsilon |t-\bar {r}_{ij}-\mu _{i}|}{ k\Delta D}-\frac {-\epsilon |t-\tilde {r}_{ij}-\mu _{i}|}{ k\Delta D}}\right)} \\=&e^{\left({\frac {\epsilon }{ k\Delta D}(|t-\tilde {r}_{ij}-\mu _{i}|-|t-\bar {r}_{ij}-\mu _{i}|) }\right)} \\ < =&e^{\left({\frac {\epsilon }{ k\Delta D}(|\tilde {r}_{ij} - \bar {r}_{ij}|)}\right)} \\=&e^{\frac {\epsilon }{k}}\end{align*}$ View Source

So, for the query $q_{ij}$ of the $i$ th user, the privacy budget is $\frac {\epsilon }{k}$ . According to the Sequential Composition Theorems of differential privacy, the $k$ queries are $\epsilon$ differential privacy. And according to the Parallel Composition Theorems of differential privacy for the query set $Q$ of all users, the proposed mechanism is $\epsilon$ differential privacy.

Next, the accuracy of the proposed mechanism will be discussed. In this paper, the discussion of accuracy is performed from two perspectives, including the absolute value of noise and the distortion of data distribution. In this subsection, the analysis is performed in terms of absolute value of noise and the concept of accuracy is $(\alpha,\beta)$ accuracy whose definition will be given later. At next subsection the analysis is performed in terms of distortion of data distribution and the quantization standard is KL-Divergence.

Definition ( $\alpha,\beta$ ):

Accuracy. We say that an algorithm which outputs answers in response to a set of queries $Q$ is $(\alpha,\beta)$ -accurate if except with probability at most $\beta$ the error of the answer is less than $\alpha$ .

Firstly, a lemma is introduced, which is used in the proof of theorem 2.

Lemma 1:

If $Y\sim Lap(\mu,b)$ , then:

$\begin{equation*} P\{ |Y|>b\cdot t \} = \frac {1}{2}e^{-\left({t-\frac {\mu }{b}}\right)}+\frac {1}{2}e^{-\left({t+\frac {\mu }{b}}\right)}\end{equation*}$ View Source

Proof:

$\begin{align*}&\hspace {-2pc}P\{|Y|>b\cdot t\} \\=&P\{Y>bt ~ or~~Y < -bt\} \\=&P\{Y>bt\}+P\{Y < -bt\} \\=&P\{Y-\mu >bt-\mu \}+P\{Y-\mu < -bt-\mu \} \\=&P\{Y-\mu >bt-\mu \}+P\{Y-\mu >bt+\mu \} \quad (4)\\=&P\{Y-\mu >b(t-\mu /b)\}+P\{Y-\mu >b(t+\mu /b)\} \\=&\frac {1}{2}e^{-\left({t-\frac {\mu }{b}}\right)}+\frac {1}{2}e^{-\left({t+\frac {\mu }{b}}\right)}\end{align*}$ View Source

where, the fourth equation holds because of the symmetry of random variable

$Y-\mu$

and the last equation holds because of Fact 1.

Theorem 2:

The mechanism $A$ is $(\alpha,\beta)$ accuracy if the

$\begin{equation*} \alpha \ge \frac {-\Delta D}{\epsilon }log\left({\frac {2\beta }{e^{\frac {\mu \epsilon }{\Delta D}} +e^{-\frac {\mu \epsilon }{\Delta D}} }}\right)\end{equation*}$ View Source

Proof:

For a $q_{ij}$ over database $D$ , error is a noise drawn from Laplace distribution so that we have

$\begin{equation*} P\{{|\bar {r}_{ij}| > \alpha }\} = P\left\{{{|\bar {r}_{ij}| >b\cdot \frac {\alpha }{b}}}\right\}\end{equation*}$ View Source

According to the Lemma 1, we have

$\begin{equation*} P\{{|\bar {r}_{ij}| > \alpha }\} = \frac {1}{2}e^{-\left({\frac {\alpha }{b}-\frac {\mu }{b}}\right)}+\frac {1}{2}e^{-\left({\frac {\alpha }{b}+\frac {\mu }{b}}\right)}\end{equation*}$ View Source

We plug the

$\begin{equation*} \alpha \ge \frac {-\Delta D}{\epsilon }log\left({\frac {2\beta }{e^{\frac {\mu \epsilon }{\Delta D}} +e^{-\frac {\mu \epsilon }{\Delta D}} }}\right)\end{equation*}$ View Source

we have

$\begin{equation*} P\{|\bar {r}_{ij}| > \alpha \} < \beta\end{equation*}$

View Source

B. Comparison Between Proposed Mechanism and Common Practice

In this subsection, a comparison will be made between proposed mechanism and common practice. In proposed mechanism, a non-zero number is applied to location parameter of Laplace distribution. However in common practice, the location parameter of Laplace distribution is zero. We will analyse the difference between proposed mechanism and common practice in terms of privacy guarantee and accuracy and design experiment to verify the analysis.

The assumption is that operations of common practice are same with proposed mechanism except that the location parameter is zero in common practice. So in common practice, the users’ noise distribution is same with each other. According to the Sequential Composition Theorem, the privacy budget $\epsilon _{c} = \frac {\epsilon }{k} \cdot k \cdot n = n\epsilon$ . There is a optimization if there are some queries which are queried by all users. According to the sequential composition theorem, the privacy budget $\epsilon _{c} = \frac {\epsilon }{k} \cdot k \cdot n(1-\delta) +\frac {\epsilon }{k} \cdot k \cdot \delta = \epsilon (n-n\delta +\delta)$ if the ratio of user’s same query is $\delta$ . Because of $n\epsilon \ge \epsilon (n-n\delta +\delta) \ge \epsilon$ , the privacy guarantee of proposed mechanism is low boundary of privacy guarantee of common practice.

Next, we design an experiment to show the different guarantee between proposed mechanism and common practice. We assume that there are $i$ malicious users and their purpose is to obtain a good estimation of a particular question such as how many people are infected with HIV.

Firstly, the analysis about common practice is given. The i malicious users send a same query respectively such as “how many people are infected with HIV”. The real answer of the query is $r$ . The noisy answer $r_{i}$ for $ith$ user is the summation of $r$ and a noise $r_{ir}$ . The noise $r_{ir}$ are drawn from Laplace distribution with mean $\mu = 0$ and variance $\sigma ^{2}$ . We calculate the mean of $r_{i}$ , which is a estimation for the real answer $r$ .

For the mean $\hat {r}$ , we have

$\begin{equation*} \hat {r}=\frac {r_{1}+r_{2}+ {\dots }+r_{i}}{i} = r + \frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i}\end{equation*}$ View Source

Here, $r_{1r},r_{2r}, {\dots }+r_{ir}$ are samples from a same Laplace distribution with mean $\mu =0$ . According to law of large numbers, we have

$\begin{equation*} \lim _{i \to \infty }{\frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i}} = 0\tag{1}\end{equation*}$ View Source

So

$\begin{align*} \lim _{i \to \infty }{\hat {r}}=&\lim _{i \to \infty }{\frac {r_{1}+r_{2}+ {\dots }+r_{i}}{i}} \\=&r + \lim _{i \to \infty }{\frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i}} \\=&r\end{align*}$ View Source

Next, we discuss how many malicious users are needed to obtain a confidential estimation for the real answer. We will give a formula to calculate the number $i$ . Before that, lemma 2 is introduced which is used later.

Lemma 2 (Central Limit Theorem):

If $X_{1},X_{2},X_{3} {\dots } X_{n}$ are independent identical distribution variables with mean $\mu$ and variance $\sigma ^{2}$ , $X =\frac {X_{1}+X_{2}+ {\dots }+X_{n}}{n}$ obeys normal distribution with mean $\mu$ and variance $\frac {\sigma ^{2}}{n}$ .

According to lemma 2, $R=\frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i}$ is a variable of normal distribution with mean 0 and variance $\frac {\sigma ^{2}}{i}$ .

We define a constant positive $m$ representing an acceptable range of errors. We have

$\begin{equation*} P\{-m < R < m\} = \int _{-m}^{m}{\frac {\sqrt {i}}{\sigma \sqrt {2\pi }}e^{-\frac {i(x-u)^{2}}{2\sigma ^{2}}}}\end{equation*}$ View Source

so

$P$

is directly proportional to i. According to the probability table of normal distribution, we know

$\begin{equation*} P\left\{{-3\frac {\sigma }{\sqrt {i}} < R < 3\frac {\sigma }{\sqrt {i}}}\right\}= 99\%\end{equation*}$

View Source

So when $3\frac {\sigma }{\sqrt {i}} < m$ , namely ${\frac {9\sigma }{m^{2}}}^{2} < i$ , we have 99% confidence saying that $\hat {r}\in r\pm m$ . Generally, the central limit theorem works well when the number of samples are greater than 30. so

$\begin{align*} i=&\max \left\{{30,{\frac {9\sigma }{m^{2}}}^{2}}\right\} \\=&\max \left\{{30,\frac {18{\Delta D}^{2}}{m^{2}\epsilon ^{2}} }\right\}\end{align*}$ View Source

Secondly, the analysis about proposed mechanism is given.

Theorem 3:

The proposed mechanism can resist conspiracy attack if the mean $\hat {m}$ of Laplace distribution’s location parameter is greater than $m$ .

Proof:

$\begin{align*}&\hspace {-1.8pc} \lim _{i \to \infty }{\hat {r}} \\=&\lim _{i \to \infty }{\frac {r_{1}+r_{2}+ {\dots }+r_{i}}{i}} \\=&r + \lim _{i \to \infty } \frac {r_{1r}+r_{2r}+ {\dots }+r_{ir}}{i} \\=&r + \lim _{i \to \infty }{\frac {(r_{1r}-\mu _{1})+(r_{2r}-\mu _{2})+ {\dots }+(r_{ir}-\mu _{i})}{i}} \\&+\lim _{i \to \infty }{\frac {(\mu _{1}+\mu _{2}+ {\dots }+\mu _{i})}{i}} \\=&r + \hat {m} \\>&r + m\end{align*}$ View Source

Discussion: For proposed mechanism, the distribution of Laplace distribution’s location parameter could be any one as long as its mean is slightly greater than error boundary $m$ . For example, the normal distribution is a good choice. Another example is the summation of a constant and the sample of sine.

Next, the analysis of accuracy will be given. Firstly, comparison of noise absolute value will be given, which is a normal standard to quantize the accuracy of differential privacy mechanism. Secondly, the distortion of the data distribution will be discussed. For statistic analysis, the influence of differential privacy mechanism on data distribution is more important than the absolute value of noise. For example, private machine learning and statistical analysis are the instances. In this paper, the KL-Divergence is used to quantize the distortion of data distribution.

Firstly, the analysis about noise’s absolute value is given. According to the theorem 2, mechanism $A$ is $(\alpha,\beta)$ accurate if $\alpha \ge \frac {-\Delta D}{\epsilon }log\left({\frac {2\beta }{e^{\frac {\mu \epsilon }{\Delta D}} +e^{-\frac {\mu \epsilon }{\Delta D}} }}\right)$ . And according to Fact 1, the common practice is $(\alpha _{c},\beta _{c})$ accurate if $\alpha _{c} \ge \frac {-\Delta D}{\epsilon _{c}}log(\beta _{c})$ . So, the inequality $\alpha \ge \alpha _{c}$ holds and the $\alpha = \alpha _{c}$ when the mean $\mu =0$ . So, the inequality is to say the proposed mechanism will incur bigger noise for the same $\beta$ .

Secondly, the analysis about noise’s influence on data distribution is given. The assumption is that the raw data distribution is $p(x)$ . The KL-Divergence of common practice is $KL_{c}$ and the KL-Divergence of proposed mechanism is $KL_{p}$ . At next, the definition of KL-Divergence is given and after that the $KL_{c}$ and $KL_{p}$ are given.

Definition (KL-Divergence):

The KL-Divergence between two random variables Y and Z taking values from the same domain is defined to be:

$\begin{equation*} D(Y||Z)=E_{y \sim Y}\left[{ln\frac {P(Y=y)}{P(Z=y)}}\right]\end{equation*}$ View Source

So, for $KL_{c}$ , we have

$\begin{align*} KL_{c}=&\int _{-\infty }^{+\infty }p(x)ln\left({\frac {p(x)}{\frac {1}{2b}e^{-\frac {|x|}{b}}}}\right)dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx + \int _{-\infty }^{+\infty } \frac {p(x)}{b}(x) dx \\&+\int _{-\infty }^{0} \frac {p(x)}{b}(-2x) dx\end{align*}$ View Source

And for $KL_{p}$ , we have

$\begin{equation*} KL_{p} = \int _{-\infty }^{+\infty }p(x)ln\left({\frac {p(x)}{\sum _{i=1}^{n}\left({\frac {1}{2b}e^{-\frac {|x-\mu _{i}|}{b}}\frac {1}{n}}\right)}}\right)dx\end{equation*}$ View Source

Let $\bar {\mu }$ represents the mean of random variable $\mu$ , So we have

$\begin{equation*} KL_{p} = \int _{-\infty }^{+\infty }p(x)ln\left({\frac {p(x)}{\frac {1}{2b}e^{-\frac {|x-\bar {\mu }|}{b}}}}\right)dx\end{equation*}$ View Source

When the $\bar {\mu } = 0$ , we have

$\begin{equation*} KL_{p} = KL_{c}\end{equation*}$ View Source

When the $\bar {\mu } \neq 0$ , we have

$\begin{align*} KL_{p}=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx - \int _{-\infty }^{+\infty }p(x)ln\left({e^{-\frac {|x-\bar {\mu }|}{b}}}\right)dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx + \int _{-\infty }^{+\infty }p(x) \frac {|x-\bar {\mu }|}{b} dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-x+\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(x-\bar {\mu }) dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-x) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(x) dx \\&+ \int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&\int _{-\infty }^{+\infty }p(x)ln(2bp(x))dx + \int _{-\infty }^{+\infty } \frac {p(x)}{b}(x) dx \\&+\int _{-\infty }^{0} \frac {p(x)}{b}(-2x) dx \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx + \int _{-\infty }^{0} \frac {p(x)}{b}(2x) dx\\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&KL_{c} \\&+\int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx + \int _{-\infty }^{0} \frac {p(x)}{b}(2x) dx \\&+ \int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&KL_{c} + \int _{0}^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx \\&+ \int _{-\infty }^{\bar {\mu }} \frac {p(x)}{b}(\bar {\mu }) dx + \int _{\bar {\mu }}^{+\infty } \frac {p(x)}{b}(-\bar {\mu }) dx \\=&KL_{c} + \int _{0}^{\bar {\mu }} \frac {p(x)}{b}(-2x) dx + \frac {\bar {\mu }}{b}\left({1-2\int _{-\infty }^{\bar {\mu }}p(x)dx}\right) \\{}\end{align*}$ View Source

In a word, in terms of distortion of data distribution, the proposed mechanism is same with common practice when the mean of location parameter in proposed mechanism is zero. And the proposed mechanism is worse than common practice when the mean of location parameter in proposed mechanism is non-zero. The comparison is summed in table 1.

TABLE 1 Comparison of Proposed Mechanism and Common Practice

SECTION IV.

Experiment

This section consists of two parts. The first part is to verify that proposed mechanism provides stronger privacy guarantee. The second part is to verify that the proposed mechanism is same with common practice in terms of accuracy when the application scenario is statistical-based machine learning.

Firstly, we analyse the privacy guarantee. When the number of malicious users $i$ is greater than $\frac {18{\Delta D}^{2}}{m^{2}\epsilon ^{2}}$ , we have 99% confidence saying that $\hat {r}\in r\pm m$ . Secondly, the constant number of the formula is 8 and 2 respectively when the confidence level is 95% and 68% respectively.

We do simulation experiments to verify that the common practice based on noise drawn from Laplace distribution with mean $\mu =0$ is vulnerable for conspiracy attacks. We assume that the setting is $\Delta D =1,m=1,\epsilon = 1, \mu = 0$ to verify the situation where $i$ is less than 30. And we assume that another setting is $\Delta D =1,m=0.1,\epsilon = 1, \mu = 0$ to verify the situation where $i$ is greater than 30. The precision $m$ is different because we also want to verify whether it is possible to tend to arbitrarily small precision. The results are showed in Figure 2.

FIGURE 2.

Conspiracy attacks to common practice.

Show All

According to the figure 2, a good estimate for real answer is obtained when $i$ reaches its theoretical value. For the precision $m=1$ , the estimate $\hat {r} \in r\pm m(1.54\in ~15\pm 1)$ ; For the precision $m=0.1$ , the estimate $\hat {r} \in r\pm m(15.07\in ~15\pm 0.1)$ .

We do simulation experiments to verify that the proposed mechanism can resist to conspiracy attacks. We assume $\Delta D =1,m=1,\epsilon = 1, \mu = N(0.5,10)$ and $\Delta D =1,m=0.1,\epsilon = 1, \mu = m + sin$ . The results are showed in Firgure 3 and Firgure 4.

FIGURE 3.

Conspiracy attacks to normal distribution mean.

Show All

FIGURE 4.

Conspiracy attacks to sine distribution mean.

Show All

According to the data of Figure 3, the estimate of conspiracy attack $\hat {r}$ is random and there is no tend to the real answer. We do a experiment in which the user sends same query and obtains a different answer. we find that the estimation of conspiracy fluctuates up and down at the horizontal line the summation of the real answer and expectation of $\mu$ . The experiment of average of 30 times is to show the tend. The experiment showed on figure 4 is same with the experiment of figure 3 except that the $\mu$ is sampled from a different distribution $\mu = constant + sin$ . The patter of curve is more clear in figure 4.

The FLEX is open source tool for differential privacy SQL query and it is based on Laplace distribution with mean 0. We do a experiment with it. The SQL statement used is “select count(*) from orders join customers on orders.customer_id = customers.customer_id where orders.product_id = 1 and customers.address like ‘%United States%”’. The Elastic sensitivity is 50. The epsilon $\epsilon = 10$ . The precision $m=0.5$ . The main information of data is showed on Table 2 and the result of experiments is showed on Figure 5. According to the Figure 5, the result is in line with theoretical analysis.

TABLE 2 Main Information of Data

FIGURE 5.

Conspiracy attacks to FELE.

Show All

Secondly, experiments are performed to verify the analysis of accuracy. As claimed, the distortion of data distribution has a more important influence than the absolute value of noise in statistic analysis. In the experiments, we will show that proposed mechanism is same with common practice for accuracy respect in the scenario of the statistic-based machine learning. The chosen model is logistic model and the chosen data is iris. The logistic model is chosen because it is used wildly. The iris data is chosen because it is a famous and popular data set used on field of machine learning. The experiment is classification experiment by logistic model over iris data set. The logistic model satisfies differential privacy by perturbation of the object function. The experiment results are showed on figure 6.

FIGURE 6.

Influence of privacy budget and location parameter.

Show All

According to the Figure 6, the accuracy increases for given location parameter when the privacy budget increases. And the accuracy fluctuates slightly for given privacy budget when the the location parameter increases from −1 to 1. The reason behind the experiment results is that for a given privacy budget, the change of location parameter of Laplace distribution does not lead to much distortion of data distribution even if the absolute value of noise increase. The experiment results are on line with our theoretical analysis.

SECTION V.

Conclusion

In this paper, a differential privacy mechanism is proposed to optimize the number of queries for application scenario of multiple users. Users are isolated by assigning various non-zero mean to noise distribution. For privacy guarantee respect proposed mechanism is better. In terms of utility, analysis is performed from the view of data distribution’s distortion and the view of noise’s absolute value.

Optimizing Query Times for Multiple Users Scenario of Differential Privacy

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Related Work

B. Organization

Preliminary Knowledge

A. System Model

B. Differential Privacy

Definition 1 (Differential Privacy):

Definition 2 (Global Sensitivity):

Definition 3 (Laplace Mechanism):

1) Sequential Composition Theorem

2) Parallel Composition Theorem

3) Laplace Distribution

Fact 1:

Proposed Mechanism

A. Differential Privacy Mechanism for Multiple Users

Algorithm 1 Mechanism $A$ : Differential Privacy Mechanism for Multiple Users

Theorem 1:

Proof:

Definition ( $\alpha,\beta$ ):

Lemma 1:

Proof:

Theorem 2:

Proof:

B. Comparison Between Proposed Mechanism and Common Practice

Lemma 2 (Central Limit Theorem):

Theorem 3:

Proof:

Definition (KL-Divergence):

Experiment

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Optimizing Query Times for Multiple Users Scenario of Differential Privacy

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

A. Related Work

B. Organization

Preliminary Knowledge

A. System Model

B. Differential Privacy

Definition 1 (Differential Privacy):

Definition 2 (Global Sensitivity):

Definition 3 (Laplace Mechanism):

1) Sequential Composition Theorem

2) Parallel Composition Theorem

3) Laplace Distribution

Fact 1:

Proposed Mechanism

A. Differential Privacy Mechanism for Multiple Users

Algorithm 1 Mechanism AA : Differential Privacy Mechanism for Multiple Users

Theorem 1:

Proof:

Definition (\alpha,\beta \alpha,\beta ):

Lemma 1:

Proof:

Theorem 2:

Proof:

B. Comparison Between Proposed Mechanism and Common Practice

Lemma 2 (Central Limit Theorem):

Theorem 3:

Proof:

Definition (KL-Divergence):

Experiment

Conclusion

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Algorithm 1 Mechanism $A$ : Differential Privacy Mechanism for Multiple Users

Definition ( $\alpha,\beta$ ):