Journals & Magazines >IEEE Access >Volume: 7

PDP-SAG: Personalized Privacy Protection in Moving Objects Databases by Combining Differential Privacy and Sensitive Attribute Generalization

The main steps of PDP-SAG are as follows. First, the sensitive attribute values of trajectory data records are generalized based on their privacy descriptor and by consid...

Abstract:

Moving objects databases have become an enabling technology for location-based applications. They mostly focus on the storing and processing of data about moving objects....Show More

Metadata

Abstract:

Moving objects databases have become an enabling technology for location-based applications. They mostly focus on the storing and processing of data about moving objects. Privacy protection is one of the most important concerns related to such databases. In recent years, some mechanisms have been proposed to answer statistical queries over moving objects databases, while satisfying differential privacy. However, none of them consider the case where a moving objects database contains non-spatiotemporal sensitive attributes other than spatiotemporal attributes. Besides, most of them do not support the personalized privacy protection requirements of different moving objects. In this paper, we address these problems by presenting PDP-SAG, a differentially private mechanism that combines the sensitive attribute generalization with personalized privacy in a unified manner. By this combination, we aim to provide different levels of differential privacy protection for moving objects that have non-spatiotemporal sensitive attributes as well. In this regard, we generalize the sensitive attribute values of trajectory data records based on their privacy descriptor and define a new personalized differentially private tree structure to keep different noisy frequencies for each trajectory according to the generalized sensitive attribute values of trajectory data records passing through that trajectory. We also propose intra- and inter-consistency constraints enforcements to make noisy frequencies consistent with each other. The extensive experiments on synthetic and real datasets verify that PDP-SAG significantly improves the utility of sensitive query answers and provides the required level of privacy protection for each moving object, in comparison to the case when no personalization and generalization are permitted.

The main steps of PDP-SAG are as follows. First, the sensitive attribute values of trajectory data records are generalized based on their privacy descriptor and by consid...

Published in: IEEE Access ( Volume: 7)

Page(s): 85887 - 85902

Date of Publication: 26 June 2019

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2019.2925236

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

A moving objects database is a multiset of trajectories, each representing the movement history of a moving object during a period of time. Moving objects databases have become very important in recent years because of their applications in many domains such as location-based services, municipal transportation, and traffic management.

Moving objects databases often contain sensitive information and, therefore, improper use of them may lead to privacy breaches. Moreover, for many applications, moving objects may have non-spatiotemporal sensitive attributes such as disease, job, and income. Therefore, there is a growing concern about breaching the privacy of moving objects whose locations are monitored and tracked.

Differential privacy (DP) [1], [2] has recently emerged as a de facto standard for private statistics publishing due to its strong provable privacy guarantees. It ensures that the probability that a statistical query will produce a given result is (nearly) the same as when one data record is added or removed from the database.

In the last years, several mechanisms have been proposed to answer statistical queries, while satisfying differential privacy for various types of databases [3]–[8], including moving objects databases [9]–[11]. However, as far as we know, none of the mechanisms proposed for moving objects databases consider the case where a moving objects database contains non-spatiotemporal sensitive attributes other than spatiotemporal attributes, associated to trajectories. This is while many real moving objects databases contain non-spatiotemporal sensitive attributes such as medical or salary information. Such moving objects databases have many applications in industry and medicine. For example, a hospital may use an RFID patient tagging system in which patients’ trajectories, personal data, and medical data are stored in a central moving objects database [12]. From now on, for convenience, we refer to non-spatiotemporal sensitive attributes simply as sensitive attributes.

Personalized privacy is an inherent notion in privacy protection, whose main objective is to guarantee the desired level of privacy protection for each individual, based on its privacy protection requirement. By applying a uniform (non-personalized) privacy model, some data records would be offered insufficient privacy protection, while others would be exerted excessive privacy control. Although excessive control poses no harm to privacy, it reduces the utility of published results [13].

Personalized differential privacy (PDP) [14] is a recently-introduced notion of differential privacy, which takes into account that different individuals may have different privacy protection requirements. However, most of the existing differentially private mechanisms for moving objects databases do not explicitly consider personalization based on the privacy protection requirements of moving objects.

As we said above, many moving objects databases contain both spatiotemporal and non-spatiotemporal attributes, and preserving privacy for such moving objects databases is an issue of great interest and concern in recent years [12], [15], [16]. That’s while, to the best of our knowledge, no differentially private mechanism has been presented so far to deal with such moving objects databases. On the other hand, many real moving object databases need to guarantee personalized privacy for moving objects based on their privacy protection requirements. In this paper, to address these problems, we present PDP-SAG, a differentially private mechanism that combines the sensitive attribute generalization with personalized privacy in a unified manner to achieve the desired level of privacy protection for moving objects of a moving objects database. In a nutshell, we generalize the sensitive attribute values of trajectory data records based on their privacy descriptor and then construct a noisy trajectory tree to represent all possible subtrajectories with their noisy frequencies in the underlying moving objects database. We further define a so-called personalized sensitive attribute generalization tree (PSAGT) for each node of the noisy trajectory tree. In fact, by doing this, we aim to keep different frequencies for each trajectory according to the generalized sensitive attribute values of trajectory data records passing through that trajectory. In continue, we allocate personal privacy budgets to the nodes of PSAGTs and then obtain the noisy version of each PSAGT in such a way that PDP-SAG satisfies PDP. Finally, we enforce all required consistency constraints via intra- and inter-consistency constraints enforcement steps. Actually, the intra-consistency constraints enforcement step makes each noisy PSAGT internally consistent, while the inter-consistency constraints enforcement step imposes different noisy PSAGTs to be consistent with each other. We use these final consistent noisy PSAGTs to answer sensitive count queries in a personalized differentially private manner.

In the following, we list the main contributions of this paper:

To the best of our knowledge, we are the first to apply differential privacy to moving objects databases that contain sensitive attributes other than spatiotemporal attributes. Also, we are the first to combine the sensitive attribute generalization (SAG) with personalized differential privacy (PDP) in a unified manner. This allows us to provide personalized privacy with strong guarantees of differential privacy.
We propose PDP-SAG, a novel differentially private mechanism that uses a new tree structure, known as noisy personalized sensitive attribute generalization tree (PSAGT), to keep different noisy frequencies for each trajectory according to the generalized sensitive attribute values of trajectory data records. This tree structure allows us to answer sensitive count queries in a personalized differentially private manner.
We propose intra- and inter-consistency constraints enforcements, to make each noisy PSAGT internally consistent and impose different noisy PSAGTs to be consistent with each other, respectively.
We formally prove that PDP-SAG satisfies personalized differential privacy for moving objects. Also, through extensive experiments on both synthetic and real datasets, we show how PDP-SAG improves the utility of sensitive query answers and provides the appropriate privacy protection for each moving object, in comparison to the case when no personalization and generalization are permitted. Besides, we define a new evaluation metric to measure the privacy protection provided for each trajectory data record (moving object) by considering both personalization and generalization.

The rest of the paper is organized as follows. Section II reviews related work. Section III provides basic definitions and preliminaries. Section IV describes how the sensitive attribute generalization is used in our work. In Section V, we introduce PDP-SAG and analyze its privacy guarantee. The experimental results are reported in Section VI and, finally, a summary and discussion are given in Section VII.

SECTION II.

Related Work

In this section, we review the state-of-the-art of mechanisms that are somewhat related to our work.

For the first time, the concept of personalized privacy was introduced by Xiao and Tao [13] for relational databases. Subsequently, this concept was extended to other areas of privacy protection [14], [17]–[20], including moving objects databases [15], [16].

In the past few years, some differentially private mechanisms have been proposed to satisfy personalized privacy for different applications. However, none of them are applicable to moving objects databases that contain sensitive attributes other than spatiotemporal attributes. For example, Niknami et al. [4] introduced the concept of PDP for spatial databases. They presented a differentially private mechanism to answer range counting queries over spatial databases by considering the privacy protection requirements of different subregions. Jorgensen et al. [14] also considered that different users may potentially have different privacy protection requirements and the analyst would like to publish useful statistics, by satisfying the individual privacy protection requirements. They introduced two different differentially private mechanisms to address this problem. Liu et al. [21] showed that a large family of recommender systems, namely, those using matrix factorization, are well suited to differential privacy. Based on this observation, they described a fast matrix factorization based recommender system that allows to calibrate the level of privacy protection for each user independently. Alaggan et al. [22] defined the notion of heterogeneous differential privacy (HDP) in which the variation of privacy expectations is captured among users as well as across different pieces of information related to the same user. In this way, the non-uniformity of privacy expectations is provided. Li et al. [23] proposed two partitioning mechanisms, namely, privacy-aware partitioning and utility-based partitioning, to achieve PDP. The goal of the privacy-aware partitioning is to minimize the waste of privacy budgets, while the goal of the utility-based partitioning is to maximize the utility of target computations. Deldar and Abadi [11] proposed PLDP-TD, a personalized-location differentially private algorithm for simple trajectory databases where sensitive attribute values are not associated with trajectories. PLDP-TD considers that different locations may have different privacy protection requirements. However, it does not explicitly consider personalization based on the privacy protection requirements of moving objects.

Furthermore, some non-differentially private mechanisms have been proposed to preserve privacy for moving objects databases that contain both spatiotemporal and non-spatiotemporal attributes. For example, Mohammed et al. [24] developed an anonymization framework that employs global suppression to achieve $LKC$ -privacy for such moving objects databases. The general intuition is to ensure that each subtrajectory with maximum length $L$ in a moving objects database is shared by at least $K$ trajectory data records and the confidence of inferring any sensitive attribute value is not greater than $C$ . Chen et al. [12] presented a similar framework that supports both local and global suppressions. The aim is to preserve both instances of locations and frequent subtrajectories. Ghasemi Komishani et al. [15] presented PPTD, an approach for preserving personalized privacy in trajectory data publishing, which takes into consideration identity linkage, attribute linkage, and similarity attacks. They combined sensitive attribute generalization and trajectory local suppression to balance between data utility and data privacy based on the privacy protection requirements of moving objects. Dai et al. [16] proposed a privacy preserving method that can strike a balance between privacy protection requirements and data availability in off-line trajectory data publishing scenarios, through trajectory reconstruction after reasonably replacing sensitive stop points. They built a taxonomy tree by abstracting and generalizing the semantic attributes assigned to all points. Then, they extracted sensitive stop points that need to be protected and adopted different strategies to select the appropriate points of interests for replacement, while considering different user-defined privacy levels. Liu et al. [25] presented SLAT, a subtrajectory linkage attack tolerance framework based on the anonymization techniques of trajectory splitting, location suppression, and sensitive attribute generalization, to prevent privacy disclosure while preserving accurate trajectory data.

SECTION III.

Preliminaries

In this section, we give some basic definitions and preliminaries that are used throughout the paper.

A. Differential Privacy

Differential privacy (DP) has emerged as one of the strongest privacy definitions for statistical databases. The intuition is that any sequence of answers to queries is equally likely to occur, even if an individual data record is present or absent in the database. Thus, even though an adversary observes sequences of query answers from two neighboring databases, the privacy of individual data records is still not disclosed. In the following, we define the concepts behind differential privacy.

Definition 1 (Neighboring Databases):

Two databases $\mathcal {D}_{1}$ and $\mathcal {D}_{2}$ are said to be neighboring, denoted by $\mathcal {D}_{1}\sim \mathcal {D}_{2}$ , iff one can be obtained by adding or removing a single data record from the other.

Definition 2 ( $\varepsilon$ -DP):

A randomized algorithm $\mathcal {A}$ is said to be $\varepsilon$ -differentially private or $\varepsilon$ -DP iff for any two input neighboring databases $\mathcal {D}_{1}$ and $\mathcal {D}_{2}$ , and for any subset $O$ of all possible outputs of $\mathcal {A}$ , we have $\begin{equation*} {\Pr [\mathcal {A}(\mathcal {D}_{1})\in O]}\leq \exp (\varepsilon)\times {\Pr [\mathcal {A}(\mathcal {D}_{2})\in O]},\tag{1}\end{equation*}$ View Source where $\varepsilon$ is a privacy parameter, known as the global (or total) privacy budget, that determines the strength of the privacy guarantee.

From Definition 2, we conclude that the privacy guarantee provided by an $\varepsilon$ -differentially private algorithm is inversely proportional to $\varepsilon$ , which means a smaller $\varepsilon$ will result in a stronger privacy guarantee, and vice versa.

A typical mechanism for answering statistical queries under differential privacy is the Laplace mechanism, which adds random noise drawn from the Laplace distribution to the results of statistical queries. The magnitude of the noise is scaled according to the (global) sensitivity of the query function, which is a measure of the maximum possible change to query answers over any two neighboring databases.

Definition 3 (Sensitivity):

Let $f$ be a query function that maps a given database to a vector of real numbers. The sensitivity of $f$ , denoted by $\sigma _{f}$ , is defined as $\begin{equation*} \sigma _{f}=\max _{\mathcal {D}_{1}\sim \mathcal {D}_{2}}{\|f(\mathcal {D}_{1})-f(\mathcal {D}_{2})\|_{1}},\tag{2}\end{equation*}$ View Source where $\|\cdot \|_{1}$ denotes the $L^{1}$ -norm of a vector.

Given an input database $\mathcal {D}$ and a query function $f$ , the Laplace mechanism adds random noise to query answers drawn from a zero-mean Laplace distribution with scale parameter $\sigma _{f}/\varepsilon$ , where $\sigma _{f}$ is the sensitivity of $f$ and $\varepsilon$ is the global privacy budget. A randomized algorithm $\mathcal {A}$ satisfies $\varepsilon$ -DP iff $\begin{equation*} \mathcal {A}(\mathcal {D})=f(\mathcal {D})+\mathrm {Lap}(\sigma _{f}/\varepsilon),\tag{3}\end{equation*}$ View Source where $\mathrm {Lap}(\lambda)$ is a Laplace random variable with probability density function $h_{\lambda }(z)=\frac {1}{2\lambda }\exp (-|z|/\lambda)$ and variance $2{\lambda }^{2}$ .

One of the basic properties of differential privacy is compositionality, which means that any sequence of differential privacy computations is also differentially private. In general, there are two different types of compositionality: sequential composition and parallel composition [26].

Theorem 1 (Sequential and Parallel Compositions[26]):

Let $\Lambda =\{\mathcal {A}_{1},\mathcal {A}_{2}, {\dots },\mathcal {A}_{n}\}$ be a set of randomized algorithms, where each $\mathcal {A}_{i}\in \Lambda$ satisfies $\varepsilon _{i}$ -DP for an input database $\mathcal {D}$ . Then, the sequential composition $\mathcal {A}_{1}\circ \mathcal {A}_{2}\circ \cdots \circ \mathcal {A}_{n}$ over $\mathcal {D}$ satisfies $\left({\sum _{i} {\varepsilon _{i}}}\right)$ -DP. Similarly, the parallel composition $\mathcal {A}_{1}\parallel \mathcal {A}_{2}\parallel \cdots \parallel \mathcal {A}_{n}$ over disjoint subsets of $\mathcal {D}$ satisfies $(\max _{i}{\varepsilon _{i}})$ -DP.

From Theorem 1, we conclude that the privacy guarantee degrades when multiple differentially private algorithms operate on the same input. If the inputs of the differentially private algorithms are disjoint, then the total privacy guarantee will be as strong as the weakest differentially private algorithm.

B. Personalized Differential Privacy

Personalized differential privacy (PDP) [4], [11], [14] is a new notion of differential privacy that provides user-level non-uniform privacy guarantees. The main goal of PDP is to enable data owners to adjust the level of privacy protection for different users based on how much privacy guarantee they desire. To achieve this goal, PDP allocates a personal privacy budget to each user (and thus to all data records of that user) independent of others. This is in contrast to traditional differential privacy in which a single global privacy budget (namely, $\varepsilon$ in Definition 2) is specified and, thus, a uniform privacy guarantee is provided for all users (data records) in the database.

Definition 4 (Personal Privacy Budget Allocation):

Let $\mathcal {R}$ be the universe of all possible data records. A personal privacy budget allocation over $\mathcal {R}$ is a mapping $\epsilon:\mathcal {R}\to \mathbb {R}_{>0}$ from data records in $\mathcal {R}$ to their personal privacy budgets.

In the real world, it may be difficult for users to directly quantify their privacy protection requirements; therefore, we assume that there is a limited set of user-friendly totally ordered privacy descriptors (e.g., Low, Medium, High, and Critical), where each user chooses to represent his/her privacy protection requirement. The data owner then allocates a personal privacy budget to each user and, thus, to all data records of that user, inversely proportional to his/her privacy descriptor. Formally, let $\mathcal {R}$ be the universe of all possible data records and $\mathcal {P}$ be the set of all totally ordered privacy descriptors. To quantify the privacy protection requirements of different users, the data owner first assigns a weight to each privacy descriptor inversely proportional to its rank, which is assumed to be 1 for the highest privacy descriptor. The higher the privacy descriptor, the higher the weight will be assigned. Then, the data owner computes the personal privacy budget of each data record $R\in \mathcal {R}$ as $\begin{equation*} \epsilon (R)=\dfrac {\varepsilon }{\omega (p(R))},\tag{4}\end{equation*}$ View Source where $p:\mathcal {R}\to \mathcal {P}$ is a labeling function that assigns to each data record in $\mathcal {R}$ a privacy descriptor in $\mathcal {P}$ and $\omega:\mathcal {P}\to (0,1]$ is a weighting function that assigns to each privacy descriptor in $\mathcal {P}$ a weight in the range of 0 to 1. We now formally define PDP in the following manner.

Definition 5 ( $\epsilon$ -PDP):

Let $\mathcal {R}$ be the universe of all possible data records. A randomized algorithm $\mathcal {A}$ is said to be $\epsilon$ -personalized differentially private or $\epsilon$ -PDP iff for any two input neighboring databases $\mathcal {D}_{1}$ and $\mathcal {D}_{2}$ that differ in a data record $R\in \mathcal {R}$ and for any subset $O$ of all possible outputs of $\mathcal {A}$ , we have $\begin{equation*} {\Pr [\mathcal {A}(\mathcal {D}_{1})\in O]}\leq \exp (\epsilon (R))\times {\Pr [\mathcal {A}(\mathcal {D}_{2})\in O]},\tag{5}\end{equation*}$ View Source where $\epsilon (R)$ is the personal privacy budget of $R$ .

C. Personalized Moving Objects Database

A spatiotemporal database is a database that deals with real-world applications in which spatial changes occur over time. Moving objects databases are specific cases of spatiotemporal databases that represent and manage changes related to the movement of moving objects such as people, animals, and vehicles.

Given a set $\mathcal {O}$ of moving objects, let $\mathcal {S}$ be a discrete spatial domain to which the movement of moving objects in $\mathcal {O}$ is constrained and $\mathcal {L}$ be the finite set of all possible locations in $\mathcal {S}$ . The trajectory $T=\langle X_{1},X_{2}, {\dots },X_{m}\rangle$ is a sequence of points (or spatiotemporal attributes) $X_{i}$ that belong to $\mathcal {L}$ . We refer to any consecutive locations of a trajectory as a contiguous subtrajectory (or just subtrajectory in short).

Definition 6 (Subtrajectory):

A trajectory $T_{r}=\langle X_{1}^{r},X_{2}^{r}, {\dots }, X_{n}^{r}\rangle$ is said to be a subtrajectory of a trajectory $T_{s}=\langle X_{1}^{s},X_{2}^{s}, {\dots }, X_{m}^{s}\rangle$ , denoted by $T_{r}\sqsubseteq T_{s}$ , iff there exists $n$ consecutive integers $1 \le i < i+1 < \cdots < i+n-1 \le m$ such that $X_{1}^{r}=X_{i}^{s},X_{2}^{r}=X_{i+1}^{s}, {\dots }, X_{n}^{r}=X_{i+n-1}^{s}$ .

A moving objects database may contain attributes other than spatiotemporal attributes, which are associated to trajectories. Some of these attributes may be sensitive, such as disease or salary, which need to be protected from unauthorized disclosure. Also, the data owner may assign a privacy descriptor to each trajectory of a moving objects database, proportional to the privacy protection requirement of the moving object that generated that trajectory. We refer to such a moving objects database as a personalized moving objects database.

Definition 7 (Personalized Moving Objects Database):

A personalized moving objects database $\mathcal {D}$ is a multiset of trajectory data records in the form of $\begin{align*} R=\langle X_{1},X_{2}, {\dots },X_{m}\rangle: S_{1},S_{2}, {\dots },S_{i}:A_{1},A_{2}, {\dots },A_{j}:P, \\\tag{6}\end{align*}$ View Source where $\langle X_{1},X_{2}, {\dots },X_{m}\rangle$ is the trajectory, $S_{1}$ to $S_{i}$ are the sensitive attribute values, $A_{1}$ to $A_{j}$ are the insensitive attribute values, and $P$ is the privacy descriptor of $R$ .

For convenience, we assume that each personalized moving objects database $\mathcal {D}$ contains only one sensitive attribute and each moving object in $\mathcal {O}$ corresponds to at most one trajectory data record in $\mathcal {D}$ . In this case, the trajectory, sensitive attribute value, and privacy descriptor of a trajectory data record $R\in \mathcal {D}$ are denoted by $t(R)$ , $s(R)$ , and $p(R)$ , respectively. Also, a given trajectory $T$ is said to be occurred in a trajectory data record $R\in \mathcal {D}$ iff $T\sqsubseteq t(R)$ .

The values of a sensitive attribute are usually divided into different categories. We can use a taxonomy tree [13] to categorize these values and, thus, to generalize them based on their categories. For example, Fig. 1 shows a simple taxonomy tree for an arbitrary sensitive attribute with five distinct values, which are represented by the leaf nodes of the taxonomy tree. Each non-leaf node of the taxonomy tree is uniquely labeled with a name showing the category of sensitive attribute values in the subtree rooted at that node.

FIGURE 1.

A taxonomy tree for an arbitrary sensitive attribute with five distinct values.

Show All

Definition 8 (Taxonomy Tree):

Let $S$ be the set of all possible sensitive attribute values. A taxonomy tree over $S$ is a 3-tuple $\Gamma =(V,E,l)$ , where $V$ is the set of nodes, $E$ is the set of edges, and $l:V\to S\cup 2^{S}$ is a labeling function that assigns a sensitive attribute value in $S$ to each leaf node and a subset of sensitive attribute values in $S$ to each non-leaf node in $V$ .

The sensitive attribute value of each trajectory data record is generalized with respect to its corresponding privacy descriptor. Let us assign a privacy descriptor in descending order to each level of the taxonomy tree, starting from the root. The generalized sensitive attribute value of each trajectory data record is then considered to be the label of a node with the same privacy descriptor in the taxonomy tree, covering the sensitive attribute value of that trajectory data record. For example, Table 1 shows a personalized moving objects database, where the sensitive attribute values of trajectory data records have been generalized with respect to the taxonomy tree of Fig. 1.

TABLE 1 A Personalized Moving Objects Database With Generalized Sensitive Attribute Values

Count queries are one of the most popular queries in moving objects databases. They are the building block of many advanced data analysis tasks such as frequent pattern mining. In this paper, we define a special type of count queries, namely, sensitive count queries, which are customized for moving objects databases with sensitive attributes. With accurate answers to sensitive count queries, data recipients can obtain answers to many of their questions such as “how many trajectories with a particular sensitive attribute value share the same subtrajectory?”. In the following, we give a formal definition of a sensitive count query.

Definition 9 (Sensitive Count Query):

Let $S$ be the set of all possible sensitive attribute values, $\mathcal {L}$ be a set of locations, $\mathcal {T}$ be the universe of all possible trajectories over $\mathcal {L}$ , and $\mathcal {D}$ be a personalized moving objects database whose trajectories belong to $\mathcal {T}$ . A sensitive count query over $\mathcal {D}$ is defined by a so-called sensitive count query function $c_{\mathcal {D}}:\mathcal {T}\times 2^{S}\to \mathbb {Z}_{\geq 0}$ that, given a trajectory in $\mathcal {T}$ and a designated subset of sensitive attribute values in $S$ , returns the total number of co-occurrences of the given trajectory with each of these sensitive attribute values in trajectory data records of $\mathcal {D}$ .

Example 1:

Consider the personalized moving objects database of Table 1. With respect to the taxonomy tree of Fig. 1, a possible sensitive count query over this moving objects database may be $c_{\mathcal {D}}(\langle L_{1},L_{3}\rangle,\text {Lung Cancer})$ or $c_{\mathcal {D}}(\langle L_{1},L_{3}\rangle,\{\text {SCLC},\text {NSCLC}\})$ . The answer to this count query is 4.

SECTION IV.

Personalized Sensitive Attribute Generalization Tree

In order to preserve the privacy of moving objects and meet the goals of personalized privacy, we define a special taxonomy tree, called a personalized sensitive attribute generalization tree (PSAGT), for each possible trajectory. Each PSAGT keeps the number of moving objects in the underlying personalized moving objects database (with generalized sensitive attribute values) that have moved along a particular trajectory. In more details, the counting function for each node of a PSAGT is computed by considering only those trajectory data records whose generalized sensitive attribute values are covered by the label of that node.

Definition 10 (Personalized Sensitive Attribute Generalization Tree):

Let $\mathcal {D}$ be a personalized moving objects database and $S$ be the set of all possible sensitive attribute values. A personalized sensitive attribute generalization tree (PSAGT) for an arbitrary trajectory $T$ is a 5-tuple $\Psi =(V,E,l,p,c)$ , where $V$ is the set of nodes, $E$ is the set of edges, $l:V\to S\cup 2^{S}$ is a labeling function that assigns a sensitive attribute value in $S$ to each leaf node and a subset of sensitive attribute values in $S$ to each non-leaf node in $V$ , $p:V\to \mathcal {P}$ is a privacy function that assigns a privacy descriptor in $\mathcal {P}$ to each node in $V$ , and $c:V\to \mathbb {Z}_{\ge 0}$ is a counting function that assigns to each node in $V$ the number of times $T$ occurs in trajectory data records of $\mathcal {D}$ whose generalized sensitive attribute value is covered by the label of that node. In such a case, $T$ is said to be implicitly associated to each node of $\Psi$ .

Similar to the taxonomy tree in Subsection III-C, a privacy descriptor (in descending order) is assigned to each level of a PSAGT (and thus to all nodes at that level), starting from the root (which is assumed to be at level 0).

In a particular case that no generalization is applied to the sensitive attribute values of trajectory data records, the counting function for each node of a PSAGT will be computed by considering only those trajectory data records whose sensitive attribute values are covered by the label of that node. In this case, we refer to such a PSAGT simply as an SAGT.

Example 2:

Consider the personalized moving objects database of Table 1. Fig. 2 shows the PSAGT for the trajectory $\langle L_{1}\rangle$ . For each node, its associated (real) count and privacy descriptor are placed inside and outside of the circle representing that node, respectively. Note that the label of each node is the same as the label of its corresponding node in the taxonomy tree of Fig. 1.

$FIGURE 2. - The PSAGT for the trajectory $\langle L_{1}\rangle $ constructed over the personalized moving objects database of Table 1. The label of each node is the same as the label of its corresponding node in the taxonomy tree of Fig. 1.$

FIGURE 2.

The PSAGT for the trajectory $\langle L_{1}\rangle$ constructed over the personalized moving objects database of Table 1. The label of each node is the same as the label of its corresponding node in the taxonomy tree of Fig. 1.

Show All

SECTION V.

PDP-SAG

In this section, we introduce PDP-SAG, a differentially private mechanism for personalized moving objects databases that combines the sensitive attribute generalization with personalized privacy in a unified manner to provide different levels of privacy protection for different moving objects. Generally, PDP-SAG consists of the following three main steps: noisy trajectory tree construction, noisy PSAGTs assignment, and post-processing. By the first step, we aim to represent a personalized moving objects database with a differentially private noisy trajectory tree. The aim of the second step is to answer sensitive count queries with personalized differential privacy guarantees. We achieve this aim by assigning a noisy PSAGT to each node of the noisy trajectory tree. Finally, in the third step, we enforce some intra- and inter-consistency constraints on the noisy trajectory tree and its associated noisy PSAGTs to make sensitive query answers consistent with each other. Table 2 summarizes the notations used throughout the paper.

TABLE 2 Notation Used Throughout the Paper

A. Noisy Trajectory Tree Construction

In this step, we construct a differentially private noisy trajectory tree, or simply a noisy trajectory tree, which is used in subsequent steps to answer all sensitive count queries under personalized differential privacy guarantees.

Definition 11 (Noisy Trajectory Tree):

Let $\mathcal {L}$ be a set of locations, $\mathcal {T}$ be the universe of all possible trajectories over $\mathcal {L}$ , and $\mathcal {D}$ be a personalized moving objects database whose trajectories belong to $\mathcal {T}$ . A noisy trajectory tree over $\mathcal {D}$ is a 6-tuple $\tilde {\Upsilon }=(V,E,tr,\epsilon,c,\tilde {c})$ , where $V$ is the set of nodes, $E$ is the set of edges, $tr:V\to \mathcal {T}\cup \{\langle \rangle \}$ is a labeling function that associates to each node in $V$ a trajectory in $\mathcal {T}\cup \{\langle \rangle \}$ , $\epsilon:V\to \mathbb {R}_{>0}$ is a total function that allocates to each node in $V$ a privacy budget, $c:V\to \mathbb {Z}_{\ge 0}$ is a counting function that assigns to each node in $V$ the number of times its associated trajectory occurs in trajectory data records of $\mathcal {D}$ , and $\tilde {c}$ is the noisy version of $c$ . $tr$ is defined such that the root of $\tilde {\Upsilon }$ is labeled with the empty trajectory $\langle \rangle$ and each other node is labeled with a trajectory that is obtained by appending one point (or location) to the trajectory of its parent.

We construct a noisy trajectory tree similar to prior work [3], [11]. However, here we consider the global privacy budget $\varepsilon$ as the personal privacy budget of moving objects (or trajectory data records) with the highest privacy descriptor.

Example 3:

Consider the personalized moving objects database of Table 1. Fig. 3 shows a simple noisy trajectory tree (with height 2) constructed over this database. For each node, its associated trajectory and real count are placed inside, and its noisy count is placed outside the rectangle representing that node. Note that $\zeta$ defines a dummy location and represents the end of a trajectory data record (refer to [11] for more information).

FIGURE 3.

A simple noisy trajectory tree constructed over the personalized moving objects database of Table 1.

Show All

B. Noisy PSAGTs Assignment

In this step, we construct a PSAGT over the trajectory of each node of the noisy trajectory tree $\tilde {\Upsilon }$ , and then make this PSAGT differentially private by adding noise to the count of each of its nodes proportional to the privacy descriptor of that node. By doing this, a noisy PSAGT will be associated to each node of $\tilde {\Upsilon }$ and, thus, we will be able to answer sensitive count queries in a personalized differentially private manner. Formally, let $v$ be an arbitrary node of $\tilde {\Upsilon }$ and $\Psi _{v}$ be the associated PSAGT of $v$ constructed over $tr(v)$ . We refer to $v$ as the hyper-root of $\Psi _{v}$ . To obtain a noisy version of $\Psi _{v}$ , denoted by $\tilde {\Psi }_{v}$ , we first allocate a personal privacy budget $\epsilon (u)$ to each node $u$ of $\tilde {\Psi }_{v}$ as $\begin{align*} \epsilon (u) \!=\! \begin{cases} \epsilon (v) & \text {if}\; u {\;\text {is the root}}, \\ \left ({\dfrac {1}{\omega (p(u))}-\dfrac {1}{\omega (p(\vartheta (u)))}}\right)\times \epsilon (v) & \text {otherwise}, \\ \end{cases}\!\!\! \\\tag{7}\end{align*}$ View Source where $\vartheta (\cdot)$ and $\omega (\cdot)$ are the parent and weight assigned to the privacy descriptor of a node, respectively. We then add some Laplace noise to $c(u)$ inversely proportional to $\epsilon (u)$ , resulting in a noisy count $\tilde {c}(u)$ . Note that, for consistency purposes, the noisy count of the root of $\tilde \Psi _{v}$ is considered to be equal to $\tilde {c}(v)$ (as computed in the previous step by considering $\epsilon (v)$ as the privacy budget of $v$ ). In fact, by the above personal privacy budget allocation strategy, we aim to preserve $\epsilon$ -PDP for all trajectory data records. We prove this in Theorem 2.

Theorem 2 ( $\epsilon$ -PDP):

PDP-SAG satisfies $\epsilon$ -PDP.

Proof:

Let $\tilde {\Upsilon }$ be the noisy trajectory tree constructed over an input personalized moving objects database $\mathcal {D}$ , and suppose we want to add/remove a trajectory data record $R$ with privacy descriptor $p(R)$ to/from $\mathcal {D}$ . Obviously, this modification affects some nodes of $\tilde {\Upsilon }$ and propagates to the noisy PSAGTs of these nodes.

From Definition 10, we know that when the real count of a node $v$ of $\tilde {\Upsilon }$ changes by adding or removing a trajectory data record, the real count of some nodes in its associated noisy PSAGT $\tilde {\Psi }_{v}$ may also change (namely, the nodes whose privacy descriptor is higher than or equal to the privacy descriptor of that trajectory data record). Also, we know that if two nodes of $\tilde {\Psi }_{v}$ are not on the same root-to-leaf path, their real counts are independent of each other. Therefore, to assess the privacy guarantee provided by $\tilde {\Psi }_{v}$ when adding or removing $R$ , we only need to consider the sequential composition property (see Theorem 1 for more information) for all affected nodes along one root-to-leaf path of $\tilde {\Psi }_{v}$ , that is $\begin{align*} \sum _{u\in \pi ^{*}}\epsilon (u)\!= \! \epsilon (v)\!+\!\!\sum _{u\in \pi ^{*}\setminus \{r(\tilde {\Psi }_{v})\}}\!\!\left ({\dfrac {1}{\omega (p(u))}\!-\!\dfrac {1}{\omega (p(\vartheta (u)))}}\right) \!\times \!\epsilon (v), \!\!\!\!\!\!\! \\\tag{8}\end{align*}$ View Source where $r(\tilde {\Psi }_{v})$ and $\pi ^{*}$ are the root and the set of all affected nodes along one root-to-leaf path of $\tilde {\Psi }_{v}$ , respectively. Also, from Definition 10, we know that only those nodes of $\tilde {\Psi }_{v}$ whose privacy descriptor is higher than or equal to $p(R)$ , are affected by adding or removing $R$ . Thus, we conclude that $\begin{equation*} \sum _{u\in \pi ^{*}}\epsilon (u)=\dfrac {\epsilon (v)}{\omega (p(R))}.\tag{9}\end{equation*}$ View Source

On the other hand, from Subsection V-A, we know that $\tilde {\Upsilon }$ is constructed with the global privacy budget $\varepsilon$ . This means that the sum of privacy budgets allocated to nodes on each root-to-leaf path of $\tilde {\Upsilon }$ is at most $\varepsilon$ (see [3] for more information). Therefore, if we assume that $\Pi ^{*}$ is the set of all affected nodes along one desired root-to-leaf path of $\tilde {\Upsilon }$ when adding or removing $R$ , the privacy guarantee provided by this root-to-leaf path is computed as $\begin{align*} \sum _{v\in \Pi ^{*}}\sum _{u\in \pi ^{*}}\epsilon (u)=&\sum _{v\in \Pi ^{*}}\dfrac {\epsilon (v)}{\omega (p(R))} \\=&\dfrac {1}{\omega (p(R))}\sum _{v\in \Pi ^{*}}\epsilon (v)\leq \dfrac {\varepsilon }{\omega (p(R))}.\tag{10}\end{align*}$ View Source

This proof shows that PDP-SAG satisfies $\frac {\varepsilon }{\omega (p(R))}$ -DP for any trajectory data record $R$ and, therefore, from Definition 5 we conclude that it satisfies $\epsilon$ -PDP.

Example 4:

Suppose we assign the weights $\frac {1}{8}$ , $\frac {1}{4}$ , and 1 to the privacy descriptors Low, Medium, and High, respectively. Formally, $\omega (\text {Low})=\frac {1}{8}$ , $\omega (\text {Medium})=\frac {1}{4}$ , and $\omega (\text {High})=1$ . If we assume that $\frac {\varepsilon }{5}$ be the privacy budge of the node of the noisy trajectory tree whose associated trajectory is $\langle L_{1}\rangle$ , then the personal privacy budgets of three arbitrary nodes $u_{1}$ , $u_{2}$ , and $u_{3}$ at levels 0, 1, and 2 of the PSAGT in Fig. 2 are obtained as $\begin{align*} \epsilon (u_{1})=&\dfrac {\varepsilon }{5}, \\ \epsilon (u_{2})=&\left ({\dfrac {1}{1/4}-\dfrac {1}{1}}\right)\times \dfrac {\varepsilon }{5}=\dfrac {3\varepsilon }{5},\\ \epsilon (u_{3})=&\left ({\dfrac {1}{1/8}-\dfrac {1}{1/4}}\right)\times \dfrac {\varepsilon }{5}=\dfrac {4\varepsilon }{5}.\end{align*}$ View Source

We define a number of relationships, namely, stepchild and stepparent, to relate nodes in two different noisy PSAGTs whose associated hyper-roots have a child-parent relationship in $\tilde {\Upsilon }$ .

Definition 12 (Stepchild and Stepparent):

Let $\tilde {\Psi }_{v_{1}}$ and $\tilde {\Psi }_{v_{2}}$ be the associated noisy PSAGTs of two nodes $v_{1}$ and $v_{2}$ of a noisy trajectory tree $\tilde {\Upsilon }$ , respectively. A node $w$ of $\tilde {\Psi }_{v_{2}}$ is called a stepchild of a node $u$ of $\tilde {\Psi }_{v_{1}}$ iff $v_{2}$ is a child of $v_{1}$ , and also $u$ and $w$ have the same traversal position in $\tilde {\Psi }_{v_{1}}$ and $\tilde {\Psi }_{v_{2}}$ (i.e., they have been labeled with the same subset of sensitive attribute values). In such a case, $u$ is said to be the stepparent of $w$ .

Example 5:

Consider the noisy trajectory tree of Fig. 3. Fig. 4 shows the associated noisy PSAGTs of the nodes $v_{1}$ , $v_{2}$ , $v_{3}$ , $v_{4}$ , and $v_{5}$ of this noisy trajectory tree, which are denoted by $\tilde {\Psi }_{v_{1}}$ , $\tilde {\Psi }_{v_{2}}$ , $\tilde {\Psi }_{v_{3}}$ , $\tilde {\Psi }_{v_{4}}$ , and $\tilde {\Psi }_{v_{5}}$ . These noisy PSAGTs are constructed over the trajectories $\langle L_{1}\rangle$ , $\langle L_{1},L_{1}\rangle$ , $\langle L_{1},L_{2}\rangle$ , $\langle L_{1},L_{3}\rangle$ , and $\langle L_{1},\zeta \rangle$ , respectively. The nodes $w_{1}$ , $w_{2}$ , $w_{3}$ , and $w_{4}$ in $\tilde {\Psi }_{v_{2}}$ , $\tilde {\Psi }_{v_{3}}$ , $\tilde {\Psi }_{v_{4}}$ , and $\tilde {\Psi }_{v_{5}}$ are the stepchildren of the node $u$ in $\tilde {\Psi }_{v_{1}}$ .

$FIGURE 4. - The associated noisy PSAGTs of the nodes $v_{1}$ , $v_{2}$ , $v_{3}$ , $v_{4}$ , and $v_{5}$ of the noisy trajectory tree of Fig. 3.$

FIGURE 4.

The associated noisy PSAGTs of the nodes $v_{1}$ , $v_{2}$ , $v_{3}$ , $v_{4}$ , and $v_{5}$ of the noisy trajectory tree of Fig. 3.

Show All

C. Post-Processing

To make sensitive query answers consistent with each other, the associated noisy PSAGTs of the noisy trajectory tree $\tilde {\Upsilon }$ should satisfy some consistency constraints: (1) the intra-consistency constraints, where the noisy count of each non-leaf node of a noisy PSAGT should be equal to the sum of the noisy counts of children of that node and (2) the inter-consistency constraints, where the noisy count of each node of a noisy PSAGT should be equal to the sum of the noisy counts of stepchildren of that node. The intra-consistency constraints are as a result of the fact that each node of a noisy PSAGT is labeled with the union of all subsets of sensitive attribute values with whom its children have been labeled. Also, the inter-consistency constraints are due to the fact that each node of a noisy PSAGT is labeled with the same subset of sensitive attribute values as its stepparent and the trajectory implicitly associated to this node is obtained by adding one location to the trajectory implicitly associated to its stepparent. Thus, when a moving object with a particular sensitive attribute value traverses a trajectory, it also traverses a sequence of nodes in PSAGTs that have stepparent-stepchild relationships.

Note that enforcing the inter-consistency constraints makes $\tilde {\Upsilon }$ consistent as well, which means that the noisy count of each of its internal nodes (i.e., non-root and non-leaf nodes) will be equal to the sum of the noisy counts of children of that node. However, due to adding random noise and considering personalization, $\tilde {\Upsilon }$ and its associated noisy PSAGTs may violate the above consistency constraints. Here, we do some post-processing to enforce these consistency constraints and, thus, to obtain consistent noisy counts. The post-processing consists of two main steps: (1) the intra-consistency constraints enforcement step, where each noisy PSAGT of $\tilde {\Upsilon }$ is made internally consistent independent of others and (2) the inter-consistency constraints enforcement step, where the noisy PSAGT of each internal node of $\tilde {\Upsilon }$ is made consistent with the noisy PSAGTs of the children of that node.

It is emphasized that intra- and inter-consistency constraints enforcement steps are necessary post-processing steps whose main goal is to make sensitive query answers consistent with each other after applying personalization and adding noise. In the following, we describe each of these steps in detail.

1) Intra-Consistency Constraints Enforcement

We perform this step in a way that the intra-consistency constraints are satisfied within each noisy PSAGT, while the initial consistent noisy node counts of the noisy PSAGT have minimum total distance from their original noisy ones. To do so, we solve a constrained optimization problem for each noisy PSAGT to obtain the initial consistent noisy counts of its nodes. Formally, let $\tilde {\Psi }_{v}$ be the noisy PSAGT of an arbitrary non-root node $v$ of $\tilde {\Upsilon }$ , we solve the following constrained optimization problem: $\begin{equation*} \mathop {\mathrm {minimize}} {\sum _{u\in V(\tilde {\Psi }_{v})}{\dfrac {{\left ({\bar {c}(u)-\tilde {c}(u)}\right)}^{2}}{\Omega (u)}}},\tag{11}\end{equation*}$ View Source subject to the following intra-consistency constraint for each non-leaf node $u$ that belongs to $\tilde {\Psi }_{v}$ : $\begin{equation*} \bar {c}(u)=\sum _{w\in \Theta (u)}{\bar {c}(w)},\tag{12}\end{equation*}$ View Source where $\Omega (u)=\mathrm {Var}(\mathrm {Lap}(l_{max}/\epsilon (u)))$ is the variance of the noise added to $c(u)$ , assuming that $l_{max}$ is the maximum trajectory length. Also, $\bar {c}(u)$ and $\Theta (u)$ are the initial consistent noisy count and child set of $u$ , respectively.

By solving the optimization problem in (11), we obtain the following recurrence relation: $\begin{equation*} \bar {c}(u)=z(u)-s(u)\sum _{w\in A(u)}\dfrac {\bar {c}(w)}{\Omega (w)},\tag{13}\end{equation*}$ View Source where $A(u)$ is the set of all ancestors of $u$ . Moreover, $z(u)$ and $s(u)$ are defined as $\begin{align*} z(u)=&\begin{cases} \xi (u)\times \Omega (u) & \text {if}\; u {\;\text {is a leaf node}}, \\ \dfrac {\displaystyle \Omega (u)\times \sum \nolimits _{w\in \Theta (u)} z(w)}{\displaystyle \Omega (u)+\sum \nolimits _{w\in \Theta (u)} s(w)} & \text {otherwise}, \end{cases} \tag{14}\\ s(u)=&\begin{cases} \Omega (u) & \text {if}\; u {\;\text {is a leaf node}}, \\ \dfrac {\displaystyle \Omega (u)\times \sum \nolimits _{w\in \Theta (u)} s(w)}{\displaystyle \Omega (u)+\sum \nolimits _{w\in \Theta (u)} s(w)} & \text {otherwise}, \end{cases}\tag{15}\end{align*}$ View Source where $\begin{equation*} \xi (u) = \begin{cases} \dfrac {\tilde {c}(u)}{\Omega (u)} & \text {if}\; u {\;\text {is the root}},\\ \xi (\vartheta (u))+\dfrac {\tilde {c}(u)}{\Omega (u)} & \text {otherwise}. \end{cases}\tag{16}\end{equation*}$ View Source

2) Inter-Consistency Constraints Enforcement

We perform this step in a way that the inter-consistency constraints are satisfied among the noisy PSAGT of each internal node of $\tilde {\Upsilon }$ and the noisy PSAGTs of the children of that node. Formally, let $\tilde {\Psi }_{v}$ be the noisy PSAGT of an arbitrary internal node $v$ of $\tilde {\Upsilon }$ . We refer to such a noisy PSAGT as a mature noisy PSAGT. The following inter-consistency constraint is enforced for each node $u$ that belongs to $\tilde {\Psi }_{v}$ : $\begin{equation*} \bar {\bar {c}}(u)=\sum _{w\in \Phi (u)}{\bar {\bar {c}}(w)},\tag{17}\end{equation*}$ View Source where $\bar {\bar {c}}(u)$ and $\Phi (u)$ are the final consistent noisy count and stepchild set of $u$ , respectively.

Let us assume that the root of $\tilde {\Upsilon }$ is at level 0. To satisfy the inter-consistency constraints in (17), we traverse $\tilde {\Upsilon }$ in breadth-first order and, for each node $v$ of $\tilde {\Upsilon }$ , if $v$ is at level 1, then we set the final consistent noisy count $\bar {\bar {c}}(u)$ of each node $u$ of $\tilde {\Psi }_{v}$ to its initial consistent noisy count $\bar {c}(u)$ . Otherwise, we compute $\bar {\bar {c}}(u)$ by the following recurrence relation: $\begin{align*} \bar {\bar {c}}(u) = \begin{cases} \dfrac {\bar {c}(u)}{\displaystyle \sum \nolimits _{w\in \Phi (\varphi (u))}\bar {c}(w)}\times \bar {\bar {c}}(\varphi (u)) & \text {if}\; u {\;\text {is a leaf node}}, \\ \displaystyle \sum \nolimits _{w\in \Theta (u)}{\bar {\bar {c}}(w)} & \text {otherwise}, \end{cases} \\\tag{18}\end{align*}$ View Source where $\varphi (u)$ is the stepparent of $u$ and $\Phi (\varphi (u))$ is the set of stepchildren of $\varphi (u)$ , which are in fact the stepsiblings of $u$ . Notice that each node is considered to be a stepsibling of itself. Actually, by this recurrence relation, we first obtain the final consistent noisy count of each leaf node of $\tilde {\Psi }_{v}$ by multiplying the final consistent noisy count of its stepparent by the ratio of its initial consistent noisy count among its stepsiblings. Then, we obtain the final consistent noisy count of each non-leaf node of $\tilde {\Psi }_{v}$ by summing the final consistent noisy counts of its children.

Algorithm 1 shows the pseudo-code of the post-processing step that takes a noisy trajectory tree $\tilde {\Upsilon }$ as input and returns a consistent noisy trajectory tree $\bar {\Upsilon }$ as output. In this algorithm, we traverse $\tilde {\Upsilon }$ two times in breadth-first order and enforce intra- and inter-consistency constraints, respectively. In the first breadth-first traversal of $\tilde {\Upsilon }$ , for each non-root node $v$ of $\tilde {\Upsilon }$ , we traverse $\tilde {\Psi }_{v}$ three times (Lines 1–11): first, we traverse $\tilde {\Psi }_{v}$ in breadth-first order to compute the temporary value $\xi (u)$ for each node $u$ of $\tilde {\Psi }_{v}$ using (16) (Lines 2–4); next, we traverse $\tilde {\Psi }_{v}$ in postorder to compute the temporary values $s(u)$ and $z(u)$ for each node $u$ of $\tilde {\Psi }_{v}$ using (15) and (14) (Lines 5–7); finally, we once again traverse $\tilde {\Psi }_{v}$ in breadth-first order to compute the initial consistent noisy count $\bar {c}(u)$ for each node $u$ of $\tilde {\Psi }_{v}$ using (13) (Lines 8–10). In the second breadth-first traversal of $\tilde {\Upsilon }$ , for each internal node $v$ of $\tilde {\Upsilon }$ , we traverse $\tilde {\Psi }_{v}$ in postorder to compute the final consistent noisy count $\bar {\bar {c}}(u)$ for each node $u$ of $\tilde {\Psi }_{v}$ using (18) (Lines 12–20).

Algorithm 1 Post-Processing

Input:

$\tilde {\Upsilon }$ : A noisy trajectory tree

Output:

$\bar {\Upsilon }$ : A consistent noisy trajectory tree

for each non-root node $v$ in the breadth-first traversal of $\tilde {\Upsilon }$ do

for each node $u$ in the breadth-first traversal of $\tilde {\Psi }_{v}$ do

Compute the temporary value $\xi (u)$ using (16)

end for

for each node $u$ in the postorder traversal of $\tilde {\Psi }_{v}$ do

Compute the temporary values $s(u)$ and $z(u)$ using (15) and (14)

end for

for each node $u$ in the breadth-first traversal of $\tilde {\Psi }_{v}$ do

Compute the initial consistent noisy count $\bar {c}(u)$ using (13)

10:

end for

11:

end for

12:

for each internal node $v$ in the breadth-first traversal of $\tilde {\Upsilon }$ do

13:

for each node $u$ in the postorder traversal of $\tilde {\Psi }_{v}$ do

14:

if the parent of $v$ is the root of $\tilde {\Upsilon }$ then

15:

Set the final consistent noisy count $\bar {\bar {c}}(u)$ to $\bar {c}(u)$

16:

else

17:

Compute the final consistent noisy count $\bar {\bar {c}}(u)$ using (18)

18:

end if

19:

end for

20:

end for

Example 6:

Suppose that the noisy PSAGTs in Fig. 5a are obtained by enforcing intra-consistency constraints to the noisy PSAGTs of Fig. 4. Fig. 5b shows the final consistent noisy PSAGTs, where the final consistent noisy count of each node is obtained from (18).

FIGURE 5.

The consistent versions of the noisy PSAGTs of Fig. 4: (a) the intra-consistent noisy PSAGTs, (b) the final consistent noisy PSAGTs.

Show All

Lemma 1:

The final consistent noisy counts of nodes of all mature noisy PSAGTs satisfy the inter-consistency constraints.

Proof:

Suppose $\tilde {\Psi }_{v}$ be an arbitrary mature noisy PSAGT. We proceed by mathematical induction and prove that the final consistent noisy count of each node $u$ of $\tilde {\Psi }_{v}$ is equal to the sum of the final consistent noisy counts of stepchildren of that node. Let us assume that $u$ is a leaf node of $\tilde {\Psi }_{v}$ . Using (18), we compute the sum of the final consistent noisy counts of stepchildren of $u$ as $\begin{align*} \sum _{w\in \Phi (u)}{\bar {\bar {c}}(w)}=&\sum _{w\in \Phi (u)}\dfrac {\bar {c}(w)}{\sum _{w'\in \Phi (\varphi (w))}{\bar {c}(w')}}\times {\bar {\bar {c}}(\varphi (w))} \\=&\dfrac {\sum _{w\in \Phi (u)}{\bar {c}(w)}}{\sum _{w'\in \Phi (u)}{\bar {c}(w')}}\times \bar {\bar {c}}(u) \\=&\bar {\bar {c}}(u).\tag{19}\end{align*}$ View Source

Now, let us assume that $u$ is a non-leaf node of $\tilde {\Psi }_{v}$ . From (18), we know that $\begin{equation*} \bar {\bar {c}}(u)=\sum _{w\in \Theta (u)}{\bar {\bar {c}}(w)}.\tag{20}\end{equation*}$ View Source

By the induction hypothesis, for each child $w$ of $u$ , we have $\begin{equation*} \bar {\bar {c}}(w)=\sum _{w'\in \Phi (w)}{\bar {\bar {c}}(w')}.\tag{21}\end{equation*}$ View Source

By substituting (21) into (20), we obtain $\begin{align*} \bar {\bar {c}}(u)=&\sum _{w\in \Theta (u)}\sum _{w'\in \Phi (w)}{\bar {\bar {c}}(w')} \\=&\sum _{w\in \Phi (u)}\sum _{w'\in \Theta (w)}{\bar {\bar {c}}(w')} \\=&\sum _{w\in \Phi (u)}{\bar {\bar {c}}(w)}.\tag{22}\end{align*}$ View Source

SECTION VI.

Experiments

In this section, we describe the experiments performed to evaluate the performance of PDP-SAG in terms of relative error and privacy breach. Our experiments were conducted in Python and run on a PC with an Intel Core i7 3.6 GHz CPU and 16 GB RAM.

A. Experimental Setup

We performed our experiments using the following synthetic and real moving objects datasets:

City80K dataset [12]. This dataset simulates the routes of 80,000 citizens in a metropolitan area with 26 city blocks in 24 hours.
Geolife dataset [27]. This dataset collects the GPS trajectories of 182 users in Beijing, China, during a period of over five years. We choose the trajectories whose points (latitude-longitude coordinates) are between [39.4, 40.8] latitude and [115.8, 117.4] longitude, and break a trajectory if the interval time between its subsequent and current points is larger than one minute. In this way, we obtain approximately one million trajectories. Since each recorded trajectory consists of a sequence of points in a continuous spatial domain, we discretize the spatial domain by partitioning it into a $64\times 64$ grid and then consider each grid cell as a location.
Taxi dataset (http://sensor.ee.tsinghua.edu.cn). This dataset contains approximately 5.2 million trajectories of 8602 taxi cabs in Beijing, China. The trajectory data cover a region of Beijing restricted between [39.8, 40.1] latitude and [116.1, 116.6] longitude. Similar to the Geolife dataset, we discretize its spatial domain into a finite number of grid cells and then consider each grid cell as a location. However, due to a smaller region size, we consider a $16\times 16$ grid.

Similar to prior work [3], [11], [28], we set the maximum trajectory length $l_{max}$ to 5 for City80K and to 20 for Geolife and Taxi. We also randomly assigned one sensitive attribute value to each moving object and, thus, to each trajectory data record. For this purpose, we constructed a taxonomy tree with 4 levels and 12 leaf nodes. Also, to obtain a personalized moving objects dataset, we randomly assigned one of the four privacy descriptors, namely, Low, Medium, High, and Critical, to each moving object. To be close to reality, we assumed that the distribution of privacy descriptors follows a Zipf distribution with shape parameter $s$ . According to Zipf’s law, the simpler an event is, the more likely it is to appear in nature. For example, there are few crowded cities but many low populated ones. Similarly, we can consider that there are few critical moving objects, but many moving objects with low or medium privacy concerns. By default, we set $s$ to 2. To quantify the privacy descriptors, we assigned the following weights to them: $\omega (\text {Low})=\frac {1}{16}$ , $\omega (\text {Medium})=\frac {1}{8}$ , $\omega (\text {High})=\frac {1}{4}$ , and $\omega (\text {Critical})=1$ .

B. Evaluation Metrics

We used two different metrics, namely, relative error and privacy breach, to evaluate how PDP-SAG makes a trade-off between utility and privacy.

1) Relative Error

This metric measures the quality of the noisy answer to each sensitive count query by considering its relative error with respect to the true answer. Formally, let $Q$ be a sensitive count query over a personalized moving objects database $\mathcal {D}$ . The relative error of the noisy answer to $Q$ is defined as $\begin{equation*} \mathcal {E}(Q)=\dfrac {|\tilde {c}_{\mathcal {D}}(Q)-c_{\mathcal {D}}(Q)|}{\max {\left \lbrace{ c_{\mathcal {D}}(Q),\delta }\right \rbrace }}\times 100,\tag{23}\end{equation*}$ View Source where $c_{\mathcal {D}}(Q)$ and $\tilde {c}_{\mathcal {D}}(Q)$ denote the true and noisy answers to $Q$ , respectively, and $\delta$ is a sanity bound used to mitigate the influences of sensitive count queries with extremely small true answers. Similar to most previous work [3], [11], [29], we set $\delta$ to be 0.1% of the total number of trajectory data records in $\mathcal {D}$ .

2) Privacy Breach

This metric measures the privacy protection provided for each trajectory data record by considering both personal privacy budget allocated to its corresponding moving object and the amount of generalization applied to its sensitive attribute value. The lower the amount of personal privacy budget or the more the generalization of the sensitive attribute value, the lower the privacy breach of a trajectory data record will be. Formally, given a personalized moving objects database $\mathcal {D}$ , a taxonomy tree $\Gamma$ , and a noisy trajectory tree $\tilde {\Upsilon }$ constructed over $\mathcal {D}$ . The privacy breach of each trajectory data record $R\in \mathcal {D}$ given $\tilde {\Upsilon }$ is defined as $\begin{equation*} \mathcal {B}(R\mid \tilde {\Upsilon })=\dfrac {\tanh (\epsilon (R))}{|g(R)|}\times 100,\tag{24}\end{equation*}$ View Source where $g(R)$ is the generalized sensitive attribute value of $R$ and $|g(R)|$ is the cardinality of $g(R)$ . Note that we use the $\tanh$ function to scale $\epsilon (R)$ to a value between 0 and 1 in a way that the rate of changes for small values is maintained.

C. Experimental Results

We cannot directly compare PDP-SAG with traditional differentially private mechanisms, because none of them consider moving objects databases that have sensitive attributes other than spatiotemporal attributes and, as we said before, PDP-SAG is the first mechanism that combines personalized differential privacy with sensitive attribute generalization in moving objects databases. For this reason, we performed two different sets of experiments.

In the first set of experiments, we studied mainly the impact of personalization and generalization on the average relative error and average privacy breach of PDP-SAG. To do this, we implemented a simplified version of PDP-SAG, called DP-SA, by assuming that all moving objects have the same privacy protection requirement and the sensitive attribute generalization is not permitted. Actually, DP-SA is similar to PDP-SAG, except that it allocates the same privacy budget $\varepsilon$ to all trajectory data records, and also assigns a noisy SAGT (see Section IV for more information) to each node of the noisy trajectory tree, instead of a noisy PSAGT. It should be noted that in a noisy SAGT, the privacy budget of the hyper-root is equally divided among all nodes along each of the root-to-leaf paths. It is clear that DP-SA satisfies $\varepsilon$ -DP.

To compare the average relative error of PDP-SAG with that of DP-SA, we constructed a number of sensitive count query sets on each moving objects dataset, each of which having different maximum query size (namely, 1, 2, 3, 4, and 5 for the City80K dataset; as well as, 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 for the Geolife and Taxi datasets) and containing 10,000 randomly-generated sensitive count queries. Each location in a sensitive count query was uniformly drawn from the set of locations of the underlying dataset. Furthermore, we randomly selected a subset of sensitive attribute values for each sensitive count query with respect to the constructed taxonomy tree. Also, similar to prior work [3], we fixed the maximum height of the noisy trajectory tree to 5. To answer queries of size larger than 5, we iteratively joined all joinable subtrajectories [3]. Figs. 6a to 6l show the obtained results (on a logarithmic scale) for different sensitive count query sets on City80K, Geolife, and Taxi under different global privacy budgets. From the figures, we observe that the average relative error of PDP-SAG is substantially smaller than that of DP-SA. This is because by allocating more personal privacy budget to trajectory data records whose corresponding moving objects have lower privacy protection requirements, PDP-SAG can achieve smaller relative error than DP-SA in which all trajectory data records have the same privacy budget. Therefore, we conclude that DP-SA degrades the quality of sensitive query answers by applying excessive privacy protection to moving objects with low privacy protection requirements. Moreover, as expected, the average relative error of both PDP-SAG and DP-SA reduces when increasing the global privacy budget, because of the less amount of noise is added to sensitive query answers. Also, the results show that the average relative error reduces by increasing the maximum query size, because of increasing the number of queries that do not exist in the original moving objects database.

$FIGURE 6. - Comparison of the average relative error (in percent) of PDP-SAG to that of DP-SA for different sensitive count query sets: (a) City80K, $\varepsilon =0.01$ , (b) City80K, $\varepsilon =0.05$ , (c) City80K, $\varepsilon =0.1$ , (d) City80K, $\varepsilon =0.5$ , (e) Geolife, $\varepsilon =0.01$ , (f) Geolife, $\varepsilon =0.05$ , (g) Geolife, $\varepsilon =0.1$ , (h) Geolife, $\varepsilon =0.5$ , (i) Taxi, $\varepsilon =0.01$ , (j) Taxi, $\varepsilon =0.05$ , (k) Taxi, $\varepsilon =0.1$ , (l) Taxi, $\varepsilon =0.5$ .$

FIGURE 6.

Comparison of the average relative error (in percent) of PDP-SAG to that of DP-SA for different sensitive count query sets: (a) City80K, $\varepsilon =0.01$ , (b) City80K, $\varepsilon =0.05$ , (c) City80K, $\varepsilon =0.1$ , (d) City80K, $\varepsilon =0.5$ , (e) Geolife, $\varepsilon =0.01$ , (f) Geolife, $\varepsilon =0.05$ , (g) Geolife, $\varepsilon =0.1$ , (h) Geolife, $\varepsilon =0.5$ , (i) Taxi, $\varepsilon =0.01$ , (j) Taxi, $\varepsilon =0.05$ , (k) Taxi, $\varepsilon =0.1$ , (l) Taxi, $\varepsilon =0.5$ .

Show All

We also compared the average privacy breaches of trajectory data records (moving objects) with different privacy descriptors in PDP-SAG with the privacy breach of each trajectory data record (moving object) in DP-SA. Our reason for this comparison was that all trajectory data records in DP-SA have the same privacy breach, while different trajectory data records in PDP-SAG may have different privacy breaches, depending on their privacy descriptor. Table 3 reports the obtained results for City80K, Geolife, and Taxi under different global privacy budgets. From the table, we observe that PDP-SAG provides lower privacy breach than DP-SA for trajectory data records with higher privacy descriptors (namely, “High” and “Critical”) and, therefore, satisfies stronger privacy protection for their corresponding moving objects. This is due to the fact that PDP-SAG applies a high amount of generalization to the sensitive attribute of these trajectory data records. This is while DP-SA applies no generalization. Furthermore, as we can see, the average privacy breach of trajectory data records with a particular privacy descriptor in PDP-SAG is almost the same for different datasets. The reason is that, as we said before, the distribution of privacy descriptors in all these datasets follows a Zipf distribution with the same shape parameter. Therefore, the percentage of trajectory data records having a particular privacy descriptor is almost the same in different datasets, which makes those trajectory data records have roughly the same average privacy breach.

TABLE 3 Comparison of the Average Privacy Breaches of Trajectory Data Records With Different Privacy Descriptors in PDP-SAG With the Privacy Breach of Each Trajectory Data Record in DP-SA Under Different Global Privacy Budgets

To study how PDP-SAG reacts to increasing or decreasing the number of highly privacy-conscious moving objects, we also considered two other values of $s$ (refer to Subsection VI-A for more details), namely, 0 and 10, which resulted in two different distributions of privacy descriptors: in the first distribution, referred to as Zipf-0, privacy descriptors approximately followed uniform distribution and in the second one, referred to as Zipf-10, almost all moving objects had the privacy descriptor Low. We compared the obtained results with the previous case, referred to as Zipf-2, in which $s$ was set to 2. Table 4 shows the average relative error of PDP-SAG for different distributions of privacy descriptors and different sensitive count query sets on City80K, Geolife, and Taxi under different global privacy budgets. From the table, we observe that by decreasing the number of highly privacy-conscious moving objects the average relative error of PDP-SAG decreases. This is because, in this case, PDP-SAG applies less generalization to sensitive attribute values and, therefore, gives more accurate noisy answers to sensitive count queries. Fig. 7 shows the average privacy breach of trajectory data records in City80K for different distributions of privacy descriptors under different global privacy budgets. Note that, as we discussed before, almost the same results hold for trajectory data records in Geolife and Taxi. From the figure, we observe that the average privacy breach decreases when the number of highly privacy-conscious moving objects increases. This is a desired behavior, as a result of more privacy protection needed by these moving objects.

TABLE 4 The Average Relative Error (in Percent) of PDP-SAG for Different Distributions of Privacy Descriptors and Different Sensitive Count Query Sets

FIGURE 7.

The average privacy breach of trajectory data records in City80K for different distributions of privacy descriptors under different global privacy budgets. Almost the same results hold for trajectory data records in Geolife and Taxi.

Show All

In the second set of experiments, we considered simple count queries instead of sensitive ones. A simple count query is a sensitive count query that does not contain any sensitive attribute value. It is obvious that PDP-SAG can answer to such count queries through the root of noisy PSAGTs. Considering simple count queries allowed us to fairly compare the average relative error of PDP-SAG with the algorithm proposed in [3], referred to as N-gram, which is the most popular differentially private algorithm to answer simple count queries over sequential databases including moving objects databases. Table 5 reports the obtained results for different simple count query sets on City80K, Geolife, and Taxi under different global privacy budgets. From the table, we observe that the average relative error of PDP-SAG is better than or as good as N-gram. However, by presenting PDP-SAG, our main goal is to be able to answer sensitive count queries that are issued on a personalized moving objects database in a personalized differentially private manner, but this experiment shows that PDP-SAG also has good results for simple count queries, in comparison to traditional differentially private algorithms.

TABLE 5 Comparison of the Average Relative Error (in Percent) of PDP-SAG to That of N-Gram for Different Simple Count Query Sets

SECTION VII.

Conclusion and Discussion

Preserving privacy for moving objects databases that contain sensitive attributes other than spatiotemporal attributes is an issue of great interest and concern in recent years. In this paper, we have presented a novel mechanism, called PDP-SAG, that addresses this issue in the contexts of personalization and the strong privacy model of differential privacy. PDP-SAG generalizes the sensitive attribute value of each trajectory data record based on the privacy protection requirement of its moving object and then answer sensitive count queries in a personalized (non-uniform) differentially private manner.

PDP-SAG addresses two main drawbacks of existing differentially private mechanisms for moving objects databases: (1) lack of attention to the privacy protection requirements of moving objects when allocating privacy budgets to them, and (2) lack of usability for moving objects databases that contain sensitive attributes other than spatiotemporal attributes. To address these drawbacks, PDP-SAG generalizes the sensitive attribute values of trajectory data records (moving objects) based on their privacy descriptor and allocates more personal privacy budget to trajectory data records (moving objects) with lower privacy descriptors to adjust excessive privacy protection. It then constructs some noisy PSAGTs to keep different noisy frequencies for each trajectory according to the generalized sensitive attribute values of trajectory data records in which that trajectory is occurred. Also, PDP-SAG enforces some intra- and inter-consistency constraints to noisy PSAGTs to make each noisy PSAGT internally consistent and different noisy PSAGTs externally consistent with each other.

Extensive experiments reveal that PDP-SAG improves the utility of sensitive query answers and satisfies stronger privacy protection for moving objects with higher privacy protection requirements, in comparison to a simplified version, also known as DP-SA, in which no personalization and generalization are permitted.

As future work, we plan to develop sensitive attribute generalization for other types of databases and other types of data mining tasks while achieving personalized differential privacy.

References is not available for this document.

PDP-SAG: Personalized Privacy Protection in Moving Objects Databases by Combining Differential Privacy and Sensitive Attribute Generalization

Alerts

Abstract:

Metadata

Abstract:

Introduction

Related Work

Preliminaries

A. Differential Privacy

Definition 1 (Neighboring Databases):

Definition 2 (\varepsilon \varepsilon -DP):

Definition 3 (Sensitivity):

Theorem 1 (Sequential and Parallel Compositions[26]):

B. Personalized Differential Privacy

Definition 4 (Personal Privacy Budget Allocation):

Definition 5 (\epsilon \epsilon -PDP):

C. Personalized Moving Objects Database

Definition 6 (Subtrajectory):

Definition 7 (Personalized Moving Objects Database):

Definition 8 (Taxonomy Tree):

Definition 9 (Sensitive Count Query):

Example 1:

Personalized Sensitive Attribute Generalization Tree

Definition 10 (Personalized Sensitive Attribute Generalization Tree):

Example 2:

PDP-SAG

A. Noisy Trajectory Tree Construction

Definition 11 (Noisy Trajectory Tree):

Example 3:

B. Noisy PSAGTs Assignment

Theorem 2 (\epsilon \epsilon -PDP):

Proof:

Example 4:

Definition 12 (Stepchild and Stepparent):

Example 5:

C. Post-Processing

1) Intra-Consistency Constraints Enforcement

2) Inter-Consistency Constraints Enforcement

Algorithm 1 Post-Processing

Example 6:

Lemma 1:

Proof:

Experiments

A. Experimental Setup

B. Evaluation Metrics

1) Relative Error

2) Privacy Breach

C. Experimental Results

Conclusion and Discussion

Authors

Figures

References

Citations

Keywords

Metrics

References

Definition 2 ( $\varepsilon$ -DP):

Definition 5 ( $\epsilon$ -PDP):

Theorem 2 ( $\epsilon$ -PDP):