Journals & Magazines >IEEE Access >Volume: 9

Anonymization Techniques for Privacy Preserving Data Publishing: A Comprehensive Survey

Privacy preserving data publishing (PPDP): (a) conceptual overview and (b) description of the actors involved in the PPDP scenario.

Abstract:

Anonymization is a practical solution for preserving user’s privacy in data publishing. Data owners such as hospitals, banks, social network (SN) service providers, and i...Show More

Metadata

Abstract:

Anonymization is a practical solution for preserving user’s privacy in data publishing. Data owners such as hospitals, banks, social network (SN) service providers, and insurance companies anonymize their user’s data before publishing it to protect the privacy of users whereas anonymous data remains useful for legitimate information consumers. Many anonymization models, algorithms, frameworks, and prototypes have been proposed/developed for privacy preserving data publishing (PPDP). These models/algorithms anonymize users’ data which is mainly in the form of tables or graphs depending upon the data owners. It is of paramount importance to provide good perspectives of the whole information privacy area involving both tabular and SN data, and recent anonymization researches. In this paper, we presents a comprehensive survey about SN (i.e., graphs) and relational (i.e., tabular) data anonymization techniques used in the PPDP. We systematically categorize the existing anonymization techniques into relational and structural anonymization, and present an up to date thorough review on existing anonymization techniques and metrics used for their evaluation. Our aim is to provide deeper insights about the PPDP problem involving both graphs and tabular data, possible attacks that can be launched on the sanitized published data, different actors involved in the anonymization scenario, and major differences in amount of private information contained in graphs and relational data, respectively. We present various representative anonymization methods that have been proposed to solve privacy problems in application-specific scenarios of the SNs. Furthermore, we highlight the user’s re-identification methods used by malevolent adversaries to re-identify people uniquely from the privacy preserved published data. Additionally, we discuss the challenges of anonymizing both graphs and tabular data, and elaborate promising research directions. To the best of our knowledge, this is the f...

Privacy preserving data publishing (PPDP): (a) conceptual overview and (b) description of the actors involved in the PPDP scenario.

Published in: IEEE Access ( Volume: 9)

Page(s): 8512 - 8545

Date of Publication: 18 December 2020

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2020.3045700

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Most organizations such as hospitals, banks, insurance companies, and supermarkets collect relevant customers/subscribers data to improve service quality (SQ). Apart from these physical organizations, an excessive amount of user’s data is collected by the virtual platforms such as social networks (SN) service providers due to the extensive use of SN all around the world. With the significant advancement in the information and communication technologies (ICT), SNs enable people to interact with their friends, make new friends, seek information about the relevant subject matter or jobs, spread reliable information at low cost, and also to entertain themselves by watching digital contents. Meanwhile, the SNs collect and store the relevant data about their users during the service provisioning, and at the time of account creation (i.e., joining the SNs). This collected data often contains information about the user’s activities, demographics, finance, hobbies, location, perceptions, interests, preferences, political and religious views, online communities, and opinions. Furthermore, most users readily post other valuable data including preferences in music, viewing choices, and social problems such as an epidemic outbreak. Research has shown that analysis of this collected data with advanced data mining tools can assist organizations in improving SQ significantly. For instance, it allows them to understand social trends, people’s sentiments and behaviors, and factors causing a certain disease outbreak. Accordingly, such information can be leveraged for many scientific or business objectives including targeted advertisement, relevant content recommendations, and effective decision making [1], [2]. Although the data sharing brings innovation and enables better decision making, it may also jeopardize the privacy of users due to the existence of sensitive information in the data [3].

Before publishing the users’ data with the researchers or third parties, data owners ensure that the user’s private information privacy is protected. This is typically done via data anonymization, which transforms the original data by applying some operations on it to effectively protect user’s privacy without degrading the anonymous data utility [4]. Privacy preserving data publishing (PPDP) provides set of models, tools, and methods to safeguard against the privacy threats that emerge from the data releasing with data miners or analysts [5]. In recent years, PPDP has received considerable attention from the research community, and many approaches have been proposed for both SN and tabular data anonymization [6]–[10]. There are two famous settings of PPDP, non-interactive and interactive [11]. In the former setting, the data owner publishes the complete dataset in an anonymized form after applying some modifications on the original data. However, in the later setting, the data owner does not publish the whole data set in a sanitized form like the former setting. Instead, data owner provides an interface to the data miners through which they may pose different statistical queries about the related data and get (possibly noisy) answers. The $k$ -anonymity model [12], and its ramifications are most widely used in the non-interactive setting of PPDP [13]–[17]. These approaches apply some modifications on the original values of quasi identifiers, and protect the user’s privacy by making information less-specific. The differential privacy (DP) [18], and DP based approaches are mostly used in an interactive setting of PPDP [19]–[21]. Meanwhile, some studies have reported the DP based approaches for non-interactive setting [22]. Both $k$ -anonymity and DP based anonymization approaches, and their improved versions have been extensively used in the PPDP.

In recent years, SN data is also published with the data-minders for accomplishing multiple scientific and business objectives due to the phenomenal growth in SN use around the globe [23]. SNs data is mostly in a graph $G$ form, and it provides unprecedented opportunities for advanced data analytics. A social graph data $G(U, V)$ , where $U$ is the list of users and $V$ represents the set of edges modelling the relationship between the users. The social connection among users in SNs can be of different types such as friend, sibling, and lover. Generally, each user in a SN has two types of the social connections. Among these connections, there is a set of public connection ( $V_{p}$ ), and other connection $V_{s} = V \setminus V_{p}$ that are set private by the users, which needs privacy protection. Aside from the $V_{s}$ privacy protection, there are many aspects in which SN users want privacy protection such as the sensitive attribute (SA), online groups affiliations, and locations. Researchers have extended the concepts used for the tabular data anonymization to protect SN’s users privacy [24]–[26]. The two popular anonymization approaches used for the SN data are: naive and structural anonymization. In naive anonymization, only social link structure is published by removing the edges and nodes labels from the $G$ . However, Backstrom et al. [27] suggested that naive anonymization is prone to identity disclosure because the structure of the released graph may reveal the identity of the individuals corresponding to the nodes. In contrast, the structural anonymization approaches modify the structure of $G$ to effectively protect the user’s privacy. These approaches add new edges, vertices, and/or modify the existing $G$ structure to fulfill the privacy and utility requirements. For instance, in the $k$ -degree anonymity [28], the $G$ containing users and their relationships is modified in such a way that each user $U$ in a $G$ has the degree $k$ . In some cases, the sensitive information (i.e., link information) is removed from the $G$ for some users during SN data anonymization. Zheng et al. [29] proposed a framework for preserving sensitive link information privacy in SN data anonymization. Aside from the edges and vertices addition/deletion, advanced techniques such as edges switch and rotation have also been proposed to solve the users’ privacy problems in SN data anonymization [30].

The existing surveys related to PPDP cover important aspects such as anonymization techniques, anonymization operations, privacy models, data anonymity frameworks, and evaluating metrics employed by the PPDP mechanisms. Fung et al. [31] study systematically summarized and evaluated different approaches used in the PPDP. Rajendran et al. [32] explained three prominent and most widely used anonymization models in a medical field, namely $k$ -anonymity, $\ell$ -diversity, and $t$ -closeness. Gkoulalas et al. [33] presented a comprehensive survey about the privacy threats and privacy models used in the PPDP. The authors provided details about fourty-five anonymity algorithms in their study. Tran et al. [34] presented a detailed survey about the privacy-preserving big data analytics. The authors explained the related studies and provided details about various PPDP practical scenarios that needs further development from research community. In addition, researchers have presented surveys about the PPDP techniques used in the big data era [35], [36]. A few surveys address the SN data publishing problems, but only considering possible breaches and briefly mentioning the privacy problems that emerge from SN users’ data publishing. Yang et al. [37] explained about the attack models and countermeasures in SN data publishing. The authors surveyed and categorized the PPDP algorithms into two categories, namely anonymization and DP. Siddula et al. [38] provided a survey about the privacy models and methods used for the SN user’s privacy protection. The authors explained the mechanism used for the edges and nodes privacy protection in publishing $G$ . Zho et al. [39] presented a survey about the anonymization techniques used for SN data, and discussed the challenges involved in the $G$ anonymization compared to the relational data anonymization. Zheleva et al. [40] presented a survey about privacy in SNs and statistical inference techniques. Abawajy et al. [41] presented a review of anonymization techniques employed on SN data, and privacy attacks and risks. Furthermore, some surveys related to across SNs user identification [42], [43], edges and vertex modifications techniques [44], and SN application specific scenarios have been reported in the literature [45].

This paper presents a comprehensive survey on recent anonymization techniques used for both SN and relational data publishing. Specifically, our review explains anonymization approaches related to the information privacy protection. The contributions of this review article in the field of PPDP is summarized as: (i) it presents state-of-the-art anonymization techniques used for both SN (i.e., social graphs) and relational (i.e., tabular) data, and fundamental concepts and ideas related to tables and graph data anonymization; (ii) it systematically categorizes the existing anonymization techniques into relational and structural anonymization, and presents an up-to-date thorough review on existing anonymization techniques and metrics used for their evaluation; (iii) it describes the anonymization techniques that have been proposed to solve privacy problems in application-specific scenarios (e.g., collaborative filtering, topic and context modeling, and community clustering etc.) of the SNs; (iv) it presents various methods and items that are exploited by malevolent adversaries for user’s re-identification across SNs; (v) it explains various challenges faced by researchers while devising new anonymization methods for tabular and SN data; (vi) it provides new insights on the privacy problems in future computing paradigm that will be helpful in devising more secure anonymization methodologies; and (vii) it discusses promising future research directions in the field of the PPDP that need further development and research from both academia and industry. Through this comprehensive overview, we hope to provide a solid foundation for future studies in the PPDP area.

The remainder of this review paper is organized as follows. Section II explains the background regarding privacy types, tabular and graph data overview, types of the user’s attributes, operational utility levels of the SN data, privacy areas in the SN, and privacy threats that occur in both SN and tabular data publishing. Section III presents the dilemma of PPDP, and explains its principal concepts and phases. Section IV discusses the state-of-the-art relational anonymization techniques used for tabular data. Section V discusses the recent structural anonymization techniques used for the SN data. The summary of the research contents presented in this article, and discussion about the privacy problems in the future computing paradigm are provided in Section VI. Section VII discusses promising open research directions/problems in the PPDP area. Finally, Section VIII concludes the paper.

SECTION II.

Background

Privacy is all about keeping personal information away from the public access. Privacy is needed for the personal autonomy, individualism, and respect. There are four types of the privacy such as information, bodily, territorial, and communication [46]. Information privacy is about collecting, managing, analyzing, and publishing the personal data. Bodily privacy is related to the physical harms from any kind of invasive procedures/measures. Communication privacy refers to any form of communication such as phone calls or e-mails privacy. Territorial privacy refers to placing boundaries on irruption into a locality. This survey focuses on the information privacy, which encompasses systems/infrastructures that collect, analyse, process, and publish user’s data. Concisely, we present the various anonymization approaches that were proposed to anonymize users’ data that can be either in the form of graph or table. The overview of the user’s data in relational and graphs form is presented in Figure 1. In relational data (Figure 1 (a)), each tuple contains four types of attributes about users, direct identifiers (DI), non-sensitive attributes (NSA), quasi identifiers (QIs), and sensitive attribute (SA). In contrast, the SN data shown in Figure 1 (b) represents the users information via nodes/edges labels. The relevant background about both types of data (i.e., tabular and graphs) is covered in subsequent subsections.

FIGURE 1.

Overview of the relational (i.e., tabular) and social network (i.e., graphs) users data.

MIT Libraries

MIT Libraries

Anonymization Techniques for Privacy Preserving Data Publishing: A Comprehensive Survey

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Background

A. Background About the Relational/Tabular Data

B. Background About the Social Networks/Graphs Data

Overview of Privacy Preserving Data Publishing and Its Fundamental Phases

A. Phase 1: Data Collection from the Individuals

B. Phase 2: Collected Data Storage, Understanding and Preparation for the Anonymization

C. Phase 3: Data Anonymization

1) Removal of Directly Identifiable Information from the Original Data

2) Choice of the Anonymization Technique

3) Selection of the Anonymization Operation

4) Enforcement of the Constrains (If Any) in an Anonymization Operation

D. Anonymous Data Releasing/Publishing

E. Phase 5: Published Data Analysis for Extracting Embedded Knowledge

Relational Anonymization Techniques Used for Tabular Data Anonymization

A. kk -Anonymity Privacy Model

B. \ell \ell -Diversity Privacy Model

C. tt -Closeness Privacy Model

D. Differential Privacy Model

E. Metrics Used for the Evaluation of Privacy and Utility of Relational Data Anonymization Techniques

1) Privacy Evaluation Metrics

2) Utility Evaluation Metrics

F. Privacy Preserving Dynamic Data Publication

G. Challenges in the Relational Data Anonymization

Structural Anonymization Techniques Used for the Social Networks Data Anonymization

A. Metrics Used for the Evaluation of Privacy and Utility of Structural Anonymization Techniques

1) Graphs Utility Evaluation Metrics

2) Graphs Privacy Evaluation Metrics

B. De-Anonymization Methods Employed By the Adversaries to Jeopardize Users Privacy in SN Published Data

C. Structural Anonymization Approaches Used for Application-Specific Scenarios in SN Data Publishing

D. Challenges in Social Network Data Anonymization

Summary and Discussion About the Privacy Issues in Future Computing Paradigm

Promising Open Research Directions

Conclusion

Conflict of Interest

References

A. $k$ -Anonymity Privacy Model

B. $\ell$ -Diversity Privacy Model

C. $t$ -Closeness Privacy Model