Processing math: 0%
Anonymization Techniques for Privacy Preserving Data Publishing: A Comprehensive Survey | IEEE Journals & Magazine | IEEE Xplore

Anonymization Techniques for Privacy Preserving Data Publishing: A Comprehensive Survey


Privacy preserving data publishing (PPDP): (a) conceptual overview and (b) description of the actors involved in the PPDP scenario.

Abstract:

Anonymization is a practical solution for preserving user’s privacy in data publishing. Data owners such as hospitals, banks, social network (SN) service providers, and i...Show More

Abstract:

Anonymization is a practical solution for preserving user’s privacy in data publishing. Data owners such as hospitals, banks, social network (SN) service providers, and insurance companies anonymize their user’s data before publishing it to protect the privacy of users whereas anonymous data remains useful for legitimate information consumers. Many anonymization models, algorithms, frameworks, and prototypes have been proposed/developed for privacy preserving data publishing (PPDP). These models/algorithms anonymize users’ data which is mainly in the form of tables or graphs depending upon the data owners. It is of paramount importance to provide good perspectives of the whole information privacy area involving both tabular and SN data, and recent anonymization researches. In this paper, we presents a comprehensive survey about SN (i.e., graphs) and relational (i.e., tabular) data anonymization techniques used in the PPDP. We systematically categorize the existing anonymization techniques into relational and structural anonymization, and present an up to date thorough review on existing anonymization techniques and metrics used for their evaluation. Our aim is to provide deeper insights about the PPDP problem involving both graphs and tabular data, possible attacks that can be launched on the sanitized published data, different actors involved in the anonymization scenario, and major differences in amount of private information contained in graphs and relational data, respectively. We present various representative anonymization methods that have been proposed to solve privacy problems in application-specific scenarios of the SNs. Furthermore, we highlight the user’s re-identification methods used by malevolent adversaries to re-identify people uniquely from the privacy preserved published data. Additionally, we discuss the challenges of anonymizing both graphs and tabular data, and elaborate promising research directions. To the best of our knowledge, this is the f...
Privacy preserving data publishing (PPDP): (a) conceptual overview and (b) description of the actors involved in the PPDP scenario.
Published in: IEEE Access ( Volume: 9)
Page(s): 8512 - 8545
Date of Publication: 18 December 2020
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Most organizations such as hospitals, banks, insurance companies, and supermarkets collect relevant customers/subscribers data to improve service quality (SQ). Apart from these physical organizations, an excessive amount of user’s data is collected by the virtual platforms such as social networks (SN) service providers due to the extensive use of SN all around the world. With the significant advancement in the information and communication technologies (ICT), SNs enable people to interact with their friends, make new friends, seek information about the relevant subject matter or jobs, spread reliable information at low cost, and also to entertain themselves by watching digital contents. Meanwhile, the SNs collect and store the relevant data about their users during the service provisioning, and at the time of account creation (i.e., joining the SNs). This collected data often contains information about the user’s activities, demographics, finance, hobbies, location, perceptions, interests, preferences, political and religious views, online communities, and opinions. Furthermore, most users readily post other valuable data including preferences in music, viewing choices, and social problems such as an epidemic outbreak. Research has shown that analysis of this collected data with advanced data mining tools can assist organizations in improving SQ significantly. For instance, it allows them to understand social trends, people’s sentiments and behaviors, and factors causing a certain disease outbreak. Accordingly, such information can be leveraged for many scientific or business objectives including targeted advertisement, relevant content recommendations, and effective decision making [1], [2]. Although the data sharing brings innovation and enables better decision making, it may also jeopardize the privacy of users due to the existence of sensitive information in the data [3].

Before publishing the users’ data with the researchers or third parties, data owners ensure that the user’s private information privacy is protected. This is typically done via data anonymization, which transforms the original data by applying some operations on it to effectively protect user’s privacy without degrading the anonymous data utility [4]. Privacy preserving data publishing (PPDP) provides set of models, tools, and methods to safeguard against the privacy threats that emerge from the data releasing with data miners or analysts [5]. In recent years, PPDP has received considerable attention from the research community, and many approaches have been proposed for both SN and tabular data anonymization [6]–​[10]. There are two famous settings of PPDP, non-interactive and interactive [11]. In the former setting, the data owner publishes the complete dataset in an anonymized form after applying some modifications on the original data. However, in the later setting, the data owner does not publish the whole data set in a sanitized form like the former setting. Instead, data owner provides an interface to the data miners through which they may pose different statistical queries about the related data and get (possibly noisy) answers. The k -anonymity model [12], and its ramifications are most widely used in the non-interactive setting of PPDP [13]–​[17]. These approaches apply some modifications on the original values of quasi identifiers, and protect the user’s privacy by making information less-specific. The differential privacy (DP) [18], and DP based approaches are mostly used in an interactive setting of PPDP [19]–​[21]. Meanwhile, some studies have reported the DP based approaches for non-interactive setting [22]. Both k -anonymity and DP based anonymization approaches, and their improved versions have been extensively used in the PPDP.

In recent years, SN data is also published with the data-minders for accomplishing multiple scientific and business objectives due to the phenomenal growth in SN use around the globe [23]. SNs data is mostly in a graph G form, and it provides unprecedented opportunities for advanced data analytics. A social graph data G(U, V) , where U is the list of users and V represents the set of edges modelling the relationship between the users. The social connection among users in SNs can be of different types such as friend, sibling, and lover. Generally, each user in a SN has two types of the social connections. Among these connections, there is a set of public connection (V_{p} ), and other connection V_{s} = V \setminus V_{p} that are set private by the users, which needs privacy protection. Aside from the V_{s} privacy protection, there are many aspects in which SN users want privacy protection such as the sensitive attribute (SA), online groups affiliations, and locations. Researchers have extended the concepts used for the tabular data anonymization to protect SN’s users privacy [24]–​[26]. The two popular anonymization approaches used for the SN data are: naive and structural anonymization. In naive anonymization, only social link structure is published by removing the edges and nodes labels from the G . However, Backstrom et al. [27] suggested that naive anonymization is prone to identity disclosure because the structure of the released graph may reveal the identity of the individuals corresponding to the nodes. In contrast, the structural anonymization approaches modify the structure of G to effectively protect the user’s privacy. These approaches add new edges, vertices, and/or modify the existing G structure to fulfill the privacy and utility requirements. For instance, in the k -degree anonymity [28], the G containing users and their relationships is modified in such a way that each user U in a G has the degree k . In some cases, the sensitive information (i.e., link information) is removed from the G for some users during SN data anonymization. Zheng et al. [29] proposed a framework for preserving sensitive link information privacy in SN data anonymization. Aside from the edges and vertices addition/deletion, advanced techniques such as edges switch and rotation have also been proposed to solve the users’ privacy problems in SN data anonymization [30].

The existing surveys related to PPDP cover important aspects such as anonymization techniques, anonymization operations, privacy models, data anonymity frameworks, and evaluating metrics employed by the PPDP mechanisms. Fung et al. [31] study systematically summarized and evaluated different approaches used in the PPDP. Rajendran et al. [32] explained three prominent and most widely used anonymization models in a medical field, namely k -anonymity, \ell -diversity, and t -closeness. Gkoulalas et al. [33] presented a comprehensive survey about the privacy threats and privacy models used in the PPDP. The authors provided details about fourty-five anonymity algorithms in their study. Tran et al. [34] presented a detailed survey about the privacy-preserving big data analytics. The authors explained the related studies and provided details about various PPDP practical scenarios that needs further development from research community. In addition, researchers have presented surveys about the PPDP techniques used in the big data era [35], [36]. A few surveys address the SN data publishing problems, but only considering possible breaches and briefly mentioning the privacy problems that emerge from SN users’ data publishing. Yang et al. [37] explained about the attack models and countermeasures in SN data publishing. The authors surveyed and categorized the PPDP algorithms into two categories, namely anonymization and DP. Siddula et al. [38] provided a survey about the privacy models and methods used for the SN user’s privacy protection. The authors explained the mechanism used for the edges and nodes privacy protection in publishing G . Zho et al. [39] presented a survey about the anonymization techniques used for SN data, and discussed the challenges involved in the G anonymization compared to the relational data anonymization. Zheleva et al. [40] presented a survey about privacy in SNs and statistical inference techniques. Abawajy et al. [41] presented a review of anonymization techniques employed on SN data, and privacy attacks and risks. Furthermore, some surveys related to across SNs user identification [42], [43], edges and vertex modifications techniques [44], and SN application specific scenarios have been reported in the literature [45].

This paper presents a comprehensive survey on recent anonymization techniques used for both SN and relational data publishing. Specifically, our review explains anonymization approaches related to the information privacy protection. The contributions of this review article in the field of PPDP is summarized as: (i) it presents state-of-the-art anonymization techniques used for both SN (i.e., social graphs) and relational (i.e., tabular) data, and fundamental concepts and ideas related to tables and graph data anonymization; (ii) it systematically categorizes the existing anonymization techniques into relational and structural anonymization, and presents an up-to-date thorough review on existing anonymization techniques and metrics used for their evaluation; (iii) it describes the anonymization techniques that have been proposed to solve privacy problems in application-specific scenarios (e.g., collaborative filtering, topic and context modeling, and community clustering etc.) of the SNs; (iv) it presents various methods and items that are exploited by malevolent adversaries for user’s re-identification across SNs; (v) it explains various challenges faced by researchers while devising new anonymization methods for tabular and SN data; (vi) it provides new insights on the privacy problems in future computing paradigm that will be helpful in devising more secure anonymization methodologies; and (vii) it discusses promising future research directions in the field of the PPDP that need further development and research from both academia and industry. Through this comprehensive overview, we hope to provide a solid foundation for future studies in the PPDP area.

The remainder of this review paper is organized as follows. Section II explains the background regarding privacy types, tabular and graph data overview, types of the user’s attributes, operational utility levels of the SN data, privacy areas in the SN, and privacy threats that occur in both SN and tabular data publishing. Section III presents the dilemma of PPDP, and explains its principal concepts and phases. Section IV discusses the state-of-the-art relational anonymization techniques used for tabular data. Section V discusses the recent structural anonymization techniques used for the SN data. The summary of the research contents presented in this article, and discussion about the privacy problems in the future computing paradigm are provided in Section VI. Section VII discusses promising open research directions/problems in the PPDP area. Finally, Section VIII concludes the paper.

SECTION II.

Background

Privacy is all about keeping personal information away from the public access. Privacy is needed for the personal autonomy, individualism, and respect. There are four types of the privacy such as information, bodily, territorial, and communication [46]. Information privacy is about collecting, managing, analyzing, and publishing the personal data. Bodily privacy is related to the physical harms from any kind of invasive procedures/measures. Communication privacy refers to any form of communication such as phone calls or e-mails privacy. Territorial privacy refers to placing boundaries on irruption into a locality. This survey focuses on the information privacy, which encompasses systems/infrastructures that collect, analyse, process, and publish user’s data. Concisely, we present the various anonymization approaches that were proposed to anonymize users’ data that can be either in the form of graph or table. The overview of the user’s data in relational and graphs form is presented in Figure 1. In relational data (Figure 1 (a)), each tuple contains four types of attributes about users, direct identifiers (DI), non-sensitive attributes (NSA), quasi identifiers (QIs), and sensitive attribute (SA). In contrast, the SN data shown in Figure 1 (b) represents the users information via nodes/edges labels. The relevant background about both types of data (i.e., tabular and graphs) is covered in subsequent subsections.

FIGURE 1. - Overview of the relational (i.e., tabular) and social network (i.e., graphs) users data.
FIGURE 1.

Overview of the relational (i.e., tabular) and social network (i.e., graphs) users data.

A. Background About the Relational/Tabular Data

The original user’s data D is considered as a private table which consists of multiple records (i.e., tuples). Each record/tuple contains four types of attributes, and every record has a unique id as shown in Figure 1(a). The detailed overview of user’s attributes along with examples and their treatment in an anonymization process is illustrated in Figure 2. Based on the types of user’s attributes listed

FIGURE 2. - Description about the types of user’s attributes and their handling in an anonymization process.
FIGURE 2.

Description about the types of user’s attributes and their handling in an anonymization process.

in Figure 2, there exists three classes of privacy threats that can occur during published data analysis [47], which are explained below:

  • Identity disclosure (i.e., unique identification): It is a well-known privacy threat in the PPDP. It occurs when an adversary can correctly associate an individual in a privacy preserved published dataset. Generally, an attacker use the information gathered from external sources (i.e., voter registration list, online repositories, and factual information) to identify an individual uniquely.

  • Attribute disclosure (i.e., private information disclosure): This type of privacy threat occurs when an individual is linked with the information about his/her SA. For example, the information can be the person’s value for the SA (i.e., crime in Figure 1(a), or salary in Figure 1(b)). This type of threat can be easily launched in imbalanced datasets (i.e., the datasets lacking heterogeneity in SA’s values.).

  • Membership disclosure (i.e., presence/absence disclosure): This threat occurs when an adversary can deduce that an individual’s record is present/absent in the published dataset with a very high probability. Researchers have reported many interesting scenarios in which the protection from the membership disclosure is imperative [48], [49].

To protect the privacy of users in the published dataset, data owners can apply one of the following anonymization operations on the original user’s data given in a tabular form [50].
  • Generalization: This operation transforms the original QI’s values into less-specific but semantically-consistent values during anonymization process. For example, the value 25 of a QI age can be generalized with an interval [25 – 30] or < 30. This operation relies on the taxonomy of each QI. The existing generalization schemes can be classified to five types such as sub-tree generalization, full domain generalization, unrestricted sub-tree generalization, cell generalization, and multidimensional generalization.

  • Suppression: This operation hides an original value of a QI with a special value (i.e., ’*’). For example, to anonymize the value 25 of a QI age using suppression operation, 5 can be replaced with an asterisk, resulting 2* as the suppressed value of QI. Record, value, and cell suppression are the three most widely used suppression variants in the tabular data anonymization.

  • Permutation: In this operation, the records are partitioned into several groups, and values of the SA are shuffled within each group. Hence, the SA and QIs relationships are de-associated within each group. This operation may yield inaccurate analysis in terms of anonymous data utility, but user’s privacy is significantly preserved.

  • Perturbation: In this operation, the original data values are replaced with some synthetically generated values. Moreover, the synthetic values are generated in a way that statistical information do not differ much in both datasets (i.e., real and synthetically generated datasets).

  • Anatomization: This operation does not apply any modifications on the original data values and instead QIs and SA are separated into two tables. By doing so, the association between QIs and SA is broken, and data is released as QIs and SA tables separately. In some cases, the SA table contains the SA’s values and their frequency in the anonymized dataset for privacy preservation effectively.

Furthermore, in some cases, more than one operation are jointly used to anonymize users data set. Data publication can be done either one or multiple times depending upon the requirements and information consumer’s needs. Typical data publication scenarios include one and multiple time publication of the micro data. In the former scenario, the data is published once, and no re-publication is made even after the changes in the original data. In contrast, in the multiple releases, the anonymous data is re-published even after a single operation (insert, update, delete) that changes the original data. Both scenarios have their own advantages and disadvantages for data owners and information consumers, respectively. After DIs and NSA removal from D as a standard PPDP practice, the D contains only two types of user attributes, QIs and SA, represented as D\{Q,S\} . We can use set Q = \{q_{1}, q_{2}, {\dots }, q_{p}\} to denote the QIs, where q_{p} is one type of QI such as gender or zip code. Set S represents the SA which can be of single type (i.e., disease) or multiple types (i.e., disease and salary) depending upon the scenario. The multiple SA scenario is getting significant attention from the research community in recent years [51]. Plenty of solutions have been proposed for the relational data anonymization considering the available data, SA scenarios (single, multiple), and the PPDP settings. We explain most famous anonymization solutions and their variants proposed for the tabular data anonymization in Section IV.

B. Background About the Social Networks/Graphs Data

The SN user’s data can be modeled with a graph G , represented as G(V, E, A) . It consists of user set V , social connections (i.e., friendship links) set E , where E \subseteq V \times V , and the set A of users’ attributes. Set A , where A =\{Q,S\} contains QIs and SA of the SN users, respectively. The overview of G along with the relevant details is depicted in Figure 1(b). In literature, V , E , and A are also referred as nodes, edges, and labels (i.e., user profile), respectively. The SN data is extremely useful for many analytical purposes, the operational utility of the SN data can be classified into three levels (l_{1}, l_{2}, l_{3}) , as outlined earlier [52].

  • l_{1} : Exposure of the graph structure only: In this exposure level, the data owners (e.g., SN service providers) only publish the graph structure (i.e., all profiles/labels information is removed prior to data publication). Thus, the data-miners/analysts can analyze only the graph structure without any concrete information about the user’s profiles.

  • l_{2} : Exposure of the nodes’ profiles: In this exposure level, the data owner publishes the profiles of nodes/users but hides the graph structure. For instance, the node’s data is stored in a table/matrix, and released for the analytics and data mining purposes.

  • l_{3} : Exposure of both graph structure and profiles of nodes (e.g., users): In this exposure level, the data owner exposes both the graph structure and the nodes’ profiles after applying some modifications to the G ’s structure and users’ profiles. This level offers much higher utility to the legitimate information consumers compared to the previous two levels.

Although SN data publishing is invaluable for accomplishing multiple research and business objectives, the SN data publishing can confront with the privacy threats of several types. Four well-known privacy threats that can happen after G' publishing are summarized below.

  • Identity disclosure (i.e., node re-identification): It occurs when an adversary can accurately associate/identify an individual from a privacy preserved published graph. For example, Tim (5th node) identification by an adversary from the anonymous version (i.e., removing all nodes labels.) of a G given in Figure 1(b) is an example of identity disclosure (i.e., node re-identification).

  • Edge disclosure (i.e., relationship/connection disclosure): It reveals the relationship between users. For example, a patient-doctor relationship can be highly sensitive, and it must be protected. If a doctor is known to be an expert in cancer treatment, the relationship disclosure can occur with inference that patient might be infected with a cancer.

  • Content disclosure (i.e., vertex/edge labels disclosure): It occurs when a sensitive label associated with an edge or vertex is revealed from a G' , and this reveled label can be directly associated with a specific individual in an original graph (i.e., G ).

  • Affiliation link disclosure (i.e., whether a person v \in / \notin to a particular affiliation group h ): It occurs when a link between a user v and an affiliation group h is revealed with confidence \geq t , and this revealed link can be directly associated with a v . The h can be prosecuted political group or a group centered around an unconventional user’s preference (i.e., sexual/drugs preferences).

To protect the privacy of users in SN data publishing, data owners can apply one of the following anonymization operations on the G prior to its publication with the analysts [53].
  • Graph modification: This operation changes the original graph’s structure by adding/deleting edges or vertices. The criteria about the graph’s elements addition/deletion depends on the objectives of data publishing, and privacy protection level data owners want. Typical graph modifications techniques are of two types-constrained and non-constrained graph modifications.

  • Graph generalization: This operation does not modify the graph structure, instead, it cluster the nodes and edges into super nodes and edges. This operation works in iterations, and in each iteration the connections between the super nodes and edges are established.

  • Graph computation (i.e., privacy-aware computation of graphs): This operation computes the output data in response to the data miner queries. The G is not released with the data miners, instead, the graph properties (degree, centralities, size, and other useful information) are computed and released.

  • Hybrid operation/approach: This operation employs combination of two or more anonymization operations simultaneously. For example, the graph generalization and modifications can be jointly used to anonymize G to produce G' .

Privacy in SN data publishing has become an active area of research in recent years. Pham et. al [54] explained that SN users privacy can be breached through three different ways such as user’s activities on the SNs sites, stored users data in the SNs service providers’ servers/databases, and privacy preserved SN published data. All three privacy areas in SN are visually presented in Figure 3 (a). The privacy in the first two areas can be preserved through guidelines, encryption and watermarking techniques. Meanwhile, the privacy in the third area can be guaranteed only by devising/implementing new anonymization mechanisms. This article covers recent concepts, methods, and solutions concerning the third area of privacy in the SNs (i.e., privacy preserved graph publishing).

FIGURE 3. - Overview of three privacy areas (adopted from [54]), and types of users data collected and processed in SNs.
FIGURE 3.

Overview of three privacy areas (adopted from [54]), and types of users data collected and processed in SNs.

The user data available on the SN sites is comprised of three types of information [55]. The taxonomy of the users data present on the SNs is depicted in Figure 3(b). The first type of data is related to user identity including the demographics and derived data (i.e., age). The second type of information is related to user’s socialism (i.e., friends, friends-of-friends, activities, and online communities joined by the users etc.). The third type is related to the contents users create on the SNs sites. It has three types: disclosed, entrusted, and incidental data. Disclosed data is readily made available by the SNs users. Entrusted data includes the data posted by a user on another user’s profile (i.e., comments). The last type of data is the information collected/posted by other users on someone’s profile/wall. All these data sources are invaluable for detailed analytics and appropriate information collection/analysis.

Aside from the fact that all necessary and private information needs protection from the malevolent adversaries, some de-anonymization attacks can be launched on the SNs published data to infer someone’s real identity/private-information. The revelation of such information can result in unknown user profiling, thus targeting unsuspecting users with far-reaching implications including targeted marketing, obtaining travel visas based on race or political viewpoints, deportation, or identity theft. Therefore, mining and sharing SNs data must not invade user’s privacy. To safeguard user’s privacy in SN data publishing, many privacy preserving graph publishing (PPGP) methods have been proposed. We describe most recent anonymization solutions proposed for the PPGP in Section V.

SECTION III.

Overview of Privacy Preserving Data Publishing and Its Fundamental Phases

Privacy preserving data publishing (PPDP) provides set of tools, methods, solutions, and frameworks for sharing valuable information with analysts/researchers without jeopardizing user’s privacy. PPDP has been extensively studied in the literature for protecting different aspects of user’s private information during published data analytics. Data is shared with the analysts/researchers for extracting the embedded knowledge from it. Meanwhile, the anonymous data sharing after applying anonymization operations, as outlined earlier (see subsections II-A, II-B), adversaries can still infer the information about user’s identity or private information by leveraging the auxiliary information gathered from external sources. Therefore, many studies have suggested the users’ privacy-protection in all stages of the information processing cycle (i.e., collection, storage, processing, release, and destruction/archival) [56]. The conceptual overview of the PPDP process is presented in Figure 4 (a). In this conceptual overview, we mainly present the overview of data collected from the individuals, different actors involved in the anonymization scenario, anonymization techniques applied on respective data, anonymous data to be published for analytics/mining purposes, and privacy breaches that can occur during published data analytics. The typical PPDP scenario involves five types of actors [57]. The brief description about each actor along with examples is summarized in Figure 4 (b). In some cases, one actor can perform multiple roles in the PPDP scenario. For instance, data holders/owners hold the collected data, and perform anonymization for its releasing with analysts. Sometimes due to the lack of computing resources or knowledge the data holder outsources the collected data for anonymization/publishing. Hence, the roles of data holder and data publisher can correspond to two distinct/same actors.

FIGURE 4. - Privacy preserving data publishing (PPDP): (a) conceptual overview and (b) description of the actors involved in the PPDP scenario.
FIGURE 4.

Privacy preserving data publishing (PPDP): (a) conceptual overview and (b) description of the actors involved in the PPDP scenario.

The typical PPDP process encompassed of the five fundamental phases: (i) data collection from the individuals; (ii) collected data storage, understanding, and preparation for the anonymization; (iii) user’s data anonymization; (iv) anonymous data releasing/publishing; and (v) published data analysis for extracting embedded knowledge. Brief details of each phase of the PPDP is presented below.

A. Phase 1: Data Collection from the Individuals

In the first phase of PPDP, relevant data from the individuals/users is collected. Due to significant technological development in recent years, the amount of data generated by sensor networks, SNs sites, healthcare applications, internet, online banking, SN integrated third party applications, and many other offline/online companies is drastically increasing. This data is collected from individuals directly or via smart devices (i.e., cell phones, laptops, and notepads etc.). For instance, if a patient visits a hospital for diagnosis, his/her personal information is collected for the better treatment. Later, the patient’s personal information combined with the disease information is saved in the hospital database for the secondary use [58]. Similarly, the account opening process in a bank is subject to the collection of basic as well as sensitive information about the customers. Aside from visiting an organization (i.e., hospital or bank), in some cases, the service providers collect relevant data from their users via questionnaires and interviews.

Recently, due to the significant development in ICTs, majority of service providers have launched their own websites for the data collection from their respective users/customers. Furthermore, in an account creation process on the SNs sites, the service providers collect and store the basic information about each user. In addition, many SNs sites collect the valuable data about users without their consent. For example, they can collect the device information, service consumption temporal information, location information, and many other useful information. Recently, with the inventions of advanced tools and technologies, many infrastructures collect and process graphs data. The nine well-known graph based data collection sources are: (i) social networks, (ii) communication data, (iii) mobility traces, (iv) healthcare and epidemiological data, (v) citation networks, (vi) collaboration network, (vii) web graphs, (viii) autonomous system graphs, and (ix) computer networks. SN data is mostly represented as a G along with the entities (i.e., users) information for modeling/analysis purposes.

B. Phase 2: Collected Data Storage, Understanding and Preparation for the Anonymization

Once the relevant users’ data is collected, the next phase is collected data storage, understanding, and preparation for further operations (i.e., anonymization). Companies are using large scale databases for storing the collected users’ data/information. The well-known SN, Facebook uses MySQL database to store/manage many petabytes of data about user’s social activities such as shares, comments, and likes. Data understanding involves the analysis of the data types, different data representations, and (key, value) -pair understanding. Generally, the collected data about users can contain incorrect values, outliers, missing values for some attributes, and incomplete records. Therefore, it needs preparation before the data anonymization process. The data preparation includes: removal of the outliers present in the data which are not appropriate for the analysis and can yield inaccurate results/analysis. Furthermore, it eliminates those records with unknown (i.e., missing) values from the user’s data. In some cases, the data processing model/algorithm needs user’s information in a specified format. Therefore, the appropriate formatting of the collected data, and enrichment if required for the subsequent steps is performed in the data preparation phase. With the help of data preparation/pre-processing, cleaned data can be obtained which contains complete information about each user, and it can be directly fed into the anonymization algorithm for anonymization.

C. Phase 3: Data Anonymization

Data anonymization is a practical and most widely used solution for protecting user’s privacy in data publishing. In tabular data, data anonymization sanitizes the QI’s original values to make information less specific for privacy protection and utility enhancements. In contrast, if users’ data is given in a G form, anonymization changes the graph structure to protect privacy of users and their associated SA without significantly decreasing anonymous graph’s utility. In addition, anonymization can be tailored with the data owner’s privacy requirements, legitimate information consumer’s utility needs, and objectives of the data publishing. The typical anonymization process includes following four main steps. All four steps are complementary, and can be employed to produce the anonymous table T' from an original table T or anonymous graphs G' from an original graph G .

1) Removal of Directly Identifiable Information from the Original Data

At the beginning of the anonymization process, any information that can identify someone directly/uniquely is removed from the data. For example, the name, social security number (SSN), email address, and cell phone number can be linked with someone’s real identity. Hence, such information is removed from the data before its anonymization. It can be removed at any stage, but the earlier removal can assist in saving computing power significantly.

2) Choice of the Anonymization Technique

Many anonymization techniques have been proposed and implemented for different scenarios. In this work, our focus is on the relational (i.e., tabular) data and SN (i.e., graphs) data anonymization. Therefore, we categorize the existing anonymization techniques into two categories: structural and relational anonymization. The former technique is applied to the graph data. In contrast, the latter technique is used for tabular data anonymization. Although some researchers have used the relational anonymization techniques for SN data anonymization. But, due to the significant differences in terms of the information contained in the graphs, relational techniques cannot be directly applied to the SN data. Therefore, the decision about the anonymization technique depends upon the original data representation (i.e., graphs or tables), and objectives of the data publishing.

3) Selection of the Anonymization Operation

Once the appropriate anonymization technique is chosen, the next step is to employ the relevant anonymization operation to distort the original data values or graph structure. The anonymization operation for each technique are different, as outlined earlier (Subsections II-A, II-B). The selection about the anonymization operation is dependent on the data type, anonymization technique, and privacy and utility objectives. For example, in relational anonymization, the generalization operation retains more semantics of the original data compared to the suppression operation. In contrast, suppression operation is highly appropriate for the user’s privacy protection compared to the generalization. In addition, due to the complex structure of a G , changes in the small portion of a G can drastically decrease utility of the G' . Therefore, the selection of the appropriate anonymization operation is made to effectively resolve the privacy and utility trade-off for real-world applications.

4) Enforcement of the Constrains (If Any) in an Anonymization Operation

Aside from the selection of appropriate anonymization technique and anonymization operation, some constrains can be enforced by the data owners during data anonymization. Such constrains can be about the privacy and utility thresholds, number of users in an equivalence class, distribution of the user’s private values in a class/cluster, and/or the number of connections between users (i.e., degree) in a G' . Such constraints are enforced during data anonymization process. These constraints can be suggested by the data owners or can be derived from the data statistics. In addition, the constraints can be employed by considering the adversaries capabilities and nature of the sensitive information contained in a T or G .

D. Anonymous Data Releasing/Publishing

The output of the anonymization process is the anonymous G' or T' depending upon the representation/style of original user’s data. This anonymous G' or T' is set to be made publicly available for the analytics. The recipients of the anonymous data can be analysts, researchers, data-miners, analytics firms, and third party applications. Before making the anonymous data publicly available, the data owners perform several checks to verify the user’s privacy protection and anonymous data utility levels, respectively. After the detailed checks, the decision about the data release is made by the data owners. In some cases, the data owners do not publish full anonymous data, instead, some parts of the anonymous data are published first. Later, the complete anonymous table or graph is published. In addition, the anonymous data can be shared only with the relevant institutes via emails or posts. However, the anonymous data is generally published over the Internet so that a large number of people can access the published data for the multi-facet analytics. Furthermore, the medium of data sharing and amount of data vary with the setting (i.e., interactive and non-interactive) of the PPDP. After releasing the data, the data owner has no control over the published data use and distributions. Meanwhile, it is assumed that published data will be used only for the intended purposes, and any problem will be explicitly reported to the data owners. In some cases, the data owners and data publishers can be two different parties. Data publishers release the anonymous data via their own mediums under the publishing agreements with the actual data owners/holders.

E. Phase 5: Published Data Analysis for Extracting Embedded Knowledge

Once anonymous data has been published, the intended recipients collect it for the analysis. In case of the hospitals data, medical students collect the data for their research. They can perform several kinds of test such as factors causing certain disease, common disease in a certain age-groups people, and symptoms of a particular disease. In addition, the banks data containing information about the loan return rating is valuable for the insurance companies. The SN data is suitable for the digital service providers and marketing firms. Recently, many companies are mining the SN data at large scale for fulfilling their scientific and business objectives. In addition, many third party applications are buying SN user’s data for fulfilling their intended objectives. The published data is not only for understanding the causes of some problems, but it can help in devising new rules and patterns that can be useful for marketing purposes. With the help of data mining algorithms, one can analyze the published data for actionable insights. Furthermore, users clustering and analysis is beneficial for recommendations, preferences mining, target marketing, information diffusion, information control, and information trustworthiness. Hence, data sharing has become a routine matter for some companies/organization due to the significant advantages in terms of improved decision making, policy enhancements, trends analysis, forecasting, predictions, and innovation.

Unfortunately, data publishing can jeopardize user’s privacy as adversaries can get copy of published data with aim to re-identify people uniquely by leveraging the large amount of auxiliary information obtained from external sources. These information can be gathered from many sources including users’ profiles from the SN sites, voter registration list, and mobility traces etc. Generally, the adversaries possess strong programming skills and tools knowledge. Therefore, with the help of auxiliary information, tools understanding, advanced programming skills, and understanding of anonymization methods enable adversaries to breach user’s privacy. Due to the advancements in ICTs, the scale and scope of the data breaches is expanding. In recent years, the adversaries are focusing for the users groups information theft for fulfillment of the intended objectives. The group identity theft can lead to the negative perceptions about certain ethnic group, discrimination based on the race/religion, and loan declining to group of people whose previous loan return rating was not satisfactory. Aside from the negative consequences of privacy breaches on the people’s life, data owners also lose their users’ trust in such circumstance. Therefore, users expect that data owners protect their sensitive information’s privacy during data release with the data-miners. Hence, the academicians and researcher are suggesting/devising new anonymization mechanisms to deal with this uprising social problem (i.e., user’s privacy protection without significantly impacting data utility/quality).

SECTION IV.

Relational Anonymization Techniques Used for Tabular Data Anonymization

The general concept of relational anonymization is to produce anonymous table T' from an original Table T . Given a T containing p QIs, single/multiple SA (s), and N users, the relational anonymization employs a privacy model/algorithm to modify the original values in such a way that user’s privacy can be protected while retaining significant utility in anonymous data for the analysis. The input to the relational anonymization is a tabular data and output is also a tabular data with modified values of user’s QIs, and shuffled/equally-distributed values of the SAs. A series of privacy models and algorithms have been proposed so for to anonymize tabular data containing user’s basic (i.e., QIs) and private information (e.g., SA). Four well-known and extensively used privacy models for relational data anonymization are summarized below.

A. k -Anonymity Privacy Model

The k -anonymity [12] privacy model is a well-known syntatic privacy model, and it has been extensively used in the tabular data anonymization. Due to the conceptual simplicity, this privacy model has attracted significant attention from the research community, and many variants of this model have been proposed by the researchers for data anonymity. The k -anonymity model [12] protects user’s privacy by placing at least k users in an equivalence class (EC) with same QI’s values. Hence, the probability of re-identifying someone from T' becomes {1}/{k} . A table T' satisfies k -anonymity if for every tuple/record t of T' there exist at least (k-1) other tuples with the same QIs in an EC. The value of k is chosen by the data owners depending upon table size and privacy protection level. The k -anonymity privacy model overview is shown in Figure 5. Primarily, the k -anonymity privacy model was devised for the protection of identity disclosure. Meanwhile, the k -anonymity privacy model is insufficient to protect the sensitive information disclosure as shown in ECs, C_{2} and C_{3} in Figure 5(b). The SA’s disclosure in these two ECs based on auxiliary information is 100%. Therefore, an advanced privacy model, named \ell -diversity was proposed to solve the shortcomings of the k -anonymity privacy model. The utility of the anonymous table T' produced by the k -anonymity model is relatively higher.

FIGURE 5. - 2-anonymity applied to the user’s relational data with six records.
FIGURE 5.

2-anonymity applied to the user’s relational data with six records.

B. \ell -Diversity Privacy Model

The \ell -diversity privacy model [59] was proposed to solve the k -anonymity model’s limitations. According to this model, an EC satisfies \ell -diversity property if there are at least \ell ”well-represented” values for the SA. A table T' is said to have \ell -diversity, if every EC of the T' is \ell -diverse. The \ell -diversity privacy model overview is shown in Figure 6.

FIGURE 6. - 2-diversity applied to the user’s relational data with four records.
FIGURE 6.

2-diversity applied to the user’s relational data with four records.

The table in Figure 6(b) is an example of 2-diverse partition of the table shown in Figure 6(a). Although, \ell -diversity privacy model provides superior privacy protection compared to the k -anonymity model by considering the diversity in SA’s values, but it does not consider the distribution of the SA’s values. Hence, it is prone to the privacy breaches in ECs in which one particular SA value is dominate over others (i.e., C_{1} ). For example, an EC C_{i} with ten users can satisfy 2-diversity property with SA’s values ratio of 1: 9 . Although, C_{i} is 2-diverse, but the SA values of someone’s can be learned with 90% accuracy. The similar case is presented in Figure 6(b), where the SA’s value, ’conservative’ disclosure is 75% with accurate mapping of someone based on the QI’s values. However, many variants of \ell -diversity model have been proposed to solve these limitations. In addition, \ell -diversity cannot be applied to the highly imbalanced (i.e., the data sets in which SA’s values distributions is not uniform.) databases. In addition, \ell -diversity privacy model degrades anonymous data utility significantly by not considering QIs distributions and similarities during anonymization.

TABLE 1 Detailed Comparisons of the Different PPDP Solutions Used for Relational Data Anonymization
Table 1- 
Detailed Comparisons of the Different PPDP Solutions Used for Relational Data Anonymization

C. t -Closeness Privacy Model

The t -closeness privacy model [60] is a syntactic privacy approach used for the tabular data anonymization. It was proposed to solve the limitations of both k -anonymity and \ell -diversity models in terms of privacy protection. A table T satisfies t -closeness if its records/tuples are split into ECs such that the distribution of SA in the whole T and the ECs of a t -close table T' are within t distance units of each other. The t -closeness privacy model suggests that the SA’s values distribution in any EC of T' differs from the overall SA’s values distribution in T by at most threshold t . The value of t can be determined by considering the protection level of the sensitive information and objectives of data publishing. The table shown in Figure 7(b) (adapted from [61]) is 0.278-close w.r.t disease SA. The table shown in Figure 7(b) is also 3-diverse (i.e., it contains three different values of SA in each EC.). The t -closeness privacy model significantly improves the user’s privacy, but it severely reduces the utility of the released data. All of the above three privacy models are among the most famous syntactic privacy models, and many variants of these models have been proposed to resolve their limitations in the PPDP. In addition, many algorithms as an extension of these three privacy models have been proposed to combat with some specific types of threats that emerge from data sharing.

FIGURE 7. - 
$t$
-closeness (where 
$t=0.278$
) applied to the user’s relational data with nine records.
FIGURE 7.

t -closeness (where t=0.278 ) applied to the user’s relational data with nine records.

There exist many attacks that are possible, and can be easily launched on the T' . We can classify those attacks into two categories: (i) background knowledge attack, and (ii) flaws in an anonymization methods. In the former category of privacy attacks, the adversary has some known information about the target victim. For example, the QIs of an individual or some other information about existence of someone’s data in a T' . The adversary uses such information to cause a privacy breach. In the latter category of privacy attacks, the flaws in an anonymization methods assist adversary in causing a privacy breach. For example, the homogeneity (i.e., no heterogeneity in SA’s values in an EC) attack of the k -anonymity privacy model, skewness attack (i.e., no considerations of the SA’s values’ distribution in an EC) of the \ell -diversity privacy model, and no semantic consideration (i.e., all disease information present in an EC belong to a same part/organ of a human body) in t -closeness privacy model lead to explicit privacy breaches in presence of the auxiliary information. All these attacks are practical, and can be launched on a T' . Hence, many improved variants of these three privacy models, and adversarial modeling based methods have been proposed to resolve these problems.

D. Differential Privacy Model

Differential privacy (DP) [18] is a well-known and mathematical definition-based privacy protection model. It is mostly used for privacy protection in an interactive settings of the PPDP. It protects the privacy of user by adding noise to the original user’s data and it does not make assumptions about the intruder scenarios. The DP belongs to the semantic class of privacy models, and it yields superior privacy protection in PPDP compared to the syntactic privacy models. Considering the effectiveness of the DP model, U.S. census Bearu is planning to use the DP in their 2020 census, and all future data products [62]. It has been reported in the literature that DP provides a mathematically provable guarantee on privacy preservation against many privacy attacks such as differencing, linkage, and reconstruction attacks. DP concept can be defined as: given a dataset D_{1} , and a neighbour dataset D_{2} . Both data-sets differ in one and only one record, defined as ||D_{1}|-|D_{2}|| = 1 . In addition, a random function F with a range S , defined as S \subseteq Range(F) , satisfies DP if \begin{equation*} Pr[F(x) \in S] \leq exp(\epsilon)Pr[F(y) \in S] + \delta \tag{1}\end{equation*} View SourceRight-click on figure for MathML and additional features. In equation 1, variable x, y \in D_{1}, D_{2} ; \epsilon is a parameter; \delta indicates a degree of the relaxation, and the probability is taking over the randomness of function F . In equation 1, if \delta = 0 then F satisfies \epsilon -differentially private. In case of the query response (R_{s} ), the probability Pr of DP model will be.\begin{equation*} \frac {Pr (A(D_{1})=R_{s})}{Pr (A(D_{2})=R_{s})} \leq e^{\epsilon } \tag{2}\end{equation*} View SourceRight-click on figure for MathML and additional features. In equation 2, A represents an anonymization algorithm, \epsilon is a parameter, and its value is chosen by the data owners. Generally, the smaller value of \epsilon is suitable for better privacy preservation in the PPDP.

In DP method, noise is added using the Laplace mechanism. DP mathematically ensures that any analyst/data-miner seeing the result of a differentially private analysis will make the same conclusion about any individual’s private information, whether or not that individual’s private information is included in the input to the analysis. The overview of the DP is presented in Figure 8. Due to the strong privacy guarantees, the DP model has been extended by many studies for further improvements.

FIGURE 8. - Functional overview of the differential privacy (DP) model used in the PPDP.
FIGURE 8.

Functional overview of the differential privacy (DP) model used in the PPDP.

Many latest studies adopted DP concept for resolving the privacy issues in the both interactive and non-interactive settings of the PPDP. There exist numerous anonymization techniques which are ramifications of the four privacy models explained above. We summarize the approaches that extended the concepts of these four models in Table 1. We categorize the anonymization approaches into four main categories: (i) the k -anonymity model and its ramifications, (ii) the \ell -diversity model and its ramifications, (iii) the t -closeness model and its ramifications, and (iv) the DP model and its ramifications.

We compare the existing approaches with each other based on six different parameters. The abbreviation used in Table 1 are: ID = identity disclosure, AD = attribute disclosure, MD = membership disclosure, IL = information loss, Sy. & Se. = syntactic and semantic, Bk = background knowledge, CoD = curse of dimensionality, Pr. = probability, SA = sensitive attribute, MSA = multiple sensitive attribute, LA = linkage attack, DMAR = discovering and maintaining association rules, DA = data analysis, ECs = equivalence classes, GPS = global positioning system, and QIs = quasi identifiers.

Some studies have used the combinations of more than one privacy models (e.g., k -anonymity and \ell -diversity) to protect the user’s privacy in data publishing. Majeed et al. [104] extended the k -anonymity model to effectively resolve the privacy and utility trade-off in the PPDP. The proposed scheme performs adaptive data generalization considering both the vulnerability of the QIs and the diversity of the SA to anonymize data. Jordi et al. [105] suggested that t -closeness can be extended to the DP model when t = exp(\epsilon) . The authors proposed a method for achieving both t -closeness and DP, respectively. As a result, higher utility was retained in the anonymous data compared to the noise addition methods. Rasool et al. [106] used the k -anonymity concept in the clustering process to produce anonymous data set for publishing. The proposed SBC algorithm significantly reduces information loss and retains better semantics of the original dataset. Recently, the k -anonymity concept has been extended in combination with entropy concept to protect the users’ groups privacy in data publishing [107]. The proposed anonymization method effectively resolves the users’ groups privacy issues stemming from the low diverse ECs, and highly susceptible QIs present in a person-specific dataset.

E. Metrics Used for the Evaluation of Privacy and Utility of Relational Data Anonymization Techniques

1) Privacy Evaluation Metrics

There exist multiple ways to quantify the privacy protection offered by an anonymization algorithm. The five common methods that are employed to evaluate the effectiveness of any anonymization algorithm in terms of privacy protection are: (i) anonymous and original dataset linking and calculating the probability of successful matches based on the QI values in both these datasets. The probability value tells the amount of privacy protection an anonymization algorithm will offer during published data analysis. In this case, it is assumed that attackers may have access to the excessive amount of auxiliary information from some external sources (i.e., voter list, online repositories, e-commerce sites etc.) to launch identity and attribute disclosure attacks; (ii) privacy protection evaluation in presence of the background knowledge. In this privacy evaluation metric, the data owner assumes that attacker may possess the true information (i.e., age and gender of a Bob) about some users and he/she can explore only the particular ECs to infer private information of relevant user/users. To evaluate effectiveness of algorithm in this regard, the data owner can pick some instances from the original data and can evaluate the anonymization algorithm privacy protection level by matching; (iii) privacy protection evaluation with the help of privacy-sensitive (PS) rules. In this case, the data owner can construct certain rules to evaluate privacy protection. For example, how many people having age >40 suffer from this disease(i.e., cancer). The sensitive knowledge pattern revelation, and attribute and identity disclosure of multiple users through PS rules have a wide range of negative consequences on people’s life; (iv) prediction about the users SA through the existence of private and public profiles in SNs. In this case, the data owner can quantify the amount of protection level of an algorithm assuming that either partial or full user’s original data is known to the attacker as some users willingly publish their QIs over different SNs. The factual information that can be learned through the knowledge gained from the D to invade unknown users’ privacy can be used to evaluate an anonymization effectiveness; (v) privacy protection evaluation in the presence of malicious users’ in a dataset. In this case, the data owner can classify some of the tuples as malicious and can calculate the similarity with other (i.e., non-malicious) users to quantify the privacy protection level. In some cases, the sensitive queries and corresponding private information budget is also used for the evaluation of anonymization algorithms/models.

2) Utility Evaluation Metrics

During tabular data anonymization, the original QI’s values are modified to fulfill the privacy needs, hence, the data utility degrades. There exist multiple ways to quantify the anonymous data utility offered by an anonymization algorithm. We classify the metrics used for measuring anonymous data utility into two categories: special purpose and general purpose metrics. The special purpose metrics use machine learning methods to measure the anonymous data quality. The most widely used special purpose metrics are, accuracy or error rate, F -measures, precision, and recall. The general purpose metrics measure the information loss caused by modifying the original data. The most popular general purpose utility evaluation methods are, weighted certainty penalty, generalized information loss (GenILoss) , discernability metric, minimal distortions, average equivalence class size (C_{AVG}) , KL -divergence, granularity, query accuracy, global loss penalty (GLP), normalized mutual information (NMI), relative error (RE), and information theocratic metrics (ITM). The comprehensive details about these utility metrics can be found in the recent studies [5], [9], [35], [50].

F. Privacy Preserving Dynamic Data Publication

Privacy preserving dynamic data publication (PPDDP) allows organizations to publish a dataset multiple times. PPDDP enables organizations to share up-to-date data with multiple recipients statically or after some modifications (i.e., update, delete or insert). In contrast, privacy preserving static data publication (PPSDP) focused only on one time publication of a dataset, and many approaches have been proposed for the PPSDP in literature [12], [59], [60]. Meanwhile, PPDDP opens a new era in PPDP research, and many organizations are publishing their users data dynamically. Kabou et al. [108] presented a comprehensive survey about the PPDDP, and summarized different studies used for the PPDDP. Shi et al. [109] presented a method for PPDDP using distance and information entropy concepts. The proposed method ensures that individual privacy is preserved after the data has been subjected to multiple releases. The m -invariance privacy model [110] was proposed to limit the risk of privacy disclosure in data re-publication. The proposed approach jointly uses the m -invariance and counterfeited generalization concepts to solve the PPDDP problem. Anjum et al. [111] proposed a \tau -safety privacy model for the PPDDP. The proposed approach performs better in the presence of external and internal updates. Zhu et al. [112] proposed a \tau -safe (\ell, k) -diversity privacy model for sequential publication. The proposed privacy model guarantees that each record’s signatures/values keep consistency or have no intersection in all data releases. This model can be applied to the data in which individual has multiple records. The proposed algorithm performs better compared to the m -invariance and \tau -safety privacy models. Considering the widespread applications/uses of the privacy preserved published data, the PPDDP has become an emerging area of research in recent years.

G. Challenges in the Relational Data Anonymization

The representation of tabular data is relatively simpler than the SN data. In relational data, each row/tuple represents one real world entity/individual. We identify following five challenges that make the relational data anonymization challenging. First, selection of the user’s attributes that are regarded as QIs. In relational data, a small subset of the users’ attributes are chosen as QIs. However, some QIs can behave as SA (i.e., profession) in practice. Thus, appropriate selection of QIs prior to data sanitization is imperative. Second, over reliance on the custom made QI’s values generalization taxonomies. The relational anonymization is performed with the help of pre-defined taxonomies of the QIs. Meanwhile, these taxonomies do not truly reflect the privacy and utility trade-off. Thus, devising the appropriate taxonomies with in-depth analysis of QI’s domain values in relational anonymization is challenging. Third, tabular data anonymization in multiple SAs (MSAs) scenarios. In some cases, the tabular data contains multiple SAs about individuals. Anonymizing the tabular data having MSAs is harder compared to the single SA scenarios due to the high risk of user’s private information disclosures. Fourth, quantifying the impact of QIs on both user’s privacy and anonymous data utility. Since each QI affects the privacy and utility differently, and some QIs can be highly vulnerable in terms of privacy and some QIs have higher utility. Thus, quantifying each QI’s statistics related to privacy and utility is very challenging in the PPDP. Fifth, accurate estimation of the subjects’ re-identification risk. The existing PPDP methods often assume worst-case scenarios regarding attackers, and apply heavy changes in the data that impact the shared data utility and often limit data sharing on a wider scale. Thus, accurate and novel adversarial modeling methods are needed to support PPDP in big data era. Aside from the five potential challenge explained above, in some cases, an adversary can possibly link the whole published dataset (e.g., T' ) by leveraging the auxiliary information [113]. Hence, devising algorithms/method which can estimate the users’ privacy disclosure risk as accurate as possible during published data analytics is a vibrant area of research.

SECTION V.

Structural Anonymization Techniques Used for the Social Networks Data Anonymization

Structural anonymization refers to the modification in the structural properties of the social network (SN) data (i.e., graphs) to protect the privacy threats that emerge from SN data publishing. Generally, the SN analysts represent the SN data mainly via two methods: metrics and graphs [114]. The matrices representation of the SN data allow the application of computer tools and mathematical models to summarize and extract patterns. Since finding relevant patterns from the dense graphs is extremely complex. The adjacency matrix is used as the variant of the graphs in social network analysis. For instance, a G with n users can be modeled as an adjacency matrix M of size n \times n . In an adjacency matrix, the relationship between two users i and j can be represented by the value (< 0, 1 > or < y,n> ) in a cell i, j . To represent various forms of users’ data and to model the structural properties of SNs, a G can have their nodes and edges labeled or unlabeled, undirected or directed, weighted or unweighted as presented in Figure 9.

FIGURE 9. - Overview of the social network representation using different forms of graphs and a matrix.
FIGURE 9.

Overview of the social network representation using different forms of graphs and a matrix.

Due to the inefficacy in handling the complex SN data, and inability to correctly represent the SN with users’ attribute information, matrices are rarely used to represent the SN data. In contrast, graphs can represent the users’ attribute information properly along with social relations. Therefore, graphs are most widely used in SN analysis, and assist effectively in proving graph-based theorems. This work uses SN data modeled as a graph for further analysis and discussions. Users’ privacy preservation in SN data publishing is very challenging compared to the relational data due to the more pieces of private information contained in a G . The pieces of information concerning user privacy in a social graph data are summarized in Table 2. The SN users need privacy protection for most of the pieces of their private information shown in Table 2. In contrast, the relational data mainly contains four pieces of information about the users: (i) users’ QIs, (ii) users’ SA, (iii) micro-statistics (i.e., SA’s particular value shared by less number of people in a dataset) about users, and (iv) macro statistics (i.e., SA’s particular value shared by large number of people in a dataset) about the users. In addition, due to the availability of user’s profiles information on the SNs sites and accounts on multiple SNs, the SN users privacy can be compromised easily compared to the tabular data.

TABLE 2 Description about the Pieces of Information Related to User’s Privacy in SNs Data (Ref. [39])
Table 2- 
Description about the Pieces of Information Related to User’s Privacy in SNs Data (Ref. [39])

The background knowledge (BK) is the fact or an information known to the adversaries about an individual or group of individuals, which can be exploited to infer the SA of an individual(s) from the G' . The BK can be acquired from different sources, and its degree purely depends upon the adversaries’ capabilities and technical knowledge. In practice, it is very difficult to quantify the level of the BK possessed by the adversaries, and many existing algorithms assume certain pieces of information as BK while anonymizing user’s data. The BK types with sufficient details and examples are explained in literature [37], [39], [41]. In Table 3, we summarize the most recent types of the BK that are used by the adversaries to jeopardize SN user’s privacy during published graphs analytics.

TABLE 3 Types of the Background Knowledge (BK) Used by the Adversaries to Jeopardize User’s Privacy in the PPGP
Table 3- 
Types of the Background Knowledge (BK) Used by the Adversaries to Jeopardize User’s Privacy in the PPGP

The most common technique used for SN users privacy preservation in the PPGP is anonymization. After in-depth synthesis of the literature [114]–​[116], we present the taxonomy of the PPGP approaches along with representative anonymization methods employed for SN data in Figure 10. These PPGP approaches can be broadly classified into five categories, namely graph modification techniques, graph generalization/clustering techniques, privacy aware graph computation techniques, differential privacy based graph anonymity techniques, and hybrid anonymization techniques. Brief description along with relevant examples about all five anonymity techniques used in the PPGP is given in subsequent paragraphs.

FIGURE 10. - Taxonomy of privacy preserving graph publishing (PPGP) approaches used for SN data.
FIGURE 10.

Taxonomy of privacy preserving graph publishing (PPGP) approaches used for SN data.

Due to the widespread applications of the SNs data, many structural anonymization techniques have been proposed for the PPGP. These techniques modify the structure of the SN graph by adding/deleting vertices or edges to preserve the user’s privacy. Aside from the add/delete, in some case, edges and vertices are switched or re-arranged in clusters to preserve user’s privacy. The overview of anonymous graphs obtained by adding vertices and edges is shown in Figure 11. In Figure 11(b), two new edges have been created (\{v_{1}, v_{3}\} and \{v_{2}, v_{4}\} ). Similarly, in Figure 11(c), two new nodes (\{v_{7}, v_{8}\} ) with four edges (\{v_{7}, v_{3}\} , \{v_{7}, v_{2}\} , \{v_{8}, v_{1}\} and \{v_{8}, v_{2}\} ) have been added in the anonymous version of a G .

FIGURE 11. - Examples of anonymizing graph by adding edges and vertices.
FIGURE 11.

Examples of anonymizing graph by adding edges and vertices.

The addition/deletion of the nodes and edges can be constrained or random (e.g., non-constrained) depending upon the scenario.

An example of the original graph G anonymization through both constrained and random perturbation techniques taken from study [30] is shown in Figure 12(i), (ii). Figure 12(i)(b) shows an example of a perturbed version of the network shown in Figure 12(i)(a) by Rand add/del operation. In this example, the two edges (\{v_{1}, v_{5}\} and \{v_{2}, v_{3}\} ) have been removed and two new edges (\{v_{6}, v_{7}\} and \{v_{8}, v_{9}\} ) have been added to produce the anonymous graph G' . Meanwhile, the perturbed version of the graph shown in Figure 12(i)(c) is obtained from the Rand switch operation. In this example, two edges (\{v_{1}, v_{2}\} and \{v_{4}, v_{5}\} ) were switched to (\{v_{1}, v_{4}\} and \{v_{2}, v_{5}\} ) to produce the anonymous graph G' . In the random perturbation (e.g., non-constrained), there is no hard constraints regarding the edges addition/deletion/switching.

FIGURE 12. - Original graph anonymization by using random (e.g., non-constrained) and constrained anonymization methods.
FIGURE 12.

Original graph anonymization by using random (e.g., non-constrained) and constrained anonymization methods.

In contrast, in the constrained anonymization, the nodes/edges addition/deletion is bounded by some constrains (i.e., degree). Figure 12(ii)(b) shows an example of the perturbed graph G' obtained by applying edge modification concept on an original graph G given in Figure 12(ii)(a). The perturbed graph is k -degree anonymous, where k = 2 . The original graph G has a degree sequence d(G) = \{2, 4, 2, 1, 3, 2, 2, 2, 2\} , while the modified graph G' given in Figure 12(ii)(b) has degree sequence d(G') = \{2, 3, 2, 2, 3, 2, 2, 2, 2\} . The number of vertices and edges are same in both graphs, and the anonymous graph is 2-degree anonymous (i.e., each vertex has at least 2-edges). The single modification in a G by modifying edge (\{v_{3}, v_{2}\} to \{v_{3}, v_{4}\} ) has made the G two anonymous. The constrains value can be adjusted considering the protection level and graph structure. Another example of the 2-degree anonymous G' by adding two edges (\{v_{4}, v_{10}\} and \{v_{5}, v_{10}\} ), and one vertex (v_{10} ) is shown in Figure 12(ii)(c). The degree sequence of G' becomes d(G') = \{2, 4, 2, 2, 4, 2, 2, 2, 2, 2\} and number of vertices and edges increase by one and two, respectively. In the constrained perturbation, addition/deletion of the nodes/edges follow some criteria (i.e., degree, closeness, and clustering co-efficient etc.), and further addition/deletion of the nodes/edges stop once the defined criteria is satisfied.

There are six basic edge and vertex modifications techniques for SN data anonymization [44]. The modifications techniques are: (i) edge add, (ii) edge delete, (iii) edge add/del, (iv) simple edge switch, (v) double edge switch, and (vi) node addition. Most of the existing SN data anonymization methods use one (or more) of these six modifications techniques during graph anonymization. A detailed taxonomy of the graph modification techniques is given in study [53]. The four types of graphs that are mainly used to represent the SNs users’ data are: simple graph, bipartite graph, labelled graph, and uncertain graph.

In SN data anonymization, majority of the approaches are driven from the concepts that were proposed for anonymizing tabular data. For example, the k -anonymity model and its variants such as k -degree anonymity, k -isomorphism anonymity, k -automorphism anonymity, k -candidate anonymity, k -neighborhood anonymity, and (k,\ell) -grouping have been adapted to anonymize SN data. The generalization/clustering based approaches anonymize SN data by partitioning it into different clusters, and generalizing the clusters into super nodes/edges [117]. The concept of these approaches after clustering is analogous to the EC generalization of the relational data. Moreover, the cluster sizes and generalization degrees are determined in a way that maximal information is retained in the clustered network (i.e., G' ). The conceptual overview of the generalization/clustering based anonymization is presented in Figure 13.

FIGURE 13. - Original graph anonymization by employing clustering/generalization based method.
FIGURE 13.

Original graph anonymization by employing clustering/generalization based method.

A network/graph with seven nodes and two QIs (age, gender) is provided as an input (Figure 13 (a)), whole network is partitioned into three clusters (c_{1}, c_{2},c_{3} ) based on the QI’s similarities (Figure 13 (b)), and a corresponding generalized network with three super nodes is obtained as an output (Figure 13 (c)). We used three distinct symbols (e.g., square, circle, and triangle) to denote the users in each cluster and super nodes, respectively. The two numbers in each super node represent the cluster size (e.g., number of users) and intra-cluster edges. For example, in cluster c_{3} there are two users but there is no edge between them. Therefore, the value inside the triangle (a.k.a super node) is (2, 0). The weighted edges between super nodes represent the inter-cluster edges.

On the contrary, the privacy-aware graph computation and DP based approaches do not release the entire G' like previous two techniques (e.g., graph modification and graph generalization/clustering techniques) [30]. These approaches perform computations on the original graphs G , and yield output of an analysis computation. Compared to the previous two techniques, these approaches allow constrained analysis on the SNs data that can limit the widest range of applications for knowledge extraction and data mining. The DP concepts have been tailored with the structural properties to anonymize graphs data, and DP is regarded as one of the best privacy-aware graph computation techniques [30]. The DP based approaches in SN data anonymization have been classified into three categories, node-level DP, edge-level DP, and node and edge level DP. These techniques compute the useful statistics from an original graph in such a way that individual’s privacy is preserved, and data remains useful for the analytical purposes.

The useful analysis provided by privacy-aware graph computation and DP based approaches are: graph density, edges count, relationships degree, degree distributions, size of the network, centralities, closeness, counts of the sub graphs, top k -users with highest degree in a network, distance/similarity between users, path length, clustering coefficients, community discovery/extraction, hypergraphs, joint degree distribution, cuts, number of users with degree d , aggregation, projections, and sparse and dense segments of the graphs, to name a few. These valuable statistics can be utilized for range of applications including social network analysis, marketing, preference mining and analysis, collaborative filtering, epidemiological investigation, and information spread/contagion etc. A sample of the graphs’ statistics computed with the help of node DP techniques [118], [119] and minimum spanning tree (MST) DP [120] approaches is presented in Figure 14.

FIGURE 14. - Overview of the graphs’ statistics determined by privacy-aware graph computation methods.
FIGURE 14.

Overview of the graphs’ statistics determined by privacy-aware graph computation methods.

Aside from these basic statistics, applying node/edge DP approaches for publishing graph cuts and pair-wise distance between nodes are handy for data-driven applications. Furthermore, many \ell -diversity and t -closeness variants have been proposed by the researchers to solve the content disclosure problems in the PPGP. A comprehensive details about these variants with definition is elaborated in studies [33], [55], [121]. Despite the success of existing PPGP mechanisms, preserving privacy in dynamic SNs is still an important research direction which has received significant attention from the research community recently.

The hybrid anonymity approaches usually employ more than one anonymity technique to yield anonymized SN data [122]. However, the complexity of the hybrid anonymity approaches is relatively higher when G contains substantial number of nodes and edges. Hence, hybrid anonymity techniques are used only in specific scenarios involving SN data. We summarize the recent structural anonymization approaches used for SN data (e.g., graphs) in Table 4. The abbreviation used in Table 4 are: EM = edge modifications, PPGP = privacy preserving graph publishing, PUT = privacy-utility trade-off, CC = computational complexity, SNs = social networks, SA = sensitive attributes, Bk = background knowledge, GP = graph publishing, GA = graph anonymization, EC = equi-cardinal, and IL = information loss. Just like relational data, many privacy and utility evaluation metrics have been proposed to access the performance of SNs data anonymization mechanisms.

TABLE 4 Detailed Comparison of Recent Structural Anonymization Approaches Used for the Social Networks Data (i.e., Graphs Data) Anonymization
Table 4- 
Detailed Comparison of Recent Structural Anonymization Approaches Used for the Social Networks Data (i.e., Graphs Data) Anonymization

A. Metrics Used for the Evaluation of Privacy and Utility of Structural Anonymization Techniques

1) Graphs Utility Evaluation Metrics

The existing well-known graph utility evaluation metrics are: (1) degree (Deg.), (2) effective diameter (ED), (3) joint degree (JD), (4) local clustering co-efficient (LCC), (5) path length (PL), (6) closeness centrality (CC), (7) eigen vector (EV), (8) betweenness centrality (BC), (9) network constraints (NC), (10) network resilience (NR), (11) infectiousness (Infe.), (12) page rank (PR), (13) hub score (HS), (14) global clustering co-efficient (GCC) or global transitivity (GT), and (15) authority score (AS). Aside from these well-known metrics, some general purpose approaches such as accuracy, classification, information loss (IL), ratio of top influential users (RRTI), query response, and graph spectral properties have also been used to measure anonymous graph’s utility. Furthermore, there exist seven application-specific utility evaluation methods which are, (1) role extraction (RX), (2) reliable email (RE), (3) minimum sized influential nodes set (MINS), (4) influence maximization (IM),(5) community detection (CD), (6) source routing (SR), and (7) sybil detection (SD). We refer interested reader to the previous work [123] for more detailed definitions and descriptions of the above metrics.

2) Graphs Privacy Evaluation Metrics

The most commonly used privacy metrics in evaluating the performance of graph anonymization algorithms are, (1) the number of re-identified nodes, (2) degree of uncertainty, (3) adversary’s success rate, (4) query’s error, (5) information gain, (6) amount of leaked information, (7) information

surprisal, (8) privacy score, (9) association rule hiding (ARH), (10) distribution leakage, (11) prior information belief and posterior information belief, (12) entropy leakage, (13) probabilistic anonymity, (14) downgrading classifier effectiveness, (15) membership inference analysis, and (16) accurate predictions. Wagner et al. [124] summarized eighty privacy evaluation metrics used by different PPDP algorithms. The authors categorized the privacy metrics based on four common characteristics which are, (i) adversary models, (ii) data sources (i.e., auxiliary information), (iii) input for computation of metrics, and (iv) output measures. The selection of the metric depends on the data type, objectives of data publishing, and target application/users. However, in most cases, a single metric cannot capture the entire concept of privacy, therefore, two or more metrics are jointly used to measure the level of privacy offered by an anonymization algorithm/model. Recently, Zhao et al. [125] analyzed and discussed twenty-six different metrics used for privacy evaluation in anonymized graph. The authors suggested that no single metric is effective to evaluate privacy protection in anonymous graphs. Therefore, the authors suggest that the strengths of multiple privacy evaluation metrics can be combined to improve the overall measurement of user’s privacy in a G' . This work employs multi-criteria decision analysis to the privacy measurement in a G' , and it opens up a new research direction that may lead to significant improvements in future for privacy measurement.

B. De-Anonymization Methods Employed By the Adversaries to Jeopardize Users Privacy in SN Published Data

Due to the phenomenal growth in SNs adoption around the globe, the SN data has become more reliable for conducting research and achieving multiple business/scientific objectives. The data-miners and analysts can extract the enormous amount of information embedded in the published G' . Aside from collecting the relevant information regarding some custom rules or desired communities from the published graph, the data-miners try to reveal true identities/private-information of the users for fulfilling multiple hidden objectives such as personalized service recommendation, user’s profiling, and preference based digital contents selling. Interestingly, in some cases, the embedded knowledge extraction enables firms to be competitive in the market for long-run. Due to increase in information surges, availability of the various SNs, maturity of the machine learning and data mining tools, advancement in computing technologies, and attacker capabilities have made the personal information retrieval much easier. Due to the public access of the SNs and lack of user’s awareness about the online privacy, the protection of privacy on the SN sites is very challenging. Therefore, it has become an active area of research in recent years. Adversaries are not only able to get multiple users information but also they can re-identify people uniquely with the help of multiple auxiliary sources. There exist several de-anonymization approaches in literature that were employed on the published graphs to infer the true identity or SA of the SN users. For instance, Azizy et al. [169] summarized and classified various de-anonymization approaches used by the adversaries. We summarize the recent de-anonymization approaches with examples in Figure 15.

FIGURE 15. - Description about de-anonymization approaches used by the malevolent adversaries for privacy breaches.
FIGURE 15.

Description about de-anonymization approaches used by the malevolent adversaries for privacy breaches.

Aside from the de-anonymization approaches explained above, Beigi et al. [170] summarized various seed-based and seed-free de-anonymization methods employed by the adversaries for privacy breaches in published SN data. Authors provided a detailed overview of latest SN data anonymization and de-anonymization approaches, and theoretical analysis about the graph’s de-anonymization. In addition, various user’s attributes inference methods have been reported by the authors with examples. Apart from the de-anonymization approaches explained in Figure 15 and previous studies, we summarize the key items that enable unique identifications of an individual/groups or SA disclosures from the privacy preserved graph data publishing in Table 5.

TABLE 5 Description About the Key Items That are Exploited by the Adversaries to Breach User’s Privacy
Table 5- 
Description About the Key Items That are Exploited by the Adversaries to Breach User’s Privacy

These items are usually exploited by the adversaries to jeopardize user’s privacy in SN published data to infer/predict true identities of users or their SAs. Each item assists in compromising an individual or community privacy when exploited by the adversaries. Various approaches based on these items/features have been proposed to compromise people’s privacy by leveraging the single or multiple SNs’ data (a.k.a across SNs). Drawing on the reviewed graph de-anonymization techniques, majority of the techniques enable adversaries to compromise users’ privacy successfully.

The privacy items and corresponding de-anonymization methods summarized in Table 5 pose serious threats to the SN user’s privacy in the PPGP. Apart from the well-known methods summarized above, the user’s privacy can also be breached through the information acquisition about the changing interest of a user overtime and predictions about the user’s private information through side channels of various types (i.e., interests). Recently, due to availability of the excessive amount of auxiliary information and advanced data mining tools, the scale and the scope of privacy breaches is expanding from an individual identification or SA disclosure to groups identity theft for achieving multiple scientific and business objectives, and identifying communities in SNs having common characteristics or common interests for accurate recommendations.

Furthermore, when a group of SN users form an online community, the SN service providers have access to more information including political viewpoints, preferences, relationship status or financial status because a user often readily shares more about him or herself in an online community. Accordingly, this goldmine of data when shared with the data-miners can lead to privacy breaches. In addition, social connection information prediction about a user with advanced data mining tools, and identifying an individual and users groups by contents analysis and activities has become serious challenge for SN service providers [122]. Hence, SNs users data sharing can endanger user’s privacy in unexpected ways, and researchers are devising many domain and attacks specific, and general PPGP methods to combat with this uprising social problem (e.g., safeguarding SN users privacy).

C. Structural Anonymization Approaches Used for Application-Specific Scenarios in SN Data Publishing

In Table 4, we summarized the generic structural anonymization approaches used in the PPGP. In this subsection, we summarize the recent structural anonymization approaches used in application-specific scenarios of SN data publishing. These scenarios include, anonymization approaches for friends recommendation, community clustering/detection, collaborative filtering, topic modelling, and SN’s users behavior analysis etc. Aside from the structural modifications in graphs, the anonymization approaches used in application-specific scenarios also consider the features of the applications for which the data is being anonymized. For example, Xu et al. [29] proposed a framework for discovering the privacy-preserved communities in the SNs. The proposed framework enables the formation/detection of different communities without revealing sensitive link information. In this work, we summarize the state-of-the art structural anonymization approaches used in application-specific scenarios of the SN in Table 6. The anonymization approaches summarized in Table 6 consider the features of the related application during anonymization process. Furthermore, some approaches have been used for multiple/heterogeneous applications due to overlapped properties/characteristics between them.

TABLE 6 Description About the Anonymization Approaches Used in SN Application-Specific Scenarios
Table 6- 
Description About the Anonymization Approaches Used in SN Application-Specific Scenarios

Recently, due to the phenomenal growth in opportunities offered by SNs data, researches have turned their attention to devise new and realistic anonymization methods leveraging advanced artificial intelligence (AI) techniques. Li et al. [226] designed a deep learning (DL) model that combines multiple factors such as attribute information, graph structure, and behaviour characteristics while avoiding tedious calculation procedures to measure the privacy in SNs. The proposed model considers the deep relationship between all three factors (user attributes information, graph structure, and behaviour characteristics) to accurately obtain the privacy score.

Alemany et al. [227] devised two metrics (Audience and Reachability) for assessment of privacy in information sharing scenario in the SNs. The authors performed rigorous simulations in different SN topologies and considering different layers, and concluded that network topology in SNs has a direct effect on the outreach of the information. Pensa et al. [228] described a knowledge-driven approach for enhancing privacy awareness in SNs. The proposed approach has ability to measure the privacy risk of the users and inform them whenever their privacy is breached or at risk. Also, it helps the exposed users to customize their privacy level semi-automatically by limiting the number of manual operations. Li et al. [229] suggested that private information in the SNs sites is time-sensitive, which means that information held by the users who are no longer in the same environment/place as the target user may no longer be reliable/true and have lost its value. The authors combined behavioral characteristics and structural similarity to accurately filter the user groups who hold the current private information of a target user for measuring a user’s privacy status. Ruggero et al. [230] suggested that user’s privacy in SNs can be influenced by many external factors (e.g., the position of the user within the social graph, the relative risk of the network). To solve this problem, the authors devised a network-aware privacy score metric that accurately measure the user’s privacy risk according to the characteristics of the network (i.e., G ).

A semi-supervised framework based on structural embedding for account correlation was proposed by Zhou et al. [231]. It learns the latent and structural semantics for accounts correlation between networks. It correlates accounts with high accuracy by leveraging the semantic information among accounts through random walks approach. Furthermore, the user’s identity disclosure/matches problems with limited profile items have also been reported in the literature [232], [233]. In recent years, federated learning (FL) based privacy preserving approaches have been proposed for anonymous data publishing with legitimate third-parties [234]–​[238]. Furthermore, many anonymization approaches have been proposed to preserve the users’ privacy in sequential publication of the SNs data [239]–​[241].

D. Challenges in Social Network Data Anonymization

SN data is usually represented as a graph, and structural anonymization approach is applied to sanitize it before releasing with data-miners. Anonymizing SN data is much more challenging compared to tabular data due to complex structure and variety of information embedded in graphs about the entities (i.e., users). The three well-known challenges related to SN data anonymization are, (i) modeling background knowledge (BK) of the adversaries (in SNs, the appropriate modeling of the adversaries’ BK is harder compared to the tabular data because adversaries can have access to the multiple pieces of information about an individual, and he/she can re-identify target individual from the G' by leveraging the BK.), (ii) devising a new structural anonymization method (devising a new anonymization method for SN data is very hard compared to the tabular data due to the structural dependence of entities on each other. In SN, a slight modification in the graph structure can affect the whole network. Hence, the adhoc solutions based on ”divide and conquer” approach cannot be directly applied on the SN data), and (iii) quantifying the utility of the anonymous graph (in SN data, measuring the usefulness offered by an anonymous graph is not straightforward. The differences in G and G' properties are difficult to quantify. In addition, by adding new edges and vertices to increase privacy protection can often lead to excessive information loss in the PPGP.). Aside from the three key challenges explained above, the structural anonymization of massively large scale graphs involving many entities data is very challenging. Furthermore, in SN, amount and variety of data collected about entities increase exponentially with the passage of time. Thus, devising new structural anonymization approaches to solve these problems from both practical and theoretical perspectives have become imperative while benefiting from the SNs users data.

SECTION VI.

Summary and Discussion About the Privacy Issues in Future Computing Paradigm

In this article, we have covered most of the concepts related to the anonymization approaches used for data owned by both physical organizations (e.g., hospitals, banks, and insurance companies etc.) and virtual platforms (e.g., Facebook, Twitter, and Link-din etc.). Specifically, we described the anonymization methods employed for two types of input data, tables and graphs. We emphasized more on the SN data anonymization considering the exponential adoption of the SNs around the globe by adults, and unprecedented opportunities these platforms offer in terms of business intelligence. Furthermore, the data-driven technologies and big data analytics are playing a vital role in extracting embedded knowledge from unstructured data to improve SQ [242].

Apart from these two types of data (e.g., tables and graphs), a wide range of data types such as matrix (e.g., trajectories information, market basket data, and ratings data), digital traces, logs, documents (e.g., medical prescriptions, disaster/disease control agencies data), images, videos, text documents (e.g., reviews, blogs, and opinions), and time series data can reveal private information in digital landscape. Hence, enterprises and organizations are constantly exploring new innovative strategies and methods to remain competitive in their market while ensuring users’ privacy [243]–​[247]. Nevertheless, data policies including privacy, intellectual property, security, and liability issues, should be addressed in all phases (e.g., collecting, pre-processing, anonymizing, sharing, and analytics) of person-specific data handling in order to exploit big data value. Decentralized anonymization methods are handy solutions to truly benefit from the data publishing without significantly impacting user’s privacy [248], [249]. Nowadays, the AI techniques have become significantly mature to assist in data-driven decision making, users’ privacy protection has become imperative for most organizations’ success [250]. Considering the applications and unprecedented opportunities of data sharing, the privacy and utility trade-off resolution in PPDP remains challenging.

In future computing paradigm, four mainstream technologies will become the centre of the information technology (IT) world, big data, cloud computing, social networks, and internet of things (IoT). These technologies have ability to process any kind of data with advanced analytics tools to extract insights from collected data. The main drivers of these technologies are expanded internet connectivity, low cost sensors, higher mobile adoption, and large IoT investments. In Figure 16, we present the future computing paradigm’s technologies, data sources, and analytics solutions that are in use to serve the mankind in better way compared to the recent past.

FIGURE 16. - Detailed overview of the future computing paradigm and related concepts.
FIGURE 16.

Detailed overview of the future computing paradigm and related concepts.

Despite the benefits offered by these latest technologies, there exist many barriers such as technological fragmentation, security concerns, implementation problems, and privacy concerns. Such potential barriers have been the bottleneck for the wide applications and development of these technologies and, thus, have attracted widespread concern. Among others, privacy concerns significantly hamper the development and applications of these technologies, and this field has attracted significant attention from the research community in recent years. Furthermore, many latest methodologies such as homomorphic encryption, federated learning, deep learning, and block chain have also been used for privacy protection in future computing paradigm (year 2020 and beyond) [251]–​[258]. Butpheng et al. [259] presented various research perspectives related to privacy and security within IoT-cloud-based e-Health systems. Authors provided various benefits of IoT and cloud based e-Health systems, and analyzed security and privacy solutions in the study.

We summarize the various privacy issues related to the future computing paradigm after in-depth synthesis of the previous studies in Figure 17. We refer interested reader to previous works [254], [260]–​[268] for more detailed descriptions and definitions of the privacy issues related to the future computing paradigm. Therefore, future research must be done to find new ways to make anonymization solutions more resilience towards these issues such as quantifying the risk and benefits of data publishing, deciding the appropriate mechanism for privacy preservation considering the adversaries’ capabilities (e.g., worst case scenarios), verifying the effectiveness of privacy enhancing technologies through real-world applications or increasing users’ awareness about the privacy leakage through hidden routes.

FIGURE 17. - Comprehensive overview of the privacy issues in the future computing paradigm (Year 2020 and beyond).
FIGURE 17.

Comprehensive overview of the privacy issues in the future computing paradigm (Year 2020 and beyond).

Moreover, achieving effective privacy protection should focus more on exploiting intrinsic characteristics of users’ data or application features for which data is being anonymized by the simulation of diverse privacy approaches [269]–​[271]. Furthermore, privacy enhancing technologies (PETs) for better privacy preservation in personalized services by satisfying both economical and ethical purposes have become more emergent than ever [272]. In recent years, privacy preserving machine learning (PPML) concept has been employed to extract the knowledge from distributed databases while ensuring data privacy [273], [274]. Due to the PPML, traditional machine learning algorithms can be adapted to secure users’ data stored in multiple digital environments.

Recently, the privacy-aware data cleaning techniques have significantly reduced the data preparation cost of data analysis pipeline [275], [276]. These techniques allow the clients to buy clean, and curated data from heterogeneous service provider to perform analytics without compromising user’s privacy. Furthermore, development of privacy information management system (PIMS) in accordance with international standard (e.g., ISO/IEC 27701) is imperative to safeguard the privacy of individuals or small groups in the population [277]. According to such system, if an anonymization mechanism cannot safeguard user’s privacy or anonymized data can be used to identify individuals uniquely or small population groups, data cannot be released without legal advice or additional technical measures.

SECTION VII.

Promising Open Research Directions

Some promising open research directions/problems that need further research and developments from both academia and industry are outlined below.

  • Users groups’ privacy issues: In tabular data anonymization, majority of the existing approaches focus solely on an individual’s privacy preservation. Thus, they are less resilient towards the users groups’ privacy preservation. For instance, k -anonymity model creates equivalence classes with k -users in each class. On the one hand, it protects individual privacy by hiding each user in other k -users’ crowd. On the other hand, it explicitly discloses private information about users groups. Hence, devising new PPDP methods for users groups’ privacy protection would be promising.

  • Excessive information loss caused by over-generalization of the QIs: In tabular data anonymization, most of the existing anonymization approaches anonymize each QI present in a dataset that can lead to excessive information loss. As some QIs are not vulnerable in terms of user’s privacy, and their unnecessary generalization significantly impact data utility. Thus, quantifying each QI impact on privacy and utility, and controlling unnecessary generalization to the extent possible while anonymizing user’s data requires further research and developments from the research community.

  • Imbalanced datasets anonymization: In some cases, the relational dataset can be highly imbalanced (i.e., the SA’s values distribution is not uniform), and its anonymization is very challenging. In such datasets, enforcing hard constraints such as making each class \ell -diverse or t -close is not possible in practice. Hence, it requires development of new approaches for anonymizing imbalanced datasets to effectively protect users’ privacy without degrading data usefulness.

  • Effective resolution of privacy and utility trade-off in the PPDP: In data anonymization, there exist a strong trade-off between privacy and utility. Tailoring the anonymization with privacy objectives can adversely affect the anonymous data utility, and vice-versa. This longstanding challenge in the field of PPDP seeking novel solutions to support privacy preserving big data analytics.

  • Personalized privacy preservation in SNs: In SNs, each user has different requirements and concerns about his/her information privacy, which is called personalized privacy (PP). For example, in social graph some users may want to hide only sensitive relationship (i.e., lover) information, while some users may want to hide all social connections (i.e., all friends) information. Therefore, the PP involves high level of subjectiveness, and it is very difficult to implement. Hence, innovative solutions that can incorporate the SN users’ PP requirements in SN data anonymization are required.

  • Accurate modeling of the adversaries’ background knowledge: Adversaries poses more side channel information as background knowledge (BK) about SNs users compared to the tabular data. This BK continues to grow due to the access to other publicly available SNs, and richness of information embedded in the SNs. Recently, text mining and natural language processing (NLP) techniques utilize the contents of SN users to match their identities. Thus, accurate modeling of the Bk while anonymizing SN data is very challenging, and further research on how to model the BK during anonymization process is required to effectively protect user’s privacy in big data era.

  • Generic solutions for the social graph anonymization: Generally, the SN data is modeled with the help of graphs (a.k.a. sociograms). These graphs can be of different types such as simple, directed, undirected, weighted, and labeled directed. The anonymization mechanism proposed for one type of the graph cannot be directly applied to the other. For instance, the k -degree anonymity concept cannot be applied to directed graphs straightforwardly as it requires the detailed analysis of in-and-out degree sequences. Hence, devising generic anonymization methods that can work with multiple graph’s types will be an interesting research area in the future.

  • Controlling large scale user identification issues by evading data mining and SN analysis SNA) tools: In SNs, users establish relationships with like-minded people or people having similar interests. This results into formation of the online communities. A growing body of research has been devoted to community discovery in the SNs. On the one hand, community discovery is beneficial for multiple purposes such as information spread and control. Moreover, community discovery in privacy preserved SN published data can jeopardize users and community privacy when published data is analyzed with advanced data mining and SNA tools. Hence, devising new solutions that are resilient towards community discovery and community-based node/user mapping in SN data publishing has become more pressing than ever.

  • Exploiting global and local features of SN data to safeguard against network reconciliation problems: Recently, across SNs users identification by leveraging multiple methods such as multiple SNs graphs matching (i.e., network reconciliation), display names mapping, contents and activities analysis, accounts on the heterogeneous SNs, and their combinations has become an activate area of research. Accordingly, the privacy approaches need significant up-gradation to protect user’s privacy in network reconciliation problems. In this regard, the approaches which perform data anonymization by exploiting local (i.e., common mapped neighbours, number of friends, and mutual friends etc.) and global (i.e., betweenness, centralities, ties strength, and multi-hop neighbour’s information etc.) features of social graph will be an interesting research area in the near future for PPGP.

  • Metrics suites rather than single metric for quantifying the level of privacy in anonymous graphs: Generally, one type of metric is employed for evaluating the level of privacy/utility in a G' offered by anonymization solutions. Moreover, in real world cases, the privacy quantified by one metric may not be monotonic (e.g., show lower privacy results for stronger adversaries) or reliable from multiple viewpoints. Hence, combining multiple privacy metrics that can more accurately measure the level of privacy and can mitigate the weaknesses of individual metrics is a promising research direction in near future considering the widespread interest in SNs data publishing with legitimate information consumers.

  • Devising privacy-friendly mechanisms for exceptional situations: During 2020, the whole world is facing an unanticipated and extraordinary challenge from an unknown enemy, called corona virus disease-19 (COVID-19) [278], [279]. The COVID-19 pandemic has affected every profession around the globe, and governments are heavily relying on the non-pharmaceutical interventions (e.g., strict lock-downs, cities and facilities closures, social distancing, geo-location based users’ mobility analysis, proximity detection, and digital contact/suspect tracing etc.) to curb the spread of COVID-19 [280]. In addition, for epidemiological investigations, some governments employed extensive measures (e.g., credit card data, mobile phone signals, Bluetooth and GPS data, and CCTV data) to find the COVID-19’s suspects and hotspots [281]–​[283]. Due to the adoption of the digital methods, a huge amount of personal data has entered into the cyberspace, and privacy violations have been constantly reporting around the globe. For example, in Italy from January to April 2020, the privacy violations in healthcare sector related to companies and individuals have doubled [284]. Furthermore, the privacy violations in post COVID-19’s era are expected to increase as many companies have collected the multi-fact data about the people lifestyles [285]–​[287]. Considering the necessity of users’ privacy preservation, the ethical aspects during and after COVID-19’s era must be carefully observed and addressed [288]–​[294]. Hence, privacy-friendly solutions are required for exceptional situations such as COVID-19 pandemic to safeguard patients’ privacy and other pandemics’ related aspects (e.g., privacy preserving symptoms tracking and reporting, collecting only relevant information from the users to protect privacy, encrypting sensitive information, GDPR-compliant and privacy-aware contact tracing, and decentralized solutions for computing the probability of exposure). Further, PPDP mechanism involving COVID-19’s patients data to safeguard discrimination and hates towards certain religions, countries, sects, caste, and sexual minorities during published data analytics is deemed necessary. Hence, analyzing/solving the potential privacy risks and vulnerabilities in contact tracing apps developed by many countries to fight with the pandemic is a vibrant area of research.

  • Adoption of industry 4.0 techniques for the PPDP: In recent years, many innovative techniques such as few-shot learning, federated learning, transfer learning, deep learning, and block-chain have revolutionized the human life in many aspects. These techniques have been extensively used in many areas such as healthcare, social engineering, data analytics, predictions and forecasting, knowledge extraction, and image recognition/analysis etc. Due to the availability of huge amount of labeled data, and ability to work in a decentralized fashion, these techniques can be utilized for users’ privacy preservation with enhanced usefulness. The heterogeneous federated transfer learning (HFTL) framework [295], privacy-preserving deep learning (PPDL) technique [296], deep transfer learning (DTL) method [297], adaptive privacy preserving federated learning (APPFL) method [298], block-chain-enable privacy preserving (BCEPP) architectures [248], [299], secure collaborative few-shot learning (SCFSL) framework [300], searchable encryption (SE) methods leveraging ciphertext-policy attribute-based encryption (CP-ABE) [301], [302], data resource protection solution leveraging smart contracts [303], improving cyber security solutions utilizing AI’s potential [304], and computational intelligence based methods for information security [305], to name a few have already been used in practical applications related to the PPDP. Hence, devising robust and lightweight techniques which involve less parameters and can co-work with the traditional anonymization approaches to scale up privacy preservation with enhanced data utility is a promising area of research for the future.

SECTION VIII.

Conclusion

In this paper, we have presented the latest researches that have been proposed to release useful information while preserving user’s privacy from malevolent adversaries, namely privacy preserving data publishing (PPDP). In recent years, there is an increasing focus on the rapid development of more practical anonymization solutions due to the significant rise in the privacy breaches across the globe, and this area is attracting researchers’ interests drastically. Owing to the rapid technological developments in communication science and technology, tremendous amount of users’ data can now be easily obtained in diverse formats, ranging from relational tables to complex social graphs. Although this increasing amount of data offers unprecedented opportunities for analytics, but it increases the chance of individuals’ privacy breaches. In addition, most of the traditional anonymization algorithms that were proposed for the tabular data can rarely perform well on a social network (SN) (e.g., graphs) data without modifications. Hence, it is of paramount importance to provide good perspectives of the information privacy area involving both tabular and SN data along with recent anonymization researches. In this work, we have provided detailed and systemic coverage of the relational anonymization techniques used for the tabular data before presenting recent structural anonymization approaches used for SNs data anonymization. We have summarized and compared substantial number of anonymization approaches used for the information privacy protection involving both SNs and tabular data. Furthermore, we provide deeper insights on the privacy problems in future computing paradigm that will be helpful in devising more secure anonymization methods, and we discuss numerous promising open research directions/problems that need further research and developments. In this survey, we specifically focus on the SN data anonymization and de-anonymization techniques considering the widespread applications/use of the SNs data. In addition, SNs data contains a treasure of information that need to be protected from the malevolent adversaries. Thus, it indicates the ever increasing interests of researchers in the area of SN’s data anonymization. Nevertheless, user’s data anonymization is still irrefutably complex, and it requires significant improvements in existing approaches as well as devising new practical approaches with regard to better utility and privacy preservation. In future work, we are planning to devise new anonymization methods for SN data, and we intend to explore privacy problems in industry 4.0 technologies.

Conflict of Interest

The authors declare that they have no conflict of interest.

References

References is not available for this document.