Introduction
Most organizations such as hospitals, banks, insurance companies, and supermarkets collect relevant customers/subscribers data to improve service quality (SQ). Apart from these physical organizations, an excessive amount of user’s data is collected by the virtual platforms such as social networks (SN) service providers due to the extensive use of SN all around the world. With the significant advancement in the information and communication technologies (ICT), SNs enable people to interact with their friends, make new friends, seek information about the relevant subject matter or jobs, spread reliable information at low cost, and also to entertain themselves by watching digital contents. Meanwhile, the SNs collect and store the relevant data about their users during the service provisioning, and at the time of account creation (i.e., joining the SNs). This collected data often contains information about the user’s activities, demographics, finance, hobbies, location, perceptions, interests, preferences, political and religious views, online communities, and opinions. Furthermore, most users readily post other valuable data including preferences in music, viewing choices, and social problems such as an epidemic outbreak. Research has shown that analysis of this collected data with advanced data mining tools can assist organizations in improving SQ significantly. For instance, it allows them to understand social trends, people’s sentiments and behaviors, and factors causing a certain disease outbreak. Accordingly, such information can be leveraged for many scientific or business objectives including targeted advertisement, relevant content recommendations, and effective decision making [1], [2]. Although the data sharing brings innovation and enables better decision making, it may also jeopardize the privacy of users due to the existence of sensitive information in the data [3].
Before publishing the users’ data with the researchers or third parties, data owners ensure that the user’s private information privacy is protected. This is typically done via data anonymization, which transforms the original data by applying some operations on it to effectively protect user’s privacy without degrading the anonymous data utility [4]. Privacy preserving data publishing (PPDP) provides set of models, tools, and methods to safeguard against the privacy threats that emerge from the data releasing with data miners or analysts [5]. In recent years, PPDP has received considerable attention from the research community, and many approaches have been proposed for both SN and tabular data anonymization [6]–[10]. There are two famous settings of PPDP, non-interactive and interactive [11]. In the former setting, the data owner publishes the complete dataset in an anonymized form after applying some modifications on the original data. However, in the later setting, the data owner does not publish the whole data set in a sanitized form like the former setting. Instead, data owner provides an interface to the data miners through which they may pose different statistical queries about the related data and get (possibly noisy) answers. The
In recent years, SN data is also published with the data-minders for accomplishing multiple scientific and business objectives due to the phenomenal growth in SN use around the globe [23]. SNs data is mostly in a graph
The existing surveys related to PPDP cover important aspects such as anonymization techniques, anonymization operations, privacy models, data anonymity frameworks, and evaluating metrics employed by the PPDP mechanisms. Fung et al. [31] study systematically summarized and evaluated different approaches used in the PPDP. Rajendran et al. [32] explained three prominent and most widely used anonymization models in a medical field, namely
This paper presents a comprehensive survey on recent anonymization techniques used for both SN and relational data publishing. Specifically, our review explains anonymization approaches related to the information privacy protection. The contributions of this review article in the field of PPDP is summarized as: (i) it presents state-of-the-art anonymization techniques used for both SN (i.e., social graphs) and relational (i.e., tabular) data, and fundamental concepts and ideas related to tables and graph data anonymization; (ii) it systematically categorizes the existing anonymization techniques into relational and structural anonymization, and presents an up-to-date thorough review on existing anonymization techniques and metrics used for their evaluation; (iii) it describes the anonymization techniques that have been proposed to solve privacy problems in application-specific scenarios (e.g., collaborative filtering, topic and context modeling, and community clustering etc.) of the SNs; (iv) it presents various methods and items that are exploited by malevolent adversaries for user’s re-identification across SNs; (v) it explains various challenges faced by researchers while devising new anonymization methods for tabular and SN data; (vi) it provides new insights on the privacy problems in future computing paradigm that will be helpful in devising more secure anonymization methodologies; and (vii) it discusses promising future research directions in the field of the PPDP that need further development and research from both academia and industry. Through this comprehensive overview, we hope to provide a solid foundation for future studies in the PPDP area.
The remainder of this review paper is organized as follows. Section II explains the background regarding privacy types, tabular and graph data overview, types of the user’s attributes, operational utility levels of the SN data, privacy areas in the SN, and privacy threats that occur in both SN and tabular data publishing. Section III presents the dilemma of PPDP, and explains its principal concepts and phases. Section IV discusses the state-of-the-art relational anonymization techniques used for tabular data. Section V discusses the recent structural anonymization techniques used for the SN data. The summary of the research contents presented in this article, and discussion about the privacy problems in the future computing paradigm are provided in Section VI. Section VII discusses promising open research directions/problems in the PPDP area. Finally, Section VIII concludes the paper.
Background
Privacy is all about keeping personal information away from the public access. Privacy is needed for the personal autonomy, individualism, and respect. There are four types of the privacy such as information, bodily, territorial, and communication [46]. Information privacy is about collecting, managing, analyzing, and publishing the personal data. Bodily privacy is related to the physical harms from any kind of invasive procedures/measures. Communication privacy refers to any form of communication such as phone calls or e-mails privacy. Territorial privacy refers to placing boundaries on irruption into a locality. This survey focuses on the information privacy, which encompasses systems/infrastructures that collect, analyse, process, and publish user’s data. Concisely, we present the various anonymization approaches that were proposed to anonymize users’ data that can be either in the form of graph or table. The overview of the user’s data in relational and graphs form is presented in Figure 1. In relational data (Figure 1 (a)), each tuple contains four types of attributes about users, direct identifiers (DI), non-sensitive attributes (NSA), quasi identifiers (QIs), and sensitive attribute (SA). In contrast, the SN data shown in Figure 1 (b) represents the users information via nodes/edges labels. The relevant background about both types of data (i.e., tabular and graphs) is covered in subsequent subsections.
Overview of the relational (i.e., tabular) and social network (i.e., graphs) users data.
A. Background About the Relational/Tabular Data
The original user’s data
Description about the types of user’s attributes and their handling in an anonymization process.
in Figure 2, there exists three classes of privacy threats that can occur during published data analysis [47], which are explained below:
Identity disclosure (i.e., unique identification): It is a well-known privacy threat in the PPDP. It occurs when an adversary can correctly associate an individual in a privacy preserved published dataset. Generally, an attacker use the information gathered from external sources (i.e., voter registration list, online repositories, and factual information) to identify an individual uniquely.
Attribute disclosure (i.e., private information disclosure): This type of privacy threat occurs when an individual is linked with the information about his/her SA. For example, the information can be the person’s value for the SA (i.e., crime in Figure 1(a), or salary in Figure 1(b)). This type of threat can be easily launched in imbalanced datasets (i.e., the datasets lacking heterogeneity in SA’s values.).
Membership disclosure (i.e., presence/absence disclosure): This threat occurs when an adversary can deduce that an individual’s record is present/absent in the published dataset with a very high probability. Researchers have reported many interesting scenarios in which the protection from the membership disclosure is imperative [48], [49].
Generalization: This operation transforms the original QI’s values into less-specific but semantically-consistent values during anonymization process. For example, the value 25 of a QI age can be generalized with an interval [25 – 30] or < 30. This operation relies on the taxonomy of each QI. The existing generalization schemes can be classified to five types such as sub-tree generalization, full domain generalization, unrestricted sub-tree generalization, cell generalization, and multidimensional generalization.
Suppression: This operation hides an original value of a QI with a special value (i.e., ’*’). For example, to anonymize the value 25 of a QI age using suppression operation, 5 can be replaced with an asterisk, resulting 2* as the suppressed value of QI. Record, value, and cell suppression are the three most widely used suppression variants in the tabular data anonymization.
Permutation: In this operation, the records are partitioned into several groups, and values of the SA are shuffled within each group. Hence, the SA and QIs relationships are de-associated within each group. This operation may yield inaccurate analysis in terms of anonymous data utility, but user’s privacy is significantly preserved.
Perturbation: In this operation, the original data values are replaced with some synthetically generated values. Moreover, the synthetic values are generated in a way that statistical information do not differ much in both datasets (i.e., real and synthetically generated datasets).
Anatomization: This operation does not apply any modifications on the original data values and instead QIs and SA are separated into two tables. By doing so, the association between QIs and SA is broken, and data is released as QIs and SA tables separately. In some cases, the SA table contains the SA’s values and their frequency in the anonymized dataset for privacy preservation effectively.
Furthermore, in some cases, more than one operation are jointly used to anonymize users data set. Data publication can be done either one or multiple times depending upon the requirements and information consumer’s needs. Typical data publication scenarios include one and multiple time publication of the micro data. In the former scenario, the data is published once, and no re-publication is made even after the changes in the original data. In contrast, in the multiple releases, the anonymous data is re-published even after a single operation (insert, update, delete) that changes the original data. Both scenarios have their own advantages and disadvantages for data owners and information consumers, respectively. After DIs and NSA removal from
B. Background About the Social Networks/Graphs Data
The SN user’s data can be modeled with a graph
: Exposure of the graph structure only: In this exposure level, the data owners (e.g., SN service providers) only publish the graph structure (i.e., all profiles/labels information is removed prior to data publication). Thus, the data-miners/analysts can analyze only the graph structure without any concrete information about the user’s profiles.$l_{1}$ : Exposure of the nodes’ profiles: In this exposure level, the data owner publishes the profiles of nodes/users but hides the graph structure. For instance, the node’s data is stored in a table/matrix, and released for the analytics and data mining purposes.$l_{2}$ : Exposure of both graph structure and profiles of nodes (e.g., users): In this exposure level, the data owner exposes both the graph structure and the nodes’ profiles after applying some modifications to the$l_{3}$ ’s structure and users’ profiles. This level offers much higher utility to the legitimate information consumers compared to the previous two levels.$G$
Although SN data publishing is invaluable for accomplishing multiple research and business objectives, the SN data publishing can confront with the privacy threats of several types. Four well-known privacy threats that can happen after
Identity disclosure (i.e., node re-identification): It occurs when an adversary can accurately associate/identify an individual from a privacy preserved published graph. For example, Tim (
node) identification by an adversary from the anonymous version (i.e., removing all nodes labels.) of a$5th$ given in Figure 1(b) is an example of identity disclosure (i.e., node re-identification).$G$ Edge disclosure (i.e., relationship/connection disclosure): It reveals the relationship between users. For example, a patient-doctor relationship can be highly sensitive, and it must be protected. If a doctor is known to be an expert in cancer treatment, the relationship disclosure can occur with inference that patient might be infected with a cancer.
Content disclosure (i.e., vertex/edge labels disclosure): It occurs when a sensitive label associated with an edge or vertex is revealed from a
, and this reveled label can be directly associated with a specific individual in an original graph (i.e.,$G'$ ).$G$ Affiliation link disclosure (i.e., whether a person
to a particular affiliation group$v \in / \notin $ ): It occurs when a link between a user$h$ and an affiliation group$v$ is revealed with confidence$h$ , and this revealed link can be directly associated with a$\geq t$ . The$v$ can be prosecuted political group or a group centered around an unconventional user’s preference (i.e., sexual/drugs preferences).$h$
Graph modification: This operation changes the original graph’s structure by adding/deleting edges or vertices. The criteria about the graph’s elements addition/deletion depends on the objectives of data publishing, and privacy protection level data owners want. Typical graph modifications techniques are of two types-constrained and non-constrained graph modifications.
Graph generalization: This operation does not modify the graph structure, instead, it cluster the nodes and edges into super nodes and edges. This operation works in iterations, and in each iteration the connections between the super nodes and edges are established.
Graph computation (i.e., privacy-aware computation of graphs): This operation computes the output data in response to the data miner queries. The
is not released with the data miners, instead, the graph properties (degree, centralities, size, and other useful information) are computed and released.$G$ Hybrid operation/approach: This operation employs combination of two or more anonymization operations simultaneously. For example, the graph generalization and modifications can be jointly used to anonymize
to produce$G$ .$G'$
Privacy in SN data publishing has become an active area of research in recent years. Pham et. al [54] explained that SN users privacy can be breached through three different ways such as user’s activities on the SNs sites, stored users data in the SNs service providers’ servers/databases, and privacy preserved SN published data. All three privacy areas in SN are visually presented in Figure 3 (a). The privacy in the first two areas can be preserved through guidelines, encryption and watermarking techniques. Meanwhile, the privacy in the third area can be guaranteed only by devising/implementing new anonymization mechanisms. This article covers recent concepts, methods, and solutions concerning the third area of privacy in the SNs (i.e., privacy preserved graph publishing).
Overview of three privacy areas (adopted from [54]), and types of users data collected and processed in SNs.
The user data available on the SN sites is comprised of three types of information [55]. The taxonomy of the users data present on the SNs is depicted in Figure 3(b). The first type of data is related to user identity including the demographics and derived data (i.e., age). The second type of information is related to user’s socialism (i.e., friends, friends-of-friends, activities, and online communities joined by the users etc.). The third type is related to the contents users create on the SNs sites. It has three types: disclosed, entrusted, and incidental data. Disclosed data is readily made available by the SNs users. Entrusted data includes the data posted by a user on another user’s profile (i.e., comments). The last type of data is the information collected/posted by other users on someone’s profile/wall. All these data sources are invaluable for detailed analytics and appropriate information collection/analysis.
Aside from the fact that all necessary and private information needs protection from the malevolent adversaries, some de-anonymization attacks can be launched on the SNs published data to infer someone’s real identity/private-information. The revelation of such information can result in unknown user profiling, thus targeting unsuspecting users with far-reaching implications including targeted marketing, obtaining travel visas based on race or political viewpoints, deportation, or identity theft. Therefore, mining and sharing SNs data must not invade user’s privacy. To safeguard user’s privacy in SN data publishing, many privacy preserving graph publishing (PPGP) methods have been proposed. We describe most recent anonymization solutions proposed for the PPGP in Section V.
Overview of Privacy Preserving Data Publishing and Its Fundamental Phases
Privacy preserving data publishing (PPDP) provides set of tools, methods, solutions, and frameworks for sharing valuable information with analysts/researchers without jeopardizing user’s privacy. PPDP has been extensively studied in the literature for protecting different aspects of user’s private information during published data analytics. Data is shared with the analysts/researchers for extracting the embedded knowledge from it. Meanwhile, the anonymous data sharing after applying anonymization operations, as outlined earlier (see subsections II-A, II-B), adversaries can still infer the information about user’s identity or private information by leveraging the auxiliary information gathered from external sources. Therefore, many studies have suggested the users’ privacy-protection in all stages of the information processing cycle (i.e., collection, storage, processing, release, and destruction/archival) [56]. The conceptual overview of the PPDP process is presented in Figure 4 (a). In this conceptual overview, we mainly present the overview of data collected from the individuals, different actors involved in the anonymization scenario, anonymization techniques applied on respective data, anonymous data to be published for analytics/mining purposes, and privacy breaches that can occur during published data analytics. The typical PPDP scenario involves five types of actors [57]. The brief description about each actor along with examples is summarized in Figure 4 (b). In some cases, one actor can perform multiple roles in the PPDP scenario. For instance, data holders/owners hold the collected data, and perform anonymization for its releasing with analysts. Sometimes due to the lack of computing resources or knowledge the data holder outsources the collected data for anonymization/publishing. Hence, the roles of data holder and data publisher can correspond to two distinct/same actors.
Privacy preserving data publishing (PPDP): (a) conceptual overview and (b) description of the actors involved in the PPDP scenario.
The typical PPDP process encompassed of the five fundamental phases: (i) data collection from the individuals; (ii) collected data storage, understanding, and preparation for the anonymization; (iii) user’s data anonymization; (iv) anonymous data releasing/publishing; and (v) published data analysis for extracting embedded knowledge. Brief details of each phase of the PPDP is presented below.
A. Phase 1: Data Collection from the Individuals
In the first phase of PPDP, relevant data from the individuals/users is collected. Due to significant technological development in recent years, the amount of data generated by sensor networks, SNs sites, healthcare applications, internet, online banking, SN integrated third party applications, and many other offline/online companies is drastically increasing. This data is collected from individuals directly or via smart devices (i.e., cell phones, laptops, and notepads etc.). For instance, if a patient visits a hospital for diagnosis, his/her personal information is collected for the better treatment. Later, the patient’s personal information combined with the disease information is saved in the hospital database for the secondary use [58]. Similarly, the account opening process in a bank is subject to the collection of basic as well as sensitive information about the customers. Aside from visiting an organization (i.e., hospital or bank), in some cases, the service providers collect relevant data from their users via questionnaires and interviews.
Recently, due to the significant development in ICTs, majority of service providers have launched their own websites for the data collection from their respective users/customers. Furthermore, in an account creation process on the SNs sites, the service providers collect and store the basic information about each user. In addition, many SNs sites collect the valuable data about users without their consent. For example, they can collect the device information, service consumption temporal information, location information, and many other useful information. Recently, with the inventions of advanced tools and technologies, many infrastructures collect and process graphs data. The nine well-known graph based data collection sources are: (i) social networks, (ii) communication data, (iii) mobility traces, (iv) healthcare and epidemiological data, (v) citation networks, (vi) collaboration network, (vii) web graphs, (viii) autonomous system graphs, and (ix) computer networks. SN data is mostly represented as a
B. Phase 2: Collected Data Storage, Understanding and Preparation for the Anonymization
Once the relevant users’ data is collected, the next phase is collected data storage, understanding, and preparation for further operations (i.e., anonymization). Companies are using large scale databases for storing the collected users’ data/information. The well-known SN, Facebook uses MySQL database to store/manage many petabytes of data about user’s social activities such as shares, comments, and likes. Data understanding involves the analysis of the data types, different data representations, and
C. Phase 3: Data Anonymization
Data anonymization is a practical and most widely used solution for protecting user’s privacy in data publishing. In tabular data, data anonymization sanitizes the QI’s original values to make information less specific for privacy protection and utility enhancements. In contrast, if users’ data is given in a
1) Removal of Directly Identifiable Information from the Original Data
At the beginning of the anonymization process, any information that can identify someone directly/uniquely is removed from the data. For example, the name, social security number (SSN), email address, and cell phone number can be linked with someone’s real identity. Hence, such information is removed from the data before its anonymization. It can be removed at any stage, but the earlier removal can assist in saving computing power significantly.
2) Choice of the Anonymization Technique
Many anonymization techniques have been proposed and implemented for different scenarios. In this work, our focus is on the relational (i.e., tabular) data and SN (i.e., graphs) data anonymization. Therefore, we categorize the existing anonymization techniques into two categories: structural and relational anonymization. The former technique is applied to the graph data. In contrast, the latter technique is used for tabular data anonymization. Although some researchers have used the relational anonymization techniques for SN data anonymization. But, due to the significant differences in terms of the information contained in the graphs, relational techniques cannot be directly applied to the SN data. Therefore, the decision about the anonymization technique depends upon the original data representation (i.e., graphs or tables), and objectives of the data publishing.
3) Selection of the Anonymization Operation
Once the appropriate anonymization technique is chosen, the next step is to employ the relevant anonymization operation to distort the original data values or graph structure. The anonymization operation for each technique are different, as outlined earlier (Subsections II-A, II-B). The selection about the anonymization operation is dependent on the data type, anonymization technique, and privacy and utility objectives. For example, in relational anonymization, the generalization operation retains more semantics of the original data compared to the suppression operation. In contrast, suppression operation is highly appropriate for the user’s privacy protection compared to the generalization. In addition, due to the complex structure of a
4) Enforcement of the Constrains (If Any) in an Anonymization Operation
Aside from the selection of appropriate anonymization technique and anonymization operation, some constrains can be enforced by the data owners during data anonymization. Such constrains can be about the privacy and utility thresholds, number of users in an equivalence class, distribution of the user’s private values in a class/cluster, and/or the number of connections between users (i.e., degree) in a
D. Anonymous Data Releasing/Publishing
The output of the anonymization process is the anonymous
E. Phase 5: Published Data Analysis for Extracting Embedded Knowledge
Once anonymous data has been published, the intended recipients collect it for the analysis. In case of the hospitals data, medical students collect the data for their research. They can perform several kinds of test such as factors causing certain disease, common disease in a certain age-groups people, and symptoms of a particular disease. In addition, the banks data containing information about the loan return rating is valuable for the insurance companies. The SN data is suitable for the digital service providers and marketing firms. Recently, many companies are mining the SN data at large scale for fulfilling their scientific and business objectives. In addition, many third party applications are buying SN user’s data for fulfilling their intended objectives. The published data is not only for understanding the causes of some problems, but it can help in devising new rules and patterns that can be useful for marketing purposes. With the help of data mining algorithms, one can analyze the published data for actionable insights. Furthermore, users clustering and analysis is beneficial for recommendations, preferences mining, target marketing, information diffusion, information control, and information trustworthiness. Hence, data sharing has become a routine matter for some companies/organization due to the significant advantages in terms of improved decision making, policy enhancements, trends analysis, forecasting, predictions, and innovation.
Unfortunately, data publishing can jeopardize user’s privacy as adversaries can get copy of published data with aim to re-identify people uniquely by leveraging the large amount of auxiliary information obtained from external sources. These information can be gathered from many sources including users’ profiles from the SN sites, voter registration list, and mobility traces etc. Generally, the adversaries possess strong programming skills and tools knowledge. Therefore, with the help of auxiliary information, tools understanding, advanced programming skills, and understanding of anonymization methods enable adversaries to breach user’s privacy. Due to the advancements in ICTs, the scale and scope of the data breaches is expanding. In recent years, the adversaries are focusing for the users groups information theft for fulfillment of the intended objectives. The group identity theft can lead to the negative perceptions about certain ethnic group, discrimination based on the race/religion, and loan declining to group of people whose previous loan return rating was not satisfactory. Aside from the negative consequences of privacy breaches on the people’s life, data owners also lose their users’ trust in such circumstance. Therefore, users expect that data owners protect their sensitive information’s privacy during data release with the data-miners. Hence, the academicians and researcher are suggesting/devising new anonymization mechanisms to deal with this uprising social problem (i.e., user’s privacy protection without significantly impacting data utility/quality).
Relational Anonymization Techniques Used for Tabular Data Anonymization
The general concept of relational anonymization is to produce anonymous table
A. $k$
-Anonymity Privacy Model
The
B. $\ell $
-Diversity Privacy Model
The
The table in Figure 6(b) is an example of 2-diverse partition of the table shown in Figure 6(a). Although,
C. $t$
-Closeness Privacy Model
The
There exist many attacks that are possible, and can be easily launched on the
D. Differential Privacy Model
Differential privacy (DP) [18] is a well-known and mathematical definition-based privacy protection model. It is mostly used for privacy protection in an interactive settings of the PPDP. It protects the privacy of user by adding noise to the original user’s data and it does not make assumptions about the intruder scenarios. The DP belongs to the semantic class of privacy models, and it yields superior privacy protection in PPDP compared to the syntactic privacy models. Considering the effectiveness of the DP model, U.S. census Bearu is planning to use the DP in their 2020 census, and all future data products [62]. It has been reported in the literature that DP provides a mathematically provable guarantee on privacy preservation against many privacy attacks such as differencing, linkage, and reconstruction attacks. DP concept can be defined as: given a dataset \begin{equation*} Pr[F(x) \in S] \leq exp(\epsilon)Pr[F(y) \in S] + \delta \tag{1}\end{equation*}
\begin{equation*} \frac {Pr (A(D_{1})=R_{s})}{Pr (A(D_{2})=R_{s})} \leq e^{\epsilon } \tag{2}\end{equation*}
In DP method, noise is added using the Laplace mechanism. DP mathematically ensures that any analyst/data-miner seeing the result of a differentially private analysis will make the same conclusion about any individual’s private information, whether or not that individual’s private information is included in the input to the analysis. The overview of the DP is presented in Figure 8. Due to the strong privacy guarantees, the DP model has been extended by many studies for further improvements.
Many latest studies adopted DP concept for resolving the privacy issues in the both interactive and non-interactive settings of the PPDP. There exist numerous anonymization techniques which are ramifications of the four privacy models explained above. We summarize the approaches that extended the concepts of these four models in Table 1. We categorize the anonymization approaches into four main categories: (i) the
We compare the existing approaches with each other based on six different parameters. The abbreviation used in Table 1 are: ID = identity disclosure, AD = attribute disclosure, MD = membership disclosure, IL = information loss, Sy. & Se. = syntactic and semantic, Bk = background knowledge, CoD = curse of dimensionality, Pr. = probability, SA = sensitive attribute, MSA = multiple sensitive attribute, LA = linkage attack, DMAR = discovering and maintaining association rules, DA = data analysis, ECs = equivalence classes, GPS = global positioning system, and QIs = quasi identifiers.
Some studies have used the combinations of more than one privacy models (e.g.,
E. Metrics Used for the Evaluation of Privacy and Utility of Relational Data Anonymization Techniques
1) Privacy Evaluation Metrics
There exist multiple ways to quantify the privacy protection offered by an anonymization algorithm. The five common methods that are employed to evaluate the effectiveness of any anonymization algorithm in terms of privacy protection are: (i) anonymous and original dataset linking and calculating the probability of successful matches based on the QI values in both these datasets. The probability value tells the amount of privacy protection an anonymization algorithm will offer during published data analysis. In this case, it is assumed that attackers may have access to the excessive amount of auxiliary information from some external sources (i.e., voter list, online repositories, e-commerce sites etc.) to launch identity and attribute disclosure attacks; (ii) privacy protection evaluation in presence of the background knowledge. In this privacy evaluation metric, the data owner assumes that attacker may possess the true information (i.e., age and gender of a Bob) about some users and he/she can explore only the particular ECs to infer private information of relevant user/users. To evaluate effectiveness of algorithm in this regard, the data owner can pick some instances from the original data and can evaluate the anonymization algorithm privacy protection level by matching; (iii) privacy protection evaluation with the help of privacy-sensitive (PS) rules. In this case, the data owner can construct certain rules to evaluate privacy protection. For example, how many people having age >40 suffer from this disease(i.e., cancer). The sensitive knowledge pattern revelation, and attribute and identity disclosure of multiple users through PS rules have a wide range of negative consequences on people’s life; (iv) prediction about the users SA through the existence of private and public profiles in SNs. In this case, the data owner can quantify the amount of protection level of an algorithm assuming that either partial or full user’s original data is known to the attacker as some users willingly publish their QIs over different SNs. The factual information that can be learned through the knowledge gained from the
2) Utility Evaluation Metrics
During tabular data anonymization, the original QI’s values are modified to fulfill the privacy needs, hence, the data utility degrades. There exist multiple ways to quantify the anonymous data utility offered by an anonymization algorithm. We classify the metrics used for measuring anonymous data utility into two categories: special purpose and general purpose metrics. The special purpose metrics use machine learning methods to measure the anonymous data quality. The most widely used special purpose metrics are, accuracy or error rate,
F. Privacy Preserving Dynamic Data Publication
Privacy preserving dynamic data publication (PPDDP) allows organizations to publish a dataset multiple times. PPDDP enables organizations to share up-to-date data with multiple recipients statically or after some modifications (i.e., update, delete or insert). In contrast, privacy preserving static data publication (PPSDP) focused only on one time publication of a dataset, and many approaches have been proposed for the PPSDP in literature [12], [59], [60]. Meanwhile, PPDDP opens a new era in PPDP research, and many organizations are publishing their users data dynamically. Kabou et al. [108] presented a comprehensive survey about the PPDDP, and summarized different studies used for the PPDDP. Shi et al. [109] presented a method for PPDDP using distance and information entropy concepts. The proposed method ensures that individual privacy is preserved after the data has been subjected to multiple releases. The
G. Challenges in the Relational Data Anonymization
The representation of tabular data is relatively simpler than the SN data. In relational data, each row/tuple represents one real world entity/individual. We identify following five challenges that make the relational data anonymization challenging. First, selection of the user’s attributes that are regarded as QIs. In relational data, a small subset of the users’ attributes are chosen as QIs. However, some QIs can behave as SA (i.e., profession) in practice. Thus, appropriate selection of QIs prior to data sanitization is imperative. Second, over reliance on the custom made QI’s values generalization taxonomies. The relational anonymization is performed with the help of pre-defined taxonomies of the QIs. Meanwhile, these taxonomies do not truly reflect the privacy and utility trade-off. Thus, devising the appropriate taxonomies with in-depth analysis of QI’s domain values in relational anonymization is challenging. Third, tabular data anonymization in multiple SAs (MSAs) scenarios. In some cases, the tabular data contains multiple SAs about individuals. Anonymizing the tabular data having MSAs is harder compared to the single SA scenarios due to the high risk of user’s private information disclosures. Fourth, quantifying the impact of QIs on both user’s privacy and anonymous data utility. Since each QI affects the privacy and utility differently, and some QIs can be highly vulnerable in terms of privacy and some QIs have higher utility. Thus, quantifying each QI’s statistics related to privacy and utility is very challenging in the PPDP. Fifth, accurate estimation of the subjects’ re-identification risk. The existing PPDP methods often assume worst-case scenarios regarding attackers, and apply heavy changes in the data that impact the shared data utility and often limit data sharing on a wider scale. Thus, accurate and novel adversarial modeling methods are needed to support PPDP in big data era. Aside from the five potential challenge explained above, in some cases, an adversary can possibly link the whole published dataset (e.g.,
Structural Anonymization Techniques Used for the Social Networks Data Anonymization
Structural anonymization refers to the modification in the structural properties of the social network (SN) data (i.e., graphs) to protect the privacy threats that emerge from SN data publishing. Generally, the SN analysts represent the SN data mainly via two methods: metrics and graphs [114]. The matrices representation of the SN data allow the application of computer tools and mathematical models to summarize and extract patterns. Since finding relevant patterns from the dense graphs is extremely complex. The adjacency matrix is used as the variant of the graphs in social network analysis. For instance, a
Overview of the social network representation using different forms of graphs and a matrix.
Due to the inefficacy in handling the complex SN data, and inability to correctly represent the SN with users’ attribute information, matrices are rarely used to represent the SN data. In contrast, graphs can represent the users’ attribute information properly along with social relations. Therefore, graphs are most widely used in SN analysis, and assist effectively in proving graph-based theorems. This work uses SN data modeled as a graph for further analysis and discussions. Users’ privacy preservation in SN data publishing is very challenging compared to the relational data due to the more pieces of private information contained in a
The background knowledge (BK) is the fact or an information known to the adversaries about an individual or group of individuals, which can be exploited to infer the SA of an individual(s) from the
The most common technique used for SN users privacy preservation in the PPGP is anonymization. After in-depth synthesis of the literature [114]–[116], we present the taxonomy of the PPGP approaches along with representative anonymization methods employed for SN data in Figure 10. These PPGP approaches can be broadly classified into five categories, namely graph modification techniques, graph generalization/clustering techniques, privacy aware graph computation techniques, differential privacy based graph anonymity techniques, and hybrid anonymization techniques. Brief description along with relevant examples about all five anonymity techniques used in the PPGP is given in subsequent paragraphs.
Taxonomy of privacy preserving graph publishing (PPGP) approaches used for SN data.
Due to the widespread applications of the SNs data, many structural anonymization techniques have been proposed for the PPGP. These techniques modify the structure of the SN graph by adding/deleting vertices or edges to preserve the user’s privacy. Aside from the add/delete, in some case, edges and vertices are switched or re-arranged in clusters to preserve user’s privacy. The overview of anonymous graphs obtained by adding vertices and edges is shown in Figure 11. In Figure 11(b), two new edges have been created (
The addition/deletion of the nodes and edges can be constrained or random (e.g., non-constrained) depending upon the scenario.
An example of the original graph
Original graph anonymization by using random (e.g., non-constrained) and constrained anonymization methods.
In contrast, in the constrained anonymization, the nodes/edges addition/deletion is bounded by some constrains (i.e., degree). Figure 12(ii)(b) shows an example of the perturbed graph
There are six basic edge and vertex modifications techniques for SN data anonymization [44]. The modifications techniques are: (i) edge add, (ii) edge delete, (iii) edge add/del, (iv) simple edge switch, (v) double edge switch, and (vi) node addition. Most of the existing SN data anonymization methods use one (or more) of these six modifications techniques during graph anonymization. A detailed taxonomy of the graph modification techniques is given in study [53]. The four types of graphs that are mainly used to represent the SNs users’ data are: simple graph, bipartite graph, labelled graph, and uncertain graph.
In SN data anonymization, majority of the approaches are driven from the concepts that were proposed for anonymizing tabular data. For example, the
Original graph anonymization by employing clustering/generalization based method.
A network/graph with seven nodes and two QIs (age, gender) is provided as an input (Figure 13 (a)), whole network is partitioned into three clusters (
On the contrary, the privacy-aware graph computation and DP based approaches do not release the entire
The useful analysis provided by privacy-aware graph computation and DP based approaches are: graph density, edges count, relationships degree, degree distributions, size of the network, centralities, closeness, counts of the sub graphs, top
Overview of the graphs’ statistics determined by privacy-aware graph computation methods.
Aside from these basic statistics, applying node/edge DP approaches for publishing graph cuts and pair-wise distance between nodes are handy for data-driven applications. Furthermore, many
The hybrid anonymity approaches usually employ more than one anonymity technique to yield anonymized SN data [122]. However, the complexity of the hybrid anonymity approaches is relatively higher when
A. Metrics Used for the Evaluation of Privacy and Utility of Structural Anonymization Techniques
1) Graphs Utility Evaluation Metrics
The existing well-known graph utility evaluation metrics are: (1) degree (Deg.), (2) effective diameter (ED), (3) joint degree (JD), (4) local clustering co-efficient (LCC), (5) path length (PL), (6) closeness centrality (CC), (7) eigen vector (EV), (8) betweenness centrality (BC), (9) network constraints (NC), (10) network resilience (NR), (11) infectiousness (Infe.), (12) page rank (PR), (13) hub score (HS), (14) global clustering co-efficient (GCC) or global transitivity (GT), and (15) authority score (AS). Aside from these well-known metrics, some general purpose approaches such as accuracy, classification, information loss (IL), ratio of top influential users (RRTI), query response, and graph spectral properties have also been used to measure anonymous graph’s utility. Furthermore, there exist seven application-specific utility evaluation methods which are, (1) role extraction (RX), (2) reliable email (RE), (3) minimum sized influential nodes set (MINS), (4) influence maximization (IM),(5) community detection (CD), (6) source routing (SR), and (7) sybil detection (SD). We refer interested reader to the previous work [123] for more detailed definitions and descriptions of the above metrics.
2) Graphs Privacy Evaluation Metrics
The most commonly used privacy metrics in evaluating the performance of graph anonymization algorithms are, (1) the number of re-identified nodes, (2) degree of uncertainty, (3) adversary’s success rate, (4) query’s error, (5) information gain, (6) amount of leaked information, (7) information
surprisal, (8) privacy score, (9) association rule hiding (ARH), (10) distribution leakage, (11) prior information belief and posterior information belief, (12) entropy leakage, (13) probabilistic anonymity, (14) downgrading classifier effectiveness, (15) membership inference analysis, and (16) accurate predictions. Wagner et al. [124] summarized eighty privacy evaluation metrics used by different PPDP algorithms. The authors categorized the privacy metrics based on four common characteristics which are, (i) adversary models, (ii) data sources (i.e., auxiliary information), (iii) input for computation of metrics, and (iv) output measures. The selection of the metric depends on the data type, objectives of data publishing, and target application/users. However, in most cases, a single metric cannot capture the entire concept of privacy, therefore, two or more metrics are jointly used to measure the level of privacy offered by an anonymization algorithm/model. Recently, Zhao et al. [125] analyzed and discussed twenty-six different metrics used for privacy evaluation in anonymized graph. The authors suggested that no single metric is effective to evaluate privacy protection in anonymous graphs. Therefore, the authors suggest that the strengths of multiple privacy evaluation metrics can be combined to improve the overall measurement of user’s privacy in a
B. De-Anonymization Methods Employed By the Adversaries to Jeopardize Users Privacy in SN Published Data
Due to the phenomenal growth in SNs adoption around the globe, the SN data has become more reliable for conducting research and achieving multiple business/scientific objectives. The data-miners and analysts can extract the enormous amount of information embedded in the published
Description about de-anonymization approaches used by the malevolent adversaries for privacy breaches.
Aside from the de-anonymization approaches explained above, Beigi et al. [170] summarized various seed-based and seed-free de-anonymization methods employed by the adversaries for privacy breaches in published SN data. Authors provided a detailed overview of latest SN data anonymization and de-anonymization approaches, and theoretical analysis about the graph’s de-anonymization. In addition, various user’s attributes inference methods have been reported by the authors with examples. Apart from the de-anonymization approaches explained in Figure 15 and previous studies, we summarize the key items that enable unique identifications of an individual/groups or SA disclosures from the privacy preserved graph data publishing in Table 5.
These items are usually exploited by the adversaries to jeopardize user’s privacy in SN published data to infer/predict true identities of users or their SAs. Each item assists in compromising an individual or community privacy when exploited by the adversaries. Various approaches based on these items/features have been proposed to compromise people’s privacy by leveraging the single or multiple SNs’ data (a.k.a across SNs). Drawing on the reviewed graph de-anonymization techniques, majority of the techniques enable adversaries to compromise users’ privacy successfully.
The privacy items and corresponding de-anonymization methods summarized in Table 5 pose serious threats to the SN user’s privacy in the PPGP. Apart from the well-known methods summarized above, the user’s privacy can also be breached through the information acquisition about the changing interest of a user overtime and predictions about the user’s private information through side channels of various types (i.e., interests). Recently, due to availability of the excessive amount of auxiliary information and advanced data mining tools, the scale and the scope of privacy breaches is expanding from an individual identification or SA disclosure to groups identity theft for achieving multiple scientific and business objectives, and identifying communities in SNs having common characteristics or common interests for accurate recommendations.
Furthermore, when a group of SN users form an online community, the SN service providers have access to more information including political viewpoints, preferences, relationship status or financial status because a user often readily shares more about him or herself in an online community. Accordingly, this goldmine of data when shared with the data-miners can lead to privacy breaches. In addition, social connection information prediction about a user with advanced data mining tools, and identifying an individual and users groups by contents analysis and activities has become serious challenge for SN service providers [122]. Hence, SNs users data sharing can endanger user’s privacy in unexpected ways, and researchers are devising many domain and attacks specific, and general PPGP methods to combat with this uprising social problem (e.g., safeguarding SN users privacy).
C. Structural Anonymization Approaches Used for Application-Specific Scenarios in SN Data Publishing
In Table 4, we summarized the generic structural anonymization approaches used in the PPGP. In this subsection, we summarize the recent structural anonymization approaches used in application-specific scenarios of SN data publishing. These scenarios include, anonymization approaches for friends recommendation, community clustering/detection, collaborative filtering, topic modelling, and SN’s users behavior analysis etc. Aside from the structural modifications in graphs, the anonymization approaches used in application-specific scenarios also consider the features of the applications for which the data is being anonymized. For example, Xu et al. [29] proposed a framework for discovering the privacy-preserved communities in the SNs. The proposed framework enables the formation/detection of different communities without revealing sensitive link information. In this work, we summarize the state-of-the art structural anonymization approaches used in application-specific scenarios of the SN in Table 6. The anonymization approaches summarized in Table 6 consider the features of the related application during anonymization process. Furthermore, some approaches have been used for multiple/heterogeneous applications due to overlapped properties/characteristics between them.
Recently, due to the phenomenal growth in opportunities offered by SNs data, researches have turned their attention to devise new and realistic anonymization methods leveraging advanced artificial intelligence (AI) techniques. Li et al. [226] designed a deep learning (DL) model that combines multiple factors such as attribute information, graph structure, and behaviour characteristics while avoiding tedious calculation procedures to measure the privacy in SNs. The proposed model considers the deep relationship between all three factors (user attributes information, graph structure, and behaviour characteristics) to accurately obtain the privacy score.
Alemany et al. [227] devised two metrics (Audience and Reachability) for assessment of privacy in information sharing scenario in the SNs. The authors performed rigorous simulations in different SN topologies and considering different layers, and concluded that network topology in SNs has a direct effect on the outreach of the information. Pensa et al. [228] described a knowledge-driven approach for enhancing privacy awareness in SNs. The proposed approach has ability to measure the privacy risk of the users and inform them whenever their privacy is breached or at risk. Also, it helps the exposed users to customize their privacy level semi-automatically by limiting the number of manual operations. Li et al. [229] suggested that private information in the SNs sites is time-sensitive, which means that information held by the users who are no longer in the same environment/place as the target user may no longer be reliable/true and have lost its value. The authors combined behavioral characteristics and structural similarity to accurately filter the user groups who hold the current private information of a target user for measuring a user’s privacy status. Ruggero et al. [230] suggested that user’s privacy in SNs can be influenced by many external factors (e.g., the position of the user within the social graph, the relative risk of the network). To solve this problem, the authors devised a network-aware privacy score metric that accurately measure the user’s privacy risk according to the characteristics of the network (i.e.,
A semi-supervised framework based on structural embedding for account correlation was proposed by Zhou et al. [231]. It learns the latent and structural semantics for accounts correlation between networks. It correlates accounts with high accuracy by leveraging the semantic information among accounts through random walks approach. Furthermore, the user’s identity disclosure/matches problems with limited profile items have also been reported in the literature [232], [233]. In recent years, federated learning (FL) based privacy preserving approaches have been proposed for anonymous data publishing with legitimate third-parties [234]–[238]. Furthermore, many anonymization approaches have been proposed to preserve the users’ privacy in sequential publication of the SNs data [239]–[241].
D. Challenges in Social Network Data Anonymization
SN data is usually represented as a graph, and structural anonymization approach is applied to sanitize it before releasing with data-miners. Anonymizing SN data is much more challenging compared to tabular data due to complex structure and variety of information embedded in graphs about the entities (i.e., users). The three well-known challenges related to SN data anonymization are, (i) modeling background knowledge (BK) of the adversaries (in SNs, the appropriate modeling of the adversaries’ BK is harder compared to the tabular data because adversaries can have access to the multiple pieces of information about an individual, and he/she can re-identify target individual from the
Summary and Discussion About the Privacy Issues in Future Computing Paradigm
In this article, we have covered most of the concepts related to the anonymization approaches used for data owned by both physical organizations (e.g., hospitals, banks, and insurance companies etc.) and virtual platforms (e.g., Facebook, Twitter, and Link-din etc.). Specifically, we described the anonymization methods employed for two types of input data, tables and graphs. We emphasized more on the SN data anonymization considering the exponential adoption of the SNs around the globe by adults, and unprecedented opportunities these platforms offer in terms of business intelligence. Furthermore, the data-driven technologies and big data analytics are playing a vital role in extracting embedded knowledge from unstructured data to improve SQ [242].
Apart from these two types of data (e.g., tables and graphs), a wide range of data types such as matrix (e.g., trajectories information, market basket data, and ratings data), digital traces, logs, documents (e.g., medical prescriptions, disaster/disease control agencies data), images, videos, text documents (e.g., reviews, blogs, and opinions), and time series data can reveal private information in digital landscape. Hence, enterprises and organizations are constantly exploring new innovative strategies and methods to remain competitive in their market while ensuring users’ privacy [243]–[247]. Nevertheless, data policies including privacy, intellectual property, security, and liability issues, should be addressed in all phases (e.g., collecting, pre-processing, anonymizing, sharing, and analytics) of person-specific data handling in order to exploit big data value. Decentralized anonymization methods are handy solutions to truly benefit from the data publishing without significantly impacting user’s privacy [248], [249]. Nowadays, the AI techniques have become significantly mature to assist in data-driven decision making, users’ privacy protection has become imperative for most organizations’ success [250]. Considering the applications and unprecedented opportunities of data sharing, the privacy and utility trade-off resolution in PPDP remains challenging.
In future computing paradigm, four mainstream technologies will become the centre of the information technology (IT) world, big data, cloud computing, social networks, and internet of things (IoT). These technologies have ability to process any kind of data with advanced analytics tools to extract insights from collected data. The main drivers of these technologies are expanded internet connectivity, low cost sensors, higher mobile adoption, and large IoT investments. In Figure 16, we present the future computing paradigm’s technologies, data sources, and analytics solutions that are in use to serve the mankind in better way compared to the recent past.
Despite the benefits offered by these latest technologies, there exist many barriers such as technological fragmentation, security concerns, implementation problems, and privacy concerns. Such potential barriers have been the bottleneck for the wide applications and development of these technologies and, thus, have attracted widespread concern. Among others, privacy concerns significantly hamper the development and applications of these technologies, and this field has attracted significant attention from the research community in recent years. Furthermore, many latest methodologies such as homomorphic encryption, federated learning, deep learning, and block chain have also been used for privacy protection in future computing paradigm (year 2020 and beyond) [251]–[258]. Butpheng et al. [259] presented various research perspectives related to privacy and security within IoT-cloud-based e-Health systems. Authors provided various benefits of IoT and cloud based e-Health systems, and analyzed security and privacy solutions in the study.
We summarize the various privacy issues related to the future computing paradigm after in-depth synthesis of the previous studies in Figure 17. We refer interested reader to previous works [254], [260]–[268] for more detailed descriptions and definitions of the privacy issues related to the future computing paradigm. Therefore, future research must be done to find new ways to make anonymization solutions more resilience towards these issues such as quantifying the risk and benefits of data publishing, deciding the appropriate mechanism for privacy preservation considering the adversaries’ capabilities (e.g., worst case scenarios), verifying the effectiveness of privacy enhancing technologies through real-world applications or increasing users’ awareness about the privacy leakage through hidden routes.
Comprehensive overview of the privacy issues in the future computing paradigm (Year 2020 and beyond).
Moreover, achieving effective privacy protection should focus more on exploiting intrinsic characteristics of users’ data or application features for which data is being anonymized by the simulation of diverse privacy approaches [269]–[271]. Furthermore, privacy enhancing technologies (PETs) for better privacy preservation in personalized services by satisfying both economical and ethical purposes have become more emergent than ever [272]. In recent years, privacy preserving machine learning (PPML) concept has been employed to extract the knowledge from distributed databases while ensuring data privacy [273], [274]. Due to the PPML, traditional machine learning algorithms can be adapted to secure users’ data stored in multiple digital environments.
Recently, the privacy-aware data cleaning techniques have significantly reduced the data preparation cost of data analysis pipeline [275], [276]. These techniques allow the clients to buy clean, and curated data from heterogeneous service provider to perform analytics without compromising user’s privacy. Furthermore, development of privacy information management system (PIMS) in accordance with international standard (e.g., ISO/IEC 27701) is imperative to safeguard the privacy of individuals or small groups in the population [277]. According to such system, if an anonymization mechanism cannot safeguard user’s privacy or anonymized data can be used to identify individuals uniquely or small population groups, data cannot be released without legal advice or additional technical measures.
Promising Open Research Directions
Some promising open research directions/problems that need further research and developments from both academia and industry are outlined below.
Users groups’ privacy issues: In tabular data anonymization, majority of the existing approaches focus solely on an individual’s privacy preservation. Thus, they are less resilient towards the users groups’ privacy preservation. For instance,
-anonymity model creates equivalence classes with$k$ -users in each class. On the one hand, it protects individual privacy by hiding each user in other$k$ -users’ crowd. On the other hand, it explicitly discloses private information about users groups. Hence, devising new PPDP methods for users groups’ privacy protection would be promising.$k$ Excessive information loss caused by over-generalization of the QIs: In tabular data anonymization, most of the existing anonymization approaches anonymize each QI present in a dataset that can lead to excessive information loss. As some QIs are not vulnerable in terms of user’s privacy, and their unnecessary generalization significantly impact data utility. Thus, quantifying each QI impact on privacy and utility, and controlling unnecessary generalization to the extent possible while anonymizing user’s data requires further research and developments from the research community.
Imbalanced datasets anonymization: In some cases, the relational dataset can be highly imbalanced (i.e., the SA’s values distribution is not uniform), and its anonymization is very challenging. In such datasets, enforcing hard constraints such as making each class
-diverse or$\ell $ -close is not possible in practice. Hence, it requires development of new approaches for anonymizing imbalanced datasets to effectively protect users’ privacy without degrading data usefulness.$t$ Effective resolution of privacy and utility trade-off in the PPDP: In data anonymization, there exist a strong trade-off between privacy and utility. Tailoring the anonymization with privacy objectives can adversely affect the anonymous data utility, and vice-versa. This longstanding challenge in the field of PPDP seeking novel solutions to support privacy preserving big data analytics.
Personalized privacy preservation in SNs: In SNs, each user has different requirements and concerns about his/her information privacy, which is called personalized privacy (PP). For example, in social graph some users may want to hide only sensitive relationship (i.e., lover) information, while some users may want to hide all social connections (i.e., all friends) information. Therefore, the PP involves high level of subjectiveness, and it is very difficult to implement. Hence, innovative solutions that can incorporate the SN users’ PP requirements in SN data anonymization are required.
Accurate modeling of the adversaries’ background knowledge: Adversaries poses more side channel information as background knowledge (BK) about SNs users compared to the tabular data. This BK continues to grow due to the access to other publicly available SNs, and richness of information embedded in the SNs. Recently, text mining and natural language processing (NLP) techniques utilize the contents of SN users to match their identities. Thus, accurate modeling of the Bk while anonymizing SN data is very challenging, and further research on how to model the BK during anonymization process is required to effectively protect user’s privacy in big data era.
Generic solutions for the social graph anonymization: Generally, the SN data is modeled with the help of graphs (a.k.a. sociograms). These graphs can be of different types such as simple, directed, undirected, weighted, and labeled directed. The anonymization mechanism proposed for one type of the graph cannot be directly applied to the other. For instance, the
-degree anonymity concept cannot be applied to directed graphs straightforwardly as it requires the detailed analysis of in-and-out degree sequences. Hence, devising generic anonymization methods that can work with multiple graph’s types will be an interesting research area in the future.$k$ Controlling large scale user identification issues by evading data mining and SN analysis SNA) tools: In SNs, users establish relationships with like-minded people or people having similar interests. This results into formation of the online communities. A growing body of research has been devoted to community discovery in the SNs. On the one hand, community discovery is beneficial for multiple purposes such as information spread and control. Moreover, community discovery in privacy preserved SN published data can jeopardize users and community privacy when published data is analyzed with advanced data mining and SNA tools. Hence, devising new solutions that are resilient towards community discovery and community-based node/user mapping in SN data publishing has become more pressing than ever.
Exploiting global and local features of SN data to safeguard against network reconciliation problems: Recently, across SNs users identification by leveraging multiple methods such as multiple SNs graphs matching (i.e., network reconciliation), display names mapping, contents and activities analysis, accounts on the heterogeneous SNs, and their combinations has become an activate area of research. Accordingly, the privacy approaches need significant up-gradation to protect user’s privacy in network reconciliation problems. In this regard, the approaches which perform data anonymization by exploiting local (i.e., common mapped neighbours, number of friends, and mutual friends etc.) and global (i.e., betweenness, centralities, ties strength, and multi-hop neighbour’s information etc.) features of social graph will be an interesting research area in the near future for PPGP.
Metrics suites rather than single metric for quantifying the level of privacy in anonymous graphs: Generally, one type of metric is employed for evaluating the level of privacy/utility in a
offered by anonymization solutions. Moreover, in real world cases, the privacy quantified by one metric may not be monotonic (e.g., show lower privacy results for stronger adversaries) or reliable from multiple viewpoints. Hence, combining multiple privacy metrics that can more accurately measure the level of privacy and can mitigate the weaknesses of individual metrics is a promising research direction in near future considering the widespread interest in SNs data publishing with legitimate information consumers.$G'$ Devising privacy-friendly mechanisms for exceptional situations: During 2020, the whole world is facing an unanticipated and extraordinary challenge from an unknown enemy, called corona virus disease-19 (COVID-19) [278], [279]. The COVID-19 pandemic has affected every profession around the globe, and governments are heavily relying on the non-pharmaceutical interventions (e.g., strict lock-downs, cities and facilities closures, social distancing, geo-location based users’ mobility analysis, proximity detection, and digital contact/suspect tracing etc.) to curb the spread of COVID-19 [280]. In addition, for epidemiological investigations, some governments employed extensive measures (e.g., credit card data, mobile phone signals, Bluetooth and GPS data, and CCTV data) to find the COVID-19’s suspects and hotspots [281]–[283]. Due to the adoption of the digital methods, a huge amount of personal data has entered into the cyberspace, and privacy violations have been constantly reporting around the globe. For example, in Italy from January to April 2020, the privacy violations in healthcare sector related to companies and individuals have doubled [284]. Furthermore, the privacy violations in post COVID-19’s era are expected to increase as many companies have collected the multi-fact data about the people lifestyles [285]–[287]. Considering the necessity of users’ privacy preservation, the ethical aspects during and after COVID-19’s era must be carefully observed and addressed [288]–[294]. Hence, privacy-friendly solutions are required for exceptional situations such as COVID-19 pandemic to safeguard patients’ privacy and other pandemics’ related aspects (e.g., privacy preserving symptoms tracking and reporting, collecting only relevant information from the users to protect privacy, encrypting sensitive information, GDPR-compliant and privacy-aware contact tracing, and decentralized solutions for computing the probability of exposure). Further, PPDP mechanism involving COVID-19’s patients data to safeguard discrimination and hates towards certain religions, countries, sects, caste, and sexual minorities during published data analytics is deemed necessary. Hence, analyzing/solving the potential privacy risks and vulnerabilities in contact tracing apps developed by many countries to fight with the pandemic is a vibrant area of research.
Adoption of industry 4.0 techniques for the PPDP: In recent years, many innovative techniques such as few-shot learning, federated learning, transfer learning, deep learning, and block-chain have revolutionized the human life in many aspects. These techniques have been extensively used in many areas such as healthcare, social engineering, data analytics, predictions and forecasting, knowledge extraction, and image recognition/analysis etc. Due to the availability of huge amount of labeled data, and ability to work in a decentralized fashion, these techniques can be utilized for users’ privacy preservation with enhanced usefulness. The heterogeneous federated transfer learning (HFTL) framework [295], privacy-preserving deep learning (PPDL) technique [296], deep transfer learning (DTL) method [297], adaptive privacy preserving federated learning (APPFL) method [298], block-chain-enable privacy preserving (BCEPP) architectures [248], [299], secure collaborative few-shot learning (SCFSL) framework [300], searchable encryption (SE) methods leveraging ciphertext-policy attribute-based encryption (CP-ABE) [301], [302], data resource protection solution leveraging smart contracts [303], improving cyber security solutions utilizing AI’s potential [304], and computational intelligence based methods for information security [305], to name a few have already been used in practical applications related to the PPDP. Hence, devising robust and lightweight techniques which involve less parameters and can co-work with the traditional anonymization approaches to scale up privacy preservation with enhanced data utility is a promising area of research for the future.
Conclusion
In this paper, we have presented the latest researches that have been proposed to release useful information while preserving user’s privacy from malevolent adversaries, namely privacy preserving data publishing (PPDP). In recent years, there is an increasing focus on the rapid development of more practical anonymization solutions due to the significant rise in the privacy breaches across the globe, and this area is attracting researchers’ interests drastically. Owing to the rapid technological developments in communication science and technology, tremendous amount of users’ data can now be easily obtained in diverse formats, ranging from relational tables to complex social graphs. Although this increasing amount of data offers unprecedented opportunities for analytics, but it increases the chance of individuals’ privacy breaches. In addition, most of the traditional anonymization algorithms that were proposed for the tabular data can rarely perform well on a social network (SN) (e.g., graphs) data without modifications. Hence, it is of paramount importance to provide good perspectives of the information privacy area involving both tabular and SN data along with recent anonymization researches. In this work, we have provided detailed and systemic coverage of the relational anonymization techniques used for the tabular data before presenting recent structural anonymization approaches used for SNs data anonymization. We have summarized and compared substantial number of anonymization approaches used for the information privacy protection involving both SNs and tabular data. Furthermore, we provide deeper insights on the privacy problems in future computing paradigm that will be helpful in devising more secure anonymization methods, and we discuss numerous promising open research directions/problems that need further research and developments. In this survey, we specifically focus on the SN data anonymization and de-anonymization techniques considering the widespread applications/use of the SNs data. In addition, SNs data contains a treasure of information that need to be protected from the malevolent adversaries. Thus, it indicates the ever increasing interests of researchers in the area of SN’s data anonymization. Nevertheless, user’s data anonymization is still irrefutably complex, and it requires significant improvements in existing approaches as well as devising new practical approaches with regard to better utility and privacy preservation. In future work, we are planning to devise new anonymization methods for SN data, and we intend to explore privacy problems in industry 4.0 technologies.
Conflict of Interest
The authors declare that they have no conflict of interest.