1 Introduction
With the rapid pace of digitization, high-dimensional data, such as healthcare data or user behaviour data, have been increasingly collected and used for different purposes. More than often, such data are possessed by different parties as if the data were horizontally partitioned among multiple parties. When integrated, these distributed data can be a valuable source for supporting better decision making or providing high-quality services. However, since the dataset held by each party may contain highly sensitive personal information, simply integrating the local datasets and sharing the integrated result will pose serious threats to individual privacy. The following scenario further motivates the problem. Assume that three hospitals \$H_1\$, \$H_2\$ and \$H_3\$ want to integrate their patient data and share the integrated result to facilitate more effective clinical research. Table 1 shows the patient data integrated from the three hospitals, where records 1 to 3 are from \$H_1\$, records 4 to 6 are from \$H_2\$, and records 7 to 10 are from \$H_3\$. Although these records are integrated with only pseudo IDs, many individuals might be easily re-identified by adversarial data recipients with some background knowledge. Suppose that the adversary knows that the target patient is a Builder and his age is 40. Record #9 together with his sensitive value (i.e., Lung cancer) can be uniquely identified since he is the only Builder who is 40 years old in the integrated data.