Introduction
The ubiquity of mobile/IoT devices has led the data-driven society. In a data-driven society, we can collect location data through sensors, applications, networks, and APIs. The collected location data can be stored and managed in databases and the cloud for analysis at any time. One of essential data is location data consisting of people’s waypoints and trajectories because it is extremely useful for various applications such as urban planning, epidemiology, and disaster prevention [1], [2], [3], [4]. Although the main interest of those applications is the statistical trend of people’s movement, the location data contains sensitive information such as home or work place and could be possibly a cause of privacy exposure.
To anonymize such sensitive information, Local Differential Privacy (LDP)-based privacy protection attracts great attention [5]. In LDP, a device does not trust any third party, including the datastore or other devices, and will not expose raw data because sharing their raw data can lead to a privacy leak. Instead, the device adds minor changes (or noise) to all values of data at the point of sensing data, i.e. before the device sends their data to a datastore. Figure 1 is the overview of LDP process. Each Device (Client) obtains perturbed data
The Overview of LDP. Each Device (Client) inputs their raw data
However, LDP has been pointed out that is vulnerable to various poisoning attacks [6], [7] because LDP has no mechanism for a datastore to validate whether received data is reliable. The cause is that the datastore collects data from devices in the way of correlating data volume and data statistical characteristics. The more data a device sends gives more statistical changes on merged data in a datastore. An adversary’s device can tamper with the data or inject arbitrary amounts of data to intentionally change the statistical characteristics of the whole data in the datastore. Since benign devices also do not disclose raw data for fear of privacy leaks, datastore cannot collaboratively validate modified data with a device who may be an adversary. This relationship between devices and datastore create an environment of mutual distrust.
In this paper, we propose a location data collection that extracts the statistical characteristic of receiving data irrespective of its data volume in an environment of mutual distrust. Our method combine LDP and Oblivious Transfer (OT) protocol [8] to obtain statistical characteristics only. OT protocol forces a receiver to discard messages with a certain probability. We utilize this mechanism for letting the datastore (receiver) collect only the statistical characteristic of data from devices (sender). If a malicious device maliciously crafts the data or amplifies the data volume, OT protocol drop crafted data before reaching datastore. An adversary thus impossibly distort the statistical characteristic and the datastore can mitigate data poisoning attack.
Although OT protocol contributes to decoupling the relationship between statistical characteristics and data volume, we additionally have to adjust the degree of perturbation and whole data volume. The amount of perturbation is calculated on the assumption that all data is received by the datastore in the case of utilizing OT protocol, the datastore receives fewer pieces of data, and thus received data get to have a stronger protection level than expected. Moreover, this sampling by OT protocol makes the amount of data necessary to extract statistics insufficient. Hence, we need to adjust perturbation and data volume for extracting statistical characteristics while keeping all privacy. In our proposal, the devices adjust the degree of perturbation depending on protection strength before LDP processing raw data. After receiving perturbed data, the datastore complements partially missing perturbed data due to OT protocol by generating synthetic data based on a pair of random devices. This generating synthetic data enable to bring the perturbed data volume closer to raw data volume, and extract statistics more accurately.
This work is extension of our preliminary version in Reference [9]. The current version has some significant novelty compared to preliminary version. Our proposal in preliminary version degrade the statistics in data because the method drop almost data by OT protocol; in contrast, our proposal in the current version recovers missing data by oversampling so that statistics can be extracted. The current version enable to extract statistics with accuracy close to that of pure LDP (See Section V). Moreover, we design our current proposal to minimize the impact of two data poisoning attacks. preliminary version did not validate the impact of data poisoning attack, but current version actually show that our method is more secure than pure LDP via extensive experimental evaluation.
The contributions of this study are threefold: First, we establish the location data collection in an environment of mutual distrust by combining LDP and OT protocol. While the mere combination of these two techniques would result in the loss of statistics, we were able to properly transfer the statistics by adjusting the perturbation of LDP and complementing data loss of OT protocol. Second, the proposed method mitigates data poisoning attacks, which have been pointed out as a vulnerability of LDP. Assuming that an adversary is actually included in the data collection in a certain percentage, we conduct an experiment to check what percentage of the statistics can be extracted accurately. The experimental results show that our proposal is more robust against data poisoning attacks than pure LDP. Third, we show that the proposed method can actually collect statistics through experiments on real/synthetic datasets, and measured privacy protection, execution time, and throughput so that the method can be applied to IoT and mobile environments with small memory.
The structure of this paper is as follows: we first give the related work about anonymization, perturbation, and OT-based data collection in Section II. In Section IV, we design a novel LDP-based data collection to decouple statistical characteristics from data volumes. In Section V, we analyze our proposal from a viewpoint of privacy and overhead. In Section VI, we compare our method with related work and discuss the possibilities and limitations of using statistics transfer outside of its scope. Finally, in Section VII, we summarize this study and refer to future work.
Related Work
A. LDP-Based Location Data Collection
LDP enables all device to protect their privacy even device cannot trust the datastore. In LDP, what each device has to do is only adding noise to its own data to ensure indistinguishability and sending numerically different or noisy data. Then, The degree of indistinguishability is determined by the privacy budget
B. OT Protocol-Based Location Data Collection
As a way of selecting packets to receive while keeping received data secret, Oblivious Transfer Protocol (OT protocol) [8] is effective technique. In OT protocol, a sender (device) sends many encrypted packets with public key, each of which includes a different piece of data, and a receiver (datastore) decrypts and obtains some of them in a predetermined probability by the trick of key exchange. This protocol is originally used for secure computation, privacy-preserving, etc. OT protocol is a broadly studied cryptographic primitive which involves two mutually distrustful peers who wish to interact with each other in order to transfer messages in an oblivious manner. OT protocol is a two peer protocol between a client and a datastore, by which the client transfers some value to the datastore. Since the OT protocol allows the datastore to unilaterally choose the data it receives, the client has no way of knowing what value the datastore is receiving (decrypting). Many related work have adopted OT protocol is to collect and sample location while protecting client’s privacy [12], [13], [14]. OT protocol certainly allows sampling of receiving data, but it does not anonymize/perturb the sensitive data itself. For this reason, as with CDP, privacy is not protected if the client does not trust the datastore to protect their privacy while managing the data.
In summary, LDP can protect privacy more securely than other methods (anonymization approaches and CDP), but it is not effective for location data collection if the datastore cannot be trusted by the device, which risks distorting the statistics of the datastore by an adversary. On the other hands, OT protocol allows the datastore to discard/receive data, but the device must unconditionally trust the datastore. To solve this dilemma, we utilize LDP and OT protocol together to collect location data in an environment of mutual distrust.
Threat Model
In this section, we present our threat model. There are two types of attacks on LDP: those at the input stage and those at the output stage [6], [7]. As defined in reference [6], [7], we set the definition of Input Poisoning Attack (IPA) and Output Poisoning Attack (OPA) in location data collection. Figure 2 is the overview of IPA/OPA. In the following subsections, we explain the adversary’s capabilities, and motivations in each attacks.
A. Input Poisoning Attack (IPA)
First of all, we define the adversary’s capability in IPA. According to some related work [7], any adversary can easily obtain a large number of fake accounts. We thus assume that the adversary can create fake accounts and manipulate them to amplify the data volume. Specifically, an adversary accesses
The adversary’s goal is to increase the amount of data sent using fake accounts, and to distort the statistics that the datastore would have originally obtained from the data. An adversary can distort statistics and disrupt a service that calculates crowding at the landmark (restaurants, railway station, amusement parks, etc) based on location data, thereby degrading the quality of the service. For crowdsourcing services, distorting statistics cause to spoil the system by crafting fake real-time events (e.g., traffic congestions) in the same way. This type of attack to location crowdsourcing has been pointed out as an example of a practical GPS spoofing in the reference [15].
B. Output Poisoning Attack (OPA)
We assume that an adversary can access a group of fake accounts by illegally registering and/or purchasing accounts from dark markets [7]. If the adversary knows the implementation of the LDP process, he or she can craft the data sent to the datastore by bypassing the perturbation or replacing the process that outputs the perturbed value with a process that outputs an arbitrary value (e.g., using tools to spoof GPS tracking device and/or to amplify data by making
Unlike IPA, we consider that the adversary’s goal in OPA is to distort the frequency of certain target locations using fake accounts. To achieve this objective, the adversary carefully spoofs and crafts all the location data of the fake account. As a result, the data collected by the datastore is increased by the number of fake accounts, as well as the number of data, and only certain waypoints or trajectories have a high frequency. By manipulating the frequency of the certain location, the adversary can intentionally manipulate the service. For example, in dating applications, an adversary can measure the distance to a particular user while spoofing his or her own location, and can estimate the approximate location in real-time [16]. Based on this estimation, the adversary can deliberately act in a way that facilitates matching the opponent’s device in the system. Not only location data can be leaked, but also cyber-stalking can become real-life, which automated stalking exposing user’s privacy even physically. As a real incident of this pattern, there have been incidents where incorrect route recommendations have led tourists to life-threatening deserts with extremely high temperatures and no water supply [15], [17]. In summary, an adversary in OPA uses fake accounts to send spoofed location data to fraudulently increase the frequency of a particular point or route, greatly distorting the statistics that the data store would otherwise obtain. Unlike IPA, OPA is more critical because it allows for statistical poisoning with regard to arbitrary locations.
Oblivious Statistic Collection
In this section, we describe the proposed Oblivious Statistic Collection. We first describe the overview of the proposed method and limitations in Sect. IV-A. Then, from Sect. IV-B to Sect. IV-D, we describe the solution to each limitations.
A. Limitations on LDP-Processsing Over OT Protocol
To realize the location data collection in an environment of mutual distrust, we combine LDP and OT protocol. The combination of LDP and OT protocol decouples the relationship between statistical characteristics and data volume, which allowing the collection of statistical features without exposing any raw data outside of the device. Figure 3 is the overview of proposed method, and we explain each process. On the method, the device first samples their data and creates a message to be transmitted using OT protocol. When creating a message, the device adds noise to satisfy the LDP to protect data privacy. All devices samples from the oldest waypoints in order to preserve the continuity of the trajectory. For instance, if the client holds the data volume
The overview of the proposed method. Devices send location data over 1-out-of-n OT protocol, and the datastore merges them to obtain statistical characteristics. Even if there is a malicious device (adversary), they cannot distort the statistics because of the limited data transfer rate.
However, implementing LDP on OT protocol causes three serious problems: noise amount does not become uniform between each message of OT protocol (Problem 1), noise amount increase due to OT protocol message drops (Problem 2), and statistic loss due to OT protocol message drops (Problem 3). First, we describe Problem 1. In order to protect privacy in LDP, devices use privacy mechanisms such as the Laplace or exponential mechanism to add noise to individual waypoints. This helps to anonymize the waypoints from each other. The added noise should be consistent to ensure that the overall privacy protection is constant. However, when using the OT protocol, only a portion of the data is transferred. This can result in an inconsistent level of noise in the data received by the datastore, which can lead to either insufficient privacy protection or excessive protection that goes against the device’s intent.
Next, we describe Problem 2. Over OT protocol, the sender divides all data into pieces called messages and sends them. The datastore (receiver) receives randomly selected data over OT protocol with a predetermined probability. This mechanism, which is equivalent to sampling, generates excessive noise relative to the amount of data. This is because the device (sender) does not know in advance how much data the datastore will receive, and the device adds noise on the assumption that they will receive all the data. Figure 5 shows a specific example of privacy loss due to pure LDP over OT protocol. In Figure 5, LDP is used on the OT to collect location data while protecting privacy. A simple combination of LDP and OT provides overly strict privacy protection because the majority of data is not received until the datastore samples the message. If data is collected so that
The overview of the combination of pure LDP and OT protocol. Since the datastore (receiver) receives randomly selected data with a predetermined probability over OT protocol, if the device (sender) adds noise to all data, the amount of noise is excessive. Then, the datastore cannot correctly extract statistical characteristics from the data.
Finally, we explain Problem 3. OT protocol divides data into messages and transfers them, but because most of the messages are lost, the data volume received by the datastore is small. For example, if there are 100 participating devices and 100 records of location data are collected per device, pure LDP is able to collect 10,000 records of data, but this method collects only 100 records. Extracting statistics from small-scale data is difficult, and the added noise to satisfy LDP makes it even more difficult to extract statistics when LDP is implemented over OT protocol compared with normal data. Therefore, Combining LDP and OT protocol need data engineering to increase the data volume as the original data so that statistics can be extracted. In this regard, simply increasing the data volume will generate synthetic data that is completely different from the distribution of the original data. In order to extract appropriate statistics, it is necessary to adjust the data volume, taking into account the original data distribution.
B. Encoding Location Data
To solve Problem 1 of Sect. IV-A, we encodes location data. By converting from numerical location data (latitude, longitude, timestamp) to categorical location data (hash value), our method makes the noise amount on each message uniform. By making the amount of noise uniform on a message-by-message basis, the privacy protection strength remains constant even if OT protocol loses messages. Moreover, for a categorical data format (small-domain), the noise amount to satisfy LDP is small [18]. This is because categorical data is more coarse-grained and easier to disambiguate than numerical data, i.e., it is easier to satisfy the LDP.
As encoding method, we use Quadkey [19]. Unlike common encoding for location data such as GeoHash,1 QuadKey assigns a hash value to each tile based on mercator coordinates rather than latitude and longitude and can represent distances in the real world more accurately. Many research on location data frequently uses quadkey to provide real-world-based services and data analysis [20], [21], [22]. While GeoHash recreates 32 map segments, QuadKey recursively quadrants the map, allowing for the finely controlled collection of location data granularity. In QuadKey, by setting the zoom level
C. Adjusting Perturbation
To address the Problem 2 in Sect. IV-A, our proposal adjusts the total amount of noise added to each message. By adjusting the total amount of noise according to the proportion of messages that are lost, the method prevents excessive privacy protection.
We will describe the order of OT protocol. After encoding to categorical data format by Quadkey, all device perturb their location and create an OT protocol’s message for transferring data to the datastore. The devices then add noise to the data (hash value) to protect privacy. As the perturbation for hash value, we use the \begin{align*} \mathcal {R}^{\text {kRR}}(y|v)=&\begin{cases}p=\displaystyle \frac {e^{\epsilon }}{e^{\epsilon }+d-1}, & { \text {if }} y=v \\ \\ q=\displaystyle \frac {1}{e^{\epsilon }+d-1}, & { \text {if }} y \neq v\end{cases} \tag{1}\\ p'=&\frac {p_{1}}{p_{2}}, q'=\frac {p_{2}-p_{1}}{d-1} \tag{2}\end{align*}
Then, we cause Problem 2 of Sect. IV-A if we just combine k-RR and OT protocol. The proposed method thus adjusts the amount of noise added to location data between the devices and datastore. Specifically, the proposed method perturbs the data of each message by adjusting the protection strength in advance, adapting it to the final protection strength to be achieved (See Figure 6). To determine the number of messages for the entire data, the method calculates the greatest common divisor (GCD). Then, the method employs the Euclidean algorithm, which is the most efficient method to obtain GCD. We derive GCD of
Overview of the perturbation adjustment in our proposed method: to apply LDP on OT protocol, we adjust the Euclidean algorithm to add appropriate noise for the number of messages in OT protocol.
D. Aggregation and Over-Sampling
To handle Problem 3 in Sect. IV-A, our method oversample received data. Compared to pure LDP, the proposed method receives less data, which making it difficult to correctly obtain statistical characteristics from the aggregated data as a whole.
To increase the amount of data while maintaining statistical characteristics, we increase the sample size by generating synthetic data. We use the Multi-Label Synthetic Minority Over-Sampling Technique (MLSMOTE) [24] for synthetic data generation, which can be used for multi-label categorical data. In order to prevent oversampling from eliminating characteristics in the data (e.g., large cities have higher population densities, rural areas have lower population densities, etc.), we intentionally synthesize the data while preserving imbalanceness in aggregated data.
Finally, datastore estimates statistical characteristics from the oversampled location data. As an estimator, we use population density data \begin{equation*} f(v, t)=\frac {1}{h_{t} h_{S}} \sum _{i=1}^{I} K_{t}\left [{\frac {t-t_{i}}{h_{t}}}\right] K_{S}\left [{\frac {v-v_{i}}{h_{S}}}\right] \tag{3}\end{equation*}
The kernel functions
Experimental Evaluation
The experimental evaluation measure four aspects: privacy protection, accuracy of statistics collection, robustness to poisoning attacks, and overhead (execution time and throughput).
A. Implementation
We implement all program on ASRockRack 3U8G+/C621E workstation, CPU is 40-core Intel Xeon Gold 6230 Processor at 2.10 GHz, 262 GB RAM, and the host OS is Ubuntu 18.04 LTS. For implementation environment, we use OS-level virtualization Docker to simulate data collection between 100-clients and one datastore. The 100 containers that play the role of client (device containers) were built with the mem_limit provided by docker-compose to limit the size to 2GiB. Device containers send data to datastore containers (datastore containers) using sockets.
As a cryptographic primitives in OT protocol, we adopt Pedersen’s Commitment [25] as a public key scheme, which is a public key exchange scheme that is computationally secure. The client sends the message sequence
B. Privacy Evaluation
The noise amount depend on the privacy budget and perturbation probability. For privacy evaluation, we therefore measure the perturbation probability
The Figure 7 shows the measured
Box plot show the probability
Next, we describe the difference among each
C. Approximation Accuracy of Statistics Collection
We evaluate whether the location data actually collected by the proposed method loses its statistical characteristics due to privacy protection. Since the proposed method is more private because it combines several privacy protections, it is more difficult to preserve statistics than pure LDP. This experiment evaluates the approximation accuracy of the proposed method against pure LDP.
For the evaluation, we need to set up a specific task using the location data. A popular use of location data in urban development and marketing is the estimation of population density. This study also assumes the use of location data for population density estimation, and evaluates whether the collected data can be used for population density estimation. For measurement, we use Dynamic population distribution dataset for Helsinki Metropolitan Area [30] as real dataset. Not only real dataset, but also we use Power-law/Uniform dataset as synthetic dataset. Since the real dataset also contains correlations and geographic features, we also validate our proposal on a synthetic-dataset which has no correlations and geographic features.
Table 1 indicate the result of approximation accuracy on each dataset. The closer to the accuracy of pure LDP, the better the proposed method is at collecting statistics. The tendency as a whole is that the accuracy is almost linearly proportional to
D. Attack Evaluation
In this section, this study evaluates the robustness of pure LDP and our proposal to poisoning attacks. For our evaluation, we use the uniform dataset in Sect V-B.
1) Impact of IPA
Here, we evaluate the robustness pure LDP and our proposal against IPA. By measuring the accuracy of statistic collection when the percentage of adversaries is set to [0%, 25%, 50%, 75%, 95%], we investigate the degree to which pure LDP and the proposed method suffer deterioration due to IPA. The experimental setup is similar to Sect V-B, with 100 containers of client roles sending data to the datastore. Out of these 100 devices, a certain percentage [0%, 25%, 50%, 75%, 95%] of the adversaries attempt to distort the statistics via the IPA. The adversary sends 10 times more data than the benign device.
Figure 8 (a) to (d) shows the accuracy of population prediction in pure LDP under IPA, and Figure 8 (e) to (h) shows the proposed method. Each percentage represents the ratio of adversaries; the darker the color, the higher the percentage of adversaries. For instance, 25% for the dataset Uniform means that 25 out of 100 users are adversaries. Since pure LDP is not equipped with any mechanism to downsample the received data, the accuracy is strongly affected by IPA, which results in severe degradation. For instance, in Figure 8 (a) to (d), the accuracy is reduced to from 0.2 to 0.4 for all
The barplot shows the accuracy of statistic collection in pure LDP (upper columns) and proposed method (lower columns) for each privacy budget under IPA.
2) Impact of OPA
The containers used in the experiment and the ratio of adversaries are exactly the same as in the IPA experiment. Out of these 100 device-role container, a certain percentage [0%, 25%, 50%, 75%, 95%] of the adversaries attempt to distort the statistics via the OPA. Unlike IPA, the adversary intentionally spoof the value of data to manipulate the distribution at an arbitrary location. This experiment define that the adversary sends 10 times as much data as the IPA.
Figure 9 (a) to (d) shows the accuracy of population prediction under OPA in pure LDP, and Figure 9 (e) to (h) shows the proposed method. The higher the value of accuracy on the vertical axis, the more intrinsic value is preserved on the data store side, and the higher the privacy budget, the more intrinsic value is preserved. Likewise Figure 8 in Sect V-D1, each percentage represents the percentage of adversaries; the darker the color, the higher the percentage of adversaries. Pure LDP is not equipped with any mechanism to limit the data transmission rate, so the accuracy is strongly affected by OPA, which results in severe degradation. In Figure 9 (a) to (d), the accuracy is reduced to from 0.2 to 0.4 for all privacy budgets, especially when the ratio of adversaries exceeds 50%, indicating that the statistics are not preserved in the datastore. In OPA, the adversary intentionally spoofs to a value, which is more severely aggravated than in IPA. In contrast to pure LDP, the proposed method limits the data sent per device. Even if an adversary spoofs and sends values, the impact is only as great as the number of adversarys, since the data volume that can be sent is severely limited. Therefore, our method can preserve statistics as long as the ratio of attackers is not extremely high. In Figure 9 (e) to (h), even if the ratio of adversaries exceeds 50%, our method preserve half of the statistics compared to the case with no adversaries case.
The barplot shows the accuracy of statistic collection in pure LDP (upper columns) and proposed method (lower columns) for each privacy budget under OPA.
E. Overhead Measurement
In our method, the execution time and the throughput vary greatly depending on the amount of data and the size of OT protocol message. Depending on these overheads, the proposed method will be difficult to apply in cases where real-time performance is required in data collection and acquisition of statistical characteristics, and in power-saving devices such as IoT and smart devices. To validate their overhead and discuss the performance, we analyze the overhead by measuring the execution time and throughput.
1) Execution Time
To verify the reality of the proposed method in population density estimation, we evaluate the execution time. Naturally, the execution time varies depending on the size of the message
All Box plot in Figure 10 shows the execution time distribution of proposed method for each
Box plot shows the execution time of the proposed LDP for different window sizes
2) Throughput
In the normal TCP protocol, how much data a datastore can receive at one time depends on the size of data. However, in the proposed method (OT protocol), the data volume received by the datastore is always constant no matter how much data is sent by the device side, thereby the throughput is expected to be constant. As the amount of data processed per unit of time in the datastore, we also measure the receive throughput from the execution time and the amount of data (bits) communicated between the two docker container (device-role containers and the datastore-role container). Bar plot in Figure 10 shows the throughput for each
However, unlike the intended design of the proposed method, throughput is not constant. The Figure 10 indicate that there is a slight proportional relationship between
Discussion
Finally, we discuss the comparison with related work, out-of-scope applicability of this research, and our limitations.
A. Other Application
This study decouple the relationship between statistical characteristics and data volume by combining the LDP over the OT protocol. The Attack on LDP [6], [7] is carried out in various ways, such as generation of malicious raw data, modification of LDP process or parameters, and data amplification. Our proposal may be valid for these attacks. Furthermore, the proposed method provides a guideline that data collection is possible even for mutual-distrust pairs. In our trust model, the device does not trust third parties, including the data store, and thus does not expose any of its original data outside the device. Due to the possibility of LDP attacks, the data store also does not trust that the device will send the correct data at all. In other words, we can say that the proposed method achieves data collection in mutual-distrust pairs. To the best of our knowledge, there are no studies that have achieved data collection in mutual-distrust pairs. Thereby, applying LDP over the OT protocol may be a solution to these problems in future data collection. Also, the proposed method is not suitable for obtaining precise location information, but it is suitable for collecting landmark-based trajectories (e.g., visiting the Eiffel Tower from Charles de Gaulle airport via the Arc de Triomphe). Landmark-based data collection has been studied mainly for the purpose of congestion, event, location verification, and disaster forecasting [35], [36], [37]. The proposed method that can collect categorical locations is considered to have high affinity.
B. Limitation
The limitations are the integrity of the privacy budget and verification of input data. If the device spoofs the privacy budget after the connection is established, the datastore cannot meet strict LDP. This is very dangerous because it can lead to unintended privacy leaks. Moreover, if the device spoofs the input data to the privacy mechanism, the datastore no longer decouples statistical characteristics from data volume. Spoof detection of input data has long been considered a difficult problem, but it is also necessary in this study.
There may also be cases other than our assumed adversary pattern. In this paper, we assumed that the proportion of adversaries is constant, but in reality it may change as new participants join or leave the data collection. An increase in the proportion of adversaries could also significantly distort the statistics at any given moment. Although we design the proposed method with the 1-out-of-n OT protocol, it may be possible to handle such cases by adjusting the number of messages to be lost, for example, depending on the increase or decrease of the adversary. In that case, however, it would be necessary to design a new function that dynamically adjusts the noise amount as the number of messages lost changes.
Conclusion and Future Work
In this study, we designed and implemented LDP over OT protocol to decouple the statistical characteristics from data volume. We proved our proposal is robust to data poisoning attacks to LDP through experimental evaluation. Our experimental evaluation reveals three facts: (1) The proposed method can extract statistics with higher accuracy than pure LDP, even when strong privacy budgets are set. (2) Pure LDP is vulnerable to both IPA and OPA, but the proposed method is robust and can preserve statistics of location data with high accuracy. (3) The overhead (execution time/throughput) of the proposed method is acceptable for population density estimation from mobile terminals by comparing it with the reference citations.
Next, we summarize our method from the viewpoint of security, sustainability, and efficiency aspects. The proposed method is significantly secure because it uses only sufficiently secure ciphers, and there is no risk of decryption during the OT protocol process. From a security perspective, we mainly discuss whether the proposed method will not leak privacy. Since the proposed method uses only sufficiently secure ciphers, there is no risk of decryption during the OT protocol process. The received data is also authenticated by Bulletproof, so it is secure enough that no intermediary can falsify the data during the step of OT protocol. Next, we discuss sustainability and how far we can respond to changes in devices and datastores. The proposed system can continuously collect data as long as the target device is not damaged. Although there is a limit to the capacity of the data store and data must be discarded when the volume exceeds a certain level, there is no problem from the standpoint of sustainability. Finally, we summarize the efficiency. In the proposed data collection, the datastore is never idle as long as data is sent from the device. The burden on the devices is also small and efficient, consisting only of pre-processing to convert the data into a gridded structure and down-sampling by the OT protocol. However, it is also possible to dynamically control the data volume sent by devices as they move, and proposals for further efficiency improvements are possible.
Finally, we describe some interesting and important directions for future work. It is known that location data does not satisfy strict LDP if data is continuously published. We thereby can consider substituting the privacy mechanism that is assumed to be used for location data. Experiments using actual mobile devices instead of Docker environment would also be of great value and interest. In actual mobile devices, delays occur depending on the throughput of the privacy mechanism, and delays also affect the accuracy of analysis on the datastore. We also suspect there are other applications for this proposal in privacy research beyond population density estimation that could be investigated.