Journals & Magazines >IEEE Access >Volume: 11

Secure Cloud-Aided Approximate Nearest Neighbor Search on High-Dimensional Data

It is an application model of an outsourced LSH, which consists of dataserver, client, cloud server and intelligent algorithm. In the application of this model, each clie...

Abstract:

As one fundamental data-mining problem, ANN (approximate nearest neighbor search) is widely used in many industries including computer vision, information retrieval, and ...Show More

Metadata

Abstract:

As one fundamental data-mining problem, ANN (approximate nearest neighbor search) is widely used in many industries including computer vision, information retrieval, and recommendation system.

$LSH$ (Local sensitive hashing) is one of the most popular hash-based approaches to solve ANN problems. However, the efficiency of operating

$LSH$ needs to be improved, as the operations of

$LSH$ often involve resource-consuming matrix operations and high-dimensional large-scale datasets. Meanwhile, for resource-constrained devices, this problem becomes more serious. One way to handle this problem is to outsource the heavy computing of high-dimensional large-scale data to cloud servers. However, when a cloud server responsible for computing tasks is untrustworthy, some security issues may arise. In this study, we proposed a cloud server-aided

$LSH$ scheme and the application model. This scheme can perform the

$LSH$ efficiently with the help of a cloud server and guarantee the privacy of the client’s information. And, in order to identify the improper behavior of the cloud server, we also provide a verification method to check the results returned from the cloud server. Meanwhile, for the implementation of this scheme on resource-constrained devices, we proposed a model for the real application of this scheme. To verify the efficiency and correctness of the proposed scheme, theoretical analysis and experiments are conducted. The results of experiments and theoretical analysis indicate that the proposed scheme is correct, verifiable, secure and efficient.

It is an application model of an outsourced LSH, which consists of dataserver, client, cloud server and intelligent algorithm. In the application of this model, each clie...

Published in: IEEE Access ( Volume: 11)

Page(s): 109027 - 109037

Date of Publication: 02 October 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3321457

Funding Agency:

Contents

SECTION I.

Introduction

As illustrated in [1], the computations of ANN (approximate nearest neighbor) search on high-dimensional dataset are basic in manly applications. The purpose of ANN is to find the approximate nearest neighbor vectors in a given dataset. And, because of the better performance of ANN on high-dimensional large-scale data than other methods, ANN has been applied in various high-dimensional large-scale fields, including product recommendation [2], image retrieval [3], clustering [4], etc. However, existing algorithms for ANN can be efficient when the data dimensionality is small even if the data dimensionality is large [5], [6]. As the data dimension increases, these exact algorithms may even be less efficient than brute force linear scanning due to dimensionality disaster [7]. Nonetheless, many ANN problems often involve high-dimensional large-scale data, for example, many large multimedia retrieval applications use ANN searches [8]. These applications with high-dimensional large-scale data require efficient processing of ANN searches [8].

Moreover, with the continuous growth of the Internet of Things (IoT) and individual devices, the implementation of ANN on resource-constrained devices has become increasingly common. However, due to the limited computational power of these devices, they cannot perform the training and large-scale ANN search processes independently. Outsourcing the ANN search tasks to cloud servers can alleviate the resource constraints of edge devices, but this raises security and verification concerns. The sensitive data collected by resource-constrained devices must be securely outsourced to the cloud server, and a verification mechanism is required to identify any potential improper behavior from the cloud server. Additionally, to ensure the efficiency of the outsourcing and verification mechanism, they should be designed to avoid complex and time-consuming operations.

In light of these challenges, the motivation behind our research is to develop a secure and efficient outsourcing mechanism, the “outsourced $LSH$ algorithm,” specifically tailored for ANN on high-dimensional large-scale data. This mechanism aims to enable resource-constrained devices to securely and efficiently perform the ANN task while preserving the privacy of the client’s information and providing a lightweight verification mechanism with a high level of certainty. By addressing these issues, we aim to enhance the applicability and performance of ANN in various high-dimensional large-scale fields, promoting the advancement of data mining and related domains.

In the rest of this paper, we present our contributions and the proposed outsourced $LSH$ algorithm, followed by a thorough analysis and evaluation of its correctness, security, and efficiency. We also demonstrate its potential application in real-world scenarios, concluding with future research directions.

The contributions can be summarized below:

First, we propose a secure outsourcing mechanism (outsourced $LSH$ algorithm) for ANN problems on high-dimensional large-scale data. This mechanism enables resource-constrained devices to complete the ANN task securely and efficiently.
Second, to ensure the privacy of the resource-constrained client’s information, we use a lightweight arithmetic method to obscure the client’s information. Meanwhile, in order to verify the improper behavior of cloud servers, we built a lightweight verification mechanism with probability 1.
Third, to analyze the proposed scheme in detail, we analyze the proposed method through detailed mathematical analysis. Meanwhile, to evaluate the efficiency and correctness of proposed scheme, we realize the proposed method with Python 3.7 and CIFAR-10 dataset. The results of the experiment show that our proposed scheme is correct and efficient.
Fourth, to illustrate the application of the proposed scheme, we design an architecture of the application model and demonstrate the potential application fields and the method to apply the proposed scheme.

The remainder of the rest is organized as follows: Section II provides a literature review of

$LSH$

, secure outsourcing computations and cloud/edge computing. Section III gives a brief introduction to

$LSH$

and secure outsourcing computation. Section IV provides the detailed scheme and an application model for the proposed scheme. Section V performs a detailed theoretical correctness, security, and efficiency analysis of the proposed scheme. Section VI evaluates the efficiency and correctness of the proposed scheme through experiments and shows the results. Section VII takes an example to show the application of the proposed scheme. Finally, Section VIII concludes the study and gives future works.

SECTION II.

Related Work

A. Machine Learning and LSH

In this information era, with the development of edge devices and networks, high-dimensional data are exploded on the Internet. With that comes the needs of fast ANN search on high-dimensional data in many fields. However, these high-dimensional data also bring a great challenge to fast ANN search because of the volume and high dimensionality of the data. ANN one of the most popular algorithms on nearest neighbor search because of its great benefit in the improvement of computational speed and storage reduction [9]. And, in multiple ANN search methods, $LSH$ (Locality Sensitive Hashing) is one of the most popular methods for ANN search problems in high-dimensional data [10]. Actually, various $LSH$ variants have been proposed to get approximate nearest neighbors on different fields [11], [12], [13], [14], [15], [16], [17]. For instance, Gan et al. [16] in 2012 presented a new $LSH$ scheme -Collision Counting $LSH$ (C2LSH), which can construct hash functions which are “dynamic” compound. With a base of $m$ single $LSH$ functions, this method improves the query quality of $C$ -approximate NN search. In 2009, Kulis and Grauman [17] proposed a method to adapt arbitrary kernel functions using $LSH$ , which allows for a sub-linear time approximate similarity search.

There is also a lot of research devoted to improving the accuracy and efficiency of $LSH$ models [1], [11], [12], [13], [15]. For example, Datar et al. [18] proposed a novel $LSH$ scheme based on $p$ -stable distributions to improve the performance of $LSH$ . Liu et al. [12] presented an incremental search-based $\mathcal {C}$ -ANN to improve the I/O performance. Tao et al. [11] developed an LSB-tree access method to improve the efficiency of ANN search problems.

B. Secure Outsourcing Computations

Extensive research of security outsourcing schemes for scientific computing have been studied. In terms of the proposal of various outsourcing methods, Atallah et al. proposed the first generic architecture to outsource scientific computations in 2002 [19]. However, the architecture lacks the verification mechanism of computing results by the cloud servers. Then, in order to solve the large-scale systems of linear equations with the help of a fully malicious cloud server, Chen et al. [20] designed a secure outsourcing method in 2015. They utilized serval special sparse matrixes to obscure the input and output data. Meanwhile, they proved through experiments that the client can use their method to detect the improper behavior of cloud servers with probability 1. In 2018, Salinas et al. proposed a secure outsourcing approach for solving large-scale sparse linear systems of equations, which are the most common and fundamental problems in big data [21]. This method not only focuses on efficiency improvement but also focus on memory I/O complexities. They demonstrated that the memory issue of I/O operations is important in making the outsourcing method practical.

Much research has been done to free resource-constrained devices (UAVs, individual phones, etc.) from complex encryption operations [22], [23], [24], [25], [26]. For example, in order to free IoT devices from solving quadratic congruences, Zhang et al. proposed two practical and secure outsourcing algorithms [27]. The method outsourced the heavy computation to a single cloud server and guarantee the IoT device can identify the improper behavior with probability 1. They also take the Rabin algorithm as an example to show the application of the method in IoT applications. For addressing the problem of using untrusted cryptographic helpers, such as cloud servers, Hohenberger et al. provided a formal security definition for outsourcing computations to an untrusted helper [24]. They also provided a framework to quantify the efficiency and verifiability of an outsourcing method. They take the problem of outsourcing modular exponentiation as an example to show how to securely outsource computations. To alleviate the local burden of secret key updates, Yu et al. proposed a secure outsourced key updates method [28]. To verify the outsourced secrets key from an authorized party, they used a third-party auditor (TPA) to audit the storage and resist the key exposure.

C. Cloud/Edge Computing

Cloud computing brings a new way for resource-constrained devices to operate complex computations [29], [30]. However, due to the requirement for data privacy in cloud computing, how to guarantee the data privacy when conduct cloud computing has become a critical issue. Many reaches have been done to guarantee privacy in cloud computing, including the security in storage [31], [32], [33] and the security in computation [20], [34], [35].

There is a great deal of research work on secure outsourcing architecture design. For example, in 2020, Lin et al. [36] focus on outsourcing the bilinear pairings computation, which is the most time-consuming operation in pairing-based cryptographic protocols, and designed a new outsourcing scheme for block chain devices. In order to solve the quadratic congruence problems which is a widely applied operation in cryptographic protocols on resource-constrained devices, Zhang et al. [27] presented two practical and secure outsourcing algorithms. The proposed algorithms enable the IoT devices to outsource the heavy computations of solving quadratic congruences to a single cloud server and do not leak the privacy of computation. Meanwhile, the proposed scheme can identify any improper behavior of the cloud server with probability 1. In 2016, Ye et al. [37] presented a secure outsourcing algorithm for modular exponentiation based on the new mathematical division under the setting of two non-colluding cloud servers. The privacy of outsourced data can be guaranteed, and efficiency is improved.

There has also been a lot of research on how to securely outsource scientific computations to cloud servers. For solving the modular exponentiation problems on IoT devices, Li et al. [38] first proposed a distributed outsourcing scheme for modular exponentiation in 2020. With this scheme, the user can not only protect the privacy of users in the outsourcing process but also identify the invalid results from edge nodes with a high probability. In order to solve the modular inversion problems on resource-constrained clients, Tian et al. [39] presented a new unimodular matrix transformation technique in 2022. This is the first secure outsource computation method that supports arbitrary and variable moduli. This method can verify the correctness of the cloud server results with 1 as the optimal probability in one round iteration.

SECTION III.

Preliminaries

In this section, we give a brief introduction to the $LSH$ algorithm and secure outsourcing computation.

A. Local Sensitive Hashing

$LSH$ (Local Sensitive Hashing) is one of the ANN (approximate nearest neighbor) search algorithms. The advantage of this algorithm is that it allows us to only concentrate on similar objects of large-scale data. The original ANN search is to traverse each item of the given data and to find the approximate nearest neighbors. However, with the development of technology and the continuous improvement of edge devices’ equipment performance, there are more and more high-dimensional large-scale datasets, and the volume of these datasets is getting larger and larger. If the original ANN search model is used to search these high-dimensional large-scale datasets, the time consumption is unacceptable. In this situation, the $LSH$ algorithm shows its powerful advantages.

The basic design idea of $LSH$ is as follows. First, we need to build a number of different hash functions to compute the multiple hash items. One hash function needs to be carefully designed to represent the approximate similarity of the data in one aspect. This means that if items are similar in this aspect, they are likely to appear in the same hash bucket of this hash function. This way, if we want to search for similarity items, we can just check the candidates that fall into the same hash bucket. Meanwhile, if you want to search for similar items from various aspects, you can generate multiple hash functions to represent multiple aspects.

$LSH$ was one algorithm that originally designed to solve the (R, c)-NN problem, that is, if given a query $q$ , the algorithm will perform:

If there exists $o\in \mathcal {D}$ such that $dict (o,q)\le R$ , then return an object $o^{\prime }\in \mathcal {D}$ satisfying $dict (o^{\prime },q)\le cR$ .

If $dict (o,q)>R$ for all $o\in \mathcal {D}$ , then return nothing.

The basic principle of designing $LSH$ is that the hash objects that are similar (smaller than a given distance) can be hashed into the same bucket with a high probability. Meanwhile, to support efficient (R, c)-NN search, $LSH$ typically implements a series of hash functions $H$ , which are random and independent of each other. This $LSH$ family will be used to build multiple hash tables. If we want to increase the similarity requirement of items, items are considered highly similar only if they are in the same bucket in multiple hash tables.

For a domain $\mathbb {R}^{d}$ of the items set, one hash function in the $LSH$ family is a mapping from $\mathbb {R}^{d}$ to $U$ . The sensitivity of an $LSH$ family to the locality can be measured in the following way:

Definition 1:

For any constant $c>1$ and probabilities $p1>p2$ , an $LSH$ family $\mathcal {H}-h:\mathbb {R}^{d}\to U$ is called $(r,cr,p_{1},p_{2})$ -sensitive if any $u,v\in \mathbb {R}^{d}$ satisfy [1]

${\mathrm {Pr}}_{\mathcal {H}} [h\left ({u }\right)=h(v)]\ge p_{1}$ when $dict (u,v)\le r$ ; and
${\mathrm {Pr}}_{\mathcal {H}} [h\left ({u }\right)=h(v)]\le p_{2}$ when $dict \left ({u,v }\right)>cr$ .

In order to make a locality-sensitive family to be valuable, when

$\mathcal {D}$

is the similarity measure, the equation needs to satisfy the inequalities

$p_{1}>p_{2}$

and

$r_{1}>r_{2}$

In terms of measuring the item distance in $LSH$ , a lot of $LSH$ families with different distance metrics have been proposed [18], [40], [41]. The most classic one is the distribution $LSH$ family $H$ proposed by Datar et al. [18] in 2004, which is for $l_{p}$ metrics based on $p$ -stable. The specific of this method’s hash function is as follows:

$\begin{equation*} h_{a,b}\left ({o }\right)=\left \lfloor{ \frac {a^{T}o+b}{w} }\right \rfloor \tag{1}\end{equation*}$ View Source

where

$a\in \mathbb {R}^{1\times d}$

$o\in \mathbb {R}^{d}$

$w\in z,b\in [0,w]$

In eq. (1), $a$ is a vector consisting of randomly selected elements from a $p$ -stable distribution. $o$ is the data of a given item. $w$ is the specified bucket width. $b$ is a random variable from the range $[0,w]$ .

For example, when Euclidean distance $dist \left ({u,v }\right)=s$ is used to calculate the distance between items $u$ and $v$ , under the hash function $h_{a,b}\in \mathcal {H}$ , the probability that $u$ and $v$ are in the same bucket can be computed as [18]:

$\begin{equation*}p\left ({s }\right)=\Pr \left [{ h_{a,b}\left ({u }\right)=h_{a,b}\left ({v }\right) }\right]=\int _{0}^{w} {\frac {2}{s}\varphi \left({\frac {t}{s}}\right)\left({1-\frac {t}{s}}\right)dt},\end{equation*}$ View Source

where

$\varphi (\cdot)$

is the distribution function of the standard normal distribution

$\mathcal {N}(0,1)$

. If the value of

$w$

is fixed,

$p(s)$

monotonically decreases with

$s$

. Thus, according to the definition, the family of hash functions

$h_{a,b}$

$(r,cr,p(r),p(cr))$

-sensitive.

To make (R, c)-NN search more efficient, $LSH$ algorithm typically uses multiple hash functions to build multiple hash tables from different aspects. These hash functions are randomly selected from the $(R,cR,p(R),p(cR))$ -sensitive hash family. When using the hash functions, the hash values of each item are calculated by these hash functions. And an item is considered similar to the query item $q$ when the item is hashed into the same bucket in every hash function. By this, the selected similar candidates need to be similar in multiple aspects, and the rigor and credibility of similarity will increase.

B. Secure Outsourcing Computation

One purpose of secure outsourced computing is to enable resource-constrained devices to compute complex computing problems with the help of cloud servers. However, the privacy of the client cannot be leaked to the cloud server in this process. Currently, there is a lot of related research in this area [20], [34], [35]. Here, we use the proposed properties of most security outsourcing needs to have by Gao et al. [42], these properties are:

Security: This property requires that when computing tasks are outsourced to the cloud service, the cloud server cannot obtain any private information from the input/output data.
Verifiability: To protect the correctness of the computing results of the cloud server, the client or a third-party authority needs to check the correctness of the outsourced computing results from the cloud server.
Efficiency: Because the purpose of outsourced computing is to allow resource-constrained clients to perform complex computational tasks, the cost of outsourced tasks should not be too complex, at least less than the complexity of performing the computations themselves.

SECTION IV.

Proposed Scheme

In this section, we first propose an application model for the outsourced computation of $LSH$ . Then the concept and basic principle of the scheme are expounded. Finally, the overall proposed scheme is explained in detail.

A. Application Model

The proposed application model of outsourced $LSH$ is shown in Fig. 1. In the process of outsourcing the $LSH$ algorithm to the cloud server, the database server will firstly generate a Secret Key (SK), which is used to obscure the raw data. Then, using the SK to obscure the database server’s data, from $(x,w)$ to $(x^{\prime },w^{\prime })$ , in which $x$ is the raw data, $w$ is the project matrix. Then, the obscured data will transfer to the cloud server. In turn, the cloud server will calculate multiple ${\mathrm {HashTable}}^{\prime}$ by multiple hash functions, and then return them to the database server. The database server will verify the correctness of the ${\mathrm {HashTable}}^{\prime}$ through the proposed verification mechanism. After passing the verification, the database server will recover the final $HashTable$ through SK. Otherwise, the database server will reject the ${\mathrm {HashTable}}^{\prime}$ . In the database server, the $\mathrm {query\_{}x}$ is calculated using the $w{query\_{}x}^{T}$ to generate the $\mathrm {query\_{}hash\_{}value}$ , and then a fast ANN query is performed with HashTable to retrieve the nearest neighbors. And, in the application of multi-party ANN search, each client sends the query data $\mathrm {query\_{}x}$ to the database server, and the database server can perform an outsourced fast multi-party ANN search through this database and return the nearest neighbors to the client.

FIGURE 1.

Application model.

Show All

B. Concept and Basic Principle

The basic principle of our scheme is to design a system that enables resource-constrained devices, such as individual phones, edge devices, and UAVs, to perform fast ANN search. To achieve this goal, we plan to utilize a secure outsourced computing mechanism to complete the computation task of $LSH$ . In this study, we used the classic $LSH$ algorithm, which is the formula shown in eq. (1), as the improved algorithm. For the purpose of outsourcing the $LSH$ algorithm, we first analyzed the most computationally time-consuming part of this formula. Through the analysis of eq. (1), it is found that the most time-consuming part lies in the $a^{T}o$ calculation. Meanwhile, note that for an efficient ANN query, it is generally necessary to compute a hash family (multiple hash functions), that is, multiple $a^{T}o$ operations. Therefore, we considered outsourcing $a^{T}o$ computing to the cloud server. To make the complexity of outsourcing calculation not exceed that of local calculation, at least equal, the method of obscuring data with a series of elementary rotation matrices is chosen after considering a variety of obscuring algorithms and querying relevant literature. At the same time, to ensure the correctness of the cloud server computing results, we provide a verification mechanism to verify the correctness of results.

C. Detailed Scheme

We express our detailed algorithm in Algorithm 1. In the proposed algorithm, we first generate the projection matrix $W$ on the local client, which is used to form the hash value of raw data. Then, in order to obscure the local client data, namely $X$ and $W$ , the client generates a series of elementary rotation matrices, $l \left ({1 < l\ll \min \left ({md }\right) }\right) m\times m$ Givens matrices $P_{1},P_{2},\cdots P_{l}$ and $l d\times d \mathrm {Givens matrices} Q_{1},Q_{2},\cdots Q_{l}$ , including a random real number $\alpha$ as the secret key. Then, the secret key is used to obscure the data, as shown in eq. (4) and eq. (5). Then the obscured data is uploaded to the cloud server to calculate the hash tables of the obscured data.

Algorithm 1 The Proposed Algorithm

Input:

$X\in \mathbb {R}^{m\times d}$ , a large-scale matrix; $W\in \mathbb {R}^{t\times d}$ , t

projection vectors that conform to p-stable distributions.

Output:

Verification result $v$ ; $\mathrm {HashTable}\in \mathbb {R}^{t\times m}$ ,

t = $\mathrm {rows\times banks}$ .

Step 1.

$\mathbf {KeyGen}(\lambda)\to (\mathrm {SK})$ :

The client chooses a random real number $\alpha$ and $2l$ elementary rotation matrices: $P_{1},P_{2},\cdots P_{l},Q_{1},Q_{2},\cdots Q_{l}$ as the secret key $SK$ .
$\begin{equation*} \mathrm {SK}=\{\alpha,P_{1},P_{2},\cdots P_{l},Q_{1},Q_{2},\cdots Q_{l}\} \tag{3}\end{equation*}$ View Source

Step 2.

${\mathbf {ComDis}}_{SK}(X,W)\longrightarrow (X^{\prime },W^{\prime })$ :

Client computes $X^{\prime }$ as:
$\begin{align*} X^{\prime }&=P_{1}P_{2}\cdots P_{l}(\alpha X)Q_{1}Q_{2}\cdots Q_{l} \tag{4}\\ W^{\prime }&=(\alpha W)Q_{1}Q_{2}\cdots Q_{l} \tag{5}\end{align*}$ View Source

Step 3.

$\mathbf {Compute}(X^{\prime })\longrightarrow ({\mathrm {HashTable}}^{\prime})$ :

Cloud chooses t projection vectors that conform to p-stable distributions to form the projection matrix $W$ .
Computes ${\mathrm {HashTable}}^{\prime}$ as follows:
$\begin{equation*} \text {HashTable}^{\prime }=W^{\prime }X^{\prime ^{T}},{W} \in \mathbb {R}^{t\times d} \tag{6}\end{equation*}$ View Source

Step 4.

$\mathbf {Verify}({\mathrm {HashTable}}^{\prime})\to v$ :

The client randomly generates a vector $r\in \mathbb {R}^{1\times t}$ and calculates the local $V_{1}$ as:
$\begin{equation*} V_{1}=(rW^{\prime }{)X^{\prime }}^{T} \tag{7}\end{equation*}$ View Source
The client calculates the $V_{2}$ by r and ${\mathrm {HashTable}}^{\prime}$ , as shown in eq. (8). Then, the client checks if $V_{1}$ equal to V₂. If they are equal, then accept ${\mathrm {HashTable}}^{\prime}$ . Otherwise, rejects ${\mathrm {HashTable}}^{\prime}$ .
$\begin{equation*} V_{2}=r\text {HashTable}^{\prime } \tag{8}\end{equation*}$ View Source

Step 5.

$\mathbf {Recover}(SK,{\mathrm {HashTable}}^{\prime})\to (\mathrm {HashTable})$ :

$\begin{align*} &\text {HashTable} \\ &=\left \lfloor{ \frac {\frac {1}{\alpha ^{2}}(\text {HashTable}^{\prime }P_{1}P_{2}\cdots P_{l})+b}{\text {num}\_{}{\text {hash}}} }\right \rfloor \tag{9}\end{align*}$ View Source

The elementary rotation matrix (also known as Givens matrix) is chosen as the way of obscuring the data because the elementary rotation matrix is a sparse matrix with most elements being 0, so the computational burden is less than that of the original matrix multiplication, as shown in eq. (2). The specific construction of the elementary rotation matrix is as follows:

Let $c=cos\theta$ and $s=sin\theta$ .
Set the $i$ -th and $j$ -th main diagonal elements as $c$ .
Set the other main diagonal elements as 1.
Set the $i$ -th row and $j$ -th column elements as $s$ .
Set the $j$ -th row and $i$ -th column elements as $-s$ .
Set the other elements as 0.
$\begin{align*} T\left ({i,j }\right)=\left [{ {\begin{array}{cccccccccccccccccccc} {\begin{array}{cccccccccccccccccccc} 1 & & \\ & \ddots & \\ & & 1\\ \end{array}} & {\begin{array}{cccccccccccccccccccc} & & \\ & & \\ & & \\ \end{array}} & {\begin{array}{cccccccccccccccccccc} & & \\ & & \\ & & \\ \end{array}}\\ {\begin{array}{cccccccccccccccccccc} & & \\ & & \\ & & \\ \end{array}} & {\begin{array}{cccccccccccccccccccc} c & & \\ & 1 & \\ \vdots & & \ddots \\ \end{array}} & {\begin{array}{cccccccccccccccccccc} s & & \\ & & \\ & & \\ \end{array}}\\ {\begin{array}{cccccccccccccccccccc} & & \\ & & \\ & & \\ \end{array}} & {\begin{array}{cccccccccccccccccccc} -s & & \\ & & \\ \vdots & & \\ \end{array}} & {\begin{array}{cccccccccccccccccccc} 1 & & \\ & c & \\ & & {\begin{array}{cccccccccccccccccccc} 1 & & \\ & \ddots & \\ & & 1\\ \end{array}}\\ \end{array}}\\ \end{array}} }\right] \tag{2}\end{align*}$ View Source

In eq. (1), elementary rotation matrix is orthogonal matrix. The elementary rotation matrix can perform the following three types of matrix operations:

$Multiplication$ : A multiplication operation multiplies the $i$ -th row (resp. column) of a matrix by a non-zero scalar.
$Permutation$ : A permutation operation makes the two rows (resp. columns) of a matrix swap their location.
$Addition$ : An addition operation makes a matrix add $j$ -th row (resp. column) multiplied by a non-zero scalar to the $i$ -th row (resp. column).

In the actual operation of the proposed scheme, first of all, the client will use $SK=\{\alpha,P_{1},P_{2},\cdots P_{l},Q_{1},Q_{2},\cdots Q_{l}\}$ as the secret key to obscure $X$ and $W$ to $X^{\prime }$ and $W^{\prime }$ through eq. (4) and eq. (5). Then the obscured data will upload to the cloud server for multiple hash functions calculation and multiple ${\mathrm {HashTable}}^{\prime}$ are obtained. Then, the result ${\mathrm {HashTable}}^{\prime}$ is passed back to the client. In this phase, the communication consumption is reduced because the dimension of the data becomes smaller. Then the client determines whether to accept ${\mathrm {HashTable}}^{\prime}$ by matching V₁ and V₂ according to the equation of eq. (7) and eq. (8). If V₁ is not equal to V₂, the cloud server will be considered to have improper behavior and the returned ${HashTable}^{\prime}$ will be rejected. Otherwise, if they are the same, multiple ${\mathrm {HashTable}}^{\prime}$ will be recovered toHashTable by eq. (9). Finally, the multi-party clients passes the local HashTable to the database server for fast multi-party ANN queries later.

SECTION V.

Theoretical Performance Analysis

A. Correctness Analysis

The generated secret key in the client is $\mathrm {SK}=\{\alpha,P_{1},P_{2},\cdots P_{l},Q_{1},Q_{2},\cdots Q_{l}\}$ , in which $P=P_{1}P_{2}\cdots P_{l}$ and $Q=Q_{1}Q_{2}\cdots Q_{l}$ . The local client will use the $SK$ to recover the HashTable from ${\mathrm {HashTable}}^{\prime}$ as $\mathrm {HashTable}=\frac {1}{\mathrm {\alpha }^{2}}(\mathrm {HashTable}^{\prime} P)$ , where

$\begin{align*} \mathrm {HashTable}&=\left \lfloor{ \frac {\frac {1}{\mathrm {\alpha }^{2}}(\mathrm {HashTable}^{\prime} P)\mathrm {+b}}{\mathrm {num\_{}hash}} }\right \rfloor \\ &=\left \lfloor{ \frac {\frac {1}{\alpha ^{2}}\left ({W^{\prime} X^{\prime T}P }\right)\mathrm {+b}}{\mathrm {num\_{}hash}} }\right \rfloor \\ &=\left \lfloor{ \frac {\frac {1}{\alpha ^{2}}\left ({\left ({\alpha W }\right)Q\left ({P\left ({\alpha X }\right)Q }\right)^{T}P }\right)+b}{\mathrm {num\_{}hash}} }\right \rfloor \\ &=\left \lfloor{ \frac {\frac {1}{\alpha ^{2}}\left ({\left ({\alpha W }\right)Q{Q^{T}\left ({\alpha X }\right)^{T}P}^{T}P }\right)+b}{\mathrm {num\_{}hash}} }\right \rfloor \\ &=\left \lfloor{ \frac {\frac {1}{\alpha ^{2}}\left ({\alpha ^{2}WX^{T} }\right)+b}{\mathrm {num\_{}hash}} }\right \rfloor \\ &=\left \lfloor{ \frac {WX^{T}+b}{\mathrm {num\_{}hash}} }\right \rfloor\end{align*}$ View Source

Thus, according to the above calculation, we can conclude that the recovered HashTable is the correct result, which is consistent with the calculation of the original HashTable, as shown in eq. (1). At the same time, to verify the correctness of the lightweight verification mechanism, we provide a correctness analysis of the verification scheme, where

$\begin{align*} V_{1}&=\mathrm {(r}W^{\prime} {)X^{\prime }}^{T} \\ &=\mathrm {r}W^{\prime} X^{\prime T} \\ &=\mathrm {r}{\mathrm {HashTable}}^{\prime} \\ &=V_{2}\end{align*}$

View Source

Thus, if

$V_{1}=\mathrm {r}{\mathrm {HashTable}}^{\prime}$

, the calculated results computed by the cloud server are correct.

B. Security and Verification Analysis

We analyze the security and verification performance of the proposed outsourced $LSH$ scheme through detailed theoretical analysis.

1) Data Privacy

The proposed outsourced $LSH$ scheme can protect the privacy of the input/output.

Proof:

To protect the input data of the local client, that is, $X$ and $W$ , we adopt a lightweight data obscuring scheme to obscure $X$ and $W$ . That is, the $X$ and $W$ will be calculated with the private key SK through eq. (4) and eq. (5) to hide the information of the original data. In the process, $X$ will multiply by $2l$ elementary rotation matrices and a real number $a$ , and $W$ will multiply by $l$ elementary rotation matrices and a real number $a$ . That is to say, if malicious parties want to obtain the original $X$ and $W$ , the malicious cloud server or other malicious parties need to crack $a$ and $2l$ elementary rotation matrices. However, as $a$ is a real number, the range of $a$ is large, and $2l$ elementary rotation matrices are also randomly generated, so the range of brute force cracking is large. So, the original data is well obscured. In terms of brute force cracking the original data, if the probability of guessing an element of the input matrix is assumed to be $\frac {1}{\delta }$ (since each element of the matrix is a real number, the range is wide), then the overall probability of guessing $X$ is $\mathrm {negli}(md)=\frac {1}{\delta ^{md}}$ . The overall probability of guessing $W$ is $\mathrm {negli}(td)=\frac {1}{\delta ^{td}}$ . The overall probability of guessing the HashTable is $\mathrm {negli}(tm)=\frac {1}{\delta ^{tm}}$ . They are all nonpolynomial-time functions. At the same time, since the HashTable is already the hash value of the original data, the original client data cannot be recovered even if the HashTable is guessed. Therefore, for the input and output of the proposed scheme, the data privacy of the input/output can be better guaranteed.

2) Verification Performance

The correctness of results from the cloud server can be verified with 100% probability.

Proof:

Based on the verifiability requirement of outsourcing computation, as shown in section III (b), an outsourcing scheme needs to verify the computing results from the cloud server. Therefore, in the proposed scheme, assuming that the cloud server has improper behavior, the client needs verification mechanism to check the occurrence of improper behavior and reject the incorrect calculation results of the cloud server. In the proposed scheme, we use the verification mechanism of checking whether V₁ is equal to V₂ to verify the correctness of the results. If the cloud service wants to forge the result ${\mathrm {HashTable}}^{\prime}$ , at least every column of the ${\mathrm {HashTable}}^{\prime}$ that needs to forge suitable values. However, every element of the ${\mathrm {HashTable}}^{\prime}$ is a real number. It is difficult to forge a suitable ${HashTable}^{\prime }$ . Thus, the correctness of results from the cloud server can be verified with 100% probability.

C. Efficiency Analysis

We analyze the efficiency of the proposed scheme through detailed theoretical analysis in terms of computation efficiency, and communication consumption.

1) Computation Efficiency

The proposed algorithm is $((2md+2td+2tm+4ml+4dl)/\mathrm {(num\_{}hash\_{}table}\ast tdm))$ efficient.

Proof:

We assume the scalar multiplication operation is SM and the fewer computational computations are neglected in the analysis. When one hash table is generated, in the step ProbGen, the client performs $md+td+2ml+4dl\mathrm {SM}$ . In the Compute phase, the cloud server performs $tdm \mathrm {SM}$ . In the Verify phase, the client executes $td+dm+tm \mathrm {SM}$ . In the Recover phase, the client executes $tm+2ml \mathrm {SM}$ . Therefore, to summarize, the client executes a total $2md+2td+2tm+4ml+4dl \mathrm {SM}$ . And the cloud server performs a total of $tdm \mathrm {SM}$ . Without outsourcing, the client executes a total $tdm \mathrm {SM}$ , as shown in eq. (1), the same as the complexity of the phase Compute. Thus, the proposed scheme is $((2md+2td+2tm+4ml+4dl)/(tdm))$ efficient when generating one hash table. However, in the $LSH$ algorithm, multiple hash tables are often generated. Therefore, if $\mathrm {num\_{}hash\_{}table}$ is the number of generated hash tables, the proposed scheme is $((2md+2td+2tm+4ml+4dl)/(\mathrm {num\_{}hash\_{}table}\ast tdm))$ efficient. The time cost of different phases is shown in Table 1.

TABLE 1 Time Cost of Phases

2) Communication Consumption

We show the communication consumption in the proposed scheme, as shown in Table 2. Among 5 phases, only ComDis phase and Compute phase have communication consumption. Because the KeyGen, Verify, and Recover phases are all carried out on the client side, these three phases have no communication consumption. Therefore, the total communication consumption includes two parts. The first one is communication consumption in the ComDis phase, the client needs to send the obscured matrix $(X^{\prime },W^{\prime })$ to the cloud server. The second is the communication consumption in the $\mathrm {Compute phase}$ . After the cloud server completes the computation of the ${\mathrm {HashTable}}^{\prime}$ , the cloud server needs to transfer the ${\mathrm {HashTable}}^{\prime}$ to the client.

TABLE 2 Communication Cost of Phases

SECTION VI.

Experimental Performance Evaluation

In this section, we verify the correctness and efficiency of the proposed scheme through experiments. The experimental results show our proposed scheme is efficient and correct.

A. Evaluation Methodology

In our experiments, we used the computer with Windows server 2012, Intel Xeon Bronze 3106 CPU at 1.7 GHz, and 512GB memory as the testbed. Actually, we initially deployed the original $LSH$ algorithm to a normal computer, but when we tried to incorporate more image data into the calculation, the memory burst. But if we used a hard disk to store the calculation results, the I/O efficiency would not be acceptable. So, we used a computer with 512GB of memory as the local client. In terms of the cloud server, we choose Alibaba Cloud. For the programming language, we use Python 3.7 to simulate our proposed algorithm.

In terms of the efficiency metric, we tested the time consumption of different phases, time consumption on the cloud server side and the client side, and the time consumption of outsourced $LSH$ and non-outsourced $LSH$ , and compared them. In terms of the correctness verification of the proposed algorithm, we verify the consistency of $V_{1}$ and $V_{2}$ through experiments.

We use the CIFAR-10 benchmark dataset as the experimental dataset, as shown in Table 3. The characteristics of the dataset are summarized as follows:

$CIFAR-10$ is an image dataset that contains 60,000 color images of 10 classes. Each class contains 6,000 images (the size of each image is $32\times 32$ ).

TABLE 3 Dataset in the Experiments

In the experiment, we generate randomly 1000 images (10 categories, 100 images per category) as the query dataset and 1000 to 9500 images (10 categories, 100 to 950 images per category) as the $HashTable$ generation dataset. The query is considered valid only if the category label of the queried image is consistent with that of the retrieved image.

B. Evaluation Demonstration

In Fig. 2, we briefly describe the experimental process of the proposed outsourced $LSH$ algorithm. First, we select a certain amount of data from CIFAR-10 as the $HashTable$ generation dataset and a certain amount of data as the query dataset. Then, we execute our proposed algorithm to generate the $HashTable$ of $HashTable$ generation dataset. Then, the data in the query dataset are successively queried through the $HashTable$ for ANN search, and top $k$ is selected as the query result, and the accuracy is calculated.

FIGURE 2.

The experimental procedure.

Show All

C. Evaluation Results

We analyze the efficiency of the proposed scheme by comparing the time consumption in Fig. 3. The amount of $HashTable$ generation dataset (the size of $X$ ) is from 1000 to 9500. In Fig. 3(a), the time consumption of $LSH$ outsourcing and non-outsourcing is compared. It can be seen that the outsourced $LSH$ algorithm does have a great improvement in time consumption metric. Fig. 3(b) shows the client time consumption in the Recover, Verify and ComDis phases, showing that the ComDis phase has the highest time consumption. This is because we need to obscure the input data $X$ and $W$ by $SK$ . $X$ has to do matrix computations with $2l$ elementary rotation matrix operations and the real number $a$ . $W$ needs to do matrix computations with $l$ elementary rotation matrix operations and the real number $a$ . In these operations, although the elementary rotation matrix is sparse, there is still a certain amount of time consumption. The second time consumption phase is the $Recover$ phase. This is because the dimension of the returned data from the cloud server has decreased compared with the original data. Although it also needs to operate with $l$ elementary rotation matrices and the real number $1/\mathrm {\alpha }^{2}$ , the overall operation time consumption will decrease. Fig. 3(c) shows the time consumption on the client side versus the cloud side in the outsourced $LSH$ algorithm. We can see that the cloud server consumes less time. It is believed that if more cloud resources are allocated to assist computing, the cloud time consumption will be smaller. Therefore, it is further proved that because the proposed algorithm is for a resource-constrained client, we need to reduce the computational complexity and storage requirements of the client to improve the efficiency of the algorithm.

FIGURE 3.

Time consumption comparison for the proposed algorithm.

Show All

To verify the correctness of the proposed scheme, we compare the $V_{1}$ and $V_{2}$ in the experiments, as shown in Fig. 4. In the Fig. 4, the $v$ is the local $V_{1}$ calculated by eq.7. The $v2$ is the $V_{2}$ calculated by ${\mathrm {HashTable}}^{\prime}$ , as shown in eq.8. It can be seen than they are the same, which verifies the correctness of the proposed scheme.

FIGURE 4.

Verification of v.

Show All

SECTION VII.

Applications

In this section, we present some potential practical applications of the proposed scheme. Since $LSH$ is considered one of the most promising indexing methods for approximate nearest neighbor search (ANN search) [43], it has been widely used in many fields such as entity resolution [44], fingerprint matching [45], and user context privacy protection [46]. And for the ANN search on high-dimensional large-scale data, the computing power and memory requirement of many edge/individual devices is not enough. Meanwhile, in these applications, if the client’s data contains private information, the use of outsourced computing will also face the privacy of the client’s data. Therefore, to protect the privacy of clients’ data while utilizing cloud servers for outsourcing computing, our proposed $LSH$ security outsourcing scheme will be a solution.

In 2011, Raffaele Cappelli et al. once proposed a new $LSH$ -based query scheme for fingerprint identification in large databases [45]. Six datasets are used in their experiment, as shown in Table 4. However, in terms of privacy protection, a person’s fingerprint characteristics should belong to private information, especially when the purpose of fingerprint identification is for commercial profit, the user’s fingerprint information should not be disclosed, otherwise, the disclosure of data may cause quite serious losses. For example, Twitter has been fined ${\$}$ 150 million and subjected to new oversight for disclosing users’ privacy in 2022. Therefore, in order to ensure data privacy and accelerate the algorithm, institutions can use our proposed security outsourcing $LSH$ algorithm to complete fingerprint ANN search. By outsourcing part of the calculation of the $LSH$ algorithm and transferring the results to the database server, the client can efficiently calculate the fingerprint hash tables with the help of the remote cloud server and ensure the privacy of the information and the verifiability of the results. Then, the server or a third party can perform fast ANN searches through the hash tables database server.

TABLE 4 Dataset in the Method of Raffaele Cappelli [45]

SECTION VIII.

Conclusion and Future Work

In this study, we proposed a secure and verifiable scheme to outsource the classical $LSH$ algorithm for the ANN query problems on resource-constrained devices. The scheme avoids the leakage of input data by obscuring the client data. At the same time, the correctness of the computing results of the cloud server can be verified with 100% possibility through the proposed verification mechanism. We detailed the design principles and the proposed scheme in Section IV. To verify the efficiency and correctness of our scheme, we conduct experiments based on the CIFAR-10 dataset and described the experimental results in Section V. The experiments show that the proposed scheme is efficient and correct.

With the rise of machine learning, more and more devices, such as individual phones and edge devices, need to do complex machine learning operations. Although with the development of technology, the performance of personal devices and edge devices is getting higher and higher, the volume and dimension of processed data are also increasing. Therefore, cloud-aided machine learning is an idea to solve this problem. We can outsource part of the complex computations of machine learning algorithms to enable resource-constrained devices to complete complex machine learning tasks on the basis of ensuring the privacy of client’s data. This not only requires cloud servers to be able to run part of machine learning algorithms’ computations but also needs to run on encrypted data. Now, there are a lot of studies on this problem, such as homomorphic encryption techniques. However, a serious problem is that since the computing is done on resource-constrained devices, the encryption algorithm should not be too complex, and the results returned by the cloud service should be able to enable the client to recover the data correctly. Therefore, based on these considerations, based on the research of others, we designed an outsourced $LSH$ scheme that can verify the results from the cloud server with probability 100% to help resource-constrained devices quickly complete the $LSH$ algorithm for fast ANN searches, especially, for high-dimensional and large-scale dataset.

In future work, we plan to add federated learning and multi-party computation mechanism to the proposed scheme, bring more devices into the scope of fast ANN query, and explore a more secure, efficient, outsourced, practical and multi-party ANN scheme.

References is not available for this document.

Secure Cloud-Aided Approximate Nearest Neighbor Search on High-Dimensional Data

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work

A. Machine Learning and LSH

B. Secure Outsourcing Computations

C. Cloud/Edge Computing

Preliminaries

A. Local Sensitive Hashing

Definition 1:

B. Secure Outsourcing Computation

Proposed Scheme

A. Application Model

B. Concept and Basic Principle

C. Detailed Scheme

Algorithm 1 The Proposed Algorithm

Theoretical Performance Analysis

A. Correctness Analysis

B. Security and Verification Analysis

1) Data Privacy

Proof:

2) Verification Performance

Proof:

C. Efficiency Analysis

1) Computation Efficiency

Proof:

2) Communication Consumption

Experimental Performance Evaluation

A. Evaluation Methodology

B. Evaluation Demonstration

C. Evaluation Results

Applications

Conclusion and Future Work

Authors

Figures

References

Citations

Keywords

Metrics

References

IEEE Account

Purchase Details

Profile Information

Need Help?