Introduction
Face recognition (FR) is one of the most well-known biometric authentication tools, and its applications tend toward ubiquity, including smart phone unlock,1 e-banking2 national identity system,3 border control,4 etc. In addition to the security applications, FR is also being used in entertainment5 applications. Generally in FR systems, some features (also known as templates or embeddings) are extracted from each face image. The extracted templates are stored in the system's database during the enrollment stage, and are later used for recognition.
Among different types of attacks against FR systems that are studied in the literature [1], [2], [3], [4], [5], template inversion (TI) attack can considerably jeopardize both security and privacy of users. In a TI attack, the adversary gains access to the templates stored in the system's database and tries to invert facial templates to reconstruct the underlying face image. Then, the adversary can use the reconstructed face image to impersonate and enter the system (security threat). In addition, the reconstructed face image may reveal privacy-sensitive information of the enrolled user, such as age, gender, ethnicity, etc. (privacy threat). In this paper, we focus on TI attacks in FR systems and present a comprehensive vulnerability evaluation of FR systems to TI attacks using 3D face reconstruction. We propose a new method (called geometry-aware face reconstruction, shortly GaFaR) to 3D reconstruct faces from facial templates using a geometry-aware face generator network. To our knowledge, this is the first work to reconstruct 3D faces from facial templates. Fig. 1 illustrates sample face images from the FFHQ [6] dataset and their corresponding 3D reconstruction from ArcFace [7] templates using our proposed method.
Sample face images from the FFHQ dataset (first row) and frontal 2D image (second row) from our 3D reconstruction (third row) in the whitebox template inversion attack against ArcFace. The values below each image of the second row show the cosine similarity between the templates of the original and frontal reconstruction face images. The decision threshold for
In recent years, the neural radiance fields (NeRF) [8] has attracted attentions in the computer vision community because of its impressive results in the novel-view generation problem. Generative NeRF (GNeRF) methods such as [9], [10], [11], [12], [13], [14], [15], [16], [17], [18], [19], [20], [21] combine conditional NeRF with generative models, such as a generative adversarial network (GAN), for geometry-aware image generation tasks. In GNeRF methods, a generative model is used to embed the appearance and shape of an object into a latent space. Then, the camera parameters along with the latent code of the generative model are fed into a NeRF model for the rendering process. Among GNeRF methods, several works proposed geometry-aware 3D face generation models that can generate face images from different views [13], [14], [15], [16], [17], [18], [19], [20].
In our proposed 3D face reconstruction method, we use a geometry-aware face generator network based on GNeRF, and learn a mapping from facial templates to the intermediate latent space of the GNeRF model. We train our model with a semi-supervised approach using real and synthetic face images. For real training face images, where we do not have the corresponding GNeRF latent codes, we train our mapping within a GAN-based framework to learn the distribution of GNeRF intermediate latent space (unsupervised learning). However, for the synthetic training face images, we have the corresponding GNeRF latent codes, and directly learn the mapping from facial templates to the GNeRF intermediate latent space (supervised learning). At the inference stage, we have the 3D reconstructed face and can generate a face image from any arbitrary pose. Thus, we apply optimization on the camera parameters to generate face images with a pose that can increase the success attack rate against the FR system. Fig. 2 illustrates the general block diagram of our proposed template inversion attack.
General block diagram of the proposed method: we train a mapping network from facial templates (input) to the intermediate latent space
We introduce our face reconstruction method for whitebox and blackbox TI attacks against FR systems. In the whitebox scenario, the adversary knows the internal functioning and parameters of the feature extraction model. However, in the blackbox scenario, the adversary does not have any knowledge about the internal functioning of the feature extraction model and can only use it to extract features from an arbitrary image. We consider the scenario where the adversary uses another FR model, with known internal functioning and parameters (i.e., whitebox knowledge), and uses this FR model for training the face reconstruction network. We present a comprehensive vulnerability evaluation of state-of-the-art (SOTA) FR systems to our TI attacks in whitebox and blackbox scenarios. We evaluate the transferability of the reconstructed face images by considering the situation where the adversary tries to reconstruct face images of the templates leaked from a FR system and use the reconstructed face images to impersonate the same users in another FR system (with a different feature extraction model) that the users are enrolled. Indeed, the transferability of TI attacks reveals a critical threat to FR systems, since the reconstructed face images can be used to enter other FR systems that the victim is enrolled in. Considering the whitebox/blackbox scenario and the adversary's knowledge of the target FR system, we define five different TI attacks, and comprehensively evaluate the vulnerability of SOTA FR systems to TI attacks. Furthermore, we perform practical evaluations based on presentation attacks using the digital replay and printed photographs of the reconstructed face images, and evaluate the vulnerability of SOTA FR systems.
To elaborate on the contributions of our paper, we summarize them hereunder:
We present a comprehensive vulnerability evaluation of SOTA FR system to TI attacks using 3D face reconstruction from facial templates. Considering the whitebox/blackbox scenarios and the adversary's knowledge of the target FR system, we define five different TI attacks and evaluate the vulnerability of SOTA FR systems to different TI attacks as well as transferability of reconstructed face images in TI attacks. We also perform a practical evaluation based on presentation attacks using the digital replay and printed photograph of the reconstructed face images in TI attacks against SOTA FR systems.
We propose a new method to reconstruct 3D faces from facial templates using a geometry-aware face generator network based on GNeRF. We use the proposed 3D face reconstruction method to introduce whitebox and blackbox TI attacks against FR systems. To our knowledge, this is the first work to reconstruct 3D faces from facial templates. To use 3D reconstructed face in TI attack against 2D FR systems during the inference stage, we apply optimization on the camera parameters in the input of the GNeRF model and find a pose that improves the success attack rate.
We learn a mapping from facial templates to the intermediate latent space of GNeRF. We train our mapping network with a semi-supervised approach, using real and synthetic face images. For the real training face images, we train our mapping within a GAN-based framework to learn the distribution of intermediate latent space of GNeRF. For the synthetic training face images, we directly learn the mapping from facial templates to the GNeRF intermediate latent codes.
The remainder of this paper is structured as follows. First, we review the related works in Section II. Then, we describe the threat model, our five different defined attacks, and our proposed method in Section III. Next, in Section IV, we present our experiments and discuss our results. Finally, the paper is concluded in Section V.
Related Works
Methods in the literature for face reconstruction in TI attacks against FR systems can be generally categorized from different aspects, including the basis of the method (optimization/learning-based), the type of attack (whitebox/blackbox attack), and the resolution of reconstructed face images (high/low resolution). However, all previous methods generate 2D images in TI attacks against FR systems.
Several methods have been proposed for reconstructing low-resolution 2D face images from facial templates [22], [23], [24], [25], [26], [27]. In [22], authors proposed two whitebox methods to reconstruct 2D low-resolution face images from facial templates. In the first method (optimization-based), they used a gradient-descent-based approach on a guiding image or random (noise) image to find an image that minimizes the distance between the template of the reconstructed face image and the target template. In addition, they used several regularization terms to generate a smooth image, including the total variation and Laplacian pyramid gradient normalization [33] of the reconstructed face image. In their learning-based method, they trained a deconvolutional neural network with the same loss function as in their optimization-based method, to generate reconstructed face images. For the evaluation of their method, they only discussed the visual reconstruction quality and did not provide any security evaluation on a FR system.
In [23], authors trained a multi-layer perceptron (MLP), to find the facial landmark coordinates, and a convolutional neural network (CNN), to generate face texture from the given facial template. Next, they used a differentiable warping to combine the estimated landmarks (from MLP) with the generated textures (from CNN) and reconstruct low-resolution 2D face images. They used their method for whitebox and blackbox attacks. In the whitebox attack, they trained their MLP and CNN by minimizing the distance between templates of the original and reconstructed face images. However, for their blackbox attack, they trained MLP and CNN separately, and used the warping in the inference only. For the security evaluation, they only reported the histogram of scores between the templates extracted from the original and reconstructed face images and compared it with the histogram of genuine scores.
In [24], authors proposed a learning-based method to generate low-resolution 2D face images in the blackbox attacks against FR systems. They proposed two new deconvolutional networks, called NbBlock-A and NbBlock-B, and trained them with either pixel loss (
In [25] and [26], a same method based on bijection learning is used to train GAN networks with PO-GAN [35] and TransGAN [36] structures, respectively. In the whitebox attack, authors minimized the distance between target templates and templates extracted from the reconstructed face images using the FR model. To extend their method to the blackbox attack, they proposed to use the distillation of knowledge to train a student network that mimics the target FR model. However, they did not report any detail about the training of the student network (e.g., network structure, etc.) nor published their source code. For the security evaluation, they reported the matching accuracy between the reconstructed image and another original image in each positive pair in their TI attacks. However, they did not evaluate the vulnerability of FR systems at different threshold configurations.
In [27], authors proposed a 3-step method to reconstruct low-resolution 2D face images in the blacbox attack. In the first step, they trained a general face generator network based on GAN. In the second step, they trained a MLP to map the templates to the templates of a known (i.e., whitebox knowledge) FR model. In the third step, they used an optimization on the latent space of their face generator to find a latent code that can generate a face image that maximizes two terms; the cosine similarity between the templates (mapped templates and the templates extracted by the known FR model) and the discriminator score (for being a real face image). For their security evaluation, they reported the adversary's success attack rate (SAR), but they did not specify the system's operation configuration, such as the system's recognition false match rate (FMR).
In contrast to the most works in the literature that generate low-resolution 2D face images, recently few methods are proposed for high-resolution 2D face reconstruction. In [28], authors proposed a learning-based method to reconstruct high-resolution 2D face images in the blackbox attack. They used a pretrained StyleGAN2 [37] to generate some face images and extracted the templates using the FR model. Then, they trained a MLP to map facial templates to the input latent codes of StyleGAN2 [37]. For the security analysis, they considered two types of attacks as defined in [24] and evaluated the vulnerability of FR systems. They also evaluated their reconstructed face images with a commercial-off-the-shelf (COTS) presentation attack detection (PAD) system, also known as face liveness detection in their paper. However, the authors did not perform a practical presentation attack scenario, in which the images should have been recaptured by camera prior to be fed to the COTS PAD. Similarly, in [29], authors proposed a learning-based method for high-resolution 2D face reconstruction in the blackbox attack. They learned three mapping networks from the facial templates to three separate parts in the intermediate latent space of StyleGAN. Each of these mapping networks is composed of a MLP and is used to reconstruct coarse to fine information of face image. They also proposed to find this mapping with optimization instead of learning the mapping networks. For the security analysis, they did not report success attack rate (percentage) for any configuration. They only reported the histogram of the distance between templates of reconstructed and original face images and compared it with the histogram of templates for random pair of images (i.e., zero-effort impostor).
In [30], authors used a learning-based method based on a conditional denoising diffusion probabilistic model to reconstruct 2D face images in blackbox attack. They used the conditional diffusion model in [38] and iteratively denoise an input Gaussian noise conditioned with facial templates to generate low resolution (i.e.,
In [31], authors proposed a optimization on the latent vector (i.e., input noise) of StyleGAN2 [37] to find latent codes which generates face images with templates similar to the target templates. They solved this optimization with a grid-search and simulated annealing [39] approach for the blackbox scenario. However, since their method is computationally expensive,7 they evaluated their method on only 20 face images and reported distance between the original templates and templates of the reconstructed face images. Along the same lines, in [32] authors considered a similar optimization to [31] on the latent vector of StyleGAN2 [37], but instead of grid-search, they solved the optimization using the standard genetic algorithm [40] for the blackbox attack. For the security analysis, they also considered two types of attacks as defined in [24] and evaluated the vulnerability of FR systems. Moreover, they evaluated their reconstructed face images using three COTS PAD systems (called liveness detection in their paper). However, similar to [28], they did not perform a practical presentation attack scenario by recapturing the reconstructed face images.
Table I compares our paper with the previous works in the literature. To our knowledge, our proposed method is the first method on 3D face reconstruction from facial templates (which are extracted from 2D face recognition models). Moreover, in contrast to most works in the literature, our method generates high-resolution face images. We also propose our method for both whitebox and blackbox attacks against FR systems and evaluate the transferability of our reconstructed face images (which has not been reported before for TI attacks). Furthermore, we perform practical presentation attacks against FR systems using the reconstructed face images. Last but not least, the source code of all the experiments in this paper is publicly available to facilitate the reproducibility of our work.
Proposed Method
We describe our threat model and define different TI attacks against FR systems in Section III-A (as depicted in Fig. 3). Then, we describe our proposed method to reconstruct 3D faces from facial templates in Section III-B. In the inference stage, we optimization on the camera parameters to generate a face image that can improve the success attack rate, as described in Section III-C. Fig. 4 illustrates the block diagram of the proposed TI attack, including our 3D face reconstruction method and our optimization on camera parameters during the inference stage.
Block diagram of our proposed TI attack: during the training process, a semi-supervised approach is used to learn our mapping
A. Threat Model
We consider the situation where the adversary gains access to the database of a FR system (
Adversary's goal: The adversary aims to reconstruct face images from templates stored in the database of a FR system (
), and use the reconstructed face images to enter the same or a different FR system (we call it the target FR system,$F_\text{template}$ ).$F_\text{target}$ Adversary's knowledge: The adversary has the following information:
The leaked face templates
of users, which are enrolled in the database of$\boldsymbol{t}_\text{leaked}$ .$F_\text{template}$ The adversary also has the whitebox knowledge of a feature extractor model (
). It is worth mentioning that$F_\text{proxy}$ can be similar to or different from$F_\text{proxy}$ and$F_\text{template}$ .$F_\text{target}$
Adversary's capability: We consider two scenarios for the adversary's capability:
The adversary can perform a presentation attack using the reconstructed face images to impersonate and enter the target FR system (e.g., using digital replay attacks or printed photographs).
The adversary can inject the reconstructed face image as a query to the target FR system.
Adversary's strategy: The adversary trains a face reconstruction model to invert the leaked facial templates
. Then, based on the adversary's capability, the adversary can use the reconstructed face images to either perform a presentation attack or inject the reconstructed face image as a query to the target FR system.$\boldsymbol{t}_\text{leaked}$
In our threat model, we consider three different feature extraction models, including
Attack 1: The adversary has the whitebox knowledge of the feature extractor of the FR system from which the template is leaked and aims to impersonate to the same FR system (i.e.,
).$F_\text{template} = F_\text{proxy} = F_\text{target}$ Attack 2: The adversary has the whitebox knowledge of the feature extractor of the FR system from which the template is leaked, but aims to impersonate to a different FR system (i.e.,
).$F_\text{template} = F_\text{proxy} \ne F_\text{target}$ Attack 3: The adversary aims to impersonate to the same FR system from which the template is leaked, but has only the blackbox access to the feature extractor of the FR system. Instead, the adversary has the whitebox knowledge of another FR model to use for training the face reconstruction model (i.e.,
).$F_\text{template} = F_\text{target} \ne F_\text{proxy}$ Attack 4: The adversary aims to impersonate to a different FR system than the one which from the template is leaked. In addition, the adversary has the whitebox knowledge of the feature extractor of the target FR system (i.e.,
).$F_\text{template} \ne F_\text{proxy} = F_\text{target}$ Attack 5: The adversary aims to impersonate to a different FR system from which the template is leaked, and has only the blackbox knowledge of the both the FR systems. However, the adversary instead has the whitebox knowledge of another FR model to use for training the face reconstruction model (i.e.,
).$F_\text{template} \ne F_\text{proxy} \ne F_\text{target}$
Table II summarizes different TI attack types in our threat model as well as the adversary's knowledge of different FR models in each type of attack. In all types of attacks, the leaked facial templates to be reconstructed are from
B. Proposed 3D Face Reconstruction
To reconstruct 3D faces from facial templates, we use a pretrained EG3D [18] model as a geometry-aware face generator network based on GNeRF. This model consists of two networks, a mapping network and a generator and renderer network. The mapping network
1) Unsupervised Learning Using Real Training Data
To train our mapping network
\begin{align*}
&\mathcal {L}_{C}^{\text{WGAN}} = \mathbb{E}_{\boldsymbol{w}\sim M_\text{GNeRF}(\boldsymbol{z}) }[C(\boldsymbol{w})] - \mathbb{E}_{\hat{\boldsymbol{w}}\sim M_\text{rec}([\boldsymbol{n},\boldsymbol{t}])}[C(\hat{\boldsymbol{w}})] \tag{1}
\\
&\mathcal {L}_{M_\text{rec}}^{\text{WGAN}} = \mathbb{E}_{\hat{\boldsymbol{w}}\sim M_\text{rec}([\boldsymbol{n},\boldsymbol{t}])}[C(\hat{\boldsymbol{w}})] \tag{2}
\end{align*}
In addition to the WGAN training, we feed the generated latent code
\begin{equation*}
\mathcal {L}_\text{real}^{\text{rec}} = \mathcal {L}^\text{Pixel} + \mathcal {L}^\text{ID}, \tag{3}
\end{equation*}
\begin{align*}
\mathcal {L}^\text{Pixel}&= \mathbb{E}_{ \hat{\boldsymbol{w}}\sim M_\text{rec}([\boldsymbol{n},\boldsymbol{t}]) } [ \left\Vert \boldsymbol{I}-G(\hat{\boldsymbol{w}},\boldsymbol{c})\right\Vert _{2}^{2}] \tag{4}
\\
\mathcal {L}^\text{ID} &= \mathbb{E}_{ \hat{\boldsymbol{w}}\sim M_\text{rec}([\boldsymbol{n},\boldsymbol{t}]) } [ \left\Vert F_\text{proxy}(\boldsymbol{I})-F_\text{proxy}(G(\hat{\boldsymbol{w}},\boldsymbol{c}))\right\Vert _{2}^{2}] \tag{5}
\end{align*}
2) Supervised Learning Using Synthetic Training Data
To train our mapping network
\begin{equation*}
\mathcal {L}_\text{syn}^{\text{rec}} = \mathcal {L}^{w} + \mathcal {L}^\text{Pixel} + \mathcal {L}^\text{ID}, \tag{6}
\end{equation*}
\begin{equation*}
\mathcal {L}^{w} = \mathbb{E}_{ \boldsymbol{w}\sim M_\text{GNeRF}(\boldsymbol{z}) } [ \left\Vert \boldsymbol{w}- M_\text{rec}([\boldsymbol{n},\boldsymbol{t}])\right\Vert _{2}^{2}] \tag{7}
\end{equation*}
To train our networks, we use Adam [42] optimizer and optimize the parameters of our new mapping network
Algorithm 1: Training Process of Our New Mapping Network.
Require:
Require:
Require:
Require:
procedure Training
Initialize
for
for
Sample a batch from
Sample a batch from
if
end if
if
Sample a batch
end if
end for
end for
end procedure
C. Camera Parameters Optimization
After generating a 3D reconstruction of face from the facial template using our proposed method described in Section III-B, the adversary needs to select a pose to generate a 2D reconstructed face image to inject into the system or perform a presentation attack. To this end, during the inference stage we can optimize the camera parameters to find a pose that increases the success attack rate (SAR). In other words, having the 3D reconstruction of a face, we would like to find the camera parameters so that the 2D generated face image has a facial template that is more similar to the leaked templates than the templates of any other pose. Among different camera parameters
1) Grid Search (GS)
In our grid search approach, we consider pre-defined steps to change the camera pitch
\begin{equation*}
\min _{\theta,\psi } \left\Vert \hat{\boldsymbol{t}}-\boldsymbol{t}\right\Vert _{2}^{2}, \tag{8}
\end{equation*}
2) Continuous Optimization (CO)
For continuous optimization, we start from the frontal camera parameters and use the Adam [42] optimizer to solve the following minimization using the mapped latent code
\begin{equation*}
\min _{\theta,\psi } \left\Vert F_\text{template}(G(\hat{\boldsymbol{w}},\boldsymbol{c}))- \boldsymbol{t}\right\Vert _{2}^{2}, \tag{9}
\end{equation*}
Experiments
In this section, we evaluate the vulnerability of SOTA FR systems to our TI attacks defined in Section III. First, in Section IV-A we describe our experimental setup. In Section IV-B, we consider the case where the adversary can inject the reconstructed face image as a query to the system to impersonate, and present our experimental results. In Section IV-C, we consider the situation where the adversary uses the reconstructed face images to perform presentation attacks and evaluate the vulnerability of SOTA FR systems. Finally, we discuss our findings in Section IV-D.
A. Experimental Setup
1) Face Recognition Models
In our experiments, we evaluate the vulnerability of different SOTA FR models to our TI attacks. We consider two SOTA models, including ArcFace [7], ElasticFace [43], as the models from which templates are leaked (i.e.,
2) Datasets
All the FR models used in our experiments are trained on the MS-Celeb1M dataset [49]. However, we assume that the adversary does not have knowledge about the training data of the FR network (either
To evaluate the vulnerability of FR systems to TI attacks, we consider two other different face image datasets with identity labels, including the MOBIO [50] and Labeled Faces in the Wild (LFW) [51] datasets. The MOBIO dataset includes face images captured using mobile devices from 150 people in 12 sessions (6-11 samples in each session). The LFW dataset includes 13,233 face images of 5,749 people collected from the internet, where 1,680 people have two or more images.
3) Evaluation Protocol
To implement each of the attacks described in Section III-A, we build one or two separate FR systems using the same or two different SOTA feature extractor models (based on the attack type). If the target FR system is the same as the system from which the template is leaked (i.e.,
To evaluate the vulnerability to all our TI attacks, we assume that the target FR system is configured at the threshold corresponding to a false match rate (FMR) of
Block diagram of a FR system and data flows in normal usage (gray solid arrows), TI attack by injecting the reconstructed face image (orange dashed arrows), and performing presentation attack using the reconstructed face image (red dashed arrows).
4) Implementation Details and Source Code
To build the FR pipeline and evaluate the TI attacks against FR systems, we use the Bob9 [52] toolbox. We use the PyTorch package and trained all the networks on a system equipped with an NVIDIA GeForce RTXTM 3090. For the GNeRF model, we use the pretrained model of EG3D10 with StyleGAN [37] backbone to generate 3D faces with
To train our 3D face reconstruction networks, we consider
In our experiments, we use the continuous optimization (in whitebox attacks only) and grid search optimization (in both whitebox and blackbox attacks) in the inference stage, as described in Section III-C, to optimize camera parameters. In the grid search approach, we consider
We should note that the source code and the captured images for our presentation attack evaluation are publicly available to help reproduce our results.12
B. TI Attack by Injecting Reconstructed Face Images
In this section, we consider the situation where the adversary can inject the reconstructed face image to the feature extractor of the target FR system. We consider SOTA FR models and evaluate the vulnerability of these systems to different TI attacks described in Section III-A in the whitebox (attacks 1-2) and blackbox (attacks 3-5) scenarios.
1) Whitebox Scenario
In attacks 1-2, we assume that the adversary has the whitebox knowledge of the FR system from which the template is leaked (i.e.,
Sample face images from the FFHQ dataset (first row) and their corresponding frontal face reconstruction (second row) as well as reconstructed face images within the camera parameters sub-grid (third row) using our method in the whitebox TI attacks (i.e., attacks 1-2) against ArcFace. The values below each image show the cosine similarity between templates of original and frontal reconstructed face images.
2) Blackbox Scenario
In attacks 3-5, we assume that the adversary has the blackbox knowledge of the feature extractor of the FR system from which the template is leaked (i.e.,
Table V also shows that SOTA FR systems are vulnerable to our TI attacks in the blackbox scenario. In particular, in attack 5 which is the hardest TI attack, where
Sample face images from the FFHQ dataset (first row) and their corresponding frontal (second row) reconstructed face images using our method in the blackbox attack against ElasticFace using ArcFace as
C. Practical Presentation Attack Using Reconstructed Face Images
In this section, we consider the situation where the adversary uses the reconstructed face image to perform a presentation attack to enter the target FR system. We consider reconstructed face images from ArcFace templates using our proposed face reconstruction method and camera parameter optimizations (i.e., GaFaR, GaFaR+GS, and GaFaR+CO) in both whitebox and blackbox scenarios, and use the reconstructed face images in each case to perform presentation attacks. We perform our presentation attacks against different SOTA FR systems based on the various TI attacks described in Section III-A. Therefore, we similarly have five different presentation attacks according to the adversary's knowledge of the FR system from which the template is leaked (i.e.,
Presentation attack via digital replay (replay attack): In this type of presentation attack, the adversary presents the reconstructed face image using a digital display in front of the camera. To perform this attack, we use a tablet (Apple iPad Pro) showing the reconstructed face image and put it in front of the camera of the target FR system.
Presentation attack via printed photograph: In this type of presentation attack, the adversary prints the reconstructed face image and presents the printed photograph. To perform this attack, we print the reconstructed face images with a colorful laser printer (Develop Ineo+C364e) on typical papers and present the printed photograph in front of the camera of the target FR system.
To perform the presentation attacks (with either digital replay or printed photograph), the reconstructed image should be presented in front of the camera of the target FR system. For each of these cases, we considered three different mobile devices, including Apple iPhone 12, Xiaomi Redmi 9 A, and Samsung Galaxy S9, as the camera of the target FR system and capture images from the presentations. Fig. 8 shows our evaluation setup for capturing presentation attacks from tablet and printed photographs using different mobile cameras. It is noteworthy that we used the default display scale on the digital screen (i.e., iPad), in which the reconstructed face images with
Our evaluation setup for performing different types of presentation and capturing presentation using mobile devices: (a) replay attack using Apple iPad Pro, and (b) presentation attack using printed photograph.
Fig. 9 illustrates a sample face image from the MOBIO dataset, its reconstructed face images from ArcFace templates using our different methods (GaFaR, GaFaR+GS, and GaFaR+CO) in the whitebox and blackbox (using ElasticFace as
Sample image from the MOBIO dataset, its corresponding reconstructed face images using our face reconstruction methods (i.e., GaFaR, GaFaR+GS, and GaFaR+CO) in the whitebox and blackbox scenarios, the corresponding digital replay attacks and presentation attacks using printed photographs captured with different mobile devices.
Table VI reports the result of the vulnerability evaluation against SOTA FR systems to TI attacks (by injecting the reconstructed face images in our simulation), and different presentation attacks (digital replay attack and printed photograph) in the whitebox and blackbox scenarios in terms of SAR.15 It is noteworthy that based on the presentation type, we have two types of presentation attacks (replay attack and printed photograph), and based on the adversary's knowledge of the FR system from which the template is leaked (i.e.,
We also compare the performance of our method with two best blackbox methods in the literature from Table V (i.e., NBNetB-P [24] and Vebdrow and Vendrow [31]) in presentation attacks based on TI attacks 3-5 against SOTA FR models. Table VII reports this evaluation for digital replay presentation attack (captured by Apple iPhone 12) based on TI attacks using ArcFace templates against SOTA FR models in terms of adversary's SAR at the system's FMR of
D. Discussion
Our experiments in Section IV-B show that our proposed method outperforms previous methods in the literature in TI attacks against FR systems. To evaluate the effect of each part in our proposed method, we perform an ablation study and train different models. To this end, we evaluate the effect of semi-supervised learning approach in our method compared to fully supervised learning (i.e, using only synthetic data where we have the corresponding latent code for each template) and fully unsupervised learning approach (i.e., using only real data where we do not have the corresponding latent code for each template). In each of fully supervised learning and fully unsupervised learning approaches, we also evaluate the effect of each loss function. In the case of the fully unsupervised learning approach, we also evaluate the effect of adversarial learning in our method. Table VIII reports our ablation study on the effect of each part in our proposed method in attack 1 (injection) against ArcFace model on the MOBIO and LFW datasets in terms of SAR at system's FMR of
As another ablation study, we evaluate the effect of hyperparameters in the camera parameter optimization for our proposed grid search (GS) and continuous optimization (CO) approaches. For the grid search optimization approach, in our experiments in Sections IV-B and IV-C, we considered
(a) Sample face image from the FFHQ dataset, (b) its frontal reconstructed face image, (c) its 3D face reconstruction, and (d) the corresponding reconstructed face images with camera parameters grid using our method in the whitebox attack against ArcFace. The cosine similarity between templates of original (a) and frontal (b) reconstructed face images is 0.679.
Ablation study on the effect of different hyperparameters in grid search for camera parameters optimization in terms of success attack rate (SAR) and average execution time for each image reconstruction for whitebox attack (i.e., attack 1) against a FR system based on ArcFace configured at FMR=
Ablation study on the effect of different hyperparameters in continuous optimization for camera parameters in terms of success attack rate (SAR) and average execution time for each image reconstruction for whitebox attack (i.e., attack 1) against a FR system based on ArcFace configured at FMR =
According to the results in Tables IV, V, and VI, our camera parameter optimization methods improve the performance of our face reconstruction network. In particular, we observe that GaFaR+GS and GaFaR+CO also improve the SAR in attacks against different target FR systems (i.e., transferability evaluation in attacks 2, 4, and 5) too. This shows that our camera parameter optimization methods improve the attacks in the way that the reconstructed face images have more similar templates to templates of the original face images, even if extracted by a different FR model. Achieving such improvements in attacks against different target FR systems shows the transferability of our pose-optimized reconstructed face images.
We further investigate the effect of our camera parameter optimization methods on our attacks. In attack 1 against ArcFace, our grid search method increases the similarity between templates of original and reconstructed face images for 89.52% and 88.70% of cases on the MOBIO and LFW datasets, respectively. Moreover, our continuous optimization method increases the similarity between templates for 99.04% and 98.66% of reconstructed face images on the MOBIO and LFW datasets, respectively.16 We also use the pose estimation model in [54] to find the histograms of the pose of original and reconstructed face images in attack 1 against17 ArcFace on the MOBIO and LFW datasets. As the histograms in this figure show, most of the pose-optimized reconstructed face images have a small variation around the frontal pose. This observation is also consistent with our ablation study in Figs. 11 and 12, where we see that the intervals of
Histogram of pitch and yaw in (a) original, (b) GaFaR+GS, and (c) GaFaR+CO for attack 1 against ArcFace on the MOBIO (first row) and LFW (second row) datasets. Note that for GaFaR without any camera parameter optimization, the reconstructed face images are frontal (i.e., pitch and yaw values are zero), and thus the histogram for GaFaR is not depicted in this figure.
Reconstruction of sample images from the MOBIO dataset in whitebox and blackbox (using ElasticFace) TI attacks against ArcFace templates using our methods.
Comparing our result in whitebox (Table IV) and blackbox (Table V) attacks in Section IV-B, we observe that our proposed face reconstruction network, GaFaR, achieves better performance in whitebox attacks (attacks 1-2) than blackbox attacks (attacks 1-2) when inverting ArcFace templates (i.e., ArcFace as
In drawing our discussion to a close, our experiments in Section IV-B show the vulnerability of SOTA FR systems to TI attacks using our face reconstruction methods (GaFaR, GaFaR+GS, and GaFaR+CO). Similarly, our experiments in Section IV-C show that the reconstructed face images by our proposed methods can be used for presentation attacks against the same FR system or different FR systems that the corresponding user is enrolled (i.e., transferability of the reconstructed face images). In fact, our experiments show potential threats that can seriously jeopardize the security and privacy of users if the facial templates are leaked. In addition to the experiments in Sections IV-B and IV-C, we should note that our proposed method can generate 3D face from facial templates (as shown in Figs. 1 and 10). Such 3D reconstruction can be used for more sophisticated presentation attacks (e.g., 3D face mask, etc.) against FR systems, which require further studies in future works.
Conclusion
In this article, we presented a comprehensive vulnerability evaluation of SOTA FR systems to TI attacks using 3D face reconstruction from facial templates. We proposed a new method (called GaFaR) to reconstruct 3D faces from facial templates using a geometry-aware face generation network based on GNeRF. We learned a mapping from facial templates to the intermediate latent space of the GNeRF model with a semi-supervised learning approach using real and synthetic training data. For the real data, where we do not have correct intermediate latent code, we used a GAN-based training to learn the distribution of intermediate latent space of the GNeRF model (unsupervised learning). For the synthetic data, we have the corresponding intermediate latent code and directly learn the mapping (supervised learning). In addition, we proposed two optimization methods on the camera parameters in GNeRF to find a pose that improves the TI attack: grid search and continuous optimization. In the grid search method, we considered a grid for pitch and yaw rotations of the reconstructed face, and in continuous optimization, we used a gradient-based optimizer to optimize camera parameters.
We proposed our method in the whitebox and blackbox attacks against face recognition systems and comprehensively evaluated the vulnerability of SOTA FR systems to our method. Considering whitebox and blackbox blackbox scenarios and adversary's knowledge of target FR system, we defined five types of TI attacks and evaluated the transferability of our reconstructed face images across other FR systems on the MOBIO and LFW datasets. We evaluated the TI attacks by injecting reconstructed face images as queries to the target FR systems. In addition, we performed practical presentation attacks against SOTA FR systems using digital screen replay and printed photographs of reconstructed frontal and pose-optimized face images. Our experiments showed the vulnerability of SOTA FR models to our TI attacks and also presentation attacks using our reconstructed face images.
Last but not least, our proposed method can generate 3D faces from facial images, and we used the 3D reconstruction to find a pose that improves the adversary's success attack rate. However, 3D reconstruction of users’ faces paves the way for new types of attacks (e.g., 3D face masks, etc.), which need to be investigated in the future.
ACKNOWLEDGMENTS
The authors would like to thank Karine Vaucher (Idiap Research Institute, Switzerland) for her help in conducting data collection in the presentation attack experiments.