Introduction
Morphed face detection has gained a surge of interest among the biometric and vision communities [24], [39], [57], [61]. Facial image morphing attacks have posed a serious threat to the functionality of face recognition systems, especially those adopted at borders [32]. Using a morphed image, a criminal can share a passport with his/her innocent accomplice to evade identification and detection. Both the criminal and accomplice faces are verified against the morphed images which allows the criminal to get a passport. A large body of research is devoted to generating morphed facial images mostly using either manipulating geometrical characteristic of two bona fide subjects’ images [22], [51], [69] or generative networks such as Generative Adversarial Networks (GANs) [22], [64], [88].
The mainstream methods for morph detection are categorized into either single image morph detection [66], [80] or differential morph detection [66], [80]. In the former case, the goal is set to identify a probe image either as a bona fide or morphed image without utilizing any other auxiliary information. On the other hand, the latter case takes into account a live image of the subject under investigation, to classify a probe image as a bona fide or morphed. State-of-the-art multi-class classifiers or object detection frameworks benefit from rich visual abstractions realized through representation learning techniques [15]. Generally speaking, face morph detection can be implicitly reformulated as learning discriminative informative cues that are taken into account for finding a decision boundary, separating bona fide images from morphed ones in a binary classification setting. Since artifacts in a morphed image are local, the discrepancy between a morphed image and the corresponding bona fide image can be detected using fine-grained features [93].
The 2D wavelet decomposition [13], [18] provides a useful insight into the joint spatial-frequency information embedded in a given 2D image. Wavelet sub-bands can be thought of as fine-grained features with variable granularity. Moreover, wavelet decomposition reveals the embedded hidden information through providing spatial-frequency representation. Resulting sub-bands can be harnessed to isolate artifacts in a given morphed image at different spatial-frequency granularity.
Feature learning plays a pivotal role for the mainstream computer vision tasks such as image classification. In particular, sparse representation learning methods have proved to be powerful tools for face recognition applications. Due to the NP-hardness of any sparsity-constrained optimization framework, an alternative relaxed version of the sparsity condition is enforced using the
Recently, visual attention mechanisms have initiated a renaissance in the image recognition and classification tasks. “Visual Explanation” [34], underlying the attention mechanism, uncovers what regions a deep neural network focuses on to form its final decision for a defined downstream task. In other words, an attention-based deep neural network forces the DNN into focusing on the most informative regions which contribute the most to learning a hypothesis, leading to a more accurate classification [34], [42], [84], [90], [94]. The feature refinement realized by the spatial- and channel-wise attentions [84] has provided a rich representation that can increase inter-class separability while minimizing intra-class dispersion. Also, vision transformers [49], which have shifted the paradigm in terms of classification accuracy without using any convolutional operations, have benefited from the multi-headed self-attention mechanism which also plays a pivotal role in this study.
This work investigates the application of group sparsity [83], soft attention mechanism [42], and self-attention [14], [84] for morph detection. The spatial-frequency content of an image provides useful information such as subtle discrepancies between a bona fide and its morphed image. We decompose every input image using a multilevel 2D wavelet decomposition to extract coarse-to-fine spatial-frequency wavelet sub-bands which are considered powerful representations for training a DNN morph detector. Our group sparsity constraint optimization framework leads our DNN morph classifier to converge into a sparse solution where some of the wavelet sub-bands of input images, bearing minimal discriminative information, are discarded. Thus, we can select a subset of the most discriminative wavelet sub-bands, which is an implicit feature selection mechanism. On the other hand, attention modules customized to our model, guide the network to pinpoint the most informative spatial pixels as well as the most information bearing channels in a given intermediate feature map. As shown in Fig. 1, we incorporate three types of visual attention mechanisms into our DNN-based morph detector which guide our detector into mining the spatial regions with the highest density of morphing artifacts. Namely, we employ the spatial and channel attention modules introduced in the CBAM [84], which we call Att. I, the end-to-end soft attention mechanism delineated in [42] which is called Att. II in this study, and the self-attention augmented feature maps which is called Att. III hereafter. Through extensive experiments, we demonstrate advantage of these three mentioned attention mechanisms for improving morph detection accuracy. To increase intra-class compactness and inter-class dispersion, we employ the additive angular margin loss function (ArcFace) to obtain highly discriminative features. We demonstrate the efficacy of our framework through extensive experiments on several morph detection datasets mentioned in Section IV-A.
Our attention augmented framework which adopts three different attention mechanisms, i.e., Att. I, Att. II, and Att. III to increase the morph detection accuracy. The Att. I module which is the convolutional block attention adopts the max-pooling or average-pooling to find channel and spatial attention maps to highlight discriminative spatial pixels in a given set of feature maps. The Att. II determines the informative spatial pixel locations through finding the correlation of each spatial location in a given feature map, known as the local feature vectors, and the output of a Fully Connected (FC) layer, known as the global feature vector, in a given DNN. In addition, the Att. III yields augmented feature maps by concatenation of the convolutional feature maps and their corresponding self-attentional feature maps.
Organization of the paper is as follows: In Section II, we delve into literature review of the articles related to morph generation, morph detection, sparse representation learning, and attention mechanism. In Section III, we delineate our methodologies to improve morph detection accuracy. In Section IV, we present our experiments and results. Finally, in sections V and VI we conclude our work and discuss future works and limitations. Our contributions in this paper are outlined as follows:
Instead of the RGB domain, we leverage the wavelet domain to find rich spatial-frequency features of input images, i.e., wavelet sub-bands of input images.
We employ group sparsity to select most discriminative wavelet sub-bands as a feature selection scheme for increasing morph detection accuracy.
We integrate three different types of visual attention modules in our DNN to highlight informative spatial areas of input images which can decrease morph detection error rates.
Related Works
A. Morph Generation
Generative methods have shifted the photo-realistic image synthesis paradigm considerably [35], [43], [47], [88]. Face morphing attacks are synthesized either through alpha blending in the spatial domain [22], [45], [51] or in the latent domain [22], [88]. Interpolation in the spatial pixel domain, known as landmark-based morphing attacks [22], translates the facial landmarks of two underlying subjects into common averaged locations, and the final morphed image is generated by warping and alpha-blending. Alternatively, generative networks such as GANs have shown compelling results on synthesizing morphed images. In [22], an encoder is added to a GAN architecture to transform the image domain into the latent domain. Once the modified GAN architecture is fully trained, two images are mapped into the latent domain where a latent alpha-blending results in a morphed latent vector. In other words, if
B. Morph Detection
Morph detection has been addressed under two scenarios. In the first scenario, called single image morph detection, a single image is classified as either bona fide or morphed [66], [80]. In the second scenario, called differential morph detection, auxiliary information which is a live version of a subject, is used to label an image as either bona fide or morphed [18], [67], [73], [74]. The scope of this work is to design a single image morph detector. Quite a few methods have been proposed for single image morph detection which are mostly categorized into using hand-crafted features [26], [65] or deep embedding features [12], [60]. State-of-the-art methods on single image morph detection are summarized in Table 1.
In [26], the authors employed inconsistencies in the Photo Response Non-Uniformity (PRNU) of the bona fide and morphed images as a proxy for detecting morphed images. Aghdaie et al. [12] employed an attention mechanism to localize morphing artifacts in the wavelet domain. Seibold et al. [68] considered the VGG19, AlexNet, and GoogLeNet to train a binary classifier for morph detection where the train and test sets were augmented by adding engineered noise. The work presented in [25] employed three methods, i.e., CNN, Local Binary Pattern Histogram (LBPH), and PRNU for detecting both land-mark based morphing attacks and the morphed images generated by generative adversarial networks. Pixel-wise supervision for morph detection was proposed in [23] to improve generalization of their morph detector. In [57], different modalities of a single image, such as eyes, nose, and mouth were used to improve morph detection accuracy. In addition, a multi-scale attention-based network was another method developed to detect morphed images, which use the attention mechanism for the images at different scales [89]. Moreover, feature-wise supervision was used in [56] to generate a prediction map for the single and differential morph detection.
C. Sparse Representation Learning
Sparse signal representation is an important class of representation learning methods which provides a compressed version of a high-dimensional signal [91]. Images are naturally sparse with respect to some predefined bases and that is why sparse representation learning is beneficial for image recognition tasks. More importantly, sparse representations have led to promising performance for face recognition tasks [31], [36], [86]. Structured group sparsity [83] has also proved to be compelling for learning representations which are more dsiscriminative. Analogous to finding the principal components, learning a sparse representation limits degree of freedom when searching for an optimal hypothesis
D. Attention Mechanism
Visual attention mechanisms [19], [42], [84] have introduced a paradigm shift for the mainstream computer vision tasks. In a typical attention mechanism, the correlation between each spatial location in a feature map of a deep network and the response of the network, which is usually the output of the last fully-connected layer in the network, designates the importance of the spatial location in terms of contributing to the final predicted output. Attention weights delineate the so-called importance of the spatial pixels. On the other hand, in the self-attention mechanism [63], [92], [85] the long-range dependencies between a pixel location in a feature map and all other pixel locations in the feature map are modeled irrespective of the output of the network.
The attention mechanism can guide a classifier into the most discriminative local patches of an input image where subtle anomalies in an image are captured. Morph detection can be thought of as a fine-grained classification because differences between a bona fide and morphed image are local and subtle, which is why the attention mechanism has proved to be useful for the task of morph detection [12]. The attention mechanism can be soft [42], which is a differentiable process trained using the back-propagation algorithm, or it can be hard, which adopts stochastic sampling to select the most discriminative pixels and is trained using the REINFORCE method [75].
Self-attention [29], [92], [78] has emerged as a powerful mechanism for boosting image recognition performance. The self-attentional network can be implemented as a stand-alone framework without adopting any convolutional operations (e.g., vision transformers [29], [49]) for the downstream task of image recognition or object detection, which is not the scope of this study. On the other hand, several works have integrated the self-attentional modules into the convolutional layers of a DNN [14], [20], [41] as a feature augmentation method to capture long-range dependencies of features which are not revealed through the local convolution operation.
Methodology
In this paper, we propose a morph detector which leverages: (1) group sparsity for capturing the most discriminative wavelet subbands of a given facial image (see Fig. 2) and (2) a visual attention mechanism which drives our morph detector into the most informative spatial- and channel-wise regions to facilitate detecting morphed faces (see Fig. 3). To evaluate both the group sparsity and attention mechanisms in detail, we first delve into the application of group sparsity as a representation learning scheme. Moreover, the effect of different attention mechanisms is investigated separately to assess the improvement of the morph detection due to an attention-based network. Finally, we train our wavelet-based attention augmented morph detector which includes three different types of attention modules. The final objective of this paper is the joint optimization of the group sparsity and attention mechanisms.
Our morph detection methodology selects the most discriminative wavelet subbands of input images which results in increasing the morph detection accuracy based on the extensive experimental evaluations.
Our morph detection framework focuses on discriminative spatial regions in the selected wavelet subbands through using three different types of visual attention mechanism, called (a) Att. I, (b) Att. II, and (c) Att. III which results in increasing the morph detection accuracy based on the extensive experimental evaluations.
From the information theoretic perspective, an optimal DNN architecture must meet the following conditions [77]: (1) Minimizing the mutual information between an intermediate feature map at a given layer
A. Channel-Wise Feature Selection
In this study, instead of experimenting on images in the original RGB spatial domain, we decompose all images using 2D wavelet decomposition which enables us to experiment on the fine-grained information in the spatial-frequency domain. The wavelet domain has proved to be a rich representation which provides information with different granularity. We extract the most useful spatial-frequency information, which is realized through sub-band selection detailed in this subsection, helping us to localize morphing artifacts accurately compared with the RGB domain. To this end, we adopt an undecimated uniform 2D wavelet decomposition. Needless to say, wavelet decomposition cannot be applied on the RGB images which have three channels. Therefore, the RGB images are first converted to the grayscale version using the Open CV RGB to grayscale conversion function in order to be passed to the wavelet decomposition module. We decompose three levels of wavelet decomposition, and from the resulting 64 wavelet sub-bands, we keep 48 sub-bands which represent the high frequency spectra. In other words, we discard the Low-Low (LL) sub-band after one level of wavelet decomposition. Our objectives are as follows: (1) channel-wise feature selection for selecting the most discriminative wavelet sub-bands from these 48 sub-bands which can help us distinguish bona fide images from morphed ones, and (2) spatial feature selection by employing different attention mechanisms to localize the most discriminative pixels in the selected wavelet sub-bands.
We adopt a group sparsity feature selection scheme to select the most discriminative wavelet sub-bands, mentioned above as the sub-band selection, for a given input image. Our implicit feature selection scheme based on the group sparsity is realized by imposing a group sparsity constraint on the parameters of the first convolutional layer in our DNN morph detector (see Fig. 2). We select the most discriminative wavelet sub-bands by discarding wavelet sub-bands that their corresponding kernel weights in the first convolutional layer of our DNN converge to zero thanks to the enforced group sparsity constraint on the parameters of the first convolutional layer. Note that the input images are composed of
As discussed above, to select the wavelet sub-bands, i.e., channel-wise feature selection, we impose a group sparsity constraint on the parameters of the first convolutional layer of our DNN. Integrating a group sparsity term in the classification loss of our DNN on the weights of the first convolutional layer, known as a structured sparsity regularization penalty, drives our network into sparsifying the grouped weights of the first convolutional layer, which implicitly results in discarding non-discriminative wavelet sub-bands. Consequently, a subset of informative wavelet sub-bands, out of 48 sub-bands, are selected. In other words, after training our DNN, some of the grouped weights
B. Arcface Loss Function
Suppose we denote the set of parameters of our network as \begin{align*} \mathcal {L_{R}}(w) = \mathcal {L}_{cl.}(w)+\lambda \|w_{l1}\|_{1,2}= \mathcal {L}_{cl.}(w) +\lambda \sum _{g \in \mathcal {G}_{l1} } \|w_{l1}^{(g)}\|_{2}, \tag{1}\end{align*}
\begin{equation*} \mathcal {L_{R}}(w) = \mathcal {L}_{cl.}(w)+\lambda \sum _{c=1}^{C}\sqrt {\sum _{n=1}^{N}\sum _{q=1}^{Q}\sum _{v=1}^{V} w_{l1}^{2}(n,c,h,v)}, \tag{2}\end{equation*}
As for the binary classification loss in Eq. 2, we adopt the additive angular margin loss (ArcFace) [28] which has proved to enhance the intra-class compactness and inter-class separation. Thus we can write the \begin{align*} -\frac {1}{M}\sum _{i=1}^{i=M}\log \frac {\exp (s\cos (\theta _{y_{i}}+m))} {\exp (s\cos (\theta _{y_{i}}+m))+\sum _{j=1,j\neq i}^{j=C}\exp (s\cos (\theta _{j}))}, \tag{3}\end{align*}
C. Spatial Feature Selection and Refinement
To select the most discriminative pixels, we investigate the application of three attention mechanisms which can suppress spatial regions that do not contribute to the final decision of our morph detector. In other words, our integrated attention modules allow our DNN morph detector to focus on discriminative spatial regions (see Fig. 3). The three attention mechanism are as follows:
1) Attention Mechanism I: Convolutional Block Attention Module (CBAM)
In our first attention module, shown in Fig. 3. (a), and called Att. I, we employ the channel and spatial attention. Specifically, we employ the Convolutional Block Attention Module (CBAM) [1], [84], to refine an intermediate feature map \begin{align*} \mathbf {M_{c}(F)}&=\sigma (MLP(AvgPool(\mathbf {F}))+MLP(MaxPool(\mathbf {F}))), \tag{4}\\ \mathbf {M_{s}(F)}&=\sigma (conv2D[AvgPool(\mathbf {F}),MaxPool(\mathbf {F})]),\tag{5}\end{align*}
\begin{equation*} \mathbf {F^{\prime} } = \mathbf {M_{c}} \bigotimes \mathbf {F}, \mathbf {F^{\prime \prime }} = \mathbf {M_{s}} \bigotimes \mathbf {F^{\prime} }.\tag{6}\end{equation*}
2) Attention Mechanism II: Learn to Pay Attention
The soft attention mechanism [2], [42], called Att. II, and shown in Fig. 3. (b), finds the correlation of each spatial location in a given intermediate feature map \begin{equation*} {c}_{i}^{\mathbf {F}} = \langle \boldsymbol {\ell }_{i}^{\mathbf {F}}, \boldsymbol g \rangle, i \in \{1,2,..,n\},\tag{7}\end{equation*}
\begin{equation*} {a}_{i}^{\mathbf {F}} = \frac {\exp ({c}_{i}^{\mathbf {F}})}{\Sigma _{i=1}^{i=n} \exp ({c}_{i}^{\mathbf {F}})}, i \in \{1,2,..,n\}.\tag{8}\end{equation*}
\begin{equation*} \boldsymbol {g}_{a}^{\mathbf {F}} = \Sigma _{i=1}^{i=n} {a}_{i}^{\mathbf {F}} \boldsymbol {\ell }_{i}^{\mathbf {F}}.\tag{9}\end{equation*}
3) Attention Mechanism III: Self-Attentional Feature Maps
To further refine the intermediate feature maps of our DNN-based morph detector, we integrate the self-attentional feature maps [3], [14] into our deep architecture, which is shown in Fig. 3. (c). The attention augmented convolutional network employs the multi-headed self-attention used in the vision transformer architecture [29], [78]. In this type of attention, the multi-headed self-attentions are applied on a given intermediate feature map in our DNN which leads to a new set of augmented feature maps. In accordance with [14], concatenation of the convoluted feature maps and the self-attentional feature maps result in the best performance.
Suppose \begin{equation*} q = FW_{q}, k = FW_{k}, v = FW_{v}.\tag{10}\end{equation*}
In the multi-head attention setting, the output of the first attention head \begin{equation*} O_{h} = Softmax\left({\frac {(FW_{q})(FW_{k})^{T}}{\sqrt {d_{k}^{h}}}}\right)(FW_{v}).\tag{11}\end{equation*}
\begin{equation*} MHA(F) = [O_{1} || O_{2} || \ldots. || O_{N_{h}}]W^{O},\tag{12}\end{equation*}
\begin{equation*} AAConv(F) = [Conv.(F)||MHA(F)].\tag{13}\end{equation*}
D. Arrangement of Feature Selection Schemes
There are different permutations for employing the group sparsity for channel-wise feature selection as well as three different attention modules Att. I, Att. II, and Att. III. We follow the rules set forth by the curriculum learning paradigm [16], [38], [37] to incrementally incorporate channel-wise and spatial feature selection/refinement modules. The curriculum learning premise highlights the benefits of providing initially easy tasks to a DNN for training purposes and presenting more difficult tasks in later stages which increases the complexity of the network’s parameter space. Thus, we first train our wavelet-based DNN which is constrained to the group sparsity constraint. Consequently, we fine-tune our trained DNN using a modified structure that incorporates different attention modules Att. I, Att. II, and Att. III. More importantly, we consider different number of attention modules in Section IV-E.
We demonstrate in our experiments, delineated in the following sections, contribution of each attention mechanism to accuracy of our deep morph detector. In other words, our results prove efficacy of the Att. I, Att. II, and Att. III for capturing morphing artifacts.
E. Training Schedule
To find the most discriminative wavelet sub-bands we first fine-tune our DNN, which is an Inception-ResNet-v1 [76] pretrained on VGGFace2 [17] with a modified loss function integrating weight decay on the parameters of its first convolutional layer, using the input images that have been decomposed into 48 wavelet sub-bands. Due to our 48-channel input data, we change the number of channels in the first layer of the original Inception-ResNet-v1 to 48. There is a hyperparameter
Once the number of selected sub-bands are obtained, which is the easy task in the context of curriculum learning, we continue fine-tuning our DNN using these selected sub-bands. In other words, we shrink the number of input channels in our DNN, and investigate the effect of adding each individual attention modules Att. I, Att. II, and Att. III. Different convolutional layers are assimilating attention modules to improve accuracy of detecting morphed images.
Evaluations
A. Datasets
We utilize the WVU Identical Twin Face Morph dataset [4] which consists of samples generated using four techniques, i.e., (1) Landmark-based face morph generation, (2) StyleGAN-based face morph generation, (3) Wavelet-based face morph generation, and (4) adversarially perturbed face morph generation. From these four morph generation methods, in this study, we use the Landmark-based, StyleGAN-based and adversarially perturbed morphs which are dubbed Twin-Landmark, Twin-StyleGAN, and Twin-Perturbed, respectively. FRLL-Morphs [27], [64], FERET-Morphs [55], [64], and FRGC-Morphs [54], [64] are other datasets we employ in this work. The FRLL-Morphs dataset is built upon the Face Research London Lab dataset using four different face morphing tools: (1) OpenCV [10], (2) FaceMorpher [7], (3) StyleGAN2 [44], and (4) WebMorpher [9]. The FERET-Morphs dataset which is based on the color FERET database are morphed using the (1) OpenCV, (2) FaceMorpher, and (3) StyleGAN2 morphing modules. The FRGC-Morphs dataset is constructed using the (1) OpenCV, (2) FaceMorpher, and (3) StyleGAN2 morphing tools. Size of each deataset is detailed in Table 2.
In addition to the above-mentioned datasets, we utilize the datasets employed in the single image morph detection tables of the MIPGAN paper [88], which were all constructed using the FRGC-V2 [54] face database. The datasets we use from the MIPGAN paper are as follows: Landmarks- I [8], [58], Landmarks- II [33], StyleGAN [43], [81], and MIPGAN- II [44].
B. Experimental Setup and Evaluation Metrics
Our core DNN is the Inception-ResNet-v1 [76]. The Inception-ResNet-v1 layers are as follows: 1) Stem block, 2) Five Inception-resnet-A blocks, 3) Reduction-A block, 4) 10 Inception-resnet-B blocks, 5) Reduction-B block, 6) Five Inception-resnet-C blocks, 7) Average Pooling, 8) Dropout, 9) Softmax. Details of each block used in the Inception-ResNet-v1 can be found in the [76]. However, as mentioned in Section III-A, we modify the original DNN architecture to account for the 48-wavelet-sub-band input data where we replace 3-channel RGB filters in the first convolutional layer with 48-channel filters. The number of the channels in the filters of the first convolutional layer of the original Inception-ResNet-v1 deep network is three since natural RGB images have three channels. However, we want to feed 48-channel data. Therefore, we increase the channel size of the filters to 48. Our DNNs are trained using the Adam [46] optimizer for 150 epochs accelerated using two 12 GB TITAN X (Pascal) GPUs. We have trained our DNNs in the PyCharm 2022.3.1 environment using PyTorch libraries in a Ubuntu 20.04.3 operating System. The learning rate is initially set at 0.001 which is divided by 10 every 20 epochs. As for the ArcFace loss function parameters, we set the scaling factor s = 64.0 and margin m = 0.5 [82].
We have reported our results using the following metrics based on the ISO/IEC 30107-3 [5]: Bona fide Presentation Classification Error Rate (BPCER) which represents proportion of bona fide presentations that are incorrectly classified as attack presentations (morph) by the classifier, Attack Presentation Classification Error Rate (APCER) that is proportion of attack presentations (morph) that are incorrectly classified as bona fide, Detection Equal Error Rate (D-EER) the point where APCER is equal to BPCER, and Area Under Receiver Operating Characteristic Curve (AUROC). In the context of binary classification, considering morph class as the positive class, BPCER and APCER are nothing but the False Positive Rate (FPR) and False Negative Rate (FNR), respectively. Especially, we are interested in the following three thresholds: 1) BPCER @ APCER = 5% 2) BPCER @ APCER = 10% 3) BPCER @ APCER = 30%. AUROC is a threshold independent metric representing a fair evaluation of our learned hypotheses.
C. Channel-Wise Feature Selection Via Group Lasso Weight Decay
In order to select the most discriminative wavelet sub-bands, we first train our DNN using the 48-wavelet-sub-band data using the WVU Twin-Landmark dataset. We do a random search for tuning the hyperparameter
48 wavelet sub-bands are depicted for a given morphed image. Our channel-wise feature selection scheme leads to selection of the six most discriminative wavelet sub-bands which are ticked.
D. Feature Refinement Via Attention Mechanisms
We have integrated our three different attention modules after the following layers: 1) “conv2d-3b” where size of the feature maps are
We incorporate the attention mechanism Att. II discussed in Section III-C2 into our DNN, to acquire new set(s) of weighted feature vectors. To this end, correlations of spatial locations in an intermediate feature map and the 512-D fully connected (FC) vector before the logits of our DNN are computed. The normalized correlation values decide which pixel locations to remain active for morph detection or which pixels are to be suppressed. Please note that, from an information-theoretic perspective, this kind of attention module looks for the spatial feature locations that have the highest mutual information with respect to the ground truth label Y. The Att. II is inserted after the “conv2d-3b” or “mixed-7a” where the number of channels are respectively 64 and 1,792. Since the number of channels in the feature map and the dimension of the FC layer’s output are not consistent, we use a
We integrate the Att. III module in our DNN. The self-attentional augmented feature maps, detailed in Subsection III-C3, are concatenated with the “vanilla” convolutional feature maps to diversify learned features. We assess the effectiveness of this multi-headed self-attention scheme through inserting this attention module at the layers “conv2d-3b” and “mixed-7a” which have respectively 80 and 1,792 feature maps. The results for this kind of attention augmented morph detection are benchmarked in Table 6. According to the benchmarked results, incorporating Att. III yields improvement in morph detection accuracy for several datasets compared to Table 3 where there was not any attention module. Improved morph detection results are highlighted in Table 6. In particular, employing Att. III has resulted in decreasing error rates when detecting morph images in the FERET-FaceMorpher, FERET-StyleGAN2, FRLL-WebMorpher, FRGC-FaceMorpher, and FRGC-OpenCV datasets.
E. Comparison With The State-of-the-Art
We compare the results of our attention-based morph detector with the results benchmarked in the MIPGAN [88] paper. The methodologies used in the MIPGAN paper are Ensemble Features [79] and Hybrid Features [62] which are abbreviated as Ensemble and Hybrid respectively in Table 7. The ensemble of features method fuses the score level morph detection results using three different feature descriptors, which are LBP, HOG, and BSIF. On the other hand, the Hybrid Features adopts the Laplacian Pyramids using two different image spaces YCbCr and HSV at three different scales where LBP is used to extract features from every sub image. LBP features are fed to a classifier that is Spectral Regression Kernel Discriminant Analysis (SRKDA) and scores for all sub images are fused for morph detection. Attention-based results on the datasets used in the MIPGAN paper are summarized in Table 7. Based on the benchmarked results, our attention augmented morph detector has resulted in decrease of the error rates for different Train/Test scenarios. The improved results are highlighted in Table 7. In particular, employing Att. I has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset regardless of the used training set, StyleGAN dataset when our DNN is trained using the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained using the Landmarks- I and StyleGAN datasets. In addition, employing Att. II has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset regardless of the used training set, StyleGAN dataset when our DNN is trained using the MIPGAN- II dataset, and MIPGAN- II dataset when our DNN is trained using the Landmarks- I, Landmarks- II, and StyleGAN datasets. Employing Att. III has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset when our DNN is trained on the Landmarks- II dataset, StyeleGAN dataset when our DNN is trained on the MIPGAN- II dataset, and MIPGAN- II dataset when our DNN is trained on the StyleGAN dataset.
Also, it is not uncommon for a given travel document issuing/authentication agency to scan a submitted hard copy facial image. To further make our morph detector more realistic and inclusive, we employ the printed and scanned (re-digitized) datasets used in the MIPGAN [88] paper for testing our morphed detectors. The summary of the morph detection performance on the printed and scanned version of the datasets are tabulated in Table 8. In accordance with the benchmarked results, our attention augmented morph detector has decreased the detection error rates in several highlighted Train/Test scenarios, which substantiates the efficacy of our wavelet-based attention augmented morph detector. In particular, employing Att. I has resulted in decreasing error rates when detecting morph images in the Landmarks- I dataset when our DNN is trained on the Landmarks- II and MIPGAN- II datasets, Landmarks- II dataset when our DNN is trained using the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained using the Landmarks- II dataset. In addition, employing Att. II has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset when our DNN is trained on the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained using the Landmarks- II dataset. Employing Att. III has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset when our DNN is trained on the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained on the Landmarks- II dataset.
We also assess the generalization ability of our framework against the state-of-the-art [39], [66] in Table 9 which has assessed morph detection performance on FRGC-FaceMorpher, and FRGC-OpenCV. To this end, we fine-tune our trained Inception-ResNet-v1, including attention modules Att. I, Att. II, and Att. III, on FERET-FaceMorpher and FERET-OpenCV datasets. Please note that, in a PyTorch environment, we freeze all layers’ parameters by setting “requires-grade = False” except the final linear classifier layer. Based on the benchmarked results, our wavelet-based attention augmented morph detector surpasses the prior works by a large margin on different train/test scenarios, which are highlighted in Table 9. In particular, employing Att. I has resulted in decreasing error rates when detecting morph images in the FRGC-FaceMorpher dataset when our DNN is trained on the FERET-FaceMorpher dataset, FRGC-OpenCV dataset when our DNN is trained using the FERET-OpenCV dataset. In addition, employing Att. II has resulted in decreasing error rates when detecting morph images in the FRGC-FaceMorpher dataset when our DNN is trained on the FERET-FaceMorpher dataset, FRGC-FaceMorpher dataset when our DNN is trained using the FERET-OpenCV, and FRGC-OpenCV dataset when our DNN is trained on the FERET-OpenCV dataset.
We also delve into the different number of attention modules used for training our deep morph detector. We add attention modules Att. I, Att. II, and Att. III to several convolutional layers simultaneously and we benchmark the results for the FERET, FRLL, and FRGC datasets, as shown in Table 10 and Table 11. Considering the results, having two modules, mainly the Att. I, and Att. II considerably improved the morph detection performance on the FRGC-FaceMorpher, FRGC-OpenCV, and FRGC-StyleGAN2 datasets. In addition, we assess the performance of our morph detector when all three attention modules, Att. I, Att. II, and Att. III, are added to our deep architecture. The resulting performance of this scenario are benchmarked in Table 11. Integrating all three attention modules in our DNN has resulted in reduction of detection error rates when assessing morphed images in the FRLL-OpenCV, FRLL-WebMorpher, and FRGC-FaceMorpher datasets. All in all, our wavelet-based attention augmented morph detector has contributed to a decrease in detection error rate in several highlighted datasets.
Most importantly, we contrast our attention augmented morph detection performance on the MIPGAN- II dataset mentioned in the latest NIST Face Recognition Vendor Test (FRVT) report [6] updated on July 14, 2022. We compare our results with the NIST report using the two criterion APCER@BPCER = 0.01 and APCER@BPCER = 0.1. We report results on the MIPGAN- II dataset while our network with the Att. I is fine-tuned on a universal dataset. Our so-called universal dataset that we use for training our wavelet-based attention augmented morph detector includes all the datasets mentioned in Section IV-A plus the AMSL dataset [53]. The AMSL dataset consists of 2,175 morph and 204 bona fide samples. Results of morph detection on the MIPGAN- II dataset using the NIST report format is summarized in Table 12. The benchmarked result delineates the efficiency of our morph detector as the APCER@BPCER = 10 % error rate is decreased.
F. Deep Morph Detector Visualization
In this section, the interpretability of our attention-based deep morph detector is investigated through two visualization tools: (1) Attention Maps (2) Gradient-weighted Class Activation Maps (Grad-CAM). Attention maps [50], [84], [42] are powerful visualization as well as attribution [87], [71], [72] techniques which represent a visual explanation for the decision-making of a DNN by highlighting spatial regions that are most relevant for generating output scores by a DNN. In particular, attention maps are obtained by overlaying the heat maps of attention weights into the original RGB images to highlight the most discriminative spatial regions in the eye of a classifier. Grad-CAM is another visualization scheme to demonstrate functionality of our DNN. Given a morphed image, the logits related to the morphed class are supposed to fire which is revealed in the grad-CAM plots. We follow the protocols adopted in the literature [42], [84] corresponding to the Att. I and Att. II modules, which demonstrate efficacy of the CBAM-integrated deep networks through visualizing Grad-CAMs and plotting attention maps for the adjusted network used in [42].
The Grad-CAMs pertinent to Table 4 for both convolutional layer of “conv2d-3b” and “conv2d-4b” are shown in Fig. 5. Moreover, the estimated attention maps of Table 5 for the “mixed-7a” convolutional layer are displayed in Fig. 6. As expected, the most discriminative spatial regions in the view of a morph detector are in the vicinity of a subject’s eyes.
Grad-CAM visualizations of CBAM-integrated deep morph detector (Table 4): (a) CBAM@conv2d-3b (b) CBAM@conv2d-4b.
Conclusion
This article addresses single image morphing attack detection where emphasizing on discriminative regions is realized through spatial and channel attention modules. In particular, we quantitatively demonstrated the efficacy of the three visual attention modules for the downstream task of morph detection in a binary classification setting. The integrated attention modules are intended for feature refinement as well as feature selection as a kind of representation learning. In particular, a trainable soft attention mechanism, convolutional block attention module, and multi-headed attention-augmented feature maps were utilized to improve accuracy of morph detection on several datasets. In addition, we have shifted the input data domain from the RGB space into the wavelet domain to take advantage of fine-grained spatial-frequency information represented through wavelet decomposition.
Our benchmarked results on morph detection using several datasets proves effectiveness of our attention-based morph detector. Most importantly, we have contrasted the generalization performance of our attention augmented morph detection scheme with the state-of-the-art results to demonstrate efficacy of our proposed architectures. Moreover, estimated attention maps and Grad-CAM visualizations were included to demonstrate interpretability of our morph detector. Heatmaps applied on the original images reveal the most discriminative spatial regions of the images that drive our attention augmented morph detectors into an accurate decision for labeling probe images as bona fide or morphed. Finally, to realize multi-attentional morph detection, we assessed our morph detection performance using two instantiations of our attention modules Att. I, Att. II, and Att. III. In addition, we trained our attention augmented morph detector using all three Att. I, Att. II, and Att. III attention modules and the corresponding results were benchmarked in the table mentioned in Section IV-E.
Future Work and Limitations
As the future work, we are considering deploying more diverse visual attention mechanisms to increase morph detection accuracy. In addition, transformer architecture has inspired a great number of attention mechanisms which can be beneficial for the task of single image morph detection. The limitation of our work is induced by different permutations of the attention modules in our DNN as well as layers at which attention modules are integrated. Therefore, investigation of all possible permutations can be addressed in a further study.