Journals & Magazines >IEEE Access >Volume: 11

Attention Augmented Face Morph Detection

Our morph detection framework focuses on discriminative spatial regions in the selected wavelet subbands through using three different types of visual attention mechanism...

Abstract:

Morph detection is of paramount significance when the integrity of Automatic Face Recognition (AFR) systems is concerned. Considering the risks incurred by morphing attac...Show More

Metadata

Abstract:

Morph detection is of paramount significance when the integrity of Automatic Face Recognition (AFR) systems is concerned. Considering the risks incurred by morphing attacks, a robust automated morph detector is required which can distinguish authentic bona fide samples from altered morphed images. We leverage the wavelet sub-band decomposition of an input image, yielding the fine-grained spatial-frequency content of the input image. To enhance detection of morphed images, our goal is to find the most discriminative information across frequency channels and spatial domain. To this end, we propose an end-to-end attention-based deep morph detector which assimilates the most discriminative wavelet sub-bands of a given image which are obtained by a group sparsity representation learning scheme. Specifically, our group sparsity-constrained Deep Neural Network (DNN) learns the most discriminative wavelet sub-bands (channels) of an input image while the attention mechanism captures the most discriminative spatial regions of input images for the downstream task of morph detection. To this end, we adopt three attention mechanisms to diversify our refined features for morph detection. As the first attention mechanism, we employ the Convolutional Block Attention Module (CBAM) which provides us with refined feature maps. As the second attention mechanism, compatibility scores across spatial locations and output of our DNN highlights the most discriminative regions, and lastly, the multiheaded self-attention augmented convolutions account for our third attention mechanism. We evaluate the efficiency of our proposed framework through extensive experiments using multiple morph datasets that are compiled using bona fide images available in the FERET, FRLL, FRGC, and WVU Twin datasets. Most importantly, our proposed methodology has resulted in reduction in detection error rates when compared with state-of-the-art results. Finally, to further assess our multi-attentional morph detecti...

Our morph detection framework focuses on discriminative spatial regions in the selected wavelet subbands through using three different types of visual attention mechanism...

Published in: IEEE Access ( Volume: 11)

Page(s): 24281 - 24298

Date of Publication: 09 March 2023

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2023.3254539

Funding Agency:

Contents

SECTION I.

Introduction

Morphed face detection has gained a surge of interest among the biometric and vision communities [24], [39], [57], [61]. Facial image morphing attacks have posed a serious threat to the functionality of face recognition systems, especially those adopted at borders [32]. Using a morphed image, a criminal can share a passport with his/her innocent accomplice to evade identification and detection. Both the criminal and accomplice faces are verified against the morphed images which allows the criminal to get a passport. A large body of research is devoted to generating morphed facial images mostly using either manipulating geometrical characteristic of two bona fide subjects’ images [22], [51], [69] or generative networks such as Generative Adversarial Networks (GANs) [22], [64], [88].

The mainstream methods for morph detection are categorized into either single image morph detection [66], [80] or differential morph detection [66], [80]. In the former case, the goal is set to identify a probe image either as a bona fide or morphed image without utilizing any other auxiliary information. On the other hand, the latter case takes into account a live image of the subject under investigation, to classify a probe image as a bona fide or morphed. State-of-the-art multi-class classifiers or object detection frameworks benefit from rich visual abstractions realized through representation learning techniques [15]. Generally speaking, face morph detection can be implicitly reformulated as learning discriminative informative cues that are taken into account for finding a decision boundary, separating bona fide images from morphed ones in a binary classification setting. Since artifacts in a morphed image are local, the discrepancy between a morphed image and the corresponding bona fide image can be detected using fine-grained features [93].

The 2D wavelet decomposition [13], [18] provides a useful insight into the joint spatial-frequency information embedded in a given 2D image. Wavelet sub-bands can be thought of as fine-grained features with variable granularity. Moreover, wavelet decomposition reveals the embedded hidden information through providing spatial-frequency representation. Resulting sub-bands can be harnessed to isolate artifacts in a given morphed image at different spatial-frequency granularity.

Feature learning plays a pivotal role for the mainstream computer vision tasks such as image classification. In particular, sparse representation learning methods have proved to be powerful tools for face recognition applications. Due to the NP-hardness of any sparsity-constrained optimization framework, an alternative relaxed version of the sparsity condition is enforced using the $\ell 1$ relaxation [30]. Most importantly, group sparsity [48] is defined as a class of sparse representation learning methods where a feasible solution for the formulated optimization framework converges when some of the grouped coefficients are zeroed out. In this study, we leverage the group sparsity to increase the accuracy of our morph detection framework.

Recently, visual attention mechanisms have initiated a renaissance in the image recognition and classification tasks. “Visual Explanation” [34], underlying the attention mechanism, uncovers what regions a deep neural network focuses on to form its final decision for a defined downstream task. In other words, an attention-based deep neural network forces the DNN into focusing on the most informative regions which contribute the most to learning a hypothesis, leading to a more accurate classification [34], [42], [84], [90], [94]. The feature refinement realized by the spatial- and channel-wise attentions [84] has provided a rich representation that can increase inter-class separability while minimizing intra-class dispersion. Also, vision transformers [49], which have shifted the paradigm in terms of classification accuracy without using any convolutional operations, have benefited from the multi-headed self-attention mechanism which also plays a pivotal role in this study.

This work investigates the application of group sparsity [83], soft attention mechanism [42], and self-attention [14], [84] for morph detection. The spatial-frequency content of an image provides useful information such as subtle discrepancies between a bona fide and its morphed image. We decompose every input image using a multilevel 2D wavelet decomposition to extract coarse-to-fine spatial-frequency wavelet sub-bands which are considered powerful representations for training a DNN morph detector. Our group sparsity constraint optimization framework leads our DNN morph classifier to converge into a sparse solution where some of the wavelet sub-bands of input images, bearing minimal discriminative information, are discarded. Thus, we can select a subset of the most discriminative wavelet sub-bands, which is an implicit feature selection mechanism. On the other hand, attention modules customized to our model, guide the network to pinpoint the most informative spatial pixels as well as the most information bearing channels in a given intermediate feature map. As shown in Fig. 1, we incorporate three types of visual attention mechanisms into our DNN-based morph detector which guide our detector into mining the spatial regions with the highest density of morphing artifacts. Namely, we employ the spatial and channel attention modules introduced in the CBAM [84], which we call Att. I, the end-to-end soft attention mechanism delineated in [42] which is called Att. II in this study, and the self-attention augmented feature maps which is called Att. III hereafter. Through extensive experiments, we demonstrate advantage of these three mentioned attention mechanisms for improving morph detection accuracy. To increase intra-class compactness and inter-class dispersion, we employ the additive angular margin loss function (ArcFace) to obtain highly discriminative features. We demonstrate the efficacy of our framework through extensive experiments on several morph detection datasets mentioned in Section IV-A.

Fig. 1.

Our attention augmented framework which adopts three different attention mechanisms, i.e., Att. I, Att. II, and Att. III to increase the morph detection accuracy. The Att. I module which is the convolutional block attention adopts the max-pooling or average-pooling to find channel and spatial attention maps to highlight discriminative spatial pixels in a given set of feature maps. The Att. II determines the informative spatial pixel locations through finding the correlation of each spatial location in a given feature map, known as the local feature vectors, and the output of a Fully Connected (FC) layer, known as the global feature vector, in a given DNN. In addition, the Att. III yields augmented feature maps by concatenation of the convolutional feature maps and their corresponding self-attentional feature maps.

Show All

Organization of the paper is as follows: In Section II, we delve into literature review of the articles related to morph generation, morph detection, sparse representation learning, and attention mechanism. In Section III, we delineate our methodologies to improve morph detection accuracy. In Section IV, we present our experiments and results. Finally, in sections V and VI we conclude our work and discuss future works and limitations. Our contributions in this paper are outlined as follows:

Instead of the RGB domain, we leverage the wavelet domain to find rich spatial-frequency features of input images, i.e., wavelet sub-bands of input images.
We employ group sparsity to select most discriminative wavelet sub-bands as a feature selection scheme for increasing morph detection accuracy.
We integrate three different types of visual attention modules in our DNN to highlight informative spatial areas of input images which can decrease morph detection error rates.

SECTION II.

Related Works

A. Morph Generation

Generative methods have shifted the photo-realistic image synthesis paradigm considerably [35], [43], [47], [88]. Face morphing attacks are synthesized either through alpha blending in the spatial domain [22], [45], [51] or in the latent domain [22], [88]. Interpolation in the spatial pixel domain, known as landmark-based morphing attacks [22], translates the facial landmarks of two underlying subjects into common averaged locations, and the final morphed image is generated by warping and alpha-blending. Alternatively, generative networks such as GANs have shown compelling results on synthesizing morphed images. In [22], an encoder is added to a GAN architecture to transform the image domain into the latent domain. Once the modified GAN architecture is fully trained, two images are mapped into the latent domain where a latent alpha-blending results in a morphed latent vector. In other words, if $Z_{1}$ represents the latent vector corresponding to the first image, and $Z_{2}$ delineates the latent of the second image, the resulting morphed image in the latent domain can be formulated as the convex combination (alpha-blending) $Z_{morph} = \alpha Z_{1} + (1-\alpha) Z_{2}$ , where $\alpha$ and $1-\alpha$ are the coefficients delineating the contribution of the first and the second latent vector, respectively, to the final latent vector. Finally, a decoder maps the morphed latent vector into the spatial domain as the final morphed image. As a variant of the vanilla GAN, the StyleGAN architecture [11], [43], [44] adopts a style transfer method to synthesize an image given a reference style. In brief, the principal premise behind the StyleGAN is to make the statistics of deep feature maps consistent for both the image and the reference style. In [21], [64], a StyleGAN architecture is utilized to generate morphing attacks which are deemed highly photo realistic.

B. Morph Detection

Morph detection has been addressed under two scenarios. In the first scenario, called single image morph detection, a single image is classified as either bona fide or morphed [66], [80]. In the second scenario, called differential morph detection, auxiliary information which is a live version of a subject, is used to label an image as either bona fide or morphed [18], [67], [73], [74]. The scope of this work is to design a single image morph detector. Quite a few methods have been proposed for single image morph detection which are mostly categorized into using hand-crafted features [26], [65] or deep embedding features [12], [60]. State-of-the-art methods on single image morph detection are summarized in Table 1.

TABLE 1 State-of-the-art Methodologies on Single Image Morph Detection

In [26], the authors employed inconsistencies in the Photo Response Non-Uniformity (PRNU) of the bona fide and morphed images as a proxy for detecting morphed images. Aghdaie et al. [12] employed an attention mechanism to localize morphing artifacts in the wavelet domain. Seibold et al. [68] considered the VGG19, AlexNet, and GoogLeNet to train a binary classifier for morph detection where the train and test sets were augmented by adding engineered noise. The work presented in [25] employed three methods, i.e., CNN, Local Binary Pattern Histogram (LBPH), and PRNU for detecting both land-mark based morphing attacks and the morphed images generated by generative adversarial networks. Pixel-wise supervision for morph detection was proposed in [23] to improve generalization of their morph detector. In [57], different modalities of a single image, such as eyes, nose, and mouth were used to improve morph detection accuracy. In addition, a multi-scale attention-based network was another method developed to detect morphed images, which use the attention mechanism for the images at different scales [89]. Moreover, feature-wise supervision was used in [56] to generate a prediction map for the single and differential morph detection.

C. Sparse Representation Learning

Sparse signal representation is an important class of representation learning methods which provides a compressed version of a high-dimensional signal [91]. Images are naturally sparse with respect to some predefined bases and that is why sparse representation learning is beneficial for image recognition tasks. More importantly, sparse representations have led to promising performance for face recognition tasks [31], [36], [86]. Structured group sparsity [83] has also proved to be compelling for learning representations which are more dsiscriminative. Analogous to finding the principal components, learning a sparse representation limits degree of freedom when searching for an optimal hypothesis $\mathcal {H}$ . Sparse representation learning has a close relationship with the Vapnik Chervonenkis (VC) dimension of a model and that is why learning the sparse representation leads to a better generalization based on statistical learning theory.

D. Attention Mechanism

Visual attention mechanisms [19], [42], [84] have introduced a paradigm shift for the mainstream computer vision tasks. In a typical attention mechanism, the correlation between each spatial location in a feature map of a deep network and the response of the network, which is usually the output of the last fully-connected layer in the network, designates the importance of the spatial location in terms of contributing to the final predicted output. Attention weights delineate the so-called importance of the spatial pixels. On the other hand, in the self-attention mechanism [63], [92], [85] the long-range dependencies between a pixel location in a feature map and all other pixel locations in the feature map are modeled irrespective of the output of the network.

The attention mechanism can guide a classifier into the most discriminative local patches of an input image where subtle anomalies in an image are captured. Morph detection can be thought of as a fine-grained classification because differences between a bona fide and morphed image are local and subtle, which is why the attention mechanism has proved to be useful for the task of morph detection [12]. The attention mechanism can be soft [42], which is a differentiable process trained using the back-propagation algorithm, or it can be hard, which adopts stochastic sampling to select the most discriminative pixels and is trained using the REINFORCE method [75].

Self-attention [29], [92], [78] has emerged as a powerful mechanism for boosting image recognition performance. The self-attentional network can be implemented as a stand-alone framework without adopting any convolutional operations (e.g., vision transformers [29], [49]) for the downstream task of image recognition or object detection, which is not the scope of this study. On the other hand, several works have integrated the self-attentional modules into the convolutional layers of a DNN [14], [20], [41] as a feature augmentation method to capture long-range dependencies of features which are not revealed through the local convolution operation.

SECTION III.

Methodology

In this paper, we propose a morph detector which leverages: (1) group sparsity for capturing the most discriminative wavelet subbands of a given facial image (see Fig. 2) and (2) a visual attention mechanism which drives our morph detector into the most informative spatial- and channel-wise regions to facilitate detecting morphed faces (see Fig. 3). To evaluate both the group sparsity and attention mechanisms in detail, we first delve into the application of group sparsity as a representation learning scheme. Moreover, the effect of different attention mechanisms is investigated separately to assess the improvement of the morph detection due to an attention-based network. Finally, we train our wavelet-based attention augmented morph detector which includes three different types of attention modules. The final objective of this paper is the joint optimization of the group sparsity and attention mechanisms.

Fig. 2.

Our morph detection methodology selects the most discriminative wavelet subbands of input images which results in increasing the morph detection accuracy based on the extensive experimental evaluations.

Show All

Fig. 3.

Our morph detection framework focuses on discriminative spatial regions in the selected wavelet subbands through using three different types of visual attention mechanism, called (a) Att. I, (b) Att. II, and (c) Att. III which results in increasing the morph detection accuracy based on the extensive experimental evaluations.

Show All

From the information theoretic perspective, an optimal DNN architecture must meet the following conditions [77]: (1) Minimizing the mutual information between an intermediate feature map at a given layer $L$ , denoted by $\mathbf {F}_{L}$ , and the next layer feature map $\mathbf {F}_{L+1}$ in the hierarchy of a DNN. In other words, $I (\mathbf {F}_{L}; \mathbf {F}_{L+1})$ must be minimized. (2) Maximizing the mutual information between a given intermediate feature map $\mathbf {F}_{L}$ and output of the DNN, denoted by $Y$ . In other words, $I (\mathbf {F}_{L}; Y)$ must be maximized. Employing structured group sparsity for selecting the most discriminative wavelet sub-bands and the attention mechanism to find the most discriminative spatial regions aids our DNN in minimizing the $I (\mathbf {F}_{L}; \mathbf {F}_{L+1})$ while maximizing $I (\mathbf {F}_{L}; Y)$ . In a similar vein to skip connections introduced in the ResNet [40] architecture, which precludes information flow loss, our adopted feature refinement schemes enable minimizing $I (\mathbf {F}_{L}; \mathbf {F}_{L+1})$ to prevent losing information when data abstraction becomes compact in the higher layers of a DNN, while at the same time, $I (\mathbf {F}_{L}; Y)$ is maximized through finding the most relevant information in a given feature map.

A. Channel-Wise Feature Selection

In this study, instead of experimenting on images in the original RGB spatial domain, we decompose all images using 2D wavelet decomposition which enables us to experiment on the fine-grained information in the spatial-frequency domain. The wavelet domain has proved to be a rich representation which provides information with different granularity. We extract the most useful spatial-frequency information, which is realized through sub-band selection detailed in this subsection, helping us to localize morphing artifacts accurately compared with the RGB domain. To this end, we adopt an undecimated uniform 2D wavelet decomposition. Needless to say, wavelet decomposition cannot be applied on the RGB images which have three channels. Therefore, the RGB images are first converted to the grayscale version using the Open CV RGB to grayscale conversion function in order to be passed to the wavelet decomposition module. We decompose three levels of wavelet decomposition, and from the resulting 64 wavelet sub-bands, we keep 48 sub-bands which represent the high frequency spectra. In other words, we discard the Low-Low (LL) sub-band after one level of wavelet decomposition. Our objectives are as follows: (1) channel-wise feature selection for selecting the most discriminative wavelet sub-bands from these 48 sub-bands which can help us distinguish bona fide images from morphed ones, and (2) spatial feature selection by employing different attention mechanisms to localize the most discriminative pixels in the selected wavelet sub-bands.

We adopt a group sparsity feature selection scheme to select the most discriminative wavelet sub-bands, mentioned above as the sub-band selection, for a given input image. Our implicit feature selection scheme based on the group sparsity is realized by imposing a group sparsity constraint on the parameters of the first convolutional layer in our DNN morph detector (see Fig. 2). We select the most discriminative wavelet sub-bands by discarding wavelet sub-bands that their corresponding kernel weights in the first convolutional layer of our DNN converge to zero thanks to the enforced group sparsity constraint on the parameters of the first convolutional layer. Note that the input images are composed of $C$ wavelet sub-bands (channels) and the first convolutional layer is defined in the space of $\mathbb {R}^{N \times C \times H \times V}$ where $N, C, H$ , and $V$ represent the number of filters, number of kernels, height, and width of a kernel, respectively. In this study, the filter and the kernel terms are distinguished. A kernel is a 2D array of size $Q\times V$ . A filter is a 3D array of size $C\times Q\times V$ , which are stacked 2D kernels over the channel axis. Input images are decomposed into $C=48$ stacked wavelet sub-bands as the input of our DNN. Therefore, the number of kernels in each filter of the first convolutional layer is equal to 48. 32 different filters are employed in the first convolutional layer, where the size of each kernel is $3\times 3$ . Thus, the dimensions of the first convolutional layer for the purpose of channel-wise feature selection are as follows: $N=32, C=48, Q=3$ , and $V=3$ . There are 48 different grouped weights that are shown by $w_{l1} (:, c,:,:)$ for c $\in \{1,\ldots, 48\}$ , where the first layer weights are denoted by $w_{l1}$ .

As discussed above, to select the wavelet sub-bands, i.e., channel-wise feature selection, we impose a group sparsity constraint on the parameters of the first convolutional layer of our DNN. Integrating a group sparsity term in the classification loss of our DNN on the weights of the first convolutional layer, known as a structured sparsity regularization penalty, drives our network into sparsifying the grouped weights of the first convolutional layer, which implicitly results in discarding non-discriminative wavelet sub-bands. Consequently, a subset of informative wavelet sub-bands, out of 48 sub-bands, are selected. In other words, after training our DNN, some of the grouped weights $w_{l1} (:, c,:,:)$ for c $\in \{1,\ldots, 48\}$ are zeroed out which means their corresponding wavelet sub-bands are discarded. Those grouped weights that do not converge to zero determine the most useful discriminative wavelet sub-bands which are used for the downstream task of morph detection.

B. Arcface Loss Function

Suppose we denote the set of parameters of our network as $w$ , first layer weight parameters as $w_{l1}$ , the binary classification loss of our DNN as $\mathcal {L}_{cl.}(w)$ , and each grouped weight in the first layer as $w_{l1}^{(g)}$ . Note that group weight corresponding to the $C^{th}$ channel is defined as all the 2D kernels in the $C^{th}$ channels of all the N = 32 filters in our first convolutional layer. The regularized loss function, denoted by $\mathcal {L_{R}}(w)$ , for training our DNN morph detector to select the most discriminative wavelet sub-bands is as follows:

$\begin{align*} \mathcal {L_{R}}(w) = \mathcal {L}_{cl.}(w)+\lambda \|w_{l1}\|_{1,2}= \mathcal {L}_{cl.}(w) +\lambda \sum _{g \in \mathcal {G}_{l1} } \|w_{l1}^{(g)}\|_{2}, \tag{1}\end{align*}$ View Source

where

$\mathcal {G}_{l1}$

is a set composed of all the group weights of the convolutional filters in the first layer, and

$\lambda$

is a parameter controlling the amount of sparsity. The regularized loss can be written as:

$\begin{equation*} \mathcal {L_{R}}(w) = \mathcal {L}_{cl.}(w)+\lambda \sum _{c=1}^{C}\sqrt {\sum _{n=1}^{N}\sum _{q=1}^{Q}\sum _{v=1}^{V} w_{l1}^{2}(n,c,h,v)}, \tag{2}\end{equation*}$

View Source

where

$\mathcal {L}_{cl.}(w)$

is the binary classification loss and

$N=32, C=48, Q=3$

, and

$V=3$

As for the binary classification loss in Eq. 2, we adopt the additive angular margin loss (ArcFace) [28] which has proved to enhance the intra-class compactness and inter-class separation. Thus we can write the $\mathcal {L}_{cl.}(w)$ as:

$\begin{align*} -\frac {1}{M}\sum _{i=1}^{i=M}\log \frac {\exp (s\cos (\theta _{y_{i}}+m))} {\exp (s\cos (\theta _{y_{i}}+m))+\sum _{j=1,j\neq i}^{j=C}\exp (s\cos (\theta _{j}))}, \tag{3}\end{align*}$ View Source

where M is the number of training samples in a given batch, s is the scale factor for learned feature embeddings and class weight vectors are normalized to 1 (

$\|W_{c}\|_{2}=1$

$\theta _{y_{i}}$

is the angle between the

$i^{th}$

input feature embedding of a ground truth class

$y$

and learned weight of class

$y$

. Moreover,

$\theta _{j}$

represents the angle between the

$i^{th}$

input embedding relevant to the ground truth class

$y$

and learned weight of class

$j$

. The additive angular margin

$m$

reduces the variance of the learned features in a given class while increasing the inter-class feature dispersion.

C. Spatial Feature Selection and Refinement

To select the most discriminative pixels, we investigate the application of three attention mechanisms which can suppress spatial regions that do not contribute to the final decision of our morph detector. In other words, our integrated attention modules allow our DNN morph detector to focus on discriminative spatial regions (see Fig. 3). The three attention mechanism are as follows:

1) Attention Mechanism I: Convolutional Block Attention Module (CBAM)

In our first attention module, shown in Fig. 3. (a), and called Att. I, we employ the channel and spatial attention. Specifically, we employ the Convolutional Block Attention Module (CBAM) [1], [84], to refine an intermediate feature map $\mathbf {F} \in \mathbb {R}^{C \times H \times W}$ . Given intermediate feature map $\mathbf {F}$ , the CBAM module captures interdependencies between spatial-channel pixels in the feature map through inferring a 1-D channel attention map $\mathbf {M_{c}} \in R^{C \times 1 \times 1}$ and a 2D spatial attention map $\mathbf {M_{s}} \in R^{1 \times H \times W}$ which are as follows:

$\begin{align*} \mathbf {M_{c}(F)}&=\sigma (MLP(AvgPool(\mathbf {F}))+MLP(MaxPool(\mathbf {F}))), \tag{4}\\ \mathbf {M_{s}(F)}&=\sigma (conv2D[AvgPool(\mathbf {F}),MaxPool(\mathbf {F})]),\tag{5}\end{align*}$ View Source

where MLP stands for the Multi-Layer Perceptron, which is typically a two layer fully-connected network and

$\sigma$

is the non-linear activation function. To find the channel attention map

$\mathbf {M_{c}}$

, we reduce the size of the hidden layer in the MLP by setting the variable “reduction_ratio=16” [1] to reduce complexity of the problem, which means that the size of the hidden layer in the MLP is

$\frac {1}{16}$

of the input layer size. Also, conv2D is a convolution applied on the concatenation of (1) the average pooled feature map along the channel axis and (2) the max pooled feature map along the channel axis. The refined attentive feature map

$\mathbf {F^{\prime \prime }}$

is found consecutively which is as follows:

$\begin{equation*} \mathbf {F^{\prime} } = \mathbf {M_{c}} \bigotimes \mathbf {F}, \mathbf {F^{\prime \prime }} = \mathbf {M_{s}} \bigotimes \mathbf {F^{\prime} }.\tag{6}\end{equation*}$

View Source

Please note that we employ up to two attention modules of type Att. I in different intermediate feature maps related to two different convolutional layers in our DNN-based morph detector.

2) Attention Mechanism II: Learn to Pay Attention

The soft attention mechanism [2], [42], called Att. II, and shown in Fig. 3. (b), finds the correlation of each spatial location in a given intermediate feature map $\mathbf {F} \in \mathbb {R}^{C \times H \times W}$ and the output of the underlying DNN, which can be one of the fully connected layers, which precedes the logits layer of our DNN, to assess how much that spatial location is deemed discriminative in the eye of the DNN. The correlation is also known as the compatibility score. Mathematically speaking, assume a spatial local feature vector in the location $i \in \{1,2,..,n\}$ in the intermediate feature map $\mathbf {F}$ is shown by $\boldsymbol {\ell }_{i}^{\mathbf {F}}$ . Note that $n$ is the total number of pixel locations in the given feature map which is $H \times W$ . The compatibility score for local feature vector $\boldsymbol {\ell }_{i}^{\mathbf {F}}$ is given as:

$\begin{equation*} {c}_{i}^{\mathbf {F}} = \langle \boldsymbol {\ell }_{i}^{\mathbf {F}}, \boldsymbol g \rangle, i \in \{1,2,..,n\},\tag{7}\end{equation*}$ View Source

where

$\boldsymbol g$

designates the global feature vector, that is the 512-D output feature embedding of the last fully connected layer in the DNN detector and

$\langle.,.\rangle$

represents the inner product. Compatibility scores for the feature map

$\mathbf {F}$

are normalized using the softmax normalization function. Normalized compatibility scores represent attention weights, which are given as:

$\begin{equation*} {a}_{i}^{\mathbf {F}} = \frac {\exp ({c}_{i}^{\mathbf {F}})}{\Sigma _{i=1}^{i=n} \exp ({c}_{i}^{\mathbf {F}})}, i \in \{1,2,..,n\}.\tag{8}\end{equation*}$

View Source

A convex combination of the local feature vectors

$\boldsymbol {\ell }_{i}^{\mathbf {F}}$

, which is the attention weighted sum of the local feature vectors

$\boldsymbol {\ell }_{i}^{\mathbf {F}}$

gives the refined attentive descriptors, known as attentive global feature vector, for the given feature map

$\mathbf {F}$

. The attentive global feature vector, i.e., attention-weighted sum of local feature vectors, can be written as:

$\begin{equation*} \boldsymbol {g}_{a}^{\mathbf {F}} = \Sigma _{i=1}^{i=n} {a}_{i}^{\mathbf {F}} \boldsymbol {\ell }_{i}^{\mathbf {F}}.\tag{9}\end{equation*}$

View Source

The attentive global feature vector

$\boldsymbol {g}_{a}^{\mathbf {F}}$

replaces the global feature vector

$\boldsymbol g$

to be used for finding new logits to perform classification. Please note that we employ up to two attention modules of type Att. II in the intermediate feature maps of two different convolutional layers in our DNN-based detector.

3) Attention Mechanism III: Self-Attentional Feature Maps

To further refine the intermediate feature maps of our DNN-based morph detector, we integrate the self-attentional feature maps [3], [14] into our deep architecture, which is shown in Fig. 3. (c). The attention augmented convolutional network employs the multi-headed self-attention used in the vision transformer architecture [29], [78]. In this type of attention, the multi-headed self-attentions are applied on a given intermediate feature map in our DNN which leads to a new set of augmented feature maps. In accordance with [14], concatenation of the convoluted feature maps and the self-attentional feature maps result in the best performance.

Suppose $F$ delineates an intermediate set of feature maps where $F \in \mathbb {R}^{H \times W \times F_{in}}$ . The feature maps are reshaped to $F \in \mathbb {R}^{HW \times F_{in}}$ . The number of attention heads is represented by $N_{h}$ and $d_{v}$ , $d_{k}$ delineates the depth of values, queries/keys, respectively. Also, the depth of values and queries/keys per attention head are denoted by $d_{v}^{h}$ and $d_{k}^{h}$ respectively. The input ${F}$ is mapped into queries, keys, and values through learned weights $W_{q} \in \mathbb {R}^{F_{in} \times d_{k}^{h}}$ , $W_{k} \in \mathbb {R}^{F_{in} \times d_{k}^{h}}$ , and $W_{v} \in \mathbb {R}^{F_{in} \times d_{v}^{h}}$ . The queries, keys, and values are as follows:

$\begin{equation*} q = FW_{q}, k = FW_{k}, v = FW_{v}.\tag{10}\end{equation*}$ View Source

In the multi-head attention setting, the output of the first attention head $O_{h}$ can be written as:

$\begin{equation*} O_{h} = Softmax\left({\frac {(FW_{q})(FW_{k})^{T}}{\sqrt {d_{k}^{h}}}}\right)(FW_{v}).\tag{11}\end{equation*}$ View Source

More importantly, the output of the multi-head self-attention module (MHA) is denoted as:

$\begin{equation*} MHA(F) = [O_{1} || O_{2} || \ldots. || O_{N_{h}}]W^{O},\tag{12}\end{equation*}$

View Source

where

$||$

represents the concatenation operation and

$W^{O} \in \mathbb {R}^{d_{v} \times d_{v}}$

is a matrix of learned weights. The attention augmented feature maps (AAConv.) stem from the concatenation of the conventional feature maps due to the convolution operation (Conv.(

$F$

)) and the self-attentional feature maps MHA(

$F$

). In other words:

$\begin{equation*} AAConv(F) = [Conv.(F)||MHA(F)].\tag{13}\end{equation*}$

View Source

D. Arrangement of Feature Selection Schemes

There are different permutations for employing the group sparsity for channel-wise feature selection as well as three different attention modules Att. I, Att. II, and Att. III. We follow the rules set forth by the curriculum learning paradigm [16], [38], [37] to incrementally incorporate channel-wise and spatial feature selection/refinement modules. The curriculum learning premise highlights the benefits of providing initially easy tasks to a DNN for training purposes and presenting more difficult tasks in later stages which increases the complexity of the network’s parameter space. Thus, we first train our wavelet-based DNN which is constrained to the group sparsity constraint. Consequently, we fine-tune our trained DNN using a modified structure that incorporates different attention modules Att. I, Att. II, and Att. III. More importantly, we consider different number of attention modules in Section IV-E.

We demonstrate in our experiments, delineated in the following sections, contribution of each attention mechanism to accuracy of our deep morph detector. In other words, our results prove efficacy of the Att. I, Att. II, and Att. III for capturing morphing artifacts.

E. Training Schedule

To find the most discriminative wavelet sub-bands we first fine-tune our DNN, which is an Inception-ResNet-v1 [76] pretrained on VGGFace2 [17] with a modified loss function integrating weight decay on the parameters of its first convolutional layer, using the input images that have been decomposed into 48 wavelet sub-bands. Due to our 48-channel input data, we change the number of channels in the first layer of the original Inception-ResNet-v1 to 48. There is a hyperparameter $\lambda$ for the group sparsity which is empirically searched for using the validation set of our data. It is expected that after training, a subset of kernel weights in the first layer are zeroed out, leading to an implicit selection of a subset of wavelet sub-bands.

Once the number of selected sub-bands are obtained, which is the easy task in the context of curriculum learning, we continue fine-tuning our DNN using these selected sub-bands. In other words, we shrink the number of input channels in our DNN, and investigate the effect of adding each individual attention modules Att. I, Att. II, and Att. III. Different convolutional layers are assimilating attention modules to improve accuracy of detecting morphed images.

SECTION IV.

Evaluations

A. Datasets

We utilize the WVU Identical Twin Face Morph dataset [4] which consists of samples generated using four techniques, i.e., (1) Landmark-based face morph generation, (2) StyleGAN-based face morph generation, (3) Wavelet-based face morph generation, and (4) adversarially perturbed face morph generation. From these four morph generation methods, in this study, we use the Landmark-based, StyleGAN-based and adversarially perturbed morphs which are dubbed Twin-Landmark, Twin-StyleGAN, and Twin-Perturbed, respectively. FRLL-Morphs [27], [64], FERET-Morphs [55], [64], and FRGC-Morphs [54], [64] are other datasets we employ in this work. The FRLL-Morphs dataset is built upon the Face Research London Lab dataset using four different face morphing tools: (1) OpenCV [10], (2) FaceMorpher [7], (3) StyleGAN2 [44], and (4) WebMorpher [9]. The FERET-Morphs dataset which is based on the color FERET database are morphed using the (1) OpenCV, (2) FaceMorpher, and (3) StyleGAN2 morphing modules. The FRGC-Morphs dataset is constructed using the (1) OpenCV, (2) FaceMorpher, and (3) StyleGAN2 morphing tools. Size of each deataset is detailed in Table 2.

TABLE 2 Size of Our Datasets

In addition to the above-mentioned datasets, we utilize the datasets employed in the single image morph detection tables of the MIPGAN paper [88], which were all constructed using the FRGC-V2 [54] face database. The datasets we use from the MIPGAN paper are as follows: Landmarks- I [8], [58], Landmarks- II [33], StyleGAN [43], [81], and MIPGAN- II [44].

B. Experimental Setup and Evaluation Metrics

Our core DNN is the Inception-ResNet-v1 [76]. The Inception-ResNet-v1 layers are as follows: 1) Stem block, 2) Five Inception-resnet-A blocks, 3) Reduction-A block, 4) 10 Inception-resnet-B blocks, 5) Reduction-B block, 6) Five Inception-resnet-C blocks, 7) Average Pooling, 8) Dropout, 9) Softmax. Details of each block used in the Inception-ResNet-v1 can be found in the [76]. However, as mentioned in Section III-A, we modify the original DNN architecture to account for the 48-wavelet-sub-band input data where we replace 3-channel RGB filters in the first convolutional layer with 48-channel filters. The number of the channels in the filters of the first convolutional layer of the original Inception-ResNet-v1 deep network is three since natural RGB images have three channels. However, we want to feed 48-channel data. Therefore, we increase the channel size of the filters to 48. Our DNNs are trained using the Adam [46] optimizer for 150 epochs accelerated using two 12 GB TITAN X (Pascal) GPUs. We have trained our DNNs in the PyCharm 2022.3.1 environment using PyTorch libraries in a Ubuntu 20.04.3 operating System. The learning rate is initially set at 0.001 which is divided by 10 every 20 epochs. As for the ArcFace loss function parameters, we set the scaling factor s = 64.0 and margin m = 0.5 [82].

We have reported our results using the following metrics based on the ISO/IEC 30107-3 [5]: Bona fide Presentation Classification Error Rate (BPCER) which represents proportion of bona fide presentations that are incorrectly classified as attack presentations (morph) by the classifier, Attack Presentation Classification Error Rate (APCER) that is proportion of attack presentations (morph) that are incorrectly classified as bona fide, Detection Equal Error Rate (D-EER) the point where APCER is equal to BPCER, and Area Under Receiver Operating Characteristic Curve (AUROC). In the context of binary classification, considering morph class as the positive class, BPCER and APCER are nothing but the False Positive Rate (FPR) and False Negative Rate (FNR), respectively. Especially, we are interested in the following three thresholds: 1) BPCER @ APCER = 5% 2) BPCER @ APCER = 10% 3) BPCER @ APCER = 30%. AUROC is a threshold independent metric representing a fair evaluation of our learned hypotheses.

C. Channel-Wise Feature Selection Via Group Lasso Weight Decay

In order to select the most discriminative wavelet sub-bands, we first train our DNN using the 48-wavelet-sub-band data using the WVU Twin-Landmark dataset. We do a random search for tuning the hyperparameter $\lambda$ and we train our DNN for several selected values of hyperparameter $\lambda$ as mentioned in Eq. 2 using the training portion of the Twin-Landmark dataset. We assess the performance of the trained DNNs using the validation portion of the Twin-Landmark dataset, and it is revealed that $\lambda =0.003$ leads to the highest accuracy on the validation set of the WVU Twin-Landmark dataset retaining six wavelet sub-bands out of 48 as depicted in Fig. 4. We further assess the generalization of our trained DNN on all of the datasets. Table 3 delineates the benchmarked morph detection results for different datasets when the input samples are either RGB images or the six wavelet sub-band data samples. The results reveal that selecting the top six most discriminative wavelet sub-bands can conspicuously decrease the predicted error rates of our classifier or equivalently increase AUROC on several datasets. In particular, all D-EER, BPCER@APCER = 5%, BPCER@APCER = 10%, and BPCER@APCER = 30% error rates decrease for all morphing types of FERET and FRGC datasets in addition to the Twin-Landmark and Twin-StyleGAN which are highlighted in Table 3. In addition, selecting six discriminative sub-bands resulted in an increase of AUROC for all eight datasets. Therefore, our sub-band selection scheme leads to a more accurate morph detector compared to the one trained on the RGB data. Please note that, for the rest of the following experiments and tables, we use the six selected wavelet subbands as the input to our deep morph detector.

TABLE 3 Comparing Single Morph Detection Performance Using the Rgb And Six Wavelet Sub-Band Channels: D-Eer%, Bpcer@Apcer=5%, Bpcer@Apcer=10%, And Bpcer@Apcer=30%. Our Subband Selection Has Resulted in Increasing the Accuracy of Morph Detection as the Improved Results are Highlighted

Fig. 4.

48 wavelet sub-bands are depicted for a given morphed image. Our channel-wise feature selection scheme leads to selection of the six most discriminative wavelet sub-bands which are ticked.

Show All

D. Feature Refinement Via Attention Mechanisms

We have integrated our three different attention modules after the following layers: 1) “conv2d-3b” where size of the feature maps are $80 \times 126 \times 126$ , 2) “conv2d-4b” where size of the feature maps are $256\times 61 \times 61$ , 3) “mixed-7a” where size of the feature maps are $1792 \times 14 \times 14$ . To increase the accuracy of our morph detector and to focus on the most discriminative spatial regions, where the density of morphing artifacts is higher, we integrate Att. I module which is an instantiation of the CBAM self-attentional class. This attention module provides us with refined intermediate activation maps increasing mutual information with respect to the ground truth labels. We fine-tune our augmented Inception-ResNet-v1 by adding a CBAM module separately at two different layers of the Inception-ResNet-v1. We report our attention-based morph detection results in Table 4 where a single CBAM module is inserted after the convolutional layers “conv2d-3b” and “conv2d-4b”. Our inserted CBAM modules enjoy both the spatial and channel gates as discussed in subsection III-C1. The channel attention gate in the CBAM module adopts a multilayer perceptron (MLP) where the size of the hidden layer is $floor\left({\frac {input-channels}{reduction-ratio}}\right)$ . In our experiments, we set $reduction-ratio = 16$ . The number of channels in the feature maps “conv2d-3b” and “conv2d-4b” are 80 and 256, respectively. Please note that we use the six selected most discriminative wavelet subbands as the input of our deep morph detector. Based on the results benchmarked in Table 4, the attention module Att. I results in the refinement of intermediate features leading to a decrease in morph detection error rates as well as an increase in the corresponding AUROC. Improved results compared to Table 3 where there was no attention module are highlighted. Adding attention module Att. I has increased morph detection accuracy on several datasets. In particular, employing Att. I has resulted in decreasing error rates when detecting morph images in the Twin-Landmark, Twin-StyleGAN, FERET-FaceMorpher, FERET-OpenCV, FRLL-FaceMorpher, FRGC-OpenCV, and FRGC-StyleGAN2 datasets.

TABLE 4 Attention-Based Single Morph Detection Performance Using the Six Selected Wavelet Sub-Bands and Att. I (Cbam Module) Trained on the Twin-Landmark Dataset: D-EER%, Bpcer@APCER=5%, BPCER@APCER=10%, and BPCER@APCER=30%

We incorporate the attention mechanism Att. II discussed in Section III-C2 into our DNN, to acquire new set(s) of weighted feature vectors. To this end, correlations of spatial locations in an intermediate feature map and the 512-D fully connected (FC) vector before the logits of our DNN are computed. The normalized correlation values decide which pixel locations to remain active for morph detection or which pixels are to be suppressed. Please note that, from an information-theoretic perspective, this kind of attention module looks for the spatial feature locations that have the highest mutual information with respect to the ground truth label Y. The Att. II is inserted after the “conv2d-3b” or “mixed-7a” where the number of channels are respectively 64 and 1,792. Since the number of channels in the feature map and the dimension of the FC layer’s output are not consistent, we use a $1\times 1$ convolution to reach 512 channels for the intermediate feature maps. Finally, the attention-weighted feature locations replace the output of the FC layer for finding the two-class logits in our DNN. The results of the morph detection on different datasets using this kind of attention module are summarized in Table 5. Based on the benchmarked results in Table 5, adopting Att. II has resulted in the improvement of morph detection accuracy for several datasets compared to Table 3 where there was not any attention module. In particular, employing Att. II has resulted in decreasing error rates when detecting morph images in the Twin-Landmark, FERET-OpenCV, FRLL-FaceMorpher, FRLL-WebMorpher, and FRGC-FaceMorpher datasets.

TABLE 5 Attention-Based Single Morph Detection Performance Using The Six Selected Wavelet Sub-Bands and Att. II Trained on the Twin-Landmark Dataset: D-EER%, BPCER@APCER=5%, BPCER@APCER=10%, and BPCER@APCER=30%

We integrate the Att. III module in our DNN. The self-attentional augmented feature maps, detailed in Subsection III-C3, are concatenated with the “vanilla” convolutional feature maps to diversify learned features. We assess the effectiveness of this multi-headed self-attention scheme through inserting this attention module at the layers “conv2d-3b” and “mixed-7a” which have respectively 80 and 1,792 feature maps. The results for this kind of attention augmented morph detection are benchmarked in Table 6. According to the benchmarked results, incorporating Att. III yields improvement in morph detection accuracy for several datasets compared to Table 3 where there was not any attention module. Improved morph detection results are highlighted in Table 6. In particular, employing Att. III has resulted in decreasing error rates when detecting morph images in the FERET-FaceMorpher, FERET-StyleGAN2, FRLL-WebMorpher, FRGC-FaceMorpher, and FRGC-OpenCV datasets.

TABLE 6 Attention-Based Single Morph Detection Performance Using the Six Selected Wavelet Sub-Bands and Att. III (Self-Attentional Feature Maps) Trained on Twin-Landmark Dataset: D-EER%, BPCER@APCER = 5%, BPCER@APCER = 10%, and BPCER@APCER = 30%

E. Comparison With The State-of-the-Art

We compare the results of our attention-based morph detector with the results benchmarked in the MIPGAN [88] paper. The methodologies used in the MIPGAN paper are Ensemble Features [79] and Hybrid Features [62] which are abbreviated as Ensemble and Hybrid respectively in Table 7. The ensemble of features method fuses the score level morph detection results using three different feature descriptors, which are LBP, HOG, and BSIF. On the other hand, the Hybrid Features adopts the Laplacian Pyramids using two different image spaces YCbCr and HSV at three different scales where LBP is used to extract features from every sub image. LBP features are fed to a classifier that is Spectral Regression Kernel Discriminant Analysis (SRKDA) and scores for all sub images are fused for morph detection. Attention-based results on the datasets used in the MIPGAN paper are summarized in Table 7. Based on the benchmarked results, our attention augmented morph detector has resulted in decrease of the error rates for different Train/Test scenarios. The improved results are highlighted in Table 7. In particular, employing Att. I has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset regardless of the used training set, StyleGAN dataset when our DNN is trained using the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained using the Landmarks- I and StyleGAN datasets. In addition, employing Att. II has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset regardless of the used training set, StyleGAN dataset when our DNN is trained using the MIPGAN- II dataset, and MIPGAN- II dataset when our DNN is trained using the Landmarks- I, Landmarks- II, and StyleGAN datasets. Employing Att. III has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset when our DNN is trained on the Landmarks- II dataset, StyeleGAN dataset when our DNN is trained on the MIPGAN- II dataset, and MIPGAN- II dataset when our DNN is trained on the StyleGAN dataset.

TABLE 7 Comparison with the MIPGAN [88] Results. Attention-Based Single Morph Detection Performance Using the Six Selected Wavelet Sub-Bands and

$\mathbf{Att. I}$ ,

$\mathbf{Att. II}$ , and

$\mathbf{Att. III}$ Modules All @conv2d-3b Fine-Tuned on the Landmarks- I, Landmarks- Ii, Stylegan, and MIPGAN- II Datasets: D-EER%, BPCER@APCER = 5%, and BPCER@APCER = 10%

$Table 7- Comparison with the MIPGAN [88] Results. Attention-Based Single Morph Detection Performance Using the Six Selected Wavelet Sub-Bands and $\mathbf{Att. I}$ , $\mathbf{Att. II}$ , and $\mathbf{Att. III}$ Modules All @conv2d-3b Fine-Tuned on the Landmarks- I, Landmarks- Ii, Stylegan, and MIPGAN- II Datasets: D-EER%, BPCER@APCER = 5%, and BPCER@APCER = 10%$

Also, it is not uncommon for a given travel document issuing/authentication agency to scan a submitted hard copy facial image. To further make our morph detector more realistic and inclusive, we employ the printed and scanned (re-digitized) datasets used in the MIPGAN [88] paper for testing our morphed detectors. The summary of the morph detection performance on the printed and scanned version of the datasets are tabulated in Table 8. In accordance with the benchmarked results, our attention augmented morph detector has decreased the detection error rates in several highlighted Train/Test scenarios, which substantiates the efficacy of our wavelet-based attention augmented morph detector. In particular, employing Att. I has resulted in decreasing error rates when detecting morph images in the Landmarks- I dataset when our DNN is trained on the Landmarks- II and MIPGAN- II datasets, Landmarks- II dataset when our DNN is trained using the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained using the Landmarks- II dataset. In addition, employing Att. II has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset when our DNN is trained on the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained using the Landmarks- II dataset. Employing Att. III has resulted in decreasing error rates when detecting morph images in the Landmarks- II dataset when our DNN is trained on the Landmarks- I and MIPGAN- II datasets, and MIPGAN- II dataset when our DNN is trained on the Landmarks- II dataset.

TABLE 8 Comparison with the MIPGAN Print and Scanned Results. Attention-Based Single Print and Scanned Morph Detection Performance Using the Six Selected Wavelet Sub-Bands and

$\mathbf{Att. I}$ ,

$\mathbf{Att. II}$ , and

$\mathbf{Att. III}$ Modules All @conv2d-3b Fine-Tuned on the Landmarks-I, landmarks- II, and MIPGAN- II datasets: D-EER%, BPCER@APCER = 5%, and BPCER@APCER = 10%

$Table 8- Comparison with the MIPGAN Print and Scanned Results. Attention-Based Single Print and Scanned Morph Detection Performance Using the Six Selected Wavelet Sub-Bands and $\mathbf{Att. I}$ , $\mathbf{Att. II}$ , and $\mathbf{Att. III}$ Modules All @conv2d-3b Fine-Tuned on the Landmarks-I, landmarks- II, and MIPGAN- II datasets: D-EER%, BPCER@APCER = 5%, and BPCER@APCER = 10%$

We also assess the generalization ability of our framework against the state-of-the-art [39], [66] in Table 9 which has assessed morph detection performance on FRGC-FaceMorpher, and FRGC-OpenCV. To this end, we fine-tune our trained Inception-ResNet-v1, including attention modules Att. I, Att. II, and Att. III, on FERET-FaceMorpher and FERET-OpenCV datasets. Please note that, in a PyTorch environment, we freeze all layers’ parameters by setting “requires-grade = False” except the final linear classifier layer. Based on the benchmarked results, our wavelet-based attention augmented morph detector surpasses the prior works by a large margin on different train/test scenarios, which are highlighted in Table 9. In particular, employing Att. I has resulted in decreasing error rates when detecting morph images in the FRGC-FaceMorpher dataset when our DNN is trained on the FERET-FaceMorpher dataset, FRGC-OpenCV dataset when our DNN is trained using the FERET-OpenCV dataset. In addition, employing Att. II has resulted in decreasing error rates when detecting morph images in the FRGC-FaceMorpher dataset when our DNN is trained on the FERET-FaceMorpher dataset, FRGC-FaceMorpher dataset when our DNN is trained using the FERET-OpenCV, and FRGC-OpenCV dataset when our DNN is trained on the FERET-OpenCV dataset.

TABLE 9 Generalization Performance: Comparison with the State-of-the-Arts: D-EER%, BPCER@APCER = 10%, and APCER@BPCER = 10%

We also delve into the different number of attention modules used for training our deep morph detector. We add attention modules Att. I, Att. II, and Att. III to several convolutional layers simultaneously and we benchmark the results for the FERET, FRLL, and FRGC datasets, as shown in Table 10 and Table 11. Considering the results, having two modules, mainly the Att. I, and Att. II considerably improved the morph detection performance on the FRGC-FaceMorpher, FRGC-OpenCV, and FRGC-StyleGAN2 datasets. In addition, we assess the performance of our morph detector when all three attention modules, Att. I, Att. II, and Att. III, are added to our deep architecture. The resulting performance of this scenario are benchmarked in Table 11. Integrating all three attention modules in our DNN has resulted in reduction of detection error rates when assessing morphed images in the FRLL-OpenCV, FRLL-WebMorpher, and FRGC-FaceMorpher datasets. All in all, our wavelet-based attention augmented morph detector has contributed to a decrease in detection error rate in several highlighted datasets.

TABLE 10 Attention-Based Single Morph Detection Performance Using the Six Selected Wavelet Sub-Bands Trained on the Twin-Landmark Dataset Using Two Modules of the

$\mathbf{Att. I}$ and

$\mathbf{Att. II}$ : D-EER%, BPCER@APCER = 5%, BPCER@APCER = 10%, and BPCER@APCER = 30%.

$Table 10- Attention-Based Single Morph Detection Performance Using the Six Selected Wavelet Sub-Bands Trained on the Twin-Landmark Dataset Using Two Modules of the $\mathbf{Att. I}$ and $\mathbf{Att. II}$ : D-EER%, BPCER@APCER = 5%, BPCER@APCER = 10%, and BPCER@APCER = 30%.$

TABLE 11 Attention-based single morph detection performance using the six selected wavelet sub-bands trained on the Twin-Landmark dataset using two modules of the

$\mathbf{Att. III}$ and three modules of the

$\mathbf{Att. I}$ ,

$\mathbf{Att. II}$ , and

$\mathbf{Att. III}$ : D-EER%, BPCER@APCER = 5%, BPCER@APCER = 10%, and BPCER@APCER = 30%

$Table 11- Attention-based single morph detection performance using the six selected wavelet sub-bands trained on the Twin-Landmark dataset using two modules of the $\mathbf{Att. III}$ and three modules of the $\mathbf{Att. I}$ , $\mathbf{Att. II}$ , and $\mathbf{Att. III}$ : D-EER%, BPCER@APCER = 5%, BPCER@APCER = 10%, and BPCER@APCER = 30%$

Most importantly, we contrast our attention augmented morph detection performance on the MIPGAN- II dataset mentioned in the latest NIST Face Recognition Vendor Test (FRVT) report [6] updated on July 14, 2022. We compare our results with the NIST report using the two criterion APCER@BPCER = 0.01 and APCER@BPCER = 0.1. We report results on the MIPGAN- II dataset while our network with the Att. I is fine-tuned on a universal dataset. Our so-called universal dataset that we use for training our wavelet-based attention augmented morph detector includes all the datasets mentioned in Section IV-A plus the AMSL dataset [53]. The AMSL dataset consists of 2,175 morph and 204 bona fide samples. Results of morph detection on the MIPGAN- II dataset using the NIST report format is summarized in Table 12. The benchmarked result delineates the efficiency of our morph detector as the APCER@BPCER = 10 % error rate is decreased.

TABLE 12 Comparison with the NIST FRVT Report [6] on the MIPGAN- II Dataset: APCER@BPCER = 1% and APCER@BPCER = 10%

F. Deep Morph Detector Visualization

In this section, the interpretability of our attention-based deep morph detector is investigated through two visualization tools: (1) Attention Maps (2) Gradient-weighted Class Activation Maps (Grad-CAM). Attention maps [50], [84], [42] are powerful visualization as well as attribution [87], [71], [72] techniques which represent a visual explanation for the decision-making of a DNN by highlighting spatial regions that are most relevant for generating output scores by a DNN. In particular, attention maps are obtained by overlaying the heat maps of attention weights into the original RGB images to highlight the most discriminative spatial regions in the eye of a classifier. Grad-CAM is another visualization scheme to demonstrate functionality of our DNN. Given a morphed image, the logits related to the morphed class are supposed to fire which is revealed in the grad-CAM plots. We follow the protocols adopted in the literature [42], [84] corresponding to the Att. I and Att. II modules, which demonstrate efficacy of the CBAM-integrated deep networks through visualizing Grad-CAMs and plotting attention maps for the adjusted network used in [42].

The Grad-CAMs pertinent to Table 4 for both convolutional layer of “conv2d-3b” and “conv2d-4b” are shown in Fig. 5. Moreover, the estimated attention maps of Table 5 for the “mixed-7a” convolutional layer are displayed in Fig. 6. As expected, the most discriminative spatial regions in the view of a morph detector are in the vicinity of a subject’s eyes.

Fig. 5.

Grad-CAM visualizations of CBAM-integrated deep morph detector (Table 4): (a) CBAM@conv2d-3b (b) CBAM@conv2d-4b.

Show All

Fig. 6.

Estimated attention maps stemming from the feature maps of “mixed-7a” convolutional layer (Table 5).

Show All

SECTION V.

Conclusion

This article addresses single image morphing attack detection where emphasizing on discriminative regions is realized through spatial and channel attention modules. In particular, we quantitatively demonstrated the efficacy of the three visual attention modules for the downstream task of morph detection in a binary classification setting. The integrated attention modules are intended for feature refinement as well as feature selection as a kind of representation learning. In particular, a trainable soft attention mechanism, convolutional block attention module, and multi-headed attention-augmented feature maps were utilized to improve accuracy of morph detection on several datasets. In addition, we have shifted the input data domain from the RGB space into the wavelet domain to take advantage of fine-grained spatial-frequency information represented through wavelet decomposition.

Our benchmarked results on morph detection using several datasets proves effectiveness of our attention-based morph detector. Most importantly, we have contrasted the generalization performance of our attention augmented morph detection scheme with the state-of-the-art results to demonstrate efficacy of our proposed architectures. Moreover, estimated attention maps and Grad-CAM visualizations were included to demonstrate interpretability of our morph detector. Heatmaps applied on the original images reveal the most discriminative spatial regions of the images that drive our attention augmented morph detectors into an accurate decision for labeling probe images as bona fide or morphed. Finally, to realize multi-attentional morph detection, we assessed our morph detection performance using two instantiations of our attention modules Att. I, Att. II, and Att. III. In addition, we trained our attention augmented morph detector using all three Att. I, Att. II, and Att. III attention modules and the corresponding results were benchmarked in the table mentioned in Section IV-E.

SECTION VI.

Future Work and Limitations

As the future work, we are considering deploying more diverse visual attention mechanisms to increase morph detection accuracy. In addition, transformer architecture has inspired a great number of attention mechanisms which can be beneficial for the task of single image morph detection. The limitation of our work is induced by different permutations of the attention modules in our DNN as well as layers at which attention modules are integrated. Therefore, investigation of all possible permutations can be addressed in a further study.

References is not available for this document.

Attention Augmented Face Morph Detection

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Morph Generation

B. Morph Detection

C. Sparse Representation Learning

D. Attention Mechanism

Methodology

A. Channel-Wise Feature Selection

B. Arcface Loss Function

C. Spatial Feature Selection and Refinement

1) Attention Mechanism I: Convolutional Block Attention Module (CBAM)

2) Attention Mechanism II: Learn to Pay Attention

3) Attention Mechanism III: Self-Attentional Feature Maps

D. Arrangement of Feature Selection Schemes

E. Training Schedule

Evaluations

A. Datasets

B. Experimental Setup and Evaluation Metrics

C. Channel-Wise Feature Selection Via Group Lasso Weight Decay

D. Feature Refinement Via Attention Mechanisms

E. Comparison With The State-of-the-Art

F. Deep Morph Detector Visualization

Conclusion

Future Work and Limitations

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Attention Augmented Face Morph Detection

Alerts

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Works

A. Morph Generation

B. Morph Detection

C. Sparse Representation Learning

D. Attention Mechanism

Methodology

A. Channel-Wise Feature Selection

B. Arcface Loss Function

C. Spatial Feature Selection and Refinement

1) Attention Mechanism I: Convolutional Block Attention Module (CBAM)

2) Attention Mechanism II: Learn to Pay Attention

3) Attention Mechanism III: Self-Attentional Feature Maps

D. Arrangement of Feature Selection Schemes

E. Training Schedule

Evaluations

A. Datasets

B. Experimental Setup and Evaluation Metrics

C. Channel-Wise Feature Selection Via Group Lasso Weight Decay

D. Feature Refinement Via Attention Mechanisms

E. Comparison With The State-of-the-Art

F. Deep Morph Detector Visualization

Conclusion

Future Work and Limitations

References