Journals & Magazines >IEEE Access >Volume: 13

Token-Wise Asymmetric Contrastive Learning in Countering Unknown Attacks for Face Anti-Spoofing

In this paper, we propose a novel framework integrating token-wise asymmetric contrastive learning with angular margin loss to enhance the robustness of FAS models agains...

Abstract:

Face anti-spoofing (FAS) techniques are crucial for safeguarding face recognition systems against an ever-evolving landscape of spoofing attacks. While existing methods h...Show More

Metadata

Abstract:

Face anti-spoofing (FAS) techniques are crucial for safeguarding face recognition systems against an ever-evolving landscape of spoofing attacks. While existing methods have made strides in detecting known attacks, robustness against unknown, potentially more sophisticated attack types remains a critical and unresolved challenge in the field. In this paper, we propose a novel framework integrating token-wise asymmetric contrastive learning with angular margin loss to enhance the robustness of FAS models against unknown attack types. The key idea to resolve the problem is to learn a feature space where live face features are densely distributed, whereas spoof face features are more dispersed. This is achieved through two novel strategies: 1) asymmetric contrastive learning, encouraging the FAS model to learn a compact distribution for live face features while relaxing the constraint on the distribution of spoof face features, and 2) token-wise learning, focusing on capturing intrinsic liveness cues from local region rather than identity or facial-related features. Additionally, an angular margin loss is incorporated to enhance the discriminative power of the learned features. Extensive experiments on public benchmark datasets demonstrate the superiority of our FAS model over state-of-the-art methods in cross-attack scenarios, showcasing its strong robustness to unknown attacks while maintaining unseen domain generalization capability.

In this paper, we propose a novel framework integrating token-wise asymmetric contrastive learning with angular margin loss to enhance the robustness of FAS models agains...

Published in: IEEE Access ( Volume: 13)

Page(s): 46334 - 46345

Date of Publication: 11 March 2025

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2025.3550397

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Face recognition systems [1] have become increasingly prevalent in our daily lives, being used for various applications from mobile device unlocking to high-security access control. However, the reliability of these systems is threatened by various types of spoofing attacks. Face anti-spoofing (FAS) techniques aim to protect face recognition systems from such attacks by determining whether a presented face is live or spoofed [2].

Several common types of face spoofing attacks include print, replay, and 3D mask attacks [3], [4], [5]. Print attacks involve presenting a photo of an authorized user’s face to the recognition system [3]. Replay attacks display a video recording of the genuine user on a digital screen [4]. More sophisticated 3D mask attacks use wearable masks to mimic facial geometry and texture [5]. While existing FAS methods have achieved impressive performance against known attack types, they often struggle when confronted with novel and unknown attacks [6], [7]. The ever-evolving nature of face spoofing techniques poses a significant risk, as FAS systems may be vulnerable to new attack types not encountered during training.

In this paper, we propose a novel framework that combines Token-wise Asymmetric Contrastive Learning (TACL) with angular margin loss for robust FAS against unknown attacks, as illustrated in the right side of Figure 1. The approach leverages two key strategies to improve the generalization ability of FAS models against unknown attacks. Spoof face images tend to have a more diverse feature distribution than live faces because the difference in attack material causes additional feature variation. To tackle this asymmetry, we propose an asymmetric contrastive loss that encourages a compact distribution of live face features while dispersing the features of spoof faces. In a partial spoofing attack, where not every patch in the spoof face image is tampered with, we address the case by leveraging the representative feature of the entire spoof image rather than individual patch features. This allows the FAS model to capture a decision boundary that is more robust to varying attack types. The second strategy, which we call token-wise learning, is inspired by patch-based learning that guides FAS models [8], [9], [10] to focus on the intrinsic live or spoof pattern rather than ID or facial-related features. Unlike conventional patch-based learning, our method uses output tokens from Vision Transformers.

FIGURE 1.

(Left) Conventional FAS models discriminate live and spoof images in different domains, and unknown attacks lie in the distribution away from the trained one, resulting in poor performance. (Right) Our method trains live features to form a compact distribution far away from spoof features with a given margin. As a result, even if unknown attacks are injected into the model, it is highly likely that these features are outside the live feature space.

Show All

In summary, the main contributions of this paper are summarized as follows:

To tackle unknown attacks in FAS, we introduce a token-wise asymmetric contrastive learning, which pulls the features from live faces into a compact cluster while allowing the features from spoof faces to be more diversely distributed.
We introduce a novel protocol to assess the generalization ability of FAS systems for unknown attack types in more challenging environments and establish a baseline for this protocol.
Extensive experiments on benchmark datasets demonstrate that our attack-agnostic FAS model significantly outperforms state-of-the-art FAS methods in unknown attack scenarios.

SECTION II.

Related Work

FAS systems face the fundamental challenge of generalizing to spoofing attacks that they have not encountered during training and have addressed this problem with various strategies. One common strategy to improve the generalization of FAS models is to leverage auxiliary supervised signals, which guide the model to learn intrinsic spoofing patterns. These auxiliary signals can be derived from various sources, such as depth maps [12], reflection maps [13], or material-based cues [14]. Depth maps have been widely used as an auxiliary supervision signal in FAS [12], [15], [16]. The underlying assumption is that live faces have different depth characteristics compared to spoofing attempts. By incorporating depth information during training, the FAS model learns to capture more intrinsic liveness cues. For instance, Liu et al. [12] proposed a CNN-RNN architecture that utilizes depth maps as supervision for live faces, encouraging the model to learn depth-related features. Similarly, Wang et al. [15] introduced a deep spatial gradient learning framework that leverages depth maps to guide the model in capturing fine-grained spatial information. Reflection maps, which capture the differences in reflectance properties between live faces and spoofing mediums, have also been explored as an auxiliary signal [13]. The idea is that spoofing attempts often exhibit distinct reflection patterns compared to genuine faces. Kim et al. [13] proposed a bipartite auxiliary supervision network (BASN) that employs both depth and reflection maps to learn generalized spoof patterns. The reflection maps are used to supervise the learning of spoof-specific features, complementing the depth-based supervision. Extending these approaches, Qin et al. [17] introduced the Meta-Teacher framework, which leverages adaptive supervision signals like pixel-wise maps to further guide models in learning diverse spoofing cues. Material-based cues, derived from the analysis of facial skin properties, have been investigated as another source of auxiliary supervision [14]. The motivation is that live faces and spoofing mediums have different material properties, such as texture and reflectance. Yu et al. [14] approached FAS as a material recognition problem, leveraging human material perception to extract discriminative and robust features. They introduced a material-based loss function to guide the model in learning material-specific characteristics. While these auxiliary supervision approaches have shown promising results, their effectiveness may depend on the quality and reliability of the auxiliary signals. Inaccurate or noisy supervision could potentially mislead the model and hinder its generalization ability. Moreover, the generation of these auxiliary signals often requires additional computational resources or specialized hardware, which may limit their practicality in certain applications.

Generalization to diverse attack types and data domains is crucial in FAS techniques to ensure models handle various scenarios, leading to the proposal of various approaches [8], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28] aimed at addressing this challenge. Some methods employ distinct feature extractors relying on domain-invariant content features and style features for classification [22], [27]. Patch-level classification has also been proposed for better utilization of local features, enforcing patch feature invariance through non-distorted augmentation and using asymmetric angular margin softmax loss to address the asymmetry between live and spoof features [22]. The application of asymmetric optimization to live and spoof faces has also been studied to cover unseen data domains. The methods [19], [25] learn a generalized feature space where the feature distribution of live faces is compact [19], and the feature distribution of spoof faces is dispersed over the attack type [25]. However, we note that these methods only focus on extracting features for generalizing unseen domains and separate spoof features between domains [19] or require an attack type as a label [25], which is not applicable for tackling anti-spoofing with unknown attacks.

Our method introduces a novel framework of asymmetric loss to the output tokens of Vision Transformers (ViT), building on prior works that leverage asymmetric losses [19], [29] and patch-based learning [29], [30]. While our approach shares conceptual motivation with these studies, it incorporates distinct and significant innovations. Table 1 highlights the key differences that distinguish our method. In the context of asymmetric loss, the main difference is the use of contrastive loss in spoof-spoof pairs. Our method effectively eliminates unnecessary regularization between spoofing pairs across diverse attack types, enabling significantly improved generalization to unknown attacks. This contrasts sharply with existing methods, which do not explicitly address this limitation. Furthermore, our token-wise learning approach redefines the paradigm of patch-based methodologies. Unlike traditional patch-based methods, which explicitly treat image patches as separate samples, our method operates on the output tokens of the ViT encoder. These tokens inherently represent implicit patches, allowing for a more seamless integration of loss computation while preserving the structural and semantic information of the input image. This implicit patch-based token processing underscores a fundamental advancement over conventional methods.

TABLE 1 Comparison of the Target and Loss Direction Employed by Various Face Anti-Spoofing Methods. Loss Direction is Categorized Into Three Types: Live-Live(L-L), Live-Spoof(L-S), and Spoof-Spoof(S-S)

TABLE 2 Comparison of Computational Complexity and Model Size for TACL and Other State-of-the-Art Face Anti-Spoofing Methods

SECTION III.

Methods

The proposed framework, termed TACL, integrates token-wise asymmetric contrastive learning with angular margin loss to enhance robustness in FAS against unknown attacks. Figure 2 presents an overall overview of the framework.

$FIGURE 2. - The overall frameworks of TACL. We use token-wise contrastive loss ( ${\mathcal {L}}_{\text {TAC}}$ ) applying output tokens from the ViT [11] encoder. Unlike the conventional contrastive loss, this loss is applied asymmetrically to live and spoof tokens. Live token pairs pull together to form a compact distribution and push the representative spoof token. The aggregated tokens are applied to the angular margin loss ( ${\mathcal {L}}_{\text {AM}}$ ) to maximize discrimination for the unknown attacks.$

FIGURE 2.

The overall frameworks of TACL. We use token-wise contrastive loss (${\mathcal {L}}_{\text {TAC}}$ ) applying output tokens from the ViT [11] encoder. Unlike the conventional contrastive loss, this loss is applied asymmetrically to live and spoof tokens. Live token pairs pull together to form a compact distribution and push the representative spoof token. The aggregated tokens are applied to the angular margin loss (${\mathcal {L}}_{\text {AM}}$ ) to maximize discrimination for the unknown attacks.

Show All

A. Problem Formulation

Consider a dataset containing a set of domains $\{D_{m}\}$ with different attack types $\{A_{n}\}$ . Each domain contains live images, with spoof images potentially sourced from the $m^{th}$ domain using the $n^{th}$ attack type. The training set is set to ${\mathcal {D}}_{train} = \{ (\mathbf {x}_{i}^{m,n}, y_{i})| D_{m} \in \mathbf {D}, A_{n} \in \mathbf {A}\}_{i}$ , where D and A are the sets of seen domain and known attack type, respectively. For brevity, we define $A_{0}$ as a non-attack case, i.e., $\mathbf {x}_{m,0}$ is a live image in domain $D_{m}$ .

The purpose of anti-spoofing is to learn a robust discriminator $h: \mathcal {X} \rightarrow \mathcal {Y}$ on the test set ${\mathcal {D}}_{test}$ from the training set ${\mathcal {D}}_{train}$ , as:\begin{equation*} \min _{h} \mathbb {E}_{(\mathbf {x}^{m,n},y) \in {\mathcal {D}}_{test}} \ell (h(\mathbf {x}^{m,n}), y), \tag {1}\end{equation*} View Sourcewhere the terms $\mathbb {E}$ and $\ell $ denote expectation and loss function, respectively. According to the protocol of the anti-spoofing applications, the test set can be defined in various ways. The simplest case is the seen-domain-known-attack configuration, where the distribution of the test set closely resembles that of the training set. Since there is little variation in the distribution in terms of domain and attack type, the algorithms attempt to extract features robust to environmental changes, e.g., illumination and geometric changes. Another one is the domain generalization configuration, which has been the focus of recent FAS studies. This configuration is tested on an unseen domain while keeping the attack types identical. The test set for this problem is defined as ${\mathcal {D}}_{test} = \{ (\mathbf {x}_{i}^{m,n}, y_{i})| D_{m} \notin \mathbf {D}, A_{n} \in \mathbf {A}\}_{i}$ . The main concern here is to generalize h to unseen domains.

The main concern of our method is to train h to generalize to unknown attack types. This can be considered as an attack type generalization problem; the known attack type can be accessed in training, but the other attack types that are not accessed in training are tested in the inference phase. In the seen-domain-unknown-attack configuration, the test set is defined as follows:\begin{equation*} {\mathcal {D}}_{test} = \{ (\mathbf {x}_{i}^{m,n}, y_{i})| D_{m} \in \mathbf {D}, A_{n} \notin \mathbf {A}\}_{i}. \tag {2}\end{equation*} View Source

B. Asymmetric Contrastive Loss

The main strategy of anti-spoofing algorithms in a domain generalization problem is to reduce the differences in distributions between different source domains [19], [23], [24], [25] in order to extract features that are invariant across domains. This approach can also be applied to train the discriminator on unknown attack types, since this problem is similarly viewed as a generalization problem concerning attack types. However, compared to the feature discrepancy in live images, which is primarily caused by domain differences, the discrepancy in spoof images is influenced by both domain variations and attack types. Consequently, the distribution of spoof features is more diverse than that of live features. Given the distinct characteristics between live and spoof features, reducing the distribution differences in spoof features could disrupt the discriminator’s training, leading to a negative impact.

Based on this observation, we propose an asymmetric contrastive loss that pulls live features for a compact distribution while relaxing the compactness constraints of spoof feature distribution. The live and spoof features push each other to maximize the margin between them. The concept of pulling features between the same class and pushing features between different classes is similar to the supervised contrastive (SupCon) loss [32]. However, unlike the SupCon loss, which applies losses to all classes (live and spoof classes), our asymmetric loss is applied only to the live class. The Asymmetric Contrastive (AC) Loss is defined as follows:\begin{align*} {\mathcal {L}}_{\text {AC}} & = \frac {1}{|L|} \sum _{i \in L} {\mathcal {L}}^{i}_{\text {AC}}, \tag {3}\\ {\mathcal {L}}^{i}_{\text {AC}} & = -\frac {1}{|L|} \sum _{j \in L} \log \left ({{\frac {\exp \left ({{z_{i} \cdot z_{j} / \tau }}\right)} {\sum _{k \in L \cup S} \exp \left ({{z_{i} \cdot z_{k} / \tau }}\right)}}}\right), \tag {4}\end{align*} View Sourcewhere L and S are the index set of live images and spoof images. $\tau $ is the temperature to control smoothness of the similarity score. The image feature $z_{i}$ of the $i^{th}$ input image $\mathbf {x}_{i}$ is given by $z_{i}= \phi (g(\mathbf {x}_{i}))$ , where g is the feature extractor and $\phi $ is an aggregation function (e.g., global average pooling). ${\mathcal {L}}^{i}_{\text {AC}}$ is the loss of the $i^{th}$ live image as an anchor.

The numerator calculates the similarity of features between live features, and the denominator is a normalization factor that is the sum of the similarity for all examples in a batch. This loss maximizes the similarity between live features, thus forcing a compact distribution of live features. Although this loss is calculated only for live samples, it also considers the similarity between live and spoof images in the normalization factor. In contrast to live-live pairs, the loss minimizes the similarity between live and spoof features, pushing each feature in different directions. Since the spoofing features do not attract each other, features from different types of attacks can be distributed more diversely.

According to the loss, the live features of different images in a batch attract each other. Since we constructed the batch by applying images with different augmentations to the same image twice, following previous work [32], live image pairs include (1) identical images with different augmentations, (2) two images from different domains, or (3) two images with different ID in the same domain. Since the loss forces a compact distribution for live features, the model attempts to extract augmentation-, domain-, and ID-agnostic live features, thereby enhancing generalization to unseen domains and identities.

C. Token-Wise Asymmetric Contrastive Loss

Although the loss ${\mathcal {L}}_{\text {AC}}$ helps to concentrate the live features into a compact distribution, the image representation may contain features irrelevant for discriminating live and spoof images, rather than focusing on intrinsic liveness cues. To tackle this problem, some works [8], [29] applied patch-based training or augmentations. Dividing the input image into small patches allows the model to focus on intrinsic live/spoof cues, avoiding reliance on ID- or facial-specific features like eyes or nose.

From this insight, we divide the input image into $|T|$ patches, and apply the contrastive loss between patches instead of samples; features of patches in live images pull each other and push the features in the spoof patch. Instead of explicitly training with sliced patches, we adopted ViT [11] as our backbone. The ViT model divides the input image into patches and outputs the feature of each patch. Moreover, each patch can interact with other patches through attention modules; this strengthens mining features on live patches. As all patch features in live images are forced to be similar, this weakens ID- or facial-related features. Each patch extracts live cues and disseminates them through attention, attenuating irrelevant features.

Since ViT splits images into patches and outputs tokens of each patch, our method does not explicitly process each patch like previous works. Since asymmetric contrastive loss is applied to ViT’s output tokens, we refer to this as token-wise loss. One thing we should consider is the loss between the live and the spoof tokens. Different from live images, not all tokens in a spoof image can be spoof tokens. In some spoof images, only small regions may be spoofed, such as partially visible or obscured eyes. Therefore, pushing all tokens in a spoof image from the live token can harm modeling a live feature distribution. For this reason, we use a representative token from a spoof image instead of using all the spoof tokens. This is the token-wise asymmetric contrastive (TAC) loss and defined as follows:\begin{align*} {\mathcal {L}}_{\text {TAC}} & = \frac {1}{N_{\text {T}}} \sum _{i \in L, t \in T}{\mathcal {L}}^{i,t}_{\text {TAC}}, \tag {5}\\ {\mathcal {L}}^{i,t}_{\text {TAC}} & = -\frac {1}{N_{\text {T}}} \sum _{j \in L, q \in T} \log \left ({{ \frac {\exp \left ({{{z_{i}^{t}} \cdot {z_{j}^{q}} / \tau }}\right)} {n_{l} + \gamma \cdot n_{f}} }}\right), \tag {6}\\ n_{l} & = \sum _{k \in L, o \in T} \exp \left ({{{z_{i}^{t}} \cdot {z_{k}^{o}} / \tau }}\right), \\ & \qquad \qquad \quad n_{f} = \sum _{s \in S} \exp \left ({{z_{i}^{t} \cdot \psi (z_{s}^{*}) / \tau }}\right), \tag {7}\end{align*} View Source

where T is the set of the token index, and $N_{\text {T}}$ is the number of all live tokens in a batch, i.e., $N_{\text {T}} = |L|\cdot |T|$ . $z^{t}_{i}$ represents the feature of the $t^{th}$ token of the $i^{th}$ input image, i.e., $z^{t}_{i}= g(\mathbf {x}_{i})_{t}$ . The scale term $\gamma $ is used to balance the loss between live and spoof tokens. The ratio of the number of live pairs contributes to the loss to the number of live-spoof pairs, i.e., $\gamma = \frac {N_{\text {T}}}{|S|}$ . The expression $z_{s}^{*} = g(\mathbf {x}_{s})$ represents all tokens of the spoof sample $x_{s}$ , and the $\psi (\cdot)$ as aggregation function over $z_{s}^{*}$ . The aggregation function $\psi (\cdot)$ can be different from $\phi $ , used for prediction. The aggregation aims to derive a representative feature among spoof tokens, and we employed global max pooling for $\psi (\cdot)$ to extract the most prominent feature.

The loss is applied to all pairs of tokens in a batch, and tokens in live images pull each other to form similar features. The pair of live tokens can be tokens in one image or tokens from different images. When the tokens originate from the same image, the loss helps to extract liveness cues independent of facial-specific features by encouraging similarity among features from each token, despite variations in facial information. Conversely, When the tokens comes from different images, the model tries to extract augmentation-, ID-, and domain-irrelevant features. As a result, the loss guides the model to unify all live tokens into a compact distribution, regardless of their domain, ID, and facial characteristics, positioning this distribution far away from any spoof features.

D. Angular Margin Loss

Regarding the token-wise loss, we use normalized features $\tilde {z}^{t}_{i}= \frac {z^{t}_{i}}{||z^{t}_{i}||}$ instead of original feature ${z}^{t}_{i}$ , following the previous studies on contrastive loss [32], [33]. Since the dot product between normalized features represents the cosine of the angle between feature vectors, the contrastive loss minimizes the angle between live features while maximizing it with spoof features. Consequently, live features are confined to a narrow angular region centered around a central vector in the angular space.

Our goal is to develop a robust discriminator capable of effectively generalizing to unknown attack types. To achieve this, we employ an angular margin loss to enhance the discrepancy between live and spoof features. Among various angular margin losses [34], [35], [36], we choose the ArcFace [36], originally proposed to solve the open-set face recognition problem. The loss introduces an additive angular margin to the target angle, ensuring that the decision boundary for each feature maintains a specified angular distance from the other boundary. The angular margin (AM) loss is defined as follows:\begin{align*} {\mathcal {L}}_{\text {AM}} & = -\frac {1}{N} \sum _{i=1}^{N} \log \\ & \quad \times \, \frac {\exp \left ({{s \cdot \cos (\theta _{y_{i}} + m)}}\right)}{\exp \left ({{s \cdot \cos (\theta _{y_{i}} + m)}}\right) + \exp \left ({{s \cdot \cos \theta _{(1-y_{i})}}}\right)}. \tag {8}\end{align*} View Source

The terms s and $y_{i}$ in the equation indicate the scale parameter and the image class, i.e., live or spoof, $y_{i} \in {0,1}$ . $\theta _{y_{i}}$ is the angle between the $y_{i}$ -th weight vector and the image feature $z_{i}$ ; therefore, $\cos \theta _{y_{i}}$ is the inner product of the normalized weight vector and the normalized image feature, i.e., $\cos (\theta _{y_{i}})=\frac {W_{y_{i}}}{||W_{y_{i}}||} \cdot \frac {z_{i}}{||z_{i}||}$ .

The total loss $\mathcal {L}$ of our method is the weighted sum of the two losses, i.e., $\mathcal {L} = \alpha _{1}\cdot {\mathcal {L}}_{\text {TAC}} + \alpha _{2} \cdot {\mathcal {L}}_{\text {AM}}$ , where $\alpha _{1}$ and $\alpha _{2}$ are the weighting parameters to control the influence of each loss. This combination allows our model to simultaneously learn fine-grained, token-level features through $L_{TAC}$ while maintaining a clear decision boundary between live and spoof samples through $L_{AM}$ .

SECTION IV.

Experiments

A. Experimental Settings

1) Datasets and Protocols

To evaluate our method, we use various public datasets, including SiW-Mv2 (S) [37], MSU-MFSD (M) [38], CASIA-FASD (C) [39], Idiap Replay-Attack (I) [40], OULU-NPU (O) [41]. SiW-Mv2 dataset contains 14 attack types, as listed in Table 3, while the other datasets include only print and replay attacks.

TABLE 3 Comparison With the State-of-the-Art(SoTA) Methods on SiW-Mv2 Leave-One-Out Protocol Measured by the Average Classification Error Rate (ACER). Bold Text Indicates the Best Performance, and Underlined Text Indicates the Second-Best Performance. Our Method Outperformed in Most Experiments, Demonstrating the Superiority of Our Method in Dealing With Unknown Attack Types

2) Implementation Details

Input images are cropped by MTCNN [42] and resized to $224\times 224$ . We use a Vision Transformer [11] (ViT-Base) as our model’s backbone, where images split into $16\times 16$ patches. AdamW [43] optimizer is selected with a learning rate of 1e-5 and a weight decay of 5e-2 to optimize training. Additionally, our overall loss is a weighted summation of TAC loss and AM loss, using weights $\alpha _{1} =0.7$ , $\alpha _{2} =0.3$ and $m =0.5$ . The model was trained on a single NVIDIA RTX A6000 GPU with 48 GB VRAM. Based on Table 3, the training time varied depending on the dataset size, ranging from approximately 5 hours for smaller datasets to 9 hours for larger ones.

3) Evaluation Metrics

We evaluate the performance of our method using several widely adopted metrics that are commonly used in the recent works. The key metrics include ACER (Average Classification Error Rate), HTER (Half Total Error Rate), and AUC (Area Under the Curve). ACER is calculated as the mean of the False Acceptance Rate (FAR) and the False Rejection Rate (FRR). This metric reflects the model’s ability to balance between false positives (accepting a spoofed image as real) and false negatives (rejecting a genuine image as spoofed). This metric is commonly used for intra-dataset evaluations, where the model is trained on the training set of a given dataset and then tested on its corresponding test set. This setup ensures that the test data comes from the same distribution as the training data, allowing the evaluation to focus on how well the model performs within a single dataset. In inter-dataset evaluation scenarios, the model is tested on a completely different dataset that it has never seen before during training. This setting is used to evaluate the model’s generalization capabilities across different datasets. In such cases, the HTER (Half Total Error Rate) is often used. It is mathematically identical to ACER, but is specifically referred to as HTER in inter-dataset settings to emphasize the model’s performance on unseen datasets. In addition to these metrics, AUC (Area Under the Curve) is used to provide a comprehensive overview of the model’s performance. It measures the area under the Receiver-Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. A higher AUC value indicates that the model has a stronger overall ability to distinguish between genuine and spoofed images across different threshold values. These metrics give a clear understanding of both intra-dataset and inter-dataset generalization capabilities.

B. Comparison to the State-of-the-Art Methods

1) Model Complexity Comparison

To further evaluate the scalability and complexity of our method, we compared the computational demands of our proposed TACL method with those of state-of-the-art FAS methods, including SA-FAS [23], DiVT-M [25], and DGUA-FAS [6]. The comparison focuses on the number of floating-point operations (FLOPs) and model parameters, as detailed in Table 2. Our method demonstrates significantly higher computational complexity with 35.15 G FLOPs and 85.8 M parameters, attributed to the use of a Vision Transformer backbone and the additional computational requirements of the TAC and AM losses. However, this trade-off in complexity yields substantial improvements in generalization ability, as demonstrated in various evaluation protocols. This highlights the importance of balancing computational complexity with performance gains, emphasizing the effectiveness of our approach in advancing generalization capabilities in face anti-spoofing.

2) Comparison in Cross-Attack Setting

We adopt widely used protocols to measure robustness against unknown attacks. We follow the conventional leave-one-out protocol using the SiW-Mv2 dataset, where each attack type is systematically excluded from the training data and used solely as a test set to evaluate the model’s generalization capability to unknown attack scenarios. In this scenario, we employ the Average Classification Error Rate (ACER) as our primary metric of evaluation.

The results in Table 3 provide a performance comparison between our method and of State-of-the-art(SoTA) FAS methods [6], [23], [25], [37] under the SiW-Mv2 leave-one-out protocol. The methods compared in this experiment include SRENet [37], which released the benchmark dataset SiW-Mv2. SA-FAS [23] and DiVT-M [25] represent the current SoTA in domain generalization for FAS methods, while DGUA-FAS [6] is a recent work that addresses approaches for FAS methods aimed at unknown attack types. Notably, our method achieves the lowest Average Classification Error Rate (ACER), outperforming other methods in all attack types except for obfuscation and transparent attacks.

The results highlight the effectiveness of our approach in handling various attack types more robustly than the comparative methods. Furthermore, it is observed that methods focusing on domain generalization typically underperform in detecting unknown attack types compared to methods specifically designed to address unknown attack types.

3) Single-Category-to-Unknown-Attacks

We introduce a novel “single-category-to-unknown-attacks” protocol to assess the generalization ability of FAS models for unknown attack types in more challenging and realistic environments. This protocol simulates real-world scenarios where a FAS model trained on a limited set of known attacks must generalize to a wide range of unknown attacks. Unlike traditional protocols that test on individual unknown attacks (e.g., leave-one-out), our approach evaluates the robustness of a FAS model against multiple categories of unknown attacks simultaneously, providing a more comprehensive assessment of generalization capabilities. To elaborate, we categorize attacks into covering, makeup, 3D, and 2D attack groups, where we train the FAS model on a single attack category and test it on the others. Average Classification Error Rate (ACER) is employed as our primary metric of evaluation.

Table 4 presents the performance results under the single-category-unknown-attacks protocol. Each group in Table 4 shows the performance against three unknown category attacks after training on one category; e.g., The first group shows the test performance on the makeup, 3D, and 2D attack categories after training on the covering attack. Our method outperforms the existing FAS methods by at least 4.4% in all cases. We have first established the baseline for this protocol. This challenging setup reveals generalization limitations in previous methods that are not apparent in traditional protocols. These findings highlight the importance of developing more robust FAS systems capable of generalizing across diverse attack categories, a crucial requirement for real-world deployment where the variety of potential attacks is vast and ever-evolving.

TABLE 4 Performance Results of the Single-Category-to-Unknown-Attacks Protocol Measured by the Average Classification Error Rate (ACER). This Examines the Ability to Generalize to Unknown Attack Types in More Extreme Environments, Which Trains on a Single Attack Category and Tests on Various Types of Unknown Attacks. The First Group is Trained on the Covering Category, and the Second Group is Trained on the Makeup Category. After Training, Evaluation is Conducted on Other Categories Outside the Trained Categories. The Following Two Cases Proceed in the Same Manner. Our Method Performs Better in Most Attack Types, and Shows Generalization Ability Even in a More Extreme Configuration

4) Comparison in Cross-Domain Setting

Table 5 shows the performance for the leave-one-out protocol using the M&C&I&O dataset, where each row selects one domain as the unseen domain; e.g. OCI-M represents training with O&C&I domains, and testing with M domain. This is the common experiment for FAS methods to measure generalization ability to domain shift. We use the Half Total Error Rate (HTER) and AUC as performance indicators. Our approach reveals competitive results, achieving the highest Area Under the Curve(AUC) in two out of the four protocols and securing the second-best performance in average AUC. This result indicates that our method competes robustly with state-of-the-art methods against unseen domains.

TABLE 5 Quantitative Analyses Under the Leave-One-Out Protocol on the M&C&I&O Datasets for Domain-Generalized Evaluation. Our Method Performs Competitively Compared to SoTA Methods and Shows Generalization Capabilities to Unseen Domains as Well as Unknown Attacks

5) Comparison in Joint Cross-Attack and Cross-Domain Setting

So far in this paper, we have evaluated the detection performance against unknown attacks in Table 3 and Table 4, and assessed the robustness against unseen domains in Table 5. We further evaluate the generalization ability under the most challenging settings, involving both unseen domains and unknown attacks in Table 6, where the test set is defined as follows: ${\mathcal {D}}_{test} = \{ (\mathbf {x}_{i}^{m,n}, y_{i})| D_{m} \notin \mathbf {D}, A_{n} \notin \mathbf {A}\}_{i}$ . In this setting, we consider the M&C&I&O dataset, comprised solely of print and replay attacks, as the training set. Performance is assessed across the 14 different attack types in the SiW-Mv2 dataset. As SiW-Mv2 also includes print and replay attacks, we additionally denote the average performance for the 12 attack types excluding print and replay attacks. By this experiment, we can assess the robustness against both unseen domains and unknown attacks.

TABLE 6 Quantitative Performance for Protocols Trained on the M&C&I&O Datasets, Which Include Only 2D Attacks, and Evaluated on the SiW-Mv2 Dataset. This Setting Tests Robustness in a More Challenging Scenario That Includes Both Unseen Domains and Unknown Attacks. The Metric “Average Except for 2D” Represents the Averaged ACER Across the 12 Attack Types Excluding 2D Attacks (Replay and Print Attacks). This Metric Allows us to Measure the FAS Model’s Ability to Generalize to Unseen Domains and Unknown Attacks, Given That the Training Dataset M&C&I&O Comprises Only 2D Attacks. Bold Text Indicates the Best Performance, and Underlined Text Indicates the Second-Best Performance. Based on the “Average Except for 2D” Metric, Our TACL Performs The Best, Indicating Strong Generalization Ability for Both Unseen Domains and Unknown Attacks

Our method, TACL, showcases superior performance across various attack types, achieving the lowest average error rate of 19.34% and an “Average error rate except for 2D” of 20.76%, both of which are the best among the compared methods. This highlights TACL’s enhanced ability to effectively generalize across a wide range of spoofing attacks, significantly outperforming other methods. Considering the overall results, it is evident that our TACL not only outperforms the comparison methods for each of unseen domains and unknown attacks, but also performs best in the most challenging settings that include both. This shows that our approach of strongly clustering live samples at the token level and applying margins with spoof samples contributes significantly to achieving robustness against both unseen domains and unknown attacks.

C. Ablation Studies

1) Sample-Vs. Token-Wise Loss

The difference between pulling live samples at the token level versus the sample level is evident when comparing the first and last rows in Table 7. The first row in Table 7 represents our proposed method, which applies the TAC loss along with the AM loss. The last row applies the sample-based asymmetric contrastive loss (Eq. 3) in combination with the AM loss. Comparing the two results, our method demonstrates a lower ACER(%) across most attack types. This improvement arises because, by pulling live samples at the token level, we enforce similarity among all tokens within the live patches, thereby weakening irrelevant features such as identity or facial-specific features.

TABLE 7 Ablation Studies to Quantitatively Assess. Contrastive Target Refers to the Target to Which Contrastive Loss is Applied, and Live Only Indicates Whether to Apply Contrastive Loss Only to Live Features. If Angular Margin is Checked, ${\mathcal {L}}_{\text {AM}}$ is Used, Otherwise, Cross Entropy Loss is Applied Instead of ${\mathcal {L}}_{\text {AM}}$ . Our Method (First Row) Performs Best Among Different Combinations, Showing That the Proposed Method is Effective

$Table 7- Ablation Studies to Quantitatively Assess. Contrastive Target Refers to the Target to Which Contrastive Loss is Applied, and Live Only Indicates Whether to Apply Contrastive Loss Only to Live Features. If Angular Margin is Checked, ${\mathcal {L}}_{\text {AM}}$ is Used, Otherwise, Cross Entropy Loss is Applied Instead of ${\mathcal {L}}_{\text {AM}}$ . Our Method (First Row) Performs Best Among Different Combinations, Showing That the Proposed Method is Effective$

2) Symmetric Vs. Asymmetric Loss

The result that asymmetric contrastive loss (live only) is more helpful than symmetric contrastive loss in detecting unknown attacks can be seen by comparing the first and second rows of Table 7. The result of asymmetric loss shows a better average ACER (%), indicating that asymmetric loss has a distinct advantage in detecting unknown attacks by relaxing the restrictions on the distribution of spoof features.

3) Effectiveness of Angular Margin Loss

The effectiveness of AM loss is evident when comparing the first and third rows in Table 7. Applying AM loss achieves an average ACER(%) that is 0.4%p lower, indicating that AM loss helps improve generalization ability to unknown attacks by widening the angular distance between decision boundaries for each feature.

4) Effectiveness of Spoof Aggregation Method

Building upon our proposed TAC loss, where live samples are attracted at the token-wise, and spoof samples are not, we investigated a critical aspect of spoof token aggregation for extracting the representative token in this study. Table 8 demonstrates that global max pooling offers a significant advantage, particularly in the covering (partial) category. This improvement can be largely attributed to the nature of spoof images, where not all tokens may necessarily contain spoof information. This experiment underlines the critical role of feature aggregation in the performance of face anti-spoofing systems and shows the effectiveness of our TAC loss approach when combined with global max pooling.

TABLE 8 Quantitative Analysis for Different Aggregation Functions ( $\psi $ ) of Spoof Tokens Shows That Global Max Pooling (GMP) Performs Better on Average Compared to Global Average Pooling (GAP), Especially for Partially Disrupted Attacks. This Enhanced Performance Can be Attributed to GMP’s Underlying Assumption That Only Local Areas Within the Image Can be Spoofed, Which Aligns Well With the Nature of Such Attacks

$Table 8- Quantitative Analysis for Different Aggregation Functions ( $\psi $ ) of Spoof Tokens Shows That Global Max Pooling (GMP) Performs Better on Average Compared to Global Average Pooling (GAP), Especially for Partially Disrupted Attacks. This Enhanced Performance Can be Attributed to GMP’s Underlying Assumption That Only Local Areas Within the Image Can be Spoofed, Which Aligns Well With the Nature of Such Attacks$

D. Visualizations and Analysis

1) T-SNE Visualization

For a better understanding of the effect of each component, we visualize the feature distribution using t-SNE in Figure 3. Figure 3(a) shows the result from the last row in Table 7, i.e., sample-based asymmetric contrastive loss (Eq. 3), with the addition of the AM loss. The figure shows that some degree of clustering has formed for each class, but no clear boundaries are established between the classes. Figure 3(b) shows the results of the third row, which only includes the TAC loss. The live features are more robustly clustered compared to Figure 3(a), highlighting the effectiveness of TAC loss. However, the live and spoof clusters are still not fully separated. Figure 3(c) presents the results of our method. Compared to Figure 3(b), it demonstrates a clear decision boundary between live and spoof samples, attributed to the application of AM loss, which underscores the enhanced discriminative power of our approach.

$FIGURE 3. - t-SNE visualization of features by ablating proposed losses. Comparing (a) and (b), token-wise loss ${\mathcal {L}}_{\text {TAC}}$ plays a role in making live clusters more compact and away from spoof features. From (b) to (c), the margin loss clearly separates a live cluster from spoof features. Best viewed in color.$

FIGURE 3.

t-SNE visualization of features by ablating proposed losses. Comparing (a) and (b), token-wise loss ${\mathcal {L}}_{\text {TAC}}$ plays a role in making live clusters more compact and away from spoof features. From (b) to (c), the margin loss clearly separates a live cluster from spoof features. Best viewed in color.

Show All

Figure 4 presents a t-SNE visualization of features from diverse attack scenarios obtained using our proposed TACL method. As anticipated, live features from test samples (yellow circle) overlap with the live cluster (green circle) of train samples, and spoof features from test samples (yellow cross) are dispersed from the live cluster. This result shows a clear separation between live and spoof features in various unknown attacks.

FIGURE 4.

t-SNE visualization of live and spoof features from the proposed TACL for various spoof types. Green and yellow symbols represent features from training and test sets, respectively. The figures show that the features of the unknown attack (yellow cross) are far from the live cluster, demonstrating our model’s generalization ability for various unknown attacks. Best viewed in color.

Show All

2) Grad-CAM Comparison

Figure 5 presents a comparative visualization of the Grad-CAM results between our method and other methods trained using the leave-one-out protocol on the SiW-Mv2 dataset, as detailed in Table 2. It is evident that our method robustly detects the spoof region across various novel attack scenarios. Unlike the other methods, our approach clearly focuses on the relevant regions that are indicative of spoofing, thereby highlighting its effectiveness in identifying spoofing attempts.

FIGURE 5.

Grad-CAM visualization comparing FAS methods across samples with varying spoofing regions. The first row is the Grad-CAM results obtained using the model trained without Mannequin. The remaining results are likewise derived from models trained using the leave-one-out protocol on the SiW-Mv2 dataset, as detailed in Table 2. Our method captures the spoofing area more accurately than other methods against unknown attacks.

Show All

SECTION V.

Conclusion

In this work, we presented TACL, a novel framework for robust FAS against unknown attacks, which combines token-wise asymmetric contrastive learning and angular margin loss. TACL achieves robustness against unknown attacks by guiding the model to aggregate all live token features, regardless of augmentation, ID, and domain, into a compact distribution, while placing this distribution away from representative spoof token. Extensive experiments on benchmark datasets demonstrate the robustness of TACL in tackling unknown attacks. The introduction of a single-category-unknown-attack protocol allows us to evaluate the model’s robustness ability against unknown attack types in a more extreme environment. TACL exhibits superior performance in this protocol, further validating its robustness. In future work, we will apply the TACL framework to other biometric modalities, such as fingerprint or iris recognition, to enhance the security of a broader range of biometric systems.

References is not available for this document.

Token-Wise Asymmetric Contrastive Learning in Countering Unknown Attacks for Face Anti-Spoofing

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction

Related Work