Introduction
Face recognition systems [1] have become increasingly prevalent in our daily lives, being used for various applications from mobile device unlocking to high-security access control. However, the reliability of these systems is threatened by various types of spoofing attacks. Face anti-spoofing (FAS) techniques aim to protect face recognition systems from such attacks by determining whether a presented face is live or spoofed [2].
Several common types of face spoofing attacks include print, replay, and 3D mask attacks [3], [4], [5]. Print attacks involve presenting a photo of an authorized user’s face to the recognition system [3]. Replay attacks display a video recording of the genuine user on a digital screen [4]. More sophisticated 3D mask attacks use wearable masks to mimic facial geometry and texture [5]. While existing FAS methods have achieved impressive performance against known attack types, they often struggle when confronted with novel and unknown attacks [6], [7]. The ever-evolving nature of face spoofing techniques poses a significant risk, as FAS systems may be vulnerable to new attack types not encountered during training.
In this paper, we propose a novel framework that combines Token-wise Asymmetric Contrastive Learning (TACL) with angular margin loss for robust FAS against unknown attacks, as illustrated in the right side of Figure 1. The approach leverages two key strategies to improve the generalization ability of FAS models against unknown attacks. Spoof face images tend to have a more diverse feature distribution than live faces because the difference in attack material causes additional feature variation. To tackle this asymmetry, we propose an asymmetric contrastive loss that encourages a compact distribution of live face features while dispersing the features of spoof faces. In a partial spoofing attack, where not every patch in the spoof face image is tampered with, we address the case by leveraging the representative feature of the entire spoof image rather than individual patch features. This allows the FAS model to capture a decision boundary that is more robust to varying attack types. The second strategy, which we call token-wise learning, is inspired by patch-based learning that guides FAS models [8], [9], [10] to focus on the intrinsic live or spoof pattern rather than ID or facial-related features. Unlike conventional patch-based learning, our method uses output tokens from Vision Transformers.
(Left) Conventional FAS models discriminate live and spoof images in different domains, and unknown attacks lie in the distribution away from the trained one, resulting in poor performance. (Right) Our method trains live features to form a compact distribution far away from spoof features with a given margin. As a result, even if unknown attacks are injected into the model, it is highly likely that these features are outside the live feature space.
In summary, the main contributions of this paper are summarized as follows:
To tackle unknown attacks in FAS, we introduce a token-wise asymmetric contrastive learning, which pulls the features from live faces into a compact cluster while allowing the features from spoof faces to be more diversely distributed.
We introduce a novel protocol to assess the generalization ability of FAS systems for unknown attack types in more challenging environments and establish a baseline for this protocol.
Extensive experiments on benchmark datasets demonstrate that our attack-agnostic FAS model significantly outperforms state-of-the-art FAS methods in unknown attack scenarios.
Related Work
FAS systems face the fundamental challenge of generalizing to spoofing attacks that they have not encountered during training and have addressed this problem with various strategies. One common strategy to improve the generalization of FAS models is to leverage auxiliary supervised signals, which guide the model to learn intrinsic spoofing patterns. These auxiliary signals can be derived from various sources, such as depth maps [12], reflection maps [13], or material-based cues [14]. Depth maps have been widely used as an auxiliary supervision signal in FAS [12], [15], [16]. The underlying assumption is that live faces have different depth characteristics compared to spoofing attempts. By incorporating depth information during training, the FAS model learns to capture more intrinsic liveness cues. For instance, Liu et al. [12] proposed a CNN-RNN architecture that utilizes depth maps as supervision for live faces, encouraging the model to learn depth-related features. Similarly, Wang et al. [15] introduced a deep spatial gradient learning framework that leverages depth maps to guide the model in capturing fine-grained spatial information. Reflection maps, which capture the differences in reflectance properties between live faces and spoofing mediums, have also been explored as an auxiliary signal [13]. The idea is that spoofing attempts often exhibit distinct reflection patterns compared to genuine faces. Kim et al. [13] proposed a bipartite auxiliary supervision network (BASN) that employs both depth and reflection maps to learn generalized spoof patterns. The reflection maps are used to supervise the learning of spoof-specific features, complementing the depth-based supervision. Extending these approaches, Qin et al. [17] introduced the Meta-Teacher framework, which leverages adaptive supervision signals like pixel-wise maps to further guide models in learning diverse spoofing cues. Material-based cues, derived from the analysis of facial skin properties, have been investigated as another source of auxiliary supervision [14]. The motivation is that live faces and spoofing mediums have different material properties, such as texture and reflectance. Yu et al. [14] approached FAS as a material recognition problem, leveraging human material perception to extract discriminative and robust features. They introduced a material-based loss function to guide the model in learning material-specific characteristics. While these auxiliary supervision approaches have shown promising results, their effectiveness may depend on the quality and reliability of the auxiliary signals. Inaccurate or noisy supervision could potentially mislead the model and hinder its generalization ability. Moreover, the generation of these auxiliary signals often requires additional computational resources or specialized hardware, which may limit their practicality in certain applications.
Generalization to diverse attack types and data domains is crucial in FAS techniques to ensure models handle various scenarios, leading to the proposal of various approaches [8], [18], [19], [20], [21], [22], [23], [24], [25], [26], [27], [28] aimed at addressing this challenge. Some methods employ distinct feature extractors relying on domain-invariant content features and style features for classification [22], [27]. Patch-level classification has also been proposed for better utilization of local features, enforcing patch feature invariance through non-distorted augmentation and using asymmetric angular margin softmax loss to address the asymmetry between live and spoof features [22]. The application of asymmetric optimization to live and spoof faces has also been studied to cover unseen data domains. The methods [19], [25] learn a generalized feature space where the feature distribution of live faces is compact [19], and the feature distribution of spoof faces is dispersed over the attack type [25]. However, we note that these methods only focus on extracting features for generalizing unseen domains and separate spoof features between domains [19] or require an attack type as a label [25], which is not applicable for tackling anti-spoofing with unknown attacks.
Our method introduces a novel framework of asymmetric loss to the output tokens of Vision Transformers (ViT), building on prior works that leverage asymmetric losses [19], [29] and patch-based learning [29], [30]. While our approach shares conceptual motivation with these studies, it incorporates distinct and significant innovations. Table 1 highlights the key differences that distinguish our method. In the context of asymmetric loss, the main difference is the use of contrastive loss in spoof-spoof pairs. Our method effectively eliminates unnecessary regularization between spoofing pairs across diverse attack types, enabling significantly improved generalization to unknown attacks. This contrasts sharply with existing methods, which do not explicitly address this limitation. Furthermore, our token-wise learning approach redefines the paradigm of patch-based methodologies. Unlike traditional patch-based methods, which explicitly treat image patches as separate samples, our method operates on the output tokens of the ViT encoder. These tokens inherently represent implicit patches, allowing for a more seamless integration of loss computation while preserving the structural and semantic information of the input image. This implicit patch-based token processing underscores a fundamental advancement over conventional methods.
Methods
The proposed framework, termed TACL, integrates token-wise asymmetric contrastive learning with angular margin loss to enhance robustness in FAS against unknown attacks. Figure 2 presents an overall overview of the framework.
The overall frameworks of TACL. We use token-wise contrastive loss (
A. Problem Formulation
Consider a dataset containing a set of domains
The purpose of anti-spoofing is to learn a robust discriminator \begin{equation*} \min _{h} \mathbb {E}_{(\mathbf {x}^{m,n},y) \in {\mathcal {D}}_{test}} \ell (h(\mathbf {x}^{m,n}), y), \tag {1}\end{equation*}
The main concern of our method is to train h to generalize to unknown attack types. This can be considered as an attack type generalization problem; the known attack type can be accessed in training, but the other attack types that are not accessed in training are tested in the inference phase. In the seen-domain-unknown-attack configuration, the test set is defined as follows:\begin{equation*} {\mathcal {D}}_{test} = \{ (\mathbf {x}_{i}^{m,n}, y_{i})| D_{m} \in \mathbf {D}, A_{n} \notin \mathbf {A}\}_{i}. \tag {2}\end{equation*}
B. Asymmetric Contrastive Loss
The main strategy of anti-spoofing algorithms in a domain generalization problem is to reduce the differences in distributions between different source domains [19], [23], [24], [25] in order to extract features that are invariant across domains. This approach can also be applied to train the discriminator on unknown attack types, since this problem is similarly viewed as a generalization problem concerning attack types. However, compared to the feature discrepancy in live images, which is primarily caused by domain differences, the discrepancy in spoof images is influenced by both domain variations and attack types. Consequently, the distribution of spoof features is more diverse than that of live features. Given the distinct characteristics between live and spoof features, reducing the distribution differences in spoof features could disrupt the discriminator’s training, leading to a negative impact.
Based on this observation, we propose an asymmetric contrastive loss that pulls live features for a compact distribution while relaxing the compactness constraints of spoof feature distribution. The live and spoof features push each other to maximize the margin between them. The concept of pulling features between the same class and pushing features between different classes is similar to the supervised contrastive (SupCon) loss [32]. However, unlike the SupCon loss, which applies losses to all classes (live and spoof classes), our asymmetric loss is applied only to the live class. The Asymmetric Contrastive (AC) Loss is defined as follows:\begin{align*} {\mathcal {L}}_{\text {AC}} & = \frac {1}{|L|} \sum _{i \in L} {\mathcal {L}}^{i}_{\text {AC}}, \tag {3}\\ {\mathcal {L}}^{i}_{\text {AC}} & = -\frac {1}{|L|} \sum _{j \in L} \log \left ({{\frac {\exp \left ({{z_{i} \cdot z_{j} / \tau }}\right)} {\sum _{k \in L \cup S} \exp \left ({{z_{i} \cdot z_{k} / \tau }}\right)}}}\right), \tag {4}\end{align*}
The numerator calculates the similarity of features between live features, and the denominator is a normalization factor that is the sum of the similarity for all examples in a batch. This loss maximizes the similarity between live features, thus forcing a compact distribution of live features. Although this loss is calculated only for live samples, it also considers the similarity between live and spoof images in the normalization factor. In contrast to live-live pairs, the loss minimizes the similarity between live and spoof features, pushing each feature in different directions. Since the spoofing features do not attract each other, features from different types of attacks can be distributed more diversely.
According to the loss, the live features of different images in a batch attract each other. Since we constructed the batch by applying images with different augmentations to the same image twice, following previous work [32], live image pairs include (1) identical images with different augmentations, (2) two images from different domains, or (3) two images with different ID in the same domain. Since the loss forces a compact distribution for live features, the model attempts to extract augmentation-, domain-, and ID-agnostic live features, thereby enhancing generalization to unseen domains and identities.
C. Token-Wise Asymmetric Contrastive Loss
Although the loss
From this insight, we divide the input image into
Since ViT splits images into patches and outputs tokens of each patch, our method does not explicitly process each patch like previous works. Since asymmetric contrastive loss is applied to ViT’s output tokens, we refer to this as token-wise loss. One thing we should consider is the loss between the live and the spoof tokens. Different from live images, not all tokens in a spoof image can be spoof tokens. In some spoof images, only small regions may be spoofed, such as partially visible or obscured eyes. Therefore, pushing all tokens in a spoof image from the live token can harm modeling a live feature distribution. For this reason, we use a representative token from a spoof image instead of using all the spoof tokens. This is the token-wise asymmetric contrastive (TAC) loss and defined as follows:\begin{align*} {\mathcal {L}}_{\text {TAC}} & = \frac {1}{N_{\text {T}}} \sum _{i \in L, t \in T}{\mathcal {L}}^{i,t}_{\text {TAC}}, \tag {5}\\ {\mathcal {L}}^{i,t}_{\text {TAC}} & = -\frac {1}{N_{\text {T}}} \sum _{j \in L, q \in T} \log \left ({{ \frac {\exp \left ({{{z_{i}^{t}} \cdot {z_{j}^{q}} / \tau }}\right)} {n_{l} + \gamma \cdot n_{f}} }}\right), \tag {6}\\ n_{l} & = \sum _{k \in L, o \in T} \exp \left ({{{z_{i}^{t}} \cdot {z_{k}^{o}} / \tau }}\right), \\ & \qquad \qquad \quad n_{f} = \sum _{s \in S} \exp \left ({{z_{i}^{t} \cdot \psi (z_{s}^{*}) / \tau }}\right), \tag {7}\end{align*}
where T is the set of the token index, and
The loss is applied to all pairs of tokens in a batch, and tokens in live images pull each other to form similar features. The pair of live tokens can be tokens in one image or tokens from different images. When the tokens originate from the same image, the loss helps to extract liveness cues independent of facial-specific features by encouraging similarity among features from each token, despite variations in facial information. Conversely, When the tokens comes from different images, the model tries to extract augmentation-, ID-, and domain-irrelevant features. As a result, the loss guides the model to unify all live tokens into a compact distribution, regardless of their domain, ID, and facial characteristics, positioning this distribution far away from any spoof features.
D. Angular Margin Loss
Regarding the token-wise loss, we use normalized features
Our goal is to develop a robust discriminator capable of effectively generalizing to unknown attack types. To achieve this, we employ an angular margin loss to enhance the discrepancy between live and spoof features. Among various angular margin losses [34], [35], [36], we choose the ArcFace [36], originally proposed to solve the open-set face recognition problem. The loss introduces an additive angular margin to the target angle, ensuring that the decision boundary for each feature maintains a specified angular distance from the other boundary. The angular margin (AM) loss is defined as follows:\begin{align*} {\mathcal {L}}_{\text {AM}} & = -\frac {1}{N} \sum _{i=1}^{N} \log \\ & \quad \times \, \frac {\exp \left ({{s \cdot \cos (\theta _{y_{i}} + m)}}\right)}{\exp \left ({{s \cdot \cos (\theta _{y_{i}} + m)}}\right) + \exp \left ({{s \cdot \cos \theta _{(1-y_{i})}}}\right)}. \tag {8}\end{align*}
The terms s and
The total loss
Experiments
A. Experimental Settings
1) Datasets and Protocols
To evaluate our method, we use various public datasets, including SiW-Mv2 (S) [37], MSU-MFSD (M) [38], CASIA-FASD (C) [39], Idiap Replay-Attack (I) [40], OULU-NPU (O) [41]. SiW-Mv2 dataset contains 14 attack types, as listed in Table 3, while the other datasets include only print and replay attacks.
2) Implementation Details
Input images are cropped by MTCNN [42] and resized to
3) Evaluation Metrics
We evaluate the performance of our method using several widely adopted metrics that are commonly used in the recent works. The key metrics include ACER (Average Classification Error Rate), HTER (Half Total Error Rate), and AUC (Area Under the Curve). ACER is calculated as the mean of the False Acceptance Rate (FAR) and the False Rejection Rate (FRR). This metric reflects the model’s ability to balance between false positives (accepting a spoofed image as real) and false negatives (rejecting a genuine image as spoofed). This metric is commonly used for intra-dataset evaluations, where the model is trained on the training set of a given dataset and then tested on its corresponding test set. This setup ensures that the test data comes from the same distribution as the training data, allowing the evaluation to focus on how well the model performs within a single dataset. In inter-dataset evaluation scenarios, the model is tested on a completely different dataset that it has never seen before during training. This setting is used to evaluate the model’s generalization capabilities across different datasets. In such cases, the HTER (Half Total Error Rate) is often used. It is mathematically identical to ACER, but is specifically referred to as HTER in inter-dataset settings to emphasize the model’s performance on unseen datasets. In addition to these metrics, AUC (Area Under the Curve) is used to provide a comprehensive overview of the model’s performance. It measures the area under the Receiver-Operating Characteristic (ROC) curve, which plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold settings. A higher AUC value indicates that the model has a stronger overall ability to distinguish between genuine and spoofed images across different threshold values. These metrics give a clear understanding of both intra-dataset and inter-dataset generalization capabilities.
B. Comparison to the State-of-the-Art Methods
1) Model Complexity Comparison
To further evaluate the scalability and complexity of our method, we compared the computational demands of our proposed TACL method with those of state-of-the-art FAS methods, including SA-FAS [23], DiVT-M [25], and DGUA-FAS [6]. The comparison focuses on the number of floating-point operations (FLOPs) and model parameters, as detailed in Table 2. Our method demonstrates significantly higher computational complexity with 35.15 G FLOPs and 85.8 M parameters, attributed to the use of a Vision Transformer backbone and the additional computational requirements of the TAC and AM losses. However, this trade-off in complexity yields substantial improvements in generalization ability, as demonstrated in various evaluation protocols. This highlights the importance of balancing computational complexity with performance gains, emphasizing the effectiveness of our approach in advancing generalization capabilities in face anti-spoofing.
2) Comparison in Cross-Attack Setting
We adopt widely used protocols to measure robustness against unknown attacks. We follow the conventional leave-one-out protocol using the SiW-Mv2 dataset, where each attack type is systematically excluded from the training data and used solely as a test set to evaluate the model’s generalization capability to unknown attack scenarios. In this scenario, we employ the Average Classification Error Rate (ACER) as our primary metric of evaluation.
The results in Table 3 provide a performance comparison between our method and of State-of-the-art(SoTA) FAS methods [6], [23], [25], [37] under the SiW-Mv2 leave-one-out protocol. The methods compared in this experiment include SRENet [37], which released the benchmark dataset SiW-Mv2. SA-FAS [23] and DiVT-M [25] represent the current SoTA in domain generalization for FAS methods, while DGUA-FAS [6] is a recent work that addresses approaches for FAS methods aimed at unknown attack types. Notably, our method achieves the lowest Average Classification Error Rate (ACER), outperforming other methods in all attack types except for obfuscation and transparent attacks.
The results highlight the effectiveness of our approach in handling various attack types more robustly than the comparative methods. Furthermore, it is observed that methods focusing on domain generalization typically underperform in detecting unknown attack types compared to methods specifically designed to address unknown attack types.
3) Single-Category-to-Unknown-Attacks
We introduce a novel “single-category-to-unknown-attacks” protocol to assess the generalization ability of FAS models for unknown attack types in more challenging and realistic environments. This protocol simulates real-world scenarios where a FAS model trained on a limited set of known attacks must generalize to a wide range of unknown attacks. Unlike traditional protocols that test on individual unknown attacks (e.g., leave-one-out), our approach evaluates the robustness of a FAS model against multiple categories of unknown attacks simultaneously, providing a more comprehensive assessment of generalization capabilities. To elaborate, we categorize attacks into covering, makeup, 3D, and 2D attack groups, where we train the FAS model on a single attack category and test it on the others. Average Classification Error Rate (ACER) is employed as our primary metric of evaluation.
Table 4 presents the performance results under the single-category-unknown-attacks protocol. Each group in Table 4 shows the performance against three unknown category attacks after training on one category; e.g., The first group shows the test performance on the makeup, 3D, and 2D attack categories after training on the covering attack. Our method outperforms the existing FAS methods by at least 4.4% in all cases. We have first established the baseline for this protocol. This challenging setup reveals generalization limitations in previous methods that are not apparent in traditional protocols. These findings highlight the importance of developing more robust FAS systems capable of generalizing across diverse attack categories, a crucial requirement for real-world deployment where the variety of potential attacks is vast and ever-evolving.
4) Comparison in Cross-Domain Setting
Table 5 shows the performance for the leave-one-out protocol using the M&C&I&O dataset, where each row selects one domain as the unseen domain; e.g. OCI-M represents training with O&C&I domains, and testing with M domain. This is the common experiment for FAS methods to measure generalization ability to domain shift. We use the Half Total Error Rate (HTER) and AUC as performance indicators. Our approach reveals competitive results, achieving the highest Area Under the Curve(AUC) in two out of the four protocols and securing the second-best performance in average AUC. This result indicates that our method competes robustly with state-of-the-art methods against unseen domains.
5) Comparison in Joint Cross-Attack and Cross-Domain Setting
So far in this paper, we have evaluated the detection performance against unknown attacks in Table 3 and Table 4, and assessed the robustness against unseen domains in Table 5. We further evaluate the generalization ability under the most challenging settings, involving both unseen domains and unknown attacks in Table 6, where the test set is defined as follows:
Our method, TACL, showcases superior performance across various attack types, achieving the lowest average error rate of 19.34% and an “Average error rate except for 2D” of 20.76%, both of which are the best among the compared methods. This highlights TACL’s enhanced ability to effectively generalize across a wide range of spoofing attacks, significantly outperforming other methods. Considering the overall results, it is evident that our TACL not only outperforms the comparison methods for each of unseen domains and unknown attacks, but also performs best in the most challenging settings that include both. This shows that our approach of strongly clustering live samples at the token level and applying margins with spoof samples contributes significantly to achieving robustness against both unseen domains and unknown attacks.
C. Ablation Studies
1) Sample-Vs. Token-Wise Loss
The difference between pulling live samples at the token level versus the sample level is evident when comparing the first and last rows in Table 7. The first row in Table 7 represents our proposed method, which applies the TAC loss along with the AM loss. The last row applies the sample-based asymmetric contrastive loss (Eq. 3) in combination with the AM loss. Comparing the two results, our method demonstrates a lower ACER(%) across most attack types. This improvement arises because, by pulling live samples at the token level, we enforce similarity among all tokens within the live patches, thereby weakening irrelevant features such as identity or facial-specific features.
2) Symmetric Vs. Asymmetric Loss
The result that asymmetric contrastive loss (live only) is more helpful than symmetric contrastive loss in detecting unknown attacks can be seen by comparing the first and second rows of Table 7. The result of asymmetric loss shows a better average ACER (%), indicating that asymmetric loss has a distinct advantage in detecting unknown attacks by relaxing the restrictions on the distribution of spoof features.
3) Effectiveness of Angular Margin Loss
The effectiveness of AM loss is evident when comparing the first and third rows in Table 7. Applying AM loss achieves an average ACER(%) that is 0.4%p lower, indicating that AM loss helps improve generalization ability to unknown attacks by widening the angular distance between decision boundaries for each feature.
4) Effectiveness of Spoof Aggregation Method
Building upon our proposed TAC loss, where live samples are attracted at the token-wise, and spoof samples are not, we investigated a critical aspect of spoof token aggregation for extracting the representative token in this study. Table 8 demonstrates that global max pooling offers a significant advantage, particularly in the covering (partial) category. This improvement can be largely attributed to the nature of spoof images, where not all tokens may necessarily contain spoof information. This experiment underlines the critical role of feature aggregation in the performance of face anti-spoofing systems and shows the effectiveness of our TAC loss approach when combined with global max pooling.
D. Visualizations and Analysis
1) T-SNE Visualization
For a better understanding of the effect of each component, we visualize the feature distribution using t-SNE in Figure 3. Figure 3(a) shows the result from the last row in Table 7, i.e., sample-based asymmetric contrastive loss (Eq. 3), with the addition of the AM loss. The figure shows that some degree of clustering has formed for each class, but no clear boundaries are established between the classes. Figure 3(b) shows the results of the third row, which only includes the TAC loss. The live features are more robustly clustered compared to Figure 3(a), highlighting the effectiveness of TAC loss. However, the live and spoof clusters are still not fully separated. Figure 3(c) presents the results of our method. Compared to Figure 3(b), it demonstrates a clear decision boundary between live and spoof samples, attributed to the application of AM loss, which underscores the enhanced discriminative power of our approach.
t-SNE visualization of features by ablating proposed losses. Comparing (a) and (b), token-wise loss
Figure 4 presents a t-SNE visualization of features from diverse attack scenarios obtained using our proposed TACL method. As anticipated, live features from test samples (yellow circle) overlap with the live cluster (green circle) of train samples, and spoof features from test samples (yellow cross) are dispersed from the live cluster. This result shows a clear separation between live and spoof features in various unknown attacks.
t-SNE visualization of live and spoof features from the proposed TACL for various spoof types. Green and yellow symbols represent features from training and test sets, respectively. The figures show that the features of the unknown attack (yellow cross) are far from the live cluster, demonstrating our model’s generalization ability for various unknown attacks. Best viewed in color.
2) Grad-CAM Comparison
Figure 5 presents a comparative visualization of the Grad-CAM results between our method and other methods trained using the leave-one-out protocol on the SiW-Mv2 dataset, as detailed in Table 2. It is evident that our method robustly detects the spoof region across various novel attack scenarios. Unlike the other methods, our approach clearly focuses on the relevant regions that are indicative of spoofing, thereby highlighting its effectiveness in identifying spoofing attempts.
Grad-CAM visualization comparing FAS methods across samples with varying spoofing regions. The first row is the Grad-CAM results obtained using the model trained without Mannequin. The remaining results are likewise derived from models trained using the leave-one-out protocol on the SiW-Mv2 dataset, as detailed in Table 2. Our method captures the spoofing area more accurately than other methods against unknown attacks.
Conclusion
In this work, we presented TACL, a novel framework for robust FAS against unknown attacks, which combines token-wise asymmetric contrastive learning and angular margin loss. TACL achieves robustness against unknown attacks by guiding the model to aggregate all live token features, regardless of augmentation, ID, and domain, into a compact distribution, while placing this distribution away from representative spoof token. Extensive experiments on benchmark datasets demonstrate the robustness of TACL in tackling unknown attacks. The introduction of a single-category-unknown-attack protocol allows us to evaluate the model’s robustness ability against unknown attack types in a more extreme environment. TACL exhibits superior performance in this protocol, further validating its robustness. In future work, we will apply the TACL framework to other biometric modalities, such as fingerprint or iris recognition, to enhance the security of a broader range of biometric systems.