Introduction
Face recognition (FR) systems have been used in a wide range of real-world applications, such as surveillance and biometric authentication. The rapid development of deep learning algorithms and the availability of large datasets have made remarkable advancements in the field of face recognition research [1]. However, despite these advancements, the performance of face recognition systems is still affected when they are deployed on low-resolution face images, which is a common occurrence in real-world scenarios [2]. One of the biggest issues in this setting is the difficulty of the cross-resolution comparison tasks, caused by the mismatch in the resolution of the gallery/enrollment and probe/test images. For instance, in a surveillance system, the low-quality face images captured by the cameras need to be compared with high-resolution images in the gallery. This mismatch in resolution can cause a significant degradation in recognition performance compared to same-resolution recognition problems. In this paper, we address this challenging cross-resolution face recognition/comparison task that, for convenience, is also illustrated in Fig. 1.
Cross-resolution face recognition. In this task, a low-resolution probe face image is compared with a set of high-resolution gallery face images.
Existing techniques for cross-resolution face recognition can in general be grouped into three categories:
Universality: They are model agnostic and can, therefore, be applied with any FR model capable of extracting face representations from the given input images for a comparison procedure. In other words, these methods operate at the image-preprocessing level and are, therefore, universally applicable with arbitrary face recognition models without the need for model fine-tuning.
Simplicity: Unlike resolution-invariant recognition models, hallucination and degradation-based methods typically require a significantly lower amount of training data to work effectively. Additionally, they can be well-described by explicit mathematical models that have fewer degrees of freedom than contemporary heavily-parameterized FR models, leading to (data) efficient learning procedures.
Interpretability: Because the techniques are typically applied at the preprocessing level and produce observable (degradation/hallucination) results that are later fed to the FR model, they allow for an easier interpretation of the recognition decisions compared to standard cross-resolution comparison procedures, where model decisions due to the different characteristics of the input images are more difficult to understand.
Complementarity: Hallucination and degradation-based techniques can be applied in conjunction with resolution-invariant models and have the capacity to further improve results by reducing the resolution induced mismatch between the probe and gallery images. Thus, these groups of techniques are complementary to methods aiming to design resolution-invariant FR approaches.
Due to the outlined characteristics, multiple studies explored degradation and face-hallucination techniques with the goal of improving cross-resolution FR performance [2], [3]. However, ensuring consistent improvements in recognition performance with either approach (degradation or hallucination) remains a challenging (open) research problem, with effective solutions for real-world data still largely missing from the literature. In this study, we aim to address this gap and explore three distinct strategies towards cross-resolution face recognition designed around novel degradation and face hallucination techniques, as illustrated in Fig. 2, i.e.:
Degrade-to-Compare (DtC): With this strategy, we investigate the impact of degrading gallery images, so they viably mimic the characteristics of the low-resolution probes, as shown in Fig. 2(a). To be able to implement the DtC strategy, we propose a scale-wise degradation method, in which different types of degradations are applied at multiple scales. The proposed method allows us to model a wider variety of degradation types, making the generated low-resolution face images more realistic and representative of the real-world challenges faced by cross-resolution face recognition systems.
Hallucinate-to-Compare (HtC): With this strategy, we study the effect of generating a high-resolution image from the low-resolution probes that aligns better with the resolution and quality characteristics of the high-resolution gallery images, as presented in Fig. 2(b). To analyze the feasibility of hallucination techniques for cross-resolution face recognition, we propose a novel multi-scale and multi-hypothesis face-superresolution approach. The approach involves upscaling the low-resolution probe images to multiple scales, i.e.,
,$2\times $ , and$4\times $ . Additionally, at each scale, multiple hypotheses are reconstructed from different versions of the original low-resolution image to capture potential variations in the degradations encountered during the acquisition of the low-resolution probes.$8\times $ Degrade-and-Hallucinate-to-Compare (DHtC): With the last strategy, we explore the feasibility of hybrid schemes that combine both gallery degradations and probe hallucinations to bridge the gap between the distributions of low-resolution and high-resolution images. The main idea behind this scheme, shown in Fig. 2(c), is to simultaneously improve the resolution of the input probes and degrade the quality of the high-resolution gallery images in a sort-of meet-in-the-middle solution. Specifically, in this paper, we propose an approach that combines the multi-scale degradation process from the DtC strategy with the multi-scale, multi-hypothesis hallucination technique from the HtC strategy into a hybrid procedure using various fusion approaches. These fusion approaches aggregate the information from the multi-scale comparisons into a single similarly score that can ultimatelly be used for identity inference.
In this paper, we investigate three distinct strategies (i.e., DtC, HtC and HDtC) to cross-resolution face recognition and propose new multi-scale degradation and multi-hypothesis hallucination techniques for their implementation. Additionally, we study the impact of low-resolution probe quality on the behavior of the three considered strategies.
The research, presented in this paper, builds on our preliminary work from [4], but extends it in multiply aspects, i.e.,
The rest of the paper is structured as follows. In Section II, we review closely related work and position our research within the existing literature. In Section III, we provide details on the three studied strategies (DtC, HtC and HDtC) and describe in-depth the novel multi-scale degradation techniques, the multi-hypothesis and multi-scale face hallucination method as well as the joint hybrid scheme with the corresponding fusion approaches. We evaluate and study the behavior of all three strategies to cross-resolution face recognition on the SCFace and DroneSURF datasets in Section IV, and, finally, conclude the paper with a summary of the main findings and some directions for future work in Section V.
Related Work
In this section, we review related prior work with the goal of providing context for our research. Specifically, we first discuss existing super-resolution and face hallucination models, then elaborate on modern face recognition techniques and, finally, explore cross-resolution recognition problems.
A. Super-Resolution and Face Hallucination
Recently, there has been a surge of interest in utilizing modern deep learning techniques to tackle the problem of super-resolution. Typically, supervised learning methods involve creating a dataset of low-resolution and high-resolution image pairs, where the high-resolution images serve as targets, i.e. ground truth. The training inputs are then derived by subjecting each image to a predetermined degradation process. Models such as Convolutional Neural Networks (CNNs), are then trained to upscale the artificially degraded input images by minimizing a pixel reconstruction error, such as the mean square error (MSE) or the mean absolute error (MAE) [5], [6], [7].
Most of the recent advances in super-resolution have focused on using more complex loss functions that go beyond simple pixel-wise differences. For example, some methods use perceptual loss functions [8] that take into account higher-level semantics to guide the learning process. Others use adversarial learning objectives [9], [10], [11], where a discriminator is trained to distinguish between generated and real images, to further improve the realism of the generated images. These advancements have led to significant progress in the field of super-resolution.
Super-resolution techniques that are used for upscaling human face images are often referred to as face hallucination techniques. Unlike general super-resolution methods, which are restricted by the information contained in the input image, face hallucination techniques are able to achieve better reconstructions at higher magnification factors, up to 8 times the resolution of the input image. This is because they are specifically trained on a limited domain of objects, i.e., human faces, which acts as an additional regularizer for the hallucination process. In contrast, most of the existing general super-resolution methods are usually limited to magnification factors of up to
B. Face Recognition
Recent advancements in large-scale face recognition have involved the collection of large face datasets. Typical examples include VGGFace2 [17], DeepID [18], MS-Celeb-1M [19], WebFace260M [20], Glint360k [21], and others. Modern datasets typically contain thousands of subjects and millions of images, in order to capture a large amount of inter-class and intra-class variance. Model architectures have not been the main focus of recent face-recognition research [3], with most state-of-the-art approaches using the above datasets to train a ResNet-based backbone [22]. These models are commonly trained using classification and metric-learning loss functions, which enable them to learn to extract discriminative features from face images for face identification or verification purposes. In recent years, researchers have focused on developing novel loss functions that can combine classification and metric learning objectives, such as CosFace [23] and ArcFace [24]. Furthermore, researchers have also been working on developing loss functions that explicitly account for the quality of the input image, such as AdaFace [25]. These advances have demonstrated significant potential to improve face recognition performance, especially in challenging conditions, where image quality is poor.
C. Cross-Resolution Face Recognition
Cross-resolution face recognition refers to a specific FR problem, where the resolution of the images to be compared during the comparison process differs significantly. Existing approaches to this problem can, in general, be categorized into three main groups:
Resolution-invariant methods aim to minimize the difference between the feature representation of low-resolution and high-resolution face images. One such method is the Deep Coupled ResNet (DCR) model, proposed by Lu et al. [26], which consists of one trunk network and two branch networks. The trunk network is first trained with face images of different resolutions, then the two branch networks are trained to learn coupled-mappings between low-resolution and high-resolution face images. Other knowledge distillation based models [27], [28], [29], [30] distill the information from a Teacher network, which is pre-trained with high-resolution face images, to the Student network, which is trained on images of different resolutions.
Face hallucination based methods reconstruct high-resolution face images from low-resolution ones and target face recognition in the high-resolution domain. In [12], an identity preserving face hallucination method is proposed. It utilizes a super-identity loss that penalizes the identity difference between high-resolution and super-resolved face images. A similar idea is also presented in [13], where identity priors in the form of pretrained face recognition models are used to steer the face-hallucination process. The Feature Adaptation Network (FAN), presented in [31], disentangles the features into identity and non-identity components and performs face normalization, while improving the resolution, facilitating cross-resolution recognition tasks.
In contrast to the face hallucination based methods, degradation based methods transform high-resolution faces into low-resolution ones. In [33], it is shown that a simple resolution comparison technique that downsamples high-resolution gallery images to the resolution of the low-resolution probe images improves the cross-resolution face recognition performance. Another approach, i.e., the Resolution Adaption Network from [34], employs a Generative Adversarial Network (GAN) that realistically transforms high-resolution images into the low-resolution domain and uses a feature adaption network to extract low-resolution information from the high-resolution embedding.
While hybrid schemes that combine both face degradations and face hallucinations have also been explored in the literature before, e.g., [4], work on this topic is still limited, with studies trying to understand the benefits and behavior of such schemes and their relation to the quality of the input low-resolution images being extremely scarce. We fill this gap with the techniques and analyses presented in this paper.
D. Feature Selection and Fusion
Feature fusion and score fusion are well-established approaches in pattern recognition. As such, there is a large body of existing work on feature selection and feature fusion approaches in machine learning in general [35], [36] and in the field of biometrics specifically. Existing works in the domain of palmprint recognition have shown more complex approaches can work as well. In [37], the authors show discriminative power analysis as a feature selection tool can improve palmprint recognition when using the DCT coefficients as features. Furthermore, in [38], low correlation between features is used as a criterion. In comparison to these approaches, our proposed feature fusion method uses simple feature concatenation and averaging, while using more capable underlying feature extraction models.
Methodology
In this section, we present our three solutions for the Degrade-to-Compare (DtC), Hallucinate-to-Compare (HtC) and Degrade-and-Hallucinate-to-Compare (DHtC) strategies towards cross-resolution face recognition. We note that all strategies start by cropping the gallery and probe images using the bounding boxes provided by a face detector, so the input to the various models are always cropped faces. Additionally, all strategies use the same pretrained FR model
A. Degrade-to-Compare With Multi-Scale Degradations
1) Multi-Scale Degradations
Previous work [33], [39] has shown that matching the resolution of the images alone is insufficient to significantly improve the comparison capabilities. We hypothesize that state-of-the-art face recognition models are sensitive to various image quality factors, which differ greatly when comparing a high-quality gallery image to a low quality probe image. In order to match the quality between the gallery and probe images more closely, we propose a stochastic degradation-based approach for the DtC strategy.
Specifically, we propose a Multi-Scale degradation method, as illustrated in Fig. 3. The proposed method involves generating multiple degraded versions of each face image in the gallery set by using a set of n degradation functions \begin{align*} G^{1} \quad & = \quad \downarrow _{s} (d_{s_{1}}(G)), \\ G^{2}\quad & = \quad \downarrow _{s} (d_{s_{2}}(G^{1})), \\ & \vdots \\ G^{k} \quad & = \quad d_{s_{k}}(G^{k-1}). \tag {1}\end{align*}
Scale-wise degradation process overview. In the graph above, there are n different degradation options to be applied at each step. All the possible paths in the graph generate a degraded gallery image and all of them are used in the recognition pipeline. Note that a downsampling operation is applied between each degradation. The highlighted yellow lines represent a combination of three degradations. The blue and purple dashed lines show specific combinations of one and two degradations, respectively.
In the above equation, the operator represented by
Sample degraded gallery images using the proposed multi-scale degradation method. Five degraded examples (in columns) are presented for four distinct gallery images (in rows).
Note that the actual set of degraded images produced from a single gallery image can be arbitrarily large based on the combinatorial space of possible degradations from Table 1. In practice, we find that generating a set of 1024 degraded images from each gallery image is sufficient to capture the range of possible degradations, and the improvements to performance past that point reach diminishing returns.
2) Resolution Matching
The multi-scale degradation process, described above, produces a set of degraded gallery images
3) Similarity-Score Calculation
The goal of the comparison procedure is to produce a scalar similarity score for each given comparison between a probe image P and a given gallery G. Because the proposed multi-scale degradation method produces multiple degradation hypotheses \begin{equation*} r = {\max _{i}(\{\varphi (\psi (G_{i}),\psi (P))\}_{i=1}^{M})}, \tag {2}\end{equation*}
Similarity-Score Calculation with the Degrade-to-Compare strategy. The proposed multi-scale degradation method generates several degradation hypotheses from the high-resolution gallery image. These hypotheses are then used during the comparison procedure to calculate a scalar similarity score for a given probe-gallery pair.
B. Hallucinate-to-Compare with Multi-Hypothesis Face Super-Resolution
1) Multi-Hypothesis Face-Superresolution
In order to add high-resolution details to real-life low-resolution face images, we train a variant of the EDSR [7] super-resolution convolutional neural network (CNN) exclusively on face images. By limiting the training set to face images, as opposed to general computer vision datasets, such as ImageNet [41] or DIV2K [42] typically used for super-resolution training, the network is able to learn to upsample human faces in more detail, which enables a higher magnification factor (
Given a super-resolution model
Examples of the super-resolved hypotheses for a sample low-resolution probe images using a trained face super-resolution model for an upscaling factor of
We note that the face template extraction models require a fixed input image resolution of
2) Multi-Scale Processing
The performance of face super-resolution models typically depends on the initial resolution of (and, in turn, the information content contained in) the input probe images and the desired magnification factor. For example, for a low-resolution input image of
3) Similarity-Score Calculation
The multi-scale multi-hypothesis face supper-resolution approach, presented above, produces a set of N upscaled hypotheses \begin{equation*} r = {\max _{i}(\{\varphi (\psi (G),\psi (P_{i}))\}_{i=1}^{N})},, \tag {3}\end{equation*}
C. Hallucination-and-Degrade-to-Compare With Multi-Scale Degradations and Multi-Hypothesis Super-Resolution
1) Hybrid Hallucination-Degradation Scheme
We combine the multi-scale degradation method and the multi-hypothesis hallucination procedure into a hybrid scheme with the goal of compensating for their individual shortcomings and to further bridge the cross-resolution domain gap. As illustrated in Fig. 8, it is likely that some of the images from the large set of degraded gallery images and super-resolved probe images will be closer in quality than the original pair, since various quality/resolution hypotheses are created with our approach for both, the initial gallery as well as the initial probe image.
Similarity-Score Calculation with the Hallucinate-to-Compare strategy. The proposed multi-scale multi-hypothesis super-resolution method generates several versions of upsampled probe images from the provided low-resolution probe. These hypotheses are then used during the comparison procedure to calculate a scalar similarity score for the given probe-gallery pair.
Similarity-Score Calculation with the Degrade-and-Hallucinate-to-Compare strategy. The multi-scale degradation method and the multi-hypothesis super-resolution methods generate multiple versions of probe and gallery images, respectively, that are used to generate a single scalar score in the comparison procedure for a given gallery-probe pair.
The face hallucination method produces N superresolved faces
2) Feature Fusion
We consider two types of feature-level fusion for the implementation of our hybrid scheme, namely, feature addition (\begin{equation*} t_{acc} = \sum _{i\in \mathcal {P}\vee \mathcal {G}}{t_{i}}, \tag {4}\end{equation*}
\begin{equation*} t_{acc} = \mathop {\mathrm {\mathbin {+\hspace {-0.56em}+}}} _{i}\left ({{t_{i}}}\right), \tag {5}\end{equation*}
3) Score Fusio
To fuse the similarity scores between probe and gallery image hypotheses, we obtain the feature vectors from these hypotheses, as described in previous sections. Then, we calculate each possible similarity score between the probe hypotheses’ feature vectors and the gallery hypotheses’ feature vectors. To fuse these similarity scores, two options are available:
The use of the maximal similarity score is motivated by the fact that face recognition models are trained such that false positive matches are much less likely than false negatives. Thus, degrading the gallery image is extremely unlikely to increase its similarity with any given probe image, unless that probe image contains the same person, and degrading the gallery image only brings the quality of the two images closer together. On the other hand, the use of the sum of similarity scores is motivated by the interpretation of face feature vectors containing information (signal) related to the identity of the person of the image, and noise related to irrelevant factors such as image quality, as well as pose, background, etc. The idea is that adding up similarity scores from a large set of images where the noise factors differ while the identity remains the same will cause the noise factors to average out, dampening the noise while amplifying the signal.
D. Implementation Details
The different degradation and hallucination procedures presented in this section all rely on comparison in the embedding space of a pretrained FR model
To extract features from test images, we first crop the images using face detection coordinates provided by the dataset authors. Then, the images are subjected to the preprocessing procedure provided by the authors of the face recognition models, and passed through the models to obtain the face feature vectors.
We note that, without fine-tuning on the target domain, the performance of our proposed method is highly reliant on the face recognition models used. The models used here were selected for their performance and large training set size, which enables a degree of robustness to image quality factors such as resolution, blur, lighting, and head pose.
Experiments
A. Datasets
We select two diverse and challenging datasets with cross-resolution comparison problems for the experiments, i.e., the SCFace [43] and the DroneSURF [44] datasets. Details on the two datasets are provided below.
The SCFace Dataset: There are 130 subjects in the SCFace dataset, each having one high-quality frontal image corresponding to the gallery face images, and multiple low-resolution images corresponding to the probe face images. Probe face images are captured using five different surveillance cameras and from three different distances, d1: 4.2m, d2: 2.6m and d3: 1.0m. Sample images from the dataset are shown in Fig. 9(a),(b). In the experiments on this dataset, we report Rank-1 identification rate (IR) results for all 130 subjects. In order to compare the obtained results with the ones from the previous works, we also report the mean of Rank-1 IR for 10 Repeated Random Sub-Sampling Validation (RRSSV) experiments on 80 subjects, which is the common benchmark on this dataset. Please note that, in most of the previous works, the remaining 50 subjects are used for training purposes, however, our proposed approach does not require any training or fine-tuning on the same dataset. In the SCFace experiments, faces are detected using MTCNN [45], then cropped by enlarging the bounding boxes with a scale factor of 1.3 following the findings in [33].
The DroneSURF Dataset: The dataset contains 200 videos of 58 subjects captured with drones, of which 24 subjects are used for testing purposes. Videos are captured in two types of surveillance settings: active and passive. In the active scenario, subjects are actively monitored, therefore, the camera-to-subject distance is relatively constant. In the passive scenario, a drone monitors an area or event while its position and orientation remain fixed. Therefore, the distance between the drone and subjects changes. Both scenarios are captured under two different day-times, during the day and before sunset, and at two locations, in the park and on the terrace of a building. The dataset also contains a gallery set of frontal face images captured using smartphones in constrained environments. In Fig. 9(c), sample images captured under the active and passive settings are given in the first and second rows, respectively. As can be seen, in the passive scenario, the face resolution is very low. In the DroneSURF experiments, gallery faces are detected using MTCNN [45]. Annotations for probe face bounding boxes are provided in DroneSURF [44], however, most of them contain a significant portion of background information. Therefore, in order to obtain tightly cropped faces, we detected probe face images using TinyFace [46]. We chose TinyFace [46] over MTCNN [45] due to its ability to detect low-resolution face images. Still, TinyFace [46] could not detect all the probe faces. In these cases, we directly use the provided bounding box annotations. The detected faces are then cropped by enlarging the bounding boxes with a scale factor of 1.3. No fine-tuning of the deep face recognition models is performed on this dataset.
Example probe and gallery images from the SCFace and DroneSURF datasets. In (a), probe face images from the SCFace dataset are given. The probe images in (a) belong to the same subject and are captured at the same distance of 4.2m (bottom row), 2.6m (middle row), and 1m (top row), with five different cameras. Faces are localized with MTCNN and cropped with a scale factor of 1.3. Although they are captured at the same distance, we can see that face resolution differs across cameras. Column (b) shows the gallery image of the subject given in (a). In (c), we see example probe images from the DroneSURF dataset, where the first row consists of the images captured under active surveillance and the second row consists of the images captured under the passive surveillance scenario. In (d), a sample gallery image from the DroneSURF dataset is shown.
B. Data Analysis
We start the experimetal section with an initial analysis of the impact of the degradation and hallucination procedures on the quality characteristics of the SCFace and DroneSURF data. To this end, we examine the Stochastic Embedding Robustness Face Image Quality (SER-FIQ) [47] score distribution of the face images before and after applying the degradation and hallucination processes. In the top row of Fig. 10, we show SER-FIQ score distributions calculated for the SCFace dataset. As can be seen, the image quality distribution obtained from the degraded gallery face images in Fig. 10(c) becomes more similar to that of the original low-resolution probe images (Fig. 10(a)) due to the applied degradation process. Conversely, the SER-FIQ score distribution of hallucinated probe face images (Fig. 10(d)) shows that the proportion of high-quality face images is increased, suggesting that the hallucination process may be beneficial for the comparison procedure on this dataset.
SER-FIQ score distributions on the SCFace and DroneSURF datasets before and after applying the proposed degradation and hallucination schemes.
The same analysis is carried out also for the active surveillance scenario of the DroneSURF dataset. As expected, the image quality of the face images decreases due to gallery degradations, as seen in Fig. 10(c). However, no improvement is observed in terms of image quality for the hallucinated probe face images, which is likely due to the quality of the original DroneSURF probe images, which is too low to recover sufficient face details using the proposed hallucination approach.
C. Baseline Recognition Results
In the first series of recognition experiment, we evaluate the performance of four off-the-shelf deep face recognition models, described in Section III-D, on the SCFace and DroneSURF datasets without applying any degradation or hallucination technique. The results of this experiment are presented in Table 2. Notably, on the SCFace dataset, the accuracies achieved by the face recognition models surpass 90% for the closer distances, indicating the effectiveness of deep face models when the face resolutions are relatively high. Hence, we focus our analysis on the face images captured from the d1 distance, which represents the farthest range leading to the lowest resolution of the face images in the SCFace dataset. The recognition accuracies on the DroneSURF dataset exhibit a significant variation. For the passive scenario, all the models achieved approximately only half of the performance that were obtained in the active scenario. It is important to note here that, compared to the face resolution in the active scenario, in the passive scenario, the resolution is much lower, which causes this significant accuracy difference.
D. Gallery Degradation Results
Next, using the method introduced in Section III-A for the DtC strategy, we apply degradations on the gallery face images of both SCFace and DroneSURF datasets. Thus, in this experiment, we study how degrading high-resolution gallery face images, with the goal of comparison the characteristics of the low-resolution probes, impacts the cross-resolution face recognition performance. Each gallery image is degraded by applying the degradations from Table 1, using the random process in Eq. (2), so as to produce a large set of degraded images, which are then compared to the probes.
The results obtained on the SCFace dataset, given in Table 3, point to significant improvements in the Rank-1 Identification Rate (%) compared to the baseline experiment results. Specifically, the model trained on the MS1M-RetinaFace dataset with a ResNet-50 backbone yields a 36% relative increase in accuracy. Moreover, the model trained on the Glint360k dataset with a ResNet-101 backbone achieves a recognition accuracy of 87.38%, approaching the results obtained at closer distances. Further reducing the domain gap by matching the resolution of gallery faces and probe faces results in even higher accuracies, which can be seen from the lower half of Table 3. The most successful model achieves a 89.69% identification rate at the d1 distance, but more importantly, al models consistently improve in performance at d1, when degradation and resolution matching are used, compared to using degradations only.
We employ the same gallery degradation strategy also on the DroneSURF dataset. The results in Table 4, show similar behavior as observed with the SCFace experiments, that is, gallery degradations lead to significant improvements over the baseline recognition accuracies, which is observed in both, the active and passive scenarios. Resolution matching applied with the gallery degradations further improves the face recognition performance in the cross-resolution setting.
E. Face Hallucination Results
In the next series of experiments, we explore the impact of the HtC face hallucination strategy, introduced in Section III-B, on the cross-resolution face recognition performance. We investigate three different strategies:
Table 5 presents the face hallucination results for the d1=4.2m distance of the SCFace dataset. The upper part of the table displays the results for single-hypothesis face hallucination, while the middle section presents the results for multi-hypothesis face hallucination. The reported results suggest that single-hypothesis face hallucination does not lead to improvements in recognition accuracies. However, multi-hypothesis face hallucination results in better performance. This suggests that naively enhancing a low-quality face image without addressing image degradations does not result in significant improvements for the cross-resolution face recognition task on this dataset. Instead, generating multiple hypotheses by removing high-frequency artifacts before applying super-resolution can aid the recognition process. In the last section of Table 5, we combine the multi-hypothesis probe images of different scales, rather than measuring their performance at separate scales. This further improves the cross-resolution comparison performance and leads to significant performance gains for all tested models.
Due to the inadequate quality of probe face images in the passive surveillance scenario of the DroneSURF dataset, we conduct face hallucination experiments only for the active scenario. In Table 6, we present the results for the multi-scale multi-hypothesis face hallucination approach that performed overall best of SCFace. As can be seen, these results show a slight reduction in the face identification rates compared to the baseline experiment results from Table 2, suggesting that face hallucination is not able to sufficiently improve the face image quality in active surveillance scenarios due to unsuitable (low) quality characteristics of the DronSURF probes. These results are also consistent with the SER-FIQ generated quality score distributions from Fig. 10 that already suggested limited quality impact of the hallucination process.
To provide a qualitative comparison and offer additional insight into the behavior of the face hallucination approach, we include hallucinated images from both the SCFace and DroneSURF datasets in Fig. 11. It is worth noting that the super-resolved DroneSURF face images have significantly more artifacts than the SCFace examples. This is the case for most of the samples in the DroneSURF dataset due to poorer image quality and lower face resolution, which appear to have a significant impact on the success of face hallucination strategies applied as preprocessing steps to cross-resolution face comparison.
Multi-hypothesis multi-scale face hallucination results. For the purpose of visualization, all the images are resized to a fixed size of
F. Combining Gallery Degradations and Probe Hallucinations
In this section, we investigate the performance of a combined approach that integrates both, the gallery degradations and the probe face hallucinations, into a hybrid DHtC procedure. Our goal with the combined approach is to bridge the gap between the distributions of low-resolution and high-resolution face images using a joint degradation-hallucination scheme, as illustrated in Fig. 2(c).
In Table 7, we combine the gallery degradation method with the probe face hallucinations for the d1=4.2m distance of the SCFace dataset. By merging these two schemes with the fusion strategies from Section III-C, we are able to achieve a face identification rate exceeding 90% for 130 subjects. While all of the fusion strategies lead to comparable cross-resolution recognition performance, the highest recognition accuracy of 91.07% is obtained with the feature concatenation strategy (
In the case of the DroneSURF dataset, the performance of the combined method is impacted by the poor performance of face hallucination scheme. However, the combination still outperforms the baseline results and the hallucination-only method. The corresponding results can be found in Table 8, where the maximum similarity score strategy (
G. Resolution Impact
In this section, we examine the impact of probe-image resolution on the face recognition performance. Table 9 presents the results of the DtC strategy, which employs the Glint360k-R101 model with gallery degradations and resolution matching (but no face hallucination), for each camera individually, together with the corresponding average face sizes per camera. The results in the table suggest that the performance of the face recognition model is directly proportional to the resolution of the face images. Fig. 9 presents a subject’s probe face images taken by five distinct cameras, which shows the disparities in resolution and degradations caused by each camera. It is straight forward to see the differences in the probe image quality and information content in the images, captured by the five cameras, that are reflected in the reported recognition rates.
The resolution of the face images also varies between the different surveillance settings of the DroneSURF dataset, namely active and passive surveillance, as exemplified in Fig. 9. Table 4 reports the face recognition performance for both scenarios, and we can observe that the performance in the passive scenario, where the faces have lower resolution, is lower than the performance of the active scenario.
H. Success and Failure Cases
In compliance with the terms of the SCFace release agreement, we display the success and failure cases for the Glint360k-R101 model with gallery degradations and resolution matching in the DtC scheme. We use only the subjects whose images are cleared for publication in Fig. 12(a). Failures typically arise with lower-quality images that contain a high degree of uncertainty about the subject’s identity. However, the proposed method consistently aligns its predictions with discernible attributes such as gender, haircut, skin, or hair color. Even if the top prediction isn’t always accurate, the correct identification is generally found within the top four, suggesting that the proposed method can successfully extract and apply key facial features in its identification process.
Top-10 predictions for probe face images on SCFace and DroneSURF datasets, including success and failure cases.
Fig. 12(b) demonstrates the model’s ability to tackle the face identification task in the challenging environment of the DroneSURF dataset. The probe images shown in Fig. 12(b) include many low-quality and non-frontal face images, captured in an unconstrained setting. Similarly to the previous findings, it aligns predictions with key observable attributes such as gender, haircut, hair color, and skin tone. In comparison fo SCFace, our proposed method performs worse overall on DroneSURF in terms of the per-frame rank-1 identification rate measure. The uncontrolled drone footage with very low-resolution faces present in the dataset demonstrates the limits of its capability. These attributes remain present across the top-10 predictions. We note that this is the most challenging experiment setting for DroneSURF, compared to the per-video setting where close-up frames can be used to identify the subjects for the entire video sequence.
I. Comparison With the State-of-the-Art
In this section, we present a comparison with the state-of-the-art on the SCFace and DroneSURF datasets.
1) SCface Results
In Table 10, we compare our results on the SCFace dataset with those of previous state-of-the-art works. To make our results comparable with those of the prior methods, we report the mean of 10 RRSSV results for 80 randomly selected subjects. Our model, Glint360k-R101, incorporates both gallery degradations and face hallucination. Existing methods that perform fine-tuning on 50 randomly selected subjects are marked in Table 10 with a check mark in the FT column. Among the models that do not perform fine-tuning, our model, Glint360k-R101, applied within the joint degradation-hallucination scheme achieves the highest performance with 95.4% accuracy at the d1=4.2m distance. The second-best result, 88.3%, is achieved by [27]. Remarkably, our approach also performs better than the competing techniques that utilize a portion of the dataset for training purposes, despite not requiring any training or fine-tuning on the target dataset.
2) DroneSURF Results
In Table 11, DroneSURF Rank-1 IR (%) results are reported for the active and passive scenarios under the frame-wise protocol. The third column indicates whether face recognition is carried out on tightly cropped probe faces that are obtained using face detectors or on the bounding boxes provided with the dataset. The table also specifies whether the models are fine-tuned on the target dataset or not. Unlike the SCFace experiments, our model only utilizes the proposed degradation method along with resolution matching and does not involve face hallucination. In the active scenario, we achieve the highest accuracy of 51.55%, which is the best result among all the approaches listed. The second-best result belongs to the method proposed in [30], which also employs a face detector to crop probe face images and does not fine-tune on the target dataset. In the passive surveillance scenario, we obtain the highest accuracy among the approaches that do not use the target dataset for training, with an accuracy of 26.84%. In [55], they use 34 subjects of the dataset for training purposes and achieve an accuracy of 27.81%. This shows the limitations of our training-free method, since fine-tuning directly on the target domain can still improve the performance somewhat on the lower end of resolution and image quality.
We compare the computational complexity of the methods evaluated on DroneSURF in table 12. Complexity evaluation for deep learning-based methods is typically split into training and inference time constraints. Here, we note only which methods require training to begin with. Furthermore, the inference computational complexity is measured in terms of the forward passes through face recognition and super-resolution networks required to compare a gallery and probe image. The time complexity is sub-linear with regards to the number of forward passes, due to the effects of GPU batch processing, as processing singleton batches is highly inefficient. The space complexity is affected only by the need to keep the expanded dataset in memory. No extra storage is required. We note that our proposed method has far higher inference complexity, while not requiring training. This indicates utility in cases where insufficient target domain data exists for extensive training or fine-tuning. Furthermore, we note that while increased computational complexity is obviously undesirable, methods that make efficient us of increased computational budgets tend to be better at eliciting improved performance in the long run [57], [58], which mirrors our findings.
Conclusion
In this work, we have addressed the challenge of cross-resolution face recognition and investigated two strategies for improving recognition accuracy, i.e., gallery degradation and face hallucination. We have proposed a multi-scale degradation method for the high-resolution galleries and a multi-scale and multi-hypothesis face hallucination method to improve the quality of the probe images. We have also explored the combination of these two methods using score-level and feature-level fusion techniques. Our experiments on the SCFace and DroneSURF datasets have shown that both methods can improve cross-resolution face recognition accuracy. However, face hallucination was not useful on DroneSURF due to poor image quality. Our findings emphasize the importance of considering image quality when selecting face recognition methods. The combination of gallery degradation and face hallucination is likely to provide the best results for cross-resolution face recognition with relatively high-quality probe images, while degradation alone may be more appropriate for low-quality probe images. Our proposed strategies are agnostic with respect to the deep face recognition model being used and do not require any fine-tuning on the target dataset.
However, in the worst-case scenario of face resolution and image quality, our proposed method is still marginally outperformed by previous approaches that include fine-tuning on the target domain. Thus, if extensive footage from the target domain exists, using it for training results in better performance so far. However, our proposed method presents a promising approach for scenarios where this is not the case, and could be used as a basis for long-range recognition applications in those domains.
As part of our future work, we plan to explore naively super-resolution methods that are robust to unknown degradations, particularly for the case of low-quality face images. This would address the challenge of poor quality face images and potentially improve the performance of the face hallucination method in such cases. Additionally, we intend to extend the ideas presented in this work to multi-frame super-resolution models that are capable of inferring high-frequency face information from a sequence of low-resolution frames, instead of hallucinating it from a single input face. Such strategies are expected to further address the challenges of cross-resolution face recognition.