Introduction
Face super-resolution (FSR) has attracted more attention and has been used in various
image-based applications. FSR, also known as face hallucination, seeks to
reconstruct a high-resolution (HR) face image from a low-resolution (LR) input. Due
to the constraints in acquiring high-quality images and the influence of imaging
conditions, face images are almost captured in poor perceptual quality in real-world
scenarios. The low-quality face image issue negatively affected the performance of
face-image-based applications such as face detection [1] and face recognition [2]. FSR is a particular instance of the single image super-resolution
(SISR) technique [3], and it is considered
an ill-posed problem because of the ambiguity in reconstructing face images’
pixels. The FSR model is designed to capture the unique characteristics of facial
attributes and optimizes the recovery of face attribute detail. On the other hand,
the SISR model focuses on enhancing a wide variety of image content without
considering the features of face attributes [4]. Therefore, the face configuration is considered a highly structured
object and utilized as confident prior knowledge for reconstructing the global face
structure. According to the reconstructed global face attributes, the local face
information is also recovered [5]. To
utilize the facial prior guide approach in the FSR model, different techniques such
as face parsing maps, landmarks heatmaps, Spatial attention maps, and the
three-dimensional facial guide [6], [7], [8], [9], [10], [11] are used
in the face attributes (eyes, lips, nose, and eyebrows) recovering process. Due to
the prior-guide technique utilization in FSR models, they outperform at large scales (
Although the facial prior guide approach models outperform in recovering global face
information compared to the non-prior guide approach, they suffer from inaccurate or
even wrong prior guide information due to the lack of sharp detail in the input face
images. A noticeable challenge encountered in FR models pertains to the inadequate
quality of the shallow dimensions of input LR face image (often at sizes of
To improve the capability of the FSR model in recovering more accurate global information from poor-quality input images, Ma et al. [5] proposed a face hallucination model comprising two recurrent networks. These networks are designed for iterative operation and improve the performance of facial component recovery and landmark detection tasks. Furthermore, to improve the restoration of local details, they enhanced the landmark information guidance utilizing an attentive module that aggregates the individual face attributes meticulously.
Although utilizing an iterative collaboration approach between the attentive module and estimated facial landmarks enhanced the performance of the FSR model to recover more global and local detail compared to the other prior guide approaches, the obstacles of poor-quality input images and consequently lack of recovering high-frequency details at mid-level of FSR model remain.
To enhance the mid-level information recovery capabilities of the FSR model and
achieve a more realistic texture for face images, the viable approach involves
leveraging the spatial information of features. This goal is effectively addressed
by incorporating the Spatial Feature Transfer [12] (SFT). Nevertheless, the sub-optimal quality of the extracted
low-level features from unclear and degraded LR input hinders the SFT
module’s efficient recovery of intricate facial details. To mitigate the
degradation effects associated with the low-level extracted features, a feasible
solution is to employ a non-local [13]
module (NL) in conjunction with the residual channel attention [14] technique. Integrating the non-local module captures
long-range dependencies within the features, promoting a more comprehensive
understanding of facial structures. Simultaneously, the residual channel attention
technique enhances the focus on critical facial details by selectively emphasizing
informative channels. This combined approach addresses the shortcomings in the
quality of low-level features, fostering a more refined and context-aware
representation. By incorporating the non-local module and residual channel attention
technique, we aim to significantly enhance the efficacy of the SFT module in
recovering facial details, ultimately contributing to improving the capability of
the attentive module to boost the guidance of landmarks and the overall improvement
of our FSR model on large up-sampling scales (
Our research presents a novel approach to FSR by employing multi-stage refinement techniques that harness the potential of an iterative collaboration process between the recovery and landmark estimation networks. This innovative framework enhances the model’s capability to recover higher fidelity and more detailed facial attributes compared to baseline models. We introduce three key contributions to our FSR model:
Propose an NL module at the early stage of our FSR network to reduce the noise effects of low-resolution face images and produce an enhanced feature representation. This module effectively addresses the noise degradation effect in low-quality inputs, improving feature quality.
Employs a residual pixel attention module on low-resolution features at the early stage of the network to capture the inter-channel relationships in low-resolution feature maps and emphasizes the importance of specific channels, enhancing the model’s ability to capture intricate facial details.
Develop an SFT module involving an affine transformation that spatially adjusts the features according to facial characteristics derived from facial heatmaps. The SFT applies at the mid-level of our proposed model before the upsampling layer and improves the effectiveness of the upscaling process by ensuring relevant facial details are brought into focus.
The illustration in Figure 1 showcases the effectiveness of our multi-stage refinement process in producing higher fidelity and more detailed face images. In Figure 1, Sample (a) demonstrates a fidelity comparison of the generated SR image with closed eyes, highlighting its superior performance to the DIC [5] model. Sample 2 further illustrates the model’s capability to recover additional facial details, surpassing the performance of the DIC [5] model.
The remaining sections of the article are structured as follows. Section II briefly reviews the relevant works. Section III provides the methodology and proposes the model’s architecture. The implementation details, datasets, and experimental results are demonstrated in Section IV. Section V demonstrates discussion and future work. Finally, the proposed FSR research is concluded in Section VI.
Related Works
A. Conventional Face Super-Resolution
The idea of improving the quality of face images, known as face hallucination (face super-resolution), became a focus of interest when researchers proposed the first general Super-resolution (SR) models. In 2000, Baker et al. [17] proposed the first FSR model. Their basic model improves the resolution of face images by learning from training image datasets. This model is relatively simple, producing a super-resolved face image by comprehending facial structures while ignoring the high-frequency information recovery.
Liu et al. [18] proposed a two-step FSR architecture to address this limitation. This model first restores the coarse face image using a linear network architecture in the initial stage and then retrieves high-frequency details using a non-parametric Markov technique. Although this model improved the recovery of face details compared to the previous model [17], there is still room for enhancing the retrieval of more facial details. Several FSR models have been proposed to tackle this issue in traditional deep-learning algorithms [19], [20], [21], [22], [23], [24]. To improve the generation of crucial facial details between 2004 and 2007, Chang et al. [19], Wang et al. [20], and Chakrabarti et al. [21] each applied different methods including local embedding, eigen transformation, and kernel principal component, respectively. Furthermore, to boost the performance of FSR models in shallow architecture, Shi et al.. [22], Jung et al. [23], and Jiang et al.. [24] employed recursive regression architecture, convex optimization based on the positional patch approach, and local smooth regression, respectively.
Nevertheless, these approaches prove ineffective in restoring natural facial
attributes, mainly when dealing with large up-sampling factors such as
B. Deep Learning Based Face Super-Resolution
Due to the rapid development of deep learning in computer vision, there have been significant advancements in face hallucination using deep learning methods [4]. Our review focuses on designing diverse network structures for FSR models and techniques for dealing with different attention techniques, facial geometry alignment, and textural and contextual information. Table 1 presents comparative comparisons of FSR models, highlighting key aspects such as the year of publication, a brief overview of the methodology employed, the accuracy metric, the dataset used, and the examined scale factor.
In 2015, the Bi-channel Convolutional Neural Network (CNN) [25] pioneered using CNN to super-resolve face images by
adaptively combining the two channels’ information, marking a notable
milestone in this area. To address the limitation in operating in larger scale
factors (beyond
In 2018, SuperFan [8] and FSRNet [6] utilized face geometry as prior knowledge of FSR models to improve the quality of resultant images by effectively recovering local information. SuperFan [8] utilized a sub-network for face alignment and detecting facial landmarks information and integrated this information into a generative adversarial network. The FSRNet [6] used facial landmark heatmaps and parsing maps without strict alignment and integrated adversarial loss into this model.
The Facial Attribute Capsules Network [29] (FACN) leverages an integrated representation model to comprehensively encapsulate facial information and reduce the noise effect of super-resolved face images in real-world scenarios. An integrated learning strategy generates the attribute capsules in semantic, probabilistic, and facial attribute manners. The Spatial Attention Residual Network [30] (SPARNet) integrated a spatial attention mechanism into vanilla residual blocks to address the difficulty in recapturing finer facial details. This technique allows convolutional layers to focus specifically on essential facial structures while reducing attention to regions with fewer features.
Ma et al. [5] introduced an iterative
collaboration between two recurrent networks (DIC), focusing on recovery and
landmark estimation to improve the FSR model accuracy and quality. The iterative
framework exploits prior knowledge of landmarks and progressively enhances each
other’s performance. The DIC [5]
model emphasizes the dynamic information exchange between the face image
recovery network and the landmark estimation network. Additionally, an
innovative attentive fusion module generates facial components and aggregates
them attentively for improved high-frequency detail restoration. While this
framework has shown acceptable capability in gradually producing results and
utilizing an attentive fusion technique to recover more facial details, its
effectiveness is hindered by using a degraded low-resolution input (
To improve texture details and enhance facial structure, the External-Internal Split Attention Group proposed split-attention in a split-attention network [31] (SISN). This model simultaneously considers overall facial structure and fine texture details and attempts to produce more fidelity face images. In the extension of [29], to enhance the fidelity of facial results, Bao et al. [11] proposed multi-attention modules (Residual Spatial Attention and Multi-scale Patch embedding and Spatial attention) in the SCTANet model to improve the recovery of both global and local information.
In 2023, Wang et al. [32] introduced Super-Resolving Face Image by Facial Parsing Information (Fishnet) by designing a parsing map attention fusion block (parsing map-guided approach) and integrating the parsing map information with an attention mechanism. The multi-scale refine block in this model attempts to preserve the spatial and contextual details and recover high-resolution information and the context from low-resolution features.
Multi-Stage Generative Adversarial Network [33] (MSGAN) proposed an end-to-end head-posed estimation network and integrated it with the FSR network. Utilizing pose-aware adversarial loss and head pose alignment feedback improves the fidelity of non-frontal face images in real-world scenarios. To improve the FSR model’s capability of recovering local and global facial detail, the CNN-Transformer Cooperation Network [34] (CTCNet) employed a multi-scale connected encoder-decoder framework as its backbone. The structure attention module, along with a Transformer block, attempts to enhance the consistency of local facial detail and global facial structure simultaneously.
To increase the accuracy of the FSR model and generate better perceptual quality images from LR faces, the FMANet [35] model introduced a facial mask attention module and attempted to enhance the identity fidelity of the resolved face image. Moreover, the MaskPix loss function was introduced to emphasize pixels containing dense identity features in this model selectively. SFMNet [36] proposed the frequency-spatial interaction block based on the Fourier transform technique to achieve optimal performance in recovering the global and local facial dependencies. Exploring correlations between spatial and frequency domains improves global and local signal generation in SR images.
Methodology
The overall MSRFSR architecture comprises three branches: a refinement network, a face recovery network, and a face alignment network (FAN), as shown in Figure 2. The LR input suffers degradation (noise, blurring, and lack of sharpness), which requires a multi-stage enhancement procedure [12], [13], [14]. The first convolution layer extracts the LR features from the LR input and up-scales them by a factor of two. The features are fed into the proposed multi-stage refinement network consisting of a non-local module and the residual pixel attention blocks, as depicted in Figure 2.
The face recovery network and FAN branch interact using an iterative collaboration
approach. The face recovery branch takes the refined features as the input and
generates the first super-resolved face image. The generated super-resolution (SR)
images are fed into the FAN branch to estimate the alignment procedure and generate
facial heatmaps. These are prior guides for fusion with the attentive module in an
iterative collaborative process. The first generated SR image can be computed as
(1):\begin{equation*}
I_{SR(1)}=M_{SR}(M_{Ref}(I_{LR}))+U(I_{LR}) \tag {1}\end{equation*}
Utilizing
The \begin{equation*} F_{n} = M_{Ref}(I_{LR},
I_{SR(n-1)}) \cdot HM_{SR(n-1)} \tag {2}\end{equation*}
Based on the iterative collaboration approach to reconstruct the SR image in \begin{equation*} I_{SR(n)}=
M_{SR}(F_{n})+M_{ST}(HM_{SR(n-1)})+U(I_{LR}) \tag {3}\end{equation*}
A. Multi-Stage Refinement Network
The Multi-stage Refinement Network comprises the NL and RCA blocks as demonstrated in Figure 2. It aims to reduce the degradation effects at the low-level extracted features in this stage of the proposed model. These enhancements optimize the recovering and alignment procedures to fully utilize the capability of iterative collaboration technique in our FSR model.
The application of the non-local attention technique is constrained to recovering the global details in LR feature maps by exploiting the interdependence of pixels [37], [38], [39]. The proposed NL is considered an attention approach to detect recurring patterns and textures across different regions of the feature maps [37]. Figure 3 depicts the non-local module’s architecture.
Specifically, instead of attending only the neighboring pixels, a larger
neighborhood region is considered. Detecting similar intensity or texture
characteristics using weighted averages in non-local technique is defined as
(4):\begin{equation*} z_{i}= W(t)\cdot y_{i}+I_{LR}
\tag {4}\end{equation*}
\begin{equation*} y_{i}= softmax (W({\theta })
\cdot W({\phi })).W(g) \tag {5}\end{equation*}
The proposed RCA blocks are a channel attention technique that employs convolutional layers to increase the network’s capability to learn specific weights of the channel [14], [40]. As shown in Figure 4, the proposed RCA module comprises convolutional layers with ReLU activation functions, global average pooling, and sigmoid activation functions. This approach helps recover and precisely extract channel-wise information from low-resolution feature maps.
The values of the weights behave as an attention map, allocating higher values to more critical information and lower values to less important information for consideration. After each convolution layer, the activation function enables the deep-learning model to learn more complex channel dependencies. A skip connection allows the network to bypass the low-frequency information from the earlier layer and integrate it with the features from the later layer [3], [40]. In the proposed multi-stage refinement network, utilizing RCA blocks after applying the non-local module is a strategic approach to emphasize inter-channel facial detail effectively. The NL enhances noise degradation at the initial stage of the refinement procedure [38], [39]. At the same time, RCA blocks are specifically designed to bolster inter-channel facial detail representations on low-level feature maps [14]. This combined technique addresses noise issues. It ensures a more comprehensive and refined treatment of facial details, ultimately optimizing the collaboration between face recovery and face alignment networks.
B. Face Recovery Network
The Face Recovery Network comprises an attentive fusion module, recurrent super-resolution, and SFT block, as shown in Figure 5.
The attention fusion module is a part of the face recovery branch that fuses the information between the face recovery network and the FAN. Thus, the gradients can be back-propagated to both the face recovery network and FAN part recursively and facilitate the utilization of different facial components as the landmark guidance of our FSR model.
The facial component maps are grouped into specific facial components, including the mouth, nose, right and left eyes, and jawline. The channels corresponding to each facial component are added together, and softmax operation is applied to create a facial attention heatmap for the respective facial attribute.
Guided by facial attention heatmaps, the specific component features are extracted by group convolution, and the weighted features from the multi-stage refinement network are added to form attentive fused output.
Based on the DIC [5] model, the
super-resolution feedback network [41]
is employed in our model’s face recovery network. The proposed recurrent
SR module is structured to utilize feedback connections and generate robust
high-level representations. The recurrent SR module’s input has
The preservation of spatial information is essential in the context of generated face images since it entails the maintenance of local attributes necessary for fidelity preservation and recovering more tiny details at various spatial regions of a facial image [12], [42], [43], [44]. Therefore, an SFT block is employed, and input spatial features are produced by heatmaps to spatial-wise feature modulation with the recurrent SR features.
The architecture of the proposed SFT module is demonstrated in Figure 5.
The SFT module learns a spatial mapping function \begin{equation*} {\mathcal {M}}: {\psi
}{\rightarrow }({\alpha }, {\beta }) \tag {6}\end{equation*}
The learned \begin{equation*} \hat {y}={\boldsymbol
{SFT}}({\boldsymbol {F}} {\vert }{\alpha }, {\beta })= {\alpha }
{\textstyle \bigodot } {\boldsymbol {F}} {\textstyle \bigoplus }{\beta }
\tag {7}\end{equation*}
C. Face Alignment Network
The face alignment process in our FSR framework involves using accurate facial
landmarks to guide the face recovery network, progressively improving face image
fidelity and landmark estimation. Our model continuously improves the landmark
estimation at each step of the iterative approach to provide more precise
auxiliary information to efficiently combine the prior guidance to the spatial
feature transform operation and attentive fusion module. On the other hand, the
iterative collaboration approach and feedback between the two networks’
branches enhance the performance of landmark estimation and the face recovery
network, ultimately improving the overall efficiency of the proposed FSR model.
The FAN in our model is based on the DIC [5]. The proposed FAN includes pre-processing and post-processing
components, as shown by
The FAN takes the input from the first generated SR image (
Experiment and Discussions
In this section, we first introduce the implementation settings of the MSRFSR model. Then, we explore the effectiveness of various components, and finally, we compare the performance of our model with that of other state-of-the-art models quantitatively and qualitatively.
A. Implementation Settings
We conducted experiments using the CelebA [15] and Helen [16]
datasets, which are widely recognized in the field of Face Super-Resolution
(FSR) models. To prepare the data, as the standard of other FSR models [5], we first utilized the estimated
landmarks and extracted square sections from each image to remove the
background. These sections were then resized to
The training objective function utilized in our FSR model is \begin{equation*} \mathcal {L}_{\text {L1}} =
\frac {1}{N} \sum _{i=1}^{N} \left |{{ y_{i} - \hat {y}_{i} }}\right |
\tag {8}\end{equation*}
The number of iterations in our iterative super-resolution approach was set to
The assessment of super-resolved face images involves the use of the Peak Signal-to-Noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS) [50], Structural Similarity Index (SSIM) [51], and Frechet Inception Distance (FID) [52]. These metrics are calculated on the Y channel within the transformed YCbCr color space.
B. Investigation of Different Iterative Steps in Training
In this section, our focus revolves around the comprehensive evaluation and
comparative analysis of the proposed iterative collaboration framework for the
proposed model. The evaluation metrics employed include PSNR, SSIM, and LPIPS.
We present the results in Table 3 and
Table 4, where each table
corresponds to different aspects of our model’s performance.
Specifically, the evaluation encompasses various iterative steps, ranging from
step 1 to step 5. The CelebA [15] and
Helen [16] datasets are utilized for
these evaluations. The evaluations are conducted for scale factors of
Following a comprehensive analysis of the different iteration steps in Table 3 and Table 4, from Step 1 to Step 4, a consistent and
gradual improvement is observed in PSNR, SSIM, and LPIPS. However, Step 5 has no
discernible enhancement in these metrics. According to the results of Table 3 and Table 4, the optimal iteration step for our model,
the best values for PSNR, SSIM, and LPIPS, is found at
C. Visual Comparisons of Face Recovery in Different Iteration Steps
The previous section compared various iteration steps to determine our
model’s optimal value for
Visual comparison of face image reconstruction in three steps of
iterative collaboration approach on CelebA image at a scale factor
of
Visual comparison of face image reconstruction in three steps of
iterative collaboration approach on Helen image at a scale factor of
Visual comparison of face image reconstruction in three steps of
iterative collaboration approach on CelebA image at a scale factor
of
Visual comparison of face image reconstruction in three steps of
iterative collaboration approach on Helen image at a scale factor of
D. Iterative Collaboration in Landmark Estimation
This section illustrates the face alignment process in various face recovery
procedures. In Figure 10, we showcase
landmark estimation at four steps within our iterative calibration framework at
a scale of the factor of
Landmark estimation in 4 steps of iterative collaboration approach at
a scale factor of
According to the results in Figure 10, alignment and landmark estimation in challenging face pose images become more accurate with each iterative recovery of the face image. This indicates a gradual improvement in prior knowledge information, subsequently assisting the proposed FSR model in recovering more accurate and faithful face images by enhancing the estimation of precise prior information within the iterative calibration framework.
E. Ablation Study
A systematic investigation assesses the impact of the NL, RCA, and SFT modules
demonstrated in this section. Table 5
and Table 6 showcase the performance
metrics, including PSNR, SSIM, LPIPS, and FID, of the proposed FSR model
equipped with different NL, RCA, and SFT configurations. The evaluations are
conducted on CelebA [15] and Helen
[16] datasets with scales of
In these tables, we compared the baseline model (without any attention module). The other rows correspond to the variation’s combinations in refinement modules, including the RCA module only, a combination of NL with RCA, an SFT module only, a combination of RCA and SFT, and finally, the combination of the three proposed modules. Across both scales and datasets, the combined application of all three attention modules (the NL module, RCB module, and SFT module) consistently leads to a noteworthy improvement in PSNR, SSIM, FID, and LPIPS, compared to the baseline and other combinations.
To attain a comprehensive understanding of the impact of distinct refinement
modules within the FSR model and to discern their relative significance in
shaping the overall performance, the evaluation results are visually presented
in Figure 11
Figure 12, Figure 13, and Figure
14. These figures represent the performance metrics, including PSNR,
SSIM, and LPIPS in each figure. The charts facilitate a comparative analysis of
the model’s performance across various combinations of refinement modules
at scale factors of
Performance comparison (PSNR/SSIM/LPIPS) of our model with various
attention module configurations at a scale factor of
Performance comparison (PSNR/SSIM/LPIPS) of our model with various
attention module configurations at a scale factor of
Performance comparison (PSNR/SSIM/LPIPS) of our model with various
attention module configurations at a scale factor of
Performance comparison (PSNR/SSIM/LPIPS) of our model with various
attention module configurations at a scale factor of
According to the visualizations in these figures, using the SFT module alone in our middle-level configuration to preserve spatial information contributes less to improving the accuracy of the FSR model. However, when combined with the RCA and NL attention modules and RCA, it exhibits the second-best and the best performance in the FSR model, respectively. The results suggest that the effectiveness of the SFT module at mid-level features could be improved by issues arising from the degraded and poor quality of LR face inputs.
In contrast, employing the NL attention module to address noise and degradation at the initial stage, coupled with the application of RCA to emphasize inter-channel dependencies at a model level, significantly enhances the performance of SFT. Consequently, this combination performs best, yielding the highest PSNR and SSIM and the lowest LPIPS in both scales and datasets.
F. Visual Comparisons of Contributing Different Attention Modules
In this section, we assess the impact of each attention module on generating higher-quality face images. Figure 15 illustrates four sample images from the CelebA [15] and Helen [16] datasets. We extract face image patches and evaluate their visual quality. Specifically, we compare the visual quality of the baseline model, the model utilizing the NL module, the combination of NL and Residual RCA modules, and finally, the combination of NL, RCA, and SFT modules.
Visual comparisons of contributing different attention modules baseline, (NL), (NL+RCA), (NL+RCA+SFT).
As depicted in Figure 15, the perceptual quality of the results employing the NL, RCA, and SFT refinement modules significantly improves compared to other methods. The PSNR and SSIM metrics in this combination surpass those of other configurations. In other words, the proposed model successfully recovers more facial details than the baseline and other combined approaches.
G. Comparison With Other Methods
This section compares our quantitative and qualitative results with several other methods.
We conduct quantitative comparisons with various models [58, 5,6,8,11, 26,27,29,30,31,32,35,40,53.54,55,56,57].
Table 7 presents comparisons of
PSNR, SSIM, and LPIPS in the CelebA and Helen datasets at a scale factor of
Comparison of performance improvement (PSNR) at a scale factor of
Table 8 provides comparisons of PSNR,
SSIM, and LPIPS in CelebA and Helen datasets at a scale factor of
Comparison of performance improvement (PSNR) at a scale factor of
Figure 18 presents visual comparisons of
the proposed model at a scale factor of
Visual comparison of our model with other state-of-the-art methods
scale factor of
Figure 19 illustrates visual comparisons
of the proposed method at a scale factor of
H. Comparison of Network Complexity and Performance
Figure 20 and Figure 21 illustrate our model’s network
complexity and performance, comparing them with other state-of-the-art models
across the CelebA and Helen datasets at scale factors of
Performance and network complexity evaluated on the CelebA and Helen
datasets at a scale factor of
Performance and network complexity evaluated on the CelebA and Helen
datasets at a scale factor of
The comparison in Figure 20 shows that WSRnet [27] exhibits the highest network parameters, with PSNR values of 26.83 dB and 36.02 dB for the CelebA and Helen datasets, respectively. Notably, our model’s performance in the CelebA and Helen datasets improves to 27.69 dB and 27.18 dB, respectively, while the network complexity decreases by around 2.7 times compared to WSRnet [27] at this scale. Compared to the DIC [5] model, our model contains 4 million more parameters, achieving an increased PSNR performance in both datasets compared to the DIC [5] model.
Based on Figure 21, the network complexity of our model is lower than that of FishFSRNet [32] and SCTANet [11]. In contrast, the performance of our model surpasses both FishFSRNet [32] and SCTANet [11] in the CelebA dataset. In the Helen dataset, our model’s performance falls within the same range as SCTANet (with only a 0.26 dB difference). In contrast, the complexity of our model is 3.20 million parameters less than SCTANet [32]. Compared to the baseline model [5], our model demonstrates significant improvements in performance in both datasets at this scale.
I. User Study
Table 9 and Table 10 present our user study evaluation results
for our model at scale factors of
Our evaluation criteria were designed to capture subjective preferences across visual quality, naturalness, and perceptual similarity to HR images. Participants were asked to rate the face images using a 5-point scale, ranging from “Poor” to “Excellent,” with “1” indicating the lowest score and “5” the highest. The highest score in the two tables shows the best quality performance among other models.
J. Quantitative and Qualitative Comparison on AFlW2000 and WFLW Datasets
We conduct quantitative comparisons with the baseline model [5] on the AFLW2000 [46] and WFLW [47]
datasets. Tables 11 and 12 present PSNR, SSIM, LPIPS, and FID
at scale factors of
Figure 22 demonstrates the visual
comparison between our model and DIC [5]
at a scale factor
Visual comparison of our model with baseline at a scale factor of
Figure 23 compares our model and DIC
[5] at a scale factor of
Visual comparison of our model with baseline at a scale factor of
Discussion and Future Work
To enhance fidelity and detail in generating face images at large scale factors, the NL module and RCA technique at the low-level stage focus on critical facial details and effectively mitigate shortcomings in feature quality, leading to a refined and context-aware representation. Additionally, the designed SFT module at the mid-level architecture significantly contributes to recovering high-frequency facial attributes by leveraging spatial information. However, it is noted that using the SFT module alone in the middle-level configuration contributes less to improving the accuracy of the FSR model, primarily due to issues arising from degraded and poor-quality LR face inputs. In contrast, employing the NL attention module to address noise and degradation at the initial stage, coupled with the application of RCA to emphasize inter-channel dependencies, significantly enhances the performance of the SFT module at the mid-level. Consequently, this innovative combination of multi-attention techniques demonstrates the best performance in generating faithful and detailed face images compared to previous FSR models, yielding the highest PSNR, SSIM, LPIPS, and VIF scores across large scales and four datasets.
Although the proposed model excels in generating faithful and detailed full-face images across diverse angles, its performance could be improved when tasked with generating profile face images. This limitation arises due to the focus of the FAN module, which was tuned exclusively for the frontal view of the face, and the training datasets (CelebA and Helen) contain only full-face images. Consequently, the model must be more accurate in accurately reconstructing profile face images, where only partial facial features, such as one eye, one side of the nose, and a portion of the mouth, are visible. This constraint restricts the model’s applicability in scenarios where profile images are prevalent or necessary.
Future research endeavors will address this limitation by tuning the FAN module to accommodate profile face images and incorporating additional training datasets containing profile face images. This effort could enhance the model’s robustness and broaden its utility in real-world applications, thus advancing the effectiveness and applicability of FSR techniques in diverse practical settings.
Conclusion
This research introduces a Multi-stage Refining Face Super-resolution model,
establishing a novel paradigm through iterative collaboration between landmark
estimation and an attentive recovery network. The formidable challenge posed by the
degraded low-dimensional (
In future work, we will integrate the proposed FSR model developed with real-time face recognition systems. This extension aims to explore the potential impact of our model on enhancing recognition accuracy in practical scenarios.
ACKNOWLEDGMENT
The authors would like to express their sincere gratitude to the Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University.