Loading [MathJax]/extensions/TeX/boldsymbol.js
MSRFSR: Multi-Stage Refining Face Super-Resolution With Iterative Collaboration Between Face Recovery and Landmark Estimation | IEEE Journals & Magazine | IEEE Xplore

MSRFSR: Multi-Stage Refining Face Super-Resolution With Iterative Collaboration Between Face Recovery and Landmark Estimation


Multi-stage Refining Face Super-resolution with Iterative Collaboration between Face Recovery and Landmark Estimation (MSRFSR)

Abstract:

Face Super-resolution (FSR) models encounter a significant challenge related to extremely low-dimensional ( 16\times 16 pixels) and degraded input images. This deficie...Show More

Abstract:

Face Super-resolution (FSR) models encounter a significant challenge related to extremely low-dimensional ( 16\times 16 pixels) and degraded input images. This deficiency in crucial facial details within the low-level and intermediate levels of the FSR model presents obstacles in tasks such as face alignment, landmark detection, and consequently, difficulty in recovering high-frequency details, resulting in unfaithful and unrealistic super-resolved face images. This research proposes an innovative FSR model with strategically designed multi-attention techniques to enhance facial attribute recovery capabilities. The model incorporates a Non-local Module (NL) and residual pixel attention technique at the low-level stage of the FSR model. Simultaneously, a Spatial Feature Transfer (SFT) module refines mid-level features by leveraging spatial information through an iterative interaction process between an attentive module and a landmark estimation network. By strategically utilizing these modules under an iterative collaboration framework, our method effectively addresses challenges in facial detail recovery, demonstrating enhanced model understanding and refined representation. The proposed model is rigorously examined on CelebA, Helen, AFLW2000, and WFLW datasets at scale factors of \times 8 and \times 16 . The results consistently demonstrate the superiority of our proposed Multi-Stage Refining Face Super-Resolution (MSRFSR) model over state-of-the-art methods through extensive quantitative and qualitative experiments on four datasets and both scales.
Multi-stage Refining Face Super-resolution with Iterative Collaboration between Face Recovery and Landmark Estimation (MSRFSR)
Published in: IEEE Access ( Volume: 12)
Page(s): 56951 - 56972
Date of Publication: 16 April 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Face super-resolution (FSR) has attracted more attention and has been used in various image-based applications. FSR, also known as face hallucination, seeks to reconstruct a high-resolution (HR) face image from a low-resolution (LR) input. Due to the constraints in acquiring high-quality images and the influence of imaging conditions, face images are almost captured in poor perceptual quality in real-world scenarios. The low-quality face image issue negatively affected the performance of face-image-based applications such as face detection [1] and face recognition [2]. FSR is a particular instance of the single image super-resolution (SISR) technique [3], and it is considered an ill-posed problem because of the ambiguity in reconstructing face images’ pixels. The FSR model is designed to capture the unique characteristics of facial attributes and optimizes the recovery of face attribute detail. On the other hand, the SISR model focuses on enhancing a wide variety of image content without considering the features of face attributes [4]. Therefore, the face configuration is considered a highly structured object and utilized as confident prior knowledge for reconstructing the global face structure. According to the reconstructed global face attributes, the local face information is also recovered [5]. To utilize the facial prior guide approach in the FSR model, different techniques such as face parsing maps, landmarks heatmaps, Spatial attention maps, and the three-dimensional facial guide [6], [7], [8], [9], [10], [11] are used in the face attributes (eyes, lips, nose, and eyebrows) recovering process. Due to the prior-guide technique utilization in FSR models, they outperform at large scales (\times 8 and more) compared to the SISR models.

Although the facial prior guide approach models outperform in recovering global face information compared to the non-prior guide approach, they suffer from inaccurate or even wrong prior guide information due to the lack of sharp detail in the input face images. A noticeable challenge encountered in FR models pertains to the inadequate quality of the shallow dimensions of input LR face image (often at sizes of 16\times 16 or 8\times 8 pixels). This issue manifests as a deficiency in valuable facial details within the FSR model’s intermediate levels (mid-level). The constrained information in such LR images poses a significant hurdle, impacting the model’s ability to effectively face alignment, landmark detection, and recover facial features. Additionally, the LR face images may contain some artifacts, including a lack of sharpness and any other degradation, making it difficult to recover the human face’s global and local details in a super-resolved image. Unfaithful and unrealistic super-resolved face images in different face poses are the adverse impact of this problem.

To improve the capability of the FSR model in recovering more accurate global information from poor-quality input images, Ma et al. [5] proposed a face hallucination model comprising two recurrent networks. These networks are designed for iterative operation and improve the performance of facial component recovery and landmark detection tasks. Furthermore, to improve the restoration of local details, they enhanced the landmark information guidance utilizing an attentive module that aggregates the individual face attributes meticulously.

Although utilizing an iterative collaboration approach between the attentive module and estimated facial landmarks enhanced the performance of the FSR model to recover more global and local detail compared to the other prior guide approaches, the obstacles of poor-quality input images and consequently lack of recovering high-frequency details at mid-level of FSR model remain.

To enhance the mid-level information recovery capabilities of the FSR model and achieve a more realistic texture for face images, the viable approach involves leveraging the spatial information of features. This goal is effectively addressed by incorporating the Spatial Feature Transfer [12] (SFT). Nevertheless, the sub-optimal quality of the extracted low-level features from unclear and degraded LR input hinders the SFT module’s efficient recovery of intricate facial details. To mitigate the degradation effects associated with the low-level extracted features, a feasible solution is to employ a non-local [13] module (NL) in conjunction with the residual channel attention [14] technique. Integrating the non-local module captures long-range dependencies within the features, promoting a more comprehensive understanding of facial structures. Simultaneously, the residual channel attention technique enhances the focus on critical facial details by selectively emphasizing informative channels. This combined approach addresses the shortcomings in the quality of low-level features, fostering a more refined and context-aware representation. By incorporating the non-local module and residual channel attention technique, we aim to significantly enhance the efficacy of the SFT module in recovering facial details, ultimately contributing to improving the capability of the attentive module to boost the guidance of landmarks and the overall improvement of our FSR model on large up-sampling scales (\times 8 and \times 16 ).

Our research presents a novel approach to FSR by employing multi-stage refinement techniques that harness the potential of an iterative collaboration process between the recovery and landmark estimation networks. This innovative framework enhances the model’s capability to recover higher fidelity and more detailed facial attributes compared to baseline models. We introduce three key contributions to our FSR model:

  1. Propose an NL module at the early stage of our FSR network to reduce the noise effects of low-resolution face images and produce an enhanced feature representation. This module effectively addresses the noise degradation effect in low-quality inputs, improving feature quality.

  2. Employs a residual pixel attention module on low-resolution features at the early stage of the network to capture the inter-channel relationships in low-resolution feature maps and emphasizes the importance of specific channels, enhancing the model’s ability to capture intricate facial details.

  3. Develop an SFT module involving an affine transformation that spatially adjusts the features according to facial characteristics derived from facial heatmaps. The SFT applies at the mid-level of our proposed model before the upsampling layer and improves the effectiveness of the upscaling process by ensuring relevant facial details are brought into focus.

The illustration in Figure 1 showcases the effectiveness of our multi-stage refinement process in producing higher fidelity and more detailed face images. In Figure 1, Sample (a) demonstrates a fidelity comparison of the generated SR image with closed eyes, highlighting its superior performance to the DIC [5] model. Sample 2 further illustrates the model’s capability to recover additional facial details, surpassing the performance of the DIC [5] model.

FIGURE 1. - 
                        Visual Comparison: Our MSRFSR model compared to DIC [5] on CelebA [15] and Helen [16]
                            datasets at a scale factor of 
                                $\times 8$ 
                            reveals distinctive performance in fidelity and yields more detailed
                            results.
FIGURE 1.

Visual Comparison: Our MSRFSR model compared to DIC [5] on CelebA [15] and Helen [16] datasets at a scale factor of \times 8 reveals distinctive performance in fidelity and yields more detailed results.

The remaining sections of the article are structured as follows. Section II briefly reviews the relevant works. Section III provides the methodology and proposes the model’s architecture. The implementation details, datasets, and experimental results are demonstrated in Section IV. Section V demonstrates discussion and future work. Finally, the proposed FSR research is concluded in Section VI.

SECTION II.

Related Works

A. Conventional Face Super-Resolution

The idea of improving the quality of face images, known as face hallucination (face super-resolution), became a focus of interest when researchers proposed the first general Super-resolution (SR) models. In 2000, Baker et al. [17] proposed the first FSR model. Their basic model improves the resolution of face images by learning from training image datasets. This model is relatively simple, producing a super-resolved face image by comprehending facial structures while ignoring the high-frequency information recovery.

Liu et al. [18] proposed a two-step FSR architecture to address this limitation. This model first restores the coarse face image using a linear network architecture in the initial stage and then retrieves high-frequency details using a non-parametric Markov technique. Although this model improved the recovery of face details compared to the previous model [17], there is still room for enhancing the retrieval of more facial details. Several FSR models have been proposed to tackle this issue in traditional deep-learning algorithms [19], [20], [21], [22], [23], [24]. To improve the generation of crucial facial details between 2004 and 2007, Chang et al. [19], Wang et al. [20], and Chakrabarti et al. [21] each applied different methods including local embedding, eigen transformation, and kernel principal component, respectively. Furthermore, to boost the performance of FSR models in shallow architecture, Shi et al.. [22], Jung et al. [23], and Jiang et al.. [24] employed recursive regression architecture, convex optimization based on the positional patch approach, and local smooth regression, respectively.

Nevertheless, these approaches prove ineffective in restoring natural facial attributes, mainly when dealing with large up-sampling factors such as \times 8 . When the magnification factor is larger than \times 4 , these methods encounter challenges in producing confident super-resolved face images.

B. Deep Learning Based Face Super-Resolution

Due to the rapid development of deep learning in computer vision, there have been significant advancements in face hallucination using deep learning methods [4]. Our review focuses on designing diverse network structures for FSR models and techniques for dealing with different attention techniques, facial geometry alignment, and textural and contextual information. Table 1 presents comparative comparisons of FSR models, highlighting key aspects such as the year of publication, a brief overview of the methodology employed, the accuracy metric, the dataset used, and the examined scale factor.

TABLE 1 Comparison of Various FSR Models
Table 1- 
                            Comparison of Various FSR Models

In 2015, the Bi-channel Convolutional Neural Network (CNN) [25] pioneered using CNN to super-resolve face images by adaptively combining the two channels’ information, marking a notable milestone in this area. To address the limitation in operating in larger scale factors (beyond \times 4 ), the URDGN [26] model introduced a discriminative generative network by resolving very low-resolution face images to \times 8 by incorporating a pixel-wise L2 loss function and utilizing feedback branch from the discriminative network to enhance the recovered global information of the face image. To prevent the over-smoothed effect in large-scale factors (\times 8 and \times 16 ), WSRNet [27] proposed a wavelet-based FSR network that utilized the wavelet transformation domain instead of the image domain and designed multiple scaling factors technique within a unified framework. This FSR model learns to predict LR’s corresponding series of HR’s wavelet coefficients and improves human faces’ global information and local texture details. The extension of the WRNet [27] model into the generative adversarial networks is detailed in [28].

In 2018, SuperFan [8] and FSRNet [6] utilized face geometry as prior knowledge of FSR models to improve the quality of resultant images by effectively recovering local information. SuperFan [8] utilized a sub-network for face alignment and detecting facial landmarks information and integrated this information into a generative adversarial network. The FSRNet [6] used facial landmark heatmaps and parsing maps without strict alignment and integrated adversarial loss into this model.

The Facial Attribute Capsules Network [29] (FACN) leverages an integrated representation model to comprehensively encapsulate facial information and reduce the noise effect of super-resolved face images in real-world scenarios. An integrated learning strategy generates the attribute capsules in semantic, probabilistic, and facial attribute manners. The Spatial Attention Residual Network [30] (SPARNet) integrated a spatial attention mechanism into vanilla residual blocks to address the difficulty in recapturing finer facial details. This technique allows convolutional layers to focus specifically on essential facial structures while reducing attention to regions with fewer features.

Ma et al. [5] introduced an iterative collaboration between two recurrent networks (DIC), focusing on recovery and landmark estimation to improve the FSR model accuracy and quality. The iterative framework exploits prior knowledge of landmarks and progressively enhances each other’s performance. The DIC [5] model emphasizes the dynamic information exchange between the face image recovery network and the landmark estimation network. Additionally, an innovative attentive fusion module generates facial components and aggregates them attentively for improved high-frequency detail restoration. While this framework has shown acceptable capability in gradually producing results and utilizing an attentive fusion technique to recover more facial details, its effectiveness is hindered by using a degraded low-resolution input (16\times 16 ). This limitation prevents the full advantage of the approach.

To improve texture details and enhance facial structure, the External-Internal Split Attention Group proposed split-attention in a split-attention network [31] (SISN). This model simultaneously considers overall facial structure and fine texture details and attempts to produce more fidelity face images. In the extension of [29], to enhance the fidelity of facial results, Bao et al. [11] proposed multi-attention modules (Residual Spatial Attention and Multi-scale Patch embedding and Spatial attention) in the SCTANet model to improve the recovery of both global and local information.

In 2023, Wang et al. [32] introduced Super-Resolving Face Image by Facial Parsing Information (Fishnet) by designing a parsing map attention fusion block (parsing map-guided approach) and integrating the parsing map information with an attention mechanism. The multi-scale refine block in this model attempts to preserve the spatial and contextual details and recover high-resolution information and the context from low-resolution features.

Multi-Stage Generative Adversarial Network [33] (MSGAN) proposed an end-to-end head-posed estimation network and integrated it with the FSR network. Utilizing pose-aware adversarial loss and head pose alignment feedback improves the fidelity of non-frontal face images in real-world scenarios. To improve the FSR model’s capability of recovering local and global facial detail, the CNN-Transformer Cooperation Network [34] (CTCNet) employed a multi-scale connected encoder-decoder framework as its backbone. The structure attention module, along with a Transformer block, attempts to enhance the consistency of local facial detail and global facial structure simultaneously.

To increase the accuracy of the FSR model and generate better perceptual quality images from LR faces, the FMANet [35] model introduced a facial mask attention module and attempted to enhance the identity fidelity of the resolved face image. Moreover, the MaskPix loss function was introduced to emphasize pixels containing dense identity features in this model selectively. SFMNet [36] proposed the frequency-spatial interaction block based on the Fourier transform technique to achieve optimal performance in recovering the global and local facial dependencies. Exploring correlations between spatial and frequency domains improves global and local signal generation in SR images.

SECTION III.

Methodology

The overall MSRFSR architecture comprises three branches: a refinement network, a face recovery network, and a face alignment network (FAN), as shown in Figure 2. The LR input suffers degradation (noise, blurring, and lack of sharpness), which requires a multi-stage enhancement procedure [12], [13], [14]. The first convolution layer extracts the LR features from the LR input and up-scales them by a factor of two. The features are fed into the proposed multi-stage refinement network consisting of a non-local module and the residual pixel attention blocks, as depicted in Figure 2.

FIGURE 2. - 
                        The architecture of MSRFSR model.
FIGURE 2.

The architecture of MSRFSR model.

The face recovery network and FAN branch interact using an iterative collaboration approach. The face recovery branch takes the refined features as the input and generates the first super-resolved face image. The generated super-resolution (SR) images are fed into the FAN branch to estimate the alignment procedure and generate facial heatmaps. These are prior guides for fusion with the attentive module in an iterative collaborative process. The first generated SR image can be computed as (1):\begin{equation*} I_{SR(1)}=M_{SR}(M_{Ref}(I_{LR}))+U(I_{LR}) \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features. where I_{SR(1)} and I_{LR} denote the first reconstructed image and the LR input image, respectively. The M_{SR} and M_{Ref} represent the recurrent SR module and multi-stage refinement network, respectively. The up-sampling module is demonstrated by U .

Utilizing I_{SR(1)} , the FAN branch predicts the initial facial landmarks (68 landmarks in the face) and produces the first facial heatmaps demonstrated by HM_{SR(1)} . The facial heatmaps include left and right eyes, lips, nose, and face shapes, utilized as prior knowledge for integrating into the attentive fusion module. For the second procedure, the attentive module is implemented on two inputs including HM_{SR(1)} and M_{Ref}(I_{LR}) .

The F_{n} denotes the fusion module and it is determined as (2):\begin{equation*} F_{n} = M_{Ref}(I_{LR}, I_{SR(n-1)}) \cdot HM_{SR(n-1)} \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features.

Based on the iterative collaboration approach to reconstruct the SR image in n steps, the final SR output is defined by I_{SR(n)} as shown in (3):\begin{equation*} I_{SR(n)}= M_{SR}(F_{n})+M_{ST}(HM_{SR(n-1)})+U(I_{LR}) \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features. In continuation of this section, detailed explanations of the Multi-stage Refinement Network, Face Recovery Network, and Face Alignment Network are discussed.

A. Multi-Stage Refinement Network

The Multi-stage Refinement Network comprises the NL and RCA blocks as demonstrated in Figure 2. It aims to reduce the degradation effects at the low-level extracted features in this stage of the proposed model. These enhancements optimize the recovering and alignment procedures to fully utilize the capability of iterative collaboration technique in our FSR model.

The application of the non-local attention technique is constrained to recovering the global details in LR feature maps by exploiting the interdependence of pixels [37], [38], [39]. The proposed NL is considered an attention approach to detect recurring patterns and textures across different regions of the feature maps [37]. Figure 3 depicts the non-local module’s architecture.

FIGURE 3. - 
                            The architecture of Non-local module (NL).
FIGURE 3.

The architecture of Non-local module (NL).

Specifically, instead of attending only the neighboring pixels, a larger neighborhood region is considered. Detecting similar intensity or texture characteristics using weighted averages in non-local technique is defined as (4):\begin{equation*} z_{i}= W(t)\cdot y_{i}+I_{LR} \tag {4}\end{equation*} View SourceRight-click on figure for MathML and additional features. where z_{i} and I_{LR} define the output and input of NLM, respectively. W(t) denotes the weights of y_{i} .\begin{equation*} y_{i}= softmax (W({\theta }) \cdot W({\phi })).W(g) \tag {5}\end{equation*} View SourceRight-click on figure for MathML and additional features. y_{i} is obtained by dot product of W(g) into the dot product of W({\theta }) and W({\phi }) after applying softmax operation. W({\theta }) , W({\phi }) and W(g) are the weight matrices to be learned. \theta , \phi , and g are (1\times 1 ) convolution. H and T are patch sizes set to 7, and the number of channels set to 48.

The proposed RCA blocks are a channel attention technique that employs convolutional layers to increase the network’s capability to learn specific weights of the channel [14], [40]. As shown in Figure 4, the proposed RCA module comprises convolutional layers with ReLU activation functions, global average pooling, and sigmoid activation functions. This approach helps recover and precisely extract channel-wise information from low-resolution feature maps.

FIGURE 4. - 
                            The architecture of RCA block.
FIGURE 4.

The architecture of RCA block.

The values of the weights behave as an attention map, allocating higher values to more critical information and lower values to less important information for consideration. After each convolution layer, the activation function enables the deep-learning model to learn more complex channel dependencies. A skip connection allows the network to bypass the low-frequency information from the earlier layer and integrate it with the features from the later layer [3], [40]. In the proposed multi-stage refinement network, utilizing RCA blocks after applying the non-local module is a strategic approach to emphasize inter-channel facial detail effectively. The NL enhances noise degradation at the initial stage of the refinement procedure [38], [39]. At the same time, RCA blocks are specifically designed to bolster inter-channel facial detail representations on low-level feature maps [14]. This combined technique addresses noise issues. It ensures a more comprehensive and refined treatment of facial details, ultimately optimizing the collaboration between face recovery and face alignment networks.

B. Face Recovery Network

The Face Recovery Network comprises an attentive fusion module, recurrent super-resolution, and SFT block, as shown in Figure 5.

FIGURE 5. - 
                            The structure of Spatial Feature Transform (SFT).
FIGURE 5.

The structure of Spatial Feature Transform (SFT).

The attention fusion module is a part of the face recovery branch that fuses the information between the face recovery network and the FAN. Thus, the gradients can be back-propagated to both the face recovery network and FAN part recursively and facilitate the utilization of different facial components as the landmark guidance of our FSR model.

The facial component maps are grouped into specific facial components, including the mouth, nose, right and left eyes, and jawline. The channels corresponding to each facial component are added together, and softmax operation is applied to create a facial attention heatmap for the respective facial attribute.

Guided by facial attention heatmaps, the specific component features are extracted by group convolution, and the weighted features from the multi-stage refinement network are added to form attentive fused output.

Based on the DIC [5] model, the super-resolution feedback network [41] is employed in our model’s face recovery network. The proposed recurrent SR module is structured to utilize feedback connections and generate robust high-level representations. The recurrent SR module’s input has 32\times 32 pixels and 48 channels and passes through the convolution and feedback layers.

The preservation of spatial information is essential in the context of generated face images since it entails the maintenance of local attributes necessary for fidelity preservation and recovering more tiny details at various spatial regions of a facial image [12], [42], [43], [44]. Therefore, an SFT block is employed, and input spatial features are produced by heatmaps to spatial-wise feature modulation with the recurrent SR features.

The architecture of the proposed SFT module is demonstrated in Figure 5.

The SFT module learns a spatial mapping function \mathcal {M} to predict the modulation parameters \alpha and \beta based on prior condition \psi and formulates as (6):\begin{equation*} {\mathcal {M}}: {\psi }{\rightarrow }({\alpha }, {\beta }) \tag {6}\end{equation*} View SourceRight-click on figure for MathML and additional features.

The learned \alpha and \beta parameters apply an affine transformation to feature maps located at the intermediate layer of generating super-resolved network and adaptively impact generating more fidelity face images. To be more exact, the preceding \psi is represented using the parameters \alpha and \beta in an affine transformation, through a mapping function. By obtaining the \alpha and \beta based on certain conditions, the transformation is applied by scaling and shifting feature maps as defined in (7):\begin{equation*} \hat {y}={\boldsymbol {SFT}}({\boldsymbol {F}} {\vert }{\alpha }, {\beta })= {\alpha } {\textstyle \bigodot } {\boldsymbol {F}} {\textstyle \bigoplus }{\beta } \tag {7}\end{equation*} View SourceRight-click on figure for MathML and additional features. where \boldsymbol {F} represents the feature maps with dimensions same as \alpha and \beta . The \textstyle \bigodot denotes element-wise multiplication and \textstyle \bigoplus shows element-wise addition. To maintain the spatial dimensions, the SFT module performs both feature-wise manipulation and spatial-wise transformation. The model takes the condition \psi generated from the Heatmaps, and the output is the affine transformation to scale (\alpha ) and shift (\beta ) feature maps. Implementing multiplication and addition features is a practical and feasible way to incrementally fuse two different levels of detail before up-sampling the face image. Shifting and scaling the feature map is similar to regulating the regional image information. This means that essential details in every region of the face images are retained, and others are ignored.

C. Face Alignment Network

The face alignment process in our FSR framework involves using accurate facial landmarks to guide the face recovery network, progressively improving face image fidelity and landmark estimation. Our model continuously improves the landmark estimation at each step of the iterative approach to provide more precise auxiliary information to efficiently combine the prior guidance to the spatial feature transform operation and attentive fusion module. On the other hand, the iterative collaboration approach and feedback between the two networks’ branches enhance the performance of landmark estimation and the face recovery network, ultimately improving the overall efficiency of the proposed FSR model. The FAN in our model is based on the DIC [5]. The proposed FAN includes pre-processing and post-processing components, as shown by P1 and P2 , along with four interconnected hourglasses that incorporate recurrent feedback in the FAN architecture [45]. In the beginning (at the first iteration), there aren’t any face component maps or recurrent features. The I_{SR(1)} reconstructs using a refined LR image by multi-stage refinement and recurrent SR networks. For the n th step in the iterative calibration approach, where n=1,\ldots , N , the Face Recovery Network generates face SR images I_{SR(n)} by utilizing the FAN result and the feedback information from (n-1) th step.

The FAN takes the input from the first generated SR image (128\times 128 ). In the pre-processing stage, convolutions are applied to SR images, down-sampled from 128\times 128 to 32\times 32 , and then fed to the recurrent hourglass network. The hourglass network estimates the facial landmarks and sends them to the P2 stage. In the post-processing stage, the detected facial landmarks are merged into five types of face component maps. In our designed model, the five types of component maps are split into two paths, as demonstrated in Figure 2. One path is utilized for the attentive fusion module, and the other is used for the spatial feature transfer operation.

SECTION IV.

Experiment and Discussions

In this section, we first introduce the implementation settings of the MSRFSR model. Then, we explore the effectiveness of various components, and finally, we compare the performance of our model with that of other state-of-the-art models quantitatively and qualitatively.

A. Implementation Settings

We conducted experiments using the CelebA [15] and Helen [16] datasets, which are widely recognized in the field of Face Super-Resolution (FSR) models. To prepare the data, as the standard of other FSR models [5], we first utilized the estimated landmarks and extracted square sections from each image to remove the background. These sections were then resized to 128\times 128 pixels without prior alignment and used as the ground truth images. Subsequently, we reduced the size of HR face images to 16\times 16 pixels for a scale factor of \times 8 and to 8\times 8 pixels for a scale factor of \times 16 . We employed the bicubic interpolation method for both datasets to downscale the images for model training. For the CelebA [15] dataset, our training set comprised 168,854 images, with 1,000 images reserved for testing. In the case of the Helen [16] dataset, our training set consisted of 2,005 images, while 50 were held out for testing purposes. Table 2 summarizes and compares the detailed strategy utilized in our implementation. Additionally, we conducted experiments with our model to identify its limitations. We selected two datasets, AFLW2000 [46] and WFLW [47], to assess the model’s performance across different facial image datasets.

TABLE 2 Hyperparameter Setting
Table 2- 
                            Hyperparameter Setting

The training objective function utilized in our FSR model is {L}_{\text {L1}} that demonstrated as (8):\begin{equation*} \mathcal {L}_{\text {L1}} = \frac {1}{N} \sum _{i=1}^{N} \left |{{ y_{i} - \hat {y}_{i} }}\right | \tag {8}\end{equation*} View SourceRight-click on figure for MathML and additional features. where N denotes the total number of elements in the tensors, y_{i} and \hat {y}_{i} are the ground truth (HR) image and the predicted super-resolved (SR) image, respectively.

The number of iterations in our iterative super-resolution approach was set to n=4 . The batch size was set to 8. To mitigate over-fitting, we enhanced the variety of training samples in our experiment by incorporating random rotations of 90°, 180°, and 270° and horizontal flips. For training, we utilized the ADAM optimizer [48] with a bias \beta _{1} =0.9 and \beta _{2} =0.999 for the alignment loss weight in the training process. The initial learning rate commenced at 1 \times 10^{-4} and underwent halving at 2 \times 10^{4} , and 4 \times 10^{4} . Our model’s training in both datasets was stopped after 5 \times 10^{5} epochs. The experiments were conducted using PyTorch [49] on an NVIDIA RTX 2080 Ti GPU with 4352 CUDA cores and 11GB of GDDR6 memory.

The assessment of super-resolved face images involves the use of the Peak Signal-to-Noise Ratio (PSNR), Learned Perceptual Image Patch Similarity (LPIPS) [50], Structural Similarity Index (SSIM) [51], and Frechet Inception Distance (FID) [52]. These metrics are calculated on the Y channel within the transformed YCbCr color space.

B. Investigation of Different Iterative Steps in Training

In this section, our focus revolves around the comprehensive evaluation and comparative analysis of the proposed iterative collaboration framework for the proposed model. The evaluation metrics employed include PSNR, SSIM, and LPIPS. We present the results in Table 3 and Table 4, where each table corresponds to different aspects of our model’s performance. Specifically, the evaluation encompasses various iterative steps, ranging from step 1 to step 5. The CelebA [15] and Helen [16] datasets are utilized for these evaluations. The evaluations are conducted for scale factors of \times 8 and \times 16 , with the corresponding outcomes represented in Table 3 and Table 4, respectively.

TABLE 3 The Quantitative Comparison Among Different Iterative Steps ( n ) on CelebA and Helen Datasets at a Scale Factor of \times 8
Table 3- 
                            The Quantitative Comparison Among Different Iterative Steps (
                                    $n$ ) on
                                CelebA and Helen Datasets at a Scale Factor of 
                                    $\times 8$
TABLE 4 The Quantitative Comparison Among Different Iterative Steps ( n ) on CelebA and Helen Datasets at a Scale Factor of \times 16
Table 4- 
                            The Quantitative Comparison Among Different Iterative Steps (
                                    $n$ ) on
                                CelebA and Helen Datasets at a Scale Factor of 
                                    $\times 16$

Following a comprehensive analysis of the different iteration steps in Table 3 and Table 4, from Step 1 to Step 4, a consistent and gradual improvement is observed in PSNR, SSIM, and LPIPS. However, Step 5 has no discernible enhancement in these metrics. According to the results of Table 3 and Table 4, the optimal iteration step for our model, the best values for PSNR, SSIM, and LPIPS, is found at n=4 for both scale factors.

C. Visual Comparisons of Face Recovery in Different Iteration Steps

The previous section compared various iteration steps to determine our model’s optimal value for n=4 . Here, we present visual comparisons of face recovery during the training mode of our FSR model for each iteration step. Figure 6, Figure 7, Figure 8, and Figure 9 depict visual comparisons of face image reconstruction, comparing them with both the HR image and the bicubic upsampling method. The corresponding PSNR and SSIM of the generated image for each iteration step with the HR image are displayed in the figures. Results for the CelebA dataset at a scale factor of \times 8 , Helen dataset at the factor of scale \times 8 , CelebA dataset at a scale factor of \times 16 , and Helen dataset at a scale factor of \times 16 are shown in Figures 6, Figures 7, Figures 8, and Figures 9, respectively. Upon visual inspection and quantitative evaluation of each figure, it is evident that the perceptual quality and the recovery of high-frequency details in facial attributes gradually improve at each iterative calibration step during the training phase of our FSR model.

FIGURE 6. - 
                            Visual comparison of face image reconstruction in three steps of
                                iterative collaboration approach on CelebA image at a scale factor
                                of 
                                    $\times 8$
                                    .
FIGURE 6.

Visual comparison of face image reconstruction in three steps of iterative collaboration approach on CelebA image at a scale factor of \times 8 .

FIGURE 7. - 
                            Visual comparison of face image reconstruction in three steps of
                                iterative collaboration approach on Helen image at a scale factor of 
                                    $\times 8$
                                    .
FIGURE 7.

Visual comparison of face image reconstruction in three steps of iterative collaboration approach on Helen image at a scale factor of \times 8 .

FIGURE 8. - 
                            Visual comparison of face image reconstruction in three steps of
                                iterative collaboration approach on CelebA image at a scale factor
                                of 
                                    $\times 16$
                                    .
FIGURE 8.

Visual comparison of face image reconstruction in three steps of iterative collaboration approach on CelebA image at a scale factor of \times 16 .

FIGURE 9. - 
                            Visual comparison of face image reconstruction in three steps of
                                iterative collaboration approach on Helen image at a scale factor of 
                                    $\times 16$
                                    .
FIGURE 9.

Visual comparison of face image reconstruction in three steps of iterative collaboration approach on Helen image at a scale factor of \times 16 .

D. Iterative Collaboration in Landmark Estimation

This section illustrates the face alignment process in various face recovery procedures. In Figure 10, we showcase landmark estimation at four steps within our iterative calibration framework at a scale of the factor of \times 18 in the training phase of the FSR model. The final estimated landmark is visualized with a heatmap shape. The chosen sample face image in this figure is considered challenging due to non-frontal face angles and existing non-face attributes (such as a hand) in the face image.

FIGURE 10. - 
                            Landmark estimation in 4 steps of iterative collaboration approach at
                                a scale factor of 
                                    $\times 16$
                                    , and the facial estimated landmark
                                in heatmap shape.
FIGURE 10.

Landmark estimation in 4 steps of iterative collaboration approach at a scale factor of \times 16 , and the facial estimated landmark in heatmap shape.

According to the results in Figure 10, alignment and landmark estimation in challenging face pose images become more accurate with each iterative recovery of the face image. This indicates a gradual improvement in prior knowledge information, subsequently assisting the proposed FSR model in recovering more accurate and faithful face images by enhancing the estimation of precise prior information within the iterative calibration framework.

E. Ablation Study

A systematic investigation assesses the impact of the NL, RCA, and SFT modules demonstrated in this section. Table 5 and Table 6 showcase the performance metrics, including PSNR, SSIM, LPIPS, and FID, of the proposed FSR model equipped with different NL, RCA, and SFT configurations. The evaluations are conducted on CelebA [15] and Helen [16] datasets with scales of \times 8 and \times 16 , as demonstrated in Tables 5 and 6. The best values are highlighted in bold.

TABLE 5 The Quantitative Evaluation on Different Configurations of Attention Modules at a Scale Factor of \times 8
Table 5- 
                            The Quantitative Evaluation on Different Configurations of
                                Attention Modules at a Scale Factor of 
                                    $\times 8$
TABLE 6 The Quantitative Evaluation on Different Configurations of Attention Modules at a Scale Factor of \times 16
Table 6- 
                            The Quantitative Evaluation on Different Configurations of
                                Attention Modules at a Scale Factor of 
                                    $\times 16$

In these tables, we compared the baseline model (without any attention module). The other rows correspond to the variation’s combinations in refinement modules, including the RCA module only, a combination of NL with RCA, an SFT module only, a combination of RCA and SFT, and finally, the combination of the three proposed modules. Across both scales and datasets, the combined application of all three attention modules (the NL module, RCB module, and SFT module) consistently leads to a noteworthy improvement in PSNR, SSIM, FID, and LPIPS, compared to the baseline and other combinations.

To attain a comprehensive understanding of the impact of distinct refinement modules within the FSR model and to discern their relative significance in shaping the overall performance, the evaluation results are visually presented in Figure 11 Figure 12, Figure 13, and Figure 14. These figures represent the performance metrics, including PSNR, SSIM, and LPIPS in each figure. The charts facilitate a comparative analysis of the model’s performance across various combinations of refinement modules at scale factors of \times 8 and \times 16 in the CelebA [15] and Helen datasets. Specifically, Figure 11 demonstrates PSNR, SSIM, and LPIPS in the CelebA dataset at a scale factor of \times 8 . Figure 12, Figure 13, and Figure 14 visualize the corresponding performances at a scale factor of \times 8 for the Helen dataset, a scale factor of \times 16 for the CelebA dataset, and a scale factor of \times 16 for the Helen dataset, respectively.

FIGURE 11. - 
                            Performance comparison (PSNR/SSIM/LPIPS) of our model with various
                                attention module configurations at a scale factor of 
                                    $\times 8$
                                     on the CelebA Dataset.
FIGURE 11.

Performance comparison (PSNR/SSIM/LPIPS) of our model with various attention module configurations at a scale factor of \times 8 on the CelebA Dataset.

FIGURE 12. - 
                            Performance comparison (PSNR/SSIM/LPIPS) of our model with various
                                attention module configurations at a scale factor of 
                                    $\times 8$
                                     on the Helen Dataset.
FIGURE 12.

Performance comparison (PSNR/SSIM/LPIPS) of our model with various attention module configurations at a scale factor of \times 8 on the Helen Dataset.

FIGURE 13. - 
                            Performance comparison (PSNR/SSIM/LPIPS) of our model with various
                                attention module configurations at a scale factor of 
                                    $\times 16$
                                     on the CelebA Dataset.
FIGURE 13.

Performance comparison (PSNR/SSIM/LPIPS) of our model with various attention module configurations at a scale factor of \times 16 on the CelebA Dataset.

FIGURE 14. - 
                            Performance comparison (PSNR/SSIM/LPIPS) of our model with various
                                attention module configurations at a scale factor of 
                                    $\times 16$
                                     on the Helen Dataset.
FIGURE 14.

Performance comparison (PSNR/SSIM/LPIPS) of our model with various attention module configurations at a scale factor of \times 16 on the Helen Dataset.

According to the visualizations in these figures, using the SFT module alone in our middle-level configuration to preserve spatial information contributes less to improving the accuracy of the FSR model. However, when combined with the RCA and NL attention modules and RCA, it exhibits the second-best and the best performance in the FSR model, respectively. The results suggest that the effectiveness of the SFT module at mid-level features could be improved by issues arising from the degraded and poor quality of LR face inputs.

In contrast, employing the NL attention module to address noise and degradation at the initial stage, coupled with the application of RCA to emphasize inter-channel dependencies at a model level, significantly enhances the performance of SFT. Consequently, this combination performs best, yielding the highest PSNR and SSIM and the lowest LPIPS in both scales and datasets.

F. Visual Comparisons of Contributing Different Attention Modules

In this section, we assess the impact of each attention module on generating higher-quality face images. Figure 15 illustrates four sample images from the CelebA [15] and Helen [16] datasets. We extract face image patches and evaluate their visual quality. Specifically, we compare the visual quality of the baseline model, the model utilizing the NL module, the combination of NL and Residual RCA modules, and finally, the combination of NL, RCA, and SFT modules.

FIGURE 15. - 
                            Visual comparisons of contributing different attention modules
                                baseline, (NL), (NL+RCA), (NL+RCA+SFT).
FIGURE 15.

Visual comparisons of contributing different attention modules baseline, (NL), (NL+RCA), (NL+RCA+SFT).

As depicted in Figure 15, the perceptual quality of the results employing the NL, RCA, and SFT refinement modules significantly improves compared to other methods. The PSNR and SSIM metrics in this combination surpass those of other configurations. In other words, the proposed model successfully recovers more facial details than the baseline and other combined approaches.

G. Comparison With Other Methods

This section compares our quantitative and qualitative results with several other methods.

We conduct quantitative comparisons with various models [58, 5,6,8,11, 26,27,29,30,31,32,35,40,53.54,55,56,57]. Table 7 presents comparisons of PSNR, SSIM, and LPIPS in the CelebA and Helen datasets at a scale factor of \times 8 . Figure 16 demonstrates a graphical representation comparing the PSNR improvement based on bicubic performance at the same scale factor. The results in Table 7 indicate that our model exhibits the highest performance in the CelebA dataset and the second-best performance in the Helen dataset. Notably, the network parameters of our model total 24.69 million, which is lower than SCTANet’s 27.56 million parameters. According to Figure 16, the PSNR improvement compared to the bicubic method at this scale is 4.8 (dB) and 6.2 (dB) for the CelebA and Helen datasets, respectively.

TABLE 7 Quantitative Benchmark Test Results at a Scale Factor of \times 8 . Red Indicates the Best Performance and Blue Indicates the Second Best
Table 7- 
                            Quantitative Benchmark Test Results at a Scale Factor of 
                                    $\times 8$
                                    . Red Indicates the Best Performance
                                and Blue Indicates the Second Best
FIGURE 16. - 
                            Comparison of performance improvement (PSNR) at a scale factor of 
                                    $\times 8$
                                     on CelebA and Helen Datasets.
FIGURE 16.

Comparison of performance improvement (PSNR) at a scale factor of \times 8 on CelebA and Helen Datasets.

Table 8 provides comparisons of PSNR, SSIM, and LPIPS in CelebA and Helen datasets at a scale factor of \times 16 . Our proposed model achieves a PSNR of 23.77 (dB) and SSIM of 0.6903 in CelebA datasets, with the lowest LPIPS value of 0.2600 attributed to our model. Furthermore, our model exhibits the best LPIPS in the Helen dataset, highlighting its superior performance compared to other methods. Additionally, our model’s network parameters are less than SCTANet [11] with 27.98 million network parameters. Figure 17 presents a graphical visualization comparing the PSNR improvement achieved through the bicubic method at a scale factor of \times 16 . According to the figure, the PSNR improvement compared to the bicubic method at this scale is 3.43 (dB) for the CelebA dataset and 2.61 (dB) for the Helen dataset.

TABLE 8 Quantitative Benchmark Test Results at a Scale Factor of \times 16 . Red Indicates the Best Performance and Blue Indicates the Second Best
Table 8- 
                            Quantitative Benchmark Test Results at a Scale Factor of 
                                    $\times 16$
                                    . Red Indicates the Best Performance
                                and Blue Indicates the Second Best
FIGURE 17. - 
                            Comparison of performance improvement (PSNR) at a scale factor of 
                                    $\times 16$
                                     on CelebA and Helen Datasets.
FIGURE 17.

Comparison of performance improvement (PSNR) at a scale factor of \times 16 on CelebA and Helen Datasets.

Figure 18 presents visual comparisons of the proposed model at a scale factor of \times 8 with other state-of-the-art models [5], [6], [8], [26], [27], [31]. Sample images 1 and 2 are from the Helen [16] dataset, while samples 3 and 4 are from the CelebA [15] dataset. Based on the results in this figure, our model consistently produces higher-fidelity face images compared to the other methods.

FIGURE 18. - 
                            Visual comparison of our model with other state-of-the-art methods
                                scale factor of 
                                    $\times 8$
                                    . (a): Bicubic, (b): URDGN, (c):
                                WSRNet, (d): SuperFAN, (e): FSRNet, (f): DIC, (g): SISN, (h): Ours,
                                (i): HR.
FIGURE 18.

Visual comparison of our model with other state-of-the-art methods scale factor of \times 8 . (a): Bicubic, (b): URDGN, (c): WSRNet, (d): SuperFAN, (e): FSRNet, (f): DIC, (g): SISN, (h): Ours, (i): HR.

Figure 19 illustrates visual comparisons of the proposed method at a scale factor of \times 16 with other state-of-the-art models, specifically DIC [5] and SISN [31]. Image samples 1 and 2 belong to the CelebA [15] dataset, while samples 3 and 4 are from the Helen [16] dataset. The image labels (a) to (e) depict the LR image (8\times 8 ), the results of DIC, SISN, our model, and the HR images, respectively. Even at scale \times 16 , our model consistently outperforms other models, showcasing its superior performance in generating high-quality super-resolved images. Specifically, at this large scale factor, the proposed model demonstrates exceptional performance in reconstructing facial attributes with greater fidelity, such as eyes, lips, and nose, compared to the other state-of-the-art models depicted in this figure.

FIGURE 19. - 
                            Visual comparison with other methods scale factor of 
                                    $\times 16$
                                    .
FIGURE 19.

Visual comparison with other methods scale factor of \times 16 .

H. Comparison of Network Complexity and Performance

Figure 20 and Figure 21 illustrate our model’s network complexity and performance, comparing them with other state-of-the-art models across the CelebA and Helen datasets at scale factors of \times 8 and \times 16 , respectively.

FIGURE 20. - 
                            Performance and network complexity evaluated on the CelebA and Helen
                                datasets at a scale factor of 
                                    $\times 8$
                                    .
FIGURE 20.

Performance and network complexity evaluated on the CelebA and Helen datasets at a scale factor of \times 8 .

FIGURE 21. - 
                            Performance and network complexity evaluated on the CelebA and Helen
                                datasets at a scale factor of 
                                    $\times 16$
                                    .
FIGURE 21.

Performance and network complexity evaluated on the CelebA and Helen datasets at a scale factor of \times 16 .

The comparison in Figure 20 shows that WSRnet [27] exhibits the highest network parameters, with PSNR values of 26.83 dB and 36.02 dB for the CelebA and Helen datasets, respectively. Notably, our model’s performance in the CelebA and Helen datasets improves to 27.69 dB and 27.18 dB, respectively, while the network complexity decreases by around 2.7 times compared to WSRnet [27] at this scale. Compared to the DIC [5] model, our model contains 4 million more parameters, achieving an increased PSNR performance in both datasets compared to the DIC [5] model.

Based on Figure 21, the network complexity of our model is lower than that of FishFSRNet [32] and SCTANet [11]. In contrast, the performance of our model surpasses both FishFSRNet [32] and SCTANet [11] in the CelebA dataset. In the Helen dataset, our model’s performance falls within the same range as SCTANet (with only a 0.26 dB difference). In contrast, the complexity of our model is 3.20 million parameters less than SCTANet [32]. Compared to the baseline model [5], our model demonstrates significant improvements in performance in both datasets at this scale.

I. User Study

Table 9 and Table 10 present our user study evaluation results for our model at scale factors of \times 8 and \times 16 , respectively. We aimed to collect detailed feedback from participants about their preferences and perceptions of the FSR models’ outputs. Using a user study approach, we sought to gather comprehensive feedback from participants regarding their preferences and perceptions of the FSR models’ outputs. We ensured variability and representation for participants by using random images from the datasets, which contributed to the robustness and generalizability of our findings. We conducted a user study involving four robust FSR models [5], [11], [32], [34] evaluated on the CelebA and Helen datasets. We recruited ten individuals, including both experts and non-experts and included four random images from each dataset.

TABLE 9 User Study Evaluation at a Scale Factor of \times 8
Table 9- 
                            User Study Evaluation at a Scale Factor of 
                                    $\times 8$
TABLE 10 User Study Evaluation at a Scale Factor of \times 16
Table 10- 
                            User Study Evaluation at a Scale Factor of 
                                    $\times 16$

Our evaluation criteria were designed to capture subjective preferences across visual quality, naturalness, and perceptual similarity to HR images. Participants were asked to rate the face images using a 5-point scale, ranging from “Poor” to “Excellent,” with “1” indicating the lowest score and “5” the highest. The highest score in the two tables shows the best quality performance among other models.

J. Quantitative and Qualitative Comparison on AFlW2000 and WFLW Datasets

We conduct quantitative comparisons with the baseline model [5] on the AFLW2000 [46] and WFLW [47] datasets. Tables 11 and 12 present PSNR, SSIM, LPIPS, and FID at scale factors of \times 8 and \times 16 , respectively. The examined models are trained on the CelebA and Helen datasets, as shown in the tables. Based on these tables, our model outperforms the DIC [5] model in PSNR, SSIM, LPIPS, and FID for both datasets under the same training conditions. Furthermore, the performance of our model and the baseline model in the trained CelebA dataset surpasses that of the models trained on the Helen dataset.

TABLE 11 The Quantitative Evaluation on Different Configurations of Attention Modules at a Scale Factor of \times 8
Table 11- 
                            The Quantitative Evaluation on Different Configurations of
                                Attention Modules at a Scale Factor of 
                                    $\times 8$
TABLE 12 The Quantitative Evaluation on Different Configurations of Attention Modules at a Scale Factor of \times 16
Table 12- 
                            The Quantitative Evaluation on Different Configurations of
                                Attention Modules at a Scale Factor of 
                                    $\times 16$

Figure 22 demonstrates the visual comparison between our model and DIC [5] at a scale factor \times 8 . Samples 1 and 2 belong to AFLW2000 [46], while Samples 3 and 4 belong to WFLW [47]. Labels a to e represent the LR image (16\times 16 ), bicubic, DIC [5], our model, and the HR image, respectively. Our model exhibits superior performance in all demonstrated samples compared to the baseline in recovering more facial details and generating higher fidelity results.

FIGURE 22. - 
                            Visual comparison of our model with baseline at a scale factor of 
                                    $\times 8$
                                     on AFLW2000 and WFLW datasets. (a):
                                LR, (b): Bicubic, (c): DIC, (d): Ours, (e): HR.
FIGURE 22.

Visual comparison of our model with baseline at a scale factor of \times 8 on AFLW2000 and WFLW datasets. (a): LR, (b): Bicubic, (c): DIC, (d): Ours, (e): HR.

Figure 23 compares our model and DIC [5] at a scale factor of \times 16 . Samples 1 and 2 are from AFLW2000, while Samples 3 and 4 are from WFLW. Labels a to e indicate the LR image (8\times 8 ), bicubic, DIC, our model, and the HR image, respectively. Across all the samples, our model outperforms the baseline by recovering more facial details and producing higher-quality results.

FIGURE 23. - 
                            Visual comparison of our model with baseline at a scale factor of 
                                    $\times 16$
                                     on AFLW2000 and WFLW datasets. (a):
                                LR, (b): Bicubic, (c): DIC, (d): Ours, (e): HR.
FIGURE 23.

Visual comparison of our model with baseline at a scale factor of \times 16 on AFLW2000 and WFLW datasets. (a): LR, (b): Bicubic, (c): DIC, (d): Ours, (e): HR.

SECTION V.

Discussion and Future Work

To enhance fidelity and detail in generating face images at large scale factors, the NL module and RCA technique at the low-level stage focus on critical facial details and effectively mitigate shortcomings in feature quality, leading to a refined and context-aware representation. Additionally, the designed SFT module at the mid-level architecture significantly contributes to recovering high-frequency facial attributes by leveraging spatial information. However, it is noted that using the SFT module alone in the middle-level configuration contributes less to improving the accuracy of the FSR model, primarily due to issues arising from degraded and poor-quality LR face inputs. In contrast, employing the NL attention module to address noise and degradation at the initial stage, coupled with the application of RCA to emphasize inter-channel dependencies, significantly enhances the performance of the SFT module at the mid-level. Consequently, this innovative combination of multi-attention techniques demonstrates the best performance in generating faithful and detailed face images compared to previous FSR models, yielding the highest PSNR, SSIM, LPIPS, and VIF scores across large scales and four datasets.

Although the proposed model excels in generating faithful and detailed full-face images across diverse angles, its performance could be improved when tasked with generating profile face images. This limitation arises due to the focus of the FAN module, which was tuned exclusively for the frontal view of the face, and the training datasets (CelebA and Helen) contain only full-face images. Consequently, the model must be more accurate in accurately reconstructing profile face images, where only partial facial features, such as one eye, one side of the nose, and a portion of the mouth, are visible. This constraint restricts the model’s applicability in scenarios where profile images are prevalent or necessary.

Future research endeavors will address this limitation by tuning the FAN module to accommodate profile face images and incorporating additional training datasets containing profile face images. This effort could enhance the model’s robustness and broaden its utility in real-world applications, thus advancing the effectiveness and applicability of FSR techniques in diverse practical settings.

SECTION VI.

Conclusion

This research introduces a Multi-stage Refining Face Super-resolution model, establishing a novel paradigm through iterative collaboration between landmark estimation and an attentive recovery network. The formidable challenge posed by the degraded low-dimensional (16\times 16 pixels or 8\times 8 pixels) input images restricts gaining the full capability of the iterative collaboration framework to generate more detailed and accurate face images. The proposed Multi-stage Refining model utilizes an SFT module for mid-level feature refinement and incorporates NL and a residual pixel attention technique at the low-level stage. The NL module captures long-range dependencies within the low-level features. At the same time, the RCA technique enhances focus on critical facial details by selectively emphasizing informative channels. This approach effectively addresses shortcomings in feature quality, fostering a more refined and context-aware representation. Consequently, it significantly enhances the efficacy of the SFT module at the mid-level stage, enabling the recovery of more faithful facial details. Applying the proposed refinement approach demonstrates improvement in both the accuracy and perceptual quality of super-resolved face images, surpassing the performance of baseline models. Empirical evaluations conducted on CelebA and Helen datasets at scale factors of \times 8 and \times 16 demonstrate noteworthy enhancements in terms of PSNR, SSIM, and LPIPS. Visual comparisons further underscore the model’s superiority, showcasing significant advancements compared to other state-of-the-art models.

In future work, we will integrate the proposed FSR model developed with real-time face recognition systems. This extension aims to explore the potential impact of our model on enhancing recognition accuracy in practical scenarios.

ACKNOWLEDGMENT

The authors would like to express their sincere gratitude to the Department of Electrical Engineering, Faculty of Engineering, Chulalongkorn University.

References

References is not available for this document.