Introduction
Smartphones, and mobile devices in general, play nowadays a central role in our society. We use them on a daily basis not only for communication purposes, but also to access social media and for sensitive tasks such as online banking. In order to increase the security level of those more sensitive applications, verifying the subject’s identity plays a key role. To tackle this requirement, many companies are currently working towards creating applications to verify the subject’s identity by comparing a selfie image with the reference face image stored in the embedded chip of an ID-Card/Passport and a selfie image using Near Field Communication (NFC) from smartphones [1]. This represents a user-friendly identity verification process, which can be easily embedded into numerous applications. However, this verification process also faces some challenges: for instance, that selfie image is captured in an uncontrolled scenario, where occlusions due to wearing a scarf in winter or a hygienic facial mask during a pandemic such as COVID-19 may hinder the performance of general face recognition algorithms. Therefore, there is a reinforced need to explore alternatives which can deal with those occluded images successfully, such as utilising the periocular region for recognition purposes.
The aforementioned reasons have increased the interest on periocular based biometrics in the last decade in different scenarios [2]–[5]. In particular, it has been shown that periocular images captured with mobile devices for recognition purposes are mainly acquired as selfie face images. And the number of digital photos will increase every year: in 2022, 1.5 trillion images were taken, and 90% of them come from smartphones.1 In order to recognise individuals from a selfie in a remote verification system, the periocular region needs to be cropped, and the resulting periocular sample has often a very low-resolution [6]. Moreover, the subjects capture selfie images in multiple places and backgrounds, using selfie sticks, alone, or with others. This translates into a high intra-class variability which can be observed for the images, in terms of size, lighting conditions, and face pose.
With the aim of improving the quality of such low-resolution images, several Single Image Super-Resolution (SISR) methods have been recently proposed [7]–[10], mainly based on convolutional neural networks (CNNs). Even though some authors have enhanced such networks to achieve more efficiently the reconstruction results of the super-resolution [11], most approaches still use deep models, which demand large resources and are thus not suitable for mobile or Internet-of-Things (IoT) devices. Furthermore, the loss function used in most techniques is based on structural similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) metrics [7]. Even though those metrics are appropriate for increasing the resolution of general purpose images (e.g., landscapes, cities, or birds) they are not that suitable for increasing the quality of iris based biometrics applications. In contrast, the ISO/IEC 29794 standard on biometric sample quality — Part 6: Iris image data describes sharpness based on the Laplacian of Gaussian (LoG) as one relevant quality.
In this work, we have a twofold goal: verify a biometric claim in a verification transaction from a smartphone selfie periocular image in the visible spectrum (VIS) and propose an efficient super-resolution approach (see Fig. 1). As already mentioned, this is a challenging task since there is limited control of the quality of the images taken: selfies can be captured from different distances, light conditions, and resolutions. Therefore, to tackle these issues, we present a SISR algorithm with a novel loss function based on the sharpness LoG metric and a light-weight CNN. This model takes into account the trade-off between the number of layers and filter sizes in order to achieve a light model suitable for mobile devices. Additionally, we explore pixel-shuffle and transposed convolutions in order to recover the fine details of the periocular eye images. To validate our approach, we use different databases for training and testing. In addition, we benchmark both handcrafted features and pre-trained deep learning models. Our method drastically reduces the number of parameters when compared with the state-of-the-art Deep CNNs with Skip Connection and Network (DCSCN) [12]: from 2,170,142 to 28,654 parameters when the image size is increased by a factor of 2.
Block diagram of the verification system proposed, including a super-resolution approach. Top: Traditional approach with resizing images. SR approach with deep learning embeddings (Middle) and with handcrafted features (BSIF, Bottom).
This paper is an extension of our previous work [13]. In that work, we focused on achieving an accurate Enhanced SISR (ESISR) algorithm for periocular eye images taken from selfie images, reporting results in terms of image similarity for the recovered images on a smaller Samsung dataset. In this paper, we evaluate this new ESISR architecture in more detail and benchmark it with two new state-of-the-art methods: WDSR-A [14] and SRGAN [15]. A full explanation of the reasons that led us to such architecture is discussed in this work. As an additional contribution, this manuscript includes the performance evaluation of our proposed methods on periocular verification systems using three pre-trained CNNs: FaceNet [16], VGGFace [17], and ArcFace [18]. All methods have been now evaluated on the larger MobBIO [19] and NTNU [20] databases. A benchmark with a traditional resized method such as inter-area, inter-lineal, and inter-cubic (bicubic) has been also analysed. A handcrafted feature extractor, Binary Statistical Image Filter (BSIF), was also added to evaluate and compare the results with the deep learning approach. Detection Error Trade-off (DET) curves are included to show our proposal’s performance and efficiency. All these new experiments are benchmarked with those previously obtained in [12], [14], [21].
Therefore, the main contributions from this work can be summarised as follows:
An efficient SR architecture is proposed, using only a feature extractor and one block based on recursive learning of reconstruction.
A recursive pixel-shuffle technique is introduced over a transposed convolution to extract and keep fine details of periocular images.
A novel loss function that includes the LoG sharpness iris quality metric and the SR loss function was proposed.
A significant reduction of the number of parameters in comparison with the state-of-the-art is reported.
A novel database for selfie periocular eye images was acquired and is available for researchers upon request.
A periocular verification system based on an embedded vector from three pre-trained models with an SR-based pre-processing of the samples (x2, x3 and x4) was tested.
A benchmark between deep learning approaches and a handcrafted method is reported.
A full analysis of the influence of SR on selfie biometrics scenarios with traditional resizing methods was also included.
The rest of the article is organised as follows. Sect.II summarises the related works on periocular recognition and super resolution. The new recognition and super-resolution method is described in Sect. III. The experimental framework is then presented in Sect. IV and the results are discussed in Sect. V. We conclude the article in Sect. VI.
Related Work
A. Super-Resolution (SR)
Super-resolution (SR) is the process of recovering a high-resolution (HR) image from a low-resolution (LR) one [7], [22]. Supervised machine learning approaches learn mapping functions from LR images to HR images from a large number of examples. The mapping function learned by these models is the inverse of a downgrade function that transforms HR images into LR images. Such downgrade functions can be known or unknown.
Many state-of-the-art super-resolution models learn most of the mapping function in LR space, followed by one or more upsampling layers at the end of the network. Earlier approaches first upsampled the LR image with a pre-defined upsampling operation and then learned the mapping in HR space (pre-upsampling SR). A disadvantage of this approach is that more parameters per layer are required because they used more convolutional layers than small filters, which leads to higher computational costs and limits the construction of deeper neural networks [7]. SR requires that most of the information contained in an LR image must be preserved in the SR image. SR models therefore mainly learn the residuals between LR and HR images. Residual network designs are therefore of high importance: identity information is conveyed via skip connections whereas reconstruction of high frequency content is done on the main path of the network [7].
Dong et al. [22] proposed several SISR algorithms which can be categorized into four types: prediction models, edge-based methods, image statistical methods, and patch-based (or example-based) methods. This method uses 2 to 4 convolutional layers to prove that the learned model performs well on SISR tasks. The authors concluded that using a larger filter size is better than using deeper Convolutional Neural Networks (CNNs).
Kim et al. [23] proposed an image SR method using a Deeply-Recursive Convolutional Network (DRCN), which contains deep CNNs with up to 20 layers. Consequently, the model has a huge number of parameters. However, the CNNs share each other’s weights to reduce the number of parameters to be trained, thereby being able to succeed in training the deep CNN network and achieving a significant performance. The authors conclude in their work that deeper networks are better than large filters.
Yamanaka et al. [12] proposed a Deep CNN with a Residual Net, Skip Connection and Network (DCSCN) model achieving a state- of-the-art reconstruction performance while reducing by at least 10 times the computational cost. According to the existing literature, deep CNNs with residual blocks and skip connections are suitable to capture fine details in the reconstruction process. In the same context, [24] and [25] propose the pixel-shuffle and transposed convolution algorithm in order to extract the most relevant features from the images. The transposed convolutional layer can learn up-sampling kernels. However, the process is similar to the usual convolutional layer and the reconstruction ability is limited. To obtain a better reconstruction performance, the transposed convolutional layers need to be stacked, which means the whole process needs high computational resources [12]. Conversely, pixel-shuffle extracts features from the low-resolution images. The authors [12] argue that batch normalisation loses scale information of images and reduces the range flexibility of activations. Removal of batch normalisation layers not only increases SR performance but also reduces GPU memory 40%. This way, significantly larger models can be trained.
Ledig et al. [21] proposed a deep residual network which is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive Mean-Opinion-Score (MOS) test shows significant gains in perceptual quality using SR based on Generative Adversarial Network (SRGAN). In addition, the authors present a new perceptual loss based on content loss and adversarial loss.
Yu et al. [14] proposed the key idea of wide activation to explore efficient ways to expand features before ReLU, since simply adding more parameters is inefficient for smartphone based image SR scenarios. The authors present two new networks named Wide Activation for Efficient and Accurate Image Super-Resolution (WSDR). These networks (WDSR-A and WDSR-B) yielded better results on the large-scale DIV2K image super resolution benchmark in terms of PSNR with the same or lower computational complexity. Similar results but with a larger number of parameters are presented by Lim et al. [15] in a model called Enhanced Deep Residual Networks for Single Image Super Resolution (EDSR).
Specifically for biometric applications, some papers have explored the use of SR in iris recognition in the visible and near-infrared spectrum. Ribeiro et al. [26] proposed a SISR method using CNNs for iris recognition. In particular, the authors test different state- of-the-art CNN architectures and use different training databases in both the near-infrared and visible spectra. Their results are validated on a database of 1,872 near-infrared iris images and on a smartphone image database. The experiments show that using deeper architectures trained with texture databases that provide a balance between edge preservation and the smoothness of the method can lead to good results in the iris recognition process. Furthermore, the authors used PSNR and SSIM to measure the quality of the reconstruction. More recently, Alonso-Fernandez et al. [27] presented a comprehensive survey of iris SR approaches. They also described an Eigen-patches reconstruction method based on the principal component analysis and Eigen-transformation of local image patches. The inherent structure of the iris is reproduced by building a patch-position-dependent dictionary. The authors also used PSNR and SSIM to measure the quality of the reconstruction in the NIR spectrum and in the NTNU database in the visible spectrum [28].
1) Metrics
Deep learning-based methods for SISR significantly outperform conventional approaches in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [22]. In this section, we review these two metrics.
SSIM is an objective metric used for measuring the structural similarity between images from the perspective of the human visual system. It is based on three relatively independent properties: luminance, contrast, and structure. The SSIM metric can be seen as a weighted product of the comparison of luminance, contrast, and structure computed independently. Therefore, SSIM is defined as:\begin{equation*} \mathrm {SSIM}(x,y) = \frac {(2\mu _{x}\mu _{y} + C_{1}) + (2 \sigma _{xy} + C_{2})} {(\mu _{x}^{2} + \mu _{y}^{2}+C_{1}) (\sigma _{x}^{2} + \sigma _{y}^{2}+C_{2})} \tag{1}\end{equation*}
PSNR is a common objective metric to measure the reconstruction quality of a lossy transformation. It is inversely proportional to the logarithm of the Mean Squared Error (MSE) between the ground truth image and the generated image:\begin{equation*} \mathrm {PSNR} = 10 \log _{10}\left ({\frac {\max ^{2}}{\mathrm {MSE}}}\right)\tag{2}\end{equation*}
B. Periocular Recognition
Periocular recognition based on traditional feature extraction methods such as intensity, shape, texture, fusion, and off-the-shelf CNN features with pre-trained models has been widely studies. However, to the best of our knowledge, only a few papers have explored the use of SR methods to improve the quality of the RGB images coming from periocular selfie captures.
Padole and Proença [29] proposed a new initialization strategy for the definition of the periocular region-of-interest and the performance degradation factor for periocular biometric and the influence of Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), Scale-Invariant Feature Transform (SIFT), Fusion at the Score Level, Effect of Reference Points of the eyes, Covariates, Occlusion Performance and Pigmentation Level Performance.
Raja et al. [30] explore multi-modal biometrics as a means for secure authentication. The proposed system employs face, periocular, and iris images, all captured with embedded smartphone cameras. As the face image is captured closely, one can always obtain periocular and iris information with fine details. This work also explores various score level fusion schemes of complementary information from all three modalities. Also, the same authors used in [20] used in the periocular region for authentication under unconstrained acquisition in biometrics. They acquired a new database named Visible Spectrum Periocular Image (VISPI), and proposed two new feature extraction techniques to achieve robust and blur invariant biometric verification using periocular images captured by smartphones.
Ahuja et al. [31] proposed a hybrid convolution-based model for verifying pairs of periocular RGB images. They composed a hybrid model as a combination of an unsupervised and a supervised CNN, and augment the combination with SIFT model.
Hernandez-Diaz et al. [32] proposed a method to apply existing architectures pre-trained on the ImageNet Large Scale Visual Recognition Challenge, to the task of periocular recognition. These networks have proven to be very successful for many other computer vision tasks apart from the detection and classification tasks for which they were designed. They demonstrate that these off-the-shelf CNN features can effectively recognise individuals based on periocular images.
More recently, Kumari and Seeja [8] surveyed periocular biometrics and provided a deep insight of various aspects, including the periocular region utility as a stand-alone modality, its fusion with iris, its application in the smartphone authentication, and its role in soft biometric classification. In their review, the authors did not mention SR approaches.
Proposed Method
As mentioned in Sect. I and depicted in Fig. 1, we focus in this work on a two-stage system. First, we improve the SR approaches for periocular images. Second, we use that improved SR method to enhance the recognition performance of periocular-based biometric systems, in contrast to traditional SR methods. We describe in Sect. III-A the proposed ESISR technique, and in Sect. III-B the feature extraction and comparison methods utilised for periocular recognition.
A. Stage-1: Super-Resolution
In this section, we present an efficient image SR network that is able to recover periocular images from selfies (ESISR). Our network includes two building blocks, as it can be observed in Fig. 2: A feature extraction and a reconstruction stage based on DCSCN, which are described in the remaineder of this section.
Since SR in general is an image-to-image translation task where the input image is highly correlated with the target image, researchers try to learn only the residuals between them (i.e. global residual learning). This process avoids learning a complicated transformation from a complete image to another. Instead, it only requires learning a residual map to restore the missing high-frequency details. Since most regions’ residuals are close to zero, the model complexity and learning difficulty are thus greatly reduced.
This local residual learning is similar to ResNet to alleviate the degradation problem caused by ever-increasing network depths, reduce training difficulty, and improve the learning ability. For these reasons, we are using recursive learning to learn higher-level features without introducing an overwhelming number of parameters, which means applying the same modules multiple times.
In addition to choosing an appropriate network architecture, the definition of the perceptual loss function is critical for the performance of the proposed method based on the DCSCN network, as mentioned in Sects. I and II. While SR is commonly based on the MSE, PSNR, and SSMI metrics, we have designed a loss function that incorporates as well a sharpness measure with respect to perceptually relevant features. The function thus balances between reconstructing images by minimising the difference of the sharpness values and weights the results of SSIM and PSNR.
1) Pre-Processing
The original RGB images captured with a smartphone represent an additive color-space where colors are obtained by a linear combination of Red, Green, and Blue values. The three channels are thus correlated by the amount of light on the surface. In order to avoid such correlations, all the images were converted from RGB to YCbCr. The YCrCb color space is derived from RGB, and separates the luminance and chrominance components into different channels. In particular, it has the following three components: i) Y, Luminance or Luma component obtained from RGB after gamma correction; ii)
2) Feature Extraction
As mentioned above, the Y component of the converted image is used as input for our model. Several patches of
Pixel-shuffle convolution layer that aggregates the feature maps from LR space and builds the SR image in a single step. Based on [33].
A model with transpose convolution instead of pixel-shuffle was trained to explore the quality of the reconstruction images [7]. See Fig. 4. Transpose convolution operates conversely to normal convolution, predicting the input based on feature maps sized like convolution output. It increases image resolution by expanding the image by adding zeros and performing convolution operations.
3) Reconstruction
Our reconstruction stage uses only one convolutional block with 2 layers (Conv + Relu + Conv) in a recursive path. This block includes
4) Perceptual Loss Function
The ISO/IEC 29794–62 on iris image quality introduced a set of quality metrics, that can measure the utility of a sample. Based on the NIST IREX evaluation3 a sharpness metric was identified as strongly predictive for recognition performance. We follow this finding and measure:\begin{equation*} LoG(x,y)=-\frac {1}{\pi \sigma ^{4}} \left [{1- \frac {x^{2}+y^{2}}{2\sigma ^{2}}}\right]e^{-\frac {x^{2}+y^{2}}{2\sigma ^{2}}} \tag{3}\end{equation*}
Now, it is important to highlight that the loss function aims to improve the quality of the reconstruction. To that end, we combine the SSIM and PSNR classical SR metrics with the sharpness metric for iris images recommended, as follows:\begin{align*}&\hspace {-.5pc} L(I_{LR},I_{HR})= 0.5\cdot \mathrm {LoG}\left ({I_{LR}, I_{HR}}\right) \cdot [0.25\cdot \mathrm {SSIM}\left ({I_{LR},I_{HR}}\right) \\&\qquad\qquad\qquad\qquad\qquad\qquad+ 0.25\cdot \mathrm {PSNR}\left ({I_{LR},I_{HR}}\right)] \tag{4}\end{align*}
B. Stage-2: Periocular Recognition
Most traditional methods in the state-of-the-art are based on machine learning techniques with different feature extraction approaches such as HOG, LBP, and BSIF, or the fusion of some of them [8]. However, today we have powerful pre-trained deep learning methods based on facial images. Using transfer learning techniques, the information extracted from some layers using fine-tuning techniques or embedding approaches could be suitable to perform periocular verification. This is the approach followed in this work.
This task involves information from periocular images estimating an eye embedding vector for a new given eye from a selfie image. An eye embedding is a vector that represents the features extracted from the eyes periocular images. This comparison occurs using euclidean distance to verify if the distance is below a predefined threshold, often tuned for a specific dataset or application. For this paper, a VGGFace [17], FaceNet [16] and ArcFace [18] models have been used as a feature extractor for periocular recognition. Also a comparison with BSIF handcrafted featured is included.
Experimental Setup
A. Experimental Protocol
In order to assess the soundness of the proposed method, we focus on a twofold objective: i) evaluate the SR approaches, and ii) analyse selfie periocular recognition systems using those SR techniques.
1) Super-Resolution Models
First, we have trained the DCSCN, WDSR-A, and SRGAN methods as a baseline for benchmarking purposes. The main properties and default parameters of those methods are summarised in the following:
DCSCN: Number of CNN layers = 12, Number of first CNN filters = 196, Number of last CNN filters = 48, Decay Gamma = 1.5, Self Ensemble = 8, Batch images for training epoch = 24,000, Dropout rate = 0.8, Optimiser function = Adam, Image size for each Batch = 48, Epochs = 100, Early stopping = 10.
WDSR-A: Number of residual blocks = 8, Number of CNN layers in the main branch = 6, Number of expansion of residual blocks = 4, Number of filters main branch = 64, Number of filters residual blocks = 256, Activation function = Relu, Optimisation Function = Adam, Learning Rate = 1e-4 and 1-e-5, Beta = 1e-7, Size of batch images = 96, Number of steps = 60,000.
SRGAN: The network has two modules:
Generator: This stage is used for learning the inverse function for downsampling the image and to generate the LR images from their corresponding HR, based in a pre-trained VGG-54. The following parameters are used: Number of residual blocks = 16, Number of CNN layers with residual blocks = 2, activation function residual block = PRelu, Kernel size residual block = 3, CNN layers = 3, kernel size = 9, 3 and, 9. Filters numbers = 64, Optimisation function = Adam, Learning rate = 1e-4 and 1e- 5, batch image size = 96, Steps = 100,000, mini size batches = 16.
Discriminator: In order to evaluate the similarity between the images generated by the SR generator (VGG-54) and the HR images, the discriminator is trained with the following parameters: CNN layers = 8, Filter numbers:64, 64, 128, 128, 256, 256, 512 and 512. Kernel size = 3, activation function = Relu, Momentum batch normalisation = 0.8, Optimisation function = Adam, Learning Rate =
and,1e-5 , Batch size = 16, Steps = 100,000.1e-6
Subsequently, we evaluated our ESISR method using the pixel-shuffle technique [34]. The best parameters for our approach were: Number of CNN layers = 7, Number of first CNN filters = 32, Number of last CNN filters = 8, Decay Gamma = 1.2, Self Ensemble = 8, Batch images for training epoch = 24,000, Dropout rate = 0.5, Optimiser function = Adam, Image size for each Batch = 32, Epochs= 100, Early stopping = 10. We further improved the efficiency of our proposal by using the transpose convolution instead of pixel-shuffle.
In all experiments, we assess the quality of the produced SR images using the sharpness function defined in Eq. 3, and the efficiency in terms of the number of features and parameters. It should be noted that the True Sharpness represents the sharpness of the original image ((prior to downsampling), and Output Sharpness represents the sharpness of the reconstructed high resolution image created by ESISR. Therefore, the goal is to achieve an Output Sharpness as close as possible to the True Sharpness. From those experiments, we selected the configuration achieving the best performance. All methods were trained using the Samsung database and tested with the SET-5E dataset.
2) Periocular SR Verification
We then extract the embedded information from selfie periocular images and compare the results with a handcrafted method for the periocular verification system. Afterwards, feature extraction was applied to the best super-resolved images using x2, x3, and x4 increased sizing, and it was compared with the same sizes but using traditional methods such as inter-area, lineal, and cubic. All the SR methods for periocular verification were tested using the MobBIO and NTNU datasets, which are different from the ones used to train the SR stage in order to grant unbiased results.
BSIF handcrafted features were used to extract textural information. An exhaustive exploration of the 60 filters was made. The image was divided into two rows and three columns. For each patch, a histogram was estimated. The concatenation of all the histograms represents the final vector. In this case, the
In more details, the FaceNet, VGGFace, and ArcFace pre-trained models were used to extract the embedding information. For FaceNet the feature vector has a size of 1,722 and input size image of
B. Databases
In order to analyse the performance of the SR algorithm, four databases were used. A new dataset was acquired in a collaborative effort with subjects from different countries with Samsung smartphones using an app
Example of Samsung databases. Left: closest position. Middle: half arm extended. Right: full arm extended.
A second dataset, Set-5E, was created to validate the results. This database has 100 images from different subjects acquired with different smartphones extracted from the CSIP database in the visual spectrum [35]. It has 2004 images, stemming from 50 subjects over 10 different mobile setups.
A third database MobBIO was used to super-resolved the size of the images with the best pre-trained super-resolution model (ESISR). It was also used to measure the performance of the periocular verification system. The MobBIO dataset comprises the biometric data from 152 volunteers. Each subject provided samples of face, iris, and voice. There are on average 8 images for each subject from a NOKIA N93i mobile. Some examples are presented in Fig. 6.
The last database is VISPI, captured by NTNU, which was used to measure the performance of the periocular verification system.6 The NTNU dataset comprises the biometric data from 152 volunteers and 3,139 total images. Each subject provided samples of left and right iris. There are in average 11 images for each subject from a NOKIA N93i mobile. Some examples are presented in Fig. 7.
Results and Discussion
A. Super-Resolution Models
First, we establish a baseline by testing the DCSCN, WSDR-A, and SR-GAN models with their default parameters. Then, we analyse our proposal (ESISR-X) using pixel-shuffle and the new loss function including the Sharpness metric (see Eqs. 3 and 4). Table I summarises the results: Rows 1–3 show the results for traditional SR methods (DSCN with 12 layers and
Observing the results, we note that all the image enlargement x2, x3, and x4 extract the same number of features for each method (i.e., 1,301 for DCSNN and 1,000 for ESISR). The more considerable difference lies on the number of parameters of each method: while DCSCN, WSDR-A, and SR-GAN methods need a large number of parameters (for images increased by x2, 1,754,942, 597,000, and 24,864,000; for images increased by x3, 2,170,142, 603,000, and 25.131.000; and for images increased for x4, 2,087,102, 610,000, and 26,939,000), these numbers are drastically reduced by the our ESISR-1 proposed method, which needs only 27,209 parameters when the image is increased by x2, 28,654 parameters when increased by x3, and 64,201 parameters when increased by x4.7
In addition to that gain in terms of efficiency, we may observe in Table I that the newly proposed loss function based on sharpness allows us to get a good reconstruction. The Output sharpness for each scale value is similar to the values obtained by DSCN (e.g. 16.85 vs. 16.70 for x2), and also close to the target True Sharpness of 17.04. Therefore, we may conclude that the proposed method keeps the sharpness quality of the images, thereby making it suitable for SR applications for mobile devices.
In addition to the baseline configuration of ESISR-1, we also evaluated two additional approaches. First, the most efficient implementation of ESISR with a big reduction of features (down to 131) and a number of parameters with pixel-shuffle and
Regarding the quality of the SR iris images, we can observe that both configurations tested in this last experiment (row 5-6) achieve a similar sharpness for the x3 and x4 scale values (14.43, 14.38, 15.46 and 16.32), but not for x2. In the latter case, the pixel-shuffle approach clearly outperforms the transpose-convolution method (15.43 vs. 14.38). The lower result of reconstruction was reached for the SRGAN method with a higher number of parameters and a relevant difference of the value of output sharpness.8
B. ABLATION STUDY
In order to show the effectiveness of the modified loss used to train E-SISR an ablation analysis was proposed. See Table II. We measure the performance of E-SISR with different weight values for
C. Periocular SR Verification
We now evaluate the periocular verification systems including the SR methods analysed in the previous section. In order to assess the quality of the super-resolved images, the MobBIO and NTNU datasets were used to evaluate the reconstruction performance with the best SR method proposed in Sect. V-A, namely ESISR with pixel-shuffle.
Figs. 8 and 9 show the DET curves of the periocular verification system for MobBIO and NTNU datasets with a standard resolution (Resolution x1) in comparison with SR images resized by x2, x3, and x4 using the ESISR method. The results show VGGFace, FaceNet, ArcFace and three different BSIFs filters with equal error rates for each one. An essential fact that we can see in Figures 8 and 9, in this case, is that SR methods help maintain the recognition accuracy when selfies are captured at different distances instead of improving the eye recognition performance.
DET curves for MobBIO and NTNU datasets using the SR method (x1 and x2) and including periocular recognition systems based on deep learning and handcrafted features (BSIF). The EER is shown in parenthesis for each technique.
DET curves for MobBIO and NTNU datasets using the SR method (x3 and x4) and including periocular recognition systems based on deep learning and handcrafted features (BSIF). The EER is shown in parenthesis.
For images with a standard resolution (Resolution x1), VGGFace obtained the best results with an EER of 16.12% using the MobBIO dataset. The best results were obtained for images from NTNU FaceNet with an EER of 8.89%.
For images with an SR x2, x3 and x4, FaceNet outperforms VGGFace and ArcFace. Obtained an 8.92%, 8.86% and 9.33% for the NTNU dataset. Conversely, MobBIO reached the lowest results. SR methods x2, x3 and x4 yielded 17.93%, 19.45% and 22.52% respectively. Regarding BSIF, the three proposed filters reached a lower performance with EERs over 20%. ArcFace obtained the worse results for the deep learning method for both datasets.
It is essential to highlight that the three scales keep the periocular verification quality based on the proposed perceptual sharpness loss. Thus, a weighted perceptual loss help to keep the quality of the images based on Sharpness metrics. This metric is more suitable for applying periocular iris images with SR than the traditional SNR and SSIM.
Table 3 (top) shows the results for MobBIO Dataset and present different sizes of SR images increased by a factor of x2, x3, and x4 and its benchmark with the pre-trained FaceNet VGGFace, ArcFace and BSIF filters as a feature extractor. Also, three resized operations were explored, analysed and compared when used in super-resolution techniques, Inter-lineal, Inter-cubic, or Inter-area resized. The results reach slightly change when used on the super-resolution process.9 We can observe that the features extracted from Deep learning methods (embeddings) performed better than BSIF filters. Overall, FaceNet reached the best results in all the models with x2, x3 and x4 in comparison with VGGFace in NTNU dataset. This result is interesting for high-security applications since operating points are usually defined at small FMR values.
The results are related to the size of the embedded vector extracted from the pre-trained model. The features extracted from FaceNet are more representative and general-purpose than for VGGFace and ArcFace.
Table 3 (bottom) shows the results for NTNU Dataset and presents different sizes of SR images increased by a factor of x2, x3, and x4 and its benchmark with the pre-trained FaceNet VGGFace, ArcFace and BSIF filters as a feature extractor. In addition, a comparison with traditional resize methods such as InterArea, InterCubic and InterLineal was performed. Column one shows the name of all techniques explored. Column 4 up to column 6 show the results of the best BSIF filters selected. The results reported show the EER and False Not Match Rate (FNMR) based on False Match Rate (FMR) at 10%. Also, rows 7, 12 and 17 included a WDSR-A method as a second-best SR from Table I, in order to compare the E-SISR in a verification system. It is essential to highlight that WDSR-A has at least 20 times more parameters than our proposed for x2. The results obtained for WDSR-A in terms of EER on face verification are on average, 2% lower than our E-SISR. Both methods can reach competitive results an recover the verification performance when selfies are captured in the wild conditions.
Conclusion
In this paper, we have proposed an efficient and accurate image super resolution method focused on the generation of enhanced eyes images for periocular verification purposes using selfie images. To that end, we developed a two-stage approach based on a CNN with pixel-shuffle, a new loss function based on a sharpness metric (see Eq. 3), derived from the ISO/IEC 29794–6 standard for iris quality, and a selfie periocular verification proposal.
In the feature extraction stage of our method, the structure of the CNN model extracts optimised features, which are subsequently sent to the reconstruction network. In this latter network, we only used a recursive convolutional block with pixel-shuffle to obtain a better reconstruction performance with reduced computational requirements. In addition, the model is designed to be capable of processing original size images. Using these techniques, our model can achieve state-of-the-art performance with a fewer number of parameters (from the state-of-the-art DSCN with 2 million parameters, we achieve a comparable quality with 27,000 parameters).
The perceptual loss function based on image sharpness that we propose allows us to keep the sharpness of iris images in the reconstructed images by x2, x3, and x4. This approach to improving the quality of the reconstruction and the SR in periocular recognition systems is well suited for implementation in mobile devices.
Regarding periocular verification system, as expected, the deep learning method’s yielded better results than handcrafted methods. FaceNet achieved the best results in comparison to VGGFace and ArcFace. An EER of 8.7% without SR and 9.2% for x2, 8.9% for x3, and 9.5% for x4 was obtained, respectively. Conversely, a slight performance was reached when VGGFace was used. An EER of 10.05% without SR and 9.94% for x2, 9.92% for x3, and 9.90% for x4, respectively.
Overall, there are marginal improvements for verification systems when only the size of the images is considered in combination with SR images. The information extracted with an embedded vector from the periocular area with a pre-trained model has a high quality of data for verification than BSIF because of the huge number of filters used during the training process.
The uncontrolled conditions such as sunlight, occlusions, rotations, or the number of people in an image when a remote selfie is captured could be more challenging than the image size for RGB selfie images. This improvement to NIR iris images must be studied in a separate work. Those uncontrolled conditions need to be examined to improve the selfie periocular verification systems.
In this research, SR helps maintaining the recognition accuracy when selfies are captured at different distances. That is, in realistic scenarios in contrast to fully controlled conditions. Our system was tested on images acquired at three different distances and obtained similar results to a baseline system with a unique acquisition distance, even when the selfie was resize using SR with x2, x3 and x4.
In future work, we will continue to collect images to train a specific periocular verification system based on CNN from scratch and/or using transfer-domain techniques. Concerning the number of images, we believe that if we use state-of-the-art pre-trained models, the machine learning-based methods could be replaced by the CNN models. The selection of the pre-trained models should be taken into account.
Disclaimer
This text reflects only the author’s views, and the Commission is not liable for any use that may be made of the information contained therein.
VII.Appendix A
Appendix A
Fig. 10 show examples of the proposed SR method ESISR with super-resolution x2, x3 and x4, respectively.
Example of MObBIO Super resolution images. Top: x2. Middle: x3 and Bottom: x4. Increase the size of the images to see the effect of SR.
Fig. 11 shows the probability density functions of the comparisons between mated and non-mated features vectors for the FaceNet, VGGFace, and ArcFace models. The VGGFace feature-vector is more spread between 0.1 and 1.0 in contrast to the FaceNet and ArcFace vectors, which are concentrated between 0.1 and 0.4. All distributions shown some overlap, which in turn leads to the non-perfect verification rates presented in the following.
Mated and non-mated score distributions for FaceNet (top), VGGFace (middle) and ArcFace (bottom).
Figs. 12, 13, and 14 show the DET curves results of periocular verification system with a standard resolution (No redimension) for MObBIO and NTNU dataset. A comparison with Inter-area, cubic and lineal resized by x2, x3 and x4 is depicted. The results also show EERs for ArcFace, VGGFace, FaceNet and BSIF.
DETs for MObBIO and NTNU dataset including selfie recognition systems based on traditional inter-area resizing. The EER is showed in parenthesis for each method.
DETs for MObBIO and NTNU dataset including selfie recognition systems based on traditional inter-cubic resizing. The EER is showed in parenthesis for each method.
DETs for MObBIO and NTNU dataset including periocular recognition systems based on traditional inter-linear resizing. The EER is showed in parenthesis for each method.