Journals & Magazines >IEEE Access >Volume: 10

Selfie Periocular Verification Using an Efficient Super-Resolution Approach

Block diagram of the verification system proposed, including a super-resolution approach. Top: Traditional approach with resizing images. SR approach with deep learning e...

Abstract:

Selfie-based biometrics has great potential for a wide range of applications since, e.g. periocular verification is contactless and is safe to use in pandemics such as CO...Show More

Metadata

Abstract:

Selfie-based biometrics has great potential for a wide range of applications since, e.g. periocular verification is contactless and is safe to use in pandemics such as COVID-19, when a major portion of a face is covered by a facial mask. Despite its advantages, selfie-based biometrics presents challenges since there is limited control over data acquisition at different distances. Therefore, Super-Resolution (SR) has to be used to increase the quality of the eye images and to keep or improve the recognition performance. We propose an Efficient Single Image Super-Resolution algorithm, which takes into account a trade-off between the efficiency and the size of its filters. To that end, the method implements a loss function based on the Sharpness metric used to evaluate iris images quality. Our method drastically reduces the number of parameters compared to the state-of-the-art: from 2,170,142 to 28,654. Our best results on remote verification systems with no redimensioning reached an EER of 8.89% for FaceNet, 12.14% for VGGFace, and 12.81% for ArcFace. Then, embedding vectors were extracted from SR images, the FaceNet-based system yielded an EER of 8.92% for a resizing of x2, 8.85% for x3, and 9.32% for x4.

Block diagram of the verification system proposed, including a super-resolution approach. Top: Traditional approach with resizing images. SR approach with deep learning e...

Published in: IEEE Access ( Volume: 10)

Page(s): 67573 - 67589

Date of Publication: 20 June 2022

Electronic ISSN: 2169-3536

DOI: 10.1109/ACCESS.2022.3184301

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Smartphones, and mobile devices in general, play nowadays a central role in our society. We use them on a daily basis not only for communication purposes, but also to access social media and for sensitive tasks such as online banking. In order to increase the security level of those more sensitive applications, verifying the subject’s identity plays a key role. To tackle this requirement, many companies are currently working towards creating applications to verify the subject’s identity by comparing a selfie image with the reference face image stored in the embedded chip of an ID-Card/Passport and a selfie image using Near Field Communication (NFC) from smartphones [1]. This represents a user-friendly identity verification process, which can be easily embedded into numerous applications. However, this verification process also faces some challenges: for instance, that selfie image is captured in an uncontrolled scenario, where occlusions due to wearing a scarf in winter or a hygienic facial mask during a pandemic such as COVID-19 may hinder the performance of general face recognition algorithms. Therefore, there is a reinforced need to explore alternatives which can deal with those occluded images successfully, such as utilising the periocular region for recognition purposes.

The aforementioned reasons have increased the interest on periocular based biometrics in the last decade in different scenarios [2]–[5]. In particular, it has been shown that periocular images captured with mobile devices for recognition purposes are mainly acquired as selfie face images. And the number of digital photos will increase every year: in 2022, 1.5 trillion images were taken, and 90% of them come from smartphones.¹ In order to recognise individuals from a selfie in a remote verification system, the periocular region needs to be cropped, and the resulting periocular sample has often a very low-resolution [6]. Moreover, the subjects capture selfie images in multiple places and backgrounds, using selfie sticks, alone, or with others. This translates into a high intra-class variability which can be observed for the images, in terms of size, lighting conditions, and face pose.

With the aim of improving the quality of such low-resolution images, several Single Image Super-Resolution (SISR) methods have been recently proposed [7]–[10], mainly based on convolutional neural networks (CNNs). Even though some authors have enhanced such networks to achieve more efficiently the reconstruction results of the super-resolution [11], most approaches still use deep models, which demand large resources and are thus not suitable for mobile or Internet-of-Things (IoT) devices. Furthermore, the loss function used in most techniques is based on structural similarity (SSIM) and Peak Signal to Noise Ratio (PSNR) metrics [7]. Even though those metrics are appropriate for increasing the resolution of general purpose images (e.g., landscapes, cities, or birds) they are not that suitable for increasing the quality of iris based biometrics applications. In contrast, the ISO/IEC 29794 standard on biometric sample quality — Part 6: Iris image data describes sharpness based on the Laplacian of Gaussian (LoG) as one relevant quality.

In this work, we have a twofold goal: verify a biometric claim in a verification transaction from a smartphone selfie periocular image in the visible spectrum (VIS) and propose an efficient super-resolution approach (see Fig. 1). As already mentioned, this is a challenging task since there is limited control of the quality of the images taken: selfies can be captured from different distances, light conditions, and resolutions. Therefore, to tackle these issues, we present a SISR algorithm with a novel loss function based on the sharpness LoG metric and a light-weight CNN. This model takes into account the trade-off between the number of layers and filter sizes in order to achieve a light model suitable for mobile devices. Additionally, we explore pixel-shuffle and transposed convolutions in order to recover the fine details of the periocular eye images. To validate our approach, we use different databases for training and testing. In addition, we benchmark both handcrafted features and pre-trained deep learning models. Our method drastically reduces the number of parameters when compared with the state-of-the-art Deep CNNs with Skip Connection and Network (DCSCN) [12]: from 2,170,142 to 28,654 parameters when the image size is increased by a factor of 2.

FIGURE 1.

Block diagram of the verification system proposed, including a super-resolution approach. Top: Traditional approach with resizing images. SR approach with deep learning embeddings (Middle) and with handcrafted features (BSIF, Bottom).

Show All

This paper is an extension of our previous work [13]. In that work, we focused on achieving an accurate Enhanced SISR (ESISR) algorithm for periocular eye images taken from selfie images, reporting results in terms of image similarity for the recovered images on a smaller Samsung dataset. In this paper, we evaluate this new ESISR architecture in more detail and benchmark it with two new state-of-the-art methods: WDSR-A [14] and SRGAN [15]. A full explanation of the reasons that led us to such architecture is discussed in this work. As an additional contribution, this manuscript includes the performance evaluation of our proposed methods on periocular verification systems using three pre-trained CNNs: FaceNet [16], VGGFace [17], and ArcFace [18]. All methods have been now evaluated on the larger MobBIO [19] and NTNU [20] databases. A benchmark with a traditional resized method such as inter-area, inter-lineal, and inter-cubic (bicubic) has been also analysed. A handcrafted feature extractor, Binary Statistical Image Filter (BSIF), was also added to evaluate and compare the results with the deep learning approach. Detection Error Trade-off (DET) curves are included to show our proposal’s performance and efficiency. All these new experiments are benchmarked with those previously obtained in [12], [14], [21].

Therefore, the main contributions from this work can be summarised as follows:

An efficient SR architecture is proposed, using only a feature extractor and one block based on recursive learning of reconstruction.
A recursive pixel-shuffle technique is introduced over a transposed convolution to extract and keep fine details of periocular images.
A novel loss function that includes the LoG sharpness iris quality metric and the SR loss function was proposed.
A significant reduction of the number of parameters in comparison with the state-of-the-art is reported.
A novel database for selfie periocular eye images was acquired and is available for researchers upon request.
A periocular verification system based on an embedded vector from three pre-trained models with an SR-based pre-processing of the samples (x2, x3 and x4) was tested.
A benchmark between deep learning approaches and a handcrafted method is reported.
A full analysis of the influence of SR on selfie biometrics scenarios with traditional resizing methods was also included.

The rest of the article is organised as follows. Sect.II summarises the related works on periocular recognition and super resolution. The new recognition and super-resolution method is described in Sect. III. The experimental framework is then presented in Sect. IV and the results are discussed in Sect. V. We conclude the article in Sect. VI.

SECTION II.

Related Work

A. Super-Resolution (SR)

Super-resolution (SR) is the process of recovering a high-resolution (HR) image from a low-resolution (LR) one [7], [22]. Supervised machine learning approaches learn mapping functions from LR images to HR images from a large number of examples. The mapping function learned by these models is the inverse of a downgrade function that transforms HR images into LR images. Such downgrade functions can be known or unknown.

Many state-of-the-art super-resolution models learn most of the mapping function in LR space, followed by one or more upsampling layers at the end of the network. Earlier approaches first upsampled the LR image with a pre-defined upsampling operation and then learned the mapping in HR space (pre-upsampling SR). A disadvantage of this approach is that more parameters per layer are required because they used more convolutional layers than small filters, which leads to higher computational costs and limits the construction of deeper neural networks [7]. SR requires that most of the information contained in an LR image must be preserved in the SR image. SR models therefore mainly learn the residuals between LR and HR images. Residual network designs are therefore of high importance: identity information is conveyed via skip connections whereas reconstruction of high frequency content is done on the main path of the network [7].

Dong et al. [22] proposed several SISR algorithms which can be categorized into four types: prediction models, edge-based methods, image statistical methods, and patch-based (or example-based) methods. This method uses 2 to 4 convolutional layers to prove that the learned model performs well on SISR tasks. The authors concluded that using a larger filter size is better than using deeper Convolutional Neural Networks (CNNs).

Kim et al. [23] proposed an image SR method using a Deeply-Recursive Convolutional Network (DRCN), which contains deep CNNs with up to 20 layers. Consequently, the model has a huge number of parameters. However, the CNNs share each other’s weights to reduce the number of parameters to be trained, thereby being able to succeed in training the deep CNN network and achieving a significant performance. The authors conclude in their work that deeper networks are better than large filters.

Yamanaka et al. [12] proposed a Deep CNN with a Residual Net, Skip Connection and Network (DCSCN) model achieving a state- of-the-art reconstruction performance while reducing by at least 10 times the computational cost. According to the existing literature, deep CNNs with residual blocks and skip connections are suitable to capture fine details in the reconstruction process. In the same context, [24] and [25] propose the pixel-shuffle and transposed convolution algorithm in order to extract the most relevant features from the images. The transposed convolutional layer can learn up-sampling kernels. However, the process is similar to the usual convolutional layer and the reconstruction ability is limited. To obtain a better reconstruction performance, the transposed convolutional layers need to be stacked, which means the whole process needs high computational resources [12]. Conversely, pixel-shuffle extracts features from the low-resolution images. The authors [12] argue that batch normalisation loses scale information of images and reduces the range flexibility of activations. Removal of batch normalisation layers not only increases SR performance but also reduces GPU memory 40%. This way, significantly larger models can be trained.

Ledig et al. [21] proposed a deep residual network which is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive Mean-Opinion-Score (MOS) test shows significant gains in perceptual quality using SR based on Generative Adversarial Network (SRGAN). In addition, the authors present a new perceptual loss based on content loss and adversarial loss.

Yu et al. [14] proposed the key idea of wide activation to explore efficient ways to expand features before ReLU, since simply adding more parameters is inefficient for smartphone based image SR scenarios. The authors present two new networks named Wide Activation for Efficient and Accurate Image Super-Resolution (WSDR). These networks (WDSR-A and WDSR-B) yielded better results on the large-scale DIV2K image super resolution benchmark in terms of PSNR with the same or lower computational complexity. Similar results but with a larger number of parameters are presented by Lim et al. [15] in a model called Enhanced Deep Residual Networks for Single Image Super Resolution (EDSR).

Specifically for biometric applications, some papers have explored the use of SR in iris recognition in the visible and near-infrared spectrum. Ribeiro et al. [26] proposed a SISR method using CNNs for iris recognition. In particular, the authors test different state- of-the-art CNN architectures and use different training databases in both the near-infrared and visible spectra. Their results are validated on a database of 1,872 near-infrared iris images and on a smartphone image database. The experiments show that using deeper architectures trained with texture databases that provide a balance between edge preservation and the smoothness of the method can lead to good results in the iris recognition process. Furthermore, the authors used PSNR and SSIM to measure the quality of the reconstruction. More recently, Alonso-Fernandez et al. [27] presented a comprehensive survey of iris SR approaches. They also described an Eigen-patches reconstruction method based on the principal component analysis and Eigen-transformation of local image patches. The inherent structure of the iris is reproduced by building a patch-position-dependent dictionary. The authors also used PSNR and SSIM to measure the quality of the reconstruction in the NIR spectrum and in the NTNU database in the visible spectrum [28].

1) Metrics

Deep learning-based methods for SISR significantly outperform conventional approaches in terms of Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity (SSIM) [22]. In this section, we review these two metrics.

SSIM is an objective metric used for measuring the structural similarity between images from the perspective of the human visual system. It is based on three relatively independent properties: luminance, contrast, and structure. The SSIM metric can be seen as a weighted product of the comparison of luminance, contrast, and structure computed independently. Therefore, SSIM is defined as:

$\begin{equation*} \mathrm {SSIM}(x,y) = \frac {(2\mu _{x}\mu _{y} + C_{1}) + (2 \sigma _{xy} + C_{2})} {(\mu _{x}^{2} + \mu _{y}^{2}+C_{1}) (\sigma _{x}^{2} + \sigma _{y}^{2}+C_{2})} \tag{1}\end{equation*}$ View Source

where

$\mu$

and

$\sigma$

represent the average and variance of x and y, respectively; and

$C_{1}$

and

$C_{2}$

are two variables to stabilise the division with a weak denominator.

PSNR is a common objective metric to measure the reconstruction quality of a lossy transformation. It is inversely proportional to the logarithm of the Mean Squared Error (MSE) between the ground truth image and the generated image:

$\begin{equation*} \mathrm {PSNR} = 10 \log _{10}\left ({\frac {\max ^{2}}{\mathrm {MSE}}}\right)\tag{2}\end{equation*}$ View Source

where max denotes the maximum pixel value, and MSE the mean of the squared of differences between the pixel values of the reconstructed super-resolution image and the ground truth image (prior to downsampling. Therefore, this metric measures pixel differences and not the quality of the images.

B. Periocular Recognition

Periocular recognition based on traditional feature extraction methods such as intensity, shape, texture, fusion, and off-the-shelf CNN features with pre-trained models has been widely studies. However, to the best of our knowledge, only a few papers have explored the use of SR methods to improve the quality of the RGB images coming from periocular selfie captures.

Padole and Proença [29] proposed a new initialization strategy for the definition of the periocular region-of-interest and the performance degradation factor for periocular biometric and the influence of Histogram of Oriented Gradient (HOG), Local Binary Pattern (LBP), Scale-Invariant Feature Transform (SIFT), Fusion at the Score Level, Effect of Reference Points of the eyes, Covariates, Occlusion Performance and Pigmentation Level Performance.

Raja et al. [30] explore multi-modal biometrics as a means for secure authentication. The proposed system employs face, periocular, and iris images, all captured with embedded smartphone cameras. As the face image is captured closely, one can always obtain periocular and iris information with fine details. This work also explores various score level fusion schemes of complementary information from all three modalities. Also, the same authors used in [20] used in the periocular region for authentication under unconstrained acquisition in biometrics. They acquired a new database named Visible Spectrum Periocular Image (VISPI), and proposed two new feature extraction techniques to achieve robust and blur invariant biometric verification using periocular images captured by smartphones.

Ahuja et al. [31] proposed a hybrid convolution-based model for verifying pairs of periocular RGB images. They composed a hybrid model as a combination of an unsupervised and a supervised CNN, and augment the combination with SIFT model.

Hernandez-Diaz et al. [32] proposed a method to apply existing architectures pre-trained on the ImageNet Large Scale Visual Recognition Challenge, to the task of periocular recognition. These networks have proven to be very successful for many other computer vision tasks apart from the detection and classification tasks for which they were designed. They demonstrate that these off-the-shelf CNN features can effectively recognise individuals based on periocular images.

More recently, Kumari and Seeja [8] surveyed periocular biometrics and provided a deep insight of various aspects, including the periocular region utility as a stand-alone modality, its fusion with iris, its application in the smartphone authentication, and its role in soft biometric classification. In their review, the authors did not mention SR approaches.

SECTION III.

Proposed Method

As mentioned in Sect. I and depicted in Fig. 1, we focus in this work on a two-stage system. First, we improve the SR approaches for periocular images. Second, we use that improved SR method to enhance the recognition performance of periocular-based biometric systems, in contrast to traditional SR methods. We describe in Sect. III-A the proposed ESISR technique, and in Sect. III-B the feature extraction and comparison methods utilised for periocular recognition.

A. Stage-1: Super-Resolution

In this section, we present an efficient image SR network that is able to recover periocular images from selfies (ESISR). Our network includes two building blocks, as it can be observed in Fig. 2: A feature extraction and a reconstruction stage based on DCSCN, which are described in the remaineder of this section.

FIGURE 2.

Proposed ESISR method.

Show All

Since SR in general is an image-to-image translation task where the input image is highly correlated with the target image, researchers try to learn only the residuals between them (i.e. global residual learning). This process avoids learning a complicated transformation from a complete image to another. Instead, it only requires learning a residual map to restore the missing high-frequency details. Since most regions’ residuals are close to zero, the model complexity and learning difficulty are thus greatly reduced.

This local residual learning is similar to ResNet to alleviate the degradation problem caused by ever-increasing network depths, reduce training difficulty, and improve the learning ability. For these reasons, we are using recursive learning to learn higher-level features without introducing an overwhelming number of parameters, which means applying the same modules multiple times.

In addition to choosing an appropriate network architecture, the definition of the perceptual loss function is critical for the performance of the proposed method based on the DCSCN network, as mentioned in Sects. I and II. While SR is commonly based on the MSE, PSNR, and SSMI metrics, we have designed a loss function that incorporates as well a sharpness measure with respect to perceptually relevant features. The function thus balances between reconstructing images by minimising the difference of the sharpness values and weights the results of SSIM and PSNR.

1) Pre-Processing

The original RGB images captured with a smartphone represent an additive color-space where colors are obtained by a linear combination of Red, Green, and Blue values. The three channels are thus correlated by the amount of light on the surface. In order to avoid such correlations, all the images were converted from RGB to YCbCr. The YCrCb color space is derived from RGB, and separates the luminance and chrominance components into different channels. In particular, it has the following three components: i) Y, Luminance or Luma component obtained from RGB after gamma correction; ii) $Cr = R - Y$ , how far is the red component from Luma; and iii) $Cb = B - Y$ , how far is the blue component from Luma. We only use $Y$ component in this work because stored the high resolution luminance information. Instead of CbCr that comprises the image information. The periocular image areas were automatically cropped from faces to the size of $250\times 200$ pixels. The low-resolution version of the images is generated automatically in the training process using a resize function based on a bicubic interpolation to reduce the images to the half size for SR-X2, to the third part for SR-X3 and a quarter to SR-X4.

2) Feature Extraction

As mentioned above, the Y component of the converted image is used as input for our model. Several patches of $32\times 32$ and $48\times 48$ pixels were extracted from the image and used to grasp the features efficiently. We look for the features that achieve a better trade-off between the number and size of filters of each CNN layer. Seven blocks of $5\times 5$ and $3\times 3$ have been selected after several experiments. The information is extracted using small convolutional blocks with residual connections and stride convolutions in order to preserve both the global and the fine details in periocular images. Only the final features from $3\times 3$ and $5\times 5$ pixels are concatenated, following the recursive pixel-shuffle approach (see Fig. 3). These local skip connections in residual blocks make the network easier to optimise, thereby supporting the construction of deeper networks.

FIGURE 3.

Pixel-shuffle convolution layer that aggregates the feature maps from LR space and builds the SR image in a single step. Based on [33].

Show All

A model with transpose convolution instead of pixel-shuffle was trained to explore the quality of the reconstruction images [7]. See Fig. 4. Transpose convolution operates conversely to normal convolution, predicting the input based on feature maps sized like convolution output. It increases image resolution by expanding the image by adding zeros and performing convolution operations.

FIGURE 4.

Transpose-convolution operation representation. (a) The starting matrix represents the input image. (b) Expanding operation adds zeros to the images in order to increase the size. c) The convolution operation is performed again in a new resolution. Based on [7].

Show All

3) Reconstruction

Our reconstruction stage uses only one convolutional block with 2 layers (Conv + Relu + Conv) in a recursive path. This block includes $3\times 3$ convolutions and pixel-shuffle algorithm (see Fig. 3) to create a high-resolution image from a low-resolution input. Batch normalisation was removed. An optimised sub-pixel convolution layer that learns a matrix of up-scaling filters to increase the final LR feature maps into the SR output was used.

4) Perceptual Loss Function

The ISO/IEC 29794–6² on iris image quality introduced a set of quality metrics, that can measure the utility of a sample. Based on the NIST IREX evaluation³ a sharpness metric was identified as strongly predictive for recognition performance. We follow this finding and measure:

$\begin{equation*} LoG(x,y)=-\frac {1}{\pi \sigma ^{4}} \left [{1- \frac {x^{2}+y^{2}}{2\sigma ^{2}}}\right]e^{-\frac {x^{2}+y^{2}}{2\sigma ^{2}}} \tag{3}\end{equation*}$ View Source

The Laplacian of Gaussian operator (LoG) is thus the sharpness metric used in this work. Calculation of the sharpness of an image is determined by the power resulting from filtering the image with a Laplacian of Gaussian kernel. The standard deviation of the Gaussian is 1,4.

Now, it is important to highlight that the loss function aims to improve the quality of the reconstruction. To that end, we combine the SSIM and PSNR classical SR metrics with the sharpness metric for iris images recommended, as follows:

$\begin{align*}&\hspace {-.5pc} L(I_{LR},I_{HR})= 0.5\cdot \mathrm {LoG}\left ({I_{LR}, I_{HR}}\right) \cdot [0.25\cdot \mathrm {SSIM}\left ({I_{LR},I_{HR}}\right) \\&\qquad\qquad\qquad\qquad\qquad\qquad+ 0.25\cdot \mathrm {PSNR}\left ({I_{LR},I_{HR}}\right)] \tag{4}\end{align*}$ View Source

where

$I_{LR}$

represents a low-resolution image,

$I_{HR}$

the corresponding high-resolution image recovered, and LoG the sharpness as defined in Eq. 3. The best values of the weights (

$w1$

$w2$

and

$w3$

) for each specific metric (i.e., 0.25, 0.25 and 0.50) were estimated in a grid search with a train dataset.

B. Stage-2: Periocular Recognition

Most traditional methods in the state-of-the-art are based on machine learning techniques with different feature extraction approaches such as HOG, LBP, and BSIF, or the fusion of some of them [8]. However, today we have powerful pre-trained deep learning methods based on facial images. Using transfer learning techniques, the information extracted from some layers using fine-tuning techniques or embedding approaches could be suitable to perform periocular verification. This is the approach followed in this work.

This task involves information from periocular images estimating an eye embedding vector for a new given eye from a selfie image. An eye embedding is a vector that represents the features extracted from the eyes periocular images. This comparison occurs using euclidean distance to verify if the distance is below a predefined threshold, often tuned for a specific dataset or application. For this paper, a VGGFace [17], FaceNet [16] and ArcFace [18] models have been used as a feature extractor for periocular recognition. Also a comparison with BSIF handcrafted featured is included.

SECTION IV.

Experimental Setup

A. Experimental Protocol

In order to assess the soundness of the proposed method, we focus on a twofold objective: i) evaluate the SR approaches, and ii) analyse selfie periocular recognition systems using those SR techniques.

1) Super-Resolution Models

First, we have trained the DCSCN, WDSR-A, and SRGAN methods as a baseline for benchmarking purposes. The main properties and default parameters of those methods are summarised in the following:

DCSCN: Number of CNN layers = 12, Number of first CNN filters = 196, Number of last CNN filters = 48, Decay Gamma = 1.5, Self Ensemble = 8, Batch images for training epoch = 24,000, Dropout rate = 0.8, Optimiser function = Adam, Image size for each Batch = 48, Epochs = 100, Early stopping = 10.
WDSR-A: Number of residual blocks = 8, Number of CNN layers in the main branch = 6, Number of expansion of residual blocks = 4, Number of filters main branch = 64, Number of filters residual blocks = 256, Activation function = Relu, Optimisation Function = Adam, Learning Rate = 1e-4 and 1-e-5, Beta = 1e-7, Size of batch images = 96, Number of steps = 60,000.
SRGAN: The network has two modules:
- Generator: This stage is used for learning the inverse function for downsampling the image and to generate the LR images from their corresponding HR, based in a pre-trained VGG-54. The following parameters are used: Number of residual blocks = 16, Number of CNN layers with residual blocks = 2, activation function residual block = PRelu, Kernel size residual block = 3, CNN layers = 3, kernel size = 9, 3 and, 9. Filters numbers = 64, Optimisation function = Adam, Learning rate = 1e-4 and 1e- 5, batch image size = 96, Steps = 100,000, mini size batches = 16.
- Discriminator: In order to evaluate the similarity between the images generated by the SR generator (VGG-54) and the HR images, the discriminator is trained with the following parameters: CNN layers = 8, Filter numbers:64, 64, 128, 128, 256, 256, 512 and 512. Kernel size = 3, activation function = Relu, Momentum batch normalisation = 0.8, Optimisation function = Adam, Learning Rate = $1e-5$ and, $1e-6$ , Batch size = 16, Steps = 100,000.

Subsequently, we evaluated our ESISR method using the pixel-shuffle technique [34]. The best parameters for our approach were: Number of CNN layers = 7, Number of first CNN filters = 32, Number of last CNN filters = 8, Decay Gamma = 1.2, Self Ensemble = 8, Batch images for training epoch = 24,000, Dropout rate = 0.5, Optimiser function = Adam, Image size for each Batch = 32, Epochs= 100, Early stopping = 10. We further improved the efficiency of our proposal by using the transpose convolution instead of pixel-shuffle.

In all experiments, we assess the quality of the produced SR images using the sharpness function defined in Eq. 3, and the efficiency in terms of the number of features and parameters. It should be noted that the True Sharpness represents the sharpness of the original image ((prior to downsampling), and Output Sharpness represents the sharpness of the reconstructed high resolution image created by ESISR. Therefore, the goal is to achieve an Output Sharpness as close as possible to the True Sharpness. From those experiments, we selected the configuration achieving the best performance. All methods were trained using the Samsung database and tested with the SET-5E dataset.

2) Periocular SR Verification

We then extract the embedded information from selfie periocular images and compare the results with a handcrafted method for the periocular verification system. Afterwards, feature extraction was applied to the best super-resolved images using x2, x3, and x4 increased sizing, and it was compared with the same sizes but using traditional methods such as inter-area, lineal, and cubic. All the SR methods for periocular verification were tested using the MobBIO and NTNU datasets, which are different from the ones used to train the SR stage in order to grant unbiased results.

BSIF handcrafted features were used to extract textural information. An exhaustive exploration of the 60 filters was made. The image was divided into two rows and three columns. For each patch, a histogram was estimated. The concatenation of all the histograms represents the final vector. In this case, the $5\times 5$ -5, $9\times 9$ -5 and $11\times 11$ -5 bits show the best performances.

In more details, the FaceNet, VGGFace, and ArcFace pre-trained models were used to extract the embedding information. For FaceNet the feature vector has a size of 1,722 and input size image of $224 \times 224 \times 3$ . For VGGFace, the feature vector has a size of 2,048 and input size image of $224 \times 224 \times 3$ . ArcFace inputs have a size of 512 an input size of $112 \times 112 \times 3$ . A PC with Intel I7, 32 GB RAM, and GPU-1080TI was used for train all the stand-alone SR model.

B. Databases

In order to analyse the performance of the SR algorithm, four databases were used. A new dataset was acquired in a collaborative effort with subjects from different countries with Samsung smartphones using an app $\vphantom {_{\int }}$ specially designed for this purpose: visualselfie.org.⁴ This app was designed in order to capture different variations of selfie scenarios at three distances, as depicted in Fig. 5. More specifically, 800 images were selected to be used for training and 100 for testing.⁵ From the training dataset, 228,700 patches of $48\times 48$ px. were created for experiment 2 and $32\times 32$ for experiment 3.

FIGURE 5.

Example of Samsung databases. Left: closest position. Middle: half arm extended. Right: full arm extended.

Show All

A second dataset, Set-5E, was created to validate the results. This database has 100 images from different subjects acquired with different smartphones extracted from the CSIP database in the visual spectrum [35]. It has 2004 images, stemming from 50 subjects over 10 different mobile setups.

A third database MobBIO was used to super-resolved the size of the images with the best pre-trained super-resolution model (ESISR). It was also used to measure the performance of the periocular verification system. The MobBIO dataset comprises the biometric data from 152 volunteers. Each subject provided samples of face, iris, and voice. There are on average 8 images for each subject from a NOKIA N93i mobile. Some examples are presented in Fig. 6.

FIGURE 6.

MOBBIO database examples.

Show All

The last database is VISPI, captured by NTNU, which was used to measure the performance of the periocular verification system.⁶ The NTNU dataset comprises the biometric data from 152 volunteers and 3,139 total images. Each subject provided samples of left and right iris. There are in average 11 images for each subject from a NOKIA N93i mobile. Some examples are presented in Fig. 7.

FIGURE 7.

NTNU database examples.

Show All

SECTION V.

Results and Discussion

A. Super-Resolution Models

First, we establish a baseline by testing the DCSCN, WSDR-A, and SR-GAN models with their default parameters. Then, we analyse our proposal (ESISR-X) using pixel-shuffle and the new loss function including the Sharpness metric (see Eqs. 3 and 4). Table I summarises the results: Rows 1–3 show the results for traditional SR methods (DSCN with 12 layers and $96\times 96$ patches, WDSR-A with 8 residual blocks and $62\times 62$ patches, SR-GAN with 16 residual blocks and $96\times 96$ patches); and rows 4 to 6 present the results of our proposed method: ESISR-1 using the pixel-shuffle algorithm with only 7 convolutions layers and $48\times 48$ patches, ESISR-2 using the pixel-shuffle algorithm with only 7 convolutions layers and $32\times 32$ patches, and ESISR-3 using the transposed convolution algorithm with only 7 convolutions layers.

TABLE 1. Summary of the Results for 3 Different Scales (x2, x3, and x4) for Our System (ESISR) with Different Configurations and the Benchmark with DCSCN, WDSR-A, and SRGAN. True Sharpness Denotes the Sharpness for the Original Image (LR), and Output Sharpness the Sharpness for Reconstructed SR Images

Observing the results, we note that all the image enlargement x2, x3, and x4 extract the same number of features for each method (i.e., 1,301 for DCSNN and 1,000 for ESISR). The more considerable difference lies on the number of parameters of each method: while DCSCN, WSDR-A, and SR-GAN methods need a large number of parameters (for images increased by x2, 1,754,942, 597,000, and 24,864,000; for images increased by x3, 2,170,142, 603,000, and 25.131.000; and for images increased for x4, 2,087,102, 610,000, and 26,939,000), these numbers are drastically reduced by the our ESISR-1 proposed method, which needs only 27,209 parameters when the image is increased by x2, 28,654 parameters when increased by x3, and 64,201 parameters when increased by x4.⁷

In addition to that gain in terms of efficiency, we may observe in Table I that the newly proposed loss function based on sharpness allows us to get a good reconstruction. The Output sharpness for each scale value is similar to the values obtained by DSCN (e.g. 16.85 vs. 16.70 for x2), and also close to the target True Sharpness of 17.04. Therefore, we may conclude that the proposed method keeps the sharpness quality of the images, thereby making it suitable for SR applications for mobile devices.

In addition to the baseline configuration of ESISR-1, we also evaluated two additional approaches. First, the most efficient implementation of ESISR with a big reduction of features (down to 131) and a number of parameters with pixel-shuffle and $32\times 32$ was analysed (Table I, row 5). Then, we also tested the method using transposed convolution with the same number of 131 features (Table I, row 6). The Transpose convolutions layer is an inverse convolutions layer that will both up-sample input and learn how to fill in details during the model training process, at the cost of increasing the number of parameters (i.e., less efficient than pixel-shuffling). As we may observe in Table I, the pixel-shuffle with $32\times 32$ px. uses the same number of parameters as with $48\times 48$ px. In contrast, the transposed convolution requires 100,316 parameters when the image is increased by 2 (x2), 109,564 parameters when increased by 3 (x3), and 100,318 parameters when increased by 4 (x4). In spite of this increase, the ESISR is still 10 to 20 times more efficient than the traditional DCSCN.

Regarding the quality of the SR iris images, we can observe that both configurations tested in this last experiment (row 5-6) achieve a similar sharpness for the x3 and x4 scale values (14.43, 14.38, 15.46 and 16.32), but not for x2. In the latter case, the pixel-shuffle approach clearly outperforms the transpose-convolution method (15.43 vs. 14.38). The lower result of reconstruction was reached for the SRGAN method with a higher number of parameters and a relevant difference of the value of output sharpness.⁸

B. ABLATION STUDY

In order to show the effectiveness of the modified loss used to train E-SISR an ablation analysis was proposed. See Table II. We measure the performance of E-SISR with different weight values for $w1$ , $w2$ and $w3$ on the loss function using 200 epochs and ESISR-X2. The results show the relevance of the weights in the final results compared with $w1$ , $w2$ and $w3$ are equal to 1. The values chosen were the best trade-off for the SSIM and PSNR values.

TABLE 2. Ablation Study Over the Loss Function Exploring w1,w2 and w3 and its Influences on SSIM and PSNR

C. Periocular SR Verification

We now evaluate the periocular verification systems including the SR methods analysed in the previous section. In order to assess the quality of the super-resolved images, the MobBIO and NTNU datasets were used to evaluate the reconstruction performance with the best SR method proposed in Sect. V-A, namely ESISR with pixel-shuffle.

Figs. 8 and 9 show the DET curves of the periocular verification system for MobBIO and NTNU datasets with a standard resolution (Resolution x1) in comparison with SR images resized by x2, x3, and x4 using the ESISR method. The results show VGGFace, FaceNet, ArcFace and three different BSIFs filters with equal error rates for each one. An essential fact that we can see in Figures 8 and 9, in this case, is that SR methods help maintain the recognition accuracy when selfies are captured at different distances instead of improving the eye recognition performance.

FIGURE 8.

DET curves for MobBIO and NTNU datasets using the SR method (x1 and x2) and including periocular recognition systems based on deep learning and handcrafted features (BSIF). The EER is shown in parenthesis for each technique.

Show All

FIGURE 9.

DET curves for MobBIO and NTNU datasets using the SR method (x3 and x4) and including periocular recognition systems based on deep learning and handcrafted features (BSIF). The EER is shown in parenthesis.

Show All

For images with a standard resolution (Resolution x1), VGGFace obtained the best results with an EER of 16.12% using the MobBIO dataset. The best results were obtained for images from NTNU FaceNet with an EER of 8.89%.

For images with an SR x2, x3 and x4, FaceNet outperforms VGGFace and ArcFace. Obtained an 8.92%, 8.86% and 9.33% for the NTNU dataset. Conversely, MobBIO reached the lowest results. SR methods x2, x3 and x4 yielded 17.93%, 19.45% and 22.52% respectively. Regarding BSIF, the three proposed filters reached a lower performance with EERs over 20%. ArcFace obtained the worse results for the deep learning method for both datasets.

It is essential to highlight that the three scales keep the periocular verification quality based on the proposed perceptual sharpness loss. Thus, a weighted perceptual loss help to keep the quality of the images based on Sharpness metrics. This metric is more suitable for applying periocular iris images with SR than the traditional SNR and SSIM.

Table 3 (top) shows the results for MobBIO Dataset and present different sizes of SR images increased by a factor of x2, x3, and x4 and its benchmark with the pre-trained FaceNet VGGFace, ArcFace and BSIF filters as a feature extractor. Also, three resized operations were explored, analysed and compared when used in super-resolution techniques, Inter-lineal, Inter-cubic, or Inter-area resized. The results reach slightly change when used on the super-resolution process.⁹ We can observe that the features extracted from Deep learning methods (embeddings) performed better than BSIF filters. Overall, FaceNet reached the best results in all the models with x2, x3 and x4 in comparison with VGGFace in NTNU dataset. This result is interesting for high-security applications since operating points are usually defined at small FMR values.

TABLE 3. MobBIO (top) and NTNU (bottom) Verification Results with No Resizing (Resolution x1), Interarea, Intercubic, and Interlineal. Both EER and FNMR are Presented in %, and FNMR is Given at FMR = 10%

The results are related to the size of the embedded vector extracted from the pre-trained model. The features extracted from FaceNet are more representative and general-purpose than for VGGFace and ArcFace.

Table 3 (bottom) shows the results for NTNU Dataset and presents different sizes of SR images increased by a factor of x2, x3, and x4 and its benchmark with the pre-trained FaceNet VGGFace, ArcFace and BSIF filters as a feature extractor. In addition, a comparison with traditional resize methods such as InterArea, InterCubic and InterLineal was performed. Column one shows the name of all techniques explored. Column 4 up to column 6 show the results of the best BSIF filters selected. The results reported show the EER and False Not Match Rate (FNMR) based on False Match Rate (FMR) at 10%. Also, rows 7, 12 and 17 included a WDSR-A method as a second-best SR from Table I, in order to compare the E-SISR in a verification system. It is essential to highlight that WDSR-A has at least 20 times more parameters than our proposed for x2. The results obtained for WDSR-A in terms of EER on face verification are on average, 2% lower than our E-SISR. Both methods can reach competitive results an recover the verification performance when selfies are captured in the wild conditions.

SECTION VI.

Conclusion

In this paper, we have proposed an efficient and accurate image super resolution method focused on the generation of enhanced eyes images for periocular verification purposes using selfie images. To that end, we developed a two-stage approach based on a CNN with pixel-shuffle, a new loss function based on a sharpness metric (see Eq. 3), derived from the ISO/IEC 29794–6 standard for iris quality, and a selfie periocular verification proposal.

In the feature extraction stage of our method, the structure of the CNN model extracts optimised features, which are subsequently sent to the reconstruction network. In this latter network, we only used a recursive convolutional block with pixel-shuffle to obtain a better reconstruction performance with reduced computational requirements. In addition, the model is designed to be capable of processing original size images. Using these techniques, our model can achieve state-of-the-art performance with a fewer number of parameters (from the state-of-the-art DSCN with 2 million parameters, we achieve a comparable quality with 27,000 parameters).

The perceptual loss function based on image sharpness that we propose allows us to keep the sharpness of iris images in the reconstructed images by x2, x3, and x4. This approach to improving the quality of the reconstruction and the SR in periocular recognition systems is well suited for implementation in mobile devices.

Regarding periocular verification system, as expected, the deep learning method’s yielded better results than handcrafted methods. FaceNet achieved the best results in comparison to VGGFace and ArcFace. An EER of 8.7% without SR and 9.2% for x2, 8.9% for x3, and 9.5% for x4 was obtained, respectively. Conversely, a slight performance was reached when VGGFace was used. An EER of 10.05% without SR and 9.94% for x2, 9.92% for x3, and 9.90% for x4, respectively.

Overall, there are marginal improvements for verification systems when only the size of the images is considered in combination with SR images. The information extracted with an embedded vector from the periocular area with a pre-trained model has a high quality of data for verification than BSIF because of the huge number of filters used during the training process.

The uncontrolled conditions such as sunlight, occlusions, rotations, or the number of people in an image when a remote selfie is captured could be more challenging than the image size for RGB selfie images. This improvement to NIR iris images must be studied in a separate work. Those uncontrolled conditions need to be examined to improve the selfie periocular verification systems.

In this research, SR helps maintaining the recognition accuracy when selfies are captured at different distances. That is, in realistic scenarios in contrast to fully controlled conditions. Our system was tested on images acquired at three different distances and obtained similar results to a baseline system with a unique acquisition distance, even when the selfie was resize using SR with x2, x3 and x4.

In future work, we will continue to collect images to train a specific periocular verification system based on CNN from scratch and/or using transfer-domain techniques. Concerning the number of images, we believe that if we use state-of-the-art pre-trained models, the machine learning-based methods could be replaced by the CNN models. The selection of the pre-trained models should be taken into account.

Disclaimer

This text reflects only the author’s views, and the Commission is not liable for any use that may be made of the information contained therein.

VII.
Appendix A

Fig. 10 show examples of the proposed SR method ESISR with super-resolution x2, x3 and x4, respectively.

FIGURE 10.

Example of MObBIO Super resolution images. Top: x2. Middle: x3 and Bottom: x4. Increase the size of the images to see the effect of SR.

Show All

Fig. 11 shows the probability density functions of the comparisons between mated and non-mated features vectors for the FaceNet, VGGFace, and ArcFace models. The VGGFace feature-vector is more spread between 0.1 and 1.0 in contrast to the FaceNet and ArcFace vectors, which are concentrated between 0.1 and 0.4. All distributions shown some overlap, which in turn leads to the non-perfect verification rates presented in the following.

FIGURE 11.

Mated and non-mated score distributions for FaceNet (top), VGGFace (middle) and ArcFace (bottom).

Show All

Figs. 12, 13, and 14 show the DET curves results of periocular verification system with a standard resolution (No redimension) for MObBIO and NTNU dataset. A comparison with Inter-area, cubic and lineal resized by x2, x3 and x4 is depicted. The results also show EERs for ArcFace, VGGFace, FaceNet and BSIF.

FIGURE 12.

DETs for MObBIO and NTNU dataset including selfie recognition systems based on traditional inter-area resizing. The EER is showed in parenthesis for each method.

Show All

FIGURE 13.

DETs for MObBIO and NTNU dataset including selfie recognition systems based on traditional inter-cubic resizing. The EER is showed in parenthesis for each method.

Show All

FIGURE 14.

DETs for MObBIO and NTNU dataset including periocular recognition systems based on traditional inter-linear resizing. The EER is showed in parenthesis for each method.

Show All

References is not available for this document.

Selfie Periocular Verification Using an Efficient Super-Resolution Approach

Abstract:

Metadata

Abstract:

Funding Agency:

Introduction