Journals & Magazines >IEEE Transactions on Informat... >Volume: 16

Morphing Attack Detection-Database, Evaluation Platform, and Benchmarking

Abstract:

Morphing attacks have posed a severe threat to Face Recognition System (FRS). Despite the number of advancements reported in recent works, we note serious open issues suc...Show More

Metadata

Abstract:

Morphing attacks have posed a severe threat to Face Recognition System (FRS). Despite the number of advancements reported in recent works, we note serious open issues such as independent benchmarking, generalizability challenges and considerations to age, gender, ethnicity that are inadequately addressed. Morphing Attack Detection (MAD) algorithms often are prone to generalization challenges as they are database dependent. The existing databases, mostly of semi-public nature, lack in diversity in terms of ethnicity, various morphing process and post-processing pipelines. Further, they do not reflect a realistic operational scenario for Automated Border Control (ABC) and do not provide a basis to test MAD on unseen data, in order to benchmark the robustness of algorithms. In this work, we present a new sequestered dataset for facilitating the advancements of MAD where the algorithms can be tested on unseen data in an effort to better generalize. The newly constructed dataset consists of facial images from 150 subjects from various ethnicities, age-groups and both genders. In order to challenge the existing MAD algorithms, the morphed images are with careful subject pre-selection created from the contributing images, and further post-processed to remove morphing artifacts. The images are also printed and scanned to remove all digital cues and to simulate a realistic challenge for MAD algorithms. Further, we present a new online evaluation platform to test algorithms on sequestered data. With the platform we can benchmark the morph detection performance and study the generalization ability. This work also presents a detailed analysis on various subsets of sequestered data and outlines open challenges for future directions in MAD research.

Published in: IEEE Transactions on Information Forensics and Security ( Volume: 16)

Page(s): 4336 - 4351

Date of Publication: 02 November 2020

ISSN Information:

DOI: 10.1109/TIFS.2020.3035252

Funding Agency:

Contents

CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.

SECTION I.

Introduction

Morphing attacks pose threats to Face Recognition Systems (FRS) by exploiting the tolerance towards intra-subject variations. Such attacks constitute a vulnerability in various applications like identity management, identity verified border crossing and visa management [1]. Morphing attacks consists of generating a composite image of two subjects resembling closely (for instance similar age and same ethnicity) and using the composite image to verify both the subject in an access control scenario. The composite image, hereafter referred as Morphed Image should be of sufficient quality to obtain a score above the threshold recommended by a FRS in an automated face comparison system. It should also be of sufficiently high quality to fool a trained border guard when inspected manually [1].

The morphed image can for instance be obtained by a malicious actor by colluding with a person having no criminal record to mask the identity of the malicious actor himself/herself, in order to obtain a new passport. When a malicious actor is granted a valid identity document, he/she can use it for various purposes posing a risk to national security in the worst possible scenarios. With such an assertion, the initial work demonstrating the morphing attacks illustrated that commercial-off-the-shelf (COTS) FRS could be defeated with a given set of morphed images [1]. That study further assessed if morphing attacks would succeed when presented to border guards. This means morphing attacks pose a threat to FRS systems and leave a major security risk to any nation where the malicious actor enters.

Initial studies have investigated various aspects of morphing attacks starting from analysing the vulnerability of FRS in detail [2]–[5] to providing measures to detect and mitigate the attacks effectively [2], [6]–[12]. Further, a number of works have focused on studying various parameters influencing the decisions of morphing attack detection subsystems, while other works have focused on providing the set of metrics to gauge the strengths of Morphing Attack Detection (MAD) mechanisms. The works have also noted the vulnerability of FRS with respect to morphing attacks, when using the digital images and re-digitized images (digitally captured image which is printed and subsequently scanned/re-digitized). In pursuit of the current State Of The Art (SOTA) in MAD, we first review the related work in the next section.

SECTION II.

Related Work in Morphing Attacks on FRS and Databases

Morphing attacks can be conducted in two specific types in a broader sense - (i) morphing attacks using digital images (ii) morphing attacks using re-digitized images (a.k.a. printed-and-scanned images). The former domain is inspired by the practices of various countries which allow to upload a digital representation of the face image for various applications such as passport renewal in UK [21] and visa application in New Zealand [22]. The latter is used in many countries where the passport/visa/identity-card applicant is requested to provide an image such as in India [23] and in most European countries (e.g. in The Netherlands [24]) and this leaves the opportunity for a malicious actor to morph the facial image before it is printed. The image submitted by the applicant is thereafter re-digitized for digital processing and biometric enrolment. The earlier works have considered both scenarios and studied the impact of both types of attacks [1], [3]–[5]. In this section, we review the key aspects of earlier works in both domains. While the literature is extensive in the recent years, we focus in this work to the most relevant works with new databases for MAD. The reader is further referred to Scherhag et al. [6] for a detailed survey of the literature.

A. Morphing Attacks Using Digital Images

The first work illustrating morphing attacks was reported in 2014 by Ferrara et al. [1] where a set of morphed images was created using the AR Face Database [25]. 5 pairs of images were morphed for male subjects and 5 pairs of female subjects for studying the vulnerability of FRS [1]. Further, to supplement the study, one morphed image constituted by one male and one female subject and another morphed image constituted by 3 male subjects was employed. The studies specifically investigated the vulnerability of two commercial FRS - Neurotechnology VeriLook SDK 5.4 [26] and Luxand SDK 4.0 [27]. The initial studies asserted the success of all morphed images in reaching a match for both constituent subjects probe images and thereby illustrating the vulnerability of face recognition systems. In the following work by Raghavendra et al. [2], the authors investigated the vulnerability on a larger set of grey scale images with 450 morphed samples from 110 different subjects on the Neurotechnology Verilook SDK [26]. In the same work, the authors also proposed a first detection approach suitable for morphed images that are processed only in the digital domain. Further, Scherhag et al. [4] conducted a similar analysis on using both a commercial SDK and OpenFace SDK - an open source face recognition SDK. In yet another work, Raghavendra et al. [3] employed a total of 431 morphed images to evaluate MAD mechanisms using deep neural networks. In a complementary work, Gomez-Barrero et al. [5] investigated the vulnerability of FRS to morphing attacks using 840 images from the Multimodal BioSecure Database [28] in the digital domain and also investigated the vulnerability of fingerprint and iris biometric systems against biometric attacks. As an alternative to morphing approaches, Raghavendra et al. [14] presented another concept of averaging facial images and proved the vulnerability of FRS for morphed and averaged images in the digital domain. The vulnerability was reported again using the Neurotechnology Verilook SDK on a newly created database of 580 morphed images and 580 averaged images. In a different paradigm, Damer et al. [7] presented an approach of generating morphed images using Generative Adversarial Networks (GAN) on a set of 1500 images to create 1000 morphed images. The authors compared the results of MAD mechanism against traditional Landmark Aligned (LMA) morphing approaches, the vulnerability of the generated database was reported using two open source face SDKs based on VGG Network [29] and OpenFace [30]. The database was used to devise MAD mechanisms on digital images alone in following works [8]–[12].

B. Morphing Attacks Using Print and Scanned Images

Motivated by threats of morphed images to FRS, a number of works have also investigated morphing attacks using re-digitized images (printed and scanned). The key assertion behind these works is that the loss of pixel level information, which was originally introduced by the morphing process, and is now lost due to subsequent printing and scanning processes using devices of various vendors decreases the MAD capability. Further the printing and scanning processes cause additional noise artifacts contained in the re-digitized morphed images [4], [14]–[16], [31].

The works in detecting re-digitized images employ the same techniques to generate morphs and then print-and-scan them. Raghavendra et al. [14] introduced a print and scanned database of 1423 morphed images using both morphing and averaging of pixels. The images were printed using a RICOH MPC 6003 SP on high-quality photo paper with $300~g/m^{2}$ density and scanned using a HP Photosmart 5520 scanner at 300 dpi for bona fide, morphed and averaged images. The work also illustrated the vulnerability of COTS FRS with regards to re-digitized images to be equal to digital domain images while the MAD performance dropped. The same work was further extended with a database to have 2518 morphed images [16]. In a similar direction, Scherhag et al. [11], introduced a printed-scanned morphed face image database generated using the FRGCv2 face dataset. The authors used the Epson DS-50000 Scanner at 300 dpi to print and scan the morphed images generated using three different morphing schemes (OpenCV/dlib, FaceFusion and FaceMorpher) [11]. Ferrara et al. [15] also introduced a printed-scanned database for MAD, specifically to study the demorphing approach where the authors subtract the re-digitized images to detect a face morphing attack. The morphed images were printed and scanned at 600 dpi using a professional quality photoprinter [15].

C. Classification of MAD

While the aforementioned works have employed various databases, most of the works have also reported MAD mechanisms correspondingly to mitigate the threats on FRS: The algorithms for MAD can be classified in two classes:

Differential-image MAD (D-MAD): A suspected morph image is compared against an image captured in a trusted environment (e.g., ABC gate) to determine if the suspected image is morphed.
Single-image MAD (S-MAD): A suspected morph image is investigated (e.g. in a forensic process), in order to determine if the image itself is morphed without using any prior information or another reference image (captured under a trusted acquisition scenario).

We provide a brief review of the relevant algorithms reported in the recent works for both S-MAD and D-MAD.

1) Differential-Image MAD:

The general principle behind the D-MAD algorithms relies on the idea that given a suspected morphed image, $I_{s}$ and a reference image $I_{t}$ captured in a trusted environment, the difference between $I_{s}$ and $I_{t}$ is obtained. The lower the difference, either in the image space or feature space, the larger the probability that the suspected image is accepted as non-morphed (or bona fide image). The first approach of D-MAD was based on inverting the morphing process in a reverse engineered manner which was termed as Demorphing [15]. In a similar manner, a number of works have been reported where the difference of feature vectors from the bona fide image and from the morph image is used to determine if the suspected image is morphed [19], [32]. The deep features from two different networks are employed to determine the difference in features in [19], and features from the 3D shape and the diffuse reflectance component estimated directly from the image was employed to detect a morphing attack in [32]. Another set of works explored the shift in landmarks of bona fide and suspected morph images in face region to determine the morphing attack [10], [11]. For the sake of simplicity a generic illustration of the D-MAD working principle is presented in Figure 1.

Fig. 1.

An illustration of the D-MAD pipeline.

Show All

2) Single-Image MAD:

S-MAD algorithms largely rely on learning a classifier to distinguish the bona fide image from a morphed image. Given a suspected morph image, $I_{s}$ , the texture information is extracted from the normalized and aligned face. The texture features such as Binarized Statistical Image Features (BSIF) and Local Binary Patterns (LBP) are used to classify the images using a pre-trained SVM classifier [4], [14], [16] in the earlier works. In a very similar direction, the LBP features were also explored in [11], [33]. While extending the works for MAD, another approach was proposed to exploit the colour spaces and the scale spaces jointly [16], [34]. With the intent to address also the post-processed morphed images, pre-trained deep networks for extraction of texture features were employed to detect the morphing attacks not only in the digital domain, but also in re-digitized domain (print-scan) [3]. Notably, the earlier works have employed two deep neural networks including VGG19 [29] and AlexNet [35], where they perform feature level fusion of the first fully connected layers from both the networks [3]. In a continued effort, other deep networks have been investigated for detecting morph attacks [17]. Another approach to detecting morphing attacks was proposed by extracting the features from the “Photo Response Non-Uniformity“ where the characteristics of the image sensor were employed to determine, if the image was morphed or not [12]. Motivated by the effectiveness of the noise modelling, better performing algorithms have been reported where the color space has been investigated to seek for residuals of the morphing process [36] including dedicated context aggregation networks to automatically model the noise [37].

D. Limitations

As noted from the set of works listed in the previous section and Table I, there is a need for standardized and reproducible testing of MAD mechanisms. The limitations can be further divided in four main categories:

Need for cross-dataset evaluation: As different works have used in-house datasets generated using different approaches, the proposed methods are only evaluated on limited sets. Despite the proposed MAD approaches performing very well on the in-house datasets, no works have attempted to study the generalizable detection performance except in recent works [33], [37] which attempts to study the cross-dataset evaluation. The missing aspect from different studies suffer from validation of SOTA proposed approaches in terms of generalizable detection performance and also indicating the directions for future works. In order to address this aspect, it is necessary to avoid the classical over-fitting problem for MAD mechanisms.
Need for sequestered database: Further to support the reporting of generalizable detection performance in studies, there is a need for sequestered data for testing the robustness of the MAD algorithms. Thus, the need for a sequestered dataset, to which researchers do not have access for training purposes, is obvious. Sequestered data should solely be used for reproducible testing. Such tests on unknown data will establish a reliable benchmark of algorithms and will indicate, whether said algorithms are robust to handle various factors unaware to researchers.
Need for independent evaluation: As a third factor, MAD algorithms are often tuned to perform well on known datasets owing to the nature of in-house datasets. Despite the datasets being divided in training, testing and validation sets, it can be well observed that the algorithms and researchers have full access to look at the cases during an introspection and thereby improve their own MAD detection performance iteratively. While this enables continuous development and impovement of algorithms, morphing attacks in a real-life border crossing scenario can be compared to biometrics in the wild, where neither morphing generation algorithms, nor the post-processing approaches or printing and scanning mechanisms can be fully controlled. For the algorithms to be ready for operational deployment, there is a need for independent testing using morphed images which are unknown to the developers.
Need for evaluation platform: While independent testing is desired, there are not many organizations hosting such platforms limiting the researchers to devise robust algorithms. Although a similar evaluation effort is carried out by NIST [38], the NIST FRVT MORPH dataset, especially the subset containing post-processed print-scan and operational ABC gate images, is currently limited in size. Therefore, the need for an independent evaluation platform that runs continuously is needed to facilitate algorithmic evaluation and benchmark the detection performance against other competing algorithms in the lines of earlier evaluation platforms from University of Bologna, who have provided a long-standing fingerprint evaluation system [39], [40].

TABLE I State of the Art in Morphing Attack Databases and Vulnerability Reporting (* Indicates Vulnerability Demonstrated Using COTS FRS)

E. Contributions of This Work

In order to address these four key limitations, in this work we provide three major contributions followed by the benchmarking of SOTA MAD mechanisms.

A large scale sequestered database of morphed and bona fide images collected in three different sites constituting to 1800 photographs of 150 subjects is released along with this article. The database covers various age groups, equal representation of genders and varied ethnicity making it an unique database for MAD algorithm evaluation. The morphing of images was conducted with 6 different morphing algorithms presenting a wide variety of possible approaches. The images in the database consist of 5,748 morphed face images, where subsets consist of: (1) morphed images without post-processing to remove digital artifacts, (2) morphed and post-processed images to remove artifacts induced while morphing to produce passport quality ICAO photos [41], (3) printed and scanned versions of ICAO standard passport images using different combinations of printers and scanners including the scanners used in federal ID management offices in Europe. The database is accessible through the FVConGoing platform [40] to allow third parties for evaluation and benchmarking.
An unbiased and independent evaluation of 5 state of the art MAD algorithms against 5,748 morphed face images and 1,396 bona fide face images. A total of 500,200 attempts with bona fide (69,800) and morphed (430,400) face images are evaluated to report the detection performance of current SOTA MAD mechanisms.
A new and independent evaluation platform is further presented to facilitate reproducible research where any researcher, governmental agency or private entity can upload SDKs and measure the performance of their MAD algorithm. The platform provides the benchmarking of the MAD performance against all previously submitted algorithms and specifically provides the results for different subsets corresponding to age, gender or ethnicity. Such detailed analysis will enable the researchers to identify the performance limitations of MAD mechanisms and facilitate them to develop more robust algorithms.

In the remainder of this article, in Section III we present the newly composed database where the details of the entire dataset are described. The new independent evaluation platform is introduced in Section IV. In Section V, we present the set of SOTA algorithms that are particularly evaluated on the sequestered dataset. A detailed discussion of results and the analysis of MAD performance is reported in Section VI. While in Section VII we draw the conclusions and list current limitations with the intention, to facilitate the efforts for development of future algorithms.

SECTION III.

SOTAMD Database

As noted in the earlier works, the existing MAD efforts by research institutions are largely based on internally created databases, which often are limited in size, diversity of image capture devices, image quality, realistic post-processing, and variability of morphing algorithms. We note that a best practice of using different databases and image acquisition and testing protocols makes it challenging, to benchmark MAD algorithms and thereby makes it for an operator next to impossible to judge the applicability of current MAD for operational deployment. In order to overcome these limitations and provide a new dataset for benchmarking (both for S-MAD and D-MAD algorithms) under realistic conditions with high quality images, we created a new dataset, to which we refer as State of the Art Morphing Detection (SOTAMD) dataset. The dataset consists of:

Enrolment images: bona fide face images taken in a capture set-up, which is meeting the requirements of passport application photo capture (e.g., photographer studio).
Gate images: bona fide face images captured live with a face capture system in an Automated Border Control ABC) gate.
Chip images: compressed face images stored on an electronic Machine Readable Travel Document (e-MRTD).
Morphed face images: morphed images created from the pool of passport face images. The database contains different kinds of morphed images as listed below:
1. Digital morphed images: Images obtained obtained directly after morphing in the digital domain.
2. Digital post-processed morphed images: morphed images that are processed (automatically or manually) in the digital domain, to eliminate or hide the artifacts resulting from a morphing process.
3. Print-scanned morphed images: post-processed morphed images that are printed and scanned to simulate the passport application process.

A number of factors are considered in creating this dataset as a joint effort in an EU funded project - State-Of-The-Art-Morphing-Detection (SOTAMD) which are explained in the subsequent sections.

Some information about the number of images in the database and their size is given, respectively, in Table II and Table III. The bona fide enrolment images have been cropped to remove the background and resized in order to follow the same inter-eye distance distribution of the morphed images, so that it’s not possible to infer the image class from its size. The details of the various subsets of data along with the details on morphing methods, print-scan pipeline, and compression details is provided in Table XIII and Table XIV, as shown in the Supplementary Material. The images from the database are used to test both S-MAD and D-MAD algorithms according to the testing protocols defined in Section IV-B.

TABLE II Number of Images in the Database

TABLE III Minimum and Maximum Image Size

A. Subject Pre-Selection

An important aspect of creating a successful morph attack is subject selection, such that closely resembling pairs of faces are chosen [4]. Following the guidelines of earlier works, the SOTAMD database was created by selecting the morph pairing candidates with high similarity with careful considerations to age, gender and ethnicity. As an additional measure, the selected morph pairing candidates were also validated by observing the comparison scores from two specific commercial-off-the-shelf (COTS) FRS - Neurotechnology Verilook SDK [26] and Cognitec FaceVacs SDK [50]. All the morphed images that did not verify against probe images from both contributing subjects were classified as low quality morph set in the final database. This labeling makes the SOTAMD database highly relevant to investigate low quality and high quality morph detection capability. Such elimination and careful selection has led to 75 unique pairs of candidates for morphing from a total of 150 individuals of various ethnicity and age group. The subjects were selected amongst university staff and student corpus, and a casting agency website. Table IV presents the gender, age and ethnicity demographics of the selected subjects for the final SOTAMD database.

TABLE IV Demographics of the SOTAMD Database

B. Bona Fide Enrolment Images

For each of the 150 subjects in the SOTAMD database, two enrolment images were captured in high quality studio acquisition set-up reflecting the real-life passport photo capture process. Further, the enrolment images are also printed and scanned to have both digital and correspondingly printed and scanned subsets. The print and scan processes are conducted using various printers and scanners to increase the diversity of the dataset.

Given the nature of this work reflecting a operational border control scenario, we have exercised care to make sure the images are ICAO complaint [41]. Thus, each of the images in the enrolment set was processed with professional software to comply with ICAO standards for eMRTD images. The processed images were further used for printing and scanning to closely follow the actual production scenario of passports based on the regulations in the Netherlands and Germany under EU member state regulations.

The number of bona fide enrolment images in the new SOTAMD database is 300 in digital format, and 1096 printed and scanned.

C. Morphed Enrolment Images

To simulate the criminal attack, we generated a number of morphed images to be used for enrolment, i.e. to be hypothetically presented during the passport application process. The morphed images have been created starting from the bona fide enrolment images (one for each subject).

Unlike the noted previous works in Table I, the newly created morphed set in the SOTAMD database has a wide variation of employed morphing processes. Specifically, the morphing set consists of an unprocessed image set and fully-processed image set. To increase the challenging nature of the dataset and in order to simulate realistic data, the post-processed images are printed and scanned using different pipelines. To further increase the diversity, each image pair was morphed using contributing factors (referred as alpha factor) of 0.3 and 0.5 for each of the two contributing faces. Examples of two morphed face images are shown in Figure 2.

$Fig. 2. - Impact of morphing factors ( $\alpha $ ) on morphing.$

Fig. 2.

Impact of morphing factors ( $\alpha$ ) on morphing.

Show All

Furthermore, the processed images are resized using the OpenCV library [51] to maintain the same inter-eye distance distribution as observed in the morphed images to avoid any possibility of inferring the image class from it’s dimensions. Post-processing methods consist of automatic and/or manual methods to conceal visible, and sometimes easy to detect morphing traces. Due to such variation in algorithms, any MAD algorithm that can achieve significant accuracy of detection on the SOTAMD dataset can be deemed as robust. Examples of automatically and manually post-processed digital morphed face image (left), and the same image after printing and scanning (right) are shown in Figure 3.

Fig. 3.

Illustration of post-processing - Careful processing to remove the artifacts can be noted in the eyelids, iris and nostril regions to eliminate the traces of the morphing process. Refer Figure 4 for detailed illustration.

Show All

Examples of a morphed face image, before (left) and after (right) manual post-processing are shown in Figure 4. Morphed face images that were both automatically and manually post-processed compose the most challenging subset. All the enrolment face images (bona fide and morphed) were processed with ICAO compliance [41] testing software before entering into the database. An overview of the basic subsets of morphed face images is shown in Table V.

TABLE V Total Number of Images With Morphing and Manual Post-Processing

Fig. 4.

Morphed face image before and after manual Post-processing from Figure 3. Only the central part of the face is reported to better appreciate the effect of artifact removal. Careful processing to remove the artifacts can be noted in the eyelids, iris and nostril regions to eliminate the traces of morphing process.

Show All

A detailed account of the morphing methods that were contributed by each partner can be seen in Table VI which provides the various approaches used for automated and manual post-processing pipelines.

TABLE VI Contributed Morphing Methods, Manual Post-Processing Methods and Automated Post-Processing Methods

A subset of the generated morphed images has been printed and scanned using multiple pipelines (in analogy with the bona fide enrolment images); the number of morphed images in the database is therefore 2045 in digital format and 3703 printed and scanned.

D. Gate Images

The SOTAMD database contains 10 gate images captured from each subject (overall 1500 images) during a single acquisition session at different locations under a simulated ABC gate operational scenario.¹

As an additional measure, the quality of the images captured in the emulate ABC set-up was validated by reading the corresponding eMRTD chip images and verifying them against the captured gate image using COTS FRS.

The gate images were captured at two different partner facilities (Norwegian University of Science and Technology - referred to as NTN and Hochschule Darmstadt - referred to as HDA) from 100 subjects that directly corresponds to real ABC gates from two different vendors. These probe images that are generated from two different vendors capture devices, represent images that are used in real operational settings. Another set (from University of Twente - referred to as UTW) of gate images from 50 subjects are captured with a simulated custom-built mock ABC gate. Thus, given three different set-ups of ABC gates, the probe-set provides a variation for benchmarking different MAD algorithms, which demands an agnostic nature and robustness of the algorithms. Examples of the different probe images captured from different set-up are illustrated in Figure 5.

Fig. 5.

Examples of probe face images captured from different ABC set-up.

Show All

SECTION IV.

Evaluation Platform

We further present a new independent evaluation framework to measure the robustness of MAD. The MAD benchmarks have been realized following the testing framework of FVC-onGoing [39], [40]. A web-based automated evaluation platform has been designed to track the advances in MAD, through continuously updated independent testing and reporting of performances on given benchmarks. FVC-onGoing benchmarks are grouped into benchmark areas according to the (sub)problem addressed and the evaluation protocol adopted (e.g. Fingerprint Verification, Palmprint Verification, Face Image ISO Compliance Verification, etc.). To maximize trustworthiness of the results, tests are carried out using a strongly supervised approach on a collection of sequestered datasets and results are reported on-line by using well known performance indicators and metrics. We follow the same design principles to evaluate the MAD algorithms in this work.

The evaluation process is fully automated as illustrated in Figure 6 which consists of participant registration, algorithm submission, performance evaluation, and results visualization. To protect sensitive information (biometric data) and to prevent external attacks, the FVC-onGoing framework is composed of two different modules physically located in two separate servers:

The Front-End server containing the web site and the algorithm repository.
The Test Engine server containing the test engine and the benchmark datasets.

Fig. 6.

The figure shows the architecture of the FVC-onGoing evaluation framework and an example of a typical workflow: a given participant, after registering to the Web Site (1), submits some algorithms (2) to one or more of the available benchmarks; the algorithms (binary executable programs compliant to a given protocol) are stored in a specific repository (3). Each algorithm is evaluated by the Test Engine that, after some preliminary checks (4), executes it on the dataset of the corresponding benchmark (5) and processes its outputs (e.g. comparison scores) to generate (6) all the results (e.g. EER, score graphs), which are finally published (7) on the Web Site.

Show All

A firewall protects the Test Engine server by blocking all inbound and outbound connections on public and private networks. Only a few authorized users can access the Test Engine server from a specific terminal using a protected local connection. Moreover, to avoid undesirable behaviour of the submitted algorithms, all of them are first analysed by antivirus software and then executed in a strongly controlled environment with minimal permissions.

Algorithms can be provided in the form of i) a Win32 console application or ii) a Linux dynamically-linked library compliant to NIST FRVT MORPH specifications [38].

Two different benchmark areas (D-MAD and S-MAD) have been created to evaluate the accuracy of MAD algorithms in the differential- and single-image scenarios. Table VII provides detailed information on the benchmarks contained in the two benchmark areas. Algorithms submitted to these benchmarks must comply to specific protocols, whose details are given on the FVC-onGoing web site [40].

TABLE VII D-MAD and S-MAD Benchmarks

A. Detection Performance Evaluation

The evaluation platform is designed to report a number of performance metrics for MAD algorithms as detailed in this section. For each experiment bona fide and morphed face images are used to compute the Bona fide Presentation Classification Error Rate (BPCER) and the Attack Presentation Classification Error Rate (APCER). As defined in [52] the BPCER is the proportion of bona fide presentations falsely classified as morphing presentation attacks while the APCER is the proportion of morphing attack presentations falsely classified as bona fide presentations. The following performance indicators are reported:

EER (detection Equal-Error-Rate): the error rate for which BPCER and APCER are identical
BPCER₁₀: the lowest BPCER for APCER≤10%
BPCER₂₀: the lowest BPCER for APCER≤5%
BPCER₁₀₀: the lowest BPCER for APCER≤1%
REJ_NBFRA: Number of bona fide face images that cannot be processed
REJ_NMRA: Number of morphed face images that cannot be processed
Bona fide and Morph detection score distributions
APCER( ${t}$ )/BPCER( ${t}$ ) curves, where ${t}$ is the detection threshold
DET( ${t}$ ) curve (the plot of BPCER against APCER)

B. Protocols for Evaluation

In order to benchmark the MAD algorithms, we defined two specific protocols for D-MAD and S-MAD respectively:

D-MAD: in this case, the algorithms receive as input a pair of images (an enrolment image and a gate image) and are requested to estimate the probability that the enrolment image is morphed, based on a differential analysis of the two input images. The enrolment images available in the database are thus compared against the gate images (i.e. trusted live capture) according to the following protocol:
- Bona fide images: the bona fide enrolment image is compared against the gate images of the same subject;
- Morphed images (factor 0.3): the morphed enrolment image is compared against the gate images of the subject who contributed least in the morphing (the hidden identity);
- Morphed images (factor 0.5): the morphed enrolment image is compared against the gate images of both contributing subjects.
S-MAD: in this case, the algorithms receive as input a single image and are requested to estimate the probability that the image is morphed (i.e. to report a morphing likelihood score). To this aim, the probe set consists of the whole set of available enrolment images (bona fide and morphed).

The resulting number of attempts for the two benchmarks is provided in Table VII.

SECTION V.

MAD Algorithms

A number of existing state of the art MAD algorithms are evaluated on the newly created SOTAMD database using the new evaluation platform. Within the scope of this work, both D-MAD and S-MAD algorithms have been submitted to the corresponding FVC-onGoing benchmarks. In this section, we provide a brief description of the algorithms that were tested on the newly developed database and the evaluation platform.

A. D-MAD

A D-MAD algorithm uses additional information from a second image known to be bona fide (e.g. a live image captured in an ABC gate) to detect morphed face images. D-MAD algorithms obtain the differences in images using textural features (textural features or deep features) or landmark shifts. We present a set of D-MAD algorithms evaluated on SOTAMD database in the subsequent sections.

1) BSIF:

It is based on a set of texture features obtained using the Binarized Statistical Independent Features (BSIF) with a 8-bit filter of size $3\times 3$ , applied on the normalized and aligned image [53]. Given the histogram feature vector of the dimension $1\times 4096$ for $h_{s}$ and $h_{t}$ respectively, the difference is presented to a pre-trained SVM classifier trained on the bona fide and morphed data from FERET [54] and FRGC [55] images. The approach also considers a number of post-processing steps such as median filtering, histogram normalization and sharpness processing on the images before training the SVM classifier for morphs generated from FaceMorpher and OpenCV.

2) DFR:

It utilizes the information of the embeddings (feature vectors) of the ArcFace algorithm [56], a ResNet based face recognition system. The fundamental idea is to use the feature vectors of the face-generating neural network to train an SVM. Since the neural network does not encounter morphed facial images during training, it can be excluded that the feature extraction overfits to artifacts of certain morphing algorithms, which in turn leads to a higher robustness of the resulting MAD algorithm. The ArcFace feature vector has a length of 512 features. The feature vectors of the e-gate live capture and the suspected morph image are subtracted. The resulting difference is used to train an SVM with RBF kernel. The algorithm evaluated in this paper was trained on the bona fide and morphed data from FERET [54] and FRGC [55]. Details of the DFR MAD algorithm can be found in [19].

3) MBLBP:

It consists of pre-processing, calculation of multiple block LBP from both $I_{s}$ and $I_{t}$ followed by classifying them as a bona fide image or morphed image using the pre-trained SVM classifer [53]. The Dlib landmark detector is used to detect the facial area and the landmarks with the face in the pre-processing step where the face is realigned and normalized to achieve ICAO compliance [41]. The normalised face image is then cropped to the $320\times 320$ pixel wide region of from which the LBP information is extracted using $4\times 4$ equally sized blocks of the image. Within each block, a window size of $5\times 5$ pixels is employed to obtain the histograms. Given the histogram of $h_{s}$ and $h_{t}$ for $I_{s}$ and $I_{t}$ respectively, a difference of $h_{s}$ and $h_{t}$ is obtained which is given to the SVM classifier to obtain a final decision on suspected image as morphed or bona fide image. Details on the MBLBP algorithm can be found in [53].

4) WL:

This method is based the fact that facial landmarks are usually averaged between two individuals when morphed images are created. Therefore, the distance of a given landmark (e.g., right corner of the right eye) between two bona fide images of the same subject will be smaller than the distance between that same landmark from a bona fide images of the subject and the morphed images with another subject. To exploit this idea, a set of 68 facial landmarks is extracted from each input image using dlib. Subsequently, two types of features are computed: Euclidean distances between landmarks, and angles between a pre-defined set of neighbouring landmarks. In order to account for the reliability of the landmarks estimation (e.g., the eye corners are more stable than landmarks on the lips), different weights are applied to the distances before they are classified as bona fide or morphed images using an SVM. Details on the computations of the distances and angles can be found in [10], [57].

5) DR:

This method is based on the differentiating the image from bona fide image captured from trusted environment, (e.g., ABC gate) and the suspected image from Machine-Readable Travel Document (eMRTD) [32]. Both images $I_{s}$ and $I_{t}$ are decomposed into the normal maps, and diffuse map using SfSNet [58] following which the diffuse reconstructed image and a quantized normal map are obtained. From the diffuse map, the features are extracted using ‘fc7’ activation layer of AlexNet [35]. The features from the normal map are extracted by converting them to quantized spherical angles (quantization is 24-bit). The features are used to train polynomial SVM classifiers for each set of features. The classifiers are used then used to determine if the suspected image is morphed or not based on the fusion of scores from each individual classifier corresponding to normal map and diffuse map. Details on the DR D-MAD algorithm can be found in [32].

6) Face Demorphing:

The idea of Face Demorphing (FaDe) [15] involves inverting the morphing process in a reverse engineered manner. Given a suspected image $I_{s}$ that is corresponding to image stored in the ID document where $I_{s}$ is generally a linear combination of multiple images. $I_{m}=I_{a}+I_{c}$ where $I_{a}$ and $I_{c}$ are the face images of bona fide accomplice and a criminal respectively. The assumption on the other end is that for a genuine ID document (with no morphing attack) the image $I_{m}$ is a combination of two identical images (for e.g., $I_{m}=I_{a}+I_{a}$ ), where $I_{a}$ is the bona fide image.

Given the captured image $I_{t}$ in a trusted environment, demorphing algorithm obtains a difference between the suspected image $I_{s}$ and the captured image $I_{t}$ to obtain a demorphed image $I_{d}$ . When the $I_{d}$ is compared against the $I_{t}$ using a FRS system, a high comparison score ( $S$ ) indicates no morphing and lower score indicates higher probability of morphing. Ferrara et al. [15] employ Dlib for comparing the trusted capture image $I_{t}$ and demorphed image $I_{d}$ as given below:

$\begin{align*} S =\begin{cases} \max \left[{0, \dfrac {(d-\tau _{1})}{(2\times (\tau _{2} - \tau _{1}))} }\right], & if ~d\leq \tau _{2} \\[0.7pc] \max \left[{1, 0.5 + \dfrac {(d-\tau _{2})}{(2\times (\tau _{3} - \tau _{2}))} }\right], & \text {otherwise}. \end{cases}\qquad \tag{1}\end{align*}$ View Source

where

$\tau _{1}, \tau _{2}, \tau _{3}$

are thresholds chosen om empirical trials set to

$0.3699, 0.4565, 0.5469$

respectively.

B. S-MAD

An S-MAD algorithm determines whether an image is morphed directly i.e. without using a trusted reference image. Most of the S-MAD algorithms first extract the features from the suspected image using textural or deep networks, followed by learning a classifier. The learnt classifier is used to determine if the image is morphed or not. We briefly describe the set of S-MAD algorithms evaluated in this work.

1) PRNU:

This algorithm is based on the analysis of Photo Response Non-Uniformity (PRNU). In essence, the PRNU stems from slight variations among individual pixels during the photoelectric conversion in digital image sensors. As a consequence, it is present in all acquired images and can be considered as an inherent part of any sensor’s output. In fact, the PRNU has been successfully used for different forensic tasks, such as device identification or detection of digital forgeries. For the particular purpose of detecting morphed images [11], the PRNU is extracted from the preprocessed facial images and subsequently split into cells. From each cell, the variance of 100-bin histograms of the PRNU is computed. Then, the minimum value among all cells is thresholded to obtain a bona fide vs. morphed image decision. More details on this MAD mechanism can be found in [11].

2) Scale-Space Ensemble Approach (SSE):

The algorithm is based on ensemble approach of extracting textural features followed by learning a classifier [16]. With the set of scores obtained from different classifiers learnt from different features, the final decision is made on whether the image is bona fide or morphed. Specifically, the image is decomposed in different color spaces such as YCbCr and HSV space. For each channel of the color space, the image is decomposed into different scale spaces using a Laplacian pyramid with 3 level decomposition. Further different textural features using Binarized Statistical Independent Features (BSIF), Local Binary Patterns (LBP) and Histogram of Gradients (HOG) are obtained. The obtained features are further used to learn the Collaborative Representative Classifier (CRC). While the testing is carried out on the SOTAMD dataset, the training was performed on a dataset derived from the FRGC face dataset. More details can be found in [3].

3) Deep-S-MAD:

This algorithm uses well-known pre-trained CNNs to detect morphed images [17]. Pre-trained networks have been fine tuned using a large set of artificially generated digital images (both bona fide and morphed). Moreover, in order to deal with the print and scan process (P&S), a further fine tuning step has been performed for the P&S case exploiting a set of images artificially generated to simulate P&S. The simulation follows a mathematical model that allows to control different image characteristics, related to both image visual quality and low-level signal content. In particular, the main visual effects produced when an image is printed and scanned can be successfully reproduced (blurring, gamma correction, color adjustment or noise).

The AlexNet architecture pre-trained on ImageNet [35] has been used on digital images while the VGG-Face16 [59] architecture pre-trained on the VGG-Face dataset [59] has been used on P&S images.

4) S-MBLBP:

The created classification system extracts multi-block local binary patterns from a face image and uses a support vector machine with a linear kernel to classify it as either morphed or bona fide [53]. The approach optimises the feature extraction process by using uniform LBPs with radius, r = 1 (i.e. number of neighbours, n = 8), and a histogram layout of $3\times 3$ . Before feature extraction the face is detected and cropped with a HOG-based face detector [45], converted to grey scale and finally histogram equalization is applied to enhance image contrast. The $3\times 3$ histogram layout is realized by splitting the face image by 2 equidistant vertical and horizontal lines. A single histogram contains 59 feature values, which means that after concatenating the 9 histograms of our layout our feature space has 531 dimensions. The classifier was trained on [55] and [60]. As pre-processing steps, all training images were converted to png format without any compression to avoid jpg compression artefacts being detected, and resized using nearest neighbour interpolation to the average size of the three training datasets. Additionally, faces were horizontally aligned to make them similar to (ICAO compliant) benchmark images.

SECTION VI.

Results and Discussion

A. Results -D-MAD

The results observed in the Digital Image Benchmark (D-MAD-SOTAMD_D-1.0) are reported in Figure 7 (also Table X in Appendix, as shown in the Supplementary Material, for the results on two subsets with morphing factor 0.3 and 0.5 respectively). In particular, the DET plots in Figure 7 refer to the overall results, additional results are reported in Appendix A, as shown in the Supplementary Material.

Fig. 7.

DET plots for the D-MAD-SOTAMD_D-1.0.

Show All

The detection accuracy of some of the evaluated algorithms is quite modest. Two algorithms perform better than the average, and the algorithm DFR in particular reaches very promising results. The reason for the general under-performance of MAD algorithms with respect to the detection accuracy reported in the original publications could be due to the difficulty of the benchmark dataset and the over-specialization of said algorithms on the native training sets used previously in the research labs. As to the FaDe approach, its better generalization capability is probably due to the absence in the method of a specific training stage and/or hyperparameters tuning. The good performance of DFR can be attributed to the fact that the ArcFace algorithm used for feature extraction was trained independently of morphed images and thus the extracted feature vectors are not overfitted to the artifacts of individual MAD algorithms. Table X, as shown in the Supplementary Material reports the performance of the tested MADs on the entire set of images as well as separately for the subsets of images with morphing factor 0.3 and 0.5. The results related to the morphing factor 0.3 are in general slightly better than those obtained on the entire database. A noticeable improvement can only be observed on all the performance indicators for DFR and FaDe algorithms. The behavior of FaDe is explainable if we consider that the algorithm has been designed to work on asymmetric morphings. The performance gain of the DFR can be attributed to the use of the difference vector. If the morphing factor is lower, the difference increases and so does the possibility to detect the morph.

For a deeper comprehension of the main image characteristics affecting to a larger extent the MAD performance, the results have been analyzed for specific subsets of images, described in Table XII presented in Appendix, as shown in the Supplementary Material. The subsets have been selected according to the number of images available (too small subsets are therefore discarded).

The degree of influence of each specific subset with respect to the overall performance has been evaluated computing, for each subset s, the percentage deviation between the EER measured on the specific subset ( $eer_{s}$ ) and the EER measured on the whole set of images:

$\begin{equation*} dev_{s}=\frac {eer_{s}-eer_{o}}{eer_{o}}\times 100\tag{2}\end{equation*}$ View Source

A negative deviation indicates that the specific subset is “easier”’ with respect to the overall set of images (a lower EER value has been observed), high positive values identify more difficult subsets. The deviation computed for each algorithm, as well as the average deviation ( $\overline {dev_{s}}$ ) for the subset of tests with morphing factor 0.3 are reported in Table XV in Appendix, as shown in the Supplementary Material, where the results are sorted by $\overline {dev_{s}}$ . Some interesting results can be observed, in relation to the main attributes characterizing the database images:

Ethnicity: in general the morphed images produced with Indian-Asian and Middle Eastern subjects are easier to detect for most of the algorithms. The cardinality of these subsets is lower than European/American, and the chance of selecting lookalike subjects for morphing was lower.
Automatic or manual post-processing: as expected manual post-processing (i.e., retouching for artefact removal) makes morphing detection more difficult w.r.t. automatic post-processing, even if the difference is just minor here.
Manual post-processing technique: significant differences can be observed in relation to the manual post-processing executor, thus confirming the importance of manual retouching aimed at removing small artefacts; while PM03 and PM06 are easier to detect, especially for some algorithms, PM02 and PM05 are more difficult to spot.
Subset of Morphs: the subset containing UTW images is more difficult with respect to those from the other partners. In fact, in this case, very similar pairs of subjects were selected, making the resulting morphs more difficult to be detected.
Morph quality: as expected high quality morphs (i.e., those accepted by commercial face verification algorithms) are more difficult to detect than low quality morphs (i.e., those already rejected by face verification algorithms).
Morphing algorithm: the results over different morphing algorithms are quite different; algorithms C06, C07 and C03 are generally easier to detect, while C02 and C01 are quite hard for most of the D-MAD algorithms.
Age: the results on subjects in the range 56–75 are generally much worse than those related to younger subjects; as per the Traits subsets (see below) we argue that the transfer of evident skin characteristics such as wrinkles, freckles or moles, can make the morphed images similar enough to both subjects.
Gender: morphing detection in female subjects looks on average more difficult.
Traits: the error rate on images with specific traits (moles, freckles) is on average higher than that measured on images without particular facial traits. See the above discussion on Age.

The results reported in Table XV (Appendix as shown in the Supplementary Material) show that, even if a common behaviour can be observed for several subsets, in a number of cases (e.g. Type of Post-processing or Ethnicity) different algorithms provide significantly different performance. This leads us to suppose that the tested D-MADs produce quite independent errors and a combination of such different techniques can lead to a performance improvement.

The results obtained on the P&S Image Benchmark (D-MAD-SOTAMD_P&S-1.0) are summarized in Fig. 8. While for the best performing approach (DFR) the detection accuracy on Digital and P&S images is similar, in general a performance drop on Print and Scan images can be observed; for example, for the demorphing method (FaDe) the BPCER values are about 10% higher. Also in this case the influence of the morphing factor on the MAD performance can be observed in Table VIII reporting the results for the overall set of images and for the subsets of images with morphing factor 0.3 and 0.5.

TABLE VIII Performance Indicators Measured on the D-MAD-SOTAMD_P&S-1.0 Benchmark for the Overall Set of Images and for the Subsets of Images With Morphing Factor 0.3 and 0.5

Fig. 8.

DET plot for the D-MAD-SOTAMD_P&S-1.0.

Show All

B. Results - S-MAD

The results of S-MAD algorithms on printed-scanned images are given in Table IX and on digital images in Table IX (Appendix as shown in the Supplementary Material) respectively. In this case the overall performance is quite unsatisfactory in general and very far from the accuracy needed in real operational conditions. No significant differences can be observed between the different test cases: morphing factor 0.3 or 0.5, digital or printed-scanned images. We can conclude that morphing attack detection based on the analysis of the single image is still very complex, particularly in the presence of heterogeneous image sources, different processing pipelines and high quality morphs obtained through a careful selection of subjects and an accurate post-processing aimed at removing all visible artifacts. The results confirm again the importance of cross-database training and testing to improve the robustness of detection algorithms.

TABLE IX Performance Indicators Measured on the S-MAD-SOTAMD_P&S-1.0 Benchmark for the Overall Set of Images and for the Subsets of Images With Morphing Factor 0.3 and 0.5

C. Directions for Future Works

As noted from the results reported in the previous sections, it is evident that the accuracy of MAD does not meet the operational requirements. If we focus on BPCER₁₀₀, we can see from Tables X and VIII that the result is around 20% for the best performing D-MAD approach. For all S-MAD algorithms (see Table IX and Table XI in Appendix as shown in the Supplementary Material), BPCER₁₀₀ is higher than 90%. From a practical point of view, this behaviour would cause a considerable number of false alarms and, as a consequence, a high number or false rejections during face verification at ABC gates. This would be unacceptable if we consider that operational face verification systems for ABC gates are expected to work at a False Accept Rate (FAR) of 0.1 per cent with a False Rejection Rate (FRR) not higher than 5% [61].

Given the number of covariates impacting the MAD performance such as age, gender and ethnicity, accurate and better algorithms need to be developed to address the complex challenge of morphing attacks. The results presented in this work also suggest that the combination of approaches of different nature could lead to a general performance improvement.
As it can also be noted from the Table VIII that the print and scan process reduces the MAD accuracy to a larger extent. Reliable and accurate algorithms need to be developed to improve the accuracy of the algorithms for detecting morphing attacks specifically when images are processed through the print and scan pipeline.
As a complementary direction, the human detection performance should be studied in a standardized manner to understand the key factors in spotting the morphing attacks on FRS.

SECTION VII.

Conclusion and Summary

Given the complex nature of the morphing attack detection and the impact on operational FRS, we presented a new evaluation framework and a new database of morphed images in this work. The sequesterd morphed dataset being publicly available allows researchers to benchmark their algorithms in a continuous manner to contribute to development of morphing attack detection. Further, this work also provides a benchmark of the existing state of the art algorithms to give a clear idea of the limitations in the existing algorithms for MAD.

ACKNOWLEDGMENT

The content of this report represents the views of the authors only and is their sole responsibility. The European Commission does not accept any responsibility for use that may be made of the information it contains. Further we are grateful to our colleagues at the German Federal Office for Information Security (BSI), the Hochschule Bonn-Rhein-Sieg (H-BRS) and to the Norwegian Police for the support in the data acquisition.

References is not available for this document.

Morphing Attack Detection-Database, Evaluation Platform, and Benchmarking

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

Description

Introduction

Related Work in Morphing Attacks on FRS and Databases

A. Morphing Attacks Using Digital Images

B. Morphing Attacks Using Print and Scanned Images

C. Classification of MAD

1) Differential-Image MAD:

2) Single-Image MAD:

D. Limitations

E. Contributions of This Work

SOTAMD Database

A. Subject Pre-Selection

B. Bona Fide Enrolment Images

C. Morphed Enrolment Images

D. Gate Images

Evaluation Platform

A. Detection Performance Evaluation

B. Protocols for Evaluation

MAD Algorithms

A. D-MAD

1) BSIF:

2) DFR:

3) MBLBP:

4) WL:

5) DR:

6) Face Demorphing:

B. S-MAD

1) PRNU:

2) Scale-Space Ensemble Approach (SSE):

3) Deep-S-MAD:

4) S-MBLBP:

Results and Discussion

A. Results -D-MAD

B. Results - S-MAD

C. Directions for Future Works

Conclusion and Summary

ACKNOWLEDGMENT

Description

References

IEEE Account

Purchase Details

Profile Information

Need Help?