Loading [MathJax]/extensions/MathMenu.js
Celeb-500K: A Large Training Dataset for Face Recognition | IEEE Conference Publication | IEEE Xplore

Celeb-500K: A Large Training Dataset for Face Recognition


Abstract:

In this paper, we propose a large training dataset named Celeb-500K for face recognition, which contains 50M images from 500K persons. To better facilitate academic resea...Show More

Abstract:

In this paper, we propose a large training dataset named Celeb-500K for face recognition, which contains 50M images from 500K persons. To better facilitate academic research, we clean Celeb-500K to obtain Celeb-500K-2R, which contains 25M aligned face images from 365K persons. Based on the developed dataset, we achieve state-of-the-art face recognition performance and reveal two important observations on face recognition study. First, metric learning methods have limited performance gain when the training dataset contains a large number of identities. Second, in order to develop an efficient training dataset, the number of identities is more important than the average image number of each identity from the perspective of face recognition performance. Extensive experimental results show the superiority of Celeb-500K and provide a strong support to the two observations.
Date of Conference: 07-10 October 2018
Date Added to IEEE Xplore: 06 September 2018
ISBN Information:
Electronic ISSN: 2381-8549
Conference Location: Athens, Greece

1. Introduction

In this paper, we propose a large training dataset named Celeb-500K for deep learning [1]–[6] based large scale face recognition [7]–[11]. The training dataset consists of 50M images from 500K persons. Our paper focuses on addressing the following two issues in face recognition. First, according to Table 1, there are large gaps on dataset scale between publicly available datasets and private datasets. For example, CelebFace [12] has only 1/800 identities and 1/500 images of the Google dataset. Therefore, compared with industrial applications, the academic research community can only resort to smaller scaled datasets resulting in typically biased conclusions. Thus the efficacy of the proposed methods on larger training datasets needs further verification. For example, many metric learning methods including Contrastive Loss [12], Center Loss [13] and Triplet Loss [14] have greatly improved the face recognition performance of models trained on smaller public datasets such as CelebFace and CASIA-WebFace, but their efficiency on larger scale datasets needs further investigation. Recent face recognition training datasets.

Dataset Available #People #Images
YFD [15] public 1595 3425 videos
VGGFace [16] public 2600 2.6M
VGGFace2 [10] public 9131 3.3M
CelebFaces [12] public 10K 202K
CASIA-WebFace [17] public 10K 500K
MS-Celeb-1M [18] public 100K 10M
Celeb-500K public 500K 50M
Facebook [18] private 4K 4.4M
Google [18] private 8M 100–200M

Contact IEEE to Subscribe

References

References is not available for this document.