1. Introduction
In this paper, we propose a large training dataset named Celeb-500K for deep learning [1]–[6] based large scale face recognition [7]–[11]. The training dataset consists of 50M images from 500K persons. Our paper focuses on addressing the following two issues in face recognition. First, according to Table 1, there are large gaps on dataset scale between publicly available datasets and private datasets. For example, CelebFace [12] has only 1/800 identities and 1/500 images of the Google dataset. Therefore, compared with industrial applications, the academic research community can only resort to smaller scaled datasets resulting in typically biased conclusions. Thus the efficacy of the proposed methods on larger training datasets needs further verification. For example, many metric learning methods including Contrastive Loss [12], Center Loss [13] and Triplet Loss [14] have greatly improved the face recognition performance of models trained on smaller public datasets such as CelebFace and CASIA-WebFace, but their efficiency on larger scale datasets needs further investigation. Recent face recognition training datasets.
Dataset | Available | #People | #Images |
---|---|---|---|
YFD [15] | public | 1595 | 3425 videos |
VGGFace [16] | public | 2600 | 2.6M |
VGGFace2 [10] | public | 9131 | 3.3M |
CelebFaces [12] | public | 10K | 202K |
CASIA-WebFace [17] | public | 10K | 500K |
MS-Celeb-1M [18] | public | 100K | 10M |
Celeb-500K | public | 500K | 50M |
Facebook [18] | private | 4K | 4.4M |
Google [18] | private | 8M | 100–200M |