Introduction
Facial beauty has a tremendous impact on our social and workplace life,and facial beauty prediction(FBP)has been widely used in lots of fields such as facial beauty-related APPs (e.g., MeiTu and Facetune), plastic hospital.
In recent years,convolution neural network (CNN) [1]–[8] as a new research method of machine vision has been brought into sharp focus by more and more researchers. The visual tasks in machine vision,such as image classification[1], [6], [9]–, object detection[7], face recognition[3], [8]etc., have been benefited from Deep convolution neural network (DCNN) by deep learning and gets very significant classification results. For example, the accuracy on face recognition by CNN model has been improved from 97% to 99% on the Labeled faces in the wild (LFW) database[10], while the mean accuracy precision on object detection from Regions with CNN (R-CNN) to Single shot multi-box detector (SSD) [11] has been improved from 53.3% to 74.3% on the PASCAL VOC database. Recently, to achieve much better accuracy, CNN model is becoming larger. Although development in FBP is relatively slow compared with face recognition, a series of researches have been achieved. In 2010, Gray et al.[12]construed an automatic FBP system by DCNN model with deeper and more abstract appearance feature extracted,combining feature extraction with classification together. In 2015, Xie et al.[13]proposed a FBP database with 500 people, and achieved maximum correlation coefficient of 0.8187 with FBP conducted in different depths of CNN model. In 2015,Xu et al.[14]extracted textural features in CIELab color space transformed from RGB color space and achieved maximum correlation coefficient of 0.88 on SCUT-FBP database in double cascaded fine-tuning method. In 2017, Xu et al.[15]improved the network of Ref. [14], and proposed a Psychologically inspired convolution neural network (PI-CNN)with best correlation coefficient of 0.87, which combined the recent psychological studies,significant appearance features of facial detail, lighting and color to optimize the PI-CNN facial-beauty predictor on SCUT-FBP database by regression prediction. The results of these studies show that deep learning has a wide range of applications in FBP. However, contrary to face recognition, FBP still has some problems, such as smaller-scale database,shallower depth in CNN model, and lower accuracy etc.
Now, CNN model proposed for both face recognition and image classification includes DeepFace[16], VGG[4], GoogleNet[5], ResNet[6], lighted CNN[17]etc. Those CNN models have special network frameworks and different construction principles. Generally, DeepFace database[16] is aligned and preprocessed by 3D model, in which sharing parameters'convolution kernel in the first three layers and no sharing parameters' convolution kernel in the latter three layers are used. The fixed-size convolution kernel in VGG is used to set all convolution and pooling layers with the same layer operation parameters, so as to make each group own the same output shape.Inception model with multi-scale convolution filters is proposed in GoogleNet, which can extract multi-scale features of images and improve capacity of feature extraction. Shortcut connections in Resnet[6] are presented to calculate summation of output of the stacked layers,make data flow of network smooth, avoid over-fitting owing to the gradient disappearance and construct a deeper network framework. In Ref. [17], Max-Feature-Max (MFM) is proposed to replace ReLU with competition mechanism to obtain more compacted features and reduce network parameters. Experiments by CNN model with 5 convolution layers and 2 fully connected layers show that good classification results have been achieved.
The main contributions of this paper can be summarized as follows:
We proposed a Lighted deep convolution neural network (LDCNN) for FBP with feature extraction strengthened, which combines network structure characteristics,such as GoogleNet, VGG, lighted CNN etc., as shown in Fig. 1.
The first convolution layer is Inception model constructed by split and merger strategy,which can extract multi-scale features of images through multiple convolution filters.
Data augmentation technology is utilized to expand the scale of the database,and can effectively improve the accuracy.
LDCNN model uses small convolution kernels to improve prediction accuracy and reduce network parameters. The total parameters of our LDCNN model are 5650K,which is smaller than the other published CNN models[4], [6], [16], [18]–. The time used in FBP for each face is listed in the experiments, in which it took only 3.11 ms for FBP of each face image.
The basic structure of our ldcnn model, where'c'represents convolution layer, 'm'mfm activation layer, 'p'max-pooling layer, 'fc'fully-connected layer
Related Work
1. Face Beauty Database
Although researchers have confirmed that face beauty is a universal concept that can be learned automatically by machine method, it's difficult for researchers to establish a beauty standard due to subjective judgment for face beauty caused by those, such as ethnicity, age, social class and culture. For this reason, there is less public authoritative database in FBP, and it’ s difficult to construct a large-scale database for all researchers. Thus researchers can only analyze algorithm of FBP on small-scale database. In our paper, we only focus on female face beauty in Asian areas, so we conduct algorithm on Large-scale database of Asian women's face database (LSAFBD), with 10K rated images and 80K unrated images. LSAFBD is the largest known database for FBP with young women as subject of study, on the basis of Ref. [19]. During LSAFBD creation, a framework of face image acquisition is constructed by algorithms, both in image process and image recognition, which could download images from internet automatically and could do follow-up operations including image preprocessing etc. Furthermore, numerous extremely beautiful samples collected by manual collection are used to expand the database, so that the distribution of facial beauty is more objective and feasible. The images collected by LSAFBD satisfy requirements of unconstrained condition with those excluded, such as too large face deflection angle, excessive facial occlusion, poor resolution and so on, which have bad effect on evaluation of facial beauty, as shown in Fig. 2. Facial beauty on LSAFBD is divided into 5 orders and represented by numbers from 1 to 5. Among them, 1 represents extremely unattractive, 2 unattractive, 3 average, 4 attractive, and 5 extremely attractive. The label histogram of LSAFBD is shown in Fig. 3, and the image label adheres to gaussian distribution. There are few face images that are extremely attractive or unattractive, and most are ordinary, which are consistent with beauty distribution in real environment.
Samples and labels on lsafbd.facial beauty on lsafbd is divided into 5 orders and represented by numbers from 1 to 5, where 1 represents extremely unattractive, 2 unattractive, 3 average, 4 attractive,and 5 extremely attractive. Attractive females with labels 4 and 5 share the same characteristics: wide eyes thin eyebrows, full lips, smooth skin and narrow chin.
2. Inception Model
Inception model as fundamental structure of GoogleNet is used to extract multi-scale features of images, which can strengthen capability of feature extraction. Furthermore, Inception model increases the width of hidden layers in GoogleNet, which can extract more detail features and can improve accuracy. Fig. 4 demonstrates framework of Inception model, which includes convolution layers of 1 × 1, convolution layers of 3 × 3, convolution layers of 5 × 5 and max pooling layers of 3 × 3. Convolution kernel at different scale can extract image feature with different spatial information, and can improve the capability of single-scale feature extraction. Meanwhile, convolution kernels with 1 × 1 are used to reduce network parameters, and pad at different convolution filter is set as 0, 1, 2 respectively. We could get outputs with the same dimension by different filters and could merge their output by concatenate layer.
Label histogram of lsafbd.the extremely attractive and unattractive are less,most are average.
Framework of inception model. It is constructed with convolution layers of 1 × 1, convolution layers of 3 × 3, convolution layers of 5 × 5 and max pooling layers of 3 × 3. The image fed into inception model is gray-scale with the resolution of 128 × 128, the output of inception model with the resolution of 128 × 128 ×256
3. Activation Function
Activation functions used in CNN model include ReLU, MFM and so on.ReLU is similar to linear activation function,which is easy to optimize and outputs 0 in the domain less than 0. Due to unilateral activation and forced sparsity, ReLU's derivative can keep large values, overcome gradient vanishing,and improve convergent speed of network,as shown in Fig. 5.
ReLU activation function only carries out linear transformation simply with few capability of nonlinear representation. In Ref. [17], MFM activation function is proposed on the basis of Maxout[18] activation function, which can achieve compacted feature and reduce network parameters. Contrary to forced sparsity of ReLU, MFM could persist maximum information among features through competition mechanism,as shown in Fig. 6.
Assuming there are 2N feature maps from previous outputs of convolution layers, \begin{equation*}
f_{ij}^{k}=\max(X_{i,i}^{k}, X_{i,j}^{k+N})
\tag{1}
\end{equation*}
\begin{equation*}
\frac{\partial f_{ij}^{k}}{\partial X^{k}}=\begin{cases}
1,\ X_{ij}^{k}\geq X_{ij}^{k+n}\\
0,\ X_{ij}^{k} < X_{ij}^{k+n}
\end{cases}
\tag{2}
\end{equation*}
The gradient of Eq.(1)could be expressed as where 1 k‘ ≤ k’ ≤2N, and\begin{equation*}
k=\begin{cases}
k^{\prime},\ 1\leq k^{\prime}\leq N\\
k^{\prime}-N,\ N+1\leq k^{\prime}\leq 2N
\end{cases}
\tag{3}
\end{equation*}
According to Fig. 6 and Eq.(2), gradient sparsity reaches 50%.
Cnn Framework
The framework of deep convolution neural network for FBP constructed in our paper is shown in Fig. 1. The network includes 1 Inception model, 4 convolution layers of 3 × 3, 4 convolution layers of 1 × 1, 9 MFM activation layers and 2 fully connected layers, inspired by both GoogleNet and lighted CNN model, as listed in Table 1.
The input images are gray-scale with the resolution of 144 × 144. They are cropped randomly into 128 × 128 when they are fed into CNN model, which can increase the numbers of training set. Both convolution layer of 1 × 1 and MFM activation layer are followed behind previous pooling layer, so that it can increase thechannel'sdimensionality. Every convolution layer is divided into two independent parts, and then Merge layer, MFM activation layer and max pooling are used later. Dropout layer is used to avoid over-fitting after Fc1 layer, and another MFM activation layer is added between Fc1 layer and Dropout layer.
Experiments and Analysis
1. Image Preprocessing
We conduct experiments on LSAFBD,and the samples are shown in Fig. 2. All face images in this database are RGB. Face detection and facial points detection are used in face images preprocessing, as shown in Fig. 7.
Samples of face detection and facial points detection on lsafbd. The red bounding box is the face detection result and five green dots are facial key points detection results.
According to the eyes landmarks extracted by Ref. [20], we could get intersection angle between horizontal line and connecting line of two landmarks,and then rotate the image based on the intersection angle to be horizontal,which could overcome pose variations. The distance between the center of eye landmarks and the center of mouth is fixed to 48 pixels,and then we can achieve scaling ratio of face in image. All faces are normalized as the same size according to the scaling ratio. Finally, all face images are cropped to 144 × 144 and grayscale transformation is performed, as shown in Fig. 8.
2. Training Method
We train CNN model on open source deep learning framework,and the input images of CNN model have been preprocessed. The images on LSAFBD are divided into two parts,in which one part as training set takes 80% and the other as validation takes 20%.The hyper parameters of CNN model are listed in Table 2.
Learning rate is set as 5e-4 initially and reduced by ten times,when training accuracy is not increased and fluctuated near maximum value.Reduced again and again manually,until accuracy is no longer increased. Dropout layer is combined to Fc1 layer to avoid over-fitting,and dropout-ratio is set as 0.75.
3. Comparison Method
Comparison between deep learning and traditional learning
To illustrate the effectiveness of the constructed CNN model,traditional learning methods are used to conduct experiments on LSAFBD. Both image preprocessing and database division are the same as Section IV.1. Support vector machine (SVM) is selected as classifier in the traditional learning methods, and Rank-1 recognition rate is applied to evaluate the performance of algorithms. The results are listed in Table 3.
Table 3 lists the results of different methods on LSABD. The features for FBP include LPQ, LBP, RAW PIXEL, K-Means and Multi-scale K-Means (MSK). Although the method of MSK outperforms the other published traditional methods, LDCNN model in our experiment achieves better and comparable results.
Comparison among CNN models at different depths
To illustrate that our LDCNN model is more suitable for facial beauty prediction,some published CNN models,such as face recognition and image recognition, are applied to do comparative experiments on LSAFBD. Because published CNN models are applied in different fields,there are great differences in image sizes, training channels and so on. Before comparative experiments, all face images on LSAFBD are transformed according to the requirements of CNN models published, so that they can be fed into CNN models. All face images applied to CNN models published are only preprocessed in scale and color space without the other data augmentations, and evaluated performances are listed in Table 4.
Table 4 lists that our LDCNN model outperforms published CNN methods and get comparable results on LSAFBD. It proves that multi-scale features extracted by our constructed CNN model have more facial beauty discriminant information,and it is effective to study facial beauty from perspective of multi-scale features. The process of FBP only cost 3.11 ms for each face image,onlyhigherthanDeepID2. Experiments also show that our method is correct and feasible.
3) Training on Data Augmentation Database
Due to the shortage of public facial beauty database, it is unable to verify performance and generalization of our LDCNN model on the other databases. Therefore,we apply data augmentation to expand the database to a larger facial beauty database to verify the performance of our LDCNN model. We convert the original images on LSAFBD, which have been used as training set in Section IV.1, to gray-scale and normalized to 144 × 144 with random noise added.And then we combine those images with training set together to expand the original 8,000 training images to 16,000 training images. In the experiment,testing set remains unchanged,as the same as Section IV.1. Prediction results based on traditional learning methods are listed in Table 5. The accuracy is obviously improved after data augmentations. For example, the accuracy of MSK is improved from 52.7% to 56.55%,which is 3.85%higher than before.
Our LDCNN model also outperforms the other competitors,and achieves the best accuracy of 63.5%, which is 1.5%higher than before,as listed in Table 6. The results express that the network performance can be improved by more training set. Table 6 also lists that accuracies on CNN models published, such as NIN, DeepID2, GoogleNet etc,are improved with data augmentation. It proves that larger facial beauty database can improve the accuracy of classification effectively.
4. Visualization of Accuracy and Loss
To better understand changes in accuracy and loss during the training of CNN model constructed in this paper,we visualize accuracy and loss curves at different scale of facial beauty database. Before FBP, we pretrain our LDCNN model on web-face database with training set of 315755 and testing set of 79000. The output of Fc2 layer is set as 15750.The trained model is achieved when the accuracy of face recognition is 88.2%.
Visualization of accuracy and loss on 10K database
Fig. 9 shows that CNN model converges after 10K iterations, and the best accuracy is 62%. Due to small-scale facial beauty database,we retain and fine tune our LDCNN model on web-face database,so that it can improve the performance of CNN model and reduce the number of iterations of network during training.
Fig. 10 shows the confusion matrix and Receiver operator characteristic curve(ROC)of FBP. The confusion matrix represents percentages of all the total that face images in some class are predicted as the same class and the other classes.Both x-axis and y-axis are the rated label of face images with a different representation of characters, e.g. attract_l label in y-axis and attract1 in x-axis are the same rated label of face image. The coordinate(attract1, attract_1)=0.59 illustrates the percentage that class 1 is predicted accurately to class 1,the coordinate (attract2, attract_1)=0.27 illustrates the percentage that class 1 is predicted wrongly to class 2. Likewise,thecoordinates(attract3, attract_l)=0.13, (attract4, attract_1)=0.01 represent the percentages predicted wrongly to class 3 and class 4 respectively. The last coordinate (attract5, attract_1) =0 shows there are none face images predicted wrongly to class 5. The curves from 1 to 5 are ROCs of 5 facial beauty classification. The ROC of some class is achieved according to the confusion matrix in Fig. 10.
Visualization of accuracy and loss on 18K database Fig. 11 shows that our LDCNN model begins to converge after 10k iterations on 18K database. Due to data augmentation, CNN model's performance is improved, which outperforms the results of the other CNN models with fine-tuning on web-face database.
Visualization of CNN model To better understand what our LDCNN model learns on LSAFBD, we conduct visualizations on Inception model and convolution layers from Conv1 to Conv5 at face image of class 3. CNN model used to visualization is trained at Section IV. 3.3), which has the best classification performance on LSAFBD, as shown in Fig. 12.
According to visualization graph,convolution layers including Inception model and conv2 are used to extract multi-scale features and low-level features,such as contour features and texture features. These features are highly relevant with face beauty. Convolution layers,such as Conv3, Conv4 and Conv5, are local connection layers,which are used to train a set of filters in every position of face images, so as to extract abstraction features to increase discrimination. Meanwhile, background information of face image is discarded, which does not affect feature extraction in subsequent convolution layers. While the features at convolution layer from Conv4 to Conv5 are abstract,the position and shape of eyes and mouth are still prominent, which shows appearance features extracted by CNN model are consistent with subjective understanding with beauty by people.
Visualization of inception model and convolution layers from the 1th to 5th. The images are gray-scale with the resolution of 144× 144, which are preprocessed on lsafbd. Inception model possesses 4 convolution layer's visualizations, including convolution layers of 1 × 1,3 ×3,5 × 5 and 1 × 1 from top to bottom and left to right.
Prediction
We download some face images of Asian young girls from internet, as shown in Fig. 13. Those face images are achieved at unconstrained environment, with great differences in face shape, hair style, image resolution,background and brightness.
We organize 30 college students who are boys aged almong 18 and 30 years old to rate the face beauty of those images. There are none additional constraints to them, as long as they can complete the rating in specified time,and can repeatedly modify their rated results of any one of the images. The average rated results are listed in Table 7.
The first row in Table 7 represents image names used to rate in Fig. 13. The second row is the average value of all 30 evaluations. The last row is the integer of Mean-score to the nearest whole number. The numbers of 1 and 5 represent the face beauty value, where 1 represents extremely unattractive,and 5 extremely attractive. And then,those face images are aligned and normalized to 144 × 144 by the method in Section IV.1, as shown in Fig. 14.
The gray images in Fig. 14 are fed into our LDCNN model,and we can achieve the prediction results,as shown in Table 8.
We put values of both Tables 7 and 8 together to generate histogram, where they can be compared and analyzed each other,as shown in Fig. 15. The blue bar represents rated values,and the tangerine bar represents prediction values by our LDCNN model. As can be seen from Fig. 15, prediction values are greatly correlated with the rated values,and they have same global beauty trend. There are four face images,which have been predicted accurately with the rated values, and the other four face images are slightly deviated with the rated values. The prediction values fluctuate up and down at the rated values,and there is no case where prediction values are extremely deviated from the rating values.
The possible reasons for the inconsistency of prediction include: (1)The rated values in Table 8 are the average of rated values of 30 raters,which has been rounded to the integer, so that there is a certain distortion in the rated values. (2)The closer the beauty of face image is, the harder it is to be distinguished, because the discernable details are overlapped and more obscure. This may be due to the fact in facial beauty that the nature of facial beauty is not still clear and lack of objective definition, so that we can only understand it from the view of subjective point. During our experiment, face images fed into our LDCNN model are from internet,which are obtained in the real environment. Experiments based on these images show that prediction results are robust, proving that our LDCNN model is correct and feasible in FBP.
Conclusions
This paper constructs a lighted deep convolution neural network with feature extraction strengthened for FBP, which uses Inception model on GoogleNet as the first convolution layers to extract multi-scale face image features with capability of feature extraction improved. Contrary to published CNN models, such as GoogleNet, VGG, DeepID etc, our LDCNN model can get more compacted features of face images and can reduce CNN model's parameters with MFM activation layer, which are used to replace ReLU activation layer through competition mechanism. Meanwhile, convolution kernel of 3 × 3 on our LDCNN is used to improve capability of feature extraction and convolution kernel of 1 × 1 is used to reduce network parameter and optimize network framework. Compared with CNN models published, our LDCNN model has advantages of less computation, few parameters and is suitable for embedded devices. Experimental results on LSAFBD show that our LDCNN achieves the best accuracy of 63.5%, which outperforms the other published CNN models for FBP.