Introduction
There are various examples and applications where we use Facial Emotion recognition. Here in our work, we take one such application i.e., Student behavior detection to evaluate and explain briefly, how the model works. There are various reasons why student behavior matters [1]. Firstly, because student engagement varies greatly depending on the environment, it is imperative to monitor and understand it both within schools and inside classrooms. Second, the likelihood of graduating with a degree and academic achievement are directly correlated with student behavior. Teachers can support students in reaching their goals and improving their chances of success by encouraging positive behaviors. [2] Due to varying problems requiring varying approaches or solutions, university students do not exhibit any one behavior. Every term, teachers struggle to conceptualize the particular behavioral problems that need to be addressed in the classroom. Sometimes they have a few issues that prevent them from acting appropriately in class. [3] Teachers can successfully identify and meet the different needs of their students by analyzing data on student behavior. Schools can learn more about their students’ mental well-being, including any indications of stress, anxiety, or emotional distress, by looking at their behavioral patterns. With the use of this information, educators and other support personnel can make timely interventions or recommend students to specialized services. Additionally, according to [4] the analysis of behavior data offers the chance to pinpoint pupils who might need extra help along the way. Teachers can adapt their teaching strategies to each student’s needs by recognizing their behavior patterns and degree of participation. For example, in order to improve a student’s academic success, teachers can investigate alternate teaching styles or create tailored learning plans if the student repeatedly demonstrates low involvement. Instructions and a breakdown of typical student conduct. It explains how a teacher can identify or understand a student’s behavior. There are many works on analysing Student performance like in [5], [6], [7], and [8] along with papers on how to use Artificial Intelligence in [9] and [10]. With all these existing research papers in the educational field or in student performance analysis, this research paper focuses on building a computer vision model on Student behavior and deriving the outcomes of the session.
Contribution to work:
To conduct an extensive review of literature on utilizing Computer Vision for detecting student behavior, examining existing methodologies and the field’s challenges.
To apply data augmentation to expand the training dataset’s diversity and size, improving the model’s ability to generalize to novel data. Formulated and trained the Convolutional Neural Network (CNN) model for emotion classification, integrating advanced techniques like Local Binary Pattern (LBP) feature extraction to enhance accuracy.
To perform comprehensive testing and evaluation of the model, scrutinizing performance metrics like accuracy, precision, recall, and F1-score to gauge its efficiency.
Contribution to refining the model architecture based on evaluation outcomes, ensuring its effectiveness in practical scenarios.
Related Work
This section focuses on the background study related to our topic. We have reviewed papers that worked on facial emotion recognition in the past to enhance our knowledge on this topic. Khanzada et al. in [11] used Ensemble of Deep Learning models for FER2013 and achieved up to 75.8% accuracy but there were considerable error analysis issues due to the subjective nature of emotions. Zahara et al. mentioned in their work [12] that they used CNN for real-time FER on Raspberry Pi with FER-2013 but they face problems with dataset constraints and hardware considerations. Ramis et al. [13] used a dataset approach with a CNN-based FER System, which gave up to 73.05% accuracy but it had limited adaptability to recent deep learning approaches. The work [14] used Deep neural networks with cross-database validation using AffectNet and RAF-DB but CNN network proposed is not consistently scalable to all Asian emotions. Huang et al. [15] used a Deep Neural Network with cross-database validation where it was limited by the generalizability of the model to Asian faces, and cultural differences in expression interpretation. Mellouk et al. [16] mentioned Dynamic kernels for facial expression recognition but there’ not much information on computational efficiency in real-time solutions. Perveen et al. [17] used a Robust FER System with a modular approach and 2D Taylor expansion. It needs further work to enhance automatic facial expression recognition in diverse conditions. Sujata et al. [18] amended the representation module for improved FER in [18] but there are limited insights into the scalability of the amended representation module across different datasets and conditions.
In work [19] global multi-scale and local attention network (MA-Net) was used which focused on improving challenging conditions and achieving state-of-the-art performance. It includes a feature pre-extractor, multi-scale module, and local attention module. Zhao et al. in [20] used Attentional CNN for facial expression recognition. In this, the network demonstrates notable advancements across diverse datasets. Visualization techniques pinpointed pivotal facial regions for emotion detection. Minaee et al. [21] used technology in early education assessments to investigate the role of technology in assessment practices within play-abased kindergarten classrooms. It highlights positive views on technology integration for enhancing teaching and assessment practices. Danniels et al. [22] used observational assessment in early care and education, timing, and content/ format factors. Provides insights into the nuances of quality benchmarks in assessment practices. Thorpe et al. [23] used Spatio-Temporal Attention (STA) for 3d CNNs and introduced the STA module for enhancing 3D CNNs in action recognition and detection tasks. Addresses spatial and temporal variations in video frames. Achieves state-of-the-art performance on benchmark datasets. Residual Multi-Task Learning Framework (RMT-Net) [24] pioneering RMT_Net for simultaneous facial landmark localization and expression recognition. It utilizes a unique residual learning module.
Multi-Modal Facial Expression Recognition(MRAN) [25] uses color, depth, and thermal information for improved recognition. However, it has some specific challenges faced in multi-modal recognition that are not discussed. Dynamic Kernels with a Gaussian Mixture model are used in [26] presenting a robust facial expression recognition system with dynamic kernels and a universal Gaussian mixture model. Achieved effectiveness in accuracy and computational efficiency. However, it still has challenges with dataset constraints and hardware considerations. Recommendations for optimizing CNN architecture and incorporating diverse datasets. Themes Discovered in Review:
Facial Expression Recognition (FER) Advancements
Real-world Applicability and Deployment
Cross-dataset Approaches and Challenges
Innovations in Model Architectures
Multi-modal Approaches for Enhanced Recognition
Challenges in Interpretability and Error Analysis
The overall architecture of the proposed work is depicted in Figure 1.
A. DataSets
One of the most used datasets for Facial Expression Recognition is [27]. It has 48 by 48 grayscale pictures with labels for seven distinct emotions: disgust, fear, happy, neutral, sad, surprised, and angry (As shown in Figure 2 and Figure 3). Preprocessing includes shrinking the photos to a standard
B. CNN Model
A subclass of deep neural networks called convolutional neural networks, or CNNs, have demonstrated remarkable performance in image categorization tasks [28]. CNNs are particularly effective for image-based emotion classification because of their ability to automatically learn spatial hierarchies of features from input images [29]. Convolutional layers are utilized for the extraction of features, pooling layers are used to compress spatial dimensions and fully connected layers are used for categorization in the CNN model employed in this work [30]. Because of its architecture, the model is ideally suited for problems like the recognition of emotions since it can derive intricate patterns and characteristics from the input images. The capability of CNNs to autonomously extract features on a wide scale sets them apart from traditional ML algorithms like SVMs and decision trees. This ability eliminates the need for manual feature engineering, which boosts efficiency [31]. CNNs can recognize and extract patterns and features from input regardless of changes in location, orientation, scale, or translation because of the translation-invariant properties that the convolutional layers confer on them. Numerous pre-trained CNN designs have proven to perform exceptionally well, such as VGG-16, ResNet50, Inceptionv3, and EfficientNet. Through a process called fine-tuning, these models can be made to work on new tasks with comparatively little data. In addition to image classification tasks, CNNs are flexible and can be used in many other fields, including speech recognition, time series analysis, and NLP. When compared to conventional machine learning methods, the accuracy and robustness of emotion recognition using CNNs have been substantially enhanced [32]. The ability of artificial intelligence to bridge the gap between human and computer capabilities has grown exponentially. Researchers and amateurs alike concentrate on numerous aspects of the field in order to do amazing things. Among these various disciplines is computer vision [33].
Convolutional Neural Networks (CNNs) offer a powerful tool for image classification tasks, allowing for the creation of models tailored to specific image categories by organizing images into folders. The post highlights CNN’s wide-ranging impact, citing Facebook’s DeepFace model with a remarkable 97% accuracy in facial recognition, surpassing human abilities [34]. Moreover, CNNs are revolutionizing the medical field, aiding in early cancer detection and diagnosing diseases like typhoid using X-ray images. The post aims to inspire confidence in readers to explore CNN algorithms in their projects, showcasing the technology’s versatility and rapid development. In order to address the issue of human face recognition on a small original dataset, this research [35] presents a novel technique that blends a convolutional neural network (CNN) with the augmented dataset. Several facial image alterations are used to expand the initial small dataset into a large one. By utilizing the clever CNN, it is possible to extract facial features more efficiently and obtain a greater level of face recognition accuracy based on the enhanced face picture dataset. Numerous tests and comparisons with a few popular facial recognition techniques may attest to the superiority and efficiency of the suggested method.
Deep Neural Networks show promise in Arabic OCR, achieving 98.46% accuracy on MNIST [36]. CNNs excel in face recognition, with a model achieving perfect classification on the UJ Face dataset after 80 epochs [37]. Regularization is crucial for deep network performance. Both tasks highlight the effectiveness of deep learning in complex recognition tasks.
C. Data Augmentation
One key strategy used in this work to artificially increase the quantity and diversity of the training dataset and improve the model’s capacity for generalization is data augmentation [38]. Data augmentation creates variations by creating altered versions of the original photos, which help the model identify strong features and patterns linked to various emotional states [39]. More specifically, different instances of the original images are created by applying augmentation techniques such as rotation, flipping, scaling, and shearing [40]. Rotation involves rotating the image by a certain angle while flipping creates mirror images by flipping the original images horizontally or vertically. Scaling modifies the size of the images, and shearing changes the shape of the images by shifting one part of the image more than the other. These variations simulate different viewing angles, lighting conditions, and facial orientations, enabling the model to learn to recognize emotions under various circumstances [41].
The application of data augmentation is especially beneficial in scenarios where the size of the training dataset is limited, as it effectively increases the dataset’s size without the need for additional data collection efforts [42]. Moreover, by applying the framework to a wider variety of instances during training, data augmentation helps reduce the danger of overfitting and enhances the model’s capacity to generalize to new data [43].
D. Uniform Local Binary Pattern
Uniform Local Binary Pattern (uLBP) feature extraction is a texture descriptor used for feature extraction from images. LBP compares each pixel in an image with its neighboring pixels and encodes the result into a binary pattern. In this study, uniform LBP is used to capture local texture patterns in the facial images. This technique helps the model capture fine-grained texture details in the facial images, enhancing its ability to distinguish between different emotions. The extracted LBP features are then concatenated with the pixel intensities and fed into the CNN for emotion classification.
In order to extract features from facial photos, this work heavily relies on the extraction of Uniform Local Binary Pattern (uLBP) features, which serve as a texture descriptor [44]. Effective local texture pattern capturing is achieved by LBP by the comparison of individual pixels in a picture with their neighboring pixels, followed by the encoding of the resultant binary pattern [45]. To capture the fine-grained texture features in facial images which are crucial for differentiating between distinct emotional states, uniform LBP is specially used in the work [46].
By incorporating uniform LBP feature extraction, the model can better understand the nuanced variations in facial textures associated with different emotions, thus enhancing its ability to classify emotions accurately [47]. LBPs are far more potent than Haralick texture features; however, depending on the application (and the kind of LBP method employed), the enhanced discriminative power may be at the expense of computationally prohibitive and possibly exploding feature vector size [48].
Local Binary Pattern, or LBP for short, is a feature of an image’s local representation, as its name implies. It is made up of relative values obtained by contrasting every pixel with its surrounding pixels [49]. LBP’s two primary features are its low computation cost and resistance to variations in the grayscale values of images. Many advancements have been made since the 1994 first concept. It is commonly used for texture segmentation, facial image recognition, and other image analysis applications, especially in non-deep learning systems. Microstructures that the histogram can estimate, such as edges, lines, spots, and flat areas, are detected by the LBP. [50] discusses For face identification, a new discriminative face representation is suggested, which is obtained by the Linear Discriminant Analysis (LDA) of multi-scale local binary pattern histograms. First, the face image is divided into many non-overlapping areas. Regional features are created by concatenating multi-scale uniform local binary pattern histograms in each region. After that, the features are projected into the LDA space in order to utilize them as a distinguishing facial descriptor. With extremely encouraging results, the system is put into practice and evaluated for face verification on the XM2VTS database and for face identification on the standard Feret database.
Key steps in the process of implementing LBP for texture classification:
Loading the Image: An image is loaded from the disk using OpenCV.
Converting to Grayscale: The image is converted to grayscale since LBP operates on single-channel images.
Computing LBP: The LBP representation of the image is computed using the feature.local_binary_pattern function from the scikit-image library.
Visualizing the LBP Image: The resulting LBP image is displayed to show the texture features.
Histogram Calculation: A histogram of the LBP image is computed to serve as a texture descriptor. This histogram can then be used for classification purposes [51].
Pixel Comparison: For each pixel in the grayscale image, compare its intensity with its surrounding pixels. Typically, the surrounding pixels form a
grid centered on the pixel being processed.3\times 3 Binary Pattern: Generate a binary pattern by setting the center pixel as the threshold. If a surrounding pixel’s intensity is greater than or equal to the center pixel’s intensity, set the corresponding binary digit to 1; otherwise, set it to 0. As shown in Figure 4.
Binary to Decimal Conversion: Convert the 8-bit binary pattern (formed by the 8 surrounding pixels) to a decimal value. As shown in Figure 5.
LBP Image Creation: Replace the center pixel’s value with the corresponding decimal value, creating a new image where each pixel value represents a local binary pattern. As shown in Figure 6.
Histogram Calculation: Compute a histogram of the LBP values across the image. Histogram captures the distribution of texture patterns and serves as a robust descriptor for texture classification.
The first step in constructing an LBP is to take the 8-pixel neighborhood surrounding a center pixel and threshold it to construct a set of 8 binary digits.
Taking the 8-bit binary neighborhood of the center pixel and converting it into a decimal representation.
The calculated LBP value is then stored in an output array with the same width and height as the original image.
Uniform Local Binary Patterns (uLBP) are outstanding for a variety of computer vision tasks because they provide a strong and effective method for texture analysis and feature extraction. In order to encode the results into a binary pattern that represents local texture information, uLBP compares each pixel with its neighbors. This technique produces stable and discriminative representations by drastically reducing the dimensionality of the input while maintaining key features. The uniformity feature of uLBP ensures that only patterns with a small number of transitions—0 to 1 or 1 to 0—are taken into account, improving noise resistance and discriminative power. As shown in [52], uLBP captures small textural differences to deliver higher performance in facial recognition tests. Furthermore, [53] emphasizes the method’s capacity to identify drowsiness, where uLBP’s resilience to changes in illumination and occlusions is crucial. Moreover, [54] highlights uLBP’s adaptability in various applications and effectiveness in real-time processing scenarios because of its minimal yet efficient computational requirements.
Proposed Methodology
Following is a detailed explanation of the procedure followed in each method.
A. Method 1: Convolutional Neural Networks
The architecture of the CNN model in this project is designed with multiple layers to extract features from input images and classify them into seven emotion categories. The initial layers consist of two convolutional layers, the first with 32 filters and the second with 64 filters, followed by a max pooling layer to reduce spatial dimensions. To prevent overfitting, a dropout layer is added, which randomly sets a fraction of input units to zero during training. The model then includes two additional convolutional layers, each with 128 filters, followed by another max pooling layer to further reduce spatial dimensions. After flattening the output, a dense layer with 1024 units is used for feature extraction, followed by another dropout layer for regularization. Finally, a dense layer with 7 units and a softmax activation function is used for multi-class classification into the seven emotion categories. The model has a total of 2,268,607 trainable parameters, which are adjusted during training to minimize the categorical cross-entropy loss function. The Adam optimizer is used for optimization during training. Refer Figure 7 for a Detailed tabular description of the CNN Model used in this method.
B. Method 2: With Data Augmentation
Facial Emotion Recognition Dataset.
Student behavior detection model.
Begin
for img in the dataset:
Data_augmentation(img)
Remove duplicated(augmented_imgs)
Rescale(augmented_imgs)
Dataset.append(augmented_imgs)
CNN_Model2(dataset) //Architecture of CNN_Model2 is same as CNN_Model1
Train and compile (CNN_Model2)
Test(CNN_Model2)
End
Data augmentation was implemented on each emotion class within the FER-2013 dataset using the Image Data Generator class from the Keras library. The augmentation parameters were meticulously selected to introduce realistic variations in the images, enhancing the model’s robustness and mitigating overfitting. Here’s a concise overview of the augmentation methods applied to each emotion: 1. Angry: 3500 images were augmented for the “angry” class. Augmentation included random rotation: 40 degrees, horizontal and vertical shifts: 20% of the image dimensions, shear transformations: 20 degrees, zooming: 20%, and horizontal flipping. These transformations aimed to simulate diverse facial expressions of anger. 2. Disgust: 6500 images were augmented for the “disgust” class. Similar to the “angry” class, augmentation included random rotations: 30 degrees, horizontal and vertical shifts: 10% of the image dimensions, shear transformations: 10 degrees, zooming: 10%, and horizontal flipping. These variations aimed to capture various facial expressions associated with disgust. 3. Fear: 3500 images were augmented for the “fear” class. Augmentation included random rotations: 40 degrees, horizontal and vertical shifts: 20% of the image dimensions, shear transformations: 20 degrees, zooming: 20%, and horizontal flipping. These transformations aimed to simulate diverse fearful expressions. 4. Neutral: 3500 images were augmented for the “neutral” class. Augmentation included random rotations up to 40 degrees, horizontal and vertical shifts: 20% of the image dimensions, shear transformations: 20 degrees, zooming: 20%, and horizontal flipping. These variations aimed to capture different neutral facial expressions. 5. Sad: 3500 images were augmented for the “sad” class. Augmentation included random rotations: 40 degrees, horizontal and vertical shifts: 20% of the image dimensions, shear transformations: 20 degrees, zooming: 20%, and horizontal flipping. These transformations aimed to simulate various sad facial expressions. 6. Surprise: 3500 images were augmented for the “surprise” class. Augmentation included random rotations: 40 degrees, horizontal and vertical shifts: 20% of the image dimensions, shear transformations: 20 degrees, zooming: 20%, and horizontal flipping. These transformations aimed to capture different surprised facial expressions. Overall, data augmentation played a critical role in enhancing the dataset’s diversity, potentially leading to a more robust and accurate model for facial expression detection.
Table 1 shows the result of Data Augmentation on the dataset. The overall result of Data Augmentation shows the enlargement of the Dataset from 28709 images to 55956 images.
CNN Model used in Method 1 is also used in Method 2 (Data Augmentation being the only variation). Refer Figure 7 for a detailed tabular description of the CNN Model used in this method.
C. Method 3: Data Augmentation with a Combination of Lbp Feature Extraction and Cnn
After Data Augmentation, the dataset increased and this caused overfitting. So, to avoid/ overcome this problem, split folders are used to further split the training dataset into train, test, and validation datasets for better results and to reduce the computation runtime. The results of Splitfolders are shown in Table 2.
Facial Emotion Recognition Dataset.
Student behavior detection model.
Begin
for img in the dataset:
Data_Augmentation(img)
Rescale(img)
LBP_Feature_Extraction(dataset):
feature_extraction_image = “LBP”
Covert(image,greyscale)
For a pixel in img:
Extract_LBP(pixel)
CNN_Model3(dataset)
Train and compile (CNN_Model3)
Test(CNN_Model3)
End
Here N is a number of LBP Feature vectors. i.e., 26.
Input layers are created for images of size (48, 48, 1) and LBP features of length N.
Two convolutional layers with 32 and 64 filters, respectively, are followed by max pooling and dropout layers.
Add another convolutional layer with 128 filters, followed by max pooling and dropout layers.
Flatten the output and pass it through a dense layer with 1024 units for image features.
Process LBP features through a dense layer with 1024 units before concatenating them with the image features.
Concatenated features go through additional dense layers with 512 and 256 units before the output layer with 7 units for softmax classification.
Model developed for CNN with uniform LBP feature extraction can be briefly understood by referring to Figure 8.
Results and Discussion
Models built are evaluated and assessed using standard performance metrics. Here is some basic information about each of these metrics with relevant computational equations.
Accuracy
The proportion of accurately predicted occurrences to all instances in the dataset is known as accuracy. It gauges how accurate the model is overall.
where\begin{equation*} Accuracy = \frac {a+b}{a+b+c+d}\end{equation*} View Source\begin{equation*} Accuracy = \frac {a+b}{a+b+c+d}\end{equation*}
True Positive
True Negative
False Positive
False Negative
Precision
The ratio of accurately anticipated positive instances to all expected positive instances is known as precision. It gauges how accurate positive forecasts are.
where a: True Positive\begin{equation*} Precision = \frac {a}{a+c}\end{equation*} View Source\begin{equation*} Precision = \frac {a}{a+c}\end{equation*}
c: False Positive
Recall
Recall is the ratio of correctly predicted positive instances to the actual positive instances in the dataset. It measures the ability of the model to identify all relevant instances. Recall is useful when the cost of false negatives is high. For example, in medical diagnosis, we want to minimize false negatives to ensure that all patients with a disease are correctly identified.
where a: True Positive\begin{equation*} Recall = \frac {a}{a+d}\end{equation*} View Source\begin{equation*} Recall = \frac {a}{a+d}\end{equation*}
d: False Negative
F1-Score
The harmonic mean of recall and precision is known as the F1-score. It offers a harmony between recall and precision.
where p: Precision r: Recall\begin{equation*} F1-Score = \frac {2*(p*r)}{p+r}\end{equation*} View Source\begin{equation*} F1-Score = \frac {2*(p*r)}{p+r}\end{equation*}
Confusion Matrix
A table known as a confusion matrix is widely employed to illustrate how well a classification model performs when applied to a set of data for which the true values are known. It makes it possible to illustrate how effectively an algorithm performs. The confusion matrix provides a summary of the predictions made by a classification model. It is particularly useful for understanding the types of errors the model is making, such as false positives and false negatives. It can potentially help in extracting performance metrics like True Positives (), True Negatives (), False Positives (), False Negatives (), Accuracy, Precision, Recall (Sensitivity), F1-Score, Misclassification Rate (Error rate). These points help in understanding how well a model is performing.
A. Analysis of Results:
1) Method 1: CNN
The confusion matrix of the CNN Model built-in method 1 is shown in Figure 9. and the Corresponding Classification Report is shown in Table 3. With an overall accuracy of only 17%, the model correctly identified only 17% of the 7178 photos. This emphasizes how urgently significant improvements are needed in order to discriminate between different emotions. Performance differed significantly between the emotion classes, with class 3 (happy) showing better results than class 1 (disgust), where all metrics were at 1% (precision: 24%, recall: 25%, F1-score: 24%). This shows that the model has trouble telling one emotion from another, such as disgust. The model showed low precision and recall across most emotion classes, suggesting that it missed many real instances of specific emotions (low recall) in addition to misclassifying a large number of images (low precision). Despite the support values indicating a balanced dataset with a similar number of images per class, the consistently poor performance across most classes suggests potential issues with the model’s architecture or training process. In conclusion, the basic CNN model performed inadequately in this student emotion recognition task, exhibiting low overall accuracy and struggling to precisely identify most emotion classes.
2) Method 2: CNN With Data Augmentation
The confusion matrix of the CNN Model with Data Augmentation built in method 2 is shown in Figure 10. and the Corresponding Classification Report is shown in Table 4.
Data augmentation improved the model’s accuracy greatly, raising it from an unknown low value to 85%. This indicates that 85% of the 7,178 photos in the test set are now accurately classified by the model. According to the report, performance was largely matched across all emotion classes (0-angry to 6-surprise), with 85% for precision, recall, and F1-score. This implies that the model can now accurately distinguish between various emotions. Performance does, however, differ slightly throughout classes. As an example, class 0 (angry) has an 83% precision, 90% recall, and an 86% F1-score; class 3 (happy) has an 89% accuracy, 86% recall, and an 87% F1-score. This implies that the model may be marginally more adept at identifying joyful emotions than furious ones. Finally, data augmentation successfully addressed the shortcomings of the prior model, yielding a high degree of overall accuracy and well-balanced performance in the majority of emotion classes.
3) Method 3: CNN With Data Augmentation and Uniform LBP Feature Extraction
The confusion matrix of CNN Model built-in method 1ist is shown in Figure 11. and the Corresponding Classification Report is shown in Table 5. With an overall accuracy of 95%, the model performs exceptionally well, correctly categorizing 95% of the 11,197 photos in the test set. This is a big step up from earlier techniques that only used a rudimentary CNN model or data augmentation. All emotion classes (0-angry to 6-surprise) had consistently high precision, recall, and F1-scores (around 95%), suggesting balanced accuracy across emotions and avoiding bias towards any particular class. Class 2 (fear) had a precision of 94%, recall of 96%, and an F1-score of 95%. Class 5 (sad) had a precision of 96%, recall of 94%, and an F1-score of 95%. These slight differences in performance between the classes are insignificant. All in all, the model’s consistency in identifying every emotion experienced by students is impressive. In summary, the combination of data augmentation with uniform LBP feature extraction is a powerful method for accurately identifying student emotions through the successful extraction of pertinent features from students’ facial expressions.
The overall result or comparison between all three models used in this project procedure is projected in Tables 6 and 7 show the results compared with other state-of-the-art methods. When we go from a basic CNN model to more sophisticated methods, the accuracy of emotion recognition clearly improves, as Table 3 demonstrates. At 17%, the accuracy of the most basic CNN model is the lowest. This is greatly enhanced with data augmentation, as the model achieves an accuracy of 85%. The most sophisticated approach obtains the highest accuracy of 95% by combining LBP feature extraction with data augmentation. This implies that utilizing LBP in conjunction with data augmentation to extract local pattern information from facial photos is a potent method for teaching students to recognize emotions.
Three distinct models created for the project on student behavior detection using computer vision are compared in the findings summary table. A CNN with a Data Augmentation Model, a CNN with Data Augmentation and LBP Feature Extraction Model, and a Simple CNN Model for Emotion Classification are among the models. The training and testing accuracy attained by each model is displayed in the table.
Training accuracy of 0.8989 and testing accuracy of 0.6258 were attained by the Simple CNN Model. Given that the test accuracy is much lower than the training accuracy, this model most likely suffered from over fitting. This suggests that the model might not have generalized well to previously encountered data, instead memorizing the training set. Next, data augmentation was used to solve this problem.
Training accuracy was increased to 0.9224 and testing accuracy to 0.8587 using the CNN with Data Augmentation Model. In order to increase the model’s capacity to generalize, data augmentation is a popular approach used to artificially increase the amount of a dataset by producing altered versions of photographs. Testing accuracy significantly improved, indicating that data augmentation reduced over fitting and enhanced model performance.
Ultimately, the CNN with Data Augmentation and LBP Feature Extraction Model had the best testing and training accuracy, coming in at 0.9482 and 0.9655, respectively. The performance of the model was significantly improved by the use of LBP feature extraction. Texture patterns are vital for tasks like emotion recognition, and LBP is a texture descriptor that captures the local structure of an image. The excellent testing accuracy shows that the LBP feature-based model caught key elements for emotion classification and adapted well to new data.
Ensemble Deep Learning models for FER2013 and achieved unprecedented 75.8% accuracy, practical applicability demonstrated through a mobile web app with rapid recognition speed. Interpretability methods shed light on facial features. Successfully surpassed human-level accuracy. In [12], the authors used the FER-2013 dataset to develop a prediction system for micro-expressions using a Convolutional Neural Network (CNN) algorithm implemented on a Raspberry Pi. Real-time prediction on Raspberry Pi achieved 65.97% accuracy [11]. This model considers facial position, distance, and photo recognition well. However, this paper lags in optimizing the model and dataset. The paper [13] uses a cross-dataset approach to enhance generality and performance. Their system achieved accuracies ranging from 31.56% to 61.78% on single-dataset tests, which improved up to 73.05% using the cross-dataset approach. Additionally, they compared CNN’s performance with human participants, who achieved 83.53% accuracy, highlighting a correlation between human and system results. The paper [51] utilizes transfer learning with pre-trained models (ResNet-50, Xception, EfficientNet-B0, Inception, DenseNet-121) to prevent overfitting by leveraging large-scale datasets and fine-tuning them for emotion recognition. Uses a combination of freezing layers of pre-trained models and fine-tuning them, which is effective for most models. Reports significant accuracy improvements with transfer learning models. All models averaged 59% accuracy when trained using AffectNet, Xception was the most accurate (61%,) followed sequentially by Inception (60%), EfficientNet-B0 (59%), ResNet-50 (58%), and DenseNet-121 (57%). The key difference here lies in the reliance on transfer learning and pre-trained models versus the development of a custom model with specific feature extraction and augmentation techniques.
Conclusion and Future Work
Our research presents an innovative approach to leveraging computer vision for detecting student behavior, particularly focusing on emotion recognition. By employing Convolutional Neural Networks (CNNs) for emotion classification, implementing data augmentation to diversify the dataset, and integrating Uniform Local Binary Pattern (LBP) feature extraction for enhanced texture analysis, we have developed a system capable of automatically identifying student emotions in real-time. This system provides educators with valuable insights into student engagement, allowing them to adapt their teaching methods accordingly. Our findings indicate that our approach achieves high accuracy in classifying student emotions. While the Simple CNN Model initially showed effectiveness, it displayed signs of overfitting, which we mitigated through data augmentation. The CNN with Data Augmentation Model notably improved performance, highlighting the efficacy of data augmentation in reducing overfitting and improving model generalization. Additionally, the CNN with Data Augmentation and LBP Feature Extraction Model further enhanced performance, underscoring the importance of texture analysis in emotion recognition tasks. Our study contributes to educational technology by offering a reliable, automated method for monitoring student behaviors and emotions, potentially fostering more dynamic and engaging learning environments. By providing real-time feedback on student emotions, our system enables educators to personalize their teaching strategies to better meet students’ needs. Overall, our research demonstrates the potential of computer vision techniques in enhancing the teaching and learning process, suggesting avenues for future exploration to enhance emotion recognition accuracy and broaden the capabilities of educational technology in supporting student engagement and learning.
The approach in this research paper differs from other papers that utilized Convolutional Neural Networks (CNN) and Local Binary Pattern (LBP) in the following ways: An innovative fusion of techniques: The research paper introduces a novel method that combines Convolutional Neural Networks (CNNs) for emotion classification, data augmentation to enhance dataset diversity and size, and Uniform Local Binary Patterns (uLBP) feature extraction for improved texture analysis. This fusion of techniques represents a fresh and unique approach to FER, leveraging the strengths of each method to achieve high accuracy in emotion classification. Incorporation of Data Augmentation: The research addresses the challenge of model overfitting by implementing data augmentation to diversify the dataset, improving the model’s ability to generalize to novel data. This effectively mitigates overfitting, which is a common issue in CNN-based models, setting this approach apart from others that may not have explicitly addressed this challenge. Integration of Uniform LBP: The paper specifically incorporates Uniform Local Binary Pattern (uLBP) feature extraction for enhanced texture analysis. This represents a departure from traditional LBP feature extraction, showcasing a more refined and nuanced approach to texture pattern capture. Emphasis on real-time insights into student emotions: The research paper places a significant focus on providing educators with real-time insights into student emotions, enabling them to adapt their teaching methods accordingly. This practical emphasis distinguishes it from other papers that may focus solely on technical aspects without a direct application to educational settings.