Introduction
Since the onset of the COVID-19 pandemic, there has been a heightened focus on the use of face masks [1]. These masks not only prevent respiratory infectious diseases such as COVID-19, avian influenza, and tuberculosis but also significantly enhance human health and safety by effectively blocking the inhalation of harmful particles in everyday situations where severe air pollution is present [2]. However, high-flow traffic and the small size of masks in public places make it difficult to monitor mask compliance [3]. The advent of computer vision technology has introduced deep learning-based mask detection systems, which play an increasingly important role in disease control in public environments and dust mitigation in industrial environments [4].
Deep learning-based target detection can be categorized into two different groups. according to the detection method, one of which is two-stage target detection algorithms, represented by R-CNN [5], Fast R-CNN [6], Faster R-CNN [7], and Mask R-CNN [8]. This class of algorithms generates a preselected frame for the target according to the algorithm and then classifies and regresses the frame through the network layer. It has higher detection accuracy in some cases. However, its training and detection speed is slow, which makes it impossible to implement in some scenarios requiring real-time performance, such as the mask detection scenario. Another class is one-stage target detection algorithms, represented by SSD [9], YOLO series algorithms (YOLO [10], YOLO 9000 [11], YOLOv3 [12], YOLOv4 [13]), which directly generate the predicted classification probabilities and coordinate values of the detected targets, which is faster than the two-stage algorithm. Single-stage methods tend to generate a large number of candidate frames in the image. Zhou et al. [14] proposed the new idea of “object-as-point”, which uses the centroid representation of the target to directly predict the position and class of the target, but its robustness to occlusion is relatively weak. In this context, the emergence of YOLOv5 has attracted much attention, which has made significant progress in terms of speed and accuracy, and brought a new impetus to the development of single-stage target detection algorithms, which can meet the requirements of occlusion detection scenarios.
Significant research and associated efforts have been devoted to the object detection algorithm YOLOv5, with a primary emphasis on ongoing enhancements and optimizations aimed at elevating the algorithm’s object detection accuracy, efficiency, and overall lightweight design. H. Haibing et al. have introduced a rapid Spatial Pooling Pyramid (SPP) structure, SimSPPF, and incorporated an attention mechanism block into the backbone to enhance the model’s speed [15]. Lu et al. optimized the YOLOv5s network by introducing depthwise convolution, using the improved Shufflenetv2 backbone for feature extraction, and adding an attention mechanism to the backbone network. They then used Path Aggregation Network (PAN) and Feature Pyramid Networks (FPN) Neck to improve feature extraction. The results show the method has some improvements in detection accuracy and model lightweight compared to the YOLOv5s detection algorithm [16], but using ShuffleNet as a replacement for the backbone of the model causes the model to lose more important functions. Roy et al. [17] integrated DenseNet blocks with the backbone to improve the preservation and reuse of key feature information. In addition, the CBAM attentional performance mechanism with an extra feature fusion layer and a Swin-Transformer Prediction Head (SPH) to detect the size of multi-scale objects more efficiently, but the model’s detecting cracks is affected by occlusion which makes the lower accuracy. Huang et al. [18] integrated Efficient Channel Attention (ECA) mechanism into the C3 block of the backbone network and replaced the Swin Transformer Block in the last C3 block of the backbone network; the small-object miss-detection problem is improved through the use of a shift-window segmentation approach; and, finally, the Task-Specific Contextual Decoupling (TSCODE) header is used to balance between classification and regression to better utilize the different contextual details. However, it is less lightweight. Jiang et al. [19] replaced the original CIOU loss function with the Focal-SIOU loss function to accelerate model convergence and improve detection accuracy, and a multi-head self-attention module was introduced into the backbone to enhance the model’s ability to capture long-range dependencies, in addition to a “shuffle-attention” module in the neck, which enhances the model’s ability to fuse spatial and channel dimensional features. Finally, the 1/32 downsampling portion of the neck and the corresponding large object detection header were removed. However, they did not take into account the degree of lightness of the model. Kowalczyk et al. [20] have proposed the research topic of automatic detection of masks to reduce the spread of viruses, and they have used YOLOv5 to classify and localize the mask images for their work.
While the previously mentioned models have seen improvements and yielded better results, there is still room for further enhancement. Challenges persist in areas such as detection accuracy, low-light performance, leakage, and false detections in photomask recognition, as well as real-time processing and deployment complexities. In this paper, to be able to realize the real-time operation and recognition of masks, this paper proposes a new YOLOv5-based lightweight model-YOLOv5-S2C2. In addition, the research put forward some improvements, as described in the following contribution:
In YOLOv5-S2C2, we propose an improved FasterNet block. In the residual branch of this block, an EMA (Efficient Multi-scale Attention, EMA) mechanism is added to preserve the feature information of each channel, and it is used to compose the EMA-FasterNet. The backbone network of YOLOv5 is then replaced by the EMA-FasterNet. This approach makes the model lighter, reduces the amount of computation, and speeds up the operation;
Secondly, we have discussed the influence and results of different positions of C3 and DepthSepConv. We have chosen the best combination of C3 and DepthSepConv to replace part of the C3 module in the Neck layer, to compress the model further and keep the higher mAP;
Finally, Soft NMS has been used to replace the NMS in YOLOv5 to reduce the confidence of the overlapping frames. Soft NMS reduces the confidence level of the overlapping edges, preventing missed and false detections, and enhances the model detection accuracy.
Basic Network
A. YOLOv5
The structure of YOLOv5 consists of Input, Backbone, Neck, and Head.
In the data input phase of YOLOv5, Mosaic combines four images into one by randomly cropping them, enriching the image background information, and allowing the model to process information from four images simultaneously. Additionally, YOLOv5 introduces an adaptive anchor framework to automatically determine the optimal scaling factor for the image. The impact of this data augmentation is illustrated in Fig. 1.
Within the backbone, the Focus, C3, and SPP structures are employed to partition the input image. This process augments the number of feature channels and reduces feature size, the number of floating-point operations, and model layers, ultimately boosting operational speed. The SPP structure utilizes multiple max-pooling operations to generate a consistent output size for diverse input features. As a result, the SPP processing contributes to enhanced scale invariance across input images of varying scales and aspect ratios.
The essential components of the Neck consist of FPN and PAN, which play a central role in improving the network model’s ability to recognize objects at different scales. The FPN is tasked with conveying high-level semantic features from deeper layers to shallower layers, while the path aggregation network transfers complete positional features from superficial layers to deeper layers, improving the feature fusion capability.
The output side of the Head typically includes a regression loss function and NMS. The regression loss function is particularly effective in solving the disjoint bounding box problem, which the previous YOLO series had difficulty optimizing.
YOLOv5 controls the model’s size by introducing depth and width factors to obtain four models, v5s, v5m, v5l, and v5x. In this paper, YOLOv5s, the shallowest and lightest of the above models, is chosen as the base model, which still has much room for volume and memory consumption optimization. The structure is shown in Fig. 2 [21].
B. FasterNet
FasterNet introduces a novel convolution method called Partial Convolution (PConv) that efficiently extracts spatial features while minimizing redundant computations. This is shown in Fig. 3 [22].
For a standard ordinary convolution, its theoretical computational FLOPs are calculated as follows:\begin{equation*} FLOPs_{Conv} =h{}^{\ast} w^{\ast} k^{2}{}^{\ast} c^{2} \tag{1}\end{equation*}
And the FLOPs of PConv:\begin{equation*} FLOPs_{PConv} =h^{\ast} w^{\ast} k^{2}{}^{\ast} c_{p}^{2} \tag{2}\end{equation*}
\begin{equation*} h^{\ast} w^{\ast} 2c_{p} +k^{2}{}^{\ast} c_{p}^{2} \approx h^{\ast} w^{\ast} 2c_{p} \tag{3}\end{equation*}
The PConv is used as the base operator to form the FasterNet block, while the FasterNet network structure consists of four FasterNet blocks, where the FasterNet block consists of a PConv layer, two PWConv layers, the BN layer, and the ReLU activation function layer. These layers are represented as an inverted residual block and include a middle layer with a significant number of channels and shortcuts for reusing the input features. The multiple variants offered by FasterNet, namely FasterNet T0/1/2, FasterNet-S, FasterNet-M, and FasterNet-L, have a similar structure with different depths and widths.
In this paper, we prioritize the lightness of the model by choosing FasterNet T0 as the backbone network of the research model. FasterNet T0 structure is shown in Table 1.
Improved YOLOv5 Model
A. Improved FasterNet
Choosing Fasternet T0 greatly compresses the computation and memory accesses, thus increasing the processing speed. However, consequently, the detection accuracy of the model decreases with the decrease of computation and parameter count, so for the model to have better detection performance, this paper chooses to improve the FasterNet Block by adding the EMA attention mechanism to its residual branch.
In the EMA mechanism, the common component of the 1*1 convolution comes from the CA (Coordinate Attention, CA) module, constituting a “1*1 branch”, whose structure is shown in Fig. 4(a). To integrate the multi-scale spatial structure information, EMA proposes a 3*3 branch in parallel with the 1*1 branch for fast response, as shown in Fig. 4(b), with the 3*3 branch on the right side. the global spatial information within the 1*1 and 3*3 branches is encoded by the two-dimensional global averaging pool. To achieve efficient computation, a combination of the Softmax function and 2D Gaussian mapping is finally used to fit the linear transformation of the 2D global mean pooling output.
The final output feature maps are calculated within each group by aggregating two spatial attention weight values and then applying a sigmoid function. This process efficiently captures pixel-level pairwise relationships and emphasizes the global context of all pixels [23]. The EMA preserves the feature information of each channel, the structure of EMA is shown in Fig. 4(b).
The improved FasterNet Block is shown in Fig. 6, so that more channel feature information can be retained, thus improving the detection performance of the model.
B. Fusion Depth Separable Convolution
The DepthSepConv is shown in Fig. 5, which consists of DWConv (Depthwise Convolution, DWConv) and PWConv (Pointwise Convolution, PWConv) of 1*1. Each convolution kernel of DWConv calculates for a single channel, and the output feature map after convolution appears to be thin. In contrast, PWConv allows for the combination of independent channels, which facilitates the aggregation of information in the channels. Both DWConv and PWConv are followed by batch normalization to prevent overfitting. Finally, the Relu activation function gives them non-linear expressiveness, substantially reducing the computation and parameters [24].
The C3 module in YOLOv5 uses multiple separated convolutions and frequently applies C3 modules with many channels, causing the problem of occupying too much cache space and computation. Therefore, in this paper, according to principle (2) proposed by ShuffleNet V2, we use DepthSepConv instead of the C3 module to avoid using too many C3 modules and reduce the parameters and computation of the model.
When training the dataset, the model in terms of detection accuracy, parameters, and computation varied depending on the position of the DepthSepConv replacement C3 module, so this paper compares the results of the DepthSepConv replacement C3 module at different positions. The comparison results are shown in Table 2. Position 1, Position 2, and Position 3 in the table refer to the second, third, and fourth C3 modules within the Neck section, specifically, the C3 modules highlighted in red in Fig. 2.
As seen from Table 2, when replacing the C3 module at position two or position one and position three, the mAP of the model can reach 94.6%. Still, the latter replacement method makes the model less lightweight, with the parameters and FLOPS shrinking by 7.4% and 4.9%, respectively, compared with those before the replacement. Therefore, in this paper, we choose to replace the C3 module at positions 1 and 3 with DepthSepConv, which gives the model a higher mAP and a higher degree of lightweight.
C. Improvement of Non-Maximum Suppression
Due to the overlapping nature of mask images, using NMS can lead to missed items in overlapping regions. The Soft NMS algorithm reduces the confidence level of candidate frames compared to the NMS algorithm, which directly removes candidate frames with IOUs greater than a threshold [25]. There are two resetting methods in Soft NMS to reduce the confidence level, linear and Gaussian, where the linear method is shown in equation (4).\begin{align*} s_{i} =\begin{cases}\displaystyle s_{i}, &iou\left ({{M-\textrm {b}_{i}} }\right) < N_{t}\\ \displaystyle s_{i} \left ({{1-iou\left ({{M-\textrm {b}_{i}} }\right)} }\right),& iou\left ({{M-\textrm {b}_{i}} }\right)\ge N_{t} \end{cases} \tag{4}\end{align*}
If the IOU is greater than or equal to the threshold, the confidence level of the candidate box decreases. However, the reset process in the linear approach is not a continuous function, which can lead to abrupt changes in the confidence level when the IOU threshold is reached. The Gaussian reset approach can solve this problem, as shown in equation (5), where \begin{equation*} s_{i} =s_{i} e^{\frac {-iou\left ({{M-\textrm {b}_{i}} }\right)}{\sigma }^{2}},\forall \textrm {b}_{i} \notin D \tag{5}\end{equation*}
In this paper, Soft NMS with a Gaussian reset approach is employed to replace the standard NMS method. This change is made to prevent object leakage resulting from the removal of overlapping candidate frames.
Fig. 7 compares the mask recognition effect before and after the improved NMS. The enhanced model retains some deleted overlapping candidate frames and successfully detects some mask objects that are not detected by the original model, avoiding the problem of missed detection caused by mask obscuration, thus improving the detection accuracy of the model.
D. YOLOv5-S2C2 Model Design
Mask images with small size characteristics, large numbers, and substantial overlap, directly applying target detection methods such as YOLOv5 is not ideal. To solve the above problems, make the following improvements to the YOLOv5 network model structure in this paper:
In response to the low level of lightweight, this paper has chosen to incorporate a lightweight structure in YOLOv5s:
Replacement of the YOLOv5 backbone network with improved FasterNet;
Replacement of part of the C3 module in Neck with DepthSepConv;
Replaced NMS with Soft NMS in YOLOv5 to retain the confidence of the overlapping frames to a certain extent.
The improved structure of YOLOv5-S2C2 is shown in Fig.8.
Experiments and Analysis
A. Introduction to the Data Set and Experimental Environment
The operating system for the experiments was Windows 10, the experimental configurations are listed in Table 3. The specific settings of the hyperparameters of the network training are listed in Table 4.
To verify the generalization of the model, two datasets are selected for training and validation of the model in this study, both of which are divided into training and validation sets according to a ratio of 9:1.
Dataset I consists of the public mask dataset provided by the network and the images obtained from the video frames of natural scenes. The training set contains 8278 images and the validation set contains 921 images, totaling 9199 images.
Dataset II consists of MAFA dataset [26] and WIDER FACE dataset [27] for filtering. The training set contains 7163 images and the validation set contains 796 images, totaling 7959 images.
The examples of the dataset part are shown in Fig. 9, where (a), (b), and (c) are derived from Dataset I and (d), (e), and (f) are derived from Dataset II. Those wearing masks are “face_mask” labeled images and those not wearing masks are “face” labeled images.
B. Evaluation Indicators
We have two classification results for the detection of mask wear: positive and negative samples of face objects with and without masks. The experiments in this paper use recall (R), precision (P), average precision (AP), and average precision mean(mAP) to evaluate the accuracy of the detection model, the parameters, and the number of computation FLOPs to evaluate the lightweight level of the model and the detection time to evaluate the real-time nature of the model [28].\begin{align*} P&=\frac {TP}{TP+FP} \tag{6}\\ R&=\frac {TP}{TP+FN} \tag{7}\\ AP&=\int \limits _{0}^{1} {p\left ({r }\right)} dr \tag{8}\\ mAP&=\frac {1}{m}\sum \limits _{i=1}^{m} {AP_{i}} \tag{9}\end{align*}
C. Comparison with Different Backbone
For verifying the ability of FasterNet as a backbone network for image feature extraction, this paper chooses to compare it with other lightweight backbone networks, ShuffleNet V2 [29] and MobileNet V3 [30]. Table 5 shows the comparison results.
From Table 5, we can see that the detection mAP of ShuffleNet V2 and MobileNet V3 as the backbone network of the model is not high enough. FasterNet as the backbone network has an mAP loss of 0.7 percentage points over the base model YOLOv5s. The EMA-FasterNet proposed in this paper as the backbone network can make up for the loss of mAP very well. The difference between its mAP and that of YOLOv5s is very small, and the degree of lightweight can be comparable to that of the FasterNet network, with the number of parameters and the amount of computation significantly compressed. Therefore, EMA-FasterNet is chosen to replace the backbone network of YOLOv5 for extracting mask image features and realizing a lightweight mask detection model.
D. Comparison with Other Detection Models
To more effectively showcase the benefits of the improved model, this experiment compares YOLOv5-S2C2 with several mainstream detection models such as SSD, YOLOv4-tiny, YOLOv5s, YOLOv7, Nanodet-plus, YOLOv8s, YOLOv9s and Gelan-s [9; 12; 13; 31; 32; 33]. A comparison is shown in Table 6 for each model volume, mAP, parameters, FLOPS, and detection speed.
As can be seen from Table 6, the mAP of YOLOv5-S2C2 proposed in this paper is improved by 7.9% compared to YOLOv4-tiny. Compared to the original base model YOLOv5s, it experiences only a marginal 0.2% decrease in performance. This is due to a reduction in model parameters and computation during lightweight operation, which affects feature extraction and may result in slightly less accurate feature coarsening, leading to a slight accuracy drop. Nevertheless, the average detection accuracy can still reach 94.8%. Compared with YOLOv5s, YOLOv5-S2C2 reduces the parameters and FLOPS by 57.0% and 51.3%, and the model volume is 44.4% of the original model.
As can be seen from the data in Table 6 and the comparison results in Fig. 10, compared with other lightweights, the mAP of YOLOv5-S2C2 is higher, up to 94.8%, with a low leakage rate for faces and better detection performance, which is more helpful for reducing the protection vulnerability; the volume, number of parameters, and FLOPS data of the YOLOv5-S2C2 model show that the proposed model greatly reduces the complexity and saves a lot of computational resources, which can meet the lightweight requirements. In terms of detection speed, the single-image detection time of YOLOv5-S2C2 is only 2.8ms, which meets the requirement of real-time detection.
YOLOv5-S2C2 has higher detection accuracy and better real-time performance than other non-lightweight single-stage object detection models (SSD, YOLOv5m). Regarding the detection accuracy, YOLOv5-S2C2 improves by 11.1% compared to SSD, although it decreases by 0.6% compared to YOLOv5m. Because YOLOv5s is used as an improved base model, the model has fewer layers and weaker detection performance than YOLOv5m. The integration of lightweight modules will further reduce the computation of the model and weaken the detection performance. Nonetheless, there is a substantial divergence between YOLOv5m and YOLOv5-S2C2 regarding parameter count, FLOPS, model size, and detection speed. This means YOLOv5m is unable to meet lightweight and real-time object detection.
In summary, the YOLOv5-S2C2 model presented in this paper effectively minimizes the consumption of computing and hardware resources while maintaining high accuracy. This meets the requirements of lightweight detection models and facilitates their practical use and deployment.
E. Ablation Experiments
The ablation experiment verifies the optimization effect of each improvement step. The experimental results of each step of the improved model in Dataset I are shown in Table 7, the experimental results in Dataset II are shown in Table 8, and the numbers in Column 1 represent the improved model.
Table 7 shows the detection results in Dataset I. Fig. 11(a) and (b) show the comparison of the detection results before and after the model improvement in Dataset I, respectively. In the case of face recognition without masks, the improved YOLOv5-S2C2 is better, and the face detection and recognition results with masks are only slightly different, but the overall mAP values are not much different.
From the results in Table 5 and the analyses, it can be seen that the choice of EMA-FasterNet as the backbone network of YOLOv5s in this study can reduce the parameters and computation of the model more substantially and make it lighter, however, this lightweight operation may lead to a slight decrease in the characterization accuracy. Based on the data in Table 7, it can be calculated that the number of parameters and FLOPS are reduced by 53.6% and 48.7%, respectively, while the mAP of the model decreases by only 0.3 percentage points, and this improvement brings high cost-effectiveness. From the experiments in Table 2, the selection of DepthSepConv to replace the C3 module of positions 1 and 3 works best, and then from the results of model ② after fusing depth-separable convolutions alone, the model decreases the number of parameters by 14.0% and compresses the computation by 8.9% without loss of mAP. As can be seen from the results of Model ③, the computational and parametric quantities of the model are almost unchanged after replacing Soft NMS alone. For mask detection, it retains some of the deleted overlapping candidate frames and avoids the problem of missed detections due to occlusion, which leads to an increase in the mAP of the model by 0.2 percentage points. All of the above experimental results indicate that all of the improvements in this study have a positive effect on the model.
Then, after the model replaces the EMA-FasterNet backbone network, the lightweight treatment of Neck mainly adopts DepthSepConv for deep features, i.e., model ④ in the table, the parameters and FLOPS are again compressed by 7.6% and 4.9%, which meets the demand of model lightweight, but the descriptions of some of the features are also reduced, and accordingly, the mAP is reduced by 0.1 percentage. Through the above lightweight processing, the compactness and efficiency of the whole network model are maintained, and the redundant operations in the model feature extraction process are reduced. Finally, soft NMS is used instead of NMS, which preserves the confidence of overlapping frames to a certain extent instead of deleting overlapping frames, thus improving the performance of small object detection. The mAP of the model is improved by 0.2 percentage points, while the parameters and FLOPS remain almost unchanged.
Fig. 11 (c) and (d) show the comparison of the detection results before and after the model improvement in Dataset II, respectively. The results of the data in Table 8 show that each step of the improvement proposed in this paper is also cost-effective in Dataset II, especially in terms of the degree of lightness, where the complexity is compressed by about 60% and the detection accuracy is comparable to that of the original model is maintained.
Fig. 12 (a) and (b) show a comparison of the confusion matrices for YOLOv5 and YOLOv5-S2C2 in Dataset I. Fig. 12 (c) and (d) show a comparison of the confusion matrices in Dataset II. Both models have a lot of false detections and miss between backgrounds, faces, and facemasks. The difference in mAP between the models before and after the improvement is not significant, but for false negatives (FN) in the confusion matrix, this value is important in the task of detecting mask-wearing, and it can directly reflect the missed detection rate. The improved YOLOv5-S2C2 has fewer overall misses than the YOLOv5s, which avoids health problems caused by failing to detect faces and masks, and the YOLOv5-S2C2 greatly improves the overall lightweight.
Fig. 13 shows the comparison results before and after the model improvement of Dataset I. Fig. 14 shows the comparison results before and after the model improvement of Dataset II. The first column is the labeled image “Ground truth”, and the second and third columns are the model detection results of YOLOv5s and YOLOv5-S2C2, respectively. YOLOv5s is unable to detect some face instances, whereas the improved YOLOv5-S2C2 can, and the overall confidence level is equal to or even higher than that of YOLOv5s.
From the above results and analyses, it can be seen that although the overall mAP of YOLOv5s is slightly higher than that of YOLOv5-S2C2, YOLOv5-S2C2 reduces the occurrence of missed detection in face instance detection compared to YOLOv5s. In summary, the improvement and optimization of YOLOv5 in this paper are reasonable and practical, which improves the detection performance of the original model in the mask detection task and better avoids the protection loopholes in the detection task.
Conclusion
Compared with YOLOv5s, in Dataset I, the YOLOv5-S2C2 model volume, parameters, and FLOPS are reduced by 55.6%, 57.0%, and 51.3%, respectively. In terms of detection speed, the detection time of a single image is only 2.8ms, which has good real-time performance. Regarding detection accuracy, the mAP of YOLOv5-S2C2 reaches 94.8%, which is a high level of detection. In Dataset II, YOLOv5-S2C2 also has a high mAP and a low leakage detection rate. The model can migrate to disease prevention and control, dust control, and other mask applications with good prospects. The detection accuracy proposed in this paper is not sufficient compared to some object detection models that aim for higher accuracy. Subsequent research on this model focuses on improving detection accuracy while maintaining a lightweight design.