Introduction
PPE compliance detection is to check whether the personal protective equipment worn by workers complies with relevant regulations and standards, which aims to avoid potential safety hazards and ensure worker safety. Specifically, for construction sites, workers must correctly wear personal protective equipment, such as helmets, reflective clothing, safety belts, etc., to ensure the safety of the construction workers and minimize the risk of injury or death. At present, PPE compliance detection can be mainly divided into two categories: sensor-based methods [1]-[3] and vision-based methods [4]-[8]. Sensor-based methods typically use positioning sensors to identify PPE, which is intrusive and incurs additional sensor costs [9]. On the contrary, vision-based techniques involve computer vision algorithms to non-intrusively detect PPE compliance [10] that share lower costs and better scalability.
Recently, deep learning algorithms have become the mainstream paradigm for PPE compliance detection, which can be divided into two-stage [11]-[13] and one-stage methods [14]-[17]. The two-stage methods, such as Faster-RCNN [12], have higher accuracy, but the high complexity of the model leads to low efficiency; the one-stage methods, such as YOLOv10 [34], enjoy an simple and elegant end-to-end architecture as well as real-time inference speed, but sacrifices precision performance to some extent. To address the above issues, this paper proposes MKD-YOLO framework (as shown in Fig. 1(a)) with the following contributions:
We propose a C2f-EMSEC module that uses multiple convolutions with different kernel sizes to capture multi-scale spatial features. In addition, a LSPPF module is specified to extract global context features in the back- bone network.
We further introduce BPNet in the neck stage to enhance the model’s ability to model multi-scale fine-grained details, which is particularly effective for small object detection and shows advantages over typical PAN-FPN.
To achieve lightweight model, we adopt a channel-wise knowledge distillation method for dense prediction, which exploits the KL divergence between the channel probability maps of the student and teacher models to make the student model pay more attention to the most important regions in each channel, thereby improving distillation effectiveness.
(a) is the proposed MKD-YOLO framework based on YOLOv8n. (b) denotes the proposed C2f-EMSEC module. (c) denotes the details of LSKA and LSPPF module as an improved version of SPPF.
Proposed Method
This section details the proposed MKD-YOLO framework, including C2f-EMSEC module, LSPPF module, and BPNet, as well as the model lightweighting and knowledge transfer learning through channel-wise knowledge distillation.
A. C2f-EMSEC Module
The canonical stacked networks primarily consist of multiple convolutional layers, where each layer typically uses convolutional kernels of a single size. For example, the VGG [20] architecture employs 3×3 convolutions. This approach limits the clarity in distinguishing features of objects or scenes at different scales. To address this, we propose the Efficient Multi-Scale Enhanced Convolution (EMSEC) in C2f blocks, as shown in Fig. 1(b). It introduces multiple convolutions with different kernel sizes, enabling the capture of spatial features at multiple scales [21], which is described as following:
\begin{gather*} EMSEC = Con{v_{1 \times 1}}\left( {Concat \left( {Con{v_{1 \times 1}}\left( {{x_1}} \right),} \right.} \right. \\ Con{v_{3 \times 3}}\left( {{x_2}} \right),Con{v_{5 \times 5}}\left( {{x_3}} \right), \\ \left. {\left. {Con{v_{7 \times 7}}\left( {{x_4}} \right)} \right)} \right). \tag{1}\end{gather*}
B. LSPPF Module
The YOLOv8 network uses Spatial Pyramid Pooling-Fast (SPPF) [16] to enrich the feature representation. However, relying solely on max pooling operations for feature extraction may result in the loss of some global information, which may affect the accuracy of the model. Therefore, we designed the LSPPF module to enhance the model’s ability to extract global features in the backbone network by integrating Large Separable Kernel Attention (LSKA) [22], as shown in Fig. 1(c). The detail of LSKA is given by Equations (2)-(5).
\begin{align*} & {X^C} = \sum\limits_{H,W} {D_{(2d - 1) \times 1}^C} * \left( {\sum\limits_{H,W} {D_{1 \times (2d - 1)}^C} * {F^C}} \right),\tag{2} \\ & {Y^C} = \sum\limits_{H,W} {D_{\left[ {\frac{k}{d}} \right] \times 1}^C} * \left( {\sum\limits_{H,W} {D_{1 \times \left[ {\frac{k}{d}} \right]}^C} * {X^C}} \right),\tag{3} \\ & {I^C} = {D_{1 \times 1}} * {Y^C}\tag{4} \\ & {\bar F^C} = {I^C} \otimes {F^C}.\tag{5}\end{align*}
(a) The original BiFPN structure, (b) The proposed BPNet module with small object-aware detection layer P2.
C. BPNet
The neck of YOLOv8 originally utilizes the PAN-FPN network, which may reduce its multi-scale representation ability due to semantic differences and direct fusion between different layers. The downsampling process may lose information from the highest-level pyramid features. To solve these problems existing in feature fusion [23], we incorporate a small object-aware layer P2 into BiFPN [24] and propose BPNet to enhance the model’s detection ability for small objects [25], as shown in Fig. 2(b).
channel distribution distillation instead aligns each channel of the student’s feature maps to that of the teacher network by minimizing the KL divergence.
D. Channel-wise Knowledge Distillation for MKD-YOLO
Knowledge Distillation [26] aims to transfer knowledge from large and complex models to simpler and smaller models, enabling student models to match or even exceed the performance of teacher models. Conventional distillation methods are prone to introducing redundant information from teacher into student networks. Channel-wise knowledge distillation [27] addresses this by normalizing the activation maps in each channel for dense prediction tasks, as shown in Fig. 3.
Let the teacher and student networks be denoted as T and S, respectively. The activation maps of T and S are represented as mT and mS, respectively. The channel-wise distillation loss can be generally expressed as:
\begin{equation*}\varphi \left( {\phi \left( {{m^T}} \right),\phi \left( {{m^S}} \right)} \right) = \varphi \left( {\phi \left( {m_C^T} \right),\phi \left( {m_C^S} \right)} \right),\tag{6}\end{equation*}
\begin{equation*}\phi \left( {{m_c}} \right) = \frac{{exp\left( {\frac{{{m_{c,n}}}}{\tau }} \right)}}{{\sum\nolimits_{n = 1}^{W\cdot H} {exp} \left( {\frac{{{m_{c,n}}}}{\tau }} \right)}},\tag{7}\end{equation*}
\begin{equation*}\varphi \left( {{m^T},{m^S}} \right) = \frac{{{\tau ^2}}}{C}\sum\limits_{c = 1}^C {\sum\limits_{n = 1}^{W\cdot H} \phi } \left( {m_{c,n}^T} \right)\cdot \log \left[ {\frac{{\phi \left( {m_{c,n}^T} \right)}}{{\phi \left( {m_{c,n}^S} \right)}}} \right],\tag{8}\end{equation*}
The KL divergence is an asymmetric metric. By comparing the distance between the teacher and the student networks through KL divergence, the model parameters that minimize the KL divergence are identified. Knowledge distillation can be categorized into feature distillation and logit distillation based on the distillation method. Considering that YOLOv8 is a multi-layer model and logit distillation can easily transfer the teacher model’s prediction uncertainty to the student model, this paper opts for feature distillation. The loss function for feature distillation is formulated as follows:
\begin{equation*}{{\mathcal{L}}_{fea}} = \alpha \sum\limits_{i = 1}^L \varphi \left( {{m^T},{m^S}} \right),\tag{9}\end{equation*}
\begin{equation*}{\mathcal{L}} = {{\mathcal{L}}_{orign}} + \lambda {{\mathcal{L}}_{fea}}.\tag{10}\end{equation*}
Experiments
A. Datasets and Evaluation Metrics
We conduct experiment on our proposed Smart Construction Site (SCS) dataset, which is derived from challenging scenes in real construction sites, such as long-distance monitoring and extremely small target images. There are a total of 3914 images (3132 in the training set, 391 in the validation set, and 391 in the testing set), including 5 detection tasks: head, safety helmet, reflective clothing, safety belt, and person. To validate the generalization of MKD-YOLO, we also perform experiments on two public datasets, SHD [28] and CSS [29]. The evaluation metrics include mAP50 (the mean Average Precision at an IoU threshold of 0.5), mAP50−95 (the mean Average Precision over IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05), parameters (Param.), and the inference speed (FPS).
B. Implementation Details
The proposed MKD-YOLO has been implemented in PyTorch 1.12.0 and trained on four RTX 3090 graphics cards. The input images for model training are 640 × 640 × 3, with training epochs, early stop parameter patch, and weight decay of 300, 50, and 0.0005, respectively. The batch size is set to 32. We utilize the SGD optimizer with an initial learning rate of 0.01 and a momentum factor of 0.937. The confidence threshold is set to 0.5, and IoU is set to 0.7 for non-maximum suppression.
C. Comparison with State-of-the-Art Methods
We compared MKD-YOLO with eight SOTA detection methods, as shown in Table I. For SCS dataset, MKD-YOLO outperforms the Mamba-YOLO (second best method) by 4.0% on mAP50 and 2.9% on mAP50−95, respectively. Although MKD-YOLO’s performance on FPS is not as good as YOLOv10 (with highest speed), it has the fewest number of parameters, which fully proves that MKD-YOLO maintains a good balance between detection accuracy and efficiency. For SHD and CSS datasets, MKD-YOLO has consistently performed well (ranking in the top two among all evaluation metrics). The experimental results fully demonstrate the superiority and robustness of our method.
D. Ablation Studies on C2f-EMSEC, LSPPF and BPNet
To evaluate the effectiveness of each proposed module, we conduct the ablation experiments on SCS dataset, and the results are shown in Table II. Based on YOLOv8n (V8n-0 in Table II), the introduction of each module significantly improves the baseline performance, especially for V8n-7, which improves mAP50 by 4.0%, mAP50−95 by 3.8%, reduces the number of parameters by 20%, and increases the speed by nearly 50%. The results verify the effectiveness of the proposed C2f-EMSEC, LSPPF and BPNet.
E. Ablation Study on Knowledge Distillation
We utilize YOLOv8n-based MKD-YOLO (with 2.43M parameters) as the student model and YOLOv8x-based MKD-YOLO (with 53.2M parameters) as the teacher model, with a temperature parameter τ = 1, the feature fusion layers of the Neck stage (layers 22, 25, 28, and 31 in Fig. 1(a)) in MKD-YOLO are selected for distillation. We perform experiments with various feature loss weights a from 0.1 to 1.0 and found that when α = 0.3, the mAP and speed (FPS) reach the optimal balance. Therefore, we set α to 0.3 for final model settings. The detailed results are shown in Table III. Compared with the model V8n-7 before distillation (as shown in Table II), although the speed is reduced by 16%, the mAP50 of the student model is improved by 1.3%, while the number of parameters remains unchanged. The results demonstrate that it is feasible to enhance the detection accuracy of the model without increasing the parameters while maintaining a minimal decrease in inference speed, thereby proving effectiveness of the proposed knowledge distillation method.
Conclusion
This paper propose a novel PPE compliance detection framework, MKD-YOLO, by designing the lightweight C2f-EMSEC, LSPPF and BPNet for multi-scale, global-contextual and fine-grained feature enhancement. Compared with SOTA detection methods, the proposed MKD-YOLO has been proven to be a lighter, more robust and more efficient detection model with fewer parameters and faster inference speed. In future work, we plan to explore more generalized multi-scale feature learning approaches and more effective knowledge distillation methods to further improve the detection precision and efficiency of the proposed model for more general PPE compliance detection tasks.