Conferences >ICASSP 2025 - 2025 IEEE Inter...

MKD-YOLO: Multi-Scale and Knowledge-Distilling YOLO for Efficient PPE Compliance Detection

Abstract:

YOLO-based models are widely used for personal protective equipment (PPE) compliance detection due to their excellent detection performance and efficiency. However, most ...Show More

Metadata

Abstract:

YOLO-based models are widely used for personal protective equipment (PPE) compliance detection due to their excellent detection performance and efficiency. However, most YOLO models are not competent for detection tasks in complex industrial scenarios such as remote surveillance and extremely small targets. In addition, there is a lack of effective model lightweighting and knowledge transfer approaches for industrial deployment. To this end, this paper proposes a Multi-scale and Knowledge-Distilling YOLO (MKD-YOLO) based on YOLOv8n for efficient PPE compliance detection. Specifically, in backbone stage, we design an Efficient Multi-Scale Enhanced Convolution (C2f-EMSEC) module and Large Spatial Pyramid Pooling-Fast (LSPPF) module for multi-scale and global-contextual feature learning as well as reducing model complexity. Then, in neck stage, a refined Bidirectional feature Pyramid Network (BPNet) is designated to capture fine-grained details for extremely small object detection. Moreover, we apply channel-wise knowledge distillation to facilitate model lightweighting and domain-specific knowledge transfer learning. Experiments on our proposed dataset and public datasets show that the proposed MKD-YOLO achieves a new state-of-the-art (SOTA) detection performance and efficiency for practical PPE compliance detection tasks. Codes and the dataset are available at https://github.com/z1Zjt/MKD-YOLO.

Published in: ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Date of Conference: 06-11 April 2025

Date Added to IEEE Xplore: 07 March 2025

ISBN Information:

ISSN Information:

DOI: 10.1109/ICASSP49660.2025.10889626

Conference Location: Hyderabad, India

Contents

SECTION I.

Introduction

PPE compliance detection is to check whether the personal protective equipment worn by workers complies with relevant regulations and standards, which aims to avoid potential safety hazards and ensure worker safety. Specifically, for construction sites, workers must correctly wear personal protective equipment, such as helmets, reflective clothing, safety belts, etc., to ensure the safety of the construction workers and minimize the risk of injury or death. At present, PPE compliance detection can be mainly divided into two categories: sensor-based methods [1]-[3] and vision-based methods [4]-[8]. Sensor-based methods typically use positioning sensors to identify PPE, which is intrusive and incurs additional sensor costs [9]. On the contrary, vision-based techniques involve computer vision algorithms to non-intrusively detect PPE compliance [10] that share lower costs and better scalability.

Recently, deep learning algorithms have become the mainstream paradigm for PPE compliance detection, which can be divided into two-stage [11]-[13] and one-stage methods [14]-[17]. The two-stage methods, such as Faster-RCNN [12], have higher accuracy, but the high complexity of the model leads to low efficiency; the one-stage methods, such as YOLOv10 [34], enjoy an simple and elegant end-to-end architecture as well as real-time inference speed, but sacrifices precision performance to some extent. To address the above issues, this paper proposes MKD-YOLO framework (as shown in Fig. 1(a)) with the following contributions:

We propose a C2f-EMSEC module that uses multiple convolutions with different kernel sizes to capture multi-scale spatial features. In addition, a LSPPF module is specified to extract global context features in the back- bone network.
We further introduce BPNet in the neck stage to enhance the model’s ability to model multi-scale fine-grained details, which is particularly effective for small object detection and shows advantages over typical PAN-FPN.
To achieve lightweight model, we adopt a channel-wise knowledge distillation method for dense prediction, which exploits the KL divergence between the channel probability maps of the student and teacher models to make the student model pay more attention to the most important regions in each channel, thereby improving distillation effectiveness.

Fig. 1.

(a) is the proposed MKD-YOLO framework based on YOLOv8n. (b) denotes the proposed C2f-EMSEC module. (c) denotes the details of LSKA and LSPPF module as an improved version of SPPF.

Show All

SECTION II.

Proposed Method

This section details the proposed MKD-YOLO framework, including C2f-EMSEC module, LSPPF module, and BPNet, as well as the model lightweighting and knowledge transfer learning through channel-wise knowledge distillation.

A. C2f-EMSEC Module

The canonical stacked networks primarily consist of multiple convolutional layers, where each layer typically uses convolutional kernels of a single size. For example, the VGG [20] architecture employs 3×3 convolutions. This approach limits the clarity in distinguishing features of objects or scenes at different scales. To address this, we propose the Efficient Multi-Scale Enhanced Convolution (EMSEC) in C2f blocks, as shown in Fig. 1(b). It introduces multiple convolutions with different kernel sizes, enabling the capture of spatial features at multiple scales [21], which is described as following:

$\begin{gather*} EMSEC = Con{v_{1 \times 1}}\left( {Concat \left( {Con{v_{1 \times 1}}\left( {{x_1}} \right),} \right.} \right. \\ Con{v_{3 \times 3}}\left( {{x_2}} \right),Con{v_{5 \times 5}}\left( {{x_3}} \right), \\ \left. {\left. {Con{v_{7 \times 7}}\left( {{x_4}} \right)} \right)} \right). \tag{1}\end{gather*}$ View Source

where x = [x₁, x₂, x₃, x₄] denotes the input features are split into four heads along the channel dimension, and Conv_i×i represents a convolution with a kernel size of i × i.

B. LSPPF Module

The YOLOv8 network uses Spatial Pyramid Pooling-Fast (SPPF) [16] to enrich the feature representation. However, relying solely on max pooling operations for feature extraction may result in the loss of some global information, which may affect the accuracy of the model. Therefore, we designed the LSPPF module to enhance the model’s ability to extract global features in the backbone network by integrating Large Separable Kernel Attention (LSKA) [22], as shown in Fig. 1(c). The detail of LSKA is given by Equations (2)-(5).

$\begin{align*} & {X^C} = \sum\limits_{H,W} {D_{(2d - 1) \times 1}^C} * \left( {\sum\limits_{H,W} {D_{1 \times (2d - 1)}^C} * {F^C}} \right),\tag{2} \\ & {Y^C} = \sum\limits_{H,W} {D_{\left[ {\frac{k}{d}} \right] \times 1}^C} * \left( {\sum\limits_{H,W} {D_{1 \times \left[ {\frac{k}{d}} \right]}^C} * {X^C}} \right),\tag{3} \\ & {I^C} = {D_{1 \times 1}} * {Y^C}\tag{4} \\ & {\bar F^C} = {I^C} \otimes {F^C}.\tag{5}\end{align*}$ View Source

where * and ⊗ denote convolution and the Hadamard product, respectively. The input feature map F ∈ ℝ^C×W×H, where C is the number of input channels, H and W represent the height and width of the feature map respectively. X^C in (2) denotes the output of the depth-wise convolution with kernel sizes of (2d − 1) × 1 and 1 × (2d − 1). Y^C is responsible for capturing the global spatial information of the deep convolution output X^C. The attention map I^C (as shown in (4)) is obtained by convolving the output Y^C of the dilated depth-wise convolution with a kernel D of size 1 × 1. The output of LSKA (

${\bar F^{\bar C}}$

in (5)) is the Hadamard product of the attention map I^C and the input feature map

${\bar F^{\bar C}}$

Fig. 2.

(a) The original BiFPN structure, (b) The proposed BPNet module with small object-aware detection layer P2.

Show All

C. BPNet

The neck of YOLOv8 originally utilizes the PAN-FPN network, which may reduce its multi-scale representation ability due to semantic differences and direct fusion between different layers. The downsampling process may lose information from the highest-level pyramid features. To solve these problems existing in feature fusion [23], we incorporate a small object-aware layer P2 into BiFPN [24] and propose BPNet to enhance the model’s detection ability for small objects [25], as shown in Fig. 2(b).

Fig. 3.

channel distribution distillation instead aligns each channel of the student’s feature maps to that of the teacher network by minimizing the KL divergence.

Show All

D. Channel-wise Knowledge Distillation for MKD-YOLO

Knowledge Distillation [26] aims to transfer knowledge from large and complex models to simpler and smaller models, enabling student models to match or even exceed the performance of teacher models. Conventional distillation methods are prone to introducing redundant information from teacher into student networks. Channel-wise knowledge distillation [27] addresses this by normalizing the activation maps in each channel for dense prediction tasks, as shown in Fig. 3.

Let the teacher and student networks be denoted as T and S, respectively. The activation maps of T and S are represented as m^T and m^S, respectively. The channel-wise distillation loss can be generally expressed as:

$\begin{equation*}\varphi \left( {\phi \left( {{m^T}} \right),\phi \left( {{m^S}} \right)} \right) = \varphi \left( {\phi \left( {m_C^T} \right),\phi \left( {m_C^S} \right)} \right),\tag{6}\end{equation*}$ View Source

where ϕ(·) is used to convert the activation values into the following probability distribution as following:

$\begin{equation*}\phi \left( {{m_c}} \right) = \frac{{exp\left( {\frac{{{m_{c,n}}}}{\tau }} \right)}}{{\sum\nolimits_{n = 1}^{W\cdot H} {exp} \left( {\frac{{{m_{c,n}}}}{\tau }} \right)}},\tag{7}\end{equation*}$

View Source

where c = 1, 2, …, C indexes the channel; and n indexes the spatial location of a channel. τ is a hyper-parameter (the temperature). φ(·) evaluates the discrepancy between the channel distributions from the teacher and the student networks. We use the KL divergence:

$\begin{equation*}\varphi \left( {{m^T},{m^S}} \right) = \frac{{{\tau ^2}}}{C}\sum\limits_{c = 1}^C {\sum\limits_{n = 1}^{W\cdot H} \phi } \left( {m_{c,n}^T} \right)\cdot \log \left[ {\frac{{\phi \left( {m_{c,n}^T} \right)}}{{\phi \left( {m_{c,n}^S} \right)}}} \right],\tag{8}\end{equation*}$

View Source

The KL divergence is an asymmetric metric. By comparing the distance between the teacher and the student networks through KL divergence, the model parameters that minimize the KL divergence are identified. Knowledge distillation can be categorized into feature distillation and logit distillation based on the distillation method. Considering that YOLOv8 is a multi-layer model and logit distillation can easily transfer the teacher model’s prediction uncertainty to the student model, this paper opts for feature distillation. The loss function for feature distillation is formulated as follows:

$\begin{equation*}{{\mathcal{L}}_{fea}} = \alpha \sum\limits_{i = 1}^L \varphi \left( {{m^T},{m^S}} \right),\tag{9}\end{equation*}$ View Source

where α represents the weight of the feature loss, and L denotes the number of feature layers. Finally, the total loss is computed as:

$\begin{equation*}{\mathcal{L}} = {{\mathcal{L}}_{orign}} + \lambda {{\mathcal{L}}_{fea}}.\tag{10}\end{equation*}$

View Source

where ℒ_orign represents the loss of the original model (YOLOv8) [17], and λ serves as the decay factor for the distillation loss, commonly utilized within optimization algorithms to regulate the influence of the distillation loss during the training process.

SECTION III.

Experiments

A. Datasets and Evaluation Metrics

We conduct experiment on our proposed Smart Construction Site (SCS) dataset, which is derived from challenging scenes in real construction sites, such as long-distance monitoring and extremely small target images. There are a total of 3914 images (3132 in the training set, 391 in the validation set, and 391 in the testing set), including 5 detection tasks: head, safety helmet, reflective clothing, safety belt, and person. To validate the generalization of MKD-YOLO, we also perform experiments on two public datasets, SHD [28] and CSS [29]. The evaluation metrics include mAP₅₀ (the mean Average Precision at an IoU threshold of 0.5), mAP₅₀₋₉₅ (the mean Average Precision over IoU thresholds ranging from 0.5 to 0.95 with a step size of 0.05), parameters (Param.), and the inference speed (FPS).

B. Implementation Details

The proposed MKD-YOLO has been implemented in PyTorch 1.12.0 and trained on four RTX 3090 graphics cards. The input images for model training are 640 × 640 × 3, with training epochs, early stop parameter patch, and weight decay of 300, 50, and 0.0005, respectively. The batch size is set to 32. We utilize the SGD optimizer with an initial learning rate of 0.01 and a momentum factor of 0.937. The confidence threshold is set to 0.5, and IoU is set to 0.7 for non-maximum suppression.

TABLE I Performance comparison of MKD-YOLO and other SOTA models on three datasets. The first, second, and third rankings are marked in

$\color{Red}{\text{red}}, \color{Blue}{\text{blue}}$ and

$\color{Green}{\text{green}}$ , respectively.

$Table I- Performance comparison of MKD-YOLO and other SOTA models on three datasets. The first, second, and third rankings are marked in $\color{Red}{\text{red}}, \color{Blue}{\text{blue}}$ and $\color{Green}{\text{green}}$, respectively.$

C. Comparison with State-of-the-Art Methods

We compared MKD-YOLO with eight SOTA detection methods, as shown in Table I. For SCS dataset, MKD-YOLO outperforms the Mamba-YOLO (second best method) by 4.0% on mAP₅₀ and 2.9% on mAP₅₀₋₉₅, respectively. Although MKD-YOLO’s performance on FPS is not as good as YOLOv10 (with highest speed), it has the fewest number of parameters, which fully proves that MKD-YOLO maintains a good balance between detection accuracy and efficiency. For SHD and CSS datasets, MKD-YOLO has consistently performed well (ranking in the top two among all evaluation metrics). The experimental results fully demonstrate the superiority and robustness of our method.

D. Ablation Studies on C2f-EMSEC, LSPPF and BPNet

To evaluate the effectiveness of each proposed module, we conduct the ablation experiments on SCS dataset, and the results are shown in Table II. Based on YOLOv8n (V8n-0 in Table II), the introduction of each module significantly improves the baseline performance, especially for V8n-7, which improves mAP₅₀ by 4.0%, mAP₅₀₋₉₅ by 3.8%, reduces the number of parameters by 20%, and increases the speed by nearly 50%. The results verify the effectiveness of the proposed C2f-EMSEC, LSPPF and BPNet.

TABLE II Ablation study of MKD-YOLO.

E. Ablation Study on Knowledge Distillation

We utilize YOLOv8n-based MKD-YOLO (with 2.43M parameters) as the student model and YOLOv8x-based MKD-YOLO (with 53.2M parameters) as the teacher model, with a temperature parameter τ = 1, the feature fusion layers of the Neck stage (layers 22, 25, 28, and 31 in Fig. 1(a)) in MKD-YOLO are selected for distillation. We perform experiments with various feature loss weights a from 0.1 to 1.0 and found that when α = 0.3, the mAP and speed (FPS) reach the optimal balance. Therefore, we set α to 0.3 for final model settings. The detailed results are shown in Table III. Compared with the model V8n-7 before distillation (as shown in Table II), although the speed is reduced by 16%, the mAP₅₀ of the student model is improved by 1.3%, while the number of parameters remains unchanged. The results demonstrate that it is feasible to enhance the detection accuracy of the model without increasing the parameters while maintaining a minimal decrease in inference speed, thereby proving effectiveness of the proposed knowledge distillation method.

TABLE III Distillation experiments under different loss weights α.

SECTION IV.

Conclusion

This paper propose a novel PPE compliance detection framework, MKD-YOLO, by designing the lightweight C2f-EMSEC, LSPPF and BPNet for multi-scale, global-contextual and fine-grained feature enhancement. Compared with SOTA detection methods, the proposed MKD-YOLO has been proven to be a lighter, more robust and more efficient detection model with fewer parameters and faster inference speed. In future work, we plan to explore more generalized multi-scale feature learning approaches and more effective knowledge distillation methods to further improve the detection precision and efficiency of the proposed model for more general PPE compliance detection tasks.

References is not available for this document.

MKD-YOLO: Multi-Scale and Knowledge-Distilling YOLO for Efficient PPE Compliance Detection

Abstract:

Metadata

Abstract:

ISSN Information:

Introduction