Introduction
As the core of the whole industrial system, the role rotating machines play is of great importance in modern industry. Moreover, the safe and reliable operation of bearings to a large degree determines the operation of the mechanical systems [1], [2]. However, bearings failures are likely to be occurr on high-load and strong-impact conditions, leading to the aging of the entire machine and even serious performance or safety loss. With the development of Industry 4.0 and Industrial Internet of Things (IIoT) technology, continuous monitoring and real-time fault diagnosis is indispensable to detecting faults before damage as well as providing important support for maintenance [3]–[6].
Traditional mechanical fault diagnosis mainly includes three steps: i) to construct perfect characteristic parameters that can represent bearing faults by using advanced signal processing methods, such as wavelet decomposition, wavelet packet decomposition, empirical mode decomposition, variational mode decomposition, Spectral kurtosis as well as the improved algorithms of the above methods [7]–[9]; ii) to select key feature parameters via dimensionality reduction methods such as principal component analysis, and auto-encoder [10]; iii) to realize fault classification through pattern recognition methods, including support vector machine (SVM), decision tree, random forest and artificial neural network methods [11], [12]. However, traditional machine learning methods largely rely on signal processing techniques and diagnostic experience, which makes it difficult to deal with classification or regression problems in complex situations.
With the rapid development of advanced measurement technology as well as the advancement of Industry 4.0 and the IIoT technology, numerous data are collected [13], [14]. Due to high computing complexity, however, traditional machine learning methods fail to establish decision models for these data. Therefore, deep learning has emerged and has been adopted for fault diagnosis. C. L. et al. used stacked de-noising auto-encoder to identify the signal status with environmental noise and fluctuations in working conditions [15]. The proposed method proved to achieve high diagnosis accuracy and strong robustness. As for Y. L. et al, they proposed a planetary gear fault diagnosis method on the basis of power-spectral-entropy-based variational mode decomposition and deep neural network to achieve almost perfect fault diagnosis effect [16].
As one of key branches of deep learning, convolutional neural networks (CNN), have emerged with excellent performance in bearing fault diagnosis in massive data context. S. G. et al. realized accurate, robust and general fault diagnosis of rotating machines through continuous wavelet transform and CNN method [17]. In the experiment of S. S. et al, they proposed a multi-signal fault diagnosis method via deep convolutional neural networks (DCNN) which could learn from multiple sensor signals to achieve robustness and ultimately accurate induction motor fault identification [18]. W. Y. et al. adopted broad convolutional neural network to improve the model’s diagnostic performance and incremental learning capabilities by adding newly generated additional features for self-update so as to include new abnormal samples and fault classes [19].
The aforementioned CNN methods, however, fail to take account of the computational and storage costs of the target models. Simultaneously, the CNN structures largely depend on expert experience to obtain the optimal diagnosis model. In addition, the models may be difficult for training because of gradient vanishing with the increase of the model depth due to limited samples. Recently, some fault diagnosis methods for high-speed devices that demand model deployment and real-time diagnosis have been widely used, laying the foundation for real-time and rapid diagnosis in the context of the IIoT [20], [21]. Moreover, some high-precision diagnostic methods for limited dataset have also been continuously studied [22], [23]. With aim to improve the model accuracy and effectiveness for bearing fault diagnosis as well as reduce its computation and storage costs, a lightweight convolutional neural network (LCNN) for intelligent diagnosis of bearing faults is proposed in this paper. The key of this study is how to effectively reduce the model parameters and storage space, and how to construct an optimal diagnostic model to achieve high accuracy. As for the proposed LCNN method, a novel decomposed Hierarchical Search Space is introduced to explore the optimal network for bearing fault diagnosis. By introducing deep separable convolution in place of traditional convolution, and introducing inverse residual structure and linear bottleneck layer, the computational and storage costs of the model are reduced while its accuracy is improved. The convolution operation is performed on the image data of the fault samples to capture the non-linear structure and fault trend, and the physical interpretation of the extracted features as well as the model recognition results is provided through visualization. The main contributions of this paper are as follows:
A LCNN model is constructed via lightweight convolution blocks rather than the traditional convolution operations, which highly improves the fault diagnosis accuracy, and largely reduces the calculation amount and model storage. Moreover, with significantly high accuracy, it effectively solves the problem of serious overfitting and vanishing gradient caused by the model deepening in limited samples.
A novel decomposed Hierarchical Search Space is adopted for model optimization to balance accuracy and parameters. Through decomposed Hierarchical Search Space for automatic search on the dataset for the optimal model for bearing fault diagnosis, the constructed LCNN greatly reduces the dependence on experience in the model construction process.
Since the CNNs are “black box”, Tensorborad is used to visualize the feature extraction results of each convolutional layer and t-distribution stochastic neighbor embedding (t-SNE) method is applied to visualize the learned features in hidden fully connected layer. In this sense, the visualization of the entire model is achieved.
Lightweight Convolutional Neural Network
A. Convolutional Neural Network
As one of important branches of deep learning, CNNs excel in the field of pattern recognition as a result of their excellent feature capture capabilities [24]. A basic CNN includes an input layer, a convolutional layer, a pooling layer, a fully connected layer, and an output layer. As a matter of fact, its essence is to construct multiple filters to convolve and pool the input data layer by layer, and extract their features layer by layer [25]. Its unique network structure can effectively reduce the number of training parameters, thus reducing the complexity of the network. Meanwhile, it guarantees invariance of translation, rotation and scaling to a certain degree.
The convolution layer consists of multiple convolution kernel filters. The kernel filter convolves with the child nodes of the input layer and then outputs the results. Each kernel filter repeatedly acts on its entire neuronal receptive field, performs a convolution operation on the pre-processed input feature map, and then uses the activation function to output the convolution result to form a feature map for local features extraction of the input feature map. Each convolution kernel filter consisting of weights \begin{equation*} x_{j}^{out}=f_{cov}\left({\sum \nolimits _{i\in M_{j}} {x_{i}^{input}\cdot k_{ij}+b_{j}} }\right)\tag{1}\end{equation*}
The pooling layer which is also composed of a convolution kernel filter, is usually set after the convolution layer. The calculation in the pooling layer kernel filter is not a weighted sum of neuron nodes, but a maximum or average operation. The purpose of the pooling layer is to perform the function of secondary feature extraction. By reducing the length, width, and depth of the feature map matrix, the preset parameters are reduced, thus increasing the calculation rate. Different from the convolution operation of the convolution layer, that of the single convolution kernel filter of the pooling layer is exclusively for nodes at one depth, so the convolution operation of the sampling layer not only takes samples in the length and width directions of the input data, but also in the depth direction. During the dimensionality reduction, the convolution operation of the sampling layer is equivalent to the secondary feature extraction of the input data. The sampling mathematical expression of pooling layer convolution filter is as follows:\begin{equation*} x_{il}^{out}=f(x_{ip}^{input},x_{i\left ({p+1 }\right)}^{input},\cdots)\tag{2}\end{equation*}
The fully connected layer is a classification module of the model. It can map the distributed features extracted by the convolution layer and the pooling layer to the target space, namely transforming from a high-dimensional space to a low-dimensional space. And it is fully connected with the previous layer, so there are more layer parameters than other layers. The feature map of the previous layer of the fully connected layer is “rolled out” in turn through the convolution operation, and then activated by the activation function Softmax for classification output. The mathematical expression of the forward propagation of the fully connected layer is as follows:\begin{equation*} a_{j}^{out}=f_{fc}\left({\sum \nolimits _{i\in M_{j}} {a_{i}^{in}\cdot w_{ij}+b_{j}} }\right)\tag{3}\end{equation*}
The fully connected layer adopts the Softmax activation function for mapping to achieve multi-classification of data. The Softmax mathematical expression is as follows:\begin{equation*} f\left ({z_{i} }\right)=\frac {exp(z_{i})}{\sum \nolimits _{i=1}^{c} {exp(z_{i})} }\tag{4}\end{equation*}
B. Lightweight Convolutional Neural Network
Traditional CNNs usually increase the model depth and complexity in pursuit of high accuracy. But such large complicated models can never be applied to the real scenarios such as mobile or embedded devices. As the improved version of the CNN, LCNN operates with lightweight convolution instead of the traditional convolution. While maintaining the performance of the model, the model size is largely reduced and the model speed is increased. Therefore, it effectively reduces the calculation and storage load of the model, which is more conducive to the application of the mobile terminal and the Internet of Things context [26], [27].
1) Depthwise Separable Convolution
Depthwise separable convolution is one of typical representatives of lightweight convolution. After different convolution kernels are conducted on different input channels, experiments performed by F. C. et al in Xception proved that depthwise separable convolution can be applied to DCNN on a large scale [28]; In MobileNet, due to the large-scale use of depthwise separable convolution, a large number of parameters and calculations are reduced, thus accelerating the inference speed of the model [29]. Moreover, under the same conditions, the accuracy loss of the model can be ignored.
The schematic diagram of traditional convolution and depthwise separable convolution is shown in Fig. 1. In depthwise separable convolution, the convolution process is divided into two steps: depthwise convolution and pointwise convolution. In depthwise convolution, the same number of convolution kernels with the input channels are used for layer-by-layer convolution along the depth of the feature map, but the convolution results are not aggregated. Thus this is an extreme grouping convolution process. Pointwise convolution uses the same number of
The schematic diagram of traditional convolution and depthwise separable convolution: (a) traditional convolution; (b)depthwise separable convolution.
The calculation of the parameters and calculation amount of traditional convolution and depthwise separable convolution are shown in Table 1. Suppose that N
As can be seen from the Table 1, the ratio of the two operations is:\begin{equation*} \frac {D_{k}\times D_{k}\times M\times D_{F}\times D_{F}+M\times N\times D_{F}\times D_{F}}{D_{k}\times D_{k}\times M\times N\times D_{F}\times D_{F}}\tag{5}\end{equation*}
2) Depthwise Convolution
Depthwise convolution is a special grouping convolution. Grouping convolution was first proposed by Hinton et al. [30]. Limited by hardware resources, the networks that needed to be trained could not be processed in the same GPU. Therefore, A. K. divided the convolution operation with a large amount of computation in the convolutional layer into two groups respectively. The calculations were done in two pieces of hardware, and the results were then fused in subsequent layers. In the first step of depthwise separable convolution, the extreme grouping convolution is used. The schematic diagram of grouping convolution is shown in Fig. 2. Use the parameters in (1), and assume the number of groups is
The schematic diagram of grouping convolution: (a) traditional convolution; (b) grouping convolution.
3) Pointwise Convolution
When the size of the convolution kernel in the convolution layer is set as 1, it is pointwise convolution. The operation of a
Proposed Intelligent Fault Diagnosis Method
A. Design of the LCNN for Bearing Fault Diagnosis
LCNN is usually constructed by the depthwise separable convolution, but this kind of model cannot perfectly meet the requirement posed by the fault diagnosis under IIoT context. Therefore, the residual structure with residual block is introduced to the optimal LCNN model construction [31], since it can effectively solve the problem of vanishing gradient of the network as the number of layers increases, and in the meantime improve the accuracy of the network.
The process of the standard residual block is presented in Fig. 3 (a). The input image first undergoes a
Residual convolution module: (a) Standard residual block;(b)Inverted residual block.
Since depthwise convolution cannot change the input data channel, the dimension where it extracts features depends on the data input in the previous layer. To solve this problem, before the depthwise convolution, a
A basic LCNN module can be constructed via the depthwise separable convolution, inverse residual structure, and linear bottleneck layer. Through the network stacking, the basic LCNN is completed. The basic convolution block and basic LCNN structure are shown in Fig. 4.
The basic convolution block and basic LCNN structure: (a) Basic convolution block; (b) Basic LCNN module. Where conv is a regular convolution and dconv is a deep convolution.
B. Network Optimization
In order to construct an optimal LCNN model suitable for bearing fault diagnosis, a novel decomposable Hierarchical Search Space is introduced. In other words, the above model is decomposed into different blocks and then the respective operations and the block-block connection relations are searched for each block. In this sense, the various layer structures are allowed for various blocks [33]. For the widely used depthwise separable convolution, we input the feature map size as (H, W, M), and use the convolution kernel with the dimension (K, K, M) for depthwise convolution to output feature map size (H, W, M). After that, we use the convolution kernel with dimension (1,1, M, N) to conduct pointwise convolution. In this case, (H, W) is the input resolution, and \begin{equation*} H\times W\times M\times (K\times K\times +N)\tag{6}\end{equation*}
When the overall computing resources are limited, we need to carefully balance the kernel size K and the number of filters N. For example, to increase the receptive field, you must reduce the number of convolution kernels N while increasing the kernel size K in the convolution layer, or reducing the calculation amount of other layers.
Fig. 5 demonstrates the baseline structure of the search space. The LCNN model is divided into a set of predefined blocks. The input resolution is gradually reduced and the number of convolution kernels is increased. Each block has a column of the same layer. The operations and connections are determined by the sub-search space of each block. Specifically, the sub-search space for a block
Convolution operation: regular convolution, depthwise separable convolution;
Convolution kernel size;
Squeeze-and-Excitation ratio(SE);
Skip operation: pooling, residual block, no skip;
Output filter size
ang the number of layersF_{i} ;N_{i}
For instance, in Fig. 5, each layer of block 4 has an inverted
C. Proposed Diagnosis Framework
To sum up, a novel model for bearing fault diagnosis under the IIoT context is proposed in this paper. A LCNN-based
fault diagnosis framework for bearing fault diagnosis is summarized in the following three steps.
Network structure construction. A basic lightweight convolution block is constructed by basic elements such as deep separable convolution and inverse residual block. During the LCNN model construction process, the basic modules constructed above are used for stitching.
Network optimization. The novel decomposed Hierarchical Search Space is used to decompose the above model into different blocks and then search for respective operations and the block-block connection relations for each block, which allows the various layer structures for various blocks. An optimal LCNN model for bearing fault diagnosis is constructed through network optimization.
Model deployment. Extract the trained LCNN to construct the fault diagnosis model. Deploy the fault diagnosis model on bearings for real time monitoring and fault diagnosis.
The LCNN-based bearing fault diagnosis process in the IIoT context is shown in Fig. 7. It mainly includes data preprocessing, data set partitioning, data enhancement, model training, model testing, and model visualization. In the device-level control part of the proposed method, data is first collected, and then the system makes decisions based on the collected data. With reference to [34], this paper intends to deal with noise in the dataset. In the process of data processing, the original vibration signal is converted into a two-dimensional picture for processing through Gramain Angular Field method [35]. Data preprocessing includes image normalization processing and scale transformation. In order to verify the validity of the model, Tensorboard is used as a visual tool for model training process and features.
Experimental Verification
A. Case Study on Case Western Reserve University Bearing Fault Dataset
1) Experiment Description
As one of the most widely used and representative data in mechanical fault diagnosis, Case Western Reserve University Bearing fault dataset is the best dataset to verify the excellent characteristics of the proposed method [36]. The bearing test-rig is shown in Fig. 8. The testing board consists of a 2-horsepower motor (in the left), a torque sensor (in the center), a dynamometer (in the right), and control electronics (not shown). Fault diameters are set to 7 mils, 14 mils, 21 mils, 28 mils, and 40 mils. Accelerometers are placed at twelve o’clock position on the drive and fan sides of the motor housing for vibration data collection. The data of normal bearings, single-point drive end, and fan end defects are respectively collected. Data are collected at 12,000 samples/second and 48,000 samples/second drive end bearing experiments. All fan-end bearing data are collected at 12,000 samples/second.
In this paper, 12k Drive End Bearing Fault Data is selected to construct a bearing fault diagnosis dataset. The vibration data is collected by an accelerometer installed on a housing with a magnetic base. The accelerometer is installed at the 12 o’clock position of the DE bearing. The normal data, namely 7 mil, 14 mil, 21 mil, 28 mil Inner Race fault, 6 o’clock Outer Race fault and Ball fault is selected to construct 10 types of fault data for health status classification. The dataset is shown in the Table 3. Each type of health status consists of 2356 sets of fault samples, and each fault sample contains 200 original time signal points.
2) Results
During the experiment, 80% of the 10 types of fault samples are used as the training set (70% for model training and 10% for cross-validation), and the remaining 20% is used as the testing set. At the same time, we trained our model with Keras (Tensorflow backend) on a machine that has a GeForce RTX 2060 GPU, an Intel i7-8700 CPU and 16 gigabytes of RAM. In order to verify the excellent performance of the LCNN model proposed in this paper, SVM and traditional CNNs, such as LeNet, AlexNet, traditional convolutional neural network with the same number of layers as LCNN in this paper (TCNN), ResNet, and ShuffleNet [37] are used for comparison. The diagnosis results are presented in Table 4.
As can be seen in Table 4, the diagnostic accuracy of the LCNN model on dataset A and B is respectively 97.386% and 100%, significantly higher than that of the other models. Specifically, the fault diagnosis accuracy of SVM on dataset A and B is 94.367% and 97.037%, LeNet’s accuracy is 13.438% and 10.349%, AlexNet’s is 10.382% and 12.157%, TCNN’s is 12.462% and 9.585%, ResNet’s is 83.285% and 89.587%, and ShuffleNet’s is 93.094% and 96.175%. Moreover, LeNet, AlexNet, TCNN all cannot be trained as the vanishing gradient, which results from the limited sample numbers. The training curves of LCNN, ShuffleNet, ResNet and the LeNet proposed in this paper are shown in Fig. 9.
The training curves of different model: (a) LCNN; (b) ShuffleNet; (c) ResNet; (d) LeNet.
It can be seen from Fig. 9 that although the training of the LCNN proposed in this paper has large fluctuations, the training and cross-validation accuracy is close to 1 when the model is stable. Although the training process of the ShuffleNet network is more stable, the cross-validation accuracy is only about 0.85, and the model learning is obviously insufficient. The training accuracy and cross-validation accuracy of the LeNet network is only about 0.1 and it does not increase with the number of iterations. As can be seen from Fig. 9 (d), as the number of iterations increases, the training accuracy of LeNet continues to improve, but its cross-validation accuracy tends to 0.1 and does not change. It is further verified that limited by the sample size, the model suffers overfitting because of gradient vanishing. Since the training process curve of AleNet is similar to that of LeNet, it will not be given here. It is further verified that the model vanishing gradient. In order to further verify the outstanding recognition results of the LCNN model in this paper, the confusion matrix of LCNN, ShuffleNet and the ResNet model are shown in Fig. 10. It can be seen that the diagnostic accuracy of the LCNN model is optimal. In addition, the difference between dataset A and B is mainly caused by the mixed failure of inner race fault and ball fault under the 14 mil fault size of data A.
The confusion matrix of different model: (a) LCNN with dataset B; (b) LCNN with dataset A; (c) ShuffleNet with dataset B; (d) ShuffleNet with dataset A.
In the IIoT context, factors such as the calculation amount and storage space to a large degree determine the models’ availability. In Table 5, other performance indicators such as the weight storage and parameters of the above models are listed in detail. As is shown in Table 5, ShuffleNet possesses the smallest amount of parameters and least storage space, followed by the LCNN model proposed in this paper, which is 1–2 orders of magnitude smaller than other models. In spite of the longest training time, the performance of the LCNN model is not influenced due to the offline update of model. As for the testing time, LeNet takes the shortest time, followed by AlexNet, ShuffleNet, LCNN, ResNet, and TCNN. Considering multiple indicators such as diagnostic accuracy, model parameters, and storage, it can be seen that the LCNN model proposed in this paper has state of art performance on Case Western Reserve University Bearing fault dataset.
CNN is usually called a “black box” model. In order to give an intuitive explanation of the model prediction, a t- distribution stochastic neighbor embedding (t-SNE) method is adopted to visualize the learned features in the hidden fully connected layer. The visualization results of the LCNN, ShuffleNet, and ResNet are presented in Fig. 11. In the flat hidden fully connected layer, samples under the same fault condition are clearly collected together and even separated, indicating the good feature representation capabilities of the learned feature descriptors. After performing non-linear mapping in the classifier, the characteristics of different fault conditions are well separated in the fully connected layer hidden in the last layer, despite the slight overlapping of the single samples, which is consistent with the diagnostic accuracy in Fig. 10. Furthermore, as to the data under invisible conditions, we can observe a larger overlapping area in the ResNet model, while the features learned using LCNN have better separability.
In order to represent the features extracted by each layer of the convolution operation, as well as give a physical interpretation of each layer feature, TensorBoard (TensorFlow, USA) is adopted to visualize the features. Fig. 12 shows the feature maps of some layers. The feature maps indicate that the learned convolution filter is relatively smooth in space, suggesting that training is adequate. In the initial process, the convolution layer mainly extracts the outline of the vibration signal, but then the features gradually become more abstract.
B. Case Study on MFPT Bearing Fault Dataset
1) Experiment Description
The MFPT bearing fault dataset, as another typical dataset for mechanical bearing fault diagnosis, is one of the important references for verifying the performance of the proposed method [38]. The MFPT bearing fault dataset was provided by the American Society for Mechanical Failure Prevention Technology with the data prepared by Dr. E. B., chief engineer of NRG Systems. This dataset includes data from the bearing testing stand (baseline bearing data, outer race failure under various loads, inner race failure under various loads) and three actual failures.
In order to ensure the samples’ balance, we construct the bearing fault dataset via 3 baseline conditions (Nor), 7 inner race fault conditions (IR), and 7 outer race fault conditions (OR) to verify the performance of LCNN. The data consists of the following data points: N with1,757,808 data points, IR with 1,025,388 data points, OR with 1,025,388 data points, and the 97656 data sampling rate. The total image settings generated from data are as follows: Nor 1800, IR 2100, OR 2100, and the time step in each picture is 0.01 seconds. Three types of sample are shown in Fig. 13.
2) Results
In order to verify the performance of the LCNN on the MFPT bearing fault dataset, 70% of the data is randomly selected as the training set, 10% as the validation set for model training, and the remaining 20% for testing. The diagnostic results of each model are shown in Table 6. As can be seen from Table 6, the recognition rate of the LCNN is up to 99.92%. LeNet, AlexNet, and TCNN are directly trained, but their diagnostic effect is extremely poor, AlexNet and TCNN in particular. It is possible that the gradient disappearance make it difficult for models to be trained with the limited data samples.
sOther performance indicators of the models are shown in Table 7. The model parameters and storage are consistent with those in IV(A). The training time and testing time are consistent with IV(A). After comprehensively considering multiple indicators such as diagnostic accuracy, model parameters, and storage, it can be seen that the LCNN proposed in this paper has state of art performance on the MFPT bearing fault dataset, which further validates the powerful advantages of the proposed model for bearing fault diagnosis.
The t-SNE results of LCNN, ShuffleNet, and ResNet model is shown in Fig.14. The three types of samples are clearly separated, and the classification boundary and region of the LCNN are more obvious, further verifying its optimal performance in recognition accuracy. The features extracted by the convolution layer are shown in Fig. 15. In the initial process of the convolution feature extraction, the convolution layer mainly extracts the outline of the vibration signal, but the subsequent features gradually become more abstract.
Conclusion
This paper proposes a novel LCNN model for intelligent bearing fault diagnosis. The proposed method can be mainly divided into three steps: first, construct LCNN on the basis of basic operations such as depthwise separable convolution, inverse residual structure, and linear bottleneck. Secondly, the novel decomposed Hierarchical Search Space decomposes the above model into different blocks and then searches for the operation and the block-block connection relationship for each block to automatically explore the optimal LCNN for bearing fault diagnosis in the IIoT context. Finally, model training is performed, and the learned deep features are input into the Softmax classifier to achieve accurate and stable diagnosis of bearing faults.
The proposed LCNN is applied to the fault diagnosis of bearing vibration experimental data. The results demonstrate that this method overcomes the dependence on artificial feature extraction of traditional deep learning models, and in the meantime tackles the traditional CNNs’ dependence on the sample size, low diagnostic accuracy, large storage and calculation costs. Therefore, it is more effective and robust than current intelligent diagnostic methods. The high accuracy and small storage and calculation cost make the proposed LCNN more suitable in the IIoT context. However, it is a problem that needs to be considered and solved in the future as to how to optimize the training and testing time of the model.