Loading web-font TeX/Main/Regular
Bearing Fault Diagnosis Based on Natural Adaptive Moment Estimation Algorithm and Improved Octave Convolution | IEEE Journals & Magazine | IEEE Xplore

Bearing Fault Diagnosis Based on Natural Adaptive Moment Estimation Algorithm and Improved Octave Convolution


The network structure is composed of three-layer dilated gate convolution layer and two dense layers. The network convergence speed is accelerated by improving Adam algor...

Abstract:

Fault diagnosis of rolling bearing has been the focus of research. Bearing signals are often accompanied by similar information, resulting in redundancy between data. Mor...Show More

Abstract:

Fault diagnosis of rolling bearing has been the focus of research. Bearing signals are often accompanied by similar information, resulting in redundancy between data. Moreover, rolling bearing is often used in situations with large background noise, so extracting the characteristic value of rolling bearing signal and removing noise from the signal are of great significance. This paper presents a fault diagnosis model combining NAdam(Natural Adaptive Moment Estimation) algorithm and improved octave convolution. First, natural exponential decay function is proposed to replace the exponential decay function for parameter updating of Adam(Adaptive Moment Estimation). Compared with the exponential decay function, the natural exponential decay function can accelerate the convergence rate of the model. The internal structure in octave convolution is then improved. The improved structure can improve feature extraction and eliminate data redundancy. Finally, the dilated gate convolution layer is used to filter and classify the data. According to the simulation test of the case western reserve university data set and laboratory power equipment data set, the accuracy rate can reach more than 98%. Experiments with variable load and signal noise ratio are carried out to verify the noise resistance and generalization performance of the proposed method.
The network structure is composed of three-layer dilated gate convolution layer and two dense layers. The network convergence speed is accelerated by improving Adam algor...
Published in: IEEE Access ( Volume: 8)
Page(s): 196790 - 196803
Date of Publication: 27 October 2020
Electronic ISSN: 2169-3536

Funding Agency:

Citations are not available for this document.

SECTION I.

Introduction

Rolling bearing is widely used as an important component of mechanical equipment in the industrial field. Rolling bearing needs to bear the weight of the rotating machine and ensure the normal operation of the bearing and the machine [1]. However, the bearing must be accompanied by vibration during operation. When wear, corrosion, and fatigue occur, the vibration of the bearing will be exacerbated, thereby reducing its working efficiency and even causing casualties [2]–​[4].

The working environment of a bearing is complicated and often accompanied by background noise [5]. Noise will also have a great impact on bearing signal analysis. Therefore, in the fault diagnosis of rolling bearings, effectively extracting the characteristic information of the bearing signal and filtering the noise are important tasks to be studied.

In traditional learning methods, artificial experience is often used to extract features, and classifiers are used to classify faults. Artificial experience tends to reduce the accuracy of feature recognition but has great limitations. Wavelet transform [6] and Fourier transform [7] are used to replace artificial experience in feature extraction. Finally, SVM [8], Bayes [9], and other classifiers are used for classification learning. Wavelet transform can observe the local characteristics of the signal and has a certain filtering effect on noise in the signal. However, wavelet transform will produce redundancy. In the Fourier transform, although the weighted average of the signal in the time domain is calculated [10], it cannot provide sufficient time domain information. When dealing with abrupt signals, the sensitivity of Fourier transform is poor. Wu et al. [11] proposed a method of combining Fourier transform and wavelet transform to filter noise in the signal, but the accuracy of this method is not very high. To have a network structure with high precision and good generalization performance, scholars need to extract features and have a high-performance classifier. Deng et al. [12] proposed a motor bearing fault diagnosis method based on integrating empirical wavelet transform, fuzzy entropy, and SVM and achieved high accuracy. Mbo’O and Hameyer [13] used linear discriminant analysis to evaluate features and employed Bayesian classifiers to perform fault diagnosis. The proposed method can distinguish damaged bearings from normal bearings. Li et al. [14] proposed a feature extraction and classification method combining wavelet scattering transform and twin support vector machines by using scattering transform to perform time-domain analysis and SVM to classify training data.

With the continuous development of artificial intelligence, deep learning is gradually applied to text analysis [15], speech synthesis [16], and image classification [17]. In contrast to traditional methods, deep learning integrates feature extraction and classification models. The CNN network has a significant effect on feature extraction and has the function of classifying networks. Sadoughi and Hu [18] introduced physical knowledge into the neural network by encoding physical information about the bearing and its fault characteristics to the network for analysis and testing. Wen et al. [19] proposed a method for automatically extracting features, performing dimensional transformation on the data on the LeNet-5 model structure, and extracting feature values in the data. Simple network structure can avoid the phenomenon of overfitting but also limits the improvement of the network accuracy. With the introduction of deep network structure, an increasing number of deep neural networks are used in fault diagnosis. Zhou et al. [20] proposed a bearing fault diagnosis model based on improved stacked recurrent neural network to solve the problem of gradient disappearance through a gating unit. Zhuang et al. [21] proposed a network model consisting of dilated convolution, gate convolution, and residual network. The dilated convolution increases the local receptive field, thereby increasing the receiving domain of the convolution kernel. Fan et al. [22] analyzed the structure of octave convolution and proposed the same multi-frequency method for octave transposed convolution.

Deep learning is widely used in various fields. However, the bearing working environment is more complex. Background noise has a certain effect on the training of network model. At the same time, when training the network, factors such as parameter setting and optimization algorithm selection will affect the training accuracy of the network model.

When training deep learning networks, many optimization algorithms are commonly used, such as learning rate decay method [23] and gradient optimization method [24]. Bello et al. [25] proposed to add noise in linear cosine attenuation to increase the randomness and possibility of the process to a certain extent. An et al. [26] proposed an exponentially decayed sine wave learning rate to learn parameters in the network. A small number of iterations are required to achieve a high accuracy rate and speed up the training. In learning rate decay method, selecting an appropriate learning rate is critical. If the learning rate is too large, then the network will not converge. If the learning rate is too small, then the network will convergence too slowly. In gradient optimization, batch gradient descent method [27] and momentum method [28] are often used for network optimization. Li et al. [29] proposed a small batch gradient separation algorithm, which solves the problem of minimizing data reconstruction errors. Tang et al. [30] used Nesterov momentum instead of traditional momentum and then combined it with a deep belief network. This method can speed up training and improve accuracy. In recent years, optimization algorithms have been improved to better fit the network for data. However, noise interference and information redundancy often occur in data. To obtain an excellent deep learning network model, scholars must use not only an appropriate algorithm optimized network parameters but also a better network structure.

In view of the problems in the above model structure, NAdam (Natural Adaptive Moment Estimation) algorithm is proposed and combined with an improved octave convolution bearing fault diagnosis method. The proposed methods contribute the following:

  1. The optimization algorithm Adam uses exponential decay moving average method to update the gradient value. The natural exponential decay function is simpler than the exponential decay function and converges faster. Therefore, NAdam algorithm is proposed to optimize network parameters, reduce running memory, and increase calculation time.

  2. Octave convolution eliminated data redundancy, and it uses down-sampling, up-sampling, and convolution operations for feature extraction and dimensionality reduction. However, some data will be lost during the data processing of down-sampling and up-sampling. Dilated convolution is used instead of up-sampling, down-sampling, and convolution operations to prevent data loss.

  3. The working environment of rolling bearings is complex, and the background noise is large, which will affect feature extraction. Gate convolution is added to the dilated convolutional layer of the network structure to form a dilated gate convolutional layer and eliminate noise interference.

The remaining parts of this paper are organized as follows. Section 2 explains the NAdam algorithm in detail. Section 3 introduces the structure and improvement of octave convolution. Section 4 introduces the network model mentioned in this article and provides further explanation. In Section 5, two different data sets are used to verify the noise resistance and generalization of the proposed method, and visualization operation is performed. Section 6 presents the conclusion and next work arrangement.

SECTION II.

NAdam Algorithm

A. Adam Algorithm

In building a deep learning network model for bearing fault diagnosis, selecting an appropriate learning rate is important when training model parameters because the size of the learning rate directly affects the convergence rate of the network.

Adam (Adaptive Moment Estimation) algorithm [31] is an optimization algorithm proposed by Diederik and Jimmy in 2015. The Adam algorithm is a very efficient random algorithm that only requires a small amount of memory and saves a lot of it for calculation of other parameters. The algorithm adaptively learns parameter learning rate and updates parameters in different aspects from the 1^{st} and 2^{th} moment vector of the gradient. The Adam algorithm, a combination of Momentum and RMSprop (Root Mean Square prop) [32], can use momentum as a parameter update direction and can adaptively adjust the learning rate.

In the Adam algorithm, first calculate the gradient value of iteration t times. Randomly select m samples from the training set to form a small batch \left \{{x^{\left ({1 }\right)},x^{\left ({2 }\right)},\cdots x^{\left ({m }\right)} }\right \} , y^{\left ({i }\right)} is the true value of x^{\left ({i }\right)} , The gradient value g_{t} is:\begin{equation*} g_{t}=\frac {1}{m}\nabla \theta _{t-1}\sum \nolimits _{i} {L\left ({f\left ({x^{\left ({i }\right)};\theta _{t-1} }\right),y^{\left ({i }\right)} }\right)}\tag{1}\end{equation*}

View SourceRight-click on figure for MathML and additional features. where \nabla \theta _{t-1} represents the partial derivative of \theta _{t-1} . f\left ({x^{\left ({i }\right)};\theta _{t-1} }\right) represents a random scalar function at time t-1. Randomness may originate from the evaluation of random samples of data points or may be caused by function noise.

On the one hand, the exponentially weighted average weight M_{t} (1^{st} moment vector) of the gradient g_{t} is calculated. On the other hand, the exponentially weighted average G_{t} (2^{th} moment vector) of the gradient square g_{t}^{2} is calculated:\begin{align*} M_{t}=&\beta _{1}M_{t-1}+\left ({1-\beta _{1} }\right)g_{t} \\ G_{t}=&\beta _{2}G_{t-1}+\left ({1-\beta _{2} }\right)g_{t}\odot g_{t}\tag{2}\end{align*}

View SourceRight-click on figure for MathML and additional features.

\beta _{1} and \beta _{2} are the attenuation rates of two moving averages, namely, \beta _{1}=0.9 , \beta _{2}=0.99 .

The initial values of M_{t} and G_{t} are usually set to M_{0}=0 and G_{0}=0 , leading to the deviation of M_{t} and G_{t} toward 0 in the early stage of training. In this regard, the deviation of the 1^{st} and 2^{th} moment vector must be corrected:\begin{align*} \widehat {M_{t}}=&M_{t}/\left ({1-\beta _{1}^{t} }\right) \\ \widehat {G_{t}}=&G_{t}/\left ({1-\beta _{1}^{t} }\right)\tag{3}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Finally, the parameter update value is:\begin{align*} \Delta \theta _{t}=&-\frac {\gamma }{\sqrt {\widehat {G_{t}}+\delta }}\widehat {M_{t}} \\ \theta _{t}=&\theta _{t-1}+\Delta \theta _{t}\tag{4}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Among them, learning rate is \gamma =0.001 . The denominator is added with constant {\delta =10}^{-6} to prevent having a 0 value.

B. Improved Adam Algorithm

In the calculation process of the Adam algorithm, exponential decay [33] calculation is involved. Exponential decay is commonly used to adjust the learning rate in neural networks, where the learning rate determines the convergence rate of the optimal solution. In calculating the exponential decay function, a large learning rate is first set to obtain optimal solution in the network. As the number of iterations increases, the learning rate will gradually decrease, resulting in increased stability of the model during iteration and training of the optimal solution. The specific exponential decay function is as follows:\begin{equation*} \alpha _{t}=\alpha _{0}\beta ^{t}\tag{5}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

Among them, t is the iterations, \alpha _{0} is the initial learning rate, \alpha _{t} is the learning rate of iteration t, \beta is the decay rate is \beta =0.96 .

Most popular methods for adaptively adjusting the learning rate, such as RMSprop and AdaDelta algorithms, all use exponential decay to obtain learning rate. In addition to the exponential decay function having an optimization effect, the adjustment of the learning rate by the natural exponential decay function has a very good effect on practice.

The natural exponential decay [34] function is faster than the exponential decay function to train the network. In contrast to the exponential decay, the natural exponential decay is based on e , which makes the calculation speed higher and allows the network to reach convergence faster. The calculation formula of the natural exponential decay function is as follows:\begin{equation*} \alpha _{t}=\alpha _{0}exp\left ({-\beta \ast t }\right)\tag{6}\end{equation*}

View SourceRight-click on figure for MathML and additional features.

\beta is the decay rate is \beta =0.96 , t is the iterations.

The learning rate of the exponential decay function and the natural exponential decay function is visualized through a simple experiment to visually compare their convergence rate.

As shown in Fig. 1, the learning rate of the natural exponential decay function converges faster than the exponential decay function, thereby speeding up the convergence of the model during the network training process. In the Adam algorithm, the exponential decay moving average in the deviation correction calculation of the first and second moments is changed to the natural exponential decay moving average, and a new optimization algorithm, namely, NAdam is proposed. This algorithm saves parameter calculations and improves the convergence speed of the network. In the calculation process, the parameter update values of \widehat {M_{t}} and \widehat {G_{t}} are smaller than those of the exponential function but are more accurate, thereby effectively avoiding overfitting. The specific algorithm flow is as follows:

FIGURE 1. - Comparison of attenuation function.
FIGURE 1.

Comparison of attenuation function.

Algorithm 1 NAdam. Good Default Settings for the Tested Machine Learning Problems Are \alpha=0.001,\beta_{1}=0.9,\beta_{2}=0.99 and \delta={10}^{-6}

Require:

\beta _{1},\beta _{2}\in [0,1): Exponential decay rates for the moment estimates

Require:

\mathrm {f}\left ({\theta _{t-1} }\right) : Stochastic objective function with parameters \theta _{t-1}

Require:

\theta : Initialize parameter vector

M_{0}\leftarrow 0 (Initialize 1^{st} moments vector)

G_{0}\leftarrow 0 (Initialize 2^{th} moments vector)

\mathrm {t}\leftarrow 0 (Initialize time step)

While \theta _{t} not converged do

t\leftarrow t+1 (Update time)

g_{t}\leftarrow \frac {1}{m}\nabla \theta _{t-1}\sum \nolimits _{i} {L\left ({f\left ({x^{\left ({i }\right)};\theta _{t-1} }\right),y^{\left ({i }\right)} }\right)} (Update gradient)

M_{t}\leftarrow \beta _{1} M_{t-1}+\left ({1-\beta _{1} }\right)g_{t} (Update biased first moment estimate)

G_{t}\leftarrow \beta _{2} G_{t-1}+\left ({1-\beta _{2} }\right)g_{t}\odot g_{t} (Update biased second raw moment estimate)

\widehat {M_{t}}=\frac {M_{t}}{1-e_{1}^{-\beta \ast t}} (Bias-corrected first moment estimate)

\widehat {G_{t}}=\frac {G_{t}}{1-e_{2}^{-\beta \ast t}} (Bias-corrected second moment estimate)

\Delta \theta _{t}\leftarrow -\frac {\gamma }{\sqrt {\widehat {G_{t}}+\delta }}\widehat {M_{t}} (Update parameters)

\theta _{t}\leftarrow \theta _{t-1}+\Delta \theta _{t} (Update parameters)

end while

return \theta _{t} (Resulting parameters)

SECTION III.

Improved Octave Convolution

In network training, the NAdam algorithm adjusts the learning rate to adapt to the update of network parameters, which will have the maximum learning rate during the training. The network structure also needs to be improved in addition to the update of hyperparameters to recognize well the characteristic information of the bearing data in a noisy environment.

In the noisy one-dimensional bearing fault data, it is difficult to extract more feature information during the dot product operation of one-dimensional convolution. In order to extract more feature information, the dimension of the space is usually changed. Mapping low-dimensional data to high-dimensional data can better display the feature amount, which is conducive to feature extraction and improvement of classification accuracy [35]. In order to reduce the amount of calculation, the paper converts one-dimensional data into two-dimensional space for calculation.

In the training model, convolutional neural networks are generally used for model training. In the convolutional neural network, the convolution operation is the highlight of the network structure. Convolution is the result of summing two variables within a certain range. Commonly used convolutions are one-dimensional convolution and two-dimensional convolution:

one-dimensional:\begin{equation*} \mathrm {y}\left ({t }\right)=g\left ({k }\right)\ast x\left ({k }\right)=\int _{-\infty }^\infty {g\left ({k }\right)} x\left ({t-k }\right)\tag{7}\end{equation*}

View SourceRight-click on figure for MathML and additional features. two-dimensional:\begin{align*} \mathrm {y}\left ({x,y }\right)=&g\left ({u,v }\right)\ast x\left ({u,v }\right) \\=&\iint _{-\infty }^\infty {g\left ({u,v }\right)x\left ({x-u,y-v }\right)}\tag{8}\end{align*}
View SourceRight-click on figure for MathML and additional features.

Among them, g\left ({\cdot }\right) is the filter, x\left ({\cdot }\right) is the signal sequence.

Based on the formula, convolution transforms two function variables in two spatial domains usually in time and frequency domains. This operation will reduce the computational workload. Convolution also involves finding the area of the overlapping area of two curves, which can be used for feature enhancement. Convolution has the effect of smoothing data and is usually used for data processing. The convolution kernel performs convolution calculation on two-dimensional data on the basis of convolution and is often used for feature extraction. Convolution kernels have different types and can be processed by marginalization and blurring. The extracted features also contain different information.

In neural networks, ordinary convolution operations are as follows:\begin{equation*} \mathrm {Y}_{p,q}=\sum \limits _{i,j\in N_{k}} W_{i+\frac {k-1}{2},j+\frac {k-1}{2}}^{T} X_{p+i,q+j}\tag{9}\end{equation*}

View SourceRight-click on figure for MathML and additional features. N_{k}=\left \{{\left ({i,j }\right):i=\left \{{-\frac {k-1}{2},\cdots \frac {k-1}{2} }\right \},j=\left \{{-\frac {k-1}{2},\cdots \frac {k-1}{2} }\right \} }\right \} , W is a convolution kernel of k*k size. Generally, k ≥3. Where (p,q) denotes the location coordinate.

In deep learning, a certain degree of similarity exists between feature maps, which is redundant in space and can be further compressed. Octave convolution [36] is a new convolution structure proposed by Chen’s in 2019. The difference between octave convolution and ordinary convolution is that it uses four ordinary convolutions, aiming at data of different frequencies.

Ordinary convolution performs operations between the entire data set. If the data set is divided into high-frequency and low-frequency data, then the operation of ordinary convolution is to adjust the frequency of the low-frequency data to be the same as the high-frequency data and connect them with the high-frequency part. This process will result in additional calculations and memory and affect the transmission speed of the network. In octave convolution, four different convolution kernels operate between high- and low-frequency data.

Octave convolution uses Gaussian convolution kernel in scale space theory [37], which divides the data into high- and low-frequency components. The high-frequency component refers to the original channel after Gaussian filtering. The low-frequency component refers to the channel compressed by Gaussian filtering. High-frequency components generally represent the details of the feature, and low-frequency components represent the contour information of the feature [38]. High- and low-frequency components are mapped to different groups and are converted through the corresponding convolution kernel to reduce spatial redundancy.

First, structural parameters are defined: X^{H},X^{L} are the input expressions of high-frequency and low-frequency components, respectively. Y^{L},Y^{H} are the high-frequency and low-frequency components of the convolution output. High frequency weight parameter W^{H}=[W^{H\to H},W^{L\to H}] , Low frequency weight parameter W^{L}=[W^{L\to L},W^{H\to L}] . W^{L\to H} represents the weight parameter of converting low-frequency data into high-frequency data. The low-frequency component needs to be up-sampled during information updating. W^{H\to L} is the weight parameter for converting high- frequency data to low-frequency data, and for down-sampling high-frequency components. Finally, convolution operation is performed on the corresponding parameters to obtain the convolution output. The specific formula is as follows:\begin{align*} Y_{p,q}^{H\to H}=&\sum \limits _{i,j\in N_{k}} {W_{i+\frac {k-1}{2},j+\frac {k-1}{2}}^{H\to H}}^{ T} X_{p+i,q+j}^{H} \tag{10}\\ Y_{p,q}^{L\to H}=&\sum \limits _{i,j\in N_{k}} {W_{i+\frac {k-1}{2},j+\frac {k-1}{2}}^{L\to H}}^{T} X_{\left ({\frac {p}{2}+i }\right),\left(\frac {q}{2}+j \right)}^{L} \tag{11}\\ Y_{p,q}^{L\to L}=&\sum \limits _{i,j\in N_{k}} {W_{i+\frac {k-1}{2},j+\frac {k-1}{2}}^{L\to L}}^{T} X_{p+i,q+j}^{L} \tag{12}\\ Y_{p,q}^{H\to L}=&\sum \limits _{i,j\in N_{k}} {W_{i+\frac {k-1}{2},j+\frac {k-1}{2}}^{H\to L}}^{T} X_{\left ({2p+0.5+ i }\right),\left (2 q+0.5+ j \right)}^{H}\qquad \tag{13}\\ Y_{p,q}^{H}=&Y_{p,q}^{H\to H}+Y_{p,q}^{L\to H} \tag{14}\\ Y_{p,q}^{L}=&Y_{p,q}^{L\to L}+Y_{p,q}^{H\to L}\tag{15}\end{align*}

View SourceRight-click on figure for MathML and additional features.

Y_{p,q}^{H} and Y_{p,q}^{L} are the output of high frequency and low frequency through convolution operation, and adding the two sets of components is the final output convolution result. After the down-sampling operation, the convolution is equivalent to the convolution with step size, which will be less precise; as such, average pooling method is used in the process of W^{H\to L} , which can make the final sampling. The result is more precise. However, in down-sampling and up-sampling of the network, data loss occurs. This problem is solved by replacing the process of up-sampling, down-sampling, and pooling by dilated convolution. The network’s receptive field is increased by setting the dilate rate in the convolution operation.

In Fig 2, the green arrows indicate information updates, and the red arrows indicate information exchange between the two frequencies. High- and low-frequency characteristics are outputted through operations among four different convolutions. Finally, the low-frequency characteristic frequency is increased by setting the dilate rate, and the high-frequency characteristic is added to obtain the final output information.

FIGURE 2. - Octave convolution.
FIGURE 2.

Octave convolution.

SECTION IV.

Network Model

A bearing fault diagnosis model is constructed by combining the characteristics of NAdam and improved octave convolution. The improved octave convolution is used to preprocess the data to reduce redundancy and extract characteristic information. The three-layer dilated gate convolution layer is the main network structure. The dilated convolution increases the convolution receptive field, causing the network structure to receive additional information. In addition, the gate convolution filters noise. Finally, the Dense layer combined with the NAdam algorithm is used for final classification task on the data.

The network model structure is shown in Fig 3. Because the network structure is two-dimensional, the input structure is [3], [30], [40], which is converted from the time series of [1200, 1]. The first three layers use a dilated gate convolution layer to deepen the network model and enhance feature extraction. Each layer uses a different number of filters, and the dilated convolution reduces the dimension of the feature map and increases the receptive field. Each layer uses gate convolution to filter noise during feature extraction. Finally, the two dense layers are connected, and the association among the features is extracted and mapped in the output space.

FIGURE 3. - Model structure.
FIGURE 3.

Model structure.

A. Dilated Gate Convolution

Dilated convolution was proposed by Yu and Koltun [39] and was originally used in context analysis. With the development of big data, dilated convolution has been applied to different fields. The calculation process of the dilated convolution is different from the ordinary convolution. Common convolution operations use convolution and pooling to reduce the dimensionality and increase the receptive field. The dilated convolution only needs to adjust the corresponding dilated rate to achieve the same effect, which saves time for network calculations and improves the calculation speed. The specific calculation is as follows:\begin{equation*} \mathrm {Receptive ~field~ size}=2(Dilate ~rate-1) \ast \left ({k-1 }\right)+k\end{equation*}

View SourceRight-click on figure for MathML and additional features. Dilate rate is the dilate rate, k is the size of the convolution kernel.

Fig 4(A) shows a 3\times 3 convolution with a dilate rate of 1. The dilated convolution at this time is the same as normal convolution operation. Fig 4(B) shows a 3\times 3 convolution with a dilate rate of 2. The weight of the 9 points with only red in the figure is not 0, and the rest are 0; at this time, the receptive field of the convolution has increased to 7\times 7 . Fig 4(C) shows the dilate rate for a dilated convolution of 4, and the receptive field is increased to 16\times 16 . Compared with the traditional linear growth convolution operation, the receptive field of the dilated convolution increases exponentially.

FIGURE 4. - Dilated convolution.
FIGURE 4.

Dilated convolution.

Gate convolution uses the gate structure in the LSTM network structure. The gate structure is a method of selectively passing information. In general, the output range of the gate is a value between 0 and 1, which is used to describe how much information can pass through the gate. The gate convolution uses this feature to filter noise in the signal, thereby allowing effective information to pass through the gate structure.

B. Data Enhancement

Before network training, data enhancement techniques are used to avoid over-fitting due to few samples. Bearing data are time series data, so a sliding window is used for data enhancement [40]. In Fig 5, the sampling window with sequence length, repetition rate wide, and step size S is resampled along the time axis.

FIGURE 5. - Sliding window.
FIGURE 5.

Sliding window.

SECTION V.

Experimental Analysis

A. Data Introduction

1) Case Western Resever Unversity Bearing Data

CWRU (Case Western Reserve University) bearing data are widely used by people in various industries as discriminant data for bearing fault diagnosis. The CWRU test bench [41] is composed of 2HP motors, encoders, and dynamometers. The test bearings are SKF6205 motor bearings. Test data are collected from the acceleration sensors installed at the motor drive end and the fan end. According to the motor speed, data is divided into 0HP, 1HP, 2HP, and 3HP load data. The sampling frequency is 12 kHz. The test selects the drive end data in TABLE 1. In the actual experiment, three possible locations for bearing failure are the inner ring, the outer ring, and the rolling unit. Three fault data are collected at each position of the drive end, which are 7 mils, 14 mils, and 21 mils, to verify the reliability of the proposed method. Four hundred samples are collected for each fault type, and each 1200 signals form a set of samples. Therefore, the test data contain 10 kinds of signals, which are normal and fault signals.

TABLE 1 CWRU Bearing Failure Data Set
Table 1- 
CWRU Bearing Failure Data Set

2) Data of the Driveline Diagnostic Simulator

The DDS (Driveline Diagnostic Simulator) test bench is used for experiments to verify the performance of the model in Fig 6. The bench is a complete power system device composed of variable speed drive motor, encoder, torsion sensor, main gear box, sub-flat gear box, programmable magnetic brake, and acceleration sensors [42]. Among them, the accuracy of the acceleration sensors is 0.089g/{m.s}^{2} . The test rig collects data on the positive and inner ring faults of rolling bearings and rolling element data. The data have practical significance for studying the noise and vibration characteristics of gearboxes.

FIGURE 6. - DDS test bench.
FIGURE 6.

DDS test bench.

B. Experimental Verification

1) Model Introduction

The proposed model is compared with CNN, LSTM, and CNN + LSTM models. CNN and LSTM networks use NAdam optimization functions for simulation experiments. Because the network structure of the proposed method is relatively simple. In order to make the model comparable, the number of network layers in TABLE 2 is consistent with the number of NDilted-CNN layers. The CNN network uses a three-layer network model. After each convolutional layer, a pooling layer is connected to reduce the dimensionality and increase the receptive field. The number of filters and convolution kernels and other parameters are consistent with the model parameters mentioned in this article. The LSTM network adopts a three-layer network structure. For bearing data with time series, LSTM exhibits good performance. The CNN + LSTM network structure uses two convolutional layers and two LSTM layers. The pooling layer is connected after the convolutional layer. The model NDilted-CNN adopts three dilated gates convolutional layers, and the dilate rate of each layer is 1, 2, and 4. The specific network parameters are shown in TABLE 2.

TABLE 2 Structure of the Model
Table 2- 
Structure of the Model

The NDilted-CNN network structure is relatively simple. The number of filters in the three dilated gates convolutional layers is 32, 24 and 16, respectively. And, each layer has a 3*3 convolution kernel. With the increase of dilated rate, the receptive field of convolution kernel is also increasing. The number of Dense1 cells is 100. According to the failure type of the data set, Dense2 is divided into 10 cells. At the same time, the D-Conv and Dense layers sequentially use the relu and softmax functions as activation functions.

At the same time, the 4 models use the same hyperparameters: length = 1200, batch = 200, Lr = 0.001, train-test_rate = [0.7,0.3], Epochs = 100, num_class = 10, BatchNorm = True.

Selecting different parameters will affect the convergence of the entire network. In TABLE 3, the convergence of the network model is analyzed for different learning rates and whether to use Batch Normalization.

TABLE 3 Model Training Schedule
Table 3- 
Model Training Schedule

It can be seen from the TABLE 3 that learning rate and Batch Normalization have a certain impact on the convergence of the network. When the learning rate is too large or too small, the network appears unstable phenomenon, which slows down the convergence speed of the network. In this paper, a suitable learning rate is obtained by using artificial experience. In the later work, the learning rate of learning can be adaptive adjustment.

Batch Normalization is to keep the input parameters of each layer of neural network stable and at the same time speed up the network training speed. In deep learning, it is mostly used in deep neural network structure. In TABLE 3, because the proposed network structure has only three convolution layers, batch standardization has little effect on the network model.

Although changing the learning rate and Batch Normalization makes the network model results occasionally oscillate. However, it can be seen from TABLE 3 that despite changing different parameters, the results of the proposed model quickly reached convergence. At the same time, the training accuracy has reached more than 98%, which proves that the proposed method has good robustness.

In this experiment, all the tested algorithms are written in Python language and run on a computer with Windows 10 system, Intel Core i7-7700 processor and 16GB RAM.

2) CWRU Data Set

Western Reserve University divides the bearing data into four groups: 12,000 sampling points per second and 48,000 sampling points per second at the drive end data; and 12,000 sampling points at the fan end data and normal data. A total of 1200 sampling points of each group of data are randomly selected as an example, and the signal diagram is shown in Fig. 7.

FIGURE 7. - CWRU original signal.
FIGURE 7.

CWRU original signal.

The data on the drive and fan sides in the figure contains large fluctuations, which are abnormal conditions due to wear or corrosion. The signal amplitude in normal data hardly fluctuates.

Fig 8 shows the comparison diagram of the accuracy and loss value of the bearing data of Western Reserve University under the proposed model. Under the structure of the NDilted-CNN model, the accuracy rate increases; when the accuracy decreases five times, a state of convergence exists. The loss value is a state of decline from the beginning of training; until the model converges, the loss value gradually stabilizes. Experiments with small batches of data can achieve convergence in only few iterations, demonstrating the reliability of the proposed model. In practical applications, calculation time will also be saved.

FIGURE 8. - Loss value and accuracy curve.
FIGURE 8.

Loss value and accuracy curve.

Since the network structure of the proposed method is relatively simple, it is compared with the traditional structure CNN and LSTM network. Based on the model comparison of TABLE 4, the training accuracy of octave convolution in CNN and LSTM networks is not high. The testing time of the CNN network is the shortest, but the accuracy is only 70.6%. LSTM network training time is the longest, however, the accuracy is only 70.9%. The accuracy of the CNN + LSTM network is only 75.24%. The dilated gate convolution layer increases the receptive field of the convolution kernel, and at the same time, it can well filter the noise in the signal, extract more characteristic information in signal. In the experimental comparison, the training time and accuracy of Dilted-CNN and NDilted-CNN are significantly higher than other methods, and the accuracy has reached more than 98%. Among them, the proposed method NDilted-CNN has the shortest training time, with an accuracy of 99.99%.

TABLE 4 Model Training Schedule
Table 4- 
Model Training Schedule

As shown in Fig 9, CNN and LSTM networks have been in an oscillating state during training, and the value has been fluctuating up and down 0.65. After 20 iterations, the convergence state is still not reached. From the training time, the CNN network has obvious advantages, and the time used is the shortest. The longest model training time is required, and most of the memory is occupied due to the large number of LSTM network parameters. The CNN + LSTM network is in the middle part of the CNN network and the LSTM network. Nevertheless, the training accuracy of the CNN + LSTM network is only 0.75, which is far from the required high accuracy.

FIGURE 9. - Model comparison diagram.
FIGURE 9.

Model comparison diagram.

In Dilted-CNN and NDilted-CNN network structures, the accuracy of the network is higher than the three other models. As shown in Fig 10, for a more intuitive comparison, the Dilted-CNN and NDilted-CNN models are separately compared and verified. Dilted-CNN is an improved octave convolutional network model, and its optimization algorithm chooses Adam. During the training process, Dilted-CNN reaches an accuracy of 0.9 within six iterations. The accuracy gradually stabilizes after 14 iterations and reaches a high precision of 0.98. When NDilted-CNN network model having training, the convergence speed is significantly improved, and the accuracy of the network reaches 1. Hence, the proposed NAdam optimizer can achieve the effect of converging the network.

FIGURE 10. - Comparison diagram before and after the improved optimization algorithm.
FIGURE 10.

Comparison diagram before and after the improved optimization algorithm.

FIGURE 11. - T-SNE visualizes CWRU data.
FIGURE 11.

T-SNE visualizes CWRU data.

T-SNE visualization is performed to determine the change structure of the data intuitively. T-SNE is a visual operation through data dimensionality reduction, which can intuitively determine the changing structure of the data. The input data are disorganized. After octave convolution, the data are effectively separated and aggregated, but some data are not separated. Therefore, the dilated gate convolutional layer is used to extract the features. The data features are completely separated. Hence, the proposed model can effectively learn fault characteristics and perform fault classification.

3) DDS Data Verification

Compared with the data of Western Reserve University, the background noise of DDS data is larger in Fig 12. Noise is artificially added to data collection to train the anti-noise performance of the network model. The comparison of the four models in Fig 13 shows the effect of noise on different networks. Compared with the data of Western Reserve University, the accuracy of DDS data fluctuates in a larger range, and the convergence speed of the four models decreases. Although the CNN, LSTM, and CNN + LSTM networks have been in a state of oscillation, their accuracy values are not different from those under the experiment of Western Reserve University.

FIGURE 12. - DDS original signal diagram.
FIGURE 12.

DDS original signal diagram.

FIGURE 13. - Model comparison diagram.
FIGURE 13.

Model comparison diagram.

The comparison of the accuracy graphs of Dilted-CNN and NDilted-CNN in Fig 14 shows that Dilted-CNN achieves an accuracy value of 0.9 within 18 iterations. The accuracy of NDilted-CNN reaches 0.9 within six iterations. The two models have been on the rise, and the accuracy is above 0.9. Hence, the proposed dilated gate convolutional layer has a good filtering effect on noise.

FIGURE 14. - Comparison diagram before and after the improved optimization algorithm.
FIGURE 14.

Comparison diagram before and after the improved optimization algorithm.

T-SNE is used for visualization of DDS data. As shown in Fig 15, only part of the features of the data are separated after the octave convolution operation, and some data are mixed together and are difficult to distinguish. After passing through the dilated gate convolutional layer, the data features are obviously separated. This finding verifies the filtering effect of gate convolution on noise.

FIGURE 15. - T-SNE visualizes DDS data.
FIGURE 15.

T-SNE visualizes DDS data.

4) Variable Load Experiment

The experiment uses data of different loads for experimental verification to verify the generalization of the model. The comparison of the four models of CWRU under different loads is shown in Fig 16. Group A is the drive end and fan end data at 12K speed; Group B is the drive end data at 12K speed and 48K speed; and Group C is 12K fan end and 48K Driver data. Under different loads and speeds, the proposed model has high accuracy and fast convergence and can reach an accuracy rate of more than 0.99.

FIGURE 16. - Different load verification of CWRU data.
FIGURE 16.

Different load verification of CWRU data.

Fig 17 shows a histogram of training accuracy at different speeds of DDS data, where DDS-A and DDS-B are data at constant speeds, which are 1120 and 1350 rpm, respectively. DDS-AB is the data under the variable speed state and composed of A and B. From the histograms of the four models at different speeds, the accuracy of CNN and LSTM networks does not occur too much when the noise and speed of the data set change. The training accuracy of the CNN + LSTM network has been improved, and the proposed NDilted-CNN model structure reaches a high accuracy of 0.97.

FIGURE 17. - Different load verification of DDS data.
FIGURE 17.

Different load verification of DDS data.

5) Experiment With Different SNR

A good model must not only have generalization but also meet the requirements of achieving high accuracy under different SNR (Signal Noise Ratio). Gaussian noise and mixed noise with different signal-to-noise ratios are added to the original CWRU data to verify the anti-noise performance of the model. Gaussian noise of different SNR is added to CWRU data. As shown in Fig 18, as the SNR increases, the amplitude of the signal also increases.

FIGURE 18. - Spectrograms of different SNR.
FIGURE 18.

Spectrograms of different SNR.

In Fig 19, four models are simulated and verified at different SNR. As the SNR increases, the training accuracy of the model decreases. However, the descending speed of the proposed model is slower than that of the three other models and reaches an accuracy of over 0.94.

FIGURE 19. - Comparison of models with different SNR.
FIGURE 19.

Comparison of models with different SNR.

Fig 20 is the frequency domain diagram of the normal signal and the mixed signal. Among them, Gaussian noise and impulse noise together form mixed noise.

FIGURE 20. - Spectrograms of mixed signal.
FIGURE 20.

Spectrograms of mixed signal.

Fig 21 is a comparison diagram of the four models under mixed noise with different signal-to-noise ratios. Under mixed noise signals, the accuracy of the NDilted-CNN model reaches 93%, which is much higher than the other three models. Compared with Fig 19, it can be found that the proposed model performs well in both mixed noise and single noise experiments. Fig 22 shows the convergence of precision curves of Gaussian noise and mixed noise when SNR is 2.

FIGURE 21. - Comparison of models with different SNR in mixed signal.
FIGURE 21.

Comparison of models with different SNR in mixed signal.

FIGURE 22. - Comparison of models with different SNR.
FIGURE 22.

Comparison of models with different SNR.

FIGURE 23. - T-SNE diagram with SNR is 2.
FIGURE 23.

T-SNE diagram with SNR is 2.

FIGURE 24. - T-SNE diagram with SNR is −4.
FIGURE 24.

T-SNE diagram with SNR is −4.

FIGURE 25. - T-SNE diagram with SNR is −10.
FIGURE 25.

T-SNE diagram with SNR is −10.

As can be seen from the Fig 22, when the input signal contains noise, it has a certain impact on the training accuracy of the network model. After 12 iterations, the accuracy of the two models reached 0.9 and gradually tended to be stable. After 30 iterations, the accuracy of the two networks reaches a convergence state. For different types of background noise, the proposed method can achieve rapid convergence and stability.

The results of the T-SNE visualization show the data feature distribution when the SNR is 2, −4, −10. As the SNR increases, data features can be well separated.

SECTION VI.

Conclusion

This paper proposes a new optimization algorithm and a network model combining improved octave convolution for rolling bearings with high background noise and difficult to extract features. In the proposed model structure, the NAdam algorithm improves the convergence speed of the network, so the network model can reach the fitting state faster. The improved octave convolution and dilated gate convolution layer can effectively extract data features, with accuracy reaching as high as 0.98 or more in the simulation test of Western Reserve University and the power system device data set. In the experiments with variable loads and different SNRs, the proposed model can still maintain the characteristics of high precision and fast convergence. This finding verifies the generalization and anti-noise performance of the proposed model.

In this paper, artificial experience is used to select the parameters that affect the network convergence. Therefore, the automatic search for optimal network parameters needs further research. In addition, how to analyze the convergence of the network proposed in this paper from a theoretical perspective is a problem worthy of research.

Cites in Papers - |

Cites in Papers - IEEE (1)

Select All
1.
Xiaolei Luo, "Construction of Equipment Installation and Fault Diagnosis System for Chemical Project Based on Ant Colony Optimization Algorithm", 2024 3rd International Conference on Artificial Intelligence and Autonomous Robot Systems (AIARS), pp.207-212, 2024.

Cites in Papers - Other Publishers (3)

1.
Changfu He, Deqiang He, Zhenzhen Jin, Yanjun Chen, Sheng Shan, "A multi-layer feature fusion fault diagnosis method for train bearings under noise and variable load working conditions", Measurement Science and Technology, vol.35, no.2, pp.025121, 2024.
2.
Jiayang Liu, Qiang Zhang, Fuqi Xie, Xiaosun Wang, Shijing Wu, "Incipient Fault Detection of Planetary Gearbox Under Steady and Varying Condition", Expert Systems with Applications, pp.121003, 2023.
3.
Mehmet Cem Catalbas, Matej Bernard Kobav, "Measurement of correlated color temperature from RGB images by deep regression model", Measurement, vol.195, pp.111053, 2022.

References

References is not available for this document.