Loading web-font TeX/Main/Regular
Fault Diagnosis of Industrial Robot Based on Multi-Source Data Fusion and Channel Attention Convolutional Neural Networks | IEEE Journals & Magazine | IEEE Xplore

Fault Diagnosis of Industrial Robot Based on Multi-Source Data Fusion and Channel Attention Convolutional Neural Networks


The structure of the multi-source data fusion and channel attention convolutional neural network (MD-CA-CNN).

Abstract:

Industrial robots are prone to failure due to harsh working environments, which affects movement accuracy. The fault diagnosis of industrial robots has become an indispen...Show More

Abstract:

Industrial robots are prone to failure due to harsh working environments, which affects movement accuracy. The fault diagnosis of industrial robots has become an indispensable part of robot collaborative maintenance in intelligent manufacturing. Most existing diagnostic methods only use a single data source, and the diagnostic accuracy will be affected due to signal acquisition errors and noise interference. This paper proposes a multi-source data fusion and channel attention convolutional neural network (MD-CA-CNN) for fault diagnosis of multi-joint industrial robots. The network takes the time domain data and time-frequency domain data of the vibration signal, torque signal, and current signal of the six joints of the robot as input. Then, we realize the diagnosis of the faults by using a Softmax Classifier layer after the two parts of feature extraction and feature fusion. In addition, a channel attention mechanism is developed. It acts on the two parts of feature extraction and feature fusion, respectively. It assigns weights to different source data and weights to time-domain and time-frequency domain features. Finally, we established a test bench to compare the proposed method with the deep learning algorithm that only uses multi-data source fusion, the deep learning algorithm that only uses a single data source, and the commonly used machine learning algorithm. The results show that the MD-CA-CNN model proposed in this paper has the highest accuracy and stability, reaching 95.8% \pm ~0.39 %, which verifies the method’s effectiveness.
The structure of the multi-source data fusion and channel attention convolutional neural network (MD-CA-CNN).
Published in: IEEE Access ( Volume: 12)
Page(s): 82247 - 82260
Date of Publication: 28 May 2024
Electronic ISSN: 2169-3536

Funding Agency:


CCBY - IEEE is not the copyright holder of this material. Please follow the instructions via https://creativecommons.org/licenses/by/4.0/ to obtain full-text articles and stipulations in the API documentation.
SECTION I.

Introduction

Industrial robots have long been used to automate manufacturing processes to improve productivity, quality, and safety [1]. Due to their advantages of high flexibility, high-cost efficiency, ample working space, high repeatability, and multiple functions, they have been widely used in modern manufacturing industries such as automobile assembly, chemical industry, and aerospace [2]. The transmission system is the core of an industrial robot. In actual industrial production, the motor, reducer, and other components in the industrial robot transmission system may fail due to overload, long-term work, or long-term non-maintenance, affecting the joint stability of the industrial robot. If it is severe, the robot will not be able to continue the task, interrupting the production task. Condition monitoring and fault diagnosis of industrial robots can reduce unplanned downtime on highly automated production lines, saving associated costs [3]. Therefore, it is meaningful to study a highly accurate fault diagnosis method for industrial robots to arrange appropriate maintenance strategies and reduce maintenance costs.

In the industrial field, there are three main ways to diagnose equipment faults: knowledge-based, model-based, and data-driven [4]. Knowledge-based diagnostics require expert experience to identify the status of the device. Chen et al. realized the composite fault identification of rolling bearings by adaptive extraction of resonant frequency bands [5]. Anouar et al. applied wavelet analysis to rotating machinery monitoring [6]. However, fault detection is a real-time process, and this approach requires one or more specialized personnel to assess machine performance, which is impractical in modern industrial production [7]. Model-based diagnostic methods typically analyze the correct behavior of each component in a robotic system, build a mathematical or physical model, and then identify faults by comparing the model’s expected output with the actual output of the robotic system [4]. Kim et al. proposed a new phase-based time-domain averaging (PTDA) method to detect industrial robot reducer faults [8]. Sun et al. realized the fault diagnosis of actuators based on constructing six free-space motion equations mathematical models of autonomous underwater vehicles [9]. Sabry et al. propose an energy model-based method for monitoring industrial robots [10]. Muradore and Fiorini proposed a fault detection and isolation method based on industrial robot models and signal processing [11]. Although model-based diagnostic methods have been widely researched in fault diagnosis, complex nonlinear characteristics and extremely variable operating conditions limit the fault diagnosis of industrial robots with complex structures. With the development of sensing technology, data-driven robot fault diagnosis methods have emerged in large numbers. No detailed mechanistic knowledge is required only fault diagnosis by analyzing and extracting different features of sensor signals collected from different health states [12]. Machine learning (ML) has been widely used to diagnose equipment failures. Guo et al. propose an improved random forest (IRF) algorithm to reselect and weight highly accurate and heterogeneous decision trees through hierarchical clustering. They used digital twin and transfer learning technology to diagnose faults on the physical production line. They verified the algorithm through a case study of an automobile rear axle assembly line [13]. Xu et al. proposed a bolt-loosening detection method for industrial robot joints based on electromechanical modeling and motor current signature analysis (MCSA). They established the kinetic equations coupling the motor current and bolt loosening, extracted the time-frequency characteristics of the motor current, and then used SVM to recognize the bolt loosening [14]. Izagirre et al. proposed a method and practical implementation of a network architecture for industrial robot data acquisition and predictive maintenance. They use SVR or ELM to detect degraded states of Robot joints. The architecture is implemented by extracting torque signals from a PLC on a real automotive assembly line [1]. Raouf et al. used statistical analysis of three-phase current signals for feature engineering to detect and diagnose six faults in the transmission mechanisms of industrial robots. They investigated a variety of algorithms. The accuracy assessment showed that the SVM model was the most accurate [15]. Pan et al. use vibration signals to detect gap failures in industrial robot joints. The Wigner-Vile distribution (WVD) of the vibration signal is used to extract healthy features and use the features as inputs to an artificial neural network (ANN) algorithm [16].

As a particular data-driven method, the deep learning (DL) model has demonstrated superior capabilities in adaptive feature extraction and fault classification for multilayer nonlinear transformations [17]. With the advent of the artificial intelligence era and powerful computing power support, deep learning methods have become a research hotspot in industrial robot fault diagnosis [18]. Zhi et al. designed a convolutional neural network and long short-term memory (CNN-LSTM) to diagnose harmonic reducer faults in industrial robots [19]. Yildirim et al. captured sound signals of various faults in an industrial robot. They used wavelet analysis to process and remove noise from the captured sound signals, extract relevant features, and train five neural networks for noise analysis and classification [20]. He et al. proposed the multi-scale hybrid convolutional neural network (MSMCNN). They used the vibration signals of robot joints for adaptive feature extraction, effectively extracted comprehensive and complementary weak fault features, and realized the diagnosis of the harmonic reducer of multiple robot joints [21]. Park et al. proposed a fault detection model for edge computing. They built an edge device that integrates data collection, processing, storage, and analysis. Then, they detected six types of faults in IR based on the LSTM network [22].

In recent years, multi-source information fusion has emerged in fault diagnosis. Due to signal acquisition errors and noise interference, we may need more than a single data source in some cases. In addition, since each different type of signal can perceive the environment from different aspects, multi-source data fusion can provide more useful and accurate information compared with a single data source approach, especially for complex systems [23]. For example, Miao et al. proposed a new channel-based convolutional neural network (CWNN-FA) with feature enhancement for fault diagnosis of wheeled mobile robots. This structure uses multi-heterogeneous sensor data such as IMUs and encoders as input [24]. Gültekin et al. proposed a data fusion method based on a convolutional neural network, using the sound and vibration signals of the motor to detect operational faults occurring in automatic vehicles (ATVs) [25]. Liu et al. proposed a fault diagnosis method for a chain jack hydraulic system based on multi-source sensor data fusion. The proposed method utilized pressure, temperature, and flow data under different operating conditions [26]. Cui et al. proposed a new multi-task multi-sensor fusion network (M2FN) to improve fault diagnosis performance. The method used convolutional neural networks to extract and fuse features from raw vibration and current signals [27]. The above methods have all been proved through experiments that using multiple data sources has higher diagnostic accuracy than using a single data source.

Attention mechanism is a method of imitating the human visual and cognitive system. In recent years, the application of attention mechanisms in deep learning, especially in convolutional neural networks, has been proposed in many works. For example, Du et al. used an efficient channel-attentive deep dense convolutional neural network to achieve automatic classification of esophageal diseases in gastroscopy images [28]. In the field of fault diagnosis, Huang et al. proposed a Multi-scale Convolutional Neural Network with Channel Attention (CA-MCNN). In CA-MCNN, the maximum pooling and average pooling layers are used to extract the multi-scale information of the bearing signals, which increases the dimensions of input. The channel attention mechanism is introduced to increase the convolutional layer feature learning ability by adaptively scoring and assigning weights to the learned features [29]. Di et al. proposed a machine tool fault diagnosis method based on Multiscale-Channel Attention Network (MSCANet). MSCANet effectively integrates the vibration signal characteristics of machine tool spindles in different directions and extracts different hierarchical features of vibration signals using multi-scale structures. Adaptive fusion of features at different scales through Channel Attention (CA) mechanism. Thus improving the accuracy of tool wear status diagnosis [30]. Huang et al. proposed a multi-scale convolutional neural network for bearing fault diagnosis. This network structure will obtain time-frequency representations from vibration time-domain signals as inputs to the model. And use convolution kernels of different sizes to extract multi-scale information from time-frequency images. In addition, an attention mechanism has been established to more effectively adaptively select features of different scales for classification, emphasizing key features and weakening redundant features [31]. Tong et al. also proposed a coordinated attention (CA) model suitable for one-dimensional vibration signals and established a lightweight coordinated attention convolutional neural network model ACNN, which takes data from multiple vibration sensors as input. Secondly, a kurtosis weighted fusion strategy was designed. Then, based on ACNN and weighted fusion strategy, a rolling bearing fault diagnosis method based on Multi sensor ACNN was proposed [32]. The above methods all demonstrate that the application of attention mechanism can improve the performance and generalization ability of convolutional neural networks. However, these fault diagnosis methods still only utilize the time-domain or time-frequency domain signals of a single vibration data, and the attention mechanism is limited to different scales of a single data source.

Although various industrial robot fault diagnosis methods have improved the diagnosis accuracy, they only use a single data source, such as vibration, torque, or current. Extracting discriminative features from multi-source data to provide accurate and reliable diagnosis is still challenging [27]. Therefore, this paper proposes an industrial robot fault diagnosis method based on multi-source data fusion and channel-attention convolutional neural networks. It can detect and diagnose industrial robot drive train faults. The time-domain signals contain original feature information, and the time-frequency domain signals contain time-domain and frequency-domain information. The industrial robot is a coupled system, so each joint’s change causes the other joints’ response. Thus, we input the time-domain and time-frequency domain signals of each joint’s original vibration, current, and torque signals. 1D-CNN and 2D-CNN extract deep features, and we realize multi-source data fusion through channel connection. At the same time, we add a channel attention module to adaptively assign weights to different source data and time domain and time-frequency domain features. Finally, we concatenate together the extracted features of each joint, and then the diagnosis result is obtained through a fully connected layer. Because this approach leverages multiple data sources, it can more accurately and robustly diagnose the condition of industrial robots. The innovation of using the channel attention mechanism in convolutional neural networks in this paper is to combine it with multi-source information fusion, which is used twice consecutively to assign different weights to different source data as well as to assign weights to time domain and time-frequency domain data, respectively. This study uses a 6-axis industrial robot as the experimental object. Comparison with other advanced fault diagnosis methods demonstrates the superiority of the MD-CA-CNN proposed in this paper. In summary, the main contribution of this paper:

  1. We realize the fusion of robot joints’ vibration, current, and torque data by adopting CNN channel superposition. We obtain the time-frequency domain image of the signal using CWT and extract the depth features of the original time-domain signal and the time-frequency domain signal using 1D-CNN and 2D-CNN simultaneously. It avoids the problem of a single data source being susceptible to noise and can provide more comprehensive information about the fault characteristics.

  2. We propose a CNN structure based on multi-source data and channel attention mechanism, namely MD-CA-CNN. A channel attention mechanism is developed and added to the CNN input port and deep feature channel fusion. It can adaptively assign weights to different source data and time-domain and time-frequency-domain features to improve the contribution of essential channels to fault diagnosis.

The rest of the paper is organized as follows: Section II details the overall framework of the fault diagnosis system for industrial robots and the proposed diagnostic approach. Section III describes the details of the experimental setup and the collected experimental data used to evaluate the proposed methodology, and Section IV gives the experimental results and comprehensive analysis. Finally, Section V summarizes the study.

SECTION II.

Proposed Methodology

A. Framework of the Diagnostic System

This study first proposes a general architecture of an industrial robot fault diagnosis method. As shown in Fig. 1, the proposed method has four phases. In the fault data acquisition phase, vibration, current, and torque signals are collected from each joint of the six-axis industrial robot.

FIGURE 1. - The overall diagnostic framework of MD-CA-CNN.
FIGURE 1.

The overall diagnostic framework of MD-CA-CNN.

In the data preprocessing stage, a sliding window method segments the data along the time dimension. The sliding window method is a commonly used data enhancement method in fault diagnosis, which can significantly increase the number of training samples. As shown in Fig. 2, a time window of length w is used to segment the data, the step size is set to s, and the total number of data points is T. The formula for the number of training samples N is shown in (1), and the result is rounded down to the nearest integer.\begin{equation*} \textrm {N}=\frac {T-w}{s}+1 \tag {1}\end{equation*} View SourceRight-click on figure for MathML and additional features.

FIGURE 2. - The method of sliding window.
FIGURE 2.

The method of sliding window.

The signals acquired in each time window are then converted into time-frequency images using the continuous wavelet transform (CWT), which reduces the noise effect and reveals fault-related features. The time-frequency images generated by CWT contain rich time-domain and frequency-domain information. The original time-domain signals also contain the most essential fault characterization information. Both are helpful for diagnosing robot faults, so both the time-domain and time-frequency-domain signals are used as input samples for model training.

In the model training stage, the processed data are randomly divided into three independent datasets, i.e., training set, validation set, and testing set, in the ratio of 7:2:1. The structure of the MD-CA-CNN model was determined by pre-experimentation. The proposed MD-CA-CNN model is trained, evaluated, and optimized offline using the training and validation set samples and their corresponding labels. The loss function is computed, and the model weights are updated according to the backpropagation principle. The test set samples are used to test the accuracy of the MD-CA-CNN model after training.

In the fault diagnosis stage, the trained MD-CA-CNN model is deployed, and the robot field data is collected as input to the model using the same time window length as the data preprocessing stage to obtain the output results. Then, the diagnostic results are used to provide decision support for maintenance.

B. Continuous Wavelet Transform

In the field of signal analysis, the commonly used STFT can locate the time. However, it has the same time and frequency resolution for different frequencies because the window size is fixed. Moreover, the length of the window function is also challenging to determine. Therefore, STFT is more suitable for smooth signals with small frequency fluctuations than non-smooth signals with large frequency fluctuations. CWT is an adaptive time-frequency analysis method due to the introduction of wavelet function as the basis function. It can automatically adjust the window size according to the height of the frequency and can more obviously detect the mutation point and oscillation part of the signal. Therefore, It is more conducive to the analysis of faulty signals. The CWT formula of the signal $x(t)$ is shown below:\begin{equation*} \textrm {X}_{\omega } (a,b)=\frac {1}{\left |{{ a }}\right |}\int _{-\infty }^{\infty }{x(t)} \bullet \bar {\psi }\left ({{\frac {t-b}{a}}}\right)dt \tag {2}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\psi \left ({{ \frac {t-b}{a} }}\right)$ is the mother wavelet and $\bar {\psi }\left ({{ \frac {t-b}{a} }}\right)$ is the complex conjugate of $\psi \left ({{ \frac {t-b}{a} }}\right)$ ; a is the scale factor, b is the translation factor, and both a and b are arbitrary real numbers; $\mathrm {X}_{\omega }\left ({{ a, b }}\right)$ is the wavelet coefficient.

C. Convolutional Neural Network

In recent years, 2D-CNN has occupied an essential position in computer vision and image recognition fields. However, 2D-CNN is only a viable option for some applications on 1D signals. To solve this problem, 1D-CNN has been proposed. It has achieved state-of-the-art performance levels in several applications, such as biomedical data categorization and early diagnosis, as well as in power electronics and motor fault detection [33]. The principle of 1D-CNN and 2D-CNN is the same. The main difference is that the convolution kernel moves in different dimensions. The convolution kernels of 1D-CNN will only be convolved along the time-step order. In contrast, the convolution kernels of 2D-CNN will be convolved along the image’s horizontal and vertical axes. The MD-CA-CNN proposed in this paper is based on 1D-CNN and 2D-CNN together.

The structure of CNN mainly consists of alternating convolutional layers, nonlinear layers, pooling layers, and fully connected output layers.

Convolutional layers are the most important part of CNN. Each convolutional layer contains multiple convolutional kernels. Unlike the traditional fully connected layer, the convolutional layer performs convolutional operations through convolutional kernels. So, the input to each node is only a part of the neurons of the previous neural network layer. Therefore, the convolution operation can extract the input data’s feature information and significantly reduce the model computing burden and the noise interference on the features. The formula for the convolution operation is shown below:\begin{equation*} y_{i}^{l} =\sum \nolimits _{r=1}^{M} {x_{r}^{l-1}} \otimes k_{i,r}^{l} +b_{i}^{l},\quad i=1,2,\cdots,N \tag {3}\end{equation*} View SourceRight-click on figure for MathML and additional features. where N represents the number of convolution kernels in the lth convolutional layer; M represents the number of convolution kernels in the l-1th convolutional layer; $x_{r}^{l-1}$ denotes the rth feature map in the l-1th convolutional layer; $k_{i}^{l}$ is the rth channel in the ith convolution kernel in the lth convolutional layer; $b_{i}^{l}$ denotes the bias of the ith convolution kernel in the lth convolutional layer; and $y_{i}^{l}$ is the output corresponding to the ith convolution kernel in the lth convolutional layer. $\otimes $ denotes the convolution operation.

The nonlinear layer is an integral part of the neural network. Generally, after the convolutional layer, the activation function is utilized to make a nonlinear mapping of the convolutional layer output. There are many activation functions, but in recent years, the ReLU function has been popular because of its computational simplicity, ability to prevent gradient vanishing, and other advantages. The formula for the ReLU function is shown below:\begin{equation*} y^{l}=\max \left ({{0,x^{l}}}\right) \tag {4}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $x^{l}$ denotes the input of the lth ReLU nonlinear layer and $y^{l}$ denotes the output of the lth ReLU nonlinear layer.

The pooling layer is generally sandwiched between consecutive convolutional layers and compresses the data while preserving the data feature information. It can reduce the number of parameters, effectively control the overfitting, and improve the network operation speed. Pooling methods include maximum pooling and average pooling, while the maximum pooling method is actually used more often. The formula for the maximum pooling method is shown below:\begin{align*} y_{k,i}^{l} =\max \left ({{x_{k,i,1}^{l},\cdots,x_{k,i,j}^{l},\cdots,x_{k,i,j^{2}}^{l}}}\right),~k=1,2,\cdots,N \tag {5}\end{align*} View SourceRight-click on figure for MathML and additional features. where N represents the number of output feature maps, which is the same as the number of output feature maps; represents the jth element of the ith image block of the kth input feature map, each image block contains $j^{2}$ elements and represents the ith element on the kth output feature map of the lth layer.

The final part of the CNN consists of a fully connected layer and an output layer for classification. First, the features obtained from the previous convolutional and pooling layers are flattened and fully connected and then passed through the fully connected layer to the output layer to output the results. The output layer usually uses the softmax function due to its high efficiency in classification tasks. The operations performed in the fully connected output layer are shown below:\begin{align*} y^{l}& =f\left ({{w^{l}\bullet x^{l-1}+b^{l}}}\right) \tag {6}\\ s^{l}& =\textrm {softmax}\left ({{y^{l}}}\right) \tag {7}\end{align*} View SourceRight-click on figure for MathML and additional features. where $x^{l-1}$ is the output of layer $l-1$ , and $w^{l}$ is the weight coefficient of the fully connected layer l; $b^{l}$ is the bias of the fully connected layer l. $y^{l}$ is the output of the fully connected layer l. $s^{l}$ denotes the final classification result obtained after the softmax activation function.

D. Attention Mechanism

Attention mechanism is an approach in deep learning based on the human visual system and cognitive mechanisms. When humans are confronted with a large amount of information, they will efficiently allocate their limited attentional resources to selectively focus on the information most valuable to them from a large amount of information [34]. The attention mechanism’s core goal is to selectively focus on the information that is more critical to the current task goal from a multitude of inputs and assign higher weights to it to enhance its contribution to the outcome. The attention mechanism can improve the expressiveness of deep learning models and increase the model’s sensitivity to key features [35].

Channel Attention Mechanism, a type of attention mechanism, can associate different channels of the input data to calculate each channel’s degree of importance and assign weights to different channel features. It improves the expressive power of the network in the feature representation, which improves the model’s performance. Therefore, it is often used inside convolutional neural networks [35]. The computational procedure of the channel attention mechanism constructed in this paper is as follows:

The input sample is denoted as $S =$ {$S_{1}$ , $S_{2}$ , $\cdots $ , $S_{i}$ , $\cdots $ , $S_{h}$ } and h is the number of input sample channels.

  1. Global average pooling is performed on the feature data for each channel:\begin{equation*} m_{i} =\textrm {Avgpool}~(S_{i})=\frac {1}{l}\sum \limits _{j=1}^{l} {S_{i}^{j}} \tag {8}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $S_{i}$ denotes the feature sequence of the ith channel, and l denotes the length of the feature sequence. $S_{i}^{j}$ denotes the jth data of the feature sequence of the ith channel. $m_{i}$ is the result of the global average pooling of the ith channel.

  2. Find the score of the ith channel feature:\begin{equation*} F=\sigma (W\bullet M+b) \tag {9}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\sigma $ is a scoring function, such as a sigmoid function or a ReLU function. ${M} = $ {$m_{1}$ , $m_{2}$ , $\cdots $ , $m_{i}$ , $\cdots $ , $m_{h}$ }, ${F} = $ {$f_{1}$ , $f_{2}$ , $\cdots $ , $f_{i}$ , $\cdots $ , $f_{h}$ }, $f_{i}$ is the score of the ith channel feature.

  3. Normalization of scores:\begin{equation*} \alpha _{\textrm {i}} =\textrm {softmax}(f_{i})=\frac {\exp (f_{i})}{\sum \nolimits _{i} f} \tag {10}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\alpha _{i}$ is the weight assigned to the ith channel data Si by the channel attention mechanism.

  4. Obtains the final output M of the attention mechanism:\begin{equation*} M=S\otimes A=\{s_{1} \alpha _{1},s_{2} \alpha _{2},\cdot \cdot \cdot,s_{h} \alpha _{h}\} \tag {11}\end{equation*} View SourceRight-click on figure for MathML and additional features. where ${A} = $ {$\alpha _{1}$ , $\alpha _{2}$ , $\cdots $ , $\alpha _{i}$ , $\cdots $ , $\alpha _{h}$ }, $\otimes $ is the multiplication of the corresponding elements of two sets.

E. Proposed MD-CA-CNN Structure

In this section, the specific steps of the proposed method are described. Fig. 3 shows the system structure of the proposed MD-CA-CNN. The proposed method uses the initial timing signals and time-frequency images of vibration, current, and torque of six joints as the input to the system. The feature extraction model is constructed for each of the six joints. Finally, the extracted features of the six joints are connected in series to determine the final diagnosis result. The proposed structure consists of two main parts. In the feature extraction part, the weights of the feature data of different channels are firstly determined by the channel attention mechanism and by combining the time domain information with the time-frequency domain information. Then, we extract the time-domain and time-frequency domain features using 1DCNN and 2DCNN, respectively. In the feature fusion part, the time domain and time-frequency domain features are first fused, and then the weights of the time domain features and time-frequency domain features are determined using the channel attention mechanism. Afterward, the data of the two channels are fused in series, and the fused features of the joint are obtained through a fully connected layer. After that, the fusion features obtained for each joint are again fused in series to obtain the final fusion features of the robot. Finally, the Softmax layer is utilized to obtain the fault diagnosis results.

FIGURE 3. - The structure of MD-CA-CNN.
FIGURE 3.

The structure of MD-CA-CNN.

Since the fusion feature extraction models of the six joints are constructed in the same way, we use one of the joints as an example to describe the specific steps of the proposed method.

1) Feature Extraction

Before feature extraction of input data using CNN, the weights of different channel data are first determined by Channel attention block_1. The structure of Channel attention block_1 is shown in Fig. 4. Because the input data contains both time-domain and time-frequency domain representations of the same data, the effects of time-domain data and time-frequency domain data on the channel weights are considered simultaneously.

FIGURE 4. - The structure of channel attention block_1.
FIGURE 4.

The structure of channel attention block_1.

Global average pooling is first performed in Channel attention block_1 for each channel of the two types of input data X1 and X2, respectively. The formulas are as follows:\begin{align*} \textrm {X}_{AP,i}^{1} & =\textrm {Avgpool}(X_{i}^{1})=\frac {1}{L}\sum \limits _{j=1}^{L} {X_{i,j}^{1}} \tag {12}\\ \textrm {X}_{AP,i}^{2} & =\textrm {Avgpool}(X_{i}^{2})=\frac {1}{\textrm {H}\times \textrm {W}}\sum \limits _{j=1}^{\textrm {H}\times W} {X_{i,j}^{2}} \tag {13}\end{align*} View SourceRight-click on figure for MathML and additional features. where $\mathrm {X}_{i,j}^{\mathrm {l}}$ and $\mathrm {X}_{i, j}^{2}$ are the jth data of the ith channel of $\mathrm {X}^{\mathrm {l}}$ and $\mathrm {X}^{2}$ , respectively; $\mathrm {X}_{AP, i}^{\mathrm {l}}$ and $\mathrm {X}_{AP, i}^{2}$ are the ith data of $\mathrm {X}_{AP}^{\mathrm {l}}$ and $\mathrm {X}_{AP}^{2}$ , respectively.

Then find the average of the corresponding elements of $\mathrm {X}_{AP}^{\mathrm {l}}$ and $\mathrm {X}_{AP}^{2}$ to obtain $\mathrm {X}_{Avg}$ . The equation is as follows:\begin{equation*} \textrm {X}_{Avg,i} =\frac {1}{2}(\textrm {X}_{AP,i}^{1} +\textrm {X}_{AP,i}^{2}) \tag {14}\end{equation*} View SourceRight-click on figure for MathML and additional features.

After that find the score for each channel feature. The formula is as follows:\begin{equation*} \textrm {F}=\sigma (W\bullet \textrm {X}_{Avg} +\textrm {b}) \tag {15}\end{equation*} View SourceRight-click on figure for MathML and additional features.

In this paper the scoring function $\sigma $ uses the Sigmoid function; F = {$f_{1}$ , $f_{2}$ , $\cdots $ , $f_{\mathrm {C}}$ }.

Next the score obtained for the ith channel is normalized using softmax function with the following formula:\begin{equation*} \alpha _{i} =\textrm {softmax}(f_{i})=\frac {\exp (f_{i})}{\sum \nolimits _{i} f} \tag {16}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\alpha _{i}$ is the weight assigned to the ith channel.

Finally, each obtained channel weight is multiplied by the corresponding channel data of the initial timing signal and the time-frequency domain signal, respectively. The final output of Channel attention block_1 is obtained. The formula is as follows:\begin{align*} \textrm {O}^{1}& =\textrm {X}^{1}\otimes \textrm {A}=\{\textrm {X}_{1}^{1} \alpha _{1},\textrm {X}_{2}^{1} \alpha _{2},\cdot \cdot \cdot,\textrm {X}_{\textrm {C}}^{1} \alpha _{\textrm {C}}\} \tag {17}\\ \textrm {O}^{2}& =\textrm {X}^{2}\otimes \textrm {A}=\{\textrm {X}_{1}^{2} \alpha _{1},\textrm {X}_{2}^{2} \alpha _{2},\cdot \cdot \cdot,\textrm {X}_{\textrm {C}}^{2} \alpha _{\textrm {C}} \} \tag {18}\end{align*} View SourceRight-click on figure for MathML and additional features. where A = {$\alpha _{1}$ , $\alpha _{2}$ , $\cdots $ , $\alpha _{i}$ , $\cdots $ , $\alpha _{h}$ }. O1 and O2 are the time domain data and time-frequency domain data after Channel attention block_1 assigns channel weights, respectively.

After Channel attention block_1, CNN is utilized to extract deep learning features. The 1D-CNN and 2D-CNN used in this paper extract the deep features of the time and time-frequency domain signals, respectively. Both 1D-CNN and 2D-CNN used in this paper consist of two convolutional modules and a flattened layer. Each convolutional module, in turn, consists of a convolutional layer, a nonlinear layer, a convolutional layer, a nonlinear layer, and a pooling layer. Finally, the Flatten layer performs a spreading operation on the extracted feature data. The parameters of the 1D-CNN and 2D-CNN convolutional modules used in this paper are determined experimentally, and the details are shown in Table 1.

TABLE 1 CNN Parameters
Table 1- CNN Parameters

2) Feature Fusion

Firstly, the channels of the two 1D sequence data from the feature extraction part are spliced to obtain two-channel sequence data. Then, the weights of the data of two channels (i.e., the weights of the time-domain features and the time-frequency-domain features) are determined by Channel attention block_2. The structure of Channel attention block_2 is shown in Fig. 5.

FIGURE 5. - The structure of channel attention block_2.
FIGURE 5.

The structure of channel attention block_2.

A global average pooling operation is first performed on the dual channel data X in Channel attention block_2. The formula is as follows:\begin{equation*} \textrm {X}_{AP,i} =Avgpool(\textrm {X}_{i})=\frac {1}{L}\sum \limits _{j=1}^{L} {\textrm {X}_{i,j}},\quad i=1,2 \tag {19}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\mathrm {X}_{i, j}$ is the jth data of the ith channel of data X; $\mathrm {X}_{AP, i}$ is the ith data of $\mathrm {X}_{AP}$ .

After that find the score of each channel. The formula is as follows:\begin{equation*} \textrm {F}=\sigma (W\bullet \textrm {X}_{AP}+b) \tag {20}\end{equation*} View SourceRight-click on figure for MathML and additional features. Here the scoring function $\sigma $ still uses the Sigmoid function; F = {$f_{1}$ , $f_{2}$ }.

The score obtained for the ith channel is normalized. The formula is as follows:\begin{equation*} \alpha _{i} =\textrm {softmax}(f_{i})=\frac {\exp (f_{i})}{\sum \nolimits _{i} f},\quad i=1,2 \tag {21}\end{equation*} View SourceRight-click on figure for MathML and additional features. where $\alpha _{i}$ is the weight assigned to the ith channel.

Finally, each of the obtained channel weights is multiplied with the corresponding channel data of the initial signal X respectively to obtain the final output of Channel attention block_2. The formula is given below:\begin{equation*} \textrm {O}=\textrm {X}\otimes \textrm {A}=\{\textrm {X}_{1} \alpha _{1},\textrm {X}_{2} \alpha _{2}\} \tag {22}\end{equation*} View SourceRight-click on figure for MathML and additional features. where A = {$\alpha _{1}$ , $\alpha _{2}$ }. O is the feature data after Channel attention block_2 gives channel weights.

After Channel attention block_2, the two channels data of feature data O are fused in series through the Flatten layer to form a one-dimensional sequence data. After that, the fused feature T of the joint is obtained through a fully connected layer. The formula is as follows:\begin{equation*} \textrm {T}=W\bullet \textrm {O}+b \tag {23}\end{equation*} View SourceRight-click on figure for MathML and additional features.

After that, the obtained fused features for each joint are again concatenated in tandem to obtain the final fused features of the robot. Finally, the Softmax layer is utilized to obtain the fault diagnosis results. The Softmax classifier is the most common classifier in deep learning, and its output reflects the probability distribution of features over the label space, which can handle multicategorization problems well [36]. Compared with some other classifiers, the Softmax function is computationally more stable and can be easily parallelized on GPUs, especially when the number of categories is not very large, with relatively low computational complexity. In contrast, some other classifiers (e.g., support vector machines) may require more complex computations, resulting in longer computation times.

SECTION III.

Experiment

A. Experimental Platform Construction

We construct a testbed to verify the performance of the proposed fault diagnosis method for industrial robots. As shown in Fig. 6, the testbed consists of a 6-axis multi-joint robot, a laptop computer (MECHREVO), a data collector (IN-SDG), and six vibration sensors. We set the task of the robot to carry a 5kg weight.

FIGURE 6. - Robot experiment platform.
FIGURE 6.

Robot experiment platform.

B. Experimental Data Acquisition and Processing

Robot reducers are an essential power transmission device in the drive system of 6-axis multi-joint industrial robots. It can help the robot realize precise motion and command transmission and ensure that each part works closely together to complete complex work. In this experiment, four kinds of single faults and two kinds of compound faults of the reducers are simulated through human fault injection. As shown in Fig. 7, which are pitting of the sun gear of the second joint reducer (F1), cracking of the planetary gear of the second joint reducer (F2), cracking of the sun gear of the third joint reducer (F3), pitting of the planetary gear of the third joint reducer (F4), pitting of the sun gear and cracking of the planetary gear of the second joint reducer (F5), cracking of the sun gear of the second joint reducer and pitting of the planetary gear of the third joint reducer (F6). In addition, the normal reducer mode (N) is added, realizing a total of seven different types of data to be simulated, as shown in Figure 7.

FIGURE 7. - Failure modes.
FIGURE 7.

Failure modes.

The vibration signals of each joint were acquired through vibration sensors, and the current and torque signals were acquired through the robot controller. The sampling frequency is set to 1024 Hz, and 750 data segments are obtained in each mode. Each segment contains 5120 data points. The window length is set to 512, and the step size is set to 200. Twenty-four samples are obtained for each segment of data by (1). So, a total of 18,000 samples are obtained in each mode. Each sample includes signals from three channels: vibration, current, and torque. Then, the time-frequency image of the sample data is obtained using CWT. The size of the time-frequency image is $512\times 512$ , as shown in Fig. 8, which shows the vibration data with a time-frequency image for each mode of the third joint. Finally, all is divided into three parts: 70 % is the training dataset, 20 % is the validation dataset, and 10 % is the test dataset, as shown in Table 2.

TABLE 2 Description of the Data Set
Table 2- Description of the Data Set
FIGURE 8. - Vibration data of third joint.
FIGURE 8.

Vibration data of third joint.

C. Design of Comparative Experiments

In order to analyze the performance of the MD-CA-CNN model based on multi-source data, experiments were designed to compare it with other commonly used machine learning and deep learning models. The model used in this experiment was written based on Python 3.8 utilizing Pytorch framework and Scikit-learn machine learning library. Moreover, the model was trained on a laptop with an NVIDIA RTX 3060 GPU.

Firstly, the models proposed in this paper are compared with the models that only utilize 1D-CNN or 2D-CNN for feature extraction (MD-CA-1D-CNN, MD-CA-2D-CNN). The MD-CA-1D-CNN model only utilizes the original vibration signals, the original current signals, and the original torque signals as inputs. The MD-CA-2D-CNN model only utilizes the time-frequency images of the original vibration signals, the current signals, and the torque signals as inputs. Second, the model proposed in this paper is compared with a model that does not contain channel attention but utilizes multiple sources of data as inputs (MD-CNN) and CNNs that utilize a single source of data as inputs, including a convolutional neural network that uses only raw vibration signals (V-CNN), a convolutional neural network that uses only raw current signals (C-CNN), and a convolutional neural network that uses only raw torque signals (T-CNN). Finally, the model proposed in this paper is compared with other commonly used deep learning models and machine learning models. Deep learning models used as comparisons include Long Short-Term Memory Network (LSTM), Gated Recurrent Unit (GRU). Machine learning models used as comparisons include Random Forest (RF) [37], K-Nearest Neighbor Algorithm (KNN), and Support Vector Machines (SVM) [38]. LSTM and GRU are commonly used models for time-series tasks, and are generally only suitable for using time-series data as inputs to the model. Since traditional machine learning algorithms are unsuitable for directly taking the raw time series data as input, some statistical features of the raw time series data are used as algorithmic inputs after standardization. The selected statistical features include maximum value, mean value, root mean square value, peak-to-peak value, crag, skewness, margin indicator, and impulse indicator. All these features can well characterize the hidden faults of mechanical equipment.

SECTION IV.

Results and Discussion

In order to reduce the effect of random factors, we conducted five trials for each algorithm model. Each experiment randomly shuffles the data order in the dataset. And in order to further reduce randomness and improve the model’s generalization ability, a ten-fold cross-validation method was used in each experiment, and the average value obtained was taken as the result of each experiment. Figure 9 shows the test accuracies for each trial for each algorithm. Figure 10 shows the violin plots of the test results of different algorithms. Table 3 shows in more detail the detailed accuracies for each experiment, the average test results for each algorithm, and the corresponding standard deviations.

TABLE 3 Detailed Accuracy of Experiments
Table 3- Detailed Accuracy of Experiments
FIGURE 9. - Diagnostic results for each algorithm in five trials.
FIGURE 9.

Diagnostic results for each algorithm in five trials.

FIGURE 10. - Violin plots of the diagnostic results for each algorithm.
FIGURE 10.

Violin plots of the diagnostic results for each algorithm.

The results show that the diagnostic accuracy of the proposed MD-CA-CNN method is 95.8%, with a standard deviation value of 0.39%. The diagnostic accuracy of the MD-CA-2D-CNN method is 91.8%, and the standard deviation value is 0.80%. The diagnostic accuracy of the MD-CA-1D-CNN method is 90.9%, and the standard deviation value is 0.83%. Through comparison, it can be found that the MD-CA-CNN method can simultaneously utilize 1D-CNN and 2D-CNN to extract time-domain features and time-frequency-domain features from the time series data, obtaining more comprehensive feature information, which improves the performance of model fault diagnosis. The diagnostic accuracy of the MD-CNN method is 90.7%, with a standard deviation value of 0.99%. By comparison, it can be found that the channel attention mechanism improves the performance of the model fault diagnosis by adaptively assigning weights to different source data as well as to time-domain and time-frequency-domain features. The diagnostic accuracy of T-CNN is 82.3%, with a standard deviation value of 1.86%. The diagnostic accuracy of V-CNN is 85.1%, with a standard deviation value of 1.54%. The diagnostic accuracy of C-CNN is 83.8%, with a standard deviation value of 1.83%. By comparison, it can be found that MD-CA-CNN and MD-CNN utilizing multi-source data fusion have higher diagnostic accuracy and more stable performance than T-CNN, V-CNN, and C-CNN utilizing only a single data source. The diagnostic accuracy of the LSTM algorithm is 89.5%, and the standard deviation value is 1.03%.The diagnostic accuracy of the GRU algorithm is 88.9%, and the standard deviation value is 1.06%. By comparison, it can be found that the MD-CA-CNN method proposed in this paper has a higher performance of fault diagnosis than deep learning models such as LSTM and GRU. Among the machine learning algorithms, the diagnostic accuracy of KNN is 69.9%, with a standard deviation value of 1.88%. The diagnostic accuracy of RF is 76.4%, with a standard deviation value of 1.35%. The diagnostic accuracy of SVM is 73.5%, with a standard deviation value of 1.96%. By comparing with the previous deep learning algorithms, it can be found that the diagnostic accuracy and stability performance of deep learning algorithms are much higher than that of machine learning algorithms. It reflects the advantages of deep learning algorithms in dealing with long-time series problems.

In addition, Fig. 11 shows the confusion matrix for the third trial of the different algorithms, where the classification and actual results are presented. All three machine learning algorithms (KNN, RF, SVM) have low diagnostic accuracy for all types of faults, especially for composite faults F5 and F6, which may be because machine learning is unsuitable for handling long time series. The three deep learning algorithms (T-CNN, V-CNN, C-CNN) utilizing a single data source significantly improve diagnostic accuracy for all types of faults over all three machine learning algorithms. However, the results are still less satisfactory, which may be due to the lack of robustness of the single data source. The MD-CNN, LSTM, and GRU have better diagnostic accuracy for a single fault type. However, they are also prone to misclassification for composite fault modes F5 and F6, which may be due to confusion about the importance of multi-source data. MD-CA-CNN algorithm has higher diagnostic accuracy for various fault modes and normal modes, possibly because the multi-source data provides more comprehensive information and improves robustness. As well as the channel attention mechanism adaptively assigns weights to the multi-source data and to the time-domain and time-frequency-domain features, which improves the performance of the model’s fault diagnosis, especially the diagnostic accuracy for F5 and F6.

FIGURE 11. - Confusion matrix for the 3rd trial of different algorithms.
FIGURE 11.

Confusion matrix for the 3rd trial of different algorithms.

SECTION V.

Conclusion

In this paper, we proposed a multi-joint industrial robot fault diagnosis model based on multi-source data fusion and channel-attention convolutional neural network, namely MD-CA-CNN. The model takes the vibration data, torque data and current data of the robot as inputs. The proposed method differs from existing multi-source data fusion techniques in the field of fault diagnosis in the following ways: firstly, the fusion of vibration, current and torque data of robot joints is realized by using CNN channel superposition. Secondly, multiple dimensions of information such as original time-domain signals and time-frequency-domain signals are taken into account, and both 1D-CNN and 2D-CNN are utilized to extract the deep features in both time-domain and time-frequency-domain signals dimensions to obtain more comprehensive fault feature information. Finally, a channel attention mechanism is developed to adaptively assign weights to different source data as well as to time-domain and time-frequency-domain features, thus improving the contribution of important channels to fault diagnosis.

We built a testbed to test the performance of the proposed method by diagnosing a variety of six-axis robot RV reducer failures. The performance of the proposed method is tested by comparing it with a model that utilizes only 1D-CNN or 2D-CNN for feature extraction (MD-CA-1D-CNN, MD-CA-2D-CNN), a deep learning algorithm that utilizes only the fusion of multiple data sources (MD-CNN), three deep learning algorithms that utilize only a single data source (T-CNN, V-CNN, and C-CNN), two commonly used deep learning algorithms (LSTM, GRU), and three commonly used machine learning algorithms (KNN, RF, SVM) are compared and analyzed, and the MD-CA-CNN algorithm has the highest accuracy and stability (95.8% $\pm ~0.39$ %). Also, the following conclusions can be obtained:

  1. Multi-source data can provide more comprehensive feature information. The combination of redundant or complementary information from multiple sources in space or time can reduce the disadvantage of a single signal susceptible to noise. In addition, simultaneously extracting the depth features of the original time-domain signal and the time-frequency domain signal can obtain more comprehensive fault feature information and improve the robustness of the algorithm;

  2. The channel attention mechanism developed in this paper can model the importance of each feature channel and enhance or suppress different channels, which is beneficial to improve the performance of the model;

  3. In the field of fault diagnosis, deep learning algorithms are more advantageous than machine learning algorithms in dealing with high-complexity, long-time series problems.

The model proposed in this paper takes a long time to train due to its large structure and many parameters, but the model after training meets the time requirements for real-time diagnosis in the field when performing fault diagnosis tasks. In future research, we will focus on reducing the complexity of the proposed method.

References

References is not available for this document.