Introduction
In this era of industrial big data, a massive amount of data is available to the public through various industries such as intelligent transportation [1], [2], power grids [3], cloud computing [4], and finance [5]. Knowledge extraction on these data is crucial for continuous improvements, process automation, and resilience improvements of these industrial systems [6]. Even though data availability increases exponentially with time, these multi-variety data has many intricacies such as incompleteness, high-dimensionality, noise, and rarely labeled [7]. This article focuses on two main intricacies; high-dimensional and unlabeled data.
The first area of focus is the high-dimensionality of data. The reliability of knowledge extraction methods generally deteriorates due to the curse of dimensionality [8]. In other words, extracting relevant features leads to a reduced number of features that results in efficient knowledge extraction methods with high accuracy [9]. Therefore, when using high-dimensional data for data-driven machine learning tasks, it is necessary to capture only the relevant information [2], [8], [10]. Extraction of relevant features and reduction of input data dimensions are performed using various feature learning and dimensionality reduction techniques. This is achieved by performing non-linear mapping of input data into an embedded representation [11]–[13]. Since the embedded representation only contains relevant information, we can use these learned embedded representations to perform various machine learning tasks with improved reliability.
The second area of focus is the abundance of unlabeled data. Real-world settings bring the challenge of dealing with high volumes of unlabeled data. The manual labeling process is time-consuming, expensive, and requires the expertise of the data [14]. Further, supervised feature learning not only is unable to take advantage of unlabelled data, but it also can result in biases by relying on labeled data. Therefore, unsupervised deep learning based feature learning(feature extraction) has gained tremendous attention.
Many dimensionality reduction based unsupervised feature learning methods has been proposed to address the above two problems. Widely used unsupervised feature learning techniques include Principle component analysis (PCA) [15], Independent component analysis (ICA), Locally Linear Embedding (LLE) [15], Factor Analysis embedding, and SVD embedding. Recently, Deep Learning has shown remarkable performance in many areas. It has been successfully used to convert high-dimensional feature spaces into new embedded representations with relevant and robust features [8], [14], [16]. This effective transformation of the input data space to embedded space has been achieved through unsupervised deep learning methods such as deep convolutional autoencoders (C-AEs) [11], [13]. Figure 1 shows current applications of Deep Neural Network (DNN) based approaches for various industrial applications such as process automation and resilience improvement.
The need for Deep Neural Networks (DNN) based unsupervised feature learning and its advantages.
Even though Deep learning had become the primary technique with state-of-the-art performance in many areas, they have the problem of vanishing gradient, i.e., when the network goes deeper, its performance gets saturated or even starts degrading rapidly. [17]. Because of this, the shallow counterparts can perform better than deep networks [17]. He et al. proposed residual blocks between layers to alleviate the problem of performance degradation [17]. These networks are called ResNets [18]–[22].
While the ideas of adding residual connections do exist, there has been very limited work that has applied it to unsupervised feature learning. Further, the existing work does not address the effect of performance degradation of deep neural networks for unsupervised feature learning. Therefore in this article, we present a framework that consists of residual blocks in AE architectures for unsupervised feature learning.
We use AEs to perform unsupervised feature learning. The unsupervised here refers to the unsupervised process of feature learning, i.e., learning of embedded representation from input data without using any labels. We used data labels only for the evaluation of learned embedded representations. We hypothesize that AEs with residual connections (RAE) will have improved resistance to performance degradation of learned features and improved feature learning capability compared to standard AEs. I.e., residual connections will alleviate possible information loss when increasing the number of hidden layers, and embedded representation will provide better separability for classification/clustering tasks.
Mainly for unlabeled data, it is challenging to decide the optimal number of hidden layers ahead when designing dimensionality reduction experiments. The proposed approach will always perform similar or better, even with a higher number of layers. Therefore, users have the advantage of designing few experiments with large networks, knowing that there is no adverse effect on the network’s dimension reduction performance. To test our hypothesis, it is necessary to show that RAEs have lower performance degradation of unsupervised feature learning than AEs when increasing the networks’ depth. We showed the effectiveness of the approach quantitatively by calculating the classification accuracy drop. I.e., we increased the number of hidden layers on both AEs and RAEs and checked how the classification accuracies on embedded representations change with the increase of the number of hidden layers. We used K Nearest Neighbor (KNN) for classification, as it allows us to check whether the same class samples are close to each other in the learned embedded representation (if the learned feature space learns a representation that encodes high-level concepts such as the classes of the input datasets).
The article presents the following contributions:
Address the effect of performance degradation of deep neural networks for unsupervised feature learning
Performance comparison between proposed architecture (RAE) and standard AE based feature learning, using a different number of hidden layers on three different datasets.
Performance comparison between widely used unsupervised dimensionality reduction methods
We compare the presented method against two relevant groups of methods (a total of 7 different methods). The first group is represented by Autoencoders, which the literature indicates to be the most commonly used state-of-the-art deep learning based unsupervised dimensionality reduction architectures. We focus on standard Autoencoder and standard Convolutional Autoencoders because these are: 1) most frequently used; 2) other variants of AEs in the literature follow the principles of these two. Our objective was to evaluate how residual connections improve “feature learning”, as such we compared against the same models with and without residual connections to evaluate improvement. The second group represents other types of feature extraction methods (five of those): Principal Component Analysis, Independent Component Analysis, Locally Linear Embedding, Factor Analysis, and Singular Value Decomposition.
The rest of the article is organized as follows: Section II presents the background and related work; Section III presents the RAE architecture; Section IV discusses the experiments and results; and finally, Section V presents the conclusions of the article.
Background and Related Work
This section consists of three subsections. The first subsection discusses widely used traditional unsupervised dimensionality reduction techniques. The second section discusses Autoencoder based deep learning approaches for dimensionality reduction. The third section discusses the theory behind residual connections.
A. Traditional Unsupervised Machine Learning for Dimensionality Reduction
As discussed in the introduction, feature learning is essential for efficient and accurate machine learning tasks. Two types of dimensionality reduction based feature learning techniques exist, namely feature selection and feature transformation [23]. A subset of features from the original space is selected in feature selection, whereas in feature transformation (Dimension reduction), it generates an entirely new set of features. Both try to keep as much information in the data as possible while reducing the dimension. However, feature selection can be misleading as it assigns weights to individual features ignoring the correlation between features [23]. Therefore, feature transformation approaches are preferable. Widely used such dimension reduction techniques are discussed below.
Principal Component Analysis (PCA): A linear algorithm which preserves most of the data’s variability in the latent space [15]. It minimizes the redundancy (measured through covariance) of data while maximizing information (Measured through variance) in the resulted space. Limitations include; 1) it only considers linear correlation, 2) input variables are assumed to be scaled at the numeric level [24].
Independent Component Analysis (ICA): A linear transformation method that minimizes the dependence of the components of the transformed feature space [24]. Linearity is a major disadvantage of this method.
Locally Linear Embedding (LLE): This is a non-linear algorithm that uses neighborhood preservation learning to generate subspace [15], [24]. However, this method has a high sensitivity for noise/outliers.
Factor Analysis: This is the same as PCA in cases where the added noise is zero [25]. This method assumes that input data represent independent, random samples from a multivariate distribution. If variables are correlated, generated factors can be highly correlated [26].
Singular Value Decomposition (SVD): This is mainly used for sparse data, i.e. when data contains many zero values. It converts the input data space to a latent representation with a reduced number of features while keeping the maximum information from the original space [27]. This approach is computationally expensive.
B. Unsupervised Deep Autoencoders for Dimensionality Reduction
The traditional concept of unsupervised learning was mainly limited to the idea of data clustering and association rule mining. However, the expansion of deep learning methods and data mining combined with this era of big data has given a much broader perspective to traditional unsupervised learning. Therefore, unsupervised learning is used not only for clustering, but also for dimentionality reduction (also referred as unsupervised feature learning / deep embedded representation learning) [28], [29], generative modelling [30], [31], and auto-regressive modelling [32], [33]. This article focuses on deep unsupervised feature learning, which is the process of transforming the input space to an embedded space, preferably a lower dimension compared to the input data space, using deep neural networks.
Many recent classification tasks use different variants of AEs, to learn feature representation from high-dimensional input data, where the learned (extracted) features will provide good separability for classification tasks. In these cases, feature extraction will be performed in an unsupervised manner, whereas classification will be performed on the extracted features in the reduced dimension in a supervised manner. Feature learning using variants of AEs has shown the following advantages: improve the robustness of feature learning [13], non-linear feature extraction [12], replacing handcrafted features with efficient algorithms for unsupervised feature learning [34], and reduces the time and storage space through dimensionality reduction [35].
The variant of deep AEs has been successfully used for deep embedded clustering tasks that perform feature learning and clustering simultaneously. In the past, clustering and feature learning were performed sequentially, i.e., it embeds the input space to a latent space and then performs clustering on the embedded space [29], [36]. With deep embedded clustering, it performs a joined optimization of feature learning (dimensionality reduction), and clustering [29]. For example, in [37], the authors have presented a deep clustering approach using fully connected convolutional AEs. They argue that the embedded representations extracted from an encoder may not be discriminative enough for efficient clustering. To overcome that, they have proposed a soft k -means model on top of the encoder to make a unified clustering model.
C. Residual Connection Within Deep Neural Networks
He et al. raised the awareness towards the problem of performance degradation [18]. I.e., when the network’s depth increases, the network’s performance will start to saturate, and eventually, it can even deteriorate [19]. This is not caused due to the over-fitting, but by the vanishing gradient of deep neural networks [19].
This problem has been addressed by various network designs networks such as ResNets [18], [20], Highway Networks [21], and DenseNets [22]. All these networks use the same design principle, i.e., skip connections or residual connections [19]. These networks with skip connections have consistently shown state-of-the-art performances in different neural network typologies [18], [21]. Other advantages of skip connection includes better easier training [19], numerical stability and easier optimization [19], [38]. Empirical evidence has shown that these deep architectures with skip connections should not produce a large error than their shallow counterparts [18], [20].
Methodology: ResNet Autoencoder Based Feature Learning for Deep Embedded Classification
This section discusses the stacked ResNet Autoencoder (RAE) based feature learning approach for classification. Figure 2 presents the standard C-AE architecture with multiple convolution and max-pooling layers with multiple filters.
In this article, we implemented standard and convolutional AEs (AEs and C-AEs) with residual connections. Our intent was to convey the advantages of adding residual connection into AE networks to improve feature learning capability. Therefore, we designed a simple and reproducible experiment, which can run in a reasonable amount of time. We introduced residual connection into the AE architecture and presented the novel Residual Autoencoder (RAE) framework for deep embedded classification. We call its convolutional counterpart C-RAE. The proposed framework is presented in Figure 3 where (a) presents the training of presented RAE and (b) represent the classification task on learned features.
RAE based feature learning (a) Training of C-RAE, (b) C-RAE based classification/clustering.
As similar to AEs, RAEs are trained to regenerate their inputs from its output (Figure 3(a)). The input sample
Similar to AE, RAE also consists of two phases, i.e., encoding phase and decoding phase [39], [40]. For a high-dimentional input
For the decoder, each hidden layer is a non-linear mapping of the form
For the encoder, each hidden layer \begin{equation*} h^{(l+1)} = r\left ({h^{(l)} }\right) + f\left ({h^{(l)} }\right)\tag{1}\end{equation*}
Similar to AE, the loss function \begin{equation*} J_{\theta } = \frac {1}{T} \sum _{i=1}^{T} \| x_{i} - y_{i} \| ^{2}\tag{2}\end{equation*}
The RAE is trained to minimize the above loss function with
Similar to AE, the dimension of the hidden representations (
The encoded value
For classification purposes, any supervised classification algorithm can be integrated at the end of the encoder (Figure 3(b)). For this experiment, the K-Nearest Neighbor algorithm (KNN) is used. Algorithm II presents the KNN based deep embedded classification.
As presented in Algorithm II, the trained RAE’s encoder is used to generate an embedded representation of train and test data (line 1-2). Then class labels for test data can be predicted by comparing each test record with all the train records and find the mode class label of K nearest train records (Algorithm II line 4-12). The distance between a test record and a train record should be calculated using a distance calculation method to find the nearest neighbors. For this experiment, Euclidean distance is calculated:\begin{equation*} dist(z_{test}, z_{train}) = \sqrt { \sum _{i=0}^{dim}\left ({z_{test,i} - z_{train,i} }\right) }\tag{3}\end{equation*}
Experiment and Results
This section discusses the experiments and results. First, we discuss the datasets used for experimental evaluation. Then, we present the experimental set-up and architecture details of the networks. Finally, we discuss the results of the experiment with a comparison between existing dimensionality reduction methods.
A. Datasets
Three datasets were used for experimental evaluation: 1) MNIST [41], 2) CIFAR10 [42], and 3) Fashion MNIST [43]. All the datasets were scaled to the 0–1 range. These benchmark datasets were selected due to their relatively high dimension and reasonable training time with deep networks. Datasets were directly obtained from the Keras library [44].
The MNIST dataset consists of hand-written digits (0-9), where each digit is an image of
The Fashion MNIST dataset benchmark dataset consist of images used for clothing classification. It consists of images with
The CIFAR10 dataset consists of color images of
B. Hyper-Parameters and Architectural Details
To maintain consistency in the experiments, all the architectures were kept constant across datasets when increasing the number of layers. Only two filters were used with size 32 and 64. The size of the embedded representation is kept at 32. The number of layers were increased by repeating the convolution layer and pooling layer for a given filter size. For this experiment number of repeating layers were increased from 2 to 90 for each filter. Optimizer (adadelta) and K(5) were kept constant for all the experiments across datasets. Batch normalization and leakyRelu was used to improve model performance. For illustration purposes, the MNIST dataset architecture with two filters (32,64) and 2 repeats is presented in Figure 4. For a given number of repeats (f), the total number of hidden layers is 2+ (f*no. of filters).
C. Classification Accuracy
The trained autoencoder models were used to generate the embedded representation for the datasets. These embedded representations were used for the classification using the KNN algorithm, i.e., encoder followed by KNN used as the classification network. Each experiment was repeated five times, and the average performances were recorded.
Table 2 shows the deep embedded classification accuracy obtained using the two models, C-AE and C-RAE, for different datasets when increasing the number of hidden layers. When comparing all the models, C-RAE showed improved accuracy compared to C-AE for all three datasets (highlighted values in Table 2). Table 3, column 6 shows the classification performance of KNN on the original high dimensional data. It can be seen that both C-AE and C-RAE based deep embedded classification showed better accuracies than just applying KNN on original data. This infers that these deep neural network models convert original data into embedded representations that are more suitable than using the original input data for down-stream tasks such as classification.
Figure 5 shows a plot of the accuracies against no of repeated layers. When increasing the number of layers, a small fluctuation of accuracy was observed for small models (up to 20 repeated layers) for all the datasets. For large models, when increasing the no of hidden layers, the accuracies started to decrease. However, C-RAE showed significantly lower degradation compared to C-AE. Therefore, it can be inferred that C-RAE based embedded representations are less likely to under-perform when increasing the number of layers.
Figure 6 shows classification accuracy distribution in box and whisker graphs for all three datasets when increasing the number of layers. The height of the box plot indicates the variability of classification accuracy for each model. Anything outside the normal distribution is marked as outliers shown as “X” marks. A shorter box and whiskers plot indicates low variability of classification accuracies. For all the C-RAE, the whiskers are shorter than C-AEs, and there are no outliers. It shows that C-RAE has consistent performance with low variability when increasing the number of layers. Mean values are marked with “O”. All C-RAEs mean values are higher than the C-AEs. These observations show that with the change of the number of hidden layers, C-RAEs have consistent performance, whereas, for standard C-AEs, a thorough cross-validation process is needed.
The last column of Table 2 shows the overall performance degradation for deep embedded classification when increasing the number of hidden layers. The performance degradation (PD) was calculated as the percentage accuracy drop when increasing the number of layers:\begin{equation*} PD = \frac {(Maximum Acc {-} Minimum Acc)*100}{Maximum Acc}\tag{4}\end{equation*}
Both C-RAE and C-AE showed some performance degradation for all three datasets. C-AE without residual connection showed 33.38% - 65.46% performance degradation whereas C-RAE showed 0.86% - 1.97% performance degradation. Based on the experimental result, it can be seen that residual connections reduce possible performance degradation significantly.
D. Comparison Between Widely Used Dimensionality Reduction Methods
Table 3 presents the performance comparison between proposed approaches and widely used unsupervised dimensionality reduction methods. We compared the proposed approach with two state-of-the-art deep neural network based dimensionality reduction methods (AE and C-AE) and five most widely used conventional dimensionality reduction methods in the recent literature (PCA, LLE, ICA, Factor Analysis embedding, Truncated SVD embedding). As described in the previous section, all these methods were used to convert the high dimensional input space to an embedded representation of 32 features. Then, KNN was used to perform the classification on the embedded representations. Further, KNN was ran to calculate the classification accuracy on the original high dimensional space (last column of Table 3). For MNIST, all the embedded classification approaches except LLE and FAE showed better accuracies compared to applying KNN on the original high dimensional feature space. C-RAE showed the highest accuracy (.9853) for MNIST. For Fashion MNIST, only deep neural network based embedded classification showed higher accuracy compared to KNN. C-RAE showed the highest accuracy (0.8858) for Fashion MNIST. For CIFAR10, all the embedded classification methods except LLE showed higher accuracy compared to KNN. C-RAE showed the highest accuracy (0.4333) for CIFAR10. When comparing AEs and RAEs on all three datasets, RAEs showed slightly better performance. When comparing RAE and C-RAE, C-RAE showed better accuracy on all three datasets. The results of 2 and Table 3 infers that deep neural network models convert original data into embedded representations that are more suitable than using the original input data for down-stream tasks such as classification, and C-RAE based embedded representations are less likely to under-perform when increasing the number of layers.
E. Overall Discussion and Future Work
Our hypothesis was that when adding new layers to standard AEs, their ability for effective feature learning degrades. Through accuracy comparison in Table 2, we confirmed that addition of residual connections to AEs (RAEs), improved their overall classification accuracy without incurring significant performance degradation (relative to standard AEs).
Through a comprehensive comparison of widely used unsupervised dimensionality reduction methods in Table 3, we demonstrated that the C-RAE outperforms widely used feature learning methods such as standard AE, KNN, PCA, LLE, ICA, Factor Analysis, and SVD by 1%-3% improvements of classification accuracy. In addition to the accuracy improvement over standard CAE, C-RAE showed significantly lower performance degradation of classification accuracy (less than 3%) compared to CAE (33%-65%), when increasing the network depth. These results evidenced the advantages and the overall superiority of C-RAEs for unsupervised feature learning compared to standard AEs and widely used traditional methods.
Finally, by implementing the novel RAE framework presenting here, one does not need to go through a trial and error process of finding the best architecture. Instead, one can safely go with more layers in case a more complex model is required for improved overall performance while not sacrificing the dimensionality reduction performance.
The experiment was tested using three datasets that can be trained with deep neural networks within a reasonable amount of time. However, it has to be noticed that the advantage of using a deep neural network is more prominent when dealing with more complex datasets. Therefore, in future work, the framework will be tested with more complex datasets, which are high in dimension and number of data records.
Conclusion
In this article, we tackle the performance degradation problem of automated deep unsupervised feature learning. We introduced an unsupervised deep learning framework, consisting of ResNet Autoencoder (RAE) and its convolutional version C-RAE, that allows making deeper neural networks while not sacrificing its dimensionality reduction performance. In this way, we improve resistance to performance degradation compared to standard Autoencoders (AEs) for feature learning. The performance of RAE on learning deep embedded representations was evaluated on a classification task using KNN. RAE was compared against AE while increasing the number of hidden layers. We did this comparison on three benchmark datasets. We demonstrated that C-RAE showed the highest accuracy on all three datasets. At the same time, C-RAE based classification only showed 0.86% to 2.68% performance degradation, which is significantly lower than the performance degradation showed by standard C-AE (33.38% - 65.46%). The empirical results confirmed that RAE reduces performance degradation of deep embedded representation based classification. This framework allows users to design fever number of experiments knowing that larger networks will not affect the network performance, especially when dealing with unlabelled data where the optimal network size is challenging to decide. Further, the classification accuracy distribution showed that RAE models perform better in terms of mean accuracy and accuracy variance (low variance), making them more suitable for deep embedded classification tasks than AE. Finally, we compared RAEs with widely used dimensionality reduction methods and showed that C-RAE outperforms on all experimented datasets.