Introduction
Data clustering is a basic problem in many areas, such as machine learning, pattern recognition, computer vision, data compression. The goal of clustering is to categorize similar data into one cluster based on some similarity measures (e.g., Euclidean distance). Although a large number of data clustering methods have been proposed [1]–[5], conventional clustering methods usually have poor performance on high-dimensional data, due to the inefficiency of similarity measures used in these methods. Furthermore, these methods generally suffer from high computational complexity on large-scale datasets. For this reason, dimensionality reduction and feature transformation methods have been extensively studied to map the raw data into a new feature space, where the generated data are easier to be separated by existing classifiers. Generally speaking, existing data transformation methods include linear transformation like Principal component analysis (PCA) [6] and non-linear transformation such as kernel methods [7] and spectral methods [8]. Nevertheless, a highly complex latent structure of data is still challenging the effectiveness of existing clustering methods. Owing to the development of deep learning [9], deep neural networks (DNNs) can be used to transform the data into more clustering-friendly representations due to its inherent property of highly non-linear transformation. For the simplicity of description, we call clustering methods with deep learning as deep clustering1 in this paper.
Basically, previous work mainly focuses on feature transformation or clustering independently. Data are usually mapped into a feature space and then directly fed into a clustering algorithm. In recent years, deep embedding clustering (DEC) [11] was proposed and followed by other novel methods [12]–[18], making deep clustering become a popular research field. Recently, an overview of deep clustering was proposed in [19] to review most remarkable algorithms in this field. Specifically, it presented some key elements of deep clustering and introduce related methods. However, this paper mainly focuses on methods based on autoencoder [20], and it was incapable of generalizing many other important methods, e.g., clustering based on deep generative model. What is worse, some up-to-date progress is also missing. Therefore, it is meaningful to conduct a more systematic survey covering the advanced methods in deep clustering.
Classical clustering methods are usually categorized as partition-based methods [21], density-based methods [22], hierarchical methods [23] and so on. However, since the essence of deep clustering is to learning a clustering-oriented representation, it is not suitable to classify methods according to the clustering loss, instead, we should focus on the network architecture used for clustering. In this paper, we make a survey of deep clustering from the perspective of network architecture. The first category uses the autoencoder (AE) to obtain a feasible feature space. An autoencoder network provides a non-linear mapping function through learning an encoder and a decoder, where the encoder is a mapping function to be trained, and the decoder is required to be capable to reconstruct the original data from those features generated by the encoder. The second category is based on feed-forward networks trained only by specific clustering loss, thus we refer to this type of DNN as Clustering DNN (CDNN). The network architecture of this category can be very deep and networks pre-trained on large-scale image datasets can further boost its clustering performance. The third and fourth categories are based on Generative Adversarial Network (GAN) [24] and Variational Autoencoder (VAE) [25] respectively, which are the most popular deep generative models in recent years. They can not only perform clustering task, but also can generate new samples from the obtained clusters. To be more detailed, we present a taxonomy of existing deep clustering methods based on the network architecture. We introduce the representative deep clustering methods and compare the advantages and disadvantages of different architectures and methods. Finally, some directions are suggested for future development of this field.
The rest of this paper is organized as follows: Section II reviews some preliminaries of deep clustering. Section III presents a taxonomy of existing deep clustering algorithms and introduces some representative methods. Finally, Section IV provides some notable trends of deep clustering and gives conclusion remarks.
Preliminaries
In this section, we introduce some preliminary knowledge of deep clustering. It includes the related network architectures for feature representation, loss functions of standard clustering methods, and the performance evaluation metrics for deep clustering.
A. Neural Network Architecture for Deep Clustering
In this part, we introduce some neural network architectures, which have been extensively used to transform inputs to a new feature representation.
1) Feedforward Fully-Connected Neural Network
A fully-connected network (FCN) consists of multiple layers of neurons, each neuron is connected to every neuron in the previous layer, and each connection has its own weight. The FCN is also known as multi-layer perceptron (MLP). It is a totally general purpose connection pattern and makes no assumptions about the features in the data. It is usually used in supervised learning when labels are provided. However, for clustering, a good initialization of parameters of network is necessary because a naive FC network tends to obtain a trivial solution when all data points are simply mapped to tight clusters, which will lead to a small value of clustering loss, but be far from being desired [13].
2) Feedforward Convolutional Neural Network
Convolutional neural networks (CNNs) [26] were inspired by biological process, in which the connectivity pattern between neurons is inspired by the organization of the animal visual cortex. Likewise, each neuron in a convolutional layer is only connected to a few nearby neurons in the previous layer, and the same set of weights is used for every neuron. It is widely applied to image datasets when locality and shift-invariance of feature extraction are required. It can be trained with a specific clustering loss directly without any requirements on initialization, and a good initialization would significantly boost the clustering performance. To the best of our knowledge, no theoretical explanation is given in any existing papers, but extensive work shows its feasibility for clustering.
3) Deep Belief Network
Deep Belief Networks (DBNs) [27] are generative graphical models which learn to extract a deep hierarchical representation of the input data. A DBN is composed of several stacked Restricted Boltzmann machines (RBMs) [28]. The greedy layer-wise unsupervised training is applied to DBNs with RBMs as the building blocks for each layer. Then, all (or part) of the parameters of DBN are fine-tuned with respect to certain criterion (loss function), e.g., a proxy for the DBN log-likelihood, a supervised training criterion, or a clustering loss.
4) Autoencoder
Autoencoder (AE) is one of the most significant algorithms in unsupervised representation learning. It is a powerful method to train a mapping function, which ensures the minimum reconstruction error between coder layer and data layer. Since the hidden layer usually has smaller dimensionality than the data layer, it can help find the most salient features of data. Although autoencoder is mostly applied to find a better initialization for parameters in supervised learning, it is also natural to combine it with unsupervised clustering. More details and formulations will be introduced in Section III-A.
5) GAN & VAE
Generative Adversarial Network (GAN) and Variational Autoencoder (VAE) are the most powerful frameworks for deep generative learning. GAN aims to achieve an equilibrium between a generator and a discriminator, while VAE attempts to maximizing a lower bound of the data log-likelihood. A series of model extensions have been developed for both GAN and VAE. Moreover, they have also been applied to handle clustering tasks. The details of the two models will be elaborated in Section III-C and Section III-D, respectively.
B. Loss Functions Related to Clustering
This part introduces some clustering loss functions, which guides the networks to learn clustering-friendly representations. Generally, there are two kinds of clustering loss. We name them as principal clustering loss and auxiliary clustering loss.
Principal Clustering Loss: This category of clustering loss functions contain the cluster centroids and cluster assignments of samples. In other words, after the training of network guided by the clustering loss, the clusters can be obtained directly. It includes
-means loss [13], cluster assignment hardening loss [11], agglomerative clustering loss [29], nonparametric maximum margin clustering [30] and so on.$k$ Auxiliary Clustering Loss: The second category solely plays the role of guiding the network to learn a more feasible representation for clustering, but cannot output clusters straightforwardly. It means deep clustering methods with merely auxiliary clustering loss require to run a clustering method after the training of network to obtain the clusters. There are many auxiliary clustering losses used in deep clustering, such as locality-preserving loss [31], which enforces the network to preserve the local property of data embedding; group sparsity loss [31], which exploits block diagonal similarity matrix for representation learning; sparse subspace clustering loss [32], which aims at learning a sparse code of data.
C. Performance Evaluation Metrics for Deep Clustering
Two standard unsupervised evaluation metrics are extensively used in many deep clustering papers. For all algorithms, the number of clusters are set to the number of ground-truth categories. The first metric is unsupervised clustering accuracy (ACC):\begin{equation*} ACC = \max _{m} \frac {\sum _{i=1}^{n} \boldsymbol {1}\{y_{i}=m(c_{i})\}}{n}\end{equation*}
The second one is Normalized Mutual Information (NMI) [34]:\begin{equation*} NMI(Y,C) = \frac {I(Y,C)}{\frac {1}{2}[H(Y)+H(C)] }\end{equation*}
Taxonomy of Deep Clustering
Deep clustering is a family of clustering methods that adopt deep neural networks to learn clustering-friendly representations. The loss function (optimizing objective) of deep clustering methods are typically composed of two parts: network loss \begin{equation*} L = \lambda L_{n} + (1 - \lambda)L_{c}\tag{1}\end{equation*}
A. AE-Based Deep Clustering
Autoencoder is a kind of neural network designed for unsupervised data representation. It aims at minimizing the reconstruction loss. An autoencoder may be viewed as consisting of two parts: an encoder function \begin{equation*} \min _{\phi, \theta } L_{rec} = \min \limits \frac {1}{n}\sum \limits _{i=1}^{n} \parallel \boldsymbol {x}_{i} - g_\theta (f_\phi (\boldsymbol {x}_{i})) \parallel ^{2}.\tag{2}\end{equation*}
Architecture: The original autoencoder is comprised of multiple layer perceptions. For the sake of handling data with spatial invariance, e.g., image data, convolutional and pooling layers can be used to construct a convolutional autoencoder (CAE).
Robustness: To avoid overfitting and to improve robustness, it is natural to add noise to the input. Denoising autoencoder [35] attempts to reconstruct
from$\boldsymbol {x}$ , which is a corrupted version of$\tilde {\boldsymbol {x}}$ through some form of noise. Additionally, noise can also be added to the inputs of each layer [14].$\boldsymbol {x}$ Restrictions on latent features: Under-complete autoencoder constrains the dimension of latent coder
lower than that of input$\boldsymbol {z}$ , enforcing the encoder to extract the most salient features from original space. Other restrictions can also be adopted, e.g., sparse autoencoder [36] imposes a sparsity constraint on latent coder to obtain a sparse representation.$\boldsymbol {x}$ Reconstruction loss: Commonly the reconstruction loss of an autoencoder consists of only the discrepancy between input and output layer, but the reconstruction losses of all layers can also be optimized jointly [14].
The optimizing objective of AE-based deep clustering is thus formulated as follows:\begin{equation*} L =\lambda L_{rec} + (1-\lambda) L_{c}\tag{3}\end{equation*}
Deep Clustering Network (DCN):
DCN [13] is one of the most remarkable methods in this field, which combines autoencoder with the
-means algorithm. In the first step, it pre-trains an autoencoder. Then, it jointly optimizes the reconstruction loss and$k$ -means loss. Since$k$ -means uses discrete cluster assignments, the method requires an alternative optimization algorithm. The objective of DCN is simple compared with other methods and the computational complexity is relatively low.$k$ Deep Embedding Network (DEN):
DEN [31] proposes a deep embedding network to extract effective representations for clustering. It first utilizes a deep autoencoder to learn reduced representation from the raw data. Secondly, in order to preserve the local structure property of the original data, a locality-preserving constraint is applied. Furthermore, it also incorporates a group sparsity constraint to diagonalize the affinity of representations. Together with the reconstruction loss, the three losses are jointly optimized to fine-tune the network for a clustering-oriented representation. The locality-preserving and group sparsity constraints serve as the auxiliary clustering loss (see Section II-B), thus, as the last step, k-means is required to cluster the learned representations.
Deep Subspace Clustering Networks (DSC-Nets):
DSC-Nets [37] introduces a novel autoencoder architecture to learn an explicit non-linear mapping that is friendly to subspace clustering [38]. The key contribution is introducing a novel self-expressive layer, which is a fully connected layer without bias and non-linear activation and inserted to the junction between the encoder and the decoder. This layer aims at encoding the self-expressiveness property [39] [40] of data drawn from a union of subspaces. Mathematically, its optimizing objective is a subspace clustering loss combined with a reconstruction loss. Although it has superior performance on several small-scale datasets, it is really memory-consuming and time-consuming and thus can not be applied to large-scale datasets. The reason is that its parameter number is
for$O(n^{2})$ samples, and it can only be optimized by gradient descent.$n$ Deep Multi-Manifold Clustering (DMC):
DMC [41] is deep learning based framework for multi-manifold clustering (MMC). It optimizes a joint loss function comprised of two parts: the locality preserving objective and the clustering-oriented objective. The first part makes the learned representations meaningful and embedded into their intrinsic manifold. It includes the autoencoder reconstruction loss and locality preserving loss. The second part penalizes representations based on their proximity to each cluster centroids, making the representation cluster-friendly and discriminative. Experimental results show that DMC has a better performance than the state-of-the-art multi-manifold clustering methods.
Deep Embedded Regularized Clustering (DEPICT):
DEPICT [14] is a sophisticated method consisting of multiple striking tricks. It consists of a softmax layer stacked on top of a multi-layer convolutional autoencoder. It minimizes a relative entropy loss function with a regularization term for clustering. The regularization term encourages balanced cluster assignments and avoids allocating clusters to outlier samples. Furthermore, the reconstruction loss of autoencoder is also employed to prevent corrupted feature representation. Note that each layer in both encoder and decoder contributes to the reconstruction loss, rather than only the input and output layer. Another highlight of this method is that it employs a noisy encoder to enhance the robustness of the algorithm. Experimental results show that DEPICT achieves superior clustering performance while having a high computational efficiency.
Deep Continuous Clustering (DCC):
DCC [42] is also an AE-based deep clustering algorithm. It aims at solving two limitations of deep clustering. Since most deep clustering algorithms are based on classical center-based, divergence-based or hierarchical clustering formulations, they have some inherent limitations. For one thing, they require setting the number of clusters in priori. For another, the optimization procedures of these methods involve discrete reconfigurations of the objective, which require updating the clustering parameters and network parameters alternatively. DCC is rooted in Robust Continuous Clustering (RCC) [43], a formulation having a clear continuous objective and no prior knowledge of clusters number. Similar to many other methods, the representation learning and clustering is optimized jointly.
Architecture of clustering based on autoencoder. The network is trained by both clustering loss and reconstruction loss.
Architecture of CDNN-based deep clustering algorithms. The network is only adjusted by the clustering loss. The network architecture can be FCN, CNN, DBN and so on.
B. CDNN-Based Deep Clustering
CDNN-based algorithms only use the clustering loss to train the network, where the network can be FCN, CNN or DBN. The optimizing objective of CDNN-based algorithms can be formulated as follows:\begin{equation*} L = L_{c}\tag{4}\end{equation*}
1) Unsupervised Pre-Trained Network
RBMs and autoencoders have been applied to CDNN-based clustering. These algorithms firstly train a RBM or an autoencoder in an unsupervised manner, then fine-tune the network (only encoder part for the autoencoder) by the clustering loss. Several representative algorithms are introduced as below.
Deep Nonparametric Clustering (DNC):
DNC [30] leverages unsupervised feature learning with DBN for clustering analysis. It first trains a DBN to map original training data into the embedding codes. Then, it runs the nonparametric maximum margin clustering (NMMC) algorithm to obtain the number of clusters and labels for all training data. After that, it takes the fine-tuning process to refine the parameters of the top layer of the DBN. The experimental results show advantages over classical clustering algorithms.
Deep Embedded Clustering (DEC):
DEC [11] is one of the most representative methods of deep clustering and attracts lots of attention into this field. It uses autoencoder as the network architecture and uses cluster assignment hardening loss as a regularization. It first trains an autoencoder by using the reconstruction loss and then drops the decoder part. The features extracted by the encoder network serve as the input of clustering module. After that, the network is fine-tuned using the cluster assignment hardening loss. Meanwhile, the clusters are iteratively refined by minimizing the KL-divergence between the distribution of soft labels and the auxiliary target distribution. As a result, the algorithm obtains a good result and become a reference to compare the performances of new deep clustering algorithms.
Discriminatively Boosted Clustering (DBC):
DBC [12] has almost the same architecture with DEC and the only improvement is that it use convolutional autoencoder. In other words, it also first pre-trains an autoencoder and then uses the cluster assignment hardening loss to fine-tune the network, along with refining the clustering parameters. It outperforms DEC on image datasets on account of the use of the convolutional network.
2) Supervised Pre-Trained Network
Although unsupervised pre-training provides a better initialization of networks, it is still challenging to extract feasible features from complex image data. Guérin et al. [44] conduct extensive experiments by testing the performance of combinations of different popular CNN architectures pre-trained on ImageNet [45] and different classical clustering algorithms. The experimental results show that feature extracted from deep CNN trained on large and diverse labeled datasets, combined with classical clustering algorithms, can outperform the state-of-the-art image clustering methods. To this effect, when the clustering objective is complex image data, it is natural to make use of the most popular network architectures like VGG [46], ResNet [47] or Inception [48] models, which are pre-trained on large-scale image datasets like ImageNet, to speed up the convergence of iterations and to boost the clustering quality. The most remarkable method of this type is introduced as follows:
Clustering Convolutional Neural Network (CCNN):
CCNN [17] is an efficient and reliable deep clustering algorithm which can deal with large-scale image datasets. It proposes a CNN-based framework to solve clustering and representation learning iteratively. It first randomly picks
samples and uses an initial model pre-trained on the ImageNet dataset to extract their features as the initial cluster centroids. In each step, mini-batch$k$ -means is performed to update assignments of samples and cluster centroids, while stochastic gradient descent is used to update the parameters of the proposed CNN. The mini-batch$k$ -means significantly reduces computation and memory costs, enabling CCNN to be adapted to large-scale datasets. Moreover, it also includes a novel iterative centroid updating method that avoids drift error induced by the feature inconsistency between two successive iterations. At the same time, only top-$k$ samples with the smallest distances to their corresponding centroids are chosen to update the network parameters, in order to enhance the reliability of updates. All these techniques improve the clustering performance. To the best of our knowledge, it is the only deep clustering method which can deal with the task of clustering millions of images.$k_{m}$
3) Non-Pre-Trained Network
Despite the fact that a pre-trained network can significantly boost the clustering performance, under the guidance of a well-designed clustering loss, the networks can also be trained to extract discriminative features.
Information Maximizing Self-Augmented Training (IMSAT):
IMSAT [49] is an unsupervised discrete representation learning algorithm, the task of which is to obtain a function mapping data into discrete representations. Clustering is a special case of the task. It combines FCN and regularized Information Maximization (RIM) [50], which learns a probabilistic classifier such that mutual information between inputs and cluster assignments is maximized. Besides, the complexity of the classifier is regularized. At the same time, an flexible and useful regularization objective termed Self-Augmented Training (SAT) is proposed to impose the intended invariance on the data representations. This data augmentation technique significantly improves the performance of standard deep RIM. IMSAT shows state-of-the-art results on MNIST and REUTERS datasets.
Joint Unsupervised Learning (JULE):
JULE [16] is proposed to learn feature representations and cluster images jointly. A convolutional neural network is used for representation learning and a hierarchical clustering (to be specific, agglomerative clustering) is used for clustering. It optimizes the objective iteratively in a recurrent process. Hierarchical image clustering is performed in the forward pass while feature representation is learned in the backward pass. In the forward pass, the representations of images are regarded as initial samples, and then label information is generated from an undirected affinity matrix based on the deep representations of images. After that, two clusters are merged according to a predefined loss metric. In the backward pass, the network parameters are iteratively updated towards obtaining a better feature representation by optimizing the already merged clusters. In experiments, the method shows excellent results on image datasets and indicates that the learned representations can be transferred across different datasets. Nevertheless, the computational cost and memory complexity are extremely high when datasets is large as it requires to construct an undirected affinity matrix. What is worse, the cost can hardly be optimized since it is a dense matrix.
Deep Adaptive Image Clustering (DAC):
DAC [51] is a single-stage convolutional-network-based method to cluster images. The method is motivated from a basic assumption that the relationship between pairwise images is binary and its optimizing objective is the binary pairwise-classification problem. The images are represented by label features extracted by a convolutional neural network, and the pairwise similarities are measured by the cosine distance between label features. Furthermore, DAC introduces a constraint to make the learned label features tend to be one-hot vectors. Moreover, since the ground-truth similarities are unknown, it adopts an adaptive learning algorithm [52], an alternating iterative method to optimize the model. In each iteration, pairwise images with the estimated similarities are selected based on the fixed network, then the network is trained by the selected labeled samples. DAC converges when all instances are used for training and the objective can not be improved further. Finally, images are clustered according to the largest response of label features. DAC achieves superior performance on five challenging datasets.
C. VAE-Based Deep Clustering
AE-based and CDNN-based deep clustering have made impressive improvements compared to classical clustering method. However, they are designed specifically for clustering and fail to uncover the real underlying structure of data, which prevent them from being extended to other tasks beyond clustering, e.g., generating samples. Worse still, the assumptions underlying the dimensionality reduction techniques are generally independent of the assumptions of the clustering techniques, thus there is no theoretical guarantee that the network would learn feasible representations. In recent years, Variational Autoencoder (VAE), a kind of deep generative model, has attracted extensive attention and motivated a large number of variants. In this section, we introduce the deep clustering algorithms based on VAE.
VAE can be considered as a generative variant of AE, as it enforces the latent code of AE to follow a predefined distribution. VAE combines variational bayesian methods with the flexibility and scalability of neural networks. It introduces neural networks to fit the conditional posterior and thus can optimize the variational inference objective via stochastic gradient descent [53] and standard backpropagation [54]. To be specific, it uses the reparameterization of the variational lower bound to yield a simple differentiable unbiased estimator of the lower bound. This estimator can be used for efficient approximate posterior inference in almost any model with continuous latent variables. Mathematically, it aims at minimizing the (variational) lower bound on the marginal likelihood of the dataset \begin{align*} L(\theta, \phi; X)=&~\sum _{i}^{N} (-D_{KL}(q_\phi (\boldsymbol {z} | \boldsymbol {x}^{(i)}) \parallel p(\boldsymbol {z})) \\&\qquad \qquad \,\, ~\quad +\,\mathbb {E}_{q_\phi (\boldsymbol {z}|\boldsymbol {x}^{(i)})}[log p_\theta (\boldsymbol {x}^{(i)} | \boldsymbol {z})])\tag{5}\end{align*}
Variational Deep Embedding (VaDE):
VaDE [15] consider the generative model
,In this model, an observed sample$p(\boldsymbol {x},\boldsymbol {z},c) = p(\boldsymbol {x}|\boldsymbol {z})p(\boldsymbol {z}|c)p(c)$ is generated by the following process:$\boldsymbol {x}$ where\begin{align*} c\sim&~Cat(1/K), \boldsymbol {z} \sim \mathcal {N}(\boldsymbol {\mu }_{c},\boldsymbol {\sigma }_{c}^{2} \boldsymbol {I})\\ \boldsymbol {x}\sim&~\mathcal {N}(\boldsymbol {\mu }_{x}(\boldsymbol {z}),\boldsymbol {\sigma }_{x}^{2}(\boldsymbol {z}) \boldsymbol {I}) {~\text {or }} \mathcal {B}(\boldsymbol {\mu }_{x}(\boldsymbol {z}))\end{align*} View Source\begin{align*} c\sim&~Cat(1/K), \boldsymbol {z} \sim \mathcal {N}(\boldsymbol {\mu }_{c},\boldsymbol {\sigma }_{c}^{2} \boldsymbol {I})\\ \boldsymbol {x}\sim&~\mathcal {N}(\boldsymbol {\mu }_{x}(\boldsymbol {z}),\boldsymbol {\sigma }_{x}^{2}(\boldsymbol {z}) \boldsymbol {I}) {~\text {or }} \mathcal {B}(\boldsymbol {\mu }_{x}(\boldsymbol {z}))\end{align*}
is the categorical distribution,$Cat(\cdot)$ is the predefined number of clusters,$K$ and$\boldsymbol {\mu }$ are the mean and the variance of the Gaussian distribution corresponding to cluster$\boldsymbol {\sigma }$ or parameterized by a given vector.$c$ and$\mathcal {N}(\cdot)$ are multivariate Gaussian distribution and Bernoulli distribution parameterized by$\mathcal {B}(\cdot)$ and$\boldsymbol {\mu }, \boldsymbol {\sigma }$ respectively. A VaDE instance is tuned to maximize the likelihood of the given samples. The log-likelihood of VaDE can be formulated as:$\boldsymbol {\mu }$ where\begin{align*}&\hspace {-2pc}\text {log} p(\boldsymbol {x}) \\=&~\text {log} \int _{\boldsymbol {z}} \sum _{c} p(\boldsymbol {x},\boldsymbol {z},c)d\boldsymbol {z} \\\geq&~E_{q(\boldsymbol {z},c|\boldsymbol {x})}\left[{\text {log} \frac {p(\boldsymbol {x},\boldsymbol {z},c)}{q(\boldsymbol {z},c|\boldsymbol {x})}}\right] \\=&~L_{ELBO}(\boldsymbol {x}) \\=&~E_{q(\boldsymbol {z},c|\boldsymbol {x})}[\text {log} p(\boldsymbol {x}|\boldsymbol {z})] - D_{KL}(q(\boldsymbol {z},c|\boldsymbol {x})||p(\boldsymbol {z},c))\tag{6}\end{align*} View Source\begin{align*}&\hspace {-2pc}\text {log} p(\boldsymbol {x}) \\=&~\text {log} \int _{\boldsymbol {z}} \sum _{c} p(\boldsymbol {x},\boldsymbol {z},c)d\boldsymbol {z} \\\geq&~E_{q(\boldsymbol {z},c|\boldsymbol {x})}\left[{\text {log} \frac {p(\boldsymbol {x},\boldsymbol {z},c)}{q(\boldsymbol {z},c|\boldsymbol {x})}}\right] \\=&~L_{ELBO}(\boldsymbol {x}) \\=&~E_{q(\boldsymbol {z},c|\boldsymbol {x})}[\text {log} p(\boldsymbol {x}|\boldsymbol {z})] - D_{KL}(q(\boldsymbol {z},c|\boldsymbol {x})||p(\boldsymbol {z},c))\tag{6}\end{align*}
is the evidence lower bound (ELBO),$L_{ELBO}$ is the variational posterior to approximate the true posterior$q(\boldsymbol {z},c|\boldsymbol {x})$ . The first term in Equation 6 is the reconstruction loss (network loss$p(\boldsymbol {z},c|\boldsymbol {x})$ ), and the second term, which is the Kullback-Leibler divergence from the Mixture-of-Gaussians (MoG) prior$L_{n}$ to the variational posterior$p(\boldsymbol {z},c)$ , can be consider as the clustering loss$q(\boldsymbol {z},c|\boldsymbol {x})$ . After the maximization of the lower bound, the cluster assignments can be inferred directly from the MoG prior.$L_{c}$ Gaussian Mixture VAE (GMVAE):
GMVAE [55] proposes a similar formulation. It considers the generative model
. In this model, an observed sample$p(\boldsymbol {x},\boldsymbol {z},\boldsymbol {n},c) = p(\boldsymbol {x}|\boldsymbol {z})p_\beta (\boldsymbol {z}|c,\boldsymbol {n})p (\boldsymbol {n})p(c)$ is generated as the following process:$\boldsymbol {x}$ Note that GMVAE is a little complex than VaDE and has worse results empirically.\begin{align*} c\sim&~Cat(1/K), \boldsymbol {n} \sim \mathcal {N}(0, \boldsymbol {I})\\ \boldsymbol {z}\sim&~\mathcal {N}(\boldsymbol {\mu }_{c}(\boldsymbol {n}),\boldsymbol { \sigma }_{c}^{2}(\boldsymbol {n}))\\ \boldsymbol {x}\sim&~\mathcal {N}(\boldsymbol {\mu }_{x}(\boldsymbol {z}),\boldsymbol {\sigma }_{x}(\boldsymbol {z})\boldsymbol {I}) {~\text {or }} \mathcal {B}(\boldsymbol {\mu }_{x}(\boldsymbol {z}))\end{align*} View Source\begin{align*} c\sim&~Cat(1/K), \boldsymbol {n} \sim \mathcal {N}(0, \boldsymbol {I})\\ \boldsymbol {z}\sim&~\mathcal {N}(\boldsymbol {\mu }_{c}(\boldsymbol {n}),\boldsymbol { \sigma }_{c}^{2}(\boldsymbol {n}))\\ \boldsymbol {x}\sim&~\mathcal {N}(\boldsymbol {\mu }_{x}(\boldsymbol {z}),\boldsymbol {\sigma }_{x}(\boldsymbol {z})\boldsymbol {I}) {~\text {or }} \mathcal {B}(\boldsymbol {\mu }_{x}(\boldsymbol {z}))\end{align*}
Architecture of VAE-based deep clustering algorithms. They impose a GMM priori over the latent code.
D. GAN-Based Deep Clustering
Then Generative Adversarial Network (GAN) is another popular deep generative model in recent years. The (GAN) framework establishes a min-max adversarial game between two neural networks: a generative network, \begin{align*} \min _{G} \max _{D} \mathbb {E}_{\boldsymbol {x} \sim p_{data}}[\text {log}D(\boldsymbol {x})] + \mathbb {E}_{\boldsymbol {z} \sim p(\boldsymbol {z})}[\text {log}(1-D(G(\boldsymbol {z})))] \\\tag{7}{}\end{align*}
Deep Adversarial Clustering (DAC):
DAC [56] is a generative model specific to clustering. It applies the adversarial autoencoder (AAE) [57] to clustering. AAE is similar to VAE as VAE uses a KL divergence penalty to impose a prior distribution on the latent representation, while AAE uses an adversarial training procedure to match the aggregated posterior of the latent representation with the prior distribution. Inspired by the success of VaDE, Harchaoui et al. [56] match the aggregated posterior of the latent representation with a Gaussian Mixture Distribution. The network architecture is illustrated in Figure 4(a). Its optimizing objective is comprised of three terms: the traditional auto-encoder reconstruction objective, the Gaussian mixture model likelihood, and the adversarial objective, where the reconstruction objective can be considered as the network loss, and the other two terms are the clustering loss. Experiment in [57] illustrates that it has a comparable result with VaDE on the MNIST dataset.
Categorial Generative Adversarial Network (CatGAN):
CatGAN [58] generalizes the GAN framework to multiple classes. As illustrated in Figure 4(b), it considers the problem of unsupervisedly learning a discriminative classifier
from dataset, which classifies the data points into a priori chosen number of categories instead of only two categories (fake or real). CatGAN introduces a new two player game based on GAN framework: Instead of requiring$D$ to predict the probability of$D$ belonging to real dataset, it enforces$\boldsymbol {x}$ to classify all data points into$D$ classes, while being uncertain of class assignments for samples generated by$k$ . On the other hand, it requires$G$ to generate samples belonging to precisely one out of$G$ classes, instead of generating samples belonging to the dataset. Mathematically, the goal of CatGAN is maximizing$k$ and$H[p(c|\boldsymbol {x},D)]$ , and minimizing$H[p(c|D)]$ , where$H[p(c|G(\boldsymbol {z}),D)]$ denotes the empirical entropy,$H[\cdot]$ is the real sample,$\boldsymbol {x}$ is the random noise, and$\boldsymbol {x}$ is the class label. The objective function of the discriminator, which we refer to with$c$ , and the generator, which we refer to with$\mathcal {L}_{D}$ can be defined as follows:$\mathcal {L}_{G}$ where\begin{align*} \mathcal {L}_{D}=&~\max _{D} H_{\mathcal {X}}[p(c|D)] - \mathbb {E}_{\boldsymbol {x} \sim \mathcal {X}}[H[p(c|\boldsymbol {x},D)]] \\[5pt]&\,\,+\,\mathbb {E}_{\boldsymbol {z} \sim P(\boldsymbol {z})}[H[p(c|G(\boldsymbol {z}),D)]] \\[5pt] \mathcal {L}_{G}=&~\min _{G} - H_{G}[p(c|D)] + \mathbb {E}_{\boldsymbol {z} \sim P(\boldsymbol {z})}[H[p(c|G(\boldsymbol {z}),D)]] \\[5pt]\tag{8}{}\end{align*} View Source\begin{align*} \mathcal {L}_{D}=&~\max _{D} H_{\mathcal {X}}[p(c|D)] - \mathbb {E}_{\boldsymbol {x} \sim \mathcal {X}}[H[p(c|\boldsymbol {x},D)]] \\[5pt]&\,\,+\,\mathbb {E}_{\boldsymbol {z} \sim P(\boldsymbol {z})}[H[p(c|G(\boldsymbol {z}),D)]] \\[5pt] \mathcal {L}_{G}=&~\min _{G} - H_{G}[p(c|D)] + \mathbb {E}_{\boldsymbol {z} \sim P(\boldsymbol {z})}[H[p(c|G(\boldsymbol {z}),D)]] \\[5pt]\tag{8}{}\end{align*}
is the distribution of dataset. Empirical evaluation shows that CatGAN is superior to$\mathcal {X}$ -means and RIM [50] on a “circles” dataset.$k$ Information Maximizing Generative Adversarial Network (InfoGAN):
InfoGAN [59] is an unsupervised method that learns disentangled representations, and it can also be used for clustering. It can disentangle both discrete and continuous latent factors, scale to complicated datasets. The idea of InfoGAN is maximizing the mutual information [60] between a fixed small subset of the GAN’s noise variables and the observation, which is relatively straightforward but surprisingly effective. To be specific, as illustrated in Figure 4(c), it decomposes the input noise vector into two parts: incompressible noise
and latent code$\boldsymbol {z}$ , so the form of the generator becomes$\boldsymbol {c}$ . To avoid trivial codes, it uses an information-theoretic regularization to ensure that the mutual information between latent codes$G(\boldsymbol {z},\boldsymbol {c})$ and generator distribution$\boldsymbol {c}$ should be high. The optimizing objective of InfoGAN become the following information-regularized minimax game:$G(\boldsymbol {z},\boldsymbol {c})$ where V(D,G) denotes the object of standard GAN, and\begin{equation*} \min _{G} \max _{D} V_{I}(D,G) = V(D,G) - \lambda I(\boldsymbol {c};G(\boldsymbol {z},\boldsymbol {c}))\tag{9}\end{equation*} View Source\begin{equation*} \min _{G} \max _{D} V_{I}(D,G) = V(D,G) - \lambda I(\boldsymbol {c};G(\boldsymbol {z},\boldsymbol {c}))\tag{9}\end{equation*}
is the information-theoretic regularization. When choosing to model the latent codes with one categorical code having$I(\boldsymbol {c}; G(\boldsymbol {z},\boldsymbol {c}))$ values, and several continuous codes, it has the function of clustering data points into$k$ clusters. Experiments in [59] shows that it can achieve 0.95 accuracy on the MNIST dataset.$k$
E. Summary of Deep Clustering Algorithms
In this part, we present an overall scope of deep clustering algorithms. Specifically, we compare the four categories of algorithms in terms of loss function, advantages, disadvantages and computational complexity. as shown in Table 4. In regard to the loss function, apart from CDNN-based algorithms, the other three categories of algorithms jointly optimize both clustering loss
AE-based DC algorithms are most common as autoencoder can be combined with almost all clustering algorithms. The reconstruction loss of autoencoder ensures the network learns a feasible representation and avoid obtaining trivial solutions. However, due to the symmetry architecture, the network depth is limited for computational feasibility. Besides, the hyper-parameter to balance the two losses requires extra fine-tuning. In contrast to AE-based DC algorithms, CDNN-based DC algorithms only optimize the clustering loss. Therefore, the depth of network is unlimited and supervisedly pre-trained architectures can be used to extract more discriminative features, thus they are capable to cluster large-scale image datasets. However, without the reconstruction loss, they have the risk of learning a corrupted feature representation, thus the clustering loss should be well-designed. VAE-based and GAN-based DC algorithms are generative DC techniques, as they are capable to generate samples from the finally obtained clusters. VAE-based algorithms have a good theoretical guarantee because they minimizes the variational lower bound on the marginal likelihood of data, but it suffer from high-computational complexity. GAN-based algorithms impose a multi-class priori on general GAN framework. They are more flexible and diverse than VAE-based ones. Some of them aim at learning interpretable representations and just take clustering task as a specific case. The shortcomings of GAN-based algorithms are similar to GANs, e.g, mode collapse and converge slowly.
The computational complexity of deep clustering varies a lot. For AE-based and CDNN-based algorithms, the computational cost is highly related to the clustering loss. For example,
Future Opportunities and Conclusions
A. Future Opportunities of Deep Clustering
Based on the aforementioned literature review and analysis, we argue that the following perspectives of deep clustering are worth being studied further:
Theoretical exploration
Although jointly optimizing networks and clustering models significantly boost the clustering performance, there is no theoretical analysis explaining why it works and how to further improve the performance. Therefore, it is meaningful to explore the theoretical guarantee of deep clustering, in order to guide further researches in this area.
Other network architectures
Existing deep clustering algorithms mostly focus on image datasets, while few attempts have been made on sequential data, e.g., documents. To this effect, it is recommended to explore the feasibility of combining other network architectures with clustering, e.g., recurrent neural network [61].
Tricks in deep learning
It is viable to introduce some tricks or techniques used in supervised deep learning to deep clustering, e.g. data augmentation and specific regularizations. A concrete example is augmenting data with noise to improve the robustness of clustering methods.
Other clustering tasks
Combining deep neural networks with diverse clustering tasks, e.g. multi-task clustering, self-taught clustering (transfer clustering) [62], is another interesting research direction. To the best of our knowledge, these tasks have not exploited the powerful non-linear transformation of neural networks.
B. Conclusion Remarks
As deep clustering is widely used in many practical applications for its powerful ability of feature extraction, it is natural to combine clustering algorithms with deep learning for better clustering results. In this paper, we give a systematic survey of deep clustering, which is a popular research field of clustering in recent years. A taxonomy of deep clustering is proposed from the perspective of network architectures, and the representative algorithms are presented in detail. The taxonomy explicitly shows the characteristics, advantages and disadvantages of different deep clustering algorithms. Furthermore, we provide several interesting future directions of deep clustering. We hope this work can serve as a valuable reference for researchers who are interested in both deep learning and clustering.