I. Introduction
Unsupervised learning has a broad goal of extracting salient features and discovering structural information from the collected data. It is challenging to extract reliable features and their latent structure from abundant heterogeneous documents which are prone to be redundant, noisy, ambiguous, mismatched and ill-posed. We aim to construct a flexible latent variable model to meet the heterogeneous conditions and annotate the observed documents for prediction of future documents. In the past decade, the unsupervised learning via probabilistic topic model [1], [2] has been successfully developed for document categorization [3], collaborative filtering, topic discovery [2], audio classification [4] , speech recognition [5], document summarization [6] and other natural language systems [7] . The latent features or semantic topics are learned from a bag of words which are semantically similar across different documents. Conventionally, the parametric topic model based on latent Dirichlet allocation (LDA) [3] was constructed for modeling a set of documents . Each vocabulary word is represented by an associated parameter which is driven by a topic label using the multinomial parameters . The multinomial parameters of topics in document are represented by a Dirichlet distribution with parameters . Graphical representation of LDA is depicted in Fig. 1(a) . LDA is known as a finite-dimensional mixture representation for documents which assumes that
number of topics is fixed,
Graphical representation for (a) LDA, (b) hDP and (c) hDP topic model where word, document and topic are indexed by , and with the plates shown in blue, yellow and green, respectively.
topics within or across documents are independent,
all topics are used to represent a target document even though some of them are irrelevant to the document.