I. Introduction
The estimation of the probability density function (pdf) from observed data samples is a fundamental problem in many machine learning and pattern recognition applications [1]–[3]. The Parzen window (PW) estimate is a simple yet remarkably accurate nonparametric density estimation technique [2]–[4]. A general and powerful approach to the problem of pdf estimation is the finite mixture model [5]. The finite mixture model includes the PW estimate as a special case in that equal weights are adopted in the PW, with the number of mixtures equal to the number of training data samples. A disadvantage associated with the PW estimate is its high computational cost of the point density estimate for a future data sample in the cases whereby the training data set is very large. Clearly, by taking a much smaller number of mixture components, the finite mixture model can be regarded as a condensed representation of data [5]. Note that the mixing weights in the finite mixture model need to be determined through parametric optimization, unlike just adopting equal weights in the PW. Much of the work in the fitting of a finite mixture model is based on a fixed number of mixtures and the expectation–maximization (EM) algorithms [5]. The disadvantages are as follows: 1) the predetermined model size may not be suitable to the data and 2) the convergence speed of EM is generally slow. Hence, it is desirable to develop new methods of fitting a finite mixture model with the capability to infer a minimal number of mixtures from the data efficiently.