I. Introduction and Related Work
A central goal of the analysis of microarray data is the identification of small subsets of informative genes with disease-specific expression profiles. While the problem of selecting genes for known disease types has been studied widely in the literature, the discovery of putative sub-typesof diseases is still a challenging task. Taking a machine-learning viewpoint, this class-discovery problem can be formalized as an unsupervised clustering problem with simultaneous feature selection. Early approaches to this problem [1] [2] [3], were semi-automatic procedures based on a combination of clustering techniques and human intervention for selecting “relevant” genes. Several shortcomings of such approaches, and also some methods for overcoming these problems, have been discussed in the literature, e.g., [4]– [6]. The common strategy of most of these approaches is the use of a (possibly iterated) stepwise procedure, in which the first step consists of extracting a set of hypothetical partitions (the clustering step), and the other step involves some way of scoring genes for relevance (the relevance determination step). A possible shortcoming of these approaches is the way of combining these two steps in an “ad hoc” manner: usually the relevance determination mechanism does not take into account the properties of the clustering method used. It rather attempts to find predictive subsets of genes by making use of simple empirical statistical measures, such as T-test scores or correlation coefficients. These scoring measures treat the genes as independent objects, while ignoring both the biological knowledge of gene expression levels being correlated, and also ignoring the inherent capabilities of many clustering methods for handling such correlations.