A central aim of many statistical analyses of microarray data is to cluster genes according to their similarity in expression behavior. In this paper, we perform clustering based on the likelihood fit of a multivariate normal mixture. This approach has several advantages with respect to standard partitioning or hierarchical algorithms; it has an unambiguous inferential characterization, it produces soft partitions through membership probabilities, it allows one to model component mean vectors and covariance structures, and to manage anomalous and missing observations in a natural way. In particular, our mixture-based approach allows us to (i) model component mean vectors through linear reparameterizations, (ii) model component covariance structures through constraints on a special decomposition, (iii) handle outliers through the introduction of a contamination term (uniform on the hypervolume of the data), and (iv) impute missing values. The maximum likelihood estimation of parameters and membership probabilities, and the imputation of missing values, is accomplished through the EM algorithm. Concerning model selection, we employ the classical Bayesian Information Criterion, pragmatically combined with consideration of other features, such as overall membership strength, within-cluster dispersion, and weight of the contamination term. To illustrate our approach, we analyze publicly available data on the reaction of yeast cells to heat shocks. The results of our analysis suggest two alternative clustering models, which provide two different and interesting interpretations of the structure in the data
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.