Skip to main content
Article thumbnail
Location of Repository

Application of Gene Shaving and Mixture Models to Cluster Microarray Gene Expression Data

By K-A. Do, G.J. McLachlan, R. Bean and S. Wen


Researchers are frequently faced with the analysis of microarray data of a relatively large number of genes using a small number of tissue samples. We examine the application of two statistical methods for clustering such microarray expression data: EMMIX-GENE and GeneClust. EMMIX-GENE is a mixture-model based clustering approach, designed primarily to cluster tissue samples on the basis of the genes. GeneClust is an implementation of the gene shaving methodology, motivated by research to identify distinct sets of genes for which variation in expression could be related to a biological property of the tissue samples. We illustrate the use of these two methods in the analysis of Affymetrix oligonucleotide arrays of well-known data sets from colon tissue samples with and without tumors, and of tumor tissue samples from patients with leukemia. Although the two approaches have been developed from different perspectives, the results demonstrate a clear correspondence between gene clusters produced by GeneClust and EMMIX-GENE for the colon tissue data. It is demonstrated, for the case of ribosomal proteins and smooth muscle genes in the colon data set, that both methods can classify genes into co-regulated families. It is further demonstrated that tissue types (tumor and normal) can be separated on the basis of subtle distributed patterns of genes. Application to the leukemia tissue data produces a division of tissues corresponding closely to the external classification, acute myeloid meukemia (AML) and acute lymphoblastic leukemia (ALL), for both methods. In addition, we also identify genes specific for the subgroup of ALL-Tcell samples. Overall, we find that the gene shaving method produces gene clusters at great speed; allows variable cluster sizes and can incorporate partial or full supervision; and finds clusters of genes in which the gene expression varies greatly over the tissue samples while maintaining a high level of coherence between the gene expression profiles. The intent of the EMMIX-GENE method is to cluster the tissue samples. It performs a filtering step that results in a subset of relevant genes, followed by gene clustering, and then tissue clustering, and is favorable in its accuracy of ranking the clusters produced

Topics: Systems Biology Special Issue
Publisher: Libertas Academica
OAI identifier:
Provided by: PubMed Central

Suggested articles


  1. (2006). A mixture model with random-effects components for clustering correlated gene-expression profi les.
  2. (2002). A Mixture Model-Based Approach to the Clustering of Microarray Expression Data.
  3. (2006). A quantitative study of gene regulation involved the immune response of anopheline mosquitoes: an application of Bayesian hierarchical clustering of curves.
  4. (2000). Analysis of molecular profi le data using regenerative and discriminative methods.
  5. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
  6. (2001). CLIFF: Clustering of high-dimensional microarray data via interative feature fi lter using normalized cuts.
  7. (1998). Cluster analysis and display of genome-wide expression patterns.
  8. (1999). Clustering gene expression patterns.
  9. (2004). Comparative analysis of clustering methods for gene expression time course data. Genetics and Molecular Biology,
  10. (2002). Comparison of discriminant methods for the classifi cation of tumors using gene expression data.
  11. (2006). Context-specifi c infi nite mixtures for clustering gene expression profiles across diverse microarray dataset.
  12. (2006). Evaluation and comparison of gene clustering methods in microarray analysis.
  13. (2000). Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns.
  14. (2002). Judging the quality of gene expressionbased clustering methods using gene annotation.
  15. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion).
  16. (1988). Mixture models: inference and applications to clustering.
  17. (1999). Molecular classifi cation of cancer: class discovery and class prediction by gene expression monitoring.
  18. (2001). Suppl 1: S306-15 (special issue for the
  19. (2005). Survey of clustering algorithms.
  20. (1999). The EMMIX software for the fi tting of mixtures of normal and t-components.
  21. (2000). Tissue Classifi cation with Gene Expression Profi les.
  22. (2001). Validating clustering for gene expression data.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.