Search CORE

94,212 research outputs found

Evaluation of clustering algorithms for gene expression data

Author: A Ruepp
I Gat-Viks
J Quackenbush
JA Hartigan
JD Banfield
JT Taylor
L Kaufman
MC Abba
PJ Rousseeuw
R Shamir
S Chu
S Datta
S Datta
S Datta
S Dudoit
Somnath Datta
Susmita Datta
T Kohonen
WN Venables
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Cluster analysis is an integral part of high dimensional data analysis. In the context of large scale gene expression data, a filtered set of genes are grouped together according to their expression profiles using one of numerous clustering algorithms that exist in the statistics and machine learning literature. A closely related problem is that of selecting a clustering algorithm that is "optimal" in some sense from a rather impressive list of clustering algorithms that currently exist. RESULTS: In this paper, we propose two validation measures each with two parts: one measuring the statistical consistency (stability) of the clusters produced and the other representing their biological functional congruence. Smaller values of these indices indicate better performance for a clustering algorithm. We illustrate this approach using two case studies with publicly available gene expression data sets: one involving a SAGE data of breast cancer patients and the other involving a time course cDNA microarray data on yeast. Six well known clustering algorithms UPGMA, K-Means, Diana, Fanny, Model-Based and SOM were evaluated. CONCLUSION: No single clustering algorithm may be best suited for clustering genes into functional groups via expression profiles for all data sets. The validation measures introduced in this paper can aid in the selection of an optimal algorithm, for a given data set, from a collection of available clustering algorithms

Crossref

Springer - Publisher Connector

PubMed Central

Consensus clustering and functional interpretation of gene-expression data

Author: Kellam P.
Liu X.
Martin Nigel
Orengo C.A.
Swift S.
Tucker A.
Vinciotti V.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

Microarray analysis using clustering algorithms can suffer from lack of inter-method consistency in assigning related gene-expression profiles to clusters. Obtaining a consensus set of clusters from a number of clustering methods should improve confidence in gene-expression analysis. Here we introduce consensus clustering, which provides such an advantage. When coupled with a statistically based gene functional analysis, our method allowed the identification of novel genes regulated by NFκB and the unfolded protein response in certain B-cell lymphomas

Springer - Publisher Connector

UCL Discovery

PubMed Central

Birkbeck Institutional Research Online

Spiral - Imperial College Digital Repository

Brunel University Research Archive

A data driven equivariant approach to constrained Gaussian mixture modeling

Author: Di Mari Roberto
Gattone Stefano Antonio
Rocci Roberto
Publication venue
Publication date: 25/10/2016
Field of study

Maximum likelihood estimation of Gaussian mixture models with different class-specific covariance matrices is known to be problematic. This is due to the unboundedness of the likelihood, together with the presence of spurious maximizers. Existing methods to bypass this obstacle are based on the fact that unboundedness is avoided if the eigenvalues of the covariance matrices are bounded away from zero. This can be done imposing some constraints on the covariance matrices, i.e. by incorporating a priori information on the covariance structure of the mixture components. The present work introduces a constrained equivariant approach, where the class conditional covariance matrices are shrunk towards a pre-specified matrix Psi. Data-driven choices of the matrix Psi, when a priori information is not available, and the optimal amount of shrinkage are investigated. The effectiveness of the proposal is evaluated on the basis of a simulation study and an empirical example

arXiv.org e-Print Archive

ART

Archivio della ricerca- Università di Roma La Sapienza

Toeplitz Inverse Covariance-Based Clustering of Multivariate Time Series Data

Author: Banerjee O.
Berndt D. J.
Boyd S.
Cover T. M.
Cuturi M.
Das G.
Gray R. M.
Hsieh C.-J.
Hsieh C.-J.
Lauritzen S. L.
Mohan K.
Smyth P.
Wytock M.
Publication venue
Publication date: 14/05/2018
Field of study

Subsequence clustering of multivariate time series is a useful tool for discovering repeated patterns in temporal data. Once these patterns have been discovered, seemingly complicated datasets can be interpreted as a temporal sequence of only a small number of states, or clusters. For example, raw sensor data from a fitness-tracking application can be expressed as a timeline of a select few actions (i.e., walking, sitting, running). However, discovering these patterns is challenging because it requires simultaneous segmentation and clustering of the time series. Furthermore, interpreting the resulting clusters is difficult, especially when the data is high-dimensional. Here we propose a new method of model-based clustering, which we call Toeplitz Inverse Covariance-based Clustering (TICC). Each cluster in the TICC method is defined by a correlation network, or Markov random field (MRF), characterizing the interdependencies between different observations in a typical subsequence of that cluster. Based on this graphical representation, TICC simultaneously segments and clusters the time series data. We solve the TICC problem through alternating minimization, using a variation of the expectation maximization (EM) algorithm. We derive closed-form solutions to efficiently solve the two resulting subproblems in a scalable way, through dynamic programming and the alternating direction method of multipliers (ADMM), respectively. We validate our approach by comparing TICC to several state-of-the-art baselines in a series of synthetic experiments, and we then demonstrate on an automobile sensor dataset how TICC can be used to learn interpretable clusters in real-world scenarios.Comment: This revised version fixes two small typos in the published versio

arXiv.org e-Print Archive

Crossref