9,930 research outputs found
Techniques for clustering gene expression data
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognises these limitations and implements procedures to overcome them. It provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for the clustering methods considered
A temporal precedence based clustering method for gene expression microarray data
Background: Time-course microarray experiments can produce useful data which can help in understanding the underlying dynamics of the system. Clustering is an important stage in microarray data analysis where the data is grouped together according to certain characteristics. The majority of clustering techniques are based on distance or visual similarity measures which may not be suitable for clustering of temporal microarray data where the sequential nature of time is important. We present a Granger causality based technique to cluster temporal microarray gene expression data, which measures the interdependence between two time-series by statistically testing if one time-series can be used for forecasting the other time-series or not.
Results: A gene-association matrix is constructed by testing temporal relationships between pairs of genes using the Granger causality test. The association matrix is further analyzed using a graph-theoretic technique to detect highly connected components representing interesting biological modules. We test our approach on synthesized datasets and real biological datasets obtained for Arabidopsis thaliana. We show the effectiveness of our approach by analyzing the results using the existing biological literature. We also report interesting structural properties of the association network commonly desired in any biological system.
Conclusions: Our experiments on synthesized and real microarray datasets show that our approach produces encouraging results. The method is simple in implementation and is statistically traceable at each step. The method can produce sets of functionally related genes which can be further used for reverse-engineering of gene circuits
Methods of Hierarchical Clustering
We survey agglomerative hierarchical clustering algorithms and discuss
efficient implementations that are available in R and other software
environments. We look at hierarchical self-organizing maps, and mixture models.
We review grid-based clustering, focusing on hierarchical density-based
approaches. Finally we describe a recently developed very efficient (linear
time) hierarchical clustering algorithm, which can also be viewed as a
hierarchical grid-based algorithm.Comment: 21 pages, 2 figures, 1 table, 69 reference
On the Persistence of Clustering Solutions and True Number of Clusters in a Dataset
Typically clustering algorithms provide clustering solutions with
prespecified number of clusters. The lack of a priori knowledge on the true
number of underlying clusters in the dataset makes it important to have a
metric to compare the clustering solutions with different number of clusters.
This article quantifies a notion of persistence of clustering solutions that
enables comparing solutions with different number of clusters. The persistence
relates to the range of data-resolution scales over which a clustering solution
persists; it is quantified in terms of the maximum over two-norms of all the
associated cluster-covariance matrices. Thus we associate a persistence value
for each element in a set of clustering solutions with different number of
clusters. We show that the datasets where natural clusters are a priori known,
the clustering solutions that identify the natural clusters are most persistent
- in this way, this notion can be used to identify solutions with true number
of clusters. Detailed experiments on a variety of standard and synthetic
datasets demonstrate that the proposed persistence-based indicator outperforms
the existing approaches, such as, gap-statistic method, -means, -means,
-means, dip-means algorithms and information-theoretic method, in
accurately identifying the clustering solutions with true number of clusters.
Interestingly, our method can be explained in terms of the phase-transition
phenomenon in the deterministic annealing algorithm, where the number of
distinct cluster centers changes (bifurcates) with respect to an annealing
parameter
- …