64,748 research outputs found
Comparison of Clustering Methods for Time Course Genomic Data: Applications to Aging Effects
Time course microarray data provide insight about dynamic biological
processes. While several clustering methods have been proposed for the analysis
of these data structures, comparison and selection of appropriate clustering
methods are seldom discussed. We compared probabilistic based clustering
methods and distance based clustering methods for time course microarray
data. Among probabilistic methods, we considered: smoothing spline clustering
also known as model based functional data analysis (MFDA), functional
clustering models for sparsely sampled data (FCM) and model-based clustering
(MCLUST). Among distance based methods, we considered: weighted gene
co-expression network analysis (WGCNA), clustering with dynamic time warping
distance (DTW) and clustering with autocorrelation based distance (ACF). We
studied these algorithms in both simulated settings and case study data. Our
investigations showed that FCM performed very well when gene curves were short
and sparse. DTW and WGCNA performed well when gene curves were medium or long
( observations). SSC performed very well when there were clusters of gene
curves similar to one another. Overall, ACF performed poorly in these
applications. In terms of computation time, FCM, SSC and DTW were considerably
slower than MCLUST and WGCNA. WGCNA outperformed MCLUST by generating more
accurate and biological meaningful clustering results. WGCNA and MCLUST are the
best methods among the 6 methods compared, when performance and computation
time are both taken into account. WGCNA outperforms MCLUST, but MCLUST provides
model based inference and uncertainty measure of clustering results
Directional clustering through matrix factorization
This paper deals with a clustering problem where feature vectors are clustered depending on the angle between feature vectors, that is, feature vectors are grouped together if they point roughly in the same direction. This directional distance measure arises in several applications, including document classification and human brain imaging. Using ideas from the field of constrained low-rank matrix factorization and sparse approximation, a novel approach is presented that differs from classical clustering methods, such as seminonnegative matrix factorization, K-EVD, or k-means clustering, yet combines some aspects of all these. As in nonnegative matrix factorization and K-EVD, the matrix decomposition is iteratively refined to optimize a data fidelity term; however, no positivity constraint is enforced directly nor do we need to explicitly compute eigenvectors. As in k-means and K-EVD, each optimization step is followed by a hard cluster assignment. This leads to an efficient algorithm that is shown here to outperform common competitors in terms of clustering performance and/or computation speed. In addition to a detailed theoretical analysis of some of the algorithm's main properties, the approach is empirically evaluated on a range of toy problems, several standard text clustering data sets, and a high-dimensional problem in brain imaging, where functional magnetic resonance imaging data are used to partition the human cerebral cortex into distinct functional regions
Parsimonious Time Series Clustering
We introduce a parsimonious model-based framework for clustering time course
data. In these applications the computational burden becomes often an issue due
to the number of available observations. The measured time series can also be
very noisy and sparse and a suitable model describing them can be hard to
define. We propose to model the observed measurements by using P-spline
smoothers and to cluster the functional objects as summarized by the optimal
spline coefficients. In principle, this idea can be adopted within all the most
common clustering frameworks. In this work we discuss applications based on a
k-means algorithm. We evaluate the accuracy and the efficiency of our proposal
by simulations and by dealing with drosophila melanogaster gene expression
data
Communication-Avoiding Optimization Methods for Distributed Massive-Scale Sparse Inverse Covariance Estimation
Across a variety of scientific disciplines, sparse inverse covariance
estimation is a popular tool for capturing the underlying dependency
relationships in multivariate data. Unfortunately, most estimators are not
scalable enough to handle the sizes of modern high-dimensional data sets (often
on the order of terabytes), and assume Gaussian samples. To address these
deficiencies, we introduce HP-CONCORD, a highly scalable optimization method
for estimating a sparse inverse covariance matrix based on a regularized
pseudolikelihood framework, without assuming Gaussianity. Our parallel proximal
gradient method uses a novel communication-avoiding linear algebra algorithm
and runs across a multi-node cluster with up to 1k nodes (24k cores), achieving
parallel scalability on problems with up to ~819 billion parameters (1.28
million dimensions); even on a single node, HP-CONCORD demonstrates
scalability, outperforming a state-of-the-art method. We also use HP-CONCORD to
estimate the underlying dependency structure of the brain from fMRI data, and
use the result to identify functional regions automatically. The results show
good agreement with a clustering from the neuroscience literature.Comment: Main paper: 15 pages, appendix: 24 page
Small-sample brain mapping: sparse recovery on spatially correlated designs with randomization and clustering
International audienceFunctional neuroimaging can measure the brain's response to an external stimulus. It is used to perform brain mapping: identifying from these observations the brain regions involved. This problem can be cast into a linear supervised learning task where the neuroimaging data are used as predictors for the stimulus. Brain mapping is then seen as a support recovery problem. On functional MRI (fMRI) data, this problem is particularly challenging as i) the number of samples is small due to lim- ited acquisition time and ii) the variables are strongly correlated. We propose to overcome these difficulties using sparse regression models over new variables obtained by clustering of the original variables. The use of randomization techniques, e.g. bootstrap samples, and clustering of the variables improves the recovery properties of sparse methods. We demonstrate the benefit of our approach on an extensive simulation study as well as two fMRI datasets
- …