78,446 research outputs found
Hierarchical maximum likelihood clustering approach
Objective:
In this work, we focused on developing a clustering approach for biological data. In many biological
analyses, such as multi-omics data analysis and genome-wide
association studies (GWAS) analysis, it is crucial to find groups of data belonging to subtypes of diseases or tumors. Methods:
Conventionally, the k-means clustering algorithm is
overwhelmingly applied in many areas including biological
sciences. There are, however, several alternative clustering algorithms that can be applied, including support vector clustering. In this paper, taking into consideration the nature of biological data, we propose a maximum likelihood clustering scheme based on a hierarchical framework.
Results: This method can perform clustering even when the data belonging to different groups overlap. It can also perform clustering when the number of samples is lower than the data dimensionality.
Conclusion: The proposed scheme is free from selecting initial settings to begin the search process. In addition, it does not require the computation of the first and second derivative of likelihood functions, as is required by many other maximum likelihood based methods.
Significance: This algorithm uses distribution and centroid
information to cluster a sample and was applied to biological data. A Matlab implementation of this method can be downloaded from the web-link
http://www.riken.jp/en/research/labs/ims/med_sci_math/
Spatial clustering of array CGH features in combination with hierarchical multiple testing
We propose a new approach for clustering DNA features using array CGH data
from multiple tumor samples. We distinguish data-collapsing: joining contiguous
DNA clones or probes with extremely similar data into regions, from clustering:
joining contiguous, correlated regions based on a maximum likelihood principle.
The model-based clustering algorithm accounts for the apparent spatial patterns
in the data. We evaluate the randomness of the clustering result by a cluster
stability score in combination with cross-validation. Moreover, we argue that
the clustering really captures spatial genomic dependency by showing that
coincidental clustering of independent regions is very unlikely. Using the
region and cluster information, we combine testing of these for association
with a clinical variable in an hierarchical multiple testing approach. This
allows for interpreting the significance of both regions and clusters while
controlling the Family-Wise Error Rate simultaneously. We prove that in the
context of permutation tests and permutation-invariant clusters it is allowed
to perform clustering and testing on the same data set. Our procedures are
illustrated on two cancer data sets
Introduction to fast Super-Paramagnetic Clustering
We map stock market interactions to spin models to recover their hierarchical structure using a simulated annealing based Super-Paramagnetic Clustering (SPC) algorithm. This is directly compared to a modified implementation of a maximum likelihood approach to fast-Super-Paramagnetic Clustering (f-SPC). The methods are first applied standard toy test-case problems, and then to a dataset of 447 stocks traded on the New York Stock Exchange (NYSE) over 1249 days. The signal to noise ratio of stock market correlation matrices is briefly considered. Our result recover approximately clusters representative of standard economic sectors and mixed clusters whose dynamics shine light on the adaptive nature of financial markets and raise concerns relating to the effectiveness of industry based static financial market classification in the world of real-time data-analytics. A key result is that we show that the standard maximum likelihood methods are confirmed to converge to solutions within a Super-Paramagnetic (SP) phase. We use insights arising from this to discuss the implications of using a Maximum Entropy Principle (MEP) as opposed to the Maximum Likelihood Principle (MLP) as an optimization device for this class of problems
Taxonomy Induction Using Hierarchical Random Graphs
This paper presents a novel approach for inducing lexical taxonomies automatically from text. We recast the learning problem as that of inferring a hierarchy from a graph whose nodes represent taxonomic terms and edges their degree of relatedness. Our model takes this graph representation as input and fits a taxonomy to it via combination of a maximum likelihood approach with a Monte Carlo Sampling algorithm. Essentially, the method works by sampling hierarchical structures with probability proportional to the likelihood with which they produce the input graph. We use our model to infer a taxonomy over 541 nouns and show that it outperforms popular flat and hierarchical clustering algorithms.
A new approach to cluster analysis: the clustering-function-based method
The purpose of the paper is to present a new statistical approach to hierarchical cluster analysis with n objects measured on p variables. Motivated by the model of multivariate analysis of variance and the method of maximum likelihood, a clustering problem is formulated as a least squares optimization problem, simultaneously solving for both an n-vector of unknown group membership of objects and a linear clustering function. This formulation is shown to be linked to linear regression analysis and Fisher linear discriminant analysis and includes principal component regression for tackling multicollinearity or rank deficiency, polynomial or B-splines regression for handling non-linearity and various variable selection methods to eliminate irrelevant variables from data analysis. Algorithmic issues are investigated by using sign eigenanalysis
Algorithms of maximum likelihood data clustering with applications
We address the problem of data clustering by introducing an unsupervised,
parameter free approach based on maximum likelihood principle. Starting from
the observation that data sets belonging to the same cluster share a common
information, we construct an expression for the likelihood of any possible
cluster structure. The likelihood in turn depends only on the Pearson's
coefficient of the data. We discuss clustering algorithms that provide a fast
and reliable approximation to maximum likelihood configurations. Compared to
standard clustering methods, our approach has the advantages that i) it is
parameter free, ii) the number of clusters need not be fixed in advance and
iii) the interpretation of the results is transparent. In order to test our
approach and compare it with standard clustering algorithms, we analyze two
very different data sets: Time series of financial market returns and gene
expression data. We find that different maximization algorithms produce similar
cluster structures whereas the outcome of standard algorithms has a much wider
variability.Comment: Accepted by Physica A; 12 pag., 5 figures. More information at:
http://www.sissa.it/dataclusterin
- …