78,446 research outputs found

    Hierarchical maximum likelihood clustering approach

    Get PDF
    Objective: In this work, we focused on developing a clustering approach for biological data. In many biological analyses, such as multi-omics data analysis and genome-wide association studies (GWAS) analysis, it is crucial to find groups of data belonging to subtypes of diseases or tumors. Methods: Conventionally, the k-means clustering algorithm is overwhelmingly applied in many areas including biological sciences. There are, however, several alternative clustering algorithms that can be applied, including support vector clustering. In this paper, taking into consideration the nature of biological data, we propose a maximum likelihood clustering scheme based on a hierarchical framework. Results: This method can perform clustering even when the data belonging to different groups overlap. It can also perform clustering when the number of samples is lower than the data dimensionality. Conclusion: The proposed scheme is free from selecting initial settings to begin the search process. In addition, it does not require the computation of the first and second derivative of likelihood functions, as is required by many other maximum likelihood based methods. Significance: This algorithm uses distribution and centroid information to cluster a sample and was applied to biological data. A Matlab implementation of this method can be downloaded from the web-link http://www.riken.jp/en/research/labs/ims/med_sci_math/

    Spatial clustering of array CGH features in combination with hierarchical multiple testing

    Get PDF
    We propose a new approach for clustering DNA features using array CGH data from multiple tumor samples. We distinguish data-collapsing: joining contiguous DNA clones or probes with extremely similar data into regions, from clustering: joining contiguous, correlated regions based on a maximum likelihood principle. The model-based clustering algorithm accounts for the apparent spatial patterns in the data. We evaluate the randomness of the clustering result by a cluster stability score in combination with cross-validation. Moreover, we argue that the clustering really captures spatial genomic dependency by showing that coincidental clustering of independent regions is very unlikely. Using the region and cluster information, we combine testing of these for association with a clinical variable in an hierarchical multiple testing approach. This allows for interpreting the significance of both regions and clusters while controlling the Family-Wise Error Rate simultaneously. We prove that in the context of permutation tests and permutation-invariant clusters it is allowed to perform clustering and testing on the same data set. Our procedures are illustrated on two cancer data sets

    Introduction to fast Super-Paramagnetic Clustering

    Get PDF
    We map stock market interactions to spin models to recover their hierarchical structure using a simulated annealing based Super-Paramagnetic Clustering (SPC) algorithm. This is directly compared to a modified implementation of a maximum likelihood approach to fast-Super-Paramagnetic Clustering (f-SPC). The methods are first applied standard toy test-case problems, and then to a dataset of 447 stocks traded on the New York Stock Exchange (NYSE) over 1249 days. The signal to noise ratio of stock market correlation matrices is briefly considered. Our result recover approximately clusters representative of standard economic sectors and mixed clusters whose dynamics shine light on the adaptive nature of financial markets and raise concerns relating to the effectiveness of industry based static financial market classification in the world of real-time data-analytics. A key result is that we show that the standard maximum likelihood methods are confirmed to converge to solutions within a Super-Paramagnetic (SP) phase. We use insights arising from this to discuss the implications of using a Maximum Entropy Principle (MEP) as opposed to the Maximum Likelihood Principle (MLP) as an optimization device for this class of problems

    Taxonomy Induction Using Hierarchical Random Graphs

    Get PDF
    This paper presents a novel approach for inducing lexical taxonomies automatically from text. We recast the learning problem as that of inferring a hierarchy from a graph whose nodes represent taxonomic terms and edges their degree of relatedness. Our model takes this graph representation as input and fits a taxonomy to it via combination of a maximum likelihood approach with a Monte Carlo Sampling algorithm. Essentially, the method works by sampling hierarchical structures with probability proportional to the likelihood with which they produce the input graph. We use our model to infer a taxonomy over 541 nouns and show that it outperforms popular flat and hierarchical clustering algorithms.

    A new approach to cluster analysis: the clustering-function-based method

    Get PDF
    The purpose of the paper is to present a new statistical approach to hierarchical cluster analysis with n objects measured on p variables. Motivated by the model of multivariate analysis of variance and the method of maximum likelihood, a clustering problem is formulated as a least squares optimization problem, simultaneously solving for both an n-vector of unknown group membership of objects and a linear clustering function. This formulation is shown to be linked to linear regression analysis and Fisher linear discriminant analysis and includes principal component regression for tackling multicollinearity or rank deficiency, polynomial or B-splines regression for handling non-linearity and various variable selection methods to eliminate irrelevant variables from data analysis. Algorithmic issues are investigated by using sign eigenanalysis

    Algorithms of maximum likelihood data clustering with applications

    Full text link
    We address the problem of data clustering by introducing an unsupervised, parameter free approach based on maximum likelihood principle. Starting from the observation that data sets belonging to the same cluster share a common information, we construct an expression for the likelihood of any possible cluster structure. The likelihood in turn depends only on the Pearson's coefficient of the data. We discuss clustering algorithms that provide a fast and reliable approximation to maximum likelihood configurations. Compared to standard clustering methods, our approach has the advantages that i) it is parameter free, ii) the number of clusters need not be fixed in advance and iii) the interpretation of the results is transparent. In order to test our approach and compare it with standard clustering algorithms, we analyze two very different data sets: Time series of financial market returns and gene expression data. We find that different maximization algorithms produce similar cluster structures whereas the outcome of standard algorithms has a much wider variability.Comment: Accepted by Physica A; 12 pag., 5 figures. More information at: http://www.sissa.it/dataclusterin
    corecore