272,430 research outputs found

    Model Assisted Variable Clustering: Minimax-optimal Recovery and Algorithms

    Get PDF
    Model-based clustering defines population level clusters relative to a model that embeds notions of similarity. Algorithms tailored to such models yield estimated clusters with a clear statistical interpretation. We take this view here and introduce the class of G-block covariance models as a background model for variable clustering. In such models, two variables in a cluster are deemed similar if they have similar associations will all other variables. This can arise, for instance, when groups of variables are noise corrupted versions of the same latent factor. We quantify the difficulty of clustering data generated from a G-block covariance model in terms of cluster proximity, measured with respect to two related, but different, cluster separation metrics. We derive minimax cluster separation thresholds, which are the metric values below which no algorithm can recover the model-defined clusters exactly, and show that they are different for the two metrics. We therefore develop two algorithms, COD and PECOK, tailored to G-block covariance models, and study their minimax-optimality with respect to each metric. Of independent interest is the fact that the analysis of the PECOK algorithm, which is based on a corrected convex relaxation of the popular K-means algorithm, provides the first statistical analysis of such algorithms for variable clustering. Additionally, we contrast our methods with another popular clustering method, spectral clustering, specialized to variable clustering, and show that ensuring exact cluster recovery via this method requires clusters to have a higher separation, relative to the minimax threshold. Extensive simulation studies, as well as our data analyses, confirm the applicability of our approach.Comment: Maintext: 38 pages; supplementary information: 37 page

    Clustering based on Random Graph Model embedding Vertex Features

    Full text link
    Large datasets with interactions between objects are common to numerous scientific fields (i.e. social science, internet, biology...). The interactions naturally define a graph and a common way to explore or summarize such dataset is graph clustering. Most techniques for clustering graph vertices just use the topology of connections ignoring informations in the vertices features. In this paper, we provide a clustering algorithm exploiting both types of data based on a statistical model with latent structure characterizing each vertex both by a vector of features as well as by its connectivity. We perform simulations to compare our algorithm with existing approaches, and also evaluate our method with real datasets based on hyper-textual documents. We find that our algorithm successfully exploits whatever information is found both in the connectivity pattern and in the features

    Generating random networks with given degree-degree correlations and degree-dependent clustering

    Full text link
    Random networks are widely used to model complex networks and research their properties. In order to get a good approximation of complex networks encountered in various disciplines of science, the ability to tune various statistical properties of random networks is very important. In this manuscript we present an algorithm which is able to construct arbitrarily degree-degree correlated networks with adjustable degree-dependent clustering. We verify the algorithm by using empirical networks as input and describe additionally a simple way to fix a degree-dependent clustering function if degree-degree correlations are given.Comment: 4 pages, 3 figure

    Representation Learning for Clustering: A Statistical Framework

    Full text link
    We address the problem of communicating domain knowledge from a user to the designer of a clustering algorithm. We propose a protocol in which the user provides a clustering of a relatively small random sample of a data set. The algorithm designer then uses that sample to come up with a data representation under which kk-means clustering results in a clustering (of the full data set) that is aligned with the user's clustering. We provide a formal statistical model for analyzing the sample complexity of learning a clustering representation with this paradigm. We then introduce a notion of capacity of a class of possible representations, in the spirit of the VC-dimension, showing that classes of representations that have finite such dimension can be successfully learned with sample size error bounds, and end our discussion with an analysis of that dimension for classes of representations induced by linear embeddings.Comment: To be published in Proceedings of UAI 201

    RNN Language Model with Word Clustering and Class-based Output Layer

    Get PDF
    The recurrent neural network language model (RNNLM) has shown significant promise for statistical language modeling. In this work, a new class-based output layer method is introduced to further improve the RNNLM. In this method, word class information is incorporated into the output layer by utilizing the Brown clustering algorithm to estimate a class-based language model. Experimental results show that the new output layer with word clustering not only improves the convergence obviously but also reduces the perplexity and word error rate in large vocabulary continuous speech recognition
    • …
    corecore