272,430 research outputs found
Model Assisted Variable Clustering: Minimax-optimal Recovery and Algorithms
Model-based clustering defines population level clusters relative to a model
that embeds notions of similarity. Algorithms tailored to such models yield
estimated clusters with a clear statistical interpretation. We take this view
here and introduce the class of G-block covariance models as a background model
for variable clustering. In such models, two variables in a cluster are deemed
similar if they have similar associations will all other variables. This can
arise, for instance, when groups of variables are noise corrupted versions of
the same latent factor. We quantify the difficulty of clustering data generated
from a G-block covariance model in terms of cluster proximity, measured with
respect to two related, but different, cluster separation metrics. We derive
minimax cluster separation thresholds, which are the metric values below which
no algorithm can recover the model-defined clusters exactly, and show that they
are different for the two metrics. We therefore develop two algorithms, COD and
PECOK, tailored to G-block covariance models, and study their
minimax-optimality with respect to each metric. Of independent interest is the
fact that the analysis of the PECOK algorithm, which is based on a corrected
convex relaxation of the popular K-means algorithm, provides the first
statistical analysis of such algorithms for variable clustering. Additionally,
we contrast our methods with another popular clustering method, spectral
clustering, specialized to variable clustering, and show that ensuring exact
cluster recovery via this method requires clusters to have a higher separation,
relative to the minimax threshold. Extensive simulation studies, as well as our
data analyses, confirm the applicability of our approach.Comment: Maintext: 38 pages; supplementary information: 37 page
Clustering based on Random Graph Model embedding Vertex Features
Large datasets with interactions between objects are common to numerous
scientific fields (i.e. social science, internet, biology...). The interactions
naturally define a graph and a common way to explore or summarize such dataset
is graph clustering. Most techniques for clustering graph vertices just use the
topology of connections ignoring informations in the vertices features. In this
paper, we provide a clustering algorithm exploiting both types of data based on
a statistical model with latent structure characterizing each vertex both by a
vector of features as well as by its connectivity. We perform simulations to
compare our algorithm with existing approaches, and also evaluate our method
with real datasets based on hyper-textual documents. We find that our algorithm
successfully exploits whatever information is found both in the connectivity
pattern and in the features
Generating random networks with given degree-degree correlations and degree-dependent clustering
Random networks are widely used to model complex networks and research their
properties. In order to get a good approximation of complex networks
encountered in various disciplines of science, the ability to tune various
statistical properties of random networks is very important. In this manuscript
we present an algorithm which is able to construct arbitrarily degree-degree
correlated networks with adjustable degree-dependent clustering. We verify the
algorithm by using empirical networks as input and describe additionally a
simple way to fix a degree-dependent clustering function if degree-degree
correlations are given.Comment: 4 pages, 3 figure
Representation Learning for Clustering: A Statistical Framework
We address the problem of communicating domain knowledge from a user to the
designer of a clustering algorithm. We propose a protocol in which the user
provides a clustering of a relatively small random sample of a data set. The
algorithm designer then uses that sample to come up with a data representation
under which -means clustering results in a clustering (of the full data set)
that is aligned with the user's clustering. We provide a formal statistical
model for analyzing the sample complexity of learning a clustering
representation with this paradigm. We then introduce a notion of capacity of a
class of possible representations, in the spirit of the VC-dimension, showing
that classes of representations that have finite such dimension can be
successfully learned with sample size error bounds, and end our discussion with
an analysis of that dimension for classes of representations induced by linear
embeddings.Comment: To be published in Proceedings of UAI 201
RNN Language Model with Word Clustering and Class-based Output Layer
The recurrent neural network language model (RNNLM) has shown significant promise for statistical language modeling. In this work, a new class-based output layer method is introduced to further improve the RNNLM. In this method, word class information is incorporated into the output layer by utilizing the Brown clustering algorithm to estimate a class-based language model. Experimental results show that the new output layer with word clustering not only improves the convergence obviously but also reduces the perplexity and word error rate in large vocabulary continuous speech recognition
- …