26,744 research outputs found
Data spectroscopy: Eigenspaces of convolution operators and clustering
This paper focuses on obtaining clustering information about a distribution
from its i.i.d. samples. We develop theoretical results to understand and use
clustering information contained in the eigenvectors of data adjacency matrices
based on a radial kernel function with a sufficiently fast tail decay. In
particular, we provide population analyses to gain insights into which
eigenvectors should be used and when the clustering information for the
distribution can be recovered from the sample. We learn that a fixed number of
top eigenvectors might at the same time contain redundant clustering
information and miss relevant clustering information. We use this insight to
design the data spectroscopic clustering (DaSpec) algorithm that utilizes
properly selected eigenvectors to determine the number of clusters
automatically and to group the data accordingly. Our findings extend the
intuitions underlying existing spectral techniques such as spectral clustering
and Kernel Principal Components Analysis, and provide new understanding into
their usability and modes of failure. Simulation studies and experiments on
real-world data are conducted to show the potential of our algorithm. In
particular, DaSpec is found to handle unbalanced groups and recover clusters of
different shapes better than the competing methods.Comment: Published in at http://dx.doi.org/10.1214/09-AOS700 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce
The kernel -means is an effective method for data clustering which extends
the commonly-used -means algorithm to work on a similarity matrix over
complex data structures. The kernel -means algorithm is however
computationally very complex as it requires the complete data matrix to be
calculated and stored. Further, the kernelized nature of the kernel -means
algorithm hinders the parallelization of its computations on modern
infrastructures for distributed computing. In this paper, we are defining a
family of kernel-based low-dimensional embeddings that allows for scaling
kernel -means on MapReduce via an efficient and unified parallelization
strategy. Afterwards, we propose two methods for low-dimensional embedding that
adhere to our definition of the embedding family. Exploiting the proposed
parallelization strategy, we present two scalable MapReduce algorithms for
kernel -means. We demonstrate the effectiveness and efficiency of the
proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data
Mining (SDM), 201
Batch kernel SOM and related Laplacian methods for social network analysis
Large graphs are natural mathematical models for describing the structure of
the data in a wide variety of fields, such as web mining, social networks,
information retrieval, biological networks, etc. For all these applications,
automatic tools are required to get a synthetic view of the graph and to reach
a good understanding of the underlying problem. In particular, discovering
groups of tightly connected vertices and understanding the relations between
those groups is very important in practice. This paper shows how a kernel
version of the batch Self Organizing Map can be used to achieve these goals via
kernels derived from the Laplacian matrix of the graph, especially when it is
used in conjunction with more classical methods based on the spectral analysis
of the graph. The proposed method is used to explore the structure of a
medieval social network modeled through a weighted graph that has been directly
built from a large corpus of agrarian contracts
- …