Search CORE

26,744 research outputs found

Data spectroscopy: Eigenspaces of convolution operators and clustering

Author: Belkin Mikhail
Shi Tao
Yu Bin
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

This paper focuses on obtaining clustering information about a distribution from its i.i.d. samples. We develop theoretical results to understand and use clustering information contained in the eigenvectors of data adjacency matrices based on a radial kernel function with a sufficiently fast tail decay. In particular, we provide population analyses to gain insights into which eigenvectors should be used and when the clustering information for the distribution can be recovered from the sample. We learn that a fixed number of top eigenvectors might at the same time contain redundant clustering information and miss relevant clustering information. We use this insight to design the data spectroscopic clustering (DaSpec) algorithm that utilizes properly selected eigenvectors to determine the number of clusters automatically and to group the data accordingly. Our findings extend the intuitions underlying existing spectral techniques such as spectral clustering and Kernel Principal Components Analysis, and provide new understanding into their usability and modes of failure. Simulation studies and experiments on real-world data are conducted to show the potential of our algorithm. In particular, DaSpec is found to handle unbalanced groups and recover clusters of different shapes better than the competing methods.Comment: Published in at http://dx.doi.org/10.1214/09-AOS700 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Embed and Conquer: Scalable Embeddings for Kernel k-Means on MapReduce

Author: Elgohary Ahmed
Farahat Ahmed K.
Kamel Mohamed S.
Karray Fakhri
Publication venue
Publication date: 29/01/2014
Field of study

The kernel

k

-means is an effective method for data clustering which extends the commonly-used

k

-means algorithm to work on a similarity matrix over complex data structures. The kernel

k

-means algorithm is however computationally very complex as it requires the complete data matrix to be calculated and stored. Further, the kernelized nature of the kernel

k

-means algorithm hinders the parallelization of its computations on modern infrastructures for distributed computing. In this paper, we are defining a family of kernel-based low-dimensional embeddings that allows for scaling kernel

k

-means on MapReduce via an efficient and unified parallelization strategy. Afterwards, we propose two methods for low-dimensional embedding that adhere to our definition of the embedding family. Exploiting the proposed parallelization strategy, we present two scalable MapReduce algorithms for kernel

k

-means. We demonstrate the effectiveness and efficiency of the proposed algorithms through an empirical evaluation on benchmark data sets.Comment: Appears in Proceedings of the SIAM International Conference on Data Mining (SDM), 201

arXiv.org e-Print Archive

CiteSeerX

Batch kernel SOM and related Laplacian methods for social network analysis

Author: Alpert
Andras
Aronszajn
Auber
Berlinet
Bertrand Jouve
Bornholdt
Clauset
Conan-Guez
Cristianini
Di Battista
Donetti
Fabrice Rossi
Faloutsos
Filippone
Graepel
Graepel
Hammer
Hammer
Herman
Kaski
Kohohen
Kohonen
Kondor
Mac Donald
Miikkulainen
Mohar
Mossa
Nathalie Villa
Neville
Newman
Newman
Newman
Newman
Palla
Pons
Radicchi
Romain Boulet
Schaeffer
Schölkopf
Schölkopf
Smola
Strogatz
Ultsch
van den Heuvel
Vert
Villa
Watts
Watts
Zhou
Publication venue
Publication date: 01/01/2008
Field of study

Large graphs are natural mathematical models for describing the structure of the data in a wide variety of fields, such as web mining, social networks, information retrieval, biological networks, etc. For all these applications, automatic tools are required to get a synthetic view of the graph and to reach a good understanding of the underlying problem. In particular, discovering groups of tightly connected vertices and understanding the relations between those groups is very important in practice. This paper shows how a kernel version of the batch Self Organizing Map can be used to achieve these goals via kernels derived from the Laplacian matrix of the graph, especially when it is used in conjunction with more classical methods based on the spectral analysis of the graph. The proposed method is used to explore the structure of a medieval social network modeled through a weighted graph that has been directly built from a large corpus of agrarian contracts

arXiv.org e-Print Archive

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

INRIA a CCSD electronic archive server

HAL-INSA Toulouse