Search CORE

59,076 research outputs found

Clustering, Hamming Embedding, Generalized LSH and the Max Norm

Author: A. Gionis
J.L. Krivine
L. Danzer
L.V. Buchok
N. Alon
N. Srebro
Publication venue
Publication date: 01/01/2014
Field of study

We study the convex relaxation of clustering and hamming embedding, focusing on the asymmetric case (co-clustering and asymmetric hamming embedding), understanding their relationship to LSH as studied by (Charikar 2002) and to the max-norm ball, and the differences between their symmetric and asymmetric versions.Comment: 17 page

arXiv.org e-Print Archive

CiteSeerX

Crossref

Probabilistic Clustering Using Maximal Matrix Norm Couplings

Author: Makur Anuran
Qiu David
Zheng Lizhong
Publication venue
Publication date: 10/10/2018
Field of study

In this paper, we present a local information theoretic approach to explicitly learn probabilistic clustering of a discrete random variable. Our formulation yields a convex maximization problem for which it is NP-hard to find the global optimum. In order to algorithmically solve this optimization problem, we propose two relaxations that are solved via gradient ascent and alternating maximization. Experiments on the MSR Sentence Completion Challenge, MovieLens 100K, and Reuters21578 datasets demonstrate that our approach is competitive with existing techniques and worthy of further investigation.Comment: Presented at 56th Annual Allerton Conference on Communication, Control, and Computing, 201

arXiv.org e-Print Archive

Crossref

DSpace@MIT

Preconditioned Data Sparsification for Big Data with Applications to PCA and K-means

Author: Becker Stephen
Pourkamali-Anaraki Farhad
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/09/2016
Field of study

We analyze a compression scheme for large data sets that randomly keeps a small percentage of the components of each data sample. The benefit is that the output is a sparse matrix and therefore subsequent processing, such as PCA or K-means, is significantly faster, especially in a distributed-data setting. Furthermore, the sampling is single-pass and applicable to streaming data. The sampling mechanism is a variant of previous methods proposed in the literature combined with a randomized preconditioning to smooth the data. We provide guarantees for PCA in terms of the covariance matrix, and guarantees for K-means in terms of the error in the center estimators at a given step. We present numerical evidence to show both that our bounds are nearly tight and that our algorithms provide a real benefit when applied to standard test data sets, as well as providing certain benefits over related sampling approaches.Comment: 28 pages, 10 figure

arXiv.org e-Print Archive

CU Scholar Institutional Repository

Crossref