1 research outputs found
Scalable K-Medoids via True Error Bound and Familywise Bandits
K-Medoids(KM) is a standard clustering method, used extensively on
semi-metric data.Error analyses of KM have traditionally used an in-sample
notion of error,which can be far from the true error and suffer from
generalization gap. We formalize the true K-Medoid error based on the
underlying data distribution.We decompose the true error into fundamental
statistical problems of: minimum estimation (ME) and minimum mean estimation
(MME). We provide a convergence result for MME. We show \errMME decreases no
slower than , where is a measure of
sample size. Inspired by this bound, we propose a computationally efficient,
distributed KM algorithm namely MCPAM. MCPAM has expected runtime
,where is the number of medoids and is number of
samples. MCPAM provides massive computational savings for a small tradeoff in
accuracy. We verify the quality and scaling properties of MCPAM on various
datasets. And achieve the hitherto unachieved feat of calculating the KM of 1
billion points on semi-metric spaces