Determining the k in k-means with MapReduce

Debatty, Thibault; Mees, Wim; Michiardi, Pietro; Thonnard, Olivier

Determining the k in k-means with MapReduce

Authors: Thibault Debatty
Wim Mees
Pietro Michiardi
Olivier Thonnard
Publication date: 24 March 2014
Publisher: HAL CCSD

Abstract

International audienceIn this paper we propose a MapReduce implementation of G-means, a variant of k-means that is able to automatically determine k, the number of clusters. We show that our implementation scales to very large datasets and very large values of k, as the computation cost is proportional to nk. Other techniques that run a clustering algorithm with different values of k and choose the value of k that provides the " best " results have a computation cost that is proportional to nk 2. We run experiments that confirm that the processing time is proportional to k. These experiments also show that, because G-means adds new centers progressively, if and where they are needed, it reduces the probability to fall into a local minimum, and finally finds better centers than classical k-means processing

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Archive Ouverte en Sciences de l'Information et de la Communication

oai:HAL:hal-01525708v1

Last time updated on 17/08/2017