K-means is undoubtedly the most widely used partitional clustering algorithm.
Unfortunately, due to its gradient descent nature, this algorithm is highly
sensitive to the initial placement of the cluster centers. Numerous
initialization methods have been proposed to address this problem. In this
paper, we first present an overview of these methods with an emphasis on their
computational efficiency. We then compare eight commonly used linear time
complexity initialization methods on a large and diverse collection of data
sets using various performance criteria. Finally, we analyze the experimental
results using non-parametric statistical tests and provide recommendations for
practitioners. We demonstrate that popular initialization methods often perform
poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table