Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

Fu, Haohuan; Janjic, Vladimir; Liu, Pan; Thomson, John; Wang, Shicai; Yan, Xiaohan; Yang, Guangwen; Yu, Teng; Zhao, Wenlai

Large-Scale Automatic K-Means Clustering for Heterogeneous Many-Core Supercomputer

Authors: Haohuan Fu
Vladimir Janjic
Pan Liu
John Thomson
Shicai Wang
Xiaohan Yan
Guangwen Yang
Teng Yu
Wenlai Zhao
Publication date: 9 December 2019
Publisher: 'Institute of Electrical and Electronics Engineers (IEEE)'
Doi

Abstract

Funding: UK EPSRC grants ”Discovery” EP/P020631/1, ”ABC: Adaptive Brokerage for the Cloud” EP/R010528/1.This article presents an automatic k-means clustering solution targeting the Sunway TaihuLight supercomputer. We ﬁrst introduce a multilevel parallel partition approach that not only partitions by dataﬂow and centroid, but also by dimension, which unlocks the potential of the hierarchical parallelism in the heterogeneous many-core processor and the system architecture of the supercomputer. The parallel design is able to process large-scale clustering problems with up to 196,608 dimensions and over 160,000 targeting centroids, while maintaining high performance and high scalability. Furthermore, we propose an automatic hyper-parameter determination process for k-means clustering, by automatically generating and executing the clustering tasks with a set of candidate hyper-parameter, and then determining the optimal hyper-parameter using a proposed evaluation method. The proposed auto-clustering solution can not only achieve high performance and scalability for problems with massive high-dimensional data, but also support clustering without sufﬁcient prior knowledge for the number of targeted clusters, which can potentially increase the scope of k-means algorithm to new application areas.PostprintPeer reviewe