The k-means algorithm is arguably the most popular nonparametric clustering
method but cannot generally be applied to datasets with incomplete records. The
usual practice then is to either impute missing values under an assumed
missing-completely-at-random mechanism or to ignore the incomplete records, and
apply the algorithm on the resulting dataset. We develop an efficient version
of the k-means algorithm that allows for clustering in the presence of
incomplete records. Our extension is called km-means and reduces to the
k-means algorithm when all records are complete. We also provide
initialization strategies for our algorithm and methods to estimate the number
of groups in the dataset. Illustrations and simulations demonstrate the
efficacy of our approach in a variety of settings and patterns of missing data.
Our methods are also applied to the analysis of activation images obtained from
a functional Magnetic Resonance Imaging experiment.Comment: 21 pages, 12 figures, 3 tables, in press, Statistical Analysis and
Data Mining -- The ASA Data Science Journal, 201