12,817 research outputs found

    Clustering of Data with Missing Entries

    Full text link
    The analysis of large datasets is often complicated by the presence of missing entries, mainly because most of the current machine learning algorithms are designed to work with full data. The main focus of this work is to introduce a clustering algorithm, that will provide good clustering even in the presence of missing data. The proposed technique solves an 0\ell_0 fusion penalty based optimization problem to recover the clusters. We theoretically analyze the conditions needed for the successful recovery of the clusters. We also propose an algorithm to solve a relaxation of this problem using saturating non-convex fusion penalties. The method is demonstrated on simulated and real datasets, and is observed to perform well in the presence of large fractions of missing entries.Comment: arXiv admin note: substantial text overlap with arXiv:1709.0187

    The Bane of Low-Dimensionality Clustering

    Get PDF
    In this paper, we give a conditional lower bound of nΩ(k)n^{\Omega(k)} on running time for the classic k-median and k-means clustering objectives (where n is the size of the input), even in low-dimensional Euclidean space of dimension four, assuming the Exponential Time Hypothesis (ETH). We also consider k-median (and k-means) with penalties where each point need not be assigned to a center, in which case it must pay a penalty, and extend our lower bound to at least three-dimensional Euclidean space. This stands in stark contrast to many other geometric problems such as the traveling salesman problem, or computing an independent set of unit spheres. While these problems benefit from the so-called (limited) blessing of dimensionality, as they can be solved in time nO(k11/d)n^{O(k^{1-1/d})} or 2n11/d2^{n^{1-1/d}} in d dimensions, our work shows that widely-used clustering objectives have a lower bound of nΩ(k)n^{\Omega(k)}, even in dimension four. We complete the picture by considering the two-dimensional case: we show that there is no algorithm that solves the penalized version in time less than no(k)n^{o(\sqrt{k})}, and provide a matching upper bound of nO(k)n^{O(\sqrt{k})}. The main tool we use to establish these lower bounds is the placement of points on the moment curve, which takes its inspiration from constructions of point sets yielding Delaunay complexes of high complexity

    Uncovering Group Level Insights with Accordant Clustering

    Full text link
    Clustering is a widely-used data mining tool, which aims to discover partitions of similar items in data. We introduce a new clustering paradigm, \emph{accordant clustering}, which enables the discovery of (predefined) group level insights. Unlike previous clustering paradigms that aim to understand relationships amongst the individual members, the goal of accordant clustering is to uncover insights at the group level through the analysis of their members. Group level insight can often support a call to action that cannot be informed through previous clustering techniques. We propose the first accordant clustering algorithm, and prove that it finds near-optimal solutions when data possesses inherent cluster structure. The insights revealed by accordant clusterings enabled experts in the field of medicine to isolate successful treatments for a neurodegenerative disease, and those in finance to discover patterns of unnecessary spending.Comment: accepted to SDM 2017 (oral
    corecore