42 research outputs found
The Power of Uniform Sampling for Coresets
Motivated by practical generalizations of the classic -median and
-means objectives, such as clustering with size constraints, fair
clustering, and Wasserstein barycenter, we introduce a meta-theorem for
designing coresets for constrained-clustering problems. The meta-theorem
reduces the task of coreset construction to one on a bounded number of ring
instances with a much-relaxed additive error. This reduction enables us to
construct coresets using uniform sampling, in contrast to the widely-used
importance sampling, and consequently we can easily handle constrained
objectives. Notably and perhaps surprisingly, this simpler sampling scheme can
yield coresets whose size is independent of , the number of input points.
Our technique yields smaller coresets, and sometimes the first coresets, for
a large number of constrained clustering problems, including capacitated
clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in
minor-excluded graph, and polygon clustering under Fr\'{e}chet and Hausdorff
distance. Finally, our technique yields also smaller coresets for -median in
low-dimensional Euclidean spaces, specifically of size
in and
in
Spectral Clustering with Imbalanced Data
Spectral clustering is sensitive to how graphs are constructed from data
particularly when proximal and imbalanced clusters are present. We show that
Ratio-Cut (RCut) or normalized cut (NCut) objectives are not tailored to
imbalanced data since they tend to emphasize cut sizes over cut values. We
propose a graph partitioning problem that seeks minimum cut partitions under
minimum size constraints on partitions to deal with imbalanced data. Our
approach parameterizes a family of graphs, by adaptively modulating node
degrees on a fixed node set, to yield a set of parameter dependent cuts
reflecting varying levels of imbalance. The solution to our problem is then
obtained by optimizing over these parameters. We present rigorous limit cut
analysis results to justify our approach. We demonstrate the superiority of our
method through unsupervised and semi-supervised experiments on synthetic and
real data sets.Comment: 24 pages, 7 figures. arXiv admin note: substantial text overlap with
arXiv:1302.513
Clustering and Community Detection with Imbalanced Clusters
Spectral clustering methods which are frequently used in clustering and
community detection applications are sensitive to the specific graph
constructions particularly when imbalanced clusters are present. We show that
ratio cut (RCut) or normalized cut (NCut) objectives are not tailored to
imbalanced cluster sizes since they tend to emphasize cut sizes over cut
values. We propose a graph partitioning problem that seeks minimum cut
partitions under minimum size constraints on partitions to deal with imbalanced
cluster sizes. Our approach parameterizes a family of graphs by adaptively
modulating node degrees on a fixed node set, yielding a set of parameter
dependent cuts reflecting varying levels of imbalance. The solution to our
problem is then obtained by optimizing over these parameters. We present
rigorous limit cut analysis results to justify our approach and demonstrate the
superiority of our method through experiments on synthetic and real datasets
for data clustering, semi-supervised learning and community detection.Comment: Extended version of arXiv:1309.2303 with new applications. Accepted
to IEEE TSIP
Clustering with diversity
We consider the {\em clustering with diversity} problem: given a set of
colored points in a metric space, partition them into clusters such that each
cluster has at least points, all of which have distinct colors.
We give a 2-approximation to this problem for any when the objective
is to minimize the maximum radius of any cluster. We show that the
approximation ratio is optimal unless , by providing a matching
lower bound. Several extensions to our algorithm have also been developed for
handling outliers. This problem is mainly motivated by applications in
privacy-preserving data publication.Comment: Extended abstract accepted in ICALP 2010. Keywords: Approximation
algorithm, k-center, k-anonymity, l-diversit
Graph Cuts with Arbitrary Size Constraints Through Optimal Transport
A common way of partitioning graphs is through minimum cuts. One drawback of
classical minimum cut methods is that they tend to produce small groups, which
is why more balanced variants such as normalized and ratio cuts have seen more
success. However, we believe that with these variants, the balance constraints
can be too restrictive for some applications like for clustering of imbalanced
datasets, while not being restrictive enough for when searching for perfectly
balanced partitions. Here, we propose a new graph cut algorithm for
partitioning graphs under arbitrary size constraints. We formulate the graph
cut problem as a regularized Gromov-Wasserstein problem. We then propose to
solve it using accelerated proximal GD algorithm which has global convergence
guarantees, results in sparse solutions and only incurs an additional ratio of
compared to the classical spectral clustering algorithm
but was seen to be more efficient
Data clustering with cluster size constraints using a modified k-means algorithm
2014-2015 > Academic research: refereed > Refereed conference paperAccepted ManuscriptPublishe