43 research outputs found
Coresets for Clustering in Graphs of Bounded Treewidth
We initiate the study of coresets for clustering in graph metrics, i.e., the
shortest-path metric of edge-weighted graphs. Such clustering problems are
essential to data analysis and used for example in road networks and data
visualization. A coreset is a compact summary of the data that approximately
preserves the clustering objective for every possible center set, and it offers
significant efficiency improvements in terms of running time, storage, and
communication, including in streaming and distributed settings. Our main result
is a near-linear time construction of a coreset for k-Median in a general graph
, with size where is the
treewidth of , and we complement the construction with a nearly-tight size
lower bound. The construction is based on the framework of Feldman and Langberg
[STOC 2011], and our main technical contribution, as required by this
framework, is a uniform bound of on the shattering
dimension under any point weights. We validate our coreset on real-world road
networks, and our scalable algorithm constructs tiny coresets with high
accuracy, which translates to a massive speedup of existing approximation
algorithms such as local search for graph k-Median
A New Coreset Framework for Clustering
Given a metric space, the -clustering problem consists of finding
centers such that the sum of the of distances raised to the power of every
point to its closest center is minimized. This encapsulates the famous
-median () and -means () clustering problems. Designing
small-space sketches of the data that approximately preserves the cost of the
solutions, also known as \emph{coresets}, has been an important research
direction over the last 15 years.
In this paper, we present a new, simple coreset framework that simultaneously
improves upon the best known bounds for a large variety of settings, ranging
from Euclidean space, doubling metric, minor-free metric, and the general
metric cases
Towards Optimal Coreset Construction for -Clustering: Breaking the Quadratic Dependency on
Constructing small-sized coresets for various clustering problems has
attracted significant attention recently. We provide efficient coreset
construction algorithms for -Clustering with improved coreset sizes in
several metric spaces. In particular, we provide an
-sized coreset for -Clustering for all in Euclidean space, improving upon the best
known size upper bound [Cohen-Addad, Larsen,
Saulpic, Schwiegelshohn. STOC'22], breaking the quadratic dependency on for
the first time (when ). For example, our coreset size
for Euclidean -Median is , improving
the best known result by a factor when ; for Euclidean -Means, our coreset size is
, improving the best known result
by a
factor when . We also obtain optimal or
improved coreset sizes for general metric space, metric space with bounded
doubling dimension, and shortest path metric when the underlying graph has
bounded treewidth, for all . Our algorithm largely follows the
framework developed by Cohen-Addad et al. with some minor but useful changes.
Our technical contribution mainly lies in the analysis. An important
improvement in our analysis is a new notion of -covering of distance
vectors with a novel error metric, which allows us to provide a tighter
variance bound. Another useful technical ingredient is terminal embedding with
additive errors, for bounding the covering number in the Euclidean case
Coresets for Clustering: Foundations and Challenges
Clustering is a fundamental task in machine learning and data analysis. The main challenge for clustering in big data sets is that classical clustering algorithms often do not scale well. Coresets are data reduction techniques that turn big data into a tiny proxy. Prior research has shown that coresets can provide a scalable solution to clustering problems and imply streaming and distributed algorithms. In this work, we aim to solve a fundamental question and two modern challenges in coresets for clustering.
\textsf{Beyond Euclidean Space}: Coresets for Clustering in Euclidean space have been well studied and coresets of constant size are known to exist. While very few results are known beyond Euclidean space. It becomes a fundamental problem that what kind of metric space admits constant-sized coresets for clustering. We focus on graph metrics which is a common ambient space for clustering. We provide positive results that assert constant-sized coresets exist in various families of graph metrics including graphs of bounded treewidth, planar graphs and the more general excluded-minor graphs.
\textsf{Missing Value}: Missing value is a common phenomenon in real data sets. Clustering under the existence of missing values is a very challenging task. In this work, we construct the first coresets for clustering with multiple missing values. Previously, such coresets were only known to exist when each data point has at most one missing value \cite{DBLP:conf/nips/MaromF19}. We further design a near-linear time algorithm to construct our coresets. This algorithm implies the first near-linear time approximation scheme for \kMeans clustering with missing values and improves a recent result by
\cite{DBLP:conf/soda/EibenFGLPS21}.
\textsf{Simultaneous Coresets}: Most classical coresets are limited to a specific clustering objective. When there are multiple potential objectives, a stronger notion of âsimultaneous coresetsâ is needed. Simultaneous coresets provide the approximations for a family of objectives and can serve as a more flexible data reduction tool. In this work, we design the first simultaneous coresets for a large clustering family which includes both \kMedian and \kCenter
A Survey on Approximation in Parameterized Complexity: Hardness and Algorithms
Parameterization and approximation are two popular ways of coping with
NP-hard problems. More recently, the two have also been combined to derive many
interesting results. We survey developments in the area both from the
algorithmic and hardness perspectives, with emphasis on new techniques and
potential future research directions
Applied Randomized Algorithms for Efficient Genomic Analysis
The scope and scale of biological data continues to grow at an exponential clip, driven by advances in genetic sequencing, annotation and widespread adoption of surveillance efforts. For instance, the Sequence Read Archive (SRA) now contains more than 25 petabases of public data, while RefSeq, a collection of reference genomes, recently surpassed 100,000 complete genomes. In the process, it has outgrown the practical reach of many traditional algorithmic approaches in both time and space.
Motivated by this extreme scale, this thesis details efficient methods for clustering and summarizing large collections of sequence data. While our primary area of interest is biological sequences, these approaches largely apply to sequence collections of any type, including natural language, software source code, and graph structured data.
We applied recent advances in randomized algorithms to practical problems. We used MinHash and HyperLogLog, both examples of Locality- Sensitive Hashing, as well as coresets, which are approximate representations for finite sum problems, to build methods capable of scaling to billions of items. Ultimately, these are all derived from variations on sampling.
We combined these advances with hardware-based optimizations and
incorporated into free and open-source software libraries (sketch, frp, lib- simdsampling) and practical software tools built on these libraries (Dashing, Minicore, Dashing 2), empowering users to interact practically with colossal datasets on commodity hardware
The Power of Uniform Sampling for Coresets
Motivated by practical generalizations of the classic -median and
-means objectives, such as clustering with size constraints, fair
clustering, and Wasserstein barycenter, we introduce a meta-theorem for
designing coresets for constrained-clustering problems. The meta-theorem
reduces the task of coreset construction to one on a bounded number of ring
instances with a much-relaxed additive error. This reduction enables us to
construct coresets using uniform sampling, in contrast to the widely-used
importance sampling, and consequently we can easily handle constrained
objectives. Notably and perhaps surprisingly, this simpler sampling scheme can
yield coresets whose size is independent of , the number of input points.
Our technique yields smaller coresets, and sometimes the first coresets, for
a large number of constrained clustering problems, including capacitated
clustering, fair clustering, Euclidean Wasserstein barycenter, clustering in
minor-excluded graph, and polygon clustering under Fr\'{e}chet and Hausdorff
distance. Finally, our technique yields also smaller coresets for -median in
low-dimensional Euclidean spaces, specifically of size
in and
in
Fair Correlation Clustering in Forests
The study of algorithmic fairness received growing attention recently. This stems from the awareness that bias in the input data for machine learning systems may result in discriminatory outputs. For clustering tasks, one of the most central notions of fairness is the formalization by Chierichetti, Kumar, Lattanzi, and Vassilvitskii [NeurIPS 2017]. A clustering is said to be fair, if each cluster has the same distribution of manifestations of a sensitive attribute as the whole input set. This is motivated by various applications where the objects to be clustered have sensitive attributes that should not be over- or underrepresented. Most research on this version of fair clustering has focused on centriod-based objectives.
In contrast, we discuss the applicability of this fairness notion to Correlation Clustering. The existing literature on the resulting Fair Correlation Clustering problem either presents approximation algorithms with poor approximation guarantees or severely limits the possible distributions of the sensitive attribute (often only two manifestations with a 1:1 ratio are considered). Our goal is to understand if there is hope for better results in between these two extremes. To this end, we consider restricted graph classes which allow us to characterize the distributions of sensitive attributes for which this form of fairness is tractable from a complexity point of view.
While existing work on Fair Correlation Clustering gives approximation algorithms, we focus on exact solutions and investigate whether there are efficiently solvable instances. The unfair version of Correlation Clustering is trivial on forests, but adding fairness creates a surprisingly rich picture of complexities. We give an overview of the distributions and types of forests where Fair Correlation Clustering turns from tractable to intractable.
As the most surprising insight, we consider the fact that the cause of the hardness of Fair Correlation Clustering is not the strictness of the fairness condition. We lift most of our results to also hold for the relaxed version of the fairness condition. Instead, the source of hardness seems to be the distribution of the sensitive attribute. On the positive side, we identify some reasonable distributions that are indeed tractable. While this tractability is only shown for forests, it may open an avenue to design reasonable approximations for larger graph classes