23 research outputs found
Fully dynamic clustering and diversity maximization in doubling metrics
We present approximation algorithms for some variants of center-based
clustering and related problems in the fully dynamic setting, where the
pointset evolves through an arbitrary sequence of insertions and deletions.
Specifically, we target the following problems: -center (with and without
outliers), matroid-center, and diversity maximization. All algorithms employ a
coreset-based strategy and rely on the use of the cover tree data structure,
which we crucially augment to maintain, at any time, some additional
information enabling the efficient extraction of the solution for the specific
problem. For all of the aforementioned problems our algorithms yield
-approximations, where is the best known
approximation attainable in polynomial time in the standard off-line setting
(except for -center with outliers where but we get a
-approximation) and is a user-provided
accuracy parameter. The analysis of the algorithms is performed in terms of the
doubling dimension of the underlying metric. Remarkably, and unlike previous
works, the data structure and the running times of the insertion and deletion
procedures do not depend in any way on the accuracy parameter
and, for the two -center variants, on the parameter . For spaces of
bounded doubling dimension, the running times are dramatically smaller than
those that would be required to compute solutions on the entire pointset from
scratch. To the best of our knowledge, ours are the first solutions for the
matroid-center and diversity maximization problems in the fully dynamic
setting
Improved Approximation and Scalability for Fair Max-Min Diversification
Given an -point metric space where each point belongs to
one of different categories or groups and a set of integers , the fair Max-Min diversification problem is to select
points belonging to category , such that the minimum pairwise
distance between selected points is maximized. The problem was introduced by
Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample
large data sets in various applications so that the derived sample achieves a
balance over diversity, i.e., the minimum distance between a pair of selected
points, and fairness, i.e., ensuring enough points of each category are
included. We prove the following results:
1. We first consider general metric spaces. We present a randomized
polynomial time algorithm that returns a factor -approximation to the
diversity but only satisfies the fairness constraints in expectation. Building
upon this result, we present a -approximation that is guaranteed to satisfy
the fairness constraints up to a factor for any constant
. We also present a linear time algorithm returning an
approximation with exact fairness. The best previous result was a
approximation.
2. We then focus on Euclidean metrics. We first show that the problem can be
solved exactly in one dimension. For constant dimensions, categories and any
constant , we present a approximation algorithm that
runs in time where . We can improve the
running time to at the expense of only picking points from category .
Finally, we present algorithms suitable to processing massive data sets
including single-pass data stream algorithms and composable coresets for the
distributed processing.Comment: To appear in ICDT 202
Streaming Algorithms for Diversity Maximization with Fairness Constraints
Diversity maximization is a fundamental problem with wide applications in
data summarization, web search, and recommender systems. Given a set of
elements, it asks to select a subset of elements with maximum
\emph{diversity}, as quantified by the dissimilarities among the elements in
. In this paper, we focus on the diversity maximization problem with
fairness constraints in the streaming setting. Specifically, we consider the
max-min diversity objective, which selects a subset that maximizes the
minimum distance (dissimilarity) between any pair of distinct elements within
it. Assuming that the set is partitioned into disjoint groups by some
sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that
the selected subset contains elements from each group .
A streaming algorithm should process sequentially in one pass and return a
subset with maximum \emph{diversity} while guaranteeing the fairness
constraint. Although diversity maximization has been extensively studied, the
only known algorithms that can work with the max-min diversity objective and
fairness constraints are very inefficient for data streams. Since diversity
maximization is NP-hard in general, we propose two approximation algorithms for
fair diversity maximization in data streams, the first of which is
-approximate and specific for , where
, and the second of which achieves a
-approximation for an arbitrary . Experimental
results on real-world and synthetic datasets show that both algorithms provide
solutions of comparable quality to the state-of-the-art algorithms while
running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202
Coresets for Clustering with General Assignment Constraints
Designing small-sized \emph{coresets}, which approximately preserve the costs
of the solutions for large datasets, has been an important research direction
for the past decade. We consider coreset construction for a variety of general
constrained clustering problems. We significantly extend and generalize the
results of a very recent paper (Braverman et al., FOCS'22), by demonstrating
that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et
al., FOCS'22) can be applied to efficiently construct coresets for a very
general class of constrained clustering problems with general assignment
constraints, including capacity constraints on cluster centers, and assignment
structure constraints for data points (modeled by a convex body .
Our main theorem shows that a small-sized -coreset exists as long
as a complexity measure of the structure
constraint, and the \emph{covering exponent}
for metric space are bounded. The complexity measure
for convex body is the Lipschitz
constant of a certain transportation problem constrained in ,
called \emph{optimal assignment transportation problem}. We prove nontrivial
upper bounds of for various polytopes, including
the general matroid basis polytopes, and laminar matroid polytopes (with better
bound). As an application of our general theorem, we construct the first
coreset for the fault-tolerant clustering problem (with or without capacity
upper/lower bound) for the above metric spaces, in which the fault-tolerance
requirement is captured by a uniform matroid basis polytope
The Power of Randomization: Distributed Submodular Maximization on Massive Datasets
A wide variety of problems in machine learning, including exemplar
clustering, document summarization, and sensor placement, can be cast as
constrained submodular maximization problems. Unfortunately, the resulting
submodular optimization problems are often too large to be solved on a single
machine. We develop a simple distributed algorithm that is embarrassingly
parallel and it achieves provable, constant factor, worst-case approximation
guarantees. In our experiments, we demonstrate its efficiency in large problems
with different kinds of constraints with objective values always close to what
is achievable in the centralized setting
GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning
Large scale machine learning and deep models are extremely data-hungry.
Unfortunately, obtaining large amounts of labeled data is expensive, and
training state-of-the-art models (with hyperparameter tuning) requires
significant computing resources and time. Secondly, real-world data is noisy
and imbalanced. As a result, several recent papers try to make the training
process more efficient and robust. However, most existing work either focuses
on robustness or efficiency, but not both. In this work, we introduce Glister,
a GeneraLIzation based data Subset selecTion for Efficient and Robust learning
framework. We formulate Glister as a mixed discrete-continuous bi-level
optimization problem to select a subset of the training data, which maximizes
the log-likelihood on a held-out validation set. Next, we propose an iterative
online algorithm Glister-Online, which performs data selection iteratively
along with the parameter updates and can be applied to any loss-based learning
algorithm. We then show that for a rich class of loss functions including
cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete
data selection is an instance of (weakly) submodular optimization, and we
analyze conditions for which Glister-Online reduces the validation loss and
converges. Finally, we propose Glister-Active, an extension to batch active
learning, and we empirically demonstrate the performance of Glister on a wide
range of tasks including, (a) data selection to reduce training time, (b)
robust learning under label noise and imbalance settings, and (c) batch-active
learning with several deep and shallow models. We show that our framework
improves upon state of the art both in efficiency and accuracy (in cases (a)
and (c)) and is more efficient compared to other state-of-the-art robust
learning algorithms in case (b)
Recommended from our members
New perspectives and applications for greedy algorithms in machine learning
Approximating probability densities is a core problem in Bayesian statistics, where the inference involves the computation of a posterior distribution. Variational Inference (VI) is a technique to approximate posterior distributions through optimization. It involves specifying a set of tractable densities, out of which the final approximation is to be chosen. While VI is traditionally motivated with the goal of tractability, the focus in this dissertation is to use Bayesian approximation to obtain parsimonious distributions. With this goal in mind, we develop greedy algorithm variants and study their theoretical properties by establishing novel connections of the resulting optimization problems in parsimonious VI with traditional studies in the discrete optimization literature. Specific realizations lead to efficient solutions for many sparse probabilistic models like Sparse regression, Sparse PCA, Sparse Collective Matrix Factorization (CMF) etc. For cases where existing results are insufficient to provide acceptable approximation guarantees, we extend the optimization results for some large scale algorithms to a much larger class of functions.The developed methods are applied to both simulated and real world datasets, including high dimensional functional Magnetic Resonance Imaging (fMRI) datasets, and to the real world tasks of interpreting data exploration and model predictions.Electrical and Computer Engineerin