20 research outputs found
Fully dynamic clustering and diversity maximization in doubling metrics
We present approximation algorithms for some variants of center-based
clustering and related problems in the fully dynamic setting, where the
pointset evolves through an arbitrary sequence of insertions and deletions.
Specifically, we target the following problems: -center (with and without
outliers), matroid-center, and diversity maximization. All algorithms employ a
coreset-based strategy and rely on the use of the cover tree data structure,
which we crucially augment to maintain, at any time, some additional
information enabling the efficient extraction of the solution for the specific
problem. For all of the aforementioned problems our algorithms yield
-approximations, where is the best known
approximation attainable in polynomial time in the standard off-line setting
(except for -center with outliers where but we get a
-approximation) and is a user-provided
accuracy parameter. The analysis of the algorithms is performed in terms of the
doubling dimension of the underlying metric. Remarkably, and unlike previous
works, the data structure and the running times of the insertion and deletion
procedures do not depend in any way on the accuracy parameter
and, for the two -center variants, on the parameter . For spaces of
bounded doubling dimension, the running times are dramatically smaller than
those that would be required to compute solutions on the entire pointset from
scratch. To the best of our knowledge, ours are the first solutions for the
matroid-center and diversity maximization problems in the fully dynamic
setting
Improved Approximation and Scalability for Fair Max-Min Diversification
Given an -point metric space where each point belongs to
one of different categories or groups and a set of integers , the fair Max-Min diversification problem is to select
points belonging to category , such that the minimum pairwise
distance between selected points is maximized. The problem was introduced by
Moumoulidou et al. [ICDT 2021] and is motivated by the need to down-sample
large data sets in various applications so that the derived sample achieves a
balance over diversity, i.e., the minimum distance between a pair of selected
points, and fairness, i.e., ensuring enough points of each category are
included. We prove the following results:
1. We first consider general metric spaces. We present a randomized
polynomial time algorithm that returns a factor -approximation to the
diversity but only satisfies the fairness constraints in expectation. Building
upon this result, we present a -approximation that is guaranteed to satisfy
the fairness constraints up to a factor for any constant
. We also present a linear time algorithm returning an
approximation with exact fairness. The best previous result was a
approximation.
2. We then focus on Euclidean metrics. We first show that the problem can be
solved exactly in one dimension. For constant dimensions, categories and any
constant , we present a approximation algorithm that
runs in time where . We can improve the
running time to at the expense of only picking points from category .
Finally, we present algorithms suitable to processing massive data sets
including single-pass data stream algorithms and composable coresets for the
distributed processing.Comment: To appear in ICDT 202
Streaming Algorithms for Diversity Maximization with Fairness Constraints
Diversity maximization is a fundamental problem with wide applications in
data summarization, web search, and recommender systems. Given a set of
elements, it asks to select a subset of elements with maximum
\emph{diversity}, as quantified by the dissimilarities among the elements in
. In this paper, we focus on the diversity maximization problem with
fairness constraints in the streaming setting. Specifically, we consider the
max-min diversity objective, which selects a subset that maximizes the
minimum distance (dissimilarity) between any pair of distinct elements within
it. Assuming that the set is partitioned into disjoint groups by some
sensitive attribute, e.g., sex or race, ensuring \emph{fairness} requires that
the selected subset contains elements from each group .
A streaming algorithm should process sequentially in one pass and return a
subset with maximum \emph{diversity} while guaranteeing the fairness
constraint. Although diversity maximization has been extensively studied, the
only known algorithms that can work with the max-min diversity objective and
fairness constraints are very inefficient for data streams. Since diversity
maximization is NP-hard in general, we propose two approximation algorithms for
fair diversity maximization in data streams, the first of which is
-approximate and specific for , where
, and the second of which achieves a
-approximation for an arbitrary . Experimental
results on real-world and synthetic datasets show that both algorithms provide
solutions of comparable quality to the state-of-the-art algorithms while
running several orders of magnitude faster in the streaming setting.Comment: 13 pages, 11 figures; published in ICDE 202
The Power of Randomization: Distributed Submodular Maximization on Massive Datasets
A wide variety of problems in machine learning, including exemplar
clustering, document summarization, and sensor placement, can be cast as
constrained submodular maximization problems. Unfortunately, the resulting
submodular optimization problems are often too large to be solved on a single
machine. We develop a simple distributed algorithm that is embarrassingly
parallel and it achieves provable, constant factor, worst-case approximation
guarantees. In our experiments, we demonstrate its efficiency in large problems
with different kinds of constraints with objective values always close to what
is achievable in the centralized setting
GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning
Large scale machine learning and deep models are extremely data-hungry.
Unfortunately, obtaining large amounts of labeled data is expensive, and
training state-of-the-art models (with hyperparameter tuning) requires
significant computing resources and time. Secondly, real-world data is noisy
and imbalanced. As a result, several recent papers try to make the training
process more efficient and robust. However, most existing work either focuses
on robustness or efficiency, but not both. In this work, we introduce Glister,
a GeneraLIzation based data Subset selecTion for Efficient and Robust learning
framework. We formulate Glister as a mixed discrete-continuous bi-level
optimization problem to select a subset of the training data, which maximizes
the log-likelihood on a held-out validation set. Next, we propose an iterative
online algorithm Glister-Online, which performs data selection iteratively
along with the parameter updates and can be applied to any loss-based learning
algorithm. We then show that for a rich class of loss functions including
cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete
data selection is an instance of (weakly) submodular optimization, and we
analyze conditions for which Glister-Online reduces the validation loss and
converges. Finally, we propose Glister-Active, an extension to batch active
learning, and we empirically demonstrate the performance of Glister on a wide
range of tasks including, (a) data selection to reduce training time, (b)
robust learning under label noise and imbalance settings, and (c) batch-active
learning with several deep and shallow models. We show that our framework
improves upon state of the art both in efficiency and accuracy (in cases (a)
and (c)) and is more efficient compared to other state-of-the-art robust
learning algorithms in case (b)
Recommended from our members
New perspectives and applications for greedy algorithms in machine learning
Approximating probability densities is a core problem in Bayesian statistics, where the inference involves the computation of a posterior distribution. Variational Inference (VI) is a technique to approximate posterior distributions through optimization. It involves specifying a set of tractable densities, out of which the final approximation is to be chosen. While VI is traditionally motivated with the goal of tractability, the focus in this dissertation is to use Bayesian approximation to obtain parsimonious distributions. With this goal in mind, we develop greedy algorithm variants and study their theoretical properties by establishing novel connections of the resulting optimization problems in parsimonious VI with traditional studies in the discrete optimization literature. Specific realizations lead to efficient solutions for many sparse probabilistic models like Sparse regression, Sparse PCA, Sparse Collective Matrix Factorization (CMF) etc. For cases where existing results are insufficient to provide acceptable approximation guarantees, we extend the optimization results for some large scale algorithms to a much larger class of functions.The developed methods are applied to both simulated and real world datasets, including high dimensional functional Magnetic Resonance Imaging (fMRI) datasets, and to the real world tasks of interpreting data exploration and model predictions.Electrical and Computer Engineerin
Diverse Data Selection under Fairness Constraints
Diversity is an important principle in data selection and summarization, facility location, and recommendation systems. Our work focuses on maximizing diversity in data selection, while offering fairness guarantees. In particular, we offer the first study that augments the Max-Min diversification objective with fairness constraints. More specifically, given a universe ? of n elements that can be partitioned into m disjoint groups, we aim to retrieve a k-sized subset that maximizes the pairwise minimum distance within the set (diversity) and contains a pre-specified k_i number of elements from each group i (fairness). We show that this problem is NP-complete even in metric spaces, and we propose three novel algorithms, linear in n, that provide strong theoretical approximation guarantees for different values of m and k. Finally, we extend our algorithms and analysis to the case where groups can be overlapping