27 research outputs found
Approximation Algorithms for Continuous Clustering and Facility Location Problems
We consider the approximability of center-based clustering problems where the
points to be clustered lie in a metric space, and no candidate centers are
specified. We call such problems "continuous", to distinguish from "discrete"
clustering where candidate centers are specified. For many objectives, one can
reduce the continuous case to the discrete case, and use an
-approximation algorithm for the discrete case to get a
-approximation for the continuous case, where depends on
the objective: e.g. for -median, , and for -means, . Our motivating question is whether this gap of is inherent, or are
there better algorithms for continuous clustering than simply reducing to the
discrete case? In a recent SODA 2021 paper, Cohen-Addad, Karthik, and Lee prove
a factor- and a factor- hardness, respectively, for continuous -median
and -means, even when the number of centers is a constant. The discrete
case for a constant is exactly solvable in polytime, so the loss
seems unavoidable in some regimes.
In this paper, we approach continuous clustering via the round-or-cut
framework. For four continuous clustering problems, we outperform the reduction
to the discrete case. Notably, for the problem -UFL, where
and the discrete case has a hardness of , we obtain an approximation
ratio of for the continuous case. Also, for continuous
-means, where the best known approximation ratio for the discrete case is
, we obtain an approximation ratio of . The key challenge
is that most algorithms for discrete clustering, including the state of the
art, depend on linear programs that become infinite-sized in the continuous
case. To overcome this, we design new linear programs for the continuous case
which are amenable to the round-or-cut framework.Comment: 24 pages, 0 figures. Full version of ESA 2022 paper
https://drops.dagstuhl.de/opus/volltexte/2022/16971 . This version adds a
link to the conference version and fixes minor formatting issue
Sanitized Clustering against Confounding Bias
Real-world datasets inevitably contain biases that arise from different
sources or conditions during data collection. Consequently, such inconsistency
itself acts as a confounding factor that disturbs the cluster analysis.
Existing methods eliminate the biases by projecting data onto the orthogonal
complement of the subspace expanded by the confounding factor before
clustering. Therein, the interested clustering factor and the confounding
factor are coarsely considered in the raw feature space, where the correlation
between the data and the confounding factor is ideally assumed to be linear for
convenient solutions. These approaches are thus limited in scope as the data in
real applications is usually complex and non-linearly correlated with the
confounding factor. This paper presents a new clustering framework named
Sanitized Clustering Against confounding Bias (SCAB), which removes the
confounding factor in the semantic latent space of complex data through a
non-linear dependence measure. To be specific, we eliminate the bias
information in the latent space by minimizing the mutual information between
the confounding factor and the latent representation delivered by Variational
Auto-Encoder (VAE). Meanwhile, a clustering module is introduced to cluster
over the purified latent representations. Extensive experiments on complex
datasets demonstrate that our SCAB achieves a significant gain in clustering
performance by removing the confounding bias. The code is available at
\url{https://github.com/EvaFlower/SCAB}.Comment: Machine Learning, in pres
Models and Mechanisms for Fairness in Location Data Processing
Location data use has become pervasive in the last decade due to the advent
of mobile apps, as well as novel areas such as smart health, smart cities, etc.
At the same time, significant concerns have surfaced with respect to fairness
in data processing. Individuals from certain population segments may be
unfairly treated when being considered for loan or job applications, access to
public resources, or other types of services. In the case of location data,
fairness is an important concern, given that an individual's whereabouts are
often correlated with sensitive attributes, e.g., race, income, education.
While fairness has received significant attention recently, e.g., in the case
of machine learning, there is little focus on the challenges of achieving
fairness when dealing with location data. Due to their characteristics and
specific type of processing algorithms, location data pose important fairness
challenges that must be addressed in a comprehensive and effective manner. In
this paper, we adapt existing fairness models to suit the specific properties
of location data and spatial processing. We focus on individual fairness, which
is more difficult to achieve, and more relevant for most location data
processing scenarios. First, we devise a novel building block to achieve
fairness in the form of fair polynomials. Then, we propose two mechanisms based
on fair polynomials that achieve individual fairness, corresponding to two
common interaction types based on location data. Extensive experimental results
on real data show that the proposed mechanisms achieve individual location
fairness without sacrificing utility
Approximation Algorithms for Fair Range Clustering
This paper studies the fair range clustering problem in which the data points
are from different demographic groups and the goal is to pick centers with
the minimum clustering cost such that each group is at least minimally
represented in the centers set and no group dominates the centers set. More
precisely, given a set of points in a metric space where each point
belongs to one of the different demographics (i.e., ) and a set of intervals on desired number of centers from
each group, the goal is to pick a set of centers with minimum
-clustering cost (i.e., ) such that for
each group , . In particular,
the fair range -clustering captures fair range -center, -median
and -means as its special cases. In this work, we provide efficient constant
factor approximation algorithms for fair range -clustering for all
values of .Comment: ICML 202
Fair Clustering via Hierarchical Fair-Dirichlet Process
The advent of ML-driven decision-making and policy formation has led to an
increasing focus on algorithmic fairness. As clustering is one of the most
commonly used unsupervised machine learning approaches, there has naturally
been a proliferation of literature on {\em fair clustering}. A popular notion
of fairness in clustering mandates the clusters to be {\em balanced}, i.e.,
each level of a protected attribute must be approximately equally represented
in each cluster. Building upon the original framework, this literature has
rapidly expanded in various aspects. In this article, we offer a novel
model-based formulation of fair clustering, complementing the existing
literature which is almost exclusively based on optimizing appropriate
objective functions
Spectral Normalized-Cut Graph Partitioning with Fairness Constraints
Normalized-cut graph partitioning aims to divide the set of nodes in a graph
into disjoint clusters to minimize the fraction of the total edges between
any cluster and all other clusters. In this paper, we consider a fair variant
of the partitioning problem wherein nodes are characterized by a categorical
sensitive attribute (e.g., gender or race) indicating membership to different
demographic groups. Our goal is to ensure that each group is approximately
proportionally represented in each cluster while minimizing the normalized cut
value. To resolve this problem, we propose a two-phase spectral algorithm
called FNM. In the first phase, we add an augmented Lagrangian term based on
our fairness criteria to the objective function for obtaining a fairer spectral
node embedding. Then, in the second phase, we design a rounding scheme to
produce clusters from the fair embedding that effectively trades off
fairness and partition quality. Through comprehensive experiments on nine
benchmark datasets, we demonstrate the superior performance of FNM compared
with three baseline methods.Comment: 17 pages, 7 figures, accepted to the 26th European Conference on
Artificial Intelligence (ECAI 2023