27 research outputs found

    Approximation Algorithms for Continuous Clustering and Facility Location Problems

    Get PDF
    We consider the approximability of center-based clustering problems where the points to be clustered lie in a metric space, and no candidate centers are specified. We call such problems "continuous", to distinguish from "discrete" clustering where candidate centers are specified. For many objectives, one can reduce the continuous case to the discrete case, and use an α\alpha-approximation algorithm for the discrete case to get a βα\beta\alpha-approximation for the continuous case, where β\beta depends on the objective: e.g. for kk-median, β=2\beta = 2, and for kk-means, β=4\beta = 4. Our motivating question is whether this gap of β\beta is inherent, or are there better algorithms for continuous clustering than simply reducing to the discrete case? In a recent SODA 2021 paper, Cohen-Addad, Karthik, and Lee prove a factor-22 and a factor-44 hardness, respectively, for continuous kk-median and kk-means, even when the number of centers kk is a constant. The discrete case for a constant kk is exactly solvable in polytime, so the β\beta loss seems unavoidable in some regimes. In this paper, we approach continuous clustering via the round-or-cut framework. For four continuous clustering problems, we outperform the reduction to the discrete case. Notably, for the problem λ\lambda-UFL, where β=2\beta = 2 and the discrete case has a hardness of 1.271.27, we obtain an approximation ratio of 2.32<2×1.272.32 < 2 \times 1.27 for the continuous case. Also, for continuous kk-means, where the best known approximation ratio for the discrete case is 99, we obtain an approximation ratio of 32<4×932 < 4 \times 9. The key challenge is that most algorithms for discrete clustering, including the state of the art, depend on linear programs that become infinite-sized in the continuous case. To overcome this, we design new linear programs for the continuous case which are amenable to the round-or-cut framework.Comment: 24 pages, 0 figures. Full version of ESA 2022 paper https://drops.dagstuhl.de/opus/volltexte/2022/16971 . This version adds a link to the conference version and fixes minor formatting issue

    Sanitized Clustering against Confounding Bias

    Full text link
    Real-world datasets inevitably contain biases that arise from different sources or conditions during data collection. Consequently, such inconsistency itself acts as a confounding factor that disturbs the cluster analysis. Existing methods eliminate the biases by projecting data onto the orthogonal complement of the subspace expanded by the confounding factor before clustering. Therein, the interested clustering factor and the confounding factor are coarsely considered in the raw feature space, where the correlation between the data and the confounding factor is ideally assumed to be linear for convenient solutions. These approaches are thus limited in scope as the data in real applications is usually complex and non-linearly correlated with the confounding factor. This paper presents a new clustering framework named Sanitized Clustering Against confounding Bias (SCAB), which removes the confounding factor in the semantic latent space of complex data through a non-linear dependence measure. To be specific, we eliminate the bias information in the latent space by minimizing the mutual information between the confounding factor and the latent representation delivered by Variational Auto-Encoder (VAE). Meanwhile, a clustering module is introduced to cluster over the purified latent representations. Extensive experiments on complex datasets demonstrate that our SCAB achieves a significant gain in clustering performance by removing the confounding bias. The code is available at \url{https://github.com/EvaFlower/SCAB}.Comment: Machine Learning, in pres

    Models and Mechanisms for Fairness in Location Data Processing

    Full text link
    Location data use has become pervasive in the last decade due to the advent of mobile apps, as well as novel areas such as smart health, smart cities, etc. At the same time, significant concerns have surfaced with respect to fairness in data processing. Individuals from certain population segments may be unfairly treated when being considered for loan or job applications, access to public resources, or other types of services. In the case of location data, fairness is an important concern, given that an individual's whereabouts are often correlated with sensitive attributes, e.g., race, income, education. While fairness has received significant attention recently, e.g., in the case of machine learning, there is little focus on the challenges of achieving fairness when dealing with location data. Due to their characteristics and specific type of processing algorithms, location data pose important fairness challenges that must be addressed in a comprehensive and effective manner. In this paper, we adapt existing fairness models to suit the specific properties of location data and spatial processing. We focus on individual fairness, which is more difficult to achieve, and more relevant for most location data processing scenarios. First, we devise a novel building block to achieve fairness in the form of fair polynomials. Then, we propose two mechanisms based on fair polynomials that achieve individual fairness, corresponding to two common interaction types based on location data. Extensive experimental results on real data show that the proposed mechanisms achieve individual location fairness without sacrificing utility

    Approximation Algorithms for Fair Range Clustering

    Full text link
    This paper studies the fair range clustering problem in which the data points are from different demographic groups and the goal is to pick kk centers with the minimum clustering cost such that each group is at least minimally represented in the centers set and no group dominates the centers set. More precisely, given a set of nn points in a metric space (P,d)(P,d) where each point belongs to one of the ℓ\ell different demographics (i.e., P=P1⊎P2⊎⋯⊎PℓP = P_1 \uplus P_2 \uplus \cdots \uplus P_\ell) and a set of ℓ\ell intervals [α1,β1],⋯ ,[αℓ,βℓ][\alpha_1, \beta_1], \cdots, [\alpha_\ell, \beta_\ell] on desired number of centers from each group, the goal is to pick a set of kk centers CC with minimum ℓp\ell_p-clustering cost (i.e., (∑v∈Pd(v,C)p)1/p(\sum_{v\in P} d(v,C)^p)^{1/p}) such that for each group i∈ℓi\in \ell, ∣C∩Pi∣∈[αi,βi]|C\cap P_i| \in [\alpha_i, \beta_i]. In particular, the fair range ℓp\ell_p-clustering captures fair range kk-center, kk-median and kk-means as its special cases. In this work, we provide efficient constant factor approximation algorithms for fair range ℓp\ell_p-clustering for all values of p∈[1,∞)p\in [1,\infty).Comment: ICML 202

    Fair Clustering via Hierarchical Fair-Dirichlet Process

    Full text link
    The advent of ML-driven decision-making and policy formation has led to an increasing focus on algorithmic fairness. As clustering is one of the most commonly used unsupervised machine learning approaches, there has naturally been a proliferation of literature on {\em fair clustering}. A popular notion of fairness in clustering mandates the clusters to be {\em balanced}, i.e., each level of a protected attribute must be approximately equally represented in each cluster. Building upon the original framework, this literature has rapidly expanded in various aspects. In this article, we offer a novel model-based formulation of fair clustering, complementing the existing literature which is almost exclusively based on optimizing appropriate objective functions

    Spectral Normalized-Cut Graph Partitioning with Fairness Constraints

    Full text link
    Normalized-cut graph partitioning aims to divide the set of nodes in a graph into kk disjoint clusters to minimize the fraction of the total edges between any cluster and all other clusters. In this paper, we consider a fair variant of the partitioning problem wherein nodes are characterized by a categorical sensitive attribute (e.g., gender or race) indicating membership to different demographic groups. Our goal is to ensure that each group is approximately proportionally represented in each cluster while minimizing the normalized cut value. To resolve this problem, we propose a two-phase spectral algorithm called FNM. In the first phase, we add an augmented Lagrangian term based on our fairness criteria to the objective function for obtaining a fairer spectral node embedding. Then, in the second phase, we design a rounding scheme to produce kk clusters from the fair embedding that effectively trades off fairness and partition quality. Through comprehensive experiments on nine benchmark datasets, we demonstrate the superior performance of FNM compared with three baseline methods.Comment: 17 pages, 7 figures, accepted to the 26th European Conference on Artificial Intelligence (ECAI 2023
    corecore