Search CORE

4 research outputs found

Coresets for Clustering with General Assignment Constraints

Author: Huang Lingxiao
Jiang Shaofeng H. -C.
Li Jian
Wu Xuan
Publication venue
Publication date: 23/01/2023
Field of study

Designing small-sized \emph{coresets}, which approximately preserve the costs of the solutions for large datasets, has been an important research direction for the past decade. We consider coreset construction for a variety of general constrained clustering problems. We significantly extend and generalize the results of a very recent paper (Braverman et al., FOCS'22), by demonstrating that the idea of hierarchical uniform sampling (Chen, SICOMP'09; Braverman et al., FOCS'22) can be applied to efficiently construct coresets for a very general class of constrained clustering problems with general assignment constraints, including capacity constraints on cluster centers, and assignment structure constraints for data points (modeled by a convex body

\mathcal{B})

. Our main theorem shows that a small-sized

\epsilon

-coreset exists as long as a complexity measure

\mathsf{Lip}(\mathcal{B})

of the structure constraint, and the \emph{covering exponent}

\Lambda_\epsilon(\mathcal{X})

for metric space

(\mathcal{X},d)

are bounded. The complexity measure

\mathsf{Lip}(\mathcal{B})

for convex body

\mathcal{B}

is the Lipschitz constant of a certain transportation problem constrained in

\mathcal{B}

, called \emph{optimal assignment transportation problem}. We prove nontrivial upper bounds of

\mathsf{Lip}(\mathcal{B})

for various polytopes, including the general matroid basis polytopes, and laminar matroid polytopes (with better bound). As an application of our general theorem, we construct the first coreset for the fault-tolerant clustering problem (with or without capacity upper/lower bound) for the above metric spaces, in which the fault-tolerance requirement is captured by a uniform matroid basis polytope

arXiv.org e-Print Archive

Near-linear time approximation schemes for clustering in doubling metrics

Author: Cohen-Addad V.
Feldmann A.E.
Saulpic D.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2021
Field of study

We consider the classic Facility Location, k-Median, and k-Means problems in metric spaces of doubling dimension d. We give nearly linear-time approximation schemes for each problem. The complexity of our algorithms is Õ(2(1/ε)O(d2) n), making a significant improvement over the state-of-the-art algorithms that run in time n(d/ε)O(d). Moreover, we show how to extend the techniques used to get the first efficient approximation schemes for the problems of prize-collecting k-Median and k-Means and efficient bicriteria approximation schemes for k-Median with outliers, k-Means with outliers and k-Center

White Rose Research Online

$\varepsilon$ -Coresets for Clustering (with Outliers) in Doubling Metrics

Author: Huang Lingxiao
Jiang Shaofeng H. -C.
Li Jian
Wu Xuan
Publication venue
Publication date: 18/08/2018
Field of study

We study the problem of constructing

\varepsilon

-coresets for the

(k, z)

-clustering problem in a doubling metric

M(X, d)

. An

\varepsilon

-coreset is a weighted subset

S\subseteq X

with weight function

w : S \rightarrow \mathbb{R}_{\geq 0}

, such that for any

k

-subset

C \in [X]^k

, it holds that

\sum_{x \in S}{w(x) \cdot d^z(x, C)} \in (1 \pm \varepsilon) \cdot \sum_{x \in X}{d^z(x, C)}

. We present an efficient algorithm that constructs an

\varepsilon

-coreset for the

(k, z)

-clustering problem in

M(X, d)

, where the size of the coreset only depends on the parameters

k, z, \varepsilon

and the doubling dimension

\mathsf{ddim}(M)

. To the best of our knowledge, this is the first efficient

\varepsilon

-coreset construction of size independent of

|X|

for general clustering problems in doubling metrics. To this end, we establish the first relation between the doubling dimension of

M(X, d)

and the shattering dimension (or VC-dimension) of the range space induced by the distance

d

. Such a relation was not known before, since one can easily construct instances in which neither one can be bounded by (some function of) the other. Surprisingly, we show that if we allow a small

(1\pm\epsilon)

-distortion of the distance function

d

, and consider the notion of

\tau

-error probabilistic shattering dimension, we can prove an upper bound of

O( \mathsf{ddim}(M)\cdot \log(1/\varepsilon) +\log\log{\frac{1}{\tau}} )

for the probabilistic shattering dimension for even weighted doubling metrics. We believe this new relation is of independent interest and may find other applications. We also study the robust coresets and centroid sets in doubling metrics. Our robust coreset construction leads to new results in clustering and property testing, and the centroid sets can be used to accelerate the local search algorithms for clustering problems.Comment: Appeared in FOCS 2018, this is the full versio

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref