23 research outputs found
Clustering with diversity
We consider the {\em clustering with diversity} problem: given a set of
colored points in a metric space, partition them into clusters such that each
cluster has at least points, all of which have distinct colors.
We give a 2-approximation to this problem for any when the objective
is to minimize the maximum radius of any cluster. We show that the
approximation ratio is optimal unless , by providing a matching
lower bound. Several extensions to our algorithm have also been developed for
handling outliers. This problem is mainly motivated by applications in
privacy-preserving data publication.Comment: Extended abstract accepted in ICALP 2010. Keywords: Approximation
algorithm, k-center, k-anonymity, l-diversit
Capacitated Center Problems with Two-Sided Bounds and Outliers
In recent years, the capacitated center problems have attracted a lot of
research interest. Given a set of vertices , we want to find a subset of
vertices , called centers, such that the maximum cluster radius is
minimized. Moreover, each center in should satisfy some capacity
constraint, which could be an upper or lower bound on the number of vertices it
can serve. Capacitated -center problems with one-sided bounds (upper or
lower) have been well studied in previous work, and a constant factor
approximation was obtained.
We are the first to study the capacitated center problem with both capacity
lower and upper bounds (with or without outliers). We assume each vertex has a
uniform lower bound and a non-uniform upper bound. For the case of opening
exactly centers, we note that a generalization of a recent LP approach can
achieve constant factor approximation algorithms for our problems. Our main
contribution is a simple combinatorial algorithm for the case where there is no
cardinality constraint on the number of open centers. Our combinatorial
algorithm is simpler and achieves better constant approximation factor compared
to the LP approach
Matroid and Knapsack Center Problems
In the classic -center problem, we are given a metric graph, and the
objective is to open nodes as centers such that the maximum distance from
any vertex to its closest center is minimized. In this paper, we consider two
important generalizations of -center, the matroid center problem and the
knapsack center problem. Both problems are motivated by recent content
distribution network applications. Our contributions can be summarized as
follows:
1. We consider the matroid center problem in which the centers are required
to form an independent set of a given matroid. We show this problem is NP-hard
even on a line. We present a 3-approximation algorithm for the problem on
general metrics. We also consider the outlier version of the problem where a
given number of vertices can be excluded as the outliers from the solution. We
present a 7-approximation for the outlier version.
2. We consider the (multi-)knapsack center problem in which the centers are
required to satisfy one (or more) knapsack constraint(s). It is known that the
knapsack center problem with a single knapsack constraint admits a
3-approximation. However, when there are at least two knapsack constraints, we
show this problem is not approximable at all. To complement the hardness
result, we present a polynomial time algorithm that gives a 3-approximate
solution such that one knapsack constraint is satisfied and the others may be
violated by at most a factor of . We also obtain a 3-approximation
for the outlier version that may violate the knapsack constraint by
.Comment: A preliminary version of this paper is accepted to IPCO 201
On the Cost of Essentially Fair Clusterings
Clustering is a fundamental tool in data mining. It partitions points into
groups (clusters) and may be used to make decisions for each point based on its
group. However, this process may harm protected (minority) classes if the
clustering algorithm does not adequately represent them in desirable clusters
-- especially if the data is already biased.
At NIPS 2017, Chierichetti et al. proposed a model for fair clustering
requiring the representation in each cluster to (approximately) preserve the
global fraction of each protected class. Restricting to two protected classes,
they developed both a 4-approximation for the fair -center problem and a
-approximation for the fair -median problem, where is a parameter
for the fairness model. For multiple protected classes, the best known result
is a 14-approximation for fair -center.
We extend and improve the known results. Firstly, we give a 5-approximation
for the fair -center problem with multiple protected classes. Secondly, we
propose a relaxed fairness notion under which we can give bicriteria
constant-factor approximations for all of the classical clustering objectives
-center, -supplier, -median, -means and facility location. The
latter approximations are achieved by a framework that takes an arbitrary
existing unfair (integral) solution and a fair (fractional) LP solution and
combines them into an essentially fair clustering with a weakly supervised
rounding scheme. In this way, a fair clustering can be established belatedly,
in a situation where the centers are already fixed
Diversity-based Attribute Weighting for K-modes Clustering
Categorical data is a kind of data that is used for computational in computer science. To obtain the information from categorical data input, it needs a clustering algorithm. There are so many clustering algorithms that are given by the researchers. One of the clustering algorithms for categorical data is k-modes. K-modes uses a simple matching approach. This simple matching approach uses similarity values. In K-modes, the two similar objects have similarity value 1, and 0 if it is otherwise. Actually, in each attribute, there are some kinds of different attribute value and each kind of attribute value has different number. The similarity value 0 and 1 is not enough to represent the real semantic distance between a data object and a cluster. Thus in this paper, we generalize a k-modes algorithm for categorical data by adding the weight and diversity value of each attribute value to optimize categorical data clustering
Resource-efficient fast prediction in healthcare data analytics: A pruned Random Forest regression approach
In predictive healthcare data analytics, high accuracy is both vital and paramount as low accuracy can lead to misdiagnosis, which is known to cause serious health consequences or death. Fast prediction is also considered an important desideratum particularly for machines and mobile devices with limited memory and processing power. For real-time health care analytics applications, particularly the ones that run on mobile devices, such traits (high accuracy and fast prediction) are highly desirable. In this paper, we propose to use an ensemble regression technique based on CLUB-DRF, which is a pruned Random Forest that possesses these features. The speed and accuracy of the method have been demonstrated by an experimental study on three medical data sets of three different diseases
Privacy Preserving Clustering with Constraints
The k-center problem is a classical combinatorial optimization problem which asks to find k centers such that the maximum distance of any input point in a set P to its assigned center is minimized. The problem allows for elegant 2-approximations. However, the situation becomes significantly more difficult when constraints are added to the problem. We raise the question whether general methods can be derived to turn an approximation algorithm for a clustering problem with some constraints into an approximation algorithm that respects one constraint more. Our constraint of choice is privacy: Here, we are asked to only open a center when at least l clients will be assigned to it. We show how to combine privacy with several other constraints