134 research outputs found
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
Clustering non-Euclidean data is difficult, and one of the most used
algorithms besides hierarchical clustering is the popular algorithm
Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In
Euclidean geometry the mean-as used in k-means-is a good estimator for the
cluster center, but this does not hold for arbitrary dissimilarities. PAM uses
the medoid instead, the object with the smallest dissimilarity to all others in
the cluster. This notion of centrality can be used with any (dis-)similarity,
and thus is of high relevance to many domains such as biology that require the
use of Jaccard, Gower, or more complex distances.
A key issue with PAM is its high run time cost. We propose modifications to
the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of
the algorithm, but will still find the same results as the original PAM
algorithm. If we slightly relax the choice of swaps performed (at comparable
quality), we can further accelerate the algorithm by performing up to k swaps
in each iteration. With the substantially faster SWAP, we can now also explore
alternative strategies for choosing the initial medoids. We also show how the
CLARA and CLARANS algorithms benefit from these modifications. It can easily be
combined with earlier approaches to use PAM and CLARA on big data (some of
which use PAM as a subroutine, hence can immediately benefit from these
improvements), where the performance with high k becomes increasingly
important.
In experiments on real data with k=100, we observed a 200-fold speedup
compared to the original PAM SWAP algorithm, making PAM applicable to larger
data sets as long as we can afford to compute a distance matrix, and in
particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the
speedup is expected to increase with k)
Run Generation Revisited: What Goes Up May or May Not Come Down
In this paper, we revisit the classic problem of run generation. Run
generation is the first phase of external-memory sorting, where the objective
is to scan through the data, reorder elements using a small buffer of size M ,
and output runs (contiguously sorted chunks of elements) that are as long as
possible.
We develop algorithms for minimizing the total number of runs (or
equivalently, maximizing the average run length) when the runs are allowed to
be sorted or reverse sorted. We study the problem in the online setting, both
with and without resource augmentation, and in the offline setting.
(1) We analyze alternating-up-down replacement selection (runs alternate
between sorted and reverse sorted), which was studied by Knuth as far back as
1963. We show that this simple policy is asymptotically optimal. Specifically,
we show that alternating-up-down replacement selection is 2-competitive and no
deterministic online algorithm can perform better.
(2) We give online algorithms having smaller competitive ratios with resource
augmentation. Specifically, we exhibit a deterministic algorithm that, when
given a buffer of size 4M , is able to match or beat any optimal algorithm
having a buffer of size M . Furthermore, we present a randomized online
algorithm which is 7/4-competitive when given a buffer twice that of the
optimal.
(3) We demonstrate that performance can also be improved with a small amount
of foresight. We give an algorithm, which is 3/2-competitive, with
foreknowledge of the next 3M elements of the input stream. For the extreme case
where all future elements are known, we design a PTAS for computing the optimal
strategy a run generation algorithm must follow.
(4) Finally, we present algorithms tailored for nearly sorted inputs which
are guaranteed to have optimal solutions with sufficiently long runs
Quantifying Privacy: A Novel Entropy-Based Measure of Disclosure Risk
It is well recognised that data mining and statistical analysis pose a
serious treat to privacy. This is true for financial, medical, criminal and
marketing research. Numerous techniques have been proposed to protect privacy,
including restriction and data modification. Recently proposed privacy models
such as differential privacy and k-anonymity received a lot of attention and
for the latter there are now several improvements of the original scheme, each
removing some security shortcomings of the previous one. However, the challenge
lies in evaluating and comparing privacy provided by various techniques. In
this paper we propose a novel entropy based security measure that can be
applied to any generalisation, restriction or data modification technique. We
use our measure to empirically evaluate and compare a few popular methods,
namely query restriction, sampling and noise addition.Comment: 20 pages, 4 figure
BETULA: Numerically Stable CF-Trees for BIRCH Clustering
BIRCH clustering is a widely known approach for clustering, that has
influenced much subsequent research and commercial products. The key
contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a
compressed representation of the input data. As new data arrives, the tree is
eventually rebuilt to increase the compression. Afterward, the leaves of the
tree are used for clustering. Because of the data compression, this method is
very scalable. The idea has been adopted for example for k-means, data stream,
and density-based clustering.
Clustering features used by BIRCH are simple summary statistics that can
easily be updated with new data: the number of points, the linear sums, and the
sum of squared values. Unfortunately, how the sum of squares is then used in
BIRCH is prone to catastrophic cancellation.
We introduce a replacement cluster feature that does not have this numeric
problem, that is not much more expensive to maintain, and which makes many
computations simpler and hence more efficient. These cluster features can also
easily be used in other work derived from BIRCH, such as algorithms for
streaming data. In the experiments, we demonstrate the numerical problem and
compare the performance of the original algorithm compared to the improved
cluster features
On the Recognition of Four-Directional Orthogonal Ray Graphs
Orthogonal ray graphs are the intersection graphs of horizontal and vertical rays (i.e. half-lines) in the plane. If the rays can have any possible orientation (left/right/up/down) then the graph is a 4-directional orthogonal ray graph (4-DORG). Otherwise, if all rays are only pointing into the positive x and y directions, the intersection graph is a 2-DORG. Similarly, for 3-DORGs, the horizontal rays can have any direction but the vertical ones can only have the positive direction. The recognition problem of 2-DORGs, which are a nice subclass of bipartite comparability graphs, is known to be polynomial, while the recognition problems for 3-DORGs and 4-DORGs are open. Recently it has been shown that the recognition of unit grid intersection graphs, a superclass of 4-DORGs, is NP-complete. In this paper we prove that the recognition problem of 4-DORGs is polynomial, given a partition {L,R,U,D} of the vertices of G (which corresponds to the four possible ray directions). For the proof, given the graph G, we first construct two cliques G 1,G 2 with both directed and undirected edges. Then we successively augment these two graphs, constructing eventually a graph TeX with both directed and undirected edges, such that G has a 4-DORG representation if and only if TeX has a transitive orientation respecting its directed edges. As a crucial tool for our analysis we introduce the notion of an S-orientation of a graph, which extends the notion of a transitive orientation. We expect that our proof ideas will be useful also in other situations. Using an independent approach we show that, given a permutation π of the vertices of U (π is the order of y-coordinates of ray endpoints for U), while the partition {L,R} of V ∖ U is not given, we can still efficiently check whether G has a 3-DORG representation
Vertex Cover Kernelization Revisited: Upper and Lower Bounds for a Refined Parameter
An important result in the study of polynomial-time preprocessing shows that
there is an algorithm which given an instance (G,k) of Vertex Cover outputs an
equivalent instance (G',k') in polynomial time with the guarantee that G' has
at most 2k' vertices (and thus O((k')^2) edges) with k' <= k. Using the
terminology of parameterized complexity we say that k-Vertex Cover has a kernel
with 2k vertices. There is complexity-theoretic evidence that both 2k vertices
and Theta(k^2) edges are optimal for the kernel size. In this paper we consider
the Vertex Cover problem with a different parameter, the size fvs(G) of a
minimum feedback vertex set for G. This refined parameter is structurally
smaller than the parameter k associated to the vertex covering number vc(G)
since fvs(G) <= vc(G) and the difference can be arbitrarily large. We give a
kernel for Vertex Cover with a number of vertices that is cubic in fvs(G): an
instance (G,X,k) of Vertex Cover, where X is a feedback vertex set for G, can
be transformed in polynomial time into an equivalent instance (G',X',k') such
that |V(G')| <= 2k and |V(G')| <= O(|X'|^3). A similar result holds when the
feedback vertex set X is not given along with the input. In sharp contrast we
show that the Weighted Vertex Cover problem does not have a polynomial kernel
when parameterized by the cardinality of a given vertex cover of the graph
unless NP is in coNP/poly and the polynomial hierarchy collapses to the third
level.Comment: Published in "Theory of Computing Systems" as an Open Access
publicatio
Query Processing in Spatial Databases Containing Obstacles
Despite the existence of obstacles in many database applications, traditional spatial query processing assumes that points in space are directly reachable and utilizes the Euclidean distance metric. In this paper, we study spatial queries in the presence of obstacles, where the obstructed distance between two points is defined as the length of the shortest path that connects them without crossing any obstacles. We propose efficient algorithms for the most important query types, namely, range search, nearest neighbours, e-distance joins, closest pairs and distance semi-joins, assuming that both data objects and obstacles are indexed by R-trees. The effectiveness of the proposed solutions is verified through extensive experiments
- …