7,800 research outputs found
A Comparative Study of Efficient Initialization Methods for the K-Means Clustering Algorithm
K-means is undoubtedly the most widely used partitional clustering algorithm.
Unfortunately, due to its gradient descent nature, this algorithm is highly
sensitive to the initial placement of the cluster centers. Numerous
initialization methods have been proposed to address this problem. In this
paper, we first present an overview of these methods with an emphasis on their
computational efficiency. We then compare eight commonly used linear time
complexity initialization methods on a large and diverse collection of data
sets using various performance criteria. Finally, we analyze the experimental
results using non-parametric statistical tests and provide recommendations for
practitioners. We demonstrate that popular initialization methods often perform
poorly and that there are in fact strong alternatives to these methods.Comment: 17 pages, 1 figure, 7 table
TopSig: Topology Preserving Document Signatures
Performance comparisons between File Signatures and Inverted Files for text
retrieval have previously shown several significant shortcomings of file
signatures relative to inverted files. The inverted file approach underpins
most state-of-the-art search engine algorithms, such as Language and
Probabilistic models. It has been widely accepted that traditional file
signatures are inferior alternatives to inverted files. This paper describes
TopSig, a new approach to the construction of file signatures. Many advances in
semantic hashing and dimensionality reduction have been made in recent times,
but these were not so far linked to general purpose, signature file based,
search engines. This paper introduces a different signature file approach that
builds upon and extends these recent advances. We are able to demonstrate
significant improvements in the performance of signature file based indexing
and retrieval, performance that is comparable to that of state of the art
inverted file based systems, including Language models and BM25. These findings
suggest that file signatures offer a viable alternative to inverted files in
suitable settings and from the theoretical perspective it positions the file
signatures model in the class of Vector Space retrieval models.Comment: 12 pages, 8 figures, CIKM 201
Squarepants in a Tree: Sum of Subtree Clustering and Hyperbolic Pants Decomposition
We provide efficient constant factor approximation algorithms for the
problems of finding a hierarchical clustering of a point set in any metric
space, minimizing the sum of minimimum spanning tree lengths within each
cluster, and in the hyperbolic or Euclidean planes, minimizing the sum of
cluster perimeters. Our algorithms for the hyperbolic and Euclidean planes can
also be used to provide a pants decomposition, that is, a set of disjoint
simple closed curves partitioning the plane minus the input points into subsets
with exactly three boundary components, with approximately minimum total
length. In the Euclidean case, these curves are squares; in the hyperbolic
case, they combine our Euclidean square pants decomposition with our tree
clustering method for general metric spaces.Comment: 22 pages, 14 figures. This version replaces the proof of what is now
Lemma 5.2, as the previous proof was erroneou
Efficient Multi-Robot Coverage of a Known Environment
This paper addresses the complete area coverage problem of a known
environment by multiple-robots. Complete area coverage is the problem of moving
an end-effector over all available space while avoiding existing obstacles. In
such tasks, using multiple robots can increase the efficiency of the area
coverage in terms of minimizing the operational time and increase the
robustness in the face of robot attrition. Unfortunately, the problem of
finding an optimal solution for such an area coverage problem with multiple
robots is known to be NP-complete. In this paper we present two approximation
heuristics for solving the multi-robot coverage problem. The first solution
presented is a direct extension of an efficient single robot area coverage
algorithm, based on an exact cellular decomposition. The second algorithm is a
greedy approach that divides the area into equal regions and applies an
efficient single-robot coverage algorithm to each region. We present
experimental results for two algorithms. Results indicate that our approaches
provide good coverage distribution between robots and minimize the workload per
robot, meanwhile ensuring complete coverage of the area.Comment: In proceedings of IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), 201
From Data Topology to a Modular Classifier
This article describes an approach to designing a distributed and modular
neural classifier. This approach introduces a new hierarchical clustering that
enables one to determine reliable regions in the representation space by
exploiting supervised information. A multilayer perceptron is then associated
with each of these detected clusters and charged with recognizing elements of
the associated cluster while rejecting all others. The obtained global
classifier is comprised of a set of cooperating neural networks and completed
by a K-nearest neighbor classifier charged with treating elements rejected by
all the neural networks. Experimental results for the handwritten digit
recognition problem and comparison with neural and statistical nonmodular
classifiers are given
- …