21 research outputs found
HIERARCHICAL CLUSTERING USING LEVEL SETS
Over the past several decades, clustering algorithms have earned their place as a go-to solution for database mining. This paper introduces a new concept which is used to develop a new recursive version of DBSCAN that can successfully perform hierarchical clustering, called Level- Set Clustering (LSC). A level-set is a subset of points of a data-set whose densities are greater than some threshold, ‘t’. By graphing the size of each level-set against its respective ‘t,’ indents are produced in the line graph which correspond to clusters in the data-set, as the points in a cluster have very similar densities. This new algorithm is able to produce the clustering result with the same O(n log n) time complexity as DBSCAN and OPTICS, while catching clusters the others missed
Quickshift++: Provably Good Initializations for Sample-Based Mean Shift
We provide initial seedings to the Quick Shift clustering algorithm, which
approximate the locally high-density regions of the data. Such seedings act as
more stable and expressive cluster-cores than the singleton modes found by
Quick Shift. We establish statistical consistency guarantees for this
modification. We then show strong clustering performance on real datasets as
well as promising applications to image segmentation.Comment: ICML 2018. Code release: https://github.com/google/quickshif
Introduction to the R package TDA
We present a short tutorial and introduction to using the R package TDA,
which provides some tools for Topological Data Analysis. In particular, it
includes implementations of functions that, given some data, provide
topological information about the underlying space, such as the distance
function, the distance to a measure, the kNN density estimator, the kernel
density estimator, and the kernel distance. The salient topological features of
the sublevel sets (or superlevel sets) of these functions can be quantified
with persistent homology. We provide an R interface for the efficient
algorithms of the C++ libraries GUDHI, Dionysus and PHAT, including a function
for the persistent homology of the Rips filtration, and one for the persistent
homology of sublevel sets (or superlevel sets) of arbitrary functions evaluated
over a grid of points. The significance of the features in the resulting
persistence diagrams can be analyzed with functions that implement recently
developed statistical methods. The R package TDA also includes the
implementation of an algorithm for density clustering, which allows us to
identify the spatial organization of the probability mass associated to a
density function and visualize it by means of a dendrogram, the cluster tree
Finite-Sample Analysis of Fixed-k Nearest Neighbor Density Functional Estimators
We provide finite-sample analysis of a general framework for using k-nearest
neighbor statistics to estimate functionals of a nonparametric continuous
probability density, including entropies and divergences. Rather than plugging
a consistent density estimate (which requires as the sample size
) into the functional of interest, the estimators we consider fix
k and perform a bias correction. This is more efficient computationally, and,
as we show in certain cases, statistically, leading to faster convergence
rates. Our framework unifies several previous estimators, for most of which
ours are the first finite sample guarantees.Comment: 16 pages, 0 figure
Consistent procedures for cluster tree estimation and pruning
For a density on , a {\it high-density cluster} is any
connected component of , for some . The
set of all high-density clusters forms a hierarchy called the {\it cluster
tree} of . We present two procedures for estimating the cluster tree given
samples from . The first is a robust variant of the single linkage algorithm
for hierarchical clustering. The second is based on the -nearest neighbor
graph of the samples. We give finite-sample convergence rates for these
algorithms which also imply consistency, and we derive lower bounds on the
sample complexity of cluster tree estimation. Finally, we study a tree pruning
procedure that guarantees, under milder conditions than usual, to remove
clusters that are spurious while recovering those that are salient