84,767 research outputs found
Learning by Unsupervised Nonlinear Diffusion
This paper proposes and analyzes a novel clustering algorithm that combines
graph-based diffusion geometry with techniques based on density and mode
estimation. The proposed method is suitable for data generated from mixtures of
distributions with densities that are both multimodal and have nonlinear
shapes. A crucial aspect of this algorithm is the use of time of a data-adapted
diffusion process as a scale parameter that is different from the local spatial
scale parameter used in many clustering algorithms. We prove estimates for the
behavior of diffusion distances with respect to this time parameter under a
flexible nonparametric data model, identifying a range of times in which the
mesoscopic equilibria of the underlying process are revealed, corresponding to
a gap between within-cluster and between-cluster diffusion distances. These
structures can be missed by the top eigenvectors of the graph Laplacian,
commonly used in spectral clustering. This analysis is leveraged to prove
sufficient conditions guaranteeing the accuracy of the proposed \emph{learning
by unsupervised nonlinear diffusion (LUND)} procedure. We implement LUND and
confirm its theoretical properties on illustrative datasets, demonstrating the
theoretical and empirical advantages over both spectral clustering and
density-based clustering techniques.Comment: 40 Pages, 17 Figure
Estimating number of speakers via density-based clustering and classification decision
It is crucial to robustly estimate the number of speakers (NoS) from the recorded audio mixtures in a reverberant environment. Some popular time-frequency (TF) methods approach this NoS estimation problem by assuming that only one of the speech components is active at each TF slot. However, this condition is violated in many scenarios where the speeches are convolved with long length of room impulse response coefficients, which causes degenerated performance of NoS estimation. To tackle this problem, a density-based clustering strategy is proposed to estimate NoS based on a local dominance assumption of speeches. Our method consists of several steps from clustering to classification of speakers with the consideration of robustness. First, the leading eigenvectors are extracted from the local covariance matrices of mixture TF components and ranked by the combination of local density and minimum distance to other leading eigenvectors with higher density. Second, a gap-based method is employed to determine the cluster centers from the ranked leading eigenvectors at each frequency bin. Third, a criterion based on averaged volume of cluster centers is proposed to select reliable clustering results at some frequency bins for the classification decision of NoS. The experiment results demonstrate that the proposed algorithm is superior to the existing methods in various reverberation cases with noise-free condition or noise condition
General framework for projection structures
In the first part, we develop a general framework for projection structures
and study several inference problems within this framework. We propose
procedures based on data dependent measures (DDM) and make connections with
empirical Bayes and penalization methods. The main inference problem is the
uncertainty quantification (UQ), but on the way we solve the estimation,
DDM-contraction problems, and a weak version of the structure recovery problem.
The approach is local in that the quality of the inference procedures is
measured by the local quantity, the oracle rate, which is the best trade-off
between the approximation error by a projection structure and the complexity of
that approximating projection structure. Like in statistical learning settings,
we develop distribution-free theory as no particular model is imposed, we only
assume certain mild condition on the stochastic part of the projection
predictor. We introduce the excessive bias restriction (EBR) under which we
establish the local confidence optimality of the constructed confidence ball.
The proposed general framework unifies a very broad class of high-dimensional
models and structures, interesting and important on their own right. In the
second part, we apply the developed theory and demonstrate how the general
results deliver a whole avenue of local and global minimax results (many new
ones, some known results from the literature are improved) for particular
models and structures as consequences, including white noise model and density
estimation with smoothness structure, linear regression and dictionary learning
with sparsity structures, biclustering and stochastic block models with
clustering structure, covariance matrix estimation with banding and sparsity
structures, and many others. Various adaptive minimax results over various
scales follow also from our local results.Comment: 89 page
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
This paper documents the release of the ELKI data mining framework, version
0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The
focus of ELKI is research in algorithms, with an emphasis on unsupervised
methods in cluster analysis and outlier detection. In order to achieve high
performance and scalability, ELKI offers data index structures such as the
R*-tree that can provide major performance gains. ELKI is designed to be easy
to extend for researchers and students in this domain, and welcomes
contributions of additional methods. ELKI aims at providing a large collection
of highly parameterizable algorithms, in order to allow easy and fair
evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the
future, and then give a brief overview over the new functionality in this
version. We also include an appendix presenting an overview on the overall
implemented functionality
Density Level Sets: Asymptotics, Inference, and Visualization
We derive asymptotic theory for the plug-in estimate for density level sets
under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap
confidence regions for level sets. The confidence regions can be used to
perform tests for anomaly detection and clustering. We also introduce a
technique to visualize high dimensional density level sets by combining mode
clustering and multidimensional scaling.Comment: Accepted to JASA-T&M. 40 pages, 11 figure
Quickshift++: Provably Good Initializations for Sample-Based Mean Shift
We provide initial seedings to the Quick Shift clustering algorithm, which
approximate the locally high-density regions of the data. Such seedings act as
more stable and expressive cluster-cores than the singleton modes found by
Quick Shift. We establish statistical consistency guarantees for this
modification. We then show strong clustering performance on real datasets as
well as promising applications to image segmentation.Comment: ICML 2018. Code release: https://github.com/google/quickshif
Kernel clustering: density biases and solutions
Kernel methods are popular in clustering due to their generality and
discriminating power. However, we show that many kernel clustering criteria
have density biases theoretically explaining some practically significant
artifacts empirically observed in the past. For example, we provide conditions
and formally prove the density mode isolation bias in kernel K-means for a
common class of kernels. We call it Breiman's bias due to its similarity to the
histogram mode isolation previously discovered by Breiman in decision tree
learning with Gini impurity. We also extend our analysis to other popular
kernel clustering methods, e.g. average/normalized cut or dominant sets, where
density biases can take different forms. For example, splitting isolated points
by cut-based criteria is essentially the sparsest subset bias, which is the
opposite of the density mode bias. Our findings suggest that a principled
solution for density biases in kernel clustering should directly address data
inhomogeneity. We show that density equalization can be implicitly achieved
using either locally adaptive weights or locally adaptive kernels. Moreover,
density equalization makes many popular kernel clustering objectives
equivalent. Our synthetic and real data experiments illustrate density biases
and proposed solutions. We anticipate that theoretical understanding of kernel
clustering limitations and their principled solutions will be important for a
broad spectrum of data analysis applications across the disciplines
A comparison of bandwidth selectors for mean shift clustering
We explore the performance of several automatic bandwidth selectors,
originally designed for density gradient estimation, as data-based procedures
for nonparametric, modal clustering. The key tool to obtain a clustering from
density gradient estimators is the mean shift algorithm, which allows to obtain
a partition not only of the data sample, but also of the whole space. The
results of our simulation study suggest that most of the methods considered
here, like cross validation and plug in bandwidth selectors, are useful for
cluster analysis via the mean shift algorithm.Comment: 13 pages, 1 figur
Nonparametric modal regression
Modal regression estimates the local modes of the distribution of given
, instead of the mean, as in the usual regression sense, and can hence
reveal important structure missed by usual regression methods. We study a
simple nonparametric method for modal regression, based on a kernel density
estimate (KDE) of the joint distribution of and . We derive asymptotic
error bounds for this method, and propose techniques for constructing
confidence sets and prediction sets. The latter is used to select the smoothing
bandwidth of the underlying KDE. The idea behind modal regression is connected
to many others, such as mixture regression and density ridge estimation, and we
discuss these ties as well.Comment: Published at http://dx.doi.org/10.1214/15-AOS1373 in the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
A Fuzzy Clustering Algorithm for the Mode Seeking Framework
In this paper, we propose a new fuzzy clustering algorithm based on the
mode-seeking framework. Given a dataset in , we define regions of
high density that we call cluster cores. We then consider a random walk on a
neighborhood graph built on top of our data points which is designed to be
attracted by high density regions. The strength of this attraction is
controlled by a temperature parameter . The membership of a point to
a given cluster is then the probability for the random walk to hit the
corresponding cluster core before any other. While many properties of random
walks (such as hitting times, commute distances, etc\dots) have been shown to
enventually encode purely local information when the number of data points
grows, we show that the regularization introduced by the use of cluster cores
solves this issue. Empirically, we show how the choice of influences
the behavior of our algorithm: for small values of the result is close
to hard mode-seeking whereas when is close to the result is similar
to the output of a (fuzzy) spectral clustering. Finally, we demonstrate the
scalability of our approach by providing the fuzzy clustering of a protein
configuration dataset containing a million data points in dimensions.Comment: Submitted to Pattern Recognition Letter
- …