Search CORE

84,767 research outputs found

Learning by Unsupervised Nonlinear Diffusion

Author: Maggioni Mauro
Murphy James M.
Publication venue
Publication date: 29/12/2018
Field of study

This paper proposes and analyzes a novel clustering algorithm that combines graph-based diffusion geometry with techniques based on density and mode estimation. The proposed method is suitable for data generated from mixtures of distributions with densities that are both multimodal and have nonlinear shapes. A crucial aspect of this algorithm is the use of time of a data-adapted diffusion process as a scale parameter that is different from the local spatial scale parameter used in many clustering algorithms. We prove estimates for the behavior of diffusion distances with respect to this time parameter under a flexible nonparametric data model, identifying a range of times in which the mesoscopic equilibria of the underlying process are revealed, corresponding to a gap between within-cluster and between-cluster diffusion distances. These structures can be missed by the top eigenvectors of the graph Laplacian, commonly used in spectral clustering. This analysis is leveraged to prove sufficient conditions guaranteeing the accuracy of the proposed \emph{learning by unsupervised nonlinear diffusion (LUND)} procedure. We implement LUND and confirm its theoretical properties on illustrative datasets, demonstrating the theoretical and empirical advantages over both spectral clustering and density-based clustering techniques.Comment: 40 Pages, 17 Figure

arXiv.org e-Print Archive

Estimating number of speakers via density-based clustering and classification decision

Author: Guo Yi (R18457)
Xie Shengli
Yang Junjie
Yang Liu
Yang Zuyuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

It is crucial to robustly estimate the number of speakers (NoS) from the recorded audio mixtures in a reverberant environment. Some popular time-frequency (TF) methods approach this NoS estimation problem by assuming that only one of the speech components is active at each TF slot. However, this condition is violated in many scenarios where the speeches are convolved with long length of room impulse response coefficients, which causes degenerated performance of NoS estimation. To tackle this problem, a density-based clustering strategy is proposed to estimate NoS based on a local dominance assumption of speeches. Our method consists of several steps from clustering to classification of speakers with the consideration of robustness. First, the leading eigenvectors are extracted from the local covariance matrices of mixture TF components and ranked by the combination of local density and minimum distance to other leading eigenvectors with higher density. Second, a gap-based method is employed to determine the cluster centers from the ranked leading eigenvectors at each frequency bin. Third, a criterion based on averaged volume of cluster centers is proposed to select reliable clustering results at some frequency bins for the classification decision of NoS. The experiment results demonstrate that the proposed algorithm is superior to the existing methods in various reverberation cases with noise-free condition or noise condition

General framework for projection structures

Author: Belitser Eduard
Nurushev Nurzhan
Publication venue
Publication date: 08/07/2019
Field of study

In the first part, we develop a general framework for projection structures and study several inference problems within this framework. We propose procedures based on data dependent measures (DDM) and make connections with empirical Bayes and penalization methods. The main inference problem is the uncertainty quantification (UQ), but on the way we solve the estimation, DDM-contraction problems, and a weak version of the structure recovery problem. The approach is local in that the quality of the inference procedures is measured by the local quantity, the oracle rate, which is the best trade-off between the approximation error by a projection structure and the complexity of that approximating projection structure. Like in statistical learning settings, we develop distribution-free theory as no particular model is imposed, we only assume certain mild condition on the stochastic part of the projection predictor. We introduce the excessive bias restriction (EBR) under which we establish the local confidence optimality of the constructed confidence ball. The proposed general framework unifies a very broad class of high-dimensional models and structures, interesting and important on their own right. In the second part, we apply the developed theory and demonstrate how the general results deliver a whole avenue of local and global minimax results (many new ones, some known results from the literature are improved) for particular models and structures as consequences, including white noise model and density estimation with smoothness structure, linear regression and dictionary learning with sparsity structures, biclustering and stochastic block models with clustering structure, covariance matrix estimation with banding and sparsity structures, and many others. Various adaptive minimax results over various scales follow also from our local results.Comment: 89 page

arXiv.org e-Print Archive

ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

Author: Schubert Erich
Zimek Arthur
Publication venue
Publication date: 10/02/2019
Field of study

This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms. We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality

arXiv.org e-Print Archive

Density Level Sets: Asymptotics, Inference, and Visualization

Author: Chen Yen-Chi
Genovese Christopher R.
Wasserman Larry
Publication venue
Publication date: 05/09/2016
Field of study

We derive asymptotic theory for the plug-in estimate for density level sets under Hausdoff loss. Based on the asymptotic theory, we propose two bootstrap confidence regions for level sets. The confidence regions can be used to perform tests for anomaly detection and clustering. We also introduce a technique to visualize high dimensional density level sets by combining mode clustering and multidimensional scaling.Comment: Accepted to JASA-T&M. 40 pages, 11 figure

arXiv.org e-Print Archive

Quickshift++: Provably Good Initializations for Sample-Based Mean Shift

Author: Jang Jennifer
Jiang Heinrich
Kpotufe Samory
Publication venue
Publication date: 21/05/2018
Field of study

We provide initial seedings to the Quick Shift clustering algorithm, which approximate the locally high-density regions of the data. Such seedings act as more stable and expressive cluster-cores than the singleton modes found by Quick Shift. We establish statistical consistency guarantees for this modification. We then show strong clustering performance on real datasets as well as promising applications to image segmentation.Comment: ICML 2018. Code release: https://github.com/google/quickshif

arXiv.org e-Print Archive

Kernel clustering: density biases and solutions

Author: Ayed Ismail Ben
Boykov Yuri
Marin Dmitrii
Tang Meng
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 06/12/2017
Field of study

Kernel methods are popular in clustering due to their generality and discriminating power. However, we show that many kernel clustering criteria have density biases theoretically explaining some practically significant artifacts empirically observed in the past. For example, we provide conditions and formally prove the density mode isolation bias in kernel K-means for a common class of kernels. We call it Breiman's bias due to its similarity to the histogram mode isolation previously discovered by Breiman in decision tree learning with Gini impurity. We also extend our analysis to other popular kernel clustering methods, e.g. average/normalized cut or dominant sets, where density biases can take different forms. For example, splitting isolated points by cut-based criteria is essentially the sparsest subset bias, which is the opposite of the density mode bias. Our findings suggest that a principled solution for density biases in kernel clustering should directly address data inhomogeneity. We show that density equalization can be implicitly achieved using either locally adaptive weights or locally adaptive kernels. Moreover, density equalization makes many popular kernel clustering objectives equivalent. Our synthetic and real data experiments illustrate density biases and proposed solutions. We anticipate that theoretical understanding of kernel clustering limitations and their principled solutions will be important for a broad spectrum of data analysis applications across the disciplines

arXiv.org e-Print Archive

A comparison of bandwidth selectors for mean shift clustering

Author: Chacón José E.
Monfort Pablo
Publication venue
Publication date: 29/10/2013
Field of study

We explore the performance of several automatic bandwidth selectors, originally designed for density gradient estimation, as data-based procedures for nonparametric, modal clustering. The key tool to obtain a clustering from density gradient estimators is the mean shift algorithm, which allows to obtain a partition not only of the data sample, but also of the whole space. The results of our simulation study suggest that most of the methods considered here, like cross validation and plug in bandwidth selectors, are useful for cluster analysis via the mean shift algorithm.Comment: 13 pages, 1 figur

arXiv.org e-Print Archive

Nonparametric modal regression

Author: Chen Yen-Chi
Genovese Christopher R.
Tibshirani Ryan J.
Wasserman Larry
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 30/03/2016
Field of study

Modal regression estimates the local modes of the distribution of

Y

given

X=x

, instead of the mean, as in the usual regression sense, and can hence reveal important structure missed by usual regression methods. We study a simple nonparametric method for modal regression, based on a kernel density estimate (KDE) of the joint distribution of

Y

and

X

. We derive asymptotic error bounds for this method, and propose techniques for constructing confidence sets and prediction sets. The latter is used to select the smoothing bandwidth of the underlying KDE. The idea behind modal regression is connected to many others, such as mixture regression and density ridge estimation, and we discuss these ties as well.Comment: Published at http://dx.doi.org/10.1214/15-AOS1373 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

A Fuzzy Clustering Algorithm for the Mode Seeking Framework

Author: Bonis Thomas
Oudot Steve
Publication venue
Publication date: 22/06/2016
Field of study

In this paper, we propose a new fuzzy clustering algorithm based on the mode-seeking framework. Given a dataset in

\mathbb{R}^d

, we define regions of high density that we call cluster cores. We then consider a random walk on a neighborhood graph built on top of our data points which is designed to be attracted by high density regions. The strength of this attraction is controlled by a temperature parameter

\beta > 0

. The membership of a point to a given cluster is then the probability for the random walk to hit the corresponding cluster core before any other. While many properties of random walks (such as hitting times, commute distances, etc\dots) have been shown to enventually encode purely local information when the number of data points grows, we show that the regularization introduced by the use of cluster cores solves this issue. Empirically, we show how the choice of

\beta

influences the behavior of our algorithm: for small values of

\beta

the result is close to hard mode-seeking whereas when

\beta

is close to

1

the result is similar to the output of a (fuzzy) spectral clustering. Finally, we demonstrate the scalability of our approach by providing the fuzzy clustering of a protein configuration dataset containing a million data points in

30

dimensions.Comment: Submitted to Pattern Recognition Letter

arXiv.org e-Print Archive