Search CORE

21,119 research outputs found

Recommended from our members

Fast high dimensional approximation via random embeddings

Author: Kennedy Christopher Garrett
Publication venue
Publication date: 07/02/2019
Field of study

In the big data era, dimension reduction techniques have been a key tool in making high dimensional geometric problems tractable. This thesis focuses on two such problems - hashing and parameter estimation. We study locality sensitive hashing(LSH), which is a framework for randomized hashing that efficiently solves an approximate version of nearest neighbor search. We propose an efficient and provably optimal hash function for LSH that builds on a simple existing hash function called cross-polytope LSH. In the context of parameter estimation, we focus on regression, for which the well-known LASSO requires precise knowledge of the unknown noise variance. We provide an estimator for this noise variance when the signal is sparse that is consistent and faster than a single iteration of LASSO. Finally, we discuss notions of distance between probability distributions for the purposes of quantization and propose a distance metric called the Rényi divergence, that achieves both large and small scale bounds.Mathematic

Texas ScholarWorks

Feature Selection and Weighting by Nearest Neighbor Ensembles

Author: Gertheiss Jan
Tutz Gerhard
Publication venue: 'Elsevier BV'
Publication date: 19/06/2008
Field of study

In the field of statistical discrimination nearest neighbor methods are a well known, quite simple but successful nonparametric classification tool. In higher dimensions, however, predictive power normally deteriorates. In general, if some covariates are assumed to be noise variables, variable selection is a promising approach. The paper’s main focus is on the development and evaluation of a nearest neighbor ensemble with implicit variable selection. In contrast to other nearest neighbor approaches we are not primarily interested in classification, but in estimating the (posterior) class probabilities. In simulation studies and for real world data the proposed nearest neighbor ensemble is compared to an extended forward/backward variable selection procedure for nearest neighbor classifiers, and some alternative well established classification tools (that offer probability estimates as well). Despite its simple structure, the proposed method’s performance is quite good - especially if relevant covariates can be separated from noise variables. Another advantage of the presented ensemble is the easy identification of interactions that are usually hard to detect. So not simply variable selection but rather some kind of feature selection is performed. The paper is a preprint of an article published in Chemometrics and Intelligent Laboratory Systems. Please use the journal version for citation

CiteSeerX

Open Access LMU

Robust nearest-neighbor methods for classifying high-dimensional data

Author: Chan Yao-ban
Hall Peter
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/09/2009
Field of study

We suggest a robust nearest-neighbor approach to classifying high-dimensional data. The method enhances sensitivity by employing a threshold and truncates to a sequence of zeros and ones in order to reduce the deleterious impact of heavy-tailed data. Empirical rules are suggested for choosing the threshold. They require the bare minimum of data; only one data vector is needed from each population. Theoretical and numerical aspects of performance are explored, paying particular attention to the impacts of correlation and heterogeneity among data components. On the theoretical side, it is shown that our truncated, thresholded, nearest-neighbor classifier enjoys the same classification boundary as more conventional, nonrobust approaches, which require finite moments in order to achieve good performance. In particular, the greater robustness of our approach does not come at the price of reduced effectiveness. Moreover, when both training sample sizes equal 1, our new method can have performance equal to that of optimal classifiers that require independent and identically distributed data with known marginal distributions; yet, our classifier does not itself need conditions of this type.Comment: Published in at http://dx.doi.org/10.1214/08-AOS591 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

University of Queensland eSpace

Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering

Author: Ben-Shaul Yoram
Nadasdy Zoltan
Quiroga R. Quian
Publication venue: 'MIT Press - Journals'
Publication date: 01/08/2004
Field of study

This study introduces a new method for detecting and sorting spikes from multiunit recordings. The method combines the wavelet transform, which localizes distinctive spike features, with superparamagnetic clustering, which allows automatic classification of the data without assumptions such as low variance or gaussian distributions. Moreover, an improved method for setting amplitude thresholds for spike detection is proposed. We describe several criteria for implementation that render the algorithm unsupervised and fast. The algorithm is compared to other conventional methods using several simulated data sets whose characteristics closely resemble those of in vivo recordings. For these data sets, we found that the proposed algorithm outperformed conventional methods

Caltech Authors

Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information

Author: Runge Jakob
Publication venue
Publication date: 05/09/2017
Field of study

Conditional independence testing is a fundamental problem underlying causal discovery and a particularly challenging task in the presence of nonlinear and high-dimensional dependencies. Here a fully non-parametric test for continuous data based on conditional mutual information combined with a local permutation scheme is presented. Through a nearest neighbor approach, the test efficiently adapts also to non-smooth distributions due to strongly nonlinear dependencies. Numerical experiments demonstrate that the test reliably simulates the null distribution even for small sample sizes and with high-dimensional conditioning sets. The test is better calibrated than kernel-based tests utilizing an analytical approximation of the null distribution, especially for non-smooth densities, and reaches the same or higher power levels. Combining the local permutation scheme with the kernel tests leads to better calibration, but suffers in power. For smaller sample sizes and lower dimensions, the test is faster than random fourier feature-based kernel tests if the permutation scheme is (embarrassingly) parallelized, but the runtime increases more sharply with sample size and dimensionality. Thus, more theoretical research to analytically approximate the null distribution and speed up the estimation for larger sample sizes is desirable.Comment: 17 pages, 12 figures, 1 tabl

arXiv.org e-Print Archive

Institute of Transport Research:Publications