21,119 research outputs found
Recommended from our members
Fast high dimensional approximation via random embeddings
In the big data era, dimension reduction techniques have been a key tool in making high dimensional geometric problems tractable. This thesis focuses on two such problems - hashing and parameter estimation. We study locality sensitive hashing(LSH), which is a framework for randomized hashing that efficiently solves an approximate version of nearest neighbor search. We propose an efficient and provably optimal hash function for LSH that builds on a simple existing hash function called cross-polytope LSH. In the context of parameter estimation, we focus on regression, for which the well-known LASSO requires precise knowledge of the unknown noise variance. We provide an estimator for this noise variance when the signal is sparse that is consistent and faster than a single iteration of LASSO. Finally, we discuss notions of distance between probability distributions for the purposes of quantization and propose a distance metric called the Rényi divergence, that achieves both large and small scale bounds.Mathematic
Feature Selection and Weighting by Nearest Neighbor Ensembles
In the field of statistical discrimination nearest neighbor methods are a well known, quite simple but successful nonparametric classification tool. In higher dimensions, however, predictive power normally deteriorates. In general, if some covariates are assumed to be noise variables, variable selection is a promising approach. The paper’s main focus is on the development and evaluation of a nearest neighbor ensemble with implicit variable selection. In contrast to other nearest neighbor approaches we are not primarily interested in classification, but in estimating the (posterior) class probabilities. In simulation studies and for real world data the proposed nearest neighbor ensemble is compared to an extended forward/backward variable selection procedure for nearest neighbor classifiers, and some alternative well established classification tools (that offer probability estimates as well). Despite its simple structure, the proposed method’s performance is quite good - especially if relevant covariates can be separated from noise variables. Another advantage of the presented ensemble is the easy identification of interactions that are usually hard to detect. So not simply variable selection but rather some kind of feature selection is performed.
The paper is a preprint of an article published in Chemometrics and Intelligent Laboratory Systems. Please use the journal version for citation
Robust nearest-neighbor methods for classifying high-dimensional data
We suggest a robust nearest-neighbor approach to classifying high-dimensional
data. The method enhances sensitivity by employing a threshold and truncates to
a sequence of zeros and ones in order to reduce the deleterious impact of
heavy-tailed data. Empirical rules are suggested for choosing the threshold.
They require the bare minimum of data; only one data vector is needed from each
population. Theoretical and numerical aspects of performance are explored,
paying particular attention to the impacts of correlation and heterogeneity
among data components. On the theoretical side, it is shown that our truncated,
thresholded, nearest-neighbor classifier enjoys the same classification
boundary as more conventional, nonrobust approaches, which require finite
moments in order to achieve good performance. In particular, the greater
robustness of our approach does not come at the price of reduced effectiveness.
Moreover, when both training sample sizes equal 1, our new method can have
performance equal to that of optimal classifiers that require independent and
identically distributed data with known marginal distributions; yet, our
classifier does not itself need conditions of this type.Comment: Published in at http://dx.doi.org/10.1214/08-AOS591 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Unsupervised spike detection and sorting with wavelets and superparamagnetic clustering
This study introduces a new method for detecting and sorting spikes from multiunit recordings. The method combines the wavelet transform, which localizes distinctive spike features, with superparamagnetic clustering,
which allows automatic classification of the data without assumptions such as low variance or gaussian distributions. Moreover, an improved method for setting amplitude thresholds for spike detection is proposed. We describe several criteria for implementation that render the algorithm unsupervised and fast. The algorithm is compared to other conventional methods using several simulated data sets whose characteristics closely resemble those of in vivo recordings. For these data sets, we found that
the proposed algorithm outperformed conventional methods
Conditional independence testing based on a nearest-neighbor estimator of conditional mutual information
Conditional independence testing is a fundamental problem underlying causal
discovery and a particularly challenging task in the presence of nonlinear and
high-dimensional dependencies. Here a fully non-parametric test for continuous
data based on conditional mutual information combined with a local permutation
scheme is presented. Through a nearest neighbor approach, the test efficiently
adapts also to non-smooth distributions due to strongly nonlinear dependencies.
Numerical experiments demonstrate that the test reliably simulates the null
distribution even for small sample sizes and with high-dimensional conditioning
sets. The test is better calibrated than kernel-based tests utilizing an
analytical approximation of the null distribution, especially for non-smooth
densities, and reaches the same or higher power levels. Combining the local
permutation scheme with the kernel tests leads to better calibration, but
suffers in power. For smaller sample sizes and lower dimensions, the test is
faster than random fourier feature-based kernel tests if the permutation scheme
is (embarrassingly) parallelized, but the runtime increases more sharply with
sample size and dimensionality. Thus, more theoretical research to analytically
approximate the null distribution and speed up the estimation for larger sample
sizes is desirable.Comment: 17 pages, 12 figures, 1 tabl
- …