9,729 research outputs found
Maximum Margin Multiclass Nearest Neighbors
We develop a general framework for margin-based multicategory classification
in metric spaces. The basic work-horse is a margin-regularized version of the
nearest-neighbor classifier. We prove generalization bounds that match the
state of the art in sample size and significantly improve the dependence on
the number of classes . Our point of departure is a nearly Bayes-optimal
finite-sample risk bound independent of . Although -free, this bound is
unregularized and non-adaptive, which motivates our main result: Rademacher and
scale-sensitive margin bounds with a logarithmic dependence on . As the best
previous risk estimates in this setting were of order , our bound is
exponentially sharper. From the algorithmic standpoint, in doubling metric
spaces our classifier may be trained on examples in time and
evaluated on new points in time
Theoretical analysis of cross-validation for estimating the risk of the k-Nearest Neighbor classifier
The present work aims at deriving theoretical guaranties on the behavior of
some cross-validation procedures applied to the -nearest neighbors (NN)
rule in the context of binary classification. Here we focus on the
leave--out cross-validation (LO) used to assess the performance of the
NN classifier. Remarkably this LO estimator can be efficiently computed
in this context using closed-form formulas derived by
\cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and
exponential concentration inequalities for the LO estimator applied to the
NN classifier. Such results are obtained first by exploiting the connection
between the LO estimator and U-statistics, and second by making an intensive
use of the generalized Efron-Stein inequality applied to the LO estimator.
One other important contribution is made by deriving new quantifications of the
discrepancy between the LO estimator and the classification error/risk of
the NN classifier. The optimality of these bounds is discussed by means of
several lower bounds as well as simulation experiments
Robust nearest-neighbor methods for classifying high-dimensional data
We suggest a robust nearest-neighbor approach to classifying high-dimensional
data. The method enhances sensitivity by employing a threshold and truncates to
a sequence of zeros and ones in order to reduce the deleterious impact of
heavy-tailed data. Empirical rules are suggested for choosing the threshold.
They require the bare minimum of data; only one data vector is needed from each
population. Theoretical and numerical aspects of performance are explored,
paying particular attention to the impacts of correlation and heterogeneity
among data components. On the theoretical side, it is shown that our truncated,
thresholded, nearest-neighbor classifier enjoys the same classification
boundary as more conventional, nonrobust approaches, which require finite
moments in order to achieve good performance. In particular, the greater
robustness of our approach does not come at the price of reduced effectiveness.
Moreover, when both training sample sizes equal 1, our new method can have
performance equal to that of optimal classifiers that require independent and
identically distributed data with known marginal distributions; yet, our
classifier does not itself need conditions of this type.Comment: Published in at http://dx.doi.org/10.1214/08-AOS591 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Active Nearest-Neighbor Learning in Metric Spaces
We propose a pool-based non-parametric active learning algorithm for general
metric spaces, called MArgin Regularized Metric Active Nearest Neighbor
(MARMANN), which outputs a nearest-neighbor classifier. We give prediction
error guarantees that depend on the noisy-margin properties of the input
sample, and are competitive with those obtained by previously proposed passive
learners. We prove that the label complexity of MARMANN is significantly lower
than that of any passive learner with similar error guarantees. MARMANN is
based on a generalized sample compression scheme, and a new label-efficient
active model-selection procedure
PointMap: A real-time memory-based learning system with on-line and post-training pruning
Also published in the International Journal of Hybrid Intelligent Systems, Volume 1, January, 2004A memory-based learning system called PointMap is a simple and computationally efficient extension of Condensed Nearest Neighbor that allows the user to limit the number of exemplars stored during incremental learning. PointMap evaluates the information value of coding nodes during training, and uses this index to prune uninformative nodes either on-line or after training. These pruning methods allow the user to control both a priori code size and sensitivity to detail in the training data, as well as to determine the code size necessary for accurate performance on a given data set. Coding and pruning computations are local in space, with only the nearest coded neighbor available for comparison with the input; and in time, with only the current input available during coding. Pruning helps solve common problems of traditional memory-based learning systems: large memory requirements, their accompanying slow on-line computations, and sensitivity to noise. PointMap copes with the curse of dimensionality by considering multiple nearest neighbors during testing without increasing the complexity of the training process or the stored code. The performance of PointMap is compared to that of a group of sixteen nearest-neighbor systems on benchmark problems.This research was supported by grants from the Air Force Office of Scientific Research (AFOSR F49620-98-l-0108, F49620-0l-l-0397, and F49620-0l-l-0423)
and the Office of Naval Research (ONR N00014-0l-l-0624)
Bounds on the finite-sample risk for exponential distribution.
In this paper, we derive lower and upper bounds on the expected nearest neighbor distance for exponential distribution, and find lower and upper bounds on the risk of the nearest neighbor of exponential distribution
Bounded-Distortion Metric Learning
Metric learning aims to embed one metric space into another to benefit tasks
like classification and clustering. Although a greatly distorted metric space
has a high degree of freedom to fit training data, it is prone to overfitting
and numerical inaccuracy. This paper presents {\it bounded-distortion metric
learning} (BDML), a new metric learning framework which amounts to finding an
optimal Mahalanobis metric space with a bounded-distortion constraint. An
efficient solver based on the multiplicative weights update method is proposed.
Moreover, we generalize BDML to pseudo-metric learning and devise the
semidefinite relaxation and a randomized algorithm to approximately solve it.
We further provide theoretical analysis to show that distortion is a key
ingredient for stability and generalization ability of our BDML algorithm.
Extensive experiments on several benchmark datasets yield promising results
- …