536 research outputs found
Stabilized Nearest Neighbor Classifier and Its Statistical Properties
The stability of statistical analysis is an important indicator for
reproducibility, which is one main principle of scientific method. It entails
that similar statistical conclusions can be reached based on independent
samples from the same underlying population. In this paper, we introduce a
general measure of classification instability (CIS) to quantify the sampling
variability of the prediction made by a classification method. Interestingly,
the asymptotic CIS of any weighted nearest neighbor classifier turns out to be
proportional to the Euclidean norm of its weight vector. Based on this concise
form, we propose a stabilized nearest neighbor (SNN) classifier, which
distinguishes itself from other nearest neighbor classifiers, by taking the
stability into consideration. In theory, we prove that SNN attains the minimax
optimal convergence rate in risk, and a sharp convergence rate in CIS. The
latter rate result is established for general plug-in classifiers under a
low-noise condition. Extensive simulated and real examples demonstrate that SNN
achieves a considerable improvement in CIS over existing nearest neighbor
classifiers, with comparable classification accuracy. We implement the
algorithm in a publicly available R package snn.Comment: 48 Pages, 11 Figures. To Appear in JASA--T&
On Reject and Refine Options in Multicategory Classification
In many real applications of statistical learning, a decision made from
misclassification can be too costly to afford; in this case, a reject option,
which defers the decision until further investigation is conducted, is often
preferred. In recent years, there has been much development for binary
classification with a reject option. Yet, little progress has been made for the
multicategory case. In this article, we propose margin-based multicategory
classification methods with a reject option. In addition, and more importantly,
we introduce a new and unique refine option for the multicategory problem,
where the class of an observation is predicted to be from a set of class
labels, whose cardinality is not necessarily one. The main advantage of both
options lies in their capacity of identifying error-prone observations.
Moreover, the refine option can provide more constructive information for
classification by effectively ruling out implausible classes. Efficient
implementations have been developed for the proposed methods. On the
theoretical side, we offer a novel statistical learning theory and show a fast
convergence rate of the excess -risk of our methods with emphasis on
diverging dimensionality and number of classes. The results can be further
improved under a low noise assumption. A set of comprehensive simulation and
real data studies has shown the usefulness of the new learning tools compared
to regular multicategory classifiers. Detailed proofs of theorems and extended
numerical results are included in the supplemental materials available online.Comment: A revised version of this paper was accepted for publication in the
Journal of the American Statistical Association Theory and Methods Section.
52 pages, 6 figure
- β¦