17,777 research outputs found
An adaptive nearest neighbor rule for classification
We introduce a variant of the -nearest neighbor classifier in which is
chosen adaptively for each query, rather than supplied as a parameter. The
choice of depends on properties of each neighborhood, and therefore may
significantly vary between different points. (For example, the algorithm will
use larger for predicting the labels of points in noisy regions.)
We provide theory and experiments that demonstrate that the algorithm
performs comparably to, and sometimes better than, -NN with an optimal
choice of . In particular, we derive bounds on the convergence rates of our
classifier that depend on a local quantity we call the `advantage' which is
significantly weaker than the Lipschitz conditions used in previous convergence
rate proofs. These generalization bounds hinge on a variant of the seminal
Uniform Convergence Theorem due to Vapnik and Chervonenkis; this variant
concerns conditional probabilities and may be of independent interest
Robust nearest-neighbor methods for classifying high-dimensional data
We suggest a robust nearest-neighbor approach to classifying high-dimensional
data. The method enhances sensitivity by employing a threshold and truncates to
a sequence of zeros and ones in order to reduce the deleterious impact of
heavy-tailed data. Empirical rules are suggested for choosing the threshold.
They require the bare minimum of data; only one data vector is needed from each
population. Theoretical and numerical aspects of performance are explored,
paying particular attention to the impacts of correlation and heterogeneity
among data components. On the theoretical side, it is shown that our truncated,
thresholded, nearest-neighbor classifier enjoys the same classification
boundary as more conventional, nonrobust approaches, which require finite
moments in order to achieve good performance. In particular, the greater
robustness of our approach does not come at the price of reduced effectiveness.
Moreover, when both training sample sizes equal 1, our new method can have
performance equal to that of optimal classifiers that require independent and
identically distributed data with known marginal distributions; yet, our
classifier does not itself need conditions of this type.Comment: Published in at http://dx.doi.org/10.1214/08-AOS591 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Positive Semidefinite Metric Learning Using Boosting-like Algorithms
The success of many machine learning and pattern recognition methods relies
heavily upon the identification of an appropriate distance metric on the input
data. It is often beneficial to learn such a metric from the input training
data, instead of using a default one such as the Euclidean distance. In this
work, we propose a boosting-based technique, termed BoostMetric, for learning a
quadratic Mahalanobis distance metric. Learning a valid Mahalanobis distance
metric requires enforcing the constraint that the matrix parameter to the
metric remains positive definite. Semidefinite programming is often used to
enforce this constraint, but does not scale well and easy to implement.
BoostMetric is instead based on the observation that any positive semidefinite
matrix can be decomposed into a linear combination of trace-one rank-one
matrices. BoostMetric thus uses rank-one positive semidefinite matrices as weak
learners within an efficient and scalable boosting-based learning process. The
resulting methods are easy to implement, efficient, and can accommodate various
types of constraints. We extend traditional boosting algorithms in that its
weak learner is a positive semidefinite matrix with trace and rank being one
rather than a classifier or regressor. Experiments on various datasets
demonstrate that the proposed algorithms compare favorably to those
state-of-the-art methods in terms of classification accuracy and running time.Comment: 30 pages, appearing in Journal of Machine Learning Researc
Stabilized Nearest Neighbor Classifier and Its Statistical Properties
The stability of statistical analysis is an important indicator for
reproducibility, which is one main principle of scientific method. It entails
that similar statistical conclusions can be reached based on independent
samples from the same underlying population. In this paper, we introduce a
general measure of classification instability (CIS) to quantify the sampling
variability of the prediction made by a classification method. Interestingly,
the asymptotic CIS of any weighted nearest neighbor classifier turns out to be
proportional to the Euclidean norm of its weight vector. Based on this concise
form, we propose a stabilized nearest neighbor (SNN) classifier, which
distinguishes itself from other nearest neighbor classifiers, by taking the
stability into consideration. In theory, we prove that SNN attains the minimax
optimal convergence rate in risk, and a sharp convergence rate in CIS. The
latter rate result is established for general plug-in classifiers under a
low-noise condition. Extensive simulated and real examples demonstrate that SNN
achieves a considerable improvement in CIS over existing nearest neighbor
classifiers, with comparable classification accuracy. We implement the
algorithm in a publicly available R package snn.Comment: 48 Pages, 11 Figures. To Appear in JASA--T&
Classification hardness for supervised learners on 20 years of intrusion detection data
This article consolidates analysis of established (NSL-KDD) and new intrusion detection datasets (ISCXIDS2012, CICIDS2017, CICIDS2018) through the use of supervised machine learning (ML) algorithms. The uniformity in analysis procedure opens up the option to compare the obtained results. It also provides a stronger foundation for the conclusions about the efficacy of supervised learners on the main classification task in network security. This research is motivated in part to address the lack of adoption of these modern datasets. Starting with a broad scope that includes classification by algorithms from different families on both established and new datasets has been done to expand the existing foundation and reveal the most opportune avenues for further inquiry. After obtaining baseline results, the classification task was increased in difficulty, by reducing the available data to learn from, both horizontally and vertically. The data reduction has been included as a stress-test to verify if the very high baseline results hold up under increasingly harsh constraints. Ultimately, this work contains the most comprehensive set of results on the topic of intrusion detection through supervised machine learning. Researchers working on algorithmic improvements can compare their results to this collection, knowing that all results reported here were gathered through a uniform framework. This work's main contributions are the outstanding classification results on the current state of the art datasets for intrusion detection and the conclusion that these methods show remarkable resilience in classification performance even when aggressively reducing the amount of data to learn from
Positive Semidefinite Metric Learning with Boosting
The learning of appropriate distance metrics is a critical problem in image
classification and retrieval. In this work, we propose a boosting-based
technique, termed \BoostMetric, for learning a Mahalanobis distance metric. One
of the primary difficulties in learning such a metric is to ensure that the
Mahalanobis matrix remains positive semidefinite. Semidefinite programming is
sometimes used to enforce this constraint, but does not scale well.
\BoostMetric is instead based on a key observation that any positive
semidefinite matrix can be decomposed into a linear positive combination of
trace-one rank-one matrices. \BoostMetric thus uses rank-one positive
semidefinite matrices as weak learners within an efficient and scalable
boosting-based learning process. The resulting method is easy to implement,
does not require tuning, and can accommodate various types of constraints.
Experiments on various datasets show that the proposed algorithm compares
favorably to those state-of-the-art methods in terms of classification accuracy
and running time.Comment: 11 pages, Twenty-Third Annual Conference on Neural Information
Processing Systems (NIPS 2009), Vancouver, Canad
- …