12,330 research outputs found
k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)
Perhaps the most straightforward classifier in the arsenal or machine
learning techniques is the Nearest Neighbour Classifier -- classification is
achieved by identifying the nearest neighbours to a query example and using
those neighbours to determine the class of the query. This approach to
classification is of particular importance because issues of poor run-time
performance is not such a problem these days with the computational power that
is available. This paper presents an overview of techniques for Nearest
Neighbour classification focusing on; mechanisms for assessing similarity
(distance), computational issues in identifying nearest neighbours and
mechanisms for reducing the dimension of the data.
This paper is the second edition of a paper previously published as a
technical report. Sections on similarity measures for time-series, retrieval
speed-up and intrinsic dimensionality have been added. An Appendix is included
providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN
Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space
For a set of points in , and parameters and \eps, we present
a data structure that answers (1+\eps,k)-\ANN queries in logarithmic time.
Surprisingly, the space used by the data-structure is \Otilde (n /k); that
is, the space used is sublinear in the input size if is sufficiently large.
Our approach provides a novel way to summarize geometric data, such that
meaningful proximity queries on the data can be carried out using this sketch.
Using this, we provide a sublinear space data-structure that can estimate the
density of a point set under various measures, including:
\begin{inparaenum}[(i)]
\item sum of distances of closest points to the query point, and
\item sum of squared distances of closest points to the query point.
\end{inparaenum}
Our approach generalizes to other distance based estimation of densities of
similar flavor. We also study the problem of approximating some of these
quantities when using sampling. In particular, we show that a sample of size
\Otilde (n /k) is sufficient, in some restricted cases, to estimate the above
quantities. Remarkably, the sample size has only linear dependency on the
dimension
Accelerating Nearest Neighbor Search on Manycore Systems
We develop methods for accelerating metric similarity search that are
effective on modern hardware. Our algorithms factor into easily parallelizable
components, making them simple to deploy and efficient on multicore CPUs and
GPUs. Despite the simple structure of our algorithms, their search performance
is provably sublinear in the size of the database, with a factor dependent only
on its intrinsic dimensionality. We demonstrate that our methods provide
substantial speedups on a range of datasets and hardware platforms. In
particular, we present results on a 48-core server machine, on graphics
hardware, and on a multicore desktop
Efficient Nearest Neighbor Classification Using a Cascade of Approximate Similarity Measures
Nearest neighbor classification using shape context can yield highly accurate results in a number of recognition problems. Unfortunately, the approach can be too slow for practical applications, and thus approximation strategies are needed to make shape context practical. This paper proposes a method for efficient and accurate nearest neighbor classification in non-Euclidean spaces, such as the space induced by the shape context measure. First, a method is introduced for constructing a Euclidean embedding that is optimized for nearest neighbor classification accuracy. Using that embedding, multiple approximations of the underlying non-Euclidean similarity measure are obtained, at different levels of accuracy and efficiency. The approximations are automatically combined to form a cascade classifier, which applies the slower approximations only to the hardest cases. Unlike typical cascade-of-classifiers approaches, that are applied to binary classification problems, our method constructs a cascade for a multiclass problem. Experiments with a standard shape data set indicate that a two-to-three order of magnitude speed up is gained over the standard shape context classifier, with minimal losses in classification accuracy.National Science Foundation (IIS-0308213, IIS-0329009, EIA-0202067); Office of Naval Research (N00014-03-1-0108
- …