28,939 research outputs found
A MapReduce-based nearest neighbor approach for big-data-driven traffic flow prediction
In big-data-driven traffic flow prediction systems, the robustness of prediction performance depends on accuracy and timeliness. This paper presents a new MapReduce-based nearest neighbor (NN) approach for traffic flow prediction using correlation analysis (TFPC) on a Hadoop platform. In particular, we develop a real-time prediction system including two key modules, i.e., offline distributed training (ODT) and online parallel prediction (OPP). Moreover, we build a parallel k-nearest neighbor optimization classifier, which incorporates correlation information among traffic flows into the classification process. Finally, we propose a novel prediction calculation method, combining the current data observed in OPP and the classification results obtained from large-scale historical data in ODT, to generate traffic flow prediction in real time. The empirical study on real-world traffic flow big data using the leave-one-out cross validation method shows that TFPC significantly outperforms four state-of-the-art prediction approaches, i.e., autoregressive integrated moving average, Naïve Bayes, multilayer perceptron neural networks, and NN regression, in terms of accuracy, which can be improved 90.07% in the best case, with an average mean absolute percent error of 5.53%. In addition, it displays excellent speedup, scaleup, and sizeup
Stabilized Nearest Neighbor Classifier and Its Statistical Properties
The stability of statistical analysis is an important indicator for
reproducibility, which is one main principle of scientific method. It entails
that similar statistical conclusions can be reached based on independent
samples from the same underlying population. In this paper, we introduce a
general measure of classification instability (CIS) to quantify the sampling
variability of the prediction made by a classification method. Interestingly,
the asymptotic CIS of any weighted nearest neighbor classifier turns out to be
proportional to the Euclidean norm of its weight vector. Based on this concise
form, we propose a stabilized nearest neighbor (SNN) classifier, which
distinguishes itself from other nearest neighbor classifiers, by taking the
stability into consideration. In theory, we prove that SNN attains the minimax
optimal convergence rate in risk, and a sharp convergence rate in CIS. The
latter rate result is established for general plug-in classifiers under a
low-noise condition. Extensive simulated and real examples demonstrate that SNN
achieves a considerable improvement in CIS over existing nearest neighbor
classifiers, with comparable classification accuracy. We implement the
algorithm in a publicly available R package snn.Comment: 48 Pages, 11 Figures. To Appear in JASA--T&
Robust nearest-neighbor methods for classifying high-dimensional data
We suggest a robust nearest-neighbor approach to classifying high-dimensional
data. The method enhances sensitivity by employing a threshold and truncates to
a sequence of zeros and ones in order to reduce the deleterious impact of
heavy-tailed data. Empirical rules are suggested for choosing the threshold.
They require the bare minimum of data; only one data vector is needed from each
population. Theoretical and numerical aspects of performance are explored,
paying particular attention to the impacts of correlation and heterogeneity
among data components. On the theoretical side, it is shown that our truncated,
thresholded, nearest-neighbor classifier enjoys the same classification
boundary as more conventional, nonrobust approaches, which require finite
moments in order to achieve good performance. In particular, the greater
robustness of our approach does not come at the price of reduced effectiveness.
Moreover, when both training sample sizes equal 1, our new method can have
performance equal to that of optimal classifiers that require independent and
identically distributed data with known marginal distributions; yet, our
classifier does not itself need conditions of this type.Comment: Published in at http://dx.doi.org/10.1214/08-AOS591 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …