9,136 research outputs found
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Finito: A Faster, Permutable Incremental Gradient Method for Big Data Problems
Recent advances in optimization theory have shown that smooth strongly convex
finite sums can be minimized faster than by treating them as a black box
"batch" problem. In this work we introduce a new method in this class with a
theoretical convergence rate four times faster than existing methods, for sums
with sufficiently many terms. This method is also amendable to a sampling
without replacement scheme that in practice gives further speed-ups. We give
empirical results showing state of the art performance
Robust Classification for Imprecise Environments
In real-world environments it usually is difficult to specify target
operating conditions precisely, for example, target misclassification costs.
This uncertainty makes building robust classification systems problematic. We
show that it is possible to build a hybrid classifier that will perform at
least as well as the best available classifier for any target conditions. In
some cases, the performance of the hybrid actually can surpass that of the best
known classifier. This robust performance extends across a wide variety of
comparison frameworks, including the optimization of metrics such as accuracy,
expected cost, lift, precision, recall, and workforce utilization. The hybrid
also is efficient to build, to store, and to update. The hybrid is based on a
method for the comparison of classifier performance that is robust to imprecise
class distributions and misclassification costs. The ROC convex hull (ROCCH)
method combines techniques from ROC analysis, decision analysis and
computational geometry, and adapts them to the particulars of analyzing learned
classifiers. The method is efficient and incremental, minimizes the management
of classifier performance data, and allows for clear visual comparisons and
sensitivity analyses. Finally, we point to empirical evidence that a robust
hybrid classifier indeed is needed for many real-world problems.Comment: 24 pages, 12 figures. To be published in Machine Learning Journal.
For related papers, see http://www.hpl.hp.com/personal/Tom_Fawcett/ROCCH
Stochastic Primal-Dual Coordinate Method for Regularized Empirical Risk Minimization
We consider a generic convex optimization problem associated with regularized
empirical risk minimization of linear predictors. The problem structure allows
us to reformulate it as a convex-concave saddle point problem. We propose a
stochastic primal-dual coordinate (SPDC) method, which alternates between
maximizing over a randomly chosen dual variable and minimizing over the primal
variable. An extrapolation step on the primal variable is performed to obtain
accelerated convergence rate. We also develop a mini-batch version of the SPDC
method which facilitates parallel computing, and an extension with weighted
sampling probabilities on the dual variables, which has a better complexity
than uniform sampling on unnormalized data. Both theoretically and empirically,
we show that the SPDC method has comparable or better performance than several
state-of-the-art optimization methods
- …