60,921 research outputs found
A survey of outlier detection methodologies
Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
Adaptive Threshold Sampling and Estimation
Sampling is a fundamental problem in both computer science and statistics. A
number of issues arise when designing a method based on sampling. These include
statistical considerations such as constructing a good sampling design and
ensuring there are good, tractable estimators for the quantities of interest as
well as computational considerations such as designing fast algorithms for
streaming data and ensuring the sample fits within memory constraints.
Unfortunately, existing sampling methods are only able to address all of these
issues in limited scenarios.
We develop a framework that can be used to address these issues in a broad
range of scenarios. In particular, it addresses the problem of drawing and
using samples under some memory budget constraint. This problem can be
challenging since the memory budget forces samples to be drawn
non-independently and consequently, makes computation of resulting estimators
difficult.
At the core of the framework is the notion of a data adaptive thresholding
scheme where the threshold effectively allows one to treat the non-independent
sample as if it were drawn independently. We provide sufficient conditions for
a thresholding scheme to allow this and provide ways to build and compose such
schemes.
Furthermore, we provide fast algorithms to efficiently sample under these
thresholding schemes
Bayesian emulation for optimization in multi-step portfolio decisions
We discuss the Bayesian emulation approach to computational solution of
multi-step portfolio studies in financial time series. "Bayesian emulation for
decisions" involves mapping the technical structure of a decision analysis
problem to that of Bayesian inference in a purely synthetic "emulating"
statistical model. This provides access to standard posterior analytic,
simulation and optimization methods that yield indirect solutions of the
decision problem. We develop this in time series portfolio analysis using
classes of economically and psychologically relevant multi-step ahead portfolio
utility functions. Studies with multivariate currency, commodity and stock
index time series illustrate the approach and show some of the practical
utility and benefits of the Bayesian emulation methodology.Comment: 24 pages, 7 figures, 2 table
Cluster membership probabilities from proper motions and multiwavelength photometric catalogues: I. Method and application to the Pleiades cluster
We present a new technique designed to take full advantage of the high
dimensionality (photometric, astrometric, temporal) of the DANCe survey to
derive self-consistent and robust membership probabilities of the Pleiades
cluster. We aim at developing a methodology to infer membership probabilities
to the Pleiades cluster from the DANCe multidimensional astro-photometric data
set in a consistent way throughout the entire derivation. The determination of
the membership probabilities has to be applicable to censored data and must
incorporate the measurement uncertainties into the inference procedure.
We use Bayes' theorem and a curvilinear forward model for the likelihood of
the measurements of cluster members in the colour-magnitude space, to infer
posterior membership probabilities. The distribution of the cluster members
proper motions and the distribution of contaminants in the full
multidimensional astro-photometric space is modelled with a
mixture-of-Gaussians likelihood. We analyse several representation spaces
composed of the proper motions plus a subset of the available magnitudes and
colour indices. We select two prominent representation spaces composed of
variables selected using feature relevance determination techniques based in
Random Forests, and analyse the resulting samples of high probability
candidates. We consistently find lists of high probability (p > 0.9975)
candidates with 1000 sources, 4 to 5 times more than obtained in the
most recent astro-photometric studies of the cluster.
The methodology presented here is ready for application in data sets that
include more dimensions, such as radial and/or rotational velocities, spectral
indices and variability.Comment: 14 pages, 4 figures, accepted by A&
- …