60,921 research outputs found

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    Adaptive Threshold Sampling and Estimation

    Full text link
    Sampling is a fundamental problem in both computer science and statistics. A number of issues arise when designing a method based on sampling. These include statistical considerations such as constructing a good sampling design and ensuring there are good, tractable estimators for the quantities of interest as well as computational considerations such as designing fast algorithms for streaming data and ensuring the sample fits within memory constraints. Unfortunately, existing sampling methods are only able to address all of these issues in limited scenarios. We develop a framework that can be used to address these issues in a broad range of scenarios. In particular, it addresses the problem of drawing and using samples under some memory budget constraint. This problem can be challenging since the memory budget forces samples to be drawn non-independently and consequently, makes computation of resulting estimators difficult. At the core of the framework is the notion of a data adaptive thresholding scheme where the threshold effectively allows one to treat the non-independent sample as if it were drawn independently. We provide sufficient conditions for a thresholding scheme to allow this and provide ways to build and compose such schemes. Furthermore, we provide fast algorithms to efficiently sample under these thresholding schemes

    Bayesian emulation for optimization in multi-step portfolio decisions

    Full text link
    We discuss the Bayesian emulation approach to computational solution of multi-step portfolio studies in financial time series. "Bayesian emulation for decisions" involves mapping the technical structure of a decision analysis problem to that of Bayesian inference in a purely synthetic "emulating" statistical model. This provides access to standard posterior analytic, simulation and optimization methods that yield indirect solutions of the decision problem. We develop this in time series portfolio analysis using classes of economically and psychologically relevant multi-step ahead portfolio utility functions. Studies with multivariate currency, commodity and stock index time series illustrate the approach and show some of the practical utility and benefits of the Bayesian emulation methodology.Comment: 24 pages, 7 figures, 2 table

    Cluster membership probabilities from proper motions and multiwavelength photometric catalogues: I. Method and application to the Pleiades cluster

    Full text link
    We present a new technique designed to take full advantage of the high dimensionality (photometric, astrometric, temporal) of the DANCe survey to derive self-consistent and robust membership probabilities of the Pleiades cluster. We aim at developing a methodology to infer membership probabilities to the Pleiades cluster from the DANCe multidimensional astro-photometric data set in a consistent way throughout the entire derivation. The determination of the membership probabilities has to be applicable to censored data and must incorporate the measurement uncertainties into the inference procedure. We use Bayes' theorem and a curvilinear forward model for the likelihood of the measurements of cluster members in the colour-magnitude space, to infer posterior membership probabilities. The distribution of the cluster members proper motions and the distribution of contaminants in the full multidimensional astro-photometric space is modelled with a mixture-of-Gaussians likelihood. We analyse several representation spaces composed of the proper motions plus a subset of the available magnitudes and colour indices. We select two prominent representation spaces composed of variables selected using feature relevance determination techniques based in Random Forests, and analyse the resulting samples of high probability candidates. We consistently find lists of high probability (p > 0.9975) candidates with \approx 1000 sources, 4 to 5 times more than obtained in the most recent astro-photometric studies of the cluster. The methodology presented here is ready for application in data sets that include more dimensions, such as radial and/or rotational velocities, spectral indices and variability.Comment: 14 pages, 4 figures, accepted by A&
    corecore