5,293 research outputs found

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    Massively-Parallel Feature Selection for Big Data

    Full text link
    We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for feature selection (FS) in Big Data settings (high dimensionality and/or sample size). To tackle the challenges of Big Data FS PFBP partitions the data matrix both in terms of rows (samples, training examples) as well as columns (features). By employing the concepts of pp-values of conditional independence tests and meta-analysis techniques PFBP manages to rely only on computations local to a partition while minimizing communication costs. Then, it employs powerful and safe (asymptotically sound) heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Our empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores, while dominating other competitive algorithms in its class

    Learning Models for Discrete Optimization

    Get PDF
    We consider a class of optimization approaches that incorporate machine learning models into the algorithm structure. Our focus is on the algorithms that can learn the patterns in the search space in order to boost computational performance. The idea is to design optimization techniques that allow for computationally efficient tuning a priori. The final objective of this work is to provide efficient solvers that can be tuned for optimal performance in serial and parallel environments.This dissertation provides a novel machine learning model based on logistic regression and describes an implementation for scheduling problems. We incorporate the proposed learning model into a well-known optimization algorithm, tabu search, and demonstrate the potential of the underlying ideas. The dissertation also establishes a new framework for comparing optimization algorithms. This framework provides a comparison of algorithms that is statistically meaningful and intuitive. Using this framework, we demonstrate that the inclusion of the logistic regression model into the tabu search method provides significant boost of its performance. Finally, we study the parallel implementation of the algorithm and evaluate the algorithm performance when more connections between threads exist

    Software development effort estimation modeling using a combination of fuzzy-neural network and differential evolution algorithm

    Get PDF
    Software cost estimation has always been a serious challenge lying ahead of software teams that should be seriously considered in the early stages of a project. Lack of sufficient information on final requirements, as well as the existence of inaccurate and vague requirements, are among the main reasons for unreliable estimations in this area. Though several effort estimation models have been proposed over the recent decade, an increase in their accuracy has always been a controversial issue, and researchers' efforts in this area are still ongoing. This study presents a new model based on a hybrid of adaptive network-based fuzzy inference system (ANFIS) and differential evolution (DE) algorithm. This model tries to obtain a more accurate estimation of software development effort that is capable of presenting a better estimate within a wide range of software projects compared to previous works. The proposed method outperformed other optimization algorithms adopted from the genetic algorithm, evolutionary algorithms, meta-heuristic algorithms, and neuro-fuzzy based optimization algorithms, and could improve the accuracy using MMRE and PRED (0.25) criteria up to 7%
    • …
    corecore