39,254 research outputs found

    Towards Scalable and Unified Example-based Explanation and Outlier Detection

    Full text link
    When neural networks are employed for high-stakes decision making, it is desirable for the neural networks to provide explanation for their prediction in order for us to understand the features that have contributed to the decision. At the same time, it is important to flag potential outliers for in-depth verification by domain experts. In this work we propose to unify two differing aspects of explainability with outlier detection. We argue for a broader adoption of prototype-based student networks capable of providing an example-based explanation for its prediction and at the same time identify regions of similarity between the predicted sample and the examples. The examples are real prototypical cases sampled from the training set via our novel iterative prototype replacement algorithm. Furthermore, we propose to use the prototype similarity scores for identifying outliers. We compare performances in terms of classification, explanation quality, and outlier detection of our proposed network with other baselines. We show that our prototype-based networks beyond similarity kernels deliver meaningful explanation and promising outlier detection results without compromising classification accuracy

    Study of Distance-Based Outlier Detection Methods

    Get PDF
    An Outlier is an observation which is dierent from the others in a sample. Usually an anomaly occurs in every data due to measurement error. Anomaly detection is identifying anomalous data for given dataset that does not show normal behavior. Anomaly detection can be classified into three categories: Unsupervised, Supervised and Semisupervised anomaly detection. Anomaly detection is used variety of domains like fault detection, fraud detection, health monitoring system, intrusion detection. The outlier detection can be grouped into 5 main categories: statistical-based approaches, depth-based approaches, clustering approaches, distance-based approaches and density-based approaches. Distance -based methods i.e. Index-based algorithm, Nested-loop algorithm and LDOF are discussed. To reduce the false positive error in LDOF, we proposed MLDOF algorithm. We tested LDOF and MLDOF by implementing on several large and high-dimensional real datasets obtained from UCI machine repository. The experiments show that the MLDOF improves accuracy of anomaly detection with respect to LDOF and reduces the false positive error

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Detecting outlying subspaces for high-dimensional data: the new task, algorithms and performance

    Get PDF
    [Abstract]: In this paper, we identify a new task for studying the outlying degree (OD) of high-dimensional data, i.e. finding the subspaces (subsets of features) in which the given points are outliers, which are called their outlying subspaces. Since the state-of-the-art outlier detection techniques fail to handle this new problem, we propose a novel detection algorithm, called High-Dimension Outlying subspace Detection (HighDOD), to detect the outlying subspaces of high-dimensional data efficiently. The intuitive idea of HighDOD is that we measure the OD of the point using the sum of distances between this point and its k nearest neighbors. Two heuristic pruning strategies are proposed to realize fast pruning in the subspace search and an efficient dynamic subspace search method with a sample-based learning process has been implemented. Experimental results show that HighDOD is efficient and outperforms other searching alternatives such as the naive top–down, bottom–up and random search methods, and the existing outlier detection methods cannot fulfill this new task effectively

    Automatic Detection of Outliers in Multibeam Echo Sounding Data

    Get PDF
    The data volumes produced by new generation multibeam systems are very large, especially for shallow water systems. Results from recent multibeam surveys indicate that the ratio of the field survey time, to the time used in interactive editing through graphical editing tools, is about 1:1. An important reason for the large amount of processing time is that users subjectively decide which soundings are outliers. There is an apparent need for an automated approach for detecting outliers that would reduce the extensive labor and obtain consistent results from the multibeam data cleaning process, independent of the individual that has processed the data. The proposed automated algorithm for cleaning multibeam soundings was tested using the SAX-99 (Destin FL) multibeam survey data [2]. Eight days of survey data (6.9 Gigabyte) were cleaned in 2.5 hours on an SGI platform. A comparison of the automatically cleaned data with the subjective, interactively cleaned data indicates that the proposed method is, if not better, at least equivalent to interactive editing as used on the SAX-99 multibeam data. Furthermore, the ratio of acquisition to processing time is considerably improved since the time required for cleaning the data was decreased from 192 hours to 2.5 hours (an improvement by a factor of 77)

    Shape Outlier Detection and Visualization for Functional Data: the Outliergram

    Get PDF
    We propose a new method to visualize and detect shape outliers in samples of curves. In functional data analysis we observe curves defined over a given real interval and shape outliers are those curves that exhibit a different shape from the rest of the sample. Whereas magnitude outliers, that is, curves that exhibit atypically high or low values at some points or across the whole interval, are in general easy to identify, shape outliers are often masked among the rest of the curves and thus difficult to detect. In this article we exploit the relation between two depths for functional data to help visualizing curves in terms of shape and to develop an algorithm for shape outlier detection. We illustrate the use of the visualization tool, the outliergram, through several examples and asses the performance of the algorithm on a simulation study. We apply them to the detection of outliers in a children growth dataset in which the girls sample is contaminated with boys curves and viceversa.Comment: 27 pages, 5 figure

    FRIOD: a deeply integrated feature-rich interactive system for effective and efficient outlier detection

    Get PDF
    In this paper, we propose an novel interactive outlier detection system called feature-rich interactive outlier detection (FRIOD), which features a deep integration of human interaction to improve detection performance and greatly streamline the detection process. A user-friendly interactive mechanism is developed to allow easy and intuitive user interaction in all the major stages of the underlying outlier detection algorithm which includes dense cell selection, location-aware distance thresholding, and final top outlier validation. By doing so, we can mitigate the major difficulty of the competitive outlier detection methods in specifying the key parameter values, such as the density and distance thresholds. An innovative optimization approach is also proposed to optimize the grid-based space partitioning, which is a critical step of FRIOD. Such optimization fully considers the high-quality outliers it detects with the aid of human interaction. The experimental evaluation demonstrates that FRIOD can improve the quality of the detected outliers and make the detection process more intuitive, effective, and efficient

    Detection of Potential Transit Signals in Sixteen Quarters of Kepler Mission Data

    Full text link
    We present the results of a search for potential transit signals in four years of photometry data acquired by the Kepler Mission. The targets of the search include 111,800 stars which were observed for the entire interval and 85,522 stars which were observed for a subset of the interval. We found that 9,743 targets contained at least one signal consistent with the signature of a transiting or eclipsing object, where the criteria for detection are periodicity of the detected transits, adequate signal-to-noise ratio, and acceptance by a number of tests which reject false positive detections. When targets that had produced a signal were searched repeatedly, an additional 6,542 signals were detected on 3,223 target stars, for a total of 16,285 potential detections. Comparison of the set of detected signals with a set of known and vetted transit events in the Kepler field of view shows that the recovery rate for these signals is 96.9%. The ensemble properties of the detected signals are reviewed.Comment: Accepted by ApJ Supplemen

    A framework for exploration and cleaning of environmental data : Tehran air quality data experience

    Get PDF
    Management and cleaning of large environmental monitored data sets is a specific challenge. In this article, the authors present a novel framework for exploring and cleaning large datasets. As a case study, we applied the method on air quality data of Tehran, Iran from 1996 to 2013. ; The framework consists of data acquisition [here, data of particulate matter with aerodynamic diameter ≤10 µm (PM10)], development of databases, initial descriptive analyses, removing inconsistent data with plausibility range, and detection of missing pattern. Additionally, we developed a novel tool entitled spatiotemporal screening tool (SST), which considers both spatial and temporal nature of data in process of outlier detection. We also evaluated the effect of dust storm in outlier detection phase.; The raw mean concentration of PM10 before implementation of algorithms was 88.96 µg/m3 for 1996-2013 in Tehran. After implementing the algorithms, in total, 5.7% of data points were recognized as unacceptable outliers, from which 69% data points were detected by SST and 1% data points were detected via dust storm algorithm. In addition, 29% of unacceptable outlier values were not in the PR.  The mean concentration of PM10 after implementation of algorithms was 88.41 µg/m3. However, the standard deviation was significantly decreased from 90.86 µg/m3 to 61.64 µg/m3 after implementation of the algorithms. There was no distinguishable significant pattern according to hour, day, month, and year in missing data.; We developed a novel framework for cleaning of large environmental monitored data, which can identify hidden patterns. We also presented a complete picture of PM10 from 1996 to 2013 in Tehran. Finally, we propose implementation of our framework on large spatiotemporal databases, especially in developing countries

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review
    corecore