25,517 research outputs found

    Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach

    Full text link
    Distance-based outlier detection is widely adopted in many fields, e.g., data mining and machine learning, because it is unsupervised, can be employed in a generic metric space, and does not have any assumptions of data distributions. Data mining and machine learning applications face a challenge of dealing with large datasets, which requires efficient distance-based outlier detection algorithms. Due to the popularization of computational environments with large memory, it is possible to build a main-memory index and detect outliers based on it, which is a promising solution for fast distance-based outlier detection. Motivated by this observation, we propose a novel approach that exploits a proximity graph. Our approach can employ an arbitrary proximity graph and obtains a significant speed-up against state-of-the-art. However, designing an effective proximity graph raises a challenge, because existing proximity graphs do not consider efficient traversal for distance-based outlier detection. To overcome this challenge, we propose a novel proximity graph, MRPG. Our empirical study using real datasets demonstrates that MRPG detects outliers significantly faster than the state-of-the-art algorithms

    Detección de outliers en grandes bases de datos mediante aproximación basada en celdas

    Get PDF
    Este artículo aborda la problemática de la detección de outliers en grandes bases de datos. En base a la aproximación por celdas propuesta por Edwin Knorr y Raymond NG en 1998 en el trabajo “Algorithms for Mining Distance-Based Outliers in Large Datasets” se implementaron distintas versiones del algoritmo que superan las limitaciones establecidas en el trabajo original con modificaciones orientadas a mejorar la eficiencia y la utilización del algoritmo en distintos escenarios.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    Detección de outliers en grandes bases de datos mediante aproximación basada en celdas

    Get PDF
    Este artículo aborda la problemática de la detección de outliers en grandes bases de datos. En base a la aproximación por celdas propuesta por Edwin Knorr y Raymond NG en 1998 en el trabajo “Algorithms for Mining Distance-Based Outliers in Large Datasets” se implementaron distintas versiones del algoritmo que superan las limitaciones establecidas en el trabajo original con modificaciones orientadas a mejorar la eficiencia y la utilización del algoritmo en distintos escenarios.Sociedad Argentina de Informática e Investigación Operativa (SADIO

    Outlier Mining Methods Based on Graph Structure Analysis

    Get PDF
    Outlier detection in high-dimensional datasets is a fundamental and challenging problem across disciplines that has also practical implications, as removing outliers from the training set improves the performance of machine learning algorithms. While many outlier mining algorithms have been proposed in the literature, they tend to be valid or efficient for specific types of datasets (time series, images, videos, etc.). Here we propose two methods that can be applied to generic datasets, as long as there is a meaningful measure of distance between pairs of elements of the dataset. Both methods start by defining a graph, where the nodes are the elements of the dataset, and the links have associated weights that are the distances between the nodes. Then, the first method assigns an outlier score based on the percolation (i.e., the fragmentation) of the graph. The second method uses the popular IsoMap non-linear dimensionality reduction algorithm, and assigns an outlier score by comparing the geodesic distances with the distances in the reduced space. We test these algorithms on real and synthetic datasets and show that they either outperform, or perform on par with other popular outlier detection methods. A main advantage of the percolation method is that is parameter free and therefore, it does not require any training; on the other hand, the IsoMap method has two integer number parameters, and when they are appropriately selected, the method performs similar to or better than all the other methods tested.Peer ReviewedPostprint (published version

    Detecting outlying subspaces for high-dimensional data: the new task, algorithms and performance

    Get PDF
    [Abstract]: In this paper, we identify a new task for studying the outlying degree (OD) of high-dimensional data, i.e. finding the subspaces (subsets of features) in which the given points are outliers, which are called their outlying subspaces. Since the state-of-the-art outlier detection techniques fail to handle this new problem, we propose a novel detection algorithm, called High-Dimension Outlying subspace Detection (HighDOD), to detect the outlying subspaces of high-dimensional data efficiently. The intuitive idea of HighDOD is that we measure the OD of the point using the sum of distances between this point and its k nearest neighbors. Two heuristic pruning strategies are proposed to realize fast pruning in the subspace search and an efficient dynamic subspace search method with a sample-based learning process has been implemented. Experimental results show that HighDOD is efficient and outperforms other searching alternatives such as the naive top–down, bottom–up and random search methods, and the existing outlier detection methods cannot fulfill this new task effectively
    corecore