298 research outputs found

    Towards outlier detection for high-dimensional data streams using projected outlier analysis strategy

    Get PDF
    [Abstract]: Outlier detection is an important research problem in data mining that aims to discover useful abnormal and irregular patterns hidden in large data sets. Most existing outlier detection methods only deal with static data with relatively low dimensionality. Recently, outlier detection for high-dimensional stream data became a new emerging research problem. A key observation that motivates this research is that outliers in high-dimensional data are projected outliers, i.e., they are embedded in lower-dimensional subspaces. Detecting projected outliers from high-dimensional stream data is a very challenging task for several reasons. First, detecting projected outliers is difficult even for high-dimensional static data. The exhaustive search for the out-lying subspaces where projected outliers are embedded is a NP problem. Second, the algorithms for handling data streams are constrained to take only one pass to process the streaming data with the conditions of space limitation and time criticality. The currently existing methods for outlier detection are found to be ineffective for detecting projected outliers in high-dimensional data streams. In this thesis, we present a new technique, called the Stream Project Outlier deTector (SPOT), which attempts to detect projected outliers in high-dimensional data streams. SPOT employs an innovative window-based time model in capturing dynamic statistics from stream data, and a novel data structure containing a set of top sparse subspaces to detect projected outliers effectively. SPOT also employs a multi-objective genetic algorithm as an effective search method for finding the outlying subspaces where most projected outliers are embedded. The experimental results demonstrate that SPOT is efficient and effective in detecting projected outliers for high-dimensional data streams. The main contribution of this thesis is that it provides a backbone in tackling the challenging problem of outlier detection for high- dimensional data streams. SPOT can facilitate the discovery of useful abnormal patterns and can be potentially applied to a variety of high demand applications, such as for sensor network data monitoring, online transaction protection, etc

    Empirical performance analysis of two algorithms for mining intentional knowledge of distance-based outliers

    Get PDF
    This thesis studies the empirical analysis of two algorithms, Uplattice and Jumplattice for mining intentional knowledge of distance-based outliers [19]. These algorithms detect strongest and weak outliers among them. Finding outliers is an important task required in major applications such as credit-card fraud detection, and the NHL statistical studies. Datasets of varying sizes have been tested to analyze the empirical values of these two algorithms. Effective data structures have been used to gain efficiency in memory-performance. The two algorithms provide intentional knowledge of the detected outliers which determines as to why an identified outlier is exceptional. This knowledge helps the user to analyze the validity of outliers and hence provides an improved understanding of the data

    A Semi-Supervised Feature Engineering Method for Effective Outlier Detection in Mixed Attribute Data Sets

    Get PDF
    Outlier detection is one of the crucial tasks in data mining which can lead to the finding of valuable and meaningful information within the data. An outlier is a data point that is notably dissimilar from other data points in the data set. As such, the methods for outlier detection play an important role in identifying and removing the outliers, thereby increasing the performance and accuracy of the prediction systems. Outlier detection is used in many areas like financial fraud detection, disease prediction, and network intrusion detection. Traditional outlier detection methods are founded on the use of different distance measures to estimate the similarity between the points and are confined to data sets that are purely continuous or categorical. These methods, though effective, lack in elucidating the relationship between outliers and known clusters/classes in the data set. We refer to this relationship as the context for any reported outlier. Alternate outlier detection methods establish the context of a reported outlier using underlying contextual beliefs of the data. Contextual beliefs are the established relationships between the attributes of the data set. Various studies have been recently conducted where they explore the contextual beliefs to determine outlier behavior. However, these methods do not scale in the situations where the data points and their respective contexts are sparse. Thus, the outliers reported by these methods tend to lose meaning. Another limitation of these methods is that they assume all features are equally important and do not consider nor determine subspaces among the features for identifying the outliers. Furthermore, determining subspaces is computationally exacerbated, as the number of possible subspaces increases with increasing dimensionality. This makes searching through all the possible subspaces impractical. In this thesis, we propose a Hybrid Bayesian Network approach to capture the underlying contextual beliefs to detect meaningful outliers in mixed attribute data sets. Hybrid Bayesian Networks utilize their probability distributions to encode the information of the data and outliers are those points which violate this information. To deal with the sparse contexts, we use an angle-based similarity method which is then combined with the joint probability distributions of the Hybrid Bayesian Network in a robust manner. With regards to the subspace selection, we employ a feature engineering method that consists of two-stage feature selection using Maximal Information Coefficient and Markov blankets of Hybrid Bayesian Networks to select highly correlated feature subspaces. This proposed method was tested on a real world medical record data set. The results indicate that the algorithm was able to identify meaningful outliers successfully. Moreover, we compare the performance of our algorithm with the existing baseline outlier detection algorithms. We also present a detailed analysis of the reported outliers using our method and demonstrate its efficiency when handling data points with sparse contexts

    Non-parametric Methods for Correlation Analysis in Multivariate Data with Applications in Data Mining

    Get PDF
    In this thesis, we develop novel methods for correlation analysis in multivariate data, with a special focus on mining correlated subspaces. Our methods handle major open challenges arisen when combining correlation analysis with subspace mining. Besides traditional correlation analysis, we explore interaction-preserving discretization of multivariate data and causality analysis. We conduct experiments on a variety of real-world data sets. The results validate the benefits of our methods

    Correlation Clustering

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining techniques and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. This can serve to group customers with similar interests, or to group genes with related functionalities. Currently, a challenge for clustering-techniques are especially high dimensional feature-spaces. Due to modern facilities of data collection, real data sets usually contain many features. These features are often noisy or exhibit correlations among each other. However, since these effects in different parts of the data set are differently relevant, irrelevant features cannot be discarded in advance. The selection of relevant features must therefore be integrated into the data mining technique. Since about 10 years, specialized clustering approaches have been developed to cope with problems in high dimensional data better than classic clustering approaches. Often, however, the different problems of very different nature are not distinguished from one another. A main objective of this thesis is therefore a systematic classification of the diverse approaches developed in recent years according to their task definition, their basic strategy, and their algorithmic approach. We discern as main categories the search for clusters (i) w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t. common behavior (patterns) of objects in axis-parallel subspaces, and (iii) w.r.t. closeness of objects in arbitrarily oriented subspaces (so called correlation cluster). For the third category, the remaining parts of the thesis describe novel approaches. A first approach is the adaptation of density-based clustering to the problem of correlation clustering. The starting point here is the first density-based approach in this field, the algorithm 4C. Subsequently, enhancements and variations of this approach are discussed allowing for a more robust, more efficient, or more effective behavior or even find hierarchies of correlation clusters and the corresponding subspaces. The density-based approach to correlation clustering, however, is fundamentally unable to solve some issues since an analysis of local neighborhoods is required. This is a problem in high dimensional data. Therefore, a novel method is proposed tackling the correlation clustering problem in a global approach. Finally, a method is proposed to derive models for correlation clusters to allow for an interpretation of the clusters and facilitate more thorough analysis in the corresponding domain science. Finally, possible applications of these models are proposed and discussed.Knowledge Discovery in Databases (KDD) ist der Prozess der automatischen Extraktion von Wissen aus großen Datenmengen, das gültig, bisher unbekannt und potentiell nützlich für eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Anwenden von Data Mining-Techniken, um nützliche Beziehungen und Zusammenhänge in einer aufbereiteten Datenmenge aufzudecken. Eine der wichtigsten Techniken des Data Mining ist die Cluster-Analyse (Clustering). Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst ähnlich und Objekte verschiedener Cluster möglichst unähnlich zu einander sind. Hier können beispielsweise Gruppen von Kunden identifiziert werden, die ähnliche Interessen haben, oder Gruppen von Genen, die ähnliche Funktionalitäten besitzen. Eine aktuelle Herausforderung für Clustering-Verfahren stellen hochdimensionale Feature-Räume dar. Reale Datensätze beinhalten dank moderner Verfahren zur Datenerhebung häufig sehr viele Merkmale (Features). Teile dieser Merkmale unterliegen oft Rauschen oder Abhängigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte in Teilen der Datenbank jeweils unterschiedlich ausgeprägt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknüpft werden. Seit etwa 10 Jahren werden vermehrt spezialisierte Clustering-Verfahren entwickelt, die mit den in hochdimensionalen Feature-Räumen auftretenden Problemen besser umgehen können als klassische Clustering-Verfahren. Hierbei wird aber oftmals nicht zwischen den ihrer Natur nach im Einzelnen sehr unterschiedlichen Problemen unterschieden. Ein Hauptanliegen der Dissertation ist daher eine systematische Einordnung der in den letzten Jahren entwickelten sehr diversen Ansätze nach den Gesichtspunkten ihrer jeweiligen Problemauffassung, ihrer grundlegenden Lösungsstrategie und ihrer algorithmischen Vorgehensweise. Als Hauptkategorien unterscheiden wir hierbei die Suche nach Clustern (1.) hinsichtlich der Nähe von Cluster-Objekten in achsenparallelen Unterräumen, (2.) hinsichtlich gemeinsamer Verhaltensweisen (Mustern) von Cluster-Objekten in achsenparallelen Unterräumen und (3.) hinsichtlich der Nähe von Cluster-Objekten in beliebig orientierten Unterräumen (sogenannte Korrelations-Cluster). Für die dritte Kategorie sollen in den weiteren Teilen der Dissertation innovative Lösungsansätze entwickelt werden. Ein erster Lösungsansatz basiert auf einer Erweiterung des dichte-basierten Clustering auf die Problemstellung des Korrelations-Clustering. Den Ausgangspunkt bildet der erste dichtebasierte Ansatz in diesem Bereich, der Algorithmus 4C. Anschließend werden Erweiterungen und Variationen dieses Ansatzes diskutiert, die robusteres, effizienteres oder effektiveres Verhalten aufweisen oder sogar Hierarchien von Korrelations-Clustern und den entsprechenden Unterräumen finden. Die dichtebasierten Korrelations-Cluster-Verfahren können allerdings einige Probleme grundsätzlich nicht lösen, da sie auf der Analyse lokaler Nachbarschaften beruhen. Dies ist in hochdimensionalen Feature-Räumen problematisch. Daher wird eine weitere Neuentwicklung vorgestellt, die das Korrelations-Cluster-Problem mit einer globalen Methode angeht. Schließlich wird eine Methode vorgestellt, die Cluster-Modelle für Korrelationscluster ableitet, so dass die gefundenen Cluster interpretiert werden können und tiefergehende Untersuchungen in der jeweiligen Fachdisziplin zielgerichtet möglich sind. Mögliche Anwendungen dieser Modelle werden abschließend vorgestellt und untersucht

    Correlation-based methods for data cleaning, with application to biological databases

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Efficient and effective outlier detection.

    Get PDF
    by Chiu Lai Mei.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 142-149).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgement --- p.viChapter 1 --- Introduction --- p.1Chapter 1.1 --- Outlier Analysis --- p.2Chapter 1.2 --- Problem Statement --- p.4Chapter 1.2.1 --- Binary Property of Outlier --- p.4Chapter 1.2.2 --- Overlapping Clusters with Different Densities --- p.4Chapter 1.2.3 --- Large Datasets --- p.5Chapter 1.2.4 --- High Dimensional Datasets --- p.6Chapter 1.3 --- Contributions --- p.8Chapter 2 --- Related Work in Outlier Detection --- p.10Chapter 2.1 --- Outlier Detection --- p.11Chapter 2.1.1 --- Clustering-Based Methods --- p.11Chapter 2.1.2 --- Distance-Based Methods --- p.14Chapter 2.1.3 --- Density-Based Methods --- p.18Chapter 2.1.4 --- Deviation-Based Methods --- p.22Chapter 2.2 --- Breakthrough Outlier Notion: Degree of Outlier-ness --- p.25Chapter 2.2.1 --- LOF: Local Outlier Factor --- p.26Chapter 2.2.2 --- Definitions --- p.26Chapter 2.2.3 --- Properties --- p.29Chapter 2.2.4 --- Algorithm --- p.30Chapter 2.2.5 --- Time Complexity --- p.31Chapter 2.2.6 --- LOF of High Dimensional Data --- p.31Chapter 3 --- LOF': Formula with Intuitive Meaning --- p.33Chapter 3.1 --- Definition of LOF' --- p.33Chapter 3.2 --- Properties --- p.34Chapter 3.3 --- Time Complexity --- p.37Chapter 4 --- "LOF"" for Detecting Small Groups of Outliers" --- p.39Chapter 4.1 --- "Definition of LOF"" " --- p.40Chapter 4.2 --- Properties --- p.41Chapter 4.3 --- Time Complexity --- p.44Chapter 5 --- GridLOF for Pruning Reasonable Portions from Datasets --- p.46Chapter 5.1 --- GridLOF Algorithm --- p.47Chapter 5.2 --- Determine Values of Input Parameters --- p.51Chapter 5.2.1 --- Number of Intervals w --- p.51Chapter 5.2.2 --- Threshold Value σ --- p.52Chapter 5.3 --- Advantages --- p.53Chapter 5.4 --- Time Complexity --- p.55Chapter 6 --- SOF: Efficient Outlier Detection for High Dimensional Data --- p.57Chapter 6.1 --- Motivation --- p.57Chapter 6.2 --- Notations and Definitions --- p.59Chapter 6.3 --- SOF: Subspace Outlier Factor --- p.62Chapter 6.3.1 --- Formal Definition of SOF --- p.62Chapter 6.3.2 --- Properties of SOF --- p.67Chapter 6.4 --- SOF-Algorithm: the Overall Framework --- p.73Chapter 6.5 --- Identify Associated Subspaces of Clusters in SOF-Algorithm . . --- p.74Chapter 6.5.1 --- Technical Details in Phase I --- p.76Chapter 6.6 --- Technical Details in Phase II and Phase III --- p.88Chapter 6.6.1 --- Identify Outliers --- p.88Chapter 6.6.2 --- Subspace Quantization --- p.90Chapter 6.6.3 --- X-Tree Index Structure --- p.91Chapter 6.6.4 --- Compute GSOF and SOF --- p.95Chapter 6.6.5 --- Assign SO Values --- p.95Chapter 6.6.6 --- Multi-threads Programming --- p.96Chapter 6.7 --- Time Complexity --- p.97Chapter 6.8 --- Strength of SOF-Algorithm --- p.99Chapter 7 --- "Experiments on LOF' ,LOF"" and GridLOF" --- p.102Chapter 7.1 --- Datasets Used --- p.103Chapter 7.2 --- LOF' --- p.103Chapter 7.3 --- "LOF"" " --- p.109Chapter 7.4 --- GridLOF --- p.114Chapter 8 --- Empirical Results of SOF --- p.121Chapter 8.1 --- Synthetic Data Generation --- p.121Chapter 8.2 --- Experimental Setup --- p.124Chapter 8.3 --- Performance Measure --- p.124Chapter 8.3.1 --- Quality Measurement --- p.127Chapter 8.3.2 --- Scalability of SOF-Algorithm --- p.136Chapter 8.3.3 --- Effect of Parameters on SOF-Algorithm --- p.139Chapter 9 --- Conclusion --- p.140Bibliography --- p.142Publication --- p.14

    A Survey on Explainable Anomaly Detection

    Full text link
    In the past two decades, most research on anomaly detection has focused on improving the accuracy of the detection, while largely ignoring the explainability of the corresponding methods and thus leaving the explanation of outcomes to practitioners. As anomaly detection algorithms are increasingly used in safety-critical domains, providing explanations for the high-stakes decisions made in those domains has become an ethical and regulatory requirement. Therefore, this work provides a comprehensive and structured survey on state-of-the-art explainable anomaly detection techniques. We propose a taxonomy based on the main aspects that characterize each explainable anomaly detection technique, aiming to help practitioners and researchers find the explainable anomaly detection method that best suits their needs.Comment: Paper accepted by the ACM Transactions on Knowledge Discovery from Data (TKDD) for publication (preprint version
    corecore