20 research outputs found
User-Centric Active Learning for Outlier Detection
Outlier detection searches for unusual, rare observations in large, often high-dimensional data sets.
One of the fundamental challenges of outlier detection is that ``unusual\u27\u27 typically depends on the perception of a user, the recipient of the detection result.
This makes finding a formal definition of ``unusual\u27\u27 that matches with user expectations difficult.
One way to deal with this issue is active learning, i.e., methods that ask users to provide auxiliary information, such as class label annotations, to return algorithmic results that are more in line with the user input.
Active learning is well-suited for outlier detection, and many respective methods have been proposed over the last years.
However, existing methods build upon strong assumptions.
One example is the assumption that users can always provide accurate feedback, regardless of how algorithmic results are presented to them -- an assumption which is unlikely to hold when data is high-dimensional.
It is an open question to which extent existing assumptions are in the way of realizing active learning in practice.
In this thesis, we study this question from different perspectives with a differentiated, user-centric view on active learning.
In the beginning, we structure and unify the research area on active learning for outlier detection.
Specifically, we present a rigorous specification of the learning setup, structure the basic building blocks, and propose novel evaluation standards.
Throughout our work, this structure has turned out to be essential to select a suitable active learning method, and to assess novel contributions in this field.
We then present two algorithmic contributions to make active learning for outlier detection user-centric.
First, we bring together two research areas that have been looked at independently so far: outlier detection in subspaces and active learning.
Subspace outlier detection are methods to improve outlier detection quality in high-dimensional data, and to make detection results more easy to interpret.
Our approach combines them with active learning such that one can balance between detection quality and annotation effort.
Second, we address one of the fundamental difficulties with adapting active learning to specific applications: selecting good hyperparameter values.
Existing methods to estimate hyperparameter values are heuristics, and it is unclear in which settings they work well.
In this thesis, we therefore propose the first principled method to estimate hyperparameter values.
Our approach relies on active learning to estimate hyperparameter values, and returns a quality estimate of the values selected.
In the last part of the thesis, we look at validating active learning for outlier detection practically.
There, we have identified several technical and conceptual challenges which we have experienced firsthand in our research.
We structure and document them, and finally derive a roadmap towards validating active learning for outlier detection with user studies
A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets
The term "outlier" can generally be defined as an observation that is significantly different from
the other values in a data set. The outliers may be instances of error or indicate events. The
task of outlier detection aims at identifying such outliers in order to improve the analysis of
data and further discover interesting and useful knowledge about unusual events within numerous
applications domains. In this paper, we report on contemporary unsupervised outlier detection
techniques for multiple types of data sets and provide a comprehensive taxonomy framework and
two decision trees to select the most suitable technique based on data set. Furthermore, we
highlight the advantages, disadvantages and performance issues of each class of outlier detection
techniques under this taxonomy framework
Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications
Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics
Recommended from our members
One-class Classification: An Approach to Handle Class Imbalance in Multimodal Biometric Authentication
Biometric verification is the process of authenticating a person‟s identity using his/her physiological and behavioural characteristics. It is well-known that multimodal biometric systems can further improve the authentication accuracy by combining information from multiple biometric traits at various levels, namely sensor, feature, match score and decision levels. Fusion at match score level is generally preferred due to the trade-off between information availability and fusion complexity. However, combining match scores poses a number of challenges, when treated as a two-class classification problem due to the highly imbalanced class distributions. Most conventional classifiers assume equally balanced classes. They do not work well when samples of one class vastly outnumber the samples of the other class. These challenges become even more significant, when the fusion is based on user-specific processing due to the limited availability of the genuine samples per user. This thesis aims at exploring the paradigm of one-class classification to advance the classification performance of imbalanced biometric data sets. The contributions of the research can be enumerated as follows.
Firstly, a thorough investigation of the various one-class classifiers, including Gaussian Mixture Model, k-Nearest Neighbour, K-means clustering and Support Vector Data Description, has been provided. These classifiers are applied in learning the user-specific and user-independent descriptions for the biometric decision inference. It is demonstrated that the one-class classifiers are particularly useful in handling the imbalanced learning problem in multimodal biometric authentication. User-specific approach is a better alternative with respect to user-independent counterpart because it is able to overcome the so-called within-class sub-concepts problem, which arises very often in multimodal biometric systems due to the existence of user variation.
Secondly, a novel adapted score fusion scheme that consists of one-class classifiers and is trained using both the genuine user and impostor samples has been proposed. This method also replaces user-independent by user-specific description to learn the characteristics of the impostor class, and thus, reducing the degree of imbalanced proportion of data for different classes. Extensive experiments are conducted on the BioSecure DS2 and XM2VTS databases to illustrate the potential of the proposed adapted score fusion scheme, which provides a relative improvement in terms of Equal Error Rate of 32% and 20% as compared to the standard sum of scores and likelihood ratio based score fusion, respectively.
Thirdly, a hybrid boosting algorithm, called r-ABOC has been developed, which is capable of exploiting the natural capabilities of both the well-known Real AdaBoost and one-class classification to further improve the system performance without causing overfitting. However, unlike the conventional Real AdaBoost, the individual classifiers in the proposed schema are trained on the same data set, but with different parameter choices. This does not only generate a high diversity, which is vital to the success of r-ABOC, but also reduces the number of user-specified parameters. A comprehensive empirical study using the BioSecure DS2 and XM2VTS databases demonstrates that r-ABOC may achieve a performance gain in terms of Half Total Error Rate of up to 28% with respect to other state-of-the-art biometric score fusion techniques.
Finally, a Robust Imputation based on Group Method of Data Handling (RIBG) has been proposed to handle the missing data problem in the BioSecure DS2 database. RIBG is able to provide accurate predictions of incomplete score vectors. It is observed to achieve a better performance with respect to the state-of-the-art imputation techniques, including mean, median and k-NN imputations. An important feature of RIBG is that it does not require any parameter fine-tuning, and hence, is amendable to immediate applications
A New Feature Selection Method Based on Class Association Rule
Feature selection is a key process for supervised learning algorithms. It involves discarding irrelevant attributes from the training dataset from which the models are derived. One of the vital feature selection approaches is Filtering, which often uses mathematical models to compute the relevance for each feature in the training dataset and then sorts the features into descending order based on their computed scores. However, most Filtering methods face several challenges including, but not limited to, merely considering feature-class correlation when defining a feature’s relevance; additionally, not recommending which subset of features to retain. Leaving this decision to the end-user may be impractical for multiple reasons such as the experience required in the application domain, care, accuracy, and time. In this research, we propose a new hybrid Filtering method called Class Association Rule Filter (CARF) that deals with the aforementioned issues by identifying relevant features through the Class Association Rule Mining approach and then using these rules to define weights for the available features in the training dataset. More crucially, we propose a new procedure based on mutual information within the CARF method which suggests the subset of features to be retained by the end-user, hence reducing time and effort. Empirical evaluation using small, medium, and large datasets that belong to various dissimilar domains reveals that CARF was able to reduce the dimensionality of the search space when contrasted with other common Filtering methods. More importantly, the classification models devised by the different machine learning algorithms against the subsets of features selected by CARF were highly competitive in terms of various performance measures. These results indeed reflect the quality of the subsets of features selected by CARF and show the impact of the new cut-off procedure proposed
Graph based Anomaly Detection and Description: A Survey
Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the ‘why’, of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field