1,057 research outputs found
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
Cost-Quality Trade-Offs in One-Class Active Learning
Active learning is a paradigm to involve users in a machine learning process. The core idea of active learning is to ask a user to annotate a specific observation to improve the classification performance. One important application of active learning is detecting outliers, i.e., unusual observations that deviate from the regular ones in a data set. Applying active learning for outlier detection in practice requires to design a system that consists of several components: the data, the classifier that discerns between inliers and outliers, the query strategy that selects the observations for feedback collection, and an oracle, e.g., the human expert that annotates the queries. Each of these components and their interplay influences the classification quality. Naturally, there are cost budgets limiting certain parts of the system, e.g., the number of queries one can ask a human. Thus, to configure efficient active learning systems, one must decide on several trade-offs between costs and quality. The existing literature on active learning systems does not provide an overview nor a formal description of the cost-quality trade-offs of active learning. All this makes the configuration of efficient active learning systems in practice difficult.
In this thesis, we study different cost-quality trade-offs that are pivotal for configuring an active learning system for outlier detection. We first provide an overview of the costs of an active learning system. Then, we analyze three important trade-offs and propose ways to model and quantify them. In our first contribution, we study how one can reduce classification training costs by training only on a sample of the data set. We formalize the sampling trade-off between classifier training costs and resulting quality as an optimization problem and propose an efficient algorithm to solve it. Compared to the existing sampling methods in literature, our approach guarantees that a classifier trained on our sample makes the same predictions as if trained on the complete data set. We can therefore reduce the classification training costs without a loss of classification quality. In our second contribution, we investigate how selecting multiple queries allows trading off costs against quality. So-called batch queries reduce classifier training costs because the system only updates the classifier once for each batch. But the annotation of a batch may give redundant information, which reduces the achievable quality with a fixed query budget. We are the first to consider batch queries for outlier detection, a generalization of the more common case to query sequentially. We formalize batch active learning and propose several strategies to construct batches by modeling the expected utility of a batch. In our third contribution, we propose query synthesis for outlier detection. Query synthesis allows to artificially generate queries at any point in the data space without being restricted by a pool of query candidates. We propose a framework to efficiently synthesize queries and develop a novel query strategy to improve the generalization of a classifier beyond a biased data set with active learning. For all contributions, we derive recommendations for the cost-quality trade-offs from formal investigations and empirical studies to facilitate the configuration of robust and efficient active learning systems for outlier detection
DOPING: Generative Data Augmentation for Unsupervised Anomaly Detection with GAN
Recently, the introduction of the generative adversarial network (GAN) and
its variants has enabled the generation of realistic synthetic samples, which
has been used for enlarging training sets. Previous work primarily focused on
data augmentation for semi-supervised and supervised tasks. In this paper, we
instead focus on unsupervised anomaly detection and propose a novel generative
data augmentation framework optimized for this task. In particular, we propose
to oversample infrequent normal samples - normal samples that occur with small
probability, e.g., rare normal events. We show that these samples are
responsible for false positives in anomaly detection. However, oversampling of
infrequent normal samples is challenging for real-world high-dimensional data
with multimodal distributions. To address this challenge, we propose to use a
GAN variant known as the adversarial autoencoder (AAE) to transform the
high-dimensional multimodal data distributions into low-dimensional unimodal
latent distributions with well-defined tail probability. Then, we
systematically oversample at the `edge' of the latent distributions to increase
the density of infrequent normal samples. We show that our oversampling
pipeline is a unified one: it is generally applicable to datasets with
different complex data distributions. To the best of our knowledge, our method
is the first data augmentation technique focused on improving performance in
unsupervised anomaly detection. We validate our method by demonstrating
consistent improvements across several real-world datasets.Comment: Published as a conference paper at ICDM 2018 (IEEE International
Conference on Data Mining
NNG-Mix: Improving Semi-supervised Anomaly Detection with Pseudo-anomaly Generation
Anomaly detection (AD) is essential in identifying rare and often critical
events in complex systems, finding applications in fields such as network
intrusion detection, financial fraud detection, and fault detection in
infrastructure and industrial systems. While AD is typically treated as an
unsupervised learning task due to the high cost of label annotation, it is more
practical to assume access to a small set of labeled anomaly samples from
domain experts, as is the case for semi-supervised anomaly detection.
Semi-supervised and supervised approaches can leverage such labeled data,
resulting in improved performance. In this paper, rather than proposing a new
semi-supervised or supervised approach for AD, we introduce a novel algorithm
for generating additional pseudo-anomalies on the basis of the limited labeled
anomalies and a large volume of unlabeled data. This serves as an augmentation
to facilitate the detection of new anomalies. Our proposed algorithm, named
Nearest Neighbor Gaussian Mixup (NNG-Mix), efficiently integrates information
from both labeled and unlabeled data to generate pseudo-anomalies. We compare
the performance of this novel algorithm with commonly applied augmentation
techniques, such as Mixup and Cutout. We evaluate NNG-Mix by training various
existing semi-supervised and supervised anomaly detection algorithms on the
original training data along with the generated pseudo-anomalies. Through
extensive experiments on 57 benchmark datasets in ADBench, reflecting different
data types, we demonstrate that NNG-Mix outperforms other data augmentation
methods. It yields significant performance improvements compared to the
baselines trained exclusively on the original training data. Notably, NNG-Mix
yields up to 16.4%, 8.8%, and 8.0% improvements on Classical, CV, and NLP
datasets in ADBench. Our source code will be available at
https://github.com/donghao51/NNG-Mix
- …