98 research outputs found
Truncated Affinity Maximization: One-class Homophily Modeling for Graph Anomaly Detection
One prevalent property we find empirically in real-world graph anomaly
detection (GAD) datasets is a one-class homophily, i.e., normal nodes tend to
have strong connection/affinity with each other, while the homophily in
abnormal nodes is significantly weaker than normal nodes. However, this
anomaly-discriminative property is ignored by existing GAD methods that are
typically built using a conventional anomaly detection objective, such as data
reconstruction. In this work, we explore this property to introduce a novel
unsupervised anomaly scoring measure for GAD -- local node affinity -- that
assigns a larger anomaly score to nodes that are less affiliated with their
neighbors, with the affinity defined as similarity on node
attributes/representations. We further propose Truncated Affinity Maximization
(TAM) that learns tailored node representations for our anomaly measure by
maximizing the local affinity of nodes to their neighbors. Optimizing on the
original graph structure can be biased by non-homophily edges (i.e., edges
connecting normal and abnormal nodes). Thus, TAM is instead optimized on
truncated graphs where non-homophily edges are removed iteratively to mitigate
this bias. The learned representations result in significantly stronger local
affinity for normal nodes than abnormal nodes. Extensive empirical results on
six real-world GAD datasets show that TAM substantially outperforms seven
competing models, achieving over 10% increase in AUROC/AUPRC compared to the
best contenders on challenging datasets. Our code will be made available at
https: //github.com/mala-lab/TAM-master/.Comment: 19 pages, 9 figure
Homophily Outlier Detection in Non-IID Categorical Data
Most of existing outlier detection methods assume that the outlier factors
(i.e., outlierness scoring measures) of data entities (e.g., feature values and
data objects) are Independent and Identically Distributed (IID). This
assumption does not hold in real-world applications where the outlierness of
different entities is dependent on each other and/or taken from different
probability distributions (non-IID). This may lead to the failure of detecting
important outliers that are too subtle to be identified without considering the
non-IID nature. The issue is even intensified in more challenging contexts,
e.g., high-dimensional data with many noisy features. This work introduces a
novel outlier detection framework and its two instances to identify outliers in
categorical data by capturing non-IID outlier factors. Our approach first
defines and incorporates distribution-sensitive outlier factors and their
interdependence into a value-value graph-based representation. It then models
an outlierness propagation process in the value graph to learn the outlierness
of feature values. The learned value outlierness allows for either direct
outlier detection or outlying feature selection. The graph representation and
mining approach is employed here to well capture the rich non-IID
characteristics. Our empirical results on 15 real-world data sets with
different levels of data complexities show that (i) the proposed outlier
detection methods significantly outperform five state-of-the-art methods at the
95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most
complex data sets; and (ii) the proposed feature selection methods
significantly outperform three competing methods in enabling subsequent outlier
detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa
Anomaly Detection under Distribution Shift
Anomaly detection (AD) is a crucial machine learning task that aims to learn
patterns from a set of normal training samples to identify abnormal samples in
test data. Most existing AD studies assume that the training and test data are
drawn from the same data distribution, but the test data can have large
distribution shifts arising in many real-world applications due to different
natural variations such as new lighting conditions, object poses, or background
appearances, rendering existing AD methods ineffective in such cases. In this
paper, we consider the problem of anomaly detection under distribution shift
and establish performance benchmarks on four widely-used AD and
out-of-distribution (OOD) generalization datasets. We demonstrate that simple
adaptation of state-of-the-art OOD generalization methods to AD settings fails
to work effectively due to the lack of labeled anomaly data. We further
introduce a novel robust AD approach to diverse distribution shifts by
minimizing the distribution gap between in-distribution and OOD normal samples
in both the training and inference stages in an unsupervised way. Our extensive
empirical results on the four datasets show that our approach substantially
outperforms state-of-the-art AD methods and OOD generalization methods on data
with various distribution shifts, while maintaining the detection accuracy on
in-distribution data. Code and data are available at
https://github.com/mala-lab/ADShift.Comment: Accepted at ICCV 202
Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection
Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly
detection area - aims at utilizing a few samples of anomaly classes seen during
training to detect unseen anomalies (i.e., samples from open-set anomaly
classes), while effectively identifying the seen anomalies. Benefiting from the
prior knowledge illustrated by the seen anomalies, current OSAD methods can
often largely reduce false positive errors. However, these methods treat the
anomaly examples as from a homogeneous distribution, rendering them less
effective in generalizing to unseen anomalies that can be drawn from any
distribution. In this paper, we propose to learn heterogeneous anomaly
distributions using the limited anomaly examples to address this issue. To this
end, we introduce a novel approach, namely Anomaly Heterogeneity Learning
(AHL), that simulates a diverse set of heterogeneous (seen and unseen) anomaly
distributions and then utilizes them to learn a unified heterogeneous
abnormality model. Further, AHL is a generic framework that existing OSAD
models can plug and play for enhancing their abnormality modeling. Extensive
experiments on nine real-world anomaly detection datasets show that AHL can 1)
substantially enhance different state-of-the-art (SOTA) OSAD models in
detecting both seen and unseen anomalies, achieving new SOTA performance on a
large set of datasets, and 2) effectively generalize to unseen anomalies in new
target domains.Comment: 18 pages, 5 figure
Deep Weakly-supervised Anomaly Detection
Anomaly detection is typically posited as an unsupervised learning task in
the literature due to the prohibitive cost and difficulty to obtain large-scale
labeled anomaly data, but this ignores the fact that a very small number
(e.g.,, a few dozens) of labeled anomalies can often be made available with
small/trivial cost in many real-world anomaly detection applications. To
leverage such labeled anomaly data, we study an important anomaly detection
problem termed weakly-supervised anomaly detection, in which, in addition to a
large amount of unlabeled data, a limited number of labeled anomalies are
available during modeling. Learning with the small labeled anomaly data enables
anomaly-informed modeling, which helps identify anomalies of interest and
address the notorious high false positives in unsupervised anomaly detection.
However, the problem is especially challenging, since (i) the limited amount of
labeled anomaly data often, if not always, cannot cover all types of anomalies
and (ii) the unlabeled data is often dominated by normal instances but has
anomaly contamination. We address the problem by formulating it as a pairwise
relation prediction task. Particularly, our approach defines a two-stream
ordinal regression neural network to learn the relation of randomly sampled
instance pairs, i.e., whether the instance pair contains two labeled anomalies,
one labeled anomaly, or just unlabeled data instances. The resulting model
effectively leverages both the labeled and unlabeled data to substantially
augment the training data and learn well-generalized representations of
normality and abnormality. Comprehensive empirical results on 40 real-world
datasets show that our approach (i) significantly outperforms four
state-of-the-art methods in detecting both of the known and previously unseen
anomalies and (ii) is substantially more data-efficient.Comment: Theoretical results are refined and extended. Significant more
empirical results are added, including results on detecting previously
unknown anomalie
Open-Set Graph Anomaly Detection via Normal Structure Regularisation
This paper considers an under-explored Graph Anomaly Detection (GAD) task,
namely open-set GAD, which aims to detect anomalous nodes using a small number
of labelled training normal and anomaly nodes (known as seen anomalies) that
cannot illustrate all possible inference-time abnormalities. The task has
attracted growing attention due to the availability of anomaly prior knowledge
from the label information that can help to substantially reduce detection
errors. However, current methods tend to over-emphasise fitting the seen
anomalies, leading to a weak generalisation ability to detect unseen anomalies,
i.e., those that are not illustrated by the labelled anomaly nodes. Further,
they were introduced to handle Euclidean data, failing to effectively capture
important non-Euclidean features for GAD. In this work, we propose a novel
open-set GAD approach, namely normal structure regularisation (NSReg), to
leverage the rich normal graph structure embedded in the labelled nodes to
tackle the aforementioned two issues. In particular, NSReg trains an
anomaly-discriminative supervised graph anomaly detector, with a plug-and-play
regularisation term to enforce compact, semantically-rich representations of
normal nodes. To this end, the regularisation is designed to differentiate
various types of normal nodes, including labelled normal nodes that are
connected in their local neighbourhood, and those that are not connected. By
doing so, it helps incorporate strong normality into the supervised anomaly
detector learning, mitigating their overfitting to the seen anomalies.
Extensive empirical results on real-world datasets demonstrate the superiority
of our proposed NSReg for open-set GAD
- …