4,819 research outputs found
Active Anomaly Detection via Ensembles: Insights, Algorithms, and Interpretability
Anomaly detection (AD) task corresponds to identifying the true anomalies
from a given set of data instances. AD algorithms score the data instances and
produce a ranked list of candidate anomalies, which are then analyzed by a
human to discover the true anomalies. However, this process can be laborious
for the human analyst when the number of false-positives is very high.
Therefore, in many real-world AD applications including computer security and
fraud prevention, the anomaly detector must be configurable by the human
analyst to minimize the effort on false positives.
In this paper, we study the problem of active learning to automatically tune
ensemble of anomaly detectors to maximize the number of true anomalies
discovered. We make four main contributions towards this goal. First, we
present an important insight that explains the practical successes of AD
ensembles and how ensembles are naturally suited for active learning. Second,
we present several algorithms for active learning with tree-based AD ensembles.
These algorithms help us to improve the diversity of discovered anomalies,
generate rule sets for improved interpretability of anomalous instances, and
adapt to streaming data settings in a principled manner. Third, we present a
novel algorithm called GLocalized Anomaly Detection (GLAD) for active learning
with generic AD ensembles. GLAD allows end-users to retain the use of simple
and understandable global anomaly detectors by automatically learning their
local relevance to specific data instances using label feedback. Fourth, we
present extensive experiments to evaluate our insights and algorithms. Our
results show that in addition to discovering significantly more anomalies than
state-of-the-art unsupervised baselines, our active learning algorithms under
the streaming-data setup are competitive with the batch setup.Comment: 47 pages including appendix; code is available at
https://github.com/shubhomoydas/ad_examples. arXiv admin note: substantial
text overlap with arXiv:1809.0647
Meta-AAD: Active Anomaly Detection with Deep Reinforcement Learning
High false-positive rate is a long-standing challenge for anomaly detection
algorithms, especially in high-stake applications. To identify the true
anomalies, in practice, analysts or domain experts will be employed to
investigate the top instances one by one in a ranked list of anomalies
identified by an anomaly detection system. This verification procedure
generates informative labels that can be leveraged to re-rank the anomalies so
as to help the analyst to discover more true anomalies given a time budget.
Some re-ranking strategies have been proposed to approximate the above
sequential decision process. Specifically, existing strategies have been
focused on making the top instances more likely to be anomalous based on the
feedback. Then they greedily select the top-1 instance for query. However,
these greedy strategies could be sub-optimal since some low-ranked instances
could be more helpful in the long-term. In this work, we propose Active Anomaly
Detection with Meta-Policy (Meta-AAD), a novel framework that learns a
meta-policy for query selection. Specifically, Meta-AAD leverages deep
reinforcement learning to train the meta-policy to select the most proper
instance to explicitly optimize the number of discovered anomalies throughout
the querying process. Meta-AAD is easy to deploy since a trained meta-policy
can be directly applied to any new datasets without further tuning. Extensive
experiments on 24 benchmark datasets demonstrate that Meta-AAD significantly
outperforms the state-of-the-art re-ranking strategies and the unsupervised
baseline. The empirical analysis shows that the trained meta-policy is
transferable and inherently achieves a balance between long-term and short-term
rewards.Comment: Accepted by ICDM 202
Industry Practice of Coverage-Guided Enterprise-Level DBMS Fuzzing
As an infrastructure for data persistence and analysis, Database Management
Systems (DBMSs) are the cornerstones of modern enterprise software. To improve
their correctness, the industry has been applying blackbox fuzzing for decades.
Recently, the research community achieved impressive fuzzing gains using
coverage guidance. However, due to the complexity and distributed nature of
enterprise-level DBMSs, seldom are these researches applied to the industry.
In this paper, we apply coverage-guided fuzzing to enterprise-level DBMSs
from Huawei and Bloomberg LP. In our practice of testing GaussDB and Comdb2, we
found major challenges in all three testing stages. The challenges are
collecting precise coverage, optimizing fuzzing performance, and analyzing root
causes. In search of a general method to overcome these challenges, we propose
Ratel, a coverage-guided fuzzer for enterprise-level DBMSs. With its
industry-oriented design, Ratel improves the feedback precision, enhances the
robustness of input generation, and performs an on-line investigation on the
root cause of bugs. As a result, Ratel outperformed other fuzzers in terms of
coverage and bugs. Compared to industrial black box fuzzers SQLsmith and
SQLancer, as well as coverage-guided academic fuzzer Squirrel, Ratel covered
38.38%, 106.14%, 583.05% more basic blocks than the best results of other three
fuzzers in GaussDB, PostgreSQL, and Comdb2, respectively. More importantly,
Ratel has discovered 32, 42, and 5 unknown bugs in GaussDB, Comdb2, and
PostgreSQL
Improve black-box sequential anomaly detector relevancy with limited user feedback
Anomaly detectors are often designed to catch statistical anomalies.
End-users typically do not have interest in all of the detected outliers, but
only those relevant to their application. Given an existing black-box
sequential anomaly detector, this paper proposes a method to improve its user
relevancy using a small number of human feedback. As our first contribution,
the method is agnostic to the detector: it only assumes access to its anomaly
scores, without requirement on any additional information inside it. Inspired
by a fact that anomalies are of different types, our approach identifies these
types and utilizes user feedback to assign relevancy to types. This relevancy
score, as our second contribution, is used to adjust the subsequent anomaly
selection process. Empirical results on synthetic and real-world datasets show
that our approach yields significant improvements on precision and recall over
a range of anomaly detectors
Recent Research Advances on Interactive Machine Learning
Interactive Machine Learning (IML) is an iterative learning process that
tightly couples a human with a machine learner, which is widely used by
researchers and practitioners to effectively solve a wide variety of real-world
application problems. Although recent years have witnessed the proliferation of
IML in the field of visual analytics, most recent surveys either focus on a
specific area of IML or aim to summarize a visualization field that is too
generic for IML. In this paper, we systematically review the recent literature
on IML and classify them into a task-oriented taxonomy built by us. We conclude
the survey with a discussion of open challenges and research opportunities that
we believe are inspiring for future work in IML
Machine Learning for Fraud Detection in E-Commerce: A Research Agenda
Fraud detection and prevention play an important part in ensuring the
sustained operation of any e-commerce business. Machine learning (ML) often
plays an important role in these anti-fraud operations, but the organizational
context in which these ML models operate cannot be ignored. In this paper, we
take an organization-centric view on the topic of fraud detection by
formulating an operational model of the anti-fraud departments in e-commerce
organizations. We derive 6 research topics and 12 practical challenges for
fraud detection from this operational model. We summarize the state of the
literature for each research topic, discuss potential solutions to the
practical challenges, and identify 22 open research challenges.Comment: Accepted and to appear in the proceedings of the KDD 2021 co-located
workshop: the 2nd International Workshop on Deployable Machine Learning for
Security Defense (MLHat
Feature Encoding with AutoEncoders for Weakly-supervised Anomaly Detection
Weakly-supervised anomaly detection aims at learning an anomaly detector from
a limited amount of labeled data and abundant unlabeled data. Recent works
build deep neural networks for anomaly detection by discriminatively mapping
the normal samples and abnormal samples to different regions in the feature
space or fitting different distributions. However, due to the limited number of
annotated anomaly samples, directly training networks with the discriminative
loss may not be sufficient. To overcome this issue, this paper proposes a novel
strategy to transform the input data into a more meaningful representation that
could be used for anomaly detection. Specifically, we leverage an autoencoder
to encode the input data and utilize three factors, hidden representation,
reconstruction residual vector, and reconstruction error, as the new
representation for the input data. This representation amounts to encode a test
sample with its projection on the training data manifold, its direction to its
projection and its distance to its projection. In addition to this encoding, we
also propose a novel network architecture to seamlessly incorporate those three
factors. From our extensive experiments, the benefits of the proposed strategy
are clearly demonstrated by its superior performance over the competitive
methods.Comment: 12pages,4 figures, published by IEEE Transactions on Neural Networks
and Learning Systems,2021,DOI: 10.1109/TNNLS.2021.308613
The Art of Social Bots: A Review and a Refined Taxonomy
Social bots represent a new generation of bots that make use of online social
networks (OSNs) as a command and control (C\&C) channel. Malicious social bots
were responsible for launching large-scale spam campaigns, promoting low-cap
stocks, manipulating user's digital influence and conducting political
astroturf. This paper presents a detailed review on current social bots and
proper techniques that can be used to fly under the radar of OSNs defences to
be undetectable for long periods of time. We also suggest a refined taxonomy of
detection approaches from social network perspective, as well as commonly used
datasets and their corresponding findings. Our study can help OSN
administrators and researchers understand the destructive potential of
malicious social bots and can improve the current defence strategies against
them
MacroBase: Prioritizing Attention in Fast Data
As data volumes continue to rise, manual inspection is becoming increasingly
untenable. In response, we present MacroBase, a data analytics engine that
prioritizes end-user attention in high-volume fast data streams. MacroBase
enables efficient, accurate, and modular analyses that highlight and aggregate
important and unusual behavior, acting as a search engine for fast data.
MacroBase is able to deliver order-of-magnitude speedups over alternatives by
optimizing the combination of explanation and classification tasks and by
leveraging a new reservoir sampler and heavy-hitters sketch specialized for
fast data streams. As a result, MacroBase delivers accurate results at speeds
of up to 2M events per second per query on a single core. The system has
delivered meaningful results in production, including at a telematics company
monitoring hundreds of thousands of vehicles.Comment: SIGMOD 201
Exploiting Epistemic Uncertainty of Anatomy Segmentation for Anomaly Detection in Retinal OCT
Diagnosis and treatment guidance are aided by detecting relevant biomarkers
in medical images. Although supervised deep learning can perform accurate
segmentation of pathological areas, it is limited by requiring a-priori
definitions of these regions, large-scale annotations, and a representative
patient cohort in the training set. In contrast, anomaly detection is not
limited to specific definitions of pathologies and allows for training on
healthy samples without annotation. Anomalous regions can then serve as
candidates for biomarker discovery. Knowledge about normal anatomical structure
brings implicit information for detecting anomalies. We propose to take
advantage of this property using bayesian deep learning, based on the
assumption that epistemic uncertainties will correlate with anatomical
deviations from a normal training set. A Bayesian U-Net is trained on a
well-defined healthy environment using weak labels of healthy anatomy produced
by existing methods. At test time, we capture epistemic uncertainty estimates
of our model using Monte Carlo dropout. A novel post-processing technique is
then applied to exploit these estimates and transfer their layered appearance
to smooth blob-shaped segmentations of the anomalies. We experimentally
validated this approach in retinal optical coherence tomography (OCT) images,
using weak labels of retinal layers. Our method achieved a Dice index of 0.789
in an independent anomaly test set of age-related macular degeneration (AMD)
cases. The resulting segmentations allowed very high accuracy for separating
healthy and diseased cases with late wet AMD, dry geographic atrophy (GA),
diabetic macular edema (DME) and retinal vein occlusion (RVO). Finally, we
qualitatively observed that our approach can also detect other deviations in
normal scans such as cut edge artifacts.Comment: Accepted for publication in IEEE Transactions on Medical Imaging,
201
- …