Search CORE

4,819 research outputs found

Active Anomaly Detection via Ensembles: Insights, Algorithms, and Interpretability

Author: Das Shubhomoy
Doppa Janardhan Rao
Islam Md Rakibul
Jayakodi Nitthilan Kannappan
Publication venue
Publication date: 23/01/2019
Field of study

Anomaly detection (AD) task corresponds to identifying the true anomalies from a given set of data instances. AD algorithms score the data instances and produce a ranked list of candidate anomalies, which are then analyzed by a human to discover the true anomalies. However, this process can be laborious for the human analyst when the number of false-positives is very high. Therefore, in many real-world AD applications including computer security and fraud prevention, the anomaly detector must be configurable by the human analyst to minimize the effort on false positives. In this paper, we study the problem of active learning to automatically tune ensemble of anomaly detectors to maximize the number of true anomalies discovered. We make four main contributions towards this goal. First, we present an important insight that explains the practical successes of AD ensembles and how ensembles are naturally suited for active learning. Second, we present several algorithms for active learning with tree-based AD ensembles. These algorithms help us to improve the diversity of discovered anomalies, generate rule sets for improved interpretability of anomalous instances, and adapt to streaming data settings in a principled manner. Third, we present a novel algorithm called GLocalized Anomaly Detection (GLAD) for active learning with generic AD ensembles. GLAD allows end-users to retain the use of simple and understandable global anomaly detectors by automatically learning their local relevance to specific data instances using label feedback. Fourth, we present extensive experiments to evaluate our insights and algorithms. Our results show that in addition to discovering significantly more anomalies than state-of-the-art unsupervised baselines, our active learning algorithms under the streaming-data setup are competitive with the batch setup.Comment: 47 pages including appendix; code is available at https://github.com/shubhomoydas/ad_examples. arXiv admin note: substantial text overlap with arXiv:1809.0647

arXiv.org e-Print Archive

Meta-AAD: Active Anomaly Detection with Deep Reinforcement Learning

Author: Hu Xia
Lai Kwei-Herng
Wan Mingyang
Zha Daochen
Publication venue
Publication date: 15/09/2020
Field of study

High false-positive rate is a long-standing challenge for anomaly detection algorithms, especially in high-stake applications. To identify the true anomalies, in practice, analysts or domain experts will be employed to investigate the top instances one by one in a ranked list of anomalies identified by an anomaly detection system. This verification procedure generates informative labels that can be leveraged to re-rank the anomalies so as to help the analyst to discover more true anomalies given a time budget. Some re-ranking strategies have been proposed to approximate the above sequential decision process. Specifically, existing strategies have been focused on making the top instances more likely to be anomalous based on the feedback. Then they greedily select the top-1 instance for query. However, these greedy strategies could be sub-optimal since some low-ranked instances could be more helpful in the long-term. In this work, we propose Active Anomaly Detection with Meta-Policy (Meta-AAD), a novel framework that learns a meta-policy for query selection. Specifically, Meta-AAD leverages deep reinforcement learning to train the meta-policy to select the most proper instance to explicitly optimize the number of discovered anomalies throughout the querying process. Meta-AAD is easy to deploy since a trained meta-policy can be directly applied to any new datasets without further tuning. Extensive experiments on 24 benchmark datasets demonstrate that Meta-AAD significantly outperforms the state-of-the-art re-ranking strategies and the unsupervised baseline. The empirical analysis shows that the trained meta-policy is transferable and inherently achieves a balance between long-term and short-term rewards.Comment: Accepted by ICDM 202

arXiv.org e-Print Archive

Industry Practice of Coverage-Guided Enterprise-Level DBMS Fuzzing

Author: Jiang Yu
Liang Jie
Wang Mingzhe
Wu Zhiyong
Xu Xinyi
Zhang Huafeng
Zhou Chijin
Publication venue
Publication date: 01/03/2021
Field of study

As an infrastructure for data persistence and analysis, Database Management Systems (DBMSs) are the cornerstones of modern enterprise software. To improve their correctness, the industry has been applying blackbox fuzzing for decades. Recently, the research community achieved impressive fuzzing gains using coverage guidance. However, due to the complexity and distributed nature of enterprise-level DBMSs, seldom are these researches applied to the industry. In this paper, we apply coverage-guided fuzzing to enterprise-level DBMSs from Huawei and Bloomberg LP. In our practice of testing GaussDB and Comdb2, we found major challenges in all three testing stages. The challenges are collecting precise coverage, optimizing fuzzing performance, and analyzing root causes. In search of a general method to overcome these challenges, we propose Ratel, a coverage-guided fuzzer for enterprise-level DBMSs. With its industry-oriented design, Ratel improves the feedback precision, enhances the robustness of input generation, and performs an on-line investigation on the root cause of bugs. As a result, Ratel outperformed other fuzzers in terms of coverage and bugs. Compared to industrial black box fuzzers SQLsmith and SQLancer, as well as coverage-guided academic fuzzer Squirrel, Ratel covered 38.38%, 106.14%, 583.05% more basic blocks than the best results of other three fuzzers in GaussDB, PostgreSQL, and Comdb2, respectively. More importantly, Ratel has discovered 32, 42, and 5 unknown bugs in GaussDB, Comdb2, and PostgreSQL

arXiv.org e-Print Archive

Improve black-box sequential anomaly detector relevancy with limited user feedback

Author: Bhatia Parminder
Callot Laurent
Chen Lifan
Chen Ming
Kong Luyang
Publication venue
Publication date: 15/09/2020
Field of study

Anomaly detectors are often designed to catch statistical anomalies. End-users typically do not have interest in all of the detected outliers, but only those relevant to their application. Given an existing black-box sequential anomaly detector, this paper proposes a method to improve its user relevancy using a small number of human feedback. As our first contribution, the method is agnostic to the detector: it only assumes access to its anomaly scores, without requirement on any additional information inside it. Inspired by a fact that anomalies are of different types, our approach identifies these types and utilizes user feedback to assign relevancy to types. This relevancy score, as our second contribution, is used to adjust the subsequent anomaly selection process. Empirical results on synthetic and real-world datasets show that our approach yields significant improvements on precision and recall over a range of anomaly detectors

arXiv.org e-Print Archive

Recent Research Advances on Interactive Machine Learning

Author: Chen Changjian
Jiang Liu
Liu Shixia
Publication venue
Publication date: 11/11/2018
Field of study

Interactive Machine Learning (IML) is an iterative learning process that tightly couples a human with a machine learner, which is widely used by researchers and practitioners to effectively solve a wide variety of real-world application problems. Although recent years have witnessed the proliferation of IML in the field of visual analytics, most recent surveys either focus on a specific area of IML or aim to summarize a visualization field that is too generic for IML. In this paper, we systematically review the recent literature on IML and classify them into a task-oriented taxonomy built by us. We conclude the survey with a discussion of open challenges and research opportunities that we believe are inspiring for future work in IML

arXiv.org e-Print Archive

Machine Learning for Fraud Detection in E-Commerce: A Research Agenda

Author: Akker Bram van den
Bernardi Lucas
de Jong Mathijs
de Vries Kees Jan
Dosoula Nikoleta
Smith Jon
Tax Niek
Thuong Olivier
Publication venue
Publication date: 05/07/2021
Field of study

Fraud detection and prevention play an important part in ensuring the sustained operation of any e-commerce business. Machine learning (ML) often plays an important role in these anti-fraud operations, but the organizational context in which these ML models operate cannot be ignored. In this paper, we take an organization-centric view on the topic of fraud detection by formulating an operational model of the anti-fraud departments in e-commerce organizations. We derive 6 research topics and 12 practical challenges for fraud detection from this operational model. We summarize the state of the literature for each research topic, discuss potential solutions to the practical challenges, and identify 22 open research challenges.Comment: Accepted and to appear in the proceedings of the KDD 2021 co-located workshop: the 2nd International Workshop on Deployable Machine Learning for Security Defense (MLHat

arXiv.org e-Print Archive

Feature Encoding with AutoEncoders for Weakly-supervised Anomaly Detection

Author: Liu Fanxing
Liu Lingqiao
Song Xucheng
Zhang Yanru
Zhou Yingjie
Zhu Ce
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/07/2021
Field of study

Weakly-supervised anomaly detection aims at learning an anomaly detector from a limited amount of labeled data and abundant unlabeled data. Recent works build deep neural networks for anomaly detection by discriminatively mapping the normal samples and abnormal samples to different regions in the feature space or fitting different distributions. However, due to the limited number of annotated anomaly samples, directly training networks with the discriminative loss may not be sufficient. To overcome this issue, this paper proposes a novel strategy to transform the input data into a more meaningful representation that could be used for anomaly detection. Specifically, we leverage an autoencoder to encode the input data and utilize three factors, hidden representation, reconstruction residual vector, and reconstruction error, as the new representation for the input data. This representation amounts to encode a test sample with its projection on the training data manifold, its direction to its projection and its distance to its projection. In addition to this encoding, we also propose a novel network architecture to seamlessly incorporate those three factors. From our extensive experiments, the benefits of the proposed strategy are clearly demonstrated by its superior performance over the competitive methods.Comment: 12pages,4 figures, published by IEEE Transactions on Neural Networks and Learning Systems,2021,DOI: 10.1109/TNNLS.2021.308613

arXiv.org e-Print Archive

The Art of Social Bots: A Review and a Refined Taxonomy

Author: Latah Majd
Publication venue
Publication date: 08/05/2019
Field of study

Social bots represent a new generation of bots that make use of online social networks (OSNs) as a command and control (C\&C) channel. Malicious social bots were responsible for launching large-scale spam campaigns, promoting low-cap stocks, manipulating user's digital influence and conducting political astroturf. This paper presents a detailed review on current social bots and proper techniques that can be used to fly under the radar of OSNs defences to be undetectable for long periods of time. We also suggest a refined taxonomy of detection approaches from social network perspective, as well as commonly used datasets and their corresponding findings. Our study can help OSN administrators and researchers understand the destructive potential of malicious social bots and can improve the current defence strategies against them

arXiv.org e-Print Archive

MacroBase: Prioritizing Attention in Fast Data

Author: Bailis Peter
Gan Edward
Madden Samuel
Narayanan Deepak
Rong Kexin
Suri Sahaana
Publication venue
Publication date: 24/03/2017
Field of study

As data volumes continue to rise, manual inspection is becoming increasingly untenable. In response, we present MacroBase, a data analytics engine that prioritizes end-user attention in high-volume fast data streams. MacroBase enables efficient, accurate, and modular analyses that highlight and aggregate important and unusual behavior, acting as a search engine for fast data. MacroBase is able to deliver order-of-magnitude speedups over alternatives by optimizing the combination of explanation and classification tasks and by leveraging a new reservoir sampler and heavy-hitters sketch specialized for fast data streams. As a result, MacroBase delivers accurate results at speeds of up to 2M events per second per query on a single core. The system has delivered meaningful results in production, including at a telematics company monitoring hundreds of thousands of vehicles.Comment: SIGMOD 201

arXiv.org e-Print Archive

Exploiting Epistemic Uncertainty of Anatomy Segmentation for Anomaly Detection in Retinal OCT

Author: Bogunović Hrvoje
Klimscha Sophie
Langs Georg
Orlando José Ignacio
Schlegl Thomas
Schmidt-Erfurth Ursula
Seeböck Philipp
Waldstein Sebastian M.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 29/05/2019
Field of study

Diagnosis and treatment guidance are aided by detecting relevant biomarkers in medical images. Although supervised deep learning can perform accurate segmentation of pathological areas, it is limited by requiring a-priori definitions of these regions, large-scale annotations, and a representative patient cohort in the training set. In contrast, anomaly detection is not limited to specific definitions of pathologies and allows for training on healthy samples without annotation. Anomalous regions can then serve as candidates for biomarker discovery. Knowledge about normal anatomical structure brings implicit information for detecting anomalies. We propose to take advantage of this property using bayesian deep learning, based on the assumption that epistemic uncertainties will correlate with anatomical deviations from a normal training set. A Bayesian U-Net is trained on a well-defined healthy environment using weak labels of healthy anatomy produced by existing methods. At test time, we capture epistemic uncertainty estimates of our model using Monte Carlo dropout. A novel post-processing technique is then applied to exploit these estimates and transfer their layered appearance to smooth blob-shaped segmentations of the anomalies. We experimentally validated this approach in retinal optical coherence tomography (OCT) images, using weak labels of retinal layers. Our method achieved a Dice index of 0.789 in an independent anomaly test set of age-related macular degeneration (AMD) cases. The resulting segmentations allowed very high accuracy for separating healthy and diseased cases with late wet AMD, dry geographic atrophy (GA), diabetic macular edema (DME) and retinal vein occlusion (RVO). Finally, we qualitatively observed that our approach can also detect other deviations in normal scans such as cut edge artifacts.Comment: Accepted for publication in IEEE Transactions on Medical Imaging, 201

arXiv.org e-Print Archive