Search CORE

12 research outputs found

Homophily Outlier Detection in Non-IID Categorical Data

Author: Cao Longbing
Chen Ling
Pang Guansong
Publication venue
Publication date: 21/03/2021
Field of study

Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa

arXiv.org e-Print Archive

OPUS - University of Technology Sydney

Outlier Detection Ensemble with Embedded Feature Selection

Author: Cheng Li
Li Bin
Liu Xinwang
Wang Yijie
Publication venue
Publication date: 15/01/2020
Field of study

Feature selection places an important role in improving the performance of outlier detection, especially for noisy data. Existing methods usually perform feature selection and outlier scoring separately, which would select feature subsets that may not optimally serve for outlier detection, leading to unsatisfying performance. In this paper, we propose an outlier detection ensemble framework with embedded feature selection (ODEFS), to address this issue. Specifically, for each random sub-sampling based learning component, ODEFS unifies feature selection and outlier detection into a pairwise ranking formulation to learn feature subsets that are tailored for the outlier detection method. Moreover, we adopt the thresholded self-paced learning to simultaneously optimize feature selection and example selection, which is helpful to improve the reliability of the training set. After that, we design an alternate algorithm with proved convergence to solve the resultant optimization problem. In addition, we analyze the generalization error bound of the proposed framework, which provides theoretical guarantee on the method and insightful practical guidance. Comprehensive experimental results on 12 real-world datasets from diverse domains validate the superiority of the proposed ODEFS.Comment: 10pages, AAAI202

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Streaming Active Learning Strategies for Real-Life Credit Card Fraud Detection: Assessment and Visualization

Author: Bontempi Gianluca
Borgne Yann-Aël Le
Caelen Olivier
Carcillo Fabirzio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 20/04/2018
Field of study

Credit card fraud detection is a very challenging problem because of the specific nature of transaction data and the labeling process. The transaction data is peculiar because they are obtained in a streaming fashion, they are strongly imbalanced and prone to non-stationarity. The labeling is the outcome of an active learning process, as every day human investigators contact only a small number of cardholders (associated to the riskiest transactions) and obtain the class (fraud or genuine) of the related transactions. An adequate selection of the set of cardholders is therefore crucial for an efficient fraud detection process. In this paper, we present a number of active learning strategies and we investigate their fraud detection accuracies. We compare different criteria (supervised, semi-supervised and unsupervised) to query unlabeled transactions. Finally, we highlight the existence of an exploitation/exploration trade-off for active learning in the context of fraud detection, which has so far been overlooked in the literature

arXiv.org e-Print Archive

DI-fusion

Unsupervised Heterogeneous Coupling Learning for Categorical Representation.

Author: Cao L
Yin J
Zhu C
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2021
Field of study

Complex categorical data is often hierarchically coupled with heterogeneous relationships between attributes and attribute values and the couplings between objects. Such value-to-object couplings are heterogeneous with complementary and inconsistent interactions and distributions. Limited research exists on unlabeled categorical data representations, ignores the heterogeneous and hierarchical couplings, underestimates data characteristics and complexities, and overuses redundant information, etc. Deep representation learning of unlabeled categorical data is challenging, overseeing such value-to-object couplings, complementarity and inconsistency, and requiring large data, disentanglement, and high computational power. This work introduces a shallow but powerful UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings and revealing heterogeneous distributions embedded in each type of couplings. UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings. Theoretical analysis shows that UNTIE can represent categorical data with maximal separability while effectively represents heterogeneous couplings and disclose their roles in categorical data. The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models on 25 categorical data sets with diversified characteristics

OPUS - University of Technology Sydney

A Survey on Explainable Anomaly Detection

Author: Li Zhong
van Leeuwen Matthijs
Zhu Yuxuan
Publication venue
Publication date: 11/07/2023
Field of study

In the past two decades, most research on anomaly detection has focused on improving the accuracy of the detection, while largely ignoring the explainability of the corresponding methods and thus leaving the explanation of outcomes to practitioners. As anomaly detection algorithms are increasingly used in safety-critical domains, providing explanations for the high-stakes decisions made in those domains has become an ethical and regulatory requirement. Therefore, this work provides a comprehensive and structured survey on state-of-the-art explainable anomaly detection techniques. We propose a taxonomy based on the main aspects that characterize each explainable anomaly detection technique, aiming to help practitioners and researchers find the explainable anomaly detection method that best suits their needs.Comment: Paper accepted by the ACM Transactions on Knowledge Discovery from Data (TKDD) for publication (preprint version

arXiv.org e-Print Archive

ADBench: Anomaly Detection Benchmark

Author: Han Songqiao
Hu Xiyang
Huang Hailiang
Jiang Mingqi
Zhao Yue
Publication venue
Publication date: 16/09/2022
Field of study

Given a long list of anomaly detection algorithms developed in the last few decades, how do they perform with regard to (i) varying levels of supervision, (ii) different types of anomalies, and (iii) noisy and corrupted data? In this work, we answer these key questions by conducting (to our best knowledge) the most comprehensive anomaly detection benchmark with 30 algorithms on 57 benchmark datasets, named ADBench. Our extensive experiments (98,436 in total) identify meaningful insights into the role of supervision and anomaly types, and unlock future directions for researchers in algorithm selection and design. With ADBench, researchers can easily conduct comprehensive and fair evaluations for newly proposed methods on the datasets (including our contributed ones from natural language and computer vision domains) against the existing baselines. To foster accessibility and reproducibility, we fully open-source ADBench and the corresponding results.Comment: NeurIPS 2022. All authors contribute equally and are listed alphabetically. Code available at https://github.com/Minqi824/ADBenc

arXiv.org e-Print Archive

A survey on explainable anomaly detection

Author: Leeuwen M. van
Li Z.
Zhu Y.
Publication venue
Publication date: 15/07/2023
Field of study

NWOAlgorithms and the Foundations of Software technolog

Leiden University Scholary Publications

Recommended from our members

Empowering Responsible Use of Large Language Models

Author: Zhao Xuandong
Publication venue: eScholarship, University of California
Publication date: 01/01/2024
Field of study

The rapid advancement of powerful Large Language Models (LLMs), such as ChatGPT and Llama, has revolutionized the world by bringing new creative possibilities and enhancing productivity. However, these advancements also pose significant challenges and risks, including the potential for misuse in the form of fake news, academic dishonesty, intellectual property infringements, and privacy leaks. In response to these concerns, this thesis explores approaches to promoting the responsible use of LLMs from both theoretical and empirical perspectives.Three key approaches are presented: (1) Detecting AI-generated Text via Watermarking: We propose a robust and high-quality watermarking method called Unigram-Watermark and introduce a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. Furthermore, we propose PF-Watermark, which achieves the best balance of high detection accuracy and low perplexity. (2) Protecting the Intellectual Property of LLMs: We safeguard the intellectual property of LLMs through novel watermarking techniques designed to prevent model-stealing attacks in both text classification and text generation tasks. (3) Privacy-Preserving LLMs: We employ Confidential Redacted Training (CRT) to train and fine-tune language generation models while protecting sensitive information. In summary, we propose a suite of algorithms and solutions to address LLMs' trending safety, security, and privacy concerns. We hope our studies provide valuable insights for researchers to explore exciting future research solutions that promote responsible AI development and deployment

eScholarship - University of California