279 research outputs found

    Homophily Outlier Detection in Non-IID Categorical Data

    Full text link
    Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa

    CURE: Flexible Categorical Data Representation by Hierarchical Coupling Learning

    Full text link
    © 1989-2012 IEEE. The representation of categorical data with hierarchical value coupling relationships (i.e., various value-to-value cluster interactions) is very critical yet challenging for capturing complex data characteristics in learning tasks. This paper proposes a novel and flexible coupled unsupervised categorical data representation (CURE) framework, which not only captures the hierarchical couplings but is also flexible enough to be instantiated for contrastive learning tasks. CURE first learns the value clusters of different granularities based on multiple value coupling functions and then learns the value representation from the couplings between the obtained value clusters. With two complementary value coupling functions, CURE is instantiated into two models: coupled data embedding (CDE) for clustering and coupled outlier scoring of high-dimensional data (COSH) for outlier detection. These show that CURE is flexible for value clustering and coupling learning between value clusters for different learning tasks. CDE embeds categorical data into a new space in which features are independent and semantics are rich. COSH represents data w.r.t. an outlying vector to capture complex outlying behaviors of objects in high-dimensional data. Substantial experiments show that CDE significantly outperforms three popular unsupervised encoding methods and three state-of-the-art similarity measures, and COSH performs significantly better than five state-of-the-art outlier detection methods on high-dimensional data. CDE and COSH are scalable and stable, linear to data size and quadratic to the number of features, and are insensitive to their parameters

    Coupled behavior informatics : modeling, analysis and learning

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Behavior refers to the action or property of an actor, entity or otherwise, to situations or stimuli in its environment. The in-depth analysis of behaviour has been increasingly recognized as a crucial means for understanding and disclosing interior driving forces and intrinsic cause-effects on business and social applications, including web community analysis, counterterrorism, fraud detection and customer relationship management, etc. Currently, behavior modeling and analysis have been extensively investigated by researchers in different disciplines, e.g. psychology, economics, mathematics, engineering and information science. From those diverse perspectives, there are widespread and long-standing explorations on behavior studies, such as behavior recognition, reasoning about action, interactive process modeling, multivariate time series analysis, and outlier mining of trading behaviors. All the above emerging methods however suffer from the following common issues and problems to different extents: (1) Existing behavior modelling approaches have too many styles and forms according to distinct situations, which is troublesome for cross-discipline researchers to follow. (2) Traditional behavior analysis relies on implicit behavior and explicit business appearance, often leading to ineffective and limited understanding on business and social activities. (3) Complex coupling relationships between behaviors are often ignored or only weakly addressed, which fails to provide a complete understanding of the underlying problems and their comprehensive solutions. (4) Current research usually overlooks the checking of behavior interactions, which weakens the soundness and robustness of models built for complex behavior applications. (5) Most of the classic mining and learning algorithms follow the fundamental assumption of independent and identical distribution (i.e. IIDness), but this is too strong to match the reality and complexities in practical applications. With the deepening and widening of social/business intelligences and their networking, the concept of behavior is in great demand to be consolidated and formalized to deeply scrutinize the native behavior intention, lifecycle and impact on complex problems and business issues. In the real-world applications, group behavior interactions (i.e. coupled behaviors) are widely seen in natural, social and artificial behavior-related problems. The verification of behavior modeling is further desired to assure the reliability and stability. In addition, complex behavior and social applications often exhibit strong explicit or implicit coupling relationships both between their entities and properties. They can not be abstracted or weakened to the extent of satisfying the IIDness assumption. These characteristics greatly challenge the current behavior-related analysis approaches. Moreover, it is also very difficult to model, analyze and check behaviors coupled with one another due to the complexity from data, domain, context and impact perspectives. Based on the above research limitations and challenges, this thesis reports state-of-the-art advances and our research innovations in modeling, analysing and learning coupled behaviors, which constitute the coupled behaviour informatics. Coupled behaviors are categorized as qualitative coupled behaviors and quantitative coupled behaviors, depending on whether the behaviour involved is qualified by actions or quantified by properties. In terms of the qualitative coupled behavior modeling and analysis, we propose an Ontology-based Qualitative Coupled Behavior Modeling and Checking (OntoB for short) system to explicitly represent and verify complex behaviour relationships, aggregations and constraints. The effectiveness of OntoB system in modeling multi-robot behaviors and their interactions in the Robocup soccer competition game has been demonstrated. With regard to the quantitative coupled behavior analysis and learning, we carry out explorations on three tasks below. They are under the non-IIDness assumption of entities or properties or both of them, which caters for the intrinsic essence of real-world problems and applications. For numerical coupled behavior analysis, we introduce a framework to address the comprehensive dependency among continuous properties. Substantial experiments show that the coupled representation can effectively model the global couplings of numerical properties and outperforms the traditional way. For categorical coupled behavior analysis, we present an efficient data-driven similarity learning approach that generates a coupled property similarity measure for nominal entities. Intensive empirical studies witness that the coupled property similarity can appropriately quantify the intrinsic and global interactions within and between categorical properties for especially large-scale behavior data. For coupled behavior ensemble learning, we explicate the couplings between methods and between entities in the application of clustering ensembles, and put forward a framework for coupled clustering ensembles (CCE). The CCE is experimentally exhibited to capture the implicit relationships of base clusterings and entities with higher clustering accuracy, stability and robustness, compared to existing techniques. All these models and frameworks are supported by statistical analysis. Finally, we provide a consolidated understanding of coupled behaviors by summarizing the qualitative and quantitative aspects, extract the multi-level couplings embedded in them, and then formalize a coupled behavior algebra at its preliminary stage. Many open research issues and opportunities related to our proposed approaches and this novel algebra are discussed accordingly. Under varying backgrounds and scenarios, our proposed systems, algorithms and frameworks for the coupled behavior informatics are evidenced to outperform state-of-the-art methods via theoretical analysis or empirical studies or both of them. All these outcomes have been accepted by top conferences, and the follow-up work has also been recognized. Therefore, coupled behavior informatics is a promising though wholly new research topic with lots of attractive opportunities for further exploration and development

    Outlier Detection Ensemble with Embedded Feature Selection

    Full text link
    Feature selection places an important role in improving the performance of outlier detection, especially for noisy data. Existing methods usually perform feature selection and outlier scoring separately, which would select feature subsets that may not optimally serve for outlier detection, leading to unsatisfying performance. In this paper, we propose an outlier detection ensemble framework with embedded feature selection (ODEFS), to address this issue. Specifically, for each random sub-sampling based learning component, ODEFS unifies feature selection and outlier detection into a pairwise ranking formulation to learn feature subsets that are tailored for the outlier detection method. Moreover, we adopt the thresholded self-paced learning to simultaneously optimize feature selection and example selection, which is helpful to improve the reliability of the training set. After that, we design an alternate algorithm with proved convergence to solve the resultant optimization problem. In addition, we analyze the generalization error bound of the proposed framework, which provides theoretical guarantee on the method and insightful practical guidance. Comprehensive experimental results on 12 real-world datasets from diverse domains validate the superiority of the proposed ODEFS.Comment: 10pages, AAAI202

    Unsupervised Heterogeneous Coupling Learning for Categorical Representation.

    Full text link
    Complex categorical data is often hierarchically coupled with heterogeneous relationships between attributes and attribute values and the couplings between objects. Such value-to-object couplings are heterogeneous with complementary and inconsistent interactions and distributions. Limited research exists on unlabeled categorical data representations, ignores the heterogeneous and hierarchical couplings, underestimates data characteristics and complexities, and overuses redundant information, etc. Deep representation learning of unlabeled categorical data is challenging, overseeing such value-to-object couplings, complementarity and inconsistency, and requiring large data, disentanglement, and high computational power. This work introduces a shallow but powerful UNsupervised heTerogeneous couplIng lEarning (UNTIE) approach for representing coupled categorical data by untying the interactions between couplings and revealing heterogeneous distributions embedded in each type of couplings. UNTIE is efficiently optimized w.r.t. a kernel k-means objective function for unsupervised representation learning of heterogeneous and hierarchical value-to-object couplings. Theoretical analysis shows that UNTIE can represent categorical data with maximal separability while effectively represents heterogeneous couplings and disclose their roles in categorical data. The UNTIE-learned representations make significant performance improvement against the state-of-the-art categorical representations and deep representation models on 25 categorical data sets with diversified characteristics

    A Survey on Explainable Anomaly Detection

    Full text link
    In the past two decades, most research on anomaly detection has focused on improving the accuracy of the detection, while largely ignoring the explainability of the corresponding methods and thus leaving the explanation of outcomes to practitioners. As anomaly detection algorithms are increasingly used in safety-critical domains, providing explanations for the high-stakes decisions made in those domains has become an ethical and regulatory requirement. Therefore, this work provides a comprehensive and structured survey on state-of-the-art explainable anomaly detection techniques. We propose a taxonomy based on the main aspects that characterize each explainable anomaly detection technique, aiming to help practitioners and researchers find the explainable anomaly detection method that best suits their needs.Comment: Paper accepted by the ACM Transactions on Knowledge Discovery from Data (TKDD) for publication (preprint version

    Table2Vec-automated universal representation learning of enterprise data DNA for benchmarkable and explainable enterprise data science.

    Full text link
    Enterprise data typically involves multiple heterogeneous data sources and external data that respectively record business activities, transactions, customer demographics, status, behaviors, interactions and communications with the enterprise, and the consumption and feedback of its products, services, production, marketing, operations, and management, etc. They involve enterprise DNA associated with domain-oriented transactions and master data, informational and operational metadata, and relevant external data. A critical challenge in enterprise data science is to enable an effective 'whole-of-enterprise' data understanding and data-driven discovery and decision-making on all-round enterprise DNA. Accordingly, here we introduce a neural encoder Table2Vec for automated universal representation learning of entities such as customers from all-round enterprise DNA with automated data characteristics analysis and data quality augmentation. The learned universal representations serve as representative and benchmarkable enterprise data genomes (similar to biological genomes and DNA in organisms) and can be used for enterprise-wide and domain-specific learning tasks. Table2Vec integrates automated universal representation learning on low-quality enterprise data and downstream learning tasks. Such automated universal enterprise representation and learning cannot be addressed by existing enterprise data warehouses (EDWs), business intelligence and corporate analytics systems, where 'enterprise big tables' are constructed with reporting and analytics conducted by specific analysts on respective domain subjects and goals. It addresses critical limitations and gaps of existing representation learning, enterprise analytics and cloud analytics, which are analytical subject, task and data-specific, creating analytical silos in an enterprise. We illustrate Table2Vec in characterizing all-round customer data DNA in an enterprise on complex heterogeneous multi-relational big tables to build universal customer vector representations. The learned universal representation of each customer is all-round, representative and benchmarkable to support both enterprise-wide and domain-specific learning goals and tasks in enterprise data science. Table2Vec significantly outperforms the existing shallow, boosting and deep learning methods typically used for enterprise analytics. We further discuss the research opportunities, directions and applications of automated universal enterprise representation and learning and the learned enterprise data DNA for automated, all-purpose, whole-of-enterprise and ethical machine learning and data science
    • …
    corecore