338 research outputs found
Table2Vec-automated universal representation learning of enterprise data DNA for benchmarkable and explainable enterprise data science.
Enterprise data typically involves multiple heterogeneous data sources and external data that respectively record business activities, transactions, customer demographics, status, behaviors, interactions and communications with the enterprise, and the consumption and feedback of its products, services, production, marketing, operations, and management, etc. They involve enterprise DNA associated with domain-oriented transactions and master data, informational and operational metadata, and relevant external data. A critical challenge in enterprise data science is to enable an effective 'whole-of-enterprise' data understanding and data-driven discovery and decision-making on all-round enterprise DNA. Accordingly, here we introduce a neural encoder Table2Vec for automated universal representation learning of entities such as customers from all-round enterprise DNA with automated data characteristics analysis and data quality augmentation. The learned universal representations serve as representative and benchmarkable enterprise data genomes (similar to biological genomes and DNA in organisms) and can be used for enterprise-wide and domain-specific learning tasks. Table2Vec integrates automated universal representation learning on low-quality enterprise data and downstream learning tasks. Such automated universal enterprise representation and learning cannot be addressed by existing enterprise data warehouses (EDWs), business intelligence and corporate analytics systems, where 'enterprise big tables' are constructed with reporting and analytics conducted by specific analysts on respective domain subjects and goals. It addresses critical limitations and gaps of existing representation learning, enterprise analytics and cloud analytics, which are analytical subject, task and data-specific, creating analytical silos in an enterprise. We illustrate Table2Vec in characterizing all-round customer data DNA in an enterprise on complex heterogeneous multi-relational big tables to build universal customer vector representations. The learned universal representation of each customer is all-round, representative and benchmarkable to support both enterprise-wide and domain-specific learning goals and tasks in enterprise data science. Table2Vec significantly outperforms the existing shallow, boosting and deep learning methods typically used for enterprise analytics. We further discuss the research opportunities, directions and applications of automated universal enterprise representation and learning and the learned enterprise data DNA for automated, all-purpose, whole-of-enterprise and ethical machine learning and data science
Outlier Detection Ensemble with Embedded Feature Selection
Feature selection places an important role in improving the performance of
outlier detection, especially for noisy data. Existing methods usually perform
feature selection and outlier scoring separately, which would select feature
subsets that may not optimally serve for outlier detection, leading to
unsatisfying performance. In this paper, we propose an outlier detection
ensemble framework with embedded feature selection (ODEFS), to address this
issue. Specifically, for each random sub-sampling based learning component,
ODEFS unifies feature selection and outlier detection into a pairwise ranking
formulation to learn feature subsets that are tailored for the outlier
detection method. Moreover, we adopt the thresholded self-paced learning to
simultaneously optimize feature selection and example selection, which is
helpful to improve the reliability of the training set. After that, we design
an alternate algorithm with proved convergence to solve the resultant
optimization problem. In addition, we analyze the generalization error bound of
the proposed framework, which provides theoretical guarantee on the method and
insightful practical guidance. Comprehensive experimental results on 12
real-world datasets from diverse domains validate the superiority of the
proposed ODEFS.Comment: 10pages, AAAI202
CLUSTERED HIERARCHICAL ANOMALY AND OUTLIER DETECTION ALGORITHMS
Anomaly and outlier detection is a long-standing problem in machine learning. In some cases, anomaly detection is easy, such as when data are drawn from well-characterized distributions such as the Gaussian. However, when data occupy high-dimensional spaces, anomaly detection becomes more difficult. We present CLAM (Clustered Learning of Approximate Manifolds), a manifold mapping technique in any metric space. CLAM begins with a fast hierarchical clustering technique and then induces a graph from the cluster tree, based on overlapping clusters as selected using several geometric and topological features. Using these graphs, we implement CHAODA (Clustered Hierarchical Anomaly and Outlier Detection Algorithms), exploring various properties of the graphs and their constituent clusters to find outliers. CHAODA employs a form of transfer learning based on a training set of datasets, and applies this knowledge to a separate test set of datasets of different cardinalities, dimensionalities, and domains. On 24 publicly available datasets, we compare CHAODA (by measure of ROC AUC) to a variety of state-of-the-art unsupervised anomaly-detection algorithms. Six of the datasets are used for training. CHAODA outperforms other approaches on 16 of the remaining 18 datasets. CLAM and CHAODA scale to large, high-dimensional “big data” anomalydetection problems, and generalize across datasets and distance functions. Source code to CLAM and CHAODA are freely available on GitHub1
Deep Weakly-supervised Anomaly Detection
Anomaly detection is typically posited as an unsupervised learning task in
the literature due to the prohibitive cost and difficulty to obtain large-scale
labeled anomaly data, but this ignores the fact that a very small number
(e.g.,, a few dozens) of labeled anomalies can often be made available with
small/trivial cost in many real-world anomaly detection applications. To
leverage such labeled anomaly data, we study an important anomaly detection
problem termed weakly-supervised anomaly detection, in which, in addition to a
large amount of unlabeled data, a limited number of labeled anomalies are
available during modeling. Learning with the small labeled anomaly data enables
anomaly-informed modeling, which helps identify anomalies of interest and
address the notorious high false positives in unsupervised anomaly detection.
However, the problem is especially challenging, since (i) the limited amount of
labeled anomaly data often, if not always, cannot cover all types of anomalies
and (ii) the unlabeled data is often dominated by normal instances but has
anomaly contamination. We address the problem by formulating it as a pairwise
relation prediction task. Particularly, our approach defines a two-stream
ordinal regression neural network to learn the relation of randomly sampled
instance pairs, i.e., whether the instance pair contains two labeled anomalies,
one labeled anomaly, or just unlabeled data instances. The resulting model
effectively leverages both the labeled and unlabeled data to substantially
augment the training data and learn well-generalized representations of
normality and abnormality. Comprehensive empirical results on 40 real-world
datasets show that our approach (i) significantly outperforms four
state-of-the-art methods in detecting both of the known and previously unseen
anomalies and (ii) is substantially more data-efficient.Comment: Theoretical results are refined and extended. Significant more
empirical results are added, including results on detecting previously
unknown anomalie
Outlier detection using flexible categorisation and interrogative agendas
Categorization is one of the basic tasks in machine learning and data
analysis. Building on formal concept analysis (FCA), the starting point of the
present work is that different ways to categorize a given set of objects exist,
which depend on the choice of the sets of features used to classify them, and
different such sets of features may yield better or worse categorizations,
relative to the task at hand. In their turn, the (a priori) choice of a
particular set of features over another might be subjective and express a
certain epistemic stance (e.g. interests, relevance, preferences) of an agent
or a group of agents, namely, their interrogative agenda. In the present paper,
we represent interrogative agendas as sets of features, and explore and compare
different ways to categorize objects w.r.t. different sets of features
(agendas). We first develop a simple unsupervised FCA-based algorithm for
outlier detection which uses categorizations arising from different agendas. We
then present a supervised meta-learning algorithm to learn suitable (fuzzy)
agendas for categorization as sets of features with different weights or
masses. We combine this meta-learning algorithm with the unsupervised outlier
detection algorithm to obtain a supervised outlier detection algorithm. We show
that these algorithms perform at par with commonly used algorithms for outlier
detection on commonly used datasets in outlier detection. These algorithms
provide both local and global explanations of their results
- …