120 research outputs found
Resilient VAE: Unsupervised Anomaly Detection at the SLAC Linac Coherent Light Source
Significant advances in utilizing deep learning for anomaly detection have
been made in recent years. However, these methods largely assume the existence
of a normal training set (i.e., uncontaminated by anomalies) or even a
completely labeled training set. In many complex engineering systems, such as
particle accelerators, labels are sparse and expensive; in order to perform
anomaly detection in these cases, we must drop these assumptions and utilize a
completely unsupervised method. This paper introduces the Resilient Variational
Autoencoder (ResVAE), a deep generative model specifically designed for anomaly
detection. ResVAE exhibits resilience to anomalies present in the training data
and provides feature-level anomaly attribution. During the training process,
ResVAE learns the anomaly probability for each sample as well as each
individual feature, utilizing these probabilities to effectively disregard
anomalous examples in the training data. We apply our proposed method to detect
anomalies in the accelerator status at the SLAC Linac Coherent Light Source
(LCLS). By utilizing shot-to-shot data from the beam position monitoring
system, we demonstrate the exceptional capability of ResVAE in identifying
various types of anomalies that are visible in the accelerator
Explaining Anomalies using Denoising Autoencoders for Financial Tabular Data
Recent advances in Explainable AI (XAI) increased the demand for deployment
of safe and interpretable AI models in various industry sectors. Despite the
latest success of deep neural networks in a variety of domains, understanding
the decision-making process of such complex models still remains a challenging
task for domain experts. Especially in the financial domain, merely pointing to
an anomaly composed of often hundreds of mixed type columns, has limited value
for experts. Hence, in this paper, we propose a framework for explaining
anomalies using denoising autoencoders designed for mixed type tabular data. We
specifically focus our technique on anomalies that are erroneous observations.
This is achieved by localizing individual sample columns (cells) with potential
errors and assigning corresponding confidence scores. In addition, the model
provides the expected cell value estimates to fix the errors. We evaluate our
approach based on three standard public tabular datasets (Credit Default,
Adult, IEEE Fraud) and one proprietary dataset (Holdings). We find that
denoising autoencoders applied to this task already outperform other approaches
in the cell error detection rates as well as in the expected value rates.
Additionally, we analyze how a specialized loss designed for cell error
detection can further improve these metrics. Our framework is designed for a
domain expert to understand abnormal characteristics of an anomaly, as well as
to improve in-house data quality management processes.Comment: 10 pages, 4 figures, 3 tables, preprint versio
AutoCure: Automated Tabular Data Curation Technique for ML Pipelines
Machine learning algorithms have become increasingly prevalent in multiple
domains, such as autonomous driving, healthcare, and finance. In such domains,
data preparation remains a significant challenge in developing accurate models,
requiring significant expertise and time investment to search the huge search
space of well-suited data curation and transformation tools. To address this
challenge, we present AutoCure, a novel and configuration-free data curation
pipeline that improves the quality of tabular data. Unlike traditional data
curation methods, AutoCure synthetically enhances the density of the clean data
fraction through an adaptive ensemble-based error detection method and a data
augmentation module. In practice, AutoCure can be integrated with open source
tools, e.g., Auto-sklearn, H2O, and TPOT, to promote the democratization of
machine learning. As a proof of concept, we provide a comparative evaluation of
AutoCure against 28 combinations of traditional data curation tools,
demonstrating superior performance and predictive accuracy without user
intervention. Our evaluation shows that AutoCure is an effective approach to
automating data preparation and improving the accuracy of machine learning
models
On Memorization in Probabilistic Deep Generative Models
Recent advances in deep generative models have led to impressive results in a
variety of application domains. Motivated by the possibility that deep learning
models might memorize part of the input data, there have been increased efforts
to understand how memorization arises. In this work, we extend a recently
proposed measure of memorization for supervised learning (Feldman, 2019) to the
unsupervised density estimation problem and adapt it to be more computationally
efficient. Next, we present a study that demonstrates how memorization can
occur in probabilistic deep generative models such as variational autoencoders.
This reveals that the form of memorization to which these models are
susceptible differs fundamentally from mode collapse and overfitting.
Furthermore, we show that the proposed memorization score measures a phenomenon
that is not captured by commonly-used nearest neighbor tests. Finally, we
discuss several strategies that can be used to limit memorization in practice.
Our work thus provides a framework for understanding problematic memorization
in probabilistic generative models.Comment: Accepted for publication at NeurIPS 202
Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo
Variational Autoencoders (VAEs) have recently been highly successful at
imputing and acquiring heterogeneous missing data. However, within this
specific application domain, existing VAE methods are restricted by using only
one layer of latent variables and strictly Gaussian posterior approximations.
To address these limitations, we present HH-VAEM, a Hierarchical VAE model for
mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic
hyper-parameter tuning for improved approximate inference. Our experiments show
that HH-VAEM outperforms existing baselines in the tasks of missing data
imputation and supervised learning with missing features. Finally, we also
present a sampling-based approach for efficiently computing the information
gain when missing features are to be acquired with HH-VAEM. Our experiments
show that this sampling-based approach is superior to alternatives based on
Gaussian approximations.Comment: Accepted at NeurIPS 202
Machine Learning Models for High-dimensional Biomedical Data
abstract: The recent technological advances enable the collection of various complex, heterogeneous and high-dimensional data in biomedical domains. The increasing availability of the high-dimensional biomedical data creates the needs of new machine learning models for effective data analysis and knowledge discovery. This dissertation introduces several unsupervised and supervised methods to help understand the data, discover the patterns and improve the decision making. All the proposed methods can generalize to other industrial fields.
The first topic of this dissertation focuses on the data clustering. Data clustering is often the first step for analyzing a dataset without the label information. Clustering high-dimensional data with mixed categorical and numeric attributes remains a challenging, yet important task. A clustering algorithm based on tree ensembles, CRAFTER, is proposed to tackle this task in a scalable manner.
The second part of this dissertation aims to develop data representation methods for genome sequencing data, a special type of high-dimensional data in the biomedical domain. The proposed data representation method, Bag-of-Segments, can summarize the key characteristics of the genome sequence into a small number of features with good interpretability.
The third part of this dissertation introduces an end-to-end deep neural network model, GCRNN, for time series classification with emphasis on both the accuracy and the interpretation. GCRNN contains a convolutional network component to extract high-level features, and a recurrent network component to enhance the modeling of the temporal characteristics. A feed-forward fully connected network with the sparse group lasso regularization is used to generate the final classification and provide good interpretability.
The last topic centers around the dimensionality reduction methods for time series data. A good dimensionality reduction method is important for the storage, decision making and pattern visualization for time series data. The CRNN autoencoder is proposed to not only achieve low reconstruction error, but also generate discriminative features. A variational version of this autoencoder has great potential for applications such as anomaly detection and process control.Dissertation/ThesisDoctoral Dissertation Industrial Engineering 201
Detection of Stealthy False Data Injection Attacks Against State Estimation in Electric Power Grids Using Deep Learning Techniques
Since communication technologies are being integrated into smart grid, its vulnerability to false data injection is increasing. State estimation is a critical component which is used for monitoring the operation of power grid. However, a tailored attack could circumvent bad data detection of the state estimation, thus disturb the stability of the grid. Such attacks are called stealthy false data injection attacks (FDIAs). This thesis proposed a prediction-based detector using deep learning techniques to detect injected measurements. The proposed detector adopts both Convolutional Neural Networks and Recurrent Neural Networks, making full use of the spatial-temporal correlations in the measurement data. With its separable architecture, three discriminators with different feature extraction methods were designed for the predictor. Besides, a measurement restoration mechanism was proposed based on the prediction. The proposed detection mechanism was assessed by simulating FDIAs on the IEEE 39-bus system. The results demonstrated that the proposed mechanism could achieve a satisfactory performance compared with existing algorithms
- …