1,656 research outputs found

    Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification

    Full text link
    Generative Adversarial Networks (GANs) have been used in many different applications to generate realistic synthetic data. We introduce a novel GAN with Autoencoder (GAN-AE) architecture to generate synthetic samples for variable length, multi-feature sequence datasets. In this model, we develop a GAN architecture with an additional autoencoder component, where recurrent neural networks (RNNs) are used for each component of the model in order to generate synthetic data to improve classification accuracy for a highly imbalanced medical device dataset. In addition to the medical device dataset, we also evaluate the GAN-AE performance on two additional datasets and demonstrate the application of GAN-AE to a sequence-to-sequence task where both synthetic sequence inputs and sequence outputs must be generated. To evaluate the quality of the synthetic data, we train encoder-decoder models both with and without the synthetic data and compare the classification model performance. We show that a model trained with GAN-AE generated synthetic data outperforms models trained with synthetic data generated both with standard oversampling techniques such as SMOTE and Autoencoders as well as with state of the art GAN-based models

    DOPING: Generative Data Augmentation for Unsupervised Anomaly Detection with GAN

    Full text link
    Recently, the introduction of the generative adversarial network (GAN) and its variants has enabled the generation of realistic synthetic samples, which has been used for enlarging training sets. Previous work primarily focused on data augmentation for semi-supervised and supervised tasks. In this paper, we instead focus on unsupervised anomaly detection and propose a novel generative data augmentation framework optimized for this task. In particular, we propose to oversample infrequent normal samples - normal samples that occur with small probability, e.g., rare normal events. We show that these samples are responsible for false positives in anomaly detection. However, oversampling of infrequent normal samples is challenging for real-world high-dimensional data with multimodal distributions. To address this challenge, we propose to use a GAN variant known as the adversarial autoencoder (AAE) to transform the high-dimensional multimodal data distributions into low-dimensional unimodal latent distributions with well-defined tail probability. Then, we systematically oversample at the `edge' of the latent distributions to increase the density of infrequent normal samples. We show that our oversampling pipeline is a unified one: it is generally applicable to datasets with different complex data distributions. To the best of our knowledge, our method is the first data augmentation technique focused on improving performance in unsupervised anomaly detection. We validate our method by demonstrating consistent improvements across several real-world datasets.Comment: Published as a conference paper at ICDM 2018 (IEEE International Conference on Data Mining

    Integrated Machine Learning Approaches to Improve Classification performance and Feature Extraction Process for EEG Dataset

    Get PDF
    Epileptic seizure or epilepsy is a chronic neurological disorder that occurs due to brain neurons\u27 abnormal activities and has affected approximately 50 million people worldwide. Epilepsy can affect patients’ health and lead to life-threatening emergencies. Early detection of epilepsy is highly effective in avoiding seizures by intervening in treatment. The electroencephalogram (EEG) signal, which contains valuable information of electrical activity in the brain, is a standard neuroimaging tool used by clinicians to monitor and diagnose epilepsy. Visually inspecting the EEG signal is an expensive, tedious, and error-prone practice. Moreover, the result varies with different neurophysiologists for an identical reading. Thus, automatically classifying epilepsy into different epileptic states with a high accuracy rate is an urgent requirement and has long been investigated. This PhD thesis contributes to the epileptic seizure detection problem using Machine Learning (ML) techniques. Machine learning algorithms have been implemented to automatically classifying epilepsy from EEG data. Imbalance class distribution problems and effective feature extraction from the EEG signals are the two major concerns towards effectively and efficiently applying machine learning algorithms for epilepsy classification. The algorithms exhibit biased results towards the majority class when classes are imbalanced, while effective feature extraction can improve classification performance. In this thesis, we presented three different novel frameworks to effectively classify epileptic states while addressing the above issues. Firstly, a deep neural network-based framework exploring different sampling techniques was proposed where both traditional and state-of-the-art sampling techniques were experimented with and evaluated for their capability of improving the imbalance ratio and classification performance. Secondly, a novel integrated machine learning-based framework was proposed to effectively learn from EEG imbalanced data leveraging the Principal Component Analysis method to extract high- and low-variant principal components, which are empirically customized for the imbalanced data classification. This study showed that principal components associated with low variances can capture implicit patterns of the minority class of a dataset. Next, we proposed a novel framework to effectively classify epilepsy leveraging summary statistics analysis of window-based features of EEG signals. The framework first denoised the signals using power spectrum density analysis and replaced outliers with k-NN imputer. Next, window level features were extracted from statistical, temporal, and spectral domains. Basic summary statistics are then computed from the extracted features to feed into different machine learning classifiers. An optimal set of features are selected leveraging variance thresholding and dropping correlated features before feeding the features for classification. Finally, we applied traditional machine learning classifiers such as Support Vector Machine, Decision Tree, Random Forest, and k-Nearest Neighbors along with Deep Neural Networks to classify epilepsy. We experimented the frameworks with a benchmark dataset through rigorous experimental settings and displayed the effectiveness of the proposed frameworks in terms of accuracy, precision, recall, and F-beta score

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
    • …
    corecore