2,382 research outputs found

    Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification

    Full text link
    Generative Adversarial Networks (GANs) have been used in many different applications to generate realistic synthetic data. We introduce a novel GAN with Autoencoder (GAN-AE) architecture to generate synthetic samples for variable length, multi-feature sequence datasets. In this model, we develop a GAN architecture with an additional autoencoder component, where recurrent neural networks (RNNs) are used for each component of the model in order to generate synthetic data to improve classification accuracy for a highly imbalanced medical device dataset. In addition to the medical device dataset, we also evaluate the GAN-AE performance on two additional datasets and demonstrate the application of GAN-AE to a sequence-to-sequence task where both synthetic sequence inputs and sequence outputs must be generated. To evaluate the quality of the synthetic data, we train encoder-decoder models both with and without the synthetic data and compare the classification model performance. We show that a model trained with GAN-AE generated synthetic data outperforms models trained with synthetic data generated both with standard oversampling techniques such as SMOTE and Autoencoders as well as with state of the art GAN-based models

    Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks

    Get PDF
    Automatic analysis of the video is one of most complex problems in the fields of computer vision and machine learning. A significant part of this research deals with (human) activity recognition (HAR) since humans, and the activities that they perform, generate most of the video semantics. Video-based HAR has applications in various domains, but one of the most important and challenging is HAR in sports videos. Some of the major issues include high inter- and intra-class variations, large class imbalance, the presence of both group actions and single player actions, and recognizing simultaneous actions, i.e., the multi-label learning problem. Keeping in mind these challenges and the recent success of CNNs in solving various computer vision problems, in this work, we implement a 3D CNN based multi-label deep HAR system for multi-label class-imbalanced action recognition in hockey videos. We test our system for two different scenarios: an ensemble of kk binary networks vs. a single kk-output network, on a publicly available dataset. We also compare our results with the system that was originally designed for the chosen dataset. Experimental results show that the proposed approach performs better than the existing solution.Comment: Accepted to IEEE/ACIS SNPD 2018, 6 pages, 3 figure

    Imbalanced Ensemble Classifier for learning from imbalanced business school data set

    Full text link
    Private business schools in India face a common problem of selecting quality students for their MBA programs to achieve the desired placement percentage. Generally, such data sets are biased towards one class, i.e., imbalanced in nature. And learning from the imbalanced dataset is a difficult proposition. This paper proposes an imbalanced ensemble classifier which can handle the imbalanced nature of the dataset and achieves higher accuracy in case of the feature selection (selection of important characteristics of students) cum classification problem (prediction of placements based on the students' characteristics) for Indian business school dataset. The optimal value of an important model parameter is found. Numerical evidence is also provided using Indian business school dataset to assess the outstanding performance of the proposed classifier

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    The Synthetic-Oversampling Method: Using Photometric Colors to Discover Extremely Metal-Poor Stars

    Get PDF
    Extremely metal-poor (EMP) stars ([Fe/H] < -3.0 dex) provide a unique window into understanding the first generation of stars and early chemical enrichment of the Universe. EMP stars are exceptionally rare, however, and the relatively small number of confirmed discoveries limits our ability to exploit these near-field probes of the first ~500 Myr after the Big Bang. Here, a new method to photometrically estimate [Fe/H] from only broadband photometric colors is presented. I show that the method, which utilizes machine-learning algorithms and a training set of ~170,000 stars with spectroscopically measured [Fe/H], produces a typical scatter of ~0.29 dex. This performance is similar to what is achievable via low-resolution spectroscopy, and outperforms other photometric techniques, while also being more general. I further show that a slight alteration to the model, wherein synthetic EMP stars are added to the training set, yields the robust identification of EMP candidates. In particular, this synthetic-oversampling method recovers ~20% of the EMP stars in the training set, at a precision of ~0.05. Furthermore, ~65% of the false positives from the model are very metal-poor stars ([Fe/H] < -2.0 dex). The synthetic-oversampling method is biased towards the discovery of warm (~F-type) stars, a consequence of the targeting bias from the SDSS/SEGUE survey. This EMP selection method represents a significant improvement over alternative broadband optical selection techniques. The models are applied to >12 million stars, with an expected yield of ~600 new EMP stars, which promises to open new avenues for exploring the early universe.Comment: 15 pages, 7 figures, to be submitted to Ap
    • …
    corecore