2,382 research outputs found
Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification
Generative Adversarial Networks (GANs) have been used in many different
applications to generate realistic synthetic data. We introduce a novel GAN
with Autoencoder (GAN-AE) architecture to generate synthetic samples for
variable length, multi-feature sequence datasets. In this model, we develop a
GAN architecture with an additional autoencoder component, where recurrent
neural networks (RNNs) are used for each component of the model in order to
generate synthetic data to improve classification accuracy for a highly
imbalanced medical device dataset. In addition to the medical device dataset,
we also evaluate the GAN-AE performance on two additional datasets and
demonstrate the application of GAN-AE to a sequence-to-sequence task where both
synthetic sequence inputs and sequence outputs must be generated. To evaluate
the quality of the synthetic data, we train encoder-decoder models both with
and without the synthetic data and compare the classification model
performance. We show that a model trained with GAN-AE generated synthetic data
outperforms models trained with synthetic data generated both with standard
oversampling techniques such as SMOTE and Autoencoders as well as with state of
the art GAN-based models
Multi-label Class-imbalanced Action Recognition in Hockey Videos via 3D Convolutional Neural Networks
Automatic analysis of the video is one of most complex problems in the fields
of computer vision and machine learning. A significant part of this research
deals with (human) activity recognition (HAR) since humans, and the activities
that they perform, generate most of the video semantics. Video-based HAR has
applications in various domains, but one of the most important and challenging
is HAR in sports videos. Some of the major issues include high inter- and
intra-class variations, large class imbalance, the presence of both group
actions and single player actions, and recognizing simultaneous actions, i.e.,
the multi-label learning problem. Keeping in mind these challenges and the
recent success of CNNs in solving various computer vision problems, in this
work, we implement a 3D CNN based multi-label deep HAR system for multi-label
class-imbalanced action recognition in hockey videos. We test our system for
two different scenarios: an ensemble of binary networks vs. a single
-output network, on a publicly available dataset. We also compare our
results with the system that was originally designed for the chosen dataset.
Experimental results show that the proposed approach performs better than the
existing solution.Comment: Accepted to IEEE/ACIS SNPD 2018, 6 pages, 3 figure
Imbalanced Ensemble Classifier for learning from imbalanced business school data set
Private business schools in India face a common problem of selecting quality
students for their MBA programs to achieve the desired placement percentage.
Generally, such data sets are biased towards one class, i.e., imbalanced in
nature. And learning from the imbalanced dataset is a difficult proposition.
This paper proposes an imbalanced ensemble classifier which can handle the
imbalanced nature of the dataset and achieves higher accuracy in case of the
feature selection (selection of important characteristics of students) cum
classification problem (prediction of placements based on the students'
characteristics) for Indian business school dataset. The optimal value of an
important model parameter is found. Numerical evidence is also provided using
Indian business school dataset to assess the outstanding performance of the
proposed classifier
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
The Synthetic-Oversampling Method: Using Photometric Colors to Discover Extremely Metal-Poor Stars
Extremely metal-poor (EMP) stars ([Fe/H] < -3.0 dex) provide a unique window
into understanding the first generation of stars and early chemical enrichment
of the Universe. EMP stars are exceptionally rare, however, and the relatively
small number of confirmed discoveries limits our ability to exploit these
near-field probes of the first ~500 Myr after the Big Bang. Here, a new method
to photometrically estimate [Fe/H] from only broadband photometric colors is
presented. I show that the method, which utilizes machine-learning algorithms
and a training set of ~170,000 stars with spectroscopically measured [Fe/H],
produces a typical scatter of ~0.29 dex. This performance is similar to what is
achievable via low-resolution spectroscopy, and outperforms other photometric
techniques, while also being more general. I further show that a slight
alteration to the model, wherein synthetic EMP stars are added to the training
set, yields the robust identification of EMP candidates. In particular, this
synthetic-oversampling method recovers ~20% of the EMP stars in the training
set, at a precision of ~0.05. Furthermore, ~65% of the false positives from the
model are very metal-poor stars ([Fe/H] < -2.0 dex). The synthetic-oversampling
method is biased towards the discovery of warm (~F-type) stars, a consequence
of the targeting bias from the SDSS/SEGUE survey. This EMP selection method
represents a significant improvement over alternative broadband optical
selection techniques. The models are applied to >12 million stars, with an
expected yield of ~600 new EMP stars, which promises to open new avenues for
exploring the early universe.Comment: 15 pages, 7 figures, to be submitted to Ap
- …