4,013 research outputs found
Extreme Entropy Machines: Robust information theoretic classification
Most of the existing classification methods are aimed at minimization of
empirical risk (through some simple point-based error measured with loss
function) with added regularization. We propose to approach this problem in a
more information theoretic way by investigating applicability of entropy
measures as a classification model objective function. We focus on quadratic
Renyi's entropy and connected Cauchy-Schwarz Divergence which leads to the
construction of Extreme Entropy Machines (EEM).
The main contribution of this paper is proposing a model based on the
information theoretic concepts which on the one hand shows new, entropic
perspective on known linear classifiers and on the other leads to a
construction of very robust method competetitive with the state of the art
non-information theoretic ones (including Support Vector Machines and Extreme
Learning Machines).
Evaluation on numerous problems spanning from small, simple ones from UCI
repository to the large (hundreads of thousands of samples) extremely
unbalanced (up to 100:1 classes' ratios) datasets shows wide applicability of
the EEM in real life problems and that it scales well
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values
This work is motivated by the needs of predictive analytics on healthcare
data as represented by Electronic Medical Records. Such data is invariably
problematic: noisy, with missing entries, with imbalance in classes of
interests, leading to serious bias in predictive modeling. Since standard data
mining methods often produce poor performance measures, we argue for
development of specialized techniques of data-preprocessing and classification.
In this paper, we propose a new method to simultaneously classify large
datasets and reduce the effects of missing values. It is based on a multilevel
framework of the cost-sensitive SVM and the expected maximization imputation
method for missing values, which relies on iterated regression analyses. We
compare classification results of multilevel SVM-based algorithms on public
benchmark datasets with imbalanced classes and missing values as well as real
data in health applications, and show that our multilevel SVM-based method
produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625
Support Vector Machine Classification on a Biased Training Set: Multi-Jet Background Rejection at Hadron Colliders
This paper describes an innovative way to optimize a multivariate classifier,
in particular a Support Vector Machine algorithm, on a problem characterized by
a biased training sample. This is possible thanks to the feedback of a
signal-background template fit performed on a validation sample and included
both in the optimization process and in the input variable selection. The
procedure is applied to a real case of interest at hadron collider experiments:
the reduction and the estimate of the multi-jet background in the
plus jets data sample collected by the CDF experiment. The training samples,
partially derived from data and partially from simulation, are described in
detail together with the input variables exploited for the classification. At
present, the reached performance is superior to any other prescription applied
to the same final state at hadron collider experiments.Comment: 24 pages, 8 figures, preprint of NIM pape
- …