47,227 research outputs found
High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression
Motivation: The high dimensionality of genomic data calls for the development
of specific classification methodologies, especially to prevent over-optimistic
predictions. This challenge can be tackled by compression and variable
selection, which combined constitute a powerful framework for classification,
as well as data visualization and interpretation. However, current proposed
combinations lead to instable and non convergent methods due to inappropriate
computational frameworks. We hereby propose a stable and convergent approach
for classification in high dimensional based on sparse Partial Least Squares
(sparse PLS). Results: We start by proposing a new solution for the sparse PLS
problem that is based on proximal operators for the case of univariate
responses. Then we develop an adaptive version of the sparse PLS for
classification, which combines iterative optimization of logistic regression
and sparse PLS to ensure convergence and stability. Our results are confirmed
on synthetic and experimental data. In particular we show how crucial
convergence and stability can be when cross-validation is involved for
calibration purposes. Using gene expression data we explore the prediction of
breast cancer relapse. We also propose a multicategorial version of our method
on the prediction of cell-types based on single-cell expression data.
Availability: Our approach is implemented in the plsgenomics R-package.Comment: 9 pages, 3 figures, 4 tables + Supplementary Materials 8 pages, 3
figures, 10 table
Are screening methods useful in feature selection? An empirical study
Filter or screening methods are often used as a preprocessing step for
reducing the number of variables used by a learning algorithm in obtaining a
classification or regression model. While there are many such filter methods,
there is a need for an objective evaluation of these methods. Such an
evaluation is needed to compare them with each other and also to answer whether
they are at all useful, or a learning algorithm could do a better job without
them. For this purpose, many popular screening methods are partnered in this
paper with three regression learners and five classification learners and
evaluated on ten real datasets to obtain accuracy criteria such as R-square and
area under the ROC curve (AUC). The obtained results are compared through curve
plots and comparison tables in order to find out whether screening methods help
improve the performance of learning algorithms and how they fare with each
other. Our findings revealed that the screening methods were useful in
improving the prediction of the best learner on two regression and two
classification datasets out of the ten datasets evaluated.Comment: 29 pages, 4 figures, 21 table
On the combination of omics data for prediction of binary outcomes
Enrichment of predictive models with new biomolecular markers is an important
task in high-dimensional omic applications. Increasingly, clinical studies
include several sets of such omics markers available for each patient,
measuring different levels of biological variation. As a result, one of the
main challenges in predictive research is the integration of different sources
of omic biomarkers for the prediction of health traits. We review several
approaches for the combination of omic markers in the context of binary outcome
prediction, all based on double cross-validation and regularized regression
models. We evaluate their performance in terms of calibration and
discrimination and we compare their performance with respect to single-omic
source predictions. We illustrate the methods through the analysis of two real
datasets. On the one hand, we consider the combination of two fractions of
proteomic mass spectrometry for the calibration of a diagnostic rule for the
detection of early-stage breast cancer. On the other hand, we consider
transcriptomics and metabolomics as predictors of obesity using data from the
Dietary, Lifestyle, and Genetic determinants of Obesity and Metabolic syndrome
(DILGOM) study, a population-based cohort, from Finland
Support vector machine for functional data classification
In many applications, input data are sampled functions taking their values in
infinite dimensional spaces rather than standard vectors. This fact has complex
consequences on data analysis algorithms that motivate modifications of them.
In fact most of the traditional data analysis tools for regression,
classification and clustering have been adapted to functional inputs under the
general name of functional Data Analysis (FDA). In this paper, we investigate
the use of Support Vector Machines (SVMs) for functional data analysis and we
focus on the problem of curves discrimination. SVMs are large margin classifier
tools based on implicit non linear mappings of the considered data into high
dimensional spaces thanks to kernels. We show how to define simple kernels that
take into account the unctional nature of the data and lead to consistent
classification. Experiments conducted on real world data emphasize the benefit
of taking into account some functional aspects of the problems.Comment: 13 page
- …