2,019 research outputs found
Stacked Penalized Logistic Regression for Selecting Views in Multi-View Learning
In biomedical research, many different types of patient data can be
collected, such as various types of omics data and medical imaging modalities.
Applying multi-view learning to these different sources of information can
increase the accuracy of medical classification models compared with
single-view procedures. However, collecting biomedical data can be expensive
and/or burdening for patients, so that it is important to reduce the amount of
required data collection. It is therefore necessary to develop multi-view
learning methods which can accurately identify those views that are most
important for prediction. In recent years, several biomedical studies have used
an approach known as multi-view stacking (MVS), where a model is trained on
each view separately and the resulting predictions are combined through
stacking. In these studies, MVS has been shown to increase classification
accuracy. However, the MVS framework can also be used for selecting a subset of
important views. To study the view selection potential of MVS, we develop a
special case called stacked penalized logistic regression (StaPLR). Compared
with existing view-selection methods, StaPLR can make use of faster
optimization algorithms and is easily parallelized. We show that nonnegativity
constraints on the parameters of the function which combines the views play an
important role in preventing unimportant views from entering the model. We
investigate the performance of StaPLR through simulations, and consider two
real data examples. We compare the performance of StaPLR with an existing view
selection method called the group lasso and observe that, in terms of view
selection, StaPLR is often more conservative and has a consistently lower false
positive rate.Comment: 26 pages, 9 figures. Accepted manuscrip
Fully Bayesian Logistic Regression with Hyper-Lasso Priors for High-dimensional Feature Selection
High-dimensional feature selection arises in many areas of modern science.
For example, in genomic research we want to find the genes that can be used to
separate tissues of different classes (e.g. cancer and normal) from tens of
thousands of genes that are active (expressed) in certain tissue cells. To this
end, we wish to fit regression and classification models with a large number of
features (also called variables, predictors). In the past decade, penalized
likelihood methods for fitting regression models based on hyper-LASSO
penalization have received increasing attention in the literature. However,
fully Bayesian methods that use Markov chain Monte Carlo (MCMC) are still in
lack of development in the literature. In this paper we introduce an MCMC
(fully Bayesian) method for learning severely multi-modal posteriors of
logistic regression models based on hyper-LASSO priors (non-convex penalties).
Our MCMC algorithm uses Hamiltonian Monte Carlo in a restricted Gibbs sampling
framework; we call our method Bayesian logistic regression with hyper-LASSO
(BLRHL) priors. We have used simulation studies and real data analysis to
demonstrate the superior performance of hyper-LASSO priors, and to investigate
the issues of choosing heaviness and scale of hyper-LASSO priors.Comment: 33 pages. arXiv admin note: substantial text overlap with
arXiv:1308.469
Factorial graphical lasso for dynamic networks
Dynamic networks models describe a growing number of important scientific
processes, from cell biology and epidemiology to sociology and finance. There
are many aspects of dynamical networks that require statistical considerations.
In this paper we focus on determining network structure. Estimating dynamic
networks is a difficult task since the number of components involved in the
system is very large. As a result, the number of parameters to be estimated is
bigger than the number of observations. However, a characteristic of many
networks is that they are sparse. For example, the molecular structure of genes
make interactions with other components a highly-structured and therefore
sparse process.
Penalized Gaussian graphical models have been used to estimate sparse
networks. However, the literature has focussed on static networks, which lack
specific temporal constraints. We propose a structured Gaussian dynamical
graphical model, where structures can consist of specific time dynamics, known
presence or absence of links and block equality constraints on the parameters.
Thus, the number of parameters to be estimated is reduced and accuracy of the
estimates, including the identification of the network, can be tuned up. Here,
we show that the constrained optimization problem can be solved by taking
advantage of an efficient solver, logdetPPA, developed in convex optimization.
Moreover, model selection methods for checking the sensitivity of the inferred
networks are described. Finally, synthetic and real data illustrate the
proposed methodologies.Comment: 30 pp, 5 figure
Tree-guided group lasso for multi-response regression with structured sparsity, with an application to eQTL mapping
We consider the problem of estimating a sparse multi-response regression
function, with an application to expression quantitative trait locus (eQTL)
mapping, where the goal is to discover genetic variations that influence
gene-expression levels. In particular, we investigate a shrinkage technique
capable of capturing a given hierarchical structure over the responses, such as
a hierarchical clustering tree with leaf nodes for responses and internal nodes
for clusters of related responses at multiple granularity, and we seek to
leverage this structure to recover covariates relevant to each
hierarchically-defined cluster of responses. We propose a tree-guided group
lasso, or tree lasso, for estimating such structured sparsity under
multi-response regression by employing a novel penalty function constructed
from the tree. We describe a systematic weighting scheme for the overlapping
groups in the tree-penalty such that each regression coefficient is penalized
in a balanced manner despite the inhomogeneous multiplicity of group
memberships of the regression coefficients due to overlaps among groups. For
efficient optimization, we employ a smoothing proximal gradient method that was
originally developed for a general class of structured-sparsity-inducing
penalties. Using simulated and yeast data sets, we demonstrate that our method
shows a superior performance in terms of both prediction errors and recovery of
true sparsity patterns, compared to other methods for learning a
multivariate-response regression.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS549 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
High Dimensional Classification with combined Adaptive Sparse PLS and Logistic Regression
Motivation: The high dimensionality of genomic data calls for the development
of specific classification methodologies, especially to prevent over-optimistic
predictions. This challenge can be tackled by compression and variable
selection, which combined constitute a powerful framework for classification,
as well as data visualization and interpretation. However, current proposed
combinations lead to instable and non convergent methods due to inappropriate
computational frameworks. We hereby propose a stable and convergent approach
for classification in high dimensional based on sparse Partial Least Squares
(sparse PLS). Results: We start by proposing a new solution for the sparse PLS
problem that is based on proximal operators for the case of univariate
responses. Then we develop an adaptive version of the sparse PLS for
classification, which combines iterative optimization of logistic regression
and sparse PLS to ensure convergence and stability. Our results are confirmed
on synthetic and experimental data. In particular we show how crucial
convergence and stability can be when cross-validation is involved for
calibration purposes. Using gene expression data we explore the prediction of
breast cancer relapse. We also propose a multicategorial version of our method
on the prediction of cell-types based on single-cell expression data.
Availability: Our approach is implemented in the plsgenomics R-package.Comment: 9 pages, 3 figures, 4 tables + Supplementary Materials 8 pages, 3
figures, 10 table
Robust Identification of Target Genes and Outliers in Triple-negative Breast Cancer Data
Correct classification of breast cancer sub-types is of high importance as it
directly affects the therapeutic options. We focus on triple-negative breast
cancer (TNBC) which has the worst prognosis among breast cancer types. Using
cutting edge methods from the field of robust statistics, we analyze Breast
Invasive Carcinoma (BRCA) transcriptomic data publicly available from The
Cancer Genome Atlas (TCGA) data portal. Our analysis identifies statistical
outliers that may correspond to misdiagnosed patients. Furthermore, it is
illustrated that classical statistical methods may fail in the presence of
these outliers, prompting the need for robust statistics. Using robust sparse
logistic regression we obtain 36 relevant genes, of which ca. 60\% have been
previously reported as biologically relevant to TNBC, reinforcing the validity
of the method. The remaining 14 genes identified are new potential biomarkers
for TNBC. Out of these, JAM3, SFT2D2 and PAPSS1 were previously associated to
breast tumors or other types of cancer. The relevance of these genes is
confirmed by the new DetectDeviatingCells (DDC) outlier detection technique. A
comparison of gene networks on the selected genes showed significant
differences between TNBC and non-TNBC data. The individual role of FOXA1 in
TNBC and non-TNBC, and the strong FOXA1-AGR2 connection in TNBC stand out. Not
only will our results contribute to the breast cancer/TNBC understanding and
ultimately its management, they also show that robust regression and outlier
detection constitute key strategies to cope with high-dimensional clinical data
such as omics data
- …