3,476 research outputs found
Integrating biological knowledge into variable selection : an empirical Bayes approach with an application in cancer biology
Background:
An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data.
Results:
We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information.
Conclusions:
The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge
Bayesian Model Selection in Complex Linear Systems, as Illustrated in Genetic Association Studies
Motivated by examples from genetic association studies, this paper considers
the model selection problem in a general complex linear model system and in a
Bayesian framework. We discuss formulating model selection problems and
incorporating context-dependent {\it a priori} information through different
levels of prior specifications. We also derive analytic Bayes factors and their
approximations to facilitate model selection and discuss their theoretical and
computational properties. We demonstrate our Bayesian approach based on an
implemented Markov Chain Monte Carlo (MCMC) algorithm in simulations and a real
data application of mapping tissue-specific eQTLs. Our novel results on Bayes
factors provide a general framework to perform efficient model comparisons in
complex linear model systems
One-step estimator paths for concave regularization
The statistics literature of the past 15 years has established many favorable
properties for sparse diminishing-bias regularization: techniques which can
roughly be understood as providing estimation under penalty functions spanning
the range of concavity between and norms. However, lasso
-regularized estimation remains the standard tool for industrial `Big
Data' applications because of its minimal computational cost and the presence
of easy-to-apply rules for penalty selection. In response, this article
proposes a simple new algorithm framework that requires no more computation
than a lasso path: the path of one-step estimators (POSE) does penalized
regression estimation on a grid of decreasing penalties, but adapts
coefficient-specific weights to decrease as a function of the coefficient
estimated in the previous path step. This provides sparse diminishing-bias
regularization at no extra cost over the fastest lasso algorithms. Moreover,
our `gamma lasso' implementation of POSE is accompanied by a reliable heuristic
for the fit degrees of freedom, so that standard information criteria can be
applied in penalty selection. We also provide novel results on the distance
between weighted- and penalized predictors; this allows us to build
intuition about POSE and other diminishing-bias regularization schemes. The
methods and results are illustrated in extensive simulations and in application
of logistic regression to evaluating the performance of hockey players.Comment: Data and code are in the gamlr package for R. Supplemental appendix
is at https://github.com/TaddyLab/pose/raw/master/paper/supplemental.pd
Bayesian Approximate Kernel Regression with Variable Selection
Nonlinear kernel regression models are often used in statistics and machine
learning because they are more accurate than linear models. Variable selection
for kernel regression models is a challenge partly because, unlike the linear
regression setting, there is no clear concept of an effect size for regression
coefficients. In this paper, we propose a novel framework that provides an
effect size analog of each explanatory variable for Bayesian kernel regression
models when the kernel is shift-invariant --- for example, the Gaussian kernel.
We use function analytic properties of shift-invariant reproducing kernel
Hilbert spaces (RKHS) to define a linear vector space that: (i) captures
nonlinear structure, and (ii) can be projected onto the original explanatory
variables. The projection onto the original explanatory variables serves as an
analog of effect sizes. The specific function analytic property we use is that
shift-invariant kernel functions can be approximated via random Fourier bases.
Based on the random Fourier expansion we propose a computationally efficient
class of Bayesian approximate kernel regression (BAKR) models for both
nonlinear regression and binary classification for which one can compute an
analog of effect sizes. We illustrate the utility of BAKR by examining two
important problems in statistical genetics: genomic selection (i.e. phenotypic
prediction) and association mapping (i.e. inference of significant variants or
loci). State-of-the-art methods for genomic selection and association mapping
are based on kernel regression and linear models, respectively. BAKR is the
first method that is competitive in both settings.Comment: 22 pages, 3 figures, 3 tables; theory added; new simulations
presented; references adde
Bayesian Compressed Regression
As an alternative to variable selection or shrinkage in high dimensional
regression, we propose to randomly compress the predictors prior to analysis.
This dramatically reduces storage and computational bottlenecks, performing
well when the predictors can be projected to a low dimensional linear subspace
with minimal loss of information about the response. As opposed to existing
Bayesian dimensionality reduction approaches, the exact posterior distribution
conditional on the compressed data is available analytically, speeding up
computation by many orders of magnitude while also bypassing robustness issues
due to convergence and mixing problems with MCMC. Model averaging is used to
reduce sensitivity to the random projection matrix, while accommodating
uncertainty in the subspace dimension. Strong theoretical support is provided
for the approach by showing near parametric convergence rates for the
predictive density in the large p small n asymptotic paradigm. Practical
performance relative to competitors is illustrated in simulations and real data
applications.Comment: 29 pages, 4 figure
- …