377 research outputs found
Incorporating biological information into linear models: A Bayesian approach to the selection of pathways and genes
The vast amount of biological knowledge accumulated over the years has
allowed researchers to identify various biochemical interactions and define
different families of pathways. There is an increased interest in identifying
pathways and pathway elements involved in particular biological processes. Drug
discovery efforts, for example, are focused on identifying biomarkers as well
as pathways related to a disease. We propose a Bayesian model that addresses
this question by incorporating information on pathways and gene networks in the
analysis of DNA microarray data. Such information is used to define pathway
summaries, specify prior distributions, and structure the MCMC moves to fit the
model. We illustrate the method with an application to gene expression data
with censored survival outcomes. In addition to identifying markers that would
have been missed otherwise and improving prediction accuracy, the integration
of existing biological knowledge into the analysis provides a better
understanding of underlying molecular processes.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS463 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Bayesian classification of tumours by using gene expression data
Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/75678/1/j.1467-9868.2005.00498.x.pd
Multivariate Analysis of Tumour Gene Expression Profiles Applying Regularisation and Bayesian Variable Selection Techniques
High-throughput microarray technology is here to stay, e.g. in oncology for tumour classification
and gene expression profiling to predict cancer pathology and clinical outcome. The global
objective of this thesis is to investigate multivariate methods that are suitable for this task.
After introducing the problem and the biological background, an overview of multivariate
regularisation methods is given in Chapter 3 and the binary classification problem is outlined
(Chapter 4). The focus of applications presented in Chapters 5 to 7 is on sparse binary classifiers
that are both parsimonious and interpretable. Particular emphasis is on sparse penalised
likelihood and Bayesian variable selection models, all in the context of logistic regression. The
thesis concludes with a final discussion chapter.
The variable selection problem is particularly challenging here, since the number of variables
is much larger than the sample size, which results in an ill-conditioned problem with
many equally good solutions. Thus, one open problem is the stability of gene expression profiles.
In a resampling study, various characteristics including stability are compared between a
variety of classifiers applied to five gene expression data sets and validated on two independent
data sets.
Bayesian variable selection provides an alternative to resampling for estimating the uncertainty
in the selection of genes. MCMC methods are used for model space exploration, but
because of the high dimensionality standard algorithms are computationally expensive and/or
result in poor Markov chain mixing. A novel MCMC algorithm is presented that uses the
dependence structure between input variables for finding blocks of variables to be updated together.
This drastically improves mixing while keeping the computational burden acceptable.
Several algorithms are compared in a simulation study. In an ovarian cancer application in
Chapter 7, the best-performing MCMC algorithms are combined with parallel tempering and
compared with an alternative method
Joint Bayesian variable and graph selection for regression models with network-structured predictors
In this work, we develop a Bayesian approach to perform selection of predictors that are linked within a network. We achieve this by combining a sparse regression model relating the predictors to a response variable with a graphical model describing conditional dependencies among the predictors. The proposed method is well-suited for genomic applications because it allows the identification of pathways of functionally related genes or proteins that impact an outcome of interest. In contrast to previous approaches for network-guided variable selection, we infer the network among predictors using a Gaussian graphical model and do not assume that network information is availableï¾ a priori. We demonstrate that our method outperforms existing methods in identifying network-structured predictors in simulation settings and illustrate our proposed model with an application to inference of proteins relevant to glioblastoma survival.
Variable selection for BART: An application to gene regulation
We consider the task of discovering gene regulatory networks, which are
defined as sets of genes and the corresponding transcription factors which
regulate their expression levels. This can be viewed as a variable selection
problem, potentially with high dimensionality. Variable selection is especially
challenging in high-dimensional settings, where it is difficult to detect
subtle individual effects and interactions between predictors. Bayesian
Additive Regression Trees [BART, Ann. Appl. Stat. 4 (2010) 266-298] provides a
novel nonparametric alternative to parametric regression approaches, such as
the lasso or stepwise regression, especially when the number of relevant
predictors is sparse relative to the total number of available predictors and
the fundamental relationships are nonlinear. We develop a principled
permutation-based inferential approach for determining when the effect of a
selected predictor is likely to be real. Going further, we adapt the BART
procedure to incorporate informed prior information about variable importance.
We present simulations demonstrating that our method compares favorably to
existing parametric and nonparametric procedures in a variety of data settings.
To demonstrate the potential of our approach in a biological context, we apply
it to the task of inferring the gene regulatory network in yeast (Saccharomyces
cerevisiae). We find that our BART-based procedure is best able to recover the
subset of covariates with the largest signal compared to other variable
selection methods. The methods developed in this work are readily available in
the R package bartMachine.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS755 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Knowledge-Guided Bayesian Support Vector Machine Methods For High-Dimensional Data
Support vector machines (SVM) is a popular classification method for analysis of high dimensional data such as genomics data. Recently, new SVM methods have been developed to achieve variable selection through either frequentist regularization or Bayesian shrinkage. The Bayesian framework provides a probabilistic interpretation for SVM and allows direct uncertainty quantification. In this dissertation, we develop four knowledge-guided SVM methods for the analysis of high dimensional data.
In Chapter 1, I first review the theory of SVM and existing methods for incorporating the prior knowledge, represented bby graphs into SVM. Second, I review the terminology on variable selection and limitations of the existing methods for SVM variable selection. Last, I introduce some Bayesian variable selection techniques as well as Markov chain
Monte Carlo (MCMC) algorithms .
In Chapter 2, we develop a new Bayesian SVM method that enables variable selection guided by structural information among predictors, e.g, biological pathways among genes. This method uses a spike and slab prior for feature selection combined with an Ising prior for incorporating structural information. The performance of the proposed method is evaluated in comparison with existing SVM methods in terms of prediction and feature selection in extensive simulations. Furthermore, the proposed method is illustrated in analysis of genomic data from a cancer study, demonstrating its advantage in generating biologically meaningful results and identifying potentially important features.
The model developed in Chapter 2 might suffer from the issue of phase transition \citep{li2010bayesian} when the number of variables becomes extremely large. In Chapter 3, we propose another Bayesian SVM method that assigns an adaptive structured shrinkage prior to the coefficients and the graph information is incorporated via the hyper-priors imposed on the precision matrix of the log-transformed shrinkage parameters. This method is shown to outperform the method in Chapter 2 in both simulations and real data analysis..
In Chapter 4, to relax the linearity assumption in chapter 2 and 3, we develop a novel knowledge-guided Bayesian non-linear SVM. The proposed method uses a diagonal matrix with ones representing feature selected and zeros representing feature unselected, and combines with the Ising prior to perform feature selection. The performance of our method is evaluated and compared with several penalized linear SVM and the standard kernel SVM method in terms of prediction and feature selection in extensive simulation settings. Also, analyses of genomic data from a cancer study show that our method yields a more accurate prediction model for patient survival and reveals biologically more meaningful results than the existing methods.
In Chapter 5, we extend the work of Chapter 4 and use a joint model to identify the relevant features and learn the structural information among them simultaneously. This model does not require that the structural information among the predictors is known, which is more powerful when the prior knowledge about pathways is limited or inaccurate. We demonstrate that our method outperforms the method developed in Chapter 4 when the prior knowledge is partially true or inaccurate in simulations and illustrate our proposed model with an application to a gliobastoma data set.
In Chapter 6, we propose some future works including extending our methods to more general types of outcomes such as categorical or continuous variables
- …