1,617 research outputs found

    Node-Structured Integrative Gaussian Graphical Model Guided by Pathway Information

    Get PDF
    Up to date, many biological pathways related to cancer have been extensively applied thanks to outputs of burgeoning biomedical research. This leads to a new technical challenge of exploring and validating biological pathways that can characterize transcriptomic mechanisms across different disease subtypes. In pursuit of accommodating multiple studies, the joint Gaussian graphical model was previously proposed to incorporate nonzero edge effects. However, this model is inevitably dependent on post hoc analysis in order to confirm biological significance. To circumvent this drawback, we attempt not only to combine transcriptomic data but also to embed pathway information, well-ascertained biological evidence as such, into the model. To this end, we propose a novel statistical framework for fitting joint Gaussian graphical model simultaneously with informative pathways consistently expressed across multiple studies. In theory, structured nodes can be prespecified with multiple genes. The optimization rule employs the structured input-output lasso model, in order to estimate a sparse precision matrix constructed by simultaneous effects of multiple studies and structured nodes. With an application to breast cancer data sets, we found that the proposed model is superior in efficiently capturing structures of biological evidence (e.g., pathways). An R software package nsiGGM is publicly available at author's webpage

    Controlling the Precision-Recall Tradeoff in Differential Dependency Network Analysis

    Full text link
    Graphical models have gained a lot of attention recently as a tool for learning and representing dependencies among variables in multivariate data. Often, domain scientists are looking specifically for differences among the dependency networks of different conditions or populations (e.g. differences between regulatory networks of different species, or differences between dependency networks of diseased versus healthy populations). The standard method for finding these differences is to learn the dependency networks for each condition independently and compare them. We show that this approach is prone to high false discovery rates (low precision) that can render the analysis useless. We then show that by imposing a bias towards learning similar dependency networks for each condition the false discovery rates can be reduced to acceptable levels, at the cost of finding a reduced number of differences. Algorithms developed in the transfer learning literature can be used to vary the strength of the imposed similarity bias and provide a natural mechanism to smoothly adjust this differential precision-recall tradeoff to cater to the requirements of the analysis conducted. We present real case studies (oncological and neurological) where domain experts use the proposed technique to extract useful differential networks that shed light on the biological processes involved in cancer and brain function

    Joint Bayesian variable and graph selection for regression models with network-structured predictors

    Get PDF
    In this work, we develop a Bayesian approach to perform selection of predictors that are linked within a network. We achieve this by combining a sparse regression model relating the predictors to a response variable with a graphical model describing conditional dependencies among the predictors. The proposed method is well-suited for genomic applications because it allows the identification of pathways of functionally related genes or proteins that impact an outcome of interest. In contrast to previous approaches for network-guided variable selection, we infer the network among predictors using a Gaussian graphical model and do not assume that network information is availableďľ a priori. We demonstrate that our method outperforms existing methods in identifying network-structured predictors in simulation settings and illustrate our proposed model with an application to inference of proteins relevant to glioblastoma survival.

    Assisted Network Analysis in Cancer Genomics

    Get PDF
    Cancer is a molecular disease. In the past two decades, we have witnessed a surge of high- throughput profiling in cancer research and corresponding development of high-dimensional statistical techniques. In this dissertation, the focus is on gene expression, which has played a uniquely important role in cancer research. Compared to some other types of molecular measurements, for example DNA changes, gene expressions are “closer” to cancer outcomes. In addition, processed gene expression data have good statistical properties, in particular, continuity. In the “early” cancer gene expression data analysis, attention has been on marginal properties such as mean and variance. Genes function in a coordinated way. As such, techniques that take a system perspective have been developed to also take into account the interconnections among genes. Among such techniques, graphical models, with lucid biological interpretations and satisfactory statistical properties, have attracted special attention. Graphical model-based analysis can not only lead to a deeper understanding of genes’ properties but also serve as a basis for other analyses, for example, regression and clustering. Cancer molecular studies usually have limited sizes. In the graphical model- based analysis, the number of parameters to be estimated gets squared. Combined together, they lead to a serious lack of information.The overarching goal of this dissertation is to conduct more effective graphical model analysis for cancer gene expression studies. One literature review and three methodological projects have been conducted. The overall strategy is to borrow strength from additional information so as to assist gene expression graphical model estimation. In the first chapter, the literature review is conducted. The methods developed in Chapter 2 and Chapter 4 take advantage of information on regulators of gene expressions (such as methylation, copy number variation, microRNA, and others). As they belong to the vertical data integration framework, we first provide a review of such data integration for gene expression data in Chapter 1. Additional, graphical model-based analysis for gene expression data is reviewed. Research reported in this chapter has led to a paper published in Briefings in Bioinformat- ics. In Chapters 2-4, to accommodate the extreme complexity of information-borrowing for graphical models, three different approaches have been proposed. In Chapter 2, two graphical models, with a gene-expression-only one and a gene-expression-regulator one, are simultaneously considered. A biologically sensible hierarchy between the sparsity structures of these two networks is developed, which is the first of its kind. This hierarchy is then used to link the estimation of the two graphical models. This work has led to a paper published in Genetic Epidemiology. In Chapter 3, additional information is mined from published literature, for example, those deposited at PubMed. The consideration is that published studies have been based on many independent experiments and can contain valuable in- formation on genes’ interconnections. The challenge is to recognize that such information can be partial or even wrong. A two-step approach, consisting of information-guided and information-incorporated estimations, is developed. This work has led to a paper published in Biometrics. In Chapter 4, we slightly shift attention and examine the difference in graphs, which has important implications for understanding cancer development and progression. Our strategy is to link changes in gene expression graphs with those in regulator graphs, which means additional information for estimation. It is noted that to make individual chapters standing-alone, there can be minor overlapping in descriptions. All methodological developments in this research fit the advanced penalization paradigm, which has been popular for cancer gene expression and other molecular data analysis. This methodological coherence is highly desirable. For the methods described in Chapters 2- 4, we have developed new penalized estimations which have lucid interpretations and can directly lead to variable selection (and so sparse and interpretable graphs). We have also developed effective computational algorithms and R codes, which have been made publicly available at Dr. Shuangge Ma’s Github software repository. For the methods described in Chapters 2 and 3, statistical properties under ultrahigh dimensional settings and mild regularity conditions have been established, providing the proposed methods a uniquely strong ground. Statistical properties for the method developed in Chapter 4 are relatively straightforward and hence are omitted. For all the proposed methods, we have conducted extensive simulations, comparisons with the most relevant competitors, and data analysis. The practical advantage is fully established. Overall, this research has delivered a practically sensible information-incorporating strategy for improving graphical model-based analysis for cancer gene expression data, multiple highly competitive methods, R programs that can have broad utilization, and new findings for multiple cancer types

    Biological network models for inferring mechanism of action, characterizing cellular phenotypes, and predicting drug response

    Get PDF
    A primary challenge in the analysis of high-throughput biological data is the abundance of correlated variables. A small change to a gene's expression or a protein's binding availability can cause significant downstream effects. The existence of such chain reactions presents challenges in numerous areas of analysis. By leveraging knowledge of the network interactions that underlie this type of data, we can often enable better understanding of biological phenomena. This dissertation will examine network-based statistical approaches to the problems of mechanism-of-action inference, characterization of gene expression changes, and prediction of drug response. First, we develop a method for multi-target perturbation detection in multi-omics biological data. We estimate a joint Gaussian graphical model across multiple data types using penalized regression, and filter for network effects. Next, we apply a set of likelihood ratio tests to identify the most likely site of the original perturbation. We also present a conditional testing procedure to allow for detection of secondary perturbations. Second, we address the problem of characterization of cellular phenotypes via Bayesian regression in the Gene Ontology (GO). In our model, we use the structure of the GO to assign changes in gene expression to functional groups, and to model the covariance between these groups. In addition to describing changes in expression, we use these functional activity estimates to predict the expression of unobserved genes. We further determine when such predictions are likely to be inaccurate by identifying GO terms with poor agreement to gene-level estimates. In a case study, we identify GO terms relevant to changes in the growth rate of S. cerevisiae. Lastly, we consider the prediction of drug sensitivity in cancer cell lines based on pathway-level activity estimates from ASSIGN, a Bayesian factor analysis model. We use penalized regression to predict response to various cancer treatments based on cancer subtype, pathway activity, and 2-way interactions thereof. We also present network representations of these interaction models and examine common patterns in their structure across treatments

    Knowledge-Guided Bayesian Support Vector Machine Methods For High-Dimensional Data

    Get PDF
    Support vector machines (SVM) is a popular classification method for analysis of high dimensional data such as genomics data. Recently, new SVM methods have been developed to achieve variable selection through either frequentist regularization or Bayesian shrinkage. The Bayesian framework provides a probabilistic interpretation for SVM and allows direct uncertainty quantification. In this dissertation, we develop four knowledge-guided SVM methods for the analysis of high dimensional data. In Chapter 1, I first review the theory of SVM and existing methods for incorporating the prior knowledge, represented bby graphs into SVM. Second, I review the terminology on variable selection and limitations of the existing methods for SVM variable selection. Last, I introduce some Bayesian variable selection techniques as well as Markov chain Monte Carlo (MCMC) algorithms . In Chapter 2, we develop a new Bayesian SVM method that enables variable selection guided by structural information among predictors, e.g, biological pathways among genes. This method uses a spike and slab prior for feature selection combined with an Ising prior for incorporating structural information. The performance of the proposed method is evaluated in comparison with existing SVM methods in terms of prediction and feature selection in extensive simulations. Furthermore, the proposed method is illustrated in analysis of genomic data from a cancer study, demonstrating its advantage in generating biologically meaningful results and identifying potentially important features. The model developed in Chapter 2 might suffer from the issue of phase transition \citep{li2010bayesian} when the number of variables becomes extremely large. In Chapter 3, we propose another Bayesian SVM method that assigns an adaptive structured shrinkage prior to the coefficients and the graph information is incorporated via the hyper-priors imposed on the precision matrix of the log-transformed shrinkage parameters. This method is shown to outperform the method in Chapter 2 in both simulations and real data analysis.. In Chapter 4, to relax the linearity assumption in chapter 2 and 3, we develop a novel knowledge-guided Bayesian non-linear SVM. The proposed method uses a diagonal matrix with ones representing feature selected and zeros representing feature unselected, and combines with the Ising prior to perform feature selection. The performance of our method is evaluated and compared with several penalized linear SVM and the standard kernel SVM method in terms of prediction and feature selection in extensive simulation settings. Also, analyses of genomic data from a cancer study show that our method yields a more accurate prediction model for patient survival and reveals biologically more meaningful results than the existing methods. In Chapter 5, we extend the work of Chapter 4 and use a joint model to identify the relevant features and learn the structural information among them simultaneously. This model does not require that the structural information among the predictors is known, which is more powerful when the prior knowledge about pathways is limited or inaccurate. We demonstrate that our method outperforms the method developed in Chapter 4 when the prior knowledge is partially true or inaccurate in simulations and illustrate our proposed model with an application to a gliobastoma data set. In Chapter 6, we propose some future works including extending our methods to more general types of outcomes such as categorical or continuous variables

    Data-driven modelling of biological multi-scale processes

    Full text link
    Biological processes involve a variety of spatial and temporal scales. A holistic understanding of many biological processes therefore requires multi-scale models which capture the relevant properties on all these scales. In this manuscript we review mathematical modelling approaches used to describe the individual spatial scales and how they are integrated into holistic models. We discuss the relation between spatial and temporal scales and the implication of that on multi-scale modelling. Based upon this overview over state-of-the-art modelling approaches, we formulate key challenges in mathematical and computational modelling of biological multi-scale and multi-physics processes. In particular, we considered the availability of analysis tools for multi-scale models and model-based multi-scale data integration. We provide a compact review of methods for model-based data integration and model-based hypothesis testing. Furthermore, novel approaches and recent trends are discussed, including computation time reduction using reduced order and surrogate models, which contribute to the solution of inference problems. We conclude the manuscript by providing a few ideas for the development of tailored multi-scale inference methods.Comment: This manuscript will appear in the Journal of Coupled Systems and Multiscale Dynamics (American Scientific Publishers
    • …
    corecore