157 research outputs found

    Modelling Transcriptional Regulation with a Mixture of Factor Analyzers and Variational Bayesian Expectation Maximization

    Get PDF
    Understanding the mechanisms of gene transcriptional regulation through analysis of high-throughput postgenomic data is one of the central problems of computational systems biology. Various approaches have been proposed, but most of them fail to address at least one of the following objectives: (1) allow for the fact that transcription factors are potentially subject to posttranscriptional regulation; (2) allow for the fact that transcription factors cooperate as a functional complex in regulating gene expression, and (3) provide a model and a learning algorithm with manageable computational complexity. The objective of the present study is to propose and test a method that addresses these three issues. The model we employ is a mixture of factor analyzers, in which the latent variables correspond to different transcription factors, grouped into complexes or modules. We pursue inference in a Bayesian framework, using the Variational Bayesian Expectation Maximization (VBEM) algorithm for approximate inference of the posterior distributions of the model parameters, and estimation of a lower bound on the marginal likelihood for model selection. We have evaluated the performance of the proposed method on three criteria: activity profile reconstruction, gene clustering, and network inference

    Statistical Modeling for Cellular Heterogeneity Problems in Cancer Research: Deconvolution, Gaussian Graphical Models and Logistic Regression

    Get PDF
    Tumor tissue samples comprise a mixture of cancerous and surrounding normal cells. Investigating cellular heterogeneity in tumors is crucial to genomic analyses associated with cancer prognosis and treatment decisions, where the contamination of non-cancerous cells may substantially affect gene expression profiling in clinically derived malignant tumor samples. For this purpose, we first computationally purify tumor profiles, and then develop new statistical modeling techniques to incorporate tumor purity estimates for genetic correlation and prediction of clinical outcome in cancer research. In this thesis, we propose novel approaches to analyzing and modeling cellular heterogeneity problems using genomic data from three perspectives. First, we develop a computation tool, DeMixT, which applies a deconvolution algorithm to explicitly account for at most three cellular components associated with cancer. Compared with the experimental approach to isolate single cells, in silico dissection of tumor samples is faster and cheaper, but computational tools previously developed have limited ability to estimate cellular proportions and tumor-specific expression profiles, when neither is given with prior information. Our model al- lows inclusion of the infiltrating immune cells as a component as well as the tumor cells and stromal cells. We assume a linear mixture of gene expression profiles for each component satisfying a log2-normal distribution and propose an iterated conditional modes algorithm to estimate parameters. We also involve a novel two-stage estimation procedure for the three-component deconvolution. Our method is computationally feasible and yields accurate estimates through simulations and real data analyses. The estimated cellular proportions and purified expression profiles can pro- vide deeper insight for cancer biomarker studies. Second, we propose a novel edge regression model for undirected graphs, which incorporates subject-level covariates to estimate the conditional dependencies. Current work for constructing graphical models for multivariate data does not take into account the subject specific information, which can bias the conditional independence structure in heterogeneous data. Especially for tumor samples with inherent contamination from normal cells, ignoring the cellular heterogeneity and modeling the population-level genomic graphs may inhibit the discovery of the true tumor graph, which would be attenuated towards the normal graph. Our model allows undirected networks to vary with the exogenous covariates and is able to borrow strength from different related graphs for estimating more robust covariate-specific graphs. Bayesian shrinkage algorithms are presented to efficiently estimate and induce sparsity for generating subject-level graphs. We demonstrate the good performance of our method through simulation studies and apply our method to cytokine measurements from blood plasma samples from hepatocellular carcinoma (HCC) patients and normal controls. Third, we build a model with respect to logistic regression that includes tumor purity as a scaling factor to improve model robustness for the purpose of both estimation and prediction. Penalized logistic regression is used to identify variables (genes) and predict clinical status with binary outcomes that are associated with cancers in high-dimensional genomic data. We aim to reduce the uncertainty introduced by cellular heterogeneity through incorporating the measure of tumor purity to quantify the power of data for each sample. We provide strategies of choosing scaling parameters. Our model is finally shown to work well through a set of simulation studies. We believe that the statistical modeling, technical pipelines and computational results included in our work will serve as a first guide for the development of statistical methods accounting for cellular heterogeneity in cancer research

    Sparse Model Building From Genome-Wide Variation With Graphical Models

    Full text link
    High throughput sequencing and expression characterization have lead to an explosion of phenotypic and genotypic molecular data underlying both experimental studies and outbred populations. We develop a novel class of algorithms to reconstruct sparse models among these molecular phenotypes (e.g. expression products) and genotypes (e.g. single nucleotide polymorphisms), via both a Bayesian hierarchical model, when the sample size is much smaller than the model dimension (i.e. p n) and the well characterized adaptive lasso algo- rithm. Specifically, we propose novel approaches to the problems of increasing power to detect additional loci in genome-wide association studies using our variational algorithm, efficiently learning directed cyclic graphs from expression and genotype data using the adaptive lasso, and constructing genomewide undirected graphs among genotype, expression and downstream phenotype data using an extension of the variational feature selection algorithm. The Bayesian hierarchical model is derived for a parametric multiple regression model with a mixture prior of a point mass and normal distribution for each regression coefficient, and appropriate priors for the set of hyperparameters. When combined with a probabilistic consistency bound on the model dimension, this approach leads to very sparse solutions without the need for cross validation. We use a variational Bayes approximate inference approach in our algorithm, where we impose a complete factorization across all parameters for the approximate posterior distribution, and then minimize the KullbackLeibler divergence between the approximate and true posterior distributions. Since the prior distribution is non-convex, we restart the algorithm many times to find multiple posterior modes, and combine information across all discovered modes in an approximate Bayesian model averaging framework, to reduce the variance of the posterior probability estimates. We perform analysis of three major publicly available data-sets: the HapMap 2 genotype and expression data collected on immortalized lymphoblastoid cell lines, the genome-wide gene expression and genetic marker data collected for a yeast intercross, and genomewide gene expression, genetic marker, and downstream phenotypes related to weight in a mouse F2 intercross. Based on both simulations and data analysis we show that our algorithms can outperform other state of the art model selection procedures when including thousands to hundreds of thousands of genotypes and expression traits, in terms of aggressively controlling false discovery rate, and generating rich simultaneous statistical models

    Innovative Algorithms and Evaluation Methods for Biological Motif Finding

    Get PDF
    Biological motifs are defined as overly recurring sub-patterns in biological systems. Sequence motifs and network motifs are the examples of biological motifs. Due to the wide range of applications, many algorithms and computational tools have been developed for efficient search for biological motifs. Therefore, there are more computationally derived motifs than experimentally validated motifs, and how to validate the biological significance of the ‘candidate motifs’ becomes an important question. Some of sequence motifs are verified by their structural similarities or their functional roles in DNA or protein sequences, and stored in databases. However, biological role of network motifs is still invalidated and currently no databases exist for this purpose. In this thesis, we focus not only on the computational efficiency but also on the biological meanings of the motifs. We provide an efficient way to incorporate biological information with clustering analysis methods: For example, a sparse nonnegative matrix factorization (SNMF) method is used with Chou-Fasman parameters for the protein motif finding. Biological network motifs are searched by various clustering algorithms with Gene ontology (GO) information. Experimental results show that the algorithms perform better than existing algorithms by producing a larger number of high-quality of biological motifs. In addition, we apply biological network motifs for the discovery of essential proteins. Essential proteins are defined as a minimum set of proteins which are vital for development to a fertile adult and in a cellular life in an organism. We design a new centrality algorithm with biological network motifs, named MCGO, and score proteins in a protein-protein interaction (PPI) network to find essential proteins. MCGO is also combined with other centrality measures to predict essential proteins using machine learning techniques. We have three contributions to the study of biological motifs through this thesis; 1) Clustering analysis is efficiently used in this work and biological information is easily integrated with the analysis; 2) We focus more on the biological meanings of motifs by adding biological knowledge in the algorithms and by suggesting biologically related evaluation methods. 3) Biological network motifs are successfully applied to a practical application of prediction of essential proteins

    Integrative Modeling of Transcriptional Regulation in Response to Autoimmune Desease Therapies

    Get PDF
    Die rheumatoide Arthritis (RA) und die Multiple Sklerose (MS) werden allgemein als Autoimmunkrankheiten eingestuft. Zur Behandlung dieser Krankheiten werden immunmodulatorische Medikamente eingesetzt, etwa TNF-alpha-Blocker (z.B. Etanercept) im Falle der RA und IFN-beta-Präparate (z.B. Betaferon und Avonex) im Falle der MS. Bis heute sind die molekularen Mechanismen dieser Therapien weitestgehend unbekannt. Zudem ist ihre Wirksamkeit und Verträglichkeit bei einigen Patienten unzureichend. In dieser Arbeit wurde die transkriptionelle Antwort im Blut von Patienten auf jede dieser drei Therapien untersucht, um die Wirkungsweise dieser Medikamente besser zu verstehen. Dabei wurden Methoden der Netzwerkinferenz eingesetzt, mit dem Ziel, die genregulatorischen Netzwerke (GRNs) der in ihrer Expression veränderten Gene zu rekonstruieren. Ausgangspunkt dieser Analysen war jeweils ein Genexpressions- Datensatz. Daraus wurden zunächst Gene gefiltert, die nach Therapiebeginn hoch- oder herunterreguliert sind. Anschließend wurden die genregulatorischen Regionen dieser Gene auf Transkriptionsfaktor-Bindestellen (TFBS) analysiert. Um schließlich GRN-Modelle abzuleiten, wurde ein neuer Netzwerkinferenz-Algorithmus (TILAR) verwendet. TILAR unterscheidet zwischen Genen und TF und beschreibt die regulatorischen Effekte zwischen diesen durch ein lineares Gleichungssystem. TILAR erlaubt dabei Vorwissen über Gen-TF- und TF-Gen-Interaktionen einzubeziehen. Im Ergebnis wurden komplexe Netzwerkstrukturen rekonstruiert, welche die regulatorischen Beziehungen zwischen den Genen beschreiben, die im Verlauf der Therapien differentiell exprimiert sind. Für die Etanercept-Therapie wurde ein Teilnetz gefunden, das Gene enthält, die niedrigere Expressionslevel bei RA-Patienten zeigen, die sehr gut auf das Medikament ansprechen. Die Analyse von GRNs kann somit zu einem besseren Verständnis Therapie-assoziierter Prozesse beitragen und transkriptionelle Unterschiede zwischen Patienten aufzeigen

    The effect of noise on dynamics and the influence of biochemical systems

    No full text
    Understanding a complex system requires integration and collective analysis of data from many levels of organisation. Predictive modelling of biochemical systems is particularly challenging because of the nature of data being plagued by noise operating at each and every level. Inevitably we have to decide whether we can reliably infer the structure and dynamics of biochemical systems from present data. Here we approach this problem from many fronts by analysing the interplay between deterministic and stochastic dynamics in a broad collection of biochemical models. In a classical mathematical model we first illustrate how this interplay can be described in surprisingly simple terms; we furthermore demonstrate the advantages of a statistical point of view also for more complex systems. We then investigate strategies for the integrated analysis of models characterised by different organisational levels, and trace the propagation of noise through such systems. We use this approach to uncover, for the first time, the dynamics of metabolic adaptation of a plant pathogen throughout its life cycle and discuss the ecological implications. Finally, we investigate how reliably we can infer model parameters of biochemical models. We develop a novel sensitivity/inferability analysis framework that is generally applicable to a large fraction of current mathematical models of biochemical systems. By using this framework to quantify the effect of parametric variation on system dynamics, we provide practical guidelines as to when and why certain parameters are easily estimated while others are much harder to infer. We highlight the limitations on parameter inference due to model structure and qualitative dynamical behaviour, and identify candidate elements of control in biochemical pathways most likely of being subjected to regulation

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    Differential geometric MCMC methods and applications

    Get PDF
    This thesis presents novel Markov chain Monte Carlo methodology that exploits the natural representation of a statistical model as a Riemannian manifold. The methods developed provide generalisations of the Metropolis-adjusted Langevin algorithm and the Hybrid Monte Carlo algorithm for Bayesian statistical inference, and resolve many shortcomings of existing Monte Carlo algorithms when sampling from target densities that may be high dimensional and exhibit strong correlation structure. The performance of these Riemannian manifold Markov chain Monte Carlo algorithms is rigorously assessed by performing Bayesian inference on logistic regression models, log-Gaussian Cox point process models, stochastic volatility models, and both parameter and model level inference of dynamical systems described by nonlinear differential equations

    Statistical methods for the analysis of the genetics of gene expression

    Get PDF

    Bayesian Inference of Gene Regulatory Networks : From Parameter Estimation to Experimental Design

    Get PDF
    To learn the structure of gene regulatory networks is an interesting and important topic in systems biology. This structure could be used to specify key regulators and this knowledge may be used to develop new drugs which affect the expression of these regulators. However, the inference of gene regulatory networks, especially from time-series data is a challenging task. This is due to the limited amount of given data which additionally contain a lot of noise. These data cause from the technical point of view for the parameter estimation procedure problems like the non-identifiability and sloppiness of parameters. To address these difficulties, in these thesis new methods for both, the parameter estimation task and the experimental design for gene regulatory networks, are developed for a non-linear ordinary differential equations model, which use a Bayesian procedure and generate samples of the underlying distribution of the parameters. These distributions are of high interest, since they do not provide only one network structure but give all network structures that are consistent with the given data. And all of these structures can then be examined in more detail. The proposed method for Bayesian parameter estimation uses smoothing splines to circumvent the numerical integration of the underlying system of ordinary differential equations, which is usually used for parameter estimation procedures in systems of ordinary differential equations. An iterative Hybrid Monte Carlo and Metropolis-Hastings algorithm is used to sample the model parameters and the smoothing factor. This new method is applied to simulated data, which shows that it is able to reconstruct the topology of the underlying gene regulatory network with high accuracy. The approach was also applied to real experimental data, a synthetic designed 5-gene network (the DREAM 2 Challenge #3 data) and outperforms other methods. For the Bayesian experimental design step, a full Bayesian approach was used which does not use any parametric assumption of the posterior distribution, nor linearizes around a point estimate. To make the full Bayesian approach computationally manageable, maximum entropy sampling is used together with a population-based Markov chain Monte Carlo algorithm. The approach was applied to simulated and real experimental data, the DREAM 2 Challenge #3 data, and outperforms the usage of random experiments and a classical experimental design method
    corecore