107 research outputs found
Unsupervised Bayesian linear unmixing of gene expression microarrays
Background: This paper introduces a new constrained model and the corresponding algorithm, called unsupervised Bayesian linear unmixing (uBLU), to identify biological signatures from high dimensional assays like gene expression microarrays. The basis for uBLU is a Bayesian model for the data samples which are represented as an additive mixture of random positive gene signatures, called factors, with random positive mixing coefficients, called factor scores, that specify the relative contribution of each signature to a specific sample. The particularity of the proposed method is that uBLU constrains the factor loadings to be non-negative and the factor scores to be probability distributions over the factors. Furthermore, it also provides estimates of the number of factors. A Gibbs sampling strategy is adopted here to generate random samples according to the posterior distribution of the factors, factor scores, and number of factors. These samples are then used to estimate all the unknown parameters. Results: Firstly, the proposed uBLU method is applied to several simulated datasets with known ground truth and compared with previous factor decomposition methods, such as principal component analysis (PCA), non negative matrix factorization (NMF), Bayesian factor regression modeling (BFRM), and the gradient-based algorithm for general matrix factorization (GB-GMF). Secondly, we illustrate the application of uBLU on a real time-evolving gene expression dataset from a recent viral challenge study in which individuals have been inoculated with influenza A/H3N2/Wisconsin. We show that the uBLU method significantly outperforms the other methods on the simulated and real data sets considered here. Conclusions: The results obtained on synthetic and real data illustrate the accuracy of the proposed uBLU method when compared to other factor decomposition methods from the literature (PCA, NMF, BFRM, and GB-GMF). The uBLU method identifies an inflammatory component closely associated with clinical symptom scores collected during the study. Using a constrained model allows recovery of all the inflammatory genes in a single factor
Estimation and Detection of Multivariate Gene Regulatory Relationships
The Coefficient of Determination (CoD) plays an important role in Genomics problems, for instance, in the inference of gene regulatory networks from gene- expression data. However, the inference theory about CoD has not been investigated systematically. In this dissertation, we study the inference of discrete CoD from both frequentist and Bayesian perspectives, with its applications to system identification problems in Genomics. From a frequentist viewpoint, we provide a theoretical framework for CoD estimation by introducing nonparametric CoD estimators and parametric maximum-likelihood (ML) CoD estimators based on static and dynamical Boolean models. Inference algorithms are developed to discover gene regulatory relationships, and numerical examples are provided to validate preferable performance of the ML approach with access to sufficient prior knowledge. To make the applications of the CoD independent of user-selectable thresholds, we describe rigorous multiple testing procedures to investigate significant regulatory relation- ships among genes using the discrete CoD, and to discover canalyzing genes using the intrinsically multivariate prediction (IMP) criterion. We develop practical statistic tools that are open to the scientific community. On the other hand, we propose a Bayesian framework for the inference of the CoD across a parametrized family of joint distributions between target and predictors. Examples of applications of the Bayesian approach are provided against those of nonparametric and parametric approaches by using synthetic data.
We have found that, with applications to system identification problems in Genomics, both parametric and Bayesian CoD estimation approaches outperform the nonparametric approaches. Hence, we conclude that parametric and Bayesian estimation approaches are preferred when we have partial knowledge about gene regulation. On the other hand, we have shown that the two proposed statistical testing frameworks can detect well-known gene regulation and canalyzing genes like p53 and DUSP1 from real data sets, respectively. This indicates that our methodology could serve as a promising tool for the detection of potential gene regulatory relationships and canalyzing genes. In one word, this dissertation is intended to serve as foundation for a detailed study of applications of CoD estimation in Genomics and related fields
MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: model-based RNA-Seq classification
BACKGROUND: Sequencing datasets consist of a finite number of reads which map to specific regions of a reference genome. Most effort in modeling these datasets focuses on the detection of univariate differentially expressed genes. However, for classification, we must consider multiple genes and their interactions. RESULTS: Thus, we introduce a hierarchical multivariate Poisson model (MP) and the associated optimal Bayesian classifier (OBC) for classifying samples using sequencing data. Lacking closed-form solutions, we employ a Monte Carlo Markov Chain (MCMC) approach to perform classification. We demonstrate superior or equivalent classification performance compared to typical classifiers for two synthetic datasets and over a range of classification problem difficulties. We also introduce the Bayesian minimum mean squared error (MMSE) conditional error estimator and demonstrate its computation over the feature space. In addition, we demonstrate superior or leading class performance over an RNA-Seq dataset containing two lung cancer tumor types from The Cancer Genome Atlas (TCGA). CONCLUSIONS: Through model-based, optimal Bayesian classification, we demonstrate superior classification performance for both synthetic and real RNA-Seq datasets. A tutorial video and Python source code is available under an open source license at http://bit.ly/1gimnss. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-014-0401-3) contains supplementary material, which is available to authorized users
Optimal Model-Based Approaches for Predictive Inference in Biology
Predictive modeling of the dynamic, multivariate, non-linear, stochastic systems of biology is a difficult enterprise. High throughput measurement techniques are enabling new approaches to computational biology, but the small number of samples typically available relative to the number of features measured make additional sources of information critical for accurate predictions. In this dissertation, we offer an approach to incorporate biological pathway knowledge into a predictive stochastic model for genetic regulatory networks. In addition, we propose a statistical model for shotgun sequencing and use computational approximation strategies to derive optimal estimators for classification.
We perform comparisons of classifiers trained using this framework to other existing classification rules including non-linear support vector machines. Using both synthetic and real sequencing data, our classifiers delivered lower classification error rates than existing classification techniques. In addition, we demonstrate using prior knowledge to construct the classifier through properly constructed prior distributions and several scenarios where this increases classification performance. This research establishes a flexible framework to generate optimal estimators with respect to statistical biological models. By demonstrating the role and power of computation in unlocking these estimators, we point future research efforts towards this computationally intensive approach for the computational biology field
Estimation and Detection of Multivariate Gene Regulatory Relationships
The Coefficient of Determination (CoD) plays an important role in Genomics problems, for instance, in the inference of gene regulatory networks from gene- expression data. However, the inference theory about CoD has not been investigated systematically. In this dissertation, we study the inference of discrete CoD from both frequentist and Bayesian perspectives, with its applications to system identification problems in Genomics. From a frequentist viewpoint, we provide a theoretical framework for CoD estimation by introducing nonparametric CoD estimators and parametric maximum-likelihood (ML) CoD estimators based on static and dynamical Boolean models. Inference algorithms are developed to discover gene regulatory relationships, and numerical examples are provided to validate preferable performance of the ML approach with access to sufficient prior knowledge. To make the applications of the CoD independent of user-selectable thresholds, we describe rigorous multiple testing procedures to investigate significant regulatory relation- ships among genes using the discrete CoD, and to discover canalyzing genes using the intrinsically multivariate prediction (IMP) criterion. We develop practical statistic tools that are open to the scientific community. On the other hand, we propose a Bayesian framework for the inference of the CoD across a parametrized family of joint distributions between target and predictors. Examples of applications of the Bayesian approach are provided against those of nonparametric and parametric approaches by using synthetic data.
We have found that, with applications to system identification problems in Genomics, both parametric and Bayesian CoD estimation approaches outperform the nonparametric approaches. Hence, we conclude that parametric and Bayesian estimation approaches are preferred when we have partial knowledge about gene regulation. On the other hand, we have shown that the two proposed statistical testing frameworks can detect well-known gene regulation and canalyzing genes like p53 and DUSP1 from real data sets, respectively. This indicates that our methodology could serve as a promising tool for the detection of potential gene regulatory relationships and canalyzing genes. In one word, this dissertation is intended to serve as foundation for a detailed study of applications of CoD estimation in Genomics and related fields
- …