45 research outputs found

    Sparse Model Building From Genome-Wide Variation With Graphical Models

    Full text link
    High throughput sequencing and expression characterization have lead to an explosion of phenotypic and genotypic molecular data underlying both experimental studies and outbred populations. We develop a novel class of algorithms to reconstruct sparse models among these molecular phenotypes (e.g. expression products) and genotypes (e.g. single nucleotide polymorphisms), via both a Bayesian hierarchical model, when the sample size is much smaller than the model dimension (i.e. p n) and the well characterized adaptive lasso algo- rithm. Specifically, we propose novel approaches to the problems of increasing power to detect additional loci in genome-wide association studies using our variational algorithm, efficiently learning directed cyclic graphs from expression and genotype data using the adaptive lasso, and constructing genomewide undirected graphs among genotype, expression and downstream phenotype data using an extension of the variational feature selection algorithm. The Bayesian hierarchical model is derived for a parametric multiple regression model with a mixture prior of a point mass and normal distribution for each regression coefficient, and appropriate priors for the set of hyperparameters. When combined with a probabilistic consistency bound on the model dimension, this approach leads to very sparse solutions without the need for cross validation. We use a variational Bayes approximate inference approach in our algorithm, where we impose a complete factorization across all parameters for the approximate posterior distribution, and then minimize the KullbackLeibler divergence between the approximate and true posterior distributions. Since the prior distribution is non-convex, we restart the algorithm many times to find multiple posterior modes, and combine information across all discovered modes in an approximate Bayesian model averaging framework, to reduce the variance of the posterior probability estimates. We perform analysis of three major publicly available data-sets: the HapMap 2 genotype and expression data collected on immortalized lymphoblastoid cell lines, the genome-wide gene expression and genetic marker data collected for a yeast intercross, and genomewide gene expression, genetic marker, and downstream phenotypes related to weight in a mouse F2 intercross. Based on both simulations and data analysis we show that our algorithms can outperform other state of the art model selection procedures when including thousands to hundreds of thousands of genotypes and expression traits, in terms of aggressively controlling false discovery rate, and generating rich simultaneous statistical models

    Graphical models for high dimensional genomic data

    Get PDF
    Graphical models study the relations among a set of random variables. In a graph, vertices represent variables and edges capture relations among the variables. We have developed three statistical methods for graphical model construction using high dimensional genomic data. We first focus on estimating a high-dimensional partial correlation matrix. It is estimated by ridge penalty followed by hypothesis testing. The null distribution of the test statistics derived from penalized partial correlation estimates has not been established. We address this challenge by estimating the null distribution from the empirical distribution of the test statistics of all the penalized partial correlation estimates. The performance of our method is systematically evaluated in simulation and application studies. Next, we consider estimating Directed Acyclic Graph (DAG) models for multivariate Gaussian random variables. The skeleton of a DAG is an undirected graphical model, which is constructed by removing the directions of all the edges in the DAG. Given observational data, not all the directions of the edges of a DAG are identifiable; however the skeleton of the DAG is identifiable. We propose a novel method named PenPC to estimate the skeleton of a high dimensional DAG by a two-step approach. We first estimate an undirected graph by selecting the non-zero entries of the partial correlation matrix, then remove false connections in this undirected graph to obtain the skeleton. We systematically study the asymptotic property of PenPC on high dimensional problems. Both simulations and real data analysis suggest that our method have substantially higher sensitivity and specificity to estimate network skeleton than existing methods. To orient the edges in the skeleton of a DAG, we exploit interventional data on an additional set of variables. The variables are direct causes of some vertices in the DAG and enable estimating directions of the edges in the skeleton. More specifically, given the skeleton of a DAG, we calculate the posterior probabilities of edge directions using the additional set of variables. We evaluate our method by simulations and an application where variables modeled by a DAG are gene expression and the additional set variables are DNA polymorphisms.Doctor of Philosoph

    Statistical methods for gene selection and genetic association studies

    Get PDF
    This dissertation includes five Chapters. A brief description of each chapter is organized as follows. In Chapter One, we propose a signed bipartite genotype and phenotype network (GPN) by linking phenotypes and genotypes based on the statistical associations. It provides a new insight to investigate the genetic architecture among multiple correlated phenotypes and explore where phenotypes might be related at a higher level of cellular and organismal organization. We show that multiple phenotypes association studies by considering the proposed network are improved by incorporating the genetic information into the phenotype clustering. In Chapter Two, we first illustrate the proposed GPN to GWAS summary statistics. Then, we assess contributions to constructing a well-defined GPN with a clear representation of genetic associations by comparing the network properties with a random network, including connectivity, centrality, and community structure. The network topology annotations based on the sparse representations of GPN can be used to understand the disease heritability for the highly correlated phenotypes. In applications of phenome-wide association studies, the proposed GPN can identify more significant pairs of genetic variant and phenotype categories. In Chapter Three, a powerful and computationally efficient gene-based association test is proposed, aggregating information from different gene-based association tests and also incorporating expression quantitative trait locus information. We show that the proposed method controls the type I error rates very well and has higher power in the simulation studies and can identify more significant genes in the real data analyses. In Chapter Four, we develop six statistical selection methods based on the penalized regression for inferring target genes of a transcription factor (TF). In this study, the proposed selection methods combine statistics, machine learning , and convex optimization approach, which have great efficacy in identifying the true target genes. The methods will fill the gap of lacking the appropriate methods for predicting target genes of a TF, and are instrumental for validating experimental results yielding from ChIP-seq and DAP-seq, and conversely, selection and annotation of TFs based on their target genes. In Chapter Five, we propose a gene selection approach by capturing gene-level signals in network-based regression into case-control association studies with DNA sequence data or DNA methylation data, inspired by the popular gene-based association tests using a weighted combination of genetic variants to capture the combined effect of individual genetic variants within a gene. We show that the proposed gene selection approach have higher true positive rates than using traditional dimension reduction techniques in the simulation studies and select potentially rheumatoid arthritis related genes that are missed by existing methods

    Statistical model identification : dynamical processes and large-scale networks in systems biology

    Get PDF
    Magdeburg, Univ., Fak. für Verfahrens- und Systemtechnik, Diss., 2014von Robert Johann Flassi

    A constraint optimization framework for discovery of cellular signaling and regulatory networks

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Computational and Systems Biology Program, 2011.Cataloged from PDF version of thesis.Includes bibliographical references.Cellular signaling and regulatory networks underlie fundamental biological processes such as growth, differentiation, and response to the environment. Although there are now various high-throughput methods for studying these processes, knowledge of them remains fragmentary. Typically, the majority of hits identified by transcriptional, proteomic, and genetic assays lie outside of the expected pathways. In addition, not all components in the regulatory networks can be exposed in one experiment because of systematic biases in the assays. These unexpected and hidden components of the cellular response are often the most interesting, because they can provide new insights into biological processes and potentially reveal new therapeutic approaches. However, they are also the most difficult to interpret. We present a technique, based on the Steiner tree problem, that uses a probabilistic protein-protein interaction network and high confidence measurement and prediction of protein-DNA interactions, to determine how these hits are organized into functionally coherent pathways, revealing many components of the cellular response that are not readily apparent in the original data. We report the results of applying this method to (1) phosphoproteomic and transcriptional data from the pheromone response in yeast, and (2) phosphoproteomic, DNaseI hypersensitivity sequencing and mRNA profiling data from the U87MG glioblastoma cell lines over-expressing the variant III mutant of the epidermal growth factor receptor (EGFRvIII). In both cases the method identifies changes in diverse cellular processes that extend far beyond the expected pathways. Analysis of the EGFRVIII network connectivity property and transcriptional regulators that link observed changes in protein phosphorylation and differential expression suggest a few intriguing hypotheses that may lead to improved therapeutic strategy for glioblastoma.by Shao-shan Carol Huang.Ph.D

    The influence of genetic variation in gene expression

    Full text link
    Variations in gene expression have long been hypothesised to be the major cause of individual differences. An initial focus of this research thesis is to elucidate the genetic regulatory architecture of gene expression. Expression quantitative trait locus (eQTL) mapping analyses have been performed on expression levels of over 22,000 mRNAs from three tissues of a panel of recombinant inbred mice. These analyses are "single-locus" where "linkage" (i.e. significant correlation) between an expression trait and a putative eQTL is considered independently of other loci. Major conclusions from these analyses are: 1. Gene expression is mainly influenced by genetic (sequence) variations that act in trans rather than in cis; 2. Subsets of genes are controlled by master regulators that influence multiple genes; 3. Gene expression is a polygenic trait with multiple regulators. Single-locus mapping analyses are not designed for detecting multiple regulators of gene expression, and so observation of multiple-linkages (i.e. one expression trait mapped to multiple eQTLs) formed the basis of the second objective of this research project: to investigate the relationship between multiple-linkages and genotype pattern-association. A locus-pair is said to have associated genotype patterns if they have similar inheritance pattern across a panel of individuals, and these are attributed to one of fours sources: 1. linkage disequilibrium between loci located on the same chromosome; 2. non-syntenic association; 3. random association; 4. un-associated. To understand the validity of multiple-linkages observed in single-locus mapping studies, a newly developed method, bqtl.twolocus, is applied to confirm two-locus effects for a total of 898 out of 1,233 multiple-linkages identified from the three studies mentioned above as well as from seven publicly available eQTL-mapping studies. Combining these results with information of genotype pattern-association, a subset of 478 multiple-linkages has been deduced for which there is high confidence to be real

    Simulation and identification of gene regulatory networks

    Get PDF
    Gene regulatory networks are a well-established model to represent the functioning, at gene level, of utterly elaborated biological networks. Studying and understanding such models of gene communication might enable researchers to rightly address costly laboratory experiments, e.g. by selecting a small set of genes deemed to be responsible for a particular disease, or by indicating with confidence which molecule is supposed to be susceptible to certain drug treatments. This thesis explores two main aspects regarding gene regulatory networks: (i) the simulation of realistic perturbative and systems genetics experiments in gene networks, and (ii) the inference of gene networks from simulated and real data measurements. In detail, the following themes will be discussed: (i) SysGenSIM, an open source software to produce gene networks with realistic topology and simulate systems genetics or targeted perturbative experiments; (ii) two state of the arts algorithms for the structural identification of gene networks from single-gene knockout measurements; (iii) an approach to reverse-engineering gene networks from heterogeneous compendia; (iv) a methodology to infer gene interactions fromsystems genetics dataset. These works have been positively recognized by the scientific community. In particular, SysGenSIM has been used – in addition to providing valuable test benches for the development of the above inference algorithms – to generate benchmark datasets for international competitions as the DREAM5 Systems Genetics challenge and the StatSeq workshop. The identificationmethodologies earned their worth by accurately reverse-engineering gene networks at established contests, namely the DREAM Network Inference challenges. Results are explained and discussed thoroughly in the thesis

    Simulation and identification of gene regulatory networks

    Get PDF
    Gene regulatory networks are a well-established model to represent the functioning, at gene level, of utterly elaborated biological networks. Studying and understanding such models of gene communication might enable researchers to rightly address costly laboratory experiments, e.g. by selecting a small set of genes deemed to be responsible for a particular disease, or by indicating with confidence which molecule is supposed to be susceptible to certain drug treatments. This thesis explores two main aspects regarding gene regulatory networks: (i) the simulation of realistic perturbative and systems genetics experiments in gene networks, and (ii) the inference of gene networks from simulated and real data measurements. In detail, the following themes will be discussed: (i) SysGenSIM, an open source software to produce gene networks with realistic topology and simulate systems genetics or targeted perturbative experiments; (ii) two state of the arts algorithms for the structural identification of gene networks from single-gene knockout measurements; (iii) an approach to reverse-engineering gene networks from heterogeneous compendia; (iv) a methodology to infer gene interactions fromsystems genetics dataset. These works have been positively recognized by the scientific community. In particular, SysGenSIM has been used – in addition to providing valuable test benches for the development of the above inference algorithms – to generate benchmark datasets for international competitions as the DREAM5 Systems Genetics challenge and the StatSeq workshop. The identificationmethodologies earned their worth by accurately reverse-engineering gene networks at established contests, namely the DREAM Network Inference challenges. Results are explained and discussed thoroughly in the thesis
    corecore