237 research outputs found

    Gene communities in co-expression networks across different tissues

    Full text link
    With the recent availability of tissue-specific gene expression data, e.g., provided by the GTEx Consortium, there is interest in comparing gene co-expression patterns across tissues. One promising approach to this problem is to use a multilayer network analysis framework and perform multilayer community detection. Communities in gene co-expression networks reveal groups of genes similarly expressed across individuals, potentially involved in related biological processes responding to specific environmental stimuli or sharing common regulatory variations. We construct a multilayer network in which each of the four layers is an exocrine gland tissue-specific gene co-expression network. We develop methods for multilayer community detection with correlation matrix input and an appropriate null model. Our correlation matrix input method identifies five groups of genes that are similarly co-expressed in multiple tissues (a community that spans multiple layers, which we call a generalist community) and two groups of genes that are co-expressed in just one tissue (a community that lies primarily within just one layer, which we call a specialist community). We further found gene co-expression communities where the genes physically cluster across the genome significantly more than expected by chance (on chromosomes 1 and 11). This clustering hints at underlying regulatory elements determining similar expression patterns across individuals and cell types. We suggest that KRTAP3-1, KRTAP3-3, and KRTAP3-5 share regulatory elements in skin and pancreas. Furthermore, we find that CELA3A and CELA3B share associated expression quantitative trait loci in the pancreas. The results indicate that our multilayer community detection method for correlation matrix input extracts biologically interesting communities of genes

    Model-based approaches for the detection of biologically active genomic regions from next generation sequencing data

    Get PDF
    Next Generation Sequencing (NGS) technologies are quickly gaining popularity in biomedical research. A popular application of NGS is to detect potential gene regulatory elements that are captured or enriched by certain experimental procedures, for example, Chromatin Immunoprecipitation (ChIP-seq), DNase hypersensitive site mapping (DNase-seq), and Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), among others. While ChIP-seq can be use to identify protein-DNA interaction sites, both DNase-seq and FAIRE-seq can be used to identify open chromatin regions, which are more likely to contain elements involved in gene expression regulation. We collectively refer to these types of sequencing data as DAE-seq, where DAE stands for DNA After Enrichment. DAE-seq data can provide important insight into gene regulation, which is crucial to understanding the molecular mechanism of phenotypic outcomes, such as complex diseases. Here we address several practical issues facing biomedical researchers in the analysis of DAE-seq data through the development of several new and relevant statistical methods. We first introduce a three-component mixture regression model to discover ``enriched regions, i.e., the genomic regions with more DAE-seq signal than expected in relation to background regions. We demonstrate its practical utility and accuracy in detecting regions of active regulatory elements across a wide range of commonly used DAE-seq datasets and experimental conditions. We then develop a novel Autoregressive Hidden Markov Model (AR-HMM) to account for often-ignored spatial dependence in DAE-seq data, and demonstrate that accounting for such dependence leads to increased performance in identifying biologically active genomic regions in both simulated and real datasets. We also introduce an efficient and novel variable selection procedure in the context of Hidden Markov Models when the means of the emission distributions of each state are modelled with covariates. We study the asymptotic properties of the proposed variable selection procedure and apply this approach to simulated and real DAE-seq data. Lastly, we introduce a new method for the joint analysis of total and allele-specific read counts from DAE-seq data and RNA-seq data. In all, we develop several statistical procedures for the analysis of DAE-seq data that are highly relevant to biomedical researchers and have broader applicability to other problems in statistics.Doctor of Philosoph

    Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics

    Get PDF
    Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables. There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.Uudet mittausteknologiat ovat mahdollistaneet kokonaisvaltaisen ymmärryksen lisäämisen elollisten organismien molekyylitason prosesseista. Niin kutsutut omiikka-teknologiat, kuten genomiikka, proteomiikka ja lipidomiikka, kykenevät tuottamaan valtavia määriä mittausdataa yksittäisten geenien, proteiinien ja lipidien ekspressio- tai konsentraatiotasoista ennennäkemättömällä tarkkuudella. Samanaikaisesti tarve uusien analyysimenetelmien kehittämiselle on kasvanut. Kiinnostuksen kohteena ovat olleet erityisesti tiettyjen sairauksien riskiä tai prognoosia ennustavien merkkiaineiden tunnistaminen sekä biologisten verkkojen rekonstruointi. Omiikka-aineistoilla on useita erityisominaisuuksia, jotka rajoittavat tavanomaisten menetelmien suoraa ja tehokasta soveltamista. Näistä tärkeimpiä ovat vasemmalta sensuroidut ja puuttuvat havainnot, sekä havaittujen muuttujien suuri lukumäärä. Tämän väitöskirjan ensimmäisenä tavoitteena on tarjota räätälöityjä analyysimenetelmiä epätäydellisten omiikka-aineistojen visualisointiin ja mallin valintaan käyttäen esimerkiksi regularisoituja regressiomalleja. Kuvailemme myös sensuroidulle aineistolle sopivan suurimman uskottavuuden estimaattorin kovarianssimatriisille. Toisena tavoitteena on kehittää uusia menetelmiä omiikka-aineistojen assosiaatiorakenteiden tarkasteluun. Monimutkaisempien rakenteiden tarkasteluun, visualisoimiseen ja vertailuun esitetään erilaisia variaatioita osittaisen pienimmän neliösumman menetelmään pohjautuvasta algoritmista, jonka avulla voidaan rekonstruoida assosiaatioverkkoja sekä multi-imputoidulle sensuroidulle että lukumääräaineistoille.Siirretty Doriast

    IsoDOT Detects Differential RNA-isoform Expression/Usage with respect to a Categorical or Continuous Covariate with High Sensitivity and Specificity

    Get PDF
    We have developed a statistical method named IsoDOT to assess differential isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq data. Here isoform usage refers to relative isoform expression given the total expression of the corresponding gene. IsoDOT performs two tasks that cannot be accomplished by existing methods: to test DIE/DIU with respect to a continuous covariate, and to test DIE/DIU for one case versus one control. The latter task is not an uncommon situation in practice, e.g., comparing paternal and maternal allele of one individual or comparing tumor and normal sample of one cancer patient. Simulation studies demonstrate the high sensitivity and specificity of IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on mouse transcriptome and identify a group of genes whose isoform usages respond to haloperidol treatment

    FCAT: A FLEXIBLE CLASSIFICATION TOOLBOX FOR SIGNAL DETECTION IN HIGH-THROUGHPUT SEQUENCING DATA

    Get PDF
    As applications of high-throughput sequencing technologies continue to grow at a fast rate, being able to conveniently develop effective data analysis solutions that can take full advantage of application-specific data characteristics is becoming increasingly important. FCAT is a flexible classification framework and toolbox for signal detection in a wide class of high-throughput sequencing applications where the objective is to locate signals in the genome based on their enrichment, shape and other features. FCAT takes aligned sequence reads (BAM files) as input. It uses supervised learning to automatically extract application-specific features that distinguish signals from noises. Users can aggregate multiple learning algorithms including random forests, L1- and L2-regularized logistic regression to improve prediction accuracy and robustness. A non-parametric inference method is developed for estimating false discovery rate of prediction results. We demonstrate FCAT through a variety of applications including analyses of DNase-seq, ATAC-seq, ChIP-seq, GRO-seq and TIP-seq data. We show that FCAT not only offers flexibility and convenience to handle data from different sequencing applications, but also yields competitive or improved signal detection accuracy compared to existing tools for each application. The FCAT framework can greatly increase the efficiency and reduce the burden for developing bioinformatics solutions to new sequencing applications. FCAT is an open source software package developed using C++ and Python. It is freely available at https://github.com/HeBing/FCAT

    Revealing the vectors of cellular identity with single-cell genomics

    Get PDF
    Single-cell genomics has now made it possible to create a comprehensive atlas of human cells. At the same time, it has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry. Emerging computational analysis methods, especially in single-cell RNA sequencing (scRNA-seq), have already begun to reveal, in a data-driven way, the diverse simultaneous facets of a cell's identity, from discrete cell types to continuous dynamic transitions and spatial locations. These developments will eventually allow a cell to be represented as a superposition of 'basis vectors', each determining a different (but possibly dependent) aspect of cellular organization and function. However, computational methods must also overcome considerable challenges-from handling technical noise and data scale to forming new abstractions of biology. As the scale of single-cell experiments continues to increase, new computational approaches will be essential for constructing and characterizing a reference map of cell identities.National Institutes of Health (U.S.) (grant P50 HG006193)BRAIN Initiative (grant U01 MH105979)National Institutes of Health (U.S.) (BRAIN grant 1U01MH105960-01)National Cancer Institute (U.S.) (grant 1U24CA180922)National Institute of Allergy and Infectious Diseases (U.S.) (grant 1U24AI118672-01

    Inferring community-driven structure in complex networks

    Get PDF
    corecore