237 research outputs found
Gene communities in co-expression networks across different tissues
With the recent availability of tissue-specific gene expression data, e.g.,
provided by the GTEx Consortium, there is interest in comparing gene
co-expression patterns across tissues. One promising approach to this problem
is to use a multilayer network analysis framework and perform multilayer
community detection. Communities in gene co-expression networks reveal groups
of genes similarly expressed across individuals, potentially involved in
related biological processes responding to specific environmental stimuli or
sharing common regulatory variations. We construct a multilayer network in
which each of the four layers is an exocrine gland tissue-specific gene
co-expression network. We develop methods for multilayer community detection
with correlation matrix input and an appropriate null model. Our correlation
matrix input method identifies five groups of genes that are similarly
co-expressed in multiple tissues (a community that spans multiple layers, which
we call a generalist community) and two groups of genes that are co-expressed
in just one tissue (a community that lies primarily within just one layer,
which we call a specialist community). We further found gene co-expression
communities where the genes physically cluster across the genome significantly
more than expected by chance (on chromosomes 1 and 11). This clustering hints
at underlying regulatory elements determining similar expression patterns
across individuals and cell types. We suggest that KRTAP3-1, KRTAP3-3, and
KRTAP3-5 share regulatory elements in skin and pancreas. Furthermore, we find
that CELA3A and CELA3B share associated expression quantitative trait loci in
the pancreas. The results indicate that our multilayer community detection
method for correlation matrix input extracts biologically interesting
communities of genes
Model-based approaches for the detection of biologically active genomic regions from next generation sequencing data
Next Generation Sequencing (NGS) technologies are quickly gaining popularity in biomedical research. A popular application of NGS is to detect potential gene regulatory elements that are captured or enriched by certain experimental procedures, for example, Chromatin Immunoprecipitation (ChIP-seq), DNase hypersensitive site mapping (DNase-seq), and Formaldehyde-Assisted Isolation of Regulatory Elements (FAIRE-seq), among others. While ChIP-seq can be use to identify protein-DNA interaction sites, both DNase-seq and FAIRE-seq can be used to identify open chromatin regions, which are more likely to contain elements involved in gene expression regulation. We collectively refer to these types of sequencing data as DAE-seq, where DAE stands for DNA After Enrichment. DAE-seq data can provide important insight into gene regulation, which is crucial to understanding the molecular mechanism of phenotypic outcomes, such as complex diseases. Here we address several practical issues facing biomedical researchers in the analysis of DAE-seq data through the development of several new and relevant statistical methods. We first introduce a three-component mixture regression model to discover ``enriched regions, i.e., the genomic regions with more DAE-seq signal than expected in relation to background regions. We demonstrate its practical utility and accuracy in detecting regions of active regulatory elements across a wide range of commonly used DAE-seq datasets and experimental conditions. We then develop a novel Autoregressive Hidden Markov Model (AR-HMM) to account for often-ignored spatial dependence in DAE-seq data, and demonstrate that accounting for such dependence leads to increased performance in identifying biologically active genomic regions in both simulated and real datasets. We also introduce an efficient and novel variable selection procedure in the context of Hidden Markov Models when the means of the emission distributions of each state are modelled with covariates. We study the asymptotic properties of the proposed variable selection procedure and apply this approach to simulated and real DAE-seq data. Lastly, we introduce a new method for the joint analysis of total and allele-specific read counts from DAE-seq data and RNA-seq data. In all, we develop several statistical procedures for the analysis of DAE-seq data that are highly relevant to biomedical researchers and have broader applicability to other problems in statistics.Doctor of Philosoph
Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics
Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables.
There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.Uudet mittausteknologiat ovat mahdollistaneet kokonaisvaltaisen ymmärryksen lisäämisen elollisten organismien molekyylitason prosesseista. Niin kutsutut omiikka-teknologiat, kuten genomiikka, proteomiikka ja lipidomiikka, kykenevät tuottamaan valtavia määriä mittausdataa yksittäisten geenien, proteiinien ja lipidien ekspressio- tai konsentraatiotasoista ennennäkemättömällä tarkkuudella. Samanaikaisesti tarve uusien analyysimenetelmien kehittämiselle on kasvanut. Kiinnostuksen kohteena ovat olleet erityisesti tiettyjen sairauksien riskiä tai prognoosia ennustavien merkkiaineiden tunnistaminen sekä biologisten verkkojen rekonstruointi.
Omiikka-aineistoilla on useita erityisominaisuuksia, jotka rajoittavat tavanomaisten menetelmien suoraa ja tehokasta soveltamista. Näistä tärkeimpiä ovat vasemmalta sensuroidut ja puuttuvat havainnot, sekä havaittujen muuttujien suuri lukumäärä. Tämän väitöskirjan ensimmäisenä tavoitteena on tarjota räätälöityjä analyysimenetelmiä epätäydellisten omiikka-aineistojen visualisointiin ja mallin valintaan käyttäen esimerkiksi regularisoituja regressiomalleja. Kuvailemme myös sensuroidulle aineistolle sopivan suurimman uskottavuuden estimaattorin kovarianssimatriisille. Toisena tavoitteena on kehittää uusia menetelmiä omiikka-aineistojen assosiaatiorakenteiden tarkasteluun. Monimutkaisempien rakenteiden tarkasteluun, visualisoimiseen ja vertailuun esitetään erilaisia variaatioita osittaisen pienimmän neliösumman menetelmään pohjautuvasta algoritmista, jonka avulla voidaan rekonstruoida assosiaatioverkkoja sekä multi-imputoidulle sensuroidulle että lukumääräaineistoille.Siirretty Doriast
Recommended from our members
The interplay of global chromosomal organisation, promoter-enhancer interactions and transcription
All somatic cells within an organism contain the same genetic material, yet they display pronounced differences in function and morphology. Precise control of gene expression is of fundamental importance to allow cells to properly develop, maintain homeostasis, and respond to external stimuli. The first step in gene expression is transcription, which starts at the core promoter region. While core promoters are crucial for transcriptional initiation, they are insufficient for establishing complex tissue- and condition-specific gene expression patterns in multicellular organisms.
Additional transcriptional control elements, such as gene enhancers, are required for this, with many such elements localising considerable distances away from their target promoters. Enhancers commonly convey their regulatory signals to target promoters by forming physical contacts with them through three-dimensional DNA looping, underpinning the importance of chromosomal organisation in transcriptional control. In recent years, the emergence of chromosome conformation capture and
related methodologies has dramatically increased our understanding of chromosomal organisation. In particular, high-throughput Hi-C analyses across cell types have led to the identification of spatial genomic structures, including Topologically Associating Domains (TADs). In parallel, high-resolution versions of these technologies (such as 5C, CHiA-PET, HiChIP and Capture Hi-C) have detected multitudes of novel looping interactions, including connections between promoters and enhancers. The interplay between precise regulatory interactions, the higher-order chromosomal organisation, and their joint contribution to transcriptional
control is incompletely understood and is the focus of this work. In the first part of this work, I take advantage of high-resolution Promoter Capture
Hi-C (PCHi-C) data to investigate the localisation of promoter interactions with respect to TAD boundaries in human primary blood cells and cell-cycle synchronised HeLa cells. I show that the majority of promoter interactions originate at, and are constrained by TAD boundaries. However, a minority of promoter interactions appear to cross TAD boundaries in all analysed cell types. Furthermore, I identify genes with multiple TAD-boundary crossing interactions per promoter and present evidence that
these interactions may be supported by transcriptional machinery. These results suggest a role for transcriptional machinery in shaping promoter interactions in a TAD independent manner. In the second part of this work, I investigate promoter interaction rewiring upon perturbations of architectural proteins. For this analysis, I use PCHi-C data from HeLa cells, in which cohesin or CTCF are rapidly depleted using Auxin-induced degradation. I show that promoter interactions that are lost, maintained, or gained upon cohesin depletion possess distinct distance profiles and relate to TAD organisation in markedly different ways. I demonstrate that promoter-interacting regions that are lost upon cohesin depletion associate with architectural proteins, while those that are maintained or gained show characteristics of enhancers. Finally, I show evidence for a functional role of cohesin-mediated interactions in transcriptional regulation. Collectively, this work reveals the interplay between TADs, promoter interactions and transcription, while suggesting that promoter interactions may be supported by TAD independent mechanisms.MRC DTP studentship
BBS/E/B/000M0816 and164259
IsoDOT Detects Differential RNA-isoform Expression/Usage with respect to a Categorical or Continuous Covariate with High Sensitivity and Specificity
We have developed a statistical method named IsoDOT to assess differential
isoform expression (DIE) and differential isoform usage (DIU) using RNA-seq
data. Here isoform usage refers to relative isoform expression given the total
expression of the corresponding gene. IsoDOT performs two tasks that cannot be
accomplished by existing methods: to test DIE/DIU with respect to a continuous
covariate, and to test DIE/DIU for one case versus one control. The latter task
is not an uncommon situation in practice, e.g., comparing paternal and maternal
allele of one individual or comparing tumor and normal sample of one cancer
patient. Simulation studies demonstrate the high sensitivity and specificity of
IsoDOT. We apply IsoDOT to study the effects of haloperidol treatment on mouse
transcriptome and identify a group of genes whose isoform usages respond to
haloperidol treatment
FCAT: A FLEXIBLE CLASSIFICATION TOOLBOX FOR SIGNAL DETECTION IN HIGH-THROUGHPUT SEQUENCING DATA
As applications of high-throughput sequencing technologies continue to grow at a fast rate, being able to conveniently develop effective data analysis solutions that can take full advantage of application-specific data characteristics is becoming increasingly important. FCAT is a flexible classification framework and toolbox for signal detection in a wide class of high-throughput sequencing applications where the objective is to locate signals in the genome based on their enrichment, shape and other features. FCAT takes aligned sequence reads (BAM files) as input. It uses supervised learning to automatically extract application-specific features that distinguish signals from noises. Users can aggregate multiple learning algorithms including random forests, L1- and L2-regularized logistic regression to improve prediction accuracy and robustness. A non-parametric inference method is developed for estimating false discovery rate of prediction results. We demonstrate FCAT through a variety of applications including analyses of DNase-seq, ATAC-seq, ChIP-seq, GRO-seq and TIP-seq data. We show that FCAT not only offers flexibility and convenience to handle data from different sequencing applications, but also yields competitive or improved signal detection accuracy compared to existing tools for each application. The FCAT framework can greatly increase the efficiency and reduce the burden for developing bioinformatics solutions to new sequencing applications. FCAT is an open source software package developed using C++ and Python. It is freely available at https://github.com/HeBing/FCAT
Revealing the vectors of cellular identity with single-cell genomics
Single-cell genomics has now made it possible to create a comprehensive atlas of human cells. At the same time, it has reopened definitions of a cell's identity and of the ways in which identity is regulated by the cell's molecular circuitry. Emerging computational analysis methods, especially in single-cell RNA sequencing (scRNA-seq), have already begun to reveal, in a data-driven way, the diverse simultaneous facets of a cell's identity, from discrete cell types to continuous dynamic transitions and spatial locations. These developments will eventually allow a cell to be represented as a superposition of 'basis vectors', each determining a different (but possibly dependent) aspect of cellular organization and function. However, computational methods must also overcome considerable challenges-from handling technical noise and data scale to forming new abstractions of biology. As the scale of single-cell experiments continues to increase, new computational approaches will be essential for constructing and characterizing a reference map of cell identities.National Institutes of Health (U.S.) (grant P50 HG006193)BRAIN Initiative (grant U01 MH105979)National Institutes of Health (U.S.) (BRAIN grant 1U01MH105960-01)National Cancer Institute (U.S.) (grant 1U24CA180922)National Institute of Allergy and Infectious Diseases (U.S.) (grant 1U24AI118672-01
- …