52 research outputs found

    Understanding co-expressed gene sets by identifying regulators and modeling genomic elements

    Get PDF
    Genomic researchers commonly study complex phenotypes by identifying experimentally derived sets of functionally related genes with similar transcriptional profiles. These gene sets are then frequently subjected to statistical tests of association relating them to previously characterized gene sets from literature and public databases. However, few tools exist examining the non-coding, regulatory sequence of gene sets for evidence of a shared regulatory signature that may signal the involvement of important DNA-binding proteins called transcription factors (TFs). Here, we proposed and developed new computational methods for identifying major regulatory features of co-expressed gene sets that incorporate TF-DNA binding specificities (“motifs”) with other important features such as sequence conservation and chromatin structure. We additionally demonstrated a novel approach for discovering regulatory signatures that are shared across gene sets from multiple experimental conditions or tissues. Given the co-expressed genes of a particular cell type, we also attempted to annotate their specific regulatory sequences (“enhancers”) by constructing models of enhancer activity that incorporate the expression and binding specificities of the relevant transcription factors. We first developed and tested these models in well-characterized cell types, and then evaluated the extent to which these models were applicable using only minimal experimental evidence to poorly characterized systems without known transcriptional regulators and functional enhancers. Finally, we developed a network-based algorithm for examining novel gene sets that integrates many diverse types of biological evidences and relationships to better discover functionally related genes. This novel approach processed a comprehensive, heterogeneous network of biological knowledge and ranked genes and molecular properties represented in the network for their relevance to the given set of co-expressed genes

    Computational Modelling of Human Transcriptional Regulation by an Information Theory-based Approach

    Get PDF
    ChIP-seq experiments can identify the genome-wide binding site motifs of a transcription factor (TF) and determine its sequence specificity. Multiple algorithms were developed to derive TF binding site (TFBS) motifs from ChIP-seq data, including the entropy minimization-based Bipad that can derive both contiguous and bipartite motifs. Prior studies applying these algorithms to ChIP-seq data only analyzed a small number of top peaks with the highest signal strengths, biasing their resultant position weight matrices (PWMs) towards consensus-like, strong binding sites; nor did they derive bipartite motifs, disabling the accurate modelling of binding behavior of dimeric TFs. This thesis presents a novel motif discovery pipeline by adding the recursive masking and thresholding functionalities to Bipad to improve detection of primary binding motifs. Analyzing 765 ENCODE ChIP-seq datasets with this pipeline generated contiguous and bipartite information theory-based PWMs (iPWMs) for 93 sequence-specific TFs, discovered 23 cofactor motifs for 127 TFs and revealed six high-confidence novel motifs. The accuracy of these iPWMs were determined via four independent validation methods, including detection of experimentally proven TFBSs, explanation of effects of characterized SNPs, comparison with previously published motifs and statistical analyses. Novel cofactor motifs supported previously unreported TF coregulatory interactions. This thesis further presents a unified framework to identify variants in hereditary breast and ovarian cancer (HBOC), successfully applying these iPWMs to prioritize TFBS variants in 20 complete genes of HBOC patients. The spatial distribution and information composition of cis-regulatory modules (e.g. TFBS clusters) in promoters substantially determine gene expression patterns and TF target genes. Multiple algorithms were developed to detect TFBS clusters, including the information density-based clustering (IDBC) algorithm that simultaneously considers the spatial and information densities of TFBSs. Prior studies predicting tissue-specific gene expression levels and differentially expressed (DE) TF targets used log likelihood ratios to quantify TFBS strengths and merged adjacent TFBSs into clusters. This thesis presents a machine learning framework that uses the Bray-Curtis function to quantify the similarity between tissue-wide expression profiles of genes, and IDBC-identified clusters from iPWM-detected TFBSs to predict gene expression profiles and DE direct TF targets. Multiple clusters enable gene expression to be robust against TFBS mutations

    An Omega-Based Bacterial One-Hybrid System for the Determination of Transcription Factor Specificity

    Get PDF
    From the yeast genome completed in 1996 to the 12 Drosophilagenomes published earlier this year; little more than a decade has provided an incredible amount of genomic data. Yet even with this mountain of genetic information the regulatory networks that control gene expression remain relatively undefined. In part, this is due to the enormous amount of non-coding DNA, over 98% of the human genome, which needs to be made sense of. It is also due to the large number of transcription factors, potentially 2,000 such factors in the human genome, which may contribute to any given network directly or indirectly. Certainly, one of the central limitations has been the paucity of transcription factor (TF) specificity data that would aid in the prediction of regulatory targets throughout a genome. The general lack of specificity data has hindered the prediction of regulatory targets for individual TFs as well as groups of factors that function within a common regulatory pathway. A large collection of factor specificities would allow for the combinatorial prediction of regulatory targets that considers all factors actively expressed in a given cell, under a given condition. Herein we describe substantial improvements to a previous bacterial one-hybrid system with increased sensitivity and dynamic range that make it amenable for the high-throughput analysis of sequence-specific TFs. Currently we have characterized 108 (14.3%) of the predicted TFs in Drosophilathat fall into a broad range of DNA-binding domain families, demonstrating the feasibility of characterizing a large number of TFs using this technology. To fully exploit our large database of binding specificities, we have created a GBrowse-based search tool that allows an end-user to examine the overrepresentation of binding sites for any number of individual factors as well as combinations of these factors in up to six Drosophila genomes (veda.cs.uiuc.edu/cgi-bin/gbrowse/gbrowse/Dmel4). We have used this tool to demonstrate that a collection of factor specificities within a common pathway will successfully predict previously validated cis-regulatory modules within a genome. Furthermore, within our database we provide a complete catalog of DNA-binding specificities for all 84 homeodomains in Drosophila. This catalog enabled us to propose and test a detailed set of recognition rules for homeodomains and use this information to predict the specificities of the majority of homeodomains in the human genome

    Exact distribution of a pattern in a set of random sequences generated by a Markov source: applications to biological data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In bioinformatics it is common to search for a pattern of interest in a potentially large set of rather short sequences (upstream gene regions, proteins, exons, etc.). Although many methodological approaches allow practitioners to compute the distribution of a pattern count in a random sequence generated by a Markov source, no specific developments have taken into account the counting of occurrences in a set of independent sequences. We aim to address this problem by deriving efficient approaches and algorithms to perform these computations both for low and high complexity patterns in the framework of homogeneous or heterogeneous Markov models.</p> <p>Results</p> <p>The latest advances in the field allowed us to use a technique of optimal Markov chain embedding based on deterministic finite automata to introduce three innovative algorithms. Algorithm 1 is the only one able to deal with heterogeneous models. It also permits to avoid any product of convolution of the pattern distribution in individual sequences. When working with homogeneous models, Algorithm 2 yields a dramatic reduction in the complexity by taking advantage of previous computations to obtain moment generating functions efficiently. In the particular case of low or moderate complexity patterns, Algorithm 3 exploits power computation and binary decomposition to further reduce the time complexity to a logarithmic scale. All these algorithms and their relative interest in comparison with existing ones were then tested and discussed on a toy-example and three biological data sets: structural patterns in protein loop structures, PROSITE signatures in a bacterial proteome, and transcription factors in upstream gene regions. On these data sets, we also compared our exact approaches to the tempting approximation that consists in concatenating the sequences in the data set into a single sequence.</p> <p>Conclusions</p> <p>Our algorithms prove to be effective and able to handle real data sets with multiple sequences, as well as biological patterns of interest, even when the latter display a high complexity (PROSITE signatures for example). In addition, these exact algorithms allow us to avoid the edge effect observed under the single sequence approximation, which leads to erroneous results, especially when the marginal distribution of the model displays a slow convergence toward the stationary distribution. We end up with a discussion on our method and on its potential improvements.</p

    Cell type-specific transcriptomic analyses of immunity in Arabidopsis thaliana roots

    Get PDF
    Plant roots represent a complex organ consisting of different cell types with highly varied functions. Thus, the response of plant roots to environmental stresses, such as pathogen infection, requires the concerted action of many cell-types. Cell type-specific transcriptomic studies are essential to understand stress resistance signalling in such a complex organ. In this thesis, the transcriptomic response to immunity elicitation is examined at the resolution of tissues and individual cell types in two large scale RNA-seq experiments. Firstly, Fluorescence-Activated Cell Sorting combined with RNA-seq was used to produce the first high-resolution gene expression atlas of plant root immunity. The resulting data set encompassed the transcriptomes of three root cell types which had been treated by two immunity elicitors. Differential gene expression analysis revealed that both immunity elicitors induced a largely cell-type specific response with a comparatively small set of genes differentially expressed in all three cell types. This strong specificity indicates that cell identity is a strong driver of the transcriptomic immune response. Secondly, gene expression in root tips was analysed using the single cell technique Drop-seq. Clustering methods were used to identify cells from three developmental stages and multiple cell types, and the immune responses were characterised in these tissues. In an effort to interpret and predict immunity network regulation in different cell types, a novel tool entitled the Paired Motif Enrichment Tool (PMET) was developed to investigate gene regulation by combinatorial transcription factor groups. The tool identifies enriched pairs of known regulatory motifs within immune responsive gene sets and revealed that each cell type/immune response combination has a largely unique regulatory landscape. Furthermore, PMET has predicted new roles of transcription factors within immunity networks
    • …
    corecore