161 research outputs found

    Network motif-based identification of transcription factor-target gene relationships by integrating multi-source biological data

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Integrating data from multiple global assays and curated databases is essential to understand the spatio-temporal interactions within cells. Different experiments measure cellular processes at various widths and depths, while databases contain biological information based on established facts or published data. Integrating these complementary datasets helps infer a mutually consistent transcriptional regulatory network (TRN) with strong similarity to the structure of the underlying genetic regulatory modules. Decomposing the TRN into a small set of recurring regulatory patterns, called network motifs (NM), facilitates the inference. Identifying NMs defined by specific transcription factors (TF) establishes the framework structure of a TRN and allows the inference of TF-target gene relationship. This paper introduces a computational framework for utilizing data from multiple sources to infer TF-target gene relationships on the basis of NMs. The data include time course gene expression profiles, genome-wide location analysis data, binding sequence data, and gene ontology (GO) information.</p> <p>Results</p> <p>The proposed computational framework was tested using gene expression data associated with cell cycle progression in yeast. Among 800 cell cycle related genes, 85 were identified as candidate TFs and classified into four previously defined NMs. The NMs for a subset of TFs are obtained from literature. Support vector machine (SVM) classifiers were used to estimate NMs for the remaining TFs. The potential downstream target genes for the TFs were clustered into 34 biologically significant groups. The relationships between TFs and potential target gene clusters were examined by training recurrent neural networks whose topologies mimic the NMs to which the TFs are classified. The identified relationships between TFs and gene clusters were evaluated using the following biological validation and statistical analyses: (1) Gene set enrichment analysis (GSEA) to evaluate the clustering results; (2) Leave-one-out cross-validation (LOOCV) to ensure that the SVM classifiers assign TFs to NM categories with high confidence; (3) Binding site enrichment analysis (BSEA) to determine enrichment of the gene clusters for the cognate binding sites of their predicted TFs; (4) Comparison with previously reported results in the literatures to confirm the inferred regulations.</p> <p>Conclusion</p> <p>The major contribution of this study is the development of a computational framework to assist the inference of TRN by integrating heterogeneous data from multiple sources and by decomposing a TRN into NM-based modules. The inference capability of the proposed framework is verified statistically (<it>e.g</it>., LOOCV) and biologically (<it>e.g</it>., GSEA, BSEA, and literature validation). The proposed framework is useful for inferring small NM-based modules of TF-target gene relationships that can serve as a basis for generating new testable hypotheses.</p

    Reconstruction of Gene Regulatory Modules in Cancer Cell Cycle by Multi-Source Data Integration

    Get PDF
    Precise regulation of the cell cycle is crucial to the growth and development of all organisms. Understanding the regulatory mechanism of the cell cycle is crucial to unraveling many complicated diseases, most notably cancer. Multiple sources of biological data are available to study the dynamic interactions among many genes that are related to the cancer cell cycle. Integrating these informative and complementary data sources can help to infer a mutually consistent gene transcriptional regulatory network with strong similarity to the underlying gene regulatory relationships in cancer cells.We propose an integrative framework that infers gene regulatory modules from the cell cycle of cancer cells by incorporating multiple sources of biological data, including gene expression profiles, gene ontology, and molecular interaction. Among 846 human genes with putative roles in cell cycle regulation, we identified 46 transcription factors and 39 gene ontology groups. We reconstructed regulatory modules to infer the underlying regulatory relationships. Four regulatory network motifs were identified from the interaction network. The relationship between each transcription factor and predicted target gene groups was examined by training a recurrent neural network whose topology mimics the network motif(s) to which the transcription factor was assigned. Inferred network motifs related to eight well-known cell cycle genes were confirmed by gene set enrichment analysis, binding site enrichment analysis, and comparison with previously published experimental results.We established a robust method that can accurately infer underlying relationships between a given transcription factor and its downstream target genes by integrating different layers of biological data. Our method could also be beneficial to biologists for predicting the components of regulatory modules in which any candidate gene is involved. Such predictions can then be used to design a more streamlined experimental approach for biological validation. Understanding the dynamics of these modules will shed light on the processes that occur in cancer cells resulting from errors in cell cycle regulation

    An intuitionistic approach to scoring DNA sequences against transcription factor binding site motifs

    Get PDF
    Background: Transcription factors (TFs) control transcription by binding to specific regions of DNA called transcription factor binding sites (TFBSs). The identification of TFBSs is a crucial problem in computational biology and includes the subtask of predicting the location of known TFBS motifs in a given DNA sequence. It has previously been shown that, when scoring matches to known TFBS motifs, interdependencies between positions within a motif should be taken into account. However, this remains a challenging task owing to the fact that sequences similar to those of known TFBSs can occur by chance with a relatively high frequency. Here we present a new method for matching sequences to TFBS motifs based on intuitionistic fuzzy sets (IFS) theory, an approach that has been shown to be particularly appropriate for tackling problems that embody a high degree of uncertainty. Results: We propose SCintuit, a new scoring method for measuring sequence-motif affinity based on IFS theory. Unlike existing methods that consider dependencies between positions, SCintuit is designed to prevent overestimation of less conserved positions of TFBSs. For a given pair of bases, SCintuit is computed not only as a function of their combined probability of occurrence, but also taking into account the individual importance of each single base at its corresponding position. We used SCintuit to identify known TFBSs in DNA sequences. Our method provides excellent results when dealing with both synthetic and real data, outperforming the sensitivity and the specificity of two existing methods in all the experiments we performed. Conclusions: The results show that SCintuit improves the prediction quality for TFs of the existing approaches without compromising sensitivity. In addition, we show how SCintuit can be successfully applied to real research problems. In this study the reliability of the IFS theory for motif discovery tasks is proven

    Defining the Plasticity of Transcription Factor Binding Sites by Deconstructing DNA Consensus Sequences: The PhoP-Binding Sites among Gamma/Enterobacteria

    Get PDF
    Transcriptional regulators recognize specific DNA sequences. Because these sequences are embedded in the background of genomic DNA, it is hard to identify the key cis-regulatory elements that determine disparate patterns of gene expression. The detection of the intra- and inter-species differences among these sequences is crucial for understanding the molecular basis of both differential gene expression and evolution. Here, we address this problem by investigating the target promoters controlled by the DNA-binding PhoP protein, which governs virulence and Mg2+ homeostasis in several bacterial species. PhoP is particularly interesting; it is highly conserved in different gamma/enterobacteria, regulating not only ancestral genes but also governing the expression of dozens of horizontally acquired genes that differ from species to species. Our approach consists of decomposing the DNA binding site sequences for a given regulator into families of motifs (i.e., termed submotifs) using a machine learning method inspired by the “Divide & Conquer” strategy. By partitioning a motif into sub-patterns, computational advantages for classification were produced, resulting in the discovery of new members of a regulon, and alleviating the problem of distinguishing functional sites in chromatin immunoprecipitation and DNA microarray genome-wide analysis. Moreover, we found that certain partitions were useful in revealing biological properties of binding site sequences, including modular gains and losses of PhoP binding sites through evolutionary turnover events, as well as conservation in distant species. The high conservation of PhoP submotifs within gamma/enterobacteria, as well as the regulatory protein that recognizes them, suggests that the major cause of divergence between related species is not due to the binding sites, as was previously suggested for other regulators. Instead, the divergence may be attributed to the fast evolution of orthologous target genes and/or the promoter architectures resulting from the interaction of those binding sites with the RNA polymerase

    Improving Thermodynamic Models of Transcription by Combining ChIP and Expression Measurements of Synthetic Promoters

    Get PDF
    Regulation of gene expression is a fundamental process in biology. Accurate mathematical models of the relationship between regulatory sequence and observed expression would advance our understanding of biology. I developed ReLoS, a regulatory logic simulator, to explore mathematical frameworks for describing the relationship between regulatory sequence and observed expression and to explore methods of learning combinatorial regulatory rules from expression data. ReLoS is a flexible simulator allowing a variety of formalisms to be applied. ReLoS was used to explore the question of how complex rules of combinatorial transcriptional regulation must be to explain the complexity of transcriptional regulation observed in biology. A previously published dataset was analyzed for regulatory elements that explained the behavior of regulatory modules for 254 genes in 255 conditions. I found that ReLoS was able to recapitulate a reasonable fraction of the variation: mean gene-wise correlation of 0.7) with only twelve combinatorial rules comprising 13 cis-regulatory elements. This result suggested that learning the combinatorial rules of transcriptional regulation should be possible. State ensemble statistical thermodynamic models are a class of models used to describe combinatorial transcriptional regulation. One way to parameterize these models is measuring the expression of a reporter gene driven by many similar promoters . Models parameterized in this fashion do better at explaining the sequence to expression relationship, but fail to distinguish between multiple biological mechanisms that give rise to equivalent expression results in the synthetic promoters, thus limiting the generalizability of the models. I developed a ChIP-based strategy for quantitatively measuring the relative occupancy of transcription factors on synthetic promoters. This data complements existing methods for obtaining expression data from the same promoters. Comparison of models parameterized with only expression, only occupancy, or expression and occupancy reveals specific biological details that are missed when considering only expression data. In particular, the occupancy data suggests that differential regulatory effects of Cbf1 in glucose versus amino acid are a function of how it interacts with polymerase rather than changes in concentration or binding affinity. Additionally, the occupancy data suggests that Gcn4 binds in a cooperative manner and that Gcn4 occupancy is adversely affected by the presence of a nearby Nrg1 site. Finally, the occupancy data and expression data taken together suggest that Gcn4 binds in competition with another transcription factor. Synthesizing disparate sources of information resulted in an improved understanding of the mechanics of transcriptional regulation of the synthetic promoters and was ultimately largely successful in decoupling the DNA binding energies from the TF interactions with polymerase. However, it suggests that more sophisticated models of the relationship between occupancy and expression may be required in at least some cases. Incorporating different sources of data into models of regulation will continue to be important for learning the biological specifics that drive expression changes

    Dissecting cis and trans Determinants of Nucleosome Positioning: A Dissertation

    Get PDF
    Eukaryotic DNA is packaged in chromatin, whose repeating subunit, the nucleosome, consists of an octamer of histone proteins wrapped by about 147bp of DNA. This packaging affects the accessibility of DNA and hence any process that occurs on DNA, such as replication, repair, and transcription. An early observation from genome-wide nucleosome mapping in yeast was that genes had a surprisingly characteristic structure, which has motivated studies to understand what determines this architecture. Both sequence and trans acting factors are known to influence chromatin packaging, but the relative contributions of cis and trans determinants of nucleosome positioning is debated. Here we present data using genetic approaches to examine the contributions of cis and trans acting factors on nucleosome positioning in budding yeast. We developed the use of yeast artificial chromosomes to exploit quantitative differences in the chromatin structures of different yeast species. This allows us to place approximately 150kb of sequence from any species into the S.cerevisiae cellular environment and compare the nucleosome positions on this same sequence in different environments to discover what features are variant and hence regulated by trans acting factors. This method allowed us to conclusively show that the great preponderance of nucleosomes are positioned by trans acting factors. We observe the maintenance of nucleosome depletion over some promoter sequences, but partial fill-in of NDRs in some of the YAC v promoters indicates that even this feature is regulated to varying extents by trans acting factors. We are able to extend our use of evolutionary divergence in order to search for specific trans regulators whose effects vary between the species. We find that a subset of transcription factors can compete with histones to help generate some NDRs, with clear effects documented in a cbf1 deletion mutant. In addition, we find that Chd1p acts as a potential “molecular ruler” involved in defining the nucleosome repeat length differences between S.cerevisiae and K.lactis. The mechanism of this measurement is unclear as the alteration in activity is partially attributable to the N-terminal portion of the protein, for which there is no structural data. Our observations of a specialized chromatin structure at de novo transcriptional units along with results from nucleosome mapping in the absence of active transcription indicate that transcription plays a role in engineering genic nucleosome architecture. This work strongly supports the role of trans acting factors in setting up a dynamic, regulated chromatin structure that allows for robustness and fine-tuning of gene expression

    Evolution of regulatory complexes: a many-body system

    Get PDF
    The recent advent of large-scale genomic sequence data and improvement of sequencing technologies has enabled population genetics to advance from a mostly abstract theoretical basis to a quantitative molecular description. However, functional units in DNA are typically combinations of interacting nucleotide segments, and evolutionary forces acting on these segments can result in very complicated population dynamics. The goal is to formulate these interactions in such a way that the macroscopic features are independent of the microscopic details, as in statistical mechanics. In this thesis, I discuss the evolutionary dynamics of regulatory sequences, which control the production of protein in cells. One of the primary forms of regulation occurs through interactions of proteins called transcription factors, with binding sites in the DNA sequence, and the strength of these interactions influence the individual's fitness in the population. What makes this an ideal model system for quantitative analysis of genomic evolution, is the possibility of inferring this relationship. Compared to prokaryotes and yeast, gene regulation is much more complex in higher eukaryotes. Regulatory information is organized in modules with multiple binding sites that are linked to a common function. In Chapter. 2, we show that binding site complexes are commonly formed by local sequence duplications, as opposed to forming from scratch by single point mutations. We also show that the underlying regulatory grammar is in tune with this mechanism such that the duplication events confer an adaptive advantage. Regulatory complexes resemble a many-particle system whose function emerges from the collective dynamics of its elements. In Chapter. 3, we develop a thermodynamic framework to characterize the effective affinity of site complexes to multiple transcription factors with cooperative binding. These affinities are the phenotype, or trait of binding complexes on which selection acts, and we characterize their evolution. From the yeast genome polymorphism data, we infer a fitness landscape as a function of binding affinity by using the novel method developed in Chapter.~ 4. This method of quantitative trait analysis can deal with long-range correlations between sites which arise in asexual populations. Our fitness landscape quantitatively predicts the amount of conservation of the phenotype, as well as the amount of compensatory changes between sites. Our results open a new avenue to understand the regulatory "grammar" of eukaryotic genomes based on quantitative evolution models. They prove that a combination of theoretical models, high-throughput experimental measurements, and analysis of genomic variation is necessary for a proper quantitative understanding of biological systems

    Interplay of nucleosome positioning and transcription initiation in Schizosaccharomyces pombe

    Get PDF
    corecore