21 research outputs found

    Understanding protein-DNA binding events

    Get PDF
    DNA binding proteins regulate essential biological processes such as DNA replication, transcription, repair, and splicing. Transcription factors (TFs) are in the focus of this work because they have the largest effect of activating and repressing gene expression by influencing transcription rates. It is important to model TF binding affinity to DNA and to predict protein-DNA binding events to understand how they regulate cell mechanisms. Higher order Markov models bring \textit{de-novo} motif discovery to the next level. BaMM!motif has been shown to provide robust predictions of these more sophisticated binding models. Here I introduce the BaMM!motif web application, a web-based platform which combines \textit{de-novo} motif discovery with motif enrichment and motif-motif comparison tools and a database of known motifs. This web application enables the usage of the BaMM!motif algorithm in a straightforward and robust environment. Post-translational histone modifications and linker histone incorporation regulate chromatin structure and genome activity. How these systems interface on a molecular level is unclear. Using biochemistry, one observes that the modification behavior of N-terminal histone H3 tails depends on the nucleosomal contexts. I found that linker histones inhibit modifications of different H3 sites on a genome-wide level.This proposes that alterations of H3 tail-linker DNA interactions by linker histones execute basal control mechanisms of chromatin function. Pervasive transcription of eukaryotic genomes stems to a large extent from bidirectional promoters that synthesize mRNA and divergent noncoding RNA (ncRNA). Here, I show that early termination that relies on the essential RNA-binding factor Nrd1 attenuates transcription of 32 genes in yeast. Further, depletion of Nrd1 from the nucleus results in 1,526 Nrd1-unterminated transcripts (NUTs) that originate from nucleosome-depleted regions (NDRs) and can deregulate mRNA synthesis by antisense repression and transcription interference

    Understanding protein-DNA binding events

    Get PDF
    DNA binding proteins regulate essential biological processes such as DNA replication, transcription, repair, and splicing. Transcription factors (TFs) are in the focus of this work because they have the largest effect of activating and repressing gene expression by influencing transcription rates. It is important to model TF binding affinity to DNA and to predict protein-DNA binding events to understand how they regulate cell mechanisms. Higher order Markov models bring \textit{de-novo} motif discovery to the next level. BaMM!motif has been shown to provide robust predictions of these more sophisticated binding models. Here I introduce the BaMM!motif web application, a web-based platform which combines \textit{de-novo} motif discovery with motif enrichment and motif-motif comparison tools and a database of known motifs. This web application enables the usage of the BaMM!motif algorithm in a straightforward and robust environment. Post-translational histone modifications and linker histone incorporation regulate chromatin structure and genome activity. How these systems interface on a molecular level is unclear. Using biochemistry, one observes that the modification behavior of N-terminal histone H3 tails depends on the nucleosomal contexts. I found that linker histones inhibit modifications of different H3 sites on a genome-wide level.This proposes that alterations of H3 tail-linker DNA interactions by linker histones execute basal control mechanisms of chromatin function. Pervasive transcription of eukaryotic genomes stems to a large extent from bidirectional promoters that synthesize mRNA and divergent noncoding RNA (ncRNA). Here, I show that early termination that relies on the essential RNA-binding factor Nrd1 attenuates transcription of 32 genes in yeast. Further, depletion of Nrd1 from the nucleus results in 1,526 Nrd1-unterminated transcripts (NUTs) that originate from nucleosome-depleted regions (NDRs) and can deregulate mRNA synthesis by antisense repression and transcription interference

    Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts

    Get PDF
    Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry. In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools. The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2

    Bayesian Markov models improve the prediction of binding motifs beyond first order

    Get PDF
    Transcription factors (TFs) regulate gene expression by binding to specific DNA motifs. Accurate models for predicting binding affinities are crucial for quantitatively understanding of transcriptional regulation. Motifs are commonly described by position weight matrices, which assume that each position contributes independently to the binding energy. Models that can learn dependencies between positions, for instance, induced by DNA structure preferences, have yielded markedly improved predictions for most TFs on in vivo data. However, they are more prone to overfit the data and to learn patterns merely correlated with rather than directly involved in TF binding. We present an improved, faster version of our Bayesian Markov model software, BaMMmotif2. We tested it with state-of-the-art motif discovery tools on a large collection of ChIP-seq and HT-SELEX datasets. BaMMmotif2 models of fifth-order achieved a median false-discovery-rate-averaged recall 13.6% and 12.2% higher than the next best tool on 427 ChIP-seq datasets and 164 HT-SELEX datasets, respectively, while being 8 to 1000 times faster. BaMMmotif2 models showed no signs of overtraining in cross-cell line and cross-platform tests, with similar improvements on the next-best tool. These results demonstrate that dependencies beyond first order clearly improve binding models for most TFs

    Application of alternative <i>de novo</i> motif recognition models for analysis of structural heterogeneity of transcription factor binding sites: a case study of FOXA2 binding sites

    Get PDF
    The most popular model for the search of ChIP-seq data for transcription factor binding sites (TFBS) is the positional weight matrix (PWM). However, this model does not take into account dependencies between nucleotide occurrences in different site positions. Currently, two recently proposed models, BaMM and InMoDe, can do as much. However, application of these models was usually limited only to comparing their recognition accuracies with that of PWMs, while none of the analyses of the co-prediction and relative positioning of hits of different models in peaks has yet been performed. To close this gap, we propose the pipeline called MultiDeNA. This pipeline includes stages of model training, assessing their recognition accuracy, scanning ChIP-seq peaks and their classif ication based on scan results. We applied our pipeline to 22 ChIP-seq datasets of TF FOXA2 and considered PWM, dinucleotide PWM (diPWM), BaMM and InMoDe models. The combination of these four models allowed a signif icant increase in the fraction of recognized peaks compared to that for the sole PWM model: the increase was 26.3 %. The BaMM model provided the main contribution to the recognition of sites. Although the major fraction of predicted peaks contained TFBS of different models with coincided positions, the medians of the fraction of peaks containing the predictions of sole models were 1.08, 0.49, 4.15 and 1.73 % for PWM, diPWM, BaMM and InMoDe, respectively. Thus, FOXA2 BSs were not fully described by only a sole model, which indicates theirs heterogeneity. We assume that the BaMM model is the most successful in describing the structure of the FOXA2 BS in ChIP-seq datasets under study

    Motif models proposing independent and interdependent impacts of nucleotides are related to high and low affinity transcription factor binding sites in Arabidopsis

    Get PDF
    Position weight matrix (PWM) is the traditional motif model representing the transcription factor (TF) binding sites. It proposes that the positions contribute independently to TFs binding affinity, although this hypothesis does not fit the data perfectly. This explains why PWM hits are missing in a substantial fraction of ChIP-seq peaks. To study various modes of the direct binding of plant TFs, we compiled the benchmark collection of 111 ChIP-seq datasets for Arabidopsis thaliana, and applied the traditional PWM, and two alternative motif models BaMM and SiteGA, proposing the dependencies of the positions. The variation in the stringency of the recognition thresholds for the models proposed that the hits of PWM, BaMM, and SiteGA models are associated with the sites of high/medium, any, and low affinity, respectively. At the medium recognition threshold, about 60% of ChIP-seq peaks contain PWM hits consisting of conserved core consensuses, while BaMM and SiteGA provide hits for an additional 15% of peaks in which a weaker core consensus is compensated through intra-motif dependencies. The presence/absence of these dependencies in the motifs of alternative/traditional models was confirmed by the dependency logo DepLogo visualizing the position-wise partitioning of the alignments of predicted sites. We exemplify the detailed analysis of ChIP-seq profiles for plant TFs CCA1, MYC2, and SEP3. Gene ontology (GO) enrichment analysis revealed that among the three motif models, the SiteGA had the highest portions of genes with the significantly enriched GO terms among all predicted genes. We showed that both alternative motif models provide for traditional PWM greater extensions in predicted sites for TFs MYC2/SEP3 with condition/tissue specific functions, compared to those for TF CCA1 with housekeeping functions. Overall, the combined application of standard and alternative motif models is beneficial to detect various modes of the direct TF-DNA interactions in the maximal portion of ChIP-seq loci

    Development of Computational Techniques for Identification of Regulatory DNA Motif

    Get PDF
    Identifying precise transcription factor binding sites (TFBS) or regulatory DNA motif (motif) plays a fundamental role in researching transcriptional regulatory mechanism in cells and helping construct regulatory networks for biological investigation. Chromatin immunoprecipitation combined with sequencing (ChIP-seq) and lambda exonuclease digestion followed by high-throughput sequencing (ChIP-exo) enables researchers to identify TFBS on a genome-scale with improved resolution. Several algorithms have been developed to perform motif identification, employing widely different methods and often giving divergent results. In addition, these existing methods still suffer from prediction accuracy. Thesis focuses on the development of improved regulatory DNA motif identification techniques. We designed an integrated framework, WTSA, that can reliably combine the experimental signals from ChIP-exo data in base pair (bp) resolution to predict the statistically significant DNA motifs. The algorithm improves the prediction accuracy and extends the scope of applicability of the existing methods. We have applied the framework to Escherichia coli k12 genome and evaluated WTSA prediction performance through comparison with seven existing programs. The performance evaluation indicated that WTSA provides reliable predictive power for regulatory motifs using ChIP-exo data. An important application of DNA motif identification is to identify transcriptional regulatory mechanisms. The rapid development of single-cell RNA-Sequencing (scRNAseq) technologies provides an unprecedented opportunity to discover the gene transcriptional regulation at the single-cell level. In the scRNA-seq analyses, a critical step is to identify the cell-type-specific regulons (CTS-Rs), each of which is a group of genes co-regulated by the same transcription regulator in a specific cell type. We developed a web server, IRIS3 (Integrated Cell-type-specific Regulon Inference Server from Single-cell RNA-Seq), to solve this problem by the integration of data preprocessing, cell type prediction, gene module identification, and cis-regulatory motif analyses. Compared with other packages, IRIS3 predicts more efficiently and provides more accurate regulon from scRNA-seq data. These CTS-Rs can substantially improve the elucidation of heterogeneous regulatory mechanisms among various cell types and allow reliable constructions of global transcriptional regulation networks encoded in a specific cell type. Also presented in this thesis is DESSO (DEep Sequence and Shape mOtif (DESSO), using deep neural networks and the binomial distribution model to identify DNA motifs, DESSO outperformed existing tools, including DeepBind, in 690 human ENCODE ChIP-Sequencing datasets. DESSO also further expanded motif identification power by integrating the detection of DNA shape features

    Algorithms for the analysis of molecular sequences

    Get PDF