236 research outputs found

    Inference of biomolecular interactions from sequence data

    Get PDF
    This thesis describes our work on the inference of biomolecular interactions from sequence data. In particular, the first part of the thesis focuses on proteins and describes computational methods that we have developed for the inference of both intra- and inter-protein interactions from genomic data. The second part of the thesis centers around protein-RNA interactions and describes a method for the inference of binding motifs of RNA-binding proteins from high-throughput sequencing data. The thesis is organized as follows. In the first part, we start by introducing a novel mathematical model for the characterization of protein sequences (chapter 1). We then show how, using genomic data, this model can be successfully applied to two different problems, namely to the inference of interacting amino acid residues in the tertiary structure of protein domains (chapter 2) and to the prediction of protein-protein interactions in large paralogous protein families (chapters 3 and 4). We conclude the first part by a discussion of potential extensions and generalizations of the methods presented (chapter 5). In the second part of this thesis, we first give a general introduction about RNA- binding proteins (chapter 6). We then describe a novel experimental method for the genome-wide identification of target RNAs of RNA-binding proteins and show how this method can be used to infer the binding motifs of RNA-binding proteins (chapter 7). Finally, we discuss a potential mechanism by which KH domain-containing RNA- binding proteins could achieve the specificity of interaction with their target RNAs and conclude the second part of the thesis by proposing a novel type of motif finding algorithm tailored for the inference of their recognition elements (chapter 8)

    Statistical methods for biological sequence analysis for DNA binding motifs and protein contacts

    Get PDF
    Over the last decades a revolution in novel measurement techniques has permeated the biological sciences filling the databases with unprecedented amounts of data ranging from genomics, transcriptomics, proteomics and metabolomics to structural and ecological data. In order to extract insights from the vast quantity of data, computational and statistical methods are nowadays crucial tools in the toolbox of every biological researcher. In this thesis I summarize my contributions in two data-rich fields in biological sciences: transcription factor binding to DNA and protein structure prediction from protein sequences with shared evolutionary ancestry. In the first part of my thesis I introduce our work towards a web server for analysing transcription factor binding data with Bayesian Markov Models. In contrast to classical PWM or di-nucleotide models, Bayesian Markov models can capture complex inter-nucleotide dependencies that can arise from shape-readout and alternative binding modes. In addition to giving access to our methods in an easy-to-use, intuitive web-interface, we provide our users with novel tools and visualizations to better evaluate the biological relevance of the inferred binding motifs. We hope that our tools will prove useful for investigating weak and complex transcription factor binding motifs which cannot be predicted accurately with existing tools. The second part discusses a statistical attempt to correct out the phylogenetic bias arising in co-evolution methods applied to the contact prediction problem. Co-evolution methods have revolutionized the protein-structure prediction field more than 10 years ago, and, until very recently, have retained their importance as crucial input features to deep neural networks. As the co-evolution information is extracted from evolutionarily related sequences, we investigated whether the phylogenetic bias to the signal can be corrected out in a principled way using a variation of the Felsenstein's tree-pruning algorithm applied in combination with an independent-pair assumption to derive pairwise amino counts that are corrected for the evolutionary history. Unfortunately, the contact prediction derived from our corrected pairwise amino acid counts did not yield a competitive performance.2021-09-2

    Recognition models to predict DNA-binding specificities of homeodomain proteins

    Get PDF
    Motivation: Recognition models for protein-DNA interactions, which allow the prediction of specificity for a DNA-binding domain based only on its sequence or the alteration of specificity through rational design, have long been a goal of computational biology. There has been some progress in constructing useful models, especially for C2H2 zinc finger proteins, but it remains a challenging problem with ample room for improvement. For most families of transcription factors the best available methods utilize k-nearest neighbor (KNN) algorithms to make specificity predictions based on the average of the specificities of the k most similar proteins with defined specificities. Homeodomain (HD) proteins are the second most abundant family of transcription factors, after zinc fingers, in most metazoan genomes, and as a consequence an effective recognition model for this family would facilitate predictive models of many transcriptional regulatory networks within these genomes

    Why Should We Care About Molecular Coevolution?

    Get PDF
    Non-independent evolution of amino acid sites has become a noticeable limitation of most methods aimed at identifying selective constraints at functionally important amino acid sites or protein regions. The need for a generalised framework to account for non-independence of amino acid sites has fuelled the design and development of new mathematical models and computational tools centred on resolving this problem. Molecular coevolution is one of the most active areas of research, with an increasing rate of new models and methods being developed everyday. Both parametric and non-parametric methods have been developed to account for correlated variability of amino acid sites. These methods have been utilised for detecting phylogenetic, functional and structural coevolution as well as to identify surfaces of amino acid sites involved in protein-protein interactions. Here we discuss and briefly describe these methods, and identify their advantages and limitations

    Protein-DNA Recognition Models for the Homeodomain and C2H2 Zinc Finger Transcription Factor Families

    Get PDF
    Transcription factors: TFs) play a central role in the gene regulatory network of each cell. They can stimulate or inhibit transcription of their target genes by binding to short, degenerate DNA sequence motifs. The goal of this research is to build improved models of TF binding site recognition. This can facilitate the determination of regulatory networks and also allow for the prediction of binding site motifs based only on the TF protein sequence. Recent technological advances have rapidly expanded the amount of quantitative TF binding data available. PBMs: Protein Binding Microarrays) have recently been implemented in a format that allows all 10mers to be assayed in parallel. There is now PBM data available for hundreds of transcription factors. Another fairly recent technique for determining the binding preference of a TF is an in vivo bacterial one-hybrid assay: B1H). In this approach a TF is expressed in E. coli where it can be used to select strong binding sites from a library of randomized sites located upstream of a weak promoter, driving expression of a selectable gene. When coupled with high throughput sequencing and a newly developed analysis method, quantitative binding data can be obtained. In the last few years, the binding specificities of hundreds of TFs have been determined using B1H. The two largest eukaryotic transcription factor families are the zf-C2H2 and homeodomain TF families. Newly available PBM and B1H specificity models were used to develop recognition models for these two families, with the goal of being able to predict the binding specific of a TF from its protein sequence. We developed a feature selection method based on adjusted mutual information that automatically recovers nearly all of the known key residues for the homeodomain and zf-C2H2 families. Using those features we find that, for both families, random forest: RF) and support vector machine: SVM) based recognition models outperform the nearest neighbor method, which has previously been considered the best method

    Combining structure probing data on RNA mutants with evolutionary information reveals RNA-binding interfaces

    Get PDF
    International audienceSystematic structure probing experiments (e.g. SHAPE) of RNA mutants such as the mutate-and-map protocol give us a direct access into the genetic robustness of ncRNA structures. Comparative studies of homologous sequences provide a distinct, yet complementary, approach to analyze structural and functional properties of non-coding RNAs. In this paper, we introduce a formal framework to combine the biochemical signal collected from mutate-and-map experiments, with the evolutionary information available in multiple sequence alignments. We apply neutral theory principles to detect complex long-range dependencies between nucleotides of a single stranded RNA, and implement these ideas into a software called aRNhAck. We illustrate the biological significance of this signal and show that the nucleotides networks calculated with aRNhAck are correlated with nucleotides located in RNA-RNA, RNA-protein, RNA-DNA and RNA-ligand interfaces. aRNhAck is freely available at http://csb.cs.mcgill.ca/arnhack

    Use of ChIP-Seq data for the design of a multiple promoter-alignment method

    Get PDF
    We address the challenge of regulatory sequence alignment with a new method, Pro-Coffee, a multiple aligner specifically designed for homologous promoter regions. Pro-Coffee uses a dinucleotide substitution matrix estimated on alignments of functional binding sites from TRANSFAC. We designed a validation framework using several thousand families of orthologous promoters. This dataset was used to evaluate the accuracy for predicting true human orthologs among their paralogs. We found that whereas other methods achieve on average 73.5% accuracy, and 77.6% when trained on that same dataset, the figure goes up to 80.4% for Pro-Coffee. We then applied a novel validation procedure based on multi-species ChIP-seq data. Trained and untrained methods were tested for their capacity to correctly align experimentally detected binding sites. Whereas the average number of correctly aligned sites for two transcription factors is 284 for default methods and 316 for trained methods, Pro-Coffee achieves 331, 16.5% above the default average. We find a high correlation between a method's performance when classifying orthologs and its ability to correctly align proven binding sites. Not only has this interesting biological consequences, it also allows us to conclude that any method that is trained on the ortholog data set will result in functionally more informative alignments

    Using evolutionary covariance to infer protein sequence-structure relationships

    Get PDF
    During the last half century, a deep knowledge of the actions of proteins has emerged from a broad range of experimental and computational methods. This means that there are now many opportunities for understanding how the varieties of proteins affect larger scale behaviors of organisms, in terms of phenotypes and diseases. It is broadly acknowledged that sequence, structure and dynamics are the three essential components for understanding proteins. Learning about the relationships among protein sequence, structure and dynamics becomes one of the most important steps for understanding the mechanisms of proteins. Together with the rapid growth in the efficiency of computers, there has been a commensurate growth in the sizes of the public databases for proteins. The field of computational biology has undergone a paradigm shift from investigating single proteins to looking collectively at sets of related proteins and broadly across all proteins. we develop a novel approach that combines the structure knowledge from the PDB, the CATH database with sequence information from the Pfam database by using co-evolution in sequences to achieve the following goals: (a) Collection of co-evolution information on the large scale by using protein domain family data; (b) Development of novel amino acid substitution matrices based on the structural information incorporated; (c) Higher order co-evolution correlation detection. The results presented here show that important gains can come from improvements to the sequence matching. What has been done here is simple and the pair correlations in sequence have been decomposed into singlet terms, which amounts to discarding much of the correlation information itself. The gains shown here are encouraging, and we would like to develop a sequence matching method that retains the pair (or higher order) correlation information, and even higher order correlations directly, and this should be possible by developing the sequence matching separately for different domain structures. The many body correlations in particular have the potential to transform the common perceptions in biology from pairs that are not actually so very informative to higher-order interactions. Fully understanding cellular processes will require a large body of higher-order correlation information such as has been initiated here for single proteins

    Using mutual information to reprogram DNA specificity of LAGLIDADG endonucleases

    Get PDF
    A systematic approach to reprogram protein-DNA interactions has yet to be discovered. This study investigates the ability of co-variation analyses to identify potential protein-DNA contacts that regulate specificity. Here, 27 LAGLIDADG Homing Endonucleases (LHEs) and their 22-basepair DNA targets were collated into a Multiple Sequence Alignment (MSA) that was subjected to pairwise co-variation calculations. Using the LHE I-OnuI as a reference, an amino acid-DNA pair, lysine (K) 231 and adenine +3, generated the highest score. To test if the K231/A3 score was biologically relevant we tested protein mutants for altered nuclease specificity at +3 DNA point mutants. Randomizing the 231st amino acid did not alone restore cleavage activity on substrate mutants but randomization in conjunction with, aspartic acid (D) 240, restored cleavage activity on A3T and A3G substrates. In conclusion, co-variation analyses identified in part, amino acids that could be mutated to alter DNA specificity. Future work should focus on mapping more LHE-DNA target sequences to increase MSA diversity
    corecore