28 research outputs found

    Detecting DNA-binding helix–turn–helix structural motifs using sequence and structure information

    Get PDF
    In this work, we analyse the potential for using structural knowledge to improve the detection of the DNA-binding helix–turn–helix (HTH) motif from sequence. Starting from a set of DNA-binding protein structures that include a functional HTH motif and have no apparent sequence similarity to each other, two different libraries of hidden Markov models (HMMs) were built. One library included sequence models of whole DNA-binding domains, which incorporate the HTH motif, the second library included shorter models of ‘partial’ domains, representing only the fraction of the domain that corresponds to the functionally relevant HTH motif itself. The libraries were scanned against a dataset of protein sequences, some containing the HTH motifs, others not. HMM predictions were compared with the results obtained from a previously published structure-based method and subsequently combined with it. The combined method proved more effective than either of the single-featured approaches, showing that information carried by motif sequences and motif structures are to some extent complementary and can successfully be used together for the detection of DNA-binding HTHs in proteins of unknown function

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    Identification and Genomic Analysis of Transcription Factors in Archaeal Genomes Exemplifies Their Functional Architecture and Evolutionary Origin

    Get PDF
    Archaea, which represent a large fraction of the phylogenetic diversity of organisms, are prokaryotes with eukaryote-like basal transcriptional machinery. This organization makes the study of their DNA-binding transcription factors (TFs) and their transcriptional regulatory networks particularly interesting. In addition, there are limited experimental data regarding their TFs. In this work, 3,918 TFs were identified and exhaustively analyzed in 52 archaeal genomes. TFs represented less than 5% of the gene products in all the studied species comparable with the number of TFs identified in parasites or intracellular pathogenic bacteria, suggesting a deficit in this class of proteins. A total of 75 families were identified, of which HTH_3, AsnC, TrmB, and ArsR families were universally and abundantly identified in all the archaeal genomes. We found that archaeal TFs are significantly small compared with other protein-coding genes in archaea as well as bacterial TFs, suggesting that a large fraction of these small-sized TFs could supply the probable deficit of TFs in archaea, by possibly forming different combinations of monomers similar to that observed in eukaryotic transcriptional machinery. Our results show that although the DNA-binding domains of archaeal TFs are similar to bacteria, there is an underrepresentation of ligand-binding domains in smaller TFs, which suggests that protein–protein interactions may act as mediators of regulatory feedback, indicating a chimera of bacterial and eukaryotic TFs’ functionality. The analysis presented here contributes to the understanding of the details of transcriptional apparatus in archaea and provides a framework for the analysis of regulatory networks in these organisms

    Prediction of Indel Flanking Regions and Its Application in the Alignment of Multiple Protein Sequences

    Get PDF
    Proteins are the most important molecules in living organism, and they are involved in every function of the cells, such as signal transmission, metabolic regulation, transportation of molecules, and defense mechanism. As new protein sequences are discovered on an everyday basis and protein databases continue to grow exponentially with time, analysis of protein families, understanding their evolutionary trends and detection of remote homologues have become extremely important. The traditional laboratory techniques of studying these proteins are very slow and time consuming. Therefore, biologists have turned to automated methods that are fast and capable of analyzing large amounts of data and determining relationships between proteins that would be difficult, if not impossible, for humans to identify through the traditional techniques. Insertion/deletion (indel) and substitution of an amino acid are two common events that lead to the evolution of and variations in protein sequences. Further, many of the human diseases and functional divergence between homologous proteins are related more to the indel mutations than to the substitution mutations, even though the former occurs less often than the latter. A reliable detection of indels and their flanking regions is a major challenge in research related to protein evolution, structures and functions. The first and most important step in studying a newly discovered protein sequence is to search protein databases for proteins that are similar or closely-related to the new protein, and then to align the new protein sequence to these proteins. Thus, the alignment of multiple protein sequences is one of the most commonly performed tasks in bioinformatics analyses, and has been used in many applications, including sequence annotation, phylogenetic tree estimation, evolutionary analysis, secondary structure prediction and protein database search. In spite of considerable research and efforts that have been recently deployed for improving the performance of multiple sequence alignment (MSA) algorithms, finding a highly accurate alignment between multiple protein sequences still remains a challenging problem. The objectives of this thesis are to develop a novel scheme to predict indel flanking regions (IndelFRs) in a protein sequence and to develop an efficient algorithm for the alignment of multiple protein sequences incorporating the information on the predicted IndelFRs. In the first part of the thesis, a variable-order Markov model-based scheme to predict indel flanking regions in a protein sequence for a given protein fold is proposed. In this scheme, two predictors, referred to as the PPM IndelFR and PST IndelFR predictors, are designed based on prediction by partial match and probabilistic suffix tree, respectively. The performance of the proposed IndelFR predictors is evaluated in terms of the commonly used metrics, namely, accuracy of prediction and F1-measure. It is shown through extensive performance evaluation that the proposed predictors are able to predict IndelFRs in the protein sequences with high values of accuracy and F1-measure. It is also shown that if one is interested only in predicting IndelFRs in protein sequences, it would be preferable to use the proposed predictors instead of HMMER 3.0 in view of the substantially superior performance of the former. In the second part of the thesis, a novel and efficient algorithm incorporating the information on the predicted IndelFRs for the alignment of multiple protein sequences is proposed. A new variable gap penalty function is introduced, which makes the gap placement in protein sequences more accurate for the protein alignment. The performance of the proposed alignment algorithm, named as MSAIndelFR algorithm, is evaluated in terms of the so called metrics, sum-of-pairs (SP) and total columns (TC). It is shown through extensive performance evaluation using four popular benchmarks, BAliBASE 3.0, OXBENCH, PREFAB 4.0, and SABmark 1.65, that the performance of MSAIndelFR is superior to that of the six most-widely used alignment algorithms, namely, Clustal W2, Clustal Omega, MSAProbs, Kalign2, MAFFT and MUSCLE. Through the study undertaken in this thesis it is shown that a reliable detection of indels and their flanking regions can be achieved by using the proposed IndelFR predictors, and a substantial improvement in the protein alignment accuracy can be achieved by using the proposed variable gap penalty function. Thus, it is anticipated that this investigation will not only facilitate future studies on the modeling of indel mutations and protein sequence alignment, but will also open up new avenues for research concerning protein evolution, structures, and functions as well as for research concerning protein sequence alignment

    Using machine learning for decoy discrimination in protein tertiary structure prediction.

    Get PDF
    In this thesis, the novelty of using machine learning to identify the low-RMSD structures in decoy discrimination in protein tertiary structure prediction is investigated. More specifically, neural networks are used to learn to recognize low-RMSD structures, using native protein structures as positive training examples, and simulated decoy structures as negative training examples. Simulated decoy structures are derived by reversing the sequences of native structures in the set of positive training examples, and threading the reversed sequences back to the native structures. Various input features, extracted from these native and simulated decoy structures, are used as inputs to the neural networks. These input features are the identities of residue pairs, the separation between the residues along the sequence, the pairwise distance and the relative solvent accessibilities of the residues. Various neural networks are created depending on the amount of input features used. The neural networks are tested against the in-house pairwise potentials of mean force method, as well as against a K-Nearest Neighbours algorithm. The second novel idea of this thesis is to use evolutionary information in the decoy discrimination process. Evolutionary information, in the form of PSI-BLAST profiles, is used as inputs to the neural networks. Results have shown that the best performing neural network is the one that uses in put information comprising of PSI-BLAST profiles of residue pairs, pairwise distance and the relative solvent accessibilities of the residues. This neural network is the best among all methods tested, including the pairwise potentials method, in discriminating the native structures. Therefore this thesis has demonstrated the feasibility of using machine learning, more specifically neural networks, in the problem of decoy discrimination. More significantly, evolutionary information in the form of PSI-BLAST profiles has been success fully used to further improve decoy discrimination, particularly in the discrimination of native structures

    Nucleosome positioning dynamics in evolution and disease

    Get PDF
    Nucleosome positioning is involved in a variety of cellular processes, and it provides a likely substrate for species evolution and may play roles in human disease. However, many fundamental aspects of nucleosome positioning remain controversial, such as the relative importance of underlying sequence features, genomic neighbourhood and trans-acting factors. In this thesis, I have focused on analyses of the divergence and conservation of nucleosome positioning, associated substitution spectra, and the interplay between them. I have investigated the extent to which nucleosome positioning patterns change following the duplication of a DNA sequence and its insertion into a new genomic region within the same species, by assessing the relative nucleosome positioning between paralogous regions in both the human (using in vitro and in vivo datasets) and yeast (in vivo) genomes. I observed that the positioning of paralogous nucleosomes is generally well conserved and detected a strong rotational preference where nucleosome positioning has diverged. I have also found, in all datasets, that DNA sequence features appear to be more important than local chromosomal environments in nucleosome positioning evolution, while controlling for trans-acting factors that can potentially confound inter-species comparisons. I have also examined the relationships between chromatin structure and DNA sequence variation, with a particular focus on the spectra of (germline and somatic) substitutions seen in human diseases. Both somatic and germline substitutions are found to be enriched at sequences coinciding with nucleosome cores. In addition, transitions appear to be enriched in germline relative to somatic substitutions at nucleosome core regions. This difference in transition to transversion ratio is also seen at transcription start sites (TSSs) genome wide. However, the contrasts seen between somatic and germline mutational spectra do not appear to be attributable to alterations in nucleosome positioning between cell types. Examination of multiple human nucleosome positioning datasets shows conserved positioning across TSSs and strongly conserved global phasing between 4 cancer cell lines and 7 non-cancer cell lines. This suggests that the particular mutational profiles seen for somatic and germline cells occur upon a common landscape of conserved chromatin structure. I extended my studies of mutational spectra by analysing genome sequencing data from various tissues in a cohort of individuals to identify human somatic mutations. This allowed an assessment of the relationship between age and mutation accumulation and a search for inherited genetic variants linked to high somatic mutation rates. A list of candidate germline variants that potentially predispose to increased somatic mutation rates was the outcome. Together these analyses contribute to an integrated view of genome evolution, encompassing the divergence of DNA sequence and chromatin structure, and explorations of how they may interact in human disease

    Developing algorithms for the in silico identification of transcription factor binding sites

    Get PDF
    Modeling the specificity of transcription factors to the DNA is one of the challenges that has kept many bioinformatics researchers busy since the early beginnings. Initially it was expected that a universal recognition code describing the amino acid to base pair contacts would be able to describe protein-DNA complex formation. However, until this very day a universal recognition code has not yet been found and alternative methods became more important. Nowadays, methods that describe the specificity of only one transcription factor (or a small family of transcription factors) are used most often. These methods make use of a set of experimentally validated binding sites to construct a profile for each transcription factor. One of the oldest profile-based methods is the consensus sequence method. Consensus sequences consist of a simple text string in which each character of the string represents the most prevalent nucleotide in the corresponding position of DNA binding sites. As an extension to these consensus sequences, in 1982, Gary Stormo introduced the well-known and very popular positional weight matrix (PWM). These PWMs consist of a 4xL matrix, with L being the length of the binding sites. In each row of these matrices, the frequency of occurrence of one of the four nucleotides is given for a certain position in the binding sites. Even though these PWMs are a big improvement to the consensus sequences method, they also lead to many false positive predictions. Many alternative methods try to improve the accuracy of these PWMs, most of them with very limited success. In this thesis I will discuss the shortcomings of the previous generation of prediction methods and I will suggest new methods that overcome some of these shortcomings. The first method that will be discussed in this thesis makes use of a multiple sequence alignment (MSA) to visualize evolutionary conserved transcription factor binding sites that are predicted with the PWM method. Binding sites that are conserved across all species in these alignments have a higher likelihood to be functional. Mutation of these binding sites would result in a less fit species, therefore mutations in these binding sites would have a negative effect. By inspecting these multiple sequence alignments for putative PWM hits we can reduce a large number of false positive predictions as false positive hits are less likely to be conserved. A second contribution of this thesis to the improvement of prediction methods is the research on and development of a number of new methods that make use of the structure and the biophysical characteristics of protein-DNA complexes. These characteristics are often overlooked in the previous generation of prediction methods even though they are very important for binding specificity in many protein-DNA complexes. With the help of the Random Forest classification method and sequence-based structural and biophysical characteristics we managed to develop models that can predict transcription factor binding sites with a higher level of accuracy. Based on this method, we also developed a user-friendly web-tool that can make use of a large number of pre-calculated transcription factor models
    corecore