10 research outputs found

    Predicting conserved protein motifs with Sub-HMMs

    Get PDF
    BackgroundProfile HMMs (hidden Markov models) provide effective methods for modeling the conserved regions of protein families. A limitation of the resulting domain models is the difficulty to pinpoint their much shorter functional sub-features, such as catalytically relevant sequence motifs in enzymes or ligand binding signatures of receptor proteins.ResultsTo identify these conserved motifs efficiently, we propose a method for extracting the most information-rich regions in protein families from their profile HMMs. The method was used here to predict a comprehensive set of sub-HMMs from the Pfam domain database. Cross-validations with the PROSITE and CSA databases confirmed the efficiency of the method in predicting most of the known functionally relevant motifs and residues. At the same time, 46,768 novel conserved regions could be predicted. The data set also allowed us to link at least 461 Pfam domains of known and unknown function by their common sub-HMMs. Finally, the sub-HMM method showed very promising results as an alternative search method for identifying proteins that share only short sequence similarities.ConclusionsSub-HMMs extend the application spectrum of profile HMMs to motif discovery. Their most interesting utility is the identification of the functionally relevant residues in proteins of known and unknown function. Additionally, sub-HMMs can be used for highly localized sequence similarity searches that focus on shorter conserved features rather than entire domains or global similarities. The motif data generated by this study is a valuable knowledge resource for characterizing protein functions in the future

    Rapid Sequence Identification of Potential Pathogens Using Techniques from Sparse Linear Algebra

    Full text link
    The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D4^{4}RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D4^{4}RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest

    Sewage effluent from an Indian hospital harbors novel carbapenemases and integron-borne antibiotic resistance genes

    Get PDF
    Background: Hospital wastewaters contain fecal material from a large number of individuals, of which many are undergoing antibiotic therapy. It is, thus, plausible that hospital wastewaters could provide opportunities to find novel carbapenemases and other resistance genes not yet described in clinical strains. Our aim was therefore to investigate the microbiota and antibiotic resistome of hospital effluent collected from the city of Mumbai, India, with a special focus on identifying novel carbapenemases. Results: Shotgun metagenomics revealed a total of 112 different mobile antibiotic resistance gene types, conferring resistance against almost all classes of antibiotics. Beta-lactamase genes, including encoding clinically important carbapenemases, such as NDM, VIM, IMP, KPC, and OXA-48, were abundant. NDM (0.9% relative abundance to 16S rRNA genes) was the most common carbapenemase gene, followed by OXA-58 (0.84% relative abundance to 16S rRNA genes). Among the investigated mobile genetic elements, class 1 integrons (11% relative abundance to 16S rRNA genes) were the most abundant. The genus Acinetobacter accounted for as many as 30% of the total 16S rRNA reads, with A. baumannii accounting for an estimated 2.5%. High throughput sequencing of amplified integron gene cassettes identified a novel functional variant of an IMP-type (proposed IMP-81) carbapenemase gene (eight aa substitutions) along with recently described novel resistance genes like sul4 and bla RSA1. Using a computational hidden Markov model, we detected 27 unique metallo-beta-lactamase (MBL) genes in the shotgun data, of which nine were novel subclass B1 genes, one novel subclass B2, and 10 novel subclass B3 genes. Six of the seven novel MBL genes were functional when expressed in Escherichia coli. Conclusion: By exploring hospital wastewater from India, our understanding of the diversity of carbapenemases has been extended. The study also demonstrates that the microbiota of hospital wastewater can serve as a reservoir of novel resistance genes, including previously uncharacterized carbapenemases with the potential to spread further

    Principes de l’évolution du réseau de l’homéostasie des protéines

    Get PDF
    L’homéostasie cellulaire est la capacité d’une cellule à maintenir son équilibre et sa fonctionnalité. Une des causes de l’instabilité de cet équilibre est le stress. En effet, le stress provoque une accumulation des protéines mal repliées, qui peuvent former des agrégats provoquant des maladies neurodégénératives. Les protéines « chaperons » sont le principal mécanisme du repliement des protéines et du contrôle de leur qualité. Celles-ci forment le cœur d’un réseau qu’on appelle le réseau d’homéostasie des protéines. Celui-ci a pour but de contrôler, d’assurer et de protéger le protéome , par le biais de la réparation du repliement et l’élimination des agrégats. Le réseau joue un rôle essentiel pour garder l’homéostasie protéique cellulaire dite protéostasie. Actuellement, nous manquons de connaissances fondamentales sur la façon dont ce réseau fonctionne en équilibre, mais aussi comment il échoue lors d’un déséquilibre. Par exemple, le rat-taupe nu, Heterocephalus glaber, a un cycle de vie lent qui peut dépasser 30 ans. Il a un mécanisme résistant au stress et un bon système du réseau, ce qui lui permet d’atteindre facilement un équilibre au niveau de son fonctionnement cellulaire. À l’inverse, le poisson, Nothobranchius furzeri, dont l’espérance de vie est très courte présente un processus de vieillissement accéléré et une perturbation de l’homéostasie. Il est à propos de se demander, comment cet équilibre fonctionne-t-il chez ces organismes et chez d’autres ? Ce projet de recherche utilise des approches bio-informatiques et de génomique comparative, afin de mettre en évidence les principes fondamentaux de la protéostasie. L’évolution du réseau des chaperons sera analysé dans le contexte de l’adaptation du protéome. En reliant à l’échelle évolutive, nous analyserons la diversification du réseau des chaperons à travers la phylogénie et l’unirons à l'évolution du protéome. Ce projet de maîtrise apporte des notions fondamentales sur l'évolution du réseau de l'homéostasie des protéines. Précisément, nous présentons une analyse comparative des chaperons de 216 espèces eucaryotiques qui indiquent que l’équilibre de la protéostasie peut être une élément clé pour expliquer la robustesse de l’organisme.Cell homeostasis is the ability of a cell to maintain its balance and functionality. One of the causes of the instability of this balance is stress. Indeed, stress causes an accumulation of misfolded proteins, which can form aggregates causing neurodegenerative diseases. Molecular chaperones are the main cellular mechanism that promote protein folding and quality control. These form the heart of a network called protein homeostasis network. The purpose of this is to control, insure and protect the proteome through folding and elimination of aggregates. The network plays a vital role in keeping cellular protein in homeostasis known as proteostasis. Failure of protein homeostasis is linked to aging and aging-associated neurodegenerative diseases such as Alzheimer’s and Parkinson’s. Currently, our understanding how this network keeps the proteome in balance in health, and how it fails and causes diseases, remains incomplete. Different species offer striking examples. For instance, the naked-mole rate, Heterocephalus glaber, remarkable for its life expectancy of over 30 years, it has an effective stress-resistant mechanism and a good homeostasis, which allows it to easily achieve a balance and homeostasis. On the other hand, the killifish, Nothobranchius furzeri, whose life expectancy is very short, has an accelerated aging process and with pronounced loss of homeostasis. Here we seek to ask, how does this balance work for these and other organisms? This research project will use bioinformatics and comparative genomics approaches to highlight the fundamental principles of proteostasis. The evolution of the chaperone network will be analyzed in the context of proteome adaptation. We analyze the diversification of the chaperone network across the eukaryotic phylogeny and compare it with aspects of the evolution of the corresponding proteomes. This master's project provide fundamental insights into the biology and the evolution of the protein homeostasis network. Moreover, we use comparative genomics analysis of chaperons from 216 eukaryotes species which indicates that the balance in proteostasis could be a key variable in explaining organismal robustness

    Structure-based prediction of protein-protein interaction sites

    Get PDF
    Protein-protein interactions play a central role in the formation of protein complexes and the biological pathways that orchestrate virtually all cellular processes. Reliable identification of the specific amino acid residues that form the interface of a protein with one or more other proteins is critical to understanding the structural and physico-chemical basis of protein interactions and their role in key cellular processes, predicting protein complexes, validating protein interactions predicted by high throughput methods, and identifying and prioritizing drug targets in computational drug design. Because of the difficulty and the high cost of experimental characterization of interface residues, there is an urgent need for computational methods for reliable predicting protein-protein interface residues from the sequence, and when available, the structure of a query protein, and when known, its putative interacting partner. Against this background, this thesis develops improved methods for predicting protein-protein interface residues and protein-protein interfaces from the three dimensional structure of an unbound query protein without considering information of its binding protein partner. Towards this end, we develop (i) ProtInDb (http://protindb.cs.iastate.edu), a database of protein-protein interface residues to facilitate (a) the generation of datasets of protein-protein interface residues that can be used to perform analysis of interaction sites and to train and evaluate predictors of interface residues, and (b) the visualization of interaction sites between proteins in both the amino acid sequences and the 3D protein structures, among other applications; (ii) PoInterS (http://pointers.cs.iastate.edu/), a method for predicting protein-protein interaction sites formed by spatially contiguous clusters of interface residues based on the predictions generated by a protein interface residue predictor. PoInterS divides a protein surface into a series of patches composed of several surface residues, and uses the outputs of the interface residue predictors to rank and select a small set of patches that are the most likely to constitute the interaction sites; and (iii) PrISE (http://prise.cs.iastate.edu/), a method for predicting protein-protein interface residues based on the similarity of the structural element formed by the query residue and its neighboring residues and the structural elements extracted from the interface and non-interface regions of proteins that are members of experimentally determined protein complexes. A structural element captures the atomic composition and solvent accessibility of a central residue and its closest neighbors in the protein structure. PrISE decomposes a query protein into a set of structural elements and searches for similar elements in a large set of proteins that belong to one or more experimentally determined complexes. The structural elements that are most similar to each structural element extracted from the query protein are then used to infer whether its central residue is or is not an interface residue. The results of our experiments using a variety of benchmark datasets show that PoInterS and PrISE generally outperform the state-of-the-art structure-based methods for predicting interaction patches and interface residues, respectively

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

    Predicting conserved protein motifs with Sub-HMMs

    No full text
    Abstract Background Profile HMMs (hidden Markov models) provide effective methods for modeling the conserved regions of protein families. A limitation of the resulting domain models is the difficulty to pinpoint their much shorter functional sub-features, such as catalytically relevant sequence motifs in enzymes or ligand binding signatures of receptor proteins. Results To identify these conserved motifs efficiently, we propose a method for extracting the most information-rich regions in protein families from their profile HMMs. The method was used here to predict a comprehensive set of sub-HMMs from the Pfam domain database. Cross-validations with the PROSITE and CSA databases confirmed the efficiency of the method in predicting most of the known functionally relevant motifs and residues. At the same time, 46,768 novel conserved regions could be predicted. The data set also allowed us to link at least 461 Pfam domains of known and unknown function by their common sub-HMMs. Finally, the sub-HMM method showed very promising results as an alternative search method for identifying proteins that share only short sequence similarities. Conclusions Sub-HMMs extend the application spectrum of profile HMMs to motif discovery. Their most interesting utility is the identification of the functionally relevant residues in proteins of known and unknown function. Additionally, sub-HMMs can be used for highly localized sequence similarity searches that focus on shorter conserved features rather than entire domains or global similarities. The motif data generated by this study is a valuable knowledge resource for characterizing protein functions in the future.</p

    Predicting conserved protein motifs with Sub-HMMs.

    No full text

    Recognition of short functional motifs in protein sequences

    Get PDF
    The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool
    corecore