98,851 research outputs found

    On Prediction Using Variable Order Markov Models

    Full text link
    This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a "decomposed" CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems

    HMMER cut-off threshold tool (HMMERCTTER): Supervised classification of superfamily protein sequences with a reliable cut-off threshold

    Get PDF
    Background: Protein superfamilies can be divided into subfamilies of proteins with different functional characteristics. Their sequences can be classified hierarchically, which is part of sequence function assignation. Typically, there are no clear subfamily hallmarks that would allow pattern-based function assignation by which this task is mostly achieved based on the similarity principle. This is hampered by the lack of a score cut-off that is both sensitive and specific. Results: HMMER Cut-off Threshold Tool (HMMERCTTER) adds a reliable cut-off threshold to the popular HMMER. Using a high quality superfamily phylogeny, it clusters a set of training sequences such that the cluster-specific HMMER profiles show cluster or subfamily member detection with 100% precision and recall (P&R), thereby generating a specific threshold as inclusion cut-off. Profiles and thresholds are then used as classifiers to screen a target dataset. Iterative inclusion of novel sequences to groups and the corresponding HMMER profiles results in high sensitivity while specificity is maintained by imposing 100% P&R self detection. In three presented case studies of protein superfamilies, classification of large datasets with 100% precision was achieved with over 95% recall. Limits and caveats are presented and explained. Conclusions: HMMERCTTER is a promising protein superfamily sequence classifier provided high quality training datasets are used. It provides a decision support system that aids in the difficult task of sequence function assignation in the twilight zone of sequence similarity. All relevant data and source codes are available from the Github repository at the following.Fil: Pagnuco, Inti Anabela. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mar del Plata. Instituto de Investigaciones Científicas y Tecnológicas en Electrónica. Universidad Nacional de Mar del Plata. Facultad de Ingeniería. Instituto de Investigaciones Científicas y Tecnológicas en Electrónica; ArgentinaFil: Revuelta, María Victoria. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mar del Plata. Instituto de Investigaciones Biológicas. Universidad Nacional de Mar del Plata. Facultad de Ciencias Exactas y Naturales. Instituto de Investigaciones Biológicas; ArgentinaFil: Bondino, Hernán Gabriel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mar del Plata. Instituto de Investigaciones Biológicas. Universidad Nacional de Mar del Plata. Facultad de Ciencias Exactas y Naturales. Instituto de Investigaciones Biológicas; ArgentinaFil: Brun, Marcel. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mar del Plata. Instituto de Investigaciones Científicas y Tecnológicas en Electrónica. Universidad Nacional de Mar del Plata. Facultad de Ingeniería. Instituto de Investigaciones Científicas y Tecnológicas en Electrónica; ArgentinaFil: Ten Have, Arjen. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Mar del Plata. Instituto de Investigaciones Biológicas. Universidad Nacional de Mar del Plata. Facultad de Ciencias Exactas y Naturales. Instituto de Investigaciones Biológicas; Argentin

    Small and mighty: adaptation of superphylum Patescibacteria to groundwater environment drives their genome simplicity.

    Get PDF
    BackgroundThe newly defined superphylum Patescibacteria such as Parcubacteria (OD1) and Microgenomates (OP11) has been found to be prevalent in groundwater, sediment, lake, and other aquifer environments. Recently increasing attention has been paid to this diverse superphylum including > 20 candidate phyla (a large part of the candidate phylum radiation, CPR) because it refreshed our view of the tree of life. However, adaptive traits contributing to its prevalence are still not well known.ResultsHere, we investigated the genomic features and metabolic pathways of Patescibacteria in groundwater through genome-resolved metagenomics analysis of > 600 Gbp sequence data. We observed that, while the members of Patescibacteria have reduced genomes (~ 1 Mbp) exclusively, functions essential to growth and reproduction such as genetic information processing were retained. Surprisingly, they have sharply reduced redundant and nonessential functions, including specific metabolic activities and stress response systems. The Patescibacteria have ultra-small cells and simplified membrane structures, including flagellar assembly, transporters, and two-component systems. Despite the lack of CRISPR viral defense, the bacteria may evade predation through deletion of common membrane phage receptors and other alternative strategies, which may explain the low representation of prophage proteins in their genomes and lack of CRISPR. By establishing the linkages between bacterial features and the groundwater environmental conditions, our results provide important insights into the functions and evolution of this CPR group.ConclusionsWe found that Patescibacteria has streamlined many functions while acquiring advantages such as avoiding phage invasion, to adapt to the groundwater environment. The unique features of small genome size, ultra-small cell size, and lacking CRISPR of this large lineage are bringing new understandings on life of Bacteria. Our results provide important insights into the mechanisms for adaptation of the superphylum in the groundwater environments, and demonstrate a case where less is more, and small is mighty

    ProLanGO: Protein Function Prediction Using Neural~Machine Translation Based on a Recurrent Neural Network

    Full text link
    With the development of next generation sequencing techniques, it is fast and cheap to determine protein sequences but relatively slow and expensive to extract useful information from protein sequences because of limitations of traditional biological experimental techniques. Protein function prediction has been a long standing challenge to fill the gap between the huge amount of protein sequences and the known function. In this paper, we propose a novel method to convert the protein function problem into a language translation problem by the new proposed protein sequence language "ProLan" to the protein function language "GOLan", and build a neural machine translation model based on recurrent neural networks to translate "ProLan" language to "GOLan" language. We blindly tested our method by attending the latest third Critical Assessment of Function Annotation (CAFA 3) in 2016, and also evaluate the performance of our methods on selected proteins whose function was released after CAFA competition. The good performance on the training and testing datasets demonstrates that our new proposed method is a promising direction for protein function prediction. In summary, we first time propose a method which converts the protein function prediction problem to a language translation problem and applies a neural machine translation model for protein function prediction.Comment: 13 pages, 5 figure

    Automated Protein Structure Classification: A Survey

    Full text link
    Classification of proteins based on their structure provides a valuable resource for studying protein structure, function and evolutionary relationships. With the rapidly increasing number of known protein structures, manual and semi-automatic classification is becoming ever more difficult and prohibitively slow. Therefore, there is a growing need for automated, accurate and efficient classification methods to generate classification databases or increase the speed and accuracy of semi-automatic techniques. Recognizing this need, several automated classification methods have been developed. In this survey, we overview recent developments in this area. We classify different methods based on their characteristics and compare their methodology, accuracy and efficiency. We then present a few open problems and explain future directions.Comment: 14 pages, Technical Report CSRG-589, University of Toront

    On the hierarchical classification of G Protein-Coupled Receptors

    Get PDF
    Motivation: G protein-coupled receptors (GPCRs) play an important role in many physiological systems by transducing an extracellular signal into an intracellular response. Over 50% of all marketed drugs are targeted towards a GPCR. There is considerable interest in developing an algorithm that could effectively predict the function of a GPCR from its primary sequence. Such an algorithm is useful not only in identifying novel GPCR sequences but in characterizing the interrelationships between known GPCRs. Results: An alignment-free approach to GPCR classification has been developed using techniques drawn from data mining and proteochemometrics. A dataset of over 8000 sequences was constructed to train the algorithm. This represents one of the largest GPCR datasets currently available. A predictive algorithm was developed based upon the simplest reasonable numerical representation of the protein's physicochemical properties. A selective top-down approach was developed, which used a hierarchical classifier to assign sequences to subdivisions within the GPCR hierarchy. The predictive performance of the algorithm was assessed against several standard data mining classifiers and further validated against Support Vector Machine-based GPCR prediction servers. The selective top-down approach achieves significantly higher accuracy than standard data mining methods in almost all cases

    Testing statistical hypothesis on random trees and applications to the protein classification problem

    Full text link
    Efficient automatic protein classification is of central importance in genomic annotation. As an independent way to check the reliability of the classification, we propose a statistical approach to test if two sets of protein domain sequences coming from two families of the Pfam database are significantly different. We model protein sequences as realizations of Variable Length Markov Chains (VLMC) and we use the context trees as a signature of each protein family. Our approach is based on a Kolmogorov--Smirnov-type goodness-of-fit test proposed by Balding et al. [Limit theorems for sequences of random trees (2008), DOI: 10.1007/s11749-008-0092-z]. The test statistic is a supremum over the space of trees of a function of the two samples; its computation grows, in principle, exponentially fast with the maximal number of nodes of the potential trees. We show how to transform this problem into a max-flow over a related graph which can be solved using a Ford--Fulkerson algorithm in polynomial time on that number. We apply the test to 10 randomly chosen protein domain families from the seed of Pfam-A database (high quality, manually curated families). The test shows that the distributions of context trees coming from different families are significantly different. We emphasize that this is a novel mathematical approach to validate the automatic clustering of sequences in any context. We also study the performance of the test via simulations on Galton--Watson related processes.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS218 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore