118 research outputs found

    A generalized risk approach to path inference based on hidden Markov models

    Full text link
    Motivated by the unceasing interest in hidden Markov models (HMMs), this paper re-examines hidden path inference in these models, using primarily a risk-based framework. While the most common maximum a posteriori (MAP), or Viterbi, path estimator and the minimum error, or Posterior Decoder (PD), have long been around, other path estimators, or decoders, have been either only hinted at or applied more recently and in dedicated applications generally unfamiliar to the statistical learning community. Over a decade ago, however, a family of algorithmically defined decoders aiming to hybridize the two standard ones was proposed (Brushe et al., 1998). The present paper gives a careful analysis of this hybridization approach, identifies several problems and issues with it and other previously proposed approaches, and proposes practical resolutions of those. Furthermore, simple modifications of the classical criteria for hidden path recognition are shown to lead to a new class of decoders. Dynamic programming algorithms to compute these decoders in the usual forward-backward manner are presented. A particularly interesting subclass of such estimators can be also viewed as hybrids of the MAP and PD estimators. Similar to previously proposed MAP-PD hybrids, the new class is parameterized by a small number of tunable parameters. Unlike their algorithmic predecessors, the new risk-based decoders are more clearly interpretable, and, most importantly, work "out of the box" in practice, which is demonstrated on some real bioinformatics tasks and data. Some further generalizations and applications are discussed in conclusion.Comment: Section 5: corrected denominators of the scaled beta variables (pp. 27-30), => corrections in claims 1, 3, Prop. 12, bottom of Table 1. Decoder (49), Corol. 14 are generalized to handle 0 probabilities. Notation is more closely aligned with (Bishop, 2006). Details are inserted in eqn-s (43); the positivity assumption in Prop. 11 is explicit. Fixed typing errors in equation (41), Example

    DeepSig: Deep learning improves signal peptide detection in proteins

    Get PDF
    Motivation: The identification of signal peptides in protein sequences is an important step toward protein localization and function characterization. Results: Here, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification. Availability and implementation: DeepSig is available as both standalone program and web server at https://deepsig.biocomp.unibo.it. All datasets used in this study can be obtained from the same website

    Progressive Mauve: Multiple alignment of genomes with gene flux and rearrangement

    Full text link
    Multiple genome alignment remains a challenging problem. Effects of recombination including rearrangement, segmental duplication, gain, and loss can create a mosaic pattern of homology even among closely related organisms. We describe a method to align two or more genomes that have undergone large-scale recombination, particularly genomes that have undergone substantial amounts of gene gain and loss (gene flux). The method utilizes a novel alignment objective score, referred to as a sum-of-pairs breakpoint score. We also apply a probabilistic alignment filtering method to remove erroneous alignments of unrelated sequences, which are commonly observed in other genome alignment methods. We describe new metrics for quantifying genome alignment accuracy which measure the quality of rearrangement breakpoint predictions and indel predictions. The progressive genome alignment algorithm demonstrates markedly improved accuracy over previous approaches in situations where genomes have undergone realistic amounts of genome rearrangement, gene gain, loss, and duplication. We apply the progressive genome alignment algorithm to a set of 23 completely sequenced genomes from the genera Escherichia, Shigella, and Salmonella. The 23 enterobacteria have an estimated 2.46Mbp of genomic content conserved among all taxa and total unique content of 15.2Mbp. We document substantial population-level variability among these organisms driven by homologous recombination, gene gain, and gene loss. Free, open-source software implementing the described genome alignment approach is available from http://gel.ahabs.wisc.edu/mauve .Comment: Revision dated June 19, 200

    Advantages of combined transmembrane topology and signal peptide prediction—the Phobius web server

    Get PDF
    When using conventional transmembrane topology and signal peptide predictors, such as TMHMM and SignalP, there is a substantial overlap between these two types of predictions. Applying these methods to five complete proteomes, we found that 30–65% of all predicted signal peptides and 25–35% of all predicted transmembrane topologies overlap. This impairs predictions of 5–10% of the proteome, hence this is an important issue in protein annotation

    Algorithms for incorporating prior topological information in HMMs: application to transmembrane proteins

    Get PDF
    BACKGROUND: Hidden Markov Models (HMMs) have been extensively used in computational molecular biology, for modelling protein and nucleic acid sequences. In many applications, such as transmembrane protein topology prediction, the incorporation of limited amount of information regarding the topology, arising from biochemical experiments, has been proved a very useful strategy that increased remarkably the performance of even the top-scoring methods. However, no clear and formal explanation of the algorithms that retains the probabilistic interpretation of the models has been presented so far in the literature. RESULTS: We present here, a simple method that allows incorporation of prior topological information concerning the sequences at hand, while at the same time the HMMs retain their full probabilistic interpretation in terms of conditional probabilities. We present modifications to the standard Forward and Backward algorithms of HMMs and we also show explicitly, how reliable predictions may arise by these modifications, using all the algorithms currently available for decoding HMMs. A similar procedure may be used in the training procedure, aiming at optimizing the labels of the HMM's classes, especially in cases such as transmembrane proteins where the labels of the membrane-spanning segments are inherently misplaced. We present an application of this approach developing a method to predict the transmembrane regions of alpha-helical membrane proteins, trained on crystallographically solved data. We show that this method compares well against already established algorithms presented in the literature, and it is extremely useful in practical applications. CONCLUSION: The algorithms presented here, are easily implemented in any kind of a Hidden Markov Model, whereas the prediction method (HMM-TM) is freely available for academic users at , offering the most advanced decoding options currently available

    Transmembrane protein topology prediction using support vector machines

    Get PDF
    Background: Alpha-helical transmembrane (TM) proteins are involved in a wide range of important biological processes such as cell signaling, transport of membrane-impermeable molecules, cell-cell communication, cell recognition and cell adhesion. Many are also prime drug targets, and it has been estimated that more than half of all drugs currently on the market target membrane proteins. However, due to the experimental difficulties involved in obtaining high quality crystals, this class of protein is severely under-represented in structural databases. In the absence of structural data, sequence-based prediction methods allow TM protein topology to be investigated.Results: We present a support vector machine-based (SVM) TM protein topology predictor that integrates both signal peptide and re-entrant helix prediction, benchmarked with full cross-validation on a novel data set of 131 sequences with known crystal structures. The method achieves topology prediction accuracy of 89%, while signal peptides and re-entrant helices are predicted with 93% and 44% accuracy respectively. An additional SVM trained to discriminate between globular and TM proteins detected zero false positives, with a low false negative rate of 0.4%. We present the results of applying these tools to a number of complete genomes. Source code, data sets and a web server are freely available from http://bioinf.cs.ucl.ac.uk/psipred/.Conclusion: The high accuracy of TM topology prediction which includes detection of both signal peptides and re-entrant helices, combined with the ability to effectively discriminate between TM and globular proteins, make this method ideally suited to whole genome annotation of alpha-helical transmembrane proteins

    Profile hidden Markov models for foreground object modelling

    Get PDF
    Accurate background/foreground segmentation is a preliminary process essential to most visual surveillance applications. With the increasing use of freely moving cameras, strategies have been proposed to refine initial segmentation. In this paper, it is proposed to exploit the Vide-omics paradigm, and Profile Hidden Markov Models in particular, to create a new type of object descriptors relying on spatiotemporal information. Performance of the proposed methodology has been evaluated using a standard dataset of videos captured by moving cameras. Results show that usage of the proposed object descriptors allows better foreground extraction than standard approaches

    XRate: a fast prototyping, training and annotation tool for phylo-grammars

    Get PDF
    BACKGROUND: Recent years have seen the emergence of genome annotation methods based on the phylo-grammar, a probabilistic model combining continuous-time Markov chains and stochastic grammars. Previously, phylo-grammars have required considerable effort to implement, limiting their adoption by computational biologists. RESULTS: We have developed an open source software tool, xrate, for working with reversible, irreversible or parametric substitution models combined with stochastic context-free grammars. xrate efficiently estimates maximum-likelihood parameters and phylogenetic trees using a novel "phylo-EM" algorithm that we describe. The grammar is specified in an external configuration file, allowing users to design new grammars, estimate rate parameters from training data and annotate multiple sequence alignments without the need to recompile code from source. We have used xrate to measure codon substitution rates and predict protein and RNA secondary structures. CONCLUSION: Our results demonstrate that xrate estimates biologically meaningful rates and makes predictions whose accuracy is comparable to that of more specialized tools

    New decoding algorithms for Hidden Markov Models using distance measures on labellings

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Existing hidden Markov model decoding algorithms do not focus on approximately identifying the sequence feature boundaries.</p> <p>Results</p> <p>We give a set of algorithms to compute the conditional probability of all labellings "near" a reference labelling <it>λ </it>for a sequence <it>y </it>for a variety of definitions of "near". In addition, we give optimization algorithms to find the best labelling for a sequence in the robust sense of having all of its feature boundaries nearly correct. Natural problems in this domain are <it>NP</it>-hard to optimize. For membrane proteins, our algorithms find the approximate topology of such proteins with comparable success to existing programs, while being substantially more accurate in estimating the positions of transmembrane helix boundaries.</p> <p>Conclusion</p> <p>More robust HMM decoding may allow for better analysis of sequence features, in reasonable runtimes.</p
    corecore