16 research outputs found

    PRED-TMBB2: Improved topology prediction and detection of beta-barrel outer membrane proteins

    No full text
    Motivation: The PRED-TMBB method is based on Hidden Markov Models and is capable of predicting the topology of beta-barrel outer membrane proteins and discriminate them from water-soluble ones. Here, we present an updated version of the method, PRED-TMBB2, with several newly developed features that improve its performance. The inclusion of a properly defined end state allows for better modeling of the beta-barrel domain, while different emission probabilities for the adjacent residues in strands are used to incorporate knowledge concerning the asymmetric amino acid distribution occurring there. Furthermore, the training was performed using newly developed algorithms in order to optimize the labels of the training sequences. Moreover, the method is retrained on a larger, non-redundant dataset which includes recently solved structures, and a newly developed decoding method was added to the already available options. Finally, the method now allows the incorporation of evolutionary information in the form of multiple sequence alignments. Results: The results of a strict cross-validation procedure show that PRED-TMBB2 with homology information performs significantly better compared to other available prediction methods. It yields 76% in correct topology predictions and outperforms the best available predictor by 7%, with an overall SOV of 0.9. Regarding detection of beta-barrel proteins, PRED-TMBB2, using just the query sequence as input, achieves an MCC value of 0.92, outperforming even predictors designed for this task and are much slower. Availability and Implementation: The method, along with all datasets used, is freely available for academic users at http://www.compgen.org/tools/PRED-TMBB2. © 2016 The Author 2016. Published by Oxford University Press. All rights reserved

    OMPdb: A database of β-barrel outer membrane proteins from Gram-negative bacteria

    No full text
    We describe here OMPdb, which is currently the most complete and comprehensive collection of integral β-barrel outer membrane proteins from Gram-negative bacteria. The database currently contains 69 354 proteins, which are classified into 85 families, based mainly on structural and functional criteria. Although OMPdb follows the annotation scheme of Pfam, many of the families included in the database were not previously described or annotated in other publicly available databases. There are also cross-references to other databases, references to the literature and annotation for sequence features, like transmembrane segments and signal peptides. Furthermore, via the web interface, the user can not only browse the available data, but submit advanced text searches and run BLAST queries against the database protein sequences or domain searches against the collection of profile Hidden Markov Models that represent each family's domain organization as well. The database is freely accessible for academic users at http://bioinformatics.biol.uoa.gr/OMPdb and we expect it to be useful for genome-wide analyses, comparative genomics as well as for providing training and test sets for predictive algorithms regarding transmembrane β-barrels. © The Author(s) 2010

    Extending hidden Markov models to allow conditioning on previous observations

    No full text
    Hidden Markov Models (HMMs) are probabilistic models widely used in computational molecular biology. However, the Markovian assumption regarding transition probabilities which dictates that the observed symbol depends only on the current state may not be sufficient for some biological problems. In order to overcome the limitations of the first order HMM, a number of extensions have been proposed in the literature to incorporate past information in HMMs conditioning either on the hidden states, or on the observations, or both. Here, we implement a simple extension of the standard HMM in which the current observed symbol (amino acid residue) depends both on the current state and on a series of observed previous symbols. The major advantage of the method is the simplicity in the implementation, which is achieved by properly transforming the observation sequence, using an extended alphabet. Thus, it can utilize all the available algorithms for the training and decoding of HMMs. We investigated the use of several encoding schemes and performed tests in a number of important biological problems previously studied by our team (prediction of transmembrane proteins and prediction of signal peptides). The evaluation shows that, when enough data are available, the performance increased by 1.8%-8.2% and the existing prediction methods may improve using this approach. The methods, for which the improvement was significant (PRED-TMBB2, PRED-TAT and HMM-TM), are available as web-servers freely accessible to academic users at www.compgen.org/tools/. © 2018 World Scientific Publishing Europe Ltd

    Prediction of lipoprotein signal peptides in Gram-positive bacteria with a Hidden Markov Model

    No full text
    We present a Hidden Markov Model method for the prediction of lipoprotein signal peptides of Gram-positive bacteria, trained on a set of 67 experimentally verified lipoproteins. The method outperforms LipoP and the methods based on regular expression patterns, in various data sets containing experimentally characterized lipoproteins, secretory proteins, proteins with an N-terminal TM segment and cytoplasmic proteins. The method is also very sensitive and specific in the detection of secretory signal peptides and in terms of overall accuracy outperforms even SignalP, which is the top-scoring method for the prediction of signal peptides. PRED-LIPO is freely available at http://bioinformatics.bi-ol. uoa.gr/PRED-LIPO/, and we anticipate that it will be a valuable tool for the experimentalists studying secreted proteins and lipoproteins from Gram-positive bacteria. © 2008 American Chemical Society

    GWAR: Robust analysis and meta-analysis of genome-wide association studies

    No full text
    Motivation: In the context of genome-wide association studies (GWAS), there is a variety of statistical techniques in order to conduct the analysis, but, in most cases, the underlying genetic model is usually unknown. Under these circumstances, the classical Cochran-Armitage trend test (CATT) is suboptimal. Robust procedures that maximize the power and preserve the nominal type I error rate are preferable. Moreover, performing a meta-analysis using robust procedures is of great interest and has never been addressed in the past. The primary goal of this work is to implement several robust methods for analysis and meta-analysis in the statistical package Stata and subsequently to make the software available to the scientific community. Results: The CATT under a recessive, additive and dominant model of inheritance as well as robust methods based on the Maximum Efficiency Robust Test statistic, the MAX statistic and the MIN2 were implemented in Stata. Concerning MAX and MIN2, we calculated their asymptotic null distributions relying on numerical integration resulting in a great gain in computational time without losing accuracy. All the aforementioned approaches were employed in a fixed or a random effects meta-analysis setting using summary data with weights equal to the reciprocal of the combined cases and controls. Overall, this is the first complete effort to implement procedures for analysis and meta-analysis in GWAS using Stata. Availability and Implementation: A Stata program and a web-server are freely available for academic users at http://www.compgen.org/tools/GWAR . © The Author 2017

    Prediction of signal peptides in archaea

    No full text
    Computational prediction of signal peptides (SPs) and their cleavage sites is of great importance in computational biology; however, currently there is no available method capable of predicting reliably the SPs of archaea, due to the limited amount of experimentally verified proteins with SPs. We performed an extensive literature search in order to identify archaeal proteins having experimentally verified SP and managed to find 69 such proteins, the largest number ever reported. A detailed analysis of these sequences revealed some unique features of the SPs of archaea, such as the unique amino acid composition of the hydrophobic region with a higher than expected occurrence of isoleucine, and a cleavage site resembling more the sequences of gram-positives with almost equal amounts of alanine and valine at the position-3 before the cleavage site and a dominant alanine at position-1, followed in abundance by serine and glycine. Using these proteins as a training set, we trained a hidden Markov model method that predicts the presence of the SPs and their cleavage sites and also discriminates such proteins from cytoplasmic and transmembrane ones. The method performs satisfactorily, yielding a 35-fold cross-validation procedure, a sensitivity of 100% and specificity 98.41% with the Matthews' correlation coefficient being equal to 0.964. This particular method is currently the only available method for the prediction of secretory SPs in archaea, and performs consistently and significantly better compared with other available predictors that were trained on sequences of eukaryotic or bacterial origin. Searching 48 completely sequenced archaeal genomes we identified 9437 putative SPs. The method, PRED-SIGNAL, and the results are freely available for academic users at http://bioinformatics.biol.uoa.gr/PRED-SIGNAL/ and we anticipate that it will be a valuable tool for the computational analysis of archaeal genomes. © The Author 2008. Published by Oxford University Press. All rights reserved

    Semi-supervised learning of hidden markov models for biological sequence analysis

    No full text
    Motivation: Hidden Markov Models (HMMs) are probabilistic models widely used in applications in computational sequence analysis. HMMs are basically unsupervised models. However, in the most important applications, they are trained in a supervised manner. Training examples accompanied by labels corresponding to different classes are given as input and the set of parameters that maximize the joint probability of sequences and labels is estimated. A main problem with this approach is that, in the majority of the cases, labels are hard to find and thus the amount of training data is limited. On the other hand, there are plenty of unclassified (unlabeled) sequences deposited in the public databases that could potentially contribute to the training procedure. This approach is called semi-supervised learning and could be very helpful in many applications. Results: We propose here, a method for semi-supervised learning of HMMs that can incorporate labeled, unlabeled and partially labeled data in a straightforward manner. The algorithm is based on a variant of the Expectation-Maximization (EM) algorithm, where the missing labels of the unlabeled or partially labeled data are considered as the missing data. We apply the algorithm to several biological problems, namely, for the prediction of transmembrane protein topology for alpha-helical and beta-barrel membrane proteins and for the prediction of archaeal signal peptides. The results are very promising, since the algorithms presented here can significantly improve the prediction performance of even the top-scoring classifiers. © The Author(s) 2018. Published by Oxford University Press. All rights reserved

    Prediction of cell wall sorting signals in gram-positive bacteria with a hidden markov model: Application to complete genomes

    No full text
    Surface proteins in Gram-positive bacteria are frequently implicated in virulence. We have focused on a group of extracellular cell wall-attached proteins (CWPs), containing an LPXTG motif for cleavage and covalent coupling to peptidoglycan by sortase enzymes. A hidden Markov model (HMM) approach for predicting the LPXTG-anchored cell wall proteins of Gram-positive bacteria was developed and compared against existing methods. The HMM model is parsimonious in terms of the number of freely estimated parameters, and it has proved to be very sensitive and specific in a training set of 55 experimentally verified LPXTG-anchored cell wall proteins as well as in reliable data sets of globular and transmembrane proteins. In order to identify such proteins in Gram-positive bacteria, a comprehensive analysis of 94 completely sequenced genomes has been performed. We identified, in total, 860 LPXTG-anchored cell wall proteins, a number that is significantly higher compared to those obtained by other available methods. Of these proteins, 237 are hypothetical proteins according to the annotation of SwissProt, and 88 had no homologs in the SwissProt database - this might be evidence that they are members of newly identified families of CWPs. The prediction tool, the database with the proteins identified in the genomes, and supplementary material are available online at http://bioinformatics.biol.uoa.gr/CW-PRED/. © 2008 Imperial College Press

    Landscape of Eukaryotic Transmembrane Beta Barrel Proteins

    No full text
    Even though in the last few years several families of eukaryotic β-barrel outer membrane proteins have been discovered, their computational characterization and their annotation in public databases are far from complete. The PFAM database includes only very few characteristic profiles for these families, and in most cases, the profile hidden Markov models (pHMMs) have been trained using prokaryotic and eukaryotic proteins together. Here, we present for the first time a comprehensive computational analysis of eukaryotic transmembrane β-barrels. Twelve characteristic pHMMs were built, based on an extensive literature search, which can discriminate eukaryotic β-barrels from other classes of proteins (globular and bacterial β-barrel ones), as well as between mitochondrial and chloroplastic ones. We built eight novel profiles for the chloroplastic β-barrel families that are not present in the PFAM database and also updated the profile for the MDM10 family (PF12519) in the PFAM database and divide the porin family (PF01459) into two separate families, namely, VDAC and TOM40. Copyright © 2020 American Chemical Society

    ExTopoDB: A database of experimentally derived topological models of transmembrane proteins

    No full text
    Summary: ExTopoDB is a publicly accessible database of experimentally derived topological models of transmembrane proteins. It contains information collected from studies in the literature that report the use of biochemical methods for the determination of the topology of α-helical transmembrane proteins. Transmembrane protein topology is highly important in order to understand their function and ExTopoDB provides an up to date, complete and comprehensive dataset of experimentally determined topologies of α-helical transmembrane proteins. Topological information is combined with transmembrane topology prediction resulting in more reliable topological models. © The Author 2010. Published by Oxford University Press. All rights reserved
    corecore