3,239 research outputs found

    Conditional random fields for the prediction of signal peptide cleavage sites

    Get PDF
    Correct prediction of signal peptide cleavage sites has a significant impact on drug design. State-of-the-art approaches to cleavage site prediction typically use generative models (such as HMMs) to represent the statistics of amino acid sequences or use neural networks to detect the changes in short amino-acid segments along a query sequence. By formulating cleavage site prediction as a sequence labeling problem, this paper demonstrates how conditional random fields (CRFs) can be applied to cleavage site prediction. The paper also demonstrates how amino acid properties can be exploited and incorporated into the CRFs to boost prediction performance. Results show that the performance of CRFs is comparable to that of a state-of-the-art predictor (SignalP V3.0). Further performance improvement was observed when the decisions of SignalP and the CRF-based predictor are fused. Index Terms — Conditional random fields, discriminative models, signal peptides, cleavage sites, protein sequences

    DeepSig: Deep learning improves signal peptide detection in proteins

    Get PDF
    Motivation: The identification of signal peptides in protein sequences is an important step toward protein localization and function characterization. Results: Here, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification. Availability and implementation: DeepSig is available as both standalone program and web server at https://deepsig.biocomp.unibo.it. All datasets used in this study can be obtained from the same website

    Fast subcellular localization by cascaded fusion of signal-based and homology-based methods

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means.</p> <p>Results</p> <p>This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA).</p> <p>Conclusions</p> <p>Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.</p

    LabCaS: Labeling calpain substrate cleavage sites from amino acid sequence using conditional random fields

    Full text link
    The calpain family of Ca 2+ ‐dependent cysteine proteases plays a vital role in many important biological processes which is closely related with a variety of pathological states. Activated calpains selectively cleave relevant substrates at specific cleavage sites, yielding multiple fragments that can have different functions from the intact substrate protein. Until now, our knowledge about the calpain functions and their substrate cleavage mechanisms are limited because the experimental determination and validation on calpain binding are usually laborious and expensive. In this work, we aim to develop a new computational approach (LabCaS) for accurate prediction of the calpain substrate cleavage sites from amino acid sequences. To overcome the imbalance of negative and positive samples in the machine‐learning training which have been suffered by most of the former approaches when splitting sequences into short peptides, we designed a conditional random field algorithm that can label the potential cleavage sites directly from the entire sequences. By integrating the multiple amino acid features and those derived from sequences, LabCaS achieves an accurate recognition of the cleave sites for most calpain proteins. In a jackknife test on a set of 129 benchmark proteins, LabCaS generates an AUC score 0.862. The LabCaS program is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/LabCaS . Proteins 2013. © 2012 Wiley Periodicals, Inc.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/97155/1/24217_ftp.pd

    Speeding up subcellular localization by extracting informative regions of protein sequences for profile alignment

    Full text link
    The functions of proteins are closely related to their subcellular locations. In the post-proteomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means. This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by using the information provided by the N-terminal sorting signals. To this end, a cascaded fusion of cleavage site prediction and profile alignment is proposed. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor. Then, only the informative segments are applied to a homology-based classifier for predicting the subcellular locations. Experimental results on a newly constructed dataset show that the method can make use of the best property of both approaches and can attain an accuracy higher than using the full-length sequences. Moreover, the method can reduce the computation time by 20 folds. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.Department of Electronic and Information EngineeringRefereed conference pape

    Finding functional motifs in protein sequences with deep learning and natural language models

    Get PDF
    Recently, prediction of structural/functional motifs in protein sequences takes advantage of powerful machine learning based approaches. Protein encoding adopts protein language models overpassing standard procedures. Different combinations of machine learning and encoding schemas are available for predicting different structural/functional motifs. Particularly interesting is the adoption of protein language models to encode proteins in addition to evolution information and physicochemical parameters. A thorough analysis of recent predictors developed for annotating transmembrane regions, sorting signals, lipidation and phosphorylation sites allows to investigate the state-of-the-art focusing on the relevance of protein language models for the different tasks. This highlights that more experimental data are necessary to exploit available powerful machine learning methods

    Computational methods for genome and proteome annotation

    Get PDF
    Il progresso tecnologico nel campo della biologia molecolare, pone la comunità scientifica di fronte all’esigenza di dare un’interpretazione all’enormità di sequenze biologiche che a mano a mano vanno a costituire le banche dati, siano esse proteine o acidi nucleici. In questo contesto la bioinformatica gioca un ruolo di primaria importanza. Un nuovo livello di possibilità conoscitive è stato introdotto con le tecnologie di Next Generation Sequencing (NGS), per mezzo delle quali è possibile ottenere interi genomi o trascrittomi in poco tempo e con bassi costi. Tra le applicazioni del NGS più rilevanti ci sono senza dubbio quelle oncologiche che prevedono la caratterizzazione genomica di tessuti tumorali e lo sviluppo di nuovi approcci diagnostici e terapeutici per il trattamento del cancro. Con l’analisi NGS è possibile individuare il set completo di variazioni che esistono nel genoma tumorale come varianti a singolo nucleotide, riarrangiamenti cromosomici, inserzioni e delezioni. Va però sottolineato che le variazioni trovate nei geni vanno in ultima battuta osservate dal punto di vista degli effetti a livello delle proteine in quanto esse sono le responsabili più dirette dei fenotipi alterati riscontrabili nella cellula tumorale. L’expertise bioinformatica va quindi collocata sia a livello dell’analisi del dato prodotto per mezzo di NGS ma anche nelle fasi successive ove è necessario effettuare l’annotazione dei geni contenuti nel genoma sequenziato e delle relative strutture proteiche che da esso sono espresse, o, come nel caso dello studio mutazionale, la valutazione dell’effetto della variazione genomica. È in questo contesto che si colloca il lavoro presentato: da un lato lo sviluppo di metodologie computazionali per l’annotazione di sequenze proteiche e dall’altro la messa a punto di una pipeline di analisi di dati prodotti con tecnologie NGS in applicazioni oncologiche avente come scopo finale quello della individuazione e caratterizzazione delle mutazioni genetiche tumorali a livello proteico.The technological advancement in the molecular biology field is strongly dependent from the availability of bioinformatics support in order to manage and to interpret the very large amount of biological sequences produced. A new level of knowledge is derivable from the Next Generation Sequencing technology (NGS), that allows to obtain whole genomes and transcriptomes in few days and with low costs. Cancer research is one of the most relevant applications of NGS technology. With the NGS analysis is possible to perform the genomic characterization of tumors highlighting all the detectable mutations such as single nucleotide variants, chromosomal rearrangements, insertions and deletions, in order to develop new diagnostic, prognostic and therapeutic approaches in the cancer treatment. However, the genetic variation found in the samples must be analyzed at protein level since they directly alter the tumor cell phenotype. The bioinformatic expertise is essential to implement the NGS data analysis and also to perform the annotation of the genes emerging from the sequenced genome and of the protein expressed by it. In this work of thesis two computational methods for protein annotation have been developed (prediction of targeting and signal peptides); moreover we will introduce a bioinformatic pipeline to analyze NGS data and to annotate the genetic variants in the cancer genomic research

    Coronavirus 3CL(pro )proteinase cleavage sites: Possible relevance to SARS virus pathology

    Get PDF
    BACKGROUND: Despite the passing of more than a year since the first outbreak of Severe Acute Respiratory Syndrome (SARS), efficient counter-measures are still few and many believe that reappearance of SARS, or a similar disease caused by a coronavirus, is not unlikely. For other virus families like the picornaviruses it is known that pathology is related to proteolytic cleavage of host proteins by viral proteinases. Furthermore, several studies indicate that virus proliferation can be arrested using specific proteinase inhibitors supporting the belief that proteinases are indeed important during infection. Prompted by this, we set out to analyse and predict cleavage by the coronavirus main proteinase using computational methods. RESULTS: We retrieved sequence data on seven fully sequenced coronaviruses and identified the main 3CL proteinase cleavage sites in polyproteins using alignments. A neural network was trained to recognise the cleavage sites in the genomes obtaining a sensitivity of 87.0% and a specificity of 99.0%. Several proteins known to be cleaved by other viruses were submitted to prediction as well as proteins suspected relevant in coronavirus pathology. Cleavage sites were predicted in proteins such as the cystic fibrosis transmembrane conductance regulator (CFTR), transcription factors CREB-RP and OCT-1, and components of the ubiquitin pathway. CONCLUSIONS: Our prediction method NetCorona predicts coronavirus cleavage sites with high specificity and several potential cleavage candidates were identified which might be important to elucidate coronavirus pathology. Furthermore, the method might assist in design of proteinase inhibitors for treatment of SARS and possible future diseases caused by coronaviruses. It is made available for public use at our website:
    corecore