3,898 research outputs found
DeepSig: Deep learning improves signal peptide detection in proteins
Motivation:
The identification of signal peptides in protein sequences is an important step toward protein localization and function characterization.
Results:
Here, we present DeepSig, an improved approach for signal peptide detection and cleavage-site prediction based on deep learning methods. Comparative benchmarks performed on an updated independent dataset of proteins show that DeepSig is the current best performing method, scoring better than other available state-of-the-art approaches on both signal peptide detection and precise cleavage-site identification.
Availability and implementation:
DeepSig is available as both standalone program and web server at https://deepsig.biocomp.unibo.it. All datasets used in this study can be obtained from the same website
CoBaltDB: Complete bacterial and archaeal orfeomes subcellular localization database and associated resources
International audienceBACKGROUND: The functions of proteins are strongly related to their localization in cell compartments (for example the cytoplasm or membranes) but the experimental determination of the sub-cellular localization of proteomes is laborious and expensive. A fast and low-cost alternative approach is in silico prediction, based on features of the protein primary sequences. However, biologists are confronted with a very large number of computational tools that use different methods that address various localization features with diverse specificities and sensitivities. As a result, exploiting these computer resources to predict protein localization accurately involves querying all tools and comparing every prediction output; this is a painstaking task. Therefore, we developed a comprehensive database, called CoBaltDB, that gathers all prediction outputs concerning complete prokaryotic proteomes. DESCRIPTION: The current version of CoBaltDB integrates the results of 43 localization predictors for 784 complete bacterial and archaeal proteomes (2.548.292 proteins in total). CoBaltDB supplies a simple user-friendly interface for retrieving and exploring relevant information about predicted features (such as signal peptide cleavage sites and transmembrane segments). Data are organized into three work-sets ("specialized tools", "meta-tools" and "additional tools"). The database can be queried using the organism name, a locus tag or a list of locus tags and may be browsed using numerous graphical and text displays. CONCLUSIONS: With its new functionalities, CoBaltDB is a novel powerful platform that provides easy access to the results of multiple localization tools and support for predicting prokaryotic protein localizations with higher confidence than previously possible. CoBaltDB is available at http://www.umr6026.univ-rennes1.fr/english/home/research/basic/software/cobalten
Protein structure-based evaluation of missense variants: Resources, challenges and future directions.
We provide an overview of the methods that can be used for protein structure-based evaluation of missense variants. The algorithms can be broadly divided into those that calculate the difference in free energy (ΔΔG) between the wild type and variant structures and those that use structural features to predict the damaging effect of a variant without providing a ΔΔG. A wide range of machine learning approaches have been employed to develop those algorithms. We also discuss challenges and opportunities for variant interpretation in view of the recent breakthrough in three-dimensional structural modelling using deep learning
MULocDeep: A deep-learning framework for protein subcellular and suborganellar localization prediction with residue-level interpretation
Prediction of protein localization plays an important role in understanding protein function and mechanisms. In this paper, we propose a general deep learning-based localization prediction framework, MULocDeep, which can predict multiple localizations of a protein at both subcellular and suborganellar levels. We collected a dataset with 44 suborganellar localization annotations in 10 major subcellular compartments—the most comprehensive suborganelle localization dataset to date. We also experimentally generated an independent dataset of mitochondrial proteins in Arabidopsis thaliana cell cultures, Solanum tuberosum tubers, and Vicia faba roots and made this dataset publicly available. Evaluations using the above datasets show that overall, MULocDeep outperforms other major methods at both subcellular and suborganellar levels. Furthermore, MULocDeep assesses each amino acid's contribution to localization, which provides insights into the mechanism of protein sorting and localization motifs. A web server can be accessed at http://mu-loc.org
Evaluation of signal peptide prediction algorithms for identification of mycobacterial signal peptides using sequence data from proteomic methods
Secreted proteins play an important part in the pathogenicity of Mycobacterium tuberculosis, and are the primary source of vaccine and diagnostic candidates. A majority of these proteins are exported via the signal peptidase I-dependent pathway, and have a signal peptide that is cleaved off during the secretion process. Sequence similarities within signal peptides have spurred the development of several algorithms for predicting their presence as well as the respective cleavage sites. For proteins exported via this pathway, algorithms exist for eukaryotes, and for Gram-negative and Gram-positive bacteria. However, the unique structure of the mycobacterial membrane raises the question of whether the existing algorithms are suitable for predicting signal peptides within mycobacterial proteins. In this work, we have evaluated the performance of nine signal peptide prediction algorithms on a positive validation set, consisting of 57 proteins with a verified signal peptide and cleavage site, and a negative set, consisting of 61 proteins that have an N-terminal sequence that confirms the annotated translational start site. We found the hidden Markov model of SignalP v3.0 to be the best-performing algorithm for predicting the presence of a signal peptide in mycobacterial proteins. It predicted no false positives or false negatives, and predicted a correct cleavage site for 45 of the 57 proteins in the positive set. Based on these results, we used the hidden Markov model of SignalP v3.0 to analyse the 10 available annotated proteomes of mycobacterial species, including annotations of M. tuberculosis H37Rv from the Wellcome Trust Sanger Institute and the J. Craig Venter Institute (JCVI). When excluding proteins with transmembrane regions among the proteins predicted to harbour a signal peptide, we found between 7.8 and 10.5 % of the proteins in the proteomes to be putative secreted proteins. Interestingly, we observed a consistent difference in the percentage of predicted proteins between the Sanger Institute and JCVI. We have determined the most valuable algorithm for predicting signal peptidase I-processed proteins of M. tuberculosis, and used this algorithm to estimate the number of mycobacterial proteins with the potential to be exported via this pathway
Recommended from our members
Computer modelling of metabolic adaptions during mitochondrial dysfunction and machine learning to predict novel mitochondrial disease genes
Mitochondria are organelles found in almost every eukaryote and are primarily responsible for generating chemical energy in the form of adenosine triphosphate. This thesis investigates two main causes of mitochondrial dysfunction: mitochondrial toxicity arising from side-effects of drugs; and mitochondrial diseases arising from defects in nuclear-encoded genes.
Novel chemical entities being developed as drug leads are screened for cellular toxicity in which mitochondrial dysfunction is a major cause. However, our lack of understanding of the metabolic adaptations to mitochondrial dysfunction limits the accurate screening of mitochondrial dysfunction for pharmaceutical companies, thus preventing potentially useful drugs from being developed. To further our understanding of these adaptations, I analysed a large-scale metabolomics data set of rats administered a known mitochondrial complex III inhibitor. The analyses revealed many perturbed pathways which can be exploited as biomarkers of mild mitochondrial dysfunction, a condition which is currently clinically undetectable during the drug development process. To direct future studies on mitochondrial dysfunction, a multi-organ model of mitochondrial metabolism was generated and used to simulate inhibition of the mitochondrial respiratory complexes. The simulations of complex III inhibition accurately predicted many of the metabolite behaviours identified in the metabolomics analyses and provided theories for their significance. Simulations of the other complexes’ inhibitions identified many unique behaviours which can be used to direct future studies, studies which would greatly improve our understanding of the metabolic adaptations and provide higher confidence biomarkers.
Mitochondrial dysfunction is linked to many late onset diseases such as Parkinson’s, and inborn errors of mitochondrial metabolism cause severe neurological and physiological diseases. Patients with suspected mitochondrial disease have their DNA sequenced and analysed. Diagnosis of mitochondrial disease by sequencing requires knowledge of the mitochondrial proteome, which is currently incomplete. A predicted mitochondrial proteome was generated using a support vector machine trained using the abundance of protein localisation data available in the MitoMiner database. The support vector machine identified 442 novel mitochondrional proteins. The current success rate of diagnosing mitochondrial disease using sequencing is currently limited by our inability to filter and prioritise a patient’s DNA variants. Patients which do not have a variant in one of the already known mitochondrial disease genes are usually left with over hundreds of potential disease-causing variants. A probability of being disease-causing for each gene in the mitochondrial proteome was generated using two trained neural networks. The networks were trained on a large amount of different data sources for differentiating mitochondrial disease genes including protein-protein interaction network metrics, gene tissue expression and protein evolution. The predicted probabilities allow for better filtering and prioritisation of a patient’s variants for candidate disease-causing genes to be experimentally verified. The predicted mitochondrial proteome and their predicted disease-causing probabilities are currently used in an NGS analysis pipeline at the MRC Mitochondrial Biology Unit for diagnosing mitochondrial disease patient samples
Recommended from our members
Proteomic analysis of biomarkers associated with immunotherapy in murine tumour models
Emergence of proteomics and high-throughput technologies has allowed the identification of protein expression patterns of disease that potentially hold clinical importance in predictive medicine. The analysis of complex data generated by these technologies incorporates the use of computer algorithms for data mining and identification of important protein biomarkers. Such candidate biomarkers can potentially be used for diagnosis, prognosis and monitoring a variety of diseases as well as the prediction of therapy response. Mass spectrometry has been used widely, for the discovery and quantitation of disease associated biomarkers using a variety of samples such as serum and tissue. In particular, matrix assisted laser desorption/ionisation time of flight mass spectrometry (MALDI-TOF MS) has been used to generate proteomic profiles or “fingerprints” from serum to distinguish patients at different clinical stages of disease. Currently, early stage disease is difficult to diagnose in most cancers as current cancer markers have limited sensitivity and specificity. In advanced stage metastatic disease, treatment options are limited, although it is recognised that some patients may benefit from immunotherapy and in particular vaccine therapy. The use of animal models is critical to evaluate the efficacy of immunotherapies and to investigate tumour immunity in general and the mechanisms involved in tumour progression. These models provide an in vivo environment which cannot be reproduced in vitro, which results in more accurate and reliable information on the host response to immunotherapy and the mechanisms involved
SARS-CoV-2 3D database: Understanding the Coronavirus Proteome and Evaluating Possible Drug Targets.
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a rapidly growing infectious disease, widely spread with high mortality rates. Since the release of the SARS-CoV-2 genome sequence in March 2020, there has been an international focus on developing target-based drug discovery, which also requires knowledge of the 3D structure of the proteome. Where there are no experimentally solved structures, our group has created 3D models with coverage of 97.5% and characterised them using state-of-the-art computational approaches. Models of protomers and oligomers, together with predictions of substrate and allosteric binding sites, protein- ligand docking, SARS-CoV-2 protein interactions with human proteins, impacts of mutations, and mapped solved experimental structures are freely available for download. These are imple- mented in SARS CoV-2 3D, a comprehensive and user-friendly database, available at https://sars3d.com/. This provides essential information for drug discovery, both to evaluate targets and design new potential therapeutics.This work is supported and funded by King Abdullah scholarship (Saudi Arabia research coun- cil), and American Leprosy Missions grants (G88726), SET is funded by the Cystic Fibrosis Trust (RG 70975) and Fondation Botnar (RG91317). A.R.J is funded by the Biotechnology and Biological Sciences Research Council (BBSRC) DTP studentship (BB/M011194/1). B.B. is funded by the Cystic Fibrosis Trust and L.C. on a studentship from Ipsen. T.L.B. is funded by a the Wellcome Trust Investigator Award, PHZJ/489 RG83114 (2016-2021
- …