55 research outputs found

    Homology-based inference sets the bar high for protein function prediction

    Get PDF
    Background: Any method that de novo predicts protein function should do better than random. More challenging, it also ought to outperform simple homology-based inference. Methods: Here, we describe a few methods that predict protein function exclusively through homology. Together, they set the bar or lower limit for future improvements. Results and conclusions: During the development of these methods, we faced two surprises. Firstly, our most successful implementation for the baseline ranked very high at CAFA1. In fact, our best combination of homology-based methods fared only slightly worse than the top-of-the-line prediction method from the Jones group. Secondly, although the concept of homology-based inference is simple, this work revealed that the precise details of the implementation are crucial: not only did the methods span from top to bottom performers at CAFA, but also the reasons for these differences were unexpected. In this work, we also propose a new rigorous measure to compare predicted and experimental annotations. It puts more emphasis on the details of protein function than the other measures employed by CAFA and may best reflect the expectations of users. Clearly, the definition of proper goals remains one major objective for CAFA

    LocTree3 prediction of localization

    Get PDF
    The prediction of protein sub-cellular localization is an important step toward elucidating protein function. For each query protein sequence, LocTree2 applies machine learning (profile kernel SVM) to predict the native sub-cellular localization in 18 classes for eukaryotes, in six for bacteria and in three for archaea. The method outputs a score that reflects the reliability of each prediction. LocTree2 has performed on par with or better than any other state-of-the-art method. Here, we report the availability of LocTree3 as a public web server. The server includes the machine learning-based LocTree2 and improves over it through the addition of homology-based inference. Assessed on sequence-unique data, LocTree3 reached an 18-state accuracy Q18 = 80 ± 3% for eukaryotes and a six-state accuracy Q6 = 89 ± 4% for bacteria. The server accepts submissions ranging from single protein sequences to entire proteomes. Response time of the unloaded server is about 90 s for a 300-residue eukaryotic protein and a few hours for an entire eukaryotic proteome not considering the generation of the alignments. For over 1000 entirely sequenced organisms, the predictions are directly available as downloads. The web server is available at http://www.rostlab.org/services/loctree3

    Rare penetrant mutations confer severe risk of common diseases

    Get PDF
    [INTRODUCTION] Genome-wide association studies (GWASs) have identified thousands of common genetic variants that are predictive of common disease susceptibility, but these variants individually have mild effects on disease owing to the effects of natural selection. By contrast, rare genetic variants can have large effects on common disease risk, but their use in genetic risk prediction has been limited to date owing to the difficulty of distinguishing pathogenic from benign variants and estimating the magnitude of their effects.[RATIONALE] PrimateAI-3D is a three-dimensional convolutional neural network for missense variant–effect prediction, which was trained with common genetic variants from the population sequencing of 233 primate species. By applying this method to estimate the pathogenicity of rare coding variants in 454,712 UK Biobank individuals, we aimed to improve rare-variant association tests and genetic risk prediction for common diseases and complex traits.[RESULTS] We performed rare-variant burden tests for 90 well-powered, clinically relevant phenotypes in the UK Biobank exome dataset. Stratifying missense variants with PrimateAI-3D greatly improved gene discovery, revealing 73% more significant gene-phenotype associations (false discovery rate <0.05) compared with not using PrimateAI-3D. When benchmarked against prior studies, gene-phenotype pairs identified with our method were better supported by orthogonal genetic evidence from GWAS and genes from related Mendelian disorders. In addition, PrimateAI-3D scores showed the strongest correlation among existing variant interpretation algorithms for predicting the quantitative effects of rare variants on continuous clinical phenotypes. Having validated our method for finding gene-phenotype relationships, we next constructed a rare-variant polygenic risk score (PRS) model by combining the rare-variant genes for each phenotype, weighting variants by their PrimateAI-3D prediction score and the direction and effect size of each associated gene. For comparison, we constructed common-variant PRS models and evaluated the performance of the two models for genetic risk prediction in a withheld-test subset of the cohort. Although common variants better explained overall population variance, rare-variant PRSs had more power at the ends of the distribution to identify individuals at the greatest risk for disease, and thus may be more relevant for population genetic screening and risk management. By contrast to common-variant PRS models derived from European populations that show poor generalization to non-Europeans, rare-variant PRSs were substantially more portable to different cohorts and ancestry groups that were not seen during model training. Moreover, because they incorporate orthogonal information from nonoverlapping sets of variants, we combined rare- and common-variant PRS models into a unified model and observed further improvement in genetic risk prediction for common diseases. To understand the extent by which rare-variant PRSs can be expected to improve with increases in discovery cohort size, we repeated our analyses in down-sampled subsets of the UK Biobank cohort. We found that the number of genes contributing to the rare-variant PRS increased linearly, with no signs of plateauing at a half-million exomes. Newly discovered rare-variant genes were strongly enriched at GWAS loci, forming allelic series with effect sizes that were ~10-fold larger on average than the respective common GWAS variant. Among well-powered GWAS loci that could be unambiguously assigned to a single gene, the majority showed subthreshold signal on the rare-variant burden test, indicating that rare penetrant variants exist at a large fraction of GWAS loci and can be incorporated into the rare-variant PRS with further advances in cohort size and variant effect prediction.[CONCLUSION] Understanding the impact of rare variants in common diseases is of prime interest for both precision medicine and the discovery of drug targets. By leveraging advances in variant effect prediction, we have demonstrated major improvements in rare-variant burden testing and genetic risk prediction. Notably, we observed that nearly all individuals carried at least one rare penetrant variant for the phenotypes we examined, demonstrating the utility of personal genome sequencing for otherwise healthy individuals in the general population.T.M.B. is supported by funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement no. 864203), PID2021-126004NB-100 (MICIIN/FEDER, UE) and Secretaria d’Universitats i Recerca, and CERCA Programme del Departament d’Economia i Coneixement de la Generalitat de Catalunya (GRC 2021 SGR 00177).Peer reviewe

    The landscape of tolerated genetic variation in humans and primates

    Get PDF

    An expanded evaluation of protein function prediction methods shows an improvement in accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio

    An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent

    The landscape of tolerated genetic variation in humans and primates.

    Get PDF
    Personalized genome sequencing has revealed millions of genetic differences between individuals, but our understanding of their clinical relevance remains largely incomplete. To systematically decipher the effects of human genetic variants, we obtained whole-genome sequencing data for 809 individuals from 233 primate species and identified 4.3 million common protein-altering variants with orthologs in humans. We show that these variants can be inferred to have nondeleterious effects in humans based on their presence at high allele frequencies in other primate populations. We use this resource to classify 6% of all possible human protein-altering variants as likely benign and impute the pathogenicity of the remaining 94% of variants with deep learning, achieving state-of-the-art accuracy for diagnosing pathogenic variants in patients with genetic diseases

    Alternative Protein-Protein Interfaces Are Frequent Exceptions

    Get PDF
    <div><p>The intricate molecular details of protein-protein interactions (PPIs) are crucial for function. Therefore, measuring the same interacting protein pair again, we expect the same result. This work measured the similarity in the molecular details of interaction for the same and for homologous protein pairs between different experiments. All scores analyzed suggested that different experiments often find exceptions in the interfaces of similar PPIs: up to 22% of all comparisons revealed some differences even for sequence-identical pairs of proteins. The corresponding number for pairs of close homologs reached 68%. Conversely, the interfaces differed entirely for 12–29% of all comparisons. All these estimates were calculated after redundancy reduction. The magnitude of interface differences ranged from subtle to the extreme, as illustrated by a few examples. An extreme case was a change of the interacting domains between two observations of the same biological interaction. One reason for different interfaces was the number of copies of an interaction in the same complex: the probability of observing alternative binding modes increases with the number of copies. Even after removing the special cases with alternative hetero-interfaces to the same homomer, a substantial variability remained. Our results strongly support the surprising notion that there are many alternative solutions to make the intricate molecular details of PPIs crucial for function.</p> </div

    Accelerating the Original Profile Kernel.

    Get PDF
    One of the most accurate multi-class protein classification systems continues to be the profile-based SVM kernel introduced by the Leslie group. Unfortunately, its CPU requirements render it too slow for practical applications of large-scale classification tasks. Here, we introduce several software improvements that enable significant acceleration. Using various non-redundant data sets, we demonstrate that our new implementation reaches a maximal speed-up as high as 14-fold for calculating the same kernel matrix. Some predictions are over 200 times faster and render the kernel as possibly the top contender in a low ratio of speed/performance. Additionally, we explain how to parallelize various computations and provide an integrative program that reduces creating a production-quality classifier to a single program call. The new implementation is available as a Debian package under a free academic license and does not depend on commercial software. For non-Debian based distributions, the source package ships with a traditional Makefile-based installer. Download and installation instructions can be found at https://rostlab.org/owiki/index.php/Fast_Profile_Kernel. Bugs and other issues may be reported at https://rostlab.org/bugzilla3/enter_bug.cgi?product=fastprofkernel
    • …
    corecore