46 research outputs found

    Integrative cell biology

    Get PDF
    Programa de Doctorado en Biotecnolog√≠a, Ingenier√≠a y Tecnolog√≠a Qu√≠micaL√≠nea de Investigaci√≥n: Bioinformatica en Biotecnolog√≠a y BiomedicinaClave Programa: DBIC√≥digo L√≠nea: 7Las prote√≠nas son la clave para entender la biolog√≠a celular. La determinaci√≥n de su rol y funci√≥n nos ayuda a descubrir las caracter√≠sticas de los procesos moleculares en la base de la vida. Las t√©cnicas de alto rendimiento han permitido a los cient√≠ficos acumular una gran cantidad de datos sobre secuencias de ADN de miles de organismos diferentes. La funci√≥n de las prote√≠nas codificadas en estas porciones de ADN se determina por m√©todos de anotaci√≥n manuales o autom√°ticos, utilizando experimentos computacionales y biol√≥gicos para obtener una descripci√≥n coherente. Aunque la revisi√≥n manual de estas predicciones finalmente produce las anotaciones m√°s fiables, este enfoque no es factible con la tasa actual de secuencias depositadas en las bases de datos biol√≥gicas. Esto afecta el conocimiento de la biolog√≠a de varios organismos. Los esfuerzos de revisi√≥n manual se centran principalmente en la caracterizaci√≥n de organismos modelo En consecuencia, las bases de datos donde se re√ļne la informaci√≥n abarcan grandes cantidades de datos para un subconjunto espec√≠fico de organismos. Actualmente, solo los grandes consorcios pueden generar estos recursos web, mientras que otros grupos que investigan organismos recientemente secuenciados carecen de los medios y recursos para lograr una anotaci√≥n de proteoma m√°s completa. Adem√°s, la gran mayor√≠a del software para anotaci√≥n de prote√≠nas se enfoca solo en algunos aspectos de la funci√≥n de una prote√≠na; por lo tanto, falta informaci√≥n complementaria que podr√≠a derivarse de otras fuentes, tanto in silico como in vivo. El objetivo de esta tesis es desarrollar un nuevo enfoque para la anotaci√≥n de funciones de prote√≠nas que aborde los problemas mencionados anteriormente, incluidas nuevas herramientas y recursos para mejorar el estado actual en el √°mbito de la predicci√≥n de la funci√≥n, para as√≠ aplicarlo a organismos no modelos. Lo llamamos ¬ŅIntegrative Cell Biology¬Ņ (ICB) o Biolog√≠a Celular Integrativa. ICB se basa en la integraci√≥n de varias fuentes de datos, incluyendo caracter√≠sticas de secuencia y estructura. De esta forma podemos obtener una anotaci√≥n m√°s amplia que proporciona al usuario una descripci√≥n m√°s completa de una prote√≠na. ICB tambi√©n es capaz de visualizar m√ļltiples prote√≠nas de una manera f√°cil y r√°pida a trav√©s de un navegador web. Probamos el enfoque Integrative Cell Biology con una ¬Ņpipeline¬Ņ computacional resultante para caracterizar 39 proteomas del superfilo bacteriano Planctomycetes-Verrucomicrobia-Chlamydia (PVC). Adem√°s de su relevancia en varios campos, sus proteomas tienen un bajo porcentaje de prote√≠nas anotadas, y solo unas pocas se han caracterizado experimentalmente. Sus propiedades fueron determinadas por observaciones experimentales, mientras que las secuencias que las codifican son en su mayor√≠a desconocidas. Al aplicar el pipeline ICB, aumentamos dr√°sticamente la cantidad de anotaciones de sus proteomas, abordando cuestiones biol√≥gicas sobre su comportamiento. Con el fin de hacer que nuestros hallazgos est√©n disponibles para la comunidad de investigaci√≥n de PVC, creamos PVCbase, una plataforma √ļnica para examinar los resultados de ICB a trav√©s de DataTables, realizar b√ļsquedas de secuencia basadas en homolog√≠a y visualizar las caracter√≠sticas de la estructura secundaria de las prote√≠nas. Para demostrar a√ļn m√°s las capacidades de ICB, analizamos tres Planctomicetos recientemente secuenciados asociados al entorno de macroalgas. Los genomas de Rubripirellula obstinata LF1, Roseimaritima ulvae UC8 y Mariniblastus fucicola FC18 se ensamblaron, se anotaron utilizando ICB, y se caracterizaron adicionalmente compar√°ndolo con Planctomyces de otros ambientes. Posteriormente se complementaron sus rutas metab√≥licas y se evalu√≥ su identidad a trav√©s de la filogenia. Tras los an√°lisis pudo verse que algunas prote√≠nas est√°n involucradas en la interacci√≥n con los hospedadores de algas, incluidas algunas de tama√Īo extraordinario que merecen un an√°lisis posterior. Se cre√≥ una versi√≥n de contenedor Docker de ICB que agiliza la instalaci√≥n y el uso de pipelines, permitiendo que los grupos de investigaci√≥n con intereses compartidos creen una plataforma similar a PVCbase. La salida de DataTables y la diversidad de herramientas incluidas permiten una transici√≥n fluida de secuencias a anotaciones de prote√≠nas f√°cilmente navegables. Estos recursos crean entornos compartidos para analizar grandes conjuntos de prote√≠nas, con poco o ning√ļn conocimiento de codificaci√≥n requerido. El concepto de Biolog√≠a Celular Integrativa y sus recursos derivados contribuyen al campo de la predicci√≥n de la funci√≥n de la prote√≠na y proporcionan una soluci√≥n en el caso de organismos mal anotados o reci√©n secuenciados. PVCbase ha sido utilizado por varios grupos de investigaci√≥n en microbiolog√≠a de PVC (16 universidades de 14 pa√≠ses hasta agosto de 2018) y su base de usuarios se beneficiar√° de la adici√≥n de proteomas y de los an√°lisis. Integrar varias fuentes de informaci√≥n para evaluar la funci√≥n de la prote√≠na es una posible soluci√≥n a la inconsistencia y falta de fiabilidad de las herramientas de predicci√≥n. Al utilizar ICB, podemos responder preguntas que no podr√≠an abordarse por otros medios. En el futuro, nuevas fuentes de informaci√≥n implementadas en ICB ampliar√°n nuestro conocimiento de varias caracter√≠sticas desconocidas de varios organismos.Universidad Pablo de Olavide de Sevilla. Escuela de DoctoradoPostprin

    Contrastive learning on protein embeddings enlightens midnight zone

    Get PDF
    Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT

    Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs

    Get PDF
    Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein‚Äďprotein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models

    Novel machine learning approaches revolutionize protein knowledge

    Get PDF
    Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Appraisal Skills Program (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community

    Novel machine learning approaches revolutionize protein knowledge

    Full text link
    Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific communit

    KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units

    Get PDF
    Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.Adeyelu T, Bordin N, Waman VP, Sadlej M, Sillitoe I, Moya-Garcia AA, Orengo CA. KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units. Biomolecules. 2023; 13(2):277. https://doi.org/10.3390/biom1302027

    CATHe: Detection of remote homologues for CATH superfamilies using embeddings from protein language models

    Get PDF
    MOTIVATION: CATH is a protein domain classification resource that exploits an automated workflow of structure and sequence comparison alongside expert manual curation to construct a hierarchical classification of evolutionary and structural relationships. The aim of this study was to develop algorithms for detecting remote homologues missed by state-of-the-art HMM-based approaches. The method developed (CATHe) combines a neural network with sequence representations obtained from protein Language Models. It was assessed using a dataset of remote homologues having less than 20% sequence identity to any domain in the training set. RESULTS: The CATHe models trained on 1773 largest and 50 largest CATH superfamilies had an accuracy of 85.6‚ÄȬĪ‚ÄČ0.4%, and 98.2‚ÄȬĪ‚ÄČ0.3% respectively. As a further test of the power of CATHe to detect more remote homologues missed by HMMs derived from CATH domains, we used a dataset consisting of protein domains that had annotations in Pfam, but not in CATH. By using highly reliable CATHe predictions (expected error rate <0.5%), we were able to provide CATH annotations for 4.62 million Pfam domains. For a subset of these domains from Homo sapiens, we structurally validated 90.86% of the predictions by comparing their corresponding AlphaFold 2 structures with structures from the CATH superfamilies to which they were assigned. AVAILABILITY AND IMPLEMENTATION: The code for the developed models can be found on https://github.com/vam-sin/CATHe. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online

    Broad functional profiling of fission yeast proteins using phenomics and machine learning

    Get PDF
    Many proteins remain poorly characterized even in well-studied organisms, presenting a bottleneck for research. We applied phenomics and machine-learning approaches with Schizosaccharomyces pombe for broad cues on protein functions. We assayed colony-growth phenotypes to measure the fitness of deletion mutants for 3509 non-essential genes in 131 conditions with different nutrients, drugs, and stresses. These analyses exposed phenotypes for 3492 mutants, including 124 mutants of ‚Äėpriority unstudied‚Äô proteins conserved in humans, providing varied functional clues. For example, over 900 proteins were newly implicated in the resistance to oxidative stress. Phenotype-correlation networks suggested roles for poorly characterized proteins through ‚Äėguilt by association‚Äô with known proteins. For complementary functional insights, we predicted Gene Ontology (GO) terms using machine learning methods exploiting protein-network and protein-homology data (NET-FF). We obtained 56,594 high-scoring GO predictions, of which 22,060 also featured high information content. Our phenotype-correlation data and NET-FF predictions showed a strong concordance with existing PomBase GO annotations and protein networks, with integrated analyses revealing 1675 novel GO predictions for 783 genes, including 47 predictions for 23 priority unstudied proteins. Experimental validation identified new proteins involved in cellular aging, showing that these predictions and phenomics data provide a rich resource to uncover new protein functions

    Hansenula polymorpha Pex37 is a peroxisomal membrane protein required for organelle fission and segregation

    Get PDF
    Here, we describe a novel peroxin, Pex37, in the yeast Hansenula polymorpha. H. polymorpha Pex37 is a peroxisomal membrane protein, which belongs to a protein family that includes, among others, the Neurospora crassa Woronin body protein Wsc, the human peroxisomal membrane protein PXMP2, the Saccharomyces cerevisiae mitochondrial inner membrane protein Sym1, and its mammalian homologue MPV17. We show that deletion of H. polymorpha PEX37 does not appear to have a significant effect on peroxisome biogenesis or proliferation in cells grown at peroxisome‚Äźinducing growth conditions (methanol). However, the absence of Pex37 results in a reduction in peroxisome numbers and a defect in peroxisome segregation in cells grown at peroxisome‚Äźrepressing conditions (glucose). Conversely, overproduction of Pex37 in glucose‚Äźgrown cells results in an increase in peroxisome numbers in conjunction with a decrease in their size. The increase in numbers in PEX37‚Äźoverexpressing cells depends on the dynamin‚Äźrelated protein Dnm1. Together our data suggest that Pex37 is involved in peroxisome fission in glucose‚Äźgrown cells. Introduction of human PXMP2 in H. polymorpha pex37 cells partially restored the peroxisomal phenotype, indicating that PXMP2 represents a functional homologue of Pex37. H.polymorpha pex37 cells did not show aberrant growth on any of the tested carbon and nitrogen sources that are metabolized by peroxisomal enzymes, suggesting that Pex37 may not fulfill an essential function in transport of these substrates or compounds required for their metabolism across the peroxisomal membrane.This work was supported by a grant from the Marie Curie Initial Training Networks (ITN) program PerFuMe (Grant Agreement Number 316723) to RS, NB, DPD, and IJvdK.Peer reviewe

    Pex24 and Pex32 are required to tether peroxisomes to the ER for organelle biogenesis, positioning and segregation in yeast

    Get PDF
    ¬© 2020. Published by The Company of Biologists Ltd.The yeast Hansenula polymorpha contains four members of the Pex23 family of peroxins, which characteristically contain a DysF domain. Here we show that all four H. polymorpha Pex23 family proteins localize to the endoplasmic reticulum (ER). Pex24 and Pex32, but not Pex23 and Pex29, predominantly accumulate at peroxisome‚ÄďER contacts. Upon deletion of PEX24 or PEX32 ‚Äď and to a much lesser extent, of PEX23 or PEX29 ‚Äď peroxisome‚ÄďER contacts are lost, concomitant with defects in peroxisomal matrix protein import, membrane growth, and organelle proliferation, positioning and segregation. These defects are suppressed by the introduction of an artificial peroxisome‚ÄďER tether, indicating that Pex24 and Pex32 contribute to tethering of peroxisomes to the ER. Accumulation of Pex32 at these contact sites is lost in cells lacking the peroxisomal membrane protein Pex11, in conjunction with disruption of the contacts. This indicates that Pex11 contributes to Pex32-dependent peroxisome‚ÄďER contact formation. The absence of Pex32 has no major effect on pre-peroxisomal vesicles that occur in pex3 atg1 deletion cells.This work was supported by a grant from the FP7 People: Marie-Curie Actions Initial Training Networks (ITN) program PerFuMe (Grant Agreement Number 316723) to N.B., D.P.D. and I.J.v.d.K., from the China Scholarship Council (CSC) to F.W., and from the Nederlandse Organisatie voor Wetenschappelijk Onderzoek/Chemical Sciences (NWO/CW) to A.A. (711.012.002)
    corecore