97 research outputs found

    Integrative cell biology

    Get PDF
    Programa de Doctorado en Biotecnología, Ingeniería y Tecnología QuímicaLínea de Investigación: Bioinformatica en Biotecnología y BiomedicinaClave Programa: DBICódigo Línea: 7Las proteínas son la clave para entender la biología celular. La determinación de su rol y función nos ayuda a descubrir las características de los procesos moleculares en la base de la vida. Las técnicas de alto rendimiento han permitido a los científicos acumular una gran cantidad de datos sobre secuencias de ADN de miles de organismos diferentes. La función de las proteínas codificadas en estas porciones de ADN se determina por métodos de anotación manuales o automáticos, utilizando experimentos computacionales y biológicos para obtener una descripción coherente. Aunque la revisión manual de estas predicciones finalmente produce las anotaciones más fiables, este enfoque no es factible con la tasa actual de secuencias depositadas en las bases de datos biológicas. Esto afecta el conocimiento de la biología de varios organismos. Los esfuerzos de revisión manual se centran principalmente en la caracterización de organismos modelo En consecuencia, las bases de datos donde se reúne la información abarcan grandes cantidades de datos para un subconjunto específico de organismos. Actualmente, solo los grandes consorcios pueden generar estos recursos web, mientras que otros grupos que investigan organismos recientemente secuenciados carecen de los medios y recursos para lograr una anotación de proteoma más completa. Además, la gran mayoría del software para anotación de proteínas se enfoca solo en algunos aspectos de la función de una proteína; por lo tanto, falta información complementaria que podría derivarse de otras fuentes, tanto in silico como in vivo. El objetivo de esta tesis es desarrollar un nuevo enfoque para la anotación de funciones de proteínas que aborde los problemas mencionados anteriormente, incluidas nuevas herramientas y recursos para mejorar el estado actual en el ámbito de la predicción de la función, para así aplicarlo a organismos no modelos. Lo llamamos ¿Integrative Cell Biology¿ (ICB) o Biología Celular Integrativa. ICB se basa en la integración de varias fuentes de datos, incluyendo características de secuencia y estructura. De esta forma podemos obtener una anotación más amplia que proporciona al usuario una descripción más completa de una proteína. ICB también es capaz de visualizar múltiples proteínas de una manera fácil y rápida a través de un navegador web. Probamos el enfoque Integrative Cell Biology con una ¿pipeline¿ computacional resultante para caracterizar 39 proteomas del superfilo bacteriano Planctomycetes-Verrucomicrobia-Chlamydia (PVC). Además de su relevancia en varios campos, sus proteomas tienen un bajo porcentaje de proteínas anotadas, y solo unas pocas se han caracterizado experimentalmente. Sus propiedades fueron determinadas por observaciones experimentales, mientras que las secuencias que las codifican son en su mayoría desconocidas. Al aplicar el pipeline ICB, aumentamos drásticamente la cantidad de anotaciones de sus proteomas, abordando cuestiones biológicas sobre su comportamiento. Con el fin de hacer que nuestros hallazgos estén disponibles para la comunidad de investigación de PVC, creamos PVCbase, una plataforma única para examinar los resultados de ICB a través de DataTables, realizar búsquedas de secuencia basadas en homología y visualizar las características de la estructura secundaria de las proteínas. Para demostrar aún más las capacidades de ICB, analizamos tres Planctomicetos recientemente secuenciados asociados al entorno de macroalgas. Los genomas de Rubripirellula obstinata LF1, Roseimaritima ulvae UC8 y Mariniblastus fucicola FC18 se ensamblaron, se anotaron utilizando ICB, y se caracterizaron adicionalmente comparándolo con Planctomyces de otros ambientes. Posteriormente se complementaron sus rutas metabólicas y se evaluó su identidad a través de la filogenia. Tras los análisis pudo verse que algunas proteínas están involucradas en la interacción con los hospedadores de algas, incluidas algunas de tamaño extraordinario que merecen un análisis posterior. Se creó una versión de contenedor Docker de ICB que agiliza la instalación y el uso de pipelines, permitiendo que los grupos de investigación con intereses compartidos creen una plataforma similar a PVCbase. La salida de DataTables y la diversidad de herramientas incluidas permiten una transición fluida de secuencias a anotaciones de proteínas fácilmente navegables. Estos recursos crean entornos compartidos para analizar grandes conjuntos de proteínas, con poco o ningún conocimiento de codificación requerido. El concepto de Biología Celular Integrativa y sus recursos derivados contribuyen al campo de la predicción de la función de la proteína y proporcionan una solución en el caso de organismos mal anotados o recién secuenciados. PVCbase ha sido utilizado por varios grupos de investigación en microbiología de PVC (16 universidades de 14 países hasta agosto de 2018) y su base de usuarios se beneficiará de la adición de proteomas y de los análisis. Integrar varias fuentes de información para evaluar la función de la proteína es una posible solución a la inconsistencia y falta de fiabilidad de las herramientas de predicción. Al utilizar ICB, podemos responder preguntas que no podrían abordarse por otros medios. En el futuro, nuevas fuentes de información implementadas en ICB ampliarán nuestro conocimiento de varias características desconocidas de varios organismos.Universidad Pablo de Olavide de Sevilla. Escuela de DoctoradoPostprin

    Contrastive learning on protein embeddings enlightens midnight zone

    Get PDF
    Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the 'midnight zone' of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT

    Clustering protein functional families at large scale with hierarchical approaches

    Get PDF
    Proteins, fundamental to cellular activities, reveal their function and evolution through their structure and sequence. CATH functional families (FunFams) are coherent clusters of protein domain sequences in which the function is conserved across their members. The increasing volume and complexity of protein data enabled by large-scale repositories like MGnify or AlphaFold Database requires more powerful approaches that can scale to the size of these new resources. In this work, we introduce MARC and FRAN, two algorithms developed to build upon and address limitations of GeMMA/FunFHMMER, our original methods developed to classify proteins with related functions using a hierarchical approach. We also present CATH-eMMA, which uses embeddings or Foldseek distances to form relationship trees from distance matrices, reducing computational demands and handling various data types effectively. CATH-eMMA offers a highly robust and much faster tool for clustering protein functions on a large scale, providing a new tool for future studies in protein function and evolutio

    Characterizing and explaining the impact of disease-associated mutations in proteins without known structures or structural homologs

    Get PDF
    Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein–protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models

    Comparative genomic analyses of aerobic planctomycetes isolated from the deep sea and the ocean surface

    Get PDF
    On the deep and dark seafloor, a cryptic and yet untapped microbial diversity flourishes around hydrothermal vent systems. This remote environment of difficult accessibility exhibits extreme conditions, including high pressure, steep temperature- and redox gradients, limited availability of oxygen and complete darkness. In this study, we analysed the genomes of three aerobic strains belonging to the phylum Planctomycetota that were isolated from two deep-sea iron- rich hydroxide deposits with low temperature diffusive vents. The vents are located in the Arctic and Pacific Ocean at a depth of 600 and 1,734 m below sea level, respectively. The isolated strains Pr1dT, K2D and TBK1r were analyzed with a focus on genome-encoded features that allow phenotypical adaptations to the low temperature iron-rich deep-sea environment. The comparison with genomes of closely related surface-inhabiting counterparts indicates that the deep-sea isolates do not differ significantly from members of the phylum Planctomycetota inhabiting other habitats, such as macroalgae biofilms and the ocean surface waters. Despite inhabiting extreme environments, our "deep and dark"-strains revealed a mostly non-extreme genome biology

    Novel machine learning approaches revolutionize protein knowledge

    Get PDF
    Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Appraisal Skills Program (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific community

    KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units

    Get PDF
    Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity.Adeyelu T, Bordin N, Waman VP, Sadlej M, Sillitoe I, Moya-Garcia AA, Orengo CA. KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units. Biomolecules. 2023; 13(2):277. https://doi.org/10.3390/biom1302027

    KinFams: De-Novo Classification of Protein Kinases Using CATH Functional Units

    Get PDF
    Protein kinases are important targets for treating human disorders, and they are the second most targeted families after G-protein coupled receptors. Several resources provide classification of kinases into evolutionary families (based on sequence homology); however, very few systematically classify functional families (FunFams) comprising evolutionary relatives that share similar functional properties. We have developed the FunFam-MARC (Multidomain ARchitecture-based Clustering) protocol, which uses multi-domain architectures of protein kinases and specificity-determining residues for functional family classification. FunFam-MARC predicts 2210 kinase functional families (KinFams), which have increased functional coherence, in terms of EC annotations, compared to the widely used KinBase classification. Our protocol provides a comprehensive classification for kinase sequences from >10,000 organisms. We associate human KinFams with diseases and drugs and identify 28 druggable human KinFams, i.e., enriched in clinically approved drugs. Since relatives in the same druggable KinFam tend to be structurally conserved, including the drug-binding site, these KinFams may be valuable for shortlisting therapeutic targets. Information on the human KinFams and associated 3D structures from AlphaFold2 are provided via our CATH FTP website and Zenodo. This gives the domain structure representative of each KinFam together with information on any drug compounds available. For 32% of the KinFams, we provide information on highly conserved residue sites that may be associated with specificity

    Novel machine learning approaches revolutionize protein knowledge

    Full text link
    Breakthrough methods in machine learning (ML), protein structure prediction, and novel ultrafast structural aligners are revolutionizing structural biology. Obtaining accurate models of proteins and annotating their functions on a large scale is no longer limited by time and resources. The most recent method to be top ranked by the Critical Assessment of Structure Prediction (CASP) assessment, AlphaFold 2 (AF2), is capable of building structural models with an accuracy comparable to that of experimental structures. Annotations of 3D models are keeping pace with the deposition of the structures due to advancements in protein language models (pLMs) and structural aligners that help validate these transferred annotations. In this review we describe how recent developments in ML for protein science are making large-scale structural bioinformatics available to the general scientific communit

    Broad functional profiling of fission yeast proteins using phenomics and machine learning

    Get PDF
    Many proteins remain poorly characterized even in well-studied organisms, presenting a bottleneck for research. We applied phenomics and machine-learning approaches with Schizosaccharomyces pombe for broad cues on protein functions. We assayed colony-growth phenotypes to measure the fitness of deletion mutants for 3509 non-essential genes in 131 conditions with different nutrients, drugs, and stresses. These analyses exposed phenotypes for 3492 mutants, including 124 mutants of ‘priority unstudied’ proteins conserved in humans, providing varied functional clues. For example, over 900 proteins were newly implicated in the resistance to oxidative stress. Phenotype-correlation networks suggested roles for poorly characterized proteins through ‘guilt by association’ with known proteins. For complementary functional insights, we predicted Gene Ontology (GO) terms using machine learning methods exploiting protein-network and protein-homology data (NET-FF). We obtained 56,594 high-scoring GO predictions, of which 22,060 also featured high information content. Our phenotype-correlation data and NET-FF predictions showed a strong concordance with existing PomBase GO annotations and protein networks, with integrated analyses revealing 1675 novel GO predictions for 783 genes, including 47 predictions for 23 priority unstudied proteins. Experimental validation identified new proteins involved in cellular aging, showing that these predictions and phenomics data provide a rich resource to uncover new protein functions
    corecore