    Mass & secondary structure propensity of amino acids explain their mutability and evolutionary replacements

    Why is an amino acid replacement in a protein accepted during evolution? The answer given by bioinformatics relies on the frequency of change of each amino acid by another one and the propensity of each to remain unchanged. We propose that these replacement rules are recoverable from the secondary structural trends of amino acids. A distance measure between high-resolution Ramachandran distributions reveals that structurally similar residues coincide with those found in substitution matrices such as BLOSUM: Asn Asp, Phe Tyr, Lys Arg, Gln Glu, Ile Val, Met → Leu; with Ala, Cys, His, Gly, Ser, Pro, and Thr, as structurally idiosyncratic residues. We also found a high average correlation (\overline{R} R = 0.85) between thirty amino acid mutability scales and the mutational inertia (I X ), which measures the energetic cost weighted by the number of observations at the most probable amino acid conformation. These results indicate that amino acid substitutions follow two optimally-efficient principles: (a) amino acids interchangeability privileges their secondary structural similarity, and (b) the amino acid mutability depends directly on its biosynthetic energy cost, and inversely with its frequency. These two principles are the underlying rules governing the observed amino acid substitutions. © 2017 The Author(s)

    Protein sequence analysis using the MPI Bioinformatics Toolkit

    The MPI Bioinformatics Toolkit (https://toolkit.tuebingen.mpg.de) provides interactive access to a wide range of the best‐performing bioinformatics tools and databases, including the state‐of‐the‐art protein sequence comparison methods HHblits and HHpred. The Toolkit currently includes 35 external and in‐house tools, covering functionalities such as sequence similarity searching, prediction of sequence features, and sequence classification. Due to this breadth of functionality, the tight interconnection of its constituent tools, and its ease of use, the Toolkit has become an important resource for biomedical research and for teaching protein sequence analysis to students in the life sciences. In this article, we provide detailed information on utilizing the three most widely accessed tools within the Toolkit: HHpred for the detection of homologs, HHpred in conjunction with MODELLER for structure prediction and homology modeling, and CLANS for the visualization of relationships in large sequence datasets. Basic Protocol 1: Sequence similarity searching using HHpred Alternate Protocol: Pairwise sequence comparison using HHpred Support Protocol: Building a custom multiple sequence alignment using PSI‐BLAST and forwarding it as input to HHpred Basic Protocol 2: Calculation of homology models using HHpred and MODELLER Basic Protocol 3: Cluster analysis using CLAN

    Recovery and characterization of viral diversity from aquatic short- and long-read metagenomes

    Viruses are the most abundant biological entities in marine ecosystems and play an essential role in global biogeochemical cycles. They have important ecological functions as drivers of bacterial populations through lytic infections and contribute to bacterial genetic diversification. Unfortunately, their study is severely limited by the difficulty to culture and isolate them in lab conditions. Culture-independent techniques such as metagenomics can complement culture-based approaches to capture more phage diversity. However, the vast majority of viral sequences recovered through these methods are uncharacterized and therefore do not provide any information about their interactions with the bacterial community, a phenomenon that has been named “viral dark matter”. In this thesis, several bioinformatic techniques are applied to both short- and long-read metagenomic datasets to recover biological information from marine viral sequences contained therein. A pipeline for recovering viral sequences based on a reference genome was developed and applied to the study of myophages infecting the alphaproteobacterial SAR11 clade, one of the most abundant bacterioplankton groups in surface marine and freshwater ecosystems. We were able to recover 22 new genomes which include the first genomes of myophages infecting LD12, the SAR11 freshwater clade. These sequences are underrepresented in datasets derived from the viral fraction, suggesting a bias of either technical or biological nature. Surprisingly, this family of phages code for an operon which resembles the secretion system type VIII operon in Escherichia coli. The function of this phage operon is still unknown. Next, a long-read dataset from the Mediterranean Sea was explored for viral contigs to contrast phage recovery between long- and short-read datasets. The analysis revealed that while long-read assemblies resulted in viral sequences of better quality, there was a sizable amount of intra-clade viral diversity that was not included in the assemblies. This viral diversity only found in long reads is even greater than previously thought. This untapped diversity could aid biotechnological efforts as evidenced by the discovery of new endolysins. Finally, a tool (Random Forest Assignment of Hosts, or RaFAH) for assigning hosts to phage sequences obtained from metagenomic datasets was created. The tool is based on a machine learning tool trained with phage protein clusters generated de novo. Benchmarking shows that RaFAH is on par with other state-of-the-art classifiers and is able to classify phage contigs at the level of Kingdom, which makes it the first classifier to accurately detect Archaea viruses from metagenomic samples. A feature importance analysis reveals that the protein clusters with the most predictive power are those involved in host recognition.Los bacteriófagos (”fagos”) son los organismos más abundantes en los ecosistemas marinos y tienen un papel esencial en los ciclos biogeoquímicos globales. Asimismo, influencian la evolución de las poblaciones bacterianas que infectan y contribuyen a la diversificación del acervo genético bacteriano. Desgraciadamente, su estudio se ve limitado por la dificultad de cultivar y aislar estos organismos en el laboratorio. El uso de técnicas que no requieren cultivo, como la metagenómica, pueden complementar el cultivo en laboratorio para recuperar una mayor diversidad de fagos. Sin embargo, la inmensa mayoría de secuencias virales recuperadas mediante metagenómica no pueden ser caracterizadas, por lo que no proporcionan ninguna información sobre sus interacciones con la comunidad bacteriana, un fenómeno que se ha nombrado “materia oscura viral”. En esta tesis se han utilizado múltiples procesos bioinformáticos en colecciones de metagenomas de lectura corta y larga para caracterizar las secuencias virales que contienen. Se ha desarrollado un procedimiento para recuperar secuencias virales a partir de un genoma de referencia y se ha aplicado al estudio de miofagos que infectan al clado SAR11 de las Alfaproteobacteria, uno de los grupos de bacterioplankton más abundantes en agua dulce y agua salada de superficie. Se consiguió recuperar 22 nuevos genomas que incluyen el primer genoma que infecta LD12, el subclado de SAR11 de agua dulce. Estos genomas están poco representados en colecciones obtenidas de la fracción viral, lo que sugiere que las afecta un sesgo técnico o biológico. Sorprendentemente, esta familia de fagos contiene un operón similar al sistema de secreción tipo VIII de Escherichia coli. La función de este operón es aún desconocida. Asimismo, se contrastó la recuperación de secuencias víricas entre colecciones de lectura corta y larga utilizando colecciones obtenidas en el mar Mediterráneo. Los resultados muestran que aunque los ensamblajes derivados de las lecturas largas producen secuencias virales de mejor calidad, en el proceso se pierde una gran cantidad de diversidad intraclado. Esta diversidad es mucho mayor de la recuperada con lecturas cortas, y podría explotarse para aplicaciones biotecnológicas, como el descubrimiento de nuevas endolisinas. Finalmente, se desarrolló un programa (Random Forest Assignment of Hosts, o RaFAH) para asignar hospedadores a secuencias virales obtenidas de colecciones metagenómicas. El programa se basa en el uso de algoritmos de machine learning entrenados con grupos de proteínas creados de novo. RaFAH muestra un rendimiento similar a otros clasificadores de secuencias y es capaz de clasificar secuencias víricas al nivel taxonómico de Reino, siendo así el primer clasificador capaz de detectar fagos que infectan arqueas con precisión. El análisis de importancia de rasgo revela que los grupos de proteínas con mayor poder predictivo son aquellos involucrados en el reconocimiento del hospedador

    Unifying the known and unknown microbial coding sequence space

    5 figures, 13 appendixes.-- Data availability: We used public data as described in the Methods section and Appendix 1-table 5.The code used for the analyses in the manuscript is available at https://github.com/functional-dark-side/functional-dark-side.github.io/tree/master/scripts. A list with the program versions can be found in https://github.com/functional-dark-side/functional-dark-side.github.io/blob/master/programs_and_versions.txt.The code to create the figures is available at https://github.com/functional-dark-side/vanni_et_al-figures, and the data for the figure can be downloaded from https://doi.org/10.6084/m9.figshare.12738476.v2. A reproducible version of the workflow is available at https://github.com/functional-dark-side/agnostos-wf.The data is publicly available at https://doi.org/10.6084/m9.figshare.12459056Genes of unknown function are among the biggest challenges in molecular biology, especially in microbial systems, where 40%-60% of the predicted genes are unknown. Despite previous attempts, systematic approaches to include the unknown fraction into analytical workflows are still lacking. Here, we present a conceptual framework, its translation into the computational workflow AGNOSTOS and a demonstration on how we can bridge the known-unknown gap in genomes and metagenomes. By analyzing 415,971,742 genes predicted from 1,749 metagenomes and 28,941 bacterial and archaeal genomes, we quantify the extent of the unknown fraction, its diversity, and its relevance across multiple organisms and environments. The unknown sequence space is exceptionally diverse, phylogenetically more conserved than the known fraction and predominantly taxonomically restricted at the species level. From the 71M genes identified to be of unknown function, we compiled a collection of 283,874 lineage-specific genes of unknown function for Cand. Patescibacteria (also known as Candidate Phyla Radiation, CPR), which provides a significant resource to expand our understanding of their unusual biology. Finally, by identifying a target gene of unknown function for antibiotic resistance, we demonstrate how we can enable the generation of hypotheses that can be used to augment experimental data.The authors thankfully acknowledge the computer resources at MareNostrum and the technical support provided by Barcelona Supercomputing Center (RES-AECT-2014-2-0085), the BMBF877 funded de.NBI Cloud within the German Network for Bioinformatics Infrastructure (de.NBI) (031A537B, 031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A, 031A532B), the University of Oxford Advanced Research Computing (http://dx.doi.org/10.5281/zenodo.22558) and the MARBITS bioinformatics core at ICM-CSIC.CV was supported by the Max Planck Society. AFG received funding from the European Union’s Horizon 2020 research and innovation program Blue Growth: Unlocking the potential of Seas and Oceans under grant agreement no. 634486 (project acronym INMARE). AM was supported by the Biotechnology and Biological Sciences Research Council [BB/M011755/1, BB/R015228/1] and RDF by the European Molecular Biology Laboratory core funds. EOC was supported by project INTERACTOMA RTI2018-101205-B-I00 from the Spanish Agency of Science MICIU/AEI. S 887 GA and PS received additional funding by the project MAGGY (CTM2017-87736-R) from the Spanish Ministry of Economy and Competitiveness. The Malaspina 2010 Expedition was supported by the Spanish Ministry of Economy and Competitiveness (MINECO) through the Consolider-Ingenio program (ref. CSD2008-00077). The authors thank Johannes Söding and Alex Bateman for helpful discussions.Peer reviewedWith the institutional support of the ‘Severo Ochoa Centre of Excellence’ accreditation (CEX2019-000928-S)

    Citrullination Was Introduced into Animals by Horizontal Gene Transfer from Cyanobacteria

    Protein posttranslational modifications add great sophistication to biological systems. Citrullination, a key regulatory mechanism in human physiology and pathophysiology, is enigmatic from an evolutionary perspective. Although the citrullinating enzymes peptidylarginine deiminases (PADIs) are ubiquitous across vertebrates, they are absent from yeast, worms, and flies. Based on this distribution PADIs were proposed to have been horizontally transferred, but this has been contested. Here, we map the evolutionary trajectory of PADIs into the animal lineage. We present strong phylogenetic support for a clade encompassing animal and cyanobacterial PADIs that excludes fungal and other bacterial homologs. The animal and cyanobacterial PADI proteins share functionally relevant primary and tertiary synapomorphic sequences that are distinct from a second PADI type present in fungi and actinobacteria. Molecular clock calculations and sequence divergence analyses using the fossil record estimate the last common ancestor of the cyanobacterial and animal PADIs to be less than 1 billion years old. Additionally, under an assumption of vertical descent, PADI sequence change during this evolutionary time frame is anachronistically low, even when compared with products of likely endosymbiont gene transfer, mitochondrial proteins, and some of the most highly conserved sequences in life. The consilience of evidence indicates that PADIs were introduced from cyanobacteria into animals by horizontal gene transfer (HGT). The ancestral cyanobacterial PADI is enzymatically active and can citrullinate eukaryotic proteins, suggesting that the PADI HGT event introduced a new catalytic capability into the regulatory repertoire of animals. This study reveals the unusual evolution of a pleiotropic protein modification

    Deep learning and embeddings for problems of computational biology

    The development of Next Generation Sequencing promotes Biology in the Big Data era. The ever-increasing gap between proteins with known sequences and those with a complete functional annotation requires computational methods for automatic structure and functional annotation. My research has been focusing on proteins and led so far to the development of three novel tools, DeepREx, E-SNPs&GO and ISPRED-SEQ, based on Machine and Deep Learning approaches. DeepREx computes the solvent exposure of residues in a protein chain. This problem is relevant for the definition of structural constraints regarding the possible folding of the protein. DeepREx exploits Long Short-Term Memory layers to capture residue-level interactions between positions distant in the sequence, achieving state-of-the-art performances. With DeepRex, I conducted a large-scale analysis investigating the relationship between solvent exposure of a residue and its probability to be pathogenic upon mutation. E-SNPs&GO predicts the pathogenicity of a Single Residue Variation. Variations occurring on a protein sequence can have different effects, possibly leading to the onset of diseases. E-SNPs&GO exploits protein embeddings generated by two novel Protein Language Models (PLMs), as well as a new way of representing functional information coming from the Gene Ontology. The method achieves state-of-the-art performances and is extremely time-efficient when compared to traditional approaches. ISPRED-SEQ predicts the presence of Protein-Protein Interaction sites in a protein sequence. Knowing how a protein interacts with other molecules is crucial for accurate functional characterization. ISPRED-SEQ exploits a convolutional layer to parse local context after embedding the protein sequence with two novel PLMs, greatly surpassing the current state-of-the-art. All methods are published in international journals and are available as user-friendly web servers. They have been developed keeping in mind standard guidelines for FAIRness (FAIR: Findable, Accessible, Interoperable, Reusable) and are integrated into the public collection of tools provided by ELIXIR, the European infrastructure for Bioinformatics

    Systematic in silico discovery of novel solute carrier-like proteins from proteomes.

    Solute carrier (SLC) proteins represent the largest superfamily of transmembrane transporters. While many of them play key biological roles, their systematic analysis has been hampered by their functional and structural heterogeneity. Based on available nomenclature systems, we hypothesized that many as yet unidentified SLC transporters exist in the human genome, which await further systematic analysis. Here, we present criteria for defining "SLC-likeness" to curate a set of "SLC-like" protein families from the Transporter Classification Database (TCDB) and Protein families (Pfam) databases. Computational sequence similarity searches surprisingly identified ~120 more proteins in human with potential SLC-like properties compared to previous annotations. Interestingly, several of these have documented transport activity in the scientific literature. To complete the overview of the "SLC-ome", we present an algorithm to classify SLC-like proteins into protein families, investigating their known functions and evolutionary relationships to similar proteins from 6 other clinically relevant experimental organisms, and pinpoint structural orphans. We envision that our work will serve as a stepping stone for future studies of the biological function and the identification of the natural substrates of the many under-explored SLC transporters, as well as for the development of new therapeutic applications, including strategies for personalized medicine and drug delivery

    In Silico Characterization of Protein-Protein Interactions Mediated by Short Linear Motifs

    Short linear motifs (SLiMs), often found in intrinsically disordered regions (IDPs), can initiate protein-protein interactions in eukaryotes. Although pathogens tend to have less disorder than eukaryotes, their proteins alter host cellular function through molecular mimicry of SLiMs. The first objective was to study sequence-based structure properties of viral SLiMs in the ELM database and the conservation of selected viral motifs involved in the virus life cycle. The second objective was to compare the structural features for SliMs in pathogens and eukaryotes in the ELM database. Our analysis showed that many viral SliMs are not found in IDPs, particularly glycosylation motifs. Moreover, analysis of disorder and secondary structure properties in the same motif from pathogens and eukaryotes shed light on similarities and differences in motif properties between pathogens and their eukaryotic equivalents. Our results indicate that the interaction mechanism may differ between pathogens and their eukaryotic hosts for the same motif

    Improving protein structure prediction using amino acid contact & distance prediction

    With more and more protein sequences generated, one of the most pressing tasks in bioinformatics has become to interpret these data. This thesis concerns how to predict the 3D structure of a protein relying on its sequence only, which is a long-standing problem in computational biology. A commonly adopted intermediate step for this task is to predict pairwise amino acid contacts based on the query sequence. Due to the simplicity of the current algorithms, which include statistical models and machine learning techniques, the accuracy of contact prediction is still low for many proteins. Also, these available algorithms are unable to predict amino acid distances (distance longer than contact). Thus, the lack of high quality and enough geometry constraints make it difficult for 3D structure prediction for many proteins. To deal with the current limitations of amino acid constraint and structure prediction, a state-of-the-art deep neural network based amino acid contact & distance prediction algorithm, DeepCDpred, is proposed in this thesis. For a given query protein sequence, the geometry constraints predicted by DeepCDpred are fed into a Rosetta ab initio modelling protocol for protein structure prediction. In addition, a neural network-based method is proposed to evaluate the quality of predicted structures. The accuracies of amino acid contact and distance predictions, the quality of structure predictions and the accuracy of confidence score predictions were evaluated by a test set of 108 protein chains whose experimental structures are known. Any sequence in the test set shares no greater than 25% sequence identity with any sequence in the training set, which was used to train DeepCDpred. The accuracy of amino acid contact predictions of DeepCDpred is just slightly worse than a newly published method, RaptorX; but exceeds all others mentioned in this thesis. Thanks to the predicted extra distance constraints and the Rosetta ab initio modelling protocol, the structure prediction quality based on the algorithms proposed in this study is better than that from the RaptorX server. A blind test, which was done with a yet to be released protein, was also used to validate the effectiveness of DeepCDpred. The protein classes of structures predicted with amino acid contact constraints from MetaPSICOV (the amino acid contact predictor, which DeepCDpred is most often compared within this thesis), are analysed and compared to the predictions based on contact constraints from DeepCDpred, and also to the predictions based on both contact and distance constraints from DeepCDpred. An online server, http://proteincoevolution.bham.ac.uk, is programmed and released to make the proposed methods for amino acid contact and distance predictions, structure prediction and structure confidence prediction accessible to average users, and it is expected beneficial to the research community

    Caspase-1 activates gasdermin A in non-mammals

    Gasdermins oligomerize to form pores in the cell membrane, causing regulated lytic cell death called pyroptosis. Mammals encode five gasdermins that can trigger pyroptosis: GSDMA, B, C, D, and E. Caspase and granzyme proteases cleave the linker regions of and activate GSDMB, C, D, and E, but no endogenous activation pathways are yet known for GSDMA. Here, we perform a comprehensive evolutionary analysis of the gasdermin family. A gene duplication of GSDMA in the common ancestor of caecilian amphibians, reptiles, and birds gave rise to GSDMA-D in mammals. Uniquely in our tree, amphibian, reptile, and bird GSDMA group in a separate clade than mammal GSDMA. Remarkably, GSDMA in numerous bird species contain caspase-1 cleavage sites like YVAD or FASD in the linker. We show that GSDMA from birds, amphibians, and reptiles are all cleaved by caspase-1. Thus, GSDMA was originally cleaved by the host-encoded protease caspase-1. In mammals the caspase-1 cleavage site in GSDMA is disrupted; instead, a new protein, GSDMD, is the target of caspase-1. Mammal caspase-1 uses exosite interactions with the GSDMD C-terminal domain to confer the specificity of this interaction, whereas we show that bird caspase-1 uses a stereotypical tetrapeptide sequence to confer specificity for bird GSDMA. Our results reveal an evolutionarily stable association between caspase-1 and the gasdermin family, albeit a shifting one. Caspase-1 repeatedly changes its target gasdermin over evolutionary time at speciation junctures, initially cleaving GSDME in fish, then GSDMA in amphibians/reptiles/birds, and finally GSDMD in mammals