21 research outputs found

    What is hidden in the darkness? Characterization of AlphaFold structural space

    Get PDF
    The recent public release of the latest version of the AlphaFold database has given us access to over 200 million predicted protein structures. We use a textquotedblleftshape-mertextquotedblright approach, a structural fragmentation method analogous to sequence k-mers, to describe these structures and look for novelties - both in terms of proteins with rare or novel structural composition and possible functional annotation of under-studied proteins. Data and code will be made available at https://github.com/TurtleTools/afdb-shapemer-darknes

    Embedding-based alignment: combining protein language models and alignment approaches to detect structural similarities in the twilight-zone

    Get PDF
    Language models are now routinely used for text classification and generative tasks. Recently, the same architectures were applied to protein sequences, unlocking powerful tools in the bioinformatics field. Protein language models (pLMs) generate high dimensional embeddings on a per-residue level and encode the "semantic meaning" of each individual amino acid in the context of the full protein sequence. Multiple works use these representations as a starting point for downstream learning tasks and, more recently, for identifying distant homologous relationships between proteins. In this work, we introduce a new method that generates embedding-based protein sequence alignments (EBA), and show how these capture structural similarities even in the twilight zone, outperforming both classical sequence-based scores and other approaches based on protein language models. The method shows excellent accuracy despite the absence of training and parameter optimization. We expect that the association of pLMs and alignment methods will soon rise in popularity, helping the detection of relationships between proteins in the twilight-zone

    Caretta – A multiple protein structure alignment and feature extraction suite

    Get PDF
    The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta's performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.</p

    What is hidden in the darkness? Deep-learning assisted large-scale protein family curation uncovers novel protein families and folds

    Get PDF
    Driven by the development and upscaling of fast genome sequencing and assembly pipelines, the number of protein-coding sequences deposited in public protein sequence databases is increasing exponentially. Recently, the dramatic success of deep learning-based approaches applied to protein structure prediction has done the same for protein structures. We are now entering a new era in protein sequence and structure annotation, with hundreds of millions of predicted protein structures made available through the AlphaFold database. These models cover most of the catalogued natural proteins, including those difficult to annotate for function or putative biological role based on standard, homology-based approaches. In this work, we quantified how much of such "dark matter" of the natural protein universe was structurally illuminated by AlphaFold2 and modelled this diversity as an interactive sequence similarity network that can be navigated at https://uniprot3d.org/atlas/AFDB90v4 . In the process, we discovered multiple novel protein families by searching for novelties from sequence, structure, and semantic perspectives. We added a number of them to Pfam, and experimentally demonstrate that one of these belongs to a novel superfamily of toxin-antitoxin systems, TumE-TumA. This work highlights the role of large-scale, evolution-driven protein comparison efforts in combination with structural similarities, genomic context conservation, and deep-learning based function prediction tools for the identification of novel protein families, aiding not only annotation and classification efforts but also the curation and prioritisation of target proteins for experimental characterisation

    New prediction categories in CASP15

    Get PDF
    Prediction categories in the Critical Assessment of Structure Prediction (CASP) experiments change with the need to address specific problems in structure modeling. In CASP15, four new prediction categories were introduced: RNA structure, ligand-protein complexes, accuracy of oligomeric structures and their interfaces, and ensembles of alternative conformations. This paper lists technical specifications for these categories and describes their integration in the CASP data management system

    A structural biology community assessment of AlphaFold2 applications

    Get PDF
    Most proteins fold into 3D structures that determine how they function and orchestrate the biological processes of the cell. Recent developments in computational methods for protein structure predictions have reached the accuracy of experimentally determined models. Although this has been independently verified, the implementation of these methods across structural-biology applications remains to be tested. Here, we evaluate the use of AlphaFold2 (AF2) predictions in the study of characteristic structural elements; the impact of missense variants; function and ligand binding site predictions; modeling of interactions; and modeling of experimental structural data. For 11 proteomes, an average of 25% additional residues can be confidently modeled when compared with homology modeling, identifying structural features rarely seen in the Protein Data Bank. AF2-based predictions of protein disorder and complexes surpass dedicated tools, and AF2 models can be used across diverse applications equally well compared with experimentally determined structures, when the confidence metrics are critically considered. In summary, we find that these advances are likely to have a transformative impact in structural biology and broader life-science research

    Protein target highlights in CASP15: Analysis of models by structure providers

    Get PDF
    We present an in-depth analysis of selected CASP15 targets, focusing on their biological and functional significance. The authors of the structures identify and discuss key protein features and evaluate how effectively these aspects were captured in the submitted predictions. While the overall ability to predict three-dimensional protein structures continues to impress, reproducing uncommon features not previously observed in experimental structures is still a challenge. Furthermore, instances with conformational flexibility and large multimeric complexes highlight the need for novel scoring strategies to better emphasize biologically relevant structural regions. Looking ahead, closer integration of computational and experimental techniques will play a key role in determining the next challenges to be unraveled in the field of structural molecular biology

    Computational approaches to discover novel enzymes for fragrance and flavour

    No full text
    Plant specialized metabolites (SMs) are crucial to plants and to humanity, with numerous applications in food, healthcare, agriculture, and cosmetics. The enzyme families involved in producing SMs, such as the terpene synthases, are very diverse, both across and within families. Understanding and predicting compound specificity of these enzymes is critical for biotechnological applications and protein engineering. The growing availability of structure data and improved computational modelling techniques puts us in the position to use structural bioinformatics and machine learning (ML) techniques to learn patterns across all enzymes in an SM family, instead of focusing on a few structures or mutants. In this thesis I explore new algorithms and approaches to analyse datasets of SM families and take advantage of their complex structural data.In Chapter 1 I introduce the terpene synthases and place them in context among the wider field of plant specialized metabolism. Their importance in both the plant and human worlds is discussed along with a history of the elucidation of their catalytic mechanisms via structural and mutational studies. I explore the various opportunities and challenges offered by computational techniques, found in the structural bioinformatics and ML fields, to better understand such elusive SM enzyme families. In Chapter 2 I describe the creation of a database of experimentally characterized plant sesquiterpene synthases (STSs), collected from literature studies, covering over 250 enzymes collectively responsible for the production of over a hundred sesquiterpene compounds. These proteins are analysed from a sequence perspective leading to interesting results on previously studied motifs, as well as the conclusion that phylogeny plays a larger role in STS sequence similarity than product specificity. This further expedited the need for protein structure information, extracted using homology modelling. In Chapter 3 I put forth an analysis of STS major and minor products, demonstrating that sesquiterpenes produced by an STS tend to be derived from the same reaction path. This enabled us to simplify the idea of product prediction to parent cation prediction, where I show that ML on the modelled STS structures out-performs sequence-based approaches.To make further use of this structural information, in Chapters 4, 5 and 6 I developed structural bioinformatics embeddings for ML applications, resulting in an embedding allowing alignment-free comparison of the topologies and shapes contained in a structure, and a multiple structure alignment algorithm for structural features. The former, termed Geometricus and presented in Chapters 5 and 6, uses a concept from computer vision called rotation invariant moments to extract and count “shape-mers”, structural analogues to sequence k-mers. The latter, Caretta, presented in Chapters 4 and 6 is a multiple structure aligner that incorporates Geometricus shape-mer counting to scale to many thousands of proteins, and includes a feedback loop between single proteins and the progressively created alignment to return accurate and high-coverage alignments. To enable downstream ML analyses, Caretta also extracts and outputs aligned feature matrices, including the moment invariants used by Geometricus as a novel feature source describing protein shape and topology.This novel feature extraction and alignment approach is applied in Chapter 7 to the task of predicting STS product specificity. To increase our coverage of STS sequence and compound space we use what we learned in Chapters 2 and 3 to select and experimentally characterize over 60 new STSs. As the number of possible products precludes the classification approach in Chapter 3, I create a joint protein-compound framework combining aligned protein structural features with chemical compound features to both successfully predict product specificity, and pinpoint residues involved in the formation of each sesquiterpene.Many of the analyses and techniques used in this thesis are common across protein biology and bioinformatics. To allow life scientists to explore the interconnected properties of their protein family of interest from a variety of different perspectives, and share these findings across the web, in Chapter 8 I present Turterra, an interactive data visualization portal.Chapter 9 concludes this thesis by describing ongoing challenges in studying SM enzyme families and their potential solutions from an ML perspective. I expand the discussion to the broader field of protein structure bioinformatics and the many opportunities it holds for enhancing our understanding of biological function

    Beyond sequence : Structure-based machine learning

    No full text
    Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field

    Geometricus represents protein structures as shape-mers derived from moment invariants

    No full text
    MOTIVATION: As the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well. RESULTS: We present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering and structure classification across proteins from different superfamilies as well as within the same family. AVAILABILITY AND IMPLEMENTATION: Python code available at https://git.wur.nl/durai001/geometricus.</p
    corecore