30 research outputs found

    Caretta – A multiple protein structure alignment and feature extraction suite

    Get PDF
    The vast number of protein structures currently available opens exciting opportunities for machine learning on proteins, aimed at predicting and understanding functional properties. In particular, in combination with homology modelling, it is now possible to not only use sequence features as input for machine learning, but also structure features. However, in order to do so, robust multiple structure alignments are imperative. Here we present Caretta, a multiple structure alignment suite meant for homologous but sequentially divergent protein families which consistently returns accurate alignments with a higher coverage than current state-of-the-art tools. Caretta is available as a GUI and command-line application and additionally outputs an aligned structure feature matrix for a given set of input structures, which can readily be used in downstream steps for supervised or unsupervised machine learning. We show Caretta's performance on two benchmark datasets, and present an example application of Caretta in predicting the conformational state of cyclin-dependent kinases.</p

    Genomic prediction in plants: opportunities for ensemble machine learning based approaches [version 2; peer review: 1 approved, 2 approved with reservations]

    Get PDF
    Background: Many studies have demonstrated the utility of machine learning (ML) methods for genomic prediction (GP) of various plant traits, but a clear rationale for choosing ML over conventionally used, often simpler parametric methods, is still lacking. Predictive performance of GP models might depend on a plethora of factors including sample size, number of markers, population structure and genetic architecture. Methods: Here, we investigate which problem and dataset characteristics are related to good performance of ML methods for genomic prediction. We compare the predictive performance of two frequently used ensemble ML methods (Random Forest and Extreme Gradient Boosting) with parametric methods including genomic best linear unbiased prediction (GBLUP), reproducing kernel Hilbert space regression (RKHS), BayesA and BayesB. To explore problem characteristics, we use simulated and real plant traits under different genetic complexity levels determined by the number of Quantitative Trait Loci (QTLs), heritability (h2 and h2e), population structure and linkage disequilibrium between causal nucleotides and other SNPs. Results: Decision tree based ensemble ML methods are a better choice for nonlinear phenotypes and are comparable to Bayesian methods for linear phenotypes in the case of large effect Quantitative Trait Nucleotides (QTNs). Furthermore, we find that ML methods are susceptible to confounding due to population structure but less sensitive to low linkage disequilibrium than linear parametric methods. Conclusions: Overall, this provides insights into the role of ML in GP as well as guidelines for practitioners

    Novel routes towards bioplastics from plants: elucidation of the methylperillate biosynthesis pathway from Salvia dorisiana trichomes

    Get PDF
    Plants produce a large variety of highly functionalized terpenoids. Functional groups such as partially unsaturated rings and carboxyl groups provide handles to use these compounds as feedstock for biobased commodity chemicals. For instance, methylperillate, a monoterpenoid found in Salvia dorisiana, may be used for this purpose, as it carries both an unsaturated ring and a methylated carboxyl group. The biosynthetic pathway of methylperillate in plants is still unclear. In this work, we identified glandular trichomes from S. dorisiana as the location of biosynthesis and storage of methylperillate. mRNA from purified trichomes was used to identify four genes that can encode the pathway from geranyl diphosphate towards methylperillate. This pathway includes a (–)-limonene synthase (SdLS), a limonene 7-hydroxylase (SdL7H, CYP71A76), and a perillyl alcohol dehydrogenase (SdPOHDH). We also identified a terpene acid methyltransferase, perillic acid O-methyltransferase (SdPAOMT), with homology to salicylic acid OMTs. Transient expression in Nicotiana benthamiana of these four genes, in combination with a geranyl diphosphate synthase to boost precursor formation, resulted in production of methylperillate. This demonstrates the potential of these enzymes for metabolic engineering of a feedstock for biobased commodity chemical

    An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent

    An expanded evaluation of protein function prediction methods shows an improvement in accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio

    Floral pathway integrator gene expression mediates gradual transmission of environmental and endogenous cues to flowering time

    No full text
    The appropriate timing of flowering is crucial for the reproductive success of plants. Hence, intricate genetic networks integrate various environmental and endogenous cues such as temperature or hormonal statues. These signals integrate into a network of floral pathway integrator genes. At a quantitative level, it is currently unclear how the impact of genetic variation in signaling pathways on flowering time is mediated by floral pathway integrator genes. Here, using datasets available from literature, we connect Arabidopsis thaliana flowering time in genetic backgrounds varying in upstream signalling components with the expression levels of floral pathway integrator genes in these genetic backgrounds. Our modelling results indicate that flowering time depends in a quite linear way on expression levels of floral pathway integrator genes. This gradual, proportional response of flowering time to upstream changes enables a gradual adaptation to changing environmental factors such as temperature and light

    Beyond sequence : Structure-based machine learning

    No full text
    Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field

    Geometricus represents protein structures as shape-mers derived from moment invariants

    No full text
    MOTIVATION: As the number of experimentally solved protein structures rises, it becomes increasingly appealing to use structural information for predictive tasks involving proteins. Due to the large variation in protein sizes, folds and topologies, an attractive approach is to embed protein structures into fixed-length vectors, which can be used in machine learning algorithms aimed at predicting and understanding functional and physical properties. Many existing embedding approaches are alignment based, which is both time-consuming and ineffective for distantly related proteins. On the other hand, library- or model-based approaches depend on a small library of fragments or require the use of a trained model, both of which may not generalize well. RESULTS: We present Geometricus, a novel and universally applicable approach to embedding proteins in a fixed-dimensional space. The approach is fast, accurate, and interpretable. Geometricus uses a set of 3D moment invariants to discretize fragments of protein structures into shape-mers, which are then counted to describe the full structure as a vector of counts. We demonstrate the applicability of this approach in various tasks, ranging from fast structure similarity search, unsupervised clustering and structure classification across proteins from different superfamilies as well as within the same family. AVAILABILITY AND IMPLEMENTATION: Python code available at https://git.wur.nl/durai001/geometricus.</p

    Transcriptional Feedback in Plant Growth and Defense by PIFs, BZR1, HY5, and MYC Transcription Factors

    No full text
    Growth of Arabidopsis is controlled by the activity of a set of bHLH and bZIP transcription factors of which phytochrome interacting factor4 (PIF4), BRASSINAZOLE-RESISTANT 1 (BZR1), and elongated hypocotyl 5 (HY5) have been most extensively studied. Defense responses are controlled by a set of MYC transcription factors of which MYC2 is best characterized. Moreover, hundreds of additional proteins (here named co-factors) have been identified which (in)directly may affect the expression or activity of these TFs. Thus, regulation of expression of genes encoding these co-factors becomes an integral part of understanding the molecular control of growth and defense. Here, we review RNA-seq data related to PIF, BZR1, HY5, or MYC activity, which indicate that 125 co-factor genes affecting PIFs, HY5, BZR1, or MYCs are themselves under transcriptional control by these TFs, thus revealing potential feedback regulation in growth and defense. The transcriptional feedback on co-factor genes related to PIF4, BZR1, and MYC2 by PIFs, BZR1, or MYCs, mostly results in negative feedback on PIF4, BZR1, or MYC2 activity. In contrast, transcription feedback on co-factor genes for HY5 by HY5 mostly results in positive feedback on HY5 activity. PIF4 and BZR1 exert a balanced regulating of photoreceptor-gene expression, whose products directly or indirectly affect PIF4, HY5, and MYC2 protein stability as a function of light. Growth itself is balanced by both multiple positive and multiple negative feedback on PIF4 and BZR1 activity. The balance between growth and defense is mostly through direct cross-regulation between HY5 and MYC2 as previously described, but also through potential transcriptional feedback on co-factor genes for MYC2 by PIF4, BZR1, and HY5 and through transcriptional feedback of co-factors for PIF4 and BZR1 by MYC2. The interlocking feed-forward and feed-backward transcriptional regulation of PIF4, BZR1, HY5, and MYC2 co-factors is a signature of robust and temporal control of signaling related to growth and defense
    corecore