45 research outputs found

    Protein Function Prediction using Phylogenomics, Domain Architecture Analysis, Data Integration, and Lexical Scoring

    Get PDF
    “As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally.” (Radivojac, Clark, Oron, et al. 2013) With this goal, three new protein function annotation tools were developed, which produce trustworthy and concise protein annotations, are easy to obtain and install, and are capable of processing large sets of proteins with reasonable computational resource demands. Especially for high throughput analysis e.g. on genome scale, these tools improve over existing tools both in ease of use and accuracy. They are dubbed: ‱ Automated Assignment of Human Readable Descriptions (AHRD) (github.com/groupschoof/AHRD; Hallab, Klee, Srinivas, and Schoof 2014), ‱ AHRD on gene clusters, and ‱ Phylogenetic predictions of Gene Ontology (GO) terms with specific calibrations (PhyloFun v2). “AHRD” assigns human readable descriptions (HRDs) to query proteins and was developed to mimic the decision making process of an expert curator. To this end it processes the descriptions of reference proteins obtained by searching selected databases with BLAST (Altschul, Madden, Schaffer, et al. 1997). Here, the trust a user puts into results found in each of these databases can be weighted separately. In the next step the descriptions of the found homologous proteins are filtered, removing accessions, species information, and finally discarding uninformative candidate descriptions like e.g. “putative protein”. Afterwards a dictionary of meaningful words is constructed from those found in the remaining candidates. In this, another filter is applied to ignore words, not conveying information like e.g. the word “protein” itself. In a lexical approach each word is assigned a score based on its frequency in all candidate descriptions, the sequence alignment quality associated with the candidate reference proteins, and finally the already mentioned trust put into the database the reference was obtained from. Subsequently each candidate description is assigned a score, which is computed from the respective scores of the meaningful words contained in that candidate. Also incorporated into this score is the description’s frequency among all regarded candidates. In the final step the highest scoring description is assigned to the query protein. The performance of this lexical algorithm, implemented in “AHRD”, was subsequently compared with that of competitive methods, which were Blast2GO and “best Blast”, where the latter “best Blast” simply passes the description of the best scoring hit to the query protein. To enable this comparison of performance, and in lack of a robust evaluation procedure, a new method to measure the accuracy of textual human readable protein descriptions was developed and applied with success. In this, the accuracy of each assigned competitive description was inferred with the frequently used “F-measure”, the harmonic mean of precision and recall, which we computed regarding meaningful words appearing in both the reference and the assigned descriptions as true positives. The results showed that “AHRD” not only outperforms its competitors by far, but also is very robust and thus does not require its users to use carefully selected parameters. In fact, AHRD’s robustness was demonstrated through cross validation and use of three different reference sets. The second annotation tool “AHRD on gene clusters” uses conserved protein domains from the InterPro database (Apweiler, Attwood, Bairoch, et al. 2000) to annotate clusters of homologous proteins. In a first step the domains found in each cluster are filtered, such that only the most informative are retained. For example are family descriptions discarded, if more detailed sub-family descriptions are also found annotated to members of the cluster. Subsequently, the most frequent candidate description is assigned, favoring those of type “family” over “domain”. Finally the third tool “PhyloFun (v2)” was developed to annotate large sets of query proteins with terms from the Gene Ontology. This work focussed on extending the “Belief propagation” (Pearl 1988) algorithm implemented in the “Sifter” annotation tool (Engelhardt, Jordan, Muratore, and Brenner 2005; Engelhardt, Jordan, Srouji, and Brenner 2011). Jöcker had developed a phylogenetic pipeline generating the input that was fed into the Sifter program. This pipeline executes stringent sequence similarity searches in a database of selected reference proteins, and reconstruct a phylogenetic tree from the found orthologs and inparalogs. This tree is than used by the Sifter program and interpreted as a “Bayesian Network” into which the GO term annotations of the homologous reference proteins are fed as “diagnostic evidence” (Pearl 1988). Subsequently the current strength of belief, the probability of this evidence being also the true state of ancestral tree nodes, is then spread recursively through the tree towards its root, and then vice versa towards the tips. These, of course, include the query protein, which in the final step is annotated with those GO terms that have the strongest belief. Note that during this recursive belief propagation a given GO term’s annotation probability depends on both the length of the currently processed branch, as well as the type of evolutionary event that took place. This event can be one of “speciation” or “duplication”, such that function mutation becomes more likely on longer branches and particularly after “duplication” events. A particular goal in extending this algorithm was to base the annotation probability of a given GO term not on a preconceived model of function evolution among homologous proteins as implemented in Sifter, but instead to compute these GO term annotation probabilities based on empirical measurements. To achieve this, calibrations were computed for each GO term separately, and reference proteins annotated with a given GO term were investigated such that the probability of function loss could be assessed empirically for decreasing sequence homology among related proteins. A second goal was to overcome errors in the identification of the type of evolutionary events. These errors arose from missing knowledge in terms of true species trees, which, in version 1 of the PhyloFun pipeline, are compared with the actual protein trees in order to tell “duplication” from “speciation” events (Zmasek and Eddy 2001). As reliable reference species trees are sparse or in many cases not available, the part of the algorithm incorporating the type of evolutionary event was discarded. Finally, the third goal postulated for the development of PhyloFun’s version 2 was to enable easy installation, usage, and calibration on latest available knowledge. This was motivated by observations made during the application of the first version of PhyloFun, in which maintaining the knowledge-base was almost not feasible. This obstacle was overcome in version 2 of PhyloFun by obtaining required reference data directly from publicly available databases. The accuracy and performance of the new PhyloFun version 2 was assessed and compared with selected competitive methods. These were chosen based on their widespread usage, as well as their applicability on large sets of query proteins without them surpassing reasonable time and computational resource requirements. The measurement of each method’s performance was carried out on a “gold standard”, obtained from the Uniprot/Swissprot public database (Boeckmann, Bairoch, Apweiler, et al. 2003), of 1000 selected reference proteins, all of which had GO term annotations made by expert curators and mostly based on experimental verifications. Subsequently the performance assessment was executed with a slightly modified version of the “Critical Assessment of Function Annotation experiment (CAFA)” experiment (Radivojac, Clark, Oron, et al. 2013). CAFA compares the performance of different protein function annotation tools on a worldwide scale using a provided set of reference proteins. In this, the predictions the competitors deliver are evaluated using the already introduced “F-measure”. Our performance evaluation of PhyloFun’s protein annotations interestingly showed that PhyloFun outperformed all of its competitors. Its use is recommended furthermore by the highly accurate phylogenetic trees the pipeline computes for each query and the found homologous reference proteins. In conclusion, three new premium tools addressing important matters in the computational prediction of protein function were developed and, in two cases, their performance assessed. Here, both AHRD and PhyloFun (v2) outperformed their competitors. Further arguments for the usage of all three tools are, that they are easy to install and use, as well as being reasonably resource demanding. Because of these results the publications of AHRD and PhyloFun (v2) are in preparation, even while AHRD already is applied by different researchers worldwide

    Photoreceptor Activity Contributes to Contrasting Responses to Shade in Cardamine and Arabidopsis Seedlings

    Get PDF
    Plants have evolved two major ways to deal with nearby vegetation or shade: avoidance and tolerance. Moreover, some plants respond to shade in different ways; for example, Arabidopsis thaliana undergoes an avoidance response to shade produced by vegetation, but its close relative Cardamine hirsuta tolerates shade. How plants adopt opposite strategies to respond to the same environmental challenge is unknown. Here, using a genetic strategy, we identified the C. hirsuta slender in shade1 (sis1) mutants, which produce strongly elongated hypocotyls in response to shade. These mutants lack the phytochrome A (phyA) photoreceptor. Our findings suggest that C. hirsuta has evolved a highly efficient phyA-dependent pathway that suppresses hypocotyl elongation when challenged by shade from nearby vegetation. This suppression relies, at least in part, on stronger phyA activity in C. hirsuta; this is achieved by increased ChPHYA expression and protein accumulation combined with a stronger specific intrinsic repressor activity. We suggest that modulation of photoreceptor activity is a powerful mechanism in nature to achieve physiological variation (shade tolerance vs. avoidance) for species to colonize different habitats

    Pan-European study of genotypes and phenotypes in the Arabidopsis relative Cardamine hirsuta reveals how adaptation, demography, and development shape diversity patterns

    Get PDF
    We study natural DNA polymorphisms and associated phenotypes in the Arabidopsis relative Cardamine hirsuta. We observed strong genetic differentiation among several ancestry groups and broader distribution of Iberian relict strains in European C. hirsuta compared to Arabidopsis. We found synchronization between vegetative and reproductive development and a pervasive role for heterochronic pathways in shaping C. hirsuta natural variation. A single, fast-cycling ChFRIGIDA allele evolved adaptively allowing range expansion from glacial refugia, unlike Arabidopsis where multiple FRIGIDA haplotypes were involved. The Azores islands, where Arabidopsis is scarce, are a hotspot for C. hirsuta diversity. We identified a quantitative trait locus (QTL) in the heterochronic SPL9 transcription factor as a determinant of an Azorean morphotype. This QTL shows evidence for positive selection, and its distribution mirrors a climate gradient that broadly shaped the Azorean flora. Overall, we establish a framework to explore how the interplay of adaptation, demography, and development shaped diversity patterns of 2 related plant species

    Cell type specific transcriptional reprogramming of maize leaves during Ustilago maydis induced tumor formation

    No full text
    Ustilago maydis is a biotrophic pathogen and well-established genetic model to understand the molecular basis of biotrophic interactions. U. maydis suppresses plant defense and induces tumors on all aerial parts of its host plant maize. In a previous study we found that U. maydis induced leaf tumor formation builds on two major processes: the induction of hypertrophy in the mesophyll and the induction of cell division (hyperplasia) in the bundle sheath. In this study we analyzed the cell-type specific transcriptome of maize leaves 4 days post infection. This analysis allowed identification of key features underlying the hypertrophic and hyperplasic cell identities derived from mesophyll and bundle sheath cells, respectively. We examined the differentially expressed (DE) genes with particular focus on maize cell cycle genes and found that three A-type cyclins, one B-, D- and T-type are upregulated in the hyperplasic tumorous cells, in which the U. maydis effector protein See1 promotes cell division. Additionally, most of the proteins involved in the formation of the pre-replication complex (pre-RC, that assure that each daughter cell receives identic DNA copies), the transcription factors E2F and DPa as well as several D-type cyclins are deregulated in the hypertrophic cells

    Plant PhysioSpace: a robust tool to compare stress response across plant species

    No full text
    Generalization of transcriptomics results can be achieved by comparison across experiments. This generalization is based on integration of interrelated transcriptomics studies into a compendium. Such a focus on the bigger picture enables both characterizations of the fate of an organism and distinction between generic and specific responses. Numerous methods for analyzing transcriptomics datasets exist. Yet, most of these methods focus on gene-wise dimension reduction to obtain marker genes and gene sets for, for example, pathway analysis. Relying only on isolated biological modules might result in missing important confounders and relevant contexts. We developed a method called Plant PhysioSpace, which enables researchers to compute experimental conditions across species and platforms without a priori reducing the reference information to specific gene sets. Plant PhysioSpace extracts physiologically relevant signatures from a reference dataset (i.e. a collection of public datasets) by integrating and transforming heterogeneous reference gene expression data into a set of physiology-specific patterns. New experimental data can be mapped to these patterns, resulting in similarity scores between the acquired data and the extracted compendium. Because of its robustness against platform bias and noise, Plant PhysioSpace can function as an inter-species or cross-platform similarity measure. We have demonstrated its success in translating stress responses between different species and platforms, including single-cell technologies. We have also implemented two R packages, one software and one data package, and a Shiny web application to facilitate access to our method and precomputed models

    Plant PhysioSpace: a robust tool to compare stress response across plant species

    No full text
    Generalization of transcriptomics results can be achieved by comparison across experiments. This generalization is based on integration of interrelated transcriptomics studies into a compendium. Such a focus on the bigger picture enables both characterizations of the fate of an organism and distinction between generic and specific responses. Numerous methods for analyzing transcriptomics datasets exist. Yet, most of these methods focus on gene-wise dimension reduction to obtain marker genes and gene sets for, for example, pathway analysis. Relying only on isolated biological modules might result in missing important confounders and relevant contexts. We developed a method called Plant PhysioSpace, which enables researchers to compute experimental conditions across species and platforms without a priori reducing the reference information to specific gene sets. Plant PhysioSpace extracts physiologically relevant signatures from a reference dataset (i.e. a collection of public datasets) by integrating and transforming heterogeneous reference gene expression data into a set of physiology-specific patterns. New experimental data can be mapped to these patterns, resulting in similarity scores between the acquired data and the extracted compendium. Because of its robustness against platform bias and noise, Plant PhysioSpace can function as an inter-species or cross-platform similarity measure. We have demonstrated its success in translating stress responses between different species and platforms, including single-cell technologies. We have also implemented two R packages, one software and one data package, and a Shiny web application to facilitate access to our method and precomputed models

    GXP: Analyze and Plot Plant Omics Data in Web Browsers

    No full text
    Next-generation sequencing and metabolomics have become very cost and work efficient and are integrated into an ever-growing number of life science research projects. Typically, established software pipelines analyze raw data and produce quantitative data informing about gene expression or concentrations of metabolites. These results need to be visualized and further analyzed in order to support scientific hypothesis building and identification of underlying biological patterns. Some of these tools already exist, but require installation or manual programming. We developed “Gene Expression Plotter” (GXP), an RNAseq and Metabolomics data visualization and analysis tool entirely running in the user’s web browser, thus not needing any custom installation, manual programming or uploading of confidential data to third party servers. Consequently, upon receiving the bioinformatic raw data analysis of RNAseq or other omics results, GXP immediately enables the user to interact with the data according to biological questions by performing knowledge-driven, in-depth data analyses and candidate identification via visualization and data exploration. Thereby, GXP can support and accelerate complex interdisciplinary omics projects and downstream analyses. GXP offers an easy way to publish data, plots, and analysis results either as a simple exported file or as a custom website. GXP is freely available on GitHub (see introduction)

    Recently duplicated sesterterpene (C25) gene clusters in Arabidopsis thaliana modulate root microbiota

    No full text
    Land plants co-speciate with a diversity of continually expanding plant specialized metabolites (PSMs) and root microbial communities (microbiota). Homeostatic interactions between plants and root microbiota are essential for plant survival in natural environments. A growing appreciation of microbiota for plant health is fuelling rapid advances in genetic mechanisms of controlling microbiota by host plants. PSMs have long been proposed to mediate plant and single microbe interactions. However, the effects of PSMs, especially those evolutionarily new PSMs, on root microbiota at community level remain to be elucidated. Here, we discovered sesterterpenes in Arabidopsis thaliana, produced by recently duplicated prenyltransferase-terpene synthase (PT-TPS) gene clusters, with neo-functionalization. A single-residue substitution played a critical role in the acquisition of sesterterpene synthase (sesterTPS) activity in Brassicaceae plants. Moreover, we found that the absence of two root-specific sesterterpenoids, with similar chemical structure, significantly affected root microbiota assembly in similar patterns. Our results not only demonstrate the sensitivity of plant microbiota to PSMs but also establish a complete framework of host plants to control root microbiota composition through evolutionarily dynamic PSMs
    corecore