25 research outputs found
Recommended from our members
PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models.
Protein phosphorylation is a dynamic and reversible post-translational modification that regulates a variety of essential biological processes. The regulatory role of phosphorylation in cellular signaling pathways, protein-protein interactions, and enzymatic activities has motivated extensive research efforts to understand its functional implications. Experimental protein phosphorylation data in plants remains limited to a few species, necessitating a scalable and accurate prediction method. Here, we present PhosBoost, a machine-learning approach that leverages protein language models and gradient-boosting trees to predict protein phosphorylation from experimentally derived data. Trained on data obtained from a comprehensive plant phosphorylation database, qPTMplants, we compared the performance of PhosBoost to existing protein phosphorylation prediction methods, PhosphoLingo and DeepPhos. For serine and threonine prediction, PhosBoost achieved higher recall than PhosphoLingo and DeepPhos (.78, .56, and .14, respectively) while maintaining a competitive area under the precision-recall curve (.54, .56, and .42, respectively). PhosphoLingo and DeepPhos failed to predict any tyrosine phosphorylation sites, while PhosBoost achieved a recall score of .6. Despite the precision-recall tradeoff, PhosBoost offers improved performance when recall is prioritized while consistently providing more confident probability scores. A sequence-based pairwise alignment step improved prediction results for all classifiers by effectively increasing the number of inferred positive phosphosites. We provide evidence to show that PhosBoost models are transferable across species and scalable for genome-wide protein phosphorylation predictions. PhosBoost is freely and publicly available on GitHub
Recommended from our members
Assessing the performance of generative artificial intelligence in retrieving information against manually curated genetic and genomic data.
Curated resources at centralized repositories provide high-value service to users by enhancing data veracity. Curation, however, comes with a cost, as it requires dedicated time and effort from personnel with deep domain knowledge. In this paper, we investigate the performance of a large language model (LLM), specifically generative pre-trained transformer (GPT)-3.5 and GPT-4, in extracting and presenting data against a human curator. In order to accomplish this task, we used a small set of journal articles on wheat and barley genetics, focusing on traits, such as salinity tolerance and disease resistance, which are becoming more important. The 36 papers were then curated by a professional curator for the GrainGenes database (https://wheat.pw.usda.gov). In parallel, we developed a GPT-based retrieval-augmented generation question-answering system and compared how GPT performed in answering questions about traits and quantitative trait loci (QTLs). Our findings show that on average GPT-4 correctly categorized manuscripts 97% of the time, correctly extracted 80% of traits, and 61% of marker-trait associations. Furthermore, we assessed the ability of a GPT-based DataFrame agent to filter and summarize curated wheat genetics data, showing the potential of human and computational curators working side-by-side. In one case study, our findings show that GPT-4 was able to retrieve up to 91% of disease related, human-curated QTLs across the whole genome, and up to 96% across a specific genomic region through prompt engineering. Also, we observed that across most tasks, GPT-4 consistently outperformed GPT-3.5 while generating less hallucinations, suggesting that improvements in LLM models will make generative artificial intelligence a much more accurate companion for curators in extracting information from scientific literature. Despite their limitations, LLMs demonstrated a potential to extract and present information to curators and users of biological databases, as long as users are aware of potential inaccuracies and the possibility of incomplete information extraction
Recommended from our members
Structural Variability of Pfam Domains Based on Alphafold2 Predictions
Understanding the biological functions of proteins is one of the main goals of functional genomics. Such understanding will help control and manipulate biological processes to enhance desirable traits, including improved abiotic and biotic stress resistance in humans, animals, plants, and microbes. Protein domains, regarded as the functional building blocks of proteins, have been used extensively to predict protein function. Sequence-based approaches for protein function prediction, including the use of protein domain prediction from resources like the Pfam database, remain popular due to their reliability, low cost, and ease of use. Although the sequence variability of Pfam domains has been reported in several studies, their structural variability has been understudied. Here, we have extracted the Pfam domain structural portion from the predicted structures of the 16 model organism proteomes in the AlphaFold2 database. Our analysis revealed that many families contained between 20% and 40% members with no assigned regular secondary structures, demonstrating within-family structural variability. To better understand this structural variability, we used FoldSeek and agglomerative clustering to identify structural variability in Pfam families. We then analyzed specific cases to provide structural details for this variability. In this study, we have used two popular prediction applications/resources, Alphafold2 and Pfam, to demonstrate inherent variability in protein domain predictions by comparing their predicted structures. Our study shows that detection of structural variability in Pfam families can facilitate curation and refinement of Pfam families, while demonstrating the need to develop more accurate protein domain prediction workflows
Comparative analyses of responses to exogenous and endogenous antiherbivore elicitors enable a forward genetics approach to identify maize gene candidates mediating sensitivity to herbivore‐associated molecular patterns
PhosBoost: Improved phosphorylation prediction recall using gradient boosting and protein language models.
Recommended from our members
Probing layers of maize immunity through integration of genetic, transcriptomic and physiological approaches
To efficiently protect themselves against pests and disease, plants surveil for attacking organisms and upon recognition, activate protective inducible defenses. Here, I present my work on the regulation and function of maize inducible defenses by integrating genetic, transcriptomic and physiological approaches. This work elucidated mechanisms underlying three layers of the maize immune response, including: (1) A novel genetic locus associated with sensitivity to exogenous herbivore-associated elicitors of the fatty-acid amino-acid conjugate (FAC) family, (2) regulatory function of phytocytokines from the Plant elicitor peptide (Pep) family, and (3) biosynthesis of antibiotic specialized metabolite defenses. Early maize signaling events triggered in the context of herbivory, were probed through comparative transcriptomic analyses upon treatment with ZmPeps and FACs, indicating a largely shared signaling pathway and identifying specific genes involved in antiherbivore defense. Genetic mapping using the Intermated B73 x Mo17 mapping population derived from B73, an FAC sensitive line, and Mo17, an FAC insensitive line, identified a single locus on chromosome 4 associated with FAC sensitivity that was further fine-mapped to a region containing 19 genes. A candidate gene within this region, FAC SENSITIVITY-ASSOCIATED (FACS), was expressed at significantly lower levels in the insensitive parent line, and heterologous expression of FACS increased FAC sensitivity in Nicotiana benthamiana, suggesting a role in regulation of FAC-induced responses. Work characterizing the maize ZmPep family led to several new insights into Pep signaling mechanisms: Maize Pep precursors (PROPEPs) were found to contain multiple nested active peptides, a phenomenon not previously observed for this family. Additionally, in contrast to Peps in Arabidopsis, individual maize Peps were found to have specific activities defined by the relative magnitude of elicited responses through rheostat-like tuning of phytohormone levels. Finally, peptide structure-function analysis and physiological assays identified ZmPep5a as a potential antagonist peptide. Finally, we report on the development of an R Shiny web-application that was developed to facilitate mutual rank-based coexpression analyses integrating user-provided supporting information. The utility of this user-friendly app was demonstrated through application to define two new biosynthetic pathways for maize terpenoid antibiotics
MutRank: an R shiny web-application for exploratory targeted mutual rank-based coexpression analyses integrated with user-provided supporting information
The rapid assignment of genotypes to phenotypes has been a historically challenging process. The discovery of genes encoding biosynthetic pathway enzymes for defined plant specialized metabolites has been informed and accelerated by the detection of gene clusters. Unfortunately, biosynthetic pathway genes are commonly dispersed across chromosomes or reside in genes clusters that provide little predictive value. More reliably, transcript abundance of genes underlying biochemical pathways for plant specialized metabolites display significant coregulation. By rapidly identifying highly coexpressed transcripts, it is possible to efficiently narrow candidate genes encoding pathway enzymes and more easily predict both functions and functional associations. Mutual Rank (MR)-based coexpression analyses in plants accurately demonstrate functional associations for many specialized metabolic pathways; however, despite the clear predictive value of MR analyses, the application is uncommonly used to drive new pathway discoveries. Moreover, many coexpression databases aid in the prediction of both functional associations and gene functions, but lack customizability for refined hypothesis testing. To facilitate and speed flexible MR-based hypothesis testing, we developed MutRank, an R Shiny web-application for coexpression analyses. MutRank provides an intuitive graphical user interface with multiple customizable features that integrates user-provided data and supporting information suitable for personal computers. Tabular and graphical outputs facilitate the rapid analyses of both unbiased and user-defined coexpression results that accelerate gene function predictions. We highlight the recent utility of MR analyses for functional predictions and discoveries in defining two maize terpenoid antibiotic pathways. Beyond applications in biosynthetic pathway discovery, MutRank provides a simple, customizable and user-friendly interface to enable coexpression analyses relating to a breadth of plant biology inquiries. Data and code are available at GitHub: https://github.com/eporetsky/MutRank.</jats:p
MutRank: an R shiny web-application for exploratory targeted mutual rank-based coexpression analyses integrated with user-provided supporting information
The rapid assignment of genotypes to phenotypes has been a historically challenging process. The discovery of genes encoding biosynthetic pathway enzymes for defined plant specialized metabolites has been informed and accelerated by the detection of gene clusters. Unfortunately, biosynthetic pathway genes are commonly dispersed across chromosomes or reside in genes clusters that provide little predictive value. More reliably, transcript abundance of genes underlying biochemical pathways for plant specialized metabolites display significant coregulation. By rapidly identifying highly coexpressed transcripts, it is possible to efficiently narrow candidate genes encoding pathway enzymes and more easily predict both functions and functional associations. Mutual Rank (MR)-based coexpression analyses in plants accurately demonstrate functional associations for many specialized metabolic pathways; however, despite the clear predictive value of MR analyses, the application is uncommonly used to drive new pathway discoveries. Moreover, many coexpression databases aid in the prediction of both functional associations and gene functions, but lack customizability for refined hypothesis testing. To facilitate and speed flexible MR-based hypothesis testing, we developed MutRank, an R Shiny web-application for coexpression analyses. MutRank provides an intuitive graphical user interface with multiple customizable features that integrates user-provided data and supporting information suitable for personal computers. Tabular and graphical outputs facilitate the rapid analyses of both unbiased and user-defined coexpression results that accelerate gene function predictions. We highlight the recent utility of MR analyses for functional predictions and discoveries in defining two maize terpenoid antibiotic pathways. Beyond applications in biosynthetic pathway discovery, MutRank provides a simple, customizable and user-friendly interface to enable coexpression analyses relating to a breadth of plant biology inquiries. Data and code are available at GitHub: https://github.com/eporetsky/MutRank
Recommended from our members
Harnessing the predicted maize pan-interactome for putative gene function prediction and prioritization of candidate genes for important traits
The recent assembly and annotation of the 26 maize nested association mapping (NAM) population founder inbreds have enabled large-scale pan-genomic comparative studies. These studies have expanded our understanding of agronomically important traits by integrating pan-transcriptomic data with trait-specific gene candidates from previous association mapping results. In contrast to the availability of pan-transcriptomic data, obtaining reliable protein-protein interaction (PPI) data has remained a challenge due to its high cost and complexity. We generated predicted PPI networks for each of the 26 genomes using the established STRING database. The individual genome-interactomes were then integrated to generate core- and pan-interactomes. We deployed the PPI clustering algorithm ClusterONE to identify numerous PPI clusters that were functionally annotated using gene ontology (GO) functional enrichment, demonstrating a diverse range of enriched GO terms across different clusters. Additional cluster annotations were generated by integrating gene co-expression data and gene description annotations, providing additional useful information. We show that the functionally annotated PPI clusters establish a useful framework for protein function prediction and prioritization of candidate genes of interest. Our study not only provides a comprehensive resource of predicted PPI networks for 26 maize genomes, but also offers annotated interactome clusters for predicting protein functions and prioritizing gene candidates. The source code for the Python implementation of the analysis workflow and a standalone web application for accessing the analysis results are available at https://github.com/eporetsky/PanPPI
Supplemental Material for Poretsky et al., 2024
Maize, an essential crop with significant agricultural importance, has been the subject of extensive research, resulting in a wealth of genomic and phenotypic data. The recent release of the genome assemblies and annotations for the 26 maize NAM inbred lines have enabled large-scale pan-genomic comparative studies. Our study provides a comprehensive predicted pan-interactome resource and offers a means to predict putative protein functions and prioritize gene candidates through the analysis of annotated interactome clusters.</p
