24 research outputs found

    Large-scale automated protein function prediction

    Get PDF
    Includes bibliographical references.2016 Summer.Proteins are the workhorses of life, and identifying their functions is a very important biological problem. The function of a protein can be loosely defined as everything it performs or happens to it. The Gene Ontology (GO) is a structured vocabulary which captures protein function in a hierarchical manner and contains thousands of terms. Through various wet-lab experiments over the years scientists have been able to annotate a large number of proteins with GO categories which reflect their functionality. However, experimentally determining protein functions is a highly resource-intensive task, and a large fraction of proteins remain un-annotated. Recently a plethora automated methods have emerged and their reasonable success in computationally determining the functions of proteins using a variety of data sources – by sequence/structure similarity or using various biological network data, has led to establishing automated function prediction (AFP) as an important problem in bioinformatics. In a typical machine learning problem, cross-validation is the protocol of choice for evaluating the accuracy of a classifier. But, due to the process of accumulation of annotations over time, we identify the AFP as a combination of two sub-tasks: making predictions on annotated proteins and making predictions on previously unannotated proteins. In our first project, we analyze the performance of several protein function prediction methods in these two scenarios. Our results show that GOstruct, an AFP method that our lab has previously developed, and two other popular methods: binary SVMs and guilt by association, find it hard to achieve the same level of accuracy on these two tasks compared to the performance evaluated through cross-validation, and that predicting novel annotations for previously annotated proteins is a harder problem than predicting annotations for uncharacterized proteins. We develop GOstruct 2.0 by proposing improvements which allows the model to make use of information of a protein's current annotations to better handle the task of predicting novel annotations for previously annotated proteins. Experimental results on yeast and human data show that GOstruct 2.0 outperforms the original GOstruct, demonstrating the effectiveness of the proposed improvements. Although the biomedical literature is a very informative resource for identifying protein function, most AFP methods do not take advantage of the large amount of information contained in it. In our second project, we conduct the first ever comprehensive evaluation on the effectiveness of literature data for AFP. Specifically, we extract co-mentions of protein-GO term pairs and bag-of-words features from the literature and explore their effectiveness in predicting protein function. Our results show that literature features are very informative of protein function but with further room for improvement. In order to improve the quality of automatically extracted co-mentions, we formulate the classification of co-mentions as a supervised learning problem and propose a novel method based on graph kernels. Experimental results indicate the feasibility of using this co-mention classifier as a complementary method that aids the bio-curators who are responsible for maintaining databases such as Gene Ontology. This is the first study of the problem of protein-function relation extraction from biomedical text. The recently developed human phenotype ontology (HPO), which is very similar to GO, is a standardized vocabulary for describing the phenotype abnormalities associated with human diseases. At present, only a small fraction of human protein coding genes have HPO annotations. But, researchers believe that a large portion of currently unannotated genes are related to disease phenotypes. Therefore, it is important to predict gene-HPO term associations using accurate computational methods. In our third project, we introduce PHENOstruct, a computational method that directly predicts the set of HPO terms for a given gene. We compare PHENOstruct with several baseline methods and show that it outperforms them in every respect. Furthermore, we highlight a collection of informative data sources suitable for the problem of predicting gene-HPO associations, including large scale literature mining data

    Friend Network as Gatekeeper: A Study of WeChat Users' Consumption of Friend-Curated Contents

    Full text link
    Social media enables users to publish, disseminate, and access information easily. The downside is that it has fewer gatekeepers of what content is allowed to enter public circulation than the traditional media. In this paper, we present preliminary empirical findings from WeChat, a popular messaging app of the Chinese, indicating that social media users leverage their friend networks collectively as latent, dynamic gatekeepers for content consumption. Taking a mixed-methods approach, we analyze over seven million users' information consumption behaviors on WeChat and conduct an online survey of 216216 users. Both quantitative and qualitative evidence suggests that friend network indeed acts as a gatekeeper in social media. Shifting from what should be produced that gatekeepers used to decide, friend network helps separate the worthy from the unworthy for individual information consumption, and its structure and dynamics that play an important role in gatekeeping may inspire the future design of socio-technical systems

    Perspective Chapter: Natural Adjuvants for Mucosal Vaccines—The Promise of Tomatine as an Inherent Adjuvant in Tomatoes

    Get PDF
    Certain natural immunostimulatory compounds inherent to plants, animals, and microorganisms, in both terrestrial and aquatic ecosystems, have been reported to enhance the immunogenicity of vaccines by conferring an adjuvant effect and/or possessing potent immunomodulatory properties acting as immunogens themselves. In this chapter, we summarize the current state of vaccine adjuvant development and application, encompassing a range of immunomodulatory compounds that improve protective immune responses and enhance vaccine efficacy. We place special focus on the effectiveness of tomatine, inherent to tomatoes, as a natural immunostimulant. We discuss the adjuvant- and immunomodulatory-properties of tomatine and its advantages in plant-based vaccine production, cost-effectiveness, development, safety profiles, and applications compared to other adjuvants and vaccine delivery systems. This chapter provides a futuristic overview and insights into the promise of tomatine for the development of safer, easily-scalable, sustainable, and more efficient vaccines

    Perspective Chapter: The Most Natural Possible Vaccine Administered in the Most Natural Possible Way - Noninvasive over Injectable Vaccine Delivery Routes

    Get PDF
    This chapter provides perspective on the routes of vaccine administration, comparing invasive and noninvasive delivery methods. We begin with an analysis of the most frequently used routes of administration: invasive, such as traditional needle-based injections (intramuscular and subcutaneous) and noninvasive, including oral and intranasal routes. We discuss recent advancements, for example, aerosols and jet injectors, as well as other novel administration methods for immunization such as improved mucosally-administered vaccines. Finally, we provide an update on how different delivery methods can impact consumer (vaccine recipients) compliance rates and vaccine availability (e.g., cold chain logistics in areas of the world with infrastructure limitations) from the perspectives of both the vaccine provider and the vaccine recipient

    An Expanded Evaluation of Protein Function Prediction Methods Shows an Improvement In Accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent

    An expanded evaluation of protein function prediction methods shows an improvement in accuracy

    Get PDF
    Background: A major bottleneck in our understanding of the molecular underpinnings of life is the assignment of function to proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, assessing methods for protein function prediction and tracking progress in the field remain challenging. Results: We conducted the second critical assessment of functional annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. We evaluated 126 methods from 56 research groups for their ability to predict biological functions using Gene Ontology and gene-disease associations using Human Phenotype Ontology on a set of 3681 proteins from 18 species. CAFA2 featured expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis compared the best methods from CAFA1 to those of CAFA2. Conclusions: The top-performing methods in CAFA2 outperformed those from CAFA1. This increased accuracy can be attributed to a combination of the growing number of experimental annotations and improved methods for function prediction. The assessment also revealed that the definition of top-performing algorithms is ontology specific, that different performance metrics can be used to probe the nature of accurate predictions, and the relative diversity of predictions in the biological process and human phenotype ontologies. While there was methodological improvement between CAFA1 and CAFA2, the interpretation of results and usefulness of individual methods remain context-dependent. Keywords: Protein function prediction, Disease gene prioritizationpublishedVersio

    The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

    Get PDF
    Background The Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function. Results Here, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory. Conclusion We conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.Peer reviewe

    The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens

    Get PDF
    BackgroundThe Critical Assessment of Functional Annotation (CAFA) is an ongoing, global, community-driven effort to evaluate and improve the computational annotation of protein function.ResultsHere, we report on the results of the third CAFA challenge, CAFA3, that featured an expanded analysis over the previous CAFA rounds, both in terms of volume of data analyzed and the types of analysis performed. In a novel and major new development, computational predictions and assessment goals drove some of the experimental assays, resulting in new functional annotations for more than 1000 genes. Specifically, we performed experimental whole-genome mutation screening in Candida albicans and Pseudomonas aureginosa genomes, which provided us with genome-wide experimental data for genes associated with biofilm formation and motility. We further performed targeted assays on selected genes in Drosophila melanogaster, which we suspected of being involved in long-term memory.ConclusionWe conclude that while predictions of the molecular function and biological process annotations have slightly improved over time, those of the cellular component have not. Term-centric prediction of experimental annotations remains equally challenging; although the performance of the top methods is significantly better than the expectations set by baseline methods in C. albicans and D. melanogaster, it leaves considerable room and need for improvement. Finally, we report that the CAFA community now involves a broad range of participants with expertise in bioinformatics, biological experimentation, biocuration, and bio-ontologies, working together to improve functional annotation, computational function prediction, and our ability to manage big data in the era of large experimental screens.</p

    BioSGAN: Protein-phenotype Co-mention classification using semi-supervised generative adversarial networks

    No full text
    Valuable and relevant information that relates human proteins with their phenotypes in biomedical literature stays hidden from biomedical scientists due to the rapid rise in biomedical publications. Previous studies that developed computational methods to extract this knowledge mostly rely on rule-based linguistic patterns and supervised machine learning approaches. In this work, we propose the use of generative adversarial networks to develop a novel method called BioSGAN for the protein-phenotype co-mention classification task. We demonstrate the potential associated with combining a small labeled dataset with vast unlabelled biomedical text data extracted from Medline abstracts and PubMed Central open Access full-text in a semi-supervised machine learning framework. Our method achieves state-of-the-art performance for classifying the validity of a given sentence-level co-mention of a human protein and phenotype by convincingly outperforming a traditional machine learning-based counterpart. These findings have implications for biocurators, researchers, and the text mining community involved with biomedical relation extraction
    corecore