2,024 research outputs found

    Automatic categorization of diverse experimental information in the bioscience literature

    Get PDF
    Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort

    WormBase: A modern Model Organism Information Resource

    Get PDF
    WormBase (https://wormbase.org/) is a mature Model Organism Information Resource supporting researchers using the nematode Caenorhabditis elegans as a model system for studies across a broad range of basic biological processes. Toward this mission, WormBase efforts are arranged in three primary facets: curation, user interface and architecture. In this update, we describe progress in each of these three areas. In particular, we discuss the status of literature curation and recently added data, detail new features of the web interface and options for users wishing to conduct data mining workflows, and discuss our efforts to build a robust and scalable architecture by leveraging commercial cloud offerings. We conclude with a description of WormBase\u27s role as a founding member of the nascent Alliance of Genome Resources

    2018 Update on Protein-Protein Interaction Data in WormBase

    Get PDF
    Protein interaction is an important data type to understand the biological function of proteins involved in the interaction, and helps researchers to deduce the biological nature of unknown proteins from the well-characterized functions of their interaction partners. High-throughput studies, coupled with the aggregation of individual experiments, provides a global 'snapshot' of the protein interactions occurring at all levels of biological processes or circumstances. This snapshot of the interaction network, the interactome, is important to understand the overall events up to the level of comparison between species or pathway simulation, or to find new factors yet undefined in the processes, or to add details to the biological processes and pathways. As of September 2018, WormBase (www.wormbase.org) (Lee et al. 2018) contains 28,279 physical protein-protein interactions for the roundworm Caenorhabditis elegans. Among these, 1500 protein-protein interactions have been curated by BioGRID as a collaboration with WormBase. Within the data set, 17,990 protein-protein interactions are unique, and 6,079 unique genes are involved in these interactions. In order to visualize the overall interaction map, a network diagram for all the unique interactions was generated by using the ‘Cytoscape’ program, version 3.6.1 (Shannon et al. 2003) (Figure 1A). These numbers represent a 108% increase in the number of interaction annotations since last year, 2017. These interaction data were curated from 1,251 peer-reviewed papers, which were selected from the literature by ‘Textpresso Central’ using automatic SVM (Support Vector Machine)-based text mining approaches (Fang et al. 2012; Müller et al. 2018) and manual verification. Compared to other databases providing C. elegans protein-protein interaction, WormBase now presents the largest data set, which has 1.72-fold more interaction annotations than IMEx (Orchard et al. 2012) and 4.51-fold more than BioGRID (Chatr-Aryamontri et al. 2017) (Figure 1B). Most significantly, WormBase now houses the complete protein interaction data from almost all of the C. elegans literature published from 1993 to 2018. The data sets presented at IMEx and BioGRID are annotated from 253 and 174 papers, respectively. All the physical interaction data in WormBase are supported by experimental evidence from original research papers. The statistics of the detection methods used as experimental evidence are shown in Figure 1C. The majority of the interaction data came from high throughput analysis such as large-scale yeast two-hybrid assays or mass-spectrometry, however, a significant portion of the data (13.1%) are supported by more direct detection methods using small-scale, low throughput methods such as co-immunoprecipitation or co-crystallography (Figure 1C). In WormBase, protein-protein interaction data can be found as a subclass of physical interaction data in the ‘Interactions widget’ on the gene report page. The Interactions widget provides all types of interaction data related to the gene of interest, such as physical, genetic, regulatory, and predicted interactions. All the interaction data are represented together in a graph created with ‘Cytoscape.js’ and a table. In the table, the gene names of interaction partners (bait-target) in the interaction are displayed along with the publication. The interaction details including the detection method are also captured in the summary and the remark field in the Interactions page. Users can query the data by using the search bar on the WormBase front page or download all the available data files from the WormBase FTP site (ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758 /annotation/c_elegans.PRJNA13758.WSXXX.interactions.txt.gz, where WSXXX is the database version release, like “WS267”). All the interaction data in WormBase will be available soon at the new information resource for multiple model organisms, the Alliance of Genome Resources (https://www.alliancegenome.org/). This site will integrate all the interaction data from human and from model organisms C. elegans, budding yeast (Saccharomyces cerevisiae), fruit fly (Drosophila melanogaster), zebrafish (Danio rerio), mouse (Mus musculus) and rat (Rattus norvegicus). Integrated views of interaction data from diverse model organisms will be extremely helpful to build interaction databases for species-to-species comparison, and to establish a disease model quickly based on the database. For the most efficient analysis of the interaction data in WormBase, we are now working on developing a new ‘Venn diagram tool’ and integrating the ‘Gene Set Enrichment Analysis tool’ (https://wormbase.org/tools/enrichment/tea/tea.cgi) into the Interactions widget. We will continue to curate other types of macro-molecular interactions including protein-DNA, protein-RNA and RNA-RNA interactions, as well as newly reported protein-protein interaction data to serve our research community

    2018 Update on Protein-Protein Interaction Data in WormBase

    Get PDF
    Protein interaction is an important data type to understand the biological function of proteins involved in the interaction, and helps researchers to deduce the biological nature of unknown proteins from the well-characterized functions of their interaction partners. High-throughput studies, coupled with the aggregation of individual experiments, provides a global 'snapshot' of the protein interactions occurring at all levels of biological processes or circumstances. This snapshot of the interaction network, the interactome, is important to understand the overall events up to the level of comparison between species or pathway simulation, or to find new factors yet undefined in the processes, or to add details to the biological processes and pathways. As of September 2018, WormBase (www.wormbase.org) (Lee et al. 2018) contains 28,279 physical protein-protein interactions for the roundworm Caenorhabditis elegans. Among these, 1500 protein-protein interactions have been curated by BioGRID as a collaboration with WormBase. Within the data set, 17,990 protein-protein interactions are unique, and 6,079 unique genes are involved in these interactions. In order to visualize the overall interaction map, a network diagram for all the unique interactions was generated by using the ‘Cytoscape’ program, version 3.6.1 (Shannon et al. 2003) (Figure 1A). These numbers represent a 108% increase in the number of interaction annotations since last year, 2017. These interaction data were curated from 1,251 peer-reviewed papers, which were selected from the literature by ‘Textpresso Central’ using automatic SVM (Support Vector Machine)-based text mining approaches (Fang et al. 2012; Müller et al. 2018) and manual verification. Compared to other databases providing C. elegans protein-protein interaction, WormBase now presents the largest data set, which has 1.72-fold more interaction annotations than IMEx (Orchard et al. 2012) and 4.51-fold more than BioGRID (Chatr-Aryamontri et al. 2017) (Figure 1B). Most significantly, WormBase now houses the complete protein interaction data from almost all of the C. elegans literature published from 1993 to 2018. The data sets presented at IMEx and BioGRID are annotated from 253 and 174 papers, respectively. All the physical interaction data in WormBase are supported by experimental evidence from original research papers. The statistics of the detection methods used as experimental evidence are shown in Figure 1C. The majority of the interaction data came from high throughput analysis such as large-scale yeast two-hybrid assays or mass-spectrometry, however, a significant portion of the data (13.1%) are supported by more direct detection methods using small-scale, low throughput methods such as co-immunoprecipitation or co-crystallography (Figure 1C). In WormBase, protein-protein interaction data can be found as a subclass of physical interaction data in the ‘Interactions widget’ on the gene report page. The Interactions widget provides all types of interaction data related to the gene of interest, such as physical, genetic, regulatory, and predicted interactions. All the interaction data are represented together in a graph created with ‘Cytoscape.js’ and a table. In the table, the gene names of interaction partners (bait-target) in the interaction are displayed along with the publication. The interaction details including the detection method are also captured in the summary and the remark field in the Interactions page. Users can query the data by using the search bar on the WormBase front page or download all the available data files from the WormBase FTP site (ftp://ftp.wormbase.org/pub/wormbase/releases/current-production-release/species/c_elegans/PRJNA13758 /annotation/c_elegans.PRJNA13758.WSXXX.interactions.txt.gz, where WSXXX is the database version release, like “WS267”). All the interaction data in WormBase will be available soon at the new information resource for multiple model organisms, the Alliance of Genome Resources (https://www.alliancegenome.org/). This site will integrate all the interaction data from human and from model organisms C. elegans, budding yeast (Saccharomyces cerevisiae), fruit fly (Drosophila melanogaster), zebrafish (Danio rerio), mouse (Mus musculus) and rat (Rattus norvegicus). Integrated views of interaction data from diverse model organisms will be extremely helpful to build interaction databases for species-to-species comparison, and to establish a disease model quickly based on the database. For the most efficient analysis of the interaction data in WormBase, we are now working on developing a new ‘Venn diagram tool’ and integrating the ‘Gene Set Enrichment Analysis tool’ (https://wormbase.org/tools/enrichment/tea/tea.cgi) into the Interactions widget. We will continue to curate other types of macro-molecular interactions including protein-DNA, protein-RNA and RNA-RNA interactions, as well as newly reported protein-protein interaction data to serve our research community

    Multi-dimensional classification of biomedical text: Toward automated, practical provision of high-utility text to diverse users

    Get PDF
    Motivation: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no ‘average biologist’ client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks

    Text mining meets community curation: a newly designed curation platform to improve author experience and participation at WormBase

    Get PDF
    Biological knowledgebases rely on expert biocuration of the research literature to maintain up-to-date collections of data organized in machine-readable form. To enter information into knowledgebases, curators need to follow three steps: (i) identify papers containing relevant data, a process called triaging; (ii) recognize named entities; and (iii) extract and curate data in accordance with the underlying data models. WormBase (WB), the authoritative repository for research data on Caenorhabditis elegans and other nematodes, uses text mining (TM) to semi-automate its curation pipeline. In addition, WB engages its community, via an Author First Pass (AFP) system, to help recognize entities and classify data types in their recently published papers. In this paper, we present a new WB AFP system that combines TM and AFP into a single application to enhance community curation. The system employs string-searching algorithms and statistical methods (e.g. support vector machines (SVMs)) to extract biological entities and classify data types, and it presents the results to authors in a web form where they validate the extracted information, rather than enter it de novo as the previous form required. With this new system, we lessen the burden for authors, while at the same time receive valuable feedback on the performance of our TM tools. The new user interface also links out to specific structured data submission forms, e.g. for phenotype or expression pattern data, giving the authors the opportunity to contribute a more detailed curation that can be incorporated into WB with minimal curator review. Our approach is generalizable and could be applied to additional knowledgebases that would like to engage their user community in assisting with the curation. In the five months succeeding the launch of the new system, the response rate has been comparable with that of the previous AFP version, but the quality and quantity of the data received has greatly improved

    WormBase: a modern Model Organism Information Resource

    Get PDF
    WormBase (https://wormbase.org/) is a mature Model Organism Information Resource supporting researchers using the nematode Caenorhabditis elegans as a model system for studies across a broad range of basic biological processes. Toward this mission, WormBase efforts are arranged in three primary facets: curation, user interface and architecture. In this update, we describe progress in each of these three areas. In particular, we discuss the status of literature curation and recently added data, detail new features of the web interface and options for users wishing to conduct data mining workflows, and discuss our efforts to build a robust and scalable architecture by leveraging commercial cloud offerings. We conclude with a description of WormBase's role as a founding member of the nascent Alliance of Genome Resources

    WormBase: a modern Model Organism Information Resource

    Get PDF
    WormBase (https://wormbase.org/) is a mature Model Organism Information Resource supporting researchers using the nematode Caenorhabditis elegans as a model system for studies across a broad range of basic biological processes. Toward this mission, WormBase efforts are arranged in three primary facets: curation, user interface and architecture. In this update, we describe progress in each of these three areas. In particular, we discuss the status of literature curation and recently added data, detail new features of the web interface and options for users wishing to conduct data mining workflows, and discuss our efforts to build a robust and scalable architecture by leveraging commercial cloud offerings. We conclude with a description of WormBase's role as a founding member of the nascent Alliance of Genome Resources
    corecore