417,834 research outputs found

    Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation

    Get PDF
    Background: Manual curation of experimental data from the biomedical literature is an expensive and time-consuming endeavor. Nevertheless, most biological knowledge bases still rely heavily on manual curation for data extraction and entry. Text mining software that can semi- or fully automate information retrieval from the literature would thus provide a significant boost to manual curation efforts. Results: We employ the Textpresso category-based information retrieval and extraction system http://www.textpresso.org webcite, developed by WormBase to explore how Textpresso might improve the efficiency with which we manually curate C. elegans proteins to the Gene Ontology's Cellular Component Ontology. Using a training set of sentences that describe results of localization experiments in the published literature, we generated three new curation task-specific categories (Cellular Components, Assay Terms, and Verbs) containing words and phrases associated with reports of experimentally determined subcellular localization. We compared the results of manual curation to that of Textpresso queries that searched the full text of articles for sentences containing terms from each of the three new categories plus the name of a previously uncurated C. elegans protein, and found that Textpresso searches identified curatable papers with recall and precision rates of 79.1% and 61.8%, respectively (F-score of 69.5%), when compared to manual curation. Within those documents, Textpresso identified relevant sentences with recall and precision rates of 30.3% and 80.1% (F-score of 44.0%). From returned sentences, curators were able to make 66.2% of all possible experimentally supported GO Cellular Component annotations with 97.3% precision (F-score of 78.8%). Measuring the relative efficiencies of Textpresso-based versus manual curation we find that Textpresso has the potential to increase curation efficiency by at least 8-fold, and perhaps as much as 15-fold, given differences in individual curatorial speed. Conclusion: Textpresso is an effective tool for improving the efficiency of manual, experimentally based curation. Incorporating a Textpresso-based Cellular Component curation pipeline at WormBase has allowed us to transition from strictly manual curation of this data type to a more efficient pipeline of computer-assisted validation. Continued development of curation task-specific Textpresso categories will provide an invaluable resource for genomics databases that rely heavily on manual curation

    WormBase - Nematode Biology and Genomes

    Get PDF
    WormBase is the major public online database resource for the _Caenorhabditis_ research community. The database was developed primarily for the nematode _C. elegans_ but expanded to host genomes and biological data from other closely related nematode species including _C. briggsae_, _C. remanei_, _C. brenneri_, _C. japonica_ and _Pristionchus pacificus_. WormBase has developed tools to mine the data held within the database and compare the hosted species. Over the years we have developed a variety of curation pipelines which often begin in a "first-pass" literature curation step. This involves a brief overview of the literature before directing it to specialised data curators who extract all relevant information. Curators focus on particular data types or experimental techniques such as gene structure changes (see the Sequence curation poster), variations, phenotypes or RNAi and their expertise in these fields make curation efficient. WormBase works with many other groups and consortiums to validate, process and integrate both large and small scale data resources. WormBase also provides data that will be of interest to the wider biomedical and bioinformatics communities allowing researchers to utilise the information and techniques offered by nematodes to study wider aspects including medicine and disease.
&#xa

    Wormbase Curation Interfaces and Tools

    Get PDF
    Curating biological information from the published literature can be time- and labor-intensive especially without automated tools. WormBase1 has adopted several curation interfaces and tools, most of which were built in-house, to help curators recognize and extract data more efficiently from the literature. These tools range from simple computer interfaces for data entry to employing scripts that take advantage of complex text extraction algorithms, which automatically identify specific objects in a paper and presents them to the curator for curation. By using these in-house tools, we are also able to tailor the tool to the individual needs and preferences of the curator. For example, Gene Ontology Cellular Component and gene-gene interaction curators employ the text mining software Textpresso2 to indentify, retrieve, and extract relevant sentences from the full text of an article. The curators then use a web-based curation form to enter the data into our local database. For transgene and antibody curation, curators use the publicly available Phenote ontology annotation curation interface (developed by the Berkeley Bioinformatics Open-Source Projects (BBOP)), which we have adapted with datatype specific configurations. This tool has been used as a basis for developing our own Ontology Annotator tool, which is being used by our phenotype and gene ontology curators. For RNAi curation, we created web-based submission forms that allow the curator to efficiently capture all relevant information. In all cases, the data undergoes a final scripted data dump step to make sure all the information conforms into a readable file by our object oriented database

    International data curation education action (IDEA) working group: a report from the second workshop of IDEA

    Get PDF
    The second workshop of the International Data curation Education (IDEA) Working Group was held December 5, 2008, in Edinburgh, Scotland, following the 4th International Digital Curation Conference. This workshop was jointly organized by the UK's Digital Curation Centre (DCC), the US's Institute of Museum and Library Services (IMLS), and the School of Information and Library Science at the University of North Carolina at Chapel Hill (SILS). Nearly forty educators and researchers accepted invitations to attend, with representation from universities, research centers, and funding agencies from Canada, the US, the UK, and Germany

    Present and future of proteomics data curation at the PRIDE database

    Get PDF
    Significant progress has been made in improving the accessibility and utility of the large amounts of generated high-throughput proteomics data by the introduction of publicly available proteomics repositories. One such repository is PRIDE (the ‘PRoteomics IDEntifications’ database, "http://www.ebi.ac.uk/pride":http://www.ebi.ac.uk/pride). PRIDE stores mass spectrometry related data, including peptide and protein identifications, mass spectra and valuable additional metadata.

At present, data curation in PRIDE is limited to data submission support. The format in which all submissions need to take place is PRIDE XML. Mass spectrometry derived data is very heterogeneous in terms of experimental approaches, instrumentation, data formats, etc. This is why conversion of all this different data to PRIDE XML is far from being trivial and can be very time consuming, since tailored submission pipelines must be often constructed. However, the situation has now ameliorated since some new tools like PRIDE converter ("http://code.google.com/p/pride-converter":http://code.google.com/p/pride-converter). are now available for submitters to convert their data to PRIDE XML.

In the near future, data curation in PRIDE will be significantly extended. High-quality data will be included in a new repository called PRIDE-plus. First of all, it will be necessary to create a set of minimal requirement rules to decide which datasets can be included in PRIDE-plus. Then, the design and implementation of new curation tools to perform data quality assessment will be essential. It will also be necessary to do research into the automation of these new curation and annotation tasks

    Extracting, Transforming and Archiving Scientific Data

    Get PDF
    It is becoming common to archive research datasets that are not only large but also numerous. In addition, their corresponding metadata and the software required to analyse or display them need to be archived. Yet the manual curation of research data can be difficult and expensive, particularly in very large digital repositories, hence the importance of models and tools for automating digital curation tasks. The automation of these tasks faces three major challenges: (1) research data and data sources are highly heterogeneous, (2) future research needs are difficult to anticipate, (3) data is hard to index. To address these problems, we propose the Extract, Transform and Archive (ETA) model for managing and mechanizing the curation of research data. Specifically, we propose a scalable strategy for addressing the research-data problem, ranging from the extraction of legacy data to its long-term storage. We review some existing solutions and propose novel avenues of research.Comment: 8 pages, Fourth Workshop on Very Large Digital Libraries, 201

    Data curation standards and social science occupational information resources

    Get PDF
    Occupational information resources - data about the characteristics of different occupational positions - are widely used in the social sciences, across a range of disciplines and international contexts. They are available in many formats, most often constituting small electronic files that are made freely downloadable from academic web-pages. However there are several challenges associated with how occupational information resources are distributed to, and exploited by, social researchers. In this paper we describe features of occupational information resources, and indicate the role digital curation can play in exploiting them. We report upon the strategies used in the GEODE research project (Grid Enabled Occupational Data Environment, http://www.geode.stir.ac.uk). This project attempts to develop long-term standards for the distribution of occupational information resources, by providing a standardized framework-based electronic depository for occupational information resources, and by providing a data indexing service, based on e-Science middleware, which collates occupational information resources and makes them readily accessible to non-specialist social scientists

    Education alignment

    Get PDF
    This essay reviews recent developments in embedding data management and curation skills into information technology, library and information science, and research-based postgraduate courses in various national contexts. The essay also investigates means of joining up formal education with professional development training opportunities more coherently. The potential for using professional internships as a means of improving communication and understanding between disciplines is also explored. A key aim of this essay is to identify what level of complementarity is needed across various disciplines to most effectively and efficiently support the entire data curation lifecycle
    corecore