9 research outputs found

    Overview of BioCreative II gene normalization

    Get PDF
    Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases

    The COMBREX Project: Design, Methodology, and Initial Results

    Get PDF
    © 2013 Brian P. et al.Prior to the “genomic era,” when the acquisition of DNA sequence involved significant labor and expense, the sequencing of genes was strongly linked to the experimental characterization of their products. Sequencing at that time directly resulted from the need to understand an experimentally determined phenotype or biochemical activity. Now that DNA sequencing has become orders of magnitude faster and less expensive, focus has shifted to sequencing entire genomes. Since biochemistry and genetics have not, by and large, enjoyed the same improvement of scale, public sequence repositories now predominantly contain putative protein sequences for which there is no direct experimental evidence of function. Computational approaches attempt to leverage evidence associated with the ever-smaller fraction of experimentally analyzed proteins to predict function for these putative proteins. Maximizing our understanding of function over the universe of proteins in toto requires not only robust computational methods of inference but also a judicious allocation of experimental resources, focusing on proteins whose experimental characterization will maximize the number and accuracy of follow-on predictions.COMBREX is funded by a GO grant from the National Institute of General Medical Sciences (NIGMS) (1RC2GM092602-01).Peer Reviewe

    The COMBREX project: design, methodology, and initial results.

    Get PDF
    Experimental data exists for only a vanishingly small fraction of sequenced microbial genes. This community page discusses the progress made by the COMBREX project to address this important issue using both computational and experimental resources

    Definitions of COMBREX functional status symbols and fractions of microbial genes in COMBREX in each status category.

    No full text
    <p>Experimentally characterized proteins are <i>green</i>. (Those in the <i>green</i> set that have been manually curated by the GSDB are also marked with a gold “G.”) Proteins with functional predictions but no experimental evidence are <i>blue</i>. Proteins with no available functional predictions are <i>black</i>.</p

    Schematic overview of the computational and experimental contributions of COMBREX and its users, and the interrelationships of these contributions.

    No full text
    <p>Data and results specific to COMBREX are shown in boxes. External data imported into COMBREX are also shown, with arrows indicating entry points into the cycle. Methodology employed by COMBREX and its users is shown in blue type, as it is used to generate data. Not shown are two critical contributions to COMBREX: genome and cluster data imported from NCBI RefSeq and ProtClustDB, respectively, and NIH funding, which enables the grants that COMBREX issues to experimental laboratories.</p
    corecore