12 research outputs found

    LINNAEUS: A species name identification system for biomedical literature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.</p> <p>Results</p> <p>In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.</p> <p>Conclusions</p> <p>LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at <url>http://linnaeus.sourceforge.net/</url>.</p

    Corpus Refactoring: a Feasibility Study

    Get PDF
    © 2007 Johnson et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution Licens

    Facilitating the development of controlled vocabularies for metabolomics technologies with text mining

    Get PDF
    BACKGROUND: Many bioinformatics applications rely on controlled vocabularies or ontologies to consistently interpret and seamlessly integrate information scattered across public resources. Experimental data sets from metabolomics studies need to be integrated with one another, but also with data produced by other types of omics studies in the spirit of systems biology, hence the pressing need for vocabularies and ontologies in metabolomics. However, it is time-consuming and non trivial to construct these resources manually. RESULTS: We describe a methodology for rapid development of controlled vocabularies, a study originally motivated by the needs for vocabularies describing metabolomics technologies. We present case studies involving two controlled vocabularies (for nuclear magnetic resonance spectroscopy and gas chromatography) whose development is currently underway as part of the Metabolomics Standards Initiative. The initial vocabularies were compiled manually, providing a total of 243 and 152 terms. A total of 5,699 and 2,612 new terms were acquired automatically from the literature. The analysis of the results showed that full-text articles (especially the Materials and Methods sections) are the major source of technology-specific terms as opposed to paper abstracts. CONCLUSIONS: We suggest a text mining method for efficient corpus-based term acquisition as a way of rapidly expanding a set of controlled vocabularies with the terms used in the scientific literature. We adopted an integrative approach, combining relatively generic software and data resources for time- and cost-effective development of a text mining tool for expansion of controlled vocabularies across various domains, as a practical alternative to both manual term collection and tailor-made named entity recognition methods

    MetNet Online: a novel integrated resource for plant systems biology

    Full text link

    A framework for discovering meaningful associations in the annotated life sciences Web

    Get PDF
    During the last decade, life sciences researchers have gained access to the entire human genome, reliable high-throughput biotechnologies, affordable computational resources, and public network access. This has produced vast amounts of data and knowledge captured in the life sciences Web, and has created the need for new tools to analyze this knowledge and make discoveries. Consider a simplified Web of three publicly accessible data resources Entrez Gene, PubMed and OMIM. Data records in each resource are annotated with terms from multiple controlled vocabularies (CVs). The links between data records in two resources form a relationship between the two resources. Thus, a record in Entrez Gene, annotated with GO terms, can have links to multiple records in PubMed that are annotated with MeSH terms. Similarly, OMIM records annotated with terms from SNOMED CT may have links to records in Entrez Gene and PubMed. This forms a rich web of annotated data records. The objective of this research is to develop the Life Science Link (LSLink) methodology and tools to discover meaningful patterns across resources and CVs. In a first step, we execute a protocol to follow links, extract annotations, and generate datasets of termlinks, which consist of data records and CV terms. We then mine the termlinks of the datasets to find potentially meaningful associations between pairs of terms from two CVs. Biologically meaningful associations of pairs of CV terms may yield innovative nuggets of previously unknown knowledge. Moreover, the bridge of associations across CV terms will reflect the practice of how scientists annotate data across linked data repositories. Contributions include a methodology to create background datasets, metrics for mining patterns, applying semantic knowledge for generalization, tools for discovery, and validation with biological use cases. Inspired by research in association rule mining and linkage analysis, we develop two metrics to determine support and confidence scores in the associations of pairs of CV terms. Associations that have a statistically significant high score and are biologically meaningful may lead to new knowledge. To further validate the support and confidence metrics, we develop a secondary test for significance based on the hypergeometric distribution. We also exploit the semantics of the CVs. We aggregate termlinks over siblings of a common parent CV term and use them as additional evidence to boost the support and confidence scores in the associations of the parent CV term. We provide a simple discovery interface where biologists can review associations and their scores. Finally, a cancer informatics use case validates the discovery of associations between human genes and diseases
    corecore