8,082 research outputs found

    Large-scale event extraction from literature with multi-level gene normalization

    Get PDF
    Text mining for the life sciences aims to aid database curation, knowledge summarization and information retrieval through the automated processing of biomedical texts. To provide comprehensive coverage and enable full integration with existing biomolecular database records, it is crucial that text mining tools scale up to millions of articles and that their analyses can be unambiguously linked to information recorded in resources such as UniProt, KEGG, BioGRID and NCBI databases. In this study, we investigate how fully automated text mining of complex biomolecular events can be augmented with a normalization strategy that identifies biological concepts in text, mapping them to identifiers at varying levels of granularity, ranging from canonicalized symbols to unique gene and proteins and broad gene families. To this end, we have combined two state-of-the-art text mining components, previously evaluated on two community-wide challenges, and have extended and improved upon these methods by exploiting their complementary nature. Using these systems, we perform normalization and event extraction to create a large-scale resource that is publicly available, unique in semantic scope, and covers all 21.9 million PubMed abstracts and 460 thousand PubMed Central open access full-text articles. This dataset contains 40 million biomolecular events involving 76 million gene/protein mentions, linked to 122 thousand distinct genes from 5032 species across the full taxonomic tree. Detailed evaluations and analyses reveal promising results for application of this data in database and pathway curation efforts. The main software components used in this study are released under an open-source license. Further, the resulting dataset is freely accessible through a novel API, providing programmatic and customized access (http://www.evexdb.org/api/v001/). Finally, to allow for large-scale bioinformatic analyses, the entire resource is available for bulk download from http://evexdb.org/download/, under the Creative Commons -Attribution - Share Alike (CC BY-SA) license

    Biodiversity informatics: the challenge of linking data and the role of shared identifiers

    Get PDF
    A major challenge facing biodiversity informatics is integrating data stored in widely distributed databases. Initial efforts have relied on taxonomic names as the shared identifier linking records in different databases. However, taxonomic names have limitations as identifiers, being neither stable nor globally unique, and the pace of molecular taxonomic and phylogenetic research means that a lot of information in public sequence databases is not linked to formal taxonomic names. This review explores the use of other identifiers, such as specimen codes and GenBank accession numbers, to link otherwise disconnected facts in different databases. The structure of these links can also be exploited using the PageRank algorithm to rank the results of searches on biodiversity databases. The key to rich integration is a commitment to deploy and reuse globally unique, shared identifiers (such as DOIs and LSIDs), and the implementation of services that link those identifiers

    Grounding Gene Mentions with Respect to Gene Database Identifiers

    Get PDF
    We describe our submission for task 1B of the BioCreAtIvE competition which is concerned with grounding gene mentions with respect to databases of organism gene identifiers. Several approaches to gene identification, lookup, and disambiguation are presented. Results are presented with two possible baseline systems and a discussion of the source of precision and recall errors as well as an estimate of precision and recall for an organism-specific tagger bootstrapped from gene synonym lists and the task 1B training data. 1

    Community next steps for making globally unique identifiers work for biocollections data

    Get PDF
    Biodiversity data is being digitized and made available online at a rapidly increasing rate but current practices typically do not preserve linkages between these data, which impedes interoperation, provenance tracking, and assembly of larger datasets. For data associated with biocollections, the biodiversity community has long recognized that an essential part of establishing and preserving linkages is to apply globally unique identifiers at the point when data are generated in the field and to persist these identifiers downstream, but this is seldom implemented in practice. There has neither been coalescence towards one single identifier solution (as in some other domains), nor even a set of recommended best practices and standards to support multiple identifier schemes sharing consistent responses. In order to further progress towards a broader community consensus, a group of biocollections and informatics experts assembled in Stockholm in October 2014 to discuss community next steps to overcome current roadblocks. The workshop participants divided into four groups focusing on: identifier practice in current field biocollections; identifier application for legacy biocollections; identifiers as applied to biodiversity data records as they are published and made available in semantically marked-up publications; and cross-cutting identifier solutions that bridge across these domains. The main outcome was consensus on key issues, including recognition of differences between legacy and new biocollections processes, the need for identifier metadata profiles that can report information on identifier persistence missions, and the unambiguous indication of the type of object associated with the identifier. Current identifier characteristics are also summarized, and an overview of available schemes and practices is provided

    Identifiers.org and MIRIAM Registry: perennial identifiers for crossreferencing purposes

    Get PDF
    The MIRIAM Registry provides unique, perennial and location independent identifiers for data used in the biomedical domain. At its core is a shared catalogue of data collections, for each of which a unique namespace is created, and extensive metadata recorded. This namespace allows the generation of Uniform Resource Identifiers (URIs) to uniquely identify any record in a collection. We present here the new Identifiers.org service which is built upon the information stored in the Registry and which provides directly resolvable identifiers, in the form of Uniform Resource Locators (URLs). The flexibility of the identification scheme and resolving system allows its use in many different fields, where unambiguous and perennial identification of data entities is necessary

    Surfacing the deep data of taxonomy

    Get PDF
    Taxonomic databases are perpetuating approaches to citing literature that may have been appropriate before the Internet, often being little more than digitised 5 × 3 index cards. Typically the original taxonomic literature is either not cited, or is represented in the form of a (typically abbreviated) text string. Hence much of the “deep data” of taxonomy, such as the original descriptions, revisions, and nomenclatural actions are largely hidden from all but the most resourceful users. At the same time there are burgeoning efforts to digitise the scientific literature, and much of this newly available content has been assigned globally unique identifiers such as Digital Object Identifiers (DOIs), which are also the identifier of choice for most modern publications. This represents an opportunity for taxonomic databases to engage with digitisation efforts. Mapping the taxonomic literature on to globally unique identifiers can be time consuming, but need be done only once. Furthermore, if we reuse existing identifiers, rather than mint our own, we can start to build the links between the diverse data that are needed to support the kinds of inference which biodiversity informatics aspires to support. Until this practice becomes widespread, the taxonomic literature will remain balkanized, and much of the knowledge that it contains will linger in obscurity
    corecore