49 research outputs found

    Iniciativas de evaluación para la indización semántica de literatura médica en español: PLANTL, LILACS, IBECS Y BIOASQ

    Get PDF
    XVI Jornadas Nacionales de Información y Documentación en Ciencias de la Salud. Oviedo, 4-5 de abril de 2019El proyecto Faro de Sanidad del Plan de Impulso de las Tecnologías del Lenguaje (PlanTL) pretende fomentar el desarrollo de sistemas de procesamiento del lenguaje natural (PLN), minería de textos y traducción automática para español y lenguas cooficiales. Una actividad importante del PlanTL es la organización de campañas de evaluación de sistemas de PLN y minería de textos, un mecanismo que no sólo es clave para evaluar la calidad de los resultados obtenidos por sistemas y algoritmos predictivos, sino que representa un motor fundamental para fomentar el desarrollo de herramientas y recursos de tecnologías del lenguaje. Debido a la importancia de la literatura para la toma de decisiones en medicina y el volumen considerable de publicaciones en español, el Plan TL, en colaboración con el BSC, el CNIO, la BNCS y la iniciativa BioASQ ha lanzado una tarea competitiva relacionada con la indización automática de la literatura médica en español con términos DeCS. Su fin es generar recursos de etiquetado semántico que sirvan de ayuda a la indización manual. La tarea BioASQ (bioasq.org) de indización semántica biomédica en español se realizará usando resúmenes de artículos de revistas contenidas en las bases de datos LILACS (Literatura Lationamericana en Ciencias de la Salud) y IBECS1 (Índice Bibliográfico Español en Ciencias de la Salud) como conjunto básico etiquetado y, a partir de ellos, desarrollar los algoritmos de indización automática, facilitando así el desarrollo de modelos de inteligencia artificial. La evaluación de los sistemas se realiza con la plataforma de BioASQ, mediante un sistema de evaluación continua. En él, se solicita a los participantes que asignen automáticamente términos DeCS a los registros nuevos añadidos a las bases de datos a medida que se hacen públicos, y antes de que se haya completado la indización manual. El rendimiento de indización se calcula comparando indización automática y manual. Gracias a los resultados de ediciones previas de BioASQ para la indización de PubMed, se ha mejorado este proceso en dicho recurso. Esta tarea de indización biomédica en español servirá para generar recursos comparables para indizar LILACS e IBECS y otros conjuntos documentales.The health flagship project of the Plan for the Advancement of Language Technology (PlanTL) tries to promote the development of natural language processing systems (NLP), text mining and machine translation resources for Spanish and co-official languages. There is a growing demand for a better exploitation of datasets generated by clinicians, especially electronic health records, as well as the integration and management of this kind of data in personalized medicine platforms integrating also information extracted from the literature. In this context, the PlanTL collaborates in the organization of evaluation efforts of clinical NLP and text mining systems, a key mechanism to evaluate the quality of results obtained by such automated systems and a fundamental mechanism to promote the development of tools and resources related to language technologies. Given the importance of literature for medical decision-making and the growing volume of Spanish medical publications, the TL Plan, in collaboration with the BSC, CNIO, the Biblioteca Nacional de Ciencias de la Salud and the BioASQ team have launched a shared task on automatic indexing of abstracts in Spanish with DeCS terms. The aim of this tracks is to generate semantic annotation resources that can be used to assist manual indexing. The Spanish biomedical semantic indexing track of BioASQ (bioasq.org) will rely on abstracts of journals contained in the LILACS databases as a basic Gold Standard manually labeled benchmark set for the development of automatic indexing algorithms particularly those based on artificial intelligence language models. The evaluation of participating systems is done through the BioASQ platform, which requests results in a continuous evaluation process, i.e. automatically asking for DeCS term assignment for newly added documents to LILACS, as they are made public, and before the manual indexing results are publicly released. The indexing performance in BioASQ is calculated by comparing automatic indexing against manual annotations. Thanks to the results of previous editions of BioASQ for indexing PubMed, the MeSH indexing process of this resource was considerably improved. This novel effort on medical indexing in Spanish will serve to generate comparable resources to semantically index not only LILACS but also other health databases and repositories in Spanish.N

    CHEMDNER: The drugs and chemical names extraction challenge

    Get PDF
    Natural language processing (NLP) and text mining technologies for the chemical domain (ChemNLP or chemical text mining) are key to improve the access and integration of information from unstructured data such as patents or the scientific literature. Therefore, the BioCreative organizers posed the CHEMDNER (chemical compound and drug name recognition) community challenge, which promoted the development of novel, competitive and accessible chemical text mining systems. This task allowed a comparative assessment of the performance of various methodologies using a carefully prepared collection of manually labeled text prepared by specially trained chemists as Gold Standard data. We evaluated two important aspects: one covered the indexing of documents with chemicals (chemical document indexing - CDI task), and the other was concerned with finding the exact mentions of chemicals in text (chemical entity mention recognition - CEM task). 27 teams (23 academic and 4 commercial, a total of 87 researchers) returned results for the CHEMDNER tasks: 26 teams for CEM and 23 for the CDI task. Top scoring teams obtained an F-score of 87.39% for the CEM task and 88.20% for the CDI task, a very promising result when compared to the agreement between human annotators (91%). The strategies used to detect chemicals included machine learning methods (e.g. conditional random fields) using a variety of features, chemistry and drug lexica, and domain-specific rules. We expect that the tools and resources resulting from this effort will have an impact in future developments of chemical text mining applications and will form the basis to find related chemical information for the detected entities, such as toxicological or pharmacogenomic properties

    The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

    Get PDF
    BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.RESULTS:A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89 and the best AUC iP/R was 68. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35) the macro-averaged precision ranged between 50 and 80, with a maximum F-Score of 55. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows

    Annotating genes and genomes with DNA sequences extracted from biomedical articles

    Get PDF
    Motivation: Increasing rates of publication and DNA sequencing make the problem of finding relevant articles for a particular gene or genomic region more challenging than ever. Existing text-mining approaches focus on finding gene names or identifiers in English text. These are often not unique and do not identify the exact genomic location of a study

    pubmed2ensembl: A Resource for Mining the Biological Literature on Genes

    Get PDF
    The last two decades have witnessed a dramatic acceleration in the production of genomic sequence information and publication of biomedical articles. Despite the fact that genome sequence data and publications are two of the most heavily relied-upon sources of information for many biologists, very little effort has been made to systematically integrate data from genomic sequences directly with the biological literature. For a limited number of model organisms dedicated teams manually curate publications about genes; however for species with no such dedicated staff many thousands of articles are never mapped to genes or genomic regions.To overcome the lack of integration between genomic data and biological literature, we have developed pubmed2ensembl (http://www.pubmed2ensembl.org), an extension to the BioMart system that links over 2,000,000 articles in PubMed to nearly 150,000 genes in Ensembl from 50 species. We use several sources of curated (e.g., Entrez Gene) and automatically generated (e.g., gene names extracted through text-mining on MEDLINE records) sources of gene-publication links, allowing users to filter and combine different data sources to suit their individual needs for information extraction and biological discovery. In addition to extending the Ensembl BioMart database to include published information on genes, we also implemented a scripting language for automated BioMart construction and a novel BioMart interface that allows text-based queries to be performed against PubMed and PubMed Central documents in conjunction with constraints on genomic features. Finally, we illustrate the potential of pubmed2ensembl through typical use cases that involve integrated queries across the biomedical literature and genomic data.By allowing biologists to find the relevant literature on specific genomic regions or sets of functionally related genes more easily, pubmed2ensembl offers a much-needed genome informatics inspired solution to accessing the ever-increasing biomedical literature

    Text mining for biology - the way forward: opinions from leading scientists

    Get PDF
    This article collects opinions from leading scientists about how text mining can provide better access to the biological literature, how the scientific community can help with this process, what the next steps are, and what role future BioCreative evaluations can play. The responses identify several broad themes, including the possibility of fusing literature and biological databases through text mining; the need for user interfaces tailored to different classes of users and supporting community-based annotation; the importance of scaling text mining technology and inserting it into larger workflows; and suggestions for additional challenge evaluations, new applications, and additional resources needed to make progress

    BioCreative III interactive task: an overview

    Get PDF
    The BioCreative challenge evaluation is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. The biocurator community, as an active user of biomedical literature, provides a diverse and engaged end user group for text mining tools. Earlier BioCreative challenges involved many text mining teams in developing basic capabilities relevant to biological curation, but they did not address the issues of system usage, insertion into the workflow and adoption by curators. Thus in BioCreative III (BC-III), the InterActive Task (IAT) was introduced to address the utility and usability of text mining tools for real-life biocuration tasks. To support the aims of the IAT in BC-III, involvement of both developers and end users was solicited, and the development of a user interface to address the tasks interactively was requested

    Linguistic measures of chemical diversity and the "keywords" of molecular collections

    Get PDF
    Computerized linguistic analyses have proven of immense value in comparing and searching through large text collections ("corpora"), including those deposited on the Internet-indeed, it would nowadays be hard to imagine browsing the Web without, for instance, search algorithms extracting most appropriate keywords from documents. This paper describes how such corpus-linguistic concepts can be extended to chemistry based on characteristic "chemical words" that span more than traditional functional groups and, instead, look at common structural fragments molecules share. Using these words, it is possible to quantify the diversity of chemical collections/databases in new ways and to define molecular "keywords" by which such collections are best characterized and annotated
    corecore