6 research outputs found

    Iniciativas de evaluación para la indización semántica de literatura médica en español: PLANTL, LILACS, IBECS Y BIOASQ

    Get PDF
    XVI Jornadas Nacionales de Información y Documentación en Ciencias de la Salud. Oviedo, 4-5 de abril de 2019El proyecto Faro de Sanidad del Plan de Impulso de las Tecnologías del Lenguaje (PlanTL) pretende fomentar el desarrollo de sistemas de procesamiento del lenguaje natural (PLN), minería de textos y traducción automática para español y lenguas cooficiales. Una actividad importante del PlanTL es la organización de campañas de evaluación de sistemas de PLN y minería de textos, un mecanismo que no sólo es clave para evaluar la calidad de los resultados obtenidos por sistemas y algoritmos predictivos, sino que representa un motor fundamental para fomentar el desarrollo de herramientas y recursos de tecnologías del lenguaje. Debido a la importancia de la literatura para la toma de decisiones en medicina y el volumen considerable de publicaciones en español, el Plan TL, en colaboración con el BSC, el CNIO, la BNCS y la iniciativa BioASQ ha lanzado una tarea competitiva relacionada con la indización automática de la literatura médica en español con términos DeCS. Su fin es generar recursos de etiquetado semántico que sirvan de ayuda a la indización manual. La tarea BioASQ (bioasq.org) de indización semántica biomédica en español se realizará usando resúmenes de artículos de revistas contenidas en las bases de datos LILACS (Literatura Lationamericana en Ciencias de la Salud) y IBECS1 (Índice Bibliográfico Español en Ciencias de la Salud) como conjunto básico etiquetado y, a partir de ellos, desarrollar los algoritmos de indización automática, facilitando así el desarrollo de modelos de inteligencia artificial. La evaluación de los sistemas se realiza con la plataforma de BioASQ, mediante un sistema de evaluación continua. En él, se solicita a los participantes que asignen automáticamente términos DeCS a los registros nuevos añadidos a las bases de datos a medida que se hacen públicos, y antes de que se haya completado la indización manual. El rendimiento de indización se calcula comparando indización automática y manual. Gracias a los resultados de ediciones previas de BioASQ para la indización de PubMed, se ha mejorado este proceso en dicho recurso. Esta tarea de indización biomédica en español servirá para generar recursos comparables para indizar LILACS e IBECS y otros conjuntos documentales.The health flagship project of the Plan for the Advancement of Language Technology (PlanTL) tries to promote the development of natural language processing systems (NLP), text mining and machine translation resources for Spanish and co-official languages. There is a growing demand for a better exploitation of datasets generated by clinicians, especially electronic health records, as well as the integration and management of this kind of data in personalized medicine platforms integrating also information extracted from the literature. In this context, the PlanTL collaborates in the organization of evaluation efforts of clinical NLP and text mining systems, a key mechanism to evaluate the quality of results obtained by such automated systems and a fundamental mechanism to promote the development of tools and resources related to language technologies. Given the importance of literature for medical decision-making and the growing volume of Spanish medical publications, the TL Plan, in collaboration with the BSC, CNIO, the Biblioteca Nacional de Ciencias de la Salud and the BioASQ team have launched a shared task on automatic indexing of abstracts in Spanish with DeCS terms. The aim of this tracks is to generate semantic annotation resources that can be used to assist manual indexing. The Spanish biomedical semantic indexing track of BioASQ (bioasq.org) will rely on abstracts of journals contained in the LILACS databases as a basic Gold Standard manually labeled benchmark set for the development of automatic indexing algorithms particularly those based on artificial intelligence language models. The evaluation of participating systems is done through the BioASQ platform, which requests results in a continuous evaluation process, i.e. automatically asking for DeCS term assignment for newly added documents to LILACS, as they are made public, and before the manual indexing results are publicly released. The indexing performance in BioASQ is calculated by comparing automatic indexing against manual annotations. Thanks to the results of previous editions of BioASQ for indexing PubMed, the MeSH indexing process of this resource was considerably improved. This novel effort on medical indexing in Spanish will serve to generate comparable resources to semantically index not only LILACS but also other health databases and repositories in Spanish.N

    Efforts to foster biomedical text mining efforts beyond English: the Spanish national strategic plan for language technologies

    Get PDF
    Si bien se han hecho esfuerzos considerables para aplicar las tecnologías de minería de texto a la literatura biomédica y los registros clínicos escritos en inglés, lo cierto es que intentos de procesar documentos en otros idiomas han atraído mucha menos atención a pesar de su interés práctico. Debido al considerable número de documentos biomédicos escritos en español, existe una necesidad apremiante de poder acceder a los recursos de minería de textos biomédicos y clínicos desarrollados para esta lengua de alto impacto. Para abordar este asunto, la Secretaría de Estado encargó las actuaciones de apoyo técnico especializado para el desarrollo del Plan de Impulso de las tecnologías del Lenguaje en el ámbito de la biomedicina. El artículo describe brevemente las líneas principales de actuación del proyecto en su primera fase, esto es: facilitar el acceso a recursos y herramientas en PNL, analizar y garantizar la interoperabilidad del sistema, la definición de métodos y herramientas de evaluación, la difusión del proyecto y sus resultados y la alineación y colaboración con otros proyectos nacionales e internacionales. Además, hemos identificado algunas de las tareas críticas en el procesamiento de textos biomédicos que requieren investigación adicional y disponibilidad de herramientas.A considerable effort has been made to apply text mining technologies to biomedical literature and clinical records written in English, while attempts to process documents in other languages have attracted far less attention despite the key practical relevance. Due to the considerable number of biomedical documents written in Spanish, there is a pressing need to be able to access biomedical and clinical text mining resources developed for this high impact language. To address this issue, the Spanish Ministry of State for Telecommunications launched the Plan for Promotion of Language Technologies in the field of biomedicine with the aim of providing specialized technical support to research and development of software solutions adapted to this domain. This article briefly describes the main lines of action of this project in its initial stages, namely: (a) identification of relevant biomedical NLP resources/tools, (b) examining and enabling system interoperability aspects, (c) to outline strategies and support for evaluation settings, (d) to disseminate the project and its results, and (e) to align and collaborate with other related national and international projects. Moreover we have identified some of the critical biomedical text processing tasks that require additional research and availability of tools

    The biomedical abbreviation recognition and resolution (BARR) track: Benchmarking, evaluation and importance of abbreviation recognition systems applied to Spanish biomedical abstracts

    Get PDF
    Healthcare professionals are generating a substantial volume of clinical data in narrative form. As healthcare providers are confronted with serious time constraints, they frequently use telegraphic phrases, domain-specific abbreviations and shorthand notes. Efficient clinical text processing tools need to cope with the recognition and resolution of abbreviations, a task that has been extensively studied for English documents. Despite the outstanding number of clinical documents written worldwide in Spanish, only a marginal amount of studies has been published on this subject. In clinical texts, as opposed to the medical literature, abbreviations are generally used without their definitions or expanded forms. The aim of the first Biomedical Abbreviation Recognition and Resolution (BARR) track, posed at the IberEval 2017 evaluation campaign, was to assess and promote the development of systems for generating a sense inventory of medical abbreviations. The BARR track required the detection of mentions of abbreviations or short forms and their corresponding long forms or definitions from Spanish medical abstracts. For this track, the organizers provided the BARR medical document collection, the BARR corpus of manually annotated abstracts labelled by domain experts and the BARR-Markyt evaluation platform. A total of 7 teams submitted 25 runs for the two BARR subtasks: (a) the identification of mentions of abbreviations and their definitions and (b) the correct detection of short form-long form pairs. Here we describe the BARR track setting, the obtained results and the methodologies used by participating systems. The BARR task summary, corpus, resources and evaluation tool for testing systems beyond this campaign are available at: http://temu.inab.org .We acknowledge the Encomienda MINETAD-CNIO/OTG Sanidad Plan TL and Open-Minted (654021) H2020 project for funding.Postprint (published version

    AbreMES-X

    No full text
    [Medical semantic annotation] Software used to generate the Spanish Medical Abbreviation DataBase (https://github.com/PlanTL/AbreMES-DB). The database is generated by detecting abbreviations and their potential definitions explicitly mentioned in the same sentence, extracted from the metadata of different biomedical publications written in Spanish that contain the titles and abstracts

    Eliminando menciones ruidosas para la supervisión a distancia

    Get PDF
    Los métodos para Extracción de Información basados en la Supervisión a Distancia se basan en usar tuplas correctas para adquirir menciones de esas tuplas, y así entrenar un sistema tradicional de extracción de información supervisado. En este artículo analizamos las fuentes de ruido en las menciones, y exploramos métodos sencillos para filtrar menciones ruidosas. Los resultados demuestran que combinando el filtrado de tuplas por frecuencia, la información mutua y la eliminación de menciones lejos de los centroides de sus respectivas etiquetas mejora los resultados de dos modelos de extracción de información significativamente.Relation Extraction methods based on Distant Supervision rely on true tuples to retrieve noisy mentions, which are then used to train traditional supervised relation extraction methods. In this paper we analyze the sources of noise in the mentions, and explore simple methods to filter out noisy mentions. The results show that a combination of mention frequency cut-off, Pointwise Mutual Information and removal of mentions which are far from the feature centroids of relation labels is able to significantly improve the results of two relation extraction models
    corecore