8 research outputs found

    Using domain-targeted text corpora to improve phenotype named entity recognition

    Get PDF
    Scientific corpora serve as the backbone for advancements in Natural Language Processing (NLP) tasks within the biomedical domain. However, current methods for corpus creation often rely solely on PubMed abstracts and Open Access (OA) publica- tions on PubMed Central (PMC). This approach overlooks the amount of information contained within the full text of scientific articles not available in these two services. Furthermore, existing tools for UMLS named entities recognition, such as MetaMap, can be computationally slow, hindering large-scale analysis. This work addresses these limitations by introducing a novel tools and resources specifically designed to enhance NLP tasks, especially UMLS and Phenotype NER, in the biomedical field. First, I present Cadmus, the first fully automated pipeline for scientific corpus creation that goes beyond PubMed abstracts and leverages the full text of OA and non-OA publications. Cadmus utilizes a combination of APIs, web scraping and text processing techniques to create comprehensive scientific corpora. Our analysis demonstrates that Cadmus corpus creation provides a significant increase in the number of identified entities (representing 64.9% of the total available UMLS entities on our DDG2P dataset) compared to prior methods. Second, I introduce ParallelPyMetaMap, a Python implementation of MetaMap. Par- allelPyMetaMap offers full access to MetaMap’s robust named entity recognition cap- abilities while incorporating a multiprocessing approach. This approach significantly accelerates processing times, allowing researchers to analyze larger datasets in a more efficient manner. Third, I present the Autism Spectrum Disorder (ASD) Corpus, the first fully auto- mated, full-text biomedical corpus. The ASD corpus is constructed by employing Cadmus to gather full-text articles related to ASD, encompassing both OA and non- OA publications. This corpus represents a valuable resource for researchers focused on ASD, providing a comprehensive collection of full-text articles for in-depth analysis. Our ASD corpus captures a significant portion of relevant publications (82.64% out of 72,058) for ASD research. Finally, I introduce a novel Phenotype Named Entity Recognition (NER) model spe- cifically optimized for identifying phenotypic entities within biomedical text. Our Phenotype NER model is trained on a large-scale silver standard dataset and incorpor- ates optimized pre-processing strategies. When compared to current state-of-the-art methods on three Human expert annotated datasets, our model outperforms existing approaches on two out of three datasets, demonstrating its effectiveness in identifying phenotypic entities. In conclusion, this work presents a comprehensive suite of tools and resources that significantly enhance NLP capabilities in the biomedical domain. Cadmus with its corpus creation and the Phenotype NER model demonstrably improve the identifica- tion of entities and phenotypes, while ParallelPyMetaMap accelerates UMLS named entity recognition. The ASD Corpus offers a valuable collection of full-text articles for researchers focused on Autism Spectrum Disorder. These advancements offer an alternative to existing methods that have been used and reused over the years

    KU AIGEN ICL EDI@BC8 Track 3: Advancing Phenotype Named Entity Recognition and Normalization for Dysmorphology Phys- ical Examination Reports

    No full text
    <h3><strong>Abstract</strong></h3><p>The objective of BioCreative8 Track 3 is to extract phenotypic key medical findings embedded within EHR texts and subsequently normalize these findings to their Human Phenotype Ontology (HPO) terms. However, the presence of diverse surface forms in phenotypic findings makes it challenging to accurately normalize them to the correct HPO terms. To address this challenge, we explored various models for named entity recognition and implemented data augmentation techniques such as synonym marginalization to enhance the normalization step. Our pipeline resulted in an exact extraction and normalization F1 score 2.6% higher than the mean score of all submissions received in response to the challenge. Furthermore, in terms of the normalization F1 score, our approach surpassed the average performance by 1.9%. These findings contribute to the advancement of automated medical data extraction and normalization techniques, showcasing potential pathways for future research and application in the biomedical domain.</p><p> </p><p>This article is part of the <a href="https://zenodo.org/doi/10.5281/zenodo.10103190">Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models</a>.</p&gt

    Nekrosen, Gangrän, Geschwüre

    No full text

    Sensing the ocean biological carbon pump from space: A review of capabilities, concepts, research gaps and future developments

    No full text
    corecore