5 research outputs found
BioConceptVec: creating and evaluating literature-based biomedical concept embeddings on a large scale
Capturing the semantics of related biological concepts, such as genes and
mutations, is of significant importance to many research tasks in computational
biology such as protein-protein interaction detection, gene-drug association
prediction, and biomedical literature-based discovery. Here, we propose to
leverage state-of-the-art text mining tools and machine learning models to
learn the semantics via vector representations (aka. embeddings) of over
400,000 biological concepts mentioned in the entire PubMed abstracts. Our
learned embeddings, namely BioConceptVec, can capture related concepts based on
their surrounding contextual information in the literature, which is beyond
exact term match or co-occurrence-based methods. BioConceptVec has been
thoroughly evaluated in multiple bioinformatics tasks consisting of over 25
million instances from nine different biological datasets. The evaluation
results demonstrate that BioConceptVec has better performance than existing
methods in all tasks. Finally, BioConceptVec is made freely available to the
research community and general public via
https://github.com/ncbi-nlp/BioConceptVec.Comment: 33 pages, 6 figures, 7 tables, accepted by PLOS Computational Biolog
Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health
Linking clinical narratives to standardized vocabularies and coding systems
is a key component of unlocking the information in medical text for analysis.
However, many domains of medical concepts lack well-developed terminologies
that can support effective coding of medical text. We present a framework for
developing natural language processing (NLP) technologies for automated coding
of under-studied types of medical information, and demonstrate its
applicability via a case study on physical mobility function. Mobility is a
component of many health measures, from post-acute care and surgical outcomes
to chronic frailty and disability, and is coded in the International
Classification of Functioning, Disability, and Health (ICF). However, mobility
and other types of functional activity remain under-studied in medical
informatics, and neither the ICF nor commonly-used medical terminologies
capture functional status terminology in practice. We investigated two
data-driven paradigms, classification and candidate selection, to link
narrative observations of mobility to standardized ICF codes, using a dataset
of clinical narratives from physical therapy encounters. Recent advances in
language modeling and word embedding were used as features for established
machine learning models and a novel deep learning approach, achieving a macro
F-1 score of 84% on linking mobility activity reports to ICF codes. Both
classification and candidate selection approaches present distinct strengths
for automated coding in under-studied domains, and we highlight that the
combination of (i) a small annotated data set; (ii) expert definitions of codes
of interest; and (iii) a representative text corpus is sufficient to produce
high-performing automated coding systems. This study has implications for the
ongoing growth of NLP tools for a variety of specialized applications in
clinical care and research.Comment: Updated final version, published in Frontiers in Digital Health,
https://doi.org/10.3389/fdgth.2021.620828. 34 pages (23 text + 11
references); 9 figures, 2 table
Data for: Concept Embedding to Measure Semantic Relatedness for Biomedical Information Ontologies
we extended the definition information of the CUI terms using the Wikipedia database to improve the coverage of the similarity model. Second, we adopted document embedding for vector representations of the CUI terms. We used UMLS2015AB for the data.THIS DATASET IS ARCHIVED AT DANS/EASY, BUT NOT ACCESSIBLE HERE. TO VIEW A LIST OF FILES AND ACCESS THE FILES IN THIS DATASET CLICK ON THE DOI-LINK ABOV
Data for: Concept Embedding to Measure Semantic Relatedness for Biomedical Information Ontologies
we extended the definition information of the CUI terms using the Wikipedia database to improve the coverage of the similarity model. Second, we adopted document embedding for vector representations of the CUI terms. We used UMLS2015AB for the data