2 research outputs found

    The MeSH-gram Neural Network Model: Extending Word Embedding Vectors with MeSH Concepts for UMLS Semantic Similarity and Relatedness in the Biomedical Domain

    Full text link
    Eliciting semantic similarity between concepts in the biomedical domain remains a challenging task. Recent approaches founded on embedding vectors have gained in popularity as they risen to efficiently capture semantic relationships The underlying idea is that two words that have close meaning gather similar contexts. In this study, we propose a new neural network model named MeSH-gram which relies on a straighforward approach that extends the skip-gram neural network model by considering MeSH (Medical Subject Headings) descriptors instead words. Trained on publicly available corpus PubMed MEDLINE, MeSH-gram is evaluated on reference standards manually annotated for semantic similarity. MeSH-gram is first compared to skip-gram with vectors of size 300 and at several windows contexts. A deeper comparison is performed with tewenty existing models. All the obtained results of Spearman's rank correlations between human scores and computed similarities show that MeSH-gram outperforms the skip-gram model, and is comparable to the best methods but that need more computation and external resources.Comment: 6 pages, 2 table

    Biomedical Literature Mining and Knowledge Discovery of Phenotyping Definitions

    Get PDF
    Indiana University-Purdue University Indianapolis (IUPUI)Phenotyping definitions are essential in cohort identification when conducting clinical research, but they become an obstacle when they are not readily available. Developing new definitions manually requires expert involvement that is labor-intensive, time-consuming, and unscalable. Moreover, automated approaches rely mostly on electronic health records’ data that suffer from bias, confounding, and incompleteness. Limited efforts established in utilizing text-mining and data-driven approaches to automate extraction and literature-based knowledge discovery of phenotyping definitions and to support their scalability. In this dissertation, we proposed a text-mining pipeline combining rule-based and machine-learning methods to automate retrieval, classification, and extraction of phenotyping definitions’ information from literature. To achieve this, we first developed an annotation guideline with ten dimensions to annotate sentences with evidence of phenotyping definitions' modalities, such as phenotypes and laboratories. Two annotators manually annotated a corpus of sentences (n=3,971) extracted from full-text observational studies’ methods sections (n=86). Percent and Kappa statistics showed high inter-annotator agreement on sentence-level annotations. Second, we constructed two validated text classifiers using our annotated corpora: abstract-level and full-text sentence-level. We applied the abstract-level classifier on a large-scale biomedical literature of over 20 million abstracts published between 1975 and 2018 to classify positive abstracts (n=459,406). After retrieving their full-texts (n=120,868), we extracted sentences from their methods sections and used the full-text sentence-level classifier to extract positive sentences (n=2,745,416). Third, we performed a literature-based discovery utilizing the positively classified sentences. Lexica-based methods were used to recognize medical concepts in these sentences (n=19,423). Co-occurrence and association methods were used to identify and rank phenotype candidates that are associated with a phenotype of interest. We derived 12,616,465 associations from our large-scale corpus. Our literature-based associations and large-scale corpus contribute in building new data-driven phenotyping definitions and expanding existing definitions with minimal expert involvement
    corecore