5,809 research outputs found

    Insights into Analogy Completion from the Biomedical Domain

    Get PDF
    Analogy completion has been a popular task in recent years for evaluating the semantic properties of word embeddings, but the standard methodology makes a number of assumptions about analogies that do not always hold, either in recent benchmark datasets or when expanding into other domains. Through an analysis of analogies in the biomedical domain, we identify three assumptions: that of a Single Answer for any given analogy, that the pairs involved describe the Same Relationship, and that each pair is Informative with respect to the other. We propose modifying the standard methodology to relax these assumptions by allowing for multiple correct answers, reporting MAP and MRR in addition to accuracy, and using multiple example pairs. We further present BMASS, a novel dataset for evaluating linguistic regularities in biomedical embeddings, and demonstrate that the relationships described in the dataset pose significant semantic challenges to current word embedding methods.Comment: Accepted to BioNLP 2017. (10 pages

    Jointly Embedding Entities and Text with Distant Supervision

    Get PDF
    Learning representations for knowledge base entities and concepts is becoming increasingly important for NLP applications. However, recent entity embedding methods have relied on structured resources that are expensive to create for new domains and corpora. We present a distantly-supervised method for jointly learning embeddings of entities and text from an unnanotated corpus, using only a list of mappings between entities and surface forms. We learn embeddings from open-domain and biomedical corpora, and compare against prior methods that rely on human-annotated text or large knowledge graph structure. Our embeddings capture entity similarity and relatedness better than prior work, both in existing biomedical datasets and a new Wikipedia-based dataset that we release to the community. Results on analogy completion and entity sense disambiguation indicate that entities and words capture complementary information that can be effectively combined for downstream use.Comment: 12 pages; Accepted to 3rd Workshop on Representation Learning for NLP (Repl4NLP 2018). Code at https://github.com/OSU-slatelab/JE

    How essential are unstructured clinical narratives and information fusion to clinical trial recruitment?

    Full text link
    Electronic health records capture patient information using structured controlled vocabularies and unstructured narrative text. While structured data typically encodes lab values, encounters and medication lists, unstructured data captures the physician's interpretation of the patient's condition, prognosis, and response to therapeutic intervention. In this paper, we demonstrate that information extraction from unstructured clinical narratives is essential to most clinical applications. We perform an empirical study to validate the argument and show that structured data alone is insufficient in resolving eligibility criteria for recruiting patients onto clinical trials for chronic lymphocytic leukemia (CLL) and prostate cancer. Unstructured data is essential to solving 59% of the CLL trial criteria and 77% of the prostate cancer trial criteria. More specifically, for resolving eligibility criteria with temporal constraints, we show the need for temporal reasoning and information integration with medical events within and across unstructured clinical narratives and structured data.Comment: AMIA TBI 2014, 6 page

    Enabling qualitative research data sharing using a natural language processing pipeline for deidentification: Moving beyond HIPAA Safe Harbor identifiers

    Get PDF
    OBJECTIVE: Sharing health research data is essential for accelerating the translation of research into actionable knowledge that can impact health care services and outcomes. Qualitative health research data are rarely shared due to the challenge of deidentifying text and the potential risks of participant reidentification. Here, we establish and evaluate a framework for deidentifying qualitative research data using automated computational techniques including removal of identifiers that are not considered HIPAA Safe Harbor (HSH) identifiers but are likely to be found in unstructured qualitative data. MATERIALS AND METHODS: We developed and validated a pipeline for deidentifying qualitative research data using automated computational techniques. An in-depth analysis and qualitative review of different types of qualitative health research data were conducted to inform and evaluate the development of a natural language processing (NLP) pipeline using named-entity recognition, pattern matching, dictionary, and regular expression methods to deidentify qualitative texts. RESULTS: We collected 2 datasets with 1.2 million words derived from over 400 qualitative research data documents. We created a gold-standard dataset with 280K words (70 files) to evaluate our deidentification pipeline. The majority of identifiers in qualitative data are non-HSH and not captured by existing systems. Our NLP deidentification pipeline had a consistent F1-score of ∼0.90 for both datasets. CONCLUSION: The results of this study demonstrate that NLP methods can be used to identify both HSH identifiers and non-HSH identifiers. Automated tools to assist researchers with the deidentification of qualitative data will be increasingly important given the new National Institutes of Health (NIH) data-sharing mandate

    Validation of vessel size imaging (VSI) in high-grade human gliomas using magnetic resonance imaging, image-guided biopsies, and quantitative immunohistochemistry.

    Get PDF
    To evaluate the association between a vessel size index (VSIMRI) derived from dynamic susceptibility contrast (DSC) perfusion imaging using a custom spin-and-gradient echo echoplanar imaging (SAGE-EPI) sequence and quantitative estimates of vessel morphometry based on immunohistochemistry from image-guided biopsy samples. The current study evaluated both relative cerebral blood volume (rCBV) and VSIMRI in eleven patients with high-grade glioma (7 WHO grade III and 4 WHO grade IV). Following 26 MRI-guided glioma biopsies in these 11 patients, we evaluated tissue morphometry, including vessel density and average radius, using an automated procedure based on the endothelial cell marker CD31 to highlight tumor vasculature. Measures of rCBV and VSIMRI were then compared to histological measures. We demonstrate good agreement between VSI measured by MRI and histology; VSIMRI = 13.67 μm and VSIHistology = 12.60 μm, with slight overestimation of VSIMRI in grade III patients compared to histology. rCBV showed a moderate but significant correlation with vessel density (r = 0.42, p = 0.03), and a correlation was also observed between VSIMRI and VSIHistology (r = 0.49, p = 0.01). The current study supports the hypothesis that vessel size measures using MRI accurately reflect vessel caliber within high-grade gliomas, while traditional measures of rCBV are correlated with vessel density and not vessel caliber

    Topology of the conceptual network of language

    Full text link
    We define two words in a language to be connected if they express similar concepts. The network of connections among the many thousands of words that make up a language is important not only for the study of the structure and evolution of languages, but also for cognitive science. We study this issue quantitatively, by mapping out the conceptual network of the English language, with the connections being defined by the entries in a Thesaurus dictionary. We find that this network presents a small-world structure, with an amazingly small average shortest path, and appears to exhibit an asymptotic scale-free feature with algebraic connectivity distribution.Comment: 4 pages, 2 figures, Revte

    Machine learning for modeling the progression of Alzheimer disease dementia using clinical data: A systematic literature review

    Get PDF
    OBJECTIVE: Alzheimer disease (AD) is the most common cause of dementia, a syndrome characterized by cognitive impairment severe enough to interfere with activities of daily life. We aimed to conduct a systematic literature review (SLR) of studies that applied machine learning (ML) methods to clinical data derived from electronic health records in order to model risk for progression of AD dementia. MATERIALS AND METHODS: We searched for articles published between January 1, 2010, and May 31, 2020, in PubMed, Scopus, ScienceDirect, IEEE Explore Digital Library, Association for Computing Machinery Digital Library, and arXiv. We used predefined criteria to select relevant articles and summarized them according to key components of ML analysis such as data characteristics, computational algorithms, and research focus. RESULTS: There has been a considerable rise over the past 5 years in the number of research papers using ML-based analysis for AD dementia modeling. We reviewed 64 relevant articles in our SLR. The results suggest that majority of existing research has focused on predicting progression of AD dementia using publicly available datasets containing both neuroimaging and clinical data (neurobehavioral status exam scores, patient demographics, neuroimaging data, and laboratory test values). DISCUSSION: Identifying individuals at risk for progression of AD dementia could potentially help to personalize disease management to plan future care. Clinical data consisting of both structured data tables and clinical notes can be effectively used in ML-based approaches to model risk for AD dementia progression. Data sharing and reproducibility of results can enhance the impact, adaptation, and generalizability of this research

    Ribosomal Proteins RPS11 and RPS20, Two Stress-Response Markers of Glioblastoma Stem Cells, Are Novel Predictors of Poor Prognosis in Glioblastoma Patients.

    Get PDF
    Glioblastoma stem cells (GSC) co-exhibiting a tumor-initiating capacity and a radio-chemoresistant phenotype, are a compelling cell model for explaining tumor recurrence. We have previously characterized patient-derived, treatment-resistant GSC clones (TRGC) that survived radiochemotherapy. Compared to glucose-dependent, treatment-sensitive GSC clones (TSGC), TRGC exhibited reduced glucose dependence that favor the fatty acid oxidation pathway as their energy source. Using comparative genome-wide transcriptome analysis, a series of defense signatures associated with TRGC survival were identified and verified by siRNA-based gene knockdown experiments that led to loss of cell integrity. In this study, we investigate the prognostic value of defense signatures in glioblastoma (GBM) patients using gene expression analysis with Probeset Analyzer (131 GBM) and The Cancer Genome Atlas (TCGA) data, and protein expression with a tissue microarray (50 GBM), yielding the first TRGC-derived prognostic biomarkers for GBM patients. Ribosomal protein S11 (RPS11), RPS20, individually and together, consistently predicted poor survival of newly diagnosed primary GBM tumors when overexpressed at the RNA or protein level [RPS11: Hazard Ratio (HR) = 11.5, p<0.001; RPS20: HR = 4.5, p = 0.03; RPS11+RPS20: HR = 17.99, p = 0.001]. The prognostic significance of RPS11 and RPS20 was further supported by whole tissue section RPS11 immunostaining (27 GBM; HR = 4.05, p = 0.01) and TCGA gene expression data (578 primary GBM; RPS11: HR = 1.19, p = 0.06; RPS20: HR = 1.25, p = 0.02; RPS11+RPS20: HR = 1.43, p = 0.01). Moreover, tumors that exhibited unmethylated O-6-methylguanine-DNA methyltransferase (MGMT) or wild-type isocitrate dehydrogenase 1 (IDH1) were associated with higher RPS11 expression levels [corr (IDH1, RPS11) = 0.64, p = 0.03); [corr (MGMT, RPS11) = 0.52, p = 0.04]. These data indicate that increased expression of RPS11 and RPS20 predicts shorter patient survival. The study also suggests that TRGC are clinically relevant cells that represent resistant tumorigenic clones from patient tumors and that their properties, at least in part, are reflected in poor-prognosis GBM. The screening of TRGC signatures may represent a novel alternative strategy for identifying new prognostic biomarkers
    • …
    corecore