5,071 research outputs found
Insights into Analogy Completion from the Biomedical Domain
Analogy completion has been a popular task in recent years for evaluating the
semantic properties of word embeddings, but the standard methodology makes a
number of assumptions about analogies that do not always hold, either in recent
benchmark datasets or when expanding into other domains. Through an analysis of
analogies in the biomedical domain, we identify three assumptions: that of a
Single Answer for any given analogy, that the pairs involved describe the Same
Relationship, and that each pair is Informative with respect to the other. We
propose modifying the standard methodology to relax these assumptions by
allowing for multiple correct answers, reporting MAP and MRR in addition to
accuracy, and using multiple example pairs. We further present BMASS, a novel
dataset for evaluating linguistic regularities in biomedical embeddings, and
demonstrate that the relationships described in the dataset pose significant
semantic challenges to current word embedding methods.Comment: Accepted to BioNLP 2017. (10 pages
Jointly Embedding Entities and Text with Distant Supervision
Learning representations for knowledge base entities and concepts is becoming
increasingly important for NLP applications. However, recent entity embedding
methods have relied on structured resources that are expensive to create for
new domains and corpora. We present a distantly-supervised method for jointly
learning embeddings of entities and text from an unnanotated corpus, using only
a list of mappings between entities and surface forms. We learn embeddings from
open-domain and biomedical corpora, and compare against prior methods that rely
on human-annotated text or large knowledge graph structure. Our embeddings
capture entity similarity and relatedness better than prior work, both in
existing biomedical datasets and a new Wikipedia-based dataset that we release
to the community. Results on analogy completion and entity sense disambiguation
indicate that entities and words capture complementary information that can be
effectively combined for downstream use.Comment: 12 pages; Accepted to 3rd Workshop on Representation Learning for NLP
(Repl4NLP 2018). Code at https://github.com/OSU-slatelab/JE
Enabling qualitative research data sharing using a natural language processing pipeline for deidentification: Moving beyond HIPAA Safe Harbor identifiers
OBJECTIVE: Sharing health research data is essential for accelerating the translation of research into actionable knowledge that can impact health care services and outcomes. Qualitative health research data are rarely shared due to the challenge of deidentifying text and the potential risks of participant reidentification. Here, we establish and evaluate a framework for deidentifying qualitative research data using automated computational techniques including removal of identifiers that are not considered HIPAA Safe Harbor (HSH) identifiers but are likely to be found in unstructured qualitative data.
MATERIALS AND METHODS: We developed and validated a pipeline for deidentifying qualitative research data using automated computational techniques. An in-depth analysis and qualitative review of different types of qualitative health research data were conducted to inform and evaluate the development of a natural language processing (NLP) pipeline using named-entity recognition, pattern matching, dictionary, and regular expression methods to deidentify qualitative texts.
RESULTS: We collected 2 datasets with 1.2 million words derived from over 400 qualitative research data documents. We created a gold-standard dataset with 280K words (70 files) to evaluate our deidentification pipeline. The majority of identifiers in qualitative data are non-HSH and not captured by existing systems. Our NLP deidentification pipeline had a consistent F1-score of ∼0.90 for both datasets.
CONCLUSION: The results of this study demonstrate that NLP methods can be used to identify both HSH identifiers and non-HSH identifiers. Automated tools to assist researchers with the deidentification of qualitative data will be increasingly important given the new National Institutes of Health (NIH) data-sharing mandate
Recommended from our members
Diffusion MR Characteristics Following Concurrent Radiochemotherapy Predicts Progression-Free and Overall Survival in Newly Diagnosed Glioblastoma.
The standard of care for newly diagnosed glioblastoma (GBM) is surgery, then radiotherapy (RT) with concurrent temozolomide (TMZ), followed by adjuvant TMZ. We hypothesized patients with low diffusivity measured using apparent diffusion coefficient (ADC) histogram analysis evaluated after RT+TMZ, prior to adjuvant TMZ, would have a significantly shorter progression-free (PFS) and overall survival (OS). To test this hypothesis we evaluated 120 patients with newly diagnosed GBM receiving RT+TMZ followed by adjuvant TMZ. MRI was performed after completion of RT+TMZ, prior to initiation of adjuvant TMZ. A double Gaussian mixed model was used to describe the ADC histograms within the enhancing tumor, where ADCL and ADCH were defined as the mean ADC value of the lower and higher Gaussian distribution, respectively. An ADCL value of 1.0 um2/ms and ADCH value of 1.6 um2/ms were used to stratify patients into high and low risk categories. Results suggest patients with low ADCL had significantly shorter PFS (Cox Hazard Ratio = 0.12, P = 0.0006). OS was significantly shorter with low ADCL tumors, showing a median OS of 407 vs. 644 days (Cox Hazard Ratio = 0.31, P = 0.047). ADCH was not predictive of PFS or OS when accounting for age and ADCL. In summary, newly diagnosed glioblastoma patients with low ADCL after completion of RT+TMZ are likely to progress and die earlier than patients with higher ADCL. Results suggest ADC histogram analysis may be useful for patient risk stratification following completion of RT+TMZ
Validation of vessel size imaging (VSI) in high-grade human gliomas using magnetic resonance imaging, image-guided biopsies, and quantitative immunohistochemistry.
To evaluate the association between a vessel size index (VSIMRI) derived from dynamic susceptibility contrast (DSC) perfusion imaging using a custom spin-and-gradient echo echoplanar imaging (SAGE-EPI) sequence and quantitative estimates of vessel morphometry based on immunohistochemistry from image-guided biopsy samples. The current study evaluated both relative cerebral blood volume (rCBV) and VSIMRI in eleven patients with high-grade glioma (7 WHO grade III and 4 WHO grade IV). Following 26 MRI-guided glioma biopsies in these 11 patients, we evaluated tissue morphometry, including vessel density and average radius, using an automated procedure based on the endothelial cell marker CD31 to highlight tumor vasculature. Measures of rCBV and VSIMRI were then compared to histological measures. We demonstrate good agreement between VSI measured by MRI and histology; VSIMRI = 13.67 μm and VSIHistology = 12.60 μm, with slight overestimation of VSIMRI in grade III patients compared to histology. rCBV showed a moderate but significant correlation with vessel density (r = 0.42, p = 0.03), and a correlation was also observed between VSIMRI and VSIHistology (r = 0.49, p = 0.01). The current study supports the hypothesis that vessel size measures using MRI accurately reflect vessel caliber within high-grade gliomas, while traditional measures of rCBV are correlated with vessel density and not vessel caliber
Topology of the conceptual network of language
We define two words in a language to be connected if they express similar
concepts. The network of connections among the many thousands of words that
make up a language is important not only for the study of the structure and
evolution of languages, but also for cognitive science. We study this issue
quantitatively, by mapping out the conceptual network of the English language,
with the connections being defined by the entries in a Thesaurus dictionary. We
find that this network presents a small-world structure, with an amazingly
small average shortest path, and appears to exhibit an asymptotic scale-free
feature with algebraic connectivity distribution.Comment: 4 pages, 2 figures, Revte
Machine learning for modeling the progression of Alzheimer disease dementia using clinical data: A systematic literature review
OBJECTIVE: Alzheimer disease (AD) is the most common cause of dementia, a syndrome characterized by cognitive impairment severe enough to interfere with activities of daily life. We aimed to conduct a systematic literature review (SLR) of studies that applied machine learning (ML) methods to clinical data derived from electronic health records in order to model risk for progression of AD dementia.
MATERIALS AND METHODS: We searched for articles published between January 1, 2010, and May 31, 2020, in PubMed, Scopus, ScienceDirect, IEEE Explore Digital Library, Association for Computing Machinery Digital Library, and arXiv. We used predefined criteria to select relevant articles and summarized them according to key components of ML analysis such as data characteristics, computational algorithms, and research focus.
RESULTS: There has been a considerable rise over the past 5 years in the number of research papers using ML-based analysis for AD dementia modeling. We reviewed 64 relevant articles in our SLR. The results suggest that majority of existing research has focused on predicting progression of AD dementia using publicly available datasets containing both neuroimaging and clinical data (neurobehavioral status exam scores, patient demographics, neuroimaging data, and laboratory test values).
DISCUSSION: Identifying individuals at risk for progression of AD dementia could potentially help to personalize disease management to plan future care. Clinical data consisting of both structured data tables and clinical notes can be effectively used in ML-based approaches to model risk for AD dementia progression. Data sharing and reproducibility of results can enhance the impact, adaptation, and generalizability of this research
Ribosomal Proteins RPS11 and RPS20, Two Stress-Response Markers of Glioblastoma Stem Cells, Are Novel Predictors of Poor Prognosis in Glioblastoma Patients.
Glioblastoma stem cells (GSC) co-exhibiting a tumor-initiating capacity and a radio-chemoresistant phenotype, are a compelling cell model for explaining tumor recurrence. We have previously characterized patient-derived, treatment-resistant GSC clones (TRGC) that survived radiochemotherapy. Compared to glucose-dependent, treatment-sensitive GSC clones (TSGC), TRGC exhibited reduced glucose dependence that favor the fatty acid oxidation pathway as their energy source. Using comparative genome-wide transcriptome analysis, a series of defense signatures associated with TRGC survival were identified and verified by siRNA-based gene knockdown experiments that led to loss of cell integrity. In this study, we investigate the prognostic value of defense signatures in glioblastoma (GBM) patients using gene expression analysis with Probeset Analyzer (131 GBM) and The Cancer Genome Atlas (TCGA) data, and protein expression with a tissue microarray (50 GBM), yielding the first TRGC-derived prognostic biomarkers for GBM patients. Ribosomal protein S11 (RPS11), RPS20, individually and together, consistently predicted poor survival of newly diagnosed primary GBM tumors when overexpressed at the RNA or protein level [RPS11: Hazard Ratio (HR) = 11.5, p<0.001; RPS20: HR = 4.5, p = 0.03; RPS11+RPS20: HR = 17.99, p = 0.001]. The prognostic significance of RPS11 and RPS20 was further supported by whole tissue section RPS11 immunostaining (27 GBM; HR = 4.05, p = 0.01) and TCGA gene expression data (578 primary GBM; RPS11: HR = 1.19, p = 0.06; RPS20: HR = 1.25, p = 0.02; RPS11+RPS20: HR = 1.43, p = 0.01). Moreover, tumors that exhibited unmethylated O-6-methylguanine-DNA methyltransferase (MGMT) or wild-type isocitrate dehydrogenase 1 (IDH1) were associated with higher RPS11 expression levels [corr (IDH1, RPS11) = 0.64, p = 0.03); [corr (MGMT, RPS11) = 0.52, p = 0.04]. These data indicate that increased expression of RPS11 and RPS20 predicts shorter patient survival. The study also suggests that TRGC are clinically relevant cells that represent resistant tumorigenic clones from patient tumors and that their properties, at least in part, are reflected in poor-prognosis GBM. The screening of TRGC signatures may represent a novel alternative strategy for identifying new prognostic biomarkers
- …