14 research outputs found

    Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models

    Full text link
    Large-scale clinical data is invaluable to driving many computational scientific advances today. However, understandable concerns regarding patient privacy hinder the open dissemination of such data and give rise to suboptimal siloed research. De-identification methods attempt to address these concerns but were shown to be susceptible to adversarial attacks. In this work, we focus on the vast amounts of unstructured natural language data stored in clinical notes and propose to automatically generate synthetic clinical notes that are more amenable to sharing using generative models trained on real de-identified records. To evaluate the merit of such notes, we measure both their privacy preservation properties as well as utility in training clinical NLP models. Experiments using neural language models yield notes whose utility is close to that of the real ones in some clinical NLP tasks, yet leave ample room for future improvements.Comment: Clinical NLP Workshop 201

    A Simple Language Model based on PMI Matrix Approximations

    Full text link
    In this study, we introduce a new approach for learning language models by training them to estimate word-context pointwise mutual information (PMI), and then deriving the desired conditional probabilities from PMI at test time. Specifically, we show that with minor modifications to word2vec's algorithm, we get principled language models that are closely related to the well-established Noise Contrastive Estimation (NCE) based language models. A compelling aspect of our approach is that our models are trained with the same simple negative sampling objective function that is commonly used in word2vec to learn word embeddings.Comment: Accepted to EMNLP 201

    Dotted interval graphs and high throughput genotyping

    Full text link
    We introduce a generalization of interval graphs, which we call dotted interval graphs (DIG). A dotted interval graph is an intersection graph of arithmetic progressions (=dotted intervals). Coloring of dotted intervals graphs naturally arises in the context of high throughput genotyping. We study the properties of dotted interval graphs, with a focus on coloring. We show that any graph is a DIG but that DIGd graphs, i.e. DIGs in which the arithmetic progressions have a jump of at most d, form a strict hierarchy. We show that coloring DIGd graphs is NP-complete even for d = 2. For any fixed d, we provide a 7 8 d approximation for the coloring of DIGd graphs.
    corecore