14 research outputs found
Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models
Large-scale clinical data is invaluable to driving many computational
scientific advances today. However, understandable concerns regarding patient
privacy hinder the open dissemination of such data and give rise to suboptimal
siloed research. De-identification methods attempt to address these concerns
but were shown to be susceptible to adversarial attacks. In this work, we focus
on the vast amounts of unstructured natural language data stored in clinical
notes and propose to automatically generate synthetic clinical notes that are
more amenable to sharing using generative models trained on real de-identified
records. To evaluate the merit of such notes, we measure both their privacy
preservation properties as well as utility in training clinical NLP models.
Experiments using neural language models yield notes whose utility is close to
that of the real ones in some clinical NLP tasks, yet leave ample room for
future improvements.Comment: Clinical NLP Workshop 201
A Simple Language Model based on PMI Matrix Approximations
In this study, we introduce a new approach for learning language models by
training them to estimate word-context pointwise mutual information (PMI), and
then deriving the desired conditional probabilities from PMI at test time.
Specifically, we show that with minor modifications to word2vec's algorithm, we
get principled language models that are closely related to the well-established
Noise Contrastive Estimation (NCE) based language models. A compelling aspect
of our approach is that our models are trained with the same simple negative
sampling objective function that is commonly used in word2vec to learn word
embeddings.Comment: Accepted to EMNLP 201
Dotted interval graphs and high throughput genotyping
We introduce a generalization of interval graphs, which we call dotted interval graphs (DIG). A dotted interval graph is an intersection graph of arithmetic progressions (=dotted intervals). Coloring of dotted intervals graphs naturally arises in the context of high throughput genotyping. We study the properties of dotted interval graphs, with a focus on coloring. We show that any graph is a DIG but that DIGd graphs, i.e. DIGs in which the arithmetic progressions have a jump of at most d, form a strict hierarchy. We show that coloring DIGd graphs is NP-complete even for d = 2. For any fixed d, we provide a 7 8 d approximation for the coloring of DIGd graphs.