171 research outputs found
MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining
One of the biggest challenges that prohibit the use of many current NLP
methods in clinical settings is the availability of public datasets. In this
work, we present MeDAL, a large medical text dataset curated for abbreviation
disambiguation, designed for natural language understanding pre-training in the
medical domain. We pre-trained several models of common architectures on this
dataset and empirically showed that such pre-training leads to improved
performance and convergence speed when fine-tuning on downstream medical tasks.Comment: EMNLP 2020 Clinical NL
Surf at MEDIQA 2019: Improving Performance of Natural Language Inference in the Clinical Domain by Adopting Pre-trained Language Model
While deep learning techniques have shown promising results in many natural
language processing (NLP) tasks, it has not been widely applied to the clinical
domain. The lack of large datasets and the pervasive use of domain-specific
language (i.e. abbreviations and acronyms) in the clinical domain causes slower
progress in NLP tasks than that of the general NLP tasks. To fill this gap, we
employ word/subword-level based models that adopt large-scale data-driven
methods such as pre-trained language models and transfer learning in analyzing
text for the clinical domain. Empirical results demonstrate the superiority of
the proposed methods by achieving 90.6% accuracy in medical domain natural
language inference task. Furthermore, we inspect the independent strengths of
the proposed approaches in quantitative and qualitative manners. This analysis
will help researchers to select necessary components in building models for the
medical domain.Comment: 9 pages, Accepted to ACL 2019 workshop on BioNL
A Generative Model of Words and Relationships from Multiple Sources
Neural language models are a powerful tool to embed words into semantic
vector spaces. However, learning such models generally relies on the
availability of abundant and diverse training examples. In highly specialised
domains this requirement may not be met due to difficulties in obtaining a
large corpus, or the limited range of expression in average use. Such domains
may encode prior knowledge about entities in a knowledge base or ontology. We
propose a generative model which integrates evidence from diverse data sources,
enabling the sharing of semantic information. We achieve this by generalising
the concept of co-occurrence from distributional semantics to include other
relationships between entities or words, which we model as affine
transformations on the embedding space. We demonstrate the effectiveness of
this approach by outperforming recent models on a link prediction task and
demonstrating its ability to profit from partially or fully unobserved data
training labels. We further demonstrate the usefulness of learning from
different data sources with overlapping vocabularies.Comment: 8 pages, 5 figures; incorporated feedback from reviewers; to appear
in Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence
201
Privacy-Preserving Predictive Modeling: Harmonization of Contextual Embeddings From Different Sources
Background: Data sharing has been a big challenge in biomedical informatics because of privacy concerns. Contextual embedding models have demonstrated a very strong representative capability to describe medical concepts (and their context), and they have shown promise as an alternative way to support deep-learning applications without the need to disclose original data. However, contextual embedding models acquired from individual hospitals cannot be directly combined because their embedding spaces are different, and naive pooling renders combined embeddings useless.
Objective: The aim of this study was to present a novel approach to address these issues and to promote sharing representation without sharing data. Without sacrificing privacy, we also aimed to build a global model from representations learned from local private data and synchronize information from multiple sources.
Methods: We propose a methodology that harmonizes different local contextual embeddings into a global model. We used Word2Vec to generate contextual embeddings from each source and Procrustes to fuse different vector models into one common space by using a list of corresponding pairs as anchor points. We performed prediction analysis with harmonized embeddings.
Results: We used sequential medical events extracted from the Medical Information Mart for Intensive Care III database to evaluate the proposed methodology in predicting the next likely diagnosis of a new patient using either structured data or unstructured data. Under different experimental scenarios, we confirmed that the global model built from harmonized local models achieves a more accurate prediction than local models and global models built from naive pooling.
Conclusions: Such aggregation of local models using our unique harmonization can serve as the proxy for a global model, combining information from a wide range of institutions and information sources. It allows information unique to a certain hospital to become available to other sites, increasing the fluidity of information flow in health care
- …