4,505 research outputs found
MASK: A flexible framework to facilitate de-identification of clinical texts
Medical health records and clinical summaries contain a vast amount of
important information in textual form that can help advancing research on
treatments, drugs and public health. However, the majority of these information
is not shared because they contain private information about patients, their
families, or medical staff treating them. Regulations such as HIPPA in the US,
PHIPPA in Canada and GDPR regulate the protection, processing and distribution
of this information. In case this information is de-identified and personal
information are replaced or redacted, they could be distributed to the research
community. In this paper, we present MASK, a software package that is designed
to perform the de-identification task. The software is able to perform named
entity recognition using some of the state-of-the-art techniques and then mask
or redact recognized entities. The user is able to select named entity
recognition algorithm (currently implemented are two versions of CRF-based
techniques and BiLSTM-based neural network with pre-trained GLoVe and ELMo
embedding) and masking algorithm (e.g. shift dates, replace names/locations,
totally redact entity)
An Instance Transfer based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition
Recently, neural networks have shown promising results for named entity
recognition (NER), which needs a number of labeled data to for model training.
When meeting a new domain (target domain) for NER, there is no or a few labeled
data, which makes domain NER much more difficult. As NER has been researched
for a long time, some similar domain already has well labelled data (source
domain). Therefore, in this paper, we focus on domain NER by studying how to
utilize the labelled data from such similar source domain for the new target
domain. We design a kernel function based instance transfer strategy by getting
similar labelled sentences from a source domain. Moreover, we propose an
enhanced recurrent neural network (ERNN) by adding an additional layer that
combines the source domain labelled data into traditional RNN structure.
Comprehensive experiments are conducted on two datasets. The comparison results
among HMM, CRF and RNN show that RNN performs bette than others. When there is
no labelled data in domain target, compared to directly using the source domain
labelled data without selecting transferred instances, our enhanced RNN
approach gets improvement from 0.8052 to 0.9328 in terms of F1 measure
Syllable-based Neural Named Entity Recognition for Myanmar Language
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar
natural language processing research work. In this work, NER for Myanmar
language is treated as a sequence tagging problem and the effectiveness of deep
neural networks on NER for Myanmar language has been investigated. Experiments
are performed by applying deep neural network architectures on syllable level
Myanmar contexts. Very first manually annotated NER corpus for Myanmar language
is also constructed and proposed. In developing our in-house NER corpus,
sentences from online news website and also sentences supported from
ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian
Language Treebank (ALT) project under ASEAN IVO. This paper contributes the
first evaluation of neural network models on NER task for Myanmar language. The
experimental results show that those neural sequence models can produce
promising results compared to the baseline CRF model. Among those neural
architectures, bidirectional LSTM network added CRF layer above gives the
highest F-score value. This work also aims to discover the effectiveness of
neural network approaches to Myanmar textual processing as well as to promote
further researches on this understudied language.Comment: Myanmar NE
Investigating how well contextual features are captured by bi-directional recurrent neural network models
Learning algorithms for natural language processing (NLP) tasks traditionally
rely on manually defined relevant contextual features. On the other hand,
neural network models using an only distributional representation of words have
been successfully applied for several NLP tasks. Such models learn features
automatically and avoid explicit feature engineering. Across several domains,
neural models become a natural choice specifically when limited characteristics
of data are known. However, this flexibility comes at the cost of
interpretability. In this paper, we define three different methods to
investigate ability of bi-directional recurrent neural networks (RNNs) in
capturing contextual features. In particular, we analyze RNNs for sequence
tagging tasks. We perform a comprehensive analysis on general as well as
biomedical domain datasets. Our experiments focus on important contextual words
as features, which can easily be extended to analyze various other feature
types. We also investigate positional effects of context words and show how the
developed methods can be used for error analysis.Comment: Camera ready version of ICON-201
A Biomedical Information Extraction Primer for NLP Researchers
Biomedical Information Extraction is an exciting field at the crossroads of
Natural Language Processing, Biology and Medicine. It encompasses a variety of
different tasks that require application of state-of-the-art NLP techniques,
such as NER and Relation Extraction. This paper provides an overview of the
problems in the field and discusses some of the techniques used for solving
them
Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches
This work investigates multiple approaches to Named Entity Recognition (NER)
for text in Electronic Health Record (EHR) data. In particular, we look into
the application of (i) rule-based, (ii) deep learning and (iii) transfer
learning systems for the task of NER on brain imaging reports with a focus on
records from patients with stroke. We explore the strengths and weaknesses of
each approach, develop rules and train on a common dataset, and evaluate each
system's performance on common test sets of Scottish radiology reports from two
sources (brain imaging reports in ESS -- Edinburgh Stroke Study data collected
by NHS Lothian as well as radiology reports created in NHS Tayside). Our
comparison shows that a hand-crafted system is the most accurate way to
automatically label EHR, but machine learning approaches can provide a feasible
alternative where resources for a manual system are not readily available.Comment: 8 pages, presented at HealTAC 2019, Cardiff, 24-25/04/201
Audio De-identification: A New Entity Recognition Task
Named Entity Recognition (NER) has been mostly studied in the context of
written text. Specifically, NER is an important step in de-identification
(de-ID) of medical records, many of which are recorded conversations between a
patient and a doctor. In such recordings, audio spans with personal information
should be redacted, similar to the redaction of sensitive character spans in
de-ID for written text. The application of NER in the context of audio
de-identification has yet to be fully investigated. To this end, we define the
task of audio de-ID, in which audio spans with entity mentions should be
detected. We then present our pipeline for this task, which involves Automatic
Speech Recognition (ASR), NER on the transcript text, and text-to-audio
alignment. Finally, we introduce a novel metric for audio de-ID and a new
evaluation benchmark consisting of a large labeled segment of the Switchboard
and Fisher audio datasets and detail our pipeline's results on it.Comment: Accepted to NAACL 2019 Industry Trac
Extraction and Analysis of Clinically Important Follow-up Recommendations in a Large Radiology Dataset
Communication of follow-up recommendations when abnormalities are identified
on imaging studies is prone to error. In this paper, we present a natural
language processing approach based on deep learning to automatically identify
clinically important recommendations in radiology reports. Our approach first
identifies the recommendation sentences and then extracts reason, test, and
time frame of the identified recommendations. To train our extraction models,
we created a corpus of 567 radiology reports annotated for recommendation
information. Our extraction models achieved 0.92 f-score for recommendation
sentence, 0.65 f-score for reason, 0.73 f-score for test, and 0.84 f-score for
time frame. We applied the extraction models to a set of over 3.3 million
radiology reports and analyzed the adherence of follow-up recommendations.Comment: Under Review at American Medical Informatics Association Fall
Symposium'201
SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data
We present SwellShark, a framework for building biomedical named entity
recognition (NER) systems quickly and without hand-labeled data. Our approach
views biomedical resources like lexicons as function primitives for
autogenerating weak supervision. We then use a generative model to unify and
denoise this supervision and construct large-scale, probabilistically labeled
datasets for training high-accuracy NER taggers. In three biomedical NER tasks,
SwellShark achieves competitive scores with state-of-the-art supervised
benchmarks using no hand-labeled training data. In a drug name extraction task
using patient medical records, one domain expert using SwellShark achieved
within 5.1% of a crowdsourced annotation approach -- which originally utilized
20 teams over the course of several weeks -- in 24 hours
An Encoder-Decoder Model for ICD-10 Coding of Death Certificates
Information extraction from textual documents such as hospital records and
healthrelated user discussions has become a topic of intense interest. The task
of medical concept coding is to map a variable length text to medical concepts
and corresponding classification codes in some external system or ontology. In
this work, we utilize recurrent neural networks to automatically assign ICD-10
codes to fragments of death certificates written in English. We develop
end-to-end neural architectures directly tailored to the task, including basic
encoder-decoder architecture for statistical translation. In order to
incorporate prior knowledge, we concatenate cosine similarities vector among
the text and dictionary entry to the encoded state. Being applied to a standard
benchmark from CLEF eHealth 2017 challenge, our model achieved F-measure of
85.01% on a full test set with significant improvement as compared to the
average score of 62.2% for all official participants approaches
- …