4,168 research outputs found
MASK: A flexible framework to facilitate de-identification of clinical texts
Medical health records and clinical summaries contain a vast amount of
important information in textual form that can help advancing research on
treatments, drugs and public health. However, the majority of these information
is not shared because they contain private information about patients, their
families, or medical staff treating them. Regulations such as HIPPA in the US,
PHIPPA in Canada and GDPR regulate the protection, processing and distribution
of this information. In case this information is de-identified and personal
information are replaced or redacted, they could be distributed to the research
community. In this paper, we present MASK, a software package that is designed
to perform the de-identification task. The software is able to perform named
entity recognition using some of the state-of-the-art techniques and then mask
or redact recognized entities. The user is able to select named entity
recognition algorithm (currently implemented are two versions of CRF-based
techniques and BiLSTM-based neural network with pre-trained GLoVe and ELMo
embedding) and masking algorithm (e.g. shift dates, replace names/locations,
totally redact entity)
An Instance Transfer based Approach Using Enhanced Recurrent Neural Network for Domain Named Entity Recognition
Recently, neural networks have shown promising results for named entity
recognition (NER), which needs a number of labeled data to for model training.
When meeting a new domain (target domain) for NER, there is no or a few labeled
data, which makes domain NER much more difficult. As NER has been researched
for a long time, some similar domain already has well labelled data (source
domain). Therefore, in this paper, we focus on domain NER by studying how to
utilize the labelled data from such similar source domain for the new target
domain. We design a kernel function based instance transfer strategy by getting
similar labelled sentences from a source domain. Moreover, we propose an
enhanced recurrent neural network (ERNN) by adding an additional layer that
combines the source domain labelled data into traditional RNN structure.
Comprehensive experiments are conducted on two datasets. The comparison results
among HMM, CRF and RNN show that RNN performs bette than others. When there is
no labelled data in domain target, compared to directly using the source domain
labelled data without selecting transferred instances, our enhanced RNN
approach gets improvement from 0.8052 to 0.9328 in terms of F1 measure
Syllable-based Neural Named Entity Recognition for Myanmar Language
Named Entity Recognition (NER) for Myanmar Language is essential to Myanmar
natural language processing research work. In this work, NER for Myanmar
language is treated as a sequence tagging problem and the effectiveness of deep
neural networks on NER for Myanmar language has been investigated. Experiments
are performed by applying deep neural network architectures on syllable level
Myanmar contexts. Very first manually annotated NER corpus for Myanmar language
is also constructed and proposed. In developing our in-house NER corpus,
sentences from online news website and also sentences supported from
ALT-Parallel-Corpus are also used. This ALT corpus is one part of the Asian
Language Treebank (ALT) project under ASEAN IVO. This paper contributes the
first evaluation of neural network models on NER task for Myanmar language. The
experimental results show that those neural sequence models can produce
promising results compared to the baseline CRF model. Among those neural
architectures, bidirectional LSTM network added CRF layer above gives the
highest F-score value. This work also aims to discover the effectiveness of
neural network approaches to Myanmar textual processing as well as to promote
further researches on this understudied language.Comment: Myanmar NE
Investigating how well contextual features are captured by bi-directional recurrent neural network models
Learning algorithms for natural language processing (NLP) tasks traditionally
rely on manually defined relevant contextual features. On the other hand,
neural network models using an only distributional representation of words have
been successfully applied for several NLP tasks. Such models learn features
automatically and avoid explicit feature engineering. Across several domains,
neural models become a natural choice specifically when limited characteristics
of data are known. However, this flexibility comes at the cost of
interpretability. In this paper, we define three different methods to
investigate ability of bi-directional recurrent neural networks (RNNs) in
capturing contextual features. In particular, we analyze RNNs for sequence
tagging tasks. We perform a comprehensive analysis on general as well as
biomedical domain datasets. Our experiments focus on important contextual words
as features, which can easily be extended to analyze various other feature
types. We also investigate positional effects of context words and show how the
developed methods can be used for error analysis.Comment: Camera ready version of ICON-201
A Biomedical Information Extraction Primer for NLP Researchers
Biomedical Information Extraction is an exciting field at the crossroads of
Natural Language Processing, Biology and Medicine. It encompasses a variety of
different tasks that require application of state-of-the-art NLP techniques,
such as NER and Relation Extraction. This paper provides an overview of the
problems in the field and discusses some of the techniques used for solving
them
Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches
This work investigates multiple approaches to Named Entity Recognition (NER)
for text in Electronic Health Record (EHR) data. In particular, we look into
the application of (i) rule-based, (ii) deep learning and (iii) transfer
learning systems for the task of NER on brain imaging reports with a focus on
records from patients with stroke. We explore the strengths and weaknesses of
each approach, develop rules and train on a common dataset, and evaluate each
system's performance on common test sets of Scottish radiology reports from two
sources (brain imaging reports in ESS -- Edinburgh Stroke Study data collected
by NHS Lothian as well as radiology reports created in NHS Tayside). Our
comparison shows that a hand-crafted system is the most accurate way to
automatically label EHR, but machine learning approaches can provide a feasible
alternative where resources for a manual system are not readily available.Comment: 8 pages, presented at HealTAC 2019, Cardiff, 24-25/04/201
Extraction and Analysis of Clinically Important Follow-up Recommendations in a Large Radiology Dataset
Communication of follow-up recommendations when abnormalities are identified
on imaging studies is prone to error. In this paper, we present a natural
language processing approach based on deep learning to automatically identify
clinically important recommendations in radiology reports. Our approach first
identifies the recommendation sentences and then extracts reason, test, and
time frame of the identified recommendations. To train our extraction models,
we created a corpus of 567 radiology reports annotated for recommendation
information. Our extraction models achieved 0.92 f-score for recommendation
sentence, 0.65 f-score for reason, 0.73 f-score for test, and 0.84 f-score for
time frame. We applied the extraction models to a set of over 3.3 million
radiology reports and analyzed the adherence of follow-up recommendations.Comment: Under Review at American Medical Informatics Association Fall
Symposium'201
SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data
We present SwellShark, a framework for building biomedical named entity
recognition (NER) systems quickly and without hand-labeled data. Our approach
views biomedical resources like lexicons as function primitives for
autogenerating weak supervision. We then use a generative model to unify and
denoise this supervision and construct large-scale, probabilistically labeled
datasets for training high-accuracy NER taggers. In three biomedical NER tasks,
SwellShark achieves competitive scores with state-of-the-art supervised
benchmarks using no hand-labeled training data. In a drug name extraction task
using patient medical records, one domain expert using SwellShark achieved
within 5.1% of a crowdsourced annotation approach -- which originally utilized
20 teams over the course of several weeks -- in 24 hours
A Joint Named-Entity Recognizer for Heterogeneous Tag-sets Using a Tag Hierarchy
We study a variant of domain adaptation for named-entity recognition where
multiple, heterogeneously tagged training sets are available. Furthermore, the
test tag-set is not identical to any individual training tag-set. Yet, the
relations between all tags are provided in a tag hierarchy, covering the test
tags as a combination of training tags. This setting occurs when various
datasets are created using different annotation schemes. This is also the case
of extending a tag-set with a new tag by annotating only the new tag in a new
dataset. We propose to use the given tag hierarchy to jointly learn a neural
network that shares its tagging layer among all tag-sets. We compare this model
to combining independent models and to a model based on the multitasking
approach. Our experiments show the benefit of the tag-hierarchy model,
especially when facing non-trivial consolidation of tag-sets.Comment: Accepted at ACL 201
Learning Named Entity Tagger using Domain-Specific Dictionary
Recent advances in deep neural models allow us to build reliable named entity
recognition (NER) systems without handcrafting features. However, such methods
require large amounts of manually-labeled training data. There have been
efforts on replacing human annotations with distant supervision (in conjunction
with external dictionaries), but the generated noisy labels pose significant
challenges on learning effective neural models. Here we propose two neural
models to suit noisy distant supervision from the dictionary. First, under the
traditional sequence labeling framework, we propose a revised fuzzy CRF layer
to handle tokens with multiple possible labels. After identifying the nature of
noisy labels in distant supervision, we go beyond the traditional framework and
propose a novel, more effective neural model AutoNER with a new Tie or Break
scheme. In addition, we discuss how to refine distant supervision for better
NER performance. Extensive experiments on three benchmark datasets demonstrate
that AutoNER achieves the best performance when only using dictionaries with no
additional human effort, and delivers competitive results with state-of-the-art
supervised benchmarks
- …