1,582 research outputs found
Effective Use of Bidirectional Language Modeling for Transfer Learning in Biomedical Named Entity Recognition
Biomedical named entity recognition (NER) is a fundamental task in text
mining of medical documents and has many applications. Deep learning based
approaches to this task have been gaining increasing attention in recent years
as their parameters can be learned end-to-end without the need for
hand-engineered features. However, these approaches rely on high-quality
labeled data, which is expensive to obtain. To address this issue, we
investigate how to use unlabeled text data to improve the performance of NER
models. Specifically, we train a bidirectional language model (BiLM) on
unlabeled data and transfer its weights to "pretrain" an NER model with the
same architecture as the BiLM, which results in a better parameter
initialization of the NER model. We evaluate our approach on four benchmark
datasets for biomedical NER and show that it leads to a substantial improvement
in the F1 scores compared with the state-of-the-art approaches. We also show
that BiLM weight transfer leads to a faster model training and the pretrained
model requires fewer training examples to achieve a particular F1 score.Comment: Machine Learning for Healthcare (MLHC) 2018, Comments: 12 pages,
updated authors affiliation
Cross-type Biomedical Named Entity Recognition with Deep Multi-Task Learning
Motivation: State-of-the-art biomedical named entity recognition (BioNER)
systems often require handcrafted features specific to each entity type, such
as genes, chemicals and diseases. Although recent studies explored using neural
network models for BioNER to free experts from manual feature engineering, the
performance remains limited by the available training data for each entity
type. Results: We propose a multi-task learning framework for BioNER to
collectively use the training data of different types of entities and improve
the performance on each of them. In experiments on 15 benchmark BioNER
datasets, our multi-task model achieves substantially better performance
compared with state-of-the-art BioNER systems and baseline neural sequence
labeling models. Further analysis shows that the large performance gains come
from sharing character- and word-level information among relevant biomedical
entities across differently labeled corpora.Comment: 7 pages, 4 figure
Neural Metric Learning for Fast End-to-End Relation Extraction
Relation extraction (RE) is an indispensable information extraction task in
several disciplines. RE models typically assume that named entity recognition
(NER) is already performed in a previous step by another independent model.
Several recent efforts, under the theme of end-to-end RE, seek to exploit
inter-task correlations by modeling both NER and RE tasks jointly. Earlier work
in this area commonly reduces the task to a table-filling problem wherein an
additional expensive decoding step involving beam search is applied to obtain
globally consistent cell labels. In efforts that do not employ table-filling,
global optimization in the form of CRFs with Viterbi decoding for the NER
component is still necessary for competitive performance. We introduce a novel
neural architecture utilizing the table structure, based on repeated
applications of 2D convolutions for pooling local dependency and metric-based
features, that improves on the state-of-the-art without the need for global
optimization. We validate our model on the ADE and CoNLL04 datasets for
end-to-end RE and demonstrate gain (in F-score) over prior best
results with training and testing times that are seven to ten times faster ---
the latter highly advantageous for time-sensitive end user applications
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Biomedical text mining is becoming increasingly important as the number of
biomedical documents rapidly grows. With the progress in natural language
processing (NLP), extracting valuable information from biomedical literature
has gained popularity among researchers, and deep learning has boosted the
development of effective biomedical text mining models. However, directly
applying the advancements in NLP to biomedical text mining often yields
unsatisfactory results due to a word distribution shift from general domain
corpora to biomedical corpora. In this article, we investigate how the recently
introduced pre-trained language model BERT can be adapted for biomedical
corpora. We introduce BioBERT (Bidirectional Encoder Representations from
Transformers for Biomedical Text Mining), which is a domain-specific language
representation model pre-trained on large-scale biomedical corpora. With almost
the same architecture across tasks, BioBERT largely outperforms BERT and
previous state-of-the-art models in a variety of biomedical text mining tasks
when pre-trained on biomedical corpora. While BERT obtains performance
comparable to that of previous state-of-the-art models, BioBERT significantly
outperforms them on the following three representative biomedical text mining
tasks: biomedical named entity recognition (0.62% F1 score improvement),
biomedical relation extraction (2.80% F1 score improvement) and biomedical
question answering (12.24% MRR improvement). Our analysis results show that
pre-training BERT on biomedical corpora helps it to understand complex
biomedical texts. We make the pre-trained weights of BioBERT freely available
at https://github.com/naver/biobert-pretrained, and the source code for
fine-tuning BioBERT available at https://github.com/dmis-lab/biobert.Comment: Bioinformatic
CollaboNet: collaboration of deep neural networks for biomedical named entity recognition
Background: Finding biomedical named entities is one of the most essential
tasks in biomedical text mining. Recently, deep learning-based approaches have
been applied to biomedical named entity recognition (BioNER) and showed
promising results. However, as deep learning approaches need an abundant amount
of training data, a lack of data can hinder performance. BioNER datasets are
scarce resources and each dataset covers only a small subset of entity types.
Furthermore, many bio entities are polysemous, which is one of the major
obstacles in named entity recognition. Results: To address the lack of data and
the entity type misclassification problem, we propose CollaboNet which utilizes
a combination of multiple NER models. In CollaboNet, models trained on a
different dataset are connected to each other so that a target model obtains
information from other collaborator models to reduce false positives. Every
model is an expert on their target entity type and takes turns serving as a
target and a collaborator model during training time. The experimental results
show that CollaboNet can be used to greatly reduce the number of false
positives and misclassified entities including polysemous words. CollaboNet
achieved state-of-the-art performance in terms of precision, recall and F1
score. Conclusions: We demonstrated the benefits of combining multiple models
for BioNER. Our model has successfully reduced the number of misclassified
entities and improved the performance by leveraging multiple datasets annotated
for different entity types. Given the state-of-the-art performance of our
model, we believe that CollaboNet can improve the accuracy of downstream
biomedical text mining applications such as bio-entity relation extraction.Comment: From DTMBio workshop at CIKM 2018, Turin, Italy. 22-26 October 201
MASK: A flexible framework to facilitate de-identification of clinical texts
Medical health records and clinical summaries contain a vast amount of
important information in textual form that can help advancing research on
treatments, drugs and public health. However, the majority of these information
is not shared because they contain private information about patients, their
families, or medical staff treating them. Regulations such as HIPPA in the US,
PHIPPA in Canada and GDPR regulate the protection, processing and distribution
of this information. In case this information is de-identified and personal
information are replaced or redacted, they could be distributed to the research
community. In this paper, we present MASK, a software package that is designed
to perform the de-identification task. The software is able to perform named
entity recognition using some of the state-of-the-art techniques and then mask
or redact recognized entities. The user is able to select named entity
recognition algorithm (currently implemented are two versions of CRF-based
techniques and BiLSTM-based neural network with pre-trained GLoVe and ELMo
embedding) and masking algorithm (e.g. shift dates, replace names/locations,
totally redact entity)
SwellShark: A Generative Model for Biomedical Named Entity Recognition without Labeled Data
We present SwellShark, a framework for building biomedical named entity
recognition (NER) systems quickly and without hand-labeled data. Our approach
views biomedical resources like lexicons as function primitives for
autogenerating weak supervision. We then use a generative model to unify and
denoise this supervision and construct large-scale, probabilistically labeled
datasets for training high-accuracy NER taggers. In three biomedical NER tasks,
SwellShark achieves competitive scores with state-of-the-art supervised
benchmarks using no hand-labeled training data. In a drug name extraction task
using patient medical records, one domain expert using SwellShark achieved
within 5.1% of a crowdsourced annotation approach -- which originally utilized
20 teams over the course of several weeks -- in 24 hours
Few-shot Learning for Named Entity Recognition in Medical Text
Deep neural network models have recently achieved state-of-the-art
performance gains in a variety of natural language processing (NLP) tasks
(Young, Hazarika, Poria, & Cambria, 2017). However, these gains rely on the
availability of large amounts of annotated examples, without which
state-of-the-art performance is rarely achievable. This is especially
inconvenient for the many NLP fields where annotated examples are scarce, such
as medical text. To improve NLP models in this situation, we evaluate five
improvements on named entity recognition (NER) tasks when only ten annotated
examples are available: (1) layer-wise initialization with pre-trained weights,
(2) hyperparameter tuning, (3) combining pre-training data, (4) custom word
embeddings, and (5) optimizing out-of-vocabulary (OOV) words. Experimental
results show that the F1 score of 69.3% achievable by state-of-the-art models
can be improved to 78.87%.Comment: 10 pages, 4 figures, 4 table
Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling
Contextualized word embeddings such as ELMo and BERT provide a foundation for
strong performance across a wide range of natural language processing tasks by
pretraining on large corpora of unlabeled text. However, the applicability of
this approach is unknown when the target domain varies substantially from the
pretraining corpus. We are specifically interested in the scenario in which
labeled data is available in only a canonical source domain such as newstext,
and the target domain is distinct from both the labeled and pretraining texts.
To address this scenario, we propose domain-adaptive fine-tuning, in which the
contextualized embeddings are adapted by masked language modeling on text from
the target domain. We test this approach on sequence labeling in two
challenging domains: Early Modern English and Twitter. Both domains differ
substantially from existing pretraining corpora, and domain-adaptive
fine-tuning yields substantial improvements over strong BERT baselines, with
particularly impressive results on out-of-vocabulary words. We conclude that
domain-adaptive fine-tuning offers a simple and effective approach for the
unsupervised adaptation of sequence labeling to difficult new domains.Comment: EMNLP 201
Probing Biomedical Embeddings from Language Models
Contextualized word embeddings derived from pre-trained language models (LMs)
show significant improvements on downstream NLP tasks. Pre-training on
domain-specific corpora, such as biomedical articles, further improves their
performance. In this paper, we conduct probing experiments to determine what
additional information is carried intrinsically by the in-domain trained
contextualized embeddings. For this we use the pre-trained LMs as fixed feature
extractors and restrict the downstream task models to not have additional
sequence modeling layers. We compare BERT, ELMo, BioBERT and BioELMo, a
biomedical version of ELMo trained on 10M PubMed abstracts. Surprisingly, while
fine-tuned BioBERT is better than BioELMo in biomedical NER and NLI tasks, as a
fixed feature extractor BioELMo outperforms BioBERT in our probing tasks. We
use visualization and nearest neighbor analysis to show that better encoding of
entity-type and relational information leads to this superiority.Comment: NAACL-HLT 2019 Workshop on Evaluating Vector Space Representations
for NLP (RepEval
- …