8,340 research outputs found
Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT
The availability of biomedical text data and advances in natural language
processing (NLP) have made new applications in biomedical NLP possible.
Language models trained or fine tuned using domain specific corpora can
outperform general models, but work to date in biomedical NLP has been limited
in terms of corpora and tasks. We present BioALBERT, a domain-specific
adaptation of A Lite Bidirectional Encoder Representations from Transformers
(ALBERT), trained on biomedical (PubMed and PubMed Central) and clinical
(MIMIC-III) corpora and fine tuned for 6 different tasks across 20 benchmark
datasets. Experiments show that BioALBERT outperforms the state of the art on
named entity recognition (+11.09% BLURB score improvement), relation extraction
(+0.80% BLURB score), sentence similarity (+1.05% BLURB score), document
classification (+0.62% F1-score), and question answering (+2.83% BLURB score).
It represents a new state of the art in 17 out of 20 benchmark datasets. By
making BioALBERT models and data available, our aim is to help the biomedical
NLP community avoid computational costs of training and establish a new set of
baselines for future efforts across a broad range of biomedical NLP tasks
Is artificial data useful for biomedical Natural Language Processing algorithms?
A major obstacle to the development of Natural Language Processing (NLP)
methods in the biomedical domain is data accessibility. This problem can be
addressed by generating medical data artificially. Most previous studies have
focused on the generation of short clinical text, and evaluation of the data
utility has been limited. We propose a generic methodology to guide the
generation of clinical text with key phrases. We use the artificial data as
additional training data in two key biomedical NLP tasks: text classification
and temporal relation extraction. We show that artificially generated training
data used in conjunction with real training data can lead to performance boosts
for data-greedy neural network algorithms. We also demonstrate the usefulness
of the generated data for NLP setups where it fully replaces real training
data.Comment: BioNLP 201
Exploring subdomain variation in biomedical language.
BACKGROUND: Applications of Natural Language Processing (NLP) technology to biomedical texts have generated significant interest in recent years. In this paper we identify and investigate the phenomenon of linguistic subdomain variation within the biomedical domain, i.e., the extent to which different subject areas of biomedicine are characterised by different linguistic behaviour. While variation at a coarser domain level such as between newswire and biomedical text is well-studied and known to affect the portability of NLP systems, we are the first to conduct an extensive investigation into more fine-grained levels of variation. RESULTS: Using the large OpenPMC text corpus, which spans the many subdomains of biomedicine, we investigate variation across a number of lexical, syntactic, semantic and discourse-related dimensions. These dimensions are chosen for their relevance to the performance of NLP systems. We use clustering techniques to analyse commonalities and distinctions among the subdomains. CONCLUSIONS: We find that while patterns of inter-subdomain variation differ somewhat from one feature set to another, robust clusters can be identified that correspond to intuitive distinctions such as that between clinical and laboratory subjects. In particular, subdomains relating to genetics and molecular biology, which are the most common sources of material for training and evaluating biomedical NLP tools, are not representative of all biomedical subdomains. We conclude that an awareness of subdomain variation is important when considering the practical use of language processing applications by biomedical researchers
Implementing a Portable Clinical NLP System with a Common Data Model - a Lisp Perspective
This paper presents a Lisp architecture for a portable NLP system, termed
LAPNLP, for processing clinical notes. LAPNLP integrates multiple standard,
customized and in-house developed NLP tools. Our system facilitates portability
across different institutions and data systems by incorporating an enriched
Common Data Model (CDM) to standardize necessary data elements. It utilizes
UMLS to perform domain adaptation when integrating generic domain NLP tools. It
also features stand-off annotations that are specified by positional reference
to the original document. We built an interval tree based search engine to
efficiently query and retrieve the stand-off annotations by specifying
positional requirements. We also developed a utility to convert an inline
annotation format to stand-off annotations to enable the reuse of clinical text
datasets with inline annotations. We experimented with our system on several
NLP facilitated tasks including computational phenotyping for lymphoma patients
and semantic relation extraction for clinical notes. These experiments
showcased the broader applicability and utility of LAPNLP.Comment: 6 pages, accepted by IEEE BIBM 2018 as regular pape
- …