206 research outputs found
Applying Stacking and Corpus Transformation to a Chunking Task
In this paper we present an application of the stacking technique
to a chunking task: named entity recognition. Stacking consists in
applying machine learning techniques for combining the results of different
models. Instead of using several corpus or several tagger generators
to obtain the models needed in stacking, we have applied three transformations
to a single training corpus and then we have used the four versions
of the corpus to train a single tagger generator. Taking as baseline
the results obtained with the original corpus (Fβ=1 value of 81.84), our
experiments show that the three transformations improve this baseline
(the best one reaches 84.51), and that applying stacking also improves
this baseline reaching an Fβ=1 measure of 88.43
Named Entity Recognition Through Corpus Transformation and System Combination
In this paper we investigate the way of combining different
taggers to improve their performance in the named entity recognition
task. The main resources used in our experiments are the publicly available
taggers TnT and TBL and a corpus of Spanish texts in which named
entities occurrences are tagged with BIO tags. We have defined three
transformations that provide us three additional versions of the training
corpus. The transformations change either the words or the tags, and
the three of them improve the results of TnT and TBL when they are
trained with the original version of the corpus. With the four versions of
the corpus and the two taggers, we have eight different models that can
be combined with several techniques. The experiments carried out show
that using machine learning techniques to combine them the performance
improves considerably. We improve the baselines for TnT (Fβ=1 value of
85.25) and TBL (Fβ=1 value of 87.45) up to a value of 90.90 in the best
of our experiments
A comparative study of classifier combination applied to NLP tasks
The paper is devoted to a comparative study of classifier combination methods, which have been successfully
applied to multiple tasks including Natural Language Processing (NLP) tasks. There is variety of classifier
combination techniques and the major difficulty is to choose one that is the best fit for a particular
task. In our study we explored the performance of a number of combination methods such as voting,
Bayesian merging, behavior knowledge space, bagging, stacking, feature sub-spacing and cascading, for
the part-of-speech tagging task using nine corpora in five languages. The results show that some methods
that, currently, are not very popular could demonstrate much better performance. In addition, we learned
how the corpus size and quality influence the combination methods performance. We also provide the
results of applying the classifier combination methods to the other NLP tasks, such as name entity recognition
and chunking. We believe that our study is the most exhaustive comparison made with combination
methods applied to NLP tasks so far
Improving the Performance of a Tagger Generator in an Information Extraction Application
In this paper we present an experience in the extraction of named entities
from Spanish texts using stacking. Named Entity Extraction (NEE) is a subtask of
Information Extraction that involves the identification of groups of words that make
up the name of an entity, and the classification of these names into a set of predefined
categories. Our approach is corpus-based, we use a re-trainable tagger generator to
obtain a named entity extractor from a set of tagged examples. The main contribution
of our work is that we obtain the systems needed in a stacking scheme without
making use of any additional training material or tagger generators. Instead of it, we
have generated the variability needed in stacking by applying corpus transformation to
the original training corpus. Once we have several versions of the training corpus we
generate several extractors and combine them by means of a machine learning algorithm.
Experiments show that the combination of corpus transformation and stacking
improve the performance of the tagger generator in this kind of natural language processing
applications. The best of our experiments achieves an improvement of more
than six percentual points respect to the predefined baseline
Improving the Performance of a Named Entity Extractor by Applying a Stacking Scheme
In this paper we investigate the way of improving the performance
of a Named Entity Extraction (NEE) system by applying machine
learning techniques and corpus transformation. The main resources used
in our experiments are the publicly available tagger TnT and a corpus
of Spanish texts in which named entities occurrences are tagged with
BIO tags. We split the NEE task into two subtasks 1) Named Entity
Recognition (NER) that involves the identification of the group of words
that make up the name of an entity and 2) Named Entity Classification
(NEC) that determines the category of a named entity. We have focused
our work on the improvement of the NER task, generating four different
taggers with the same training corpus and combining them using a
stacking scheme. We improve the baseline of the NER task (Fβ=1 value
of 81.84) up to a value of 88.37. When a NEC module is added to the
NER system the performance of the whole NEE task is also improved.
A value of 70.47 is achieved from a baseline of 66.07
De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of Deep Learning Models
Electronic Medical Records (EMRs) contain clinical narrative text that is of
great potential value to medical researchers. However, this information is
mixed with Personally Identifiable Information (PII) that presents risks to
patient and clinician confidentiality. This paper presents an end-to-end
de-identification framework to automatically remove PII from hospital discharge
summaries. Our corpus included 600 hospital discharge summaries which were
extracted from the EMRs of two principal referral hospitals in Sydney,
Australia. Our end-to-end de-identification framework consists of three
components: 1) Annotation: labelling of PII in the 600 hospital discharge
summaries using five pre-defined categories: person, address, date of birth,
identification number, phone number; 2) Modelling: training six named entity
recognition (NER) deep learning base-models on balanced and imbalanced
datasets; and evaluating ensembles that combine all six base-models, the three
base-models with the best F1 scores and the three base-models with the best
recall scores respectively, using token-level majority voting and stacking
methods; and 3) De-identification: removing PII from the hospital discharge
summaries. Our results showed that the ensemble model combined using the
stacking Support Vector Machine (SVM) method on the three base-models with the
best F1 scores achieved excellent results with a F1 score of 99.16% on the test
set of our corpus. We also evaluated the robustness of our modelling component
on the 2014 i2b2 de-identification dataset. Our ensemble model, which uses the
token-level majority voting method on all six base-models, achieved the highest
F1 score of 96.24% at strict entity matching and the highest F1 score of 98.64%
at binary token-level matching compared to two state-of-the-art methods. The
framework provides a robust solution to de-identifying clinical narrative text
safely
- …