Search CORE

206 research outputs found

Applying Stacking and Corpus Transformation to a Chunking Task

Author: Carrillo Montero Vicente
Cruz Mata Fermín
Díaz Madrigal Víctor Jesús
Enríquez de Salamanca Ros Fernando
Troyano Jiménez José Antonio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

In this paper we present an application of the stacking technique to a chunking task: named entity recognition. Stacking consists in applying machine learning techniques for combining the results of different models. Instead of using several corpus or several tagger generators to obtain the models needed in stacking, we have applied three transformations to a single training corpus and then we have used the four versions of the corpus to train a single tagger generator. Taking as baseline the results obtained with the original corpus (Fβ=1 value of 81.84), our experiments show that the three transformations improve this baseline (the best one reaches 84.51), and that applying stacking also improves this baseline reaching an Fβ=1 measure of 88.43

idUS. Depósito de Investigación Universidad de Sevilla

Named Entity Recognition Through Corpus Transformation and System Combination

Author: Carrillo Montero Vicente
Enríquez de Salamanca Ros Fernando
Galán Morillo Francisco José
Troyano Jiménez José Antonio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

In this paper we investigate the way of combining different taggers to improve their performance in the named entity recognition task. The main resources used in our experiments are the publicly available taggers TnT and TBL and a corpus of Spanish texts in which named entities occurrences are tagged with BIO tags. We have defined three transformations that provide us three additional versions of the training corpus. The transformations change either the words or the tags, and the three of them improve the results of TnT and TBL when they are trained with the original version of the corpus. With the four versions of the corpus and the two taggers, we have eight different models that can be combined with several techniques. The experiments carried out show that using machine learning techniques to combine them the performance improves considerably. We improve the baselines for TnT (Fβ=1 value of 85.25) and TBL (Fβ=1 value of 87.45) up to a value of 90.90 in the best of our experiments

idUS. Depósito de Investigación Universidad de Sevilla

A comparative study of classifier combination applied to NLP tasks

Author: Cruz Mata Fermín
Enríquez de Salamanca Ros Fernando
García Vallejo Carlos Antonio
Ortega Rodríguez Francisco Javier
Troyano Jiménez José Antonio
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

The paper is devoted to a comparative study of classifier combination methods, which have been successfully applied to multiple tasks including Natural Language Processing (NLP) tasks. There is variety of classifier combination techniques and the major difficulty is to choose one that is the best fit for a particular task. In our study we explored the performance of a number of combination methods such as voting, Bayesian merging, behavior knowledge space, bagging, stacking, feature sub-spacing and cascading, for the part-of-speech tagging task using nine corpora in five languages. The results show that some methods that, currently, are not very popular could demonstrate much better performance. In addition, we learned how the corpus size and quality influence the combination methods performance. We also provide the results of applying the classifier combination methods to the other NLP tasks, such as name entity recognition and chunking. We believe that our study is the most exhaustive comparison made with combination methods applied to NLP tasks so far

idUS. Depósito de Investigación Universidad de Sevilla

Improving the Performance of a Tagger Generator in an Information Extraction Application

Author: Cañete Valdeón José Miguel
Cruz Mata Fermín
Enríquez de Salamanca Ros Fernando
Ortega Rodríguez Francisco Javier
Troyano Jiménez José Antonio
Publication venue: Graz University of Technology, Institut für Informations systeme und Computer Medien (IICM)
Publication date: 01/01/2007
Field of study

In this paper we present an experience in the extraction of named entities from Spanish texts using stacking. Named Entity Extraction (NEE) is a subtask of Information Extraction that involves the identification of groups of words that make up the name of an entity, and the classification of these names into a set of predefined categories. Our approach is corpus-based, we use a re-trainable tagger generator to obtain a named entity extractor from a set of tagged examples. The main contribution of our work is that we obtain the systems needed in a stacking scheme without making use of any additional training material or tagger generators. Instead of it, we have generated the variability needed in stacking by applying corpus transformation to the original training corpus. Once we have several versions of the training corpus we generate several extractors and combine them by means of a machine learning algorithm. Experiments show that the combination of corpus transformation and stacking improve the performance of the tagger generator in this kind of natural language processing applications. The best of our experiments achieves an improvement of more than six percentual points respect to the predefined baseline

idUS. Depósito de Investigación Universidad de Sevilla

Improving the Performance of a Named Entity Extractor by Applying a Stacking Scheme

Author: Díaz Madrigal Víctor Jesús
Enríquez de Salamanca Ros Fernando
Romero Moreno Luisa María
Troyano Jiménez José Antonio
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2004
Field of study

In this paper we investigate the way of improving the performance of a Named Entity Extraction (NEE) system by applying machine learning techniques and corpus transformation. The main resources used in our experiments are the publicly available tagger TnT and a corpus of Spanish texts in which named entities occurrences are tagged with BIO tags. We split the NEE task into two subtasks 1) Named Entity Recognition (NER) that involves the identification of the group of words that make up the name of an entity and 2) Named Entity Classification (NEC) that determines the category of a named entity. We have focused our work on the improvement of the NER task, generating four different taggers with the same training corpus and combining them using a stacking scheme. We improve the baseline of the NER task (Fβ=1 value of 81.84) up to a value of 88.37. When a NEC module is added to the NER system the performance of the whole NEE task is also improved. A value of 70.47 is achieved from a baseline of 66.07

idUS. Depósito de Investigación Universidad de Sevilla

Portuguese corpus-based learning using ETL

Author: Cícero Nogueira dos Santos
Julio Cesar Duarte
Ruy Luiz Milidiú
Publication venue: 'FapUNIFESP (SciELO)'
Publication date
Field of study

Crossref

De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of Deep Learning Models

Author: Bennett Vicki
Jorm Louisa
Liu Leibo
Nguyen Anthony
Perez-Concha Oscar
Publication venue
Publication date: 03/12/2021
Field of study

Electronic Medical Records (EMRs) contain clinical narrative text that is of great potential value to medical researchers. However, this information is mixed with Personally Identifiable Information (PII) that presents risks to patient and clinician confidentiality. This paper presents an end-to-end de-identification framework to automatically remove PII from hospital discharge summaries. Our corpus included 600 hospital discharge summaries which were extracted from the EMRs of two principal referral hospitals in Sydney, Australia. Our end-to-end de-identification framework consists of three components: 1) Annotation: labelling of PII in the 600 hospital discharge summaries using five pre-defined categories: person, address, date of birth, identification number, phone number; 2) Modelling: training six named entity recognition (NER) deep learning base-models on balanced and imbalanced datasets; and evaluating ensembles that combine all six base-models, the three base-models with the best F1 scores and the three base-models with the best recall scores respectively, using token-level majority voting and stacking methods; and 3) De-identification: removing PII from the hospital discharge summaries. Our results showed that the ensemble model combined using the stacking Support Vector Machine (SVM) method on the three base-models with the best F1 scores achieved excellent results with a F1 score of 99.16% on the test set of our corpus. We also evaluated the robustness of our modelling component on the 2014 i2b2 de-identification dataset. Our ensemble model, which uses the token-level majority voting method on all six base-models, achieved the highest F1 score of 96.24% at strict entity matching and the highest F1 score of 98.64% at binary token-level matching compared to two state-of-the-art methods. The framework provides a robust solution to de-identifying clinical narrative text safely

arXiv.org e-Print Archive