Search CORE

4,114 research outputs found

Win-Win Cooperation: Bundling Sequence and Span Models for Named Entity Recognition

Author: Ji Bin
Li Shasha
Liu Huijun
Ma Jun
Yu Jie
Publication venue
Publication date: 15/07/2022
Field of study

For Named Entity Recognition (NER), sequence labeling-based and span-based paradigms are quite different. Previous research has demonstrated that the two paradigms have clear complementary advantages, but few models have attempted to leverage these advantages in a single NER model as far as we know. In our previous work, we proposed a paradigm known as Bundling Learning (BL) to address the above problem. The BL paradigm bundles the two NER paradigms, enabling NER models to jointly tune their parameters by weighted summing each paradigm's training loss. However, three critical issues remain unresolved: When does BL work? Why does BL work? Can BL enhance the existing state-of-the-art (SOTA) NER models? To address the first two issues, we implement three NER models, involving a sequence labeling-based model--SeqNER, a span-based NER model--SpanNER, and BL-NER that bundles SeqNER and SpanNER together. We draw two conclusions regarding the two issues based on the experimental results on eleven NER datasets from five domains. We then apply BL to five existing SOTA NER models to investigate the third issue, consisting of three sequence labeling-based models and two span-based models. Experimental results indicate that BL consistently enhances their performance, suggesting that it is possible to construct a new SOTA NER system by incorporating BL into the current SOTA system. Moreover, we find that BL reduces both entity boundary and type prediction errors. In addition, we compare two commonly used labeling tagging methods as well as three types of span semantic representations

arXiv.org e-Print Archive

Contributions to information extraction for spanish written biomedical text

Author: Pérez Miguel Naiara
Publication venue
Publication date: 28/03/2023
Field of study

285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

Archivo Digital para la Docencia y la Investigación

Recognising Biomedical Names: Challenges and Solutions

Author: Dai Xiang
Publication venue: Faculty of Engineering, School of Computer Science
Publication date: 01/01/2021
Field of study

The growth rate in the amount of biomedical documents is staggering. Unlocking information trapped in these documents can enable researchers and practitioners to operate confidently in the information world. Biomedical Named Entity Recognition (NER), the task of recognising biomedical names, is usually employed as the first step of the NLP pipeline. Standard NER models, based on sequence tagging technique, are good at recognising short entity mentions in the generic domain. However, there are several open challenges of applying these models to recognise biomedical names: ● Biomedical names may contain complex inner structure (discontinuity and overlapping) which cannot be recognised using standard sequence tagging technique; ● The training of NER models usually requires large amount of labelled data, which are difficult to obtain in the biomedical domain; and, ● Commonly used language representation models are pre-trained on generic data; a domain shift therefore exists between these models and target biomedical data. To deal with these challenges, we explore several research directions and make the following contributions: (1) we propose a transition-based NER model which can recognise discontinuous mentions; (2) We develop a cost-effective approach that nominates the suitable pre-training data; and, (3) We design several data augmentation methods for NER. Our contributions have obvious practical implications, especially when new biomedical applications are needed. Our proposed data augmentation methods can help the NER model achieve decent performance, requiring only a small amount of labelled data. Our investigation regarding selecting pre-training data can improve the model by incorporating language representation models, which are pre-trained using in-domain data. Finally, our proposed transition-based NER model can further improve the performance by recognising discontinuous mentions

Sydney eScholarship

TermEval 2020 : shared task on automatic term extraction using the Annotated Corpora for term Extraction Research (ACTER) dataset

Author: Drouin Patrick
Hoste Veronique
Lefever Els
Rigouts Terryn Ayla
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2020
Field of study

The TermEval 2020 shared task provided a platform for researchers to work on automatic term extraction (ATE) with the same dataset: the Annotated Corpora for Term Extraction Research (ACTER). The dataset covers three languages (English, French, and Dutch) and four domains, of which the domain of heart failure was kept as a held-out test set on which final f1-scores were calculated. The aim was to provide a large, transparent, qualitatively annotated, and diverse dataset to the ATE research community, with the goal of promoting comparative research and thus identifying strengths and weaknesses of various state-of-the-art methodologies. The results show a lot of variation between different systems and illustrate how some methodologies reach higher precision or recall, how different systems extract different types of terms, how some are exceptionally good at finding rare terms, or are less impacted by term length. The current contribution offers an overview of the shared task with a comparative evaluation, which complements the individual papers by all participants

Ghent University Academic Bibliography