10 research outputs found
NLNDE: Enhancing Neural Sequence Taggers with Attention and Noisy Channel for Robust Pharmacological Entity Detection
Named entity recognition has been extensively studied on English news texts.
However, the transfer to other domains and languages is still a challenging
problem. In this paper, we describe the system with which we participated in
the first subtrack of the PharmaCoNER competition of the BioNLP Open Shared
Tasks 2019. Aiming at pharmacological entity detection in Spanish texts, the
task provides a non-standard domain and language setting. However, we propose
an architecture that requires neither language nor domain expertise. We treat
the task as a sequence labeling task and experiment with attention-based
embedding selection and the training on automatically annotated data to further
improve our system's performance. Our system achieves promising results,
especially by combining the different techniques, and reaches up to 88.6% F1 in
the competition.Comment: Published at BioNLP-OST@EMNLP 201
Feature-Dependent Confusion Matrices for Low-Resource NER Labeling with Noisy Labels
In low-resource settings, the performance of supervised labeling models can
be improved with automatically annotated or distantly supervised data, which is
cheap to create but often noisy. Previous works have shown that significant
improvements can be reached by injecting information about the confusion
between clean and noisy labels in this additional training data into the
classifier training. However, for noise estimation, these approaches either do
not take the input features (in our case word embeddings) into account, or they
need to learn the noise modeling from scratch which can be difficult in a
low-resource setting. We propose to cluster the training data using the input
features and then compute different confusion matrices for each cluster. To the
best of our knowledge, our approach is the first to leverage feature-dependent
noise modeling with pre-initialized confusion matrices. We evaluate on
low-resource named entity recognition settings in several languages, showing
that our methods improve upon other confusion-matrix based methods by up to 9%.Comment: Published at EMNLP-IJCNLP 201
Evaluating automated and hybrid neural disambiguation for African historical named entities
Documents detailing South African history contain ambiguous names. Ambiguous names may be due to people having the same name or the same person being referred to by multiple different names. Thus when searching for or attempting to extract information about a particular person, the name used may affect the results. This problem may be alleviated by using a Named Entity Disambiguation (NED) system to disambiguate names by linking them to a knowledge base. In recent years, transformer-based language models have led to improvements in NED systems. Furthermore, multilingual language models have shown the ability to learn concepts across languages, reducing the amount of training data required in low-resource languages. Thus a multilingual language model-based NED system was developed to disambiguate people's names within a historical South African context using documents written in English and isiZulu from the 500 Year Archive (FHYA). The multilingual language model-based system substantially improved on a probability-based baseline and achieved a micro F1-score of 0.726. At the same time, the entity linking component was able to link 81.9% of the mentions to the correct entity. However, the system's performance on documents written in isiZulu was significantly lower than on the documents written in English. Thus the system was augmented with handcrafted rules to improve its performance. The addition of handcrafted rules resulted in a small but significant improvement in performance when compared to the unaugmented NED system
Towards the extraction of cross-sentence relations through event extraction and entity coreference
Cross-sentence relation extraction deals with the extraction of relations beyond the sentence boundary. This thesis focuses on two of the NLP tasks which are of importance to the successful extraction of cross-sentence relation mentions: event extraction and coreference resolution. The first part of the thesis focuses on addressing data sparsity issues in event extraction. We propose a self-training approach for obtaining additional labeled examples for the task. The process starts off with a Bi-LSTM event tagger trained on a small labeled data set which is used to discover new event instances in a large collection of unstructured text. The high confidence model predictions are selected to construct a data set of automatically-labeled training examples. We present several ways in which the resulting data set can be used for re-training the event tagger in conjunction with the initial labeled data. The best configuration achieves statistically significant improvement over the baseline on the ACE 2005 test set (macro-F1), as well as in a 10-fold cross validation (micro- and macro-F1) evaluation. Our error analysis reveals that the augmentation approach is especially beneficial for the classification of the most under-represented event types in the original data set. The second part of the thesis focuses on the problem of coreference resolution. While a certain level of precision can be reached by modeling surface information about entity mentions, their successful resolution often depends on semantic or world knowledge. This thesis investigates an unsupervised source of such knowledge, namely distributed word representations. We present several ways in which word embeddings can be utilized to extract features for a supervised coreference resolver. Our evaluation results and error analysis show that each of these features helps improve over the baseline coreference system’s performance, with a statistically significant improvement (CoNLL F1) achieved when the proposed features are used jointly. Moreover, all features lead to a reduction in the amount of precision errors in resolving references between common nouns, demonstrating that they successfully incorporate semantic information into the process
Recognising Biomedical Names: Challenges and Solutions
The growth rate in the amount of biomedical documents is staggering. Unlocking information trapped in these documents can enable researchers and practitioners to operate confidently in the information world. Biomedical Named Entity Recognition (NER), the task of recognising biomedical names, is usually employed as the first step of the NLP pipeline.
Standard NER models, based on sequence tagging technique, are good at recognising short entity mentions in the generic domain. However, there are several open challenges of applying these models to recognise biomedical names:
â—Ź Biomedical names may contain complex inner structure (discontinuity and overlapping) which cannot be recognised using standard sequence tagging technique;
â—Ź The training of NER models usually requires large amount of labelled data, which are difficult to obtain in the biomedical domain; and,
â—Ź Commonly used language representation models are pre-trained on generic data; a domain shift therefore exists between these models and target biomedical data.
To deal with these challenges, we explore several research directions and make the following contributions: (1) we propose a transition-based NER model which can recognise discontinuous mentions; (2) We develop a cost-effective approach that nominates the suitable pre-training data; and, (3) We design several data augmentation methods for NER.
Our contributions have obvious practical implications, especially when new biomedical applications are needed. Our proposed data augmentation methods can help the NER model achieve decent performance, requiring only a small amount of labelled data. Our investigation regarding selecting pre-training data can improve the model by incorporating language representation models, which are pre-trained using in-domain data. Finally, our proposed transition-based NER model can further improve the performance by recognising discontinuous mentions