Search CORE

451 research outputs found

A multi-level methodology for the automated translation of a coreference resolution dataset: an application to the Italian language

Author: Fujita Hamido
Minutolo Aniello
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/09/2022
Field of study

In the last decade, the demand for readily accessible corpora has touched all areas of natural language processing, including coreference resolution. However, it is one of the least considered sub-fields in recent developments. Moreover, almost all existing resources are only available for the English language. To overcome this lack, this work proposes a methodology to create a corpus for coreference resolution in Italian exploiting knowledge of annotated resources in other languages. Starting from OntonNotes, the methodology translates and refines English utterances to obtain utterances respecting Italian grammar, dealing with language-specific phenomena and preserving coreference and mentions. A quantitative and qualitative evaluation is performed to assess the well-formedness of generated utterances, considering readability, grammaticality, and acceptability indexes. The results have confirmed the effectiveness of the methodology in generating a good dataset for coreference resolution starting from an existing one. The goodness of the dataset is also assessed by training a coreference resolution model based on BERT language model, achieving the promising results. Even if the methodology has been tailored for English and Italian languages, it has a general basis easily extendable to other languages, adapting a small number of language-dependent rules to generalize most of the linguistic phenomena of the language under examination

Repositorio Institucional Universidad de Granada

Towards the extraction of cross-sentence relations through event extraction and entity coreference

Author: Simova Iliana
Publication venue: Saarländische Universitäts- und Landesbibliothek
Publication date: 01/01/2021
Field of study

Cross-sentence relation extraction deals with the extraction of relations beyond the sentence boundary. This thesis focuses on two of the NLP tasks which are of importance to the successful extraction of cross-sentence relation mentions: event extraction and coreference resolution. The first part of the thesis focuses on addressing data sparsity issues in event extraction. We propose a self-training approach for obtaining additional labeled examples for the task. The process starts off with a Bi-LSTM event tagger trained on a small labeled data set which is used to discover new event instances in a large collection of unstructured text. The high confidence model predictions are selected to construct a data set of automatically-labeled training examples. We present several ways in which the resulting data set can be used for re-training the event tagger in conjunction with the initial labeled data. The best configuration achieves statistically significant improvement over the baseline on the ACE 2005 test set (macro-F1), as well as in a 10-fold cross validation (micro- and macro-F1) evaluation. Our error analysis reveals that the augmentation approach is especially beneficial for the classification of the most under-represented event types in the original data set. The second part of the thesis focuses on the problem of coreference resolution. While a certain level of precision can be reached by modeling surface information about entity mentions, their successful resolution often depends on semantic or world knowledge. This thesis investigates an unsupervised source of such knowledge, namely distributed word representations. We present several ways in which word embeddings can be utilized to extract features for a supervised coreference resolver. Our evaluation results and error analysis show that each of these features helps improve over the baseline coreference system’s performance, with a statistically significant improvement (CoNLL F1) achieved when the proposed features are used jointly. Moreover, all features lead to a reduction in the amount of precision errors in resolving references between common nouns, demonstrating that they successfully incorporate semantic information into the process

Universaar

Acronym

An ELECTRA-Based Model for Neural Coreference Resolution

Author: Fujita Hamido
Gargiulo Francesco
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 12/07/2022
Field of study

In last years, coreference resolution has received a sensibly performance boost exploiting different pre-trained Neural Language Models, from BERT to SpanBERT until Longformer. This work is aimed at assessing, for the rst time, the impact of ELECTRA model on this task, moved by the experimental evidence of an improved contextual representation and better performance on different downstream tasks. In particular, ELECTRA has been employed as representation layer in an assessed neural coreference architecture able to determine entity mentions among spans of text and to best cluster them. The architecture itself has been optimized: i) by simplifying the modality of representation of spans of text but still considering both the context they appear and their entire content, ii) by maximizing both the number and length of input textual segments to exploit better the improved contextual representation power of ELECTRA, iii) by maximizing the number of spans of text to be processed, since potentially representing mentions, preserving computational ef ciency. Experimental results on the OntoNotes dataset have shown the effectiveness of this solution from both a quantitative and qualitative perspective, and also with respect to other state-of-the-art models, thanks to a more pro cient token and span representation. The results also hint at the possible use of this solution also for low-resource languages, simply requiring a pre-trained version of ELECTRA instead of language-speci c models trained to handle either spans of text or long documents

Repositorio Institucional Universidad de Granada

Review of coreference resolution in English and Persian

Author: Aznaveh Ahmad Mahmoudi
Mohammadi Hassan Haji
Talebpour Alireza
Yazdani Samaneh
Publication venue
Publication date: 08/11/2022
Field of study

Coreference resolution (CR) is one of the most challenging areas of natural language processing. This task seeks to identify all textual references to the same real-world entity. Research in this field is divided into coreference resolution and anaphora resolution. Due to its application in textual comprehension and its utility in other tasks such as information extraction systems, document summarization, and machine translation, this field has attracted considerable interest. Consequently, it has a significant effect on the quality of these systems. This article reviews the existing corpora and evaluation metrics in this field. Then, an overview of the coreference algorithms, from rule-based methods to the latest deep learning techniques, is provided. Finally, coreference resolution and pronoun resolution systems in Persian are investigated.Comment: 44 pages, 11 figures, 5 table

arXiv.org e-Print Archive

Toward Concept-Based Text Understanding and Mining

Author: Li Xin
Publication venue
Publication date: 01/05/2005
Field of study

There is a huge amount of text information in the world, written in natural languages. Most of the text information is hard to access compared with other well-structured information sources such as relational databases. This is because reading and understanding text requires the ability to disambiguate text fragments at several levels, syntactically and semantically, abstracting away details and using background knowledge in a variety of ways. One possible solution to these problems is to implement a framework of concept-based text understanding and mining, that is, a mechanism of analyzing and integrating segregated information, and a framework of organizing, indexing, accessing textual information centered around real-world concepts. A fundamental difficulty toward this goal is caused by the concept ambiguity of natural language. In text, the real-world entities are referred using their names. The variability in writing a given concept, along with the fact that different concepts/enities may have very similar writings, poses a significant challenge to progress in text understanding and mining. Supporting concept-based natural language understanding requires resolving conceptual ambiguity, and in particular, identifying whether different mentions of real world entities, within and across documents, actually represent the same concept. This thesis systematically studies this fundamental problem. We study and propose different machine learning techniques to address different aspects of this problem and show that as more information can be exploited, the learning techniques developed accordingly, can continuously improve the identification accuracy. In addition, we extend our global probabilistic model to address a significant application -- semantic integration between text and databases

Illinois Digital Environment for Access to Learning and Scholarship Repository

The role of knowledge in determining identity of long-tail entities

Author: Hovy Eduard
Ilievski Filip
Schlobach Stefan
Vossen Piek
Xie Qizhe
Publication venue: 'Elsevier BV'
Publication date: 01/03/2020
Field of study

The NIL entities do not have an accessible representation, which means that their identity cannot be established through traditional disambiguation. Consequently, they have received little attention in entity linking systems and tasks so far. Given the non-redundancy of knowledge on NIL entities, the lack of frequency priors, their potentially extreme ambiguity, and numerousness, they form an extreme class of long-tail entities and pose a great challenge for state-of-the-art systems. In this paper, we investigate the role of knowledge when establishing the identity of NIL entities mentioned in text. What kind of knowledge can be applied to establish the identity of NILs? Can we potentially link to them at a later point? How to capture implicit knowledge and fill knowledge gaps in communication? We formulate and test hypotheses to provide insights to these questions. Due to the unavailability of instance-level knowledge, we propose to enrich the locally extracted information with profiling models that rely on background knowledge in Wikidata. We describe and implement two profiling machines based on state-of-the-art neural models. We evaluate their intrinsic behavior and their impact on the task of determining identity of NIL entities

VU Research Portal