55 research outputs found
Recommended from our members
Building effective representations for domain adaptation in coreference resolution
Over the past few years, research in coreference resolution, one of the core tasks in Natural Language processing, has displayed significant improvement. However, the field of domain adaptation in coreference resolution is yet to be explored; Moosavi and Strube [2017] have shown that the performance of state-of-the-art coreference resolution systems drop when the systems are tested on datasets from different domains. We modify e2e-coref [Lee et al., 2017], a state-of-the-art coreference resolution system, to perform well on new domains by adding sparse linguistic features, incorporating information from Wikipedia, and implementing a domain adversarial network to the system. Our experiments show that each modification improves the precision of the system. We train the model on CoNLL-2012 datasets and test it on several datasets: WikiCoref, the pt documents, and the wb documents from CoNLL-2012. Our best results gains 0.50, 0.52, and 1.14 F1 improvements over the baselines of the respective test sets.Computer Science
Review of coreference resolution in English and Persian
Coreference resolution (CR) is one of the most challenging areas of natural
language processing. This task seeks to identify all textual references to the
same real-world entity. Research in this field is divided into coreference
resolution and anaphora resolution. Due to its application in textual
comprehension and its utility in other tasks such as information extraction
systems, document summarization, and machine translation, this field has
attracted considerable interest. Consequently, it has a significant effect on
the quality of these systems. This article reviews the existing corpora and
evaluation metrics in this field. Then, an overview of the coreference
algorithms, from rule-based methods to the latest deep learning techniques, is
provided. Finally, coreference resolution and pronoun resolution systems in
Persian are investigated.Comment: 44 pages, 11 figures, 5 table
Aspects of Coherence for Entity Analysis
Natural language understanding is an important topic in natural language proces-
sing. Given a text, a computer program should, at the very least, be able to under-
stand what the text is about, and ideally also situate it in its extra-textual context
and understand what purpose it serves. What exactly it means to understand what a
text is about is an open question, but it is generally accepted that, at a minimum, un-
derstanding involves being able to answer questions like “Who did what to whom?
Where? When? How? And Why?”. Entity analysis, the computational analysis of
entities mentioned in a text, aims to support answering the questions “Who?” and
“Whom?” by identifying entities mentioned in a text. If the answers to “Where?”
and “When?” are specific, named locations and events, entity analysis can also pro-
vide these answers. Entity analysis aims to answer these questions by performing
entity linking, that is, linking mentions of entities to their corresponding entry in
a knowledge base, coreference resolution, that is, identifying all mentions in a text
that refer to the same entity, and entity typing, that is, assigning a label such as
Person to mentions of entities.
In this thesis, we study how different aspects of coherence can be exploited to
improve entity analysis. Our main contribution is a method that allows exploiting
knowledge-rich, specific aspects of coherence, namely geographic, temporal, and
entity type coherence. Geographic coherence expresses the intuition that entities
mentioned in a text tend to be geographically close. Similarly, temporal coherence
captures the intuition that entities mentioned in a text tend to be close in the tem-
poral dimension. Entity type coherence is based in the observation that in a text
about a certain topic, such as sports, the entities mentioned in it tend to have the
same or related entity types, such as sports team or athlete. We show how to integrate
features modeling these aspects of coherence into entity linking systems and esta-
blish their utility in extensive experiments covering different datasets and systems.
Since entity linking often requires computationally expensive joint, global optimi-
zation, we propose a simple, but effective rule-based approach that enjoys some of
the benefits of joint, global approaches, while avoiding some of their drawbacks.
To enable convenient error analysis for system developers, we introduce a tool for
visual analysis of entity linking system output. Investigating another aspect of co-
herence, namely the coherence between a predicate and its arguments, we devise a
distributed model of selectional preferences and assess its impact on a neural core-
ference resolution system. Our final contribution examines how multilingual entity
typing can be improved by incorporating subword information. We train and make
publicly available subword embeddings in 275 languages and show their utility in
a multilingual entity typing tas
Entity-centric knowledge discovery for idiosyncratic domains
Technical and scientific knowledge is produced at an ever-accelerating pace, leading to increasing issues when trying to automatically organize or process it, e.g., when searching for relevant prior work. Knowledge can today be produced both in unstructured (plain text) and structured (metadata or linked data) forms. However, unstructured content is still themost dominant formused to represent scientific knowledge. In order to facilitate the extraction and discovery of relevant content, new automated and scalable methods for processing, structuring and organizing scientific knowledge are called for. In this context, a number of applications are emerging, ranging fromNamed Entity Recognition (NER) and Entity Linking tools for scientific papers to specific platforms leveraging information extraction techniques to organize scientific knowledge. In this thesis, we tackle the tasks of Entity Recognition, Disambiguation and Linking in idiosyncratic domains with an emphasis on scientific literature. Furthermore, we study the related task of co-reference resolution with a specific focus on named entities. We start by exploring Named Entity Recognition, a task that aims to identify the boundaries of named entities in textual contents. We propose a newmethod to generate candidate named entities based on n-gram collocation statistics and design several entity recognition features to further classify them. In addition, we show how the use of external knowledge bases (either domain-specific like DBLP or generic like DBPedia) can be leveraged to improve the effectiveness of NER for idiosyncratic domains. Subsequently, we move to Entity Disambiguation, which is typically performed after entity recognition in order to link an entity to a knowledge base. We propose novel semi-supervised methods for word disambiguation leveraging the structure of a community-based ontology of scientific concepts. Our approach exploits the graph structure that connects different terms and their definitions to automatically identify the correct sense that was originally picked by the authors of a scientific publication. We then turn to co-reference resolution, a task aiming at identifying entities that appear using various forms throughout the text. We propose an approach to type entities leveraging an inverted index built on top of a knowledge base, and to subsequently re-assign entities based on the semantic relatedness of the introduced types. Finally, we describe an application which goal is to help researchers discover and manage scientific publications. We focus on the problem of selecting relevant tags to organize collections of research papers in that context. We experimentally demonstrate that the use of a community-authored ontology together with information about the position of the concepts in the documents allows to significantly increase the precision of tag selection over standard methods
Coreference resolution with and for Wikipedia
Wikipédia est une ressource embarquée dans de nombreuses applications du traite-
ment des langues naturelles. Pourtant, aucune étude à notre connaissance n’a tenté de
mesurer la qualité de résolution de coréférence dans les textes de Wikipédia, une étape
prĂ©liminaire Ă la comprĂ©hension de textes. La première partie de ce mĂ©moire consiste Ă
construire un corpus de coréférence en anglais, construit uniquement à partir des articles
de Wikipédia. Les mentions sont étiquetées par des informations syntaxiques et séman-
tiques, avec lorsque cela est possible un lien vers les entités FreeBase équivalentes. Le
but est de créer un corpus équilibré regroupant des articles de divers sujets et tailles.
Notre schéma d’annotation est similaire à celui suivi dans le projet OntoNotes. Dans la
deuxième partie, nous allons mesurer la qualité des systèmes de détection de coréférence
à l’état de l’art sur une tâche simple consistant à mesurer les mentions du concept décrit
dans une page Wikipédia (p. ex : les mentions du président Obama dans la page Wiki-
pédia dédiée à cette personne). Nous tenterons d’améliorer ces performances en faisant
usage le plus possible des informations disponibles dans Wikipédia (catégories, redi-
rects, infoboxes, etc.) et Freebase (information du genre, du nombre, type de relations
avec autres entités, etc.).Wikipedia is a resource of choice exploited in many NLP applications, yet we are
not aware of recent attempts to adapt coreference resolution to this resource, a prelim-
inary step to understand Wikipedia texts. The first part of this master thesis is to build
an English coreference corpus, where all documents are from the English version of
Wikipedia. We annotated each markable with coreference type, mention type and the
equivalent Freebase topic. Our corpus has no restriction on the topics of the documents
being annotated, and documents of various sizes have been considered for annotation.
Our annotation scheme follows the one of OntoNotes with a few disparities. In part two,
we propose a testbed for evaluating coreference systems in a simple task of measuring
the particulars of the concept described in a Wikipedia page (eg. The statements of Pres-
ident Obama the Wikipedia page dedicated to that person). We show that by exploiting
the Wikipedia markup (categories, redirects, infoboxes, etc.) of a document, as well
as links to external knowledge bases such as Freebase (information of the type, num-
ber, type of relationship with other entities, etc.), we can acquire useful information on
entities that helps to classify mentions as coreferent or not
Inferring Missing Entity Type Instances for Knowledge Base Completion: New Dataset and Methods
Most of previous work in knowledge base (KB) completion has focused on the
problem of relation extraction. In this work, we focus on the task of inferring
missing entity type instances in a KB, a fundamental task for KB competition
yet receives little attention. Due to the novelty of this task, we construct a
large-scale dataset and design an automatic evaluation methodology. Our
knowledge base completion method uses information within the existing KB and
external information from Wikipedia. We show that individual methods trained
with a global objective that considers unobserved cells from both the entity
and the type side gives consistently higher quality predictions compared to
baseline methods. We also perform manual evaluation on a small subset of the
data to verify the effectiveness of our knowledge base completion methods and
the correctness of our proposed automatic evaluation method.Comment: North American Chapter of the Association for Computational
Linguistics- Human Language Technologies, 201
- …