565 research outputs found
Cross-lingual Coreference Resolution of Pronouns
This work is, to our knowledge, a first attempt at a machine learning approach to cross-lingual
coreference resolution, i.e. coreference resolution (CR) performed on a bitext. Focusing on CR of English pronouns, we leverage language differences and enrich the feature set of a standard monolingual CR system for English with features extracted from the Czech side of the bitext. Our work also includes a supervised pronoun aligner that outperforms a GIZA++ baseline in terms of both intrinsic evaluation and evaluation on CR. The final cross-lingual CR system has successfully outperformed both a monolingual CR and a cross-lingual projection system
Multilingual Coreference Resolution in Multiparty Dialogue
Existing multiparty dialogue datasets for coreference resolution are nascent,
and many challenges are still unaddressed. We create a large-scale dataset,
Multilingual Multiparty Coref (MMC), for this task based on TV transcripts. Due
to the availability of gold-quality subtitles in multiple languages, we propose
reusing the annotations to create silver coreference data in other languages
(Chinese and Farsi) via annotation projection. On the gold (English) data,
off-the-shelf models perform relatively poorly on MMC, suggesting that MMC has
broader coverage of multiparty coreference than prior datasets. On the silver
data, we find success both using it for data augmentation and training from
scratch, which effectively simulates the zero-shot cross-lingual setting
Parallel Data Helps Neural Entity Coreference Resolution
Coreference resolution is the task of finding expressions that refer to the
same entity in a text. Coreference models are generally trained on monolingual
annotated data but annotating coreference is expensive and challenging.
Hardmeier et al.(2013) have shown that parallel data contains latent anaphoric
knowledge, but it has not been explored in end-to-end neural models yet. In
this paper, we propose a simple yet effective model to exploit coreference
knowledge from parallel data. In addition to the conventional modules learning
coreference from annotations, we introduce an unsupervised module to capture
cross-lingual coreference knowledge. Our proposed cross-lingual model achieves
consistent improvements, up to 1.74 percentage points, on the OntoNotes 5.0
English dataset using 9 different synthetic parallel datasets. These
experimental results confirm that parallel data can provide additional
coreference knowledge which is beneficial to coreference resolution tasks.Comment: camera-ready version; to appear in the Findings of ACL 202
BenCoref: A Multi-Domain Dataset of Nominal Phrases and Pronominal Reference Annotations
Coreference Resolution is a well studied problem in NLP. While widely studied
for English and other resource-rich languages, research on coreference
resolution in Bengali largely remains unexplored due to the absence of relevant
datasets. Bengali, being a low-resource language, exhibits greater
morphological richness compared to English. In this article, we introduce a new
dataset, BenCoref, comprising coreference annotations for Bengali texts
gathered from four distinct domains. This relatively small dataset contains
5200 mention annotations forming 502 mention clusters within 48,569 tokens. We
describe the process of creating this dataset and report performance of
multiple models trained using BenCoref. We anticipate that our work sheds some
light on the variations in coreference phenomena across multiple domains in
Bengali and encourages the development of additional resources for Bengali.
Furthermore, we found poor crosslingual performance at zero-shot setting from
English, highlighting the need for more language-specific resources for this
task
Cross-lingual Incongruences in the Annotation of Coreference
In the present paper, we deal with incongruences in English-German multilingual coreference annotation and present automated methods to discover them. More specifically, we automatically detect full coreference chains in parallel texts and analyse discrepancies in their annotations. In doing so, we wish to find out whether the discrepancies rather derive from language typological constraints, from the translation or the actual annotation process. The results of our study contribute to the referential analysis of similarities and differences across languages and support evaluation of cross-lingual coreference annotation. They are also useful for cross-lingual coreference resolution systems and contrastive linguistic studies
Dynamic Entity Representations in Neural Language Models
Understanding a long document requires tracking how entities are introduced
and evolve over time. We present a new type of language model, EntityNLM, that
can explicitly model entities, dynamically update their representations, and
contextually generate their mentions. Our model is generative and flexible; it
can model an arbitrary number of entities in context while generating each
entity mention at an arbitrary length. In addition, it can be used for several
different tasks such as language modeling, coreference resolution, and entity
prediction. Experimental results with all these tasks demonstrate that our
model consistently outperforms strong baselines and prior work.Comment: EMNLP 2017 camera-ready versio
mOKB6: A Multilingual Open Knowledge Base Completion Benchmark
Automated completion of open knowledge bases (Open KBs), which are
constructed from triples of the form (subject phrase, relation phrase, object
phrase), obtained via open information extraction (Open IE) system, are useful
for discovering novel facts that may not be directly present in the text.
However, research in Open KB completion (Open KBC) has so far been limited to
resource-rich languages like English. Using the latest advances in multilingual
Open IE, we construct the first multilingual Open KBC dataset, called mOKB6,
containing facts from Wikipedia in six languages (including English). Improving
the previous Open KB construction pipeline by doing multilingual coreference
resolution and keeping only entity-linked triples, we create a dense Open KB.
We experiment with several models for the task and observe a consistent benefit
of combining languages with the help of shared embedding space as well as
translations of facts. We also observe that current multilingual models
struggle to remember facts seen in languages of different scripts.Comment: camera-ready version for ACL 202
Investigating Multilingual Coreference Resolution by Universal Annotations
Multilingual coreference resolution (MCR) has been a long-standing and
challenging task. With the newly proposed multilingual coreference dataset,
CorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by
using its harmonized universal morphosyntactic and coreference annotations.
First, we study coreference by examining the ground truth data at different
linguistic levels, namely mention, entity and document levels, and across
different genres, to gain insights into the characteristics of coreference
across multiple languages. Second, we perform an error analysis of the most
challenging cases that the SotA system fails to resolve in the CRAC 2022 shared
task using the universal annotations. Last, based on this analysis, we extract
features from universal morphosyntactic annotations and integrate these
features into a baseline system to assess their potential benefits for the MCR
task. Our results show that our best configuration of features improves the
baseline by 0.9% F1 score.Comment: Accepted at Findings of EMNLP202
- …