1,039 research outputs found
Multilingual Coreference Resolution in Multiparty Dialogue
Existing multiparty dialogue datasets for coreference resolution are nascent,
and many challenges are still unaddressed. We create a large-scale dataset,
Multilingual Multiparty Coref (MMC), for this task based on TV transcripts. Due
to the availability of gold-quality subtitles in multiple languages, we propose
reusing the annotations to create silver coreference data in other languages
(Chinese and Farsi) via annotation projection. On the gold (English) data,
off-the-shelf models perform relatively poorly on MMC, suggesting that MMC has
broader coverage of multiparty coreference than prior datasets. On the silver
data, we find success both using it for data augmentation and training from
scratch, which effectively simulates the zero-shot cross-lingual setting
Code Book for the Annotation of Diverse Cross-Document Coreference of Entities in News Articles
This paper presents a scheme for annotating coreference across news articles,
extending beyond traditional identity relations by also considering
near-identity and bridging relations. It includes a precise description of how
to set up Inception, a respective annotation tool, how to annotate entities in
news articles, connect them with diverse coreferential relations, and link them
across documents to Wikidata's global knowledge graph. This multi-layered
annotation approach is discussed in the context of the problem of media bias.
Our main contribution lies in providing a methodology for creating a diverse
cross-document coreference corpus which can be applied to the analysis of media
bias by word-choice and labelling
Investigating Multilingual Coreference Resolution by Universal Annotations
Multilingual coreference resolution (MCR) has been a long-standing and
challenging task. With the newly proposed multilingual coreference dataset,
CorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by
using its harmonized universal morphosyntactic and coreference annotations.
First, we study coreference by examining the ground truth data at different
linguistic levels, namely mention, entity and document levels, and across
different genres, to gain insights into the characteristics of coreference
across multiple languages. Second, we perform an error analysis of the most
challenging cases that the SotA system fails to resolve in the CRAC 2022 shared
task using the universal annotations. Last, based on this analysis, we extract
features from universal morphosyntactic annotations and integrate these
features into a baseline system to assess their potential benefits for the MCR
task. Our results show that our best configuration of features improves the
baseline by 0.9% F1 score.Comment: Accepted at Findings of EMNLP202
- …