1,454 research outputs found
Cross-Lingual Zero Pronoun Resolution
In languages like Arabic, Chinese, Italian, Japanese, Korean, Portuguese, Spanish, and many others, predicate arguments in certainsyntactic positions are not realized instead of being realized as overt pronouns, and are thus called zero- or null-pronouns. Identifyingand resolving such omitted arguments is crucial to machine translation, information extraction and other NLP tasks, but depends heavilyonsemanticcoherenceandlexicalrelationships. WeproposeaBERT-basedcross-lingualmodelforzeropronounresolution,andevaluateit on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, ours is the first neural model of zero-pronoun resolutionfor Arabic; and our model also outperforms the state-of-the-art for Chinese. In the paper we also evaluate BERT feature extraction andfine-tune models on the task, and compare them with our model. We also report on an investigation of BERT layers indicating whichlayer encodes the most suitable representation for the task. Our code is available at https://github.com/amaloraini/cross-lingual-Z
Do language models make human-like predictions about the coreferents of Italian anaphoric zero pronouns?
Some languages allow arguments to be omitted in certain contexts. Yet human
language comprehenders reliably infer the intended referents of these zero
pronouns, in part because they construct expectations about which referents are
more likely. We ask whether Neural Language Models also extract the same
expectations. We test whether 12 contemporary language models display
expectations that reflect human behavior when exposed to sentences with zero
pronouns from five behavioral experiments conducted in Italian by Carminati
(2005). We find that three models - XGLM 2.9B, 4.5B, and 7.5B - capture the
human behavior from all the experiments, with others successfully modeling some
of the results. This result suggests that human expectations about coreference
can be derived from exposure to language, and also indicates features of
language models that allow them to better reflect human behavior.Comment: Accepted at COLING 202
Pronoun Prediction with Linguistic Features and Example Weighing
We present a system submitted to the WMT16 shared task in cross-lingual pronoun prediction, in particular, to the English-to-German and German-to-English sub-tasks. The system is based on a linear classifier making use of features both from the target language model and from linguistically analyzed source and target texts. Furthermore, we apply example weighing in classifier learning, which proved to be beneficial for recall in less frequent pronoun classes. Compared to other shared task participants, our best English-to-German system is able to rank just below the top performing submissions
ParCorFull2.0: a Parallel Corpus Annotated with Full Coreference
In this paper, we describe ParCorFull2.0, a parallel corpus annotated with full coreference chains for multiple languages, which is an extension of the existing corpus ParCorFull (Lapshinova-Koltunski et al., 2018). Similar to the previous version, this corpus has been created to address translation of coreference across languages, a phenomenon still challenging for machine translation (MT) and other multilingual natural language processing (NLP) applications. The current version of the corpus that we present here contains not only parallel texts for the language pair English-German, but also for English-French and English-Portuguese, which are all major European languages. The new language pairs belong to the Romance languages. The addition of a new language group creates a need of extension not only in terms of texts added, but also in terms of the annotation guidelines. Both French and Portuguese contain structures not found in English and German. Moreover, Portuguese is a pro-drop language bringing even more systemic differences in the realisation of coreference into our cross-lingual resources. These differences cause problems for multilingual coreference resolution and machine translation. Our parallel corpus with full annotation of coreference will be a valuable resource with a variety of uses not only for NLP applications, but also for contrastive linguists and researchers in translation studies.Christian Hardmeier and Elina Lartaud were supported by the Swedish Research Council under grant
2017-930, which also funded the annotation work of
the French data. Pedro Augusto Ferreira was supported by FCT, Foundation for Science and Technology, Portugal, under grant SFRH/BD/146578/2019
Investigating Multilingual Coreference Resolution by Universal Annotations
Multilingual coreference resolution (MCR) has been a long-standing and
challenging task. With the newly proposed multilingual coreference dataset,
CorefUD (Nedoluzhko et al., 2022), we conduct an investigation into the task by
using its harmonized universal morphosyntactic and coreference annotations.
First, we study coreference by examining the ground truth data at different
linguistic levels, namely mention, entity and document levels, and across
different genres, to gain insights into the characteristics of coreference
across multiple languages. Second, we perform an error analysis of the most
challenging cases that the SotA system fails to resolve in the CRAC 2022 shared
task using the universal annotations. Last, based on this analysis, we extract
features from universal morphosyntactic annotations and integrate these
features into a baseline system to assess their potential benefits for the MCR
task. Our results show that our best configuration of features improves the
baseline by 0.9% F1 score.Comment: Accepted at Findings of EMNLP202
Wino-X: Multilingual Winograd Schemas for Commonsense Reasoning and Coreference Resolution
Winograd schemas are a well-established tool for evaluating coreference resolution (CoR) and commonsense reasoning (CSR) capabilities of computational models. So far, schemas remained largely confined to English, limiting their utility in multilingual settings. This work presents Wino-X, a parallel dataset of German, French, and Russian schemas, aligned with their English counterparts. We use this resource to investigate whether neural machine translation (NMT) models can perform CoR that requires commonsense knowledge and whether multilingual language models (MLLMs) are capable of CSR across multiple languages. Our findings show Wino-X to be exceptionally challenging for NMT systems that are prone to undesirable biases and unable to detect disambiguating information. We quantify biases using established statistical methods and define ways to address both of these issues. We furthermore present evidence of active cross-lingual knowledge transfer in MLLMs, whereby fine-tuning models on English schemas yields CSR improvements in other languages
A Survey on Semantic Processing Techniques
Semantic processing is a fundamental research domain in computational
linguistics. In the era of powerful pre-trained language models and large
language models, the advancement of research in this domain appears to be
decelerating. However, the study of semantics is multi-dimensional in
linguistics. The research depth and breadth of computational semantic
processing can be largely improved with new technologies. In this survey, we
analyzed five semantic processing tasks, e.g., word sense disambiguation,
anaphora resolution, named entity recognition, concept extraction, and
subjectivity detection. We study relevant theoretical research in these fields,
advanced methods, and downstream applications. We connect the surveyed tasks
with downstream applications because this may inspire future scholars to fuse
these low-level semantic processing tasks with high-level natural language
processing tasks. The review of theoretical research may also inspire new tasks
and technologies in the semantic processing domain. Finally, we compare the
different semantic processing techniques and summarize their technical trends,
application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN
1566-2535. The equal contribution mark is missed in the published version due
to the publication policies. Please contact Prof. Erik Cambria for detail
- âŠ