318 research outputs found
Reflexive pronouns in Spanish Universal Dependencies
In this paper, we argue that in current Universal Dependencies treebanks, the annotation of Spanish reflexives is an unsolved problem, which clearly affects the accuracy and consistency of current parsers. We evaluate different proposals for fine-tuning the various categories, and discuss remaining open issues. We believe that the solution for these issues could lie in a multi-layered way of annotating the characteristics, combining annotation of the dependency relation and of the so-called token features, rather than in expanding the number of categories on one layer. We apply this proposal to the v2.5 Spanish UD AnCora treebank and provide a categorized conversion table that can be run with a Python script
The CoNLL 2007 shared task on dependency parsing
The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In this paper, we define the tasks of the different tracks and describe how the data sets were created from existing treebanks for ten languages. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results
A Universal Part-of-Speech Tagset
To facilitate future research in unsupervised induction of syntactic
structure and to standardize best-practices, we propose a tagset that consists
of twelve universal part-of-speech categories. In addition to the tagset, we
develop a mapping from 25 different treebank tagsets to this universal set. As
a result, when combined with the original treebank data, this universal tagset
and mapping produce a dataset consisting of common parts-of-speech for 22
different languages. We highlight the use of this resource via two experiments,
including one that reports competitive accuracies for unsupervised grammar
induction without gold standard part-of-speech tags
UD Annotatrix: An Annotation Tool For Universal Dependencies
In this paper we introduce the UD Annotatrix annotation tool for manual annotation of Universal Dependencies. This tool has been designed with the aim that it should be tailored to the needs of the Universal Dependencies (UD) community, including that it should operate in fully-offline mode, and is freely-available under the GNU GPL licence. In this paper, we provide some background to the tool, an overview of its development, and background on how it works. We compare it with some other widely-used tools which are used for Universal Dependencies annotation, describe some features unique to UD Annotatrix, and finally outline some avenues for future work and provide a few concluding remarks
Findings of the Shared Task on Multilingual Coreference Resolution
This paper presents an overview of the shared task on multilingual
coreference resolution associated with the CRAC 2022 workshop. Shared task
participants were supposed to develop trainable systems capable of identifying
mentions and clustering them according to identity coreference. The public
edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used
as the source of training and evaluation data. The CoNLL score used in previous
coreference-oriented shared tasks was used as the main evaluation metric. There
were 8 coreference prediction systems submitted by 5 participating teams; in
addition, there was a competitive Transformer-based baseline system provided by
the organizers at the beginning of the shared task. The winner system
outperformed the baseline by 12 percentage points (in terms of the CoNLL scores
averaged across all datasets for individual languages)
HamleDT 2.0: Thirty Dependency Treebanks Stanfordized
We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently.
We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion.
We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future.
We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline
- …