322 research outputs found
HamleDT 2.0: Thirty Dependency Treebanks Stanfordized
We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently.
We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion.
We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future.
We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline
Discourse Relations and Connectives in Higher Text Structure
The present article investigates possibilities and limits of local (shallow) analysis of discourse coherence with respect to the phenomena of global coherence and higher composition of texts. We study corpora annotated with local discourse relations in Czech and partly in English to try and find clues in the local annotation indicating a higher discourse structure. First, we classify patterns of subsequent or overlapping pairs of local relations, and hierarchies formed by nested local relations. Special attention is then given to relations crossing paragraph boundaries and their semantic types, and to paragraph-initial discourse connectives. In the third part, we examine situations in which annotators incline to marking a large argument (larger than one sentence) of a discourse relation even with a minimality principle annotation rule in place. Our analyses bring (i) new linguistic insights regarding coherence signals in local and higher contexts, e.g. detection and description of hierarchies of local discourse relations up to 5 levels in Czech and English, description of distribution differences in semantic types in cross-paragraph and other settings, identification of Czech connectives only typical for higher structures, or the detection of prevalence of large left-sided arguments in locally annotated data; (ii) as another type of contribution, some new reflections on methodologies of the approaches under scrutiny
Proceedings
Proceedings of the Workshop on Annotation and
Exploitation of Parallel Corpora AEPC 2010.
Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk.
NEALT Proceedings Series, Vol. 10 (2010), 98 pages.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15893
Findings of the Shared Task on Multilingual Coreference Resolution
This paper presents an overview of the shared task on multilingual
coreference resolution associated with the CRAC 2022 workshop. Shared task
participants were supposed to develop trainable systems capable of identifying
mentions and clustering them according to identity coreference. The public
edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used
as the source of training and evaluation data. The CoNLL score used in previous
coreference-oriented shared tasks was used as the main evaluation metric. There
were 8 coreference prediction systems submitted by 5 participating teams; in
addition, there was a competitive Transformer-based baseline system provided by
the organizers at the beginning of the shared task. The winner system
outperformed the baseline by 12 percentage points (in terms of the CoNLL scores
averaged across all datasets for individual languages)
GENTLE: A Genre-Diverse Multilayer Challenge Set for English NLP and Linguistic Evaluation
We present GENTLE, a new mixed-genre English challenge corpus totaling 17K
tokens and consisting of 8 unusual text types for out-of domain evaluation:
dictionary entries, esports commentaries, legal documents, medical notes,
poetry, mathematical proofs, syllabuses, and threat letters. GENTLE is manually
annotated for a variety of popular NLP tasks, including syntactic dependency
parsing, entity recognition, coreference resolution, and discourse parsing. We
evaluate state-of-the-art NLP systems on GENTLE and find severe degradation for
at least some genres in their performance on all tasks, which indicates
GENTLE's utility as an evaluation dataset for NLP systems.Comment: Camera-ready for LAW-XVII collocated with ACL 202
Dependency parsing of Turkish
The suitability of different parsing methods for different languages is an important topic in
syntactic parsing. Especially lesser-studied languages, typologically different from the languages
for which methods have originally been developed, poses interesting challenges in this respect.
This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative
free constituent order language that can be seen as the representative of a wider class
of languages of similar type. Our investigations show that morphological structure plays an
essential role in finding syntactic relations in such a language. In particular, we show that
employing sublexical representations called inflectional groups, rather than word forms, as the
basic parsing units improves parsing accuracy. We compare two different parsing methods, one
based on a probabilistic model with beam search, the other based on discriminative classifiers and
a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless
of parsing method.We examine the impact of morphological and lexical information in detail and
show that, properly used, this kind of information can improve parsing accuracy substantially.
Applying the techniques presented in this article, we achieve the highest reported accuracy for
parsing the Turkish Treebank
- …