Search CORE

373 research outputs found

HamleDT 2.0: Thirty Dependency Treebanks Stanfordized

Author: Mareček David
Mašek Jan
Popel Martin
Rosa Rudolf
Zeman Daniel
Žabokrtský Zdeněk
Publication venue
Publication date: 01/01/2014
Field of study

We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion. We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future. We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline

Biblio at Institute of Formal and Applied Linguistics

Extracting a Lexicon of Discourse Connectives in Czech from an Annotated Corpus

Author: Mirovsky Jiri
Polakova Lucie
Rysova Magdalena
Synkova Pavlina
Publication venue: the National University (Philippines)
Publication date: 01/01/2017
Field of study

Waseda University Repository

Translation of "It" in a Deep Syntax Framework

Author: Nedoluzhko Anna
Novák Michal
Žabokrtský Zdeněk
Publication venue
Publication date: 01/01/2013
Field of study

We present a novel approach to the translation of the English personal pronoun it to Czech. We conduct a linguistic analysis on how the distinct categories of it are usually mapped to their Czech counterparts. Armed with these observations, we design a discriminative translation model of it, which is then integrated into the TectoMT deep syntax MT framework. Features in the model take advantage of rich syntactic annotation TectoMT is based on, external tools for anaphoricity resolution, lexical co-occurrence frequencies measured on a large parallel corpus and gold coreference annotation. Even though the new model for it exhibits no improvement in terms of BLEU, manual evaluation shows that it outperforms the original solution in 8.5% sentences containing it

Biblio at Institute of Formal and Applied Linguistics

Cross-lingual Coreference Resolution of Pronouns

Author: Novák Michal
Žabokrtský Zdeněk
Publication venue
Publication date: 01/01/2014
Field of study

This work is, to our knowledge, a first attempt at a machine learning approach to cross-lingual coreference resolution, i.e. coreference resolution (CR) performed on a bitext. Focusing on CR of English pronouns, we leverage language differences and enrich the feature set of a standard monolingual CR system for English with features extracted from the Czech side of the bitext. Our work also includes a supervised pronoun aligner that outperforms a GIZA++ baseline in terms of both intrinsic evaluation and evaluation on CR. The final cross-lingual CR system has successfully outperformed both a monolingual CR and a cross-lingual projection system

Biblio at Institute of Formal and Applied Linguistics

Proceedings

Author: Ahrenberg Lars
Tiedemann Jörg
Volk Martin
Publication venue
Publication date: 30/11/2010
Field of study

Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 98 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

DSpace at Tartu University Library

Findings of the Shared Task on Multilingual Coreference Resolution

Author: Konopík Miloslav
Nedoluzhko Anna
Novák Michal
Ogrodniczuk Maciej
Popel Martin
Pražák Ondřej
Sido Jakub
Zeman Daniel
Zhu Yilun
Žabokrtský Zdeněk
Publication venue
Publication date: 16/09/2022
Field of study

This paper presents an overview of the shared task on multilingual coreference resolution associated with the CRAC 2022 workshop. Shared task participants were supposed to develop trainable systems capable of identifying mentions and clustering them according to identity coreference. The public edition of CorefUD 1.0, which contains 13 datasets for 10 languages, was used as the source of training and evaluation data. The CoNLL score used in previous coreference-oriented shared tasks was used as the main evaluation metric. There were 8 coreference prediction systems submitted by 5 participating teams; in addition, there was a competitive Transformer-based baseline system provided by the organizers at the beginning of the shared task. The winner system outperformed the baseline by 12 percentage points (in terms of the CoNLL scores averaged across all datasets for individual languages)

arXiv.org e-Print Archive

Proceedings

Author: Dickinson Markus
Müürisep Kaili
Passarotti Marco
Publication venue
Publication date: 01/12/2010
Field of study

Proceedings of the Ninth International Workshop on Treebanks and Linguistic Theories. Editors: Markus Dickinson, Kaili Müürisep and Marco Passarotti. NEALT Proceedings Series, Vol. 9 (2010), 268 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15891

DSpace at Tartu University Library

Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLP

Author: Plank Barbara
Ramponi Alan
Sharaf Ibrahim
van der Goot Rob
Üstün Ahmet
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit, from text classification and sequence labeling to dependency parsing, masked language modeling, and text generation.Comment: https://machamp-nlp.github.io

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

The IT University of Copenhagen's Repository