Search CORE

197 research outputs found

Introduction to the special issue on annotated corpora

Author: Candito Marie
Liberman Mark
Publication venue: 'Associacio catalana de Salut Laboral'
Publication date: 20/12/2019
Field of study

International audienceLes corpus annotés sont toujours plus cruciaux, aussi bien pour la recherche scien- tifique en linguistique que le traitement automatique des langues. Ce numéro spécial passe brièvement en revue l’évolution du domaine et souligne les défis à relever en restant dans le cadre actuel d’annotations utilisant des catégories analytiques, ainsi que ceux remettant en question le cadre lui-même. Il présente trois articles, l’un concernant l’évaluation de la qualité d’annotation, et deux concernant des corpus arborés du français, l’un traitant du plus ancien projet de corpus arboré du français, le French Treebank, le second concernant la conversion de corpus français dans le schéma interlingue des Universal Dependencies, offrant ainsi une illustration de l’histoire du développement des corpus arborés.Annotated corpora are increasingly important for linguistic scholarship, science and technology. This special issue briefly surveys the development of the field and points to challenges within the current framework of annotation using analytical categories as well as challenges to the framework itself. It presents three articles, one concerning the evaluation of the quality of annotation, and two concerning French treebanks, one dealing with the oldest project for French, the French Treebank, the second concerning the conversion of French corpora into the cross-lingual framework of Universal Dependencies, thus offering an illustration of the history of treebank development worldwide

From the world to word order:Deriving biases in noun phrase order from statistical properties of the world

Author: Culbertson Jennifer
Kirby Simon
Schouwstra Marieke
Publication venue: 'Project Muse'
Publication date: 01/09/2020
Field of study

Edinburgh Research Explorer

The Parallel-TUT: a multilingual and multiformat treebank

Author: Bosco Cristina
Lesmo Leonardo
Sanguinetti Manuela
Publication venue: European Language Resources Association (ELRA)
Publication date: 01/01/2012
Field of study

The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French, i.e. Parallel–TUT, or simply ParTUT. For the development of this resource, both the dependency and constituency-based formats of the Italian Turin University Treebank (TUT) have been applied to a preliminary dataset, which includes the whole text of the Universal Declaration of Human Rights, sentences from the JRC-Acquis Multilingual Parallel Corpus and the Creative Commons licence. The focus of the project is mainly on the quality of the annotation and the investigation of some issues related to the alignment of data that can be allowed by the TUT formats, also taking into account the availability of conversion tools for display data in standard ways, such as Tiger–XML and CoNLL formats. It is, in fact, our belief that increasing the portability of our treebank could give us the opportunity to access resources and tools provided by other research groups, especially at this stage of the project, where no particular tool – compatible with the TUT format – is available in order to tackle the alignment problems

Archivio istituzionale della ricerca - Università di Cagliari

Institutional Research Information System University of Turin

Increasing return on annotation investment: the automatic construction of a Universal Dependency treebank for Dutch

Author: Bouma Gosse
van Noord Gerardus
Publication venue
Publication date: 01/05/2017
Field of study

ARTS repository - University of Groningen

New Treebank or Repurposed? On the Feasibility of Cross-Lingual Parsing of Romance Languages with Universal Dependencies

Author: Alonso Miguel A
García Marcos
Gómez-Rodríguez Carlos
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 06/10/2017
Field of study

This is the final peer-reviewed manuscript that was accepted for publication in Natural Language Engineering. Changes resulting from the publishing process, such as editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document.[Abstract] This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.Ministerio de Economía y Competitividad; FJCI-2014-22853Ministerio de Economía y Competitividad; FFI2014-51978-C2-1-RMinisterio de Economía y Competitividad; FFI2014-51978-C2-2-

Repositorio da Universidade da Coruña

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

The Lacunae of Danish Natural Language Processing

Author: Derczynski Leon
Kirkedal Andreas Søeborg
Plank Barbara
Schluter Natalie
Publication venue
Publication date: 01/01/2019
Field of study

The IT University of Copenhagen's Repository

Universal Dependencies and Morphology for Hungarian - and on the Price of Universality

Author: Farkas Richárd
Simkó Katalin Ilona
Szántó Zsolt
Vincze Veronika
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/04/2017
Field of study

In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance

Repository of the Academy's Library

HamleDT 2.0: Thirty Dependency Treebanks Stanfordized

Author: Mareček David
Mašek Jan
Popel Martin
Rosa Rudolf
Zeman Daniel
Žabokrtský Zdeněk
Publication venue
Publication date: 01/01/2014
Field of study

We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently. We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion. We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future. We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline

Biblio at Institute of Formal and Applied Linguistics