197 research outputs found
Introduction to the special issue on annotated corpora
International audienceLes corpus annoteĚs sont toujours plus cruciaux, aussi bien pour la recherche scien- tifique en linguistique que le traitement automatique des langues. Ce numeĚro speĚcial passe brieĚvement en revue lâeĚvolution du domaine et souligne les deĚfis aĚ relever en restant dans le cadre actuel dâannotations utilisant des cateĚgories analytiques, ainsi que ceux remettant en question le cadre lui-meĚme. Il preĚsente trois articles, lâun concernant lâeĚvaluation de la qualiteĚ dâannotation, et deux concernant des corpus arboreĚs du français, lâun traitant du plus ancien projet de corpus arboreĚ du français, le French Treebank, le second concernant la conversion de corpus français dans le scheĚma interlingue des Universal Dependencies, offrant ainsi une illustration de lâhistoire du deĚveloppement des corpus arboreĚs.Annotated corpora are increasingly important for linguistic scholarship, science and technology. This special issue briefly surveys the development of the field and points to challenges within the current framework of annotation using analytical categories as well as challenges to the framework itself. It presents three articles, one concerning the evaluation of the quality of annotation, and two concerning French treebanks, one dealing with the oldest project for French, the French Treebank, the second concerning the conversion of French corpora into the cross-lingual framework of Universal Dependencies, thus offering an illustration of the history of treebank development worldwide
The Parallel-TUT: a multilingual and multiformat treebank
The paper introduces an ongoing project for the development of a parallel treebank for Italian, English and French, i.e. ParallelâTUT,
or simply ParTUT. For the development of this resource, both the dependency and constituency-based formats of the Italian Turin
University Treebank (TUT) have been applied to a preliminary dataset, which includes the whole text of the Universal Declaration
of Human Rights, sentences from the JRC-Acquis Multilingual Parallel Corpus and the Creative Commons licence. The focus of the project is mainly on the quality of the annotation and the investigation of some issues related to the alignment of data that can be allowed by the TUT formats, also taking into account the availability of conversion tools for display data in standard ways, such as TigerâXML and CoNLL formats. It is, in fact, our belief that increasing the portability of our treebank could give us the opportunity to access resources and tools provided by other research groups, especially at this stage of the project, where no particular tool â compatible with the TUT format â is available in order to tackle the alignment problems
New Treebank or Repurposed? On the Feasibility of Cross-Lingual Parsing of Romance Languages with Universal Dependencies
This is the final peer-reviewed manuscript that was accepted for publication in Natural Language Engineering. Changes resulting from the publishing process, such as editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document.[Abstract] This paper addresses the feasibility of cross-lingual parsing with Universal Dependencies (UD) between Romance languages, analyzing its performance when compared to the use of manually annotated resources of the target languages. Several experiments take into account factors such as the lexical distance between the source and target varieties, the impact of delexicalization, the combination of different source treebanks or the adaptation of resources to the target language, among others. The results of these evaluations show that the direct application of a parser from one Romance language to another reaches similar labeled attachment score (LAS) values to those obtained with a manual annotation of about 3,000 tokens in the target language, and unlabeled attachment score (UAS) results equivalent to the use of around 7,000 tokens, depending on the case. These numbers can noticeably increase by performing a focused selection of the source treebanks. Furthermore, the removal of the words in the training corpus (delexicalization) is not useful in most cases of cross-lingual parsing of Romance languages. The lessons learned with the performed experiments were used to build a new UD treebank for Galician, with 1,000 sentences manually corrected after an automatic cross-lingual annotation. Several evaluations in this new resource show that a cross-lingual parser built with the best combination and adaptation of the source treebanks performs better (77 percent LAS and 82 percent UAS) than using more than 16,000 (for LAS results) and more than 20,000 (UAS) manually labeled tokens of Galician.Ministerio de EconomĂa y Competitividad; FJCI-2014-22853Ministerio de EconomĂa y Competitividad; FFI2014-51978-C2-1-RMinisterio de EconomĂa y Competitividad; FFI2014-51978-C2-2-
Universal Dependencies and Morphology for Hungarian - and on the Price of Universality
In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance
HamleDT 2.0: Thirty Dependency Treebanks Stanfordized
We present HamleDT 2.0 (HArmonized Multi-LanguagE Dependency Treebank). HamleDT 2.0 is a collection of 30 existing treebanks harmonized into a common annotation style, the Prague Dependencies, and further transformed into Stanford Dependencies, a treebank annotation style that became popular recently.
We use the newest basic Universal Stanford Dependencies, without added language-specific subtypes. We describe both of the annotation styles, including adjustments that were necessary to make, and provide details about the conversion process. We also discuss the differences between the two styles, evaluating their advantages and disadvantages, and note the effects of the differences on the conversion.
We regard the stanfordization as generally successful, although we admit several shortcomings, especially in the distinction between direct and indirect objects, that have to be addressed in future.
We release part of HamleDT 2.0 freely; we are not allowed to redistribute the whole dataset, but we do provide the conversion pipeline
- âŚ