Search CORE

532 research outputs found

Treebank annotation schemes and parser evaluation for German

Author: Rehbein Ines
van Genabith Josef
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2007
Field of study

Recent studies focussed on the question whether less-congurational languages like German are harder to parse than English, or whether the lower parsing scores are an artefact of treebank encoding schemes and data structures, as claimed by K¨ubler et al. (2006). This claim is based on the assumption that PARSEVAL metrics fully reflect parse quality across treebank encoding schemes. In this paper we present new experiments to test this claim. We use the PARSEVAL metric, the Leaf-Ancestor metric as well as a dependency-based evaluation, and present novel approaches measuring the effect of controlled error insertion on treebank trees and parser output. We also provide extensive past-parsing crosstreebank conversion. The results of the experiments show that, contrary to K¨ubler et al. (2006), the question whether or not German is harder to parse than English remains undecided

Irish Universities

DCU Online Research Access Service

A Maximum-Entropy Partial Parser for Unrestricted Text

Author: Brants Thorsten
Skut Wojciech
Publication venue
Publication date: 01/01/1998
Field of study

This paper describes a partial parser that assigns syntactic structures to sequences of part-of-speech tags. The program uses the maximum entropy parameter estimation method, which allows a flexible combination of different knowledge sources: the hierarchical structure, parts of speech and phrasal categories. In effect, the parser goes beyond simple bracketing and recognises even fairly complex structures. We give accuracy figures for different applications of the parser.Comment: 9 pages, LaTe

arXiv.org e-Print Archive

CiteSeerX

How to compare treebanks

Author: Kübler Sandra
Maier Wolfgang
Rehbein Ines
Versley Yannick
Publication venue
Publication date: 01/01/2008
Field of study

Recent years have seen an increasing interest in developing standards for linguistic annotation, with a focus on the interoperability of the resources. This effort, however, requires a profound knowledge of the advantages and disadvantages of linguistic annotation schemes in order to avoid importing the flaws and weaknesses of existing encoding schemes into the new standards. This paper addresses the question how to compare syntactically annotated corpora and gain insights into the usefulness of specific design decisions. We present an exhaustive evaluation of two German treebanks with crucially different encoding schemes. We evaluate three different parsers trained on the two treebanks and compare results using EVALB, the Leaf-Ancestor metric, and a dependency-based evaluation. Furthermore, we present TePaCoC, a new testsuite for the evaluation of parsers on complex German grammatical constructions. The testsuite provides a well thought-out error classification, which enables us to compare parser output for parsers trained on treebanks with different encoding schemes and provides interesting insights into the impact of treebank annotation schemes on specific constructions like PP attachment or non-constituent coordination

Hochschulschriftenserver - Universität Frankfurt am Main

Corpus Annotation for Parser Evaluation

Author: Briscoe Ted
Carroll John
Minnen Guido
Publication venue
Publication date: 01/01/1999
Field of study

We describe a recently developed corpus annotation scheme for evaluating parsers that avoids shortcomings of current methods. The scheme encodes grammatical relations between heads and dependents, and has been used to mark up a new public-domain corpus of naturally occurring English text. We show how the corpus can be used to evaluate the accuracy of a robust parser, and relate the corpus to extant resources.Comment: 7 pages, LaTeX (uses eaclap.sty

arXiv.org e-Print Archive

CiteSeerX

The Tüba-D/Z treebank : annotating German with a context-free backbone

Author: Hinrichs Erhard
Kübler Sandra
Telljohann Heike
Publication venue
Publication date: 01/01/2004
Field of study

The purpose of this paper is to describe the TüBa-D/Z treebank of written German and to compare it to the independently developed TIGER treebank (Brants et al., 2002). Both treebanks, TIGER and TüBa-D/Z, use an annotation framework that is based on phrase structure grammar and that is enhanced by a level of predicate-argument structure. The comparison between the annotation schemes of the two treebanks focuses on the different treatments of free word order and discontinuous constituents in German as well as on differences in phrase-internal annotation

CiteSeerX

Hochschulschriftenserver - Universität Frankfurt am Main

A testsuite for testing parser performance on complex German grammatical constructions [TePaCoC - a corpus for testing parser performance on complex German grammatical constructions]

Author: Genabith Josef van
Kübler Sandra
Rehbein Ines
Publication venue
Publication date: 01/11/2008
Field of study

Traditionally, parsers are evaluated against gold standard test data. This can cause problems if there is a mismatch between the data structures and representations used by the parser and the gold standard. A particular case in point is German, for which two treebanks (TiGer and TüBa-D/Z) are available with highly different annotation schemes for the acquisition of (e.g.) PCFG parsers. The differences between the TiGer and TüBa-D/Z annotation schemes make fair and unbiased parser evaluation difficult [7, 9, 12]. The resource (TEPACOC) presented in this paper takes a different approach to parser evaluation: instead of providing evaluation data in a single annotation scheme, TEPACOC uses comparable sentences and their annotations for 5 selected key grammatical phenomena (with 20 sentences each per phenomena) from both TiGer and TüBa-D/Z resources. This provides a 2 times 100 sentence comparable testsuite which allows us to evaluate TiGer-trained parsers against the TiGer part of TEPACOC, and TüBa-D/Z-trained parsers against the TüBa-D/Z part of TEPACOC for key phenomena, instead of comparing them against a single (and potentially biased) gold standard. To overcome the problem of inconsistency in human evaluation and to bridge the gap between the two different annotation schemes, we provide an extensive error classification, which enables us to compare parser output across the two different treebanks. In the remaining part of the paper we present the testsuite and describe the grammatical phenomena covered in the data. We discuss the different annotation strategies used in the two treebanks to encode these phenomena and present our error classification of potential parser errors

Utrecht University Repository

Hochschulschriftenserver - Universität Frankfurt am Main

Automatic F-Structure Annotation from the AP Treebank

Author: Sadler Louisa
van Genabith Josef
Way Andy
Publication venue: CSLI Publications
Publication date: 01/01/2000
Field of study

We present a method for automatically annotating treebank resources with functional structures. The method defines systematic patterns of correspondence between partial PS configurations and functional structures. These are applied to PS rules extracted from treebanks. The set of techniques which we have developed constitute a methodology for corpus-guided grammar development. Despite the widespread belief that treebank representations are not very useful in grammar development, we show that systematic patterns of c-structure to f-structure correspondence can be simply and successfully stated over such rules. The method is partial in that it requires manual correction of the annotated grammar rules

CiteSeerX

Irish Universities

DCU Online Research Access Service

Parallel Aligned Treebank Corpora at LDC: Methodology, Annotation and Integration

Author: Bies Ann
Ge Niyu
Grimes Stephen
Ismael Safa
Li Xuansong
Ma Xiaoyi
Maamouri Mohamed
Strassel Stephanie
Xue Nianwen
Publication venue
Publication date: 01/01/2010
Field of study

Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 14-23. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

CiteSeerX

DSpace at Tartu University Library