A diagnostic tool for German syntax

Abstract

In this paper we describe an effort to construct a catalogue of syntactic data, exemplifying the major syntactic patterns of German. The purpose of the corpus is to support the diagnosis of errors in the syntactic components of natural language processing (NLP) systems. Two secondary aims are the evaluation of NLP systems components and the support of theoretical and empirical work on German syntax. The data consist of artificially and systematically constructed expressions, including also negative (ungrammatical) examples. The data are organized into a relational data base and annotated with some basic information about the phenomena illustrated and the internal structure of the sample sentences. The organization of the data supports selected systematic testing of specific areas of syntax, but also serves the purpose of a linguistic data base. The paper first gives some general motivation for the necessity of syntactic precision in some areas of NLP and discusses the potential contribution of a syntactic data base to the field of component evaluation. The second part of the paper describes the set up and control methods applied in the construction of the sentence suite and annotations to the examples. We illustrate the approach with the example of verbal government. The section also contains a description of the abstract data model, the design of the data base and the query language used to access the data. The final sections compare our work to existing approaches and sketch some future extensions. We invite other research groups to participate in our effort, so that the diagnostics tool can eventually become public domain

    Similar works