2 research outputs found

    The Spoken Dutch Corpus (“Corpus Gesproken Nederlands”

    Get PDF
    The paper describes the syntactic annotation of the Spoken Dutch Corpus (“Corpus Gesproken Nederlands ” or CGN), the Dutch-Flemish project (1998-2003) aiming at the collection, description and annotation of ten million words of spoken Dutch. In the first part, the background of the parsing strategy is discussed, as well as some details concerning the actual implementation of the parsing process. The second part discusses some examples of practical applications of the result of the parsing process. 1

    Lemmatisation and Morphosyntactic Annotation for the Spoken Dutch Corpus

    No full text
    This paper describes the lemmatisation and tagging guidelines developed for the \Spoken Dutch Corpus", and lays out the philosophy behind the high granularity tagset that was designed for the project. To bootstrap the annotation of large quantities of material (10 million words) with this new tagset we tested several existing taggers and tagger generators on initial samples of the corpus. The results show that the most eective method, when trained on the small samples, is a high quality implementation of a Hidden Markov Model tagger generator. However, we also show that a combination of systems improves the accuracy, and that this combinatory approach allows us to leverage existing taggers with dierent tagsets and lexical resources. 1 Introduction The Dutch-Flemish project \Corpus Gesproken Nederlands" (1998-2003) aims at the collection, transcription and annotation of ten million words of spoken Dutch, see Oostdijk (this volume). The rst layer of linguistic annotation concerns th..
    corecore