Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources

Amaro, Raquel; Mendes, Amália; Nascimento, Maria Fernanda Bacelar do

Morphological Tagging of a Spoken Portuguese Corpus Using Available Resources

Authors: Raquel Amaro
Amália Mendes
Maria Fernanda Bacelar do Nascimento
Publication date: 1 January 2004
Publisher: Colibri

Abstract

This paper discusses the experience of reusing annotation tools developed for written corpora to tag a spoken corpus with POS information. Eric Brill’s tagger, initially trained over a written and tagged corpus of 250.000 words, is being used to tag the Portuguese C-ORAL-ROM spoken corpus, of 300.000 words. First, we address issues related with the tagset definition as well as the tagger performance over the written corpus. We discuss important options concerning the spoken corpus transcription, with direct impact on the tagging task, as well as the additional tags required. Transcription options allow in some cases for automatic tag identification and replacement, through a post-tagger process. Other cases, like the annotation of discourse markers, are more complex and require manual revision (and eventual listening). Since the final annotation will not only include the POS tag but also the wordform lemma, the paper also addresses issues related to the lemmatisation task. The positive results obtained show that the process of tagging and lemmatising a spoken Portuguese corpus through the reuse of already available resources may constitute an example of how to minimize the costs of such a task, without compromising the results. Finally, we discuss some possible developments to improve the tagger’s performance.info:eu-repo/semantics/publishedVersio

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Universidade de Lisboa: Repositório.UL

oai:repositorio.ul.pt:10451/37...

Last time updated on 27/03/2019

Repositório da Universidade de Lisboa

oai:repositorio.ul.pt:10451/37...

Last time updated on 12/06/2020