This paper addresses the issue of {\sc pos} tagger evaluation. Such
evaluation is usually performed by comparing the tagger output with a reference
test corpus, which is assumed to be error-free. Currently used corpora contain
noise which causes the obtained performance to be a distortion of the real
value. We analyze to what extent this distortion may invalidate the comparison
between taggers or the measure of the improvement given by a new system. The
main conclusion is that a more rigorous testing experimentation
setting/designing is needed to reliably evaluate and compare tagger accuracies.Comment: Appears in proceedings of joint COLING-ACL 1998, Montreal, Canad