Studies of the contextual and linguistic factors that constrain discourse
phenomena such as reference are coming to depend increasingly on annotated
language corpora. In preparing the corpora, it is important to evaluate the
reliability of the annotation, but methods for doing so have not been readily
available. In this report, I present a method for computing reliability of
coreference annotation. First I review a method for applying the information
retrieval metrics of recall and precision to coreference annotation proposed by
Marc Vilain and his collaborators. I show how this method makes it possible to
construct contingency tables for computing Cohen's Kappa, a familiar
reliability metric. By comparing recall and precision to reliability on the
same data sets, I also show that recall and precision can be misleadingly high.
Because Kappa factors out chance agreement among coders, it is a preferable
measure for developing annotated corpora where no pre-existing target
annotation exists.Comment: 10 pages, 2-column format; uuencoded, gzipped, tarfil