Currently, computational linguists and cognitive scientists working in the
area of discourse and dialogue argue that their subjective judgments are
reliable using several different statistics, none of which are easily
interpretable or comparable to each other. Meanwhile, researchers in content
analysis have already experienced the same difficulties and come up with a
solution in the kappa statistic. We discuss what is wrong with reliability
measures as they are currently used for discourse and dialogue work in
computational linguistics and cognitive science, and argue that we would be
better off as a field adopting techniques from content analysis.Comment: 9 page