Evaluation of sentiment analysis, like large-scale IR evalu-
ation, relies on the accuracy of human assessors to create
judgments. Subjectivity in judgments is a problem for rel-
evance assessment and even more so in the case of senti-
ment annotations. In this study we examine the degree to
which assessors agree upon sentence-level sentiment anno-
tation. We show that inter-assessor agreement is not con-
tingent on document length or frequency of sentiment but
correlates positively with automated opinion retrieval per-
formance. We also examine the individual annotation cate-
gories to determine which categories pose most di±culty for
annotators