Evaluation of sentiment analysis, like large-scale IR evalu-

ation, relies on the accuracy of human assessors to create

judgments. Subjectivity in judgments is a problem for rel-

evance assessment and even more so in the case of senti-

ment annotations. In this study we examine the degree to

which assessors agree upon sentence-level sentiment anno-

tation. We show that inter-assessor agreement is not con-

tingent on document length or frequency of sentiment but

correlates positively with automated opinion retrieval per-

formance. We also examine the individual annotation cate-

gories to determine which categories pose most di±culty for

annotators

Bermingham, Adam

Smeaton, Alan F.

English

Crossref

A study of inter-annotator agreement for opinion retrieval

DCU Online Research Access Service

A Study of Inter-Annotator Agreement for Opinion RetrievalAdam Bermingham and Alan F. SmeatonCLARITY: Centre for Sensor Web TechnologiesDublin City UniversityDublin, Ireland.{abermingham,asmeaton}@computing.dcu.ieABSTRACTEvaluation of sentiment analysis, like large-scale IR evalu-ation, relies on the accuracy of human assessors to createjudgments. Subjectivity in judgments is a problem for rel-evance assessment and even more so in the case of senti-ment annotations. In this study we examine the degree towhich assessors agree upon sentence-level sentiment anno-tation. We show that inter-assessor agreement is not con-tingent on document length or frequency of sentiment butcorrelates positively with automated opinion retrieval per-formance. We also examine the individual annotation cate-gories to determine which categories pose most difficulty forannotators.Categories and Subject DescriptorsH.3.4 [Information Retrieval]: Systems and Software Per-formance evaluation (efficiency and effectiveness)General TermsExperimentation, Measurement, Human Factors1. INTRODUCTIONWith the abundance of user-generated content on the Inter-net in recent years, there has been much effort to model thesentiment in online texts. Annotated documents are nec-essary to evaluate systems designed to automatically clas-sify, rank or score documents with respect to opinion. Insome domains, such as film reviews, a sentiment polarityscore is often readily available as users annotate their doc-uments with a quantified summary e.g. 4 out of 5 stars.In other domains an author annotation is not available andwe rely on human assessors to create annotations or judge-ments. There are a number of subjective variables associ-ated specifically with opinion annotation which affect agree-ment including domain expertise, personal opinion, ambigu-ity of language, and context of interpretation. One otherissue is granularity of sentiment and previous annotation ef-forts have varied from the document level [3] to sentence-and sub-sentence-levels [5]. There have also been efforts atmulti-lingual sentence-level opinion annotation which haveyielded moderately high rates of agreement for Japanese andChinese but low agreement for English texts [4].Copyright is held by the author/owner(s).SIGIR’09, July 19–23, 2009, Boston, Massachusetts, USA.ACM 978-1-60558-483-6/09/07.Using documents from the Blogs06 corpus used at theTREC Blog Track [3], we asked participants to identifyopinion at sentence-level. We then measure sentence-levelinter-annotator reliability for all sentences and repeat thisfor each document and topic. Extrapolating annotationsto the document-level, we then draw comparisons betweensentence-level and document-level agreement. Finally, weconvert the annotations to binary judgements for each an-notation class to allow per-class analysis.2. EXPERIMENTAL SETUPOur 15 participants were postgraduate students and post-doctoral researchers, 5 of whom have worked in sentimentanalysis and 13 of whom were native English speakers. 15topics were selected out of the 150 topics used in the BlogTrack at TREC 2008 based on median TREC participantperformance per topic, evenly distributed from low-performingtopics to high-performing topics. A pool of documents wasselected from our own baseline TREC run [1], up to a max-imum of 8 documents per topic. All of the documents se-lected were judged by the TREC relevance assessments tocontain opinion on the topic and consisted of plain text blogentries extracted from HTML and passed through the noiseremoval portion of our TREC system.In the annotation process, participants were presentedwith a series of 30 topic/document pairings and asked toannotate the sentences in each document as one of five cate-gories: “non-relevant”, “relevant” (relevant and no opinion),“positive”, “negative”, “mixed”. When a document is initiallypresented to a participant, all of the sentences are annotatedas non-relevant. After completing the sentence-level anno-tation, participants were then asked to rate the documentfor negative opinion from 1 (“no negative topic-directed opin-ion”) to 5 (“very obvious and intense negative topic-directedopinion”) and similarly for positive opinion.In total, 115 documents were judged by an average of 3.6annotators yielding 26,375 sentence annotations.3. RESULTS AND EVALUATIONWe use Krippendorff’s alpha[2] for measuring inter-annotatoragreement. This is a robust statistic which takes into ac-count the probability that observed variability is due tochance and does not require that each annotator annotateeach document. α for sentence-level annotation with respectto the 5 classes in Section 2 is 0.4219. This indicates a signif-icant agreement between annotators but is less than the levelrecommended by Krippendorff for reliable data (α = 0.8) orfor tentative reliability (α = 0.667).Figure 1: α for Binary JudgementsFigure 2: α and mean TREC MAP (ρ = 0.53,τ = 0.41)If we examine α for each of the 115 documents, we seelittle correlation between α and the number of sentences perdocument (Pearson’s ρ = −0.123, Kendall’s τ = −0.13) orbetween α and the proportion of sentences annotated as con-taining opinion (ρ = −0.045, τ = −0.015). This indicatesthat the consistency between annotators is not dependentupon the proportion of sentiment-bearing sentences or theoverall length of the document.Calculating α for each of the 15 topics, we see a significantpositive correlation between the retrieval performance of thetopics at TREC and α for each topic (ρ = 0.53, τ = 0.41).This reflects the increased ambiguity and obscurity amongthe low-performing topics which hampers both automatedopinion retrieval and manual annotation efforts similarly. Aranking of the 15 topics by α demonstrates no discernablepattern in terms of topic nature.In order to simulate document-level annotations, we ex-trapolate document-level annotations from sentence-level an-notations. For example, a document containing positive sen-tences but no mixed or negative sentences would be consid-ered a positive document. Agreement for these document-level annotations is slightly higher than for sentence-level(α = 0.4461) suggesting that although annotators may dif-fer in their reasons for document annotation, they convergea small amount at document-level. It should be noted thatthe simulated document-level annotations are not necessar-ily the same as would be obtained had the annotators beenexplicitly asked to annotate at the document-level.To look at the individual classes more closely we mapthe 5-way annotations to binary judgements for each of theclasses (Figure 1). The most striking difference in agree-ment between document and sentence-level annotation is forthe mixed class. Agreement for this class is highest at thedocument-level and lowest at the sentence-level. It is alsoworth noting that there is much less agreement for negativitythan positivity at document-level and that agreement is verylow for the relevant class, particularly at the sentence-level.If we look at an additional aggregate class, All Relevant,there is a surprisingly low α for extrapolated document-levelrelevance. This possibly reflects the fact that 11% of docu-ments were annotated as non-relevant, despite none of thembeing judged that way at TREC.Finally, we determine a binary opinion judgement fromthe two document-level 5 point scales. A document is de-fined as opinionated if either of the scales record a valuegreater than 1 for that document. If each of these opinionjudgements is compared with its corresponding opinion an-notation extrapolated from sentence annotations, we see avery high level of agreement (α = 0.8263). This shows con-sistency between each annotator’s sentence and document-level annotations.4. CONCLUSIONS AND DISCUSSIONWe have found that sentence-level sentiment annotation yieldsa moderate level of inter-annotator agreement and that thisis independent of the nature of the sentiment, specifically thefrequency of the sentiment and document length. We sug-gest that the 5 classifications used here (and in TREC) arenot ideal categories for sentiment annotation. In particular,the mixed category shows very low agreement at sentence-level. At document-level there is high agreement but onlydue to the broad definition of mixed as a document con-taining both positive and negative opinions. This does notnecessarily reflect the overriding sentiment in a document asboth sides of a discussion are frequently cited in distinctlypolarised documents, yielding an artificially high proportionof mixed documents.Annotators reported frequently feeling uneasy about theirjudgements, particularly where domain or background knowl-edge was required. For this reason we suggest an indetermi-nate class which they are encouraged to use when they arenot confident about their annotation. We would also like toexamine the task description and annotator training moreclosely to see what effect it may have on agreement.Finally, we show an increase in agreement can be achievedby simulating document-level judgements, suggesting thatsentence-level annotation is too granular. For future workwe would like to compare annotation at the document, para-graph, sentence and passage levels in an effort to identify themost appropriate sentiment granularity, both for annotationand automated opinion retrieval.AcknowledgmentsThis work is supported by Science Foundation Ireland undergrant 07/CE/I1147.5. REFERENCES[1] A. Bermingham, A. Smeaton, J. Foster, and D. Hogan.DCU at the TREC 2008 Blog Track. In TheSeventeenth Text REtrieval Conference (TREC 2008)Proc., 2008.[2] A. F. Hayes and K. Krippendorff. Answering the callfor a standard reliability measure for coding data. InCommunication Methods and Measures, 2007.[3] I. Ounis, C. MacDonald, and I. Soboroff. Overview ofthe TREC-2008 Blog Track. In The Text REtrievalConference (TREC 2008) Proc. NIST, 2008.[4] Y. Seki, D. K. Evans, L. Ku, L. Sun, H. Chen, andN. Kando. Overview of multilingual opinion analysistask at NTCIR-7. 2008.[5] J. Wiebe, T. Wilson, and C. Cardie. Annotatingexpressions of opinions and emotions in language.Language Resources and Evaluation, 1(2):0, 2005.

Annotating expressions of opinions and emotions in language.

Answering the call for a standard reliability measure for coding data.

Overview of multilingual opinion analysis task at NTCIR-7.

Overview of the TREC-2008 Blog Track.

Evaluation of sentiment analysis, like large-scale IR evalu-
ation, relies on the accuracy of human assessors to create
judgments. Subjectivity in judgments is a problem for rel-
evance assessment and even more so in the case of senti-
ment annotations. In this study we examine the degree to
which assessors agree upon sentence-level sentiment anno-
tation. We show that inter-assessor agreement is not con-
tingent on document length or frequency of sentiment but
correlates positively with automated opinion retrieval per-
formance. We also examine the individual annotation cate-
gories to determine which categories pose most di±culty for
annotators

Irish Universities

Evaluation of sentiment analysis, like large-scale IR evalu-\ud
ation, relies on the accuracy of human assessors to create\ud
judgments. Subjectivity in judgments is a problem for rel-\ud
evance assessment and even more so in the case of senti-\ud
ment annotations. In this study we examine the degree to\ud
which assessors agree upon sentence-level sentiment anno-\ud
tation. We show that inter-assessor agreement is not con-\ud
tingent on document length or frequency of sentiment but\ud
correlates positively with automated opinion retrieval per-\ud
formance. We also examine the individual annotation cate-\ud
gories to determine which categories pose most di±culty for\ud
annotators

Name not available

http://doras.dcu.ie/14779/1/sigir146-bermingham.pdf

A study of inter-annotator agreement for opinion retrieval

Abstract

Similar works

Full text

Available Versions

Crossref

DCU Online Research Access Service

Irish Universities

Name not available