Search CORE

20 research outputs found

What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation

Author: Carletta Jean
Julius Sim
Karsten Ingmar Paul
Petra Saskia Bayerl
Publication venue: 'MIT Press - Journals'
Publication date: 01/01/2011
Field of study

Recent discussions of annotator agreement have mostly centered around its calculation and interpretation, and the correct choice of indices. Although these discussions are important, they only consider the "back-end" of the story, namely, what to do once the data are collected. Just as important in our opinion is to know how agreement is reached in the first place and what factors influence coder agreement as part of the annotation process or setting, as this knowledge can provide concrete guidelines for the planning and set-up of annotation projects. To investigate whether there are factors that consistently impact annotator agreement we conducted a meta-analytic investigation of annotation studies reporting agreement percentages. Our meta-analysis synthesized factors reported in 96 annotation studies from three domains (word-sense disambiguation, prosodic transcriptions, and phonetic transcriptions) and was based on a total of 346 agreement indices. Our analysis identified seven factors that influence reported agreement values: annotation domain, number of categories in a coding scheme, number of annotators in a project, whether annotators received training, the intensity of annotator training, the annotation purpose, and the method used for the calculation of percentage agreements. Based on our results we develop practical recommendations for the assessment, interpretation, calculation, and reporting of coder agreement. We also briefly discuss theoretical implications for the concept of annotation quality

Crossref

EUR Research Repository

Erasmus University Digital Repository

ChatGPT outperforms crowd workers for text-annotation tasks

Author: Alizadeh Meysam
Gilardi Fabrizio
Kubli Mael
Publication venue: National Academy of Sciences
Publication date: 01/01/2023
Field of study

Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles (n = 6,183), we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd workers by about 25 percentage points on average, while ChatGPT’s intercoder agreement exceeds that of both crowd workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003—about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification

ZORA

All that glitters...: Interannotator agreement in natural language processing

Author: Borin Lars
Publication venue: 'UiT The Arctic University of Norway'
Publication date: 30/08/2022
Field of study

Evaluation has emerged as a central concern in natural language processing (NLP) over the last few decades. Evaluation is done against a gold standard, a manually linguistically annotated dataset, which is assumed to provide the ground truth against which the accuracy of the NLP system can be assessed automatically. In this article, some methodological questions in connection with the creation of gold standard datasets are discussed, in particular (non-)expectations of linguistic expertise in annotators and the interannotator agreement measure standardly but unreflectedly used as a kind of quality index of NLP gold standards

Septentrio Academic Publishing

Sentiment and behaviour annotation in a corpus of dialogue summaries

Author: Alvares Alexandre Rossi
Carvalho Ariadne Maria Brito Rizzoni
Piwek Paul
Roman Norton Trevisan
Publication venue
Publication date: 01/01/2015
Field of study

This paper proposes a scheme for sentiment annotation. We show how the task can be made tractable by focusing on one of the many aspects of sentiment: sentiment as it is recorded in behaviour reports of people and their interactions. Together with a number of measures for supporting the reliable application of the scheme, this allows us to obtain sufficient to good agreement scores (in terms of Krippendorf's alpha) on three key dimensions: polarity, evaluated party and type of clause. Evaluation of the scheme is carried out through the annotation of an existing corpus of dialogue summaries (in English and Portuguese) by nine annotators. Our contribution to the field is twofold: (i) a reliable multi-dimensional annotation scheme for sentiment in behaviour reports; and (ii) an annotated corpus that was used for testing the reliability of the scheme and which is made available to the research community

ZENODO

Open Research Online (The Open University)

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

ARPHA OAI-PMH Endpoint

ARPHA Preprints

Empirical Methodology for Crowdsourcing Ground Truth

Author: Aroyo Lora
Dumitrache Anca
Inel Oana
Ortiz Carlos
Sips Robert-Jan
Timmermans Benjamin
Welty Chris
Publication venue
Publication date: 24/09/2018
Field of study

The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa

arXiv.org e-Print Archive

Inter-Annotator Agreement in linguistica: una rassegna critica

Author: Gagliardi Gloria
Publication venue: 'OpenEdition'
Publication date: 01/01/2018
Field of study

I coefficienti di Inter-Annotator Agreement sono ampiamente utilizzati in Linguistica Computazionale e NLP per valutare il livello di “affidabilità” delle annotazioni linguistiche. L’articolo propone una breve revisione della letteratura scientifica sull’argomento.Agreement indexes are widely used in Computational Linguistics and NLP to assess the reliability of annotation tasks. The paper aims at reviewing the literature on the topic, illustrating chance-corrected coefficients and their interpretation

Crossref

ARCHIVIO ISTITUZIONALE DELLA RICERCA-UNIVERSITA' DEGLI STUDI DI NAPOLI "L'ORIENTALE"

Università degli Studi di Napoli L'Orientale: CINECA IRIS

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

OpenEdition

The foreign language classroom anxiety scale and academic achievement: an overview of the prevailing literature and a meta-analysis

Author: Botes E.
Dewaele Jean-Marc
Greiff S.
Publication venue: International Association for the Psychology of Language Learning (IAPLL)
Publication date: 27/06/2020
Field of study

Foreign language learners experience a unique type of anxiety during the language learning process: Foreign Language Classroom Anxiety (FLCA). This situation-specific anxiety is frequently examined alongside academic achievement in foreign language courses. The present meta-analysis examined the relationship between FLCA measured through the Foreign Language Classroom Anxiety Scale (FLCAS) and five forms of academic achievement: general academic achievement and four competency-specific outcome scores (reading-, writing-, listening-, and speaking academic achievement). A total of k = 99 effect sizes were analysed with an overall sample size of N = 14128 in a random effects model with Pearson correlation coefficients. A moderate negative correlation was found between FLCA and all categories of academic achievement (e.g., general academic achievement: r = -.39; k = 59; N = 12585). The results of this meta analysis confirm the negative association between FLCA and academic achievement in foreign language courses

Birkbeck Institutional Research Online