20 research outputs found

    What Determines Inter-Coder Agreement in Manual Annotations? A Meta-Analytic Investigation

    Get PDF
    Recent discussions of annotator agreement have mostly centered around its calculation and interpretation, and the correct choice of indices. Although these discussions are important, they only consider the "back-end" of the story, namely, what to do once the data are collected. Just as important in our opinion is to know how agreement is reached in the first place and what factors influence coder agreement as part of the annotation process or setting, as this knowledge can provide concrete guidelines for the planning and set-up of annotation projects. To investigate whether there are factors that consistently impact annotator agreement we conducted a meta-analytic investigation of annotation studies reporting agreement percentages. Our meta-analysis synthesized factors reported in 96 annotation studies from three domains (word-sense disambiguation, prosodic transcriptions, and phonetic transcriptions) and was based on a total of 346 agreement indices. Our analysis identified seven factors that influence reported agreement values: annotation domain, number of categories in a coding scheme, number of annotators in a project, whether annotators received training, the intensity of annotator training, the annotation purpose, and the method used for the calculation of percentage agreements. Based on our results we develop practical recommendations for the assessment, interpretation, calculation, and reporting of coder agreement. We also briefly discuss theoretical implications for the concept of annotation quality

    ChatGPT outperforms crowd workers for text-annotation tasks

    Full text link
    Many NLP applications require manual text annotations for a variety of tasks, notably to train classifiers or evaluate the performance of unsupervised models. Depending on the size and degree of complexity, the tasks may be conducted by crowd workers on platforms such as MTurk as well as trained annotators, such as research assistants. Using four samples of tweets and news articles (n = 6,183), we show that ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection. Across the four datasets, the zero-shot accuracy of ChatGPT exceeds that of crowd workers by about 25 percentage points on average, while ChatGPT’s intercoder agreement exceeds that of both crowd workers and trained annotators for all tasks. Moreover, the per-annotation cost of ChatGPT is less than $0.003—about thirty times cheaper than MTurk. These results demonstrate the potential of large language models to drastically increase the efficiency of text classification

    All that glitters...: Interannotator agreement in natural language processing

    Get PDF
    Evaluation has emerged as a central concern in natural language processing (NLP) over the last few decades. Evaluation is done against a gold standard, a manually linguistically annotated dataset, which is assumed to provide the ground truth against which the accuracy of the NLP system can be assessed automatically. In this article, some methodological questions in connection with the creation of gold standard datasets are discussed, in particular (non-)expectations of linguistic expertise in annotators and the interannotator agreement measure standardly but unreflectedly used as a kind of quality index of NLP gold standards

    Sentiment and behaviour annotation in a corpus of dialogue summaries

    Get PDF
    This paper proposes a scheme for sentiment annotation. We show how the task can be made tractable by focusing on one of the many aspects of sentiment: sentiment as it is recorded in behaviour reports of people and their interactions. Together with a number of measures for supporting the reliable application of the scheme, this allows us to obtain sufficient to good agreement scores (in terms of Krippendorf's alpha) on three key dimensions: polarity, evaluated party and type of clause. Evaluation of the scheme is carried out through the annotation of an existing corpus of dialogue summaries (in English and Portuguese) by nine annotators. Our contribution to the field is twofold: (i) a reliable multi-dimensional annotation scheme for sentiment in behaviour reports; and (ii) an annotated corpus that was used for testing the reliability of the scheme and which is made available to the research community

    Empirical Methodology for Crowdsourcing Ground Truth

    Full text link
    The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa

    Inter-Annotator Agreement in linguistica: una rassegna critica

    Get PDF
    I coefficienti di Inter-Annotator Agreement sono ampiamente utilizzati in Linguistica Computazionale e NLP per valutare il livello di “affidabilità” delle annotazioni linguistiche. L’articolo propone una breve revisione della letteratura scientifica sull’argomento.Agreement indexes are widely used in Computational Linguistics and NLP to assess the reliability of annotation tasks. The paper aims at reviewing the literature on the topic, illustrating chance-corrected coefficients and their interpretation

    The foreign language classroom anxiety scale and academic achievement: an overview of the prevailing literature and a meta-analysis

    Get PDF
    Foreign language learners experience a unique type of anxiety during the language learning process: Foreign Language Classroom Anxiety (FLCA). This situation-specific anxiety is frequently examined alongside academic achievement in foreign language courses. The present meta-analysis examined the relationship between FLCA measured through the Foreign Language Classroom Anxiety Scale (FLCAS) and five forms of academic achievement: general academic achievement and four competency-specific outcome scores (reading-, writing-, listening-, and speaking academic achievement). A total of k = 99 effect sizes were analysed with an overall sample size of N = 14128 in a random effects model with Pearson correlation coefficients. A moderate negative correlation was found between FLCA and all categories of academic achievement (e.g., general academic achievement: r = -.39; k = 59; N = 12585). The results of this meta analysis confirm the negative association between FLCA and academic achievement in foreign language courses
    corecore