853 research outputs found

    Modelling Instance-Level Annotator Reliability for Natural Language Labelling Tasks

    Full text link
    When constructing models that learn from noisy labels produced by multiple annotators, it is important to accurately estimate the reliability of annotators. Annotators may provide labels of inconsistent quality due to their varying expertise and reliability in a domain. Previous studies have mostly focused on estimating each annotator's overall reliability on the entire annotation task. However, in practice, the reliability of an annotator may depend on each specific instance. Only a limited number of studies have investigated modelling per-instance reliability and these only considered binary labels. In this paper, we propose an unsupervised model which can handle both binary and multi-class labels. It can automatically estimate the per-instance reliability of each annotator and the correct label for each instance. We specify our model as a probabilistic model which incorporates neural networks to model the dependency between latent variables and instances. For evaluation, the proposed method is applied to both synthetic and real data, including two labelling tasks: text classification and textual entailment. Experimental results demonstrate our novel method can not only accurately estimate the reliability of annotators across different instances, but also achieve superior performance in predicting the correct labels and detecting the least reliable annotators compared to state-of-the-art baselines.Comment: 9 pages, 1 figures, 10 tables, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL2019

    Comparative judgments are more consistent than binary classification for labelling word complexity

    Get PDF
    © 2019 Association for Computational Linguistics Lexical simplification systems replace complex words with simple ones based on a model of which words are complex in context. We explore how users can help train complex word identification models through labelling more efficiently and reliably. We show that using an interface where annotators make comparative rather than binary judgments leads to more reliable and consistent labels, and explore whether comparative judgments may provide a faster way for collecting labels

    On the study of crowdsourced labelled data and annotators: beyond noisy labels.

    Get PDF
    123 p.La presente tesis incluye 3 contribuciones al área llamada "learning from crowds", que estudia los métodos de aprendizaje basados en datos etiquetados por medio del "crowdsourcing". Estas etiquetas se caracterizan por tener una incertidumbre asociada debido a que la fiabilidad de las personas anotadoras no está garantizada. En primer lugar, se propone un nuevo método de "label aggregation", llamado"domain aware voting", una extensión del popular y simple método "majority voting" que tiene en cuenta la variable descriptiva, obteniendo resultados mejores especialmente cuando hay una mayor escasez de etiquetas. La segunda contribución consiste en la propuesta de un nuevo marco de etiquetado, "candidate labelling", que permite a las personas anotadoras expresar sus dudas acerca de las etiquetas que otorgan,pudiendo otorgar varias etiquetas a cada instancia. Se proponen 2 métodos de "label aggregation"asociados a este tipo de etiquetado, y se muestra, mediante un marco experimental que aúna el etiquetado tradicional y el propuesto, que el "candidate labelling" consigue extraer más información con un mismo número de personas anotadoras. Por último, se desarrolla un modelo de persona anotadora y 2 métodos de aprendizaje adaptados a este nuevo etiquetado, basados en el algoritmo EM, que obtienen mejores resultados en general que los métodos análogos en el marco de etiquetado tradicional.bcam Excelencia Severo Ocho

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    Machine learning from crowds using candidate set-based labelling

    Get PDF
    Crowdsourcing is a popular cheap alternative in machine learning for gathering information from a set of annotators. Learning from crowd-labelled data involves dealing with its inherent uncertainty and inconsistencies. In the classical framework, each annotator provides a single label per example, which fails to capture the complete knowledge of annotators. We propose candidate labelling, that is, to allow annotators to provide a set of candidate labels for each example and thus express their doubts. We propose an appropriate model for the annotators, and present two novel learning methods that deal with the two basic steps (label aggregation and model learning) sequentially or jointly. Our empirical study shows the advantage of candidate labelling and the proposed methods with respect to the classical framework

    Learning facial-expression models with crowdsourcing

    Get PDF
    The computational power is increasing day by day. Despite that, there are some tasks that are still difficult or even impossible for a computer to perform. For example, while identifying a facial expression is easy for a human, for a computer it is an area in development. To tackle this and similar issues, crowdsourcing has grown as a way to use human computation in a large scale. Crowdsourcing is a novel approach to collect labels in a fast and cheap manner, by sourcing the labels from the crowds. However, these labels lack reliability since annotators are not guaranteed to have any expertise in the field. This fact has led to a new research area where we must create or adapt annotation models to handle these weaklylabeled data. Current techniques explore the annotators’ expertise and the task difficulty as variables that influences labels’ correction. Other specific aspects are also considered by noisy-labels analysis techniques. The main contribution of this thesis is the process to collect reliable crowdsourcing labels for a facial expressions dataset. This process consists in two steps: first, we design our crowdsourcing tasks to collect annotators labels; next, we infer the true label from the collected labels by applying state-of-art crowdsourcing algorithms. At the same time, a facial expression dataset is created, containing 40.000 images and respective labels. At the end, we publish the resulting dataset

    Learning from disagreement: a survey

    Get PDF
    Many tasks in Natural Language Processing (nlp) and Computer Vision (cv) offer evidence that humans disagree, from objective tasks such as part-of-speech tagging to more subjective tasks such as classifying an image or deciding whether a proposition follows from certain premises. While most learning in artificial intelligence (ai) still relies on the assumption that a single (gold) interpretation exists for each item, a growing body of research aims to develop learning methods that do not rely on this assumption. In this survey, we review the evidence for disagreements on nlp and cv tasks, focusing on tasks for which substantial datasets containing this information have been created. We discuss the most popular approaches to training models from datasets containing multiple judgments potentially in disagreement. We systematically compare these different approaches by training them with each of the available datasets, considering several ways to evaluate the resulting models. Finally, we discuss the results in depth, focusing on four key research questions, and assess how the type of evaluation and the characteristics of a dataset determine the answers to these questions. Our results suggest, first of all, that even if we abandon the assumption of a gold standard, it is still essential to reach a consensus on how to evaluate models. This is because the relative performance of the various training methods is critically affected by the chosen form of evaluation. Secondly, we observed a strong dataset effect. With substantial datasets, providing many judgments by high-quality coders for each item, training directly with soft labels achieved better results than training from aggregated or even gold labels. This result holds for both hard and soft evaluation. But when the above conditions do not hold, leveraging both gold and soft labels generally achieved the best results in the hard evaluation. All datasets and models employed in this paper are freely available as supplementary materials
    • …