227 research outputs found

    Crowd-Certain: Label Aggregation in Crowdsourced and Ensemble Learning Classification

    Full text link
    Crowdsourcing systems have been used to accumulate massive amounts of labeled data for applications such as computer vision and natural language processing. However, because crowdsourced labeling is inherently dynamic and uncertain, developing a technique that can work in most situations is extremely challenging. In this paper, we introduce Crowd-Certain, a novel approach for label aggregation in crowdsourced and ensemble learning classification tasks that offers improved performance and computational efficiency for different numbers of annotators and a variety of datasets. The proposed method uses the consistency of the annotators versus a trained classifier to determine a reliability score for each annotator. Furthermore, Crowd-Certain leverages predicted probabilities, enabling the reuse of trained classifiers on future sample data, thereby eliminating the need for recurrent simulation processes inherent in existing methods. We extensively evaluated our approach against ten existing techniques across ten different datasets, each labeled by varying numbers of annotators. The findings demonstrate that Crowd-Certain outperforms the existing methods (Tao, Sheng, KOS, MACE, MajorityVote, MMSR, Wawa, Zero-Based Skill, GLAD, and Dawid Skene), in nearly all scenarios, delivering higher average accuracy, F1 scores, and AUC rates. Additionally, we introduce a variation of two existing confidence score measurement techniques. Finally we evaluate these two confidence score techniques using two evaluation metrics: Expected Calibration Error (ECE) and Brier Score Loss. Our results show that Crowd-Certain achieves higher Brier Score, and lower ECE across the majority of the examined datasets, suggesting better calibrated results.Comment: 49 pages, 5 figure

    A robust consistency model of crowd workers in text labeling tasks

    Get PDF
    Crowdsourcing is a popular human-based model to acquire labeled data. Despite its ability to generate huge amounts of labelled data at moderate costs, it is susceptible to low quality labels. This can happen through unintentional or intentional errors by the crowd workers. Consistency is an important attribute of reliability. It is a practical metric that evaluates a crowd workers' reliability based on their ability to conform to themselves by yielding the same output when repeatedly given a particular input. Consistency has not yet been sufficiently explored in the literature. In this work, we propose a novel consistency model based on the pairwise comparisons method. We apply this model on unpaid workers. We measure the workers' consistency on tasks of labeling political text-based claims and study the effects of different duplicate task characteristics on their consistency. Our results show that the proposed model outperforms the current state-of-the-art models in terms of accuracy. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0

    A GENERAL MODEL FOR NOISY LABELS IN MACHINE LEARNING

    Get PDF
    Machine learning is an ever-growing and increasingly pervasive presence in every-day life; we entrust these models, and systems built on these models, with some of our most sensitive information and security applications. However, for all of the trust that we place in these models, it is essential to recognize the fact that such models are simply reflections of the data and labels on which they are trained. To wit, if the data and labels are suspect, then so too must be the models that we rely on—yet, as larger and more comprehensive datasets become standard in contemporary machine learning, it becomes increasingly more difficult to obtain reliable, trustworthy label information. While recent work has begun to investigate mitigating the effect of noisy labels, to date this critical field has been disjointed and disconnected, despite the common goal. In this work, we propose a new model of label noise, which we call “labeler-dependent noise (LDN).” LDN extends and generalizes the canonical instance-dependent noise model to multiple labelers, and unifies every pre-ceding modeling strategy under a single umbrella. Furthermore, studying the LDN model leads us to propose a more general, modular framework for noise-robust learning called “labeler-aware learning (LAL).” Our comprehensive suite of experiments demonstrate that unlike previous methods that are unable to remain robust under the general LDN model, LAL retains its full learning capabilities under extreme, and even adversarial, conditions of label noise. We believe that LDN and LAL should mark a paradigm shift in how we learn from labeled data, so that we may both discover new insights about machine learning, and develop more robust, trustworthy models on which to build our daily lives

    Combining crowd worker, algorithm, and expert efforts to find boundaries of objects in images

    Get PDF
    While traditional approaches to image analysis have typically relied upon either manual annotation by experts or purely-algorithmic approaches, the rise of crowdsourcing now provides a new source of human labor to create training data or perform computations at run-time. Given this richer design space, how should we utilize algorithms, crowds, and experts to better annotate images? To answer this question for the important task of finding the boundaries of objects or regions in images, I focus on image segmentation, an important precursor to solving a variety of fundamental image analysis problems, including recognition, classification, tracking, registration, retrieval, and 3D visualization. The first part of the work includes a detailed analysis of the relative strengths and weaknesses of three different approaches to demarcate object boundaries in images: by experts, by crowdsourced laymen, and by automated computer vision algorithms. The second part of the work describes three hybrid system designs that integrate computer vision algorithms and crowdsourced laymen to demarcate boundaries in images. Experiments revealed that hybrid system designs yielded more accurate results than relying on algorithms or crowd workers alone and could yield segmentations that are indistinguishable from those created by biomedical experts. To encourage community-wide effort to continue working on developing methods and systems for image-based studies which can have real and measurable impact that benefit society at large, datasets and code are publicly-shared (http://www.cs.bu.edu/~betke/BiomedicalImageSegmentation/)

    Combining crowd worker, algorithm, and expert efforts to find boundaries of objects in images

    Get PDF
    While traditional approaches to image analysis have typically relied upon either manual annotation by experts or purely-algorithmic approaches, the rise of crowdsourcing now provides a new source of human labor to create training data or perform computations at run-time. Given this richer design space, how should we utilize algorithms, crowds, and experts to better annotate images? To answer this question for the important task of finding the boundaries of objects or regions in images, I focus on image segmentation, an important precursor to solving a variety of fundamental image analysis problems, including recognition, classification, tracking, registration, retrieval, and 3D visualization. The first part of the work includes a detailed analysis of the relative strengths and weaknesses of three different approaches to demarcate object boundaries in images: by experts, by crowdsourced laymen, and by automated computer vision algorithms. The second part of the work describes three hybrid system designs that integrate computer vision algorithms and crowdsourced laymen to demarcate boundaries in images. Experiments revealed that hybrid system designs yielded more accurate results than relying on algorithms or crowd workers alone and could yield segmentations that are indistinguishable from those created by biomedical experts. To encourage community-wide effort to continue working on developing methods and systems for image-based studies which can have real and measurable impact that benefit society at large, datasets and code are publicly-shared (http://www.cs.bu.edu/~betke/BiomedicalImageSegmentation/)

    CONSENSUS-BASED CROWDSOURCING: TECHNIQUES AND APPLICATIONS

    Get PDF
    Crowdsourcing solutions are receiving more and more attention in the recent literature about social computing and distributed problem solving. In general terms, crowdsourcing can be considered as a social-computing model aimed at fostering the autonomous formation and emergence of the so-called wisdom of the crowd. Quality assessment is a crucial issue for the effectiveness of crowdsourcing systems, both for what concerns task and worker management. Another aspect to be considered in crowdsourcing systems is about the kind of contributions workers can make. Usually, crowdsourcing approaches rely only on tasks where workers have to decide among a predefined set of possible solutions. On the other hand, tasks leaving the workers a higher level of freedom in producing their answer (e.g., free-hand drawing) are more difficult to be managed and verified. In the Thesis, we present the LiquidCrowd approach based on consensus and trustworthiness techniques for managing the xecution of collaborative tasks. By collaborative task, we refer to a task for which a factual answer is not possible/appropriate, or a task whose result depends on the personal perception/point-of-view of the worker. We introduce the notion of worker trustworthiness to denote the worker \u201creliability\u201d, namely her/his capability to foster the successful completion of tasks. Furthermore, we improve the conventional score-based mechanism by introducing the notion of award that is a bonus provided to those workers that contribute to reach the consensus within groups. This way, groups with certain trustworthiness requirements can be composed on-demand, to deal with complex tasks, like for example tasks where consensus has not been reached during the first execution. In LiquidCrowd , we define a democratic mechanism based on the notion of supermajority to enable the flexible specification of the expected degree of agreement required for obtaining the consensus within a worker group. In LiquidCrowd , three task typologies are provided: choice, where the worker is asked to choose the answer among a list of predefined options; range, where the worker is asked to provide a free-numeric answer; proposition, where the worker is asked to provide a free text answer. To evaluate the quality of the produced results obtained through LiquidCrowd consensus techniques, we perform a testing against the SQUARE crowdsourcing benchmark. Furthermore, to evaluate the capability of LiquidCrowd to effectively support a real problem, real case studies about web data classification have been selected

    Crowdsource Annotation and Automatic Reconstruction of Online Discussion Threads

    Get PDF
    Modern communication relies on electronic messages organized in the form of discussion threads. Emails, IMs, SMS, website comments, and forums are all composed of threads, which consist of individual user messages connected by metadata and discourse coherence to messages from other users. Threads are used to display user messages effectively in a GUI such as an email client, providing a background context for understanding a single message. Many messages are meaningless without the context provided by their thread. However, a number of factors may result in missing thread structure, ranging from user mistake (replying to the wrong message), to missing metadata (some email clients do not produce/save headers that fully encapsulate thread structure; and, conversion of archived threads from over repository to another may also result in lost metadata), to covert use (users may avoid metadata to render discussions difficult for third parties to understand). In the field of security, law enforcement agencies may obtain vast collections of discussion turns that require automatic thread reconstruction to understand. For example, the Enron Email Corpus, obtained by the Federal Energy Regulatory Commission during its investigation of the Enron Corporation, has no inherent thread structure. In this thesis, we will use natural language processing approaches to reconstruct threads from message content. Reconstruction based on message content sidesteps the problem of missing metadata, permitting post hoc reorganization and discussion understanding. We will investigate corpora of email threads and Wikipedia discussions. However, there is a scarcity of annotated corpora for this task. For example, the Enron Emails Corpus contains no inherent thread structure. Therefore, we also investigate issues faced when creating crowdsourced datasets and learning statistical models of them. Several of our findings are applicable for other natural language machine classification tasks, beyond thread reconstruction. We will divide our investigation of discussion thread reconstruction into two parts. First, we explore techniques needed to create a corpus for our thread reconstruction research. Like other NLP pairwise classification tasks such as Wikipedia discussion turn/edit alignment and sentence pair text similarity rating, email thread disentanglement is a heavily class-imbalanced problem, and although the advent of crowdsourcing has reduced annotation costs, the common practice of crowdsourcing redundancy is too expensive for class-imbalanced tasks. As the first contribution of this thesis, we evaluate alternative strategies for reducing crowdsourcing annotation redundancy for class-imbalanced NLP tasks. We also examine techniques to learn the best machine classifier from our crowdsourced labels. In order to reduce noise in training data, most natural language crowdsourcing annotation tasks gather redundant labels and aggregate them into an integrated label, which is provided to the classifier. However, aggregation discards potentially useful information from linguistically ambiguous instances. For the second contribution of this thesis, we show that, for four of five natural language tasks, filtering of the training dataset based on crowdsource annotation item agreement improves task performance, while soft labeling based on crowdsource annotations does not improve task performance. Second, we investigate thread reconstruction as divided into the tasks of thread disentanglement and adjacency recognition. We present the Enron Threads Corpus, a newly-extracted corpus of 70,178 multi-email threads with emails from the Enron Email Corpus. In the original Enron Emails Corpus, emails are not sorted by thread. To disentangle these threads, and as the third contribution of this thesis, we perform pairwise classification, using text similarity measures on non-quoted texts in emails. We show that i) content text similarity metrics outperform style and structure text similarity metrics in both a class-balanced and class-imbalanced setting, and ii) although feature performance is dependent on the semantic similarity of the corpus, content features are still effective even when controlling for semantic similarity. To reconstruct threads, it is also necessary to identify adjacency relations among pairs. For the forum of Wikipedia discussions, metadata is not available, and dialogue act typologies, helpful for other domains, are inapplicable. As our fourth contribution, via our experiments, we show that adjacency pair recognition can be performed using lexical pair features, without a dialogue act typology or metadata, and that this is robust to controlling for topic bias of the discussions. Yet, lexical pair features do not effectively model the lexical semantic relations between adjacency pairs. To model lexical semantic relations, and as our fifth contribution, we perform adjacency recognition using extracted keyphrases enhanced with semantically related terms. While this technique outperforms a most frequent class baseline, it fails to outperform lexical pair features or tf-idf weighted cosine similarity. Our investigation shows that this is the result of poor word sense disambiguation and poor keyphrase extraction causing spurious false positive semantic connections. In concluding this thesis, we also reflect on open issues and unanswered questions remaining after our research contributions, discuss applications for thread reconstruction, and suggest some directions for future work
    • …
    corecore