1,029 research outputs found

    Explicit diversification of event aspects for temporal summarization

    Get PDF
    During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness

    Empirical Methodology for Crowdsourcing Ground Truth

    Full text link
    The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa

    A flexible framework for assessing the quality of crowdsourced data

    Get PDF
    Ponencias, comunicaciones y pĂłsters presentados en el 17th AGILE Conference on Geographic Information Science "Connecting a Digital Europe through Location and Place", celebrado en la Universitat Jaume I del 3 al 6 de junio de 2014.Crowdsourcing as a means of data collection has produced previously unavailable data assets and enriched existing ones, but its quality can be highly variable. This presents several challenges to potential end users that are concerned with the validation and quality assurance of the data collected. Being able to quantify the uncertainty, define and measure the different quality elements associated with crowdsourced data, and introduce means for dynamically assessing and improving it is the focus of this paper. We argue that the required quality assurance and quality control is dependent on the studied domain, the style of crowdsourcing and the goals of the study. We describe a framework for qualifying geolocated data collected from non-authoritative sources that enables assessment for specific case studies by creating a workflow supported by an ontological description of a range of choices. The top levels of this ontology describe seven pillars of quality checks and assessments that present a range of techniques to qualify, improve or reject data. Our generic operational framework allows for extension of this ontology to specific applied domains. This will facilitate quality assurance in real-time or for post-processing to validate data and produce quality metadata. It enables a system that dynamically optimises the usability value of the data captured. A case study illustrates this framework

    Human-in-the-Loop Learning From Crowdsourcing and Social Media

    Get PDF
    Computational social studies using public social media data have become more and more popular because of the large amount of user-generated data available. The richness of social media data, coupled with noise and subjectivity, raise significant challenges for computationally studying social issues in a feasible and scalable manner. Machine learning problems are, as a result, often subjective or ambiguous when humans are involved. That is, humans solving the same problems might come to legitimate but completely different conclusions, based on their personal experiences and beliefs. When building supervised learning models, particularly when using crowdsourced training data, multiple annotations per data item are usually reduced to a single label representing ground truth. This inevitably hides a rich source of diversity and subjectivity of opinions about the labels. Label distribution learning associates for each data item a probability distribution over the labels for that item, thus it can preserve diversities of opinions, beliefs, etc. that conventional learning hides or ignores. We propose a humans-in-the-loop learning framework to model and study large volumes of unlabeled subjective social media data with less human effort. We study various annotation tasks given to crowdsourced annotators and methods for aggregating their contributions in a manner that preserves subjectivity and disagreement. We introduce a strategy for learning label distributions with only five-to-ten labels per item by aggregating human-annotated labels over multiple, semantically related data items. We conduct experiments using our learning framework on data related to two subjective social issues (work and employment, and suicide prevention) that touch many people worldwide. Our methods can be applied to a broad variety of problems, particularly social problems. Our experimental results suggest that specific label aggregation methods can help provide reliable representative semantics at the population level

    When in doubt ask the crowd : leveraging collective intelligence for improving event detection and machine learning

    Get PDF
    [no abstract

    Harnessing the power of the general public for crowdsourced business intelligence: a survey

    Get PDF
    International audienceCrowdsourced business intelligence (CrowdBI), which leverages the crowdsourced user-generated data to extract useful knowledge about business and create marketing intelligence to excel in the business environment, has become a surging research topic in recent years. Compared with the traditional business intelligence that is based on the firm-owned data and survey data, CrowdBI faces numerous unique issues, such as customer behavior analysis, brand tracking, and product improvement, demand forecasting and trend analysis, competitive intelligence, business popularity analysis and site recommendation, and urban commercial analysis. This paper first characterizes the concept model and unique features and presents a generic framework for CrowdBI. It also investigates novel application areas as well as the key challenges and techniques of CrowdBI. Furthermore, we make discussions about the future research directions of CrowdBI

    Quality Control in Crowdsourcing: A Survey of Quality Attributes, Assessment Techniques and Assurance Actions

    Get PDF
    Crowdsourcing enables one to leverage on the intelligence and wisdom of potentially large groups of individuals toward solving problems. Common problems approached with crowdsourcing are labeling images, translating or transcribing text, providing opinions or ideas, and similar - all tasks that computers are not good at or where they may even fail altogether. The introduction of humans into computations and/or everyday work, however, also poses critical, novel challenges in terms of quality control, as the crowd is typically composed of people with unknown and very diverse abilities, skills, interests, personal objectives and technological resources. This survey studies quality in the context of crowdsourcing along several dimensions, so as to define and characterize it and to understand the current state of the art. Specifically, this survey derives a quality model for crowdsourcing tasks, identifies the methods and techniques that can be used to assess the attributes of the model, and the actions and strategies that help prevent and mitigate quality problems. An analysis of how these features are supported by the state of the art further identifies open issues and informs an outlook on hot future research directions.Comment: 40 pages main paper, 5 pages appendi

    VGI and crowdsourced data credibility analysis using spam email detection techniques

    Get PDF
    Volunteered geographic information (VGI) can be considered a subset of crowdsourced data (CSD) and its popularity has recently increased in a number of application areas. Disaster management is one of its key application areas in which the benefits of VGI and CSD are potentially very high. However, quality issues such as credibility, reliability and relevance are limiting many of the advantages of utilising CSD. Credibility issues arise as CSD come from a variety of heterogeneous sources including both professionals and untrained citizens. VGI and CSD are also highly unstructured and the quality and metadata are often undocumented. In the 2011 Australian floods, the general public and disaster management administrators used the Ushahidi Crowd-mapping platform to extensively communicate flood-related information including hazards, evacuations, emergency services, road closures and property damage. This study assessed the credibility of the Australian Broadcasting Corporation’s Ushahidi CrowdMap dataset using a Naïve Bayesian network approach based on models commonly used in spam email detection systems. The results of the study reveal that the spam email detection approach is potentially useful for CSD credibility detection with an accuracy of over 90% using a forced classification methodology

    Accurate and budget-efficient text, image, and video analysis systems powered by the crowd

    Full text link
    Crowdsourcing systems empower individuals and companies to outsource labor-intensive tasks that cannot currently be solved by automated methods and are expensive to tackle by domain experts. Crowdsourcing platforms are traditionally used to provide training labels for supervised machine learning algorithms. Crowdsourced tasks are distributed among internet workers who typically have a range of skills and knowledge, differing previous exposure to the task at hand, and biases that may influence their work. This inhomogeneity of the workforce makes the design of accurate and efficient crowdsourcing systems challenging. This dissertation presents solutions to improve existing crowdsourcing systems in terms of accuracy and efficiency. It explores crowdsourcing tasks in two application areas, political discourse and annotation of biomedical and everyday images. The first part of the dissertation investigates how workers' behavioral factors and their unfamiliarity with data can be leveraged by crowdsourcing systems to control quality. Through studies that involve familiar and unfamiliar image content, the thesis demonstrates the benefit of explicitly accounting for a worker's familiarity with the data when designing annotation systems powered by the crowd. The thesis next presents Crowd-O-Meter, a system that automatically predicts the vulnerability of crowd workers to believe \enquote{fake news} in text and video. The second part of the dissertation explores the reversed relationship between machine learning and crowdsourcing by incorporating machine learning techniques for quality control of crowdsourced end products. In particular, it investigates if machine learning can be used to improve the quality of crowdsourced results and also consider budget constraints. The thesis proposes an image analysis system called ICORD that utilizes behavioral cues of the crowd worker, augmented by automated evaluation of image features, to infer the quality of a worker-drawn outline of a cell in a microscope image dynamically. ICORD determines the need to seek additional annotations from other workers in a budget-efficient manner. Next, the thesis proposes a budget-efficient machine learning system that uses fewer workers to analyze easy-to-label data and more workers for data that require extra scrutiny. The system learns a mapping from data features to number of allocated crowd workers for two case studies, sentiment analysis of twitter messages and segmentation of biomedical images. Finally, the thesis uncovers the potential for design of hybrid crowd-algorithm methods by describing an interactive system for cell tracking in time-lapse microscopy videos, based on a prediction model that determines when automated cell tracking algorithms fail and human interaction is needed to ensure accurate tracking
    • …
    corecore