1,673 research outputs found

    Engineering Crowdsourced Stream Processing Systems

    Full text link
    A crowdsourced stream processing system (CSP) is a system that incorporates crowdsourced tasks in the processing of a data stream. This can be seen as enabling crowdsourcing work to be applied on a sample of large-scale data at high speed, or equivalently, enabling stream processing to employ human intelligence. It also leads to a substantial expansion of the capabilities of data processing systems. Engineering a CSP system requires the combination of human and machine computation elements. From a general systems theory perspective, this means taking into account inherited as well as emerging properties from both these elements. In this paper, we position CSP systems within a broader taxonomy, outline a series of design principles and evaluation metrics, present an extensible framework for their design, and describe several design patterns. We showcase the capabilities of CSP systems by performing a case study that applies our proposed framework to the design and analysis of a real system (AIDR) that classifies social media messages during time-critical crisis events. Results show that compared to a pure stream processing system, AIDR can achieve a higher data classification accuracy, while compared to a pure crowdsourcing solution, the system makes better use of human workers by requiring much less manual work effort

    Explicit diversification of event aspects for temporal summarization

    Get PDF
    During major events, such as emergencies and disasters, a large volume of information is reported on newswire and social media platforms. Temporal summarization (TS) approaches are used to automatically produce concise overviews of such events by extracting text snippets from related articles over time. Current TS approaches rely on a combination of event relevance and textual novelty for snippet selection. However, for events that span multiple days, textual novelty is often a poor criterion for selecting snippets, since many snippets are textually unique but are semantically redundant or non-informative. In this article, we propose a framework for the diversification of snippets using explicit event aspects, building on recent works in search result diversification. In particular, we first propose two techniques to identify explicit aspects that a user might want to see covered in a summary for different types of event. We then extend a state-of-the-art explicit diversification framework to maximize the coverage of these aspects when selecting summary snippets for unseen events. Through experimentation over the TREC TS 2013, 2014, and 2015 datasets, we show that explicit diversification for temporal summarization significantly outperforms classical novelty-based diversification, as the use of explicit event aspects reduces the amount of redundant and off-topic snippets returned, while also increasing summary timeliness

    Hybrid geo-information processing:crowdsourced supervision of geo-spatial machine learning tasks

    Get PDF

    Human-in-the-Loop Learning From Crowdsourcing and Social Media

    Get PDF
    Computational social studies using public social media data have become more and more popular because of the large amount of user-generated data available. The richness of social media data, coupled with noise and subjectivity, raise significant challenges for computationally studying social issues in a feasible and scalable manner. Machine learning problems are, as a result, often subjective or ambiguous when humans are involved. That is, humans solving the same problems might come to legitimate but completely different conclusions, based on their personal experiences and beliefs. When building supervised learning models, particularly when using crowdsourced training data, multiple annotations per data item are usually reduced to a single label representing ground truth. This inevitably hides a rich source of diversity and subjectivity of opinions about the labels. Label distribution learning associates for each data item a probability distribution over the labels for that item, thus it can preserve diversities of opinions, beliefs, etc. that conventional learning hides or ignores. We propose a humans-in-the-loop learning framework to model and study large volumes of unlabeled subjective social media data with less human effort. We study various annotation tasks given to crowdsourced annotators and methods for aggregating their contributions in a manner that preserves subjectivity and disagreement. We introduce a strategy for learning label distributions with only five-to-ten labels per item by aggregating human-annotated labels over multiple, semantically related data items. We conduct experiments using our learning framework on data related to two subjective social issues (work and employment, and suicide prevention) that touch many people worldwide. Our methods can be applied to a broad variety of problems, particularly social problems. Our experimental results suggest that specific label aggregation methods can help provide reliable representative semantics at the population level

    Real-time Traffic State Assessment using Multi-source Data

    Get PDF
    The normal flow of traffic is impeded by abnormal events and the impacts of the events extend over time and space. In recent years, with the rapid growth of multi-source data, traffic researchers seek to leverage those data to identify the spatial-temporal dynamics of traffic flow and proactively manage abnormal traffic conditions. However, the characteristics of data collected by different techniques have not been fully understood. To this end, this study presents a series of studies to provide insight to data from different sources and to dynamically detect real-time traffic states utilizing those data. Speed is one of the three traffic fundamental parameters in traffic flow theory that describe traffic flow states. While the speed collection techniques evolve over the past decades, the average speed calculation method has not been updated. The first section of this study pointed out the traditional harmonic mean-based average speed calculation method can produce erroneous results for probe-based data. A new speed calculation method based on the fundamental definition was proposed instead. The second section evaluated the spatial-temporal accuracy of a different type of crowdsourced data - crowdsourced user reports and revealed Waze user behavior. Based on the evaluation results, a traffic detection system was developed to support the dynamic detection of incidents and traffic queues. A critical problem with current automatic incident detection algorithms (AIDs) which limits their application in practice is their heavy calibration requirements. The third section solved this problem by proposing a selfevaluation module that determines the occurrence of traffic incidents and serves as an autocalibration procedure. Following the incident detection, the fourth section proposed a clustering algorithm to detect the spatial-temporal movements of congestion by clustering crowdsource reports. This study contributes to the understanding of fundamental parameters and expands the knowledge of multi-source data. It has implications for future speed, flow, and density calculation with data collection technique advancements. Additionally, the proposed dynamic algorithms allow the system to run automatically with minimum human intervention thus promote the intelligence of the traffic operation system. The algorithms not only apply to incident and queue detection but also apply to a variety of detection systems
    • …
    corecore