294 research outputs found

    Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents

    Full text link
    Conducting user studies is a crucial component in many scientific fields. While some studies require participants to be physically present, other studies can be conducted both physically (e.g. in-lab) and online (e.g. via crowdsourcing). Inviting participants to the lab can be a time-consuming and logistically difficult endeavor, not to mention that sometimes research groups might not be able to run in-lab experiments, because of, for example, a pandemic. Crowdsourcing platforms such as Amazon Mechanical Turk (AMT) or Prolific can therefore be a suitable alternative to run certain experiments, such as evaluating virtual agents. Although previous studies investigated the use of crowdsourcing platforms for running experiments, there is still uncertainty as to whether the results are reliable for perceptual studies. Here we replicate a previous experiment where participants evaluated a gesture generation model for virtual agents. The experiment is conducted across three participant pools -- in-lab, Prolific, and AMT -- having similar demographics across the in-lab participants and the Prolific platform. Our results show no difference between the three participant pools in regards to their evaluations of the gesture generation models and their reliability scores. The results indicate that online platforms can successfully be used for perceptual evaluations of this kind.Comment: Accepted to IVA 2020. Patrik Jonell and Taras Kucherenko contributed equally to this wor

    An Open source Implementation of ITU-T Recommendation P.808 with Validation

    Full text link
    The ITU-T Recommendation P.808 provides a crowdsourcing approach for conducting a subjective assessment of speech quality using the Absolute Category Rating (ACR) method. We provide an open-source implementation of the ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended our implementation to include Degradation Category Ratings (DCR) and Comparison Category Ratings (CCR) test methods. We also significantly speed up the test process by integrating the participant qualification step into the main rating task compared to a two-stage qualification and rating solution. We provide program scripts for creating and executing the subjective test, and data cleansing and analyzing the answers to avoid operational errors. To validate the implementation, we compare the Mean Opinion Scores (MOS) collected through our implementation with MOS values from a standard laboratory experiment conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility of the result of the subjective speech quality assessment through crowdsourcing using our implementation. Finally, we quantify the impact of parts of the system designed to improve the reliability: environmental tests, gold and trapping questions, rating patterns, and a headset usage test

    Training and Application of Correct Information Unit Analysis to Structured and Unstructured Discourse

    Get PDF
    Correct Information Units (CIU) analysis is one of the few measures of discourse that attempts to quantify discourse as a function of communicating information efficiently. Though this analysis is used reliably as a research tool, most studies\u27 apply CIUs to structured discourse tasks and do not specifically describe how raters are trained. If certified clinical speech-language pathologists can likewise reliably apply CIU analysis within clinical settings to unstructured discourse, such as the discourse of people with aphasia (PWA), it may allow clinicians to quantify the information communicated efficiently in clinical populations with discourse deficits. Purpose: The purpose of this study is to determine if using the outlined training module, clinicians are able to score CIUs with similar inter-rater reliability across both structured and unstructured discourse samples as researchers. Method: Four certified SLPs will undergo a two-hour training session in CIU analysis similar to that of a university research staffs\u27 CIU training protocol. Each SLP will score CIUs in structured and unstructured language samples collected from individuals diagnosed with aphasia. The SLP\u27 scores within the structured and unstructured discourse samples will be compared to those of a university research lab staffs\u27. This will determine (1) whether SLPs can reliably code CIUs when compared with research raters in a lab setting when both using the same two-hour CIU training and resources allotted; (2) whether there is a significant difference in reliability when structured and unstructured discourse is analyzed

    Automated Virtual Coach for Surgical Training

    Get PDF
    Surgical educators have recommended individualized coaching for acquisition, retention and improvement of expertise in technical skills. Such one-on-one coaching is limited to institutions that can afford surgical coaches and is certainly not feasible at national and global scales. We hypothesize that automated methods that model intraoperative video, surgeon's hand and instrument motion, and sensor data can provide effective and efficient individualized coaching. With the advent of instrumented operating rooms and training laboratories, access to such large scale intra-operative data has become feasible. Previous methods for automated skill assessment present an overall evaluation at the task/global level to the surgeons without any directed feedback and error analysis. Demonstration, if at all, is present in the form of fixed instructional videos, while deliberate practice is completely absent from automated training platforms. We believe that an effective coach should: demonstrate expert behavior (how do I do it correctly), evaluate trainee performance (how did I do) at task and segment-level, critique errors and deficits (where and why was I wrong), recommend deliberate practice (what do I do to improve), and monitor skill progress (when do I become proficient). In this thesis, we present new methods and solutions towards these coaching interventions in different training settings viz. virtual reality simulation, bench-top simulation and the operating room. First, we outline a summarizations-based approach for surgical phase modeling using various sources of intra-operative procedural data such as – system events (sensors) as well as crowdsourced surgical activity context. We validate a crowdsourced approach to obtain context summarizations of intra-operative surgical activity. Second, we develop a new scoring method to evaluate task segments using rankings derived from pairwise comparisons of performances obtained via crowdsourcing. We show that reliable and valid crowdsourced pairwise comparisons can be obtained across multiple training task settings. Additionally, we present preliminary results comparing inter-rater agreement in relative ratings and absolute ratings for crowdsourced assessments of an endoscopic sinus surgery training task data set. Third, we implement a real-time feedback and teaching framework using virtual reality simulation to present teaching cues and deficit metrics that are targeted at critical learning elements of a task. We compare the effectiveness of this real-time coach to independent self-driven learning on a needle passing task in a pilot randomized controlled trial. Finally, we present an integration of the above components of task progress detection, segment-level evaluation and real-time feedback towards the first end-to-end automated virtual coach for surgical training

    Human-in-the-Loop Learning From Crowdsourcing and Social Media

    Get PDF
    Computational social studies using public social media data have become more and more popular because of the large amount of user-generated data available. The richness of social media data, coupled with noise and subjectivity, raise significant challenges for computationally studying social issues in a feasible and scalable manner. Machine learning problems are, as a result, often subjective or ambiguous when humans are involved. That is, humans solving the same problems might come to legitimate but completely different conclusions, based on their personal experiences and beliefs. When building supervised learning models, particularly when using crowdsourced training data, multiple annotations per data item are usually reduced to a single label representing ground truth. This inevitably hides a rich source of diversity and subjectivity of opinions about the labels. Label distribution learning associates for each data item a probability distribution over the labels for that item, thus it can preserve diversities of opinions, beliefs, etc. that conventional learning hides or ignores. We propose a humans-in-the-loop learning framework to model and study large volumes of unlabeled subjective social media data with less human effort. We study various annotation tasks given to crowdsourced annotators and methods for aggregating their contributions in a manner that preserves subjectivity and disagreement. We introduce a strategy for learning label distributions with only five-to-ten labels per item by aggregating human-annotated labels over multiple, semantically related data items. We conduct experiments using our learning framework on data related to two subjective social issues (work and employment, and suicide prevention) that touch many people worldwide. Our methods can be applied to a broad variety of problems, particularly social problems. Our experimental results suggest that specific label aggregation methods can help provide reliable representative semantics at the population level

    Fully Automatic Analysis of Engagement and Its Relationship to Personality in Human-Robot Interactions

    Get PDF
    Engagement is crucial to designing intelligent systems that can adapt to the characteristics of their users. This paper focuses on automatic analysis and classification of engagement based on humans’ and robot’s personality profiles in a triadic human-human-robot interaction setting. More explicitly, we present a study that involves two participants interacting with a humanoid robot, and investigate how participants’ personalities can be used together with the robot’s personality to predict the engagement state of each participant. The fully automatic system is firstly trained to predict the Big Five personality traits of each participant by extracting individual and interpersonal features from their nonverbal behavioural cues. Secondly, the output of the personality prediction system is used as an input to the engagement classification system. Thirdly, we focus on the concept of “group engagement”, which we define as the collective engagement of the participants with the robot, and analyse the impact of similar and dissimilar personalities on the engagement classification. Our experimental results show that (i) using the automatically predicted personality labels for engagement classification yields an F-measure on par with using the manually annotated personality labels, demonstrating the effectiveness of the automatic personality prediction module proposed; (ii) using the individual and interpersonal features without utilising personality information is not sufficient for engagement classification, instead incorporating the participants’ and robot’s personalities with individual/interpersonal features increases engagement classification performance; and (iii) the best classification performance is achieved when the participants and the robot are extroverted, while the worst results are obtained when all are introverted.This work was performed within the Labex SMART project (ANR-11-LABX-65) supported by French state funds managed by the ANR within the Investissements d’Avenir programme under reference ANR-11-IDEX-0004-02. The work of Oya Celiktutan and Hatice Gunes is also funded by the EPSRC under its IDEAS Factory Sandpits call on Digital Personhood (Grant Ref.: EP/L00416X/1).This is the author accepted manuscript. The final version is available from Institute of Electrical and Electronics Engineers via http://dx.doi.org/10.1109/ACCESS.2016.261452

    An Italian lexical resource for incivility detection in online discourses

    Get PDF
    • …
    corecore