294 research outputs found
Can we trust online crowdworkers? Comparing online and offline participants in a preference test of virtual agents
Conducting user studies is a crucial component in many scientific fields.
While some studies require participants to be physically present, other studies
can be conducted both physically (e.g. in-lab) and online (e.g. via
crowdsourcing). Inviting participants to the lab can be a time-consuming and
logistically difficult endeavor, not to mention that sometimes research groups
might not be able to run in-lab experiments, because of, for example, a
pandemic. Crowdsourcing platforms such as Amazon Mechanical Turk (AMT) or
Prolific can therefore be a suitable alternative to run certain experiments,
such as evaluating virtual agents. Although previous studies investigated the
use of crowdsourcing platforms for running experiments, there is still
uncertainty as to whether the results are reliable for perceptual studies. Here
we replicate a previous experiment where participants evaluated a gesture
generation model for virtual agents. The experiment is conducted across three
participant pools -- in-lab, Prolific, and AMT -- having similar demographics
across the in-lab participants and the Prolific platform. Our results show no
difference between the three participant pools in regards to their evaluations
of the gesture generation models and their reliability scores. The results
indicate that online platforms can successfully be used for perceptual
evaluations of this kind.Comment: Accepted to IVA 2020. Patrik Jonell and Taras Kucherenko contributed
equally to this wor
An Open source Implementation of ITU-T Recommendation P.808 with Validation
The ITU-T Recommendation P.808 provides a crowdsourcing approach for
conducting a subjective assessment of speech quality using the Absolute
Category Rating (ACR) method. We provide an open-source implementation of the
ITU-T Rec. P.808 that runs on the Amazon Mechanical Turk platform. We extended
our implementation to include Degradation Category Ratings (DCR) and Comparison
Category Ratings (CCR) test methods. We also significantly speed up the test
process by integrating the participant qualification step into the main rating
task compared to a two-stage qualification and rating solution. We provide
program scripts for creating and executing the subjective test, and data
cleansing and analyzing the answers to avoid operational errors. To validate
the implementation, we compare the Mean Opinion Scores (MOS) collected through
our implementation with MOS values from a standard laboratory experiment
conducted based on the ITU-T Rec. P.800. We also evaluate the reproducibility
of the result of the subjective speech quality assessment through crowdsourcing
using our implementation. Finally, we quantify the impact of parts of the
system designed to improve the reliability: environmental tests, gold and
trapping questions, rating patterns, and a headset usage test
Training and Application of Correct Information Unit Analysis to Structured and Unstructured Discourse
Correct Information Units (CIU) analysis is one of the few measures of discourse that attempts to quantify discourse as a function of communicating information efficiently. Though this analysis is used reliably as a research tool, most studies\u27 apply CIUs to structured discourse tasks and do not specifically describe how raters are trained. If certified clinical speech-language pathologists can likewise reliably apply CIU analysis within clinical settings to unstructured discourse, such as the discourse of people with aphasia (PWA), it may allow clinicians to quantify the information communicated efficiently in clinical populations with discourse deficits. Purpose: The purpose of this study is to determine if using the outlined training module, clinicians are able to score CIUs with similar inter-rater reliability across both structured and unstructured discourse samples as researchers. Method: Four certified SLPs will undergo a two-hour training session in CIU analysis similar to that of a university research staffs\u27 CIU training protocol. Each SLP will score CIUs in structured and unstructured language samples collected from individuals diagnosed with aphasia. The SLP\u27 scores within the structured and unstructured discourse samples will be compared to those of a university research lab staffs\u27. This will determine (1) whether SLPs can reliably code CIUs when compared with research raters in a lab setting when both using the same two-hour CIU training and resources allotted; (2) whether there is a significant difference in reliability when structured and unstructured discourse is analyzed
Automated Virtual Coach for Surgical Training
Surgical educators have recommended individualized coaching for acquisition, retention and improvement of expertise in technical skills. Such one-on-one coaching is limited to institutions that can afford surgical coaches and is certainly not feasible at national and global scales. We hypothesize that automated methods that model intraoperative video, surgeon's hand and instrument motion, and sensor data can provide effective and efficient individualized coaching. With the advent of instrumented operating rooms and training laboratories, access to such large scale intra-operative data has become feasible. Previous methods for automated skill assessment present an overall evaluation at the task/global level to the surgeons without any directed feedback and error analysis. Demonstration, if at all, is present in the form of fixed instructional videos, while deliberate practice is completely absent from automated training platforms. We believe that an effective coach should: demonstrate expert behavior (how do I do it correctly), evaluate trainee performance (how did I do) at task and segment-level, critique errors and deficits (where and why was I wrong), recommend deliberate practice (what do I do to improve), and monitor skill progress (when do I become proficient).
In this thesis, we present new methods and solutions towards these coaching interventions in different training settings viz. virtual reality simulation, bench-top simulation and the operating room. First, we outline a summarizations-based approach for surgical phase modeling using various sources of intra-operative procedural data such as – system events (sensors) as well as crowdsourced surgical activity context. We validate a crowdsourced approach to obtain context summarizations of intra-operative surgical activity. Second, we develop a new scoring method to evaluate task segments using rankings derived from pairwise comparisons of performances obtained via crowdsourcing. We show that reliable and valid crowdsourced pairwise comparisons can be obtained across multiple training task settings. Additionally, we present preliminary results comparing inter-rater agreement in relative ratings and absolute ratings for crowdsourced assessments of an endoscopic sinus surgery training task data set. Third, we implement a real-time feedback and teaching framework using virtual reality simulation to present teaching cues and deficit metrics that are targeted at critical learning elements of a task. We compare the effectiveness of this real-time coach to independent self-driven learning on a needle passing task in a pilot randomized controlled trial. Finally, we present an integration of the above components of task progress detection, segment-level evaluation and real-time feedback towards the first end-to-end automated virtual coach for surgical training
Human-in-the-Loop Learning From Crowdsourcing and Social Media
Computational social studies using public social media data have become more and more popular because of the large amount of user-generated data available. The richness of social media data, coupled with noise and subjectivity, raise significant challenges for computationally studying social issues in a feasible and scalable manner. Machine learning problems are, as a result, often subjective or ambiguous when humans are involved. That is, humans solving the same problems might come to legitimate but completely different conclusions, based on their personal experiences and beliefs. When building supervised learning models, particularly when using crowdsourced training data, multiple annotations per data item are usually reduced to a single label representing ground truth. This inevitably hides a rich source of diversity and subjectivity of opinions about the labels.
Label distribution learning associates for each data item a probability distribution over the labels for that item, thus it can preserve diversities of opinions, beliefs, etc. that conventional learning hides or ignores. We propose a humans-in-the-loop learning framework to model and study large volumes of unlabeled subjective social media data with less human effort. We study various annotation tasks given to crowdsourced annotators and methods for aggregating their contributions in a manner that preserves subjectivity and disagreement. We introduce a strategy for learning label distributions with only five-to-ten labels per item by aggregating human-annotated labels over multiple, semantically related data items. We conduct experiments using our learning framework on data related to two subjective social issues (work and employment, and suicide prevention) that touch many people worldwide. Our methods can be applied to a broad variety of problems, particularly social problems. Our experimental results suggest that specific label aggregation methods can help provide reliable representative semantics at the population level
Recommended from our members
A framework for evaluating automatic indexing or classification in the context of retrieval
Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. While some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The paper reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single “gold standard” method when evaluating indexing and retrieval and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on indexing, classification and approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard; evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance
Fully Automatic Analysis of Engagement and Its Relationship to Personality in Human-Robot Interactions
Engagement is crucial to designing intelligent systems that can adapt to the characteristics of their users. This paper focuses on automatic analysis and classification of engagement based on humans’ and robot’s personality profiles in a triadic human-human-robot interaction setting. More explicitly, we present a study that involves two participants interacting with a humanoid robot, and investigate how participants’ personalities can be used together with the robot’s personality to predict the engagement state of each participant. The fully automatic system is firstly trained to predict the Big Five personality traits of each participant by extracting individual and interpersonal features from their nonverbal behavioural cues. Secondly, the output of the personality prediction system is used as an input to the engagement classification system. Thirdly, we focus on the concept of “group engagement”, which we define as the collective engagement of the participants with the robot, and analyse the impact of similar and dissimilar personalities on the engagement classification. Our experimental results show that (i) using the automatically predicted personality labels for engagement classification yields an F-measure on par with using the manually annotated personality labels, demonstrating the effectiveness of the automatic personality prediction module proposed; (ii) using the individual and interpersonal features without utilising personality information is not sufficient for engagement classification, instead incorporating the participants’ and robot’s personalities with individual/interpersonal features increases engagement classification performance; and (iii) the best classification performance is achieved when the participants and the robot are extroverted, while the worst results are obtained when all are introverted.This work was performed within the Labex SMART project (ANR-11-LABX-65) supported by French state funds managed by the ANR within the Investissements d’Avenir programme under reference ANR-11-IDEX-0004-02. The work of Oya Celiktutan and Hatice Gunes is also funded by the EPSRC under its IDEAS Factory Sandpits call on Digital Personhood (Grant Ref.: EP/L00416X/1).This is the author accepted manuscript. The final version is available from Institute of Electrical and Electronics Engineers via http://dx.doi.org/10.1109/ACCESS.2016.261452
- …