7,146 research outputs found
Recommended from our members
Project Retrosight. Understanding the returns from cardiovascular and stroke research: Methodology Report
Copyright @ 2011 RAND Europe. All rights reserved. The full text article is available via the link below.This project explores the impacts arising from cardiovascular and stroke research funded 15-20 years ago and attempts to draw out aspects of the research, researcher or environment that are associated with high or low impact. The project is a case study-based review of 29 cardiovascular and stroke research grants, funded in Australia, Canada and UK between 1989 and 1993. The case studies focused on the individual grants but considered the development of the investigators and ideas involved in the research projects from initiation to the present day. Grants were selected through a stratified random selection approach that aimed to include both high- and low-impact grants. The key messages are as follows: 1) The cases reveal that a large and diverse range of impacts arose from the 29 grants studied. 2) There are variations between the impacts derived from basic biomedical and clinical research. 3) There is no correlation between knowledge production and wider impacts 4) The majority of economic impacts identified come from a minority of projects. 5) We identified factors that appear to be associated with high and low impact. This report presents the key observations of the study and an overview of the methods involved. It has been written for funders of biomedical and health research and health services, health researchers, and policy makers in those fields. It will also be of interest to those involved in research and impact evaluation.This study was initiated with internal funding from RAND Europe and HERG, with continuing funding from the UK National Institute for Health Research, the Canadian Institutes of Health Research, the Heart and Stroke Foundation of Canada and the National Heart Foundation of Australia. The UK Stroke Association and the British Heart Foundation provided support in kind through access to their archives
Like trainer, like bot? Inheritance of bias in algorithmic content moderation
The internet has become a central medium through which `networked publics'
express their opinions and engage in debate. Offensive comments and personal
attacks can inhibit participation in these spaces. Automated content moderation
aims to overcome this problem using machine learning classifiers trained on
large corpora of texts manually annotated for offence. While such systems could
help encourage more civil debate, they must navigate inherently normatively
contestable boundaries, and are subject to the idiosyncratic norms of the human
raters who provide the training data. An important objective for platforms
implementing such measures might be to ensure that they are not unduly biased
towards or against particular norms of offence. This paper provides some
exploratory methods by which the normative biases of algorithmic content
moderation systems can be measured, by way of a case study using an existing
dataset of comments labelled for offence. We train classifiers on comments
labelled by different demographic subsets (men and women) to understand how
differences in conceptions of offence between these groups might affect the
performance of the resulting models on various test sets. We conclude by
discussing some of the ethical choices facing the implementers of algorithmic
moderation systems, given various desired levels of diversity of viewpoints
amongst discussion participants.Comment: 12 pages, 3 figures, 9th International Conference on Social
Informatics (SocInfo 2017), Oxford, UK, 13--15 September 2017 (forthcoming in
Springer Lecture Notes in Computer Science
Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription
Automatic Music Transcription (AMT) is usually evaluated using low-level criteria, typically by counting the numbers of errors, with equal weighting. Yet, some errors (e.g. out-of-key notes) are more salient than others. In this study, we design an online listening test to gather judgements about AMT quality. These judgements take the form of pairwise comparisons of transcriptions of the same music by pairs of different AMT systems. We investigate how these judgements correlate with benchmark metrics, and find that although they match in many cases, agreement drops when comparing pairs with similar scores, or pairs of poor transcriptions. We show that onset-only notewise F-measure is the benchmark metric that correlates best with human judgement, all the more so with higher onset tolerance thresholds. We define a set of features related to various musical attributes, and use them to design a new metric that correlates significantly better with listeners' quality judgements. We examine which musical aspects were important to raters by conducting an ablation study on the defined metric, highlighting the importance of the rhythmic dimension (tempo, meter). We make the collected data entirely available for further study, in particular to evaluate the perceptual relevance of new AMT metrics
Supporting Answerers with Feedback in Social Q&A
Prior research has examined the use of Social Question and Answer (Q&A)
websites for answer and help seeking. However, the potential for these websites
to support domain learning has not yet been realized. Helping users write
effective answers can be beneficial for subject area learning for both
answerers and the recipients of answers. In this study, we examine the utility
of crowdsourced, criteria-based feedback for answerers on a student-centered
Q&A website, Brainly.com. In an experiment with 55 users, we compared
perceptions of the current rating system against two feedback designs with
explicit criteria (Appropriate, Understandable, and Generalizable). Contrary to
our hypotheses, answerers disagreed with and rejected the criteria-based
feedback. Although the criteria aligned with answerers' goals, and crowdsourced
ratings were found to be objectively accurate, the norms and expectations for
answers on Brainly conflicted with our design. We conclude with implications
for the design of feedback in social Q&A.Comment: Published in Proceedings of the Fifth Annual ACM Conference on
Learning at Scale, Article No. 10, London, United Kingdom. June 26 - 28, 201
Recommended from our members
Rater Cognition in L2 Speaking Assessment: A Review of the Literature
This literature review attempts to survey representative studies within the context of L2 speaking assessment that have contributed to the conceptualization of rater cognition. Two types of studies are looked at: 1) studies that examine how raters differ (and sometimes agree) in their cognitive processes and rating behaviors, in terms of their focus and feature attention, their approaches to scoring, and their treatment of the scoring criteria and non-criteria relevant aspects and features of the speaking performance; 2) studies that explore why raters differ, through the analysis of the interactions between several rater background factors (i.e., rater language background, rater experience and rater training) and their rating behaviors and decision-making processes. The two types of studies have improved our understanding of the nature and the causes of rater variability in their perception and evaluation of L2 speech. However, very few of those studies has drawn on existing theories of human information processing and research on strategy use, which can explain on a cognitive-processing (Purpura, 2014) level what goes on in raters’ mind during assessment. It is argued as a final conclusion that only based on established frameworks of human information processing and research on (meta)cognitive strategy use can rater cognition be explored with more depth and breadth
Identifying quality improvement intervention publications - A comparison of electronic search strategies
Abstract Background The evidence base for quality improvement (QI) interventions is expanding rapidly. The diversity of the initiatives and the inconsistency in labeling these as QI interventions makes it challenging for researchers, policymakers, and QI practitioners to access the literature systematically and to identify relevant publications. Methods We evaluated search strategies developed for MEDLINE (Ovid) and PubMed based on free text words, Medical subject headings (MeSH), QI intervention components, continuous quality improvement (CQI) methods, and combinations of the strategies. Three sets of pertinent QI intervention publications were used for validation. Two independent expert reviewers screened publications for relevance. We compared the yield, recall rate, and precision of the search strategies for the identification of QI publications and for a subset of empirical studies on effects of QI interventions. Results The search yields ranged from 2,221 to 216,167 publications. Mean recall rates for reference publications ranged from 5% to 53% for strategies with yields of 50,000 publications or fewer. The 'best case' strategy, a simple text word search with high face validity ('quality' AND 'improv*' AND 'intervention*') identified 44%, 24%, and 62% of influential intervention articles selected by Agency for Healthcare Research and Quality (AHRQ) experts, a set of exemplar articles provided by members of the Standards for Quality Improvement Reporting Excellence (SQUIRE) group, and a sample from the Cochrane Effective Practice and Organization of Care Group (EPOC) register of studies, respectively. We applied the search strategy to a PubMed search for articles published in 10 pertinent journals in a three-year period which retrieved 183 publications. Among these, 67% were deemed relevant to QI by at least one of two independent raters. Forty percent were classified as empirical studies reporting on a QI intervention. Conclusions The presented search terms and operating characteristics can be used to guide the identification of QI intervention publications. Even with extensive iterative development, we achieved only moderate recall rates of reference publications. Consensus development on QI reporting and initiatives to develop QI-relevant MeSH terms are urgently needed
Integration of a web-based rating system with an oral proficiency interview test: argument-based approach to validation
This dissertation focuses on the validation of the Oral Proficiency Interview (OPI), a component of the Oral English Certification Test for international teaching assistants. The rating of oral responses was implemented through an innovative computer technology—a web-based rating system called Rater-Platform (R-Plat). The main purpose of the dissertation was to investigate the validity of interpretations and uses of the OPI scores derived from raters’ assessment of examinees’ performance during the web-based rating process. Following the argument-based validation approach (Kane, 2006), an interpretive argument for the OPI was constructed. The interpretive argument specifies a series of inferences, warrants for each inference, as well as underlying assumptions and specific types of backing necessary to support the assumptions. Of seven inferences—domain description, evaluation, generalization, extrapolation, explanation, utilization, and impact—this study focuses on two. Specifically, it aims to obtain validity evidence for three assumptions underlying the evaluation inference and for three assumptions underlying the generalization inference. The research questions addressed: (1) raters’ perceptions towards R-Plat in terms of clarity, effectiveness, satisfaction, and comfort level; (2) quality of raters’ diagnostic descriptor markings; (3) quality of raters’ comments; (4) quality of OPI scores; (5) quality of individual raters’ OPI ratings; (6) prompt difficulty; and (7) raters’ rating practices.
A mixed-methods design was employed to collect and analyze qualitative and quantitative data. Qualitative data consisted of: (a) 14 raters’ responses to open-ended questions about their perceptions towards R-Plat, (b) 5 recordings of individual/focus group interviews on eliciting raters’ perceptions, and (c) 1,900 evaluative units extracted from raters’ comments about examinees’ speaking performance. Quantitative data included: (a) 14 raters’ responses to six-point scale statements about their perceptions, (b) 2,524 diagnostic descriptor markings of examinees’ speaking ability, (c) OPI scores for 279 examinees, (d) 803 individual raters’ ratings, (e) individual prompt ratings divided by each intended prompt level, given by each rater, and (f) individual raters’ ratings on the given prompts, grouped by test administration.
The results showed that the assumptions for the evaluation inference were supported. Raters’ responses to questionnaire and individual/focus group interviews revealed positive attitudes towards R-Plat. Diagnostic descriptors and raters’ comments, analyzed by chi-square tests, indicated different speaking ability levels. OPI scores were distributed across different proficiency levels throughout different test administrations. For the generalization inference, both positive and negative evidence was obtained. MFRM analyses showed that OPI scores reliably separated examinees into different speaking ability levels. Observed prompt difficulty matched intended prompt levels, although several problematic prompts were identified. Finally, while the raters used rating scales consistently adequately within the same test administration, they were not consistent in their severity. Overall, the foundational parts for the validity argument were successfully established.
The findings of this study allow for moving forward with the investigation of the subsequent inferences in order to construct a complete OPI validity argument. They also suggest important implications for argument-based validation research, for the study of raters and task variability, and for future applications of web-based rating systems for speaking assessment
VidPlat: A Tool for Fast Crowdsourcing of Quality-of-Experience Measurements
For video or web services, it is crucial to measure user-perceived quality of
experience (QoE) at scale under various video quality or page loading delays.
However, fast QoE measurements remain challenging as they must elicit
subjective assessment from human users. Previous work either (1) automates QoE
measurements by letting crowdsourcing raters watch and rate QoE test videos or
(2) dynamically prunes redundant QoE tests based on previously collected QoE
measurements. Unfortunately, it is hard to combine both ideas because
traditional crowdsourcing requires QoE test videos to be pre-determined before
a crowdsourcing campaign begins. Thus, if researchers want to dynamically prune
redundant test videos based on other test videos' QoE, they are forced to
launch multiple crowdsourcing campaigns, causing extra overheads to
re-calibrate or train raters every time.
This paper presents VidPlat, the first open-source tool for fast and
automated QoE measurements, by allowing dynamic pruning of QoE test videos
within a single crowdsourcing task. VidPlat creates an indirect shim layer
between researchers and the crowdsourcing platforms. It allows researchers to
define a logic that dynamically determines which new test videos need more QoE
ratings based on the latest QoE measurements, and it then redirects
crowdsourcing raters to watch QoE test videos dynamically selected by this
logic. Other than having fewer crowdsourcing campaigns, VidPlat also reduces
the total number of QoE ratings by dynamically deciding when enough ratings are
gathered for each test video. It is an open-source platform that future
researchers can reuse and customize. We have used VidPlat in three projects
(web loading, on-demand video, and online gaming). We show that VidPlat can
reduce crowdsourcing cost by 31.8% - 46.0% and latency by 50.9% - 68.8%
- …