Search CORE

7,146 research outputs found

Recommended from our members

Project Retrosight. Understanding the returns from cardiovascular and stroke research: Methodology Report

Author: Buxton MJ
Grant J
Hanney S
Pollitt A
Wooding S
Publication venue: RAND Europe
Publication date: 01/01/2011
Field of study

Copyright @ 2011 RAND Europe. All rights reserved. The full text article is available via the link below.This project explores the impacts arising from cardiovascular and stroke research funded 15-20 years ago and attempts to draw out aspects of the research, researcher or environment that are associated with high or low impact. The project is a case study-based review of 29 cardiovascular and stroke research grants, funded in Australia, Canada and UK between 1989 and 1993. The case studies focused on the individual grants but considered the development of the investigators and ideas involved in the research projects from initiation to the present day. Grants were selected through a stratified random selection approach that aimed to include both high- and low-impact grants. The key messages are as follows: 1) The cases reveal that a large and diverse range of impacts arose from the 29 grants studied. 2) There are variations between the impacts derived from basic biomedical and clinical research. 3) There is no correlation between knowledge production and wider impacts 4) The majority of economic impacts identified come from a minority of projects. 5) We identified factors that appear to be associated with high and low impact. This report presents the key observations of the study and an overview of the methods involved. It has been written for funders of biomedical and health research and health services, health researchers, and policy makers in those fields. It will also be of interest to those involved in research and impact evaluation.This study was initiated with internal funding from RAND Europe and HERG, with continuing funding from the UK National Institute for Health Research, the Canadian Institutes of Health Research, the Heart and Stroke Foundation of Canada and the National Heart Foundation of Australia. The UK Stroke Association and the British Heart Foundation provided support in kind through access to their archives

Brunel University Research Archive

Like trainer, like bot? Inheritance of bias in algorithmic content moderation

Author: A Caliskan
A Centivany
AA Anderson
AF Hayes
D Halpern
FL Johnson
I Gagliardone
J Feinberg
J Wolak
JS Mill
K Crawford
L Dahlberg
LA Sutton
NJ Stroud
P Burnap
RS Tokunaga
T Calders
T Gillespie
T Jay
TB Ksiazek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

The internet has become a central medium through which `networked publics' express their opinions and engage in debate. Offensive comments and personal attacks can inhibit participation in these spaces. Automated content moderation aims to overcome this problem using machine learning classifiers trained on large corpora of texts manually annotated for offence. While such systems could help encourage more civil debate, they must navigate inherently normatively contestable boundaries, and are subject to the idiosyncratic norms of the human raters who provide the training data. An important objective for platforms implementing such measures might be to ensure that they are not unduly biased towards or against particular norms of offence. This paper provides some exploratory methods by which the normative biases of algorithmic content moderation systems can be measured, by way of a case study using an existing dataset of comments labelled for offence. We train classifiers on comments labelled by different demographic subsets (men and women) to understand how differences in conceptions of offence between these groups might affect the performance of the resulting models on various test sets. We conclude by discussing some of the ethical choices facing the implementers of algorithmic moderation systems, given various desired levels of diversity of viewpoints amongst discussion participants.Comment: 12 pages, 3 figures, 9th International Conference on Social Informatics (SocInfo 2017), Oxford, UK, 13--15 September 2017 (forthcoming in Springer Lecture Notes in Computer Science

arXiv.org e-Print Archive

Crossref

UCL Discovery

Oxford University Research Archive

Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription

Author: Benetos E
Liu L
Pearce M
Ycart A
Publication venue: 'Ubiquity Press, Ltd.'
Publication date: 01/01/2020
Field of study

Automatic Music Transcription (AMT) is usually evaluated using low-level criteria, typically by counting the numbers of errors, with equal weighting. Yet, some errors (e.g. out-of-key notes) are more salient than others. In this study, we design an online listening test to gather judgements about AMT quality. These judgements take the form of pairwise comparisons of transcriptions of the same music by pairs of different AMT systems. We investigate how these judgements correlate with benchmark metrics, and find that although they match in many cases, agreement drops when comparing pairs with similar scores, or pairs of poor transcriptions. We show that onset-only notewise F-measure is the benchmark metric that correlates best with human judgement, all the more so with higher onset tolerance thresholds. We define a set of features related to various musical attributes, and use them to design a new metric that correlates significantly better with listeners' quality judgements. We examine which musical aspects were important to raters by conducting an ablation study on the defined metric, highlighting the importance of the rhythmic dimension (tempo, meter). We make the collected data entirely available for further study, in particular to evaluate the perceptual relevance of new AMT metrics

Queen Mary Research Online

Supporting Answerers with Feedback in Social Q&A

Author: Frens John
Hsieh Gary
Walker Erin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 26/09/2018
Field of study

Prior research has examined the use of Social Question and Answer (Q&A) websites for answer and help seeking. However, the potential for these websites to support domain learning has not yet been realized. Helping users write effective answers can be beneficial for subject area learning for both answerers and the recipients of answers. In this study, we examine the utility of crowdsourced, criteria-based feedback for answerers on a student-centered Q&A website, Brainly.com. In an experiment with 55 users, we compared perceptions of the current rating system against two feedback designs with explicit criteria (Appropriate, Understandable, and Generalizable). Contrary to our hypotheses, answerers disagreed with and rejected the criteria-based feedback. Although the criteria aligned with answerers' goals, and crowdsourced ratings were found to be objectively accurate, the norms and expectations for answers on Brainly conflicted with our design. We conclude with implications for the design of feedback in social Q&A.Comment: Published in Proceedings of the Fifth Annual ACM Conference on Learning at Scale, Article No. 10, London, United Kingdom. June 26 - 28, 201

arXiv.org e-Print Archive

Crossref

Recommended from our members

Rater Cognition in L2 Speaking Assessment: A Review of the Literature

Author: Han Qie
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2015
Field of study

This literature review attempts to survey representative studies within the context of L2 speaking assessment that have contributed to the conceptualization of rater cognition. Two types of studies are looked at: 1) studies that examine how raters differ (and sometimes agree) in their cognitive processes and rating behaviors, in terms of their focus and feature attention, their approaches to scoring, and their treatment of the scoring criteria and non-criteria relevant aspects and features of the speaking performance; 2) studies that explore why raters differ, through the analysis of the interactions between several rater background factors (i.e., rater language background, rater experience and rater training) and their rating behaviors and decision-making processes. The two types of studies have improved our understanding of the nature and the causes of rater variability in their perception and evaluation of L2 speech. However, very few of those studies has drawn on existing theories of human information processing and research on strategy use, which can explain on a cognitive-processing (Purpura, 2014) level what goes on in raters’ mind during assessment. It is argued as a final conclusion that only based on established frameworks of human information processing and research on (meta)cognitive strategy use can rater cognition be explored with more depth and breadth

Columbia University Academic Commons

Directory of Open Access Journals

Identifying quality improvement intervention publications - A comparison of electronic search strategies

Author: BE Landon
EA Balas
EG Stone
F Davidoff
F Davidoff
G Jamtvedt
JA Alexander
JM Glanville
K Dickersin
K Wells
KA Robinson
Lisa V Rubenstein
LM Schouten
LV Rubenstein
M Jenkins
M Jenkins
Marjorie Danz
MS Danz
NL Wilczynski
P Glasziou
Paul G Shekelle
PB Batalden
R Anderson
R Sladek
RM Sladek
Robbie Foy
Roberta M Shanman
S Michie
SR Arnold
Su Golder
Susanne Hempel
WM McClellan
X Yao
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The evidence base for quality improvement (QI) interventions is expanding rapidly. The diversity of the initiatives and the inconsistency in labeling these as QI interventions makes it challenging for researchers, policymakers, and QI practitioners to access the literature systematically and to identify relevant publications. Methods We evaluated search strategies developed for MEDLINE (Ovid) and PubMed based on free text words, Medical subject headings (MeSH), QI intervention components, continuous quality improvement (CQI) methods, and combinations of the strategies. Three sets of pertinent QI intervention publications were used for validation. Two independent expert reviewers screened publications for relevance. We compared the yield, recall rate, and precision of the search strategies for the identification of QI publications and for a subset of empirical studies on effects of QI interventions. Results The search yields ranged from 2,221 to 216,167 publications. Mean recall rates for reference publications ranged from 5% to 53% for strategies with yields of 50,000 publications or fewer. The 'best case' strategy, a simple text word search with high face validity ('quality' AND 'improv*' AND 'intervention*') identified 44%, 24%, and 62% of influential intervention articles selected by Agency for Healthcare Research and Quality (AHRQ) experts, a set of exemplar articles provided by members of the Standards for Quality Improvement Reporting Excellence (SQUIRE) group, and a sample from the Cochrane Effective Practice and Organization of Care Group (EPOC) register of studies, respectively. We applied the search strategy to a PubMed search for articles published in 10 pertinent journals in a three-year period which retrieved 183 publications. Among these, 67% were deemed relevant to QI by at least one of two independent raters. Forty percent were classified as empirical studies reporting on a QI intervention. Conclusions The presented search terms and operating characteristics can be used to guide the identification of QI intervention publications. Even with extensive iterative development, we achieved only moderate recall rates of reference publications. Consensus development on QI reporting and initiatives to develop QI-relevant MeSH terms are urgently needed

Crossref

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

White Rose Research Online

Integration of a web-based rating system with an oral proficiency interview test: argument-based approach to validation

Author: Yang Hye Jin
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2016
Field of study

This dissertation focuses on the validation of the Oral Proficiency Interview (OPI), a component of the Oral English Certification Test for international teaching assistants. The rating of oral responses was implemented through an innovative computer technology—a web-based rating system called Rater-Platform (R-Plat). The main purpose of the dissertation was to investigate the validity of interpretations and uses of the OPI scores derived from raters’ assessment of examinees’ performance during the web-based rating process. Following the argument-based validation approach (Kane, 2006), an interpretive argument for the OPI was constructed. The interpretive argument specifies a series of inferences, warrants for each inference, as well as underlying assumptions and specific types of backing necessary to support the assumptions. Of seven inferences—domain description, evaluation, generalization, extrapolation, explanation, utilization, and impact—this study focuses on two. Specifically, it aims to obtain validity evidence for three assumptions underlying the evaluation inference and for three assumptions underlying the generalization inference. The research questions addressed: (1) raters’ perceptions towards R-Plat in terms of clarity, effectiveness, satisfaction, and comfort level; (2) quality of raters’ diagnostic descriptor markings; (3) quality of raters’ comments; (4) quality of OPI scores; (5) quality of individual raters’ OPI ratings; (6) prompt difficulty; and (7) raters’ rating practices. A mixed-methods design was employed to collect and analyze qualitative and quantitative data. Qualitative data consisted of: (a) 14 raters’ responses to open-ended questions about their perceptions towards R-Plat, (b) 5 recordings of individual/focus group interviews on eliciting raters’ perceptions, and (c) 1,900 evaluative units extracted from raters’ comments about examinees’ speaking performance. Quantitative data included: (a) 14 raters’ responses to six-point scale statements about their perceptions, (b) 2,524 diagnostic descriptor markings of examinees’ speaking ability, (c) OPI scores for 279 examinees, (d) 803 individual raters’ ratings, (e) individual prompt ratings divided by each intended prompt level, given by each rater, and (f) individual raters’ ratings on the given prompts, grouped by test administration. The results showed that the assumptions for the evaluation inference were supported. Raters’ responses to questionnaire and individual/focus group interviews revealed positive attitudes towards R-Plat. Diagnostic descriptors and raters’ comments, analyzed by chi-square tests, indicated different speaking ability levels. OPI scores were distributed across different proficiency levels throughout different test administrations. For the generalization inference, both positive and negative evidence was obtained. MFRM analyses showed that OPI scores reliably separated examinees into different speaking ability levels. Observed prompt difficulty matched intended prompt levels, although several problematic prompts were identified. Finally, while the raters used rating scales consistently adequately within the same test administration, they were not consistent in their severity. Overall, the foundational parts for the validity argument were successfully established. The findings of this study allow for moving forward with the investigation of the subsequent inferences in order to construct a complete OPI validity argument. They also suggest important implications for argument-based validation research, for the study of raters and task variability, and for future applications of web-based rating systems for speaking assessment

Digital Repository @ Iowa State University (ISU)

VidPlat: A Tool for Fast Crowdsourcing of Quality-of-Experience Measurements

Author: Chetty Marshini
Feamster Nick
Jiang Junchen
Li Hanchen
Schmitt Paul
Zhang Xu
Publication venue
Publication date: 11/11/2023
Field of study

For video or web services, it is crucial to measure user-perceived quality of experience (QoE) at scale under various video quality or page loading delays. However, fast QoE measurements remain challenging as they must elicit subjective assessment from human users. Previous work either (1) automates QoE measurements by letting crowdsourcing raters watch and rate QoE test videos or (2) dynamically prunes redundant QoE tests based on previously collected QoE measurements. Unfortunately, it is hard to combine both ideas because traditional crowdsourcing requires QoE test videos to be pre-determined before a crowdsourcing campaign begins. Thus, if researchers want to dynamically prune redundant test videos based on other test videos' QoE, they are forced to launch multiple crowdsourcing campaigns, causing extra overheads to re-calibrate or train raters every time. This paper presents VidPlat, the first open-source tool for fast and automated QoE measurements, by allowing dynamic pruning of QoE test videos within a single crowdsourcing task. VidPlat creates an indirect shim layer between researchers and the crowdsourcing platforms. It allows researchers to define a logic that dynamically determines which new test videos need more QoE ratings based on the latest QoE measurements, and it then redirects crowdsourcing raters to watch QoE test videos dynamically selected by this logic. Other than having fewer crowdsourcing campaigns, VidPlat also reduces the total number of QoE ratings by dynamically deciding when enough ratings are gathered for each test video. It is an open-source platform that future researchers can reuse and customize. We have used VidPlat in three projects (web loading, on-demand video, and online gaming). We show that VidPlat can reduce crowdsourcing cost by 31.8% - 46.0% and latency by 50.9% - 68.8%

arXiv.org e-Print Archive

Standardisation methods, mark schemes, and their impact on marking reliability

Author
Publication venue: Office of Qualifications and Examinations Regulation
Publication date: 01/01/2014
Field of study

Digital Education Resource Archive