7,146 research outputs found

    Like trainer, like bot? Inheritance of bias in algorithmic content moderation

    Get PDF
    The internet has become a central medium through which `networked publics' express their opinions and engage in debate. Offensive comments and personal attacks can inhibit participation in these spaces. Automated content moderation aims to overcome this problem using machine learning classifiers trained on large corpora of texts manually annotated for offence. While such systems could help encourage more civil debate, they must navigate inherently normatively contestable boundaries, and are subject to the idiosyncratic norms of the human raters who provide the training data. An important objective for platforms implementing such measures might be to ensure that they are not unduly biased towards or against particular norms of offence. This paper provides some exploratory methods by which the normative biases of algorithmic content moderation systems can be measured, by way of a case study using an existing dataset of comments labelled for offence. We train classifiers on comments labelled by different demographic subsets (men and women) to understand how differences in conceptions of offence between these groups might affect the performance of the resulting models on various test sets. We conclude by discussing some of the ethical choices facing the implementers of algorithmic moderation systems, given various desired levels of diversity of viewpoints amongst discussion participants.Comment: 12 pages, 3 figures, 9th International Conference on Social Informatics (SocInfo 2017), Oxford, UK, 13--15 September 2017 (forthcoming in Springer Lecture Notes in Computer Science

    Investigating the Perceptual Validity of Evaluation Metrics for Automatic Piano Music Transcription

    Get PDF
    Automatic Music Transcription (AMT) is usually evaluated using low-level criteria, typically by counting the numbers of errors, with equal weighting. Yet, some errors (e.g. out-of-key notes) are more salient than others. In this study, we design an online listening test to gather judgements about AMT quality. These judgements take the form of pairwise comparisons of transcriptions of the same music by pairs of different AMT systems. We investigate how these judgements correlate with benchmark metrics, and find that although they match in many cases, agreement drops when comparing pairs with similar scores, or pairs of poor transcriptions. We show that onset-only notewise F-measure is the benchmark metric that correlates best with human judgement, all the more so with higher onset tolerance thresholds. We define a set of features related to various musical attributes, and use them to design a new metric that correlates significantly better with listeners' quality judgements. We examine which musical aspects were important to raters by conducting an ablation study on the defined metric, highlighting the importance of the rhythmic dimension (tempo, meter). We make the collected data entirely available for further study, in particular to evaluate the perceptual relevance of new AMT metrics

    Supporting Answerers with Feedback in Social Q&A

    Full text link
    Prior research has examined the use of Social Question and Answer (Q&A) websites for answer and help seeking. However, the potential for these websites to support domain learning has not yet been realized. Helping users write effective answers can be beneficial for subject area learning for both answerers and the recipients of answers. In this study, we examine the utility of crowdsourced, criteria-based feedback for answerers on a student-centered Q&A website, Brainly.com. In an experiment with 55 users, we compared perceptions of the current rating system against two feedback designs with explicit criteria (Appropriate, Understandable, and Generalizable). Contrary to our hypotheses, answerers disagreed with and rejected the criteria-based feedback. Although the criteria aligned with answerers' goals, and crowdsourced ratings were found to be objectively accurate, the norms and expectations for answers on Brainly conflicted with our design. We conclude with implications for the design of feedback in social Q&A.Comment: Published in Proceedings of the Fifth Annual ACM Conference on Learning at Scale, Article No. 10, London, United Kingdom. June 26 - 28, 201

    Identifying quality improvement intervention publications - A comparison of electronic search strategies

    Get PDF
    Abstract Background The evidence base for quality improvement (QI) interventions is expanding rapidly. The diversity of the initiatives and the inconsistency in labeling these as QI interventions makes it challenging for researchers, policymakers, and QI practitioners to access the literature systematically and to identify relevant publications. Methods We evaluated search strategies developed for MEDLINE (Ovid) and PubMed based on free text words, Medical subject headings (MeSH), QI intervention components, continuous quality improvement (CQI) methods, and combinations of the strategies. Three sets of pertinent QI intervention publications were used for validation. Two independent expert reviewers screened publications for relevance. We compared the yield, recall rate, and precision of the search strategies for the identification of QI publications and for a subset of empirical studies on effects of QI interventions. Results The search yields ranged from 2,221 to 216,167 publications. Mean recall rates for reference publications ranged from 5% to 53% for strategies with yields of 50,000 publications or fewer. The 'best case' strategy, a simple text word search with high face validity ('quality' AND 'improv*' AND 'intervention*') identified 44%, 24%, and 62% of influential intervention articles selected by Agency for Healthcare Research and Quality (AHRQ) experts, a set of exemplar articles provided by members of the Standards for Quality Improvement Reporting Excellence (SQUIRE) group, and a sample from the Cochrane Effective Practice and Organization of Care Group (EPOC) register of studies, respectively. We applied the search strategy to a PubMed search for articles published in 10 pertinent journals in a three-year period which retrieved 183 publications. Among these, 67% were deemed relevant to QI by at least one of two independent raters. Forty percent were classified as empirical studies reporting on a QI intervention. Conclusions The presented search terms and operating characteristics can be used to guide the identification of QI intervention publications. Even with extensive iterative development, we achieved only moderate recall rates of reference publications. Consensus development on QI reporting and initiatives to develop QI-relevant MeSH terms are urgently needed

    Integration of a web-based rating system with an oral proficiency interview test: argument-based approach to validation

    Get PDF
    This dissertation focuses on the validation of the Oral Proficiency Interview (OPI), a component of the Oral English Certification Test for international teaching assistants. The rating of oral responses was implemented through an innovative computer technology—a web-based rating system called Rater-Platform (R-Plat). The main purpose of the dissertation was to investigate the validity of interpretations and uses of the OPI scores derived from raters’ assessment of examinees’ performance during the web-based rating process. Following the argument-based validation approach (Kane, 2006), an interpretive argument for the OPI was constructed. The interpretive argument specifies a series of inferences, warrants for each inference, as well as underlying assumptions and specific types of backing necessary to support the assumptions. Of seven inferences—domain description, evaluation, generalization, extrapolation, explanation, utilization, and impact—this study focuses on two. Specifically, it aims to obtain validity evidence for three assumptions underlying the evaluation inference and for three assumptions underlying the generalization inference. The research questions addressed: (1) raters’ perceptions towards R-Plat in terms of clarity, effectiveness, satisfaction, and comfort level; (2) quality of raters’ diagnostic descriptor markings; (3) quality of raters’ comments; (4) quality of OPI scores; (5) quality of individual raters’ OPI ratings; (6) prompt difficulty; and (7) raters’ rating practices. A mixed-methods design was employed to collect and analyze qualitative and quantitative data. Qualitative data consisted of: (a) 14 raters’ responses to open-ended questions about their perceptions towards R-Plat, (b) 5 recordings of individual/focus group interviews on eliciting raters’ perceptions, and (c) 1,900 evaluative units extracted from raters’ comments about examinees’ speaking performance. Quantitative data included: (a) 14 raters’ responses to six-point scale statements about their perceptions, (b) 2,524 diagnostic descriptor markings of examinees’ speaking ability, (c) OPI scores for 279 examinees, (d) 803 individual raters’ ratings, (e) individual prompt ratings divided by each intended prompt level, given by each rater, and (f) individual raters’ ratings on the given prompts, grouped by test administration. The results showed that the assumptions for the evaluation inference were supported. Raters’ responses to questionnaire and individual/focus group interviews revealed positive attitudes towards R-Plat. Diagnostic descriptors and raters’ comments, analyzed by chi-square tests, indicated different speaking ability levels. OPI scores were distributed across different proficiency levels throughout different test administrations. For the generalization inference, both positive and negative evidence was obtained. MFRM analyses showed that OPI scores reliably separated examinees into different speaking ability levels. Observed prompt difficulty matched intended prompt levels, although several problematic prompts were identified. Finally, while the raters used rating scales consistently adequately within the same test administration, they were not consistent in their severity. Overall, the foundational parts for the validity argument were successfully established. The findings of this study allow for moving forward with the investigation of the subsequent inferences in order to construct a complete OPI validity argument. They also suggest important implications for argument-based validation research, for the study of raters and task variability, and for future applications of web-based rating systems for speaking assessment

    VidPlat: A Tool for Fast Crowdsourcing of Quality-of-Experience Measurements

    Full text link
    For video or web services, it is crucial to measure user-perceived quality of experience (QoE) at scale under various video quality or page loading delays. However, fast QoE measurements remain challenging as they must elicit subjective assessment from human users. Previous work either (1) automates QoE measurements by letting crowdsourcing raters watch and rate QoE test videos or (2) dynamically prunes redundant QoE tests based on previously collected QoE measurements. Unfortunately, it is hard to combine both ideas because traditional crowdsourcing requires QoE test videos to be pre-determined before a crowdsourcing campaign begins. Thus, if researchers want to dynamically prune redundant test videos based on other test videos' QoE, they are forced to launch multiple crowdsourcing campaigns, causing extra overheads to re-calibrate or train raters every time. This paper presents VidPlat, the first open-source tool for fast and automated QoE measurements, by allowing dynamic pruning of QoE test videos within a single crowdsourcing task. VidPlat creates an indirect shim layer between researchers and the crowdsourcing platforms. It allows researchers to define a logic that dynamically determines which new test videos need more QoE ratings based on the latest QoE measurements, and it then redirects crowdsourcing raters to watch QoE test videos dynamically selected by this logic. Other than having fewer crowdsourcing campaigns, VidPlat also reduces the total number of QoE ratings by dynamically deciding when enough ratings are gathered for each test video. It is an open-source platform that future researchers can reuse and customize. We have used VidPlat in three projects (web loading, on-demand video, and online gaming). We show that VidPlat can reduce crowdsourcing cost by 31.8% - 46.0% and latency by 50.9% - 68.8%

    Standardisation methods, mark schemes, and their impact on marking reliability

    Get PDF
    • …
    corecore