31 research outputs found
Rise to the Challenge or Not Give a Damn: Differential Performance in High vs. Low Stakes Tests
This paper studies how different demographic groups respond to incentives by comparing performance in the GRE examination in "high" and "low" stakes situations. The high stakes situation is the real GRE examination and the low stakes situation is a voluntary experimental section of the GRE that examinees were invited to take immediately after they finished the real GRE exam. We show that males exhibit a larger difference in performance between the high and low stakes examinations than females, and that Whites exhibit a larger difference in performance between the high and low stakes examinations relative to Asians, Blacks, and Hispanics. We find that the larger differential performance between high and low stakes tests among men and whites can be partially explained by the lower level of effort invested by these groups in the low stake test.gender, competition, incentives, GRE, high stakes, low stakes, test score gap
Rise to the challenge or not give a damn: Differential performance in high vs. low stakes tests
This paper studies how different demographic groups respond to incentives by comparing performance in the GRE examination in high and low stakes situations. The high stakes situation is the real GRE examination and the low stakes situation is a voluntary experimental section of the GRE that examinees were invited to take immediately after they finished the real GRE exam. We show that males exhibit a larger difference in performance between the high and low stakes examinations than females, and that Whites exhibit a larger difference in performance between the high and low stakes examinations relative to Asians, Blacks, and Hispanics. We find that the larger differential performance between high and low stakes tests among men and whites can be partially explained by the lower level of effort invested by these groups in the low stake test
Investigating the Factor Structure of iSkills™
This article was added to the Knowledge Bank in December, 2023.This paper investigates the issue of internal validity in the context of complex assessments and constructs, focusing on Information and Communication Technology (ICT) literacy as measured by the iSkills assessment. Utilizing exploratory and confirmatory factor analyses, the paper aims to explore the internal structure of the iSkills assessment vis-à-vis unidimensional and multidimensional views of ICT literacy. The results indicate
that ICT literacy, as measured by iSkills, emerges more as an integrated skill set rather than as distinct domains. The paper contributes to the broader conversation on the internal structure of assessments designed for complex constructs
Maintaining and monitoring quality of a continuously administered digital assessment
Digital-first assessments are a new generation of high-stakes assessments that can be taken anytime and anywhere in the world. The flexibility, complexity, and high-stakes nature of these assessments pose quality assurance challenges and require continuous data monitoring and the ability to promptly identify, interpret, and correct anomalous results. In this manuscript, we illustrate the development of a quality assurance system for anomaly detection for a new high-stakes digital-first assessment, for which the population of test takers is still in flux. Various control charts and models are applied to detect and flag any abnormal changes in the assessment statistics, which are then reviewed by experts. The procedure of determining the causes of a score anomaly is demonstrated with a real-world example. Several categories of statistics, including scores, test taker profiles, repeaters, item analysis and item exposure, are monitored to provide context and evidence for evaluating the score anomaly as well as assure the quality of the assessment. The monitoring results and alerts are programmed to be automatically updated and delivered via an interactive dashboard every day
Listening. Learning. Leading. ® A Developmental Writing Scale A Developmental Writing Scale
Abstract This report describes the development of grade norms for timed-writing performance in two modes of writing: persuasive and descriptive. These norms are based on objective and automatically computed measures of writing quality in grammar, usage, mechanics, style, vocabulary, organization, and development. These measures are also used in the automated essay scoring system e-rater ® V.2. Norms were developed through a large-scale data collection effort that involved a national sample of 170 schools, more than 500 classes from 4th, 6th, 8th, 10th, and 12th grades and more than 12,000 students. Personal and school background information was also collected. These students wrote (in 30-minute sessions) up to 4 essays (2 in each mode of writing) on topics selected from a pool of 20 topics. The data allowed us to explore a range of questions about the development and nature of writing proficiency. Specifically, this paper provides a description of the trajectory of development in writing performance from 4th grade to 12th grade. The validity of a single developmental writing scale is examined through a human scoring experiment and a longitudinal study. The validity of the single scale is further explored through a factor analysis (exploratory and confirmatory) of the internal structure of writing performance and changes in this structure from 4th grade to 12th grade. The paper also explores important factors affecting performance, including prompt difficulty, writing mode, and student background (gender, ethnicity, and English language background)
Automated Essay Scoring With e-rater® V.2
E-rater has been used by the Educational Testing Service for automated essay scoring since 1999. This paper describes a new version of e-rater (V.2) that is different from other automated essay scoring systems in several important respects. The main innovations of e-rater V.2 are a small, intuitive, and meaningful set of features used for scoring; a single scoring model and standards can be used across all prompts of an assessment; modeling procedures that are transparent and flexible, and can be based entirely on expert judgment. The paper describes this new system and presents evidence on the validity and reliability of its scores