9 research outputs found
State-Of-The-Art Automated Essay Scoring: Competition, Results, and Future Directions from a United States Demonstration
This article summarizes the highlights of two studies: a national demonstration that contrasted commercial vendors\u27 performance on automated essay scoring (AES) with that of human raters: and an international competition to match or exceed commercial vendor performance benchmarks. In these studies, the automated essay scoring engines performed well on five of seven measures and approximated human rater performance on the other two. With additional validity studies, it appears that automated essay scoring holds the potential to play a viable role in high-stakes writing assessments. (C) 2013 Elsevier Ltd. All rights reserved
Contrasting state-of-the-art automated scoring of essays: analysis
This study compared the results from nine automated essay scoring engines on eight essay scoring prompts drawn from six states that annually administer high-stakes writing assessments.
Student essays from each state were randomly divided into three sets: a training set (used for modeling the essay prompt responses and consisting of text and ratings from two human raters along with a final or resolved score), a second test set used for a blind test of the vendor-developed model (consisting of text responses only), and a validation set that was not employed in this study.
The essays encompassed writing assessment items from three grade levels (7, 8, 10) and were evenly divided between source-based prompts (i.e., essay prompts developed on the basis of provided source material) or those drawn from traditional writing genre (i.e., narrative, descriptive, persuasive). The total sample size was N = 22,029. Six of the eight essays were transcribed from their original handwritten responses using two transcription vendors. Transcription accuracy rates were computed at 98.70% for 17,502 essays. The remaining essays were typed in by students during the actual assessment and provided in ASCII form.
Seven of the eight essays were holistically scored and one employed score assignments for two traits. Scale ranges, rubrics, and scoring adjudications for the essay sets were quite variable. Results were presented on distributional properties of the data (mean and standard deviation) along with traditional measures used in automated essay scoring: exact agreement, exact+adjacent agreement, kappa, quadratic-weighted kappa, and the Pearson r. The results demonstrated that overall, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items with equal performance for both source-based and traditional writing genre