4 research outputs found
Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency
Developing an educational test can be expensive and time-consuming, as each
item must be written by experts and then evaluated by collecting hundreds of
student responses. Moreover, many tests require multiple distinct sets of
questions administered throughout the school year to closely monitor students'
progress, known as parallel tests. In this study, we focus on tests of silent
sentence reading efficiency, used to assess students' reading ability over
time. To generate high-quality parallel tests, we propose to fine-tune large
language models (LLMs) to simulate how previous students would have responded
to unseen items. With these simulated responses, we can estimate each item's
difficulty and ambiguity. We first use GPT-4 to generate new test items
following a list of expert-developed rules and then apply a fine-tuned LLM to
filter the items based on criteria from psychological measurements. We also
propose an optimal-transport-inspired technique for generating parallel tests
and show the generated tests closely correspond to the original test's
difficulty and reliability based on crowdworker responses. Our evaluation of a
generated test with 234 students from grades 2 to 8 produces test scores highly
correlated (r=0.93) to those of a standard test form written by human experts
and evaluated across thousands of K-12 students.Comment: Accepted to EMNLP 2023 (Main
ROAR-CAT: Rapid Online Assessment of Reading ability with Computerized Adaptive Testing
The Rapid Online Assessment of Reading (ROAR) is a web-based, lexical decision task that measures single word reading abilities in children and adults without a proctor. Here we study whether item response theory (IRT) and computerized adaptive testing (CAT) can be used to create a more efficient online measure of word recognition. To construct an item bank, we first analyzed data taken from four groups of students (N = 1,960) who differed in age, socioeconomic status, and language-based learning disabilities. The majority of item parameters were highly consistent across groups (r=0.78 - 0.94); 6 items that functioned differently across groups were removed, leaving 246 items in the final item bank. Next, we implemented a JavaScript CAT algorithm and conducted a validation experiment with 485 students in grades 1-8 who were randomly assigned to complete trials of all items in the item bank in either a) a random order vs b) an order determined by the CAT algorithm. We found that, to achieve reliability of 0.9, CAT improved test efficiency by 40%: 75 CAT items produced the same standard error of measurement as 125 items in a random order. Subsequent validation in 32 public school classrooms shows 40 CAT items (approximately 3 minutes) can achieve high correlations (r = .89 for 1st grade, r = .73 for 2nd grade) with alternative 15-20 minutes individually proctored reading assessments. Our findings suggest that ROAR-CAT is a promising tool for efficiently and accurately measuring single word reading ability in reading research and educational practice. Furthermore, our development process serves as a model for creating adaptive online assessments that bridge research and practice
Development and validation of a rapid online sentence reading efficiency assessment
The speed at which students can accurately read and understand connected text is at the foundation of reading development. Timed reading measures go under a variety of names (e.g., reading fluency, reading efficiency and comprehension, etc) and involve different levels of demands on comprehension, making it hard to interpret the extent to which scores reflect differences in reading efficiency versus comprehension. Here we define a new measure of silent sentence reading efficiency (SRE) and explore key aspects of item development for an unproctored, online SRE assessment (ROAR-SRE). In doing so, we set forth an argument for developing sentences that are simple assertions, with an unambiguous answer, requiring minimal background knowledge and vocabulary. We then run a large-scale validation study to document convergent validity between ROAR-SRE and other measures of reading. Finally we validate the reliability and accuracy of using artificial intelligence (AI) to generate matched test forms. We find that a short, one-minute SRE assessment is highly correlated with other reading measures and has exceptional reliability. Moreover, AI can automatically generate test forms that are almost perfectly matched to manually-authored test forms. Together these results highlight the potential for regular - even weekly - assessment and progress monitoring at scale with ROAR-SRE