4 research outputs found

    Generating and Evaluating Tests for K-12 Students with Language Model Simulations: A Case Study on Sentence Reading Efficiency

    Full text link
    Developing an educational test can be expensive and time-consuming, as each item must be written by experts and then evaluated by collecting hundreds of student responses. Moreover, many tests require multiple distinct sets of questions administered throughout the school year to closely monitor students' progress, known as parallel tests. In this study, we focus on tests of silent sentence reading efficiency, used to assess students' reading ability over time. To generate high-quality parallel tests, we propose to fine-tune large language models (LLMs) to simulate how previous students would have responded to unseen items. With these simulated responses, we can estimate each item's difficulty and ambiguity. We first use GPT-4 to generate new test items following a list of expert-developed rules and then apply a fine-tuned LLM to filter the items based on criteria from psychological measurements. We also propose an optimal-transport-inspired technique for generating parallel tests and show the generated tests closely correspond to the original test's difficulty and reliability based on crowdworker responses. Our evaluation of a generated test with 234 students from grades 2 to 8 produces test scores highly correlated (r=0.93) to those of a standard test form written by human experts and evaluated across thousands of K-12 students.Comment: Accepted to EMNLP 2023 (Main

    ROAR-CAT: Rapid Online Assessment of Reading ability with Computerized Adaptive Testing

    No full text
    The Rapid Online Assessment of Reading (ROAR) is a web-based, lexical decision task that measures single word reading abilities in children and adults without a proctor. Here we study whether item response theory (IRT) and computerized adaptive testing (CAT) can be used to create a more efficient online measure of word recognition. To construct an item bank, we first analyzed data taken from four groups of students (N = 1,960) who differed in age, socioeconomic status, and language-based learning disabilities. The majority of item parameters were highly consistent across groups (r=0.78 - 0.94); 6 items that functioned differently across groups were removed, leaving 246 items in the final item bank. Next, we implemented a JavaScript CAT algorithm and conducted a validation experiment with 485 students in grades 1-8 who were randomly assigned to complete trials of all items in the item bank in either a) a random order vs b) an order determined by the CAT algorithm. We found that, to achieve reliability of 0.9, CAT improved test efficiency by 40%: 75 CAT items produced the same standard error of measurement as 125 items in a random order. Subsequent validation in 32 public school classrooms shows 40 CAT items (approximately 3 minutes) can achieve high correlations (r = .89 for 1st grade, r = .73 for 2nd grade) with alternative 15-20 minutes individually proctored reading assessments. Our findings suggest that ROAR-CAT is a promising tool for efficiently and accurately measuring single word reading ability in reading research and educational practice. Furthermore, our development process serves as a model for creating adaptive online assessments that bridge research and practice

    Development and validation of a rapid online sentence reading efficiency assessment

    No full text
    The speed at which students can accurately read and understand connected text is at the foundation of reading development. Timed reading measures go under a variety of names (e.g., reading fluency, reading efficiency and comprehension, etc) and involve different levels of demands on comprehension, making it hard to interpret the extent to which scores reflect differences in reading efficiency versus comprehension. Here we define a new measure of silent sentence reading efficiency (SRE) and explore key aspects of item development for an unproctored, online SRE assessment (ROAR-SRE). In doing so, we set forth an argument for developing sentences that are simple assertions, with an unambiguous answer, requiring minimal background knowledge and vocabulary. We then run a large-scale validation study to document convergent validity between ROAR-SRE and other measures of reading. Finally we validate the reliability and accuracy of using artificial intelligence (AI) to generate matched test forms. We find that a short, one-minute SRE assessment is highly correlated with other reading measures and has exceptional reliability. Moreover, AI can automatically generate test forms that are almost perfectly matched to manually-authored test forms. Together these results highlight the potential for regular - even weekly - assessment and progress monitoring at scale with ROAR-SRE
    corecore