25 research outputs found
Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments
IntroductionThe Framework for K-12 Science Education promotes supporting the development of knowledge application skills along previously validated learning progressions (LPs). Effective assessment of knowledge application requires LP-aligned constructed-response (CR) assessments. But these assessments are time-consuming and expensive to score and provide feedback for. As part of artificial intelligence, machine learning (ML) presents an invaluable tool for conducting validation studies and providing immediate feedback. To fully evaluate the validity of machine-based scores, it is important to investigate human-machine score consistency beyond observed scores. Importantly, no formal studies have explored the nature of disagreements between human and machine-assigned scores as related to LP levels.MethodsWe used quantitative and qualitative approaches to investigate the nature of disagreements among human and scores generated by two approaches to machine learning using a previously validated assessment instrument aligned to LP for scientific argumentation.ResultsWe applied quantitative approaches, including agreement measures, confirmatory factor analysis, and generalizability studies, to identify items that represent threats to validity for different machine scoring approaches. This analysis allowed us to determine specific elements of argumentation practice at each level of the LP that are associated with a higher percentage of misscores by each of the scoring approaches. We further used qualitative analysis of the items identified by quantitative methods to examine the consistency between the misscores, the scoring rubrics, and student responses. We found that rubrics that require interpretation by human coders and items which target more sophisticated argumentation practice present the greatest threats to the validity of machine scores.DiscussionWe use this information to construct a fine-grained validity argument for machine scores, which is an important piece because it provides insights for improving the design of LP-aligned assessments and artificial intelligence-enabled scoring of those assessments
Recommended from our members
Mixed Student Ideas about Mechanisms of Human Weight Loss
Recent calls for college biology education reform have identified âpathways and transformations of matter and energyâ as a big idea in biology crucial for students to learn. Previous work has been conducted on how college students think about such matter-transforming processes; however, little research has investigated how students connect these ideas. Here, we probe student thinking about matter transformations in the familiar context of human weight loss. Our analysis of 1192 student constructed responses revealed three scientific (which we label âNormativeâ) and five less scientific (which we label âDevelopingâ) ideas that students use to explain weight loss. Additionally, students combine these ideas in their responses, with an average number of 2.19 ± 1.07 ideas per response, and 74.4% of responses containing two or more ideas. These results highlight the extent to which students hold multiple (both correct and incorrect) ideas about complex biological processes. We described student responses as conforming to either Scientific, Mixed, or Developing descriptive models, which had an average of 1.9 ± 0.6, 3.1 ± 0.9, and 1.7 ± 0.8 ideas per response, respectively. Such heterogeneous student thinking is characteristic of difficulties in both conceptual change and early expertise development and will require careful instructional intervention for lasting learning gains
Introductory biology undergraduate students\u27 mixed ideas about genetic information flow
The core concept of genetic information flow was identified in recent calls to improve undergraduate biology education. Previous work shows that students have difficulty differentiating between the three processes of the Central Dogma (CD; replication, transcription, and translation). We built upon this work by developing and applying an analytic coding rubric to 1050 student written responses to a threeâquestion item about the CD. Each response was previously coded only for correctness using a holistic rubric. Our rubric captures subtleties of student conceptual understanding of each process that previous work has not yet captured at a large scale. Regardless of holistic correctness scores, student responses included five or six distinct ideas. By analyzing common coâoccurring rubric categories in student responses, we found a common pair representing two normative ideas about the molecules produced by each CD process. By applying analytic coding to student responses preinstruction and postinstruction, we found student thinking about the processes involved was most prone to change. The combined strengths of analytic and holistic rubrics allow us to reveal mixed ideas about the CD processes and provide a detailed picture of which conceptual ideas students draw upon when explaining each CD process
Recommended from our members
Deconstruction of Holistic Rubrics into Analytic Rubrics for Large-Scale Assessments of Studentsâ Reasoning of Complex Science Concepts
Constructed responses can be used to assess the complexity of student thinking and can be evaluated using rubrics. The two most typical rubric types used are holistic and analytic. Holistic rubrics may be difficult to use with expert-level reasoning that has additive or overlapping language. In an attempt to unpack complexity in holistic rubrics at a large scale, we have developed a systematic approach called deconstruction. We define deconstruction as the process of converting a holistic rubric into defining individual conceptual components that can be used for analytic rubric development and application. These individual components can then be recombined into the holistic score which keeps true to the holistic rubric purpose, while maximizing the benefits and minimizing the shortcomings of each rubric type. This paper outlines the deconstruction process and presents a case study that shows defined concept definitions for a hierarchical holistic rubric developed for an undergraduate physiology-content reasoning context. These methods can be used as one way for assessment developers to unpack complex student reasoning, which may ultimately improve reliability and validation of assessments that are targeted at uncovering large-scale complex scientific reasoning. Accessed 398 times on https://pareonline.net from September 05, 2019 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right
Table_1_Validation of automated scoring for learning progression-aligned Next Generation Science Standards performance assessments.DOCX
IntroductionThe Framework for K-12 Science Education promotes supporting the development of knowledge application skills along previously validated learning progressions (LPs). Effective assessment of knowledge application requires LP-aligned constructed-response (CR) assessments. But these assessments are time-consuming and expensive to score and provide feedback for. As part of artificial intelligence, machine learning (ML) presents an invaluable tool for conducting validation studies and providing immediate feedback. To fully evaluate the validity of machine-based scores, it is important to investigate human-machine score consistency beyond observed scores. Importantly, no formal studies have explored the nature of disagreements between human and machine-assigned scores as related to LP levels.MethodsWe used quantitative and qualitative approaches to investigate the nature of disagreements among human and scores generated by two approaches to machine learning using a previously validated assessment instrument aligned to LP for scientific argumentation.ResultsWe applied quantitative approaches, including agreement measures, confirmatory factor analysis, and generalizability studies, to identify items that represent threats to validity for different machine scoring approaches. This analysis allowed us to determine specific elements of argumentation practice at each level of the LP that are associated with a higher percentage of misscores by each of the scoring approaches. We further used qualitative analysis of the items identified by quantitative methods to examine the consistency between the misscores, the scoring rubrics, and student responses. We found that rubrics that require interpretation by human coders and items which target more sophisticated argumentation practice present the greatest threats to the validity of machine scores.DiscussionWe use this information to construct a fine-grained validity argument for machine scores, which is an important piece because it provides insights for improving the design of LP-aligned assessments and artificial intelligence-enabled scoring of those assessments.</p
Recommended from our members
Using Lexical Analysis Software to Assess Student Writing in Statistics
Meaningful assessments that reveal student thinking are vital to the success of addressing the GAISE recommendation: use assessments to improve and evaluate student learning. Constructed-response questions, also known as open-response or short answer questions, in which students must write an answer in their own words, have been shown to better reveal students' understanding than multiple-choice questions, but they are much more time consuming to grade for classroom use or code for research purposes. This paper describes and illustrates the use of two different software packages to analyze open-response data collected from undergraduate studentsâ writing. The analysis and results produced by the two packages are contrasted with each other and with the results obtained from hand coding of the same data sets. The article concludes with a discussion of the advantages and limitations of the analysis options for statistics education research
Recommended from our members
Using Lexical Analysis Software to Assess Student Writing in Statistics
Meaningful assessments that reveal student thinking are vital to the success of addressing the GAISE recommendation: use assessments to improve and evaluate student learning. Constructed-response questions, also known as open-response or short answer questions, in which students must write an answer in their own words, have been shown to better reveal students' understanding than multiple-choice questions, but they are much more time consuming to grade for classroom use or code for research purposes. This paper describes and illustrates the use of two different software packages to analyze open-response data collected from undergraduate studentsâ writing. The analysis and results produced by the two packages are contrasted with each other and with the results obtained from hand coding of the same data sets. The article concludes with a discussion of the advantages and limitations of the analysis options for statistics education research