524 research outputs found
Towards Interpretable Deep Learning Models for Knowledge Tracing
As an important technique for modeling the knowledge states of learners, the
traditional knowledge tracing (KT) models have been widely used to support
intelligent tutoring systems and MOOC platforms. Driven by the fast
advancements of deep learning techniques, deep neural network has been recently
adopted to design new KT models for achieving better prediction performance.
However, the lack of interpretability of these models has painfully impeded
their practical applications, as their outputs and working mechanisms suffer
from the intransparent decision process and complex inner structures. We thus
propose to adopt the post-hoc method to tackle the interpretability issue for
deep learning based knowledge tracing (DLKT) models. Specifically, we focus on
applying the layer-wise relevance propagation (LRP) method to interpret
RNN-based DLKT model by backpropagating the relevance from the model's output
layer to its input layer. The experiment results show the feasibility using the
LRP method for interpreting the DLKT model's predictions, and partially
validate the computed relevance scores from both question level and concept
level. We believe it can be a solid step towards fully interpreting the DLKT
models and promote their practical applications in the education domain
Recommended from our members
Searching for equity in math education : an examination into issues of course access and classroom experiences for Black and Hispanic youth
Achieving equity in math education requires investigations into issues of access as well as experiences of students in their math classrooms. In this dissertation, I present three analytic chapters that explore equitable access by race/ethnicity to advanced math courses as well as equitable experiences within math classrooms. Specifically, in the first analytic chapter I explore the extent to which Black and Hispanic students in a large and diverse school district are underrepresented in 8th grade algebra relative to their White peers (and each other) within the context of the racial/ethnic composition of their schools. In the second analytic chapter, I examine whether those students who successfully complete algebra in the 8th grade go on to take geometry in the 9th grade at the same rate as their White peers. In these two chapters I find that equitable access to 8th grade algebra depends largely on the racial/ethnic composition of the school students attend, such that Black and Hispanic students are disadvantaged in some contexts but not in others. However, I also find that once students enter the pipeline of advanced math course-taking in the 8th grade, access to subsequent advanced math is equitable. In the final analytic chapter, I shift my focus to what happens in math classrooms by utilizing national data to examine the extent to which students perceive their 9th grade math teachers as being equitable and how these perceptions affect student outcomes. My findings indicate that the impact of having an equitable teacher on math test scores varies by race/ethnicity, such that Black students realize positive effects of having an equitable teacher regardless of their math course level, while their Hispanic and White peers realize differing effects depending on course level.Science, Technology, Engineering, and Mathematics Educatio
The effects of individual differences and linguistic features on reading comprehension of health-related texts
Background. Relatively little attention has been focused on whether or how the effects of reader characteristics, or of the linguistic properties of a text, predict reading comprehension of health-related information. In addition, there is little evidence for the utility of any of the writing guidelines promulgated by the National Health Service (NHS) in order to improve the comprehension of health information. Nonetheless, some previous research suggests that health-related texts could be adapted for different groups of users to optimise understanding. Thus, existing knowledge presents important limitations, and raises concerns with potentially far-reaching practical implications. To address these concerns, I investigated how variation in individual differences and in text features predicts the comprehension of health-related texts, examining how the effects of textual features may differ for different kinds of readers. Method. The focus of this thesis is on Study 3, in which I investigated the predictors of tested comprehension, but I report preliminary studies where I examined the readability of a sample of health-related texts (Study 1), and the perceived comprehension of a sample of health-related texts (Study 2). In the primary study (Study 3), I used Bayesian mixed-effects models to analyse the influences that affect the accuracy of responses to questions probing the comprehension of a sample of health-related texts. I measured variation among 200 participants in their cognitive abilities, to capture the effects of individual differences, as well as variation in the linguistic features of texts, to capture the effects of text structure and content. Results. I found that tested comprehension was less likely to be accurate among older participants. However, comprehension accuracy was greater given higher levels of education, health literacy, and English language proficiency levels. In addition, self-rated evaluations of perceived comprehension predicted comprehension, but only in the absence of other individual-differences-related predictors. Variation in text features, including readability estimates, did not predict comprehension accuracy, and there was no evidence for the modulation of the effects of individual differences by text features. Discussion. Text features did not module the effects of individual differences to influence comprehension accuracy in any meaningful way. This suggests that adapting health-related texts to different groups of the population may be of limited practical value. Implications. Individual differences really matter to comprehension. Thus, optimally, understanding of health-related texts amongst the end-users should be tested, and interventions to aid readers, such as those with relatively low health literacy levels, could be used to improve comprehension of health-texts. In the absence of sensitive measures of reader characteristics, and when testing of understanding is not possible, the use of end-user evaluations of health-related texts may serve as a useful proxy of tested comprehension. However, looking for text effects, and guidance focusing on text effects, seems less useful given the reported evidence. Consequently, the effectiveness of designing health-related texts with the consideration of NHSâs text writing guidelines, is likely to be limited
Student Modeling in Intelligent Tutoring Systems
After decades of development, Intelligent Tutoring Systems (ITSs) have become a common learning environment for learners of various domains and academic levels. ITSs are computer systems designed to provide instruction and immediate feedback, which is customized to individual students, but without requiring the intervention of human instructors. All ITSs share the same goal: to provide tutorial services that support learning. Since learning is a very complex process, it is not surprising that a range of technologies and methodologies from different fields is employed. Student modeling is a pivotal technique used in ITSs. The model observes student behaviors in the tutor and creates a quantitative representation of student properties of interest necessary to customize instruction, to respond effectively, to engage students¥¯ interest and to promote learning. In this dissertation work, I focus on the following aspects of student modeling. Part I: Student Knowledge: Parameter Interpretation. Student modeling is widely used to obtain scientific insights about how people learn. Student models typically produce semantically meaningful parameter estimates, such as how quickly students learn a skill on average. Therefore, parameter estimates being interpretable and plausible is fundamental. My work includes automatically generating data-suggested Dirichlet priors for the Bayesian Knowledge Tracing model, in order to obtain more plausible parameter estimates. I also proposed, implemented, and evaluated an approach to generate multiple Dirichlet priors to improve parameter plausibility, accommodating the assumption that there are subsets of skills which students learn similarly. Part II: Student Performance: Student Performance Prediction. Accurately predicting student performance is one of the most desired features common evaluations for student modeling. for an ITS. The task, however, is very challenging, particularly in predicting a student¥¯s response on an individual problem in the tutor. I analyzed the components of two common student models to determine which aspects provide predictive power in classifying student performance. I found that modeling the student¥¯s overall knowledge led to improved predictive accuracy. I also presented an approach, which, rather than assuming students are drawn from a single distribution, modeled multiple distributions of student performances to improve the model¥¯s accuracy. Part III: Wheel-spinning: Student Future Failure in Mastery Learning. One drawback of the mastery learning framework is its possibility to leave a student stuck attempting to learn a skill he is unable to master. We refer to this phenomenon of students being given practice with no improvement as wheel-spinning. I analyzed student wheel-spinning across different tutoring systems and estimated the scope of the problem. To investigate the negative consequences of see what wheel-spinning could have done to students, I investigated the relationships between wheel-spinning and two other constructs of interest about students: efficiency of learning and ¥°gaming the system¥±. In addition, I designed a generic model of wheel-spinning, which uses features easily obtained by most ITSs. The model can be well generalized to unknown students with high accuracy classifying mastery and wheel-spinning problems. When used as a detector, the model can detect wheel-spinning in its early stage with satisfying satisfactory precision and recall
Neural Cognitive Diagnosis for Intelligent Education Systems
Cognitive diagnosis is a fundamental issue in intelligent education, which
aims to discover the proficiency level of students on specific knowledge
concepts. Existing approaches usually mine linear interactions of student
exercising process by manual-designed function (e.g., logistic function), which
is not sufficient for capturing complex relations between students and
exercises. In this paper, we propose a general Neural Cognitive Diagnosis
(NeuralCD) framework, which incorporates neural networks to learn the complex
exercising interactions, for getting both accurate and interpretable diagnosis
results. Specifically, we project students and exercises to factor vectors and
leverage multi neural layers for modeling their interactions, where the
monotonicity assumption is applied to ensure the interpretability of both
factors. Furthermore, we propose two implementations of NeuralCD by
specializing the required concepts of each exercise, i.e., the NeuralCDM with
traditional Q-matrix and the improved NeuralCDM+ exploring the rich text
content. Extensive experimental results on real-world datasets show the
effectiveness of NeuralCD framework with both accuracy and interpretability
Studies in Analytical Chemistry and Chemical Education. Part 1: Characterization of Complex Organics By Raman Spectroscopy and Gas Chromatography. Part 2: Differential Item Functioning on Multiple-choice General Chemistry Assessments
PART 1: CHARACTERIZATION OF COMPLEX ORGANICS BY RAMAN SPECTROSCOPY AND GAS CHROMATOGRAPHY.
The analytical chemistry component of this thesis focused on instrumentation and methods to address challenges in art conservation, particularly the identification, quantitation, and reactivity of a set of representative varnishes and their degradation products. Methods for characterizing varnishes are of great interest to art conservators to restore art work more accurately. A database was created as a means to identify and quantify the composition of aged varnishes. Fourier Transform (FT)-Raman Spectroscopy was used to study common organic acids found in varnishes. The database included nine short-chain carboxylic acids, four di-carboxylic acids, and six medium-to-long-chain fatty acids. Four varnish samples (Linseed Oil, Tung Oil, Dammar, and Mastic) were studied as well. Through visual comparison and fingerprinting analysis comparison, identification of components in the Raman Spectral Database were recognized as components of the varnish samples. Singular Value Decomposition (SVD) was conducted to determine how well the database represented the unknown varnish samples. SVD was applied to the 19 standards collected in building the database. To reduce the amount of data, seven singular values were chosen. The seven singular values were then used to model several unknowns - Linseed Oil, Tung Oil, Dammar, and Mastic. The root-mean square (RMS) error for the unknowns were 0.08, 0.13, 0.21, and 0.21 Raman Intensity units, for Linseed Oil, Tung Oil, Dammar, and Mastic, respectively. If those values are compared to the largest peak in the unknown spectra, the % relative RMS errors are 1.7%, 1.7%, 4.9%, and 6.4%, respectively.
A method based upon Gas Chromatography (GC) was developed to characterize carboxylic acids formed as a result of varnish degradation. In this method, a headspace solid-phase microextraction (SPME) approach was optimized in which a 75 ”m carboxen-polydimethylsiloxane (CAR/PDMS) SPME fiber was used to analyze mono carboxylic acids. For quantitative determinations, the injection port was in the splitless mode and held at 250°C for 1.0 min for the desorption of the analytes from the SPME fiber. After the initial minute, the injector was switched to a 1:100 split ratio. The temperature program consisted of the oven being initially set to a temperature of 30°C and held for 1 min, and then ramped at 25°C/min to 200°C, where the temperature was held for 1 min, thereby resulting in a total run time of 8.80 min. The PFPD was held at 200 °C for the entire run with a 0.5 ms gate delay, and the gate width was set to 20.0 ms. The mono carboxylic acids that were studied were Formic, Acetic, Propanoic, Butyric, Valeric, and Caproic Acid. A linear relationship was observed between the number of carbons in the carboxylic acid and the retention time (y = 0.75x + 1.55, R2=0.95). Quantitation of Acetic Acid was done by calibration using a first-order regression fit. The model yielded: y = 0.29x + 0.92 (R2=0.95). Using a second-order model, a better fit was found: y = 0.0025x2 - 0.0016x + 5.9 (R2=0.99).
An ageing chamber was designed, fabricated, and tested as a means for better understanding the decomposition of varnishes over time as a function of temperature, humidity, and ultraviolet light. The goal in the development of the ageing chamber was to demonstrate that it may be possible to create Standard Reference Materials (SRMs) artificially that resemble authentically aged varnishes. This is possible by the use of the ageing chamber that was built because it is directly incorporated into a GC oven where temperature, where UV radiation, humidity levels, and pollutants can be precisely controlled and carefully monitored. The GC method for carboxylic acids described above was developed to aid in the measurement of carboxylic acid fragments that could arise from the ageing process. There are promising results of the Raman Intensity increasing as the sample aged.
PART 2: DIFFERENTIAL ITEM FUNCTIONING ON MULTIPLE-CHOICE GENERAL CHEMISTRY ASSESSMENTS.
Over the past 30 years, there have been a plethora of studies on gender differences. Some of the earlier studies found that male students typically outperform female students in visual-spatial and quantitative abilities, whereas female students outperform male students in verbal abilities. In later studies it was reinforced that female students still tended to outperform male students in verbal abilities while the gap in science and mathematics (the latter as an extension of visual-spatial and quantitative abilities) closed greatly. During this same time, more female students entered the science, technology, engineering, and mathematics (STEM) fields. In 1966, only 25% of all STEM bachelor\u27s degrees were obtained by female students, whereas in 2010 that percentage had grown to 50%. Specifically in chemistry, 49.9% of the bachelor\u27s degrees were earned by women compared to the 18.5% in 1966.1 With assessments as a large source of the student\u27s overall course grade, it is imperative that those assessments be valid and unbiased. One way to determine this is to use Differential Item Functioning (DIF). DIF occurs when subgroups of equal abilities perform statistically different on an item on an assessment where typically students that are matched with equivalent ability would have an equivalent possibility of answering the question on the assessment correctly. Because of the difficulty in determining students\u27 ability often times the subgroups are matched on their proficiency or the score they received on an assessment.
This dissertation focused on four main questions. The first question focused on identifying items that exhibited DIF. The second question was to determine if DIF was real, i.e. did it persist no matter the set of students or the matching criteria used? The third question focused on determining the causes of DIF by cloning the items by content and construct (format). Lastly, it was hypothesized that one of the reasons behind why DIF is happening was due to the students\u27 problem-solving process and examining these through the use of incorrect heuristics.
Data for the first part of the study was collected from two American Chemical SocietyâExaminations Institute (ACSâEI) trial tests (Form A and Form B) that were given to students who had completed one term of general chemistry. This data was analyzed using the MantelâHaenszel statistic to determine which items exhibited possible DIF. Along with the MantelâHaenzel statistic a two stage DIF analysis2 was conducted. Out of the 140 items, 33 exhibited DIF. On Form A there were 14 items which exhibited DIF, seven that favored male students and seven that favored female students. On Form B there were 19 items which exhibited DIF, 11 that favored female students and eight that favored male students. Those items that exhibited the highest probability of DIF were cloned and included on hourly examinations. These items were examined for DIF persistence against both stages of the two-stage analysis and other relevant measures of proficiency. As more results were collected, patterns emerged for persistent DIF items. On the 24 hourly examinations that were included in this analysis, there were a total of 687 items: 33 (5%) had a significant value using the Mantel-Haenszel statistic, thereby exhibiting persistent DIF. Of those 33 items, 15 were flagged with persistent DIF that favored female students and 18 were flagged with persistent DIF that favored male students. On the three standardized examinations, there were a total of 140 items; 19 (14%) had a significant value using the Mantel-Haenszel statistic, thereby exhibiting persistent DIF. Of those 19 items, two of the items that were flagged with persistent DIF favored female students and 17 of the items that were flagged with persistent DIF favored male students.
Along with these items, certain content areas and formats of the items were found to favor one gender. Over six semesters of testing, the content areas that consistently showed DIF that favored male students were measurement (density), greatest/least number of atoms, limiting reagents, ideal gas equation, and crystal structures; the content areas that favored female students were nomenclature and molecular orbital theory. The formats that tended to favor male students were visual-spatial, reasoning, and computation; the format that favored female students was specific chemical knowledge. By cloning these items, it was found that some of the possible causes of persistent DIF for certain items were the content and/or the format.
Lastly semi-structured interviews were conducted and it was found that for seven items the possible reason why DIF was happening was due to one subgroup using an incorrect heuristic. These items were in the specific content areas of measurement (density), greatest/least number of atoms, stoichiometry-general, and crystal structures. Additionally, the format inclusions of visual-spatial, reasoning, and computation for these items could also be contributing factors to the observed results.
References
1. S&E Degrees: 1966-2010: National Center for Science and Engineering Statistics. http://www.nsf.gov/statistics/nsf11316/content.cfm?pub_id=4062&id=2 (accessed May 26).
2. Zenisky, A. L.; Hambleton, R. K., Detection of Differential Item Functioning in Large-Scale State Assessments: A Study Evaluating a Two-Stage Approach. Educational and Psychological Measurement 2003a, 63 (1), 51-64
Recommended from our members
Iâve (Urn)ed This: An Application and Criterion-based Evaluation of the Urnings Algorithm
There is increased interest in personalized learning and making e-learning environments more adaptable. Some e-learning systems may use an Item Response Theory (IRT)-based assessment system. An important distinction between assessment and learning contexts is that learner proficiency is expected to remain constant across an assessment, while it is expected to change over time in a learning context. Constant learner proficiency during an assessment enables conventional approaches to estimating person and item parameters using IRT. These IRT-based systems could be abandoned for alternative approaches to modeling learners and system learning content, but assessments may provide more functions than adapting learning material to students. Thus, there is the question, how can e-learning systems with IRT-based assessment components more dynamically adapt their learning content? Is there a solution that leverages IRT for adapting the learning content of the system? A promising solution is the Urnings algorithm. Like other candidate algorithms, it is computationally light, but this algorithm has mechanisms for preventing variance inflation and is suitable for e-learning contexts. It also provides a measure of uncertainty around estimates. It has been studied both through simulations and applications to e-learning systems. Results are promising; however, there has not been an application of the Urnings algorithm to an e-learning context where there are conventionally estimated person parameters to compare the algorithm estimates to. This study addresses this gap by applying the Urnings algorithm to a Kâ8 reading and mathematics learning platform. In data from this platform, we have person parameter estimates across academic years from an in-system diagnostic assessment. Results from this study will help industry researchers understand the feasibility of the Urnings algorithm for large e-learning systems with IRT-based assessment components
Psychometrics in Practice at RCEC
A broad range of topics is dealt with in this volume: from combining the psychometric generalizability and item response theories to the ideas for an integrated formative use of data-driven decision making, assessment for learning and diagnostic testing. A number of chapters pay attention to computerized (adaptive) and classification testing. Other chapters treat the quality of testing in a general sense, but for topics like maintaining standards or the testing of writing ability, the quality of testing is dealt with more specifically.\ud
All authors are connected to RCEC as researchers. They present one of their current research topics and provide some insight into the focus of RCEC. The selection of the topics and the editing intends that the book should be of special interest to educational researchers, psychometricians and practitioners in educational assessment
Recommended from our members
Investigating the Construct of Topical Knowledge in a Scenario-Based Assessment Designed to Simulate Real-Life Second Language Use
The vast development of digital technology and the widespread use of social network platforms have reshaped how we live in the world. For L2 learners to maximally utilize their language proficiency to function effectively as members of modern society, they need not only the necessary L2 knowledge, skills, and abilities (KSAs) but also essential topical knowledge. While many researchers believe that topical knowledge should be viewed as an integral component of L2 communicative competence, the role of topical knowledge has not always been accounted for in an assessment context due to the difficulty of operationalizing the construct.
Scenario-based assessment, an innovative, technology-based assessment approach, allows great affordances for expanding the measured constructs of an assessment. It is designed expressly for learners to demonstrate their KSAs in a context that simulates real-life language use. Through the utilization of a sequence of thematically-related tasks, along with simulated character interaction, scenario-based assessment offers opportunities to examine L2 learnersâ communicative competence in a purposeful, interactive, and contextually meaningful manner.
In this study, a scenario-based language assessment (SBLA) was developed to measure high-intermediate L2 learnersâ topical knowledge and their L2 KSAs as part of the broadened construct of L2 communicative competence. To fulfill the scenario goal, learners were required to demonstrate their listening, reading, and writing abilities to build and share knowledge. In addition, learnersâ prior topical knowledge was measured and their topical learning was tracked using the same set of topical knowledge items.
A total of 118 adult EFL learners participated in the study. The results showed that the SBLA served as an appropriate measure of high-intermediate learnersâ L2 proficiency. The topical knowledge items were found to function appropriately, supporting the use of the SBLA to measure topical knowledge as part of the broadened construct of communicative competence. In addition, most learners exhibited substantial topical learning over the course of the SBLA, suggesting that with proper contextualization, learning can be facilitated within an assessment. In sum, this study demonstrated the potential value of scenario-based assessment as an approach to measure complex constructs of communicative language competence in L2 context
- âŠ