6,634 research outputs found

    Predicting the difficulty of multiple choice questions in a high-stakes medical exam

    Get PDF
    Predicting the construct-relevant difficulty of Multiple-Choice Questions (MCQs) has the potential to reduce cost while maintaining the quality of high-stakes exams. In this paper, we propose a method for estimating the difficulty of MCQs from a high-stakes medical exam, where all questions were deliberately written to a common reading level. To accomplish this, we extract a large number of linguistic features and embedding types, as well as features quantifying the difficulty of the items for an automatic question-answering system. The results show that the proposed approach outperforms various baselines with a statistically significant difference. Best results were achieved when using the full feature set, where embeddings had the highest predictive power, followed by linguistic features. An ablation study of the various types of linguistic features suggested that information from all levels of linguistic processing contributes to predicting item difficulty, with features related to semantic ambiguity and the psycholinguistic properties of words having a slightly higher importance. Owing to its generic nature, the presented approach has the potential to generalize over other exams containing MCQs

    The use of situational judgment tests in admission to higher education: validity and coaching effects

    Get PDF
    Medical and dental education in Europe face enormous challenges. Admission is one of them. The number of candidates in both these educations often exceeds the available places. This dissertation provides a first look at the use of a fairly new selection tool in admission procedures for medical and dental education: a situational judgment test. First, a general introduction and overview of the literature on SJTs is given. Next, the setting of the dissertation is described: the admission exam for medical and dental studies in Flanders. Obviously, the selection of medical and dental students in Flanders is different from the admission in other countries. First, the Flemish Admission Exam is exactly the same for both medical and dental students. However, the first study shows that students with a lower score on the cognitive tests, tend to choose dental education. This finding raises questions about using the same admission exam for two different majors. Second, the Flemish admission exam uses an SJT as non-cognitive predictor. SJTs have proven their value in the context of job selection. Studies in both medical and dental education show that SJTs can be valid predictors of both academic and job performance. Over time (from year 1 through year 5/7) the validities of the SJT for predicting academic performance (GPA) slightly increased and there was evidence of incremental validity of the SJT over cognitive ability. The SJT was a predictor of supervisory-rated job performance nine years later. In the last study, the technique of propensity scoring is used to study the coaching effects of both cognitive and non-cognitive tests. By using this technique, treatment-control comparisons can be made among individuals with approximately equal probabilities of having received the treatment. Results show that people who seek coaching were those with the lowest scores on the pretest. Coaching effects were largest for the SJT (d=.50), followed by the knowledge tests (d=.45) and general mental ability test (d=.34). SJTs can be valuable additions to cognitive tests in an admission procedure for higher education. However, the coaching effects found, raise questions about using the same SJT on a long-term basis

    Introducing a framework to assess newly created questions with Natural Language Processing

    Full text link
    Statistical models such as those derived from Item Response Theory (IRT) enable the assessment of students on a specific subject, which can be useful for several purposes (e.g., learning path customization, drop-out prediction). However, the questions have to be assessed as well and, although it is possible to estimate with IRT the characteristics of questions that have already been answered by several students, this technique cannot be used on newly generated questions. In this paper, we propose a framework to train and evaluate models for estimating the difficulty and discrimination of newly created Multiple Choice Questions by extracting meaningful features from the text of the question and of the possible choices. We implement one model using this framework and test it on a real-world dataset provided by CloudAcademy, showing that it outperforms previously proposed models, reducing by 6.7% the RMSE for difficulty estimation and by 10.8% the RMSE for discrimination estimation. We also present the results of an ablation study performed to support our features choice and to show the effects of different characteristics of the questions' text on difficulty and discrimination.Comment: Accepted at the International Conference of Artificial Intelligence in Educatio

    Do the Guideline Violations Influence Test Difficulty of High-stake Test?: An Investigation on University Entrance Examination in Turkey

    Get PDF
    Multiple-choice (MC) items are commonly used in high-stake tests. Thus, each item of such tests should be meticulously constructed to increase the accuracy of decisions based on test results. Haladyna and his colleagues (2002) addressed the valid item-writing guidelines to construct high quality MC items in order to increase test reliability and validity. However, violating these guidelines is very common in high-stake tests. This study addressed two of these guidelines:, “AVOID the complex MC (Type K) format” and “Word the stem positively, avoid negatives such as NOT or EXCEPT”, respectively. After reviewing a total of 2336 MC items extracted from university entrance examination (UEE) in Turkey administered over the past 15 years, we investigated impact of the violations of item-writing guidelines on test difficulty using multiple regression analysis. The findings showed that test difficulty was not statistically changed when MC items with negative stem were used in a test. They, however, indicated that the use of complex MC items has a statistically negative influence on the test difficulty. The paper concludes with possible results of the cases whereby items constructed by violating item-writing gudielines are eliminated from the test, and directions for future studies

    Operationalizing item difficulty modeling in a medical certification context

    Get PDF
    This research study modeled item difficulty in general pediatric test items using content, cognitive complexity, linguistic, and text-based variables. The research first presents an introduction which addresses the current shortcomings found in item development and alternative methods such as principled assessment design which aim to address those shortcomings. Next, a review of the literature is presented which addresses traditional item development, item development using cognitive demands, item difficulty modeling, and the Coh-Metrix (Grasser et al., 2004) linguistic tool. The methods section outlines how content, cognitive, linguistic, and text-based variables were defined and coded using both subject matter experts (SMEs) and Coh-Metrix web-based software. The methods section goes on to outline the backward multiple regression analysis which was conducted to determine the proportion of variance in Rasch item difficulty accounted for by the defined variables and a study which can be used to demonstrate the impact of the current findings on examinee ability calibration. The results of the study demonstrate an operationalizable process for determining item difficulty variables. The results also found that Rasch item difficulty was significantly predicted by five item difficulty variables which accounted for .324 variance in Rasch item difficulty. The research concludes with a discussion of the findings, including steps that can be taken in future studies to build upon the current research and results

    Psychometrics in Practice at RCEC

    Get PDF
    A broad range of topics is dealt with in this volume: from combining the psychometric generalizability and item response theories to the ideas for an integrated formative use of data-driven decision making, assessment for learning and diagnostic testing. A number of chapters pay attention to computerized (adaptive) and classification testing. Other chapters treat the quality of testing in a general sense, but for topics like maintaining standards or the testing of writing ability, the quality of testing is dealt with more specifically.\ud All authors are connected to RCEC as researchers. They present one of their current research topics and provide some insight into the focus of RCEC. The selection of the topics and the editing intends that the book should be of special interest to educational researchers, psychometricians and practitioners in educational assessment

    Factors that Affect Reattempting the Emergency Medical Technician Cognitive Certification Examination

    Get PDF
    Certification as an Emergency Medical Technician (EMT) is often the entry point for firefighting careers and is a pre-requisite to enter Advanced EMT or paramedic programs. EMT candidates in most of the United States must pass the National Registry of EMTs cognitive examination (NREMT-C) to be eligible for state licensure. Many candidates who fail their first NREMT-C attempt never take even one of the five additional possible attempts within the specified two-year time frame. Using binary logistic regression with de-identified existing NREMT test data from 2007 though 2012, this research attempted to develop a model to show the relative contribution of previous NREMT-C score, demographic factors, pay status, employment status, and school accreditation to predict candidates’ likelihood of retesting. A literature review suggested that these factors influence candidates’ success on their first exam attempt, however no literature has examined if these factors predict a candidate’s decision to take at least one additional examination attempt. Results showed that the theta score from the prior attempt was a strong predictor of reattempting examinations two through six. Female gender was negatively associated with attempting examinations two through five. Younger candidates were more likely to attempt examinations two through four, whereas at attempt five the odds an older candidate would try again were slightly higher. Military candidates were much more likely to persist through examination attempt three, however this trend reversed at attempts four and five when they were less likely to reattempt. Having someone pay for the prior exam enhanced candidates’ odds of taking the second and third examination. All race and ethnic categories (except Hispanic) were weakly associated with the odds of taking a second examination but not in any subsequent attempts. Students who attended schools associated with accredited paramedic programs were slightly more likely to persist through exams two through four as were individuals having more education. While this analysis identifies some factors related to examination persistence, it produced weak models, suggesting that many more individual variables are associated with the decision to persist after failing the NREMT-C EMT examination than those examined
    • 

    corecore