16 research outputs found

    Conditional reliability of admissions interview ratings: extreme ratings are the most informative

    Full text link
    Admissions interviews are unreliable and have poor predictive validity, yet are the sole measures of non-cognitive skills used by most medical school admissions departments. The low reliability may be due in part to variation in conditional reliability across the rating scale. Objectives  To describe an empirically derived estimate of conditional reliability and use it to improve the predictive validity of interview ratings. Methods  A set of medical school interview ratings was compared to a Monte Carlo simulated set to estimate conditional reliability controlling for range restriction, response scale bias and other artefacts. This estimate was used as a weighting function to improve the predictive validity of a second set of interview ratings for predicting non-cognitive measures (USMLE Step II residuals from Step I scores). Results  Compared with the simulated set, both observed sets showed more reliability at low and high rating levels than at moderate levels. Raw interview scores did not predict USMLE Step II scores after controlling for Step I performance (additional r 2  = 0.001, not significant). Weighting interview ratings by estimated conditional reliability improved predictive validity (additional r 2  = 0.121, P  < 0.01). Conclusions  Conditional reliability is important for understanding the psychometric properties of subjective rating scales. Weighting these measures during the admissions process would improve admissions decisions.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/71581/1/j.1365-2929.2006.02634.x.pd

    Main and Regional Campus Assessments of Applicants to a Rural Physician Leadership Program: A Generalizability Analysis

    Get PDF
    While the selection of qualified applicants often relies, in part, on scores generated from a medical school pre-admission interview (MSPI), the growth of regional medical campuses (RMCs) – many with specialized rural tracks, programs, or missions – has challenged schools to accommodate a wider range of stakeholder input. This study examines the reliabilities of main (urban) and regional (rural) campus interviewers’ assessments of applicants to a Rural Physician Leadership Program (RPLP) located in the southeastern United States. Data from RPLP applicants completing MSPIs on two campuses from 2009-2017 (n = 232) were examined in a generalizability analysis. In two separate interviews on each campus (4 total), raters independently evaluated applicants’ overall acceptability and likelihood of practicing in a rural area of the state. Results provided campus-specific and combined (composite) estimates of obtained and projected reliabilities. The person-by-campus interaction accounted for 11% and 5% of the respective variance in interviewers’ ratings of overall applicant acceptability and likelihood of rural in-state practice, and the reliability of mean scores across the four independent interviews (each with a single, unique rater) was 0.73 and 0.82. Error variances were higher among main campus interviewers, but scores correlated highly between the two campuses. While broadening the universe of generalization often results in decreased reliability, reliability was shown to be enhanced with the addition of regional (rural) campus interviews. As the RPLP matures, an examination of graduates’ actual practice locations should yield insights into the predictive validity of these pre-admissions assessments. More generally, research may wish to explore the conditions under which increasing the diversity of stakeholder input can be accommodated without concomitant reductions in overall reliability

    New web-based applications for mechanistic case diagramming

    Get PDF
    The goal of mechanistic case diagraming (MCD) is to provide students with more in-depth understanding of cause and effect relationships and basic mechanistic pathways in medicine. This will enable them to better explain how observed clinical findings develop from preceding pathogenic and pathophysiological events. The pedagogic function of MCD is in relating risk factors, disease entities and morphology, signs and symptoms, and test and procedure findings in a specific case scenario with etiologic pathogenic and pathophysiological sequences within a flow diagram. In this paper, we describe the addition of automation and predetermined lists to further develop the original concept of MCD as described by Engelberg in 1992 and Guerrero in 2001. We demonstrate that with these modifications, MCD is effective and efficient in small group case-based teaching for second-year medical students (ratings of ~3.4 on a 4.0 scale). There was also a significant correlation with other measures of competency, with a ‘true’ score correlation of 0.54. A traditional calculation of reliability showed promising results (α =0.47) within a low stakes, ungraded environment. Further, we have demonstrated MCD's potential for use in independent learning and TBL. Future studies are needed to evaluate MCD's potential for use in medium stakes assessment or self-paced independent learning and assessment. MCD may be especially relevant in returning students to the application of basic medical science mechanisms in the clinical years

    A report on the piloting of a novel computer-based medical case simulation for teaching and formative assessment of diagnostic laboratory testing

    Get PDF
    Objectives: Insufficient attention has been given to how information from computer-based clinical case simulations is presented, collected, and scored. Research is needed on how best to design such simulations to acquire valid performance assessment data that can act as useful feedback for educational applications. This report describes a study of a new simulation format with design features aimed at improving both its formative assessment feedback and educational function. Methods: Case simulation software (LabCAPS) was developed to target a highly focused and well-defined measurement goal with a response format that allowed objective scoring. Data from an eight-case computer-based performance assessment administered in a pilot study to 13 second-year medical students was analyzed using classical test theory and generalizability analysis. In addition, a similar analysis was conducted on an administration in a less controlled setting, but to a much large sample (n=143), within a clinical course that utilized two random case subsets from a library of 18 cases. Results: Classical test theory case-level item analysis of the pilot assessment yielded an average case discrimination of 0.37, and all eight cases were positively discriminating (range=0.11&#x2013;0.56). Classical test theory coefficient alpha and the decision study showed the eight-case performance assessment to have an observed reliability of &#x03C3;=G=0.70. The decision study further demonstrated that a G=0.80 could be attained with approximately 3 h and 15 min of testing. The less-controlled educational application within a large medical class produced a somewhat lower reliability for eight cases (G=0.53). Students gave high ratings to the logic of the simulation interface, its educational value, and to the fidelity of the tasks. Conclusions: LabCAPS software shows the potential to provide formative assessment of medical students&#x2019; skill at diagnostic test ordering and to provide valid feedback to learners. The perceived fidelity of the performance tasks and the statistical reliability findings support the validity of using the automated scores for formative assessment and learning. LabCAPS cases appear well designed for use as a scored assignment, for stimulating discussions in small group educational settings, for self-assessment, and for independent learning. Extension of the more highly controlled pilot assessment study with a larger sample will be needed to confirm its reliability in other assessment applications

    Using systematically observed clinical encounters (SOCEs) to assess medical students&amp;rsquo; skills in clinical settings

    No full text
    George R Bergus1&amp;ndash;3, Jerold C Woodhead4, Clarence D Kreiter2,51Performance Based Assessment Program, Office of Student Affairs and Curriculum, 2Department of Family Medicine, 3Department of Psychiatry, 4Department of Pediatrics, 5Office of Consultation and Research in Medical Education, Roy J and Lucille A Carver College of Medicine, The University of Iowa, Iowa City, IA, USAIntroduction: The Objective Structured Clinical Examination (OSCE) is widely used to assess the clinical performance of medical students. However, concerns related to cost, availability, and validity, have led educators to investigate alternatives to the OSCE. Some alternatives involve assessing students while they provide care to patients &amp;ndash; the mini-CEX (mini-Clinical Evaluation Exercise) and the Long Case are examples. We investigated the psychometrics of systematically observed clinical encounters (SOCEs), in which physicians are supplemented by lay trained observers, as a means of assessing the clinical performances of medical students.Methods: During the pediatrics clerkship at the University of Iowa, trained lay observers assessed the communication skills of third-year medical students using a communication checklist while the students interviewed and examined pediatric patients. Students then verbally presented their findings to faculty, who assessed students&amp;rsquo; clinical skills using a standardized form. The reliability of the combined communication and clinical skills scores was calculated using generalizability theory.Results: Fifty-one medical students completed 199 observed patient encounters. The mean combined clinical and communication skills score (out of a maximum 45 points) was 40.8 (standard deviation 3.3). The calculated reliability of the SOCE scores, using generalizability theory, from 10 observed patient encounters was 0.81. Students reported receiving helpful feedback from faculty after 97% of their observed clinical encounters.Conclusion: The SOCE can reliably assess the clinical performances of third-year medical students on their pediatrics clerkship. The SOCE is an attractive addition to the other methods utilizing real patient encounters for assessing the skills of learners.Keywords: performance assessment, clinical skills, medical educatio

    Threats to Validity in the Use and Interpretation of Script Concordance Test Scores

    No full text
    Recent reviews have claimed that the Script Concordance Test (SCT) methodology generally produces reliable and valid assessments of clinical reasoning. We describe three major validity threats not yet considered in prior research. First, the predominant method for aggregate and partial credit scoring of SCTs introduces logical inconsistencies in the scoring key. Second, reliability studies of SCTs have generally ignored inter-panel, inter-panelist, and test-retest measurement error. Instead, studies have focused on observed levels of coefficient alpha, which is neither an informative index of internal structure nor a comprehensive index of reliability for SCT scores. As such, claims that SCT scores show acceptable reliability are premature. Finally, SCT criteria for item inclusion, in concert with a statistical artifact of its format, cause anchors at the extremes of the scale to have less expected credit than anchors near or at the midpoint. Consequently, SCT scores are likely to reflect construct-irrelevant differences in examinees’ response style. This makes the test susceptible to bias against groups that endorse extreme scale anchors more readily; it also makes the test susceptible to score inflation due to coaching. In a re-analysis of existing SCT data, we found that simulating a strategy whereby examinees never endorse extreme scale points resulted in considerable score inflation (d = 1.51), and examinees that simply endorsed the scale midpoint for every item would still have outperformed most examinees that used the scale as intended. Given the severity of these threats, we conclude that aggregate scoring cannot be recommended. Recommendations for revisions of SCT methodology are discussed

    Examining rater and occasion influences in observational assessments obtained from within the clinical environment

    No full text
    Background: When ratings of student performance within the clerkship consist of a variable number of ratings per clinical teacher (rater), an important measurement question arises regarding how to combine such ratings to accurately summarize performance. As previous G studies have not estimated the independent influence of occasion and rater facets in observational ratings within the clinic, this study was designed to provide estimates of these two sources of error. Method: During 2 years of an emergency medicine clerkship at a large midwestern university, 592 students were evaluated an average of 15.9 times. Ratings were performed at the end of clinical shifts, and students often received multiple ratings from the same rater. A completely nested G study model (occasion: rater: person) was used to analyze sampled rating data. Results: The variance component (VC) related to occasion was small relative to the VC associated with rater. The D study clearly demonstrates that having a preceptor rate a student on multiple occasions does not substantially enhance the reliability of a clerkship performance summary score. Conclusions: Although further research is needed, it is clear that case-specific factors do not explain the low correlation between ratings and that having one or two raters repeatedly rate a student on different occasions/cases is unlikely to yield a reliable mean score. This research suggests that it may be more efficient to have a preceptor rate a student just once. However, when multiple ratings from a single preceptor are available for a student, it is recommended that a mean of the preceptor's ratings be used to calculate the student's overall mean performance score
    corecore