24 research outputs found
Recommended from our members
Factors influencing the performance of the Mantel-Haenszel procedure in identifying differential item functioning.
The Mantel-Haenszel (MH) procedure has emerged as one of the methods of choice for identification of differentially functioning test items (DIF). Although there has been considerable research examining its performance in this context, important gaps remain in the knowledge base for effectively applying this procedure. This investigation is an attempt to fill these gaps with the results of five simulation studies. The first study is an examination of the utility of the two-step procedure recommended by Holland and Thayer in which the matching criterion used in the second step is refined by removing items identified in the first step. The results showed that using the two-step procedure is associated with a reduction in the Type II error rate. In the second study, the capability of the MH procedure to identify uniform DIF was examined. The statistic was used to identify simulated DIF in items with varying levels of difficulty and discrimination and with differing levels of difference in difficulty between groups. The results indicated that when difference in difficulty was held constant, poorly discriminating items and items that were very difficult were less likely to be identified by the procedure. In the third study, the effects of sample size were considered. In spite of the fact that the MH procedure has been repeatedly recommended for use with small samples, the results of this study suggest that samples below 200 per group may be inadequate. Performance with larger samples was satisfactory and improved as samples increased. The fourth study is an examination of the effects of score group width on the statistic. Holland and Thayer recommended that n + 1 score groups should be used for matching (where n is the number of items). Since then, various authors have suggested that there may be utility in using fewer (wider) score groups. It was shown that use of this variation on the MH procedure could result in dramatically increased type I error rates. In the final study, a simple variation on the MH statistic which may allow it to identify non-uniform DIF was examined. The MH statistic\u27s inability to identify certain types of non-uniform DIF items has been noted as a major shortcoming. Use of the variation resulted in identification of many of the simulated non-uniform DIF items with little or no increase in the type I error rate
Physician Experiences and Understanding of Genomic Sequencing in Oncology
The amount of information produced by genomic sequencing is vast, technically complicated, and can be difficult to interpret. Appropriately tailoring genomic information for nonĂą geneticists is an essential next step in the clinical use of genomic sequencing. To initiate development of a framework for genomic results communication, we conducted eighteen qualitative interviews with oncologists who had referred adult cancer patients to a matched tumorĂą normal tissue genomic sequencing study. In our qualitative analysis, we found varied levels of clinician knowledge relating to sequencing technology, the scope of the tumor genomic sequencing study, and incidental germline findings. Clinicians expressed a perceived need for more genetics education. Additionally, they had a variety of suggestions for improving results reports and possible resources to aid in results interpretation. Most clinicians felt genetic counselors were needed when incidental germline findings were identified. Our research suggests that more consistent genetics education is imperative in ensuring the proper utilization of genomic sequencing in cancer care. Clinician suggestions for results interpretation resources and results report modifications could be used to improve communication. CliniciansĂą perceived need to involve genetic counselors when incidental germline findings were found suggests genetic specialists could play a critical role in ensuring patients receive appropriate followĂą up.Peer Reviewedhttps://deepblue.lib.umich.edu/bitstream/2027.42/147187/1/jgc40187.pd
The Impact of Examinee Performance Information on Judgesâ Cut Scores in Modified Angoff Standard-Setting Exercises
Educational Measurement: Issues and Practice, Vol. 33, No. 1, pp. 15â22This research evaluated the impact of a common modification to Angoff standard-setting exercises: the provision of examinee performance data. Data from 18 independent standard-setting panels across three different medical licensing examinations were examined to investigate whether and how the provision of performance information impacted judgments and the resulting cut scores. Results varied by panel but in general indicated that both the variability among the panelists and the resulting cut scores were affected by the data. After the review of performance data, panelist variability generally decreased. In addition, for all panels and examinations pre- and post-data cut
scores were significantly different. Investigation of the practical significance of the findings indicated that nontrivial fail rate changes were associated with the cut score changes for a majority of standard-setting exercises. This study is the first to provide a large-scale, systematic evaluation of the impact of a common standard setting practice, and the results can provide practitioners with insight into how the practice influences panelist variability and resulting cut scores
Evaluation of missing data in an assessment of professional behaviors
BACKGROUND: The National Board of Medical Examiners is currently developing the Assessment of Professional Behaviors, a multisource feedback (MSF) tool intended for formative use with medical students and residents. This study investigated whether missing responses on this tool can be considered random; evidence that missing values are not random would suggest response bias, a significant threat to score validity.
METHOD: Correlational analyses of pilot data (N = 2,149) investigated whether missing values were systematically related to global evaluations of observees.
RESULTS: The percentage of missing items was correlated with global evaluations of observees; observers answered more items for preferred observees compared with nonpreferred observees.
CONCLUSIONS: Missing responses on this MSF tool seem to be nonrandom and are instead systematically related to global perceptions of observees. Further research is needed to determine whether modifications to the items, the instructions, or other components of the assessment process can reduce this effect
Using natural language processing to predict item response times and improve test construction
In this article, it is shown how item text can be represented by (a) 113 features quantifying the text's linguistic characteristics, (b) 16 measures of the extent to which an informationâretrievalâbased automatic questionâanswering system finds an item challenging, and (c) through dense word representations (word embeddings). Using a random forests algorithm, these data then are used to train a prediction model for item response times and predicted response times then are used to assemble test forms. Using empirical data from the United States Medical Licensing Examination, we show that timing demands are more consistent across these specially assembled forms than across forms comprising randomlyâselected items. Because an exam's timing conditions affect examinee performance, this result has implications for exam fairness whenever examinees are compared with each other or against a common standard.Published onlin
Collecting validity evidence for an assessment of professionalism: findings from think-aloud interviews
BACKGROUND: This study investigated whether participants\u27 subjective reports of how they assigned ratings on a multisource feedback instrument provide evidence to support interpreting the resulting scores as objective, accurate measures of professional behavior.
METHOD: Twenty-six participants completed think-aloud interviews while rating students, residents, or faculty members they had worked with previously. The items rated included 15 behavioral items and one global item.
RESULTS: Participants referred to generalized behaviors and global impressions six times as often as specific behaviors, rated observees in the absence of information necessary to do so, relied on indirect evidence about performance, and varied in how they interpreted items.
CONCLUSIONS: Behavioral change becomes difficult to address if it is unclear what behaviors raters considered when providing feedback. These findings highlight the importance of explicitly stating and empirically investigating the assumptions that underlie the use of an observational assessment tool