1,984 research outputs found

    A Simulation Study to Assess the Effect of the Number of Response Categories on the Power of Ordinal Logistic Regression for Differential Item Functioning Analysis in Rating Scales

    Get PDF
    Objective. The present study uses simulated data to find what the optimal number of response categories is to achieve adequate power in ordinal logistic regression (OLR) model for differential item functioning (DIF) analysis in psychometric research. Methods. A hypothetical ten-item quality of life scale with three, four, and five response categories was simulated. The power and type I error rates of OLR model for detecting uniform DIF were investigated under different combinations of ability distribution (θ), sample size, sample size ratio, and the magnitude of uniform DIF across reference and focal groups. Results. When θ was distributed identically in the reference and focal groups, increasing the number of response categories from 3 to 5 resulted in an increase of approximately 8% in power of OLR model for detecting uniform DIF. The power of OLR was less than 0.36 when ability distribution in the reference and focal groups was highly skewed to the left and right, respectively. Conclusions. The clearest conclusion from this research is that the minimum number of response categories for DIF analysis using OLR is five. However, the impact of the number of response categories in detecting DIF was lower than might be expected

    Using Differential Item Functioning and Anchoring Vignettes to Examine the Fairness of Achievement Motivation Items

    Get PDF
    Achievement motivation is a well-documented predictor of a variety of positive student outcomes. However, researchers have also found threats to fairness and measurement scale comparability in motivation items, including group differences in response scale use and response styles. As such, the measurement comparability of achievement motivation items was evaluated before and after using anchoring vignettes to account for the effect of group-specific response scale use as a source of differential item functioning (DIF) across gender and ethnicity. Within a combined item response theory/ordinal logistic regression DIF framework, gender DIF was assessed using pairwise comparisons and ethnicity DIF was tested using both multiple-group DIF with a common base group as the reference group and all possible pairwise comparisons. Overall, using the vignettes changed both the form of DIF within items and the pattern of DIF between groups across items. Results indicated the presence of DIF between genders, but the DIF was unrelated to group differences in response scale use. Across ethnic groups, Black/African American students and Asian students demonstrated group-specific response scale use. When groups showed response tendencies, accounting for such scale use with the vignettes had a greater effect on reducing DIF in base group comparisons than in pairwise comparisons. Despite that DIF was identified in multiple items, the magnitude of all DIF was negligible and had little practical implication. Therefore, achievement motivation items appeared to demonstrate measurement comparability. As sources of DIF often go unidentified, a contribution of this study was the novel use of anchoring vignettes to account for group differences in response scale use as the source of DIF and to clarify the effect of those differences on measurement scale comparability and DIF

    Assessment of Differential Item Functioning in Health-Related Outcomes: A Simulation and Empirical Analysis with Hierarchical Polytomous Data

    Get PDF
    Background. The purpose of this study was to evaluate the effectiveness of two methods of detecting differential item functioning (DIF) in the presence of multilevel data and polytomously scored items. The assessment of DIF with multilevel data (e.g., patients nested within hospitals, hospitals nested within districts) from large-scale assessment programs has received considerable attention but very few studies evaluated the effect of hierarchical structure of data on DIF detection for polytomously scored items. Methods. The ordinal logistic regression (OLR) and hierarchical ordinal logistic regression (HOLR) were utilized to assess DIF in simulated and real multilevel polytomous data. Six factors (DIF magnitude, grouping variable, intraclass correlation coefficient, number of clusters, number of participants per cluster, and item discrimination parameter) with a fully crossed design were considered in the simulation study. Furthermore, data of Pediatric Quality of Life Inventory™ (PedsQL™) 4.0 collected from 576 healthy school children were analyzed. Results. Overall, results indicate that both methods performed equivalently in terms of controlling Type I error and detection power rates. Conclusions. The current study showed negligible difference between OLR and HOLR in detecting DIF with polytomously scored items in a hierarchical structure. Implications and considerations while analyzing real data were also discussed

    Assessing the Performance of Two Procedures for Detecting Differential Item Functioning within the Multilevel Partial Credit Model

    Get PDF
    This is a simulation study that evaluates the performances of two models for the detection of uniform differential item functioning (DIF). Simulated data are generated by a multilevel partial credit model (MLPCM). The purpose of this study was to compare the accuracy of two DIF detection procedures, hierarchical ordinal logistic regression (HOLR) for multilevel data and multilevel generalized Mantel-Haenszel (MGMH: French & Finch, 2013; French, Finch, & Imekus, 2019). Conditions manipulated were the number of participants per cluster (20, 40), number of clusters (50, 100, 200), DIF magnitude (0, .4, .8), and magnitude of intraclass correlation coefficient (.05, .25, .45). Furthermore, only one grouping variable was used within-groups. Data was simulated using R (R Core Team, 2019), whereas analyses will be performed using SAS 9.4 (SAS Institute, 2013) and R. In general, HOLR maintains the Type I error rate better than MGMH and HOLR has more power than MGMH under most simulation conditions

    Differential Item Functioning for Polytomous Response Items Using Hierarchical Generalized Linear Model

    Get PDF
    Hierarchical generalized linear model (HGLM) as a differential item functioning (DIF) detection method is a relatively new approach and has several advantages; such as handling extreme response patterns like perfect or all-missed scores and adding covariates and levels to simultaneously identify the sources and consequences of DIF. Several studies examined the performance of using HGLM in DIF assessment for dichotomous items, but only a few exist for polytomous items. This study examined the DIF-free-then-DIF strategy to select DIF-free anchor items and the performance of HGLM in DIF assessment for polytomous items. This study extends the work of Williams and Beretevas (2006) by adopting the constant anchor item method as the model identification method for HGLM, and examining the performance of DIF evaluation with the presence of latent trait differences between the focal and reference group. In addition, the study extends the work of Chen, Chen, and Shih (2014) by exploring the performance of HGLM for polytomous response items with 3 response categories, and comparing the results to logistic regression and Generalized Mantel-Haensel (GMH) procedure. In this study, the accuracy of using iterative HGLM with DIF-free-then-DIF strategy to select DIF-free items as anchor was examined first. Then, HGLM with 1-item anchor and 4-item anchor were fitted to the data, as well as the logistic regression and GMH. The Type I error and power rates were computed for all the 4 methods. The results showed that compared to dichotomous items, the accuracy rate of HGLM methods in selecting DIF-free item was generally lower for polytomous items. The HGLM with 1-item and 4-item anchor methods showed decent control of Type I error rate, while the logistic regression and GMH showed considerably inflated Type I error. In terms of power, HGLM with 4-item anchor method outperformed the 1-item anchor method. The logistic regression behaved similarly to HGLM with 1-item anchor. The GMH was generally more powerful, especially under small sample size conditions. However, this may be a result of its inflated Type I error. Recommendations were made for applied researchers in selecting among HGLM, logistic regression, and GMH for DIF assessment of polytomous items

    Safeguarding psychological measurement:Psychometric approaches to correct for test and agreeing bias

    Get PDF
    Psychological measurement is critical for understanding human behavior, but measuring intangible psychological constructs is challenging. Self-report scales are often used to measure these latent constructs but are prone to biases that can compromise validity. This dissertation encloses a series of methodological contributions about psychometric methods to detect and correct for test and agreeing bias. These methods can help to improve the validity and reliability of self-report scales and other psychological assessments, ultimately helping us to build valid and effective measurement tools

    Safeguarding psychological measurement:Psychometric approaches to correct for test and agreeing bias

    Get PDF
    Psychological measurement is critical for understanding human behavior, but measuring intangible psychological constructs is challenging. Self-report scales are often used to measure these latent constructs but are prone to biases that can compromise validity. This dissertation encloses a series of methodological contributions about psychometric methods to detect and correct for test and agreeing bias. These methods can help to improve the validity and reliability of self-report scales and other psychological assessments, ultimately helping us to build valid and effective measurement tools

    Beyond motivation: Differences in score meaning between assessment conditions

    Get PDF
    Written communication is a skill necessary for not only the success of undergraduate students, but for post-graduates in the workplace. Furthermore, according to employers the writing skills of post-graduates tend to be below expectations. Therefore, the assessment of such skills within higher education is in high demand. Written communication assessments tend to be administered in one of two conditions: 1) course embedded and 2) a low-stakes, non-embedded condition. The current study investigated possible construct-irrelevant variance in writing assessment scores by using data from a mid-sized public university in the Mid-Atlantic region of the United States. Specifically, 157 student products were scored using the Association of American Colleges and Universities’ Written Communication rubric by Multi-State Collaborative trained raters. A final sample size of 57 student products were in the non-embedded assessment condition and 107 student products were in the embedded assessment condition. Differential item functioning analyses were conducted using a Rasch Rating Scale model and an Ordinal Regression wherein Verbal SAT was used an external criterion of ability. Said differently, this study investigated whether students of the same proficiency had different probabilities of receiving particular written communication scores. After controlling for motivation, the results provide evidence of possible differential item functioning for Content Development as well as Genre and Disciplinary Conventions. Students of the same ability tend to obtain higher written communication scores in the non-embedded assessment condition. These results raise concerns about the presence of construct-irrelevant variance aside from motivation. Future research should investigate faculty feedback, allotted time, and task structure as possible sources of construct-irrelevant variance when using low-stakes, non-embedded assessments of written communication
    • …
    corecore