114 research outputs found

    The crit coefficient in Mokken scale analysis:A simulation study and an application in quality-of-life research

    Get PDF
    PURPOSE: In Mokken scaling, the Crit index was proposed and is sometimes used as evidence (or lack thereof) of violations of some common model assumptions. The main goal of our study was twofold: To make the formulation of the Crit index explicit and accessible, and to investigate its distribution under various measurement conditions. METHODS: We conducted two simulation studies in the context of dichotomously scored item responses. We manipulated the type of assumption violation, the proportion of violating items, sample size, and quality. False positive rates and power to detect assumption violations were our main outcome variables. Furthermore, we used the Crit coefficient in a Mokken scale analysis to a set of responses to the General Health Questionnaire (GHQ-12), a self-administered questionnaire for assessing current mental health. RESULTS: We found that the false positive rates of Crit were close to the nominal rate in most conditions, and that power to detect misfit depended on the sample size, type of violation, and number of assumption-violating items. Overall, in small samples Crit lacked the power to detect misfit, and in larger samples power differed considerably depending on the type of violation and proportion of misfitting items. Furthermore, we also found in our empirical example that even in large samples the Crit index may fail to detect assumption violations. DISCUSSION: Even in large samples, the Crit coefficient showed limited usefulness for detecting moderate and severe violations of monotonicity. Our findings are relevant to researchers and practitioners who use Mokken scaling for scale and questionnaire construction and revision. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1007/s11136-021-02924-z

    Essays on invariant item ordering

    Get PDF

    Getting started with Mokken Scale Analysis in R

    Get PDF

    Person fit analysis with simulation-based methods

    Get PDF
    Aberrant responding in test or questionnaire data violating the principles of item response theory is a prevalent phenomenon in psychological and educational sciences. By means of person fit statistics aberrant responding is identified that prevents the computation of inadequate ability estimates. Simulation-based methods for person fit analysis were investigated in simulation studies with regard to Type I error and statistical power to detect aberrancy. Real data analyses from psychological and educational sciences further illustrate the usefulness of person fit statistics based on the presented approaches. In Study 1, a Markov chain Monte Carlo algorithm for sampling data matrices denoted as the Rasch Sampler is applied for simulating the null distribution of person fit statistics under the Rasch model. Results are compared to standardized statistics and illustrate the new approach (1) to correctly recover the nominal Type I error rates (while the standardized statistics deviate substantially) and (2) to offer predominantly similar or higher statistical power. Results from the application to Rasch-scalability problems of two subscales taken from Heller and Perleth’s (2000) multidimensional intelligence test (KFT) confirmed findings from the simulation studies. In Study 2, the Type I error and power of person fit tests based on weighted maximum likelihood ability estimators and parametric bootstrap were evaluated. Results were compared to established methods for person fit analysis. Bootstrapping based on robust maximum likelihood estimators improves the statistical power but a satisfactory recovery of nominal Type I error rates requires strong downweighting of aberrant item responses. Bootstrapping based on the Warm’s (1989) estimator applied as scoring method to original and simulated data displayed promising results concerning Type I error recovery and statistical power. Results from the simulations were matched by findings from the analysis of four samples of students with disabilities participating in a state-wide administered large-scale assessment program to investigate whether assessment of competence is invalidated by test modifications for these students. Both studies provide new insights on the benefits of simulation-based methods for the application of person fit tests to detect aberrant response behavior.Abweichendes Antwortverhalten in Test- und Fragebogendaten gegenüber den Annahmen der Item-Response-Theorie stellt ein häufiges Phänomen in der Psychologie und den Bildungswissenschaften dar. Personen-Fit-Statistiken können herangezogen werden, um derartiges Antwortverhalten zu identifizieren und die Schätzung inadäquater Fähigkeitsausprägungen zu verhindern. Simulations-basierte Methoden zur Personen-Fit-Analyse werden mit Hilfe von Simulationsstudien in Bezug auf Typ-I-Fehler und statistische Power untersucht. Real-Daten aus der Psychologie und Bildungsforschung werden genutzt, um die Bedeutung der Ergebnisse beispielhaft zu untermauern. In Studie 1 wird der Rasch Sampler, ein Markov-Chain-Monte-Carlo-Algorithmus zur Ziehung binärer Datenmatrizen, herangezogen, um die Verteilung von Personen-Fit-Statistiken für das Rasch-Modell zu simulieren. Die Ergebnisse werden mit standardisierten Personen-Fit-Statistiken verglichen und verdeutlichen (1) die Einhaltung des nominalen Typ-I-Fehlers (im Gegensatz zu deutlichen Abweichungen der standardisierten Statistiken) sowie (2) überwiegend vergleichbare oder höhere statistische Power im neuen Ansatz. Die Anwendung der Methode auf die Forschungsfrage nach der Rasch-Skalierbarkeit von zwei Subskalen von Heller und Perleth’s (2000) multidimensionalem Intelligenztest (KFT) unterstreicht Ergebnisse der Simulationsstudien. In der zweiten Studie werden Typ-I-Fehler und statistische Power verschiedener (parametrischer) Personen-Fit-Statistiken basierend auf gewichteten Maximum-Likelihood-Fähigkeitsschätzern untersucht und mit etablierten Ansätzen verglichen. Ein parametrischer Bootstrap basierend auf robusten Maximum-Likelihood-Schätzern erhöht die statistische Power, jedoch fällt die Einhaltung des nominalen Typ-I-Fehlers nur dann zufriedenstellend aus, wenn der Einfluss abweichender Item-Antworten bei der Berechnung des Schätzers durch Wahl einer geeigneten Gewichtung stark reduziert wird. Ein parametrischer Bootstrap basierend auf Warms (1989) Schätzer, angewendet auf Original- und simulierte Daten, verzeichnet vielversprechende Ergebnisse bezüglich der Einhaltung des Typ-I-Fehlers sowie der statistischen Power. Ergebnisse der Simulationen werden durch Ergebnisse einer Analyse von vier Stichproben von Förderschülern ergänzt, welche Erkenntnisse zur Invarianz zwischen konventioneller und angepasster Testadministration bei einem regionalen Large-Scale Assessment Programm erlauben. Die Ergebnisse der beiden vorliegenden Studien erbringen neue Erkenntnisse bezüglich der Vorteile simulations-basierter Methoden bei der Anwendung von Person-Fit-Statistiken

    A Bayesian approach to person-fit analysis in item response theory models

    Get PDF

    PerFit: An R package for person-fit analysis in IRT

    Get PDF
    Checking the validity of test scores is important in both educational and psychological measurement. Person-fit analysis provides several statistics that help practitioners assessing whether individual item score vectors conform to a prespecified item response theory model or, alternatively, to a group of test takers. Software enabling easy access to most person-fit statistics was lacking up to now. The PerFit R package was written in order to fill in this void. A theoretical overview of relatively simple person-fit statistics is provided. A practical guide showing how the main functions of PerFit can be used is also given. Both numerical and graphical tools are described and illustrated using examples. The goal is to show how person-fit statistics can be easily applied to testing of questionnaire data

    Item-Score Reliability as a Selection Tool in Test Construction

    Get PDF
    This study investigates the usefulness of item-score reliability as a criterion for item selection in test construction. Methods MS, λ6, and CA were investigated as item-assessment methods in item selection and compared to the corrected item-total correlation, which was used as a benchmark. An ideal ordering to add items to the test (bottom-up procedure) or omit items from the test (top-down procedure) was defined based on the population test-score reliability. The orderings the four item-assessment methods produced in samples were compared to the ideal ordering, and the degree of resemblance was expressed by means of Kendall's τ. To investigate the concordance of the orderings across 1,000 replicated samples, Kendall's W was computed for each item-assessment method. The results showed that for both the bottom-up and the top-down procedures, item-assessment method CA and the corrected item-total correlation most closely resembled the ideal ordering. Generally, all item assessment methods resembled the ideal ordering better, and concordance of the orderings was greater, for larger sample sizes, and greater variance of the item discrimination parameters
    • …
    corecore