15,550 research outputs found

    Estimating the reliability of composite scores

    Get PDF
    In situations where multiple tests are administered (such as the GCSE subjects), scores from individual tests are frequently combined to produce a composite score. As part of the Ofqual reliability programme, this study, through a review of literature, attempts to: look at the different approaches that are employed to form composite scores from component or unit scores; investigate the implications of the use of the different approaches for the psychometric properties, particularly the reliability, of the composite scores; and identify procedures that are commonly used to estimate the reliability measure of composite scores

    Comparisons of subscoring methods in computerized adaptive testing: a simulation study

    Get PDF
    Given the increasing demands of subscore reports, various subscoring methods and augmentation techniques have been developed aiming to improve the subscore estimates, but few studies have been conducted to systematically compare these methods under the framework of computerized adaptive tests (CAT). This research conducts a simulation study, for the purpose of comparing five subscoring methods on score estimation under variable simulated CAT conditions. Among the five subscoring methods, the IND-UCAT scoring ignores the correlations among subtests, whereas the other four correlation-based scoring methods (SEQ-CAT, PC-MCAT, reSEQ-CAT, and AUG-CAT) capitalize on the correlation information in the scoring procedure. By manipulating the sublengths, the correlation structures, and the item selection algorithms, more comparable, pragmatic, and systematic testing scenarios are created for comparison purposes. Also, to make the best of the sources underlying the assessments, the study proposes a successive scoring procedure according to the structure of the higher-order IRT model, in which the test total score of individual examinees can be calculated after the subscore estimation procedure is conducted. Through the successive scoring procedure, the subscores and the total score of an examinee can be sequentially derived from one test. The results of the study indicate that in the low correlation structure, the original IND-CAT is suggested for subscore estimation considering the ease of implementation in practice, while the suggested total score estimation procedure is not recommended given the large divergences from the true total scores. For the mixed correlation structure with two moderate correlations and one strong correlation, the original SEQ-CAT or the combination of the SEQ-CAT item selection and the PC-MCAT scoring should be considered not only for subscore estimation but also for total score estimation. If the post-hoc estimation procedure is allowed, the original SEQ-CAT and the reSEQ-CAT scoring could be jointly conducted for the best score estimates. In the high correlation structure, the original PC-MCAT and the combination of the PC-MCAT scoring and the SEQ-CAT item selection are suggested for both the subscore estimation and the total score estimation. In terms of the post-hoc score estimation, the reSEQ-CAT scoring in conjunction with the original SEQ-CAT is strongly recommended. If the complexity of the implementation is an issue in practice, the reSEQ-CAT scoring jointly conducted with the original IND-UCAT could be considered for reasonable score estimates. Additionally, to compensate for the constrained use of item pools in PC-MCAT, the PC-MCAT with adaptively sequencing subtests (SEQ-MCAT) is proposed for future investigations. The simplifications of item and/or subtest selection criteria in a simple-structure MCAT, PC-MCAT, and SEQ-MCAT are also pointed out for the convenience of their applications in practice. Last, the limitations of the study are discussed and the directions for future studies are also provided

    plink: An R Package for Linking Mixed-Format Tests Using IRT-Based Methods

    Get PDF
    The R package plink has been developed to facilitate the linking of mixed-format tests for multiple groups under a common item design using unidimensional and multidimensional IRT-based methods. This paper presents the capabilities of the package in the context of the unidimensional methods. The package supports nine unidimensional item response models (the Rasch model, 1PL, 2PL, 3PL, graded response model, partial credit and generalized partial credit model, nominal response model, and multiple-choice model) and four separate calibration linking methods (mean/sigma, mean/mean, Haebara, and Stocking-Lord). It also includes functions for importing item and/or ability parameters from common IRT software, conducting IRT true-score and observed-score equating, and plotting item response curves and parameter comparison plots.

    THE ROBUSTNESS OF IRT-BASED VERTICAL SCALING METHODS TO VIOLATION OF UNIDIMENSIONALITY

    Get PDF
    In recent years, many states have adopted Item Response Theory (IRT) based vertically scaled tests due to their compelling features in a growth-based accountability context. However, selection of a practical and effective calibration/scaling method and proper understanding of issues with possible multidimensionality in the test data is critical to ensure their accuracy and reliability. This study aims to use Monte Carlo simulation to investigate the robustness of various unidimensional scaling methods under different test conditions and different degrees of departure from unidimensionality in common-items nonequivalent groups design (grades 3 to 8). The main research questions answered by this research are: 1) Which calibration/scaling methods, concurrent, semi-concurrent, separate calibration with SL scaling, separate calibration with mean/sigma scaling, and pair-wise calibration, yield least biased ability estimates in the vertical scaling context? 2) How do different degrees of multidimensionality affect use of the methods? Results indicate that various calibration and scaling methods perform very differently under different test conditions, especially when the grades are furthest away from the base grade. Under unidimensional condition, the five calibration and linking methods produced very similar results when the grades are close to the base grade 5. However, for grades 7 and 8, semi-concurrent and concurrent calibrations yielded more biased results while the results for the other three are comparable. Under multidimensional conditions, all five methods produced more biased results and the bias patterns differed across methods. In general, the more severe the multidimensionality is, the larger the biases are. Among the five methods compared, separate calibration with SL linking is the most robust to variations in multidimensionality

    A Comparison of Two MCMC Algorithms for Estimating the 2PL IRT Models

    Get PDF
    The fully Bayesian estimation via the use of Markov chain Monte Carlo (MCMC) techniques has become popular for estimating item response theory (IRT) models. The current development of MCMC includes two major algorithms: Gibbs sampling and the No-U-Turn sampler (NUTS). While the former has been used with fitting various IRT models, the latter is relatively new, calling for the research to compare it with other algorithms. The purpose of the present study is to evaluate the performances of these two emerging MCMC algorithms in estimating two two-parameter logistic (2PL) IRT models, namely, the 2PL unidimensional model and the 2PL multi-unidimensional model under various test situations. Through investigating the accuracy and bias in estimating the model parameters given different test lengths, sample sizes, prior specifications, and/or correlations for these models, the key motivation is to provide researchers and practitioners with general guidelines when it comes to estimating a UIRT model and a multi-unidimensional IRT model. The results from the present study suggest that NUTS is equally effective as Gibbs sampling at parameter estimation under most conditions for the 2PL IRT models. Findings also shed light on the use of the two MCMC algorithms with more complex IRT models

    Evaluation of Item Response Theory Models for Nonignorable Omissions

    Get PDF
    When competence tests are administered, subjects frequently omit items. These missing responses pose a threat to correctly estimating the proficiency level. Newer model-based approaches aim to take nonignorable missing data processes into account by incorporating a latent missing propensity into the measurement model. Two assumptions are typically made when using these models: (1) The missing propensity is unidimensional and (2) the missing propensity and the ability are bivariate normally distributed. These assumptions may, however, be violated in real data sets and could, thus, pose a threat to the validity of this approach. The present study focuses on modeling competencies in various domains, using data from a school sample (N = 15,396) and an adult sample (N = 7,256) from the National Educational Panel Study. Our interest was to investigate whether violations of unidimensionality and the normal distribution assumption severely affect the performance of the model-based approach in terms of differences in ability estimates. We propose a model with a competence dimension, a unidimensional missing propensity and a distributional assumption more flexible than a multivariate normal. Using this model for ability estimation results in different ability estimates compared with a model ignoring missing responses. Implications for ability estimation in large-scale assessments are discussed

    A Note on Improving Variational Estimation for Multidimensional Item Response Theory

    Full text link
    Survey instruments and assessments are frequently used in many domains of social science. When the constructs that these assessments try to measure become multifaceted, multidimensional item response theory (MIRT) provides a unified framework and convenient statistical tool for item analysis, calibration, and scoring. However, the computational challenge of estimating MIRT models prohibits its wide use because many of the extant methods can hardly provide results in a realistic time frame when the number of dimensions, sample size, and test length are large. Instead, variational estimation methods, such as Gaussian Variational Expectation Maximization (GVEM) algorithm, have been recently proposed to solve the estimation challenge by providing a fast and accurate solution. However, results have shown that variational estimation methods may produce some bias on discrimination parameters during confirmatory model estimation, and this note proposes an importance weighted version of GVEM (i.e., IW-GVEM) to correct for such bias under MIRT models. We also use the adaptive moment estimation method to update the learning rate for gradient descent automatically. Our simulations show that IW-GVEM can effectively correct bias with modest increase of computation time, compared with GVEM. The proposed method may also shed light on improving the variational estimation for other psychometrics models

    An Examination of Parameter Recovery Using Different Multiple Matrix Booklet Designs

    Get PDF
    Educational large-scale assessments examine students’ achievement in various content domains and thus provide key findings to inform educational research and evidence-based educational policies. To this end, large-scale assessments involve hundreds of items to test students’ achievement in various content domains. Administering all these items to single students will over-burden them, reduce participation rates, and consume too much time and resources. Hence multiple matrix sampling is used in which the test items are distributed into various test forms called “booklets”; and each student administered a booklet, containing a subset of items that can sensibly be answered during the allotted test timeframe. However, there are numerous possibilities as to how these booklets can be designed, and this manner of booklet design could influence parameter recovery precision both at global and subpopulation levels. One popular booklet design with many desirable characteristics is the Balanced Incomplete 7-Block or Youden squares design. Extensions of this booklet design are used in many large-scale assessments like TIMSS and PISA. This doctoral project examines the degree to which item and population parameters are recovered in real and simulated data in relation to matrix sparseness, when using various balanced incomplete block booklet designs. To this end, key factors (e.g., number of items, number of persons, number of items per person, and the match between the distributions of item and person parameters) are experimentally manipulated to learn how these factors affect the precision with which these designs recover true population parameters. In doing so, the project expands the empirical knowledge base on the statistical properties of booklet designs, which in turn could help improve the design of future large-scale studies. Generally, the results show that for a typical large-scale assessment (with a sample size of at least 3,000 students and more than 100 test items), population and item parameters are recovered accurately and without bias in the various multi-matrix booklet designs. This is true both at the global population level and at the subgroup or sub-population levels. Further, for such a large-scale assessment, the match between the distribution of person abilities and the distribution of item difficulties is found to have an insignificant effect on the precision with which person and item parameters are recovered, when using these multi-matrix booklet designs. These results give further support to the use of multi-matrix booklet designs as a reliable test abridgment technique in large-scale assessments, and for accurate measurement of performance gaps between policy-relevant subgroups within populations. However, item position effects were not fully considered, and different results are possible if similar studies are performed (a) with conditions involving items that poorly measure student abilities (e.g., with students having skewed ability distributions); or, (b) simulating conditions where there is a lot of missing data because of non-response, instead of just missing by design. This should be further investigated in future studies.Die Erfassung des Leistungsstands von SchĂŒlerinnen und SchĂŒlern in verschiedenen DomĂ€nen durch groß angelegte Schulleistungsstudien (sog. Large-Scale Assessments) liefert wichtige Erkenntnisse fĂŒr die Bildungsforschung und die evidenzbasierte Bildungspolitik. Jedoch erfordert die Leistungstestung in vielen Themenbereichen auch immer den Einsatz hunderter Items. WĂŒrden alle Testaufgaben jeder einzelnen SchĂŒlerin bzw. jedem einzelnen SchĂŒler vorgelegt werden, wĂŒrde dies eine zu große Belastung fĂŒr die SchĂŒlerinnen und SchĂŒler darstellen und folglich wĂ€ren diese auch weniger motiviert, alle Aufgaben zu bearbeiten. Zudem wĂ€re der Einsatz aller Aufgaben in der gesamten Stichprobe sehr zeit- und ressourcenintensiv. Aus diesen GrĂŒnden wird in Large-Scale Assessments oft auf ein Multi- Matrix Design zurĂŒckgegriffen bei dem verschiedene, den Testpersonen zufĂ€llig zugeordnete, Testheftversionen (sog. Booklets) zum Einsatz kommen. Diese enthalten nicht alle Aufgaben, sondern lediglich eine Teilmenge des Aufgabenpools, wobei nur ein Teil der Items zwischen den verschiedenen Booklets ĂŒberlappt. Somit wird sichergestellt, dass die SchĂŒlerinnen und SchĂŒler alle ihnen vorgelegten Items in der vorgegebenen Testzeit bearbeiten können. Jedoch gibt es zahlreiche Varianten wie diese Booklets zusammengestellt werden können. Das jeweilige Booklet Design hat wiederum Auswirkungen auf die Genauigkeit der ParameterschĂ€tzung auf Populations- und Teilpopulationsebene. Ein bewĂ€hrtes Booklet Design ist das Balanced-Incomplete-7-Block Design, auch Youden-Squares Design genannt, das in unterschiedlicher Form in vielen Large-Scale Assessments, wie z.B. TIMSS und PISA, Anwendung findet. Die vorliegende Arbeit untersucht sowohl auf Basis realer als auch simulierter Daten die Genauigkeit mit der Item- und Personenparameter unter Anwendung verschiedener Balanced-Incomplete-Block Designs und in AbhĂ€ngigkeit vom Anteil designbedingt fehlender Werte geschĂ€tzt werden können. DafĂŒr wurden verschiede Designparameter variiert (z.B. Itemanzahl, Stichprobenumfang, Itemanzahl pro Booklet, Ausmaß der Passung von Item- und Personenparametern) und anschließend analysiert, in welcher Weise diese die Genauigkeit der SchĂ€tzung von Populationsparametern beeinflussen. Die vorliegende Arbeit hat somit zum Ziel, das empirische Wissen um die statistischen Eigenschaften von Booklet Designs zu erweitern, wodurch ein Beitrag zur Verbesserung zukĂŒnftiger Large-Scale Assessments geleistet wird. Die Ergebnisse der vorliegenden Arbeit zeigten, dass fĂŒr ein typisches Large-Scale Assessment (mit einer StichprobengrĂ¶ĂŸe von mindestens 3000 SchĂŒlerinnen und SchĂŒlern und mindestens 100 Items) die Personen- und Itemparameter sowohl auf Populations- als auch auf Teilpopulationsebene mit allen eingesetzten Varianten des Balanced-Incomplete- Block Designs prĂ€zise geschĂ€tzt wurden. Außerdem konnte gezeigt werden, dass fĂŒr Stichproben mit mindestens 3000 SchĂŒlerinnen und SchĂŒlern die Passung zwischen der Leistungsverteilung und der Verteilung der Aufgabenschwierigkeit keinen bedeutsamen Einfluss auf die Genauigkeit hatte, mit der verschiedene Booklet Designs Personen- und Itemparameter schĂ€tzten. Die Ergebnisse untermauern, dass unter Verwendung von multi-matrix Designs bildungspolitisch relevante Leistungsunterschiede zwischen Gruppen von SchĂŒlerinnen und SchĂŒlern in der Population reliabel und prĂ€zise geschĂ€tzt werden können. Eine EinschrĂ€nkung der vorliegenden Studie liegt darin, dass Itempositionseffekte nicht umfassend berĂŒcksichtigt wurden. So kann nicht ausgeschlossen werden, dass die Ergebnisse abweichen wĂŒrden, wenn (a) Items verwendet werden wĂŒrden, welche die Leistung der SchĂŒlerinnen und SchĂŒler schlecht schĂ€tzen (z.B. bei einer schiefen Verteilungen der Leistungswerte) oder (b) hohe Anteile an fehlenden Werten vorliegen, die nicht durch das Multi-Matrix Design erzeugt wurden. Dies sollte in zukĂŒnftigen Studien untersucht werden

    A COMPARISON OF SUBSCORE REPORTING METHODS FOR A STATE ASSESSMENT OF ENGLISH LANGUAGE PROFICIENCY

    Get PDF
    Educational tests that assess multiple content domains related to varying degrees often have subsections based on these content domains; scores assigned to these subsections are commonly known as subscores. Testing programs face increasing customer demands for the reporting of subscores in addition to the total test scores in today's accountability-oriented educational environment. While reporting subscores can provide much-needed information for teachers, administrators, and students about proficiency in the test domains, one of the major drawbacks of subscore reporting includes their lower reliability as compared to the test as a whole. This dissertation explored several methods of assigning subscores to the four domains of an English language proficiency test (listening, reading, writing, and speaking), including classical test theory (CTT)-based number correct, unidimensional item response theory (UIRT), augmented item response theory (A-IRT), and multidimensional item response theory (MIRT), and compared the reliability and precision of these different methods across language domains and grade bands. CTT and UIRT methods were found to have similar reliability and precision that was lower than that of augmented IRT and MIRT methods. The reliability of augmented IRT and MIRT was found to be comparable for most domains and grade bands. The policy implications and limitations of this study, as well as directions for further research, were discussed
    • 

    corecore