124 research outputs found
Impact of test design, item quality, and item bank size on the psychometric properties of computer-based credentialing examinations.
Abstract Computer-based testing with many credentialing examination agencies has become a common occurence. At the same time, selecting a test design is difficult because several are available-parallel-forms, computer-adaptive (CAT), and multi-stage (MST), and the merits of these designs interact with exam conditions. These conditions include item quality, bank size, candidate score distribution, placement of the passing score, exam length, and more. In this study three popular computer-based test designs under some common examination conditions were investigated using computer simulation techniques. Item quality and bank size were varied. The results from the study were clear: both item bank size and item quality had a practically significant impact on decision consistency and accuracy. Interestingly, even in nearly ideal situations, the choice of test design was not a factor in the results. Two conclusions seem to follow from the findings: (1) more time and resources should be committed to expanding both the size and quality of item banks, and (2) designs that individualize an exam administration such as MST and CAT, may not be especially helpful when the primary purpose of an examination is to make pass-fail decisions, and conditions are present for using parallel-forms of examinations with a target information function that can be centered at the passing score. Obviously, the validity of these conclusions needs to be thoroughly checked with additional simulations and real data
Recommended from our members
Item Bias Review
Accessed 139,291 times on https://pareonline.net from November 13, 1999 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right
NAEP State Reports in Mathematics: Valuable Information for Monitoring Education Reform
The National Assessment of Educational Progress (NAEP), a congressionally mandated program, can provide valuable data to educational policymakers in Massachusetts and other New England states about the status of their educational reform initiatives and their performance standards. The three purposes of this article are to describe NAEP and its goals and structure, to present some of the results of the 1992 Mathematics NAEP Assessment as an example of the utility of this national assessment program, and to highlight ways in which background data collected by NAEP can be helpful in interpreting assessment results and monitoring educational reform. The six New England states aspire to performance standards that approximate national and international standards of excellence. NAEP, which provides an excellent database to influence the standard-setting process, therefore should be of considerable interest to policymakers who are serious about setting meaningful performance standards and monitoring the quality of educational progress
Advances in item response theory and applications: an introduction
Test theories can be divided roughly into two categories. The first is classical test theory, which dates back to Spearman’s conception of the observed test score as a composite of true and error components, and which was introduced to psychologists at the beginning of this century. Important milestones in its long and venerable tradition are Gulliksen’s Theory of Mental Tests (1950) and Lord and Novick’s Statistical Theories of Mental Test Scores (1968). The second is item response theory, or latent trait theory, as it has been called until recently. At the present time, item response theory (IRT) is having a major impact on the field of testing. Models derived from IRT are being used to develop tests, to equate scores from nonparallel tests, to investigate item bias, and to report scores, as well as to address many other pressing measurement problems (see, e.g., Hambleton, 1983; Lord, 1980). IRT differs from classical test theory in that it assumes a different relation of the test score to the variable measured by the test. Although there are parallels between models from IRT and psychophysical models formulated around the turn of the century, only in the last 10 years has IRT had any impact on psychometricians and test users. Work by Rasch (1980/1960), Fischer (1974), 9 Birnbaum (1968), ivrighi and Panchapakesan (1969), Bock (1972), and Lord (1974) has been especially influential in this turnabout; and Lazarsfeld’s pioneering work on latent structure analysis in sociology (Lazarsfeld, 1950; Lazarsfeld & Henry, 1968) has also provided impetus. One objective of this introduction is to review the conceptual differences between classical test theory and IRT. A second objective is to introduce the goals of this special issue on item response theory and the seven papers. Some basic problems with classical test theory are reviewed in the next section. Then, IRT approaches to educational and psychological measurement are presented and compared to classical test theory. The final two sections present the goals for this special issue and an outline of the seven invited papers
Recommended from our members
Effect of Adjusting Pseudo-Guessing Parameter Estimates on Test Scaling When Item Parameter Drift Is Present
In item response theory test scaling/equating with the three-parameter model, the scaling coefficients A and B have no impact on the c-parameter estimates of the test items since the c-parameter estimates are not adjusted in the scaling/equating procedure. The main research question in this study concerned how serious the consequences would be if c-parameter estimates are not adjusted in the test equating procedure when item-parameter drift (IPD) is present. This drift is commonly observed in equating studies and hence, has been the source of considerable research. The results from a series of Monte-Carlo simulation studies conducted under 32 different combinations of conditions showed that some calibration strategies in the study, where the c-parameters were adjusted to be identical across two test forms, resulted in more robust equating performance in the presence of IPD. This paper discusses the practical effectiveness and the theoretical importance of appropriately adjusting c-parameter estimates in equating. Accessed 3,754 times on https://pareonline.net from July 04, 2015 to December 31, 2019. For downloads from January 1, 2020 forward, please click on the PlumX Metrics link to the right
An Application of Item Response Theory to Psychological Test Development
Item response theory (IRT) has become a popular methodological framework for modeling response data from assessments in education and health; however, its use is not widespread among psychologists. This paper aims to provide a didactic application of IRT and to highlight some of these advantages for psychological test development. IRT was applied to two scales (a positive and a negative affect scale) of a self-report test. Respondents were 853 university students (57 % women) between the ages of 17 and 35 and who answered the scales. IRT analyses revealed that the positive affect scale has items with moderate discrimination and are measuring respondents below the average score more effectively. The negative affect scale also presented items with moderate discrimination and are evaluating respondents across the trait continuum; however, with much less precision. Some features of IRT are used to show how such results can improve the measurement of the scales. The authors illustrate and emphasize how knowledge of the features of IRT may allow test makers to refine and increase the validity and reliability of other psychological measures
International Test Commission guidelines for test adaptation: A criterion checklist
Background: To improve the quality of test translation and adaptation, and hence the comparability of scores across cultures, the International Test Commission (ITC) proposed a number of guidelines for the adaptation process. Although these guidelines are well-known, they are not implemented as often as they should be. One possible reason for this is the broad scope of the guidelines, which makes them difficult to apply in practice. The goal of this study was therefore to draw up an evaluative criterion checklist that would help test adapters to implement the ITC recommendations and which would serve as a model for assessing the quality of test adaptations. Method: Each ITC guideline was operationalized through a number of criteria. For each criterion, acceptable and excellent levels of accomplishment were proposed. The initial checklist was then reviewed by a panel of 12 experts in testing and test adaptation. The resulting checklist was applied to two different tests by two pairs of independent reviewers. Results: The final evaluative checklist consisted of 29 criteria covering all phases of test adaptation: planning, development, confirmation, administration, score interpretation, and documentation. Conclusions: We believe that the proposed evaluative checklist will help to improve the quality of test adaptation
Profiles of Mathematics Anxiety Among 15-Year-Old Students: A Cross-Cultural Study Using Multi-Group Latent Profile Analysis
Using PISA 2012 data, the present study explored profiles of mathematics anxiety (MA) among 15-year old students from Finland, Korea, and the United States to determine the similarities and differences of MA across the three national samples by applying a multi-group latent profile analysis (LPA). The major findings were that (a) three MA profiles were found in all three national samples, i.e., Low MA, Mid MA, and High MA profile, and (b) the percentages of students classified into each of the three MA profiles differed across the Finnish, Korean, and American samples, with United States having the highest prevalence of High MA, and Finland the lowest. Multi-group LPA also provided clear and useful latent profile separation. The High MA profile demonstrated significant poorer mathematics performance and lower mathematics interest, self-efficacy, and self-concept than the Mid and Low MA profiles. Same differences appeared between the Mid and Low MA profiles. The implications of the findings seem clear: (1) it is possible that there is some relative level of universality in MA among 15-year old students which is independent of cultural context; and (2) multi-group LPA could be a useful analytic tool for research on the study of classification and cultural differences of MA
Recommended from our members
A Functional Difficulty and Functional Pain Instrument for Hip and Knee Osteoarthritis
Introduction: The objectives of this study were to develop a functional outcome instrument for hip and knee osteoarthritis research (OA-FUNCTION-CAT) using item response theory (IRT) and computer adaptive test (CAT) methods and to assess its psychometric performance compared to the current standard in the field. Methods: We conducted an extensive literature review, focus groups, and cognitive testing to guide the construction of an item bank consisting of 125 functional activities commonly affected by hip and knee osteoarthritis. We recruited a convenience sample of 328 adults with confirmed hip and/or knee osteoarthritis. Subjects reported their degree of functional difficulty and functional pain in performing each activity in the item bank and completed the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC). Confirmatory factor analyses were conducted to assess scale uni-dimensionality, and IRT methods were used to calibrate the items and examine the fit of the data. We assessed the performance of OA-FUNCTION-CATs of different lengths relative to the full item bank and WOMAC using CAT simulation analyses. Results: Confirmatory factor analyses revealed distinct functional difficulty and functional pain domains. Descriptive statistics for scores from 5-, 10-, and 15-item CATs were similar to those for the full item bank. The 10-item OA-FUNCTION-CAT scales demonstrated a high degree of accuracy compared with the item bank (r = 0.96 and 0.89, respectively). Compared to the WOMAC, both scales covered a broader score range and demonstrated a higher degree of precision at the ceiling and reliability across the range of scores. Conclusions: The OA-FUNCTION-CAT provided superior reliability throughout the score range and improved breadth and precision at the ceiling compared with the WOMAC. Further research is needed to assess whether these improvements carry over into superior ability to measure change
Envelope Determinants of Equine Lentiviral Vaccine Protection
Lentiviral envelope (Env) antigenic variation and associated immune evasion present major obstacles to vaccine development. The concept that Env is a critical determinant for vaccine efficacy is well accepted, however defined correlates of protection associated with Env variation have yet to be determined. We reported an attenuated equine infectious anemia virus (EIAV) vaccine study that directly examined the effect of lentiviral Env sequence variation on vaccine efficacy. The study identified a significant, inverse, linear correlation between vaccine efficacy and increasing divergence of the challenge virus Env gp90 protein compared to the vaccine virus gp90. The report demonstrated approximately 100% protection of immunized ponies from disease after challenge by virus with a homologous gp90 (EV0), and roughly 40% protection against challenge by virus (EV13) with a gp90 13% divergent from the vaccine strain. In the current study we examine whether the protection observed when challenging with the EV0 strain could be conferred to animals via chimeric challenge viruses between the EV0 and EV13 strains, allowing for mapping of protection to specific Env sequences. Viruses containing the EV13 proviral backbone and selected domains of the EV0 gp90 were constructed and in vitro and in vivo infectivity examined. Vaccine efficacy studies indicated that homology between the vaccine strain gp90 and the N-terminus of the challenge strain gp90 was capable of inducing immunity that resulted in significantly lower levels of post-challenge virus and significantly delayed the onset of disease. However, a homologous N-terminal region alone inserted in the EV13 backbone could not impart the 100% protection observed with the EV0 strain. Data presented here denote the complicated and potentially contradictory relationship between in vitro virulence and in vivo pathogenicity. The study highlights the importance of structural conformation for immunogens and emphasizes the need for antibody binding, not neutralizing, assays that correlate with vaccine protection. © 2013 Craigo et al
- …