179 research outputs found

    Can value-added measures of teacher performance be trusted?

    Full text link
    We investigate whether commonly used value-added estimation strategies can produce accurate estimates of teacher effects. We estimate teacher effects in simulated student achievement data sets that mimic plausible types of student grouping and teacher assignment scenarios. No one method accurately captures true teacher effects in all scenarios, and the potential for misclassifying teachers as high- or low-performing can be substantial. Misspecifying dynamic relationships can exacerbate estimation problems. However, some estimators are more robust across scenarios and better suited to estimating teacher effects than others

    Optimal item pool design for computerized adaptive tests with polytomous items using GPCM

    Get PDF
    Abstract Computerized adaptive testing (CAT) is a testing procedure with advantages in improving measurement precision and increasing test efficiency. An item pool with optimal characteristics is the foundation for a CAT program to achieve those desirable psychometric features. This study proposed a method to design an optimal item pool for tests with polytomous items using the generalized partial credit model (G-PCM). It extended a method for approximating optimality with polytomous items being described succinctly for the purpose of pool design. Optimal item pools were generated using CAT simulations with and without practical constraints of content balancing and item exposure control. The performances of the item pools were evaluated against an operational item pool. The results indicated that the item pools designed with stratification based on discrimination parameters performed well with an efficient use of the less discriminative items within the target accuracy levels. The implications for developing item pools are also discussed

    Does the Precision and Stability of Value-Added Estimates of Teacher Performance Depend on the Types of Students They Serve?

    Full text link
    This paper investigates how the precision and stability of a teacher's value-added estimate relates to the characteristics of the teacher's students. Using a large administrative data set and a variety of teacher value-added estimators, it finds that the stability over time of teacher value-added estimates can depend on the previous achievement level of a teacher's students. The differences are large in magnitude and statistically significant. The year-to-year stability level of teacher value-added estimates are typically 25% to more than 50% larger for teachers serving initially higher performing students compared to teachers with initially lower performing students. In addition, some differences are detected even when the number of student observations is artificially set to the same level and the data are pooled across two years to compute teacher value-added. Finally, the paper offers a policy simulation which demonstrates that teachers who face students with certain characteristics may be differentially likely to be the recipient of sanctions in a high stakes policy based on value-added estimates and more likely to see their estimates vary from year-to-year due to low stability

    A Comparison of Growth Percentile and Value-Added Models of Teacher Performance

    Full text link
    School districts and state departments of education frequently must choose between a variety of methods to estimating teacher quality. This paper examines under what circumstances the decision between estimators of teacher quality is important. We examine estimates derived from growth percentile measures and estimates derived from commonly used value-added estimators. Using simulated data, we examine how well the estimators can rank teachers and avoid misclassification errors under a variety of assignment scenarios of teachers to students. We find that growth percentile measures perform worse than value-added measures that control for prior year student test scores and control for teacher fixed effects when assignment of students to teachers is nonrandom. In addition, using actual data from a large diverse anonymous state, we find evidence that growth percentile measures are less correlated with value-added measures with teacher fixed effects when there is evidence of nonrandom grouping of students in schools. This evidence suggests that the choice between estimators is most consequential under nonrandom assignment of teachers to students, and that value-added measures controlling for teacher fixed effects may be better suited to estimating teacher quality in this case

    How do principals assign students to teachers? Finding evidence in administrative data and the implications for value-added

    Full text link
    The federal government's Race to the Top competition has promoted the adoption of test-based performance measures as a component of teacher evaluations throughout many states, but the validity of these measures has been controversial among researchers and widely contested by teachers' unions. A key concern is the extent to which nonrandom sorting of students to teachers may bias the results and lead to a misclassification of teachers as high or low performing. In light of this, it is important to assess the extent to which evidence of sorting can be found in the large administrative data sets used for VAM estimation. Using a large longitudinal data set from an anonymous state, we find evidence that a nontrivial amount of sorting exists - particularly sorting based on prior test scores - and that the extent of sorting varies considerably across schools, a fact obscured by the types of aggregate sorting indices developed in prior research. We also find that VAM estimation is sensitive to the presence of nonrandom sorting. There is less agreement across estimation approaches regarding a particular teacher's rank in the distribution of estimated effectiveness when schools engage in sorting

    A proof of principle for using adaptive testing in routine Outcome Monitoring: the efficiency of the Mood and Anxiety Symptoms Questionnaire -Anhedonic Depression CAT

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In Routine Outcome Monitoring (ROM) there is a high demand for short assessments. Computerized Adaptive Testing (CAT) is a promising method for efficient assessment. In this article, the efficiency of a CAT version of the Mood and Anxiety Symptom Questionnaire, - Anhedonic Depression scale (MASQ-AD) for use in ROM was scrutinized in a simulation study.</p> <p>Methods</p> <p>The responses of a large sample of patients (<it>N </it>= 3,597) obtained through ROM were used. The psychometric evaluation showed that the items met the requirements for CAT. In the simulations, CATs with several measurement precision requirements were run on the item responses as if they had been collected adaptively.</p> <p>Results</p> <p>CATs employing only a small number of items gave results which, both in terms of depression measurement and criterion validity, were only marginally different from the results of a full MASQ-AD assessment.</p> <p>Conclusions</p> <p>It was concluded that CAT improved the efficiency of the MASQ-AD questionnaire very much. The strengths and limitations of the application of CAT in ROM are discussed.</p

    Measuring the ICF components of impairment, activity limitation and participation restriction: an item analysis using classical test theory and item response theory

    Get PDF
    The International Classification of Functioning, Disability and Health (ICF) proposes three main health outcomes, Impairment (I), Activity Limitation (A) and Participation Restriction (P), but good measures of these constructs are needed The aim of this study was to use both Classical Test Theory (CTT) and Item Response Theory (IRT) methods to carry out an item analysis to improve measurement of these three components in patients having joint replacement surgery mainly for osteoarthritis (OA). A geographical cohort of patients about to undergo lower limb joint replacement was invited to participate. Five hundred and twenty four patients completed ICF items that had been previously identified as measuring only a single ICF construct in patients with osteoarthritis. There were 13 I, 26 A and 20 P items. The SF-36 was used to explore the construct validity of the resultant I, A and P measures. The CTT and IRT analyses were run separately to identify items for inclusion or exclusion in the measurement of each construct. The results from both analyses were compared and contrasted. Overall, the item analysis resulted in the removal of 4 I items, 9 A items and 11 P items. CTT and IRT identified the same 14 items for removal, with CTT additionally excluding 3 items, and IRT a further 7 items. In a preliminary exploration of reliability and validity, the new measures appeared acceptable. New measures were developed that reflect the ICF components of Impairment, Activity Limitation and Participation Restriction for patients with advanced arthritis. The resulting Aberdeen IAP measures (Ab-IAP) comprising I (Ab-I, 9 items), A (Ab-A, 17 items), and P (Ab-P, 9 items) met the criteria of conventional psychometric (CTT) analyses and the additional criteria (information and discrimination) of IRT. The use of both methods was more informative than the use of only one of these methods. Thus combining CTT and IRT appears to be a valuable tool in the development of measures

    Linking tests of English for academic purposes to the CEFR: the score user’s perspective

    Get PDF
    The Common European Framework of Reference for Languages (CEFR) is widely used in setting language proficiency requirements, including for international students seeking access to university courses taught in English. When different language examinations have been related to the CEFR, the process is claimed to help score users, such as university admissions staff, to compare and evaluate these examinations as tools for selecting qualified applicants. This study analyses the linking claims made for four internationally recognised tests of English widely used in university admissions. It uses the Council of Europe’s (2009) suggested stages of specification, standard setting, and empirical validation to frame an evaluation of the extent to which, in this context, the CEFR has fulfilled its potential to “facilitate comparisons between different systems of qualifications.” Findings show that testing agencies make little use of CEFR categories to explain test content; represent the relationships between their tests and the framework in different terms; and arrive at conflicting conclusions about the correspondences between test scores and CEFR levels. This raises questions about the capacity of the CEFR to communicate competing views of a test construct within a coherent overarching structure
    corecore