Search CORE

179 research outputs found

Can value-added measures of teacher performance be trusted?

Author: Guarino Cassandra
Reckase Mark D.
Wooldridge Jeffrey M.
Publication venue: Bonn: Institute for the Study of Labor (IZA)
Publication date: 01/01/2012
Field of study

We investigate whether commonly used value-added estimation strategies can produce accurate estimates of teacher effects. We estimate teacher effects in simulated student achievement data sets that mimic plausible types of student grouping and teacher assignment scenarios. No one method accurately captures true teacher effects in all scenarios, and the potential for misclassifying teachers as high- or low-performing can be substantial. Misspecifying dynamic relationships can exacerbate estimation problems. However, some estimators are more robust across scenarios and better suited to estimating teacher effects than others

EconStor (ZBW Kiel)

Optimal item pool design for computerized adaptive tests with polytomous items using GPCM

Author: M. D X Zhou
Mark D Reckase
Reckase
Xuechun Zhou
Publication venue
Publication date: 01/01/2014
Field of study

Abstract Computerized adaptive testing (CAT) is a testing procedure with advantages in improving measurement precision and increasing test efficiency. An item pool with optimal characteristics is the foundation for a CAT program to achieve those desirable psychometric features. This study proposed a method to design an optimal item pool for tests with polytomous items using the generalized partial credit model (G-PCM). It extended a method for approximating optimality with polytomous items being described succinctly for the purpose of pool design. Optimal item pools were generated using CAT simulations with and without practical constraints of content balancing and item exposure control. The performances of the item pools were evaluated against an operational item pool. The results indicated that the item pools designed with stratification based on discrimination parameters performed well with an efficient use of the less discriminative items within the target accuracy levels. The implications for developing item pools are also discussed

CiteSeerX

Does the Precision and Stability of Value-Added Estimates of Teacher Performance Depend on the Types of Students They Serve?

Author: Guarino Cassandra
Reckase Mark D.
Stacy Brian
Wooldridge Jeffrey M.
Publication venue: Bonn: Institute for the Study of Labor (IZA)
Publication date: 01/01/2013
Field of study

This paper investigates how the precision and stability of a teacher's value-added estimate relates to the characteristics of the teacher's students. Using a large administrative data set and a variety of teacher value-added estimators, it finds that the stability over time of teacher value-added estimates can depend on the previous achievement level of a teacher's students. The differences are large in magnitude and statistically significant. The year-to-year stability level of teacher value-added estimates are typically 25% to more than 50% larger for teachers serving initially higher performing students compared to teachers with initially lower performing students. In addition, some differences are detected even when the number of student observations is artificially set to the same level and the data are pooled across two years to compute teacher value-added. Finally, the paper offers a policy simulation which demonstrates that teachers who face students with certain characteristics may be differentially likely to be the recipient of sanctions in a high stakes policy based on value-added estimates and more likely to see their estimates vary from year-to-year due to low stability

EconStor (ZBW Kiel)

A Comparison of Growth Percentile and Value-Added Models of Teacher Performance

Author: Guarino Cassandra
Reckase Mark D.
Stacy Brian
Wooldridge Jeffrey M.
Publication venue: Bonn: Institute for the Study of Labor (IZA)
Publication date: 01/01/2014
Field of study

School districts and state departments of education frequently must choose between a variety of methods to estimating teacher quality. This paper examines under what circumstances the decision between estimators of teacher quality is important. We examine estimates derived from growth percentile measures and estimates derived from commonly used value-added estimators. Using simulated data, we examine how well the estimators can rank teachers and avoid misclassification errors under a variety of assignment scenarios of teachers to students. We find that growth percentile measures perform worse than value-added measures that control for prior year student test scores and control for teacher fixed effects when assignment of students to teachers is nonrandom. In addition, using actual data from a large diverse anonymous state, we find evidence that growth percentile measures are less correlated with value-added measures with teacher fixed effects when there is evidence of nonrandom grouping of students in schools. This evidence suggests that the choice between estimators is most consequential under nonrandom assignment of teachers to students, and that value-added measures controlling for teacher fixed effects may be better suited to estimating teacher quality in this case

EconStor (ZBW Kiel)

How do principals assign students to teachers? Finding evidence in administrative data and the implications for value-added

Author: Dieterle Steven G.
Guarino Cassandra
Reckase Mark D.
Wooldridge Jeffrey M.
Publication venue: Bonn: Institute for the Study of Labor (IZA)
Publication date: 01/01/2012
Field of study

The federal government's Race to the Top competition has promoted the adoption of test-based performance measures as a component of teacher evaluations throughout many states, but the validity of these measures has been controversial among researchers and widely contested by teachers' unions. A key concern is the extent to which nonrandom sorting of students to teachers may bias the results and lead to a misclassification of teachers as high or low performing. In light of this, it is important to assess the extent to which evidence of sorting can be found in the large administrative data sets used for VAM estimation. Using a large longitudinal data set from an anonymous state, we find evidence that a nontrivial amount of sorting exists - particularly sorting based on prior test scores - and that the extent of sorting varies considerably across schools, a fact obscured by the types of aggregate sorting indices developed in prior research. We also find that VAM estimation is sensitive to the presence of nonrandom sorting. There is less agreement across estimation approaches regarding a particular teacher's rank in the distribution of estimated effectiveness when schools engage in sorting

EconStor (ZBW Kiel)

Modeling Judgments in the Angoff and Contrasting-Groups Method of Standard Setting

Author: Angoff W. H.
Brandon P. R.
Gelman A.
Gilks W.
Haertel E. H.
Jaeger R. M.
Kane M. T.
Livingston S. A.
Longford N. T.
Lord F. M.
Meskauskas J. A.
Reckase M. D.
Reckase M. D.
Tanner M. A.
Zieky M. J.
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Nonparametric IRT analysis of Quality-of-Life Scales and its application to the World Health Organization Quality-of-Life Scale (WHOQOL-Bref)

Crossref

PubMed Central

EUR Research Repository

Tilburg University Repository

A proof of principle for using adaptive testing in routine Outcome Monitoring: the efficiency of the Mood and Anxiety Symptoms Questionnaire -Anhedonic Depression CAT

Abstract Background In Routine Outcome Monitoring (ROM) there is a high demand for short assessments. Computerized Adaptive Testing (CAT) is a promising method for efficient assessment. In this article, the efficiency of a CAT version of the Mood and Anxiety Symptom Questionnaire, - Anhedonic Depression scale (MASQ-AD) for use in ROM was scrutinized in a simulation study. Methods The responses of a large sample of patients (<it>N </it>= 3,597) obtained through ROM were used. The psychometric evaluation showed that the items met the requirements for CAT. In the simulations, CATs with several measurement precision requirements were run on the item responses as if they had been collected adaptively. Results CATs employing only a small number of items gave results which, both in terms of depression measurement and criterion validity, were only marginally different from the results of a full MASQ-AD assessment. Conclusions It was concluded that CAT improved the efficiency of the MASQ-AD questionnaire very much. The strengths and limitations of the application of CAT in ROM are discussed.</p

Crossref

VU Research Portal

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Leiden University Scholary Publications

Measuring the ICF components of impairment, activity limitation and participation restriction: an item analysis using classical test theory and item response theory

Author: A Cieza
A Williams
AJ Carr
B Pollard
BB Reeve
Beth Pollard
BG Tabachnick
CA McHorney
D Andrich
D Thissen
Diane Dixon
DJ Cooke
DJ Cooke
DL Patrick
EF Sinar
J Dawson
J Dawson
J Singh
JE Ware
JE Ware
JF Fries
JN Insall
L Prieto
LJ Cronbach
M Akai
M Johnston
M Reckase
M Weigl
MAM Gignac
Marie Johnston
MG Lequesne
N Bellamy
Paul Dieppe
PM Fayers
R Hays
R Lindeboom
R Wilkie
R Wilkie
RB Fletcher
RD Hays
RF Meenan
RH Harwood
RJM Perenboom
RK Hambleton
RO Anderson
SE Embretson
SM Downing
SM Haley
W Kuyken
WH Harris
WHO
WHO
WHOQOL group
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

The International Classification of Functioning, Disability and Health (ICF) proposes three main health outcomes, Impairment (I), Activity Limitation (A) and Participation Restriction (P), but good measures of these constructs are needed The aim of this study was to use both Classical Test Theory (CTT) and Item Response Theory (IRT) methods to carry out an item analysis to improve measurement of these three components in patients having joint replacement surgery mainly for osteoarthritis (OA). A geographical cohort of patients about to undergo lower limb joint replacement was invited to participate. Five hundred and twenty four patients completed ICF items that had been previously identified as measuring only a single ICF construct in patients with osteoarthritis. There were 13 I, 26 A and 20 P items. The SF-36 was used to explore the construct validity of the resultant I, A and P measures. The CTT and IRT analyses were run separately to identify items for inclusion or exclusion in the measurement of each construct. The results from both analyses were compared and contrasted. Overall, the item analysis resulted in the removal of 4 I items, 9 A items and 11 P items. CTT and IRT identified the same 14 items for removal, with CTT additionally excluding 3 items, and IRT a further 7 items. In a preliminary exploration of reliability and validity, the new measures appeared acceptable. New measures were developed that reflect the ICF components of Impairment, Activity Limitation and Participation Restriction for patients with advanced arthritis. The resulting Aberdeen IAP measures (Ab-IAP) comprising I (Ab-I, 9 items), A (Ab-A, 17 items), and P (Ab-P, 9 items) met the criteria of conventional psychometric (CTT) analyses and the additional criteria (information and discrimination) of IRT. The use of both methods was more informative than the use of only one of these methods. Thus combining CTT and IRT appears to be a valuable tool in the development of measures

Crossref

University of Strathclyde Institutional Repository

Stirling Online Research Repository (RIOXX)

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Stirling Online Research Repository

Linking tests of English for academic purposes to the CEFR: the score user’s perspective

The Common European Framework of Reference for Languages (CEFR) is widely used in setting language proficiency requirements, including for international students seeking access to university courses taught in English. When different language examinations have been related to the CEFR, the process is claimed to help score users, such as university admissions staff, to compare and evaluate these examinations as tools for selecting qualified applicants. This study analyses the linking claims made for four internationally recognised tests of English widely used in university admissions. It uses the Council of Europe’s (2009) suggested stages of specification, standard setting, and empirical validation to frame an evaluation of the extent to which, in this context, the CEFR has fulfilled its potential to “facilitate comparisons between different systems of qualifications.” Findings show that testing agencies make little use of CEFR categories to explain test content; represent the relationships between their tests and the framework in different terms; and arrive at conflicting conclusions about the correspondences between test scores and CEFR levels. This raises questions about the capacity of the CEFR to communicate competing views of a test construct within a coherent overarching structure

Crossref

University of Bedfordshire Repository