Search CORE

Rating the methodological quality in systematic reviews of studies on measurement properties: a scoring system for the COSMIN checklist

Author: A. D. Furlan
Caroline B. Terwee
CB Terwee
CB Terwee
Dirk L. Knol
GH Guyatt
H Wind
HCW Vet de
HCW Vet de
Henrica C. W. de Vet
J Marinus
J Stevens
JL Brozek
JM Valderas
KL Haywood
LB Mokkink
LB Mokkink
LB Mokkink
Lex M. Bouter
Lidwine B. Mokkink
Raymond W. J. G. Ostelo
S Alla
Scientific Advisory Committee of the Medical Outcomes Trust
Publication venue: Springer Netherlands
Publication date: 01/01/2011
Field of study

Background: The COSMIN checklist is a standardized tool for assessing the methodological quality of studies on measurement properties. It contains 9 boxes, each dealing with one measurement property, with 5-18 items per box about design aspects and statistical methods. Our aim was to develop a scoring system for the COSMIN checklist to calculate quality scores per measurement property when using the checklist in systematic reviews of measurement properties. Methods: The scoring system was developed based on discussions among experts and testing of the scoring system on 46 articles from a systematic review. Four response options were defined for each COSMIN item (excellent, good, fair, and poor). A quality score per measurement property is obtained by taking the lowest rating of any item in a box ("worst score counts"). Results: Specific criteria for excellent, good, fair, and poor quality for each COSMIN item are described. In defining the criteria, the "worst score counts" algorithm was taken into consideration. This means that only fatal flaws were defined as poor quality. The scores of the 46 articles show how the scoring system can be used to provide an overview of the methodological quality of studies included in a systematic review of measurement properties. Conclusions: Based on experience in testing this scoring system on 46 articles, the COSMIN checklist with the proposed scoring system seems to be a useful tool for assessing the methodological quality of studies included in systematic reviews of measurement properties. © The Author(s) 2011

A multivariate hierarchical Bayesian approach to measuring agreement in repeated measurement method comparison studies

Author: A Gelman
B Carstensen
B Carstensen
DJ Lunn
G Lu
H Goldstein
HC de Vet
J Ludbrook
JM Bland
JM Bland
JM Bland
JO Berger
KJ Rothman
M Oliver
MW Woolrich
P Congdon
Philip J Schluter
RR Luiz
RR Luiz
SA White
SP Brooks
Stata Corporation
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background. Assessing agreement in method comparison studies depends on two fundamentally important components; validity (the between method agreement) and reproducibility (the within method agreement). The Bland-Altman limits of agreement technique is one of the favoured approaches in medical literature for assessing between method validity. However, few researchers have adopted this approach for the assessment of both validity and reproducibility. This may be partly due to a lack of a flexible, easily implemented and readily available statistical machinery to analyse repeated measurement method comparison data. Methods. Adopting the Bland-Altman framework, but using Bayesian methods, we present this statistical machinery. Two multivariate hierarchical Bayesian models are advocated, one which assumes that the underlying values for subjects remain static (exchangeable replicates) and one which assumes that the underlying values can change between repeated measurements (non-exchangeable replicates). Results. We illustrate the salient advantages of these models using two separate datasets that have been previously analysed and presented; (i) assuming static underlying values analysed using both multivariate hierarchical Bayesian models, and (ii) assuming each subject's underlying value is continually changing quantity and analysed using the non-exchangeable replicate multivariate hierarchical Bayesian model. Conclusion. These easily implemented models allow for full parameter uncertainty, simultaneous method comparison, handle unbalanced or missing data, and provide estimates and credible regions for all the parameters of interest. Computer code for the analyses in also presented, provided in the freely available and currently cost free software package WinBUGS

UQ eSpace (University of Queensland)

The COSMIN checklist for assessing the methodological quality of studies on measurement properties of health status measurement instruments: an international Delphi study

Author: AP Verhagen
B Kirshner
C Powell
C Veenhof
Caroline B. Terwee
CB Terwee
Dirk L. Knol
Donald L. Patrick
HCW Vet De
Henrica C. W. de Vet
JC Nunnally
JM Bland
JM Valderas
Jordi Alonso
KN Lohr
LB Mokkink
LB Mokkink
LE Pfennings
Lex M. Bouter
Lidwine B. Mokkink
M Marshall
MR Boer De
Paul W. Stratford
S Evers
US Department of Health and Human Services FDA Center for Drug Evaluation and Research
Publication venue: Springer Netherlands
Publication date: 01/01/2010
Field of study

BACKGROUND: Aim of the COSMIN study (COnsensus-based Standards for the selection of health status Measurement INstruments) was to develop a consensus-based checklist to evaluate the methodological quality of studies on measurement properties. We present the COSMIN checklist and the agreement of the panel on the items of the checklist. METHODS: A four-round Delphi study was performed with international experts (psychologists, epidemiologists, statisticians and clinicians). Of the 91 invited experts, 57 agreed to participate (63%). Panel members were asked to rate their (dis)agreement with each proposal on a five-point scale. Consensus was considered to be reached when at least 67% of the panel members indicated 'agree' or 'strongly agree'. RESULTS: Consensus was reached on the inclusion of the following measurement properties: internal consistency, reliability, measurement error, content validity (including face validity), construct validity (including structural validity, hypotheses testing and cross-cultural validity), criterion validity, responsiveness, and interpretability. The latter was not considered a measurement property. The panel also reached consensus on how these properties should be assessed. CONCLUSIONS: The resulting COSMIN checklist could be useful when selecting a measurement instrument, peer-reviewing a manuscript, designing or reporting a study on measurement properties, or for educational purposes.This study was financially supported by the EMGO Institute for Health and Care Research, VU University Medical Center, Amsterdam, and the Anna Foundation, Leiden, The Netherlands

Keele Research Repository

UPF Digital Repository

Evaluation of the measurement properties of the Manchester foot pain and disability index

Author: AP Garrow
AP Garrow
Babette C van der Zwaard
BC de Morais
BC van der Zwaard
Berend Terluin
Caroline B Terwee
CB Terwee
CE Cook
CJ Bowen
DAWM van der Windt
DE Beaton
E Roddy
E Thomas
Edward Roddy
F Benvenuti
G Peat
HB Menz
HB Menz
HB Menz
HCW de Vet
HCW De Vet
HCW de Vet
Henriette E van der Horst
HSJ Picavet
JB Schreiber
KJ Gorter
LB Mokkink
LB Mokkink
LT Hu
MA Petersen
MM Kuyvenhoven
P Kaoulla
Petra JM Elders
S Muller
SW Choi
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

BACKGROUND: The Manchester Foot Pain and Disability Index (MFPDI, 19 items) was developed to measure functional limitations, pain and appearance for patients with foot pain and is frequently used in both observational studies and randomised controlled trials. A Dutch version of the MFPDI was developed. The aims of this study were to evaluate all the measurement properties for the Dutch version of the MFPDI and to evaluate comparability to the original version. METHOD: The MFPDI was translated into Dutch using a forward/backward translation process. The dimensionality was evaluated using exploratory and confirmatory factor analysis. Measurement properties were evaluated per subscale according to the COSMIN taxonomy consisting of: reliability (internal consistency, test-retest reliability and measurement error), validity (structural validity, content validity and cross-cultural validity comparing the Dutch version to the English version) responsiveness and interpretation. RESULTS: The questionnaire consists of three scales, measuring foot function, foot pain and perception. The reliability of the foot function scale is acceptable (Cronbach’s α > 0.7, ICC = 0.7, SEM = 2.2 on 0-18 scale). The construct validity of the function and pain scale was confirmed and only the pain scale contains one item with differential item functioning (DIF). The responsiveness of the function and pain scale is moderate when compared to anchor questions. CONCLUSION: Results using the Dutch MFPDI version can be compared to results using the original version. The foot function sub-scale (items 1-9) is a reliable and valid sub-scale. This study indicates that the use of the MFPDI as a longitudinal instrument might be problematic for measuring change in musculoskeletal foot pain due to moderate responsiveness

The COSMIN checklist for evaluating the methodological quality of studies on measurement properties: A clarification of its content

Author: C Powell
CA McHorney
Caroline B Terwee
CB Terwee
CM Goodman
DA Revicki
DG Altman
Dirk L Knol
DL Streiner
DL Streiner
Donald L Patrick
DW Levine
F Hasson
FJ Floyd
GH Guyatt
GJ Van der Heijden
GR Norman
H De Vet
HC De Vet
Henrica CW de Vet
I McDowell
IB Wilson
J Cohen
JM Cortina
Jordi Alonso
LB Mokkink
LB Mokkink
LB Mokkink
Lex M Bouter
Lidwine B Mokkink
LJ Cronbach
LJ Cronbach
ME Strauss
MR De Boer
MR Stockler
Paul W Stratford
PM Fayers
S Keeney
S Messick
World Health Organization
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The COSMIN checklist (COnsensus-based Standards for the selection of health status Measurement INstruments) was developed in an international Delphi study to evaluate the methodological quality of studies on measurement properties of health-related patient reported outcomes (HR-PROs). In this paper, we explain our choices for the design requirements and preferred statistical methods for which no evidence is available in the literature or on which the Delphi panel members had substantial discussion. Methods The issues described in this paper are a reflection of the Delphi process in which 43 panel members participated. Results The topics discussed are internal consistency (relevance for reflective and formative models, and distinction with unidimensionality), content validity (judging relevance and comprehensiveness), hypotheses testing as an aspect of construct validity (specificity of hypotheses), criterion validity (relevance for PROs), and responsiveness (concept and relation to validity, and (in) appropriate measures). Conclusions We expect that this paper will contribute to a better understanding of the rationale behind the items, thereby enhancing the acceptance and use of the COSMIN checklist.</p

UPF Digital Repository

The size of the treatment effect: do patients and proxies agree?

Author: AD Sadovnick
Alan J Thompson
AS Pickard
Bernard MJ Uitdehaag
Chris H Polman
EL Hoogervorst
FA van der Linden
FA van der Linden
FD Lublin
Femke AH van der Linden
G Guyatt
GH Guyatt
GR Norman
HC de Vet
HC de Vet
Henk M van der Ploeg
J Cohen
J Hobart
J Lee
JA Husted
JC Hobart
JC Hobart
JC Nunnally
Jeremy C Hobart
JM Bland
JM Bland
Jolijn J Kragt
KO McGraw
L Costelloe
M Korostil
M Ross
MA Sprangers
Martin Klein
MP Amato
SM Rao
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Background: This study examined whether MS patients and proxy respondents agreed on change in disease impact, which was induced by treatment. This may be of interest in situations when patients suffer from limitations that interfere with reliable self-assessment, such as cognitive impairment.Methods: MS patients and proxies completed the Multiple Sclerosis Impact Scale (MSIS-29) before and after intravenous steroid treatment. Analyses focused on patient-proxy agreement between MSIS-29 change scores. Transition ratings were used to measure the patient's judgement of change and whether this change was reflected in the MSIS-29 change of patients and proxies. Receiver operating characteristic (ROC) analyses were also performed to examine the diagnostic properties of the MSIS-29 when completed by patients and proxies.Results: 42 patients and proxy respondents completed the MSIS-29 at baseline and follow-up. Patient-proxy differences between change scores on the physical and psychological MSIS-29 subscale were quite small, although large variability was found. The direction of mean change was in concordance with the transition ratings of the patients. Results of the ROC analyses of the MSIS-29 were similar when completed by patients (physical scale: AUC = 0.79, 95% CI: 0.65 - 0.93 and 0.66, 95% CI: 0.48 - 0.84 for the psychological scale) and proxies (physical scale: 0.80, 95% CI: 0.72 - 0.96 and 0.71, 95% CI: 0.56 - 0.87 for the psychological scale)Conclusion: Although the results need to be further explored in larger samples, these results do point towards possible use of proxy respondents to assess patient perceived treatment change at the group level

PEARL (Univ. of Plymouth)

UCL Discovery

Reproducibility and responsiveness of the Symptom Severity Scale and the hand and finger function subscale of the Dutch arthritis impact measurement scales (Dutch-AIMS2-HFF) in primary care patients with wrist or hand problems

Author: AF De Bruin
AW Evers
CA McHorney
Caroline B Terwee
CB Terwee
CB Terwee
Daniëlle AWM van der Windt
DL Streiner
DW Levine
GR Norman
GR Norman
HC De Vet
HCW De Vet
JA Hanley
JC Nunally
JM Bland
KO McGraw
KW Wyrwich
KW Wyrwich
Marinda N Spies-Dorgelo
MR De Boer
N Van der Roer
R Jaeschke
RA Deyo
RF Meenan
RF Meenan
RP Riemsma
Wim AB Stalman
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: To determine the clinimetric properties of two questionnaires assessing symptoms (Symptom Severity Scale) and physical functioning (hand and finger function subscale of the AIMS2) in a Dutch primary care population. METHODS: The first 84 participants in a 1-year follow-up study on the diagnosis and prognosis of hand and wrist problems completed the Symptom Severity Scale and the hand and finger function subscale of the Dutch-AIMS2 twice within 1 to 2 weeks. The data were used to assess test-retest reliability (ICC) and smallest detectable change (SDC, based on the standard error of measurement (SEM)). To assess responsiveness, changes in scores between baseline and the 3 month follow-up were related to an external criterion to estimate the minimal important change (MIC). We calculated the group size needed to detect the MIC beyond measurement error. RESULTS: The ICC for the Symptom Severity Scale was 0.68 (95% CI: 0.54–0.78). The SDC was 1.00 at individual level and 0.11 at group level, both on a 5-point scale. The MIC was 0.23, exceeding the SDC at group level. The group size required to detect a MIC beyond measurement error was 19 for the Symptom Severity Scale. The ICC for the hand and finger function subscale of the Dutch-AIMS2 was 0.62 (95% CI: 0.47–0.74). The SDC was 3.80 at individual level and 0.42 at group level, both on an 11-point scale. The MIC was 0.31, which was less than the SDC at group level. The group size required to detect a MIC beyond measurement error was 150. CONCLUSION: In our heterogeneous primary care population the Symptom Severity Scale was found to be a suitable instrument to assess the severity of symptoms, whereas the hand and finger function subscale of the Dutch-AIMS2 was less suitable for the measurement of physical functioning in patients with hand and wrist problems

Keele Research Repository

Factorial validity and internal consistency of the PRAFAB questionnaire in women with stress urinary incontinence

Author: Arnold TM Bernards
CVZ
CVZ
CW de Vet
DE Johnson
DL Streiner
DL Streiner
E Faber
EJ Hendriks
Erik JM Hendriks
FJ Floyd
GH Guyatt
H Sandvik
H Sandvik
Henrica CW de Vet
I Gasquet
J Bart Staal
J Spruijt
JC Nunnally
JL Melville
JS Brown
K Avery
LCM Berghmans
LJ Cronbach
MD Smith
ME Borghouts
ME Vierhout
NH Fultz
P Abrams
P Kline
R Fitzpatrick
RA Deyo
Rob A de Bie
TAM Teunissen
TAM Teunissen
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background To investigate the factor structure, dimensionality and construct validity of the (5-item) PRAFAB questionnaire score in women with stress urinary incontinence (stress UI). Methods A cross validation study design was used in a cohort of 279 patients who were randomly divided into Sample A or B. Sample A was used for preliminary exploratory factor analyses with promax rotation. Sample B provided an independent sample for confirming the premeditated and proposed factor structure and item retention. Internal consistency, item-total and subscale correlations were determined to assess the dimensionality. Construct validity was assessed by comparing factor-based scale means by clinical characteristics based on known relationships. Results Factor analyses resulted in a two-factor structure or subscales: items related to 'leakage severity' (protection, amount and frequency) and items related to its 'perceived symptom impact' or consequences of stress UI on the patient's life (adjustment and body (or self) image). The patterns of the factor loadings were fairly identical for both study samples. The two constructed subscales demonstrated adequate internal consistency with Cronbach's alphas in a range of 0.78 and 0.84 respectively. Scale scores differed by clinical characteristics according to the expectations and supported the construct validity of the scales. Conclusion The findings suggest a two-factorial structure of the PRAFAB questionnaire. Furthermore the results confirmed the internal consistency and construct validity as demonstrated in our previous study. The best description of the factorial structure of the PRAFAB questionnaire was given by a two-factor solution, measuring the stress UI leakage severity items and the perceived symptom impact items. Future research will be necessary to replicate these findings in different settings, type of UI and non-white women and men.</p

Inter-rater agreement and reliability of the COSMIN (COnsensus-based Standards for the selection of health status Measurement Instruments) Checklist

Author: Caroline B Terwee
Dirk L Knol
Donald L Patrick
E Moberg-Mogren
Elizabeth Gibbons
HC Kraemer
Henrica CW de Vet
JL Fleiss
JM Valderas
Jordi Alonso
JR Landis
L Lin
LB Mokkink
LB Mokkink
LB Mokkink
Lex M Bouter
Lidwine B Mokkink
N Smidt
Paul W Stratford
W Vach
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The COSMIN checklist is a tool for evaluating the methodological quality of studies on measurement properties of health-related patient-reported outcomes. The aim of this study is to determine the inter-rater agreement and reliability of each item score of the COSMIN checklist (n = 114). Methods 75 articles evaluating measurement properties were randomly selected from the bibliographic database compiled by the Patient-Reported Outcome Measurement Group, Oxford, UK. Raters were asked to assess the methodological quality of three articles, using the COSMIN checklist. In a one-way design, percentage agreement and intraclass kappa coefficients or quadratic-weighted kappa coefficients were calculated for each item. Results 88 raters participated. Of the 75 selected articles, 26 articles were rated by four to six participants, and 49 by two or three participants. Overall, percentage agreement was appropriate (68% was above 80% agreement), and the kappa coefficients for the COSMIN items were low (61% was below 0.40, 6% was above 0.75). Reasons for low inter-rater agreement were need for subjective judgement, and accustom to different standards, terminology and definitions. Conclusions Results indicated that raters often choose the same response option, but that it is difficult on item level to distinguish between articles. When using the COSMIN checklist in a systematic review, we recommend getting some training and experience, completing it by two independent raters, and reaching consensus on one final rating. Instructions for using the checklist are improved.</p