7 research outputs found
The validity of the tool “statcheck” in discovering statistical reporting inconsistencies
The R package “statcheck” (Epskamp & Nuijten, 2016) is a tool to extract statistical results from articles and check whether the reported p-value matches the accompanying test statistic and degrees of freedom. A previous study showed high interrater reliabilities (between .76 and .89) between statcheck and manual coding of inconsistencies (.76 - .89; Nuijten, Hartgerink, Van Assen, Epskamp, & Wicherts, 2016). Here we present an additional, detailed study of the validity of statcheck. In Study 1, we calculated its sensitivity and specificity. We found that statcheck’s sensitivity (true positive rate) and specificity (true negative rate) were high: between 85.3% and 100%, and between 96.0% and 100%, respectively, depending on the assumptions and settings. The overall accuracy of statcheck ranged from 96.2% to 99.9%. In Study 2, we investigated statcheck’s ability to deal with statistical corrections for multiple testing or violations of assumptions in articles. We found that the prevalence of corrections for multiple testing or violations of assumptions in psychology was higher than we initially estimated in Nuijten et al. (2016). Although we found numerous reporting inconsistencies in results corrected for violations of the sphericity assumption, we demonstrate that inconsistencies associated with statistical corrections are not what is causing the high estimates of the prevalence of statistical reporting inconsistencies in psychology
The dire disregard of measurement invariance testing in psychological science
In psychological science, self-report scales are widely used to compare means in targeted latent constructs across time points, groups, or experimental conditions. For these scale mean comparisons (SMC) to be meaningful and unbiased, the scales should be measurement invariant across the compared time points or (experimental) groups. Measurement invariance (MI) testing checks whether the latent constructs are measured equivalently across groups or time points. Since MI is essential for meaningful comparisons, we conducted a systematic review to check whether MI is taken seriously in psychological research. Specifically, we sampled 426 psychology articles with openly available data that involved a total of 918 SMCs to (1) investigate common practices in conducting and reporting of MI testing, (2) check whether reported MI test results can be reproduced, and (3) conduct MI tests for the SMCs that enabled sufficiently powerful MI testing with the shared data. Our results indicate that (1) 4% of the 918 scales underwent MI testing across groups or time and that these tests were generally poorly reported, (2) none of the reported MI tests could be successfully reproduced, and (3) of 161 newly performed MI tests, a mere 46 (29%) reached sufficient MI (scalar invariance), and MI often failed completely (89; 55%). Thus, MI tests were rarely done and poorly reported in psychological studies, and the frequent violations of MI indicate that reported group differences cannot be solely attributed to group differences in the latent constructs. We offer recommendations on reporting MI tests and improving computational reproducibility practices
The validity of the tool “statcheck” in discovering statistical reporting inconsistencies
The R package “statcheck” (Epskamp & Nuijten, 2016) is a tool to extract statistical results from articles and check whether the reported p-value matches the accompanying test statistic and degrees of freedom. A previous study showed high interrater reliabilities (between .76 and .89) between statcheck and manual coding of inconsistencies (.76 - .89; Nuijten, Hartgerink, Van Assen, Epskamp, & Wicherts, 2016). Here we present an additional, detailed study of the validity of statcheck. In Study 1, we calculated its sensitivity and specificity. We found that statcheck’s sensitivity (true positive rate) and specificity (true negative rate) were high: between 85.3% and 100%, and between 96.0% and 100%, respectively, depending on the assumptions and settings. The overall accuracy of statcheck ranged from 96.2% to 99.9%. In Study 2, we investigated statcheck’s ability to deal with statistical corrections for multiple testing or violations of assumptions in articles. We found that the prevalence of corrections for multiple testing or violations of assumptions in psychology was higher than we initially estimated in Nuijten et al. (2016). Although we found numerous reporting inconsistencies in results corrected for violations of the sphericity assumption, we demonstrate that inconsistencies associated with statistical corrections are not what is causing the high estimates of the prevalence of statistical reporting inconsistencies in psychology
Recommended from our members
Citation patterns following a strongly contradictory replication result: Four case studies from psychology
Replication studies that contradict prior findings may facilitate scientific self-correction by triggering a reappraisal of the original studies; however, the research community's response to replication results has not been studied systematically. One approach for gauging responses to replication results is to examine how they impact citations to original studies. In this study, we explored post-replication citation patterns in the context of four prominent multi-laboratory replication attempts published in the field of psychology that strongly contradicted and outweighed prior findings. Generally, we observed a small post-replication decline in the number of favourable citations and a small increase in unfavourable citations. This indicates only modest corrective effects and implies considerable perpetuation of belief in the original findings. Replication results that strongly contradict an original finding do not necessarily nullify its credibility; however, one might at least expect the replication results to be acknowledged and explicitly debated in subsequent literature. By contrast, we found substantial citation bias: the majority of articles citing the original studies neglected to cite relevant replication results. Of those articles that did cite the replication, but continued to cite the original study favourably, approximately half offered an explicit defence of the original study. Our findings suggest that even replication results that strongly contradict original findings do not necessarily prompt a corrective response from the research community.</p
The dire disregard of measurement invariance testing in psychological science
In psychological science, self-report scales are widely used to compare means in targeted latent constructs across time points, groups, or experimental conditions. For these scale mean comparisons (SMC) to be meaningful and unbiased, the scales should be measurement invariant across the compared time points or (experimental) groups. Measurement invariance (MI) testing checks whether the latent constructs are measured equivalently across groups or time points. Since MI is essential for meaningful comparisons, we conducted a systematic review to check whether MI is taken seriously in psychological research. Specifically, we sampled 426 psychology articles with openly available data that involved a total of 918 SMCs to (1) investigate common practices in conducting and reporting of MI testing, (2) check whether reported MI test results can be reproduced, and (3) conduct MI tests for the SMCs that enabled sufficiently powerful MI testing with the shared data. Our results indicate that (1) 4% of the 918 scales underwent MI testing across groups or time and that these tests were generally poorly reported, (2) none of the reported MI tests could be successfully reproduced, and (3) of 161 newly performed MI tests, a mere 46 (29%) reached sufficient MI (scalar invariance), and MI often failed completely (89; 55%). Thus, MI tests were rarely done and poorly reported in psychological studies, and the frequent violations of MI indicate that reported group differences cannot be solely attributed to group differences in the latent constructs. We offer recommendations on reporting MI tests and improving computational reproducibility practices
The meta-plot:A graphical tool for interpreting the results of a meta-analysis
The meta-plot is a descriptive visual tool for meta-analysis that provides information on the primary studies in the meta-analysis and the results of the meta-analysis. More precisely, the meta-plot portrays (i) the precision and statistical power of the primary studies in the meta-analysis, (ii) the estimate and confidence interval of a random-effects meta-analysis, (iii) the results of a cumulative random-effects meta-analysis yielding a robustness check of the meta-analytic effect size with respect to primary studies’ precision, and (iv) evidence of publication bias. After explaining the underlying logic and theory, the meta-plot is applied to two cherry-picked meta-analyses that appear to be biased and to ten meta-analyses randomly selected from the psychological literature. We recommend using the meta-plot in addition to any meta-analysis of common effect size measures, rather than variants of the funnel plot