56 research outputs found

    Six solutions for more reliable infant research

    Get PDF
    Infant research is often underpowered, undermining the robustness and replicability of our findings. Improving the reliability of infant studies offers a solution for increasing statistical power independent of sample size. Here, we discuss two senses of the term reliability in the context of infant research: reliable (large) effects and reliable measures. We examine the circumstances under which effects are strongest and measures are most reliable and use synthetic datasets to illustrate the relationship between effect size, measurement reliability, and statistical power. We then present six concrete solutions for more reliable infant research: (a) routinely estimating and reporting the effect size and measurement reliability of infant tasks, (b) selecting the best measurement tool, (c) developing better infant paradigms, (d) collecting more data points per infant, (e) excluding unreliable data from the analysis, and (f) conducting more sophisticated data analyses. Deeper consideration of measurement in infant research will improve our ability to study infant development

    Null hypothesis significance testing: a short tutorial

    Get PDF
    Although thoroughly criticized, null hypothesis significance testing (NHST) remains the statistical method of choice used to provide evidence for an effect, in biological, biomedical and social sciences. In this short tutorial, I first summarize the concepts behind the method, distinguishing test of significance (Fisher) and test of acceptance (Newman-Pearson) and point to common interpretation errors regarding the p-value. I then present the related concepts of confidence intervals and again point to common interpretation errors. Finally, I discuss what should be reported in which context. The goal is to clarify concepts to avoid interpretation errors and propose reporting practices.</ns4:p

    [Comment] Redefine statistical significance

    Get PDF
    The lack of reproducibility of scientific studies has caused growing concern over the credibility of claims of new discoveries based on “statistically significant” findings. There has been much progress toward documenting and addressing several causes of this lack of reproducibility (e.g., multiple testing, P-hacking, publication bias, and under-powered studies). However, we believe that a leading cause of non-reproducibility has not yet been adequately addressed: Statistical standards of evidence for claiming discoveries in many fields of science are simply too low. Associating “statistically significant” findings with P < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems. For fields where the threshold for defining statistical significance is P<0.05, we propose a change to P<0.005. This simple step would immediately improve the reproducibility of scientific research in many fields. Results that would currently be called “significant” but do not meet the new threshold should instead be called “suggestive.” While statisticians have known the relative weakness of using P≈0.05 as a threshold for discovery and the proposal to lower it to 0.005 is not new (1, 2), a critical mass of researchers now endorse this change. We restrict our recommendation to claims of discovery of new effects. We do not address the appropriate threshold for confirmatory or contradictory replications of existing claims. We also do not advocate changes to discovery thresholds in fields that have already adopted more stringent standards (e.g., genomics and high-energy physics research; see Potential Objections below). We also restrict our recommendation to studies that conduct null hypothesis significance tests. We have diverse views about how best to improve reproducibility, and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P-values. However, changing the P-value threshold is simple and might quickly achieve broad acceptance

    Quality-of-life assessment in dementia: the use of DEMQOL and DEMQOL-Proxy total scores

    Get PDF
    Purpose There is a need to determine whether health-related quality-of-life (HRQL) assessments in dementia capture what is important, to form a coherent basis for guiding research and clinical and policy decisions. This study investigated structural validity of HRQL assessments made using the DEMQOL system, with particular interest in studying domains that might be central to HRQL, and the external validity of these HRQL measurements. Methods HRQL of people with dementia was evaluated by 868 self-reports (DEMQOL) and 909 proxy reports (DEMQOL-Proxy) at a community memory service. Exploratory and confirmatory factor analyses (EFA and CFA) were conducted using bifactor models to investigate domains that might be central to general HRQL. Reliability of the general and specific factors measured by the bifactor models was examined using omega (?) and omega hierarchical (? h) coefficients. Multiple-indicators multiple-causes models were used to explore the external validity of these HRQL measurements in terms of their associations with other clinical assessments. Results Bifactor models showed adequate goodness of fit, supporting HRQL in dementia as a general construct that underlies a diverse range of health indicators. At the same time, additional factors were necessary to explain residual covariation of items within specific health domains identified from the literature. Based on these models, DEMQOL and DEMQOL-Proxy overall total scores showed excellent reliability (? h > 0.8). After accounting for common variance due to a general factor, subscale scores were less reliable (? h < 0.7) for informing on individual differences in specific HRQL domains. Depression was more strongly associated with general HRQL based on DEMQOL than on DEMQOL-Proxy (?0.55 vs ?0.22). Cognitive impairment had no reliable association with general HRQL based on DEMQOL or DEMQOL-Proxy. Conclusions The tenability of a bifactor model of HRQL in dementia suggests that it is possible to retain theoretical focus on the assessment of a general phenomenon, while exploring variation in specific HRQL domains for insights on what may lie at the ‘heart’ of HRQL for people with dementia. These data suggest that DEMQOL and DEMQOL-Proxy total scores are likely to be accurate measures of individual differences in HRQL, but that subscale scores should not be used. No specific domain was solely responsible for general HRQL at dementia diagnosis. Better HRQL was moderately associated with less depressive symptoms, but this was less apparent based on informant reports. HRQL was not associated with severity of cognitive impairment

    The performance of robust test statistics with categorical data

    No full text
    This paper reports on a simulation study that evaluated the performance of five structural equation model test statistics appropriate for categorical data. Both Type I error rate and power were investigated. Different model sizes, sample sizes, numbers of categories, and threshold distributions were considered. Statistics associated with both the diagonally weighted least squares (cat-DWLS) estimator and with the unweighted least squares (cat-ULS) estimator were studied. Recent research suggests that cat-ULS parameter estimates and robust standard errors slightly outperform cat-DWLS estimates and robust standard errors (Forero, Maydeu-Olivares, & Gallardo-Pujol, 2009). The findings of the present research suggest that the mean- and variance-adjusted test statistic associated with the cat-ULS estimator performs best overall. A new version of this statistic now exists that does not require a degrees-of-freedom adjustment (Asparouhov & Muthén, 2010), and this statistic is recommended. Overall, the cat-ULS estimator is recommended over cat-DWLS, particularly in small to medium sample sizes

    On the Asymptotic Relative Efficiency of Planned Missingness Designs

    No full text
    In planned missingness (PM) designs, certain data are set a priori to be missing. PM designs can increase validity and reduce cost; however, little is known about the loss of efficiency that accompanies these designs. The present paper compares PM designs to reduced sample (RN) designs that have the same total number of data points concentrated in fewer participants. In 4 studies, we consider models for both observed and latent variables, designs that do or do not include an "X set" of variables with complete data, and a full range of between- and within-set correlation values. All results are obtained using asymptotic relative efficiency formulas, and thus no data are generated; this novel approach allows us to examine whether PM designs have theoretical advantages over RN designs removing the impact of sampling error. Our primary findings are that (a) in manifest variable regression models, estimates of regression coefficients have much lower relative efficiency in PM designs as compared to RN designs, (b) relative efficiency of factor correlation or latent regression coefficient estimates is maximized when the indicators of each latent variable come from different sets, and (c) the addition of an X set improves efficiency in manifest variable regression models only for the parameters that directly involve the X-set variables, but it substantially improves efficiency of most parameters in latent variable models. We conclude that PM designs can be beneficial when the model of interest is a latent variable model; recommendations are made for how to optimize such a design
    corecore