189 research outputs found
Six solutions for more reliable infant research
Infant research is often underpowered, undermining the robustness and replicability of our findings. Improving the reliability of infant studies offers a solution for increasing statistical power independent of sample size. Here, we discuss two senses of the term reliability in the context of infant research: reliable (large) effects and reliable measures. We examine the circumstances under which effects are strongest and measures are most reliable and use synthetic datasets to illustrate the relationship between effect size, measurement reliability, and statistical power. We then present six concrete solutions for more reliable infant research: (a) routinely estimating and reporting the effect size and measurement reliability of infant tasks, (b) selecting the best measurement tool, (c) developing better infant paradigms, (d) collecting more data points per infant, (e) excluding unreliable data from the analysis, and (f) conducting more sophisticated data analyses. Deeper consideration of measurement in infant research will improve our ability to study infant development
Social support, stress, health, and academic success in Ghanaian adolescents: A path analysis
The aim of this study is to gain a better understanding of the role psychosocial factors play in promoting the health and academic success of adolescents. A total of 770 adolescent boys and girls in Senior High Schools were randomly selected to complete a self-report questionnaire. School reported latest terminal examination grades were used as the measure of academic success. Structural equation modelling indicated a relatively good fit to the posteriori model with four of the hypothesised paths fully supported and two partially supported. Perceived social support was negatively related to stress and predictive of health and wellbeing but not academic success. Stress was predictive of health but not academic success. Finally, health and wellbeing was able to predict academic success. These findings have policy implications regarding efforts aimed at promoting the health and wellbeing as well as the academic success of adolescents in Ghana. © 2014 The Foundation for Professionals in Services for Adolescents
Null hypothesis significance testing: a short tutorial
Although thoroughly criticized, null hypothesis significance testing (NHST) remains the statistical method of choice used to provide evidence for an effect, in biological, biomedical and social sciences. In this short tutorial, I first summarize the concepts behind the method, distinguishing test of significance (Fisher) and test of acceptance (Newman-Pearson) and point to common interpretation errors regarding the p-value. I then present the related concepts of confidence intervals and again point to common interpretation errors. Finally, I discuss what should be reported in which context. The goal is to clarify concepts to avoid interpretation errors and propose reporting practices.</ns4:p
[Comment] Redefine statistical significance
The lack of reproducibility of scientific studies has caused growing concern over the credibility of claims of new discoveries based on “statistically significant” findings. There has been much progress toward documenting and addressing several causes of this lack of reproducibility (e.g., multiple testing, P-hacking, publication bias, and under-powered studies). However, we believe that a leading cause of non-reproducibility has not yet been adequately addressed: Statistical standards of evidence for claiming discoveries in many fields of science are simply too low. Associating “statistically significant” findings with P < 0.05 results in a high rate of false positives even in the absence of other experimental, procedural and reporting problems.
For fields where the threshold for defining statistical significance is P<0.05, we propose a change to P<0.005. This simple step would immediately improve the reproducibility of scientific research in many fields. Results that would currently be called “significant” but do not meet the new threshold should instead be called “suggestive.” While statisticians have known the relative weakness of using P≈0.05 as a threshold for discovery and the proposal to lower it to 0.005 is not new (1, 2), a critical mass of researchers now endorse this change.
We restrict our recommendation to claims of discovery of new effects. We do not address the appropriate threshold for confirmatory or contradictory replications of existing claims. We also do not advocate changes to discovery thresholds in fields that have already adopted more stringent standards (e.g., genomics and high-energy physics research; see Potential Objections below).
We also restrict our recommendation to studies that conduct null hypothesis significance tests. We have diverse views about how best to improve reproducibility, and many of us believe that other ways of summarizing the data, such as Bayes factors or other posterior summaries based on clearly articulated model assumptions, are preferable to P-values. However, changing the P-value threshold is simple and might quickly achieve broad acceptance
The earth is flat (p < 0.05): significance thresholds and the crisis of unreplicable research
The widespread use of ‘statistical significance’ as a license for making a claim of a scientific finding leads to considerable distortion of the scientific process (according to the American Statistical Association). We review why degrading p -values into ‘significant’ and ‘nonsignificant’ contributes to making studies irreproducible, or to making them seem irreproducible. A major problem is that we tend to take small p -values at face value, but mistrust results with larger p -values. In either case, p -values tell little about reliability of research, because they are hardly replicable even if an alternative hypothesis is true. Also significance ( p ≤ 0.05) is hardly replicable: at a good statistical power of 80%, two studies will be ‘conflicting’, meaning that one is significant and the other is not, in one third of the cases if there is a true effect. A replication can therefore not be interpreted as having failed only because it is nonsignificant. Many apparent replication failures may thus reflect faulty judgment based on significance thresholds rather than a crisis of unreplicable research. Reliable conclusions on replicability and practical importance of a finding can only be drawn using cumulative evidence from multiple independent studies. However, applying significance thresholds makes cumulative knowledge unreliable. One reason is that with anything but ideal statistical power, significant effect sizes will be biased upwards. Interpreting inflated significant results while ignoring nonsignificant results will thus lead to wrong conclusions. But current incentives to hunt for significance lead to selective reporting and to publication bias against nonsignificant findings. Data dredging, p -hacking, and publication bias should be addressed by removing fixed significance thresholds. Consistent with the recommendations of the late Ronald Fisher, p -values should be interpreted as graded measures of the strength of evidence against the null hypothesis. Also larger p -values offer some evidence against the null hypothesis, and they cannot be interpreted as supporting the null hypothesis, falsely concluding that ‘there is no effect’. Information on possible true effect sizes that are compatible with the data must be obtained from the point estimate, e.g., from a sample average, and from the interval estimate, such as a confidence interval. We review how confusion about interpretation of larger p -values can be traced back to historical disputes among the founders of modern statistics. We further discuss potential arguments against removing significance thresholds, for example that decision rules should rather be more stringent, that sample sizes could decrease, or that p -values should better be completely abandoned. We conclude that whatever method of statistical inference we use, dichotomous threshold thinking must give way to non-automated informed judgment
Quality-of-life assessment in dementia: the use of DEMQOL and DEMQOL-Proxy total scores
Purpose
There is a need to determine whether health-related quality-of-life (HRQL) assessments in dementia capture what is important, to form a coherent basis for guiding research and clinical and policy decisions. This study investigated structural validity of HRQL assessments made using the DEMQOL system, with particular interest in studying domains that might be central to HRQL, and the external validity of these HRQL measurements.
Methods
HRQL of people with dementia was evaluated by 868 self-reports (DEMQOL) and 909 proxy reports (DEMQOL-Proxy) at a community memory service. Exploratory and confirmatory factor analyses (EFA and CFA) were conducted using bifactor models to investigate domains that might be central to general HRQL. Reliability of the general and specific factors measured by the bifactor models was examined using omega (?) and omega hierarchical (? h) coefficients. Multiple-indicators multiple-causes models were used to explore the external validity of these HRQL measurements in terms of their associations with other clinical assessments.
Results
Bifactor models showed adequate goodness of fit, supporting HRQL in dementia as a general construct that underlies a diverse range of health indicators. At the same time, additional factors were necessary to explain residual covariation of items within specific health domains identified from the literature. Based on these models, DEMQOL and DEMQOL-Proxy overall total scores showed excellent reliability (? h > 0.8). After accounting for common variance due to a general factor, subscale scores were less reliable (? h < 0.7) for informing on individual differences in specific HRQL domains. Depression was more strongly associated with general HRQL based on DEMQOL than on DEMQOL-Proxy (?0.55 vs ?0.22). Cognitive impairment had no reliable association with general HRQL based on DEMQOL or DEMQOL-Proxy.
Conclusions
The tenability of a bifactor model of HRQL in dementia suggests that it is possible to retain theoretical focus on the assessment of a general phenomenon, while exploring variation in specific HRQL domains for insights on what may lie at the ‘heart’ of HRQL for people with dementia. These data suggest that DEMQOL and DEMQOL-Proxy total scores are likely to be accurate measures of individual differences in HRQL, but that subscale scores should not be used. No specific domain was solely responsible for general HRQL at dementia diagnosis. Better HRQL was moderately associated with less depressive symptoms, but this was less apparent based on informant reports. HRQL was not associated with severity of cognitive impairment
Personality traits and mental disorders
Peer reviewe
To leave or not to leave? A multi-sample study on individual, job-related, and organizational antecedents of employability and retirement intentions
Contains fulltext :
207750.pdf (publisher's version ) (Open Access
On Asymptotic Robustness of NT Methods with Missing Data
Literature on asymptotic robustness of normal theory (NT) methods outlines conditions under which the NT estimator remains asymptotically efficient and the NT test statistic retains its chi-square distribution even under nonnormality. These conditions have been stated both abstractly and in terms of properties of specific models. This research discusses issues associated with extending asymptotic robustness theory to the direct ML estimator and associated test statistic when data are missing completely at random (MCAR). It is shown that the same abstract robustness condition necessary for robustness to hold with complete data is required for incomplete data, while properties of specific models (such as mutual independence of the errors and their independence of the factors in a CFA model) no longer ensure robustness with incomplete data. The lack of robustness in such a case is illustrated both mathematically and empirically via a simulation study. Violation becomes more severe when the data are highly nonnormal and when a higher proportion of data is missing
- …