Estimating credibility of science claims : analysis of forecasting data from metascience projects : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Statistics at Massey University, Albany, New Zealand

Abstract

The veracity of scientific claims is not always certain. In fact, sufficient claims have been proven incorrect that many scientists believe that science itself is facing a “replication crisis”. Large scale replication projects provided empirical evidence that only around 50% of published social and behavioral science findings are replicable. Multiple forecasting studies showed that the outcomes of replication projects could be predicted by crowdsourced human evaluators. The research presented in this thesis builds on previous forecasting studies, deriving new findings and exploring new scope and scale. The research is centered around the DARPA SCORE (Systematizing Confidence in Open Research and Evidence) programme, a project aimed at developing measures of credibility for social and behavioral science claims. As part of my contribution to SCORE, myself, along with a international collaboration, elicited forecasts from human experts via surveys and prediction markets to predict the replicability of 3000 claims. I also present research on other forecasting studies. In chapter 2, I pool data from previous studies to analyse the performance of prediction markets and surveys with higher statistical power. I confirm that prediction markets are better at forecasting replication outcomes than surveys. This study also demonstrates the relationship between p-values of original findings and replication outcomes. These findings are used to inform the experimental and statistical design to forecast the replicability of 3000 claims as part of the SCORE programme. A full description of the design including planned statistical analyses is included in chapter 3. Due to COVID-19 restrictions, our generated forecasts could not be validated through direct replication, experiments conducted by other teams within the SCORE collaboration, thereby preventing results being presented in this thesis. The completion of these replications is now scheduled for 2022, and the pre-analysis plan presented in Chapter 3 will provide the basis for the analysis of the resulting data. In chapter 4, an analysis of ‘meta’ forecasts, or forecasts regarding field wide replication rates and year specific replication rates, is presented. We presented and published community expectations that replication rates will differ by field and will increase over time. These forecasts serve as valuable insights into the academic community’s views of the replication crisis, including those research fields for which no large-scale replication studies have been undertaken yet. Once the full results from SCORE are available, there will be additional insights from validations of the community expectations. I also analyse forecaster’s ability to predict replications and effect sizes in Chapters 5 (Creative Destruction in Science) and 6 (A creative destruction approach to replication: Implicit work and sex morality across cultures). In these projects a ‘creative destruction’ approach to replication was used, where a claim is compared not only to the null hypothesis but to alternative contradictory claims. I conclude forecasters can predict the size and direction of effects. Chapter 7 examines the use of forecasting for scientific outcomes beyond replication. In the COVID-19 preprint forecasting project I find that forecasters can predict if a preprint will be published within one year, including the quality of the publishing journal. Forecasters can also predict the number of citations preprints will receive. This thesis demonstrates that information about scientific claims with respect to replicability is dispersed within scientific community. I have helped to develop methodologies and tools to efficiently elicit and aggregate forecasts. Forecasts about scientific outcomes can be used as guides to credibility, to gauge community expectations and to efficiently allocate sparse replication resources

    Similar works