34 research outputs found
Veridical Data Science
Building and expanding on principles of statistics, machine learning, and
scientific inquiry, we propose the predictability, computability, and stability
(PCS) framework for veridical data science. Our framework, comprised of both a
workflow and documentation, aims to provide responsible, reliable,
reproducible, and transparent results across the entire data science life
cycle. The PCS workflow uses predictability as a reality check and considers
the importance of computation in data collection/storage and algorithm design.
It augments predictability and computability with an overarching stability
principle for the data science life cycle. Stability expands on statistical
uncertainty considerations to assess how human judgment calls impact data
results through data and model/algorithm perturbations. Moreover, we develop
inference procedures that build on PCS, namely PCS perturbation intervals and
PCS hypothesis testing, to investigate the stability of data results relative
to problem formulation, data cleaning, modeling decisions, and interpretations.
We illustrate PCS inference through neuroscience and genomics projects of our
own and others and compare it to existing methods in high dimensional, sparse
linear model simulations. Over a wide range of misspecified simulation models,
PCS inference demonstrates favorable performance in terms of ROC curves.
Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook,
with publicly available, reproducible codes and narratives to back up human
choices made throughout an analysis. The PCS workflow and documentation are
demonstrated in a genomics case study available on Zenodo
Decentralized Infrastructure for Reproducible and Replicable Geographical Science
The I-GUIDE cyberinfrastructure project for convergence science is a leading example of the possibilities the geospatial data revolution holds for scientific discovery. However, rapidly expanding access to increasingly complex data sources and methods of computational analysis also presents a challenge to the research community. With more data and more potential analyses, researchers face the possibility of jeopardizing the inferential power of convergence research with selection bias. Well-designed infrastructure that can flexibly guide researchers as they record and track decisions in their research designs opens a path to mitigating this problem, while also expanding the reproducibility and replicability of research. Much of the infrastructure needed for convergence research can be borrowed and adapted from other disciplines, but geographic convergence research confronts at least five novel challenges. These are the need for geographically-explicit project metadata, managing diverse and complex data inputs, handling restricted data, specifying and reproducing computational environments, and disclosing researcher decisions and threats to validity that are unique to geographic research. We introduce a template research compendium and analysis plan for study preregistration to address these novel challenges
Toward a taxonomy of trust for probabilistic machine learning
Probabilistic machine learning increasingly informs critical decisions in medicine, economics, politics, and beyond. To aid the development of trust in these decisions, we develop a taxonomy delineating where trust in an analysis can break down: (i) in the translation of real-world goals to goals on a particular set of training data, (ii) in the translation of abstract goals on the training data to a concrete mathematical problem, (iii) in the use of an algorithm to solve the stated mathematical problem, and (iv) in the use of a particular code implementation of the chosen algorithm. We detail how trust can fail at each step and illustrate our taxonomy with two case studies. Finally, we describe a wide variety of methods that can be used to increase trust at each step of our taxonomy. The use of our taxonomy highlights not only steps where existing research work on trust tends to concentrate and but also steps where building trust is particularly challenging