Search CORE

34 research outputs found

Veridical Data Science

Author: Kumbier Karl
Yu Bin
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 12/11/2019
Field of study

Building and expanding on principles of statistics, machine learning, and scientific inquiry, we propose the predictability, computability, and stability (PCS) framework for veridical data science. Our framework, comprised of both a workflow and documentation, aims to provide responsible, reliable, reproducible, and transparent results across the entire data science life cycle. The PCS workflow uses predictability as a reality check and considers the importance of computation in data collection/storage and algorithm design. It augments predictability and computability with an overarching stability principle for the data science life cycle. Stability expands on statistical uncertainty considerations to assess how human judgment calls impact data results through data and model/algorithm perturbations. Moreover, we develop inference procedures that build on PCS, namely PCS perturbation intervals and PCS hypothesis testing, to investigate the stability of data results relative to problem formulation, data cleaning, modeling decisions, and interpretations. We illustrate PCS inference through neuroscience and genomics projects of our own and others and compare it to existing methods in high dimensional, sparse linear model simulations. Over a wide range of misspecified simulation models, PCS inference demonstrates favorable performance in terms of ROC curves. Finally, we propose PCS documentation based on R Markdown or Jupyter Notebook, with publicly available, reproducible codes and narratives to back up human choices made throughout an analysis. The PCS workflow and documentation are demonstrated in a genomics case study available on Zenodo

arXiv.org e-Print Archive

Interview, Building Trust in Medical AI Algorithms with Veridical Data Science

Author: Behr Merle
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 05/05/2023
Field of study

University of Regensburg Publication Server

Three principles of data science: predictability, computability, and stability (PCS)

Author: Yu Bin
Publication venue: eScholarship, University of California
Publication date: 01/01/2018
Field of study

Crossref

eScholarship - University of California

Decentralized Infrastructure for Reproducible and Replicable Geographical Science

Author: Holler Joseph
Kedron Peter
Publication venue: 'Purdue University (bepress)'
Publication date: 06/10/2023
Field of study

The I-GUIDE cyberinfrastructure project for convergence science is a leading example of the possibilities the geospatial data revolution holds for scientific discovery. However, rapidly expanding access to increasingly complex data sources and methods of computational analysis also presents a challenge to the research community. With more data and more potential analyses, researchers face the possibility of jeopardizing the inferential power of convergence research with selection bias. Well-designed infrastructure that can flexibly guide researchers as they record and track decisions in their research designs opens a path to mitigating this problem, while also expanding the reproducibility and replicability of research. Much of the infrastructure needed for convergence research can be borrowed and adapted from other disciplines, but geographic convergence research confronts at least five novel challenges. These are the need for geographically-explicit project metadata, managing diverse and complex data inputs, handling restricted data, specifying and reproducing computational environments, and disclosing researcher decisions and threats to validity that are unique to geographic research. We introduce a template research compendium and analysis plan for study preregistration to address these novel challenges

Purdue E-Pubs

Network-based Semi-supervised Clustering of Time Series Data

Author: Carmela Cappelli
Claudio Conversano
Giulia Contu
Luca Frigau
Publication venue: 'Firenze University Press'
Publication date: 01/01/2021
Field of study

Archivio della ricerca - Università degli studi di Napoli Federico II

Archivio istituzionale della ricerca - Università di Cagliari

Toward a taxonomy of trust for probabilistic machine learning

Author: Broderick Tamara
Gelman Andrew
Meager Rachael
Smith Anna L.
Zheng Tian
Publication venue: 'American Association for the Advancement of Science (AAAS)'
Publication date: 15/02/2023
Field of study

Probabilistic machine learning increasingly informs critical decisions in medicine, economics, politics, and beyond. To aid the development of trust in these decisions, we develop a taxonomy delineating where trust in an analysis can break down: (i) in the translation of real-world goals to goals on a particular set of training data, (ii) in the translation of abstract goals on the training data to a concrete mathematical problem, (iii) in the use of an algorithm to solve the stated mathematical problem, and (iv) in the use of a particular code implementation of the chosen algorithm. We detail how trust can fail at each step and illustrate our taxonomy with two case studies. Finally, we describe a wide variety of methods that can be used to increase trust at each step of our taxonomy. The use of our taxonomy highlights not only steps where existing research work on trust tends to concentrate and but also steps where building trust is particularly challenging

LSE Research Online