439 research outputs found
Computational reproducibility of Jupyter notebooks from biomedical publications
Jupyter notebooks facilitate the bundling of executable code with its
documentation and output in one interactive environment, and they represent a
popular mechanism to document and share computational workflows. The
reproducibility of computational aspects of research is a key component of
scientific reproducibility but has not yet been assessed at scale for Jupyter
notebooks associated with biomedical publications. We address computational
reproducibility at two levels: First, using fully automated workflows, we
analyzed the computational reproducibility of Jupyter notebooks related to
publications indexed in PubMed Central. We identified such notebooks by mining
the articles full text, locating them on GitHub and re-running them in an
environment as close to the original as possible. We documented reproduction
success and exceptions and explored relationships between notebook
reproducibility and variables related to the notebooks or publications. Second,
this study represents a reproducibility attempt in and of itself, using
essentially the same methodology twice on PubMed Central over two years. Out of
27271 notebooks from 2660 GitHub repositories associated with 3467 articles,
22578 notebooks were written in Python, including 15817 that had their
dependencies declared in standard requirement files and that we attempted to
re-run automatically. For 10388 of these, all declared dependencies could be
installed successfully, and we re-ran them to assess reproducibility. Of these,
1203 notebooks ran through without any errors, including 879 that produced
results identical to those reported in the original notebook and 324 for which
our results differed from the originally reported ones. Running the other
notebooks resulted in exceptions. We zoom in on common problems, highlight
trends and discuss potential improvements to Jupyter-related workflows
associated with biomedical publications.Comment: arXiv admin note: substantial text overlap with arXiv:2209.0430
Notebook articles: towards a transformative publishing experience in nonlinear science
Open Science, Reproducible Research, Findable, Accessible, Interoperable and
Reusable (FAIR) data principles are long term goals for scientific
dissemination. However, the implementation of these principles calls for a
reinspection of our means of dissemination. In our viewpoint, we discuss and
advocate, in the context of nonlinear science, how a notebook article
represents an essential step toward this objective by fully embracing cloud
computing solutions. Notebook articles as scholar articles offer an
alternative, efficient and more ethical way to disseminate research through
their versatile environment. This format invites the readers to delve deeper
into the reported research. Through the interactivity of the notebook articles,
research results such as for instance equations and figures are reproducible
even for non-expert readers. The codes and methods are available, in a
transparent manner, to interested readers. The methods can be reused and
adapted to answer additional questions in related topics. The codes run on
cloud computing services, which provide easy access, even to low-income
countries and research groups. The versatility of this environment provides the
stakeholders - from the researchers to the publishers - with opportunities to
disseminate the research results in innovative ways.Comment: This article is an editorial viewpoin
Recommended from our members
Ten simple rules for writing Dockerfiles for reproducible data science.
Computational science has been greatly improved by the use of containers for packaging software and data dependencies. In a scholarly context, the main drivers for using these containers are transparency and support of reproducibility; in turn, a workflow's reproducibility can be greatly affected by the choices that are made with respect to building containers. In many cases, the build process for the container's image is created from instructions provided in a Dockerfile format. In support of this approach, we present a set of rules to help researchers write understandable Dockerfiles for typical data science workflows. By following the rules in this article, researchers can create containers suitable for sharing with fellow scientists, for including in scholarly communication such as education or scientific papers, and for effective and sustainable personal workflows
HPC-oriented Canonical Workflows for Machine Learning Applications in Climate and Weather Prediction
Machine learning (ML) applications in weather and climate are gaining momentum as big data and the immense increase in High-performance computing (HPC) power are paving the way. Ensuring FAIR data and reproducible ML practices are significant challenges for Earth system researchers. Even though the FAIR principle is well known to many scientists, research communities are slow to adopt them. Canonical Workflow Framework for Research (CWFR) provides a platform to ensure the FAIRness and reproducibility of these practices without overwhelming researchers. This conceptual paper envisions a holistic CWFR approach towards ML applications in weather and climate, focusing on HPC and big data. Specifically, we discuss Fair Digital Object (FDO) and Research Object (RO) in the DeepRain project to achieve granular reproducibility. DeepRain is a project that aims to improve precipitation forecast in Germany by using ML. Our concept envisages the raster datacube to provide data harmonization and fast and scalable data access. We suggest the Juypter notebook as a single reproducible experiment. In addition, we envision JuypterHub as a scalable and distributed central platform that connects all these elements and the HPC resources to the researchers via an easy-to-use graphical interface
Introducing Reproducibility to Citation Analysis: a Case Study in the Earth Sciences
Objectives: Replicate methods from a 2019 study of Earth Science researcher citation practices. Calculate programmatically whether researchers in Earth Science rely on a smaller subset of literature than estimated by the 80/20 rule. Determine whether these reproducible citation analysis methods can be used to analyze open access uptake.
Methods: Replicated methods of a prior citation study provide an updated transparent, reproducible citation analysis protocol that can be replicated with Jupyter Notebooks.
Results: This study replicated the prior citation study’s conclusions, and also adapted the author’s methods to analyze the citation practices of Earth Scientists at four institutions. We found that 80% of the citations could be accounted for by only 7.88% of journals, a key metric to help identify a core collection of titles in this discipline. We then demonstrated programmatically that 36% of these cited references were available as open access.
Conclusions: Jupyter Notebooks are a viable platform for disseminating replicable processes for citation analysis. A completely open methodology is emerging and we consider this a step forward. Adherence to the 80/20 rule aligned with institutional research output, but citation preferences are evident. Reproducible citation analysis methods may be used to analyze open access uptake, however, results are inconclusive. It is difficult to determine whether an article was open access at the time of citation, or became open access after an embargo
- …