675 research outputs found
Workflow analysis of data science code in public GitHub repositories
Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem
Computational reproducibility of Jupyter notebooks from biomedical publications
Jupyter notebooks facilitate the bundling of executable code with its
documentation and output in one interactive environment, and they represent a
popular mechanism to document and share computational workflows. The
reproducibility of computational aspects of research is a key component of
scientific reproducibility but has not yet been assessed at scale for Jupyter
notebooks associated with biomedical publications. We address computational
reproducibility at two levels: First, using fully automated workflows, we
analyzed the computational reproducibility of Jupyter notebooks related to
publications indexed in PubMed Central. We identified such notebooks by mining
the articles full text, locating them on GitHub and re-running them in an
environment as close to the original as possible. We documented reproduction
success and exceptions and explored relationships between notebook
reproducibility and variables related to the notebooks or publications. Second,
this study represents a reproducibility attempt in and of itself, using
essentially the same methodology twice on PubMed Central over two years. Out of
27271 notebooks from 2660 GitHub repositories associated with 3467 articles,
22578 notebooks were written in Python, including 15817 that had their
dependencies declared in standard requirement files and that we attempted to
re-run automatically. For 10388 of these, all declared dependencies could be
installed successfully, and we re-ran them to assess reproducibility. Of these,
1203 notebooks ran through without any errors, including 879 that produced
results identical to those reported in the original notebook and 324 for which
our results differed from the originally reported ones. Running the other
notebooks resulted in exceptions. We zoom in on common problems, highlight
trends and discuss potential improvements to Jupyter-related workflows
associated with biomedical publications.Comment: arXiv admin note: substantial text overlap with arXiv:2209.0430
Machine Learning practices and infrastructures
Machine Learning (ML) systems, particularly when deployed in high-stakes
domains, are deeply consequential. They can exacerbate existing inequities,
create new modes of discrimination, and reify outdated social constructs.
Accordingly, the social context (i.e. organisations, teams, cultures) in which
ML systems are developed is a site of active research for the field of AI
ethics, and intervention for policymakers. This paper focuses on one aspect of
social context that is often overlooked: interactions between practitioners and
the tools they rely on, and the role these interactions play in shaping ML
practices and the development of ML systems. In particular, through an
empirical study of questions asked on the Stack Exchange forums, the use of
interactive computing platforms (e.g. Jupyter Notebook and Google Colab) in ML
practices is explored. I find that interactive computing platforms are used in
a host of learning and coordination practices, which constitutes an
infrastructural relationship between interactive computing platforms and ML
practitioners. I describe how ML practices are co-evolving alongside the
development of interactive computing platforms, and highlight how this risks
making invisible aspects of the ML life cycle that AI ethics researchers' have
demonstrated to be particularly salient for the societal impact of deployed ML
systems
Setting the basis of best practices and standards for curation and annotation of logical models in biology
International audienceThe fast accumulation of biological data calls for their integration, analysis and exploitation through more systematic approaches. The generation of novel, relevant hypotheses from this enormous quantity of data remains challenging. Logical models have long been used to answer a variety of questions regarding the dynamical behaviours of regulatory networks. As the number of published logical models increases, there is a pressing need for systematic model annotation, referencing and curation in community-supported and standardised formats. This article summarises the key topics and future directions of a meeting entitled ‘Annotation and curation of computational models in biology’, organised as part of the 2019 [BC]2 conference. The purpose of the meeting was to develop and drive forward a plan towards the standardised annotation of logical models, review and connect various ongoing projects of experts from different communities involved in the modelling and annotation of molecular biological entities, interactions, pathways and models. This article defines a roadmap towards the annotation and curation of logical models, including milestones for best practices and minimum standard requirements
- …