4,888 research outputs found
Report on the Third Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE3)
This report records and discusses the Third Workshop on Sustainable Software
for Science: Practice and Experiences (WSSSPE3). The report includes a
description of the keynote presentation of the workshop, which served as an
overview of sustainable scientific software. It also summarizes a set of
lightning talks in which speakers highlighted to-the-point lessons and
challenges pertaining to sustaining scientific software. The final and main
contribution of the report is a summary of the discussions, future steps, and
future organization for a set of self-organized working groups on topics
including developing pathways to funding scientific software; constructing
useful common metrics for crediting software stakeholders; identifying
principles for sustainable software engineering design; reaching out to
research software organizations around the world; and building communities for
software sustainability. For each group, we include a point of contact and a
landing page that can be used by those who want to join that group's future
activities. The main challenge left by the workshop is to see if the groups
will execute these activities that they have scheduled, and how the WSSSPE
community can encourage this to happen
How Do Data Science Workers Communicate Intermediate Results?
Data science workers increasingly collaborate on large-scale projects before
communicating insights to a broader audience in the form of visualization.
While prior work has modeled how data science teams, oftentimes with distinct
roles and work processes, communicate knowledge to outside stakeholders, we
have little knowledge of how data science workers communicate intermediately
before delivering the final products. In this work, we contribute a nuanced
description of the intermediate communication process within data science
teams. By analyzing interview data with 8 self-identified data science workers,
we characterized the data science intermediate communication process with four
factors, including the types of audience, communication goals, shared
artifacts, and mode of communication. We also identified overarching challenges
in the current communication process. We also discussed design implications
that might inform better tools that facilitate intermediate communication
within data science teams.Comment: This paper was accepted for presentation as part of the eighth
Symposium on Visualization in Data Science (VDS) at ACM KDD 2022 as well as
IEEE VIS 2022. http://www.visualdatascience.org/2022/index.htm
Chemical information matters: an e-Research perspective on information and data sharing in the chemical sciences
Recently, a number of organisations have called for open access to scientific information and especially to the data obtained from publicly funded research, among which the Royal Society report and the European Commission press release are particularly notable. It has long been accepted that building research on the foundations laid by other scientists is both effective and efficient. Regrettably, some disciplines, chemistry being one, have been slow to recognise the value of sharing and have thus been reluctant to curate their data and information in preparation for exchanging it. The very significant increases in both the volume and the complexity of the datasets produced has encouraged the expansion of e-Research, and stimulated the development of methodologies for managing, organising, and analysing "big data". We review the evolution of cheminformatics, the amalgam of chemistry, computer science, and information technology, and assess the wider e-Science and e-Research perspective. Chemical information does matter, as do matters of communicating data and collaborating with data. For chemistry, unique identifiers, structure representations, and property descriptors are essential to the activities of sharing and exchange. Open science entails the sharing of more than mere facts: for example, the publication of negative outcomes can facilitate better understanding of which synthetic routes to choose, an aspiration of the Dial-a-Molecule Grand Challenge. The protagonists of open notebook science go even further and exchange their thoughts and plans. We consider the concepts of preservation, curation, provenance, discovery, and access in the context of the research lifecycle, and then focus on the role of metadata, particularly the ontologies on which the emerging chemical Semantic Web will depend. Among our conclusions, we present our choice of the "grand challenges" for the preservation and sharing of chemical information
Workflow analysis of data science code in public GitHub repositories
Despite the ubiquity of data science, we are far from rigorously understanding how coding in data science is performed. Even though the scientific literature has hinted at the iterative and explorative nature of data science coding, we need further empirical evidence to understand this practice and its workflows in detail. Such understanding is critical to recognise the needs of data scientists and, for instance, inform tooling support. To obtain a deeper understanding of the iterative and explorative nature of data science coding, we analysed 470 Jupyter notebooks publicly available in GitHub repositories. We focused on the extent to which data scientists transition between different types of data science activities, or steps (such as data preprocessing and modelling), as well as the frequency and co-occurrence of such transitions. For our analysis, we developed a dataset with the help of five data science experts, who manually annotated the data science steps for each code cell within the aforementioned 470 notebooks. Using the first-order Markov chain model, we extracted the transitions and analysed the transition probabilities between the different steps. In addition to providing deeper insights into the implementation practices of data science coding, our results provide evidence that the steps in a data science workflow are indeed iterative and reveal specific patterns. We also evaluated the use of the annotated dataset to train machine-learning classifiers to predict the data science step(s) of a given code cell. We investigate the representativeness of the classification by comparing the workflow analysis applied to (a) the predicted data set and (b) the data set labelled by experts, finding an F1-score of about 71% for the 10-class data science step prediction problem
Computational Notebooks as Co-Design Tools: Engaging Young Adults Living with Diabetes, Family Carers, and Clinicians with Machine Learning Models
Engaging end user groups with machine learning (ML) models can help align the design of predictive systems with
people’s needs and expectations. We present a co-design study investigating the benefits and challenges of using
computational notebooks to inform ML models with end user groups. We used a computational notebook to engage
young adults, carers, and clinicians with an example ML model that predicted health risk in diabetes care. Through codesign workshops and retrospective interviews, we found that participants particularly valued using the interactive
data visualisations of the computational notebook to scaffold multidisciplinary learning, anticipate benefits and harms
of the example ML model, and create fictional feature importance plots to highlight care needs. Participants also
reported challenges, from running code cells to managing information asymmetries and power imbalances. We discuss
the potential of leveraging computational notebooks as interactive co-design tools to meet end user needs early in ML
model lifecycles
Visualising data science workflows to support third-party notebook comprehension: an empirical study
Data science is an exploratory and iterative process that often leads to complex and unstructured code. This code is usually poorly documented and, consequently, hard to understand by a third party. In this paper, we first collect empirical evidence for the non-linearity of data science code from real-world Jupyter notebooks, confirming the need for new approaches that aid in data science code interaction and comprehension. Second, we propose a visualisation method that elucidates implicit workflow information in data science code and assists data scientists in navigating the so-called garden of forking paths in non-linear code. The visualisation also provides information such as the rationale and the identification of the data science pipeline step based on cell annotations. We conducted a user experiment with data scientists to evaluate the proposed method, assessing the influence of (i) different workflow visualisations and (ii) cell annotations on code comprehension. Our results show that visualising the exploration helps the users obtain an overview of the notebook, significantly improving code comprehension. Furthermore, our qualitative analysis provides more insights into the difficulties faced during data science code comprehension
- …