703 research outputs found

    Quantifying Reproducibility in Computational Biology: The Case of the Tuberculosis Drugome

    Get PDF
    How easy is it to reproduce the results found in a typical computational biology paper? Either through experience or intuition the reader will already know that the answer is with difficulty or not at all. In this paper we attempt to quantify this difficulty by reproducing a previously published paper for different classes of users (ranging from users with little expertise to domain experts) and suggest ways in which the situation might be improved. Quantification is achieved by estimating the time required to reproduce each of the steps in the method described in the original paper and make them part of an explicit workflow that reproduces the original results. Reproducing the method took several months of effort, and required using new versions and new software that posed challenges to reconstructing and validating the results. The quantification leads to “reproducibility maps” that reveal that novice researchers would only be able to reproduce a few of the steps in the method, and that only expert researchers with advance knowledge of the domain would be able to reproduce the method in its entirety. The workflow itself is published as an online resource together with supporting software and data. The paper concludes with a brief discussion of the complexities of requiring reproducibility in terms of cost versus benefit, and a desiderata with our observations and guidelines for improving reproducibility. This has implications not only in reproducing the work of others from published papers, but reproducing work from one’s own laboratory

    A survey of general-purpose experiment management tools for distributed systems

    Get PDF
    International audienceIn the field of large-scale distributed systems, experimentation is particularly difficult. The studied systems are complex, often nondeterministic and unreliable, software is plagued with bugs, whereas the experiment workflows are unclear and hard to reproduce. These obstacles led many independent researchers to design tools to control their experiments, boost productivity and improve quality of scientific results. Despite much research in the domain of distributed systems experiment management, the current fragmentation of efforts asks for a general analysis. We therefore propose to build a framework to uncover missing functionality of these tools, enable meaningful comparisons be-tween them and find recommendations for future improvements and research. The contribution in this paper is twofold. First, we provide an extensive list of features offered by general-purpose experiment management tools dedicated to distributed systems research on real platforms. We then use it to assess existing solutions and compare them, outlining possible future paths for improvements

    A new approach for publishing workflows: abstractions, standards, and linked data

    Get PDF
    In recent years, a variety of systems have been developed that export the workflows used to analyze data and make them part of published articles. We argue that the workflows that are published in current approaches are dependent on the specific codes used for execution, the specific workflow system used, and the specific workflow catalogs where they are published. In this paper, we describe a new approach that addresses these shortcomings and makes workflows more reusable through: 1) the use of abstract workflows to complement executable workflows to make them reusable when the execution environment is different, 2) the publication of both abstract and executable workflows using standards such as the Open Provenance Model that can be imported by other workflow systems, 3) the publication of workflows as Linked Data that results in open web accessible workflow repositories. We illustrate this approach using a complex workflow that we re-created from an influential publication that describes the generation of 'drugomes'

    Principles for data analysis workflows

    Full text link
    Traditional data science education often omits training on research workflows: the process that moves a scientific investigation from raw data to coherent research question to insightful contribution. In this paper, we elaborate basic principles of a reproducible data analysis workflow by defining three phases: the Exploratory, Refinement, and Polishing Phases. Each workflow phase is roughly centered around the audience to whom research decisions, methodologies, and results are being immediately communicated. Importantly, each phase can also give rise to a number of research products beyond traditional academic publications. Where relevant, we draw analogies between principles for data-intensive research workflows and established practice in software development. The guidance provided here is not intended to be a strict rulebook; rather, the suggestions for practices and tools to advance reproducible, sound data-intensive analysis may furnish support for both students and current professionals

    Interactive tools for reproducible science

    Get PDF
    Reproducibility should be a cornerstone of science. It plays an essential role in research validation and reuse. In recent years, the scientific community and the general public became increasingly aware of the reproducibility crisis, i.e. the wide-spread inability of researchers to reproduce published work, including their own. The reproducibility crisis has been identified in most branches of data-driven science. The effort required to document, clean, preserve, and share experimental resources has been described as one of the core contributors to this irreproducibility challenge. Documentation, preservation, and sharing are key reproducible research practices that are of little perceived value for scientists, as they fall outside the traditional academic reputation economy that is focused on novelty-driven scientific contributions. Scientific research is increasingly focused on the creation, observation, processing, and analysis of large data volumes. On one hand, this transition towards computational and data-intensive science poses new challenges for research reproducibility and reuse. On the other hand, increased availability and advances in computation and web technologies offer new opportunities to address the reproducibility crisis. A prominent example is the World Wide Web (WWW), which was developed in response to researchers’ needs to quickly share research data and findings with the scientific community. The WWW was invented at the European Organization for Nuclear Research (CERN). CERN is a key laboratory in High Energy Physics (HEP), one of the most data-intensive scientific domains. This thesis reports on research connected in the context of CAP, a Research Data Management (RDM) service tailored to CERN's major experiments. We use this scientific environment to study the role and requirements of interactive tools in facilitating reproducible research. In this thesis, we build a wider understanding of researchers' interactions with tools that support research documentation, preservation, and sharing. From an HCI perspective the following aspects are fundamental: (1) Characterize and map requirements and practices around research preservation and reuse. (2) Understand the wider role and impact of RDM tools in scientific workflows. (3) Design tools and interactions that promote, motivate, and acknowledge reproducible research practices. Research reported in this thesis represents the first systematic application of HCI methods in the study and design of interactive tools for reproducible science. We have built an empirical understanding of reproducible research practices and the role of supportive tools through research in HEP and across a variety of scientific fields. We designed prototypes and implemented services that aim to create rewarding and motivating interactions. We conducted mixed-method evaluations to assess the UX of the designs, in particular related to usefulness, suitability, and persuasiveness. We report on four empirical studies in which 42 researchers and data managers participated. In the first interview study, we asked HEP data analysts about RDM practices and invited them to explore and discuss CAP. Our findings show that tailored preservation services allow for introducing and promoting meaningful rewards and incentives that benefit contributors in their research work. Here, we introduce the term secondary usage forms of RDM tools. While not part of the core mission of the tools, secondary usage forms motivate contributions through meaningful rewards. We extended this research through a cross-domain interview study with data analysts and data stewards from a diverse set of scientific fields. Based on the findings of this cross-domain study, we contribute a Stage-Based Model of Personal RDM Commitment Evolution that explains how and why scientists commit to open and reproducible science. To address the motivation challenge, we explored if and how gamification can motivate contributions and promote reproducible research practices. To this end, we designed two prototypes of a gamified preservation service that was inspired by CAP. Each gamification prototype makes use of different underlying mechanisms. HEP researchers found both implementations valuable, enjoyable, suitable, and persuasive. The gamification layer improves visibility of scientists and research work and facilitates content navigation and discovery. Based on these findings, we implemented six tailored science badges in CAP in our second gamification study. The badges promote and reward high-quality documentation and special uses of preserved research. Findings from our evaluation with HEP researchers show that tailored science badges enable novel forms of research repository navigation and content discovery that benefit users and contributors. We discuss how the use of tailored science badges as an incentivizing element paves new ways for interaction with research repositories. Finally, we describe the role of HCI in supporting reproducible research practices. We stress that tailored RDM tools can improve content navigation and discovery, which is key in the design of secondary usage forms. Moreover, we argue that incentivizing elements like gamification may not only motivate contributions, but further promote secondary uses and enable new forms of interaction with preserved research. Based on our empirical research, we describe the roles of both HCI scholars and practitioners in building interactive tools for reproducible science. Finally, we outline our vision to transform computational and data-driven research preservation through ubiquitous preservation strategies that integrate into research workflows and make use of automated knowledge recording. In conclusion, this thesis advocates the unique role of HCI in supporting, motivating, and transforming reproducible research practices through the design of tools that enable effective RDM. We present practices around research preservation and reuse in HEP and beyond. Our research paves new ways for interaction with RDM tools that support and motivate reproducible science.Reproduzierbarkeit sollte ein wissenschaftlicher Grundpfeiler sein, da sie einen essenziellen Bestandteil in der Validierung und Nachnutzung von Forschungsarbeiten darstellt. Verfügbarkeit und Vollständigkeit von Forschungsmaterialien sind wichtige Voraussetzungen für die Interaktion mit experimentellen Arbeiten. Diese Voraussetzungen sind jedoch oft nicht gegeben. Zuletzt zeigten sich die Wissenschaftsgemeinde und die Öffentlichkeit besorgt über die Reproduzierbarkeitskrise in der empirischen Forschung. Diese Krise bezieht sich auf die Feststellung, dass Forscher oftmals nicht in der Lage sind, veröffentlichte Forschungsergebnisse zu validieren oder nachzunutzen. Tatsächlich wurde die Reproduzierbarkeitskrise in den meisten Wissenschaftsfeldern beschrieben. Eine der Hauptursachen liegt in dem Aufwand, der benötigt wird, um Forschungsmaterialien zu dokumentieren, vorzubereiten und zu teilen. Wissenschaftler empfinden diese Forschungspraktiken oftmals als unattraktiv, da sie außerhalb der traditionellen wissenschaftlichen Belohnungsstruktur liegen. Diese ist zumeist ausgelegt auf das Veröffentlichen neuer Forschungsergebnisse. Wissenschaftliche Forschung basiert zunehmend auf der Verarbeitung und Analyse großer Datensätze. Dieser Übergang zur rechnergestützten und daten-intensiven Forschung stellt neue Herausforderungen an Reproduzierbarkeit und Forschungsnachnutzung. Die weite Verbreitung des Internets bietet jedoch ebenso neue Möglichkeiten, Reproduzierbarkeit in der Forschung zu ermöglichen. Die Entwicklung des World Wide Web (WWW) stellt hierfür ein sehr gutes Beispiel dar. Das WWW wurde in der Europäischen Organisation für Kernforschung (CERN) entwickelt, um Forschern den weltweiten Austausch von Daten zu ermöglichen. CERN ist eine der wichtigsten Großforschungseinrichtungen in der Teilchenphysik, welche zu den daten-intensivsten Forschungsbereichen gehört. In dieser Arbeit berichten wir über unsere Forschung, die sich auf CERN Analysis Preservation (CAP) fokussiert. CAP ist ein Forschungsdatenmanagement-Service (FDM-Service), zugeschnitten auf die größten Experimente von CERN. In dieser Arbeit entwickeln und kommunizieren wir ein erweitertes Verständnis der Interaktion von Forschern mit FDM-Infrastruktur. Aus Sicht der Mensch-Computer-Interaktion (MCI) sind folgende Aspekte fundamental: (1) Das Bestimmen von Voraussetzungen und Praktiken rund um FDM und Nachnutzung. (2) Das Entwickeln von Verständnis für die Rolle und Auswirkungen von FDM-Systemen in der wissenschaftlichen Arbeit. (3) Das Entwerfen von Systemen, die Praktiken unterstützen, motivieren und anerkennen, welche die Reproduzierbarkeit von Forschung vorantreiben. Die Forschung, die wir in dieser Arbeit beschreiben, stellt die erste systematische Anwendung von MCI-Methoden in der Entwicklung von FDM-Systemen für Forschungsreproduzierbarkeit dar. Wir entwickeln ein empirisches Verständnis von Forschungspraktiken und der Rolle von unterstützenden Systemen durch überwiegend qualitative Forschung in Teilchenphysik und darüber hinaus. Des Weiteren entwerfen und implementieren wir Prototypen und Systeme mit dem Ziel, Wissenschaftler für FDM zu motivieren und zu belohnen. Wir verfolgten einen Mixed-Method-Ansatz in der Evaluierung der Nutzererfahrung bezüglich unserer Prototypen und Implementierungen. Wir berichten von vier empirischen Studien, in denen insgesamt 42 Forscher und Forschungsdaten-Manager teilgenommen haben. In unserer ersten Interview-Studie haben wir Teilchenphysiker über FDM-Praktiken befragt und sie eingeladen, CAP zu nutzen und über den Service zu diskutieren. Unsere Ergebnisse zeigen, dass die mensch-zentrierte Studie von speziell angepassten FDM-Systemen eine besondere Blickweise auf das Entwerfen von Anreizen und bedeutungsvollen Belohnungen ermöglicht. Wir führen den Begriff secondary usage forms (Zweitnutzungsformen) in Bezug auf FDM-Infrastruktur ein. Hierbei handelt es sich um Nutzungsformen, die Forschern sinnvolle Anreize bieten, ihre Arbeiten zu dokumentieren und zu teilen. Basierend auf unseren Ergebnissen in der Teilchenphysik haben wir unseren Forschungsansatz daraufhin auf Wissenschaftler und Forschungsdatenmanager aus einer Vielzahl verschiedener und diverser Wissenschaftsfelder erweitert. In Bezug auf die Ergebnisse dieser Studie beschreiben wir ein zustandsbasiertes Modell über die Entwicklung individueller Selbstverpflichtung zu FDM. Wir erwarten, dass dieses Modell designorientierte Denk- und Methodenansätze in der künftigen Implementierung und Evaluation von FDM-Infrastruktur beeinflussen wird. Des Weiteren haben wir einen Forschungsansatz zu Spielifizierung (Gamification) verfolgt, in dem wir untersucht haben, ob und wie Spielelemente FDM-Praktiken motivieren können. Zunächst haben wir zwei Prototypen eines spielifizierten FDM-Tools entwickelt, welche sich an CAP orientieren. Obwohl die beiden Prototypen auf sehr unterschiedlichen Entwurfskonzepten beruhen, fanden Teilchenphysiker beide angemessen und motivierend. Die Studienteilnehmer diskutierten insbesondere verbesserte Sichtbarkeit individueller Forscher und wissenschaftlicher Arbeiten. Basierend auf den Ergebnissen dieser ersten Studie zu Spielifizierung in FDM haben wir im nächsten Schritt sechs speziell zugeschnittene Forschungs-Abzeichen (tailored science badges) in CAP implementiert. Die Abzeichen bewerben das ausführliche Dokumentieren sowie besondere Nutzen der auf dem Service zugänglichen Forschungsarbeiten. Die Ergebnisse unserer Evaluierung mit Teilchenphysikern zeigen, dass die speziell zugeschnittenen Forschungs-Abzeichen neue und effektivere Möglichkeiten bieten, Forschungsmaterialien systematisch zu durchsuchen und zu entdecken. Hierdurch profitieren sowohl Nutzer als auch Forschungsdaten-Beisteuernde. Basierend auf den Ergebnissen diskutieren wir, wie die Forschungs-Abzeichen neue Formen der Interaktion mit großen Forschungsrepositorien ermöglichen. Zum Schluss heben wir die besondere Rolle von MCI in der Entwicklung unterstützender FDM-Infrastruktur hervor. Wir betonen, dass speziell an Forschungspraktiken angepasste Systeme neue Ansätze in der Interaktion mit wissenschaftlichen Arbeiten ermöglichen. Wir beschreiben zwei Modelle und unsere Erwartung, wie MCI die Entwicklung künftiger FDM-Systeme nachhaltig beeinflussen kann. In diesem Zusammenhang präsentieren wir auch unsere Vision zu ubiquitären Strategien, die zum Ziel hat, Forschungsprozesse und Wissen systematisch festzuhalten

    Scientific Workflows: Past, Present and Future

    Get PDF
    International audienceThis special issue and our editorial celebrate 10 years of progress with data-intensive or scientific workflows. There have been very substantial advances in the representation of workflows and in the engineering of workflow management systems (WMS). The creation and refinement stages are now well supported, with a significant improvement in usability. Improved abstraction supports cross-fertilisation between different workflow communities and consistent interpretation as WMS evolve. Through such re-engineering the WMS deliver much improved performance, significantly increased scale and sophisticated reliability mechanisms. Further improvement is anticipated from substantial advances in optimisation. We invited papers from those who have delivered these advances and selected 14 to represent today's achievements and representative plans for future progress. This editorial introduces those contributions with an overview and categorisation of the papers. Furthermore, it elucidates responses from a survey of major workflow systems, which provides evidence of substantial progress and a structured index of related papers. We conclude with suggestions on areas where further research and development is needed and offer a vision of future research directions
    corecore