1,459 research outputs found

    Reprodutibilidade e reuso de experimentos em eScience : workflows, ontologias e scripts

    Get PDF
    Orientadores: Claudia Maria Bauzer Medeiros, Yolanda GilTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Scripts e Sistemas Gerenciadores de Workflows Científicos (SGWfC) são abordagens comumente utilizadas para automatizar o fluxo de processos e análise de dados em experimentos científicos computacionais. Apesar de amplamente usados em diversas disciplinas, scripts são difíceis de entender, adaptar, reusar e reproduzir. Por esta razão, diversas soluções têm sido propostas para auxiliar na reprodutibilidade de experimentos que utilizam ambientes baseados em scripts. Porém, estas soluções não permitem a documentação completa do experimento, nem ajudam quando outros cientistas querem reusar apenas parte do código do script. SGWfCs, por outro lado, ajudam na documentação e reuso através do suporte aos cientistas durante a modelagem e execução dos seus experimentos, que são especificados e executados como componentes interconectados (reutilizáveis) de workflows. Enquanto workflows são melhores que scripts para entendimento e reuso dos experimentos, eles também exigem documentação adicional. Durante a modelagem de um experimento, cientistas frequentemente criam variantes de workflows, e.g., mudando componentes do workflow. Reuso e reprodutibilidade exigem o entendimento e rastreamento da proveniência das variantes, uma tarefa que consome muito tempo. Esta tese tem como objetivo auxiliar na reprodutibilidade e reuso de experimentos computacionais. Para superar estes desafios, nós lidamos com dois problemas de pesquisas: (1) entendimento de um experimento computacional, e (2) extensão de um experimento computacional. Nosso trabalho para resolver estes problemas nos direcionou na escolha de workflows e ontologias como respostas para ambos os problemas. As principais contribuições desta tese são: (i) apresentar os requisitos para a conversão de experimentos baseados em scripts em experimentos reprodutíveis; (ii) propor uma metodologia que guia o cientista durante o processo de conversão de experimentos baseados em scripts em workflow research objects reprodutíveis. (iii) projetar e implementar funcionalidades para avaliação da qualidade de experimentos computacionais; (iv) projetar e implementar o W2Share, um arcabouço para auxiliar a metodologia de conversão, que explora ferramentas e padrões que foram desenvolvidos pela comunidade científica para promover o reuso e reprodutibilidade; (v) projetar e implementar o OntoSoft-VFF, um arcabouço para captura de informação sobre software e componentes de workflow para auxiliar cientistas a gerenciarem a exploração e evolução de workflows. Nosso trabalho é apresentado via casos de uso em Dinâmica Molecular, Bioinformática e Previsão do TempoAbstract: Scripts and Scientific Workflow Management Systems (SWfMSs) are common approaches that have been used to automate the execution flow of processes and data analysis in scientific (computational) experiments. Although widely used in many disciplines, scripts are hard to understand, adapt, reuse, and reproduce. For this reason, several solutions have been proposed to aid experiment reproducibility for script-based environments. However, they neither allow to fully document the experiment nor do they help when third parties want to reuse just part of the code. SWfMSs, on the other hand, help documentation and reuse by supporting scientists in the design and execution of their experiments, which are specified and run as interconnected (reusable) workflow components (a.k.a. building blocks). While workflows are better than scripts for understandability and reuse, they still require additional documentation. During experiment design, scientists frequently create workflow variants, e.g., by changing workflow components. Reuse and reproducibility require understanding and tracking variant provenance, a time-consuming task. This thesis aims to support reproducibility and reuse of computational experiments. To meet these challenges, we address two research problems: (1) understanding a computational experiment, and (2) extending a computational experiment. Our work towards solving these problems led us to choose workflows and ontologies to answer both problems. The main contributions of this thesis are thus: (i) to present the requirements for the conversion of script to reproducible research; (ii) to propose a methodology that guides the scientists through the process of conversion of script-based experiments into reproducible workflow research objects; (iii) to design and implement features for quality assessment of computational experiments; (iv) to design and implement W2Share, a framework to support the conversion methodology, which exploits tools and standards that have been developed by the scientific community to promote reuse and reproducibility; (v) to design and implement OntoSoft-VFF, a framework for capturing information about software and workflow components to support scientists manage workflow exploration and evolution. Our work is showcased via use cases in Molecular Dynamics, Bioinformatics and Weather ForecastingDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação2013/08293-7, 2014/23861-4, 2017/03570-3FAPES

    Capturing the "Whole Tale" of Computational Research: Reproducibility in Computing Environments

    Full text link
    We present an overview of the recently funded "Merging Science and Cyberinfrastructure Pathways: The Whole Tale" project (NSF award #1541450). Our approach has two nested goals: 1) deliver an environment that enables researchers to create a complete narrative of the research process including exposure of the data-to-publication lifecycle, and 2) systematically and persistently link research publications to their associated digital scholarly objects such as the data, code, and workflows. To enable this, Whole Tale will create an environment where researchers can collaborate on data, workspaces, and workflows and then publish them for future adoption or modification. Published data and applications will be consumed either directly by users using the Whole Tale environment or can be integrated into existing or future domain Science Gateways

    MOLNs: A cloud platform for interactive, reproducible and scalable spatial stochastic computational experiments in systems biology using PyURDME

    Full text link
    Computational experiments using spatial stochastic simulations have led to important new biological insights, but they require specialized tools, a complex software stack, as well as large and scalable compute and data analysis resources due to the large computational cost associated with Monte Carlo computational workflows. The complexity of setting up and managing a large-scale distributed computation environment to support productive and reproducible modeling can be prohibitive for practitioners in systems biology. This results in a barrier to the adoption of spatial stochastic simulation tools, effectively limiting the type of biological questions addressed by quantitative modeling. In this paper, we present PyURDME, a new, user-friendly spatial modeling and simulation package, and MOLNs, a cloud computing appliance for distributed simulation of stochastic reaction-diffusion models. MOLNs is based on IPython and provides an interactive programming platform for development of sharable and reproducible distributed parallel computational experiments

    Distributed workflows with Jupyter

    Get PDF
    The designers of a new coordination interface enacting complex workflows have to tackle a dichotomy: choosing a language-independent or language-dependent approach. Language-independent approaches decouple workflow models from the host code's business logic and advocate portability. Language-dependent approaches foster flexibility and performance by adopting the same host language for business and coordination code. Jupyter Notebooks, with their capability to describe both imperative and declarative code in a unique format, allow taking the best of the two approaches, maintaining a clear separation between application and coordination layers but still providing a unified interface to both aspects. We advocate the Jupyter Notebooks’ potential to express complex distributed workflows, identifying the general requirements for a Jupyter-based Workflow Management System (WMS) and introducing a proof-of-concept portable implementation working on hybrid Cloud-HPC infrastructures. As a byproduct, we extended the vanilla IPython kernel with workflow-based parallel and distributed execution capabilities. The proposed Jupyter-workflow (Jw) system is evaluated on common scenarios for High Performance Computing (HPC) and Cloud, showing its potential in lowering the barriers between prototypical Notebooks and production-ready implementations

    A provenance-based semantic approach to support understandability, reproducibility, and reuse of scientific experiments

    Get PDF
    Understandability and reproducibility of scientific results are vital in every field of science. Several reproducibility measures are being taken to make the data used in the publications findable and accessible. However, there are many challenges faced by scientists from the beginning of an experiment to the end in particular for data management. The explosive growth of heterogeneous research data and understanding how this data has been derived is one of the research problems faced in this context. Interlinking the data, the steps and the results from the computational and non-computational processes of a scientific experiment is important for the reproducibility. We introduce the notion of end-to-end provenance management'' of scientific experiments to help scientists understand and reproduce the experimental results. The main contributions of this thesis are: (1) We propose a provenance modelREPRODUCE-ME'' to describe the scientific experiments using semantic web technologies by extending existing standards. (2) We study computational reproducibility and important aspects required to achieve it. (3) Taking into account the REPRODUCE-ME provenance model and the study on computational reproducibility, we introduce our tool, ProvBook, which is designed and developed to demonstrate computational reproducibility. It provides features to capture and store provenance of Jupyter notebooks and helps scientists to compare and track their results of different executions. (4) We provide a framework, CAESAR (CollAborative Environment for Scientific Analysis with Reproducibility) for the end-to-end provenance management. This collaborative framework allows scientists to capture, manage, query and visualize the complete path of a scientific experiment consisting of computational and non-computational steps in an interoperable way. We apply our contributions to a set of scientific experiments in microscopy research projects

    The workflow trace archive:Open-access data from public and private computing infrastructures

    Get PDF
    Realistic, relevant, and reproducible experiments often need input traces collected from real-world environments. In this work, we focus on traces of workflows - common in datacenters, clouds, and HPC infrastructures. We show that the state-of-the-art in using workflow-traces raises important issues: (1) the use of realistic traces is infrequent and (2) the use of realistic, open-access traces even more so. Alleviating these issues, we introduce the Workflow Trace Archive (WTA), an open-access archive of workflow traces from diverse computing infrastructures and tooling to parse, validate, and analyze traces. The WTA includes {>}48>48 million workflows captured from {>}10>10 computing infrastructures, representing a broad diversity of trace domains and characteristics. To emphasize the importance of trace diversity, we characterize the WTA contents and analyze in simulation the impact of trace diversity on experiment results. Our results indicate significant differences in characteristics, properties, and workflow structures between workload sources, domains, and fields

    AiiDA: Automated Interactive Infrastructure and Database for Computational Science

    Full text link
    Computational science has seen in the last decades a spectacular rise in the scope, breadth, and depth of its efforts. Notwithstanding this prevalence and impact, it is often still performed using the renaissance model of individual artisans gathered in a workshop, under the guidance of an established practitioner. Great benefits could follow instead from adopting concepts and tools coming from computer science to manage, preserve, and share these computational efforts. We illustrate here our paradigm sustaining such vision, based around the four pillars of Automation, Data, Environment, and Sharing. We then discuss its implementation in the open-source AiiDA platform (http://www.aiida.net), that has been tuned first to the demands of computational materials science. AiiDA's design is based on directed acyclic graphs to track the provenance of data and calculations, and ensure preservation and searchability. Remote computational resources are managed transparently, and automation is coupled with data storage to ensure reproducibility. Last, complex sequences of calculations can be encoded into scientific workflows. We believe that AiiDA's design and its sharing capabilities will encourage the creation of social ecosystems to disseminate codes, data, and scientific workflows.Comment: 30 pages, 7 figure
    corecore