39,850 research outputs found

    Pingo: A Framework for the Management of Storage of Intermediate Outputs of Computational Workflows

    Get PDF
    abstract: Scientific workflows allow scientists to easily model and express the entire data processing steps, typically as a directed acyclic graph (DAG). These scientific workflows are made of a collection of tasks that usually take a long time to compute and that produce a considerable amount of intermediate datasets. Because of the nature of scientific exploration, a scientific workflow can be modified and re-run multiple times, or new scientific workflows are created that might make use of past intermediate datasets. Storing intermediate datasets has the potential to save time in computations. Since storage is limited, one main problem that needs a solution is determining which intermediate datasets need to be saved at creation time in order to minimize the computational time of the workflows to be run in the future. This research thesis proposes the design and implementation of Pingo, a system that is capable of managing the computations of scientific workflows as well as the storage, provenance and deletion of intermediate datasets. Pingo uses the history of workflows submitted to the system to predict the most likely datasets to be needed in the future, and subjects the decision of dataset deletion to the optimization of the computational time of future workflows.Dissertation/ThesisMasters Thesis Computer Science 201

    Reusability Challenges of Scientific Workflows: A Case Study for Galaxy

    Full text link
    Scientific workflow has become essential in software engineering because it provides a structured approach to designing, executing, and analyzing scientific experiments. Software developers and researchers have developed hundreds of scientific workflow management systems so scientists in various domains can benefit from them by automating repetitive tasks, enhancing collaboration, and ensuring the reproducibility of their results. However, even for expert users, workflow creation is a complex task due to the dramatic growth of tools and data heterogeneity. Thus, scientists attempt to reuse existing workflows shared in workflow repositories. Unfortunately, several challenges prevent scientists from reusing those workflows. In this study, we thus first attempted to identify those reusability challenges. We also offered an action list and evidence-based guidelines to promote the reusability of scientific workflows. Our intensive manual investigation examined the reusability of existing workflows and exposed several challenges. The challenges preventing reusability include tool upgrading, tool support unavailability, design flaws, incomplete workflows, failure to load a workflow, etc. Such challenges and our action list offered guidelines to future workflow composers to create better workflows with enhanced reusability. In the future, we plan to develop a recommender system using reusable workflows that can assist scientists in creating effective and error-free workflows.Comment: Accepted in APSEC 202

    Automatic vs Manual Provenance Abstractions: Mind the Gap

    Full text link
    In recent years the need to simplify or to hide sensitive information in provenance has given way to research on provenance abstraction. In the context of scientific workflows, existing research provides techniques to semi automatically create abstractions of a given workflow description, which is in turn used as filters over the workflow's provenance traces. An alternative approach that is commonly adopted by scientists is to build workflows with abstractions embedded into the workflow's design, such as using sub-workflows. This paper reports on the comparison of manual versus semi-automated approaches in a context where result abstractions are used to filter report-worthy results of computational scientific analyses. Specifically; we take a real-world workflow containing user-created design abstractions and compare these with abstractions created by ZOOM UserViews and Workflow Summaries systems. Our comparison shows that semi-automatic and manual approaches largely overlap from a process perspective, meanwhile, there is a dramatic mismatch in terms of data artefacts retained in an abstracted account of derivation. We discuss reasons and suggest future research directions.Comment: Preprint accepted to the 2016 workshop on the Theory and Applications of Provenance, TAPP 201

    EPiK-a Workflow for Electron Tomography in Kepler.

    Get PDF
    Scientific workflows integrate data and computing interfaces as configurable, semi-automatic graphs to solve a scientific problem. Kepler is such a software system for designing, executing, reusing, evolving, archiving and sharing scientific workflows. Electron tomography (ET) enables high-resolution views of complex cellular structures, such as cytoskeletons, organelles, viruses and chromosomes. Imaging investigations produce large datasets. For instance, in Electron Tomography, the size of a 16 fold image tilt series is about 65 Gigabytes with each projection image including 4096 by 4096 pixels. When we use serial sections or montage technique for large field ET, the dataset will be even larger. For higher resolution images with multiple tilt series, the data size may be in terabyte range. Demands of mass data processing and complex algorithms require the integration of diverse codes into flexible software structures. This paper describes a workflow for Electron Tomography Programs in Kepler (EPiK). This EPiK workflow embeds the tracking process of IMOD, and realizes the main algorithms including filtered backprojection (FBP) from TxBR and iterative reconstruction methods. We have tested the three dimensional (3D) reconstruction process using EPiK on ET data. EPiK can be a potential toolkit for biology researchers with the advantage of logical viewing, easy handling, convenient sharing and future extensibility

    The Evolution of myExperiment

    No full text
    The myExperiment social website for sharing scientific workflows, designed according to Web 2.0 principles, has grown to be the largest public repository of its kind. It is distinctive for its focus on sharing methods, its researcher-centric design and its facility to aggregate content into sharable 'research objects'. This evolution of myExperiment has occurred hand in hand with its users. myExperiment now supports Linked Data as a step toward our vision of the future research environment, which we categorise here as '3rd generation e-Research'

    The Sciences of Data – Moving Towards a Comprehensive Systems Perspective

    Get PDF
    Data science’s rapid development in a dynamically growing data environment endows it with unique characteristics among scientific disciplines, juxtaposing challenges typically encountered in theoretical as well as empirical sciences. This raises questions as to the identification of the most pressing problems for data science, as well as to what constitutes its theoretical foundations. In this contribution, we first describe data science from the perspective of philosophy of science. We argue that the current mode of development of data science is adequately described by what we term the differentiational-expansionist mode. This leads us to conclude that data science concerns the acquisition of scientific theories relating to the application of methods, workflows and algorithms that generate value for users – which we term the integrative view. This definition emphasizes the interdependent nature of human and algorithmic elements in complex data workflows. We then offer four challenges for the future of the field. We conclude that since full control of entire data workflows is unfeasible, attention should be redirected towards the creation of an infrastructure by which data workflows will self-organize in a useful manner

    Progress and prospects for accelerating materials science with automated and autonomous workflows

    Get PDF
    Accelerating materials research by integrating automation with artificial intelligence is increasingly recognized as a grand scientific challenge to discover and develop materials for emerging and future technologies. While the solid state materials science community has demonstrated a broad range of high throughput methods and effectively leveraged computational techniques to accelerate individual research tasks, revolutionary acceleration of materials discovery has yet to be fully realized. This perspective review presents a framework and ontology to outline a materials experiment lifecycle and visualize materials discovery workflows, providing a context for mapping the realized levels of automation and the next generation of autonomous loops in terms of scientific and automation complexity. Expanding autonomous loops to encompass larger portions of complex workflows will require integration of a range of experimental techniques as well as automation of expert decisions, including subtle reasoning about data quality, responses to unexpected data, and model design. Recent demonstrations of workflows that integrate multiple techniques and include autonomous loops, combined with emerging advancements in artificial intelligence and high throughput experimentation, signal the imminence of a revolution in materials discovery
    • …
    corecore