968 research outputs found
Scientific workflow execution reproducibility using cloud-aware provenance
Scientific experiments and projects such as CMS and neuGRIDforYou (N4U) are annually producing data of the order of Peta-Bytes. They adopt scientific workflows to analyse this large amount of data in order to extract meaningful information. These workflows are executed over distributed resources, both compute and storage in nature, provided by the Grid and recently by the Cloud. The Cloud is becoming the playing field for scientists as it provides scalability and on-demand resource provisioning. Reproducing a workflow execution to verify results is vital for scientists and have proven to be a challenge. As per a study (Belhajjame et al. 2012) around 80% of workflows cannot be reproduced, and 12% of them are due to the lack of information about the execution environment. The dynamic and on-demand provisioning capability of the Cloud makes this more challenging. To overcome these challenges, this research aims to investigate how to capture the execution provenance of a scientific workflow along with the resources used to execute the workflow in a Cloud infrastructure. This information will then enable a scientist to reproduce workflow-based scientific experiments on the Cloud infrastructure by re-provisioning the similar resources on the Cloud.Provenance has been recognised as information that helps in debugging, verifying and reproducing a scientific workflow execution. Recent adoption of Cloud-based scientific workflows presents an opportunity to investigate the suitability of existing approaches or to propose new approaches to collect provenance information from the Cloud and to utilize it for workflow reproducibility on the Cloud. From literature analysis, it was found that the existing approaches for Grid or Cloud do not provide detailed resource information and also do not present an automatic provenance capturing approach for the Cloud environment. To mitigate the challenges and fulfil the knowledge gap, a provenance based approach, ReCAP, has been proposed in this thesis. In ReCAP, workflow execution reproducibility is achieved by (a) capturing the Cloud-aware provenance (CAP), b) re-provisioning similar resources on the Cloud and re-executing the workflow on them and (c) by comparing the provenance graph structure including the Cloud resource information, and outputs of workflows. ReCAP captures the Cloud resource information and links it with the workflow provenance to generate Cloud-aware provenance. The Cloud-aware provenance consists of configuration parameters relating to hardware and software describing a resource on the Cloud. This information once captured aids in re-provisioning the same execution infrastructure on the Cloud for workflow re-execution. Since resources on the Cloud can be used in static or dynamic (i.e. destroyed when a task is finished) manner, this presents a challenge for the devised provenance capturing approach. In order to deal with these scenarios, different capturing and mapping approaches have been presented in this thesis. These mapping approaches work outside the virtual machine and collect resource information from the Cloud middleware, thus they do not affect job performance. The impact of the collected Cloud resource information on the job as well as on the workflow execution has been evaluated through various experiments in this thesis. In ReCAP, the workflow reproducibility isverified by comparing the provenance graph structure, infrastructure details and the output produced by the workflows. To compare the provenance graphs, the captured provenance information including infrastructure details is translated to a graph model. These graphs of original execution and the reproduced execution are then compared in order to analyse their similarity. In this regard, two comparison approaches have been presented that can produce a qualitative analysis as well as quantitative analysis about the graph structure. The ReCAP framework and its constituent components are evaluated using different scientific workflows such as ReconAll and Montage from the domains of neuroscience (i.e. N4U) and astronomy respectively. The results have shown that ReCAP has been able to capture the Cloud-aware provenance and demonstrate the workflow execution reproducibility by re-provisioning the same resources on the Cloud. The results have also demonstrated that the provenance comparison approaches can determine the similarity between the two given provenance graphs. The results of workflow output comparison have shown that this approach is suitable to compare the outputs of scientific workflows, especially for deterministic workflows
Scientific Workflow Repeatability through Cloud-Aware Provenance
The transformations, analyses and interpretations of data in scientific
workflows are vital for the repeatability and reliability of scientific
workflows. This provenance of scientific workflows has been effectively carried
out in Grid based scientific workflow systems. However, recent adoption of
Cloud-based scientific workflows present an opportunity to investigate the
suitability of existing approaches or propose new approaches to collect
provenance information from the Cloud and to utilize it for workflow
repeatability in the Cloud infrastructure. The dynamic nature of the Cloud in
comparison to the Grid makes it difficult because resources are provisioned
on-demand unlike the Grid. This paper presents a novel approach that can assist
in mitigating this challenge. This approach can collect Cloud infrastructure
information along with workflow provenance and can establish a mapping between
them. This mapping is later used to re-provision resources on the Cloud. The
repeatability of the workflow execution is performed by: (a) capturing the
Cloud infrastructure information (virtual machine configuration) along with the
workflow provenance, and (b) re-provisioning the similar resources on the Cloud
and re-executing the workflow on them. The evaluation of an initial prototype
suggests that the proposed approach is feasible and can be investigated
further.Comment: 6 pages; 5 figures; 3 tables in Proceedings of the Recomputability
2014 workshop of the 7th IEEE/ACM International Conference on Utility and
Cloud Computing (UCC 2014). London December 201
Reproducibility of scientific workflows execution using cloud-aware provenance (ReCAP)
© 2018, Springer-Verlag GmbH Austria, part of Springer Nature. Provenance of scientific workflows has been considered a mean to provide workflow reproducibility. However, the provenance approaches adopted so far are not applicable in the context of Cloud because the provenance trace lacks the Cloud information. This paper presents a novel approach that collects the Cloud-aware provenance and represents it as a graph. The workflow execution reproducibility on the Cloud is determined by comparing the workflow provenance at three levels i.e., workflow structure, execution infrastructure and workflow outputs. The experimental evaluation shows that the implemented approach can detect changes in the provenance traces and the outputs produced by the workflow
Sharing and Preserving Computational Analyses for Posterity with encapsulator
Open data and open-source software may be part of the solution to science's
"reproducibility crisis", but they are insufficient to guarantee
reproducibility. Requiring minimal end-user expertise, encapsulator creates a
"time capsule" with reproducible code in a self-contained computational
environment. encapsulator provides end-users with a fully-featured desktop
environment for reproducible research.Comment: 11 pages, 6 figure
Automatic deployment and reproducibility of workflow on the Cloud using container virtualization
PhD ThesisCloud computing is a service-oriented approach to distributed computing that has
many attractive features, including on-demand access to large compute resources. One
type of cloud applications are scientific work
ows, which are playing an increasingly
important role in building applications from heterogeneous components. Work
ows are
increasingly used in science as a means to capture, share, and publish computational
analysis. Clouds can offer a number of benefits to work
ow systems, including the
dynamic provisioning of the resources needed for computation and storage, which has
the potential to dramatically increase the ability to quickly extract new results from
the huge amounts of data now being collected.
However, there are increasing number of Cloud computing platforms, each with different
functionality and interfaces. It therefore becomes increasingly challenging to
de ne work
ows in a portable way so that they can be run reliably on different clouds.
As a consequence, work
ow developers face the problem of deciding which Cloud to
select and - more importantly for the long-term - how to avoid vendor lock-in.
A further issue that has arisen with work
ows is that it is common for them to stop
being executable a relatively short time after they were created. This can be due to
the external resources required to execute a work
ow - such as data and services -
becoming unavailable. It can also be caused by changes in the execution environment
on which the work
ow depends, such as changes to a library causing an error when a
work
ow service is executed. This "work
ow decay" issue is recognised as an impediment
to the reuse of work
ows and the reproducibility of their results. It is becoming
a major problem, as the reproducibility of science is increasingly dependent on the
reproducibility of scientific work
ows.
In this thesis we presented new solutions to address these challenges. We propose a new
approach to work
ow modelling that offers a portable and re-usable description of the
work
ow using the TOSCA specification language. Our approach addresses portability
by allowing work
ow components to be systematically specifed and automatically
- v -
deployed on a range of clouds, or in local computing environments, using container
virtualisation techniques.
To address the issues of reproducibility and work
ow decay, our modelling and deployment
approach has also been integrated with source control and container management
techniques to create a new framework that e ciently supports dynamic work
ow deployment,
(re-)execution and reproducibility.
To improve deployment performance, we extend the framework with number of new
optimisation techniques, and evaluate their effect on a range of real and synthetic
work
ows.Ministry of Higher Education and
Scientific Research in Iraq and Mosul Universit
The Research Object Suite of Ontologies: Sharing and Exchanging Research Data and Methods on the Open Web
Research in life sciences is increasingly being conducted in a digital and
online environment. In particular, life scientists have been pioneers in
embracing new computational tools to conduct their investigations. To support
the sharing of digital objects produced during such research investigations, we
have witnessed in the last few years the emergence of specialized repositories,
e.g., DataVerse and FigShare. Such repositories provide users with the means to
share and publish datasets that were used or generated in research
investigations. While these repositories have proven their usefulness,
interpreting and reusing evidence for most research results is a challenging
task. Additional contextual descriptions are needed to understand how those
results were generated and/or the circumstances under which they were
concluded. Because of this, scientists are calling for models that go beyond
the publication of datasets to systematically capture the life cycle of
scientific investigations and provide a single entry point to access the
information about the hypothesis investigated, the datasets used, the
experiments carried out, the results of the experiments, the people involved in
the research, etc. In this paper we present the Research Object (RO) suite of
ontologies, which provide a structured container to encapsulate research data
and methods along with essential metadata descriptions. Research Objects are
portable units that enable the sharing, preservation, interpretation and reuse
of research investigation results. The ontologies we present have been designed
in the light of requirements that we gathered from life scientists. They have
been built upon existing popular vocabularies to facilitate interoperability.
Furthermore, we have developed tools to support the creation and sharing of
Research Objects, thereby promoting and facilitating their adoption.Comment: 20 page
The Application of Cloud Computing to the Creation of Image Mosaics and Management of Their Provenance
We have used the Montage image mosaic engine to investigate the cost and
performance of processing images on the Amazon EC2 cloud, and to inform the
requirements that higher-level products impose on provenance management
technologies. We will present a detailed comparison of the performance of
Montage on the cloud and on the Abe high performance cluster at the National
Center for Supercomputing Applications (NCSA). Because Montage generates many
intermediate products, we have used it to understand the science requirements
that higher-level products impose on provenance management technologies. We
describe experiments with provenance management technologies such as the
"Provenance Aware Service Oriented Architecture" (PASOA).Comment: 15 pages, 3 figur
- …