11 research outputs found
Scientific workflow execution reproducibility using cloud-aware provenance
Scientific experiments and projects such as CMS and neuGRIDforYou (N4U) are annually producing data of the order of Peta-Bytes. They adopt scientific workflows to analyse this large amount of data in order to extract meaningful information. These workflows are executed over distributed resources, both compute and storage in nature, provided by the Grid and recently by the Cloud. The Cloud is becoming the playing field for scientists as it provides scalability and on-demand resource provisioning. Reproducing a workflow execution to verify results is vital for scientists and have proven to be a challenge. As per a study (Belhajjame et al. 2012) around 80% of workflows cannot be reproduced, and 12% of them are due to the lack of information about the execution environment. The dynamic and on-demand provisioning capability of the Cloud makes this more challenging. To overcome these challenges, this research aims to investigate how to capture the execution provenance of a scientific workflow along with the resources used to execute the workflow in a Cloud infrastructure. This information will then enable a scientist to reproduce workflow-based scientific experiments on the Cloud infrastructure by re-provisioning the similar resources on the Cloud.Provenance has been recognised as information that helps in debugging, verifying and reproducing a scientific workflow execution. Recent adoption of Cloud-based scientific workflows presents an opportunity to investigate the suitability of existing approaches or to propose new approaches to collect provenance information from the Cloud and to utilize it for workflow reproducibility on the Cloud. From literature analysis, it was found that the existing approaches for Grid or Cloud do not provide detailed resource information and also do not present an automatic provenance capturing approach for the Cloud environment. To mitigate the challenges and fulfil the knowledge gap, a provenance based approach, ReCAP, has been proposed in this thesis. In ReCAP, workflow execution reproducibility is achieved by (a) capturing the Cloud-aware provenance (CAP), b) re-provisioning similar resources on the Cloud and re-executing the workflow on them and (c) by comparing the provenance graph structure including the Cloud resource information, and outputs of workflows. ReCAP captures the Cloud resource information and links it with the workflow provenance to generate Cloud-aware provenance. The Cloud-aware provenance consists of configuration parameters relating to hardware and software describing a resource on the Cloud. This information once captured aids in re-provisioning the same execution infrastructure on the Cloud for workflow re-execution. Since resources on the Cloud can be used in static or dynamic (i.e. destroyed when a task is finished) manner, this presents a challenge for the devised provenance capturing approach. In order to deal with these scenarios, different capturing and mapping approaches have been presented in this thesis. These mapping approaches work outside the virtual machine and collect resource information from the Cloud middleware, thus they do not affect job performance. The impact of the collected Cloud resource information on the job as well as on the workflow execution has been evaluated through various experiments in this thesis. In ReCAP, the workflow reproducibility isverified by comparing the provenance graph structure, infrastructure details and the output produced by the workflows. To compare the provenance graphs, the captured provenance information including infrastructure details is translated to a graph model. These graphs of original execution and the reproduced execution are then compared in order to analyse their similarity. In this regard, two comparison approaches have been presented that can produce a qualitative analysis as well as quantitative analysis about the graph structure. The ReCAP framework and its constituent components are evaluated using different scientific workflows such as ReconAll and Montage from the domains of neuroscience (i.e. N4U) and astronomy respectively. The results have shown that ReCAP has been able to capture the Cloud-aware provenance and demonstrate the workflow execution reproducibility by re-provisioning the same resources on the Cloud. The results have also demonstrated that the provenance comparison approaches can determine the similarity between the two given provenance graphs. The results of workflow output comparison have shown that this approach is suitable to compare the outputs of scientific workflows, especially for deterministic workflows
Scientific Workflow Repeatability through Cloud-Aware Provenance
The transformations, analyses and interpretations of data in scientific
workflows are vital for the repeatability and reliability of scientific
workflows. This provenance of scientific workflows has been effectively carried
out in Grid based scientific workflow systems. However, recent adoption of
Cloud-based scientific workflows present an opportunity to investigate the
suitability of existing approaches or propose new approaches to collect
provenance information from the Cloud and to utilize it for workflow
repeatability in the Cloud infrastructure. The dynamic nature of the Cloud in
comparison to the Grid makes it difficult because resources are provisioned
on-demand unlike the Grid. This paper presents a novel approach that can assist
in mitigating this challenge. This approach can collect Cloud infrastructure
information along with workflow provenance and can establish a mapping between
them. This mapping is later used to re-provision resources on the Cloud. The
repeatability of the workflow execution is performed by: (a) capturing the
Cloud infrastructure information (virtual machine configuration) along with the
workflow provenance, and (b) re-provisioning the similar resources on the Cloud
and re-executing the workflow on them. The evaluation of an initial prototype
suggests that the proposed approach is feasible and can be investigated
further.Comment: 6 pages; 5 figures; 3 tables in Proceedings of the Recomputability
2014 workshop of the 7th IEEE/ACM International Conference on Utility and
Cloud Computing (UCC 2014). London December 201
Provision of an integrated data analysis platform for computational neuroscience experiments
© Emerald Group Publishing Limited. Purpose – The purpose of this paper is to provide an integrated analysis base to facilitate computational neuroscience experiments, following a user-led approach to provide access to the integrated neuroscience data and to enable the analyses demanded by the biomedical research community. Design/methodology/approach – The design and development of the N4U analysis base and related information services addresses the existing research and practical challenges by offering an integrated medical data analysis environment with the necessary building blocks for neuroscientists to optimally exploit neuroscience workflows, large image data sets and algorithms to conduct analyses. Findings – The provision of an integrated e-science environment of computational neuroimaging can enhance the prospects, speed and utility of the data analysis process for neurodegenerative diseases. Originality/value – The N4U analysis base enables conducting biomedical data analyses by indexing and interlinking the neuroimaging and clinical study data sets stored on the grid infrastructure, algorithms and scientific workflow definitions along with their associated provenance information
Frequency of Disc Degeneration at Different Levels of Cervical Vertebrae in Adult Patients with Neck Pain on Magnetic Resonance Imaging
Background:Disc degeneration is terminology used for heterogeneous changes affecting the anatomy and physiology of the intervertebral disc. Disc degeneration alters the material properties of the intervertebral disc leading to an unfavorable distribution and transmission of stress to adjacent spinal structures.Objective:The aim of the study was to determine the frequency of disc degeneration at different level of cervical vertebrae in adult patients with neck pain on magnetic resonance imaging.Methodology:In this descriptive study 180 adult patients were included. All patients had been collected from DHQ hospital Gilgit and Ghurki Trust teaching hospital. After informed consent, data were collected through 1.5 tesla GE (closed bore) and 0.35 tesla Hitachi (open bore) MRI machines.Results:Findings show that among 180 adult patients, 136 presented with disc degeneration among which 81 were males and 55 were females. Among 81 males, 63 had disc degeneration at multiple levels while 18 had single disc degeneration. In females 35 patients showed multiple disc degeneration while 20 involved a single disc.Conclusion:It is concluded that disc degeneration is prevalent in males than females. Disc degeneration at multiple levels is higher than single disc degeneration in both genders. Keywords: Disc degeneration, magnetic resonance imaging, intervertebral disc. DOI: 10.7176/JHMN/71-02 Publication date: February 29th 202
Development of a large-scale neuroimages and clinical variables data atlas in the neuGRID4You (N4U) project
© 2015 Elsevier Inc.. Exceptional growth in the availability of large-scale clinical imaging datasets has led to the development of computational infrastructures that offer scientists access to image repositories and associated clinical variables data. The EU FP7 neuGRID and its follow on neuGRID4You (N4U) projects provide a leading e-Infrastructure where neuroscientists can find core services and resources for brain image analysis. The core component of this e-Infrastructure is the N4U Virtual Laboratory, which offers easy access for neuroscientists to a wide range of datasets and algorithms, pipelines, computational resources, services, and associated support services. The foundation of this virtual laboratory is a massive data store plus a set of Information Services collectively called the 'Data Atlas'. This data atlas stores datasets, clinical study data, data dictionaries, algorithm/pipeline definitions, and provides interfaces for parameterised querying so that neuroscientists can perform analyses on required datasets. This paper presents the overall design and development of the Data Atlas, its associated dataset indexing and retrieval services that originated from the development of the N4U Virtual Laboratory in the EU FP7 N4U project in the light of detailed user requirements
CMS workflow execution using intelligent job scheduling and data access strategies
Complex scientific workflows can process large amounts of data using thousands of tasks. The turnaround times of these workflows are often affected by various latencies such as the resource discovery, scheduling and data access latencies for the individual workflow processes or actors. Minimizing these latencies will improve the overall execution time of a workflow and thus lead to a more efficient and robust processing environment. In this paper, we propose a pilot job concept that has intelligent data reuse and job execution strategies to minimize the scheduling, queuing, execution and data access latencies. The results have shown that significant improvements in the overall turnaround time of a workflow can be achieved with this approach. The proposed approach has been evaluated, first using the CMS Tier0 data processing workflow, and then simulating the workflows to evaluate its effectiveness in a controlled environment. © 2011 IEEE
DIANA Scheduling Hierarchies for Optimizing Bulk Job Scheduling
The use of meta-schedulers for resource management in large-scale distributed systems often leads to a hierarchy of schedulers. In this paper, we discuss why existing meta-scheduling hierarchies are sometimes not sufficient for Grid systems due to their inability to re-organise jobs already scheduled locally. Such a job re-organisation is required to adapt to evolving loads which are common in heavily used Grid infrastructures. We propose a peer-topeer scheduling model and evaluate it using case studies and mathematical modelling. We detail the DIANA (Data Intensive and Network Aware) scheduling algorithm and its queue management system for coping with the load distribution and for supporting bulk job scheduling. We demonstrate that such a system is beneficial for dynamic, distributed and self-organizing resource management and can assist in optimizing load or job distribution in complex Grid infrastructures
Reproducibility of scientific workflows execution using cloud-aware provenance (ReCAP)
© 2018, Springer-Verlag GmbH Austria, part of Springer Nature. Provenance of scientific workflows has been considered a mean to provide workflow reproducibility. However, the provenance approaches adopted so far are not applicable in the context of Cloud because the provenance trace lacks the Cloud information. This paper presents a novel approach that collects the Cloud-aware provenance and represents it as a graph. The workflow execution reproducibility on the Cloud is determined by comparing the workflow provenance at three levels i.e., workflow structure, execution infrastructure and workflow outputs. The experimental evaluation shows that the implemented approach can detect changes in the provenance traces and the outputs produced by the workflow
Cloud infrastructure provenance collection and management to reproduce scientific workflows execution
© 2017 Elsevier B.V. The emergence of Cloud computing provides a new computing paradigm for scientific workflow execution. It provides dynamic, on-demand and scalable resources that enable the processing of complex workflow-based experiments. With the ever growing size of the experimental data and increasingly complex processing workflows, the need for reproducibility has also become essential. Provenance has been thought of a mechanism to verify a workflow and to provide workflow reproducibility. One of the obstacles in reproducing an experiment execution is the lack of information about the execution infrastructure in the collected provenance. This information becomes critical in the context of Cloud in which resources are provisioned on-demand and by specifying resource configurations. Therefore, a mechanism is required that enables capturing of infrastructure information along with the provenance of workflows executing on the Cloud to facilitate the re-creation of execution environment on the Cloud. This paper presents a framework to Reproduce Scientific Workflow Execution using Cloud-Aware Provenance (ReCAP), along with the proposed mapping approaches that aid in capturing the Cloud-aware provenance information and help in re-provisioning the execution resource on the Cloud with similar configurations. Experimental evaluation has shown the impact of different resource configurations on the workflow execution performance, therefore justifies the need for collecting such provenance information in the context of Cloud. The evaluation has also demonstrated that the proposed mapping approaches can capture Cloud information in various Cloud usage scenarios without causing performance overhead and can also enable the re-provisioning of resources on Cloud. Experiments were conducted using workflows from different scientific domains such as astronomy and neuroscience to demonstrate the applicability of this research for different workflows