2,338 research outputs found
Simulation of the performance of complex data-intensive workflows
PhD ThesisRecently, cloud computing has been used for analytical and data-intensive processes
as it offers many attractive features, including resource pooling, on-demand capability
and rapid elasticity. Scientific workflows use these features to tackle the problems of
complex data-intensive applications. Data-intensive workflows are composed of many
tasks that may involve large input data sets and produce large amounts of data as
output, which typically runs in highly dynamic environments. However, the resources
should be allocated dynamically depending on the demand changes of the work
flow, as over-provisioning increases the cost and under-provisioning causes Service Level
Agreement (SLA) violation and poor Quality of Service (QoS). Performance prediction
of complex workflows is a necessary step prior to the deployment of the workflow.
Performance analysis of complex data-intensive workflows is a challenging task due
to the complexity of their structure, diversity of big data, and data dependencies, in
addition to the required examination to the performance and challenges associated
with running their workflows in the real cloud.
In this thesis, a solution is explored to address these challenges, using a Next Generation
Sequencing (NGS) workflow pipeline as a case study, which may require hundreds/
thousands of CPU hours to process a terabyte of data. We propose a methodology to
model, simulate and predict runtime and the number of resources used by the complex
data-intensive workflows. One contribution of our simulation methodology is that it
provides an ability to extract the simulation parameters (e.g., MIPs and BW values)
that are required for constructing a training set and a fairly accurate prediction of
the run time for input for cluster sizes much larger than ones used in training of the
prediction model. The proposed methodology permits the derivation of run time prediction
based on historical data from the provenance fi les. We present the run time
prediction of the complex workflow by considering different cases of its running in the
cloud such as execution failure and library deployment time. In case of failure, the
framework can apply the prediction only partially considering the successful parts of
the pipeline, in the other case the framework can predict with or without considering
the time to deploy libraries. To further improve the accuracy of prediction, we propose
a simulation model that handles I/O contention
CERN openlab Whitepaper on Future IT Challenges in Scientific Research
This whitepaper describes the major IT challenges in scientific research at CERN and several other European and international research laboratories and projects. Each challenge is exemplified through a set of concrete use cases drawn from the requirements of large-scale scientific programs. The paper is based on contributions from many researchers and IT experts of the participating laboratories and also input from the existing CERN openlab industrial sponsors. The views expressed in this document are those of the individual contributors and do not necessarily reflect the view of their organisations and/or affiliates
- …