34,374 research outputs found
Recommended from our members
Reducing deadline miss rate for grid workloads running in virtual machines: a deadline-aware and adaptive approach
This thesis explores three major areas of research; integration of virutalization into scientific grid infrastructures, evaluation of the virtualization overhead on HPC grid job’s performance, and optimization of job execution times to increase their throughput by reducing job deadline miss rate.
Integration of the virtualization into the grid to deploy on-demand virtual machines for jobs in a way that is transparent to the end users and have minimum impact on the existing system poses a significant challenge. This involves the creation of virtual machines, decompression of the operating system image, adapting the virtual environment
to satisfy software requirements of the job, constant update of the job state once it’s running with out modifying batch system or existing grid middleware, and finally bringing the host machine back to a consistent state.
To facilitate this research, an existing and in production pilot job framework has been modified to deploy virtual machines on demand on the grid using virtualization administrative domain to handle all I/O to increase network throughput. This approach limits the change impact on the existing grid infrastructure while leveraging the execution
and performance isolation capabilities of virtualization for job execution. This work led to evaluation of various scheduling strategies used by the Xen hypervisor to measure the sensitivity of job performance to the amount of CPU and memory allocated under various configurations.
However, virtualization overhead is also a critical factor in determining job execution times. Grid jobs have a diverse set of requirements for machine resources such as CPU, Memory, Network and have inter-dependencies on other jobs in meeting their deadlines since the input of one job can be the output from the previous job. A novel resource provisioning model was devised to decrease the impact of virtualization overhead on job execution.
Finally, dynamic deadline-aware optimization algorithms were introduced using exponential smoothing and rate limiting to predict job failure rates based on static and dynamic virtualization overhead. Statistical techniques were also integrated into the optimization algorithm to flag jobs that are at risk to miss their deadlines, and taking preventive action to increase overall job throughput
Introducing risk management into the grid
Service Level Agreements (SLAs) are explicit statements about all expectations and obligations in the business partnership between customers and providers. They have been introduced in Grid computing to overcome the best effort approach, making the Grid more interesting for commercial applications. However, decisions on negotiation and system management still rely on static approaches, not reflecting the risk linked with decisions. The EC-funded project "AssessGrid" aims at introducing risk assessment and management as a novel decision paradigm into Grid computing. This paper gives a general motivation for risk management and presents the envisaged architecture of a "risk-aware" Grid middleware and Grid fabric, highlighting its functionality by means of three showcase scenarios
Large Scale In Silico Screening on Grid Infrastructures
Large-scale grid infrastructures for in silico drug discovery open
opportunities of particular interest to neglected and emerging diseases. In
2005 and 2006, we have been able to deploy large scale in silico docking within
the framework of the WISDOM initiative against Malaria and Avian Flu requiring
about 105 years of CPU on the EGEE, Auvergrid and TWGrid infrastructures. These
achievements demonstrated the relevance of large-scale grid infrastructures for
the virtual screening by molecular docking. This also allowed evaluating the
performances of the grid infrastructures and to identify specific issues raised
by large-scale deployment.Comment: 14 pages, 2 figures, 2 tables, The Third International Life Science
Grid Workshop, LSGrid 2006, Yokohama, Japan, 13-14 october 2006, to appear in
the proceeding
Exploiting relocation to reduce network dimensions of resilient optical grids
Optical grids are widely deployed to solve complex problems we are facing today. An important aspect of the supporting network is resiliency i.e. the ability to overcome network failures. In contrast to classical network protection schemes, we will not necessarily provide a back-up path between the source and the original destination. Instead, we will try to relocate the job to another server location if this means that we can provide a backup path which comprises less wavelengths than the one the traditional scheme would suggest. This relocation can be backed up by the grid specific anycast principle: a user generally does not care where his job is executed and is only interested in its results. We present ILP formulations for both resilience schemes and we evaluate them in a case study on an European network topology
MOON: MapReduce On Opportunistic eNvironments
Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments
Predicting Scheduling Failures in the Cloud
Cloud Computing has emerged as a key technology to deliver and manage
computing, platform, and software services over the Internet. Task scheduling
algorithms play an important role in the efficiency of cloud computing services
as they aim to reduce the turnaround time of tasks and improve resource
utilization. Several task scheduling algorithms have been proposed in the
literature for cloud computing systems, the majority relying on the
computational complexity of tasks and the distribution of resources. However,
several tasks scheduled following these algorithms still fail because of
unforeseen changes in the cloud environments. In this paper, using tasks
execution and resource utilization data extracted from the execution traces of
real world applications at Google, we explore the possibility of predicting the
scheduling outcome of a task using statistical models. If we can successfully
predict tasks failures, we may be able to reduce the execution time of jobs by
rescheduling failed tasks earlier (i.e., before their actual failing time). Our
results show that statistical models can predict task failures with a precision
up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of
such predictions using the tool kit GloudSim and found that they can improve
the number of finished tasks by up to 40%. We also perform a case study using
the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene
expression correlations analysis study from breast cancer research. We find
that when extending the scheduler of Hadoop with our predictive models, the
percentage of failed jobs can be reduced by up to 45%, with an overhead of less
than 5 minutes
- …