Search CORE

34,374 research outputs found

Recommended from our members

Reducing deadline miss rate for grid workloads running in virtual machines: a deadline-aware and adaptive approach

Author: Khalid Omer
Publication venue
Publication date: 01/01/2011
Field of study

This thesis explores three major areas of research; integration of virutalization into scientific grid infrastructures, evaluation of the virtualization overhead on HPC grid job’s performance, and optimization of job execution times to increase their throughput by reducing job deadline miss rate. Integration of the virtualization into the grid to deploy on-demand virtual machines for jobs in a way that is transparent to the end users and have minimum impact on the existing system poses a significant challenge. This involves the creation of virtual machines, decompression of the operating system image, adapting the virtual environment to satisfy software requirements of the job, constant update of the job state once it’s running with out modifying batch system or existing grid middleware, and finally bringing the host machine back to a consistent state. To facilitate this research, an existing and in production pilot job framework has been modified to deploy virtual machines on demand on the grid using virtualization administrative domain to handle all I/O to increase network throughput. This approach limits the change impact on the existing grid infrastructure while leveraging the execution and performance isolation capabilities of virtualization for job execution. This work led to evaluation of various scheduling strategies used by the Xen hypervisor to measure the sensitivity of job performance to the amount of CPU and memory allocated under various configurations. However, virtualization overhead is also a critical factor in determining job execution times. Grid jobs have a diverse set of requirements for machine resources such as CPU, Memory, Network and have inter-dependencies on other jobs in meeting their deadlines since the input of one job can be the output from the previous job. A novel resource provisioning model was devised to decrease the impact of virtualization overhead on job execution. Finally, dynamic deadline-aware optimization algorithms were introduced using exponential smoothing and rate limiting to predict job failure rates based on static and dynamic virtualization overhead. Statistical techniques were also integrated into the optimization algorithm to flag jobs that are at risk to miss their deadlines, and taking preventive action to increase overall job throughput

Greenwich Academic Literature Archive

CERN Document Server

Introducing risk management into the grid

Author: Birkenheuer G.
Djemame K.
Gourlay I.
Hovestadt M.
Odej K.
Padgett J.
Voss K.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/12/2006
Field of study

Service Level Agreements (SLAs) are explicit statements about all expectations and obligations in the business partnership between customers and providers. They have been introduced in Grid computing to overcome the best effort approach, making the Grid more interesting for commercial applications. However, decisions on negotiation and system management still rely on static approaches, not reflecting the risk linked with decisions. The EC-funded project "AssessGrid" aims at introducing risk assessment and management as a novel decision paradigm into Grid computing. This paper gives a general motivation for risk management and presents the envisaged architecture of a "risk-aware" Grid middleware and Grid fabric, highlighting its functionality by means of three showcase scenarios

Crossref

White Rose Research Online

Large Scale In Silico Screening on Grid Infrastructures

Author: Breton V.
Chen H. -Y.
Ho L. -Y.
Hofmann M.
Jacq N.
Kasam V.
Lee H. -C.
Legré Y.
Lin S. -C.
Maass A.
Medernach E.
Merelli I.
Milanesi L.
Rastelli G.
Reichstadt M.
Salzemann J.
Schwichtenberg H.
Sridhar M.
Wu Y. -T.
Zimmermann M.
Publication venue
Publication date: 13/10/2006
Field of study

Large-scale grid infrastructures for in silico drug discovery open opportunities of particular interest to neglected and emerging diseases. In 2005 and 2006, we have been able to deploy large scale in silico docking within the framework of the WISDOM initiative against Malaria and Avian Flu requiring about 105 years of CPU on the EGEE, Auvergrid and TWGrid infrastructures. These achievements demonstrated the relevance of large-scale grid infrastructures for the virtual screening by molecular docking. This also allowed evaluating the performances of the grid infrastructures and to identify specific issues raised by large-scale deployment.Comment: 14 pages, 2 figures, 2 tables, The Third International Life Science Grid Workshop, LSGrid 2006, Yokohama, Japan, 13-14 october 2006, to appear in the proceeding

arXiv.org e-Print Archive

HAL-IN2P3

HAL Clermont Université

Exploiting relocation to reduce network dimensions of resilient optical grids

Author: Buysse Jens
De Leenheer Marc
Develder Chris
Dhoedt Bart
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

Optical grids are widely deployed to solve complex problems we are facing today. An important aspect of the supporting network is resiliency i.e. the ability to overcome network failures. In contrast to classical network protection schemes, we will not necessarily provide a back-up path between the source and the original destination. Instead, we will try to relocate the job to another server location if this means that we can provide a backup path which comprises less wavelengths than the one the traditional scheme would suggest. This relocation can be backed up by the grid specific anycast principle: a user generally does not care where his job is executed and is only interested in its results. We present ILP formulations for both resilience schemes and we evaluate them in a case study on an European network topology

Crossref

Ghent University Academic Bibliography

MOON: MapReduce On Opportunistic eNvironments

Author: Archuleta Jeremy
Feng Wu-chun
Gardner Mark
Lin Heshan
Ma Xiaosong
Zhang Zhe
Publication venue
Publication date: 01/01/2009
Field of study

Abstract—MapReduce offers a ﬂexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments

Computer Science Technical Reports @Virginia Tech

Predicting Scheduling Failures in the Cloud

Author: Khomh Foutse
Soualhia Mbarka
Tahar Sofiene
Publication venue
Publication date: 01/01/2015
Field of study

Cloud Computing has emerged as a key technology to deliver and manage computing, platform, and software services over the Internet. Task scheduling algorithms play an important role in the efficiency of cloud computing services as they aim to reduce the turnaround time of tasks and improve resource utilization. Several task scheduling algorithms have been proposed in the literature for cloud computing systems, the majority relying on the computational complexity of tasks and the distribution of resources. However, several tasks scheduled following these algorithms still fail because of unforeseen changes in the cloud environments. In this paper, using tasks execution and resource utilization data extracted from the execution traces of real world applications at Google, we explore the possibility of predicting the scheduling outcome of a task using statistical models. If we can successfully predict tasks failures, we may be able to reduce the execution time of jobs by rescheduling failed tasks earlier (i.e., before their actual failing time). Our results show that statistical models can predict task failures with a precision up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of such predictions using the tool kit GloudSim and found that they can improve the number of finished tasks by up to 40%. We also perform a case study using the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene expression correlations analysis study from breast cancer research. We find that when extending the scheduler of Hadoop with our predictive models, the percentage of failed jobs can be reduced by up to 45%, with an overhead of less than 5 minutes

arXiv.org e-Print Archive

PolyPublie