Search CORE

9 research outputs found

Energy-aware scheduling under reliability and makespan constraints

Author: Aupy Guillaume
Benoit Anne
Robert Yves
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2012
Field of study

International audienceWe consider a task graph mapped on a set of homogeneous processors. We aim at minimizing the energy consumption while enforcing two constraints: a prescribed bound on the execution time (or makespan), and a reliability threshold. Dynamic voltage and frequency scaling (DVFS) is an approach frequently used to reduce the energy consumption of a schedule, but slowing down the execution of a task to save energy is decreasing the reliability of the execution. In this work, to improve the reliability of a schedule while reducing the energy consumption, we allow for the re-execution of some tasks. We assess the complexity of the tri-criteria scheduling problem (makespan, reliability, energy) of deciding which task to re-execute, and at which speed each execution of a task should be done, with two different speed models: either processors can have arbitrary speeds (continuous model), or a processor can run at a finite number of different speeds and change its speed during a computation (VDD model). We propose several novel tri-criteria scheduling heuristics under the continuous speed model, and we evaluate them through a set of simulations. The two best heuristics turn out to be very efficient and complementary

HAL-ENS-LYON

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Hal-Diderot

Failure analysis and reliability -aware resource allocation of parallel applications in High Performance Computing systems

Author: Gottumukkala Narasimha Raju
Publication venue: Louisiana Tech Digital Commons
Publication date: 01/04/2008
Field of study

The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users. This dissertation makes three major contributions for reliability evaluation and resource management in HPC systems. First we study the failure properties of HPC systems and observe that Times To Failure (TTF\u27s) of individual compute nodes follow a time-varying failure rate based distribution like Weibull distribution. We then propose a model for the TTF distribution of a system of k independent nodes when individual nodes exhibit time varying failure rates. Based on the reliability of the proposed TTF model, we develop reliability-aware resource allocation algorithms and evaluated them on actual parallel workloads and failure data of a HPC system. Our observations indicate that applying time varying failure rate-based reliability function combined with some heuristics reduce the performance loss due to unexpected failures by as much as 30 to 53 percent. Finally, we also study the effect of reliability with respect to the number of nodes and propose reliability-aware optimal k node allocation algorithm for large scale parallel applications. Our simulation results of comparing the optimal k node algorithm indicate that choosing the number of nodes for large scale parallel applications based on the reliability of compute nodes can reduce the overall completion time and waste time when the k may be smaller than the total number of nodes in the system

Louisiana Tech Digital Commons

Approximation Algorithms for Energy, Reliability, and Makespan Optimization Problems

Author: Aupy Guillaume
Benoit Anne
Publication venue: World Scientific Publishing House Ltd
Publication date: 01/01/2016
Field of study

International audienceWe consider the problem of scheduling an application on a parallel computational platform. The application is a particular task graph, either a linear chain of tasks, or a set of independent tasks. The platform is made of identical processors, whose speed can be dynamically modified. It is also subject to failures: if a processor is slowed down to decrease the energy consumption, it has a higher chance to fail. Therefore, the scheduling problem requires us to re-execute or replicate tasks (i.e., execute twice the same task, either on the same processor, or on two distinct processors), in order to increase the reliability. It is a tri-criteria problem: the goal is to minimize the energy consumption, while enforcing a bound on the total execution time (the makespan), and a constraint on the reliability of each task. Our main contribution is to propose approximation algorithms for linear chains of tasks and independent tasks. For linear chains, we design a fully polynomial-time approximation scheme. However, we show that there exists no constant factor approximation algorithm for independent tasks, unless P=NP, and we propose in this case an approximation algorithm with a relaxation on the makespan constraint

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Recommended from our members

Semi-supervised and Self-evolving Learning Algorithms with Application to Anomaly Detection in Cloud Computing

Author: Pannu Husanbir Singh
Publication venue: 'University of North Texas Libraries'
Publication date: 01/12/2012
Field of study

Semi-supervised learning (SSL) is the most practical approach for classification among machine learning algorithms. It is similar to the humans way of learning and thus has great applications in text/image classification, bioinformatics, artificial intelligence, robotics etc. Labeled data is hard to obtain in real life experiments and may need human experts with experimental equipments to mark the labels, which can be slow and expensive. But unlabeled data is easily available in terms of web pages, data logs, images, audio, video les and DNA/RNA sequences. SSL uses large unlabeled and few labeled data to build better classifying functions which acquires higher accuracy and needs lesser human efforts. Thus it is of great empirical and theoretical interest. We contribute two SSL algorithms (i) adaptive anomaly detection (AAD) (ii) hybrid anomaly detection (HAD), which are self evolving and very efficient to detect anomalies in a large scale and complex data distributions. Our algorithms are capable of modifying an existing classier by both retiring old data and adding new data. This characteristic enables the proposed algorithms to handle massive and streaming datasets where other existing algorithms fail and run out of memory. As an application to semi-supervised anomaly detection and for experimental illustration, we have implemented a prototype of the AAD and HAD systems and conducted experiments in an on-campus cloud computing environment. Experimental results show that the detection accuracy of both algorithms improves as they evolves and can achieve 92.1% detection sensitivity and 83.8% detection specificity, which makes it well suitable for anomaly detection in large and streaming datasets. We compared our algorithms with two popular SSL methods (i) subspace regularization (ii) ensemble of Bayesian sub-models and decision tree classifiers. Our contributed algorithms are easy to implement, significantly better in terms of space, time complexity and accuracy than these two methods for semi-supervised anomaly detection mechanism

UNT Digital Library

Cooperative checkpointing for supercomputing systems

Author: Oliner Adam Jamison
Publication venue: Massachusetts Institute of Technology
Publication date: 01/01/2005
Field of study

Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 91-94).A system-level checkpointing mechanism, with global knowledge of the state and health of the machine, can improve performance and reliability by dynamically deciding when to skip checkpoint requests made by applications. This thesis presents such a technique, called cooperative checkpointing, and models its behavior as an online algorithm. Where C is the checkpoint overhead and I is the request interval, a worst-case analysis proves a lower bound of (2 + [C/I])-competitiveness for deterministic cooperative checkpointing algorithms, and proves that a number of simple algorithms meet this bound. Using an expected-case analysis, this thesis proves that an optimal periodic checkpointing algorithm that assumes an exponential failure distribution may be arbitrarily bad relative to an optimal cooperative checkpointing algorithm that permits a general failure distribution. Calculations suggest that, under realistic conditions, an application using cooperative checkpointing may make progress 4 times faster than one using periodic checkpointing. Finally, the thesis suggests an embodiment of cooperative checkpointing for a large-scale high performance computer system and presents the results of some preliminary simulations. These results show that, in extreme cases, cooperative checkpointing improved system utilization by more than 25%, reduced bounded slowdown by a factor of 9, while simultaneously reducing the amount of work lost due to failures by 30%. This thesis contributes a unique approach to providing large-scale system reliability through cooperative checkpointing, techniques for analyzing the approach, and blueprints for implementing it in practice.by Adam Jamison Oliner.M.Eng

CiteSeerX

DSpace@MIT

Contributions to High-Throughput Computing Based on the Peer-to-Peer Paradigm

Author: Pérez Miguel Carlos
Publication venue
Publication date: 18/06/2015
Field of study

XII, 116 p.This dissertation focuses on High Throughput Computing (HTC) systems and how to build a working HTC system using Peer-to-Peer (P2P) technologies. The traditional HTC systems, designed to process the largest possible number of tasks per unit of time, revolve around a central node that implements a queue used to store and manage submitted tasks. This central node limits the scalability and fault tolerance of the HTC system. A usual solution involves the utilization of replicas of the master node that can replace it. This solution is, however, limited by the number of replicas used. In this thesis, we propose an alternative solution that follows the P2P philosophy: a completely distributed system in which all worker nodes participate in the scheduling tasks, and with a physically distributed task queue implemented on top of a P2P storage system. The fault tolerance and scalability of this proposal is, therefore, limited only by the number of nodes in the system. The proper operation and scalability of our proposal have been validated through experimentation with a real system. The data availability provided by Cassandra, the P2P data management framework used in our proposal, is analysed by means of several stochastic models. These models can be used to make predictions about the availability of any Cassandra deployment, as well as to select the best possible con guration of any Cassandra system. In order to validate the proposed models, an experimentation with real Cassandra clusters is made, showing that our models are good descriptors of Cassandra's availability. Finally, we propose a set of scheduling policies that try to solve a common problem of HTC systems: re-execution of tasks due to a failure in the node where the task was running, without additional resource misspending. In order to reduce the number of re-executions, our proposals try to nd good ts between the reliability of nodes and the estimated length of each task. An extensive simulation-based experimentation shows that our policies are capable of reducing the number of re-executions, improving system performance and utilization of nodes

Archivo Digital para la Docencia y la Investigación

Fault-aware job scheduling for bluegene/l systems

Author: A. J. Oliner
A. Sivasubramaniam
J. E. Moreira
M. Gupta
R. K. Sahoo
Publication venue
Publication date
Field of study

Large-scale systems like BlueGene/L are susceptible to a number of software and hardware failures that can affect system performance. In this paper evaluate the effectiveness of a previously developed job scheduling algorithm for BlueGene/L in the presence of faults. We have developed two new job-scheduling algorithms considering failures while scheduling the jobs. We have also evaluated the impact of these algorithms on average bounded slowdown, average response time and system utilization, considering different levels of proactive failure prediction and prevention techniques reported in the literature. Our simulation studies show that the use of these new algorithms with even trivial fault prediction confidence or accuracy levels (as low as) can significantly improve the performance of the BlueGene/L system. 1

CiteSeerX

Energy- efficient and SLA-based management of IaaS Cloud Data Centers

Author: Altino Manuel Silva Sampaio
Publication venue
Publication date: 15/06/2015
Field of study

Repositório Aberto da Universidade do Porto