873 research outputs found
Checkpointing algorithms and fault prediction
This paper deals with the impact of fault prediction techniques on
checkpointing strategies. We extend the classical first-order analysis of Young
and Daly in the presence of a fault prediction system, characterized by its
recall and its precision. In this framework, we provide an optimal algorithm to
decide when to take predictions into account, and we derive the optimal value
of the checkpointing period. These results allow to analytically assess the key
parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and
Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM
Reliable Provisioning of Spot Instances for Compute-intensive Applications
Cloud computing providers are now offering their unused resources for leasing
in the spot market, which has been considered the first step towards a
full-fledged market economy for computational resources. Spot instances are
virtual machines (VMs) available at lower prices than their standard on-demand
counterparts. These VMs will run for as long as the current price is lower than
the maximum bid price users are willing to pay per hour. Spot instances have
been increasingly used for executing compute-intensive applications. In spite
of an apparent economical advantage, due to an intermittent nature of biddable
resources, application execution times may be prolonged or they may not finish
at all. This paper proposes a resource allocation strategy that addresses the
problem of running compute-intensive jobs on a pool of intermittent virtual
machines, while also aiming to run applications in a fast and economical way.
To mitigate potential unavailability periods, a multifaceted fault-aware
resource provisioning policy is proposed. Our solution employs price and
runtime estimation mechanisms, as well as three fault tolerance techniques,
namely checkpointing, task duplication and migration. We evaluate our
strategies using trace-driven simulations, which take as input real price
variation traces, as well as an application trace from the Parallel Workload
Archive. Our results demonstrate the effectiveness of executing applications on
spot instances, respecting QoS constraints, despite occasional failures.Comment: 8 pages, 4 figure
Reliability -aware optimal checkpoint /restart model in high performance computing
Computational power demand for large challenging problems has increasingly driven the physical size of High Performance Computing (HPC) systems. As the system gets larger, it requires more and more components (processor, memory, disk, switch, power supply and so on). Thus, challenges arise in handling reliability of such large-scale systems. In order to minimize the performance loss due to unexpected failures, fault tolerant mechanisms are vital to sustain computational power in such environment. Checkpoint/restart is a common fault tolerant technique which has been widely applied in the single computer system. However, checkpointing in a large-scale HPC environment is much more challenging due to complexity, coordination, and timing issues. In this dissertation, we present a reliability-aware method for an optimal checkpoint/restart strategy. Our scheme aims to address the fault tolerance challenge, especially in a large-scale HPC system, by providing optimal checkpoint placement techniques derived from the actual system reliability. Unlike existing checkpoint models, which can only handle Poisson failure and a constant checkpoint interval, our model can perform a varying checkpoint interval and deal with different failure distributions. In addition, the approach considers optimality for both checkpoint overhead and rollback time. Our validation results suggest a significant improvement over existing techniques
A Reliable and Cost-Efficient Auto-Scaling System for Web Applications Using Heterogeneous Spot Instances
Cloud providers sell their idle capacity on markets through an auction-like
mechanism to increase their return on investment. The instances sold in this
way are called spot instances. In spite that spot instances are usually 90%
cheaper than on-demand instances, they can be terminated by provider when their
bidding prices are lower than market prices. Thus, they are largely used to
provision fault-tolerant applications only. In this paper, we explore how to
utilize spot instances to provision web applications, which are usually
considered availability-critical. The idea is to take advantage of differences
in price among various types of spot instances to reach both high availability
and significant cost saving. We first propose a fault-tolerant model for web
applications provisioned by spot instances. Based on that, we devise novel
auto-scaling polices for hourly billed cloud markets. We implemented the
proposed model and policies both on a simulation testbed for repeatable
validation and Amazon EC2. The experiments on the simulation testbed and the
real platform against the benchmarks show that the proposed approach can
greatly reduce resource cost and still achieve satisfactory Quality of Service
(QoS) in terms of response time and availability
Recommended from our members
Transiency-driven Resource Management for Cloud Computing Platforms
Modern distributed server applications are hosted on enterprise or cloud data centers that provide computing, storage, and networking capabilities to these applications. These applications are built using the implicit assumption that the underlying servers will be stable and normally available, barring for occasional faults. In many emerging scenarios, however, data centers and clouds only provide transient, rather than continuous, availability of their servers. Transiency in modern distributed systems arises in many contexts, such as green data centers powered using renewable intermittent sources, and cloud platforms that provide lower-cost transient servers which can be unilaterally revoked by the cloud operator.
Transient computing resources are increasingly important, and existing fault-tolerance and resource management techniques are inadequate for transient servers because applications typically assume continuous resource availability. This thesis presents research in distributed systems design that treats transiency as a first-class design principle. I show that combining transiency-specific fault-tolerance mechanisms with resource management policies to suit application characteristics and requirements, can yield significant cost and performance benefits. These mechanisms and policies have been implemented and prototyped as part of software systems, which allow a wide range of applications, such as interactive services and distributed data processing, to be deployed on transient servers, and can reduce cloud computing costs by up to 90\%.
This thesis makes contributions to four areas of computer systems research: transiency-specific fault-tolerance, resource allocation, abstractions, and resource reclamation. For reducing the impact of transient server revocations, I develop two fault-tolerance techniques that are tailored to transient server characteristics and application requirements. For interactive applications, I build a derivative cloud platform that masks revocations by transparently moving application-state between servers of different types. Similarly, for distributed data processing applications, I investigate the use of application level periodic checkpointing to reduce the performance impact of server revocations. For managing and reducing the risk of server revocations, I investigate the use of server portfolios that allow transient resource allocation to be tailored to application requirements.
Finally, I investigate how resource providers (such as cloud platforms) can provide transient resource availability without revocation, by looking into alternative resource reclamation techniques. I develop resource deflation, wherein a server\u27s resources are fractionally reclaimed, allowing the application to continue execution albeit with fewer resources. Resource deflation generalizes revocation, and the deflation mechanisms and cluster-wide policies can yield both high cluster utilization and low application performance degradation
On the impact of process replication on executions of large-scale parallel applications with coordinated checkpointing
International audienceProcessor failures in post-petascale parallel computing platforms are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback-recovery, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback-recovery, has been recently advocated. We first derive novel theoretical results for Exponential failure distributions, namely exact values for the Mean Number of Failures To Interruption and the Mean Time To Interruption. We then extend these results to arbitrary failure distributions, obtaining closed-form solutions for Weibull distributions. Finally, we evaluate process replica-tion in simulation using both synthetic and real-world failure traces so as to quantify average application makespan. One interesting result from these experiments is that, when process repli-cation is used, application performance is not sensitive to the checkpointing period, provided that that period is within a large neighborhood of the optimal period. More generally, our empirical results make it possible to identify regimes in which process replication is beneficial
- …