808 research outputs found
Reliable Provisioning of Spot Instances for Compute-intensive Applications
Cloud computing providers are now offering their unused resources for leasing
in the spot market, which has been considered the first step towards a
full-fledged market economy for computational resources. Spot instances are
virtual machines (VMs) available at lower prices than their standard on-demand
counterparts. These VMs will run for as long as the current price is lower than
the maximum bid price users are willing to pay per hour. Spot instances have
been increasingly used for executing compute-intensive applications. In spite
of an apparent economical advantage, due to an intermittent nature of biddable
resources, application execution times may be prolonged or they may not finish
at all. This paper proposes a resource allocation strategy that addresses the
problem of running compute-intensive jobs on a pool of intermittent virtual
machines, while also aiming to run applications in a fast and economical way.
To mitigate potential unavailability periods, a multifaceted fault-aware
resource provisioning policy is proposed. Our solution employs price and
runtime estimation mechanisms, as well as three fault tolerance techniques,
namely checkpointing, task duplication and migration. We evaluate our
strategies using trace-driven simulations, which take as input real price
variation traces, as well as an application trace from the Parallel Workload
Archive. Our results demonstrate the effectiveness of executing applications on
spot instances, respecting QoS constraints, despite occasional failures.Comment: 8 pages, 4 figure
Checkpointing algorithms and fault prediction
This paper deals with the impact of fault prediction techniques on
checkpointing strategies. We extend the classical first-order analysis of Young
and Daly in the presence of a fault prediction system, characterized by its
recall and its precision. In this framework, we provide an optimal algorithm to
decide when to take predictions into account, and we derive the optimal value
of the checkpointing period. These results allow to analytically assess the key
parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and
Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693
Doing-it-All with Bounded Work and Communication
We consider the Do-All problem, where cooperating processors need to
complete similar and independent tasks in an adversarial setting. Here we
deal with a synchronous message passing system with processors that are subject
to crash failures. Efficiency of algorithms in this setting is measured in
terms of work complexity (also known as total available processor steps) and
communication complexity (total number of point-to-point messages). When work
and communication are considered to be comparable resources, then the overall
efficiency is meaningfully expressed in terms of effort defined as work +
communication. We develop and analyze a constructive algorithm that has work
and a nonconstructive
algorithm that has work . The latter result is close to the
lower bound on work. The effort of each of
these algorithms is proportional to its work when the number of crashes is
bounded above by , for some positive constant . We also present a
nonconstructive algorithm that has effort
Building on Quicksand
Reliable systems have always been built out of unreliable components. Early
on, the reliable components were small such as mirrored disks or ECC (Error
Correcting Codes) in core memory. These systems were designed such that
failures of these small components were transparent to the application. Later,
the size of the unreliable components grew larger and semantic challenges crept
into the application when failures occurred.
As the granularity of the unreliable component grows, the latency to
communicate with a backup becomes unpalatable. This leads to a more relaxed
model for fault tolerance. The primary system will acknowledge the work request
and its actions without waiting to ensure that the backup is notified of the
work. This improves the responsiveness of the system.
There are two implications of asynchronous state capture: 1) Everything
promised by the primary is probabilistic. There is always a chance that an
untimely failure shortly after the promise results in a backup proceeding
without knowledge of the commitment. Hence, nothing is guaranteed! 2)
Applications must ensure eventual consistency. Since work may be stuck in the
primary after a failure and reappear later, the processing order for work
cannot be guaranteed.
Platform designers are struggling to make this easier for their applications.
Emerging patterns of eventual consistency and probabilistic execution may soon
yield a way for applications to express requirements for a "looser" form of
consistency while providing availability in the face of ever larger failures.
This paper recounts portions of the evolution of these trends, attempts to
show the patterns that span these changes, and talks about future directions as
we continue to "build on quicksand".Comment: CIDR 200
- …