808 research outputs found

    Reliable Provisioning of Spot Instances for Compute-intensive Applications

    Full text link
    Cloud computing providers are now offering their unused resources for leasing in the spot market, which has been considered the first step towards a full-fledged market economy for computational resources. Spot instances are virtual machines (VMs) available at lower prices than their standard on-demand counterparts. These VMs will run for as long as the current price is lower than the maximum bid price users are willing to pay per hour. Spot instances have been increasingly used for executing compute-intensive applications. In spite of an apparent economical advantage, due to an intermittent nature of biddable resources, application execution times may be prolonged or they may not finish at all. This paper proposes a resource allocation strategy that addresses the problem of running compute-intensive jobs on a pool of intermittent virtual machines, while also aiming to run applications in a fast and economical way. To mitigate potential unavailability periods, a multifaceted fault-aware resource provisioning policy is proposed. Our solution employs price and runtime estimation mechanisms, as well as three fault tolerance techniques, namely checkpointing, task duplication and migration. We evaluate our strategies using trace-driven simulations, which take as input real price variation traces, as well as an application trace from the Parallel Workload Archive. Our results demonstrate the effectiveness of executing applications on spot instances, respecting QoS constraints, despite occasional failures.Comment: 8 pages, 4 figure

    Checkpointing algorithms and fault prediction

    Get PDF
    This paper deals with the impact of fault prediction techniques on checkpointing strategies. We extend the classical first-order analysis of Young and Daly in the presence of a fault prediction system, characterized by its recall and its precision. In this framework, we provide an optimal algorithm to decide when to take predictions into account, and we derive the optimal value of the checkpointing period. These results allow to analytically assess the key parameters that impact the performance of fault predictors at very large scale.Comment: Supported in part by ANR Rescue. Published in Journal of Parallel and Distributed Computing. arXiv admin note: text overlap with arXiv:1207.693

    Doing-it-All with Bounded Work and Communication

    Get PDF
    We consider the Do-All problem, where pp cooperating processors need to complete tt similar and independent tasks in an adversarial setting. Here we deal with a synchronous message passing system with processors that are subject to crash failures. Efficiency of algorithms in this setting is measured in terms of work complexity (also known as total available processor steps) and communication complexity (total number of point-to-point messages). When work and communication are considered to be comparable resources, then the overall efficiency is meaningfully expressed in terms of effort defined as work + communication. We develop and analyze a constructive algorithm that has work O(t+plogp(plogp+tlogt))O( t + p \log p\, (\sqrt{p\log p}+\sqrt{t\log t}\, ) ) and a nonconstructive algorithm that has work O(t+plog2p)O(t +p \log^2 p). The latter result is close to the lower bound Ω(t+plogp/loglogp)\Omega(t + p \log p/ \log \log p) on work. The effort of each of these algorithms is proportional to its work when the number of crashes is bounded above by cpc\,p, for some positive constant c<1c < 1. We also present a nonconstructive algorithm that has effort O(t+p1.77)O(t + p ^{1.77})

    Building on Quicksand

    Full text link
    Reliable systems have always been built out of unreliable components. Early on, the reliable components were small such as mirrored disks or ECC (Error Correcting Codes) in core memory. These systems were designed such that failures of these small components were transparent to the application. Later, the size of the unreliable components grew larger and semantic challenges crept into the application when failures occurred. As the granularity of the unreliable component grows, the latency to communicate with a backup becomes unpalatable. This leads to a more relaxed model for fault tolerance. The primary system will acknowledge the work request and its actions without waiting to ensure that the backup is notified of the work. This improves the responsiveness of the system. There are two implications of asynchronous state capture: 1) Everything promised by the primary is probabilistic. There is always a chance that an untimely failure shortly after the promise results in a backup proceeding without knowledge of the commitment. Hence, nothing is guaranteed! 2) Applications must ensure eventual consistency. Since work may be stuck in the primary after a failure and reappear later, the processing order for work cannot be guaranteed. Platform designers are struggling to make this easier for their applications. Emerging patterns of eventual consistency and probabilistic execution may soon yield a way for applications to express requirements for a "looser" form of consistency while providing availability in the face of ever larger failures. This paper recounts portions of the evolution of these trends, attempts to show the patterns that span these changes, and talks about future directions as we continue to "build on quicksand".Comment: CIDR 200
    corecore