282 research outputs found
Recommended from our members
Transiency-driven Resource Management for Cloud Computing Platforms
Modern distributed server applications are hosted on enterprise or cloud data centers that provide computing, storage, and networking capabilities to these applications. These applications are built using the implicit assumption that the underlying servers will be stable and normally available, barring for occasional faults. In many emerging scenarios, however, data centers and clouds only provide transient, rather than continuous, availability of their servers. Transiency in modern distributed systems arises in many contexts, such as green data centers powered using renewable intermittent sources, and cloud platforms that provide lower-cost transient servers which can be unilaterally revoked by the cloud operator.
Transient computing resources are increasingly important, and existing fault-tolerance and resource management techniques are inadequate for transient servers because applications typically assume continuous resource availability. This thesis presents research in distributed systems design that treats transiency as a first-class design principle. I show that combining transiency-specific fault-tolerance mechanisms with resource management policies to suit application characteristics and requirements, can yield significant cost and performance benefits. These mechanisms and policies have been implemented and prototyped as part of software systems, which allow a wide range of applications, such as interactive services and distributed data processing, to be deployed on transient servers, and can reduce cloud computing costs by up to 90\%.
This thesis makes contributions to four areas of computer systems research: transiency-specific fault-tolerance, resource allocation, abstractions, and resource reclamation. For reducing the impact of transient server revocations, I develop two fault-tolerance techniques that are tailored to transient server characteristics and application requirements. For interactive applications, I build a derivative cloud platform that masks revocations by transparently moving application-state between servers of different types. Similarly, for distributed data processing applications, I investigate the use of application level periodic checkpointing to reduce the performance impact of server revocations. For managing and reducing the risk of server revocations, I investigate the use of server portfolios that allow transient resource allocation to be tailored to application requirements.
Finally, I investigate how resource providers (such as cloud platforms) can provide transient resource availability without revocation, by looking into alternative resource reclamation techniques. I develop resource deflation, wherein a server\u27s resources are fractionally reclaimed, allowing the application to continue execution albeit with fewer resources. Resource deflation generalizes revocation, and the deflation mechanisms and cluster-wide policies can yield both high cluster utilization and low application performance degradation
Integrating Scale Out and Fault Tolerance in Stream Processing using Operator State Management
As users of big data applications expect fresh results, we witness a new breed of stream processing systems (SPS) that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges: (i) to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases; (ii) failures are common with deployments on hundreds of VMs - systems must be fault-tolerant with fast recovery times, yet low per-machine overheads. An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results. Our key idea is to expose internal operator state explicitly to the SPS through a set of state management primitives. Based on them, we describe an integrated approach for dynamic scale out and recovery of stateful operators. Externalised operator state is checkpointed periodically by the SPS and backed up to upstream VMs. The SPS identifies individual operator bottlenecks and automatically scales them out by allocating new VMs and partitioning the check-pointed state. At any point, failed operators are recovered by restoring checkpointed state on a new VM and replaying unprocessed tuples. We evaluate this approach with the Linear Road Benchmark on the Amazon EC2 cloud platform and show that it can scale automatically to a load factor of L=350 with 50 VMs, while recovering quickly from failures. Copyright © 2013 ACM
A Reliable and Cost-Efficient Auto-Scaling System for Web Applications Using Heterogeneous Spot Instances
Cloud providers sell their idle capacity on markets through an auction-like
mechanism to increase their return on investment. The instances sold in this
way are called spot instances. In spite that spot instances are usually 90%
cheaper than on-demand instances, they can be terminated by provider when their
bidding prices are lower than market prices. Thus, they are largely used to
provision fault-tolerant applications only. In this paper, we explore how to
utilize spot instances to provision web applications, which are usually
considered availability-critical. The idea is to take advantage of differences
in price among various types of spot instances to reach both high availability
and significant cost saving. We first propose a fault-tolerant model for web
applications provisioned by spot instances. Based on that, we devise novel
auto-scaling polices for hourly billed cloud markets. We implemented the
proposed model and policies both on a simulation testbed for repeatable
validation and Amazon EC2. The experiments on the simulation testbed and the
real platform against the benchmarks show that the proposed approach can
greatly reduce resource cost and still achieve satisfactory Quality of Service
(QoS) in terms of response time and availability
Fail Over Strategy for Fault Tolerance in Cloud Computing Environment
YesCloud fault tolerance is an important issue in cloud computing platforms and applications. In the event of an unexpected
system failure or malfunction, a robust fault-tolerant design may allow the cloud to continue functioning correctly
possibly at a reduced level instead of failing completely. To ensure high availability of critical cloud services, the
application execution and hardware performance, various fault tolerant techniques exist for building self-autonomous
cloud systems. In comparison to current approaches, this paper proposes a more robust and reliable architecture using
optimal checkpointing strategy to ensure high system availability and reduced system task service finish time. Using
pass rates and virtualised mechanisms, the proposed Smart Failover Strategy (SFS) scheme uses components such as
Cloud fault manager, Cloud controller, Cloud load balancer and a selection mechanism, providing fault tolerance via
redundancy, optimized selection and checkpointing. In our approach, the Cloud fault manager repairs faults generated
before the task time deadline is reached, blocking unrecoverable faulty nodes as well as their virtual nodes. This scheme
is also able to remove temporary software faults from recoverable faulty nodes, thereby making them available for future
request. We argue that the proposed SFS algorithm makes the system highly fault tolerant by considering forward and
backward recovery using diverse software tools. Compared to existing approaches, preliminary experiment of the SFS
algorithm indicate an increase in pass rates and a consequent decrease in failure rates, showing an overall good
performance in task allocations. We present these results using experimental validation tools with comparison to other
techniques, laying a foundation for a fully fault tolerant IaaS Cloud environment
Optimising Fault Tolerance in Real-time Cloud Computing IaaS Environment
YesFault tolerance is the ability of a system to respond
swiftly to an unexpected failure. Failures in a cloud computing
environment are normal rather than exceptional, but fault
detection and system recovery in a real time cloud system is a
crucial issue. To deal with this problem and to minimize the risk
of failure, an optimal fault tolerance mechanism was introduced
where fault tolerance was achieved using the combination of the
Cloud Master, Compute nodes, Cloud load balancer, Selection
mechanism and Cloud Fault handler. In this paper, we proposed
an optimized fault tolerance approach where a model is designed
to tolerate faults based on the reliability of each compute node
(virtual machine) and can be replaced if the performance is not
optimal. Preliminary test of our algorithm indicates that the rate
of increase in pass rate exceeds the decrease in failure rate and it
also considers forward and backward recovery using diverse
software tools. Our results obtained are demonstrated through
experimental validation thereby laying a foundation for a fully
fault tolerant IaaS Cloud environment, which suggests a good
performance of our model compared to current existing
approaches.Petroleum Technology Development Fund (PTDF
Extended Fault Taxonomy of SOA-Based Systems
Service Oriented Architecture (SOA) is considered as a standard for enterprise software development. The main characteristics of SOA are dynamic discovery and composition of software services in a heterogeneous environment. These properties pose newer challenges in fault management of SOA-based systems (SBS). A proper understanding of different faults in an SBS is very necessary for effective fault handling. A comprehensive three-fold fault taxonomy is presented here that covers distributed, SOA specific and non-functional faults in a holistic manner. A comprehensive fault taxonomy is a key starting point for providing techniques and methods for accessing the quality of a given system. In this paper, an attempt has been made to outline several SBSs faults into a well-structured taxonomy that may assist developers to plan suitable fault repairing strategies. Some commonly emphasized fault recovery strategies are also discussed. Some challenges that may occur during fault handling of SBSs are also mentioned
Recommended from our members
System Support for Managing Risk in Cloud Computing Platforms
Cloud platforms sell computing to applications for a price. However, by precisely defining and controlling the service-level characteristics of cloud servers, they expose applications to a number of implicit risks throughout the application’s lifecycle. For example, user’s request for a server may be denied, leading to rejection risk; an allocated resource may be withdrawn, resulting in revocation risk; an acquired cloud server’s price may rise relative to others, causing price risk; a cloud server’s performance may vary due to external factors, triggering valuation risk. Though these risks are implicit, the costs they bear on the applications are not.
While some risks exist in all Infrastructure-as-a-Service offerings, they are most pronounced in an emerging category called transient cloud servers. Since transient servers are carved out of instantaneous idle cloud capacity, they exhibit two distinct features: (i) revocations that are intentional, frequent and come with advanced warning, and (ii) prices that are low in average but vary across time and location. Thus, despite enabling inexpensive access to at-scale computing, transient cloud servers expose applications to risks, the scale of which were unseen in the past platforms. Unfortunately, the current generation system software are not designed to handle these risks, which in turn results in inconsistent performances, unexpected failures, missed savings, and slower adoption.
In this dissertation, we elevate risk management to a first-class system design principle. Our goal is to identify the risks, quantify their costs, and explicitly manage them for applications deployed on cloud platforms. Towards that goal, we adapt and extend concepts from finance and economics to propose a new system design approach called financializing cloud computing. By treating cloud resources as investments, and by quantifying the cost of their risks, financialization enables system software to manage the risk-reward trade-offs, explicitly and autonomously.
We demonstrate the utility of our approach via four contributions: (i) mitigating revocation risk with insurance policy, (ii) reducing price risk through active trading, (iii) eliminating uncertainty risk by index tracking, and (iv) minimizing server’s valuation risk via asset pricing. We conclude by observing that diversity and asymmetry in the creation and consumption of cloud compute resources is on the rise, and that financialization can be effectively employed to manage its complexity and risks
- …