1,052 research outputs found

    A Reliable and Cost-Efficient Auto-Scaling System for Web Applications Using Heterogeneous Spot Instances

    Full text link
    Cloud providers sell their idle capacity on markets through an auction-like mechanism to increase their return on investment. The instances sold in this way are called spot instances. In spite that spot instances are usually 90% cheaper than on-demand instances, they can be terminated by provider when their bidding prices are lower than market prices. Thus, they are largely used to provision fault-tolerant applications only. In this paper, we explore how to utilize spot instances to provision web applications, which are usually considered availability-critical. The idea is to take advantage of differences in price among various types of spot instances to reach both high availability and significant cost saving. We first propose a fault-tolerant model for web applications provisioned by spot instances. Based on that, we devise novel auto-scaling polices for hourly billed cloud markets. We implemented the proposed model and policies both on a simulation testbed for repeatable validation and Amazon EC2. The experiments on the simulation testbed and the real platform against the benchmarks show that the proposed approach can greatly reduce resource cost and still achieve satisfactory Quality of Service (QoS) in terms of response time and availability

    Proactive Fault Tolerance Through Cloud Failure Prediction Using Machine Learning

    Get PDF
    One of the crucial aspects of cloud infrastructure is fault tolerance, and its primary responsibility is to address the situations that arise when different architectural parts fail. A sizeable cloud data center must deliver high service dependability and availability while minimizing failure incidence. However, modern large cloud data centers continue to have significant failure rates owing to a variety of factors, including hardware and software faults, which often lead to task and job failures. To reduce unexpected loss, it is critical to forecast task or job failures with high accuracy before they occur. This research examines the performance of four machine learning (ML) algorithms for forecasting failure in a real-time cloud environment to increase system availability using real-time data gathered from the Google Cluster Workload Traces 2019. We applied four distinct supervised machine learning algorithms are logistic regression, KNN, SVM, decision tree, and logistic regression classifiers. Confusion matrices as well as ROC curves were used to assess the reliability and robustness of each algorithm. This study will assist cloud service providers developing a robust fault tolerance design by optimizing device selection, consequently boosting system availability and eliminating unexpected system downtime

    Service Quality Assessment for Cloud-based Distributed Data Services

    Full text link
    The issue of less-than-100% reliability and trust-worthiness of third-party controlled cloud components (e.g., IaaS and SaaS components from different vendors) may lead to laxity in the QoS guarantees offered by a service-support system S to various applications. An example of S is a replicated data service to handle customer queries with fault-tolerance and performance goals. QoS laxity (i.e., SLA violations) may be inadvertent: say, due to the inability of system designers to model the impact of sub-system behaviors onto a deliverable QoS. Sometimes, QoS laxity may even be intentional: say, to reap revenue-oriented benefits by cheating on resource allocations and/or excessive statistical-sharing of system resources (e.g., VM cycles, number of servers). Our goal is to assess how well the internal mechanisms of S are geared to offer a required level of service to the applications. We use computational models of S to determine the optimal feasible resource schedules and verify how close is the actual system behavior to a model-computed \u27gold-standard\u27. Our QoS assessment methods allow comparing different service vendors (possibly with different business policies) in terms of canonical properties: such as elasticity, linearity, isolation, and fairness (analogical to a comparative rating of restaurants). Case studies of cloud-based distributed applications are described to illustrate our QoS assessment methods. Specific systems studied in the thesis are: i) replicated data services where the servers may be hosted on multiple data-centers for fault-tolerance and performance reasons; and ii) content delivery networks to geographically distributed clients where the content data caches may reside on different data-centers. The methods studied in the thesis are useful in various contexts of QoS management and self-configurations in large-scale cloud-based distributed systems that are inherently complex due to size, diversity, and environment dynamicity

    Resource Provisioning Exploiting Cost and Performance Diversity within IaaS Cloud Providers

    Get PDF
    IaaS platforms such as Amazon EC2 allow clients access to massive computational power in the form of instances. Amazon hosts three different instance purchasing options, each with its own SLA covering pricing and availability. Amazon also offers access to a number of geographical regions, zones, and instance types to select from. In this thesis, the problem of utilizing Spot and On-Demand instances is analyzed and two approaches are presented in order to exploit the cost and performance diversity among different instance types and availability zones, and among the Spot markets they represent. We first develop RAMP, a framework designed to calculate the expected profit of using a specific Spot or On-Demand instance through an evaluation of instance reliability. RAMP is extended to develop RAMC-DC, a framework designed to allocate the most cost effective instance through strategies that facilitate interchangeability of instances among short jobs, reliability of instances among long jobs, and a comparison of the estimated costs of possible allocations. RAMC-DC achieves fault tolerance through comparisons of the price dynamics across instance types and availability zones, and through an examination of three basic checkpointing methods. Evaluations demonstrate that both frameworks take a large step toward low-volatility, high cost-efficiency resource provisioning. While achieving early-termination rates as low as 2.2%, RAMP can completely offset the total cost when charging the user just 17.5% of the On-Demand price. Moreover, the increases in profit resulting from relatively small additional charges to users are notably high, i.e., 100% profit compared to the resource provisioning cost with 35% of the equivalent On-Demand price. RAMC-DC can maintain deadline breaches below 1.8% of all jobs, achieve both early-termination and deadline breach rates as low as 0.5% of all jobs, and lowers total costs by between 80% and 87% compared to using only On-Demand instances

    Resource Provisioning Exploiting Cost and Performance Diversity within IaaS Cloud Providers

    Get PDF
    IaaS platforms such as Amazon EC2 allow clients access to massive computational power in the form of instances. Amazon hosts three different instance purchasing options, each with its own SLA covering pricing and availability. Amazon also offers access to a number of geographical regions, zones, and instance types to select from. In this thesis, the problem of utilizing Spot and On-Demand instances is analyzed and two approaches are presented in order to exploit the cost and performance diversity among different instance types and availability zones, and among the Spot markets they represent. We first develop RAMP, a framework designed to calculate the expected profit of using a specific Spot or On-Demand instance through an evaluation of instance reliability. RAMP is extended to develop RAMC-DC, a framework designed to allocate the most cost effective instance through strategies that facilitate interchangeability of instances among short jobs, reliability of instances among long jobs, and a comparison of the estimated costs of possible allocations. RAMC-DC achieves fault tolerance through comparisons of the price dynamics across instance types and availability zones, and through an examination of three basic checkpointing methods. Evaluations demonstrate that both frameworks take a large step toward low-volatility, high cost-efficiency resource provisioning. While achieving early-termination rates as low as 2.2%, RAMP can completely offset the total cost when charging the user just 17.5% of the On-Demand price. Moreover, the increases in profit resulting from relatively small additional charges to users are notably high, i.e., 100% profit compared to the resource provisioning cost with 35% of the equivalent On-Demand price. RAMC-DC can maintain deadline breaches below 1.8% of all jobs, achieve both early-termination and deadline breach rates as low as 0.5% of all jobs, and lowers total costs by between 80% and 87% compared to using only On-Demand instances

    Reliable and energy efficient resource provisioning in cloud computing systems

    Get PDF
    Cloud Computing has revolutionized the Information Technology sector by giving computing a perspective of service. The services of cloud computing can be accessed by users not knowing about the underlying system with easy-to-use portals. To provide such an abstract view, cloud computing systems have to perform many complex operations besides managing a large underlying infrastructure. Such complex operations confront service providers with many challenges such as security, sustainability, reliability, energy consumption and resource management. Among all the challenges, reliability and energy consumption are two key challenges focused on in this thesis because of their conflicting nature. Current solutions either focused on reliability techniques or energy efficiency methods. But it has been observed that mechanisms providing reliability in cloud computing systems can deteriorate the energy consumption. Adding backup resources and running replicated systems provide strong fault tolerance but also increase energy consumption. Reducing energy consumption by running resources on low power scaling levels or by reducing the number of active but idle sitting resources such as backup resources reduces the system reliability. This creates a critical trade-off between these two metrics that are investigated in this thesis. To address this problem, this thesis presents novel resource management policies which target the provisioning of best resources in terms of reliability and energy efficiency and allocate them to suitable virtual machines. A mathematical framework showing interplay between reliability and energy consumption is also proposed in this thesis. A formal method to calculate the finishing time of tasks running in a cloud computing environment impacted with independent and correlated failures is also provided. The proposed policies adopted various fault tolerance mechanisms while satisfying the constraints such as task deadlines and utility values. This thesis also provides a novel failure-aware VM consolidation method, which takes the failure characteristics of resources into consideration before performing VM consolidation. All the proposed resource management methods are evaluated by using real failure traces collected from various distributed computing sites. In order to perform the evaluation, a cloud computing framework, 'ReliableCloudSim' capable of simulating failure-prone cloud computing systems is developed. The key research findings and contributions of this thesis are: 1. If the emphasis is given only to energy optimization without considering reliability in a failure prone cloud computing environment, the results can be contrary to the intuitive expectations. Rather than reducing energy consumption, a system ends up consuming more energy due to the energy losses incurred because of failure overheads. 2. While performing VM consolidation in a failure prone cloud computing environment, a significant improvement in terms of energy efficiency and reliability can be achieved by considering failure characteristics of physical resources. 3. By considering correlated occurrence of failures during resource provisioning and VM allocation, the service downtime or interruption is reduced significantly by 34% in comparison to the environments with the assumption of independent occurrence of failures. Moreover, measured by our mathematical model, the ratio of reliability and energy consumption is improved by 14%
    • …
    corecore