183 research outputs found

    Autonomic management of virtualized resources in cloud computing

    Get PDF
    The last five years have witnessed a rapid growth of cloud computing in business, governmental and educational IT deployment. The success of cloud services depends critically on the effective management of virtualized resources. A key requirement of cloud management is the ability to dynamically match resource allocations to actual demands, To this end, we aim to design and implement a cloud resource management mechanism that manages underlying complexity, automates resource provisioning and controls client-perceived quality of service (QoS) while still achieving resource efficiency. The design of an automatic resource management centers on two questions: when to adjust resource allocations and how much to adjust. In a cloud, applications have different definitions on capacity and cloud dynamics makes it difficult to determine a static resource to performance relationship. In this dissertation, we have proposed a generic metric that measures application capacity, designed model-independent and adaptive approaches to manage resources and built a cloud management system scalable to a cluster of machines. To understand web system capacity, we propose to use a metric of productivity index (PI), which is defined as the ratio of yield to cost, to measure the system processing capability online. PI is a generic concept that can be applied to different levels to monitor system progress in order to identify if more capacity is needed. We applied the concept of PI to the problem of overload prevention in multi-tier websites. The overload predictor built on the PI metric shows more accurate and responsive overload prevention compared to conventional approaches. To address the issue of the lack of accurate server model, we propose a model-independent fuzzy control based approach for CPU allocation. For adaptive and stable control performance, we embed the controller with self-tuning output amplification and flexible rule selection. Finally, we build a QoS provisioning framework that supports multi-objective QoS control and service differentiation. Experiments on a virtual cluster with two service classes show the effectiveness of our approach in both performance and power control. To address the problems of complex interplay between resources and process delays in fine-grained multi-resource allocation, we consider capacity management as a decision-making problem and employ reinforcement learning (RL) to optimize the process. The optimization depends on the trial-and-error interactions with the cloud system. In order to improve the initial management performance, we propose a model-based RL algorithm. The neural network based environment model, which is learned from previous management history, generates simulated resource allocations for the RL agent. Experiment results on heterogeneous applications show that our approach makes efficient use of limited interactions and find near optimal resource configurations within 7 steps. Finally, we present a distributed reinforcement learning approach to the cluster-wide cloud resource management. We decompose the cluster-wide resource allocation problem into sub-problems concerning individual VM resource configurations. The cluster-wide allocation is optimized if individual VMs meet their SLA with a high resource utilization. For scalability, we develop an efficient reinforcement learning approach with continuous state space. For adaptability, we use VM low-level runtime statistics to accommodate workload dynamics. Prototyped in a iBalloon system, the distributed learning approach successfully manages 128 VMs on a 16-node close correlated cluster

    Self-Adaptive Provisioning of Virtualized Resources in Cloud Computing

    Get PDF
    Abstract-Although cloud computing has gained sufficient popularity recently, there are still some key impediments to enterprise adoption. Cloud management is one of the top challenges. The ability of on-the-fly partitioning hardware resources into virtual machine(VM) instances facilitates elastic computing environment to users. But the extra layer of resource virtualization poses challenges on effective cloud management. The factors of time-varying user demand, complicated interplay between co-hosted VMs and the arbitrary deployment of multi-tier applications make it difficult for administrators to plan good VM configurations. In this paper, we propose a distributed learning mechanism that facilitates self-adaptive virtual machines resource provisioning. We treat cloud resource allocation as a distributed learning task, in which each VM being a highly autonomous agent submits resource requests according to its own benefit. The mechanism evaluates the requests and replies with feedbacks. We develop a reinforcement learning algorithm with a highly efficient representation of experiences as the heart of the VM side learning engine. We prototype the mechanism and the distributed learning algorithm in an iBalloon system. Experiment results on an Xen-based cloud testbed demonstrate the effectiveness of iBalloon. The distributed VM agents are able to reach near-optimal configuration decisions in 7 iteration steps at no more than 5% performance cost. Most importantly, iBalloon shows good scalability on resource allocation by scaling to 128 correlated VMs

    Optimizing Cloud-Service Performance: Efficient Resource Provisioning Via Optimal Workload Allocation

    Get PDF
    Cloud computing is being widely accepted and utilized in the business world. From the perspective of businesses utilizing the cloud, it is critical to meet their customers\u27 requirements by achieving service-level-objectives. Hence, the ability to accurately characterize and optimize cloud-service performance is of great importance. In this dissertation, a stochastic multi-tenant framework is proposed to model the service of customer requests in a cloud infrastructure composed of heterogeneous virtual machines (VMs). The proposed framework addresses the critical concepts and characteristics in the cloud, including virtualization, multi-tenancy, heterogeneity of VMs, VM isolation for the purpose of security and/or performance guarantee and the stochastic response time of a customer request. Two cloud-service performance metrics are mathematically characterized, namely the percentile of the stochastic response time and the mean of the stochastic response time of a customer request. Based upon the proposed multi-tenant framework, a workload-allocation algorithm, termed max-min-cloud algorithm, is then devised to optimize the performance of the cloud service. A rigorous optimality proof of the max-min-cloud algorithm is given when the stochastic response time of a customer request assumed exponentially distributed. Furthermore, extensive Monte-Carlo simulations are conducted to validate the optimality of the max-min-cloud algorithm by comparing with other two workload-allocation algorithms under various scenarios. Next, the resource provisioning problem in the cloud is studied in light of the max-min-cloud algorithm. In particular, an efficient resource-provisioning strategy, termed the MPC strategy, is proposed for serving dynamically arriving customer requests. The efficacy of the MPC strategy is verified through two practical cases when the arrival of the customer requests is predictable and unpredictable, respectively. As an extension of the max-min-cloud algorithm, we further devise the max-load-first algorithm to deal with the VM placement problem in the cloud. MC simulation results show that the max-load-first VM-placement algorithm outperforms the other two heuristic algorithms in terms of reducing the mean of stochastic completion time of a group of arbitrary customers\u27 requests. Simulation results also provide insight on how the initial loads of servers affect the performance of the cloud system. In summary, the findings in this dissertation work can be of great benefit to both service providers (namely business owners) and cloud providers. For business owners, the max-min-cloud workload-allocation algorithm and the MPC resource-provisioning strategy together can be used help them build a better understanding of how much virtual resources in the cloud they may need to meet customers\u27 expectations subject to cost constraints. For cloud providers, the max-load-first VM-placement algorithm can be used to optimize the computational performance of the service by appropriately utilizing the physical machines and efficiently placing the VMs in their cloud infrastructures

    Evaluating and Enabling Scalable High Performance Computing Workloads on Commercial Clouds

    Get PDF
    Performance, usability, and accessibility are critical components of high performance computing (HPC). Usability and performance are especially important to academic researchers as they generally have little time to learn a new technology and demand a certain type of performance in order to ensure the quality and quantity of their research results. We have observed that while not all workloads run well in the cloud, some workloads perform well. We have also observed that although commercial cloud adoption by industry has been growing at a rapid pace, its use by academic researchers has not grown as quickly. We aim to help close this gap and enable researchers to utilize the commercial cloud more efficiently and effectively. We present our results on architecting and benchmarking an HPC environment on Amazon Web Services (AWS) where we observe that there are particular types of applications that are and are not suited for the commercial cloud. Then, we present our results on architecting and building a provisioning and workflow management tool (PAW), where we developed an application that enables a user to launch an HPC environment in the cloud, execute a customizable workflow, and after the workflow has completed delete the HPC environment automatically. We then present our results on the scalability of PAW and the commercial cloud for compute intensive workloads by deploying a 1.1 million vCPU cluster. We then discuss our research into the feasibility of utilizing commercial cloud infrastructure to help tackle the large spikes and data-intensive characteristics of Transportation Cyberphysical Systems (TCPS) workloads. Then, we present our research in utilizing the commercial cloud for urgent HPC applications by deploying a 1.5 million vCPU cluster to process 211TB of traffic video data to be utilized by first responders during an evacuation situation. Lastly, we present the contributions and conclusions drawn from this work

    Virtualization in the Private Cloud: State of the Practice

    Get PDF
    Virtualization has become a mainstream technology that allows efficient and safe resource sharing in data centers. In this paper, we present a large scale workload characterization study of 90K virtual machines hosted on 8K physical servers, across several geographically distributed corporate data centers of a major service provider. The study focuses on 19 days of operation and focuses on the state of the practice, i. e., how virtual machines are deployed across different physical resources with an emphasis on processors and memory, focusing on resource sharing and usage of physical resources, virtual machine life cycles, and migration patterns and their frequencies. This paper illustrates that indeed there is a huge tendency in over-provisioning CPU and memory resources while certain virtualization features (e. g., migration and collocation) are used rather conservatively, showing that there is significant room for the development of policies that aim to reduce operational costs in data centers

    Qos-aware fine-grained power management in networked computing systems

    Get PDF
    Power is a major design concern of today\u27s networked computing systems, from low-power battery-powered mobile and embedded systems to high-power enterprise servers. Embedded systems are required to be power efficiency because most embedded systems are powered by battery with limited capacity. Similar concern of power expenditure rises as well in enterprise server environments due to cooling requirement, power delivery limit, electricity costs as well as environment pollutions. The power consumption in networked computing systems includes that on circuit board and that for communication. In the context of networked real-time systems, the power dissipation on wireless communication is more significant than that on circuit board. We focus on packet scheduling for wireless real-time systems with renewable energy resources. In such a scenario, it is required to transmit data with higher level of importance periodically. We formulate this packet scheduling problem as an NP-hard reward maximization problem with time and energy constraints. An optimal solution with pseudo polynomial time complexity is presented. In addition, we propose a sub-optimal solution with polynomial time complexity. Circuit board, especially processor, power consumption is still the major source of system power consumption. We provide a general-purposed, practical and comprehensive power management middleware for networked computing systems to manage circuit board power consumption thus to affect system-level power consumption. It has the functionalities of power and performance monitoring, power management (PM) policy selection and PM control, as well as energy efficiency analysis. This middleware includes an extensible PM policy library. We implemented a prototype of this middleware on Base Band Units (BBUs) with three PM policies enclosed. These policies have been validated on different platforms, such as enterprise servers, virtual environments and BBUs. In enterprise environments, the power dissipation on circuit board dominates. Regulation on computing resources on board has a significant impact on power consumption. Dynamic Voltage and Frequency Scaling (DVFS) is an effective technique to conserve energy consumption. We investigate system-level power management in order to avoid system failures due to power capacity overload or overheating. This management needs to control the power consumption in an accurate and responsive manner, which cannot be achieve by the existing black-box feedback control. Thus we present a model-predictive feedback controller to regulate processor frequency so that power budget can be satisfied without significant loss on performance. In addition to providing power guarantee alone, performance with respect to service-level agreements (SLAs) is required to be guaranteed as well. The proliferation of virtualization technology imposes new challenges on power management due to resource sharing. It is hard to achieve optimization in both power and performance on shared infrastructures due to system dynamics. We propose vPnP, a feedback control based coordination approach providing guarantee on application-level performance and underlying physical host power consumption in virtualized environments. This system can adapt gracefully to workload change. The preliminary results show its flexibility to achieve different levels of tradeoffs between power and performance as well as its robustness over a variety of workloads. It is desirable for improve energy efficiency of systems, such as BBUs, hosting soft-real time applications. We proposed a power management strategy for controlling delay and minimizing power consumption using DVFS. We use the Robbins-Monro (RM) stochastic approximation method to estimate delay quantile. We couple a fuzzy controller with the RM algorithm to scale CPU frequency that will maintain performance within the specified QoS

    Autonomic Overload Management For Large-Scale Virtualized Network Functions

    Get PDF
    The explosion of data traffic in telecommunication networks has been impressive in the last few years. To keep up with the high demand and staying profitable, Telcos are embracing the Network Function Virtualization (NFV) paradigm by shifting from hardware network appliances to software virtual network functions, which are expected to support extremely large scale architectures, providing both high performance and high reliability. The main objective of this dissertation is to provide frameworks and techniques to enable proper overload detection and mitigation for the emerging virtualized software-based network services. The thesis contribution is threefold. First, it proposes a novel approach to quickly detect performance anomalies in complex and large-scale VNF services. Second, it presents NFV-Throttle, an autonomic overload control framework to protect NFV services from overload within a short period of time, allowing to preserve the QoS of traffic flows admitted by network services in response to both traffic spikes (up to 10x the available capacity) and capacity reduction due to infrastructure problems (such as CPU contention). Third, it proposes DRACO, to manage overload problems arising in novel large-scale multi-tier applications, such as complex stateful network functions in which the state is spread across modern key-value stores to achieve both scalability and performance. DRACO performs a fine-grained admission control, by tuning the amount and type of traffic according to datastore node dependencies among the tiers (which are dynamically discovered at run-time), and to the current capacity of individual nodes, in order to mitigate overloads and preventing hot-spots. This thesis presents the implementation details and an extensive experimental evaluation for all the above overload management solutions, by means of a virtualized IP Multimedia Subsystem (IMS), which provides modern multimedia services for Telco operators, such as Videoconferencing and VoLTE, and which is one of the top use-cases of the NFV technology

    Cooperative Resource Management in a IaaS

    Get PDF
    International audienceVirtualized IaaS generally rely on a server consolidation system to pack virtual machines (VMs) on as few servers as possible, for energy saving. However, two situations are not taken into account, and could enhance consolidation. First, since the managed VMs can be of various sizes (small, medium, large, etc.), VMs packing can be obstructed when sizes don't fit available spaces on servers. Therefore, we would need to "split" such VMs. Second, two VMs which host replicas of the same application server (for scalability) could be "fusion Ned" when they are located on the same physical server, in order to reduce virtualization overhead and VMs memory footprint. Split and fusion operations lead to the management of elastic VMs and requires cooperation between the application level and the provider level, as they impact management at both levels. In this paper, we propose a IaaS resource management system which implements elastic VMs based on split/fusion operations and cooperative management. We show its benefit with a set of experiments

    Autonomic Management And Performance Optimization For Cloud Computing Services

    Get PDF
    Cloud computing has become an increasingly important computing paradigm. It offers three levels of on-demand services to cloud users: software as a service (SaaS), platform as a service (PaaS), and infrastructure as a service (IaaS) . The success of cloud services heavily depends on the effectiveness of cloud management strategies. In this dissertation work, we aim to design and implement an automatic cloud management system to improve application performance, increase platform efficiency and optimize resource allocation. For large-scale multi-component applications, especially web-based cloud applica- tions, parameter setting is crucial to the service availability and quality. The increas- ing system complexity requires an automatic and efficient application configuration strategy. To improve the quality of application services, we propose a reinforcement learning(RL)-based autonomic configuration framework. It is able to adapt appli- cation parameter settings not only to the variations in workload, but also to the change of virtual resource allocation. The RL approach is enhanced with an efficient initialization policy to reduce the learning time for online decision. Experiments on Xen-based virtual cluster with TPC-W benchmarks show that the framework can drive applications into a optimal configuration in less than 25 iterations. For cloud platform service, one of the key challenges is to efficiently adapt the offered platforms to the virtualized environment, meanwhile maintaining their service features. MapReduce has become an important distributed parallel programming paradigm. Offering MapReduce cloud service presents an attractive usage model for enterprises. In a virtual MapReduce cluster, the interference between virtual machines (VMs) causes performance degradation of map and reduce tasks and renders existing data locality-aware task scheduling policy, like delay scheduling, no longer effective. On the other hand, virtualization offers an extra opportunity of data locality for co-hosted VMs. To address these issues, we present a task scheduling strategy to mitigate interference and meanwhile preserving task data locality for MapReduce applications. The strategy includes an interference-aware scheduling policy, based on a task performance prediction model, and an adaptive delay scheduling algorithm for data locality improvement. Experimental results on a 72-node Xen-based virtual cluster show that the scheduler is able to achieve a speedup of 1.5 to 6.5 times for individual jobs and yield an improvement of up to 1.9 times in system throughput in comparison with four other MapReduce schedulers. Cloud computing has a key requirement for resource configuration in a real-time manner. In such virtualized environments, both virtual machines (VMs) and hosted applications need to be configured on-the fly to adapt to system dynamics. The in- terplay between the layers of VMs and applications further complicates the problem of cloud configuration. Independent tuning of each aspect may not lead to optimal system wide performance. In this work, we propose a framework for coordinated configuration of VMs and resident applications. At the heart of the framework is a model-free hybrid reinforcement learning (RL) approach, which combines the advan- tages of Simplex method and RL method and is further enhanced by the use of system knowledge guided exploration policies. Experimental results on Xen based virtualized environments with TPC-W and TPC-C benchmarks demonstrate that the framework is able to drive a virtual server cluster into an optimal or near-optimal configuration state on the fly, in response to the change of workload. It improves the systems throughput by more than 30% over independent tuning strategies. In comparison with the coordinated tuning strategies based on basic RL or Simplex algorithm, the hybrid RL algorithm gains 25% to 40% throughput improvement

    A Reliable and Cost-Efficient Auto-Scaling System for Web Applications Using Heterogeneous Spot Instances

    Full text link
    Cloud providers sell their idle capacity on markets through an auction-like mechanism to increase their return on investment. The instances sold in this way are called spot instances. In spite that spot instances are usually 90% cheaper than on-demand instances, they can be terminated by provider when their bidding prices are lower than market prices. Thus, they are largely used to provision fault-tolerant applications only. In this paper, we explore how to utilize spot instances to provision web applications, which are usually considered availability-critical. The idea is to take advantage of differences in price among various types of spot instances to reach both high availability and significant cost saving. We first propose a fault-tolerant model for web applications provisioned by spot instances. Based on that, we devise novel auto-scaling polices for hourly billed cloud markets. We implemented the proposed model and policies both on a simulation testbed for repeatable validation and Amazon EC2. The experiments on the simulation testbed and the real platform against the benchmarks show that the proposed approach can greatly reduce resource cost and still achieve satisfactory Quality of Service (QoS) in terms of response time and availability
    • …
    corecore