158 research outputs found

    Optimizing Virtual Resource Management in Cloud Datacenters

    Get PDF
    Datacenter clouds (e.g., Microsoft\u27s Azure, Google\u27s App Engine, and Amazon\u27s EC2) are emerging as a popular infrastructure for computing and storage due to their high scalability and elasticity. More and more companies and organizations shift their services (e.g., online social networks, Dropbox file hosting) to clouds to avoid large capital expenditures. Cloud systems employ virtualization technology to provide resources in physical machines (PMs) in the form of virtual machines (VMs). Users create VMs deployed on the cloud and each VM consumes resources (e.g., CPU, memory and bandwidth) from its host PM. Cloud providers supply services by signing Service Level Agreement (SLA) with cloud customers that serves as both the blueprint and the warranty for cloud computing. Under-provisioning of resources leads to SLA violations while over-provisioning of resources leads to resource underutilization and then revenue decrease for the cloud providers. Thus, a formidable challenge is effective management of virtual resource to maximize energy efficiency and resource utilization while satisfying the SLA. This proposal is devoted to tackle this challenge by addressing three fundamental and essential issues: i) initial VM allocation, ii) VM migration for load balance, and iii) proactive VM migration for long-term load balance. Accordingly, this proposal consists of three innovative components: (1) Initial Complementary VM Consolidation. Previous resource provisioning strategies either allocate physical resources to virtual machines (VMs) based on static VM resource demands or dynamically handle the variations in VM resource requirements through live VM migrations. However, the former fail to maximize energy efficiency and resource utilization while the latter produce high migration overhead. To handle these problems, we propose an initial VM allocation mechanism that consolidates complementary VMs with spatial/temporal-awareness. Complementary VMs are the VMs whose total demand of each resource dimension (in the spatial space) nearly reaches their host\u27s capacity during VM lifetime period (in the temporal space). Based on our observation of the existence of VM resource utilization patterns, the mechanism predicts the lifetime resource utilization patterns of short-term VMs or periodical resource utilization patterns of long-term VMs. Based on the predicted patterns, it coordinates the requirements of different resources and consolidates complementary VMs in the same physical machine (PM). This mechanism reduces the number of PMs needed to provide VM service hence increases energy efficiency and resource utilization and also reduces the number of VM migrations and SLA violations. (2) Resource Intensity Aware VM Migration for Load Balance. The unique features of clouds pose formidable challenges to achieving effective and efficient load balancing. First, VMs in clouds use different resources (e.g., CPU, bandwidth, memory) to serve a variety of services (e.g., high performance computing, web services, file services), resulting in different overutilized resources in different PMs. Also, the overutilized resources in a PM may vary over time due to the time-varying heterogenous service requests. Second, there is intensive network communication between VMs. However, previous load balancing methods statically assign equal or predefined weights to different resources, which leads to degraded performance in terms of speed and cost to achieve load balance. Also, they do not strive to minimize the VM communications between PMs. This proposed mechanism dynamically assigns different weights to different resources according to their usage intensity in the PM, which significantly reduces the time and cost to achieve load balance and avoids future load imbalance. It also tries to keep frequently communicating VMs in the same PM to reduce bandwidth cost, and migrate VMs to PMs with minimum VM performance degradation. (3) Proactive VM Migration for Long-Term Load Balance. Previous reactive load balancing algorithms migrate VMs upon the occurrence of load imbalance, while previous proactive load balancing algorithms predict PM overload to conduct VM migration. However, both methods cannot maintain long-term load balance and produce high overhead and delay due to migration VM selection and destination PM selection. To overcome these problems, we propose a proactive Markov Decision Process (MDP)-based load balancing algorithm. We handle the challenges of allying MDP in virtual resource management in cloud datacenters, which allows a PM to proactively find an optimal action to transit to a lightly loaded state that will maintain for a longer period of time. We also apply the MDP to determine destination PMs to achieve long-term PM load balance state. Our algorithm reduces the numbers of SLA violations by long-term load balance maintenance, and also reduces the load balancing overhead (e.g., CPU time, energy) and delay by quickly identifying VMs and destination PMs to migrate. Finally, we conducted extensive experiments to evaluate the proposed three mechanisms. i) We conducted simulation experiments based on two real traces and real-world testbed experiments to show that the initial complementary VM consolidation mechanism significantly reduces the number of PMs used, SLA violations and VM migrations of the previous resource provisioning strategies. ii) We conducted trace-driven simulation and real-world testbed experiments to show that RIAL outperforms other load balancing approaches in regards to the number of VM migrations, VM performance degradation and VM communication cost. iii) We conducted trace-driven experiments to show that the MDP-based load balancing algorithm outperforms previous reactive and proactive load balancing algorithms in terms of SLA violation, load balancing efficiency and long-term load balance maintenance

    Towards An Efficient Cloud Computing System: Data Management, Resource Allocation and Job Scheduling

    Get PDF
    Cloud computing is an emerging technology in distributed computing, and it has proved to be an effective infrastructure to provide services to users. Cloud is developing day by day and faces many challenges. One of challenges is to build cost-effective data management system that can ensure high data availability while maintaining consistency. Another challenge in cloud is efficient resource allocation which ensures high resource utilization and high SLO availability. Scheduling, referring to a set of policies to control the order of the work to be performed by a computer system, for high throughput is another challenge. In this dissertation, we study how to manage data and improve data availability while reducing cost (i.e., consistency maintenance cost and storage cost); how to efficiently manage the resource for processing jobs and increase the resource utilization with high SLO availability; how to design an efficient scheduling algorithm which provides high throughput, low overhead while satisfying the demands on completion time of jobs. Replication is a common approach to enhance data availability in cloud storage systems. Previously proposed replication schemes cannot effectively handle both correlated and non-correlated machine failures while increasing the data availability with the limited resource. The schemes for correlated machine failures must create a constant number of replicas for each data object, which neglects diverse data popularities and cannot utilize the resource to maximize the expected data availability. Also, the previous schemes neglect the consistency maintenance cost and the storage cost caused by replication. It is critical for cloud providers to maximize data availability hence minimize SLA (Service Level Agreement) violations while minimize cost caused by replication in order to maximize the revenue. In this dissertation, we build a nonlinear programming model to maximize data availability in both types of failures and minimize the cost caused by replication. Based on the model\u27s solution for the replication degree of each data object, we propose a low-cost multi-failure resilient replication scheme (MRR). MRR can effectively handle both correlated and non-correlated machine failures, considers data popularities to enhance data availability, and also tries to minimize consistency maintenance and storage cost. In current cloud, providers still need to reserve resources to allow users to scale on demand. The capacity offered by cloud offerings is in the form of pre-defined virtual machine (VM) configurations. This incurs resource wastage and results in low resource utilization when the users actually consume much less resource than the VM capacity. Existing works either reallocate the unused resources with no Service Level Objectives (SLOs) for availability\footnote{Availability refers to the probability of an allocated resource being remain operational and accessible during the validity of the contract~\cite{CarvalhoCirne14}.} or consider SLOs to reallocate the unused resources for long-running service jobs. This approach increases the allocated resource whenever it detects that SLO is violated in order to achieve SLO in the long term, neglecting the frequent fluctuations of jobs\u27 resource requirements in real-time application especially for short-term jobs that require fast responses and decision making for resource allocation. Thus, this approach cannot fully utilize the resources to process data because they cannot quickly adjust the resource allocation strategy dealing with the fluctuations of jobs\u27 resource requirements. What\u27s more, the previous opportunistic based resource allocation approach aims at providing long-term availability SLOs with good QoS for long-running jobs, which ensures that the jobs can be finished within weeks or months by providing slighted degraded resources with moderate availability guarantees, but it ignores deadline constraints in defining Quality of Service (QoS) for short-lived jobs requiring online responses in real-time application, thus it cannot truly guarantee the QoS and long-term availability SLOs. To overcome the drawbacks of previous works, we adequately consider the fluctuations of unused resource caused by bursts of jobs\u27 resource demands, and present a cooperative opportunistic resource provisioning (CORP) scheme to dynamically allocate the resource to jobs. CORP leverages complementarity of jobs\u27 requirements on different resource types and utilizes the job packing to reduce the resource wastage and increase the resource utilization. An increasing number of large-scale data analytics frameworks move towards larger degrees of parallelism aiming at high throughput. Scheduling that assigns tasks to workers and preemption that suspends low-priority tasks and runs high-priority tasks are two important functions in such frameworks. There are many existing works on scheduling and preemption in literature to provide high throughput. However, previous works do not substantially consider dependency in increasing throughput in scheduling or preemption. Considering dependency is crucial to increase the overall throughput. Besides, extensive task evictions for preemption increase context switches, which may decrease the throughput. To address the above problems, we propose an efficient scheduling system Dependency-aware Scheduling and Preemption (DSP) to achieve high throughput in scheduling and preemption. First, we build a mathematical model to minimize the makespan with the consideration of task dependency, and derive the target workers for tasks which can minimize the makespan; second, we utilize task dependency information to determine tasks\u27 priorities for preemption; finally, we present a probabilistic based preemption to reduce the numerous preemptions, while satisfying the demands on completion time of jobs. We conduct trace driven simulations on a real-cluster and real-world experiments on Amazon S3/EC2 to demonstrate the efficiency and effectiveness of our proposed system in comparison with other systems. The experimental results show the superior performance of our proposed system. In the future, we will further consider data update frequency to reduce consistency maintenance cost, and we will consider the effects of node joining and node leaving. Also we will consider energy consumption of machines and design an optimal replication scheme to improve data availability while saving power. For resource allocation, we will consider using the greedy approach for deep learning to reduce the computation overhead caused by the deep neural network. Also, we will additionally consider the heterogeneity of jobs (i.e., short jobs and long jobs), and use a hybrid resource allocation strategy to provide SLO availability customization for different job types while increasing the resource utilization. For scheduling, we will aim to handle scheduling tasks with partial dependency, worker failures in scheduling and make our DSP fully distributed to increase its scalability. Finally, we plan to use different workloads and real-world experiment to fully test the performance of our methods and make our preliminary system design more mature

    CloudScope: diagnosing and managing performance interference in multi-tenant clouds

    Get PDF
    © 2015 IEEE.Virtual machine consolidation is attractive in cloud computing platforms for several reasons including reduced infrastructure costs, lower energy consumption and ease of management. However, the interference between co-resident workloads caused by virtualization can violate the service level objectives (SLOs) that the cloud platform guarantees. Existing solutions to minimize interference between virtual machines (VMs) are mostly based on comprehensive micro-benchmarks or online training which makes them computationally intensive. In this paper, we present CloudScope, a system for diagnosing interference for multi-tenant cloud systems in a lightweight way. CloudScope employs a discrete-time Markov Chain model for the online prediction of performance interference of co-resident VMs. It uses the results to optimally (re)assign VMs to physical machines and to optimize the hypervisor configuration, e.g. the CPU share it can use, for different workloads. We have implemented CloudScope on top of the Xen hypervisor and conducted experiments using a set of CPU, disk, and network intensive workloads and a real system (MapReduce). Our results show that CloudScope interference prediction achieves an average error of 9%. The interference-aware scheduler improves VM performance by up to 10% compared to the default scheduler. In addition, the hypervisor reconfiguration can improve network throughput by up to 30%

    Automatic elasticity in OpenStack

    Get PDF
    Cloud computing infrastructures are the most recent ap- proach to the development and conception of computational systems. Cloud infrastructures are complex environments with various subsystems, each one with their own challenges. Cloud systems should be able to provide the following fun- damental property: elasticity. Elasticity is the ability to automatically add and remove instances according to the needs of the system. This is an requirement for pay-per-use billing models. Various open source software solutions allow companies and institutions to build their own Cloud infrastructure. How- ever, in most of these, the elasticity feature is quite imma- ture. Monitoring and timely adapting the active resources of a Cloud computing infrastructure is key to provide the elasticity required by diverse, multi tenant and pay-per-use business models. In this paper, we propose Elastack, an automated monitor- ing and adaptive system, generic enough to be applied to existing IaaS frameworks and intended to enable the elastic- ity they currently lack. Our approach offers any Cloud in- frastructure the mechanisms to implement automated mon- itoring and adaptation as well as the flexibility to go beyond these. We evaluate Elastack by integrating it with the Open- Stack and showing how easy it is to add these important features with a minimum, almost imperceptible, amount of modifications to the default installation.(undefined

    Variability in Behavior of Application Service Workload in a Utility Cloud

    Get PDF
    Using the elasticity feature of a utility cloud, users can acquire and release resources as required and pay for what they use. Applications with time-varying workloads can request for variable resources over time that makes cloud a convenient option for such applications. The elasticity in current IaaS cloud provides mainly two options to the users: horizontal and vertical scaling. In both ways of scaling the basic resource allocation unit is fixed-sized VM, it forces the cloud users to characterize their workload based on VM size, which might lead to under-utilization or over-allocation of resources. This turns out to be an inefficient model for both cloud users and providers. In this paper we discuss and calculate the variability in different kinds of application service workload. We also discuss different dynamic provisioning approaches proposed by researchers. We conclude with a brief introduction to the issues or limitations in existing solutions and our approach to resolve them in a way that is suitable and economic for both cloud user and provider

    Virtualized application performance prediction using system metrics

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 79-80).Virtualized datacenter administrators would like to consolidate virtual machines (VMs) onto as few physical hosts as possible in order to decrease costs, but must leave enough physical resources for each VM to meet application service level objectives (SLOs). The threshold between good and bad performance in terms of resource settings, however, is hard to determine and rarely static due to changing workloads and resource usage. Thus, in order to avoid SLO violations, system administrators must err on the side of caution by running fewer VMs per host than necessary or setting reservations, which prevents resources from being shared. To ameliorate this situation, we are working to design and implement a system that automatically monitors VM-level metrics to predict impending application SLO violations, and takes appropriate action to prevent the SLO violation from occurring. So far we have implemented the performance prediction, which is detailed in this document, while the preventative actions are left as future work. We created a three-stage pipeline in order to achieve scalable performance prediction. The three stages are prediction, which predicts future VM ESX performance counter values based on current time-series data; aggregation, which aggregates the predicted VM metrics into a single set of global metrics; and finally classification, which for each VM classifies its performance as good or bad based on the predicted VM counters and the predicted global state. Prediction of each counter is performed by a least-squares linear fit, aggregation is performed simply by summing each counter across all VMs, and classification is performed using a support vector machine (SVM) for each application. In addition, we created an experimental system running a MongoDB instance in order to test and evaluate our pipeline implementation. Our results on this experimental system are promising, but further research will be necessary before applying these techniques to a production system.by Skye A. Wanderman-Milne.M.Eng

    Efficiently Conducting Quality-of-Service Analyses by Templating Architectural Knowledge

    Get PDF
    Previously, software architects were unable to effectively and efficiently apply reusable knowledge (e.g., architectural styles and patterns) to architectural analyses. This work tackles this problem with a novel method to create and apply templates for reusable knowledge. These templates capture reusable knowledge formally and can efficiently be integrated in architectural analyses

    Log4Perf: Suggesting and Updating Logging Locations for Web-based Systems' Performance Monitoring

    Get PDF
    Performance assurance activities are an essential step in the release cycle of software systems. Logs have become one of the most important sources of information that is used to monitor, understand and improve software performance. However, developers often face the challenge of making logging decisions, i.e., neither logging too little and logging too much is desirable. Although prior research has proposed techniques to assist in logging decisions, those automated logging guidance techniques are rather general, without considering a particular goal, such as monitoring software performance. In this thesis, we present Log4Perf, an automated approach that provides suggestions of where to insert logging statements with the goal of monitoring web-based systems' software performance. In particular, our approach builds and manipulates a statistical performance model to identify the locations in the source code that statistically significantly influence software performance. To evaluate Log4Perf, we conduct case studies on open source systems, i.e., CloudStore and OpenMRS, and one large-scale commercial system. Our evaluation results show that Log4Perf can build well-fit statistical performance models, indicating that such models can be leveraged to investigate the influence of locations in the source code on performance. Also, the suggested logging locations are often small and simple methods that do not have logging statements and that are not performance hotspots, making our approach an ideal complement to traditional approaches that are based on software metrics or performance hotspots. In addition, we proposed approaches that can suggest the need for updating logging locations when software evolves. After evaluating our approach, we manually examine the logging locations that are newly suggested or deprecated and identify seven root-causes. Log4Perf is integrated into the release engineering process of the commercial software to provide logging suggestions on a regular basis
    corecore