1,938 research outputs found

    CloudScope: diagnosing and managing performance interference in multi-tenant clouds

    Get PDF
    © 2015 IEEE.Virtual machine consolidation is attractive in cloud computing platforms for several reasons including reduced infrastructure costs, lower energy consumption and ease of management. However, the interference between co-resident workloads caused by virtualization can violate the service level objectives (SLOs) that the cloud platform guarantees. Existing solutions to minimize interference between virtual machines (VMs) are mostly based on comprehensive micro-benchmarks or online training which makes them computationally intensive. In this paper, we present CloudScope, a system for diagnosing interference for multi-tenant cloud systems in a lightweight way. CloudScope employs a discrete-time Markov Chain model for the online prediction of performance interference of co-resident VMs. It uses the results to optimally (re)assign VMs to physical machines and to optimize the hypervisor configuration, e.g. the CPU share it can use, for different workloads. We have implemented CloudScope on top of the Xen hypervisor and conducted experiments using a set of CPU, disk, and network intensive workloads and a real system (MapReduce). Our results show that CloudScope interference prediction achieves an average error of 9%. The interference-aware scheduler improves VM performance by up to 10% compared to the default scheduler. In addition, the hypervisor reconfiguration can improve network throughput by up to 30%

    CATS: linearizability and partition tolerance in scalable and self-organizing key-value stores

    Get PDF
    Distributed key-value stores provide scalable, fault-tolerant, and self-organizing storage services, but fall short of guaranteeing linearizable consistency in partially synchronous, lossy, partitionable, and dynamic networks, when data is distributed and replicated automatically by the principle of consistent hashing. This paper introduces consistent quorums as a solution for achieving atomic consistency. We present the design and implementation of CATS, a distributed key-value store which uses consistent quorums to guarantee linearizability and partition tolerance in such adverse and dynamic network conditions. CATS is scalable, elastic, and self-organizing; key properties for modern cloud storage middleware. Our system shows that consistency can be achieved with practical performance and modest throughput overhead (5%) for read-intensive workloads

    Computing in the RAIN: a reliable array of independent nodes

    Get PDF
    The RAIN project is a research collaboration between Caltech and NASA-JPL on distributed computing and data-storage systems for future spaceborne missions. The goal of the project is to identify and develop key building blocks for reliable distributed systems built with inexpensive off-the-shelf components. The RAIN platform consists of a heterogeneous cluster of computing and/or storage nodes connected via multiple interfaces to networks configured in fault-tolerant topologies. The RAIN software components run in conjunction with operating system services and standard network protocols. Through software-implemented fault tolerance, the system tolerates multiple node, link, and switch failures, with no single point of failure. The RAIN-technology has been transferred to Rainfinity, a start-up company focusing on creating clustered solutions for improving the performance and availability of Internet data centers. In this paper, we describe the following contributions: 1) fault-tolerant interconnect topologies and communication protocols providing consistent error reporting of link failures, 2) fault management techniques based on group membership, and 3) data storage schemes based on computationally efficient error-control codes. We present several proof-of-concept applications: a highly-available video server, a highly-available Web server, and a distributed checkpointing system. Also, we describe a commercial product, Rainwall, built with the RAIN technology

    Building global and scalable systems with atomic multicast

    Get PDF
    The rise of worldwide Internet-scale services demands large distributed systems. Indeed, when handling several millions of users, it is common to operate thousands of servers spread across the globe. Here, replication plays a central role, as it contributes to improve the user experience by hiding failures and by providing acceptable latency. In this thesis, we claim that atomic multicast, with strong and well-defined properties, is the appropriate abstraction to efficiently design and implement globally scalable distributed systems. Internet-scale services rely on data partitioning and replication to provide scalable performance and high availability. Moreover, to reduce user-perceived response times and tolerate disasters (i.e., the failure of a whole datacenter), services are increasingly becoming geographically distributed. Data partitioning and replication, combined with local and geographical distribution, introduce daunting challenges, including the need to carefully order requests among replicas and partitions. One way to tackle this problem is to use group communication primitives that encapsulate order requirements. While replication is a common technique used to design such reliable distributed systems, to cope with the requirements of modern cloud based ``always-on'' applications, replication protocols must additionally allow for throughput scalability and dynamic reconfiguration, that is, on-demand replacement or provisioning of system resources. We propose a dynamic atomic multicast protocol which fulfills these requirements. It allows to dynamically add and remove resources to an online replicated state machine and to recover crashed processes. Major efforts have been spent in recent years to improve the performance, scalability and reliability of distributed systems. In order to hide the complexity of designing distributed applications, many proposals provide efficient high-level communication abstractions. Since the implementation of a production-ready system based on this abstraction is still a major task, we further propose to expose our protocol to developers in the form of distributed data structures. B-trees for example, are commonly used in different kinds of applications, including database indexes or file systems. Providing a distributed, fault-tolerant and scalable data structure would help developers to integrate their applications in a distribution transparent manner. This work describes how to build reliable and scalable distributed systems based on atomic multicast and demonstrates their capabilities by an implementation of a distributed ordered map that supports dynamic re-partitioning and fast recovery. To substantiate our claim, we ported an existing SQL database atop of our distributed lock-free data structure. Here, replication plays a central role, as it contributes to improve the user experience by hiding failures and by providing acceptable latency. In this thesis, we claim that atomic multicast, with strong and well-defined properties, is the appropriate abstraction to efficiently design and implement globally scalable distributed systems. Internet-scale services rely on data partitioning and replication to provide scalable performance and high availability. Moreover, to reduce user-perceived response times and tolerate disasters (i.e., the failure of a whole datacenter), services are increasingly becoming geographically distributed. Data partitioning and replication, combined with local and geographical distribution, introduce daunting challenges, including the need to carefully order requests among replicas and partitions. One way to tackle this problem is to use group communication primitives that encapsulate order requirements. While replication is a common technique used to design such reliable distributed systems, to cope with the requirements of modern cloud based ``always-on'' applications, replication protocols must additionally allow for throughput scalability and dynamic reconfiguration, that is, on-demand replacement or provisioning of system resources. We propose a dynamic atomic multicast protocol which fulfills these requirements. It allows to dynamically add and remove resources to an online replicated state machine and to recover crashed processes. Major efforts have been spent in recent years to improve the performance, scalability and reliability of distributed systems. In order to hide the complexity of designing distributed applications, many proposals provide efficient high-level communication abstractions. Since the implementation of a production-ready system based on this abstraction is still a major task, we further propose to expose our protocol to developers in the form of distributed data structures. B- trees for example, are commonly used in different kinds of applications, including database indexes or file systems. Providing a distributed, fault-tolerant and scalable data structure would help developers to integrate their applications in a distribution transparent manner. This work describes how to build reliable and scalable distributed systems based on atomic multicast and demonstrates their capabilities by an implementation of a distributed ordered map that supports dynamic re-partitioning and fast recovery. To substantiate our claim, we ported an existing SQL database atop of our distributed lock-free data structure

    Survey and Analysis of Production Distributed Computing Infrastructures

    Full text link
    This report has two objectives. First, we describe a set of the production distributed infrastructures currently available, so that the reader has a basic understanding of them. This includes explaining why each infrastructure was created and made available and how it has succeeded and failed. The set is not complete, but we believe it is representative. Second, we describe the infrastructures in terms of their use, which is a combination of how they were designed to be used and how users have found ways to use them. Applications are often designed and created with specific infrastructures in mind, with both an appreciation of the existing capabilities provided by those infrastructures and an anticipation of their future capabilities. Here, the infrastructures we discuss were often designed and created with specific applications in mind, or at least specific types of applications. The reader should understand how the interplay between the infrastructure providers and the users leads to such usages, which we call usage modalities. These usage modalities are really abstractions that exist between the infrastructures and the applications; they influence the infrastructures by representing the applications, and they influence the ap- plications by representing the infrastructures

    Electric Vehicles Charging Control based on Future Internet Generic Enablers

    Full text link
    In this paper a rationale for the deployment of Future Internet based applications in the field of Electric Vehicles (EVs) smart charging is presented. The focus is on the Connected Device Interface (CDI) Generic Enabler (GE) and the Network Information and Controller (NetIC) GE, which are recognized to have a potential impact on the charging control problem and the configuration of communications networks within reconfigurable clusters of charging points. The CDI GE can be used for capturing the driver feedback in terms of Quality of Experience (QoE) in those situations where the charging power is abruptly limited as a consequence of short term grid needs, like the shedding action asked by the Transmission System Operator to the Distribution System Operator aimed at clearing networks contingencies due to the loss of a transmission line or large wind power fluctuations. The NetIC GE can be used when a master Electric Vehicle Supply Equipment (EVSE) hosts the Load Area Controller, responsible for managing simultaneous charging sessions within a given Load Area (LA); the reconfiguration of distribution grid topology results in shift of EVSEs among LAs, then reallocation of slave EVSEs is needed. Involved actors, equipment, communications and processes are identified through the standardized framework provided by the Smart Grid Architecture Model (SGAM).Comment: To appear in IEEE International Electric Vehicle Conference (IEEE IEVC 2014

    Autonomic management of virtualized resources in cloud computing

    Get PDF
    The last five years have witnessed a rapid growth of cloud computing in business, governmental and educational IT deployment. The success of cloud services depends critically on the effective management of virtualized resources. A key requirement of cloud management is the ability to dynamically match resource allocations to actual demands, To this end, we aim to design and implement a cloud resource management mechanism that manages underlying complexity, automates resource provisioning and controls client-perceived quality of service (QoS) while still achieving resource efficiency. The design of an automatic resource management centers on two questions: when to adjust resource allocations and how much to adjust. In a cloud, applications have different definitions on capacity and cloud dynamics makes it difficult to determine a static resource to performance relationship. In this dissertation, we have proposed a generic metric that measures application capacity, designed model-independent and adaptive approaches to manage resources and built a cloud management system scalable to a cluster of machines. To understand web system capacity, we propose to use a metric of productivity index (PI), which is defined as the ratio of yield to cost, to measure the system processing capability online. PI is a generic concept that can be applied to different levels to monitor system progress in order to identify if more capacity is needed. We applied the concept of PI to the problem of overload prevention in multi-tier websites. The overload predictor built on the PI metric shows more accurate and responsive overload prevention compared to conventional approaches. To address the issue of the lack of accurate server model, we propose a model-independent fuzzy control based approach for CPU allocation. For adaptive and stable control performance, we embed the controller with self-tuning output amplification and flexible rule selection. Finally, we build a QoS provisioning framework that supports multi-objective QoS control and service differentiation. Experiments on a virtual cluster with two service classes show the effectiveness of our approach in both performance and power control. To address the problems of complex interplay between resources and process delays in fine-grained multi-resource allocation, we consider capacity management as a decision-making problem and employ reinforcement learning (RL) to optimize the process. The optimization depends on the trial-and-error interactions with the cloud system. In order to improve the initial management performance, we propose a model-based RL algorithm. The neural network based environment model, which is learned from previous management history, generates simulated resource allocations for the RL agent. Experiment results on heterogeneous applications show that our approach makes efficient use of limited interactions and find near optimal resource configurations within 7 steps. Finally, we present a distributed reinforcement learning approach to the cluster-wide cloud resource management. We decompose the cluster-wide resource allocation problem into sub-problems concerning individual VM resource configurations. The cluster-wide allocation is optimized if individual VMs meet their SLA with a high resource utilization. For scalability, we develop an efficient reinforcement learning approach with continuous state space. For adaptability, we use VM low-level runtime statistics to accommodate workload dynamics. Prototyped in a iBalloon system, the distributed learning approach successfully manages 128 VMs on a 16-node close correlated cluster

    On the Benefit of Virtualization: Strategies for Flexible Server Allocation

    Full text link
    Virtualization technology facilitates a dynamic, demand-driven allocation and migration of servers. This paper studies how the flexibility offered by network virtualization can be used to improve Quality-of-Service parameters such as latency, while taking into account allocation costs. A generic use case is considered where both the overall demand issued for a certain service (for example, an SAP application in the cloud, or a gaming application) as well as the origins of the requests change over time (e.g., due to time zone effects or due to user mobility), and we present online and optimal offline strategies to compute the number and location of the servers implementing this service. These algorithms also allow us to study the fundamental benefits of dynamic resource allocation compared to static systems. Our simulation results confirm our expectations that the gain of flexible server allocation is particularly high in scenarios with moderate dynamics
    corecore