26 research outputs found

    Overcommitment in Cloud Services -- Bin packing with Chance Constraints

    Full text link
    This paper considers a traditional problem of resource allocation, scheduling jobs on machines. One such recent application is cloud computing, where jobs arrive in an online fashion with capacity requirements and need to be immediately scheduled on physical machines in data centers. It is often observed that the requested capacities are not fully utilized, hence offering an opportunity to employ an overcommitment policy, i.e., selling resources beyond capacity. Setting the right overcommitment level can induce a significant cost reduction for the cloud provider, while only inducing a very low risk of violating capacity constraints. We introduce and study a model that quantifies the value of overcommitment by modeling the problem as a bin packing with chance constraints. We then propose an alternative formulation that transforms each chance constraint into a submodular function. We show that our model captures the risk pooling effect and can guide scheduling and overcommitment decisions. We also develop a family of online algorithms that are intuitive, easy to implement and provide a constant factor guarantee from optimal. Finally, we calibrate our model using realistic workload data, and test our approach in a practical setting. Our analysis and experiments illustrate the benefit of overcommitment in cloud services, and suggest a cost reduction of 1.5% to 17% depending on the provider's risk tolerance

    Solving the Batch Stochastic Bin Packing Problem in Cloud: A Chance-constrained Optimization Approach

    Full text link
    This paper investigates a critical resource allocation problem in the first party cloud: scheduling containers to machines. There are tens of services and each service runs a set of homogeneous containers with dynamic resource usage; containers of a service are scheduled daily in a batch fashion. This problem can be naturally formulated as Stochastic Bin Packing Problem (SBPP). However, traditional SBPP research often focuses on cases of empty machines, whose objective, i.e., to minimize the number of used machines, is not well-defined for the more common reality with nonempty machines. This paper aims to close this gap. First, we define a new objective metric, Used Capacity at Confidence (UCaC), which measures the maximum used resources at a probability and is proved to be consistent for both empty and nonempty machines, and reformulate the SBPP under chance constraints. Second, by modeling the container resource usage distribution in a generative approach, we reveal that UCaC can be approximated with Gaussian, which is verified by trace data of real-world applications. Third, we propose an exact solver by solving the equivalent cutting stock variant as well as two heuristics-based solvers -- UCaC best fit, bi-level heuristics. We experimentally evaluate these solvers on both synthetic datasets and real application traces, demonstrating our methodology's advantage over traditional SBPP optimal solver minimizing the number of used machines, with a low rate of resource violations.Comment: To appear in SIGKDD 2022 as Research Track pape

    Efficient and elastic management of computing infrastructures

    Full text link
    Tesis por compendio[EN] Modern data centers integrate a lot of computer and electronic devices. However, some reports state that the mean usage of a typical data center is around 50% of its peak capacity, and the mean usage of each server is between 10% and 50%. A lot of energy is destined to power on computer hardware that most of the time remains idle. Therefore, it would be possible to save energy simply by powering off those parts from the data center that are not actually used, and powering them on again as they are needed. Most data centers have computing clusters that are used for intensive computing, recently evolving towards an on-premises Cloud service model. Despite the use of low consuming components, higher energy savings can be achieved by dynamically adapting the system to the actual workload. The main approach in this case is the usage of energy saving criteria for scheduling the jobs or the virtual machines into the working nodes. The aim is to power off idle servers automatically. But it is necessary to schedule the power management of the servers in order to minimize the impact on the end users and their applications. The objective of this thesis is the elastic and efficient management of cluster infrastructures, with the aim of reducing the costs associated to idle components. This objective is addressed by automating the power management of the working nodes in a computing cluster, and also proactive stimulating the load distribution to achieve idle resources that could be powered off by means of memory overcommitment and live migration of virtual machines. Moreover, this automation is of interest for virtual clusters, as they also suffer from the same problems. While in physical clusters idle working nodes waste energy, in the case of virtual clusters that are built from virtual machines, the idle working nodes can waste money in commercial Clouds or computational resources in an on-premises Cloud.[ES] En los Centros de Procesos de Datos (CPD) existe una gran concentración de dispositivos informáticos y de equipamiento electrónico. Sin embargo, algunos estudios han mostrado que la utilización media de los CPD está en torno al 50%, y que la utilización media de los servidores se encuentra entre el 10% y el 50%. Estos datos evidencian que existe una gran cantidad de energía destinada a alimentar equipamiento ocioso, y que podríamos conseguir un ahorro energético simplemente apagando los componentes que no se estén utilizando. En muchos CPD suele haber clusters de computadores que se utilizan para computación de altas prestaciones y para la creación de Clouds privados. Si bien se ha tratado de ahorrar energía utilizando componentes de bajo consumo, también es posible conseguirlo adaptando los sistemas a la carga de trabajo en cada momento. En los últimos años han surgido trabajos que investigan la aplicación de criterios energéticos a la hora de seleccionar en qué servidor, de entre los que forman un cluster, se debe ejecutar un trabajo o alojar una máquina virtual. En muchos casos se trata de conseguir equipos ociosos que puedan ser apagados, pero habitualmente se asume que dicho apagado se hace de forma automática, y que los equipos se encienden de nuevo cuando son necesarios. Sin embargo, es necesario hacer una planificación de encendido y apagado de máquinas para minimizar el impacto en el usuario final. En esta tesis nos planteamos la gestión elástica y eficiente de infrastructuras de cálculo tipo cluster, con el objetivo de reducir los costes asociados a los componentes ociosos. Para abordar este problema nos planteamos la automatización del encendido y apagado de máquinas en los clusters, así como la aplicación de técnicas de migración en vivo y de sobreaprovisionamiento de memoria para estimular la obtención de equipos ociosos que puedan ser apagados. Además, esta automatización es de interés para los clusters virtuales, puesto que también sufren el problema de los componentes ociosos, sólo que en este caso están compuestos por, en lugar de equipos físicos que gastan energía, por máquinas virtuales que gastan dinero en un proveedor Cloud comercial o recursos en un Cloud privado.[CA] En els Centres de Processament de Dades (CPD) hi ha una gran concentració de dispositius informàtics i d'equipament electrònic. No obstant això, alguns estudis han mostrat que la utilització mitjana dels CPD està entorn del 50%, i que la utilització mitjana dels servidors es troba entre el 10% i el 50%. Estes dades evidencien que hi ha una gran quantitat d'energia destinada a alimentar equipament ociós, i que podríem aconseguir un estalvi energètic simplement apagant els components que no s'estiguen utilitzant. En molts CPD sol haver-hi clusters de computadors que s'utilitzen per a computació d'altes prestacions i per a la creació de Clouds privats. Si bé s'ha tractat d'estalviar energia utilitzant components de baix consum, també és possible aconseguir-ho adaptant els sistemes a la càrrega de treball en cada moment. En els últims anys han sorgit treballs que investiguen l'aplicació de criteris energètics a l'hora de seleccionar en quin servidor, d'entre els que formen un cluster, s'ha d'executar un treball o allotjar una màquina virtual. En molts casos es tracta d'aconseguir equips ociosos que puguen ser apagats, però habitualment s'assumix que l'apagat es fa de forma automàtica, i que els equips s'encenen novament quan són necessaris. No obstant això, és necessari fer una planificació d'encesa i apagat de màquines per a minimitzar l'impacte en l'usuari final. En esta tesi ens plantegem la gestió elàstica i eficient d'infrastructuras de càlcul tipus cluster, amb l'objectiu de reduir els costos associats als components ociosos. Per a abordar este problema ens plantegem l'automatització de l'encesa i apagat de màquines en els clusters, així com l'aplicació de tècniques de migració en viu i de sobreaprovisionament de memòria per a estimular l'obtenció d'equips ociosos que puguen ser apagats. A més, esta automatització és d'interés per als clusters virtuals, ja que també patixen el problema dels components ociosos, encara que en este cas estan compostos per, en compte d'equips físics que gasten energia, per màquines virtuals que gasten diners en un proveïdor Cloud comercial o recursos en un Cloud privat.Alfonso Laguna, CD. (2015). Efficient and elastic management of computing infrastructures [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/57187Compendi

    Learning Cooperative Oversubscription for Cloud by Chance-Constrained Multi-Agent Reinforcement Learning

    Full text link
    Oversubscription is a common practice for improving cloud resource utilization. It allows the cloud service provider to sell more resources than the physical limit, assuming not all users would fully utilize the resources simultaneously. However, how to design an oversubscription policy that improves utilization while satisfying the some safety constraints remains an open problem. Existing methods and industrial practices are over-conservative, ignoring the coordination of diverse resource usage patterns and probabilistic constraints. To address these two limitations, this paper formulates the oversubscription for cloud as a chance-constrained optimization problem and propose an effective Chance Constrained Multi-Agent Reinforcement Learning (C2MARL) method to solve this problem. Specifically, C2MARL reduces the number of constraints by considering their upper bounds and leverages a multi-agent reinforcement learning paradigm to learn a safe and optimal coordination policy. We evaluate our C2MARL on an internal cloud platform and public cloud datasets. Experiments show that our C2MARL outperforms existing methods in improving utilization (20%86%20\%\sim 86\%) under different levels of safety constraints

    Balancing the use of batteries and opportunistic scheduling policies for maximizing renewable energy consumption in a Cloud data center

    Get PDF
    International audienceThe fast growth of cloud computing considerably increases the energy consumption of cloud infrastructures, especially , data centers. To reduce brown energy consumption and carbon footprint, renewable energy such as solar/wind energy is considered recently to supply new green data centers. As renewable energy is intermittent and fluctuates from time to time, this paper considers two fundamental approaches for improving the usage of renewable energy in a small/medium-sized data center. One approach is based on opportunistic scheduling: more jobs are performed when renewable energy is available. The other approach relies on Energy Storage Devices (ESDs), which store renewable energy surplus at first and then, provide energy to the data center when renewable energy becomes unavailable. In this paper, we explore these two means to maximize the utilization of on-site renewable energy for small data centers. By using real-world job workload and solar energy traces, our experimental results show the energy consumption with varying battery size and solar panel dimensions for opportunistic scheduling or ESD-only solution. The results also demonstrate that opportunistic scheduling can reduce the demand for ESD capacity. Finally, we find an intermediate solution mixing both approaches in order to achieve a balance in all aspects, implying minimizing the renewable energy losses. It also saves brown energy consumption by up to 33% compared to ESD-only solution

    Multi-objective temporal bin packing problem : an application in cloud computing

    Get PDF
    Improving energy efficiency and lowering operational costs are the main challenges faced in systems with multiple servers. One prevalent objective in such systems is to minimize the number of servers required to process a given set of tasks under server capacity constraints. This objective leads to the well-known bin packing problem. In this study, we consider a generalization of this problem with a time dimension, where the tasks are to be performed with predefined start and end times. This new dimension brings about new performance considerations, one of which is the uninterrupted utilization of servers. This study is motivated by the problem of energy efficient assignment of virtual machines to physical servers in a cloud computing service. We address the virtual machine placement problem and present a binary integer programming model to develop different assignment policies. By analyzing the structural properties of the problem, we propose an efficient heuristic method based on solving smaller versions of the original problem iteratively. Moreover, we design a column generation algorithm that yields a lower bound on the objective value, which can be utilized to evaluate the performance of the heuristic algorithm. Our numerical study indicates that the proposed heuristic is capable of solving large-scale instances in a short time with small optimality gaps
    corecore