Search CORE

102 research outputs found

Automatic memory-based vertical elasticity and oversubscription on cloud platforms

Author: Caballer
Carlos de Alfonso
Clark
Dawoud
Denning
de~Alfonso
Galante
Germán Moltó
Hines
Householder
Hwang
Kivity
L.
Litke
Merkel
Miguel Caballer
Moltó
Moreno-Vozmediano
Salomie
Tasoulas
Tomás
Waldspurger
Williams
Publication venue: 'Elsevier BV'
Publication date: 01/03/2016
Field of study

Hypervisors and Operating Systems support vertical elasticity techniques such as memory ballooning to dynamically assign the memory of Virtual Machines (VMs). However, current Cloud Management Platforms (CMPs), such as OpenNebula or OpenStack, do not currently support dynamic vertical elasticity. This paper describes a system that integrates with the CMP to provide automatic vertical elasticity to adapt the memory size of the VMs to their current memory consumption, featuring live migration to prevent overload scenarios, without downtime for the VMs. This enables an enhanced VM-per-host consolidation ratio while maintaining the Quality of Service for VMs, since their memory is dynamically increased as necessary. The feasibility of the development is assessed via two case studies based on OpenNebula featuring (i) horizontal and vertical elastic virtual clusters on a production Grid infrastructure and (ii) elastic multi-tenant VMs that run Docker containers coupled with live migration techniques. The results show that memory oversubscription can be integrated on CMPs to deliver automatic memory management without severely impacting the performance of the VMs. This results in a memory management framework for on-premises Clouds that features live migration to safely enable transient oversubscription of physical resources in a CMP. © 2015 Elsevier B.V. All rights reserved.The authors would like to thank the Spanish "Ministerio de Economia y Competitividad" for the project CLUVIEM (TIN2013-44390-R) and the European Commission for the project INDIGO-DataCloud with grant number 653549.Moltó, G.; Caballer Fernández, M.; Alfonso Laguna, CD. (2016). Automatic memory-based vertical elasticity and oversubscription on cloud platforms. Future Generation Computer Systems. 56:1-10. https://doi.org/10.1016/j.future.2015.10.002S1105

Crossref

RiuNet

Efficient and elastic management of computing infrastructures

Author: Alfonso Laguna Carlos de
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 23/10/2016
Field of study

Tesis por compendio[EN] Modern data centers integrate a lot of computer and electronic devices. However, some reports state that the mean usage of a typical data center is around 50% of its peak capacity, and the mean usage of each server is between 10% and 50%. A lot of energy is destined to power on computer hardware that most of the time remains idle. Therefore, it would be possible to save energy simply by powering off those parts from the data center that are not actually used, and powering them on again as they are needed. Most data centers have computing clusters that are used for intensive computing, recently evolving towards an on-premises Cloud service model. Despite the use of low consuming components, higher energy savings can be achieved by dynamically adapting the system to the actual workload. The main approach in this case is the usage of energy saving criteria for scheduling the jobs or the virtual machines into the working nodes. The aim is to power off idle servers automatically. But it is necessary to schedule the power management of the servers in order to minimize the impact on the end users and their applications. The objective of this thesis is the elastic and efficient management of cluster infrastructures, with the aim of reducing the costs associated to idle components. This objective is addressed by automating the power management of the working nodes in a computing cluster, and also proactive stimulating the load distribution to achieve idle resources that could be powered off by means of memory overcommitment and live migration of virtual machines. Moreover, this automation is of interest for virtual clusters, as they also suffer from the same problems. While in physical clusters idle working nodes waste energy, in the case of virtual clusters that are built from virtual machines, the idle working nodes can waste money in commercial Clouds or computational resources in an on-premises Cloud.[ES] En los Centros de Procesos de Datos (CPD) existe una gran concentración de dispositivos informáticos y de equipamiento electrónico. Sin embargo, algunos estudios han mostrado que la utilización media de los CPD está en torno al 50%, y que la utilización media de los servidores se encuentra entre el 10% y el 50%. Estos datos evidencian que existe una gran cantidad de energía destinada a alimentar equipamiento ocioso, y que podríamos conseguir un ahorro energético simplemente apagando los componentes que no se estén utilizando. En muchos CPD suele haber clusters de computadores que se utilizan para computación de altas prestaciones y para la creación de Clouds privados. Si bien se ha tratado de ahorrar energía utilizando componentes de bajo consumo, también es posible conseguirlo adaptando los sistemas a la carga de trabajo en cada momento. En los últimos años han surgido trabajos que investigan la aplicación de criterios energéticos a la hora de seleccionar en qué servidor, de entre los que forman un cluster, se debe ejecutar un trabajo o alojar una máquina virtual. En muchos casos se trata de conseguir equipos ociosos que puedan ser apagados, pero habitualmente se asume que dicho apagado se hace de forma automática, y que los equipos se encienden de nuevo cuando son necesarios. Sin embargo, es necesario hacer una planificación de encendido y apagado de máquinas para minimizar el impacto en el usuario final. En esta tesis nos planteamos la gestión elástica y eficiente de infrastructuras de cálculo tipo cluster, con el objetivo de reducir los costes asociados a los componentes ociosos. Para abordar este problema nos planteamos la automatización del encendido y apagado de máquinas en los clusters, así como la aplicación de técnicas de migración en vivo y de sobreaprovisionamiento de memoria para estimular la obtención de equipos ociosos que puedan ser apagados. Además, esta automatización es de interés para los clusters virtuales, puesto que también sufren el problema de los componentes ociosos, sólo que en este caso están compuestos por, en lugar de equipos físicos que gastan energía, por máquinas virtuales que gastan dinero en un proveedor Cloud comercial o recursos en un Cloud privado.[CA] En els Centres de Processament de Dades (CPD) hi ha una gran concentració de dispositius informàtics i d'equipament electrònic. No obstant això, alguns estudis han mostrat que la utilització mitjana dels CPD està entorn del 50%, i que la utilització mitjana dels servidors es troba entre el 10% i el 50%. Estes dades evidencien que hi ha una gran quantitat d'energia destinada a alimentar equipament ociós, i que podríem aconseguir un estalvi energètic simplement apagant els components que no s'estiguen utilitzant. En molts CPD sol haver-hi clusters de computadors que s'utilitzen per a computació d'altes prestacions i per a la creació de Clouds privats. Si bé s'ha tractat d'estalviar energia utilitzant components de baix consum, també és possible aconseguir-ho adaptant els sistemes a la càrrega de treball en cada moment. En els últims anys han sorgit treballs que investiguen l'aplicació de criteris energètics a l'hora de seleccionar en quin servidor, d'entre els que formen un cluster, s'ha d'executar un treball o allotjar una màquina virtual. En molts casos es tracta d'aconseguir equips ociosos que puguen ser apagats, però habitualment s'assumix que l'apagat es fa de forma automàtica, i que els equips s'encenen novament quan són necessaris. No obstant això, és necessari fer una planificació d'encesa i apagat de màquines per a minimitzar l'impacte en l'usuari final. En esta tesi ens plantegem la gestió elàstica i eficient d'infrastructuras de càlcul tipus cluster, amb l'objectiu de reduir els costos associats als components ociosos. Per a abordar este problema ens plantegem l'automatització de l'encesa i apagat de màquines en els clusters, així com l'aplicació de tècniques de migració en viu i de sobreaprovisionament de memòria per a estimular l'obtenció d'equips ociosos que puguen ser apagats. A més, esta automatització és d'interés per als clusters virtuals, ja que també patixen el problema dels components ociosos, encara que en este cas estan compostos per, en compte d'equips físics que gasten energia, per màquines virtuals que gasten diners en un proveïdor Cloud comercial o recursos en un Cloud privat.Alfonso Laguna, CD. (2015). Efficient and elastic management of computing infrastructures [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/57187Compendi

RiuNet

Understanding Security Threats in Cloud

Author: Xu Zhang
Publication venue: W&M ScholarWorks
Publication date: 01/01/2016
Field of study

As cloud computing has become a trend in the computing world, understanding its security concerns becomes essential for improving service quality and expanding business scale. This dissertation studies the security issues in a public cloud from three aspects. First, we investigate a new threat called power attack in the cloud. Second, we perform a systematical measurement on the public cloud to understand how cloud vendors react to existing security threats. Finally, we propose a novel technique to perform data reduction on audit data to improve system capacity, and hence helping to enhance security in cloud. In the power attack, we exploit various attack vectors in platform as a service (PaaS), infrastructure as a service (IaaS), and software as a service (SaaS) cloud environments. to demonstrate the feasibility of launching a power attack, we conduct series of testbed based experiments and data-center-level simulations. Moreover, we give a detailed analysis on how different power management methods could affect a power attack and how to mitigate such an attack. Our experimental results and analysis show that power attacks will pose a serious threat to modern data centers and should be taken into account while deploying new high-density servers and power management techniques. In the measurement study, we mainly investigate how cloud vendors have reacted to the co-residence threat inside the cloud, in terms of Virtual Machine (VM) placement, network management, and Virtual Private Cloud (VPC). Specifically, through intensive measurement probing, we first profile the dynamic environment of cloud instances inside the cloud. Then using real experiments, we quantify the impacts of VM placement and network management upon co-residence, respectively. Moreover, we explore VPC, which is a defensive service of Amazon EC2 for security enhancement, from the routing perspective. Advanced Persistent Threat (APT) is a serious cyber-threat, cloud vendors are seeking solutions to ``connect the suspicious dots\u27\u27 across multiple activities. This requires ubiquitous system auditing for long period of time, which in turn causes overwhelmingly large amount of system audit logs. We propose a new approach that exploits the dependency among system events to reduce the number of log entries while still supporting high quality forensics analysis. In particular, we first propose an aggregation algorithm that preserves the event dependency in data reduction to ensure high quality of forensic analysis. Then we propose an aggressive reduction algorithm and exploit domain knowledge for further data reduction. We conduct a comprehensive evaluation on real world auditing systems using more than one-month log traces to validate the efficacy of our approach

College of William & Mary: W&M Publish

CloudScope: diagnosing and managing performance interference in multi-tenant clouds

Author: Chen X
Franciosi F
Knottenbelt W
Osman R
Pietzuch P
Rupprecht L
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/10/2015
Field of study

© 2015 IEEE.Virtual machine consolidation is attractive in cloud computing platforms for several reasons including reduced infrastructure costs, lower energy consumption and ease of management. However, the interference between co-resident workloads caused by virtualization can violate the service level objectives (SLOs) that the cloud platform guarantees. Existing solutions to minimize interference between virtual machines (VMs) are mostly based on comprehensive micro-benchmarks or online training which makes them computationally intensive. In this paper, we present CloudScope, a system for diagnosing interference for multi-tenant cloud systems in a lightweight way. CloudScope employs a discrete-time Markov Chain model for the online prediction of performance interference of co-resident VMs. It uses the results to optimally (re)assign VMs to physical machines and to optimize the hypervisor configuration, e.g. the CPU share it can use, for different workloads. We have implemented CloudScope on top of the Xen hypervisor and conducted experiments using a set of CPU, disk, and network intensive workloads and a real system (MapReduce). Our results show that CloudScope interference prediction achieves an average error of 9%. The interference-aware scheduler improves VM performance by up to 10% compared to the default scheduler. In addition, the hypervisor reconfiguration can improve network throughput by up to 30%

Spiral - Imperial College Digital Repository

Improving energy efficiency in cloud computing data centres using intelligent mobile agents

Author: Okonor Ogechukwu M.
Publication venue
Publication date: 01/01/2021
Field of study

Portsmouth University Research Portal (Pure)

PROV-TE: A Provenance-Driven Diagnostic Framework for Task Eviction in Data Centers

Author: Albatli A
Lau L
McKee D
Townend P
Xu J
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2017
Field of study

Cloud Computing allows users to control substantial computing power for complex data processing, generating huge and complex data. However, the virtual resources requested by users are rarely utilized to their full capacities. To mitigate this, providers often perform over-commitment to maximize profit, which can result in node overloading and consequent task eviction. This paper presents a novel framework that mines the huge and growing historical usage data generated by Cloud data centers to identify the causes of overloads. Provenance modelling is applied to add contextual meaning to the data, and the PROV-TE diagnostic framework provides algorithms to efficiently identify the causality of task eviction. Using simulation to reflect real world scenarios, our results demonstrate a precision and recall of the diagnostic algorithms of 83% and 90% respectively. This demonstrates a high level of accuracy of the identification of causes

Crossref

White Rose Research Online

Optimisation for Optical Data Centre Switching and Networking with Artificial Intelligence

Author: Shabka Zacharaya
Publication venue: UCL (University College London)
Publication date: 28/06/2023
Field of study

Cloud and cluster computing platforms have become standard across almost every domain of business, and their scale quickly approaches

\mathbf{O}(10^6)

servers in a single warehouse. However, the tier-based opto-electronically packet switched network infrastructure that is standard across these systems gives way to several scalability bottlenecks including resource fragmentation and high energy requirements. Experimental results show that optical circuit switched networks pose a promising alternative that could avoid these. However, optimality challenges are encountered at realistic commercial scales. Where exhaustive optimisation techniques are not applicable for problems at the scale of Cloud-scale computer networks, and expert-designed heuristics are performance-limited and typically biased in their design, artificial intelligence can discover more scalable and better performing optimisation strategies. This thesis demonstrates these benefits through experimental and theoretical work spanning all of component, system and commercial optimisation problems which stand in the way of practical Cloud-scale computer network systems. Firstly, optical components are optimised to gate in

\approx 500 ps

and are demonstrated in a proof-of-concept switching architecture for optical data centres with better wavelength and component scalability than previous demonstrations. Secondly, network-aware resource allocation schemes for optically composable data centres are learnt end-to-end with deep reinforcement learning and graph neural networks, where

3\times

less networking resources are required to achieve the same resource efficiency compared to conventional methods. Finally, a deep reinforcement learning based method for optimising PID-control parameters is presented which generates tailored parameters for unseen devices in

\mathbf{O}(10^{-3}) s

. This method is demonstrated on a market leading optical switching product based on piezoelectric actuation, where switching speed is improved

>20\%

with no compromise to optical loss and the manufacturing yield of actuators is improved. This method was licensed to and integrated within the manufacturing pipeline of this company. As such, crucial public and private infrastructure utilising these products will benefit from this work

UCL Discovery

Recommended from our members

Reconfigurable Optically Interconnected Systems

Author: Shen Yiwen
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2020
Field of study

With the immense growth of data consumption in today's data centers and high-performance computing systems driven by the constant influx of new applications, the network infrastructure supporting this demand is under increasing pressure to enable higher bandwidth, latency, and flexibility requirements. Optical interconnects, able to support high bandwidth wavelength division multiplexed signals with extreme energy efficiency, have become the basis for long-haul and metro-scale networks around the world, while photonic components are being rapidly integrated within rack and chip-scale systems. However, optical and photonic interconnects are not a direct replacement for electronic-based components. Rather, the integration of optical interconnects with electronic peripherals allows for unique functionalities that can improve the capacity, compute performance and flexibility of current state-of-the-art computing systems. This requires physical layer methodologies for their integration with electronic components, as well as system level control planes that incorporates the optical layer characteristics. This thesis explores various network architectures and the associated control plane, hardware infrastructure, and other supporting software modules needed to integrate silicon photonics and MEMS based optical switching into conventional datacom network systems ranging from intra-data center and high-performance computing systems to the metro-scale layer networks between data centers. In each of these systems, we demonstrate dynamic bandwidth steering and compute resource allocation capabilities to enable significant performance improvements. The key accomplishments of this thesis are as follows. In Part 1, we present high-performance computing network architectures that integrate silicon photonic switches for optical bandwidth steering, enabling multiple reconfigurable topologies that results in significant system performance improvements. As high-performance systems rely on increased parallelism by scaling up to greater numbers of processor nodes, communication between these nodes grows rapidly and the interconnection network becomes a bottleneck to the overall performance of the system. It has been observed that many scientific applications operating on high-performance computing systems cause highly skewed traffic over the network, congesting only a small percentage of the total available links while other links are underutilized. This mismatch of the traffic and the bandwidth allocation of the physical layer network presents the opportunity to optimize the bandwidth resource utilization of the system by using silicon photonic switches to perform bandwidth steering. This allows the individual processors to perform at their maximum compute potential and thereby improving the overall system performance. We show various testbeds that integrates both microring resonator and Mach-Zehnder based silicon photonic switches within Dragonfly and Fat-Tree topology networks built with conventional equipment, and demonstrate 30-60% reduction in execution time of real high-performance benchmark applications. Part 2 presents a flexible network architecture and control plane that enables autonomous bandwidth steering and IT resource provisioning capabilities between metro-scale geographically distributed data centers. It uses a software-defined control plane to autonomously provision both network and IT resources to support different quality of service requirements and optimizes resource utilization under dynamically changing load variations. By actively monitoring both the bandwidth utilization of the network and CPU or memory resources of the end hosts, the control plane autonomously provisions background or dynamic connections with different levels of quality of service using optical MEMS switching, as well as initializing live migrations of virtual machines to consolidate or distribute workload. Together these functionalities provide flexibility and maximize efficiency in processing and transferring data, and enables energy and cost savings by scaling down the system when resources are not needed. An experimental testbed of three data center nodes was built to demonstrate the feasibility of these capabilities. Part 3 presents Lightbridge, a communications platform specifically designed to provide a more seamless integration between processor nodes and an optically switched network. It addresses some of the crucial issues faced by the works presented in the previous chapters related to optical switching. When optical switches perform switching operations, they change the physical topology of the network, and they lack the capability to buffer packets, resulting in certain optical circuits being unavailable. This prompts the question of whether it is safe to transmit packets by end hosts at any given time. Lightbridge was developed to coordinate switching and routing of optical circuits across the network, by having the processors gain information about the current state of the optical network before transmitting packets, and being able to buffer packets when the optical circuit is not available. This part describes details of Lightbridge which is constituted by a loadable Linux kernel module along with other supporting modifications to the Linux kernel in order to achieve the necessary functionalities

Columbia University Academic Commons