559 research outputs found

    TIME AWARE VIRTUAL MACHINE PLACEMENT AND ROUTING FOR POWER EFFICIENCY IN DATA CENTERS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Empowering Cloud Data Centers with Network Programmability

    Get PDF
    Cloud data centers are a critical infrastructure for modern Internet services such as web search, social networking and e-commerce. However, the gradual slow-down of Moore’s law has put a burden on the growth of data centers’ performance and energy efficiency. In addition, the increasing of millisecond-scale and microsecond-scale tasks also bring higher requirements to the throughput and latency for the cloud applications. Today’s server-based solutions are hard to meet the performance requirements in many scenarios like resource management, scheduling, high-speed traffic monitoring and testing. In this dissertation, we study these problems from a network perspective. We investigate a new architecture that leverages the programmability of new-generation network switches to improve the performance and reliability of clouds. As programmable switches only provide very limited memory and functionalities, we exploit compact data structures and deeply co-design software and hardware to best utilize the resource. More specifically, this dissertation presents four systems: (i) NetLock: A new centralized lock management architecture that co-designs programmable switches and servers to simultaneously achieve high performance and rich policy support. It provides orders-of-magnitude higher throughput than existing systems with microsecond-level latency, and supports many commonly-used policies such as performance isolation. (ii) HCSFQ: A scalable and practical solution to implement hierarchical fair queueing on commodity hardware at line rate. Instead of relying on a hierarchy of queues with complex queue management, HCSFQ does not keep per-flow states and uses only one queue to achieve hierarchical fair queueing. (iii) AIFO: A new approach for programmable packet scheduling that only uses a single FIFO queue. AIFO utilizes an admission control mechanism to approximate PIFO which is theoretically ideal but hard to implement with commodity devices. (iv) Lumina: A tool that enables fine-grained analysis of hardware network stack. By exploiting network programmability to emulate various network scenarios, Lumina is able to help users understand the micro-behaviors of hardware network stacks

    Improving efficiency and resilience in large-scale computing systems through analytics and data-driven management

    Full text link
    Applications running in large-scale computing systems such as high performance computing (HPC) or cloud data centers are essential to many aspects of modern society, from weather forecasting to financial services. As the number and size of data centers increase with the growing computing demand, scalable and efficient management becomes crucial. However, data center management is a challenging task due to the complex interactions between applications, middleware, and hardware layers such as processors, network, and cooling units. This thesis claims that to improve robustness and efficiency of large-scale computing systems, significantly higher levels of automated support than what is available in today's systems are needed, and this automation should leverage the data continuously collected from various system layers. Towards this claim, we propose novel methodologies to automatically diagnose the root causes of performance and configuration problems and to improve efficiency through data-driven system management. We first propose a framework to diagnose software and hardware anomalies that cause undesired performance variations in large-scale computing systems. We show that by training machine learning models on resource usage and performance data collected from servers, our approach successfully diagnoses 98% of the injected anomalies at runtime in real-world HPC clusters with negligible computational overhead. We then introduce an analytics framework to address another major source of performance anomalies in cloud data centers: software misconfigurations. Our framework discovers and extracts configuration information from cloud instances such as containers or virtual machines. This is the first framework to provide comprehensive visibility into software configurations in multi-tenant cloud platforms, enabling systematic analysis for validating the correctness of software configurations. This thesis also contributes to the design of robust and efficient system management methods that leverage continuously monitored resource usage data. To improve performance under power constraints, we propose a workload- and cooling-aware power budgeting algorithm that distributes the available power among servers and cooling units in a data center, achieving up to 21% improvement in throughput per Watt compared to the state-of-the-art. Additionally, we design a network- and communication-aware HPC workload placement policy that reduces communication overhead by up to 30% in terms of hop-bytes compared to existing policies.2019-07-02T00:00:00

    Strategic and operational services for workload management in the cloud

    Full text link
    In hosting environments such as Infrastructure as a Service (IaaS) clouds, desirable application performance is typically guaranteed through the use of Service Level Agreements (SLAs), which specify minimal fractions of resource capacities that must be allocated by a service provider for unencumbered use by customers to ensure proper operation of their workloads. Most IaaS offerings are presented to customers as fixed-size and fixed-price SLAs, that do not match well the needs of specific applications. Furthermore, arbitrary colocation of applications with different SLAs may result in inefficient utilization of hosts' resources, resulting in economically undesirable customer behavior. In this thesis, we propose the design and architecture of a Colocation as a Service (CaaS) framework: a set of strategic and operational services that allow the efficient colocation of customer workloads. CaaS strategic services provide customers the means to specify their application workload using an SLA language that provides them the opportunity and incentive to take advantage of any tolerances they may have regarding the scheduling of their workloads. CaaS operational services provide the information necessary for, and carry out the reconfigurations mandated by strategic services. We recognize that it could be the case that there are multiple, yet functionally equivalent ways to express an SLA. Thus, towards that end, we present a service that allows the provably-safe transformation of SLAs from one form to another for the purpose of achieving more efficient colocation. Our CaaS framework could be incorporated into an IaaS offering by providers or it could be implemented as a value added proposition by IaaS resellers. To establish the practicality of such offerings, we present a prototype implementation of our proposed CaaS framework

    Machine Learning-based Orchestration Solutions for Future Slicing-Enabled Mobile Networks

    Get PDF
    The fifth generation mobile networks (5G) will incorporate novel technologies such as network programmability and virtualization enabled by Software-Defined Networking (SDN) and Network Function Virtualization (NFV) paradigms, which have recently attracted major interest from both academic and industrial stakeholders. Building on these concepts, Network Slicing raised as the main driver of a novel business model where mobile operators may open, i.e., “slice”, their infrastructure to new business players and offer independent, isolated and self-contained sets of network functions and physical/virtual resources tailored to specific services requirements. While Network Slicing has the potential to increase the revenue sources of service providers, it involves a number of technical challenges that must be carefully addressed. End-to-end (E2E) network slices encompass time and spectrum resources in the radio access network (RAN), transport resources on the fronthauling/backhauling links, and computing and storage resources at core and edge data centers. Additionally, the vertical service requirements’ heterogeneity (e.g., high throughput, low latency, high reliability) exacerbates the need for novel orchestration solutions able to manage end-to-end network slice resources across different domains, while satisfying stringent service level agreements and specific traffic requirements. An end-to-end network slicing orchestration solution shall i) admit network slice requests such that the overall system revenues are maximized, ii) provide the required resources across different network domains to fulfill the Service Level Agreements (SLAs) iii) dynamically adapt the resource allocation based on the real-time traffic load, endusers’ mobility and instantaneous wireless channel statistics. Certainly, a mobile network represents a fast-changing scenario characterized by complex spatio-temporal relationship connecting end-users’ traffic demand with social activities and economy. Legacy models that aim at providing dynamic resource allocation based on traditional traffic demand forecasting techniques fail to capture these important aspects. To close this gap, machine learning-aided solutions are quickly arising as promising technologies to sustain, in a scalable manner, the set of operations required by the network slicing context. How to implement such resource allocation schemes among slices, while trying to make the most efficient use of the networking resources composing the mobile infrastructure, are key problems underlying the network slicing paradigm, which will be addressed in this thesis

    Writing Rangers

    Get PDF

    Performance-oriented service management in clouds

    Get PDF
    Cloud computing has provided the convenience for many IT-related and traditional industries to use feature-rich services to process complex requests. Various services are deployed in the cloud and they interact with each other to deliver the required results. How to effectively manage these services, the number of which is ever increasing, within the cloud has unavoidably become a critical issue for both tenants and service providers of the cloud. In this thesis, we develop the novel resource provision frameworks to determine resources provision for interactive services. Next, we propose the algorithms for mapping Virtual Machines (VMs) to Physical Machines (PMs) under different constraints, aiming to achieve the desired Quality-of-Services (QoS) while optimizing the provisions in both computing resources and communication bandwidth. Finally, job scheduling may become a performance bottleneck itself in such a large scale cloud. In order to address this issue, the distributed job scheduling framework has been proposed in the literature. However, such distributed job scheduling may cause resource conflict among distributed job schedulers due to the fact that individual job schedulers make their job scheduling decisions independently. In this thesis, we investigate the methods for reducing resource conflict. We apply the game theoretical methodology to capture the behaviour of the distributed schedulers in the cloud. The frameworks and methods developed in this thesis have been evaluated with a simulated workload, a large-scale workload trace and a real cloud testbed

    Network and Server Resource Management Strategies for Data Centre Infrastructures: A Survey

    Get PDF
    The advent of virtualisation and the increasing demand for outsourced, elastic compute charged on a pay-as-you-use basis has stimulated the development of large-scale Cloud Data Centres (DCs) housing tens of thousands of computer clusters. Of the signi�cant capital outlay required for building and operating such infrastructures, server and network equipment account for 45% and 15% of the total cost, respectively, making resource utilisation e�ciency paramount in order to increase the operators' Return-on-Investment (RoI). In this paper, we present an extensive survey on the management of server and network resources over virtualised Cloud DC infrastructures, highlighting key concepts and results, and critically discussing their limitations and implications for future research opportunities. We highlight the need for and bene �ts of adaptive resource provisioning that alleviates reliance on static utilisation prediction models and exploits direct measurement of resource utilisation on servers and network nodes. Coupling such distributed measurement with logically-centralised Software De�ned Networking (SDN) principles, we subsequently discuss the challenges and opportunities for converged resource management over converged ICT environments, through unifying control loops to globally orchestrate adaptive and load-sensitive resource provisioning

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Processamento de eventos complexos como serviço em ambientes multi-nuvem

    Get PDF
    Orientadores: Luiz Fernando Bittencourt, Miriam Akemi Manabe CapretzTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O surgimento das tecnologias de dispositivos móveis e da Internet das Coisas, combinada com avanços das tecnologias Web, criou um novo mundo de Big Data em que o volume e a velocidade da geração de dados atingiu uma escala sem precedentes. Por ser uma tecnologia criada para processar fluxos contínuos de dados, o Processamento de Eventos Complexos (CEP, do inglês Complex Event Processing) tem sido frequentemente associado a Big Data e aplicado como uma ferramenta para obter informações em tempo real. Todavia, apesar desta onda de interesse, o mercado de CEP ainda é dominado por soluções proprietárias que requerem grandes investimentos para sua aquisição e não proveem a flexibilidade que os usuários necessitam. Como alternativa, algumas empresas adotam soluções de baixo nível que demandam intenso treinamento técnico e possuem alto custo operacional. A fim de solucionar esses problemas, esta pesquisa propõe a criação de um sistema de CEP que pode ser oferecido como serviço e usado através da Internet. Um sistema de CEP como Serviço (CEPaaS, do inglês CEP as a Service) oferece aos usuários as funcionalidades de CEP aliadas às vantagens do modelo de serviços, tais como redução do investimento inicial e baixo custo de manutenção. No entanto, a criação de tal serviço envolve inúmeros desafios que não são abordados no atual estado da arte de CEP. Em especial, esta pesquisa propõe soluções para três problemas em aberto que existem neste contexto. Em primeiro lugar, para o problema de entender e reusar a enorme variedade de procedimentos para gerência de sistemas CEP, esta pesquisa propõe o formalismo Reescrita de Grafos com Atributos para Gerência de Processamento de Eventos Complexos (AGeCEP, do inglês Attributed Graph Rewriting for Complex Event Processing Management). Este formalismo inclui modelos para consultas CEP e transformações de consultas que são independentes de tecnologia e linguagem. Em segundo lugar, para o problema de avaliar estratégias de gerência e processamento de consultas CEP, esta pesquisa apresenta CEPSim, um simulador de sistemas CEP baseado em nuvem. Por fim, esta pesquisa também descreve um sistema CEPaaS fundamentado em ambientes multi-nuvem, sistemas de gerência de contêineres e um design multiusuário baseado em AGeCEP. Para demonstrar sua viabilidade, o formalismo AGeCEP foi usado para projetar um gerente autônomo e um conjunto de políticas de auto-gerenciamento para sistemas CEP. Além disso, o simulador CEPSim foi minuciosamente avaliado através de experimentos que demonstram sua capacidade de simular sistemas CEP com acurácia e baixo custo adicional de processamento. Por fim, experimentos adicionais validaram o sistema CEPaaS e demonstraram que o objetivo de oferecer funcionalidades CEP como um serviço escalável e tolerante a falhas foi atingido. Em conjunto, esses resultados confirmam que esta pesquisa avança significantemente o estado da arte e também oferece novas ferramentas e metodologias que podem ser aplicadas à pesquisa em CEPAbstract: The rise of mobile technologies and the Internet of Things, combined with advances in Web technologies, have created a new Big Data world in which the volume and velocity of data generation have achieved an unprecedented scale. As a technology created to process continuous streams of data, Complex Event Processing (CEP) has been often related to Big Data and used as a tool to obtain real-time insights. However, despite this recent surge of interest, the CEP market is still dominated by solutions that are costly and inflexible or too low-level and hard to operate. To address these problems, this research proposes the creation of a CEP system that can be offered as a service and used over the Internet. Such a CEP as a Service (CEPaaS) system would give its users CEP functionalities associated with the advantages of the services model, such as no up-front investment and low maintenance cost. Nevertheless, creating such a service involves challenges that are not addressed by current CEP systems. This research proposes solutions for three open problems that exist in this context. First, to address the problem of understanding and reusing existing CEP management procedures, this research introduces the Attributed Graph Rewriting for Complex Event Processing Management (AGeCEP) formalism as a technology- and language-agnostic representation of queries and their reconfigurations. Second, to address the problem of evaluating CEP query management and processing strategies, this research introduces CEPSim, a simulator of cloud-based CEP systems. Finally, this research also introduces a CEPaaS system based on a multi-cloud architecture, container management systems, and an AGeCEP-based multi-tenant design. To demonstrate its feasibility, AGeCEP was used to design an autonomic manager and a selected set of self-management policies. Moreover, CEPSim was thoroughly evaluated by experiments that showed it can simulate existing systems with accuracy and low execution overhead. Finally, additional experiments validated the CEPaaS system and demonstrated it achieves the goal of offering CEP functionalities as a scalable and fault-tolerant service. In tandem, these results confirm this research significantly advances the CEP state of the art and provides novel tools and methodologies that can be applied to CEP researchDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação140920/2012-9CNP
    corecore