40 research outputs found

    Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters

    Get PDF
    It is a long-standing challenge to achieve a high degree of resource utilization in cluster scheduling. Resource oversubscription has become a common practice in improving resource utilization and cost reduction. However, current centralized approaches to oversubscription suffer from the issue with resource mismatch and fail to take into account other performance requirements, e.g., tail latency. In this article we present ROSE, a new resource management platform capable of conducting performance-aware resource oversubscription. ROSE allows latency-sensitive long-running applications (LRAs) to co-exist with computation-intensive batch jobs. Instead of waiting for resource allocation to be confirmed by the centralized scheduler, job managers in ROSE can independently request to launch speculative tasks within specific machines according to their suitability for oversubscription. Node agents of those machines can however, avoid any excessive resource oversubscription by means of a mechanism for admission control using multi-resource threshold control and performance-aware resource throttle. Experiments show that in case of mixed co-location of batch jobs and latency-sensitive LRAs, the CPU utilization and the disk utilization can reach 56.34 and 43.49 percent, respectively, but the 95th percentile of read latency in YCSB workloads only increases by 5.4 percent against the case of executing the LRAs alone

    Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

    Get PDF
    The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

    Load Balancing in Distributed Cloud Computing: A Reinforcement Learning Algorithms in Heterogeneous Environment

    Get PDF
    Balancing load in cloud based is an important aspect that plays a vital role in order to achieve sharing of load between different types of resources such as virtual machines that lay on servers, storage in the form of hard drives and servers. Reinforcement learning approaches can be adopted with cloud computing to achieve quality of service factors such as minimized cost and response time, increased throughput, fault tolerance and utilization of all available resources in the network, thus increasing system performance. Reinforcement Learning based approaches result in making effective resource utilization by selecting the best suitable processor for task execution with minimum makespan. Since in the earlier related work done on sharing of load, there are limited reinforcement learning based approaches. However this paper, focuses on the importance of RL based approaches for achieving balanced load in the area of distributed cloud computing. A Reinforcement Learning framework is proposed and implemented for execution of tasks in heterogeneous environments, particularly, Least Load Balancing (LLB) and Booster Reinforcement Controller (BRC) Load Balancing. With the help of reinforcement learning approaches an optimal result is achieved for load sharing and task allocation. In this RL based framework processor workload is taken as an input. In this paper, the results of proposed RL based approaches have been evaluated for cost and makespan and are compared with existing load balancing techniques for task execution and resource utilization.

    Reconfigurable Optically Interconnected Systems

    Get PDF
    With the immense growth of data consumption in today's data centers and high-performance computing systems driven by the constant influx of new applications, the network infrastructure supporting this demand is under increasing pressure to enable higher bandwidth, latency, and flexibility requirements. Optical interconnects, able to support high bandwidth wavelength division multiplexed signals with extreme energy efficiency, have become the basis for long-haul and metro-scale networks around the world, while photonic components are being rapidly integrated within rack and chip-scale systems. However, optical and photonic interconnects are not a direct replacement for electronic-based components. Rather, the integration of optical interconnects with electronic peripherals allows for unique functionalities that can improve the capacity, compute performance and flexibility of current state-of-the-art computing systems. This requires physical layer methodologies for their integration with electronic components, as well as system level control planes that incorporates the optical layer characteristics. This thesis explores various network architectures and the associated control plane, hardware infrastructure, and other supporting software modules needed to integrate silicon photonics and MEMS based optical switching into conventional datacom network systems ranging from intra-data center and high-performance computing systems to the metro-scale layer networks between data centers. In each of these systems, we demonstrate dynamic bandwidth steering and compute resource allocation capabilities to enable significant performance improvements. The key accomplishments of this thesis are as follows. In Part 1, we present high-performance computing network architectures that integrate silicon photonic switches for optical bandwidth steering, enabling multiple reconfigurable topologies that results in significant system performance improvements. As high-performance systems rely on increased parallelism by scaling up to greater numbers of processor nodes, communication between these nodes grows rapidly and the interconnection network becomes a bottleneck to the overall performance of the system. It has been observed that many scientific applications operating on high-performance computing systems cause highly skewed traffic over the network, congesting only a small percentage of the total available links while other links are underutilized. This mismatch of the traffic and the bandwidth allocation of the physical layer network presents the opportunity to optimize the bandwidth resource utilization of the system by using silicon photonic switches to perform bandwidth steering. This allows the individual processors to perform at their maximum compute potential and thereby improving the overall system performance. We show various testbeds that integrates both microring resonator and Mach-Zehnder based silicon photonic switches within Dragonfly and Fat-Tree topology networks built with conventional equipment, and demonstrate 30-60% reduction in execution time of real high-performance benchmark applications. Part 2 presents a flexible network architecture and control plane that enables autonomous bandwidth steering and IT resource provisioning capabilities between metro-scale geographically distributed data centers. It uses a software-defined control plane to autonomously provision both network and IT resources to support different quality of service requirements and optimizes resource utilization under dynamically changing load variations. By actively monitoring both the bandwidth utilization of the network and CPU or memory resources of the end hosts, the control plane autonomously provisions background or dynamic connections with different levels of quality of service using optical MEMS switching, as well as initializing live migrations of virtual machines to consolidate or distribute workload. Together these functionalities provide flexibility and maximize efficiency in processing and transferring data, and enables energy and cost savings by scaling down the system when resources are not needed. An experimental testbed of three data center nodes was built to demonstrate the feasibility of these capabilities. Part 3 presents Lightbridge, a communications platform specifically designed to provide a more seamless integration between processor nodes and an optically switched network. It addresses some of the crucial issues faced by the works presented in the previous chapters related to optical switching. When optical switches perform switching operations, they change the physical topology of the network, and they lack the capability to buffer packets, resulting in certain optical circuits being unavailable. This prompts the question of whether it is safe to transmit packets by end hosts at any given time. Lightbridge was developed to coordinate switching and routing of optical circuits across the network, by having the processors gain information about the current state of the optical network before transmitting packets, and being able to buffer packets when the optical circuit is not available. This part describes details of Lightbridge which is constituted by a loadable Linux kernel module along with other supporting modifications to the Linux kernel in order to achieve the necessary functionalities

    Energy-aware service provisioning in P2P-assisted cloud ecosystems

    Get PDF
    Cotutela Universitat Politècnica de Catalunya i Instituto Tecnico de LisboaEnergy has been emerged as a first-class computing resource in modern systems. The trend has primarily led to the strong focus on reducing the energy consumption of data centers, coupled with the growing awareness of the adverse impact on the environment due to data centers. This has led to a strong focus on energy management for server class systems. In this work, we intend to address the energy-aware service provisioning in P2P-assisted cloud ecosystems, leveraging economics-inspired mechanisms. Toward this goal, we addressed a number of challenges. To frame an energy aware service provisioning mechanism in the P2P-assisted cloud, first, we need to compare the energy consumption of each individual service in P2P-cloud and data centers. However, in the procedure of decreasing the energy consumption of cloud services, we may be trapped with the performance violation. Therefore, we need to formulate a performance aware energy analysis metric, conceptualized across the service provisioning stack. We leverage this metric to derive energy analysis framework. Then, we sketch a framework to analyze the energy effectiveness in P2P-cloud and data center platforms to choose the right service platform, according to the performance and energy characteristics. This framework maps energy from the hardware oblivious, top level to the particular hardware setting in the bottom layer of the stack. Afterwards, we introduce an economics-inspired mechanism to increase the energy effectiveness in the P2P-assisted cloud platform as well as moving toward a greener ICT for ICT for a greener ecosystem.La energía se ha convertido en un recurso de computación de primera clase en los sistemas modernos. La tendencia ha dado lugar principalmente a un fuerte enfoque hacia la reducción del consumo de energía de los centros de datos, así como una creciente conciencia sobre los efectos ambientales negativos, producidos por los centros de datos. Esto ha llevado a un fuerte enfoque en la gestión de energía de los sistemas de tipo servidor. En este trabajo, se pretende hacer frente a la provisión de servicios de bajo consumo energético en los ecosistemas de la nube asistida por P2P, haciendo uso de mecanismos basados en economía. Con este objetivo, hemos abordado una serie de desafíos. Para instrumentar un mecanismo de servicio de aprovisionamiento de energía consciente en la nube asistida por P2P, en primer lugar, tenemos que comparar el consumo energético de cada servicio en la nube P2P y en los centros de datos. Sin embargo, en el procedimiento de disminuir el consumo de energía de los servicios en la nube, podemos quedar atrapados en el incumplimiento del rendimiento. Por lo tanto, tenemos que formular una métrica, sobre el rendimiento energético, a través de la pila de servicio de aprovisionamiento. Nos aprovechamos de esta métrica para derivar un marco de análisis de energía. Luego, se esboza un marco para analizar la eficacia energética en la nube asistida por P2P y en la plataforma de centros de datos para elegir la plataforma de servicios adecuada, de acuerdo con las características de rendimiento y energía. Este marco mapea la energía desde el alto nivel independiente del hardware a la configuración de hardware particular en la capa inferior de la pila. Posteriormente, se introduce un mecanismo basado en economía para aumentar la eficacia energética en la plataforma en la nube asistida por P2P, así como avanzar hacia unas TIC más verdes, para las TIC en un ecosistema más verde.Postprint (published version

    Towards multiprogrammed GPUs

    Get PDF
    Programmable Graphics Processing Units (GPUs) have recently become the most pervasitheve massively parallel processors. They have come a long way, from fixed function ASICs designed to accelerate graphics tasks to a programmable architecture that can also execute general-purpose computations. Because of their performance and efficiency, an increasing amount of software is relying on them to accelerate data parallel and computationally intensive sections of code. They have earned a place in many systems, from low power mobile devices to the biggest data centers in the world. However, GPUs are still plagued by the fact that they essentially have no multiprogramming support, resulting in low system performance if the GPU is shared among multiple programs. In this dissertation we set to provide the rich GPU multiprogramming support by improving the multitasking capabilities and increasing the virtual memory functionality and performance. The main issue hindering the multitasking support in GPUs is the nonpreemptive execution of GPU kernels. Here we propose two preemption mechanisms with dierent design philosophies, that can be used by a scheduler to preempt execution on GPU cores and make room for some other process. We also argue for the spatial sharing of the GPU and propose a concrete hardware scheduler implementation that dynamically partitions the GPU cores among running kernels, according to their set priorities. Opposing the assumptions made in the related work, we demonstrate that preemptive execution is feasible and the desired approach to GPU multitasking. We further show improved system fairness and responsiveness with our scheduling policy. We also pinpoint that at the core of the insufficient virtual memory support lies the exceptions handling mechanism used by modern GPUs. Currently, GPUs offload the actual exception handling work to the CPU, while the faulting instruction is stalled in the GPU core. This stall-on-fault model prevents some of the virtual memory features and optimizations and is especially harmful in multiprogrammed environments because it prevents context switching the GPU unless all the in-flight faults are resolved. In this disseritation, we propose three GPU core organizations with varying performance-complexity trade-off that get rid of the stall-on-fault execution and enable preemptible exceptions on the GPU (i.e., the faulting instruction can be squashed and restarted later). Building on this support, we implement two use cases and demonstrate their utility. One is a scheme that performs context switch of the faulted threads and tries to find some other useful work to do in the meantime, hiding the latency of the fault and improving the system performance. The other enables the fault handling code to run locally, on the GPU, instead of relying on the CPU offloading and show that the local fault handling can also improve performance.Las Unidades de Procesamiento de Gráficos Programables (GPU, por sus siglas en inglés) se han convertido recientemente en los procesadores masivamente paralelos más difundidos. Han recorrido un largo camino desde ASICs de función fija diseñados para acelerar tareas gráficas, hasta una arquitectura programable que también puede ejecutar cálculos de propósito general. Debido a su rendimiento y eficiencia, una cantidad creciente de software se basa en ellas para acelerar las secciones de código computacionalmente intensivas que disponen de paralelismo de datos. Se han ganado un lugar en muchos sistemas, desde dispositivos móviles de baja potencia hasta los centros de datos más grandes del mundo. Sin embargo, las GPUs siguen plagadas por el hecho de que esencialmente no tienen soporte de multiprogramación, lo que resulta en un bajo rendimiento del sistema si la GPU se comparte entre múltiples programas. En esta disertación nos centramos en proporcionar soporte de multiprogramación para GPUs mediante la mejora de las capacidades de multitarea y del soporte de memoria virtual. El principal problema que dificulta el soporte multitarea en las GPUs es la ejecución no apropiativa de los núcleos de la GPU. Proponemos dos mecanismos de apropiación con diferentes filosofías de diseño, que pueden ser utilizados por un planificador para apropiarse de los núcleos de la GPU y asignarlos a otros procesos. También abogamos por la división espacial de la GPU y proponemos una implementación concreta de un planificador hardware que divide dinámicamente los núcleos de la GPU entre los kernels en ejecución, de acuerdo con sus prioridades establecidas. Oponiéndose a las suposiciones hechas por otros en trabajos relacionados, demostramos que la ejecución apropiativa es factible y el enfoque deseado para la multitarea en GPUs. Además, mostramos una mayor equidad y capacidad de respuesta del sistema con nuestra política de asignación de núcleos de la GPU. También señalamos que la causa principal del insuficiente soporte de la memoria virtual en las GPUs es el mecanismo de manejo de excepciones utilizado por las GPUs modernas. En la actualidad, las GPUs descargan el manejo de las excepciones a la CPU, mientras que la instrucción que causo la fallada se encuentra esperando en el núcleo de la GPU. Este modelo de bloqueo en fallada impide algunas de las funciones y optimizaciones de la memoria virtual y es especialmente perjudicial en entornos multiprogramados porque evita el cambio de contexto de la GPU a menos que se resuelvan todas las fallas pendientes. En esta disertación, proponemos tres implementaciones del pipeline de los núcleos de la GPU que ofrecen distintos balances de rendimiento-complejidad y permiten la apropiación del núcleo aunque haya excepciones pendientes (es decir, la instrucción que produjo la fallada puede ser reiniciada más tarde). Basándonos en esta nueva funcionalidad, implementamos dos casos de uso para demostrar su utilidad. El primero es un planificador que asigna el núcleo a otros subprocesos cuando hay una fallada para tratar de hacer trabajo útil mientras esta se resuelve, ocultando así la latencia de la fallada y mejorando el rendimiento del sistema. El segundo permite que el código de manejo de las falladas se ejecute localmente en la GPU, en lugar de descargar el manejo a la CPU, mostrando que el manejo local de falladas también puede mejorar el rendimiento.Postprint (published version

    Dependable mapreduce in a cloud-of-clouds

    Get PDF
    Tese de doutoramento, Informática (Engenharia Informática), Universidade de Lisboa, Faculdade de Ciências, 2017MapReduce is a simple and elegant programming model suitable for loosely coupled parallelization problems—problems that can be decomposed into subproblems. Hadoop MapReduce has become the most popular framework for performing large-scale computation on off-the-shelf clusters, and it is widely used to process these problems in a parallel and distributed fashion. This framework is highly scalable, can deal efficiently with large volumes of unstructured data, and it is a platform for many other applications. However, the framework has limitations concerning dependability. Namely, it is solely prepared to tolerate crash faults by re-executing tasks in case of failure, and to detect file corruptions using file checksums. Unfortunately, there is evidence that arbitrary faults do occur and can affect the correctness of MapReduce execution. Although such Byzantine faults are considered to be rare, particular MapReduce applications are critical and intolerant to this type of fault. Furthermore, typical MapReduce implementations are constrained to a single cloud environment. This is a problem as there is increasing evidence of outages on major cloud offerings, raising concerns about the dependence on a single cloud. In this thesis, I propose techniques to improve the dependability of MapReduce systems. The proposed solutions allow MapReduce to scale out computations to a multi-cloud environment, or cloud of-clouds, to tolerate arbitrary and malicious faults and cloud outages. The proposals have three important properties: they increase the dependability of MapReduce by tolerating the faults mentioned above; they require minimal or no modifications to users’ applications; and they achieve this increased level of fault tolerance at reasonable cost. To achieve these goals, I introduce three key ideas: minimizing the required replication; applying context-based job scheduling based on cloud and network conditions; and performing fine-grained replication. I evaluated all proposed solutions in real testbed environments running typical MapReduce applications. The results demonstrate interesting trade-offs concerning resilience and performance when compared to traditional methods. The fundamental conclusion is that the cost introduced by our solutions is small, and thus deemed acceptable for many critical applications.O MapReduce é um modelo de programação adequado para processar grandes volumes de dados em paralelo, executando um conjunto de tarefas independentes, e combinando os resultados parciais na solução final. OHadoop MapReduce é uma plataforma popular para processar grandes quantidades de dados de forma paralela e distribuída. Do ponto de vista da confiabilidade, a plataforma está preparada exclusivamente para tolerar faltas de paragem, re-executando tarefas, e detectar corrupções de ficheiros usando somas de verificação. Esta é uma importante limitação dado haver evidência de que faltas arbitrárias ocorrem e podem afetar a execução do MapReduce. Embora estas faltas Bizantinas sejam raras, certas aplicações de MapReduce são críticas e não toleram faltas deste tipo. Além disso, o número de ocorrências de interrupções em infraestruturas da nuvem tem vindo a aumentar ao longo dos anos, levantando preocupações sobre a dependência dos clientes num fornecedor único de serviços de nuvem. Nesta tese proponho várias técnicas para melhorar a confiabilidade do sistema MapReduce. As soluções propostas permitem processar tarefas MapReduce num ambiente de múltiplas nuvens para tolerar faltas arbitrárias, maliciosas e faltas de paragem nas nuvens. Estas soluções oferecem três importantes propriedades: toleram os tipos de faltas mencionadas; não exigem modificações às aplicações dos clientes; alcançam esta tolerância a faltas a um custo razoável. Estas técnicas são baseadas nas seguintes ideias: minimizar a replicação, desenvolver algoritmos de escalonamento para o MapReduce baseados nas condições da nuvem e da rede, e criar um sistema de tolerância a faltas com granularidade fina no que respeita à replicação. Avaliei as minhas propostas em ambientes de teste real com aplicações comuns do MapReduce, que me permite demonstrar compromissos interessantes em termos de resiliência e desempenho, quando comparados com métodos tradicionais. Em particular, os resultados mostram que o custo introduzido pelas soluções são aceitáveis para muitas aplicações críticas

    Proactive Interference-aware Resource Management in Deep Learning Training Cluster

    Get PDF
    Deep Learning (DL) applications are growing at an unprecedented rate across many domains, ranging from weather prediction, map navigation to medical imaging. However, training these deep learning models in large-scale compute clusters face substantial challenges in terms of low cluster resource utilisation and high job waiting time. State-of-the-art DL cluster resource managers are needed to increase GPU utilisation and maximise throughput. While co-locating DL jobs within the same GPU has been shown to be an effective means towards achieving this, co-location subsequently incurs performance interference resulting in job slowdown. We argue that effective workload placement can minimise DL cluster interference at scheduling runtime by understanding the DL workload characteristics and their respective hardware resource consumption. However, existing DL cluster resource managers reserve isolated GPUs to perform online profiling to directly measure GPU utilisation and kernel patterns for each unique submitted job. Such a feedback-based reactive approach results in additional waiting times as well as reduced cluster resource efficiency and availability. In this thesis, we propose Horus: an interference-aware and prediction-based DL cluster resource manager. Through empirically studying a series of microbenchmarks and DL workload co-location combinations across heterogeneous GPU hardware, we demonstrate the negative effects of performance interference when colocating DL workload, and identify GPU utilisation as a general proxy metric to determine good placement decisions. From these findings, we design Horus, which in contrast to existing approaches, proactively predicts GPU utilisation of heterogeneous DL workload extrapolated from the DL model computation graph features when performing placement decisions, removing the need for online profiling and isolated reserved GPUs. By conducting empirical experimentation within a medium-scale DL cluster as well as a large-scale trace-driven simulation of a production system, we demonstrate Horus improves cluster GPU utilisation, reduces cluster makespan and waiting time, and can scale to operate within hundreds of machines
    corecore