22 research outputs found

    Changement de contexte pour tâches virtualisées à l'échelle des grappes

    Get PDF
    National audienceDe nos jours, la gestion des ressources d'une grappe est effectuée en allouant des tranches de temps aux applications, spécifiées par les utilisateurs et de manière statique. Pour un utilisateur, soit les ressources demandées sont sur-estimées et la grappe est sous-utilisée, soit sous-dimensionnées et ses calculs sont dans la plupart des cas perdus. L'apparition de la virtualisation a apporté une certaine flexibilité quant à la gestion des applications et des ressources des grappes. Cependant, pour optimiser l'utilisation de ces ressource, et libérer les utilisateurs d'estimations hasardeuses, il devient nécessaire d'allouer dynamiquement les ressources en fonction des besoins réels des applications : Être capable de démarrer dynamiquement une application lorsqu'une ressource se libère ou la suspendre lorsque la ressource doit être ré-attribuée. En d'autres termes, être capable de développer un système comparable au changement de contexte sur les ordinateurs standards pour les applications s'exécutant sur une grappe. En s'appuyant sur la virtualisation, développer un tel mécanisme de manière générique devient envisageable. Dans cet article nous proposons une infrastructure offrant la notion de changement de contexte d'applications virtualisées appliquée aux grappes. Cette solution a permis de développer un ordonnanceur exécutant simultanément un maximum d'applications virtualisées. Nous montrons qu'une telle solution augmente le taux d'occupation de notre grappe et réduit le temps de traitement des applications

    Topology-aware GPU scheduling for learning workloads in cloud environments

    Get PDF
    Recent advances in hardware, such as systems with multiple GPUs and their availability in the cloud, are enabling deep learning in various domains including health care, autonomous vehicles, and Internet of Things. Multi-GPU systems exhibit complex connectivity among GPUs and between GPUs and CPUs. Workload schedulers must consider hardware topology and workload communication requirements in order to allocate CPU and GPU resources for optimal execution time and improved utilization in shared cloud environments. This paper presents a new topology-aware workload placement strategy to schedule deep learning jobs on multi-GPU systems. The placement strategy is evaluated with a prototype on a Power8 machine with Tesla P100 cards, showing speedups of up to ≈1.30x compared to state-of-the-art strategies; the proposed algorithm achieves this result by allocating GPUs that satisfy workload requirements while preventing interference. Additionally, a large-scale simulation shows that the proposed strategy provides higher resource utilization and performance in cloud systems.This project is supported by the IBM/BSC Technology Center for Supercomputing collaboration agreement. It has also received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). We thank our IBM Research colleagues Alaa Youssef and Asser Tantawi for the valuable discussions. We also thank SC17 committee member Blair Bethwaite of Monash University for his constructive feedback on the earlier drafts of this paper.Peer ReviewedPostprint (published version

    Changement de contexte pour tâches virtualisées à l'échelle des grappes

    Get PDF
    National audienceDe nos jours, la gestion des ressources d'une grappe est effectuée en allouant des tranches de temps aux applications, spécifiées par les utilisateurs et de manière statique. Pour un utilisateur, soit les ressources demandées sont sur-estimées et la grappe est sous-utilisée, soit sous-dimensionnées et ses calculs sont dans la plupart des cas perdus. L'apparition de la virtualisation a apporté une certaine flexibilité quant à la gestion des applications et des ressources des grappes. Cependant, pour optimiser l'utilisation de ces ressource, et libérer les utilisateurs d'estimations hasardeuses, il devient nécessaire d'allouer dynamiquement les ressources en fonction des besoins réels des applications : Être capable de démarrer dynamiquement une application lorsqu'une ressource se libère ou la suspendre lorsque la ressource doit être ré-attribuée. En d'autres termes, être capable de développer un système comparable au changement de contexte sur les ordinateurs standards pour les applications s'exécutant sur une grappe. En s'appuyant sur la virtualisation, développer un tel mécanisme de manière générique devient envisageable. Dans cet article nous proposons une infrastructure offrant la notion de changement de contexte d'applications virtualisées appliquée aux grappes. Cette solution a permis de développer un ordonnanceur exécutant simultanément un maximum d'applications virtualisées. Nous montrons qu'une telle solution augmente le taux d'occupation de notre grappe et réduit le temps de traitement des applications

    DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems

    Full text link
    As we approach the exascale era, the size and complexity of HPC systems continues to increase, raising concerns about their manageability and sustainability. For this reason, more and more HPC centers are experimenting with fine-grained monitoring coupled with Operational Data Analytics (ODA) to optimize efficiency and effectiveness of system operations. However, while monitoring is a common reality in HPC, there is no well-stated and comprehensive list of requirements, nor matching frameworks, to support holistic and online ODA. This leads to insular ad-hoc solutions, each addressing only specific aspects of the problem. In this paper we propose Wintermute, a novel generic framework to enable online ODA on large-scale HPC installations. Its design is based on the results of a literature survey of common operational requirements. We implement Wintermute on top of the holistic DCDB monitoring system, offering a large variety of configuration options to accommodate the varying requirements of ODA applications. Moreover, Wintermute is based on a set of logical abstractions to ease the configuration of models at a large scale and maximize code re-use. We highlight Wintermute's flexibility through a series of practical case studies, each targeting a different aspect of the management of HPC systems, and then demonstrate the small resource footprint of our implementation.Comment: Accepted for publication at the 29th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2020

    VSCM: a Virtual Server Consolidation Manager for Cluster

    Get PDF
    Abstract. Virtual server consolidation is to use virtual machines to encapsulate applications which are running on multiple physical servers in the cluster and then integrate them into a small number of servers. Nowadays, with the expanding of enterprise-class data centers, virtual server consolidation can reduce large number of servers to help the enterprises reduce hardware and operating costs significantly and improve server utilization greatly. In this paper, we propose the VSCM manager for virtual cluster, which solves the problems in the consolidation from a globally optimal view and also takes migration overhead into account. Experiment results in virtual cluster demonstrate that, VSCM can greatly reduce the number of servers and the migration overhead

    PIASA: A power and interference aware resource management strategy for heterogeneous workloads in cloud data centers

    Get PDF
    Cloud data centers have been progressively adopted in different scenarios, as reflected in the execution of heterogeneous applications with diverse workloads and diverse quality of service (QoS) requirements. Virtual machine (VM) technology eases resource management in physical servers and helps cloud providers achieve goals such as optimization of energy consumption. However, the performance of an application running inside a VM is not guaranteed due to the interference among co-hosted workloads sharing the same physical resources. Moreover, the different types of co-hosted applications with diverse QoS requirements as well as the dynamic behavior of the cloud makes efficient provisioning of resources even more difficult and a challenging problem in cloud data centers. In this paper, we address the problem of resource allocation within a data center that runs different types of application workloads, particularly CPU- and network-intensive applications. To address these challenges, we propose an interference- and power-aware management mechanism that combines a performance deviation estimator and a scheduling algorithm to guide the resource allocation in virtualized environments. We conduct simulations by injecting synthetic workloads whose characteristics follow the last version of the Google Cloud tracelogs. The results indicate that our performance-enforcing strategy is able to fulfill contracted SLAs of real-world environments while reducing energy costs by as much as 21%

    A Survey of Research on Power Management Techniques for High Performance Systems

    Get PDF
    This paper surveys the research on power management techniques for high performance systems. These include both commercial high performance clusters and scientific high performance computing (HPC) systems. Power consumption has rapidly risen to an intolerable scale. This results in both high operating costs and high failure rates so it is now a major cause for concern. It is imposed new challenges to the development of high performance systems. In this paper, we first review the basic mechanisms that underlie power management techniques. Then we survey two fundamental techniques for power management: metrics and profiling. After that, we review the research for the two major types of high performance systems: commercial clusters and supercomputers. Based on this, we discuss the new opportunities and problems presented by the recent adoption of virtualization techniques, and again we present the most recent research on this. Finally, we summarise and discuss future research directions
    corecore