31 research outputs found

    A Reinforcement Learning Approach for Performance-aware Reduction in Power Consumption of Data Center Compute Nodes

    Full text link
    As Exascale computing becomes a reality, the energy needs of compute nodes in cloud data centers will continue to grow. A common approach to reducing this energy demand is to limit the power consumption of hardware components when workloads are experiencing bottlenecks elsewhere in the system. However, designing a resource controller capable of detecting and limiting power consumption on-the-fly is a complex issue and can also adversely impact application performance. In this paper, we explore the use of Reinforcement Learning (RL) to design a power capping policy on cloud compute nodes using observations on current power consumption and instantaneous application performance (heartbeats). By leveraging the Argo Node Resource Management (NRM) software stack in conjunction with the Intel Running Average Power Limit (RAPL) hardware control mechanism, we design an agent to control the maximum supplied power to processors without compromising on application performance. Employing a Proximal Policy Optimization (PPO) agent to learn an optimal policy on a mathematical model of the compute nodes, we demonstrate and evaluate using the STREAM benchmark how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.Comment: This manuscript consists of a total of 10 pages with 8 figures and 3 tables and is awaiting its publication at IC2E-202

    Workshop on Resource Arbitration for Dynamic Runtimes (RADR)

    Get PDF
    International audienceThe question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack. The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions

    Introduction to RADR 2019

    Get PDF
    International audienceThe question of efficient dynamic allocation of compute-node resources, such as cores, by independent libraries or runtime systems can be an nightmare. Scientists writing application components have no way to efficiently specify and compose resource-hungry components. As application software stacks become deeper and the interaction of multiple runtime layers compete for resources from the operating system, it has become clear that intelligent cooperation is needed. Resources such as compute cores, in-package memory, and even electrical power must be orchestrated dynamically across application components, with the ability to query each other and respond appropriately. A more integrated solution would reduce intra-application resource competition and improve performance. Furthermore, application runtime systems could request and allocate specific hardware assets and adjust runtime tuning parameters up and down the software stack. The goal of this workshop is to gather and share the latest scholarly research from the community working on these issues, at all levels of the HPC software stack. This include thread allocation, resource arbitration and management, containers, and so on, from runtime-system designers to compilers. We will also use panel sessions and keynote talks to discuss these issues, share visions, and present solutions. Scope Over the last five years, the number of nodes in large supercomputers has remained largely unchanged. In fact, the Oak Ridge National Laboratory computer leading the Top500 list, Summit, has fewer nodes than its predecessor, which is 20 times slower. Machines are getting faster not by adding nodes, but by adding parallelism, cores, and hierarchical memory to each compute node. This shift in how computers are scaled up makes it imperative that parallel computer resources within a node be carefully orchestrated to achieve maximum performance. Dynamically allocating and managing threads and the mapping of these threads to cores is a challenge that requires cooperation and coordination between the different components of the software stack

    Comment rater la validation de votre algorithme d'ordonnancement

    Get PDF
    National audienceImaginons que vous veniez de développer un nouvel algorithme d’ordonnancement : félicitations ! Pourdisposer d’informations qualitatives sur votre algorithme et le comparer à d’autres vous avez décidécomme beaucoup avant vous de réaliser des simulations. Très classiquement vos simulations portentsur des jeux de données aléatoires (ici, des graphes orientés acycliques)

    Narrowing the Search Space of Applications Mapping on Hierarchical Topologies

    Get PDF
    To be held in conjunction with SC21International audienceProcessor architectures at exascale and beyond are expected to continue to suffer from nonuniform access issues to in-die and node-wide shared resources. Mapping applications onto these resource hierarchies is an on-going performance concern, requiring specific care for increasing locality and resource sharing but also for ensuing contention. Application-agnostic approaches to search efficient mappings are based on heuristics. Indeed, the size of the search space makes it impractical to find optimal solutions nowadays and will only worsen as the complexity of computing systems increases over time. In this paper we leverage the hierarchical structure of modern compute nodes to reduce the size of this search space. As a result, we facilitate the search for optimal mappings and improve the ability to evaluate existing heuristics.Using widely known benchmarks, we show that permuting thread and process placement per node of a hierarchical topology leads to similar performances. As a result, the mapping search space can be narrowed down by several orders of magnitude when performing exhaustive search. This reduced search space will enable the design of new approaches, including exhaustive search or automatic exploration. Moreover, it provides new insights into heuristic-based approaches, including better upper bounds and smaller solution space

    Comment rater la validation de votre algorithme d'ordonnancement

    Get PDF
    National audienceImaginons que vous veniez de développer un nouvel algorithme d’ordonnancement : félicitations ! Pourdisposer d’informations qualitatives sur votre algorithme et le comparer à d’autres vous avez décidécomme beaucoup avant vous de réaliser des simulations. Très classiquement vos simulations portentsur des jeux de données aléatoires (ici, des graphes orientés acycliques)

    Environnements pour l'analyse expérimentale d'applications de calcul haute performance

    No full text
    High performance computing systems are increasingly complex. Nowadays, each compute node can contain several sockets or several cores and share multiple memory caches in a hierarchical way. To understand an application's performance on such systems or to develop new algorithms and validate their behavior, an experimental study is often required. In this thesis, we consider two types of experimental analysis : execution on real systems and simulation using randomly generated inputs. In both cases, a scientist can improve the quality of its performance analysis by controlling the environment (hardware or input data) used. Therefore, we discuss two methods to control hardware resources allocation inside a system : one for the processing time given to an application, the other for the amount of cache memory available to it. Both methods allow us to study how an application's behavior change according to the amount of resources allocated. Based on modifications of the operating system, we implemented these methods for Linux and demonstrated their use for the analysis of several parallel applications. Regarding simulation, we studied the issue of the random generation of directed acyclic graphs for scheduler simulations. While numerous algorithms can be found for such problem, most papers in this field rely on ad-hoc implementations and provide little validation of their generator. To tackle this issue, we propose a complete environment providing most of the classical generation methods. We validated this environment using big analysis campaigns on Grid'5000, verifying known statistical properties of most algorithms. We also demonstrated that the performance of a scheduler can be impacted by the generation method used, identifying a reversing phenomenon : changing the generating algorithm can reverse the comparison between two schedulers.Les machines du domaine du calcul haute performance (HPC) gagnent régulièrement en com- plexité. De nos jours, chaque nœud de calcul peut être constitué de plusieurs puces ou de plusieurs cœurs se partageant divers caches mémoire de façon hiérarchique. Que se soit pour comprendre les performances ob- tenues par une application sur ces architectures ou pour développer de nouveaux algorithmes et valider leur performance, une phase d'expérimentation est souvent nécessaire. Dans cette thèse, nous nous intéressons à deux formes d'analyse expérimentale : l'exécution sur machines réelles et la simulation d'algorithmes sur des jeux de données aléatoires. Dans un cas comme dans l'autre, le contrôle des paramètres de l'environnement (matériel ou données en entrée) permet une meilleure analyse des performances de l'application étudiée. Ainsi, nous proposons deux méthodes pour contrôler l'utilisation par une application des ressources ma- térielles d'une machine : l'une pour le temps processeur alloué et l'autre pour la quantité de cache mémoire disponible. Ces deux méthodes nous permettent notamment d'étudier les changements de comportement d'une application en fonction de la quantité de ressources allouées. Basées sur une modification du compor- tement du système d'exploitation, nous avons implémenté ces méthodes pour un système Linux et démontré leur utilité dans l'analyse de plusieurs applications parallèles. Du point de vue de la simulation, nous avons étudié le problème de la génération aléatoire de graphes orientés acycliques (DAG) pour la simulation d'algorithmes d'ordonnancement. Bien qu'un grand nombre d'algorithmes de génération existent dans ce domaine, la plupart des publications repose sur des implémen- tations ad-hoc et peu validées de ces derniers. Pour pallier ce problème, nous proposons un environnement de génération comprenant la majorité des méthodes rencontrées dans la littérature. Pour valider cet envi- ronnement, nous avons réalisé de grande campagnes d'analyses à l'aide de Grid'5000, notamment du point de vue des propriétés statistiques connues de certaines méthodes. Nous montrons aussi que la performance d'un algorithme est fortement influencée par la méthode de génération des entrées choisie, au point de ren- contrer des phénomènes d'inversion : un changement d'algorithme de génération inverse le résultat d'une comparaison entre deux ordonnanceurs

    Environments for the experimental analysis of HPC applications.

    No full text
    Les machines du domaine du calcul haute performance (HPC) gagnent régulièrement en com- plexité. De nos jours, chaque nœud de calcul peut être constitué de plusieurs puces ou de plusieurs cœurs se partageant divers caches mémoire de façon hiérarchique. Que se soit pour comprendre les performances ob- tenues par une application sur ces architectures ou pour développer de nouveaux algorithmes et valider leur performance, une phase d'expérimentation est souvent nécessaire. Dans cette thèse, nous nous intéressons à deux formes d'analyse expérimentale : l'exécution sur machines réelles et la simulation d'algorithmes sur des jeux de données aléatoires. Dans un cas comme dans l'autre, le contrôle des paramètres de l'environnement (matériel ou données en entrée) permet une meilleure analyse des performances de l'application étudiée. Ainsi, nous proposons deux méthodes pour contrôler l'utilisation par une application des ressources ma- térielles d'une machine : l'une pour le temps processeur alloué et l'autre pour la quantité de cache mémoire disponible. Ces deux méthodes nous permettent notamment d'étudier les changements de comportement d'une application en fonction de la quantité de ressources allouées. Basées sur une modification du compor- tement du système d'exploitation, nous avons implémenté ces méthodes pour un système Linux et démontré leur utilité dans l'analyse de plusieurs applications parallèles. Du point de vue de la simulation, nous avons étudié le problème de la génération aléatoire de graphes orientés acycliques (DAG) pour la simulation d'algorithmes d'ordonnancement. Bien qu'un grand nombre d'algorithmes de génération existent dans ce domaine, la plupart des publications repose sur des implémen- tations ad-hoc et peu validées de ces derniers. Pour pallier ce problème, nous proposons un environnement de génération comprenant la majorité des méthodes rencontrées dans la littérature. Pour valider cet envi- ronnement, nous avons réalisé de grande campagnes d'analyses à l'aide de Grid'5000, notamment du point de vue des propriétés statistiques connues de certaines méthodes. Nous montrons aussi que la performance d'un algorithme est fortement influencée par la méthode de génération des entrées choisie, au point de ren- contrer des phénomènes d'inversion : un changement d'algorithme de génération inverse le résultat d'une comparaison entre deux ordonnanceurs.High performance computing systems are increasingly complex. Nowadays, each compute node can contain several sockets or several cores and share multiple memory caches in a hierarchical way. To understand an application's performance on such systems or to develop new algorithms and validate their behavior, an experimental study is often required. In this thesis, we consider two types of experimental analysis : execution on real systems and simulation using randomly generated inputs. In both cases, a scientist can improve the quality of its performance analysis by controlling the environment (hardware or input data) used. Therefore, we discuss two methods to control hardware resources allocation inside a system : one for the processing time given to an application, the other for the amount of cache memory available to it. Both methods allow us to study how an application's behavior change according to the amount of resources allocated. Based on modifications of the operating system, we implemented these methods for Linux and demonstrated their use for the analysis of several parallel applications. Regarding simulation, we studied the issue of the random generation of directed acyclic graphs for scheduler simulations. While numerous algorithms can be found for such problem, most papers in this field rely on ad-hoc implementations and provide little validation of their generator. To tackle this issue, we propose a complete environment providing most of the classical generation methods. We validated this environment using big analysis campaigns on Grid'5000, verifying known statistical properties of most algorithms. We also demonstrated that the performance of a scheduler can be impacted by the generation method used, identifying a reversing phenomenon : changing the generating algorithm can reverse the comparison between two schedulers

    Ggen : génération aléatoire de graphes pour l'ordonnancement

    No full text
    National audienceno abstrac
    corecore