7 research outputs found

    Energy and thermal models for simulation of workload and resource management in computing systems

    Get PDF
    In the recent years, we have faced the evolution of high-performance computing (HPC) systems towards higher scale, density and heterogeneity. In particular, hardware vendors along with software providers, HPC centers, and scientists are struggling with the exascale computing challenge. As the density of both computing power and heat is growing, proper energy and thermal management becomes crucial in terms of overall system efficiency. Moreover, an accurate and relatively fast method to evaluate such large scale computing systems is needed. In this paper we present a way to model energy and thermal behavior of computing system. The proposed model can be used to effectively estimate system performance, energy consumption, and energy-efficiency metrics. We evaluate their accuracy by comparing the values calculated based on these models against the measurements obtained on real hardware. Finally, we show how the proposed models can be applied to workload scheduling and resource management in large scale computing systems by integrating them in the DCworms simulation framework

    Using an Adaptive HPC Runtime System to Reconfigure the Cache Hierarchy

    Full text link
    The cache hierarchy often consumes a large portion of a processor’s energy. To save energy in HPC environments, this paper proposes software-controlled reconfiguration of the cache hierarchy with an adaptive runtime system. Our approach addresses the two major limitations associated with other methods that reconfigure the caches: predicting the application’s future and finding the best cache hierarchy configuration. Our approach uses formal language theory to express the application’s pattern and help predict its future. Furthermore, it uses the prevalent Single Program Multiple Data (SPMD) model of HPC codes to find the best configuration in parallel quickly. Our experiments using cycle-level simulations indicate that 67 % of the cache energy can be saved with only a 2.4 % performance penalty on average. Moreover, we demonstrate that, for some applica-tions, switching to a software-controlled reconfigurable streaming buffer configuration can improve performance by up to 30 % and save 75 % of the cache energy. I

    Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget

    Full text link
    Abstract—Building future generation supercomputers while constraining their power consumption is one of the biggest challenges faced by the HPC community. For example, US Department of Energy has set a goal of 20 MW for an exascale (1018 flops) supercomputer. To realize this goal, a lot of research is being done to revolutionize hardware design to build power efficient computers and network interconnects. In this work, we propose a software-based online resource management system that leverages hardware facilitated capability to constrain the power consumption of each node in order to optimally allocate power and nodes to a job. Our scheme uses this hardware capability in conjunction with an adaptive runtime system that can dynamically change the resource configuration of a running job allowing our resource manager to re-optimize allocation decisions to running jobs as new jobs arrive, or a running job terminates. We also propose a performance modeling scheme that esti-mates the essential power characteristics of a job at any scale. The proposed online resource manager uses these performance characteristics for making scheduling and resource allocation decisions that maximize the job throughput of the supercomputer under a given power budget. We demonstrate the benefits of our approach by using a mix of jobs with different power-response characteristics. We show that with a power budget of 4.75 MW, we can obtain up to 5.2X improvement in job throughput when compared with the SLURM scheduling policy that is power-unaware. We corroborate our results with real experiments on a relatively small scale cluster, in which we obtain a 1.7X improvement. I

    Intelligent Load Balancing in Cloud Computer Systems

    Get PDF
    Cloud computing is an established technology allowing users to share resources on a large scale, never before seen in IT history. A cloud system connects multiple individual servers in order to process related tasks in several environments at the same time. Clouds are typically more cost-effective than single computers of comparable computing performance. The sheer physical size of the system itself means that thousands of machines may be involved. The focus of this research was to design a strategy to dynamically allocate tasks without overloading Cloud nodes which would result in system stability being maintained at minimum cost. This research has added the following new contributions to the state of knowledge: (i) a novel taxonomy and categorisation of three classes of schedulers, namely OS-level, Cluster and Big Data, which highlight their unique evolution and underline their different objectives; (ii) an abstract model of cloud resources utilisation is specified, including multiple types of resources and consideration of task migration costs; (iii) a virtual machine live migration was experimented with in order to create a formula which estimates the network traffic generated by this process; (iv) a high-fidelity Cloud workload simulator, based on a month-long workload traces from Google's computing cells, was created; (v) two possible approaches to resource management were proposed and examined in the practical part of the manuscript: the centralised metaheuristic load balancer and the decentralised agent-based system. The project involved extensive experiments run on the University of Westminster HPC cluster, and the promising results are presented together with detailed discussions and a conclusion
    corecore