971 research outputs found

    Power-aware load balancing of large scale MPI applications

    Get PDF
    Power consumption is a very important issue for HPC community, both at the level of one application or at the level of whole workload. Load imbalance of a MPI application can be exploited to save CPU energy without penalizing the execution time. An application is load imbalanced when some nodes are assigned more computation than others. The nodes with less computation can be run at lower frequency since otherwise they have to wait for the nodes with more computation blocked in MPI calls. A technique that can be used to reduce the speed is Dynamic Voltage Frequency Scaling (DVFS). Dynamic power dissipation is proportional to the product of the frequency and the square of the supply voltage, while static power is proportional to the supply voltage. Thus decreasing voltage and/or frequency results in power reduction. Furthermore, over-clocking can be applied in some CPUs to reduce overall execution time. This paper investigates the impact of using different gear sets , over-clocking, and application and platform propreties to reduce CPU power. A new algorithm applying DVFS and CPU over-clocking is proposed that reduces execution time while achieving power savings comparable to prior work. The results show that it is possible to save up to 60% of CPU energy in applications with high load imbalance. Our results show that six gear sets achieve, on average, results close to the continuous frequency set that has been used as a baseline.Peer ReviewedPostprint (published version

    BSLD threshold driven parallel job scheduling for energy efficient HPC centers

    Get PDF
    Recently, power awareness in high performance computing (HPC) community has increased significantly. While CPU power reduction of HPC applications using Dynamic Voltage Frequency Scaling (DVFS) has been explored thoroughly, CPU power management for large scale parallel systems at system level has left unexplored. In this paper we propose a power-aware parallel job scheduler assuming DVFS enabled clusters. Traditional parallel job schedulers determine when a job will be run, power aware ones should assign CPU frequency which it will be run at. We have introduced two adjustable thresholds in order to enable fine grain energy performance trade-off control. Since our power reduction approach is policy independent it can be added to any parallel job scheduling policy. Furthermore, we have done an analysis of HPC system dimension. Running an application at lower frequency on more processors can be more energy efficient than running it at the highest CPU frequency on less processors. This paper investigates whether having more DVFS enabled processors and same load can lead to better energy efficiency and performance. Five workload logs from systems in production use with up to 9 216 processors are simulated to evaluate the proposed algorithm and the dimensioning problem. Our approach decreases CPU energy by 7%- 18% on average depending on allowed job performance penalty. Applying the same frequency scaling algorithm on 20% larger system, CPU energy needed to execute same load can be decreased by almost 30% while having same or better job performance.Postprint (published version

    BSLD threshold driven power management policy for HPC centers

    Get PDF
    In this paper, we propose a power-aware parallel job scheduler assuming DVFS enabled clusters. A CPU frequency assignment algorithm is integrated into the well established EASY backfilling job scheduling policy. Running a job at lower frequency results in a reduction in power dissipation and accordingly in energy consumption. However, lower frequencies introduce a penalty in performance. Our frequency assignment algorithm has two adjustable parameters in order to enable fine grain energy-performance trade-off control. Furthermore, we have done an analysis of HPC system dimension. This paper investigates whether having more DVFS enabled processors for same load can lead to better energy efficiency and performance. Five workload traces from systems in production use with up to 9 216 processors are simulated to evaluate the proposed algorithm and the dimensioning problem. Our approach decreases CPU energy by 7%– 18% on average depending on allowed job performance penalty. Using the power-aware job scheduling for 20% larger system, CPU energy needed to execute same load can be decreased by almost 30% while having same or better job performance.Peer ReviewedPostprint (published version

    Optimizing job performance under a given power constraint in HPC centers

    Get PDF
    Never-ending striving for performance has resulted in a tremendous increase in power consumption of HPC centers. Power budgeting has become very important from several reasons such as reliability, operating costs and limited power draw due to the existing infrastructure. In this paper we propose a power budget guided job scheduling policy that maximize overall job performance for a given power budget. We have shown that using DVFS under a power constraint performance can be significantly improved as it allows more jobs to run simultaneously leading to shorter wait times. Aggressiveness of frequency scaling applied to a job depends on instantaneous power consumption and on the job's predicted performance. Our policy has been evaluated for four workload traces from systems in production use with up to 4 008 processors. The results show that our policy achieves up to two times better performance compared to power budgeting without DVFS. Moreover it leads to 23% lower CPU energy consumption on average. Furthermore, we have investigated how much job performance and energy efficiency can be improved under our policy and same power budget by an increase in the number of DVFS enabled processors.Peer ReviewedPostprint (published version

    Contributions to the efficient use of general purpose coprocessors: kernel density estimation as case study

    Get PDF
    142 p.The high performance computing landscape is shifting from assemblies of homogeneous nodes towards heterogeneous systems, in which nodes consist of a combination of traditional out-of-order execution cores and accelerator devices. Accelerators provide greater theoretical performance compared to traditional multi-core CPUs, but exploiting their computing power remains as a challenging task.This dissertation discusses the issues that arise when trying to efficiently use general purpose accelerators. As a contribution to aid in this task, we present a thorough survey of performance modeling techniques and tools for general purpose coprocessors. Then we use as case study the statistical technique Kernel Density Estimation (KDE). KDE is a memory bound application that poses several challenges for its adaptation to the accelerator-based model. We present a novel algorithm for the computation of KDE that reduces considerably its computational complexity, called S-KDE. Furthermore, we have carried out two parallel implementations of S-KDE, one for multi and many-core processors, and another one for accelerators. The latter has been implemented in OpenCL in order to make it portable across a wide range of devices. We have evaluated the performance of each implementation of S-KDE in a variety of architectures, trying to highlight the bottlenecks and the limits that the code reaches in each device. Finally, we present an application of our S-KDE algorithm in the field of climatology: a novel methodology for the evaluation of environmental models

    Complexity plots

    Get PDF
    In this paper, we present a novel visualization technique for assisting in observation and analysis of algorithmic\ud complexity. In comparison with conventional line graphs, this new technique is not sensitive to the units of\ud measurement, allowing multivariate data series of different physical qualities (e.g., time, space and energy) to be juxtaposed together conveniently and consistently. It supports multivariate visualization as well as uncertainty visualization. It enables users to focus on algorithm categorization by complexity classes, while reducing visual impact caused by constants and algorithmic components that are insignificant to complexity analysis. It provides an effective means for observing the algorithmic complexity of programs with a mixture of algorithms and blackbox software through visualization. Through two case studies, we demonstrate the effectiveness of complexity plots in complexity analysis in research, education and application

    Dynamic Power Management for Reactive Stream Processing on the SCC Tiled Architecture

    Get PDF
    This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.Dynamic voltage and frequency scaling} (DVFS) is a means to adjust the computing capacity and power consumption of computing systems to the application demands. DVFS is generally useful to provide a compromise between computing demands and power consumption, especially in the areas of resource-constrained computing systems. Many modern processors support some form of DVFS. In this article we focus on the development of an execution framework that provides light-weight DVFS support for reactive stream-processing systems (RSPS). RSPS are a common form of embedded control systems, operating in direct response to inputs from their environment. At the execution framework we focus on support for many-core scheduling for parallel execution of concurrent programs. We provide a DVFS strategy for RSPS that is simple and lightweight, to be used for dynamic adaptation of the power consumption at runtime. The simplicity of the DVFS strategy became possible by sole focus on the application domain of RSPS. The presented DVFS strategy does not require specific assumptions about the message arrival rate or the underlying scheduling method. While DVFS is a very active field, in contrast to most existing research, our approach works also for platforms like many-core processors, where the power settings typically cannot be controlled individually for each computational unit. We also support dynamic scheduling with variable workload. While many research results are provided with simulators, in our approach we present a parallel execution framework with experiments conducted on real hardware, using the SCC many-core processor. The results of our experimental evaluation confirm that our simple DVFS strategy provides potential for significant energy saving on RSPS.Peer reviewe

    Software Approaches to Manage Resource Tradeoffs of Power and Energy Constrained Applications

    Get PDF
    Power and energy efficiency have become an increasingly important design metric for a wide spectrum of computing devices. Battery efficiency, which requires a mixture of energy and power efficiency, is exceedingly important especially since there have been no groundbreaking advances in battery capacity recently. The need for energy and power efficiency stretches from small embedded devices to portable computers to large scale data centers. The projected future of computing demand, referred to as exascale computing, demands that researchers find ways to perform exaFLOPs of computation at a power bound much lower than would be required by simply scaling today's standards. There is a large body of work on power and energy efficiency for a wide range of applications and at different levels of abstraction. However, there is a lack of work studying the nuances of different tradeoffs that arise when operating under a power/energy budget. Moreover, there is no work on constructing a generalized model of applications running under power/energy constraints, which allows the designer to optimize their resource consumption, be it power, energy, time, bandwidth, or space. There is need for an efficient model that can provide bounds on the optimality of an application's resource consumption, becoming a basis against which online resource management heuristics can be measured. In this thesis, we tackle the problem of managing resource tradeoffs of power/energy constrained applications. We begin by studying the nuances of power/energy tradeoffs with the response time and throughput of stream processing applications. We then study the power performance tradeoff of batch processing applications to identify a power configuration that maximizes performance under a power bound. Next, we study the tradeoff of power/energy with network bandwidth and precision. Finally, we study how to combine tradeoffs into a generalized model of applications running under resource constraints. The work in this thesis presents detailed studies of the power/energy tradeoff with response time, throughput, performance, network bandwidth, and precision of stream and batch processing applications. To that end, we present an adaptive algorithm that manages stream processing tradeoffs of response time and throughput at the CPU level. At the task-level, we present an online heuristic that adaptively distributes bounded power in a cluster to improve performance, as well as an offline approach to optimally bound performance. We demonstrate how power can be used to reduce bandwidth bottlenecks and extend our offline approach to model bandwidth tradeoffs. Moreover, we present a tool that identifies parts of a program that can be downgraded in precision with minimal impact on accuracy, and maximal impact on energy consumption. Finally, we combine all the above tradeoffs into a flexible model that is efficient to solve and allows for bounding and/or optimizing the consumption of different resources
    • …
    corecore