738 research outputs found

    Power-aware load balancing of large scale MPI applications

    Get PDF
    Power consumption is a very important issue for HPC community, both at the level of one application or at the level of whole workload. Load imbalance of a MPI application can be exploited to save CPU energy without penalizing the execution time. An application is load imbalanced when some nodes are assigned more computation than others. The nodes with less computation can be run at lower frequency since otherwise they have to wait for the nodes with more computation blocked in MPI calls. A technique that can be used to reduce the speed is Dynamic Voltage Frequency Scaling (DVFS). Dynamic power dissipation is proportional to the product of the frequency and the square of the supply voltage, while static power is proportional to the supply voltage. Thus decreasing voltage and/or frequency results in power reduction. Furthermore, over-clocking can be applied in some CPUs to reduce overall execution time. This paper investigates the impact of using different gear sets , over-clocking, and application and platform propreties to reduce CPU power. A new algorithm applying DVFS and CPU over-clocking is proposed that reduces execution time while achieving power savings comparable to prior work. The results show that it is possible to save up to 60% of CPU energy in applications with high load imbalance. Our results show that six gear sets achieve, on average, results close to the continuous frequency set that has been used as a baseline.Peer ReviewedPostprint (published version

    A Three Step Blind Approach for Improving HPC Systems' Energy Performance

    Get PDF
    International audienceNowadays, there is no doubt that energy consumption has become a limiting factor in the design and operation of high performance computing (HPC) systems. This is evidenced by the rise of efforts both from the academia and the industry to reduce the energy consumption of those systems. Unlike hardware solutions, software initiatives targeting HPC systems' energy consumption reduction despite their effectiveness are often limited for reasons including: (i) the program specific nature of the solution proposed; (ii) the need of deep understanding of applications at hand; (iii) proposed solutions are often difficult to use by novices and/or are designed for single task environments. This paper propose a three step blind system-wide, application independent, fine-grain, and easy to use (user friendly) methodology for improving energy performance of HPC systems. The methodology typically breaks into phase detection, phase characterization, and phase identification and system reconfiguration. And it is blind in the sense that it does not require any knowledge from users. It relies upon reconfigurable capabilities offered by the majority of HPC subsystems -- including the processor, storage, memory, and communication subsystems -- to reduce the overall energy consumption of the system (excluding network equipments) at runtime. We also present an implementation of our methodology through which we demonstrate its effectiveness via static analyses and experiments using benchmarks representative of HPC workloads

    A Workload-Aware, Eco-Friendly Daemon for Cluster Computing

    Get PDF
    This paper presents an eco-friendly daemon that reduces power consumption while better maintaining high performance via a novel behavioral quantification of workload. Specifically, our behavioral quantification achieves a more accurate workload characterization than previous approaches by inferring "processor stall cycles due to off-chip activities." This quantification, in turn, provides a foundation upon which we construct an interval-based, power-aware, run-time algorithm that is implemented within a system-wide daemon. We then evaluate our power-aware daemon in a cluster-computing environment with the NAS Parallel Benchmarks. The results indicate that our novel behavioral quantification of workload allows our power-aware daemon to more tightly control performance while delivering substantial energy savings

    Power Saving Experiments for Large Scale Global Optimization

    Get PDF
    Green computing, an emerging ïŹeld of research that seeks to reduce excess power consumption in high performance computing (HPC), is gaining popularity among researchers. Research in this ïŹeld often relies on simulation or only uses a small cluster, typically 8 or 16 nodes, because of the lack of hardware support. In contrast, System G at Virginia Tech is a 2592 processor supercomputer equipped with power aware components suitable for large scale green computing research. DIRECT is a deterministic global optimization algorithm, implemented in the mathematical software package VTDIRECT95. This paper explores the potential energy savings for the parallel implementation of DIRECT, called pVTdirect, when used with a large scale computational biology application, parameter estimation for a budding yeast cell cycle model, on System G. Two power aware approaches for pVTdirect are developed and compared against the CPUSPEED power saving system tool. The results show that knowledge of the parallel workload of the underlying application is beneficial for power management

    Phase-Based Application-Driven Hierarchical Power Management on the Single-chip Cloud Computer

    Full text link
    To improve energy efficiency processors allow for Dy-namic Voltage and Frequency Scaling (DVFS), which enables changing their performance and power consumption on-the-fly. Many-core architectures, such as the Single-chip Cloud Computer (SCC) experimental processor from Intel Labs, have DVFS infrastructures that scale by having many more independent voltage and frequency domains on-die than to-day’s multi-cores. This paper proposes a novel, hierarchical, and transparent client-server power management scheme applicable to such architectures. The scheme tries to minimize energy consump-tion within a performance window taking into consideration not only the local information for cores within frequency do-mains but also information that spans multiple frequency and voltage domains. We implement our proposed hierarchical power control using a novel application-driven phase detection and predic-tion approach for Message Passing Interface (MPI) applica-tions, a natural choice on the SCC with its fast on-chip net-work and its non-coherentmemory hierarchy. This phase pre-dictor operates as the front-end to the hierarchical DVFS con-troller, providing the necessary DVFS scheduling points. Experimental results with SCC hardware show that our ap-proach provides significant improvement of the EnergyDelay Product (EDP) of as much as 27.2%, and 11.4 % on average, with an average increase in execution time of 7.7 % over a baseline version without DVFS. These improvements come from both improved phase prediction accuracy and more ef-fective DVFS control of the domains, compared to existing approaches

    RUNTIME METHODS TO IMPROVE ENERGY EFFICIENCY IN SUPERCOMPUTING APPLICATIONS

    Get PDF
    Energy efficiency in supercomputing is critical to limit operating costs and carbon footprints. While the energy efficiency of future supercomputing centers needs to improve at all levels, the energy consumed by the processing units is a large fraction of the total energy consumed by High Performance Computing (HPC) systems. HPC applications use a parallel programming paradigm like the Message Passing Interface (MPI) to coordinate computation and communication among thousands of processors. With dynamically-changing factors both in hardware and software affecting energy usage of processors, there exists a need for power monitoring and regulation at runtime to achieve savings in energy. This dissertation highlights an adaptive runtime framework that enables processors with core-specific power control by dynamically adapting to workload characteristics to reduce power with little or no performance impact. Two opportunities to improve the energy efficiency of processors running MPI applications are identified - computational workload imbalance and waiting on memory. Monitoring of performance and power regulation is performed by the framework transparently within the MPI runtime system, eliminating the need for code changes to MPI applications. The effect of enforcing power limits (capping) on processors is also investigated. Experiments on 32 nodes (1024 cores) show that in presence of workload imbalance, the runtime reduces Central Processing Unit (CPU) frequency on cores not on the critical path, thereby reducing power and hence energy usage without deteriorating performance. Using this runtime, six MPI mini-applications and a full MPI application show an overall 20% decrease in energy use with less than 1% increase in execution time. In addition, the lowering of frequency on non-critical cores reduces run-to-run performance variation and improves performance. For the full application, an average speedup of 11% is seen, while the power is lowered by about 31% for an energy savings of up to 42%. Another experiment on 16 nodes (256 cores) that are power capped also shows performance improvement along with power reduction. Thus, energy optimization can also be a performance optimization. For applications that are limited by memory access times, memory metrics identified facilitate lowering of power by up to 32% without adversely impacting performance.Doctor of Philosoph

    BSLD threshold driven parallel job scheduling for energy efficient HPC centers

    Get PDF
    Recently, power awareness in high performance computing (HPC) community has increased significantly. While CPU power reduction of HPC applications using Dynamic Voltage Frequency Scaling (DVFS) has been explored thoroughly, CPU power management for large scale parallel systems at system level has left unexplored. In this paper we propose a power-aware parallel job scheduler assuming DVFS enabled clusters. Traditional parallel job schedulers determine when a job will be run, power aware ones should assign CPU frequency which it will be run at. We have introduced two adjustable thresholds in order to enable fine grain energy performance trade-off control. Since our power reduction approach is policy independent it can be added to any parallel job scheduling policy. Furthermore, we have done an analysis of HPC system dimension. Running an application at lower frequency on more processors can be more energy efficient than running it at the highest CPU frequency on less processors. This paper investigates whether having more DVFS enabled processors and same load can lead to better energy efficiency and performance. Five workload logs from systems in production use with up to 9 216 processors are simulated to evaluate the proposed algorithm and the dimensioning problem. Our approach decreases CPU energy by 7%- 18% on average depending on allowed job performance penalty. Applying the same frequency scaling algorithm on 20% larger system, CPU energy needed to execute same load can be decreased by almost 30% while having same or better job performance.Postprint (published version

    Dynamic Power Management for Reactive Stream Processing on the SCC Tiled Architecture

    Get PDF
    This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.Dynamic voltage and frequency scaling} (DVFS) is a means to adjust the computing capacity and power consumption of computing systems to the application demands. DVFS is generally useful to provide a compromise between computing demands and power consumption, especially in the areas of resource-constrained computing systems. Many modern processors support some form of DVFS. In this article we focus on the development of an execution framework that provides light-weight DVFS support for reactive stream-processing systems (RSPS). RSPS are a common form of embedded control systems, operating in direct response to inputs from their environment. At the execution framework we focus on support for many-core scheduling for parallel execution of concurrent programs. We provide a DVFS strategy for RSPS that is simple and lightweight, to be used for dynamic adaptation of the power consumption at runtime. The simplicity of the DVFS strategy became possible by sole focus on the application domain of RSPS. The presented DVFS strategy does not require specific assumptions about the message arrival rate or the underlying scheduling method. While DVFS is a very active field, in contrast to most existing research, our approach works also for platforms like many-core processors, where the power settings typically cannot be controlled individually for each computational unit. We also support dynamic scheduling with variable workload. While many research results are provided with simulators, in our approach we present a parallel execution framework with experiments conducted on real hardware, using the SCC many-core processor. The results of our experimental evaluation confirm that our simple DVFS strategy provides potential for significant energy saving on RSPS.Peer reviewe

    Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications

    Get PDF
    Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications
    • 

    corecore