Search CORE

1,268 research outputs found

Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications

Author: Biferale
Biferale
Biferale
Calore
Calore
Calore
Calore
Calore
Crimi
Dick
Etinski
Ge
Khabi
Lim
Mantovani
Mazouz
Peraza
Sbragaglia
Scagliarini
Succi
Sundriyal
Williams
Wittmann
Publication venue: 'Wiley'
Publication date: 01/01/2017
Field of study

Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università di Ferrara

Mitigation of performance variability induced by Checkpoint-Restart using DVFS

Author: Kokolis Apostolos
Κοκόλης Απόστολος
Publication venue
Publication date: 30/09/2015
Field of study

DSpace at NTUA

Exploring performance and power properties of modern multicore chips via simple machine models

Author: Chen
Hager
Hoisie
Hähnel
Kerbyson
Li
Nudd
Qian
Rotem
Succi
Suleman
Treibig
Treibig
Treibig
Wellein
Wolf-Gladrow
Zeiser
Ziegler
Publication venue: 'Wiley'
Publication date: 19/03/2014
Field of study

Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the well-known roofline model, since it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multicore chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multicore processors, and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice-Boltzmann flow solver code.Comment: 23 pages, 10 figures. Typos corrected, DOI adde

arXiv.org e-Print Archive

Crossref

Industry Paper: On the Performance of Commodity Hardware for Low Latency and Low Jitter Packet Processing

Author: Almgren Magnus
Bonnier Staffan
Gillander Linus
Johansson Bengt
Landsiedel Olaf
Neish Trevor
Papatriantafilou Marina
Stylianopoulos Charalampos
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2020
Field of study

With the introduction of Virtual Network Functions (VNF), network processing is no longer done solely on special purpose hardware. Instead, deploying network functions on commodity servers increases flexibility and has been proven effective for many network applications. However, new industrial applications and the Internet of Things (IoT) call for event-based systems and midleware that can deliver ultra-low and predictable latency, which present a challenge for the packet processing infrastructure they are deployed on. In this industry experience paper, we take a hands-on look on the performance of network functions on commodity servers to determine the feasibility of using them in existing and future latency-critical event-based applications. We identify sources of significant latency (delays in packet processing and forwarding) and jitter (variation in latency) and we propose application- and system-level improvements for removing or keeping them within required limits. Our results show that network functions that are highly optimized for throughput perform sub-optimally under the very different requirements set by latency-critical applications, compared to latency-optimized versions that have up to 9.8X lower latency. We also show that hardware-aware, system-level configurations, such as disabling frequency scaling technologies, greatly reduce jitter by up 2.4X and lead to more predictable latency

Crossref

Chalmers Research

An extension to VORO++ for multithreaded computation of Voronoi cells

Author: Lazar Emanuel A.
Lu Jiayin
Rycroft Chris H.
Publication venue
Publication date: 08/07/2023
Field of study

VORO++ is a software library written in C++ for computing the Voronoi tessellation, a technique in computational geometry that is widely used for analyzing systems of particles. VORO++ was released in 2009 and is based on computing the Voronoi cell for each particle individually. Here, we take advantage of modern computer hardware, and extend the original serial version to allow for multithreaded computation of Voronoi cells via the OpenMP application programming interface. We test the performance of the code, and demonstrate that we can achieve parallel efficiencies greater than 95% in many cases. The multithreaded extension follows standard OpenMP programming paradigms, allowing it to be incorporated into other programs. We provide an example of this using the VoroTop software library, performing a multithreaded Voronoi cell topology analysis of up to 102.4 million particles.Comment: Fix typo and section number

arXiv.org e-Print Archive

eScholarship - University of California