572 research outputs found
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
Energy efficiency is becoming increasingly important for computing systems,
in particular for large scale HPC facilities. In this work we evaluate, from an
user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS)
techniques, assisted by the power and energy monitoring capabilities of modern
processors in order to tune applications for energy efficiency. We run selected
kernels and a full HPC application on two high-end processors widely used in
the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate
the available trade-offs between energy-to-solution and time-to-solution,
attempting a function-by-function frequency tuning. We finally estimate the
benefits obtainable running the full code on a HPC multi-GPU node, with respect
to default clock frequency governors. We instrument our code to accurately
monitor power consumption and execution time without the need of any additional
hardware, and we enable it to change CPUs and GPUs clock frequencies while
running. We analyze our results on the different architectures using a simple
energy-performance model, and derive a number of energy saving strategies which
can be easily adopted on recent high-end HPC systems for generic applications
Higher-order CFD and Interface Tracking Methods on Highly-Parallel MPI and GPU systems
A computational investigation of the effects on parallel performance of higher-order accurate schemes was carried out on two different computational systems: a traditional CPU based MPI cluster and a system of four Graphics Processing Units (GPUs) controlled by a single quad-core CPU. The investigation was based on the solution of the level set equations for interface tracking using a High-Order Upstream Central (HOUC) scheme. Different variants of the HOUC scheme were employed together with a 3rd-order TVD Runge-Kutta time integration. An increase in performance of two orders of magnitude was seen when comparing a single CPU core to a single GPU with a greater increase at higher orders of accuracy and at lower precision
Heterogeneous Computing on Mixed Unstructured Grids with PyFR
PyFR is an open-source high-order accurate computational fluid dynamics
solver for mixed unstructured grids that can target a range of hardware
platforms from a single codebase. In this paper we demonstrate the ability of
PyFR to perform high-order accurate unsteady simulations of flow on mixed
unstructured grids using heterogeneous multi-node hardware. Specifically, after
benchmarking single-node performance for various platforms, PyFR v0.2.2 is used
to undertake simulations of unsteady flow over a circular cylinder at Reynolds
number 3 900 using a mixed unstructured grid of prismatic and tetrahedral
elements on a desktop workstation containing an Intel Xeon E5-2697 v2 CPU, an
NVIDIA Tesla K40c GPU, and an AMD FirePro W9100 GPU. Both the performance and
accuracy of PyFR are assessed. PyFR v0.2.2 is freely available under a 3-Clause
New Style BSD license (see www.pyfr.org).Comment: 21 pages, 9 figures, 6 table
Implementation of a High Throughput 3GPP Turbo Decoder on GPU
Turbo code is a computationally intensive channel code that is widely used in current and upcoming wireless standards. General-purpose graphics processor unit (GPGPU) is a programmable commodity processor that achieves high performance computation power by using many simple cores. In this paper,
we present a 3GPP LTE compliant Turbo decoder accelerator that takes advantage of the processing power of GPU to offer fast Turbo decoding throughput. Several
techniques are used to improve the performance of the decoder. To fully utilize the computational resources on GPU, our decoder can decode multiple codewords simultaneously, divide the workload for a single codeword across multiple cores, and pack multiple codewords to fit the single instruction multiple
data (SIMD) instruction width. In addition, we use shared memory judiciously to enable hundreds of concurrent multiple threads while keeping frequently used
data local to keep memory access fast. To improve efficiency of the decoder in the high SNR regime, we also present a low complexity early termination scheme
based on average extrinsic LLR statistics. Finally, we examine how different workload partitioning choices affect the error correction performance and the decoder throughput.Renesas MobileTexas InstrumentsXilinxNational Science Foundatio
GPU Accelerated Scalable Parallel Decoding of LDPC Codes
This paper proposes a flexible low-density parity-check (LDPC) decoder which leverages graphic processor units (GPU) to provide high decoding throughput. LDPC codes are widely adopted by the new emerging standards for wireless
communication systems and storage applications due to their near-capacity error correcting performance. To achieve high decoding throughput on GPU, we leverage the parallelism embedded in the check-node computation and variable-node
computation and propose a parallel strategy of partitioning the decoding jobs among multi-processors in GPU. In addition, we propose a scalable multi-codeword decoding scheme to fully utilize the computation resources of GPU. Furthermore, we developed a novel adaptive performance-tuning method to make
our decoder implementation more flexible and scalable. The experimental results show that our LDPC decoder is scalable and flexible, and the adaptive performance-tuning method can deliver the peak performance based on the GPU architecture.Renesas MobileSamsungNational Science Foundatio
Dwarfs on Accelerators: Enhancing OpenCL Benchmarking for Heterogeneous Computing Architectures
For reasons of both performance and energy efficiency, high-performance
computing (HPC) hardware is becoming increasingly heterogeneous. The OpenCL
framework supports portable programming across a wide range of computing
devices and is gaining influence in programming next-generation accelerators.
To characterize the performance of these devices across a range of applications
requires a diverse, portable and configurable benchmark suite, and OpenCL is an
attractive programming model for this purpose. We present an extended and
enhanced version of the OpenDwarfs OpenCL benchmark suite, with a strong focus
placed on the robustness of applications, curation of additional benchmarks
with an increased emphasis on correctness of results and choice of problem
size. Preliminary results and analysis are reported for eight benchmark codes
on a diverse set of architectures -- three Intel CPUs, five Nvidia GPUs, six
AMD GPUs and a Xeon Phi.Comment: 10 pages, 5 figure
A Performance-Portable SYCL Implementation of CRK-HACC for Exascale
The first generation of exascale systems will include a variety of machine
architectures, featuring GPUs from multiple vendors. As a result, many
developers are interested in adopting portable programming models to avoid
maintaining multiple versions of their code. It is necessary to document
experiences with such programming models to assist developers in understanding
the advantages and disadvantages of different approaches.
To this end, this paper evaluates the performance portability of a SYCL
implementation of a large-scale cosmology application (CRK-HACC) running on
GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the
process of migrating the original code from CUDA to SYCL and show that
specializing kernels for specific targets can greatly improve performance
portability without significantly impacting programmer productivity. The SYCL
version of CRK-HACC achieves a performance portability of 0.96 with a code
divergence of almost 0, demonstrating that SYCL is a viable programming model
for performance-portable applications.Comment: 12 pages, 13 figures, 2023 International Workshop on Performance,
Portability & Productivity in HP
Datacenter Design for Future Cloud Radio Access Network.
Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers.
I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding.
Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform.
Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd
- …