161 research outputs found
Comparative evaluation of bandwidth-bound applications on the Intel Xeon CPU MAX Series
In this paper we explore the performance of Intel Xeon MAX CPU Series,
representing the most significant new variation upon the classical CPU
architecture since the Intel Xeon Phi Processor. Given the availability of a
large on-package high-bandwidth memory, the bandwidth-to-compute ratio has
significantly shifted compared to other CPUs on the market. Since a large
fraction of HPC workloads are sensitive to the available bandwidth, we explore
how this architecture performs on a selection of HPC proxies and applications
that are mostly sensitive to bandwidth, and how it compares to the previous 3rd
generation Intel Xeon Scalable processors (codenamed Ice Lake) and an AMD EPYC
7003 Series Processor with 3D V-Cache Technology (codenamed Milan-X). We
explore performance with different parallel implementations (MPI, MPI+OpenMP,
MPI+SYCL), compiled with different compilers and flags, and executed with or
without hyperthreading. We show how performance bottlenecks are shifted from
bandwidth to communication latencies for some applications, and demonstrate
speedups compared to the previous generation between 2.0x-4.3x
Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library
We present an analysis on optimizing performance of a single C++11 source
code using the Alpaka hardware abstraction library. For this we use the general
matrix multiplication (GEMM) algorithm in order to show that compilers can
optimize Alpaka code effectively when tuning key parameters of the algorithm.
We do not intend to rival existing, highly optimized DGEMM versions, but merely
choose this example to prove that Alpaka allows for platform-specific tuning
with a single source code. In addition we analyze the optimization potential
available with vendor-specific compilers when confronted with the heavily
templated abstractions of Alpaka. We specifically test the code for bleeding
edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL)
and Haswell architecture as well as IBM's Power8 system. On some of these we
are able to reach almost 50\% of the peak floating point operation performance
using the aforementioned means. When adding compiler-specific #pragmas we are
able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.Comment: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfur
Providing performance portable numerics for Intel GPUs
With discrete Intel GPUs entering the high-performance computing landscape, there is an urgent need for production-ready software stacks for these platforms. In this article, we report how we enable the Ginkgo math library to execute on Intel GPUs by developing a kernel backed based on the DPC++ programming environment. We discuss conceptual differences between the CUDA and DPC++ programming models and describe workflows for simplified code conversion. We evaluate the performance of basic and advanced sparse linear algebra routines available in Ginkgo\u27s DPC++ backend in the hardware-specific performance bounds and compare against routines providing the same functionality that ship with Intel\u27s oneMKL vendor library
FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing
In this paper, we present FLASH 1.0, a C++-based software framework for rapid
parallel deployment and enhancing host code portability in heterogeneous
computing. FLASH takes a novel approach in describing kernels and dynamically
dispatching them in a hardware-agnostic manner. FLASH features truly
hardware-agnostic frontend interfaces, which not only unify the compile-time
control flow but also enforces a portability-optimized code organization that
imposes a demarcation between computational (performance-critical) and
functional (non-performance-critical) codes as well as the separation of
hardware-specific and hardware-agnostic codes in the host application. We use
static code analysis to measure the hardware independence ratio of popular HPC
applications and show that up to 99.72% code portability can be achieved with
FLASH. Similarly, we measure the complexity of state-of-the-art portable
programming models and show that a code reduction of up to 2.2x can be achieved
for two common HPC kernels while maintaining 100% code portability with a
normalized framework overhead between 1% - 13% of the total kernel runtime. The
codes are available at https://github.com/PSCLab-ASU/FLASH.Comment: 12 page
Data Parallel C++
Learn how to accelerate C++ programs using data parallelism. This open access book enables C++ programmers to be at the forefront of this exciting and important new development that is helping to push computing to new levels. It is full of practical advice, detailed explanations, and code examples to illustrate key topics. Data parallelism in C++ enables access to parallel resources in a modern heterogeneous system, freeing you from being locked into any particular computing device. Now a single C++ application can use any combination of devices—including GPUs, CPUs, FPGAs and AI ASICs—that are suitable to the problems at hand. This book begins by introducing data parallelism and foundational topics for effective use of the SYCL standard from the Khronos Group and Data Parallel C++ (DPC++), the open source compiler used in this book. Later chapters cover advanced topics including error handling, hardware-specific programming, communication and synchronization, and memory model considerations. Data Parallel C++ provides you with everything needed to use SYCL for programming heterogeneous systems. What You'll Learn Accelerate C++ programs using data-parallel programming Target multiple device types (e.g. CPU, GPU, FPGA) Use SYCL and SYCL compilers Connect with computing’s heterogeneous future via Intel’s oneAPI initiative Who This Book Is For Those new data-parallel programming and computer programmers interested in data-parallel programming using C++
Evaluating Portable Parallelization Strategies for Heterogeneous Architectures in High Energy Physics
High-energy physics (HEP) experiments have developed millions of lines of
code over decades that are optimized to run on traditional x86 CPU systems.
However, we are seeing a rapidly increasing fraction of floating point
computing power in leadership-class computing facilities and traditional data
centers coming from new accelerator architectures, such as GPUs. HEP
experiments are now faced with the untenable prospect of rewriting millions of
lines of x86 CPU code, for the increasingly dominant architectures found in
these computational accelerators. This task is made more challenging by the
architecture-specific languages and APIs promoted by manufacturers such as
NVIDIA, Intel and AMD. Producing multiple, architecture-specific
implementations is not a viable scenario, given the available person power and
code maintenance issues.
The Portable Parallelization Strategies team of the HEP Center for
Computational Excellence is investigating the use of Kokkos, SYCL, OpenMP,
std::execution::parallel and alpaka as potential portability solutions that
promise to execute on multiple architectures from the same source code, using
representative use cases from major HEP experiments, including the DUNE
experiment of the Long Baseline Neutrino Facility, and the ATLAS and CMS
experiments of the Large Hadron Collider. This cross-cutting evaluation of
portability solutions using real applications will help inform and guide the
HEP community when choosing their software and hardware suites for the next
generation of experimental frameworks. We present the outcomes of our studies,
including performance metrics, porting challenges, API evaluations, and build
system integration.Comment: 18 pages, 9 Figures, 2 Table
- …