931 research outputs found
Device specialization in heterogeneous multi-GPU environments
In the last few years there have been many activities towards coupling CPUs and GPUs in order to get the most from CPU-GPU heterogeneous systems. One of the main problems that prevent these systems to be exploited in a device-aware manner is the CPU-GPU communication bottleneck, which often doesn\u27t allow to produce code more efficient than the GPU-only and the CPU-only counterparts. As a consequence, most of the heterogeneous scheduling systems treat CPUs and GPUs as homogeneous nodes, electing map-like data partitioning to employ both these processing resources. We propose to study how the radical change in the connection between GPU, CPU and memory characterizing the APUs (Accelerated Processing Units) affect the architecture of a compiler and if it is possible to use all these computing resources in a device-aware manner. We investigate on a methodology to analyze the devices that populate heterogeneous multi-GPU systems and to classify general purpose algorithms in order to perform near-optimal control flow and data partitioning
The future of computing beyond Moore's Law.
Moore's Law is a techno-economic model that has enabled the information technology industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. Advances in silicon lithography have enabled this exponential miniaturization of electronics, but, as transistors reach atomic scale and fabrication costs continue to rise, the classical technological driver that has underpinned Moore's Law for 50 years is failing and is anticipated to flatten by 2025. This article provides an updated view of what a post-exascale system will look like and the challenges ahead, based on our most recent understanding of technology roadmaps. It also discusses the tapering of historical improvements, and how it affects options available to continue scaling of successors to the first exascale machine. Lastly, this article covers the many different opportunities and strategies available to continue computing performance improvements in the absence of historical technology drivers. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
DAPHNE: An Open and Extensible System Infrastructure for Integrated Data Analysis Pipelines
Integrated data analysis (IDA) pipelines—that combine data management (DM) and query processing, high-performance computing
(HPC), and machine learning (ML) training and scoring—become
increasingly common in practice. Interestingly, systems of these
areas share many compilation and runtime techniques, and the
used—increasingly heterogeneous—hardware infrastructure converges as well. Yet, the programming paradigms, cluster resource
management, data formats and representations, as well as execution
strategies differ substantially. DAPHNE is an open and extensible
system infrastructure for such IDA pipelines, including language abstractions, compilation and runtime techniques, multi-level scheduling, hardware (HW) accelerators, and computational storage for
increasing productivity and eliminating unnecessary overheads. In
this paper, we make a case for IDA pipelines, describe the overall
DAPHNE system architecture, its key components, and the design
of a vectorized execution engine for computational storage, HW
accelerators, as well as local and distributed operations. Preliminary experiments that compare DAPHNE with MonetDB, Pandas,
DuckDB, and TensorFlow show promising results
From Physics Model to Results: An Optimizing Framework for Cross-Architecture Code Generation
Starting from a high-level problem description in terms of partial
differential equations using abstract tensor notation, the Chemora framework
discretizes, optimizes, and generates complete high performance codes for a
wide range of compute architectures. Chemora extends the capabilities of
Cactus, facilitating the usage of large-scale CPU/GPU systems in an efficient
manner for complex applications, without low-level code tuning. Chemora
achieves parallelism through MPI and multi-threading, combining OpenMP and
CUDA. Optimizations include high-level code transformations, efficient loop
traversal strategies, dynamically selected data and instruction cache usage
strategies, and JIT compilation of GPU code tailored to the problem
characteristics. The discretization is based on higher-order finite differences
on multi-block domains. Chemora's capabilities are demonstrated by simulations
of black hole collisions. This problem provides an acid test of the framework,
as the Einstein equations contain hundreds of variables and thousands of terms.Comment: 18 pages, 4 figures, accepted for publication in Scientific
Programmin
JACC: An OpenACC Runtime Framework with Kernel-Level and Multi-GPU Parallelization
The rapid development in computing technology has paved the way for
directive-based programming models towards a principal role in maintaining
software portability of performance-critical applications. Efforts on such
models involve a least engineering cost for enabling computational acceleration
on multiple architectures while programmers are only required to add meta
information upon sequential code. Optimizations for obtaining the best possible
efficiency, however, are often challenging. The insertions of directives by the
programmer can lead to side-effects that limit the available compiler
optimization possible, which could result in performance degradation. This is
exacerbated when targeting multi-GPU systems, as pragmas do not automatically
adapt to such systems, and require expensive and time consuming code adjustment
by programmers.
This paper introduces JACC, an OpenACC runtime framework which enables the
dynamic extension of OpenACC programs by serving as a transparent layer between
the program and the compiler. We add a versatile code-translation method for
multi-device utilization by which manually-optimized applications can be
distributed automatically while keeping original code structure and
parallelism. We show in some cases nearly linear scaling on the part of kernel
execution with the NVIDIA V100 GPUs. While adaptively using multi-GPUs, the
resulting performance improvements amortize the latency of GPU-to-GPU
communications.Comment: Extended version of a paper to appear in: Proceedings of the 28th
IEEE International Conference on High Performance Computing, Data, and
Analytics (HiPC), December 17-18, 202
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
- …