101 research outputs found
GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems
While many of the architectural details of future exascale-class high
performance computer systems are still a matter of intense research, there
appears to be a general consensus that they will be strongly heterogeneous,
featuring "standard" as well as "accelerated" resources. Today, such resources
are available as multicore processors, graphics processing units (GPUs), and
other accelerators such as the Intel Xeon Phi. Any software infrastructure that
claims usefulness for such environments must be able to meet their inherent
challenges: massive multi-level parallelism, topology, asynchronicity, and
abstraction. The "General, Hybrid, and Optimized Sparse Toolkit" (GHOST) is a
collection of building blocks that targets algorithms dealing with sparse
matrix representations on current and future large-scale systems. It implements
the "MPI+X" paradigm, has a pure C interface, and provides hybrid-parallel
numerical kernels, intelligent resource management, and truly heterogeneous
parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We
describe the details of its design with respect to the challenges posed by
modern heterogeneous supercomputers and recent algorithmic developments.
Implementation details which are indispensable for achieving high efficiency
are pointed out and their necessity is justified by performance measurements or
predictions based on performance models. The library code and several
applications are available as open source. We also provide instructions on how
to make use of GHOST in existing software packages, together with a case study
which demonstrates the applicability and performance of GHOST as a component
within a larger software stack.Comment: 32 pages, 11 figure
Performance Engineering for Real and Complex Tall & Skinny Matrix Multiplication Kernels on GPUs
General matrix-matrix multiplications with double-precision real and complex
entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized
for square matrices but often show bad performance for tall & skinny matrices,
which are much taller than wide. NVIDIA's current CUBLAS implementation
delivers only a fraction of the potential performance as indicated by the
roofline model in this case. We describe the challenges and key characteristics
of an implementation that can achieve close to optimal performance. We further
evaluate different strategies of parallelization and thread distribution, and
devise a flexible, configurable mapping scheme. To ensure flexibility and allow
for highly tailored implementations we use code generation combined with
autotuning. For a large range of matrix sizes in the domain of interest we
achieve at least 2/3 of the roofline performance and often substantially
outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.Comment: 12 pages, 22 figures. Extended version of arXiv:1905.03136v1 for
journal submissio
Accelerating 128-bit Floating-Point Matrix Multiplication on FPGAs
General Matrix Multiplication (GEMM) is a fundamental operation widely used
in scientific computations. Its performance and accuracy significantly impact
the performance and accuracy of applications that depend on it. One such
application is semidefinite programming (SDP), and it often requires binary128
or higher precision arithmetic to solve problems involving SDP stably. However,
only some processors support binary128 arithmetic, which makes SDP solvers
generally slow. In this study, we focused on accelerating GEMM with binary128
arithmetic on field-programmable gate arrays (FPGAs) to enable the flexible
design of accelerators for the desired computations. Our binary128 GEMM designs
on a recent high-performance FPGA achieved approximately 90GFlops, 147x faster
than the computation executed on a recent CPU with 20 threads for large
matrices. Using our binary128 GEMM design on the FPGA, we successfully
accelerated two numerical applications: LU decomposition and SDP problems, for
the first time.Comment: 12 pages, 8 figure
Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors
Factorization and multiplication of dense matrices and tensors are critical,
yet extremely expensive pieces of the scientific toolbox. Careful use of low
rank approximation can drastically reduce the computation and memory
requirements of these operations. In addition to a lower arithmetic complexity,
such methods can, by their structure, be designed to efficiently exploit modern
hardware architectures. The majority of existing work relies on batched BLAS
libraries to handle the computation of many small dense matrices. We show that
through careful analysis of the cache utilization, register accumulation using
SIMD registers and a redesign of the implementation, one can achieve
significantly higher throughput for these types of batched low-rank matrices
across a large range of block and batch sizes. We test our algorithm on 3 CPUs
using diverse ISAs -- the Fujitsu A64FX using ARM SVE, the Intel Xeon 6148
using AVX-512 and AMD EPYC 7502 using AVX-2, and show that our new batching
methodology is able to obtain more than twice the throughput of vendor
optimized libraries for all CPU architectures and problem sizes
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
- …