68 research outputs found
GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU
High-performance implementations of graph algorithms are challenging to
implement on new parallel hardware such as GPUs because of three challenges:
(1) the difficulty of coming up with graph building blocks, (2) load imbalance
on parallel hardware, and (3) graph problems having low arithmetic intensity.
To address some of these challenges, GraphBLAS is an innovative, on-going
effort by the graph analytics community to propose building blocks based on
sparse linear algebra, which will allow graph algorithms to be expressed in a
performant, succinct, composable and portable manner. In this paper, we examine
the performance challenges of a linear-algebra-based approach to building graph
frameworks and describe new design principles for overcoming these bottlenecks.
Among the new design principles is exploiting input sparsity, which allows
users to write graph algorithms without specifying push and pull direction.
Exploiting output sparsity allows users to tell the backend which values of the
output in a single vectorized computation they do not want computed.
Load-balancing is an important feature for balancing work amongst parallel
workers. We describe the important load-balancing features for handling graphs
with different characteristics. The design principles described in this paper
have been implemented in "GraphBLAST", the first high-performance linear
algebra-based graph framework on NVIDIA GPUs that is open-source. The results
show that on a single GPU, GraphBLAST has on average at least an order of
magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL,
comparable performance to the fastest GPU hardwired primitives and
shared-memory graph frameworks Ligra and Gunrock, and better performance than
any other GPU graph framework, while offering a simpler and more concise
programming model.Comment: 50 pages, 14 figures, 14 table
Multi-GPU aggregation-based AMG preconditioner for iterative linear solvers
We present and release in open source format a sparse linear solver which
efficiently exploits heterogeneous parallel computers. The solver can be easily
integrated into scientific applications that need to solve large and sparse
linear systems on modern parallel computers made of hybrid nodes hosting NVIDIA
Graphics Processing Unit (GPU) accelerators.
The work extends our previous efforts in the exploitation of a single GPU
accelerator and proposes an implementation, based on the hybrid MPI-CUDA
software environment, of a Krylov-type linear solver relying on an efficient
Algebraic MultiGrid (AMG) preconditioner already available in the BootCMatchG
library. Our design for the hybrid implementation has been driven by the best
practices for minimizing data communication overhead when multiple GPUs are
employed, yet preserving the efficiency of the single GPU kernels. Strong and
weak scalability results on well-known benchmark test cases of the new version
of the library are discussed. Comparisons with the Nvidia AmgX solution show an
improvement of up to 2.0x in the solve phase
A Systematic Survey of General Sparse Matrix-Matrix Multiplication
SpGEMM (General Sparse Matrix-Matrix Multiplication) has attracted much
attention from researchers in fields of multigrid methods and graph analysis.
Many optimization techniques have been developed for certain application fields
and computing architecture over the decades. The objective of this paper is to
provide a structured and comprehensive overview of the research on SpGEMM.
Existing optimization techniques have been grouped into different categories
based on their target problems and architectures. Covered topics include SpGEMM
applications, size prediction of result matrix, matrix partitioning and load
balancing, result accumulating, and target architecture-oriented optimization.
The rationales of different algorithms in each category are analyzed, and a
wide range of SpGEMM algorithms are summarized. This survey sufficiently
reveals the latest progress and research status of SpGEMM optimization from
1977 to 2019. More specifically, an experimentally comparative study of
existing implementations on CPU and GPU is presented. Based on our findings, we
highlight future research directions and how future studies can leverage our
findings to encourage better design and implementation.Comment: 19 pages, 11 figures, 2 tables, 4 algorithm
Doctor of Philosophy
dissertationEmerging trends such as growing architectural diversity and increased emphasis on energy and power efficiency motivate the need for code that adapts to its execution context (input dataset and target architecture). Unfortunately, writing such code remains difficult, and is typically attempted only by a small group of motivated expert programmers who are highly knowledgeable about the relationship between software and its hardware mapping. In this dissertation, we introduce novel abstractions and techniques based on automatic performance tuning that enable both experts and nonexperts (application developers) to produce adaptive code. We present two new frameworks for adaptive programming: Nitro and Surge. Nitro enables expert programmers to specify code variants, or alternative implementations of the same computation, together with meta-information for selecting among them. It then utilizes supervised classification to select an optimal code variant at runtime based on characteristics of the execution context. Surge, on the other hand, provides a high-level nested data-parallel programming interface for application developers to specify computations. It then employs a two-level mechanism to automatically generate code variants and then tunes them using Nitro. The resulting code performs on par with or better than handcrafted reference implementations on both CPUs and GPUs. In addition to abstractions for expressing code variants, this dissertation also presents novel strategies for adaptively tuning them. First, we introduce a technique for dynamically selecting an optimal code variant at runtime based on characteristics of the input dataset. On five high-performance GPU applications, variants tuned using this strategy achieve over 93% of the performance of variants selected through exhaustive search. Next, we present a novel approach based on multitask learning to develop a code variant selection model on a target architecture from training on different source architectures. We evaluate this approach on a set of six benchmark applications and a collection of six NVIDIA GPUs from three distinct architecture generations. Finally, we implement support for combined code variant and frequency selection based on multiple objectives, including power and energy efficiency. Using this strategy, we construct a GPU sorting implementation that provides improved energy and power efficiency with less than a proportional drop in sorting throughput
Exploiting commutativity to reduce the cost of updates to shared data in cache-coherent systems
We present Coup, a technique to lower the cost of updates to shared data in cache-coherent systems. Coup exploits the insight that many update operations, such as additions and bitwise logical operations, are commutative: they produce the same final result regardless of the order they are performed in. Coup allows multiple private caches to simultaneously hold update-only permission to the same cache line. Caches with update-only permission can locally buffer and coalesce updates to the line, but cannot satisfy read requests. Upon a read request, Coup reduces the partial updates buffered in private caches to produce the final value. Coup integrates seamlessly into existing coherence protocols, requires inexpensive hardware, and does not affect the memory consistency model.
We apply Coup to speed up single-word updates to shared data. On a simulated 128-core, 8-socket system, Coup accelerates state-of-the-art implementations of update-heavy algorithms by up to 2.4×.Center for Future Architectures ResearchNational Science Foundation (U.S.) (CAREER-1452994)Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science (Grier Presidential Fellowship)Microelectronics Advanced Research CorporationUnited States. Defense Advanced Research Projects Agenc
- …