4,236 research outputs found
Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation
Sparse matrix-vector multiplication (spMVM) is the dominant operation in many
sparse solvers. We investigate performance properties of spMVM with matrices of
various sparsity patterns on the nVidia "Fermi" class of GPGPUs. A new "padded
jagged diagonals storage" (pJDS) format is proposed which may substantially
reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our
test scenarios the pJDS format cuts the overall spMVM memory footprint on the
GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance.
Using a suitable performance model we identify performance bottlenecks on the
node level that invalidate some types of matrix structures for efficient
multi-GPGPU parallelization. For appropriate sparsity patterns we extend
previous work on distributed-memory parallel spMVM to demonstrate a scalable
hybrid MPI-GPGPU code, achieving efficient overlap of communication and
computation.Comment: 10 pages, 5 figures. Added reference to other recent sparse matrix
format
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
Scratchpad Sharing in GPUs
GPGPU applications exploit on-chip scratchpad memory available in the
Graphics Processing Units (GPUs) to improve performance. The amount of thread
level parallelism present in the GPU is limited by the number of resident
threads, which in turn depends on the availability of scratchpad memory in its
streaming multiprocessor (SM). Since the scratchpad memory is allocated at
thread block granularity, part of the memory may remain unutilized. In this
paper, we propose architectural and compiler optimizations to improve the
scratchpad utilization. Our approach, Scratchpad Sharing, addresses scratchpad
under-utilization by launching additional thread blocks in each SM. These
thread blocks use unutilized scratchpad and also share scratchpad with other
resident blocks. To improve the performance of scratchpad sharing, we propose
Owner Warp First (OWF) scheduling that schedules warps from the additional
thread blocks effectively. The performance of this approach, however, is
limited by the availability of the shared part of scratchpad.
We propose compiler optimizations to improve the availability of shared
scratchpad. We describe a scratchpad allocation scheme that helps in allocating
scratchpad variables such that shared scratchpad is accessed for short
duration. We introduce a new instruction, relssp, that when executed, releases
the shared scratchpad. Finally, we describe an analysis for optimal placement
of relssp instructions such that shared scratchpad is released as early as
possible.
We implemented the hardware changes using the GPGPU-Sim simulator and
implemented the compiler optimizations in Ocelot framework. We evaluated the
effectiveness of our approach on 19 kernels from 3 benchmarks suites: CUDA-SDK,
GPGPU-Sim, and Rodinia. The kernels that underutilize scratchpad memory show an
average improvement of 19% and maximum improvement of 92.17% compared to the
baseline approach
FDTD/K-DWM simulation of 3D room acoustics on general purpose graphics hardware using compute unified device architecture (CUDA)
The growing demand for reliable prediction of sound fields in rooms have resulted in adaptation of various approaches for physical modeling, including the Finite Difference Time Domain (FDTD) and the Digital Waveguide Mesh (DWM). Whilst considered versatile and attractive methods, they suffer from dispersion errors that increase with frequency and vary with direction of propagation, thus imposing a high frequency calculation limit. Attempts have been made to reduce such errors by considering different mesh topologies, by spatial interpolation, or by simply oversampling the grid. As the latter approach is computationally expensive, its application to three-dimensional problems has often been avoided. In this paper, we propose an implementation of the FDTD on general purpose graphics hardware, allowing for high sampling rates whilst maintaining reasonable calculation times. Dispersion errors are consequently reduced and the high frequency limit is increased. A range of graphics processors are evaluated and compared with traditional CPUs in terms of accuracy, calculation time and memory requirements
- …