Nested data-parallelism on the GPU
Abstract
Graphics processing units (GPUs) provide both memory bandwidth and arithmetic performance far greater than that available on CPUs but, because of their Single-Instruction-Multiple-Data (SIMD) ar-chitecture, they are hard to program. Most of the programs ported to GPUs thus far use traditional data-level parallelism, performing only operations that operate uniformly over vectors. NESL is a first-order functional language that was designed to allow programmers to write irregular-parallel programs — such as parallel divide-and-conquer algorithms — for wide-vector parallel computers. This paper presents our port of the NESL implementa-tion to work on GPUs and provides empirical evidence that nested data-parallelism (NDP) on GPUs significantly outperforms CPU-based implementations and matches or beats newer GPU languages that support only flat parallelism. While our performance does not match that of hand-tuned CUDA programs, we argue that the nota-tional conciseness of NESL is worth the loss in performance. This work provides the first language implementation that directly sup-ports NDP on a GPU- text
- Categories and Subject Descriptors D.3.0 [Programming Lan- guages
- General
- D.3.2 [Programming Languages
- Language Classifications—Applicative (Functional) Programming
- Concur- rent
- distributed
- and parallel languages
- D.3.4 [Programming Languages
- Processors—Compilers General Terms Languages
- Performance Keywords GPU
- GPGPU
- NESL
- nested data parallelism