32 research outputs found
Efficient Primitives and Algorithms for Many-core architectures
Graphics Processing Units (GPUs) are a fast evolving architecture. Over the last decade their programmability has been harnessed to solve non-graphics tasks—in many cases at a huge performance advantage to CPUs. Unlike CPUs, GPUs have always been a highly parallel architecture—thousands of lightweight execution contexts are necessary to keep the chip busy. CPUs too have become parallel over the same time period. But having started from the opposite end of the spectrum they expose parallelism of a different nature, where only tens of heavyweight execution contexts are active at any point in time. As parallel architectures become increasingly common, GPUs provide a research vehicle for developing parallel algorithms, programming models, and languages. This dissertation describes a library, CUDA Data Parallel Primitives Library (CUDPP), of building blocks called primitives for efficiently solving a broad range of problems on the GPU. CUDPP has become one of the most used libraries for data-parallel programming on GPUs thus showing the applicability of these building blocks. The most basic of the primitives is scan and its segmented variant, both of which have a rich history in the literature of data-parallel programming. I describe multiple efficient implementations for both variants. Subsequently I look at a class of problems exhibiting nested parallelism that have traditionally proved challenging for many-core processors. For each instance, segmented scan based primitives efficiently solve such problems for the first time on the GPU. In terms of complexity, sort and hash, fundamental algorithms in computer science, prove to be quite challenging to efficiently implement on the GPU. I take a detailed look at both radix sort and a class of comparison sorts. I compare between two different merge strategies that were used on parallel processors in the past and discuss their efficiency on GPUs. I show how sort is used to develop the first efficient algorithm for building spatial hierarchies on the GPU, thus showing how building spatial hierarchies is, in essence, sorting in three dimensions. The penultimate chapter describes the first efficient hash algorithm for GPUs based on cuckoo hashing. I close by looking at two implementations of the set data structure
Assessment of Graphic Processing Units (GPUs) for Department of Defense (DoD) Digital Signal Processing (DSP) Applications
In this report we analyze the performance of the fast Fourier transform (FFT) on graphics hardware (the GPU), comparing it to the best-of-class CPU implementation FFTW.WedescribetheFFT,thearchitectureoftheGPU,andhowgeneral-purpose computation is structured on the GPU. We then identify the factors that influence FFT performance and describe several experiments that compare these factors between the CPUandtheGPU.Weconcludethattheoverheadoftransferringdataandinitiating GPU computation are substantially higher than on the CPU, and thus for latencycritical applications, the CPU is a superior choice. We show that the CPU implementation is limited by computation and the GPU implementation by GPU memory bandwidth and its lack of a writable cache. The GPU is comparatively better suited for larger FFTs with many FFTs computed in parallel in applications where FFT throughputismostimportant;ontheseapplicationsGPUandCPUperformanceisroughly on par. We also demonstrate that adding additional computation to an application that includes the FFT, particularly computation that is GPU-friendly, puts the GPU at an advantage compared to the CPU
Recommended from our members
Assessment of Graphic Processing Units (GPUs) for Department of Defense (DoD) Digital Signal Processing (DSP) Applications
In this report we analyze the performance of the fast Fourier transform (FFT) on graphics hardware (the GPU), comparing it to the best-of-class CPU implementation FFTW. We describe the FFT, the architecture of the GPU, and how general-purpose computation is structured on the GPU. We then identify the factors that influence FFT performance and describe several experiments that compare these factors between the CPU and the GPU. We conclude that the overhead of transferring data and initiating GPU computation are substantially higher than on the CPU, and thus for latency-critical applications, the CPU is a superior choice. We show that the CPU implementation is limited by computation and the GPU implementation by GPU memory bandwidth and its lack of a writable cache. The GPU is comparatively better suited for larger FFTs with many FFTs computed in parallel in applications where FFT throughput is most important; on these applications GPU and CPU performance is roughly on par. We also demonstrate that adding additional computation to an application that includes the FFT, particularly computation that is GPU-friendly, puts the GPU at an advantage compared to the CPU
Recommended from our members
Resolution Matched Shadow Maps
This paper presents resolution-matched shadow maps (RMSM), a modified adaptive shadow map (ASM) algorithm, that is practical for interactive rendering of dynamic scenes. Adaptive shadow maps, which build a quadtree of shadow samples to match the projected resolution of each shadow texel in eye space, offer a robust solution to projective and perspective aliasing in shadow maps. However, their use for interactive dynamic scenes is plagued by an expensive iterative edge-finding algorithm that takes a highly variable amount of time per frame and is not guaranteed to converge to a correct solution. This paper introduces a simplified algorithm that is up to ten times faster than ASMs, has more predictable performance, and delivers more accurate shadows. Our main contribution is the observation that it is more efficient to forgo the iterative refinement analysis in favor of generating all shadow texels requested by the pixels in the eye-space image. The practicality of this approach is based on the insight that, for surfaces continuously visible from the eye, adjacent eye-space pixels map to adjacent shadow texels in quadtree shadow space. This means that the number of contiguous regions of shadow texels (which can be efficiently generated with a rasterizer) is proportional to the number of continuously visible surfaces in the scene. Moreover, these regions can be coalesced to further reduce the number of render passes required to shadow an image. The secondary contribution of this paper is demonstrating the design and use of data-parallel algorithms inseparably mixed with traditional graphics programming to implement a novel interactive rendering algorithm. For the scenes described in this paper, we achieve 60--80 frames per second on static scenes and 20--60 frames per second on dynamic scenes for 512 x 512 and 1024 x 1024 images with a maximum effective shadow resolution of 32,768 x 32,768 texels
Efficient GPU stream transform
Asymmetric data patterns and workloads pose a challenge to massively parallel algorithm design, in particular for modern wide- SIMD architectures exhibiting several levels of parallelism. We propose a simple-to use primitive that enables programmers to design algorithms with arbitrary data expansion or compaction while hiding the architecture details. We evaluate and characterize the performance of the primitive for a range of workloads, both synthetic and real-world. The results demonstrate that the primitive can be an effective tool in the toolbox of designers of parallel algorithms