Assessment of Graphic Processing Units (GPUs) for Department of Defense (DoD) Digital Signal Processing (DSP) Applications by Owens, John D. et al.
UC Davis
IDAV Publications
Title
Assessment of Graphic Processing Units (GPUs) for Department of Defense (DoD) Digital 
Signal Processing (DSP) Applications
Permalink
https://escholarship.org/uc/item/6wm775kj
Authors
Owens, John D.
Sengupta, Shubhabrata
Horn, Daniel
Publication Date
2005
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Assessment of Graphic Processing Units (GPUs) for Department of
Defense (DoD) Digital Signal Processing (DSP) Applications
John D. Owens, Shubhabrata Sengupta, and Daniel Horn†
University of California, Davis
† Stanford University
Abstract
In this report we analyze the performance of the fast Fourier transform (FFT) on
graphics hardware (the GPU), comparing it to the best-of-class CPU implementation
FFTW. We describe the FFT, the architecture of the GPU, and how general-purpose
computation is structured on the GPU.We then identify the factors that inﬂuence FFT
performance and describe several experiments that compare these factors between the
CPU and the GPU.We conclude that the overhead of transferring data and initiating
GPU computation are substantially higher than on the CPU, and thus for latency-
critical applications, the CPU is a superior choice. We show that the CPU imple-
mentation is limited by computation and the GPU implementation by GPU memory
bandwidth and its lack of a writable cache. TheGPU is comparatively better suited for
larger FFTs withmany FFTs computed in parallel in applications where FFT through-
put is most important; on these applications GPU and CPU performance is roughly
on par. We also demonstrate that adding additional computation to an application
that includes the FFT, particularly computation that is GPU-friendly, puts the GPU
at an advantage compared to the CPU.
The future of desktop processing is parallel. The last few years have seen an explosion of
single-chip commodity parallel architectures that promises to be the centerpieces of future
computing platforms. Parallel hardware on the desktop—new multicore microprocessors,
graphics processors, and stream processors—has the potential to greatly increase the com-
putational power available to today’s computer users, with a resulting impact in computa-
tion domains such asmultimedia, entertainment, signal and image processing, and scientiﬁc
computation.
Several vendors have recently addressed this need for parallel computing: Intel andAMD
are producingmulticore CPUs; the IBMCell processor delivers impressive performancewith
its  parallel cores; and several stream processor startups, including Stream Processors Inc.
and Clearspeed, are producing commercial parallel stream processors. None of these chips,
however, have achieved market penetration to the degree of the graphics processor (GPU).
Today’s GPU features massive arithmetic capability and memory bandwidth with superior
performance and price-performance when compared to the CPU. For instance, the NVIDIA
JohnD.Owens, Shubhabrata Sengupta, andDaniel Horn. “Assessment of Graphic ProcessingUnits (GPUs)
for Department of Defense (DoD) Digital Signal Processing (DSP) Applications”. Technical Report ECE-CE-
-, Computer Engineering Research Laboratory, University of California, Davis, . http://www.ece.
ucdavis.edu/cerl/techreports/-/

2002 2003 2004 2005
Year
0
50
100
150
G
FL
O
PS
NVIDIA [NV30 NV35 NV40 G70]
ATI [R300 R360 R420]
Intel Pentium 4
(single-core except where marked)
dual-core
Figure : The programmable ﬂoating-point performance of GPUs (measured on the multiply-add in-
struction as  ﬂoating-point operations per MAD) has increased dramatically over the last four years
when compared to CPUs. Figure from Owens et al. [OLG+].
GeForce GTX features over GFLOPSof programmable ﬂoating-point computation,
. GB/s of peak main memory bandwidth, and about  GB/s of measured sequentially
accessed main memory bandwidth. These numbers are substantially greater than the peak
values for upcoming Intel multicore GPUs ( GFLOPS and . GB/s), and are growing
more quickly as well (Figure ). And the GPU is currently shipping in large volumes, with
a rate of hundreds of millions of units per year. The market penetration of the GPU, its
economies of scale, and its established programming libraries make the compelling case that
it may be the parallel processor of choice in future systems.
Parallel processing, however, is not without its costs. Programming eﬃcient parallel code
is a diﬃcult task, and the limited and restrictive programmingmodels of these emerging par-
allel processors make this task more diﬃcult. Traditional scalar programs do not eﬃciently
map to parallel hardware, and thus new approaches to programming these systems are nec-
essary. From a research point of view, we believe that the major obstacle to the success of
these new architectures will be the diﬃculty in programming desktop parallel hardware.
This report is arranged as follows. In Section , we ﬁrst describe the architecture of the
GPU and how general-purpose applications map to it. We then introduce the fast Fourier
transform (FFT) in Section  and outline our implementation of it. Section  gives high-level
metrics for evaluating the FFT, and Section  analyzes the performance of the FFT on the
GPU and CPU. We oﬀer thoughts about the future in Section .
 GPUArchitecture
We begin by describing the architecture of today’s GPU and how general-purpose programs
map to it. The recentGPUGems 2 book has longer articles on these topics: Kilgariﬀ and Fer-
GFLOPS numbers courtesy of Ian Buck, Stanford University; GPU GFLOPS were measured via
GPUBench [BFHa]; GPU peak memory bandwidth is from NVIDIAmarketing literature [NVI], and mea-
sured GPU sequential bandwidth from Buck [Buca].

Figure : Block diagram of the NVIDIA GeForce . The vertex and fragment stages are program-
mable, with most GPGPU work concentrated in the fragment processors. Note there are  parallel frag-
ment processors, each of which operates on -wide vectors of -bit ﬂoating point values. (The GeForce
 GTX has  fragment processors.) Figure courtesy Nick Triantos, NVIDIA.
nando summarize the recent GeForce  series of GPUs from NVIDIA [KF], Mark Harris
discusses mapping computational concepts to GPUs [Har], and Ian Buck oﬀers advice on
writing fast and eﬃcient GPU applications [Bucb].
The most important recent change in the GPU has been a move from a ﬁxed-function
pipeline, in which the graphics hardware could only implement the hardwired shading and
lighting functionality built into the graphics hardware, to a programmable pipeline, where
the shading and lighting calculations are programmed by the user on programmable func-
tional units. Figure  shows the block diagram of the NVIDIA GeForce . To keep up
with real-time demands for high-performance, visually compelling graphics, these program-
mable units collectively deliver enormous arithmetic capability, which can instead be used
for computationally demanding general-purpose tasks.
To describe how a general purpose program maps onto graphics hardware, we com-
pare side-by-side how a graphics application is structured on the GPU with how a general-
purpose application is structured on the GPU.

Graphics Application General-Purpose Application
First, the graphics application speciﬁes the
scene geometry. This geometry is trans-
formed into screen space, resulting in a
set of geometric primitives (triangles or
quads) that cover regions of the screen.
In general-purpose applications, the pro-
grammer instead typically speciﬁes a large
single piece of geometry that covers the
screen. For the purposes of this example,
assume that geometry is  ×  pix-
els square.
The next step is called “rasterization”. Each
screen-space primitive covers a set of pixel
locations on the screen. The rasterizer gen-
erates a “fragment” for each covered pixel
location.
The rasterizer will generate one fragment
for each pixel location in the  × 
square, resulting in one million fragments.
Now the programmable hardware takes
center stage. Each fragment is evaluated
by a fragment program that in graphics ap-
plications calculates the color of the frag-
ment. The GPU features multiple frag-
ment engines that can calculate many frag-
ments in parallel at the same time. These
fragments are processed in SIMD fashion,
meaning they are all evaluated by the same
program running in lockstep over all frag-
ments.
The fragment processors are the primary
computational engines within the GPU.
The previous steps were only necessary to
generate a large number of fragments that
are then processed by the fragment pro-
cessors. In the most recent NVIDIA pro-
cessor,  fragment engines process frag-
ments at the same time, with all operations
on -wide -bit ﬂoating point vectors.
Calculating the color of fragments typi-
cally involves reading from global memory
organized as “textures” that are mapped
onto the geometric primitives. (Think of a
picture of a brick wall decaled onto a large
rectangle.) Fragment processors are eﬃ-
cient at fetching texture elements (texels)
from memory.
General-purpose applications use the tex-
ture access mechanism to read values from
global memory; in vector processing ter-
minology, this is called a “gather”. Though
fragment programs can read from random
locations in memory, they cannot write to
random locations in memory (“scatter”).
This limitation restricts the generality of
general-purpose programs (though scatter
can be synthesized using otherGPUmech-
anisms [Bucb]).
Generated fragments are then assembled
into a ﬁnal image. This image can be used
as texture for another pass of the graph-
ics pipeline if desired, and many graphics
applications take many passes through the
pipeline to generate a ﬁnal image.
After the fragment computation is com-
pleted, the resulting  ×  array of
computed values can be stored into global
memory for use on another pass. In prac-
tice, almost all interesting GPGPU appli-
cations use many passes, often dozens of
passes.

The designers of the GPU have also made architectural tradeoﬀs that are diﬀerent than
those of the CPU. Perhaps the most important is an architectural philosophy on the GPU
that emphasizes throughput over latency. CPUs generally optimize for latency; their mem-
ory systems are designed to return values as quickly as possible to keep their computation
units busy. TheGPU, historically, has diﬀerent goals, because it targets the human visual sys-
tem, which operates onmillisecond time scales. With computation occurring onnanosecond
time scales, six orders of magnitude faster than the visual system, the designers of the GPU
learned that a tradeoﬀ that increased throughput at the expense of latency did not matter
to the users of the GPU and improved overall performance. Consequently pipelines for the
GPU are thousands of cycles long (compared to tens of cycles on a CPU); individual opera-
tions take much longer than their CPU counterparts, but these deep pipelines permit more
concurrency and more throughput.
GPUmanufacturers have found thatmany of their applications of interest are limited not
by the GPU but by the CPU, speciﬁcally its ability to marshal and send the data to the GPU.
A signiﬁcant eﬀort in GPU architecture and API design is thus toward alleviating this bottle-
neck. The GPU manufacturers also often ﬁnd that their applications that are limited by the
GPUare limited less byGPUcomputation andmore by itsmemory system. Aswementioned
in the introduction, the NVIDIA GeForce  features . GB/s of peak main memory
bandwidth, and about  GB/s of measured sequentially accessedmainmemory bandwidth.
However, the bandwidth on random reads is signiﬁcantly lower, only about  GB/s [Buca].
 The FFT on the GPU
We now describe the high-level structure of the FFT and its mapping to the GPU, discuss the
previous work in this ﬁeld, and then detail the implementation we chose for this study.
The Fourier transform links signal representations between the physical domain (usually
time) and the spectral domain (frequency). While the computation of the Fourier transform
for n points has complexity ofO(n2), the “fast Fourier transform” (FFT) described by Coo-
ley and Tukey [CT] reduced this cost toO(n log n). The FFT and its variants are the most
common methods of computing the Fourier transform in today’s signal processing applica-
tions.
The most common formulation of the FFT is described as “radix-, decimation-in-time
(DIT)” and we use it to describe the general structure of the FFT computation. The FFT is
at its core a divide-and-conquer algorithm, and radix- means that at each step, the problem
is divided into  parts. Other radixes are possible, but here, we can conclude that a radix-
 n-point FFT will take log2 n stages. Decimation-in-time means that the physical domain
signals (usuallymeasurements as a function of time) are split into twohalves in an interleaved
fashion; even measurements into the ﬁrst half, odd measurements into the second half. The
alternative, decimation-in-frequency, splits the signal in the spectral domain rather than the
physical domain.
The recursive nature of the FFTmeans that the problem, for power-of- values ofn, even-
tually decomposes to a series of -point transforms, known as “butterﬂies”. These butterﬂies

produce two outputs from two inputs: o0 = i0+Wni1; o1 = i0−Wni1. TheWn coeﬃcients
are called “twiddle factors” and are complex roots of . The computation of the butterﬂies
and (possibly) the twiddle factors comprise the computation needs of the algorithm.
The FFT, then, has the following characteristics that are interesting from the point of
view of eﬃcient implementations:
• The FFT is a multi-stage algorithm; for a radix- formulation, a n-point FFT requires
log2 n stages.
• The FFT is typically performed on complex numbers.
• The recursive structure of the FFT, coupled with the interleaving of DIT, means that
the inputs to the ﬁrst stage of butterﬂies are not in input order. Instead, they are in
“bit-reversed” order. Memory systems with bit-reversed load primitives are common
in digital signal processors.
Each successive stage also requires a structured but non-trivial and non-sequential
communication pattern; this pattern is implementation-dependent.
• If organized properly, by evaluating blocks of the FFT as a coherentwhole, thememory
traﬃc of the FFT is highly cache eﬃcient.
. Previous GPU Implementations of the FFT
In our recent GPGPU survey we compiled a recent list of FFT algorithms [OLG+], which
we copy below:
Motivated by the high arithmetic capabilities of modern GPUs, several projects
have recently developed GPU implementations of the fast Fourier transform
(FFT) [BFH+b, JvHK,MA,SL]. (The GPU Gems 2 chapter by Suman-
aweera and Liu, in particular, gives a detailed description of the FFT and their
GPU implementation [SL].) In general, these implementations operate on d
or d input data, use a radix- decimation-in-time approach, and require one
fragment-program pass per FFT stage. The real and imaginary components of
the FFT can be computed in two components of the -vectors in each fragment
processor, so two FFTs can easily be processed in parallel. These implemen-
tations are primarily limited by memory bandwidth and the lack of eﬀective
caching in today’s GPUs, and only by processing two FFTs simultaneously can
match the performance of a highly tuned CPU implementation [FJ]. Daniel
Horn maintains an open-source optimized FFT library based on the Brook dis-
tribution [Horb].
Since the publication of this survey (August ), we have not come across any new
implementations described in the literature.

. Mapping the FFT to the GPU
Themost straightforward way tomap the FFT to the GPU is to perform one pass through the
GPU for each of the logn stages of the FFT.The previous algorithms (Section .) all use this
structure. Each stage requires the computation of many butterﬂies (a n-point FFT requires
n/2 butterﬂies), which can all be computed in parallel and with the same SIMD instruction
stream.
The structure of each stage is similar: draw geometry that covers n fragments for a n-
point FFT (depending on the packing and the structure of the computation, n/2 or n/4
fragments might be appropriate); in the fragment program, load the inputs to the butterﬂies
as a texture fetch; and store the outputs of the butterﬂy as a texture for use in the next stage.
Two major challenges to do this eﬃciently are ensuring that the -wide fragment processors
are fully utilized and attempting to run the same code on each stage, meaning the commu-
nication pattern must be identical in each stage (this is not true in a traditional CPU-based
FFT computation).
. Our Implementation
After evaluating the existing implementations and their advantages and disadvantages, for
this study we adapted Daniel Horn’s “libgpufft” library [Hora,Horb]. This library was
developed at Stanford in conjunctionwith theBrook streamprogramming system [BFH+b]
and has several factors that made it the most attractive basis for this study.
• libgpufft has two major technical advantages over other GPU FFT implementations:
– libgpufft has an identical communication pattern for each stage, allowing the
same fragment program to be used across all stages. Consequently libgpufft does
not incur the cost of switching the fragment program between stages. This pat-
tern, while structured and consistent across passes, is not sequential. Thus it will
not realize the maximum memory bandwidth from the GPU memory system.
(We discuss this point further in Section ..) This diﬃculty is common to all
GPU implementations of the FFT.
– libgpufft fully utilizes the -wide ﬂoating-point arithmetic units in the GPU’s
fragment program units. Horn’s implementation packs two complex ﬂoating-
point numbers into each fragment, allowing the -wide units to be used at max-
imum eﬃciency. Other FFT implementations require two FFTs to run simulta-
neously to fully utilize the math capabilities of these units.
• libgpufft is written in Brook, a high-level language for GPUs, making its code simpler
to understand and adapt than other implementations. Buck indicated that the FFT’s
loss of eﬃciency when written in Brook compared to written in a low-level language
is small [Buca].

• Its source code is publicly available.
• It is eﬃcient on both NVIDIA and ATI hardware.
• Horn and his colleagues have compared their Brook implementation to other FFT
implementations, including GPU implementations (Moreland’s original FFT imple-
mentation [MA] and ATI’s reference implementation) as well as the best-of-class
CPU implementation, FFTW [FJ]. Brook has comparable or better performance
than any of the GPU implementations.
Buck indicated inAugust  that the (d) libgpuffthas twice the performance (through-
put) as FFTW [Buca]. libgpufft, however, only addresses d FFTs, thus for this study we
adapted its techniques to apply to arrays of d FFTs. We plan to incorporate our changes and
additions into the libgpufft release.
 Analyzing FFT Performance
To aid in understanding the results below (Section ), we describe some of the factors that
contribute to the runtime and some of the considerations we should make in analyzing the
results.
. Keys to FFTHigh Performance
Eﬃcient FFT computation depends on the following factors:
Computation FFTs demand high ﬂoating-point computation rates. For a -point FFT,
the butterﬂies require on the order of , operations.
Main memory bandwidth For a n-point FFT, each stage of computation requires reading
and writing n points from and to memory. FFT-speciﬁc hardware also often features a
bit-reversal primitive in the memory system, but its eﬀect on the overall performance
here would likely be minimal.
Cache eﬃciency Because main memory bandwidth is often the limiting factor in FFT im-
plementations, any gains in memory bandwidth from caching may directly improve
FFT performance.
Using these factors as a basis, we can divide the runtime of our computation into four
components and compare them between the CPU and the GPU. Table  summarizes these
Buck compared CPU (Pentium  . GHz) vs. GPU (NVIDIA GeForce ) performance for several
benchmarks. On the microbenchmark SAXPY, for instance, the GPU’s advantage was over  to . The discrep-
ancy between the : GPU performance advantage on d FFTs and what we report here for d FFTs is because
the d FFT requires a cache-unfriendly transposition step on the CPU, whereas the GPUhas a native dmemory
system that has equal eﬃciency for horizontal and vertical strides.

components. In summary, the raw computation and main memory bandwidth rates of the
GPU are superior to the CPU, but the superior cache performance of the CPU, and its sub-
stantially smaller transfer and setup times, give it an advantage over the GPU in many cases.
. Evaluating the FFT
The user of the FFT may be interested in two aspects of FFT performance. First, the latency
of the FFT may be most important, and implementations should strive for minimizing the
amount of time to compute a single FFT. Second, FFT throughput may be most important,
so implementations may instead attempt to maximize the number of FFTs performed in a
given amount of time.
These two aspects are often contradictory, particularly as systems move from scalar to
parallel. The additional parallel hardware can either be used to speed up a single FFT (im-
proving latency) or perform multiple FFTs in parallel (improving throughput). In general,
the parallelism exploited by the GPU is better suited for improving throughput rather than
latency.
 Results and Analysis
. Experimental Setup
We evaluated the FFT on two platforms, a laptop running Windows XP and a desktop also
running Windows XP. All results shown are for the (faster) desktop.
For the CPU FFT, we used the FFTW version . available at http://www.fftw.org/ com-
piled with Microsoft Visual C++; for the GPU FFT, we used the CVS version of Brook with
the OpenGL backend (also compiled with Microsoft Visual C++).
All tests were run  times and the results averaged for our ﬁnal measurements.
. Single Issue FFTs
We begin by analyzing the amount of time to evaluate a single FFT. Figure  illustrates the
runtime of a single FFT as a function of the size of the FFT, and Figure  shows MFLOPS
(millions of ﬂoating point operations per second). This test primarily measures FFT latency.
The results indicate several interesting points. First, at all sizes, the CPU-based FFTW
has a much smaller latency than the GPU-based libgpufft. We also note that, as we expect,
FFTswithmore points take longer to compute on theCPUusing FFTW than FFTswith fewer
CPU: 2128MHz PentiumM 770; 64 KB L1, 2048 KB L2 cache; mainmemory: 512MBx2 133MHzDDR2-
SDRAM; GPU: GeForce GO 6800 Ultra PCI-E (NV41) with 256 MB memory.
CPU: 2412 MHz AMD Athlon 64 FX 53; 128 KB L1, 1024 KB L2 cache; main memory: 1024x1 200 MHz
DDR-SDRAM; GPU: GeForce 7800 GTX with 256 MB memory.
OpenGL is a cross-platform API for graphics supported by modern graphics cards; the alternative is the
Windows-only “DirectX” API.

Component CPU GPU Comments
Transfer
time
Zero Fixed For the GPU to begin computation it must ﬁrst transfer
the data from the CPU’s memory to the GPU’s memory
over a system bus. Though the recent introduction of the
PCI Express (PCI-E) bus has improved the bandwidth of
this transfer, and continued driver improvements have
helped the latency, this cost is still an important com-
ponent for any GPU-based computation, particularly for
small amounts of GPU computation.
Setup time Smaller Larger To run a program over a dataset, the GPU must set up
the graphics pipeline, load the fragment program, run the
input through the entire pipeline, and ﬂush the pipeline
at the end. Collectively this is termed a “pass”. On our
GPU FFT implementation, one pass is necessary for each
stage. While the cost of a pass has decreased over the past
few years due to better drivers and hardware, processing
speed has increased even faster, so ever-larger amounts
of work are necessary to mitigate the setup cost. In any
case the setup cost is still certainly larger than the cost of
setting up and initiating a comparable CPU program.
Memory
loads/stores
Slower,
but
cacheable
Faster The GPU memory system has – times as much main
memory bandwidth as a CPU memory system, but the
CPU caches can cache intermediate results, unlike the
GPU. (GPU caches are read-only, so no intermediate re-
sults can be written back to them. CPUs, on the other
hand, take advantage of the read-write capabilities of
their caches to place an entire block of the FFT within
their caches and fully compute it without intermediate
accesses tomainmemory.) To take full advantage ofGPU
memory bandwidth, the dataset should allow sequential
memory fetches as much as possible.
Computation Slower Faster If structured in a GPU-friendly manner, GPU compu-
tation is substantially more capable than CPU compu-
tation. The CPU can make up some ground by eﬀec-
tively using its vector (SSE and SSE-) units, but the CPU
still has many fewer functional units than the GPU. The
peak ﬂoating-point performance for today’s hottest GPU
is more than  times as high as its CPU counterpart, and
this gap is growing. The challenge for GPUs is to orga-
nize the computation to take advantage of the full power
of their hardware.
Table : FFT runtime components compared between CPU and GPU.

10−8
10−6
10−4
10−2
100
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104 105
Size of FFT (points)
fftw
ogl
Figure : Time to compute one FFT (latency), as a function of the size of the FFT. Toward the bottom
of the graph is faster. “fftw” is the CPU-based FFTW; “ogl” is the GPU-based Brook OpenGL back end
to libgpufft.
10−4
10−2
100
102
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104 105
Size of FFT (points)
fftw
ogl
Figure : Measured MFLOPS (millions of ﬂoating-point operations per second), as a function of the
size of the FFT. Toward the top of the graph is higher MFLOPS. “fftw” is the CPU-based FFTW; “ogl” is
the GPU-based Brook OpenGL back end to libgpufft.

points. However, to ﬁrst order, the latency of one single FFT on the GPU does not depend
on the size of the FFT except for very large FFTs (k points). Why is this?
As we saw in Section ., the runtime of a FFT depends on several factors. In the case
of the CPU, the most signiﬁcant component is the time to compute the FFT. For the GPU,
however, the time to compute one single GPU is insigniﬁcant compared to other factors.
It is diﬃcult to separate the various components that account for the GPU’s runtime, but
they would include the time to load the dataset from CPU memory, the time to set up and
perform the transfer to the GPU, the time to initially set up the graphics pipeline and load
the fragment program, the time to ﬂush the pipeline when the computation is complete, the
time to set up and perform the transfer from the GPU back to the CPU, and the time to store
the result back in the CPU’s main memory.
A second factor, for GPUs, is their ineﬃciency with small numbers of fragments. Mod-
ern GPUs appear to have a threshold of a minimum number of fragments to ﬁll the pipe-
line for optimal eﬃciency. On the latest GPUs, Horn et al. report this threshold is roughly
 fragments for their HMMer application (described as “not unlike libgpufft”) [HHH].
What this means in practice is that processing any number of fragments less than  will
take the same amount of time as processing  fragments. Horn et al. also report that a
larger threshold is necessary to get peak performance, and again on the latest GPUs, this
threshold is roughly , fragments. Running a batch of fewer than , fragments will
not approach peak performance on these GPUs, and these fragment thresholds certainly will
not shrink but are instead likely to grow with new GPU generations.
The lesson to be learned here is that the CPU has substantially smaller overhead associ-
ated with initiating a FFT than the GPU, and in these tests, the GPU FFT is limited by this
overhead.. Thoughmany of the factors that contribute to GPU overhead have improved over
the past fewGPUgenerations, it is probably safe to say that CPUoverheadwill continue to be
substantially lower than GPU overhead in the near to intermediate future. As we discussed
in Section , latency is not the goal of GPU designers; thus if your task involves running a
single FFT, or latency is your primary concern, the CPU’s FFT is a better choice.
. Varying FFT Issue
Evaluating only a single FFT at a time is a task poorly suited for graphics hardware. The
GPU is much better at emphasizing throughput and evaluating multiple FFTs at the same
time. For a CPU the cost of additional FFTs is linear: evaluating m FFTs takes m times as
long as evaluating  FFT. The GPU has a very diﬀerent tradeoﬀ. For example, in our tests,
evaluating  -point FFTs takes only . times as long as evaluating  -point
FFT. This is because the cost of evaluating a single FFT is dominated by overhead (as we
saw in Section .), but this overhead changes very little when we calculate multiple FFTs at
the same time. Also, running a larger number of fragments through the fragment programs
(in this case, one million fragments, well above the thresholds we discussed in the previous
section) allows the fragment program to reach maximum eﬃciency.

10−5
10−4
10−3
10−2
10−1
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simultaneous -Point FFTs ( sequential FFT)
fftw
ogl
10−5
10−4
10−3
10−2
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
10−5
10−4
10−3
10−2
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
Figure : Time per -point FFT, varying the number of simultaneous (parallel) FFTs (moving toward
the right on the graphs above) and varying the number of sequential FFTs (moving down between the
graphs above). “fftw” is the CPU-based FFTW; “ogl” is the GPU-based Brook OpenGL back end to
libgpufft.

100
101
102
103
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simultaneous -Point FFTs ( sequential FFT)
fftw
ogl
101
102
103
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
101
102
103
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
Figure : MeasuredMFLOPS (millions of ﬂoating-point operations per second) for the -point FFT,
varying the number of simultaneous (parallel) FFTs (moving toward the right on the graphs above) and
varying the number of sequential FFTs (moving down between the graphs above). “fftw” is the CPU-
based FFTW; “ogl” is the GPU-based Brook OpenGL back end to libgpufft.

10−6
10−3
100
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFT)
fftw
ogl
10−5
10−4
10−3
10−2
10−1
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFT)
fftw
ogl
10−6
10−4
10−2
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
10−5
10−4
10−3
10−2
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
10−6
10−5
10−4
10−3
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
10−5
10−4
10−3
10−2
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
Figure : Time per -point (left) and -point (right) FFT, varying the number of simultaneous
(parallel) FFTs (moving toward the right on the graphs above) and varying the number of sequential
FFTs (moving down between the graphs above). “fftw” is the CPU-based FFTW; “ogl” is the GPU-based
Brook OpenGL back end to libgpufft.

10−2
100
102
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFT)
fftw
ogl
100
102
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFT)
fftw
ogl
100
102
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
101
102
103
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
100
102
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
101
102
103
104
M
FL
O
PS
M
FL
O
PS
100 101 102 103 104
Simul. -pt. FFTs ( sequential FFTs)
fftw
ogl
Figure : Measured MFLOPS (millions of ﬂoating-point operations per second) for the -point (left)
and -point (right) FFT, varying the number of simultaneous (parallel) FFTs (moving toward the
right on the graphs above) and varying the number of sequential FFTs (moving down between the graphs
above). “fftw” is the CPU-based FFTW; “ogl” is the GPU-based Brook OpenGL back end to libgpufft.

0.001
0.01
0.1
1
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
1 2 5 10 20 50 100
Simultaneous -Point FFTs ( sequential FFT)
fftw
ogl
0.001
0.01
0.1
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
1 2 5 10 20 50 100
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
0.001
0.002
0.005
0.01
Ti
m
e
pe
r
FF
T
(s
)
Ti
m
e
pe
r
FF
T
(s
)
1 2 5 10 20 50 100
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
Figure : Time per ,-point FFT, varying the number of simultaneous (parallel) FFTs (moving
toward the right on the graphs above) and varying the number of sequential FFTs (moving down between
the graphs above). “fftw” is the CPU-based FFTW; “ogl” is the GPU-based Brook OpenGL back end to
libgpufft.

10
100
1000
M
FL
O
PS
M
FL
O
PS
1 2 5 10 20 50 100
Simultaneous -Point FFTs ( sequential FFT)
fftw
ogl
100
200
500
1000
M
FL
O
PS
M
FL
O
PS
1 2 5 10 20 50 100
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
102
103
104
M
FL
O
PS
M
FL
O
PS
1 2 5 10 20 50 100
Simultaneous -Point FFTs ( sequential FFTs)
fftw
ogl
Figure : Measured MFLOPS (millions of ﬂoating-point operations per second) for the ,-point
FFT, varying the number of simultaneous (parallel) FFTs (moving toward the right on the graphs above)
and varying the number of sequential FFTs (moving down between the graphs above). “fftw” is the
CPU-based FFTW; “ogl” is the GPU-based Brook OpenGL back end to libgpufft.

Figures  and  summarize our results from varying FFT issue for -point FFTs. (For
completeness, Figures  and  have results for -point and -point FFTs, and Figures 
and  have results for ,-point FFTs. The latter is an interesting case: the size of this FFT
makes it cache more poorly than smaller FFTs on the CPU, but this degradation in perfor-
mance is balanced by the GPU’s need to introduce address translation arithmetic to handle
the large size of the necessary data storage.) We describe the results of two experiments here.
• First, we vary the number of simultaneous (parallel) FFTs calculated. Instead of mea-
suring the amount of time to calculate  FFT, we measure the amount of time to mea-
surem FFTs. As the number of simultaneous FFTs grows, the gap between the CPU’s
throughput and the GPU’s throughput shrinks. This is because the marginal cost of
issuing more FFTs after the overhead is paid for the ﬁrst is much smaller for a GPU. In
Figures  and , as we move right along the graphs, the number of simultaneous FFTs
increases.
• Second, we replace a single FFT with a series of FFT-IFFT pairs. These are calculated
sequentially;Thus instead of running a single FFT, we run FFT-IFFT-FFT-IFFT….On
the GPU, this mitigates the overhead of transferring the data to and from the GPU. In
Figure  and , as we move down between the graphs, the number of sequential FFTs
increases.
Both of these experiments mitigate the overhead associated with the GPU. While they
have little impact on the throughput of the CPU, the throughput per FFT on the GPU rises
withmore simultaneous FFTs andwithmore sequential FFTs. The throughput of the CPU
and the GPU are roughly equivalent for  parallel, -or-more sequential -point
FFTs. The GPU performs correspondingly better on larger FFTs than smaller FFTs (to a
limit—moving to the huge ,-point FFT did not produce an additional performance
advantage for the GPU). It is important to note, however, that in the parallel case, the latency
does not change on the GPU compared to the serial case.
By running multiple FFTs on the GPU in parallel and emphasizing throughput as our
metric, we avoid our performance being limited by the overhead of the transfer and pipeline
setup. We can now determine the limitation of the FFT on the GPU proper.
In Figure , we can see that the maximum throughput of -point FFTs on the GPU
is roughly one every  microseconds. The memory requirement for a -point FFT is
roughly  kB, and at  FFT every  microseconds, the memory bandwidth sustained
by the GPU in our tests is a little more than  GB/s. This bandwidth falls between the
maximum measured memory bandwidth for random accesses ( GB/s) and for sequential
accesses ( GB/s) discussed in Section , which is to be expected since the memory pattern
is structured and regular but not sequential.
On the other hand, the -point FFT requires on the order of , instructions (Sec-
tion .), and at  microseconds per FFT, the required arithmetic capability is . GFLOPS.
 stages×  words per stage ( read,  write,  for reading the twiddle factors)×  bytes per word× 
points =  kB.

0.01
0.1
1
10
El
ap
se
d
Ti
m
e
(s
)
El
ap
se
d
Ti
m
e
(s
)
1 2 5 10 20 50 100 200 500 1000
Number of Complex Multiplies
fftw
ogl
Figure : Time to compute  -point FFTs, a series of complex multiplies on each point the
result (indicated on the x axis), and  -point IFFTs. Toward the bottom of the graph is faster.
“fftw” is the CPU-based FFTW; “ogl” is the GPU-based Brook OpenGL back end to libgpufft.
This is far below the  GFLOPS of GPU ﬂoating-point capability. We can thus conclude
that the core GPU FFT is limited by memory bandwidth and not by computation.
Recall that the GPU and CPU throughputs were similar for many parallel, sequential
FFTs. Thus in our tests, we conclude that our CPU could sustain roughly . GFLOPS.
FFTW’s internal tests on a . GHz Pentium  Xeon show that -point d FFTs sus-
tain roughly . GFLOPS [FFTb], and on a similar but slower processor to ours, roughly
. GFLOPS [FFTa].
. Adding Intermediate Computation
Any real application would be unlikely to run just the FFT. Instead, the FFT is more typically
part of a chain of processing kernels. For instance, an image processing task may require a
FFT, a complex multiply on each frequency component to implement a ﬁlter, and then an
inverse FFT. Other applicationsmay requiremuchmore complex intermediate computation.
Our next experiment measures the impact of placing a task between an FFT and an in-
verse FFT (Figures  and ). The task is a series of complex multiplies on each frequency
component generated by the FFT. For a simple ﬁlter, only  complex multiply may be neces-
sary. To simulatemore complex tasks, we scaled the number of complexmultiplies, up to 
complex multiplies on each frequency component. Recall that a -point FFT requires on

102
103
104
105
M
FL
O
PS
M
FL
O
PS
1 2 5 10 20 50 100 200 500 1000
Number of Complex Multiplies
fftw
ogl
Figure : MeasuredMFLOPS (millions of ﬂoating-point operations per second) while computing 
-point FFTs, a series of complex multiplies on each point the result (indicated on the x axis), and
 -point IFFTs. Toward the bottom of the graph is faster. “fftw” is the CPU-based FFTW; “ogl”
is the GPU-based Brook OpenGL back end to libgpufft.
the order of , ﬂoating-point operations.  complexmultiplies on each of  points
is over  million operations.
When adding GPU-friendly intermediate computation like  complex multiplies,
the GPU demonstrates a clear advantage over the CPU. We should qualify this result be-
cause the complex multiplies are ideal for the GPU: they require only raw computational
horsepower without need for any communication through memory or caches. This par-
ticular benchmark is thus a best-case scenario for the GPU. Still, on this benchmark (
-point FFTs, then  complexmultiplies on each of themillion frequency components,
then  -point IFFTs), the GPU is sustaining over  GFLOPS (Figure ), a little less
than half its theoretical peak and substantially more than the maximum computation rate of
any current CPU. Note that while increasing the number of complexmultiplies from  to 
changes the performance of the CPU by more than an order of magnitude, it makes almost
no impact on the GPU at all. This conﬁrms the conclusion from the previous section that
memory bandwidth is the bottleneck for the GPU. Thus on this benchmark, the bottleneck
for the CPU is the complex multiplies; on the GPU, the memory bandwidth of the FFT.

. GPUCaveats
GPU hardware, and programming environments for the GPU, are not nearly as mature as
their CPU counterparts. Consequently there are several caveats that must be oﬀered when
discussing GPU computation. Each of these are active research topics, and none seem insur-
mountable at this time; however, for any projects that are concerned with making practical
use of the GPU today they must be considered.
• GPU computation today is limited to single-precision ﬂoating-point arithmetic. No
integer capability is currently available, though we expect future hardware to support
integers; double precision is not available at all and does not appear likely in the near
to intermediate future (though synthesizing it in software is a possibility).
GPU ﬂoating-point arithmetic is still not IEEE-compliant [HL], with deﬁciencies
in rounding, precision, and denormals. In addition, GPUs do not support any sort of
exceptions (such as a trap on divide-by-zero).
• GPU programs have hard limits on their size; the Pixel Shader  standard allows 
unique instructions with loops (typically) permitting k instructions. Though soft-
ware techniques can divide large programs into multiple smaller programs, such tech-
niques are not part of today’s production programming environments. We note, how-
ever, that the size of the FFT kernel is far below the program sizes of today’s GPUs; it
would only be much larger programs that would run into this restriction.
• The programmingmodel for theGPU is quite limited compared to the CPU. Program-
ming parallel hardware is diﬃcult, and not all tasks map well to the SIMD-parallel
model and the limited feature set of theGPU.Much of the research in theGPGPUﬁeld
is devoted to ﬁnding good mappings between tasks and the GPU hardware. Though
continued progress has beenmade in this area, it is unlikely that wewill ever see vanilla
C or Fortran code compiled directly and eﬃciently to a GPU or other parallel archi-
tectures.
 The Future
The future holds both good and bad news for the FFT.The good news is for the future of the
GPU.The bad news is for the technology trends that will shape future computation hardware.
Details on next-generation GPUs are scant. The best way to judge what is coming in
the future is to study the software trends, in particular the upcoming Microsoft DirectX
standards, since GPU vendors must build hardware to conform to these standards. Cur-
rent hardware is compliant with DirectX .. The next generation standard will be called
Windows Graphics Foundation (WGF) (WGF . will be the name for what is logically Di-
rectX ) and chips that will comply with it are still under development. Details of these
chips have not yet been released, so it is diﬃcult to speculate on what might come next.
Nonetheless, from these chips I believe we can expect the following.

Technology Improvements Certainly the computation rates, and particularly the program-
mable computation rates, will continue to rise as they have historically. Figure  shows
the incredible historical growth of programmable ﬂoating-point performance. From
a technology point of view this may not be sustainable in the long term, but it is rea-
sonable to predict that GPU performance improvements will track clock-speed and
transistor-density technology improvements of  a year [Sem] rather than the
 annual performance increases of recent CPUs [EWN]. GPUs expect these
faster rates because they are more easily able to take advantage of parallelism. (For
more on technology trends, please read our recent article in GPU Gems 2 [Owe].)
Memory bandwidth and latency are largely outside the domain of the GPU manu-
facturer and will thus improve at industry rates (the  International Roadmap for
Semiconductors forecasts a  annual improvement for DRAMbandwidth [Sem],
and historically, DRAM latency has improved by  annually). As the core GPU FFT
is currently limited by GPU memory performance, and future memory performance
will improve more slowly than compute performance, it seems most likely that the fu-
ture of the GPU FFTwill continue to be limited by slowly improvingmemory. Adding
writable caches to the GPU would be the best antidote to this trend.
PCI Express (PCI-E) is a scalable standard. The x PCI-E buses common in work-
stations today appear suﬃcient for graphics applications.
Mitigating GPU Disadvantages We saw in the results that the primary disadvantages for
the GPU are the high overhead for initiating computation and the lack of a writable
cache.
Newer buses have helped lessen the CPU-to-GPU transfer time for large data, but the
initiation cost is still high, and will continue to be a factor. For tasks where latency is
paramount, and the compute time ismodest, the CPUwill likely be the target of choice
for the near future. (Note that changes in CPU architectures, however—in particular,
the move to multicore—prioritize throughput over latency.)
The cost of setting up a pass has decreased over the past few years as multipass tech-
niques have become more common in graphics applications. This trend should con-
tinue. In addition, recent additions to graphics APIs allow more ﬂexible and faster
manipulation of data in graphics memory: two years ago Bolz et al. indicated that the
cost of switching data buﬀers from output to input was the bottleneck for their lin-
ear algebra application [BFGS], but graphics vendors have greatly reduced that cost
over the last two years. The current “frame buﬀer object” (FBO) data container is a
promising one.
Finally, recent additions to the graphics API that allow more ﬂexible use of buﬀers,
and proposed changes in future WGF-supporting hardware, will allow data to enter
and leave the pipeline at locations in the pipeline that are not the beginning or the end.
In the newest hardware, the output of a pass can be used as geometry or texture, for

instance, andWGF allows an intermediate “stream out” of data from the pipeline into
memory before it reaches the rasterizer.
As for cache, that is diﬃcult to predict. Froma strictly graphics point of view, awritable
cache is not useful, and adding the ability to write to a cache signiﬁcantly increases
the complexity of its design. We have heard no rumors of a writable cache in future
graphics hardware at this time; the only real argument for it is its usefulness in general-
purpose computation. One possible direction for the GPU manufacturers is to make
the shared on-chip level- cache writable while maintaining read-only level- caches
at each fragment processor. Users who are primarily interested in FFT-like signal
processing applications that have potential gains from adding writability to the
cache would be advised to make that known to GPU vendors.
Other WGF Changes WGF-compliant hardware will add a third programmable unit to the
pipeline, the “geometry shader”. It will fall between the vertex shader and the raster-
izer in the pipeline, will work on entire triangles at the same time, and (unlike the
other programmable units) will be able to produce output more ﬂexibly than the frag-
ment and vertex shaders’ one-in, one-out model (the geometry shader will be able to
produce – outputs for each input).
Microsoft Windows is the most popular platform for GPGPU application develop-
ment, primarily because Windows’ ubiquity has led GPU vendors to make its drivers
themost stable and fully featured. One current concern aboutWGF isMicrosoft’s push
toward layering theOpenGL graphics API on top of it. Graphics researchers in general
prefer using OpenGL for several reasons. Perhaps the most important is that OpenGL
is cross-platform and works under other platforms such as Linux (which is common
for cluster research) and the Apple Macintosh. Though this issue is far from resolved,
graphics researchers fear that layering OpenGL atopWGFwill reduce its performance
in an unacceptable way.
Opinions These are my personal thoughts and should be interpreted in that light. I believe
we will see improvements in several areas over the near to intermediate future. Or-
thogonality is an important goal for the GPU community: the same instruction sets
work across the programmable units, each supported feature works properly with each
datatype, and tools will become more fully featured and interoperable. Stability is an-
other goal of the GPU vendors (and a goal ofWGF): currently new drivers or new tool
revisions have signiﬁcant and undesirable impacts on performance and functionality.
This has improved recently and must continue to improve. Finally, I hope that we will
see new and varied interfaces, APIs, libraries, and abstractions emerge as graphics and
general-purpose computation continue to converge.
However, there are aspects of the GPU that will likely not change in the near to inter-
mediate future. One issue for GPGPU programmers is the constant churn of new fea-
tures in graphics hardware and its resulting impact on writing the most eﬃcient code.
Though this makes programming diﬃcult, GPU vendors will continue to introduce

new features for business reasons, as it keeps them ahead of possible competition. The
vendors’ concentration on entertainment applications will continue as it is by far the
largest market for them; the important task for the GPGPU community is to identify
a “killer app” that will make an impact on GPU sales. Such a killer app may promote
features like -bit ﬂoating-point that would not otherwise be on the GPU horizon.
Finally, the concentration on parallelism for GPU performance, and the diﬃculty of
programming parallel hardware, will not change.
 About the PI
John Owens is Assistant Professor of Electrical and Computer Engineering at the University
of California, Davis, where he leads research groups in graphics hardware (with an emphasis
on general-purpose programmability on graphics processors, or GPGPU) and sensor net-
works. Hewas awarded theDepartment of Energy Early Career Principal Investigator Award
in  and frequently gives presentations (such as an invited talk at the  High Perfor-
mance Embedded Computing Conference) and tutorials (such as at IEEE Visualization in
 and ) on GPGPU. John was also the author of the recent comprehensive survey of
GPGPU [OLG+] published as a Eurographics State of the Art Report in September .
John earned his Ph.D. in electrical engineering in  from Stanford University, where
he was an architect of the Imagine Stream Processor and was responsible for major portions
of its architecture, programming system, and applications. He earned his B.S. in electrical
engineering and computer sciences from the University of California, Berkeley, in . He
can be contacted at the Department of Electrical and Computer Engineering, One Shields
Avenue, Davis, CA  USA; at jowens@ece.ucdavis.edu; or at +  -.
Acknowledgements
We gratefully acknowledge the ﬁnancial support of Lockheed-Martin and the Department
of Defense that made this work possible.
Bibliography
[BFGS] Jeﬀ Bolz, Ian Farmer, EitanGrinspun, and Peter Schröder. Sparsematrix solvers
on the GPU: Conjugate gradients and multigrid. ACM Transactions on Graph-
ics, ():–, July .
[BFHa] Ian Buck, Kayvon Fatahalian, and Pat Hanrahan. GPUBench: Evaluating GPU
performance for numerical and scientiﬁc applications. In  ACM Work-
shop on General-Purpose Computing on Graphics Processors, pages C–, Au-
gust .

[BFH+b] Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman, Kayvon Fatahalian, Mike
Houston, and Pat Hanrahan. Brook for GPUs: Stream computing on graphics
hardware. ACM Transactions on Graphics, ():–, August .
[Buca] Ian Buck. GPGPU:General-purpose computation on graphics hardware—high
level languages for GPUs. ACM SIGGRAPH Course Notes, August .
[Bucb] Ian Buck. Taking the plunge into GPU computing. In Matt Pharr, editor, GPU
Gems , chapter , pages –. Addison Wesley, March .
[CT] JamesW. Cooley and JohnW. Tukey. An algorithm for the machine calculation
of complex Fourier series. Mathematics of Computation, :–, .
[EWN] Magnus Ekman, FredrikWarg, and JimNilsson. An in-depth look at computer
performance growth. ACMSIGARCHComputer Architecture News, ():–
, March .
[FFTa] FFTW. single-precision complex, d transforms [AMDAthlon XP . GHz].
http://www.fftw.org/speed/athlonXP-MHz/amd.d.scxx.p.png, .
[FFTb] FFTW. single-precision complex, d transforms [Pentium  Xeon . GHz].
http://www.fftw.org/speed/p-.GHz-new/vce-new.d.scxx.p.png, .
[FJ] Matteo Frigo and Steven G. Johnson. FFTW: An adaptive software architecture
for the FFT. In Proceedings of the  International Conference on Acoustics,
Speech, and Signal Processing, volume , pages –, May .
[Har] Mark Harris. Mapping computational concepts to GPUs. InMatt Pharr, editor,
GPU Gems , chapter , pages –. Addison Wesley, March .
[HHH] Daniel Reiter Horn, Mike Houston, and Pat Hanrahan. ClawHMMer:
A streaming HMMer-search implementation. In Proceedings of the 
ACM/IEEE Conference on Supercomputing, November .
[HL] Karl E. Hillesland and Anselmo Lastra. GPU ﬂoating-point paranoia. In Pro-
ceedings of the ACMWorkshop on General Purpose Computing on Graphics Pro-
cessors, pages C–, August .
[Hora] Daniel Horn. Fast Fourier transforms. Unpublished, .
[Horb] Daniel Horn. libgpufft. http://sourceforge.net/projects/gpufft/, .
[JvHK] Thomas Jansen, Bartosz von Rymon-Lipinski, Nils Hanssen, and Erwin Keeve.
Fourier volume rendering on theGPUusing a Split-Stream-FFT. InProceedings
of Vision, Modeling, and Visualization, pages –, November .

[KF] Emmett Kilgariﬀ and Randima Fernando. The GeForce  series GPU architec-
ture. In Matt Pharr, editor, GPU Gems , chapter , pages –. Addison
Wesley, March .
[MA] Kenneth Moreland and Edward Angel. The FFT on a GPU. In Graphics
Hardware , pages –, July . http://www.cs.unm.edu/∼kmorel/
documents/fftgpu/.
[NVI] NVIDIA Developer Relations. NVIDIA GeForce  GPUs speciﬁcations.
http://www.nvidia.com/page/specs_gf.html, August .
[OLG+] John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Krüger,
Aaron E. Lefohn, and Tim Purcell. A survey of general-purpose computation
on graphics hardware. In Eurographics , State of the Art Reports, pages
–, September .
[Owe] John Owens. Streaming architectures and technology trends. In Matt Pharr,
editor, GPU Gems , chapter , pages –. Addison Wesley, March .
[Sem] Semiconductor Industry Association. International technology roadmap for
semiconductors. http://public.itrs.net/, .
[SL] Thilaka Sumanaweera and Donald Liu. Medical image reconstruction with the
FFT. In Matt Pharr, editor, GPU Gems , chapter , pages –. Addison
Wesley, March .
Revision History
•  September : Submitted to Lockheed-Martin.
•  October : MFLOPS graphs added.
•  October : Integrated Daniel Horn comments; ﬁled as technical report.

