Optimizing Parallel Reduction In Cuda To Reach GPU Peak Performance by Mahardito, Adityo et al.
Optimizing Parallel Reduction In Cuda To Reach GPU Peak
Performance
1Adityo Mahardito
2Adang Suhendra
3Deni Tri Hasta
1Gunadarma University(adit@student.gunadarma.ac.id)
2Gunadarma University(adang@staff.gunadarma.ac.id)
3Gunadarma University(deni@student.gunadarma.ac.id)
Abstract
GPUs are massively multi threaded many core chips that have hundreds of cores and thousands of con-
current threads swith high performance and memory bandwidth. Now days, GPUs have already been
used to accelerate some numerically intensive high performance computing applications in parallel not
only used to graphic processing. This thesis aims primarily to demonstrate the programming model ap-
proaches that can be maximize the performance of GPUs. This is accomplished by a proof of maximum
reach of bandwidth memory and get speed up from the GPU that used to process parallel computation. The
programming environment that used is NVIDIA’s CUDA, it is parallel architecture and model for general-
purpose computing on a GPU.
1 Introduction
Currently, demand for hardware resources for pro-
cessing data on computers is increasing. This can
be proved by continuously increasing the mini-
mum requirement of software applications. The
phenomenon of the advent multi core CPUs and
many core GPUs means that mainstream processor
chips are now parallel systems. Many applications
that process large data sets can use a data paral-
lel programming model to speed up the computa-
tions. The application outside the field of computa-
tion itself are accelerated by data parallel process-
ing, from general image rendering and processing,
signal processing, physics simulation to computa-
tional finance or computational biology. However,
to more support of data parallel processing, the
GPU can be considered than the CPU, because there
are very significant differences between the capabil-
ity that are owned by the CPUs and GPUs to pro-
cess sets large data on parallel computation. In
the other side, GPUs are very efficient not only at
manipulating computer graphics, but also at their
highly parallel structure makes them more effective
than general purpose CPUs for a range of complex
algorithms [2].
The GPU is specialized for compute intensive,
highly parallel computation and therefore designed
such that more transistors are devoted to data pro-
cessing rather than data caching and flow con-
trol. Many companies have produced GPUs under a
number of brand names. Intel, NVIDIA, AMD/AT,
S3 Graphics, VIA Technologies, and Matrox pro-
duce GPUs. But, NVIDIA and ATI control nearly
100% of the market. Recently, NVIDIA began re-
leasing cards supporting an API extension to the C
programming language CUDA (”Compute Unified
Device Architecture”), which allows specified func-
tions from a normal C program to run on the GPU’s
stream processors[8].
The performance offered by the GPU, in fact,
depend-ing on the method approach that used
while create a CUDA program, even though using
same algorithm. Because in the parallel program-
ming, especially on CUDA, the thinks that must
be considered to obtain optimal results are mem-
ory usage and instruction usage. The main focus
of this thesis is make optimization of parallel pro-
gramming on the CUDA environment that running
over GPU using parallel reduction algorithm. The
parallel reduction algorithm that implemented is
a summation algorithm that computes the sum of
large arrays of values.
2 Literature Review
2.1 Graphic Processor Unit
A graphics processing unit or GPU is a processor at-
tached to a graphics card dedicated to calculating
floating point operations. GPU has evolved into a
highly parallel, multithreaded, manycore processor
with tremendous computational horsepower and
very high memory bandwidth.
The reason behind the discrepancy in floating-
point capability between the CPU and the GPU is
that the GPU is specialized for compute intensive,
highly parallel computation and therefore designed
such that more transistors are devoted to data pro-
cessing rather than data caching and flow control.
This difference makes the CPU and GPU suited
for different kinds of problems. In a simplified way,
one can say that a CPU performs best when a few,
small pieces of data are processed in a complex,
but sequential way. This lets the CPU utilize the
many transistors used for caching, branch predic-
tion and instruction level parallelism. The GPU, on
the other hand, need massively data parallel prob-
lems to work efficiently. The programming model
most commonly used when programming a GPU is
based on the stream programming model. In the
stream programming model, input to and output
from a computation comes in the form of streams.
A stream is a collection of homogeneous data ele-
ments on which some operation, called a kernel, is
to be performed, and the operation on one element
is independent of the other elements in the stream.
Another important difference between a general
purpose processor and a typical GPU is the mem-
ory bandwidth. Because of simpler memory mod-
els and no requirements from legacy operating sys-
tems, the GPU can support more than 100 GB/s
of memory bandwidth, while the bandwidth of
general-purpose processors is around 20 GB/s [7].
2.1.1 GPU Architecture
Start from Streaming processor (see at Figure 1),
it is a fully pipelined, single-issue, in order micro-
processor complete with two ALUs and a FPU. An
SP doesn’t have any cache, so it’s not particularly
great at anything other than cranking through tons
of mathematical operations. Since an SP spends
most of its time working on pixel or vertex data,
the fact that it doesn’t have a cache doesn’t really
matter, because if you add up enough of these cores
of GPU technology you can start to get something
productive given that graphics rendering is a highly
parallelizable task[3].
The GPU has a many-core processor containing
an array of Streaming Multiprocessors (SMs). A SM
is an array of SPs, consists of 8 Streaming Proces-
sors (SPs), along with two more processors called
Special Function Units (SFUs). Each SFU has four
FP multiply units which are used for transcenden-
tal operations (e.g. sin, cosin) and interpolation,
the latter being used in some of the calculations
for things like anisotropic texture filtering. NVIDIA
Figure 1: Streaming Processor (SP)
isn’t specific in saying about SFU architecture, just
assume that each SFU is also a fully pipelined, sin-
gle issue, in order microprocessor. There’s a MT
issue unit that dispatches instructions to all of the
SPs and SFUs in the group.
In addition to the processor cores in a SM,
there’s a very small instruction cache, a read only
data cache and a 16KB read/write shared mem-
ory. These cache sizes are kept purposefully small
because unlike a conventional desktop micropro-
cessor, the datasets we’re trying to cache here are
small. Each SP ends up working on an individual
pixel and despite the move to 32-bit floating point
values, there’s only so much data associated with
a single pixel. The 16KB memory is akin to Cell’s
local stores in that it’s not a cache, but a software
managed data store so that latency is always pre-
dictable. With this many cores in a single SM, con-
trol and predictability and very important to mak-
ing the whole thing work efficiently.
The next cluster is Texture Processor Cluster
(TPC), shown by Figure 2. NVIDIA purposefully
designed its GPU architecture to be modular, so a
single TPC can be made up of any number of SMs.
In the GT200 architecture it was made up of three
SMs and it can be different with the other series
[4].
Figure 2: Texture Processor Cluster
The components of the TPC however haven’t
changed, a TPC is made up of SMs, some control
logic and a texture block. Remember that a SM is
a total of 8 SPs and 2 SFUs, so that brings the to-
tal up to 24 SPs and 6 SFUs per cluster in GT200.
The texture block includes texture addressing and
filtering logic as well as a L1 texture cache.
The modular theme continues with the Stream-
ing Processor Array (SPA) that is composed of a
number of TPCs, see Figure 3 [5].
Figure 3: Streaming Processor Array
Similarly, the processed data elements need to be
stored back in global memory for copying back to
host memory. The task of effectively hiding the
global memory access la-tency and managing the
memory hierarchy is very crucial for obtaining max-
imal performance from the GPU. Each SM manages
the creation, execution, synchronization and de-
struction of concurrent threads in hardware, with
zero scheduling overhead, and this is one of the key
factors in achieving very high execution through-
put. Each parallel thread is mapped to an SP for
execution, and each thread maintains it’s own regis-
ter state. All the SPs in an SM execute their threads
in lock-step, according to the order of instructions
issued by the per-SM instruction unit. The SM cre-
ates and manages threads in groups of 32, and each
such group is called a warp [6].
A warp is the smallest unit of scheduling within
each SM. The GPU achieves efficiency by splitting
it’s work-load into multiple warps and multiplexing
many warps onto the same SM.
2.2 CUDA Overview
CUDA, Compute Unified Device Architecture, is
a general-purpose hardware interface designed to
let pro-grammers use NVIDIA graphics hardware
for purposes other than graphics in a more famil-
iar way. In general, the hardware need not be
a graphics related card at all, as there are cards
designed specifically for general-purpose calcula-
tions. CUDA defines a programming model and a
memory model that is consistent between all CUDA
devices. The programming model describes how
parallel code is writ-ten, launched and executed on
a device and how threads are grouped into blocks.
The memory model defines the different types of
memories that are available to a CUDA program.
2.3 CUDA Programming Model
NVIDIA’s Compute Unified Device Architecture
(CUDA) is a programming model and hard-
ware/software environment for GPGPU, which con-
sists of the following components:
• An extension to the C programming language
that al-lows programmers to define GPGPU
functions called kernels which are executed by
multiple threads on the GPU.
• A compiler and related tools for translating
CUDA source code to GPU devicespecific bi-
nary
• A software stack consisting of GPGPU appli-
cation li-braries, CUDA runtime libraries and
CUDA device driver
• A CUDA enabled GPU device
2.3.1 CUDA Kernels
The CUDA programming language provides a
means for programmers to express a problem that
exhibits significant data-parallelism as a CUDA ker-
nel. A CUDA kernel is a function that is executed
in Single Program Multiple Data (SPMD) on a large
set of data elements.
Kernel code is written on the thread level with
access to built-in variables that identify the ex-
ecuting thread. A kernel is defined using the
global declaration specifier and the number of
CUDA threads for each call is specified using a new
<<<...>>>syntax.
2.3.2 CUDA Memory Hierarchy
CUDA threads may access data from multiple mem-
ory spaces during their execution as illustrated by
Figure 4.
Each thread has a private local memory. Each
thread block has a shared memory visible to all
threads of the block and with the same lifetime as
the block.
2.3.3 Thread Hierarchy
The CUDA programming model organizes threads
into a three-level hierarchy as shown in Figure 5. At
the highest level of the hierarchy is the grid. A grid
is a 2D array of thread blocks, and thread blocks
are in turn 3D arrays of threads.
The size of the grid and the thread-blocks are de-
termined by the programmer, according to the size
Figure 4: Memory Access
Figure 5: Hierarchy of threads in the CUDA Pro-
gramming Model
of the problem be-ing solved and communicated to
the driver at kernel launch time. Each thread-block
in a grid has it’s own unique identifier and each
thread has a unique identifier within a block. Us-
ing a combination of block-id and thread-id, it is
possible to distinguish each individual thread run-
ning on the entire device. Only a single grid of
thread blocks can be launched on the GPU at once,
and the hardware limits on the num-ber of thread
blocks and threads vary across different GPU archi-
tectures.
A kernel is executed by a grid, which contain
blocks. These blocks contain the threads. A thread
block is a batch of threads that can cooperate in
sharing data through shared memory and synchro-
nizing their execution. A threads from different
blocks operate independently.
2.4 CUDA Optimization Strategy
In CUDA optimization performance strategy, there
are 4 approaches that can be used, ie:
• Instruction Performance
• Number of thread per block
• Data transfer between host(CPU) and de-
vice(GPU)
• Texture Fetch versus Global or Constant Mem-
ory Read
In this case, the strategy that will used is instruction
performance approach.
2.4.1 Instruction Performance
To process an instruction for a warp of threads, a
multiprocessor must read the instruction operands
for each thread of the warp, execute the instruction,
then write the result for each thread of the warp.
Therefore, the effective instruction throughput
depends on the nominal instruction throughput as
well as the mem-ory latency and bandwidth. It is
maximized by minimizing the use of instructions
with low throughput, then maximiz-ing the use of
the available memory bandwidth for each category
of memory.
2.4.2 Instruction Throughput
• Arithmetic Instruction
To issue one instruction for a warp, a multipro-
cessor takes 4 clock cycles for single-precision
floating-point add, multiply, and multiply-add,
integer add, bitwise operations, compare, min,
max, type conversion in-struction. Integer di-
vision and modulo operation are particularly
costly and should be avoided if possible or re-
placed with bitwise operations whenever pos-
sible. If n is a power of 2, (i/n) is equiva-
lent to (i >> log2(n)) and (i%is equivalent
to (i&(n−1)), the compiler will perform these
conversions if n is literal.
• Control Flow Instructions
Any flow control instruction (if, switch, do,
for,while) can significantly impact the effective
instruc-tion throughput by causing threads of
the same warp to diverge, that is, to follow dif-
ferent execution paths. If this happens, the dif-
ferent executions paths have to be serialized,
increasing the total number of instruc-tions ex-
ecuted for this warp. When all the different ex-
ecution paths have completed, the threads con-
verge back to the same execution path. To ob-
tain best performance in cases where the con-
trol flow depends on the thread ID, the control-
ling condition should be written so as to mini-
mize the number of divergent warps.
• Memory Instruction
Memory instructions include any instruction
that reads from or writes to shared, local or
global memory. A multiprocessor takes 4 clock
cycles to issue one memory instruction for a
warp. When accessing local or global mem-
ory, there are, in addition, 400 to 600 clock
cycles of memory latency. Much of this global
memory latency can be hidden by the thread
scheduler if there are sufficient independent
arithmetic instructions that can be issued while
waiting for the global memory access to com-
plete.
• Synchronization
Instruction syncthreads takes 4 clock cycles to
issue for a warp if no thread has to wait for any
other threads.
2.4.3 Memory Bandwidth
The effective bandwidth of each memory space de-
pends significantly on the memory access pattern.
Since device memory is of much higher latency
and lower bandwidth than on-chip memory, de-
vice memory accesses should be minimized. A typ-
ical programming pattern is to stage data coming
from device memory into shared memory; in other
words, to have each thread of a block:
1. Load data from device memory to shared mem-
ory
2. Synchronize with all the other threads of the
block so that each thread can safely read
shared memory loca-tions that were written by
different threads
3. Process the data in shared memory
4. Synchronize again if necessary to make sure
that shared memory has been updated with the
results
5. Write the results back to device memory
2.5 Parallel Reduction
Parallel reduction is one of the kinds of Paral-
lel Random Access Machine (PRAM) algorithms.
A Parallel Random Access Machine (PRAM) is a
shared memory abstract machine which is used by
parallel algorithms designers to esti-mate the algo-
rithm performance (like its time complexity). So,
a parallel reduction is a process of PRAM in the
manipulating data stored in the memory of global
registers. There are several operations in parallel
reduction i.e., the addition operation, subtraction
operation, and multiplication operation. Parallel
reduction can be described with a binary tree, a
group of n log p value is added in addition steps
in parallel.
2.5.1 Kahan Summation Algorithm
In numerical analysis, the Kahan summation algo-
rithm (see algorithm 1) also known as compensated
summation significantly reduces the numerical er-
ror in the total obtained by adding a sequence of
finite precision floating point numbers, compared
to the obvious approach. This is done by keeping a
separate running compensation (a variable to accu-
mulate small errors).
In particular, simply summing n numbers in se-
quence has a worst-case error that grows propor-
tional to n, and a root mean square error that grows
as
√
n for random inputs (the roundoff errors form
a random walk). With compensated summation,
the worst-case error bound is independent of n, so
a large number of values can be summed with an
error that only depends on the floating-point preci-
sion.
Algorithm 1 Kahan Summation Algorithm
function kahanSum(input)
var sum ← input[i] A running compensation for
lost-low order bits
var c←0:0
i = 2 to input.length
y← input[i] - c
t← sum + y
c← (t -sum) - y
sum← t
next i
return sum
2.5.2 Parallel Reduction Complexity and Cost
In the parallel processing, there are three kind of
complexity, i.e. step complexity, work complexity,
and time complexity.
• Step complexity is Log(N) parallel steps, each
step S does N/2 independent operation, so step
complexity is O(log N).
• Work complexity because for N = 2D performs∑
S[1..D]2D−S = N − 1 operations, so work
complexity is O(N).
• The time complexity is O(N/P + logN) with P
threads physically in parallel (P processor)
3 Methodology
3.1 General Scenario
A reduction is the process of combining elements
of a vector or array to yield a single aggregate el-
ement. It is commonly used in scientific compu-
tations in parallel processing. Therefore, this re-
search try to implementing this method to find a
way optimizing the parallel device. Actually, the
reduction have some operation that can be used,
i.e. addition , multiplication, substraction, and di-
vision. The author chose the addition operation
with consideration that it will not give the elements
that consist negative value, decimal, or too large.
Then, this reduction operation will be implemented
to process large set elements of an array on GPU de-
vice. The case that used is sum of increment array
elements on CUDA environment, then explore the
CUDA programming model to get different imple-
mentation of each others, in an attempt to find an
efficient and scalable implementation.
3.2 GPU Device
In this research, the GPU device that used to op-
timized is GeForce GT 240. This device is a desk-
top class, commonly used to play the games on the
PC. It’s have 96 CUDA cores to process on paral-
lel with 54.4 GB/s memory bandwidth. The points
that want to reach on GPU is the maximum of mem-
ory bandwidth and minimum of the time process-
ing.
3.3 Research Method
There are some method that implemented in this
research, i.e. :
1. Use recent of example program by Ian Buck re-
search [1]
2. Find a weakness or problem from the program
3. Try to solve the problem then upgrade to better
program
4. Test and compare between all of them
3.4 Program Implementation
3.4.1 Interleaved Addressing with Divergent
Branching by Ian Buck
The program, for next called as kernel, doing re-
duction where each thread performs calculation on
two elements that loaded from shared memory. It is
an implementation of sum EREW PRAM reduction
process, where the indexing method use a modulo
operation for every iteration process.
This method is simple but has a weakness, that
indexing thread is useless, because there is one hop
between the thread and others, it will make highly
divergent warps that affected inefficient process.
The solution is according to the strategy perfor-
mance, this program can be revised with arithmetic
instruction and control flow instruction approaches
to replace if (tid% process to minimize the number
of divergent branch and warps. It can be replaced
with if(index < blockDim.x) control flow, where
intindex = 2 ? s ? tid.
3.4.2 Reduction kernel with bank conflict
method
This method is using the solution of Interleaved ad-
dressing with divergent branching. Just replace the
way on thread indexing with stride index and non
divergent branch. So, the indexing of thread ID not
based on anymore with the order of elements value
in shared memory.
By using this method, the thread index has been
sorted correctly without hop between each other, so
there is no useless of thread usage.
But the new problem is shared memory bank con-
flicts. To achieve high memory bandwidth, shared
memory is divided into equally sized memory mod-
ules, called banks, which can be accessed simul-
taneously. So, any memory read or write request
made of n addresses that fall in n distinct memory
banks can be serviced simultaneously, yielding an
effective bandwidth that is n times as high as the
bandwidth of a single module. However, if two ad-
dresses of a memory request fall in the same mem-
ory bank, there is a bank conflict and the access
has to be serialized. The hardware splits a memory
request with bank conflicts into as many separate
conflict-free requests as necessary, decreasing the
effective bandwidth by a factor equal to the num-
ber of separate memory requests. If the number of
separate memory requests is n, the initial memory
request is said to cause n-way bank conflicts. To get
maximum performance, it is therefore important to
understand howmemory addresses map to memory
banks in order to schedule the memory requests so
as to minimize bank conflicts.
For all threads of a warp (32 threads), accessing
the shared memory is as fast as accessing a regis-
ter as long as there are no bank conflicts between
the threads. So, the solution is using sequential ad-
dressing method for indexing thread ID.
3.4.3 Reduction kernel with sequential ad-
dressing
Sequential addressing method will avoid bank con-
flict problem, because in this method the thread ID
indexing doing with sequential method, so every
threads will not access the same address on mem-
ory bank, just point to one same shared memory
address.
In the source code, the think that must replace
is just a stride indexing in inner loop with reversed
loop and thread ID based indexing.
But, this way still have a problem, half of the
threads are idle on first loop iteration, so it make
wasteful and of course will affected to performance.
The solution can be get by doing reduction during
the load element process from global memory to
shared memory.
3.4.4 Reduction kernel with first add during
load
To avoid idle of threads on sequential addressing
method by doing the reduction when load elements
from global memory to shared memory or can said
first add during load method.
In this method, during loop process of loading el-
ement from global memory, also carried out the re-
duction process from element that have finished to
load before finally writing on shared memory. So,
the element that process in kernel function is the
result of sum computation too. The point is replace
single load with two loads on the load process and
perform first add during load processed..
It should be the best method that can used to
reach GPU peak performance, but in fact the reach
of memory bandwidth is still far from bandwidth
bound, it may caused by instruction bottleneck or
instruction latency. An instruction’s latency is the
number of clock cycles it takes for the instruction to
pass through the pipeline. To avoid this condition,
the strategy that can used is unroll loops, because
of the reduction has low arithmetic intensity.
3.4.5 Reduction kernel with unroll last warp
Sometimes, the compiler may unroll loops or op-
timize out if or switch statements by using branch
predication instead.In these cases, no warp can ever
diverge. The programmer can also control loop un-
rolling.
As reduction proceeds, the number of active
threads decreases. In GPUs architecture, a kernel
or an instruction are SIMD synchronous within a
warp. Warp is a group of threads executed physi-
cally in parallel, and each group of warp consist of
32 thread. That means when ”s¡=32”, no longer
need to _syncthreads_() and no longer need to
”if(tid¡s)” because it doesn’t save any work, so it
can be control loop unrolling. Let’s unroll the last 6
iteration of the inner loop.
This saves useless work in all warps, not just the
last one. Without unrolling, all warps execute every
iteration of the for loop and if statement.
3.4.6 Reduction kernel with completely un-
rolled
If the number of iterations at compile time is
known, it could completely unroll the reduction.
Luckily, the block size is limited by the GPU to 512
threads and also sticking to power of 2 block size.
So it can easily unroll for fixed block size, then re-
place the iterative process ”for” with completely un-
roll.
3.4.7 Reduction kernel with multiple elements
per thread
Given a total number of threads per grid, the num-
ber of threads per block, or equivalently the num-
ber of blocks, should be chosen to maximize the uti-
lization of the available computing resources. This
means that there should be at least as many blocks
as there are multiprocessors in the device.
Furthermore, running only one block per multi-
processor will force the multiprocessor to idle dur-
ing thread synchronization and also during device
memory reads if there are not enough threads per
block to cover the load latency. It is therefore usu-
ally better to allow for two or more blocks to be
active on each multiprocessor to allow overlap be-
tween blocks that wait and blocks that can run.
For this to happen, not only should there be at
least twice as many blocks as there are multipro-
cessors in the device, but also the amount of allo-
cated shared memory per block should be at most
half the total amount of shared memory available
per multiprocessor.
4 Result and Comparison
4.1 Parameter
Before start to run all of program from each
method, the parameter that used on CUDA for all
method is:
• Number of elements array is 217 to 225
• Number of thread per block is 512
• Number of block is 512
4.2 Result and Comparison
• Result for 224 elements array
Result of running all of method, compared
by memory bandwidth and time consuming
shown by Figure 6. It’s just described the per-
formance by each method to process 224 ele-
ments of array, where ”Time” column is time
process that needed in millisecond, ”Band-
width” column is rate at which data can be
read from or stored into a memory. Mem-
ory bandwidth is usually expressed in units of
bytes/second, ”Step Speedup” column is value
of current kernel time divided by previous ker-
nel time, and ”Cumulative Speedup” column
shown the value of current kernel time divided
by kernel 1 (divergent branch method) time.
• Result for 217 to 225 elements array
Figure 6: Result performance by all of method
Figure 7, show performance comparison by
time on graphical. On that figure can see
that from 131.072 - 1.048.576 (217 to 220) ele-
ments array, all of method still reliable to used.
At 2.097.152 - 16.777.216 (221 to 224) ele-
ments array, the method of Divergent Branch-
ing, Bank Conflict, and Sequential Addressing
was not reliable to used, because they need
a time over 1.0 second. Furthermore, at 225
show that just four methods that still reliable to
used. In the end of the result, the best method
is Multiple Element per thread because it con-
sumes more less time than the others.
Figure 7: Result Comparison by time
Figure 8, show performance comparison by
bandwidth per seconds on graphical, its means
that the best method is that can reach highest
bandwidth or closer with the GPU bandwidth
boundary, at 54.4 GB/s. On that figure show
performance from 131.072 - 33.554.432 (217
to225) elements.
Figure 8: Result Comparison by bandwidth
5 Conclusions
5.1 Conclusion
According to the result, the GPU peak perfor-
mance has reached. It proved by acquisition mem-
ory bandwidth value (at 43.3 GB/s) with mem-
ory bandwidth bound (54.4 GB/s) on GPU device
(Geforce GT 240).
To optimize the GPU for parallel reduction pro-
cess by keep the multiprocessors on the device as
busy as possible. A device in which work is poorly
balanced across the multiprocessors will deliver
suboptimal performance. Hence, it’s important to
design your application to use threads and blocks
in a way that maximizes hardware utilization and
to limit practices that impede the free distribution
of work. One of the keys to good performance is
Another important concept is the management of
system resources allocated for a particular task.
CUDA optimization strategy can be used to maxi-
mizes hardware utilization. In this case of reduc-
tion process, it can be use divergent branching,
bank conflicts, memory coalescing with sequential
addressing and latency hiding with unroll loop.
Actually, so many type of approaches that can be
used to optimize performance on GPU, but espe-
cial for CUDA, there two type strategy, ie instruction
throughput and memory bandwidth.
5.2 FutureWork
The author have an wish to continue the research
about GPU performance on future work, wish to try
with some different thinks,i.e.: Use different type
of GPU device, more high specification to process
more large data Thread and block parameter that
used, to find ideal portion of shared memory usage
Algorithm processing, to break through the bound-
ary of memory bandwidth.
References
[1] Ian Buck. Parallel programming with cuda.
NVIDIA, 2008.
[2] NVIDIA. Nvidia cuda compute unified device
architecture. In NVIDIA CUDA Programming
Guide 2.0, volume 2, page 1. 2008.
[3] Anand Lal Shimpi and Derek Wilson. Nvidia’s
1.4 billion transistor gpu. page 2, 2008.
http://www.anandtech.com/show/2549/2,
accessed on August 2010.
[4] Anand Lal Shimpi and Derek Wilson.
Nvidia’s 1.4 billion transistor gpu. 2008.
http://www.anandtech.com/show/2549/4,
access on April 2010.
[5] Anand Lal Shimpi and Derek Wilson. Nvidia’s
1.4 billion transistor gpu. page 4, 2008.
http://www.anandtech.com/show/2549/4,
access on April 2010.
[6] Anand Lal Shimpi and Derek Wilson. Nvidia’s
1.4 billion transistor gpu. page 2, 2008.
http://www.anandtech.com/show/2549/2,
access on April 2010.
[7] wikipedia. gpgpu, 2010.
http://en.wikipedia.org/wiki/GPGPU.
[8] Wikipedia. Graphics processing unit. 2010.
http://en.wikipedia.org/wiki/Graphics_ pro-
cessing_unit, accessed on August 2010.
