On the Robust Mapping of Dynamic Programming onto a Graphics Processing Unit by Xiao, Shucai et al.
On the Robust Mapping of Dynamic Programming
onto a Graphics Processing Unit
Shucai Xiao, Ashwin Aji, and Wu-chun Feng
Department of Computer Science
Center of High-End Computing Systems
Virginia Tech
{shucai, aaji, wfeng}@vt.edu
Abstract
Graphics processing units (GPUs) have been widely used
to accelerate algorithms that exhibit massive data paral-
lelism or task parallelism. When such parallelism is not
inherent in an algorithm, computational scientists resort to
simply replicating the algorithm on every multiprocessor of
a NVIDIA GPU, for example, to create such parallelism,
resulting in embarrassingly parallel ensemble runs that
deliver significant aggregate speed-up. However, the funda-
mental issue with such ensemble runs is that the problem
size to achieve this speed-up is limited to the available
shared memory and cache of a GPU multiprocessor.
An example of the above is dynamic programming (DP),
one of the Berkeley 13 dwarfs. All known DP implemen-
tations to date use the coarse-grained approach of embar-
rassingly parallel ensemble runs because a finer-grained
parallelization on the GPU would require extensive com-
munication between the multiprocessors of a GPU, which
could easily cripple performance as communication between
multiprocessors is not natively supported in a GPU.
Consequently, we address the above by proposing a
fine-grained parallelization of a single instance of the DP
algorithm that is mapped to the GPU. Our parallelization
incorporates a set of techniques aimed to substantially
improve GPU performance: matrix re-alignment, co-
alesced memory access, tiling, and GPU (rather than
CPU) synchronization. The specific DP algorithm that
we parallelize is called Smith-Waterman (SWat), which
is an optimal local-sequence alignment algorithm. We
use this SWat algorithm as a baseline to compare our
GPU implementation, i.e., CUDA-SWat, to our Cell
implementation, i.e., Cell-SWat.
1 Introduction
Today, gains in computational horsepower are no longer
driven by clock speeds. Instead, the gains are increasingly
achieved through parallelism, both in traditional x86
multi-core architectures as well as the many-core archi-
tectures of the graphics processing unit (GPU). Amongst
the most prominent many-core architectures are the
GPUs from NVIDIA and AMD/ATI, which can support
general-purpose computation on the GPU (GPGPU).
Thus, GPUs have evolved from their traditional roots of
graphics pipeline model into programmable devices that
are suited for accelerating scientific applications such as
sequence match and fast N-body simulation [10–12,14].
In general, the many-core GPU architecture only maps
well to data- or task-parallel applications that have
minimal to no inter-multiprocessor communication [2].
This is mainly due to the lack of explicit hardware or
software support for inter-thread communication between
different multiprocessors across the entire GPU chip.
The current (implicit) synchronization strategy for the
NVIDIA Compute Unified Device Architecture (CUDA)
platform [9] synchronizes via the host CPU and then
re-launches a new kernel from the CPU to GPU, which is
a costly operation. In addition, another accelerator-based
parallel computing platform, the Cell Broadband Engine
(Cell/BE) [3], is also used for general-purpose computa-
tion. It is a heterogeneous processor, where direct data
communication is supported among the Power Processing
Element (PPE) and the Synergistic Processing Elements
(SPEs) via the Element Interconnect Bus (EIB).
Dynamic programming (DP) is one of the Berkeley 13
dwarfs, where a dwarf is defined as an algorithmic method
that captures a pattern of computation and communica-
tion.1 In this paper, we investigate the mapping of the
Smith-Waterman (SWat) algorithm [13] — an example of
the dynamic programming algorithm — onto the GPU.
Specifically, we focus on a fine-grained parallelization of
the SWat algorithm, in which a single problem instance
is processed across all the multiprocessors of the GPU,
thus resulting in a more robust mapping of DP to the GPU
where virtually all sequence sizes within NIH’s GenBank
can be processed. This is in stark contrast to previous
coarse-grained parallelizations [6, 7], where multiple prob-
lem instances are simply replicated onto each multiproces-
sor of a NVIDIA GPU, thus creating an “embarrassingly
parallel ensemble run” that severely restricts the problem
size that can be solved, as shown conceptually in Fig-
ure 1(a). When the sequence sizes exceed 1024 in length,
i.e., a significant portion of the NIH GenBank database,
the sequences cannot be processed by this coarse-grained
parallelization, as shown in Figure 1(b). Our fine-grained
parallelization, however, can process all sequence sizes up
to 8461 in length, as shown conceptually in Figure 1(c). We
then use global memory to store the alignment information
due to the limited size of the shared memory and cache.
Due to the long latency-access times to the global
memory of the GPU, coupled with SWat being a memory-
1The dwarfs represent different algorithmic equivalence classes,
where membership in an equivalence class is defined by similarity in
computation and communication.
(a) Coarse-grained parallelization for small se-
quences (SM: Streaming Multiprocessor) [6,7,15].
(b) Coarse-grained parallelization for large sequences (SM:
Streaming Multiprocessor) [6, 7, 15].
(c) Our fine-grained parallelization for arbi-
trarily sized sequences (SM: Streaming Mul-
tiprocessor).
Figure 1. Coarse-Grained vs. Fine-Grained
Parallelization
bound application [5], we propose three techniques to
more effectively access the global memory of the GPU:
matrix re-alignment, coalesced memory access, and tiling.
In addition, to achieve a fine-grained parallelization of DP
on the GPU, data will need to be shared across different
blocks, and hence multiprocessors, of the GPU. Thus,
global synchronization across the GPU will be needed.
As such, we propose a GPU synchronization method
that does not involve the host CPU, thus eliminating
the overhead between switching back and forth between
the GPU and CPU. (Although the above techniques are
proposed for the SWat algorithm, they can be used in any
other dynamic programming-based application.)
The overarching contribution of the paper is the robust
mapping of our fine-grained SWat parallelization onto the
GPU, thus enabling it to support the alignment of virtu-
ally all the sequences in the NIH GenBank. To achieve
the above, we propose the following: (1) techniques for ac-
celerating access to global memory: matrix re-alignment,
coalesced memory access, and tiling; (2) a GPU synchro-
nization method to improve global synchronization time,
and hence, communication between multiprocessors; and
(3) a performance comparison relative to the Cell/BE,
another type of accelerator-based parallel platform.
The rest of this paper is organized as follows: Section 2
presents the related work. Section 3 describes the NVIDIA
GTX 280 architecture and the CUDA programming model.
Section 4 introduces the sequential SWat algorithm. Tech-
niques to accelerate SWat in CUDA are described in Sec-
tion 5. Section 6 compares and analyzes the performance
of the various implementations on the GPU and across the
GPU and the Cell/BE. Section 7 concludes the paper.
2 Related Work
The Smith-Waterman (SWat) algorithm has previously
been implemented on the GPU by using graphics prim-
itives [4, 5], and more recently, using CUDA [6, 7, 15].
Though the most recent CUDA implementations of
SWat [6, 7] report speed-ups as high as 30-fold, they all
suffer from a myriad of limitations. First, each of their
approaches only follows a coarse-grained, embarrassingly
parallel approach that assigns a single problem instance
to each multiprocessor on the device, thereby sharing
the available GPU resources among multiple concurrent
problem instances, as shown in Figure 1(a). This approach
severely restricts the maximum problem size that can be
solved by the GPU to sequenes of length 1024 or less. In
contrast, we propose and implement a fine-grained paral-
lelization of SWat by distributing the task of processing
a single problem instance across all the multiprocessors
on the GPU, thereby supporting realistic problem size, as
large as 8461 in lengths. The above limitation is due to
the physical size of global memory on the GPU. Though
the global memory can support the alignment of a large
sequences, the coarse-grained parallel approach forces the
global memory to be shared amongst multiple instances
of SWat whereas our fine-grained parallel approach leaves
the entire global memory resource available to a single in-
stance of SWat, allowing larger sequences to be processed.
In [15], Striemer et al. also primarily use shared memory
and constant cache for coarse-grained SWat parallelism
on the GPU. Thus, their implementation is also limited
to query sequences of length 1024 or less because of the
limited shared memory and cache size.
With respect to GPU synchronization, the most closely
related work to ours is that of Volkov et al. [16], who have
also “implemented” a global software synchronization
method to accelerate dense linear-algebra constructs. Our
GPU synchronization differs from Volkov’s in the following
ways: (1) they have not actually integrated their GPU
synchronization with any of the linear algebra routines,
and (2) they have acknowledged a severe drawback in their
synchronization method, i.e., their global synchronization
does not guarantee that previous accesses to all levels of
the memory hierarchy have completed unless a memory
consistency model is assumed. With respect to the first
point, Volkov et al. only provide theoretical estimates of the
performance that could possibly be obtained if their im-
plementation was used in conjunction with their synchro-
nization code. In contrast, we have implemented our GPU
synchronization, integrated it in our CUDA-SWat imple-
mentation, and shown actual experimental results. For the
second point, our implementation guarantees that trans-
actions at all levels of the memory hierarchy have been
completed at the end of the synchronization code. Each
thread block waits for its respective representative (leader)
thread to reach the global synchronization, followed by a
call to the local synchronization function syncthreads(),
which guarantees that all the memory transactions are
committed at the different memory hierarchies.
3 The GTX 280 GPGPU and the CUDA Pro-
gramming Model
The GTX 280 GPGPU (or device) consists of a set of
30 SIMT streaming multiprocessors (SMs), where each
SM consists of eight scalar processor (SP) cores running
at 1.2 GHz with 16-KB on-chip shared memory and a
multi-threaded instruction unit.
The device memory, which can be accessed by all
the SMs, consists of 1 GB of read-write global memory
and 64 KB of read-only constant memory and read-only
texture memory. However, all the device memory modules
can be read or written to by the host processor. Each
SM has on-chip memory, which can be accessed by all the
SPs within the SM and will be one of the following four
types: a set of 16k local 32-bit registers; 16 KB of parallel,
shared, and software-managed data cache; a read-only
constant cache that caches the data from the constant
device memory; and a read-only texture cache that caches
the data from the texture device memory. The global
memory space is not cached by the device.
CUDA (Compute Unified Device Architecture) [9] is the
parallel programming model and software environment
provided by NVIDIA to run applications on their GPUs.
It abstracts the architecture to parallel programmers via
simple extensions to the C programming language.
CUDA follows a code off-loading model, i.e. data-
parallel, compute-intensive portions of applications
running on the host processor are typically off-loaded onto
the device. The kernel is the portion of the program that
is compiled to the instruction set of the device and then
off-loaded to the device before execution.
In CUDA, threads can communicate and synchronize
only within a thread block via the shared memory of the
SM; there exists no mechanism for threads to communicate
across thread blocks. If threads from two different blocks
try to communicate via global memory, the outcome of the
memory transaction is undefined, and NVIDIA does not
recommend such inter-block communication. Thus, the
launch of a kernel from the host processor to the GPGPU
currently serves as an implicit barrier to all threads that
were launched by the previous kernel.
4 Smith-Waterman (SWat) Algorithm
The SWat algorithm [13] is an optimal local sequence
alignment methodology that follows the dynamic pro-
gramming paradigm, where intermediate alignment scores
are stored in a matrix before the maximum alignment
score is calculated. Next, the matrix entries are inspected,
and the highest-scoring local alignment is generated. The
SWat algorithm can thus be broadly classified into two
phases: (1) matrix filling and (2) backtracing.
To fill out the dynamic-programming (DP ) matrix, the
SWat algorithm follows a scoring system that consists of
a scoring matrix and a gap-penalty scheme. The scoring
matrix, M is a two-dimensional matrix containing the
scores for aligning individual amino acid or nucleotide
residues. The gap-penalty scheme provides the option of
gaps being introduced within the alignments, hoping that
a better alignment score can be generated; but they incur
some penalty or negative score. In our implementation,
we consider an affine gap penalty scheme that consists
of two types of penalties. The gap-open penalty, o is
incurred for starting (or opening) a gap in the alignment,
and the gap-extension penalty, e is imposed for extending
a previously existing gap by one unit. The gap-extension
penalty is usually smaller than the gap-open penalty.
Using this scoring scheme, the dynamic-programming
matrix is populated via a wavefront pattern, i.e. beginning
from the northwest corner element and going toward the
NW N
W
(a) (b)
Figure 2. The SWat Wavefront Algorithm and its
Dependencies.
southeast corner; the current anti-diagonal is filled after
the previous anti-diagonals are computed, as shown in
Figure 2(a). Moreover, each element in the matrix can
be computed only after its north, west, and northwest
neighbors are computed, as shown in Figure 2(b). Thus,
elements within the same anti-diagonal are independent
of each other and can therefore be computed in parallel.
The backtracing phase of the algorithm is essentially a se-
quential operation that generates the highest scoring local
alignment. In this paper, we mainly focus on accelerating
the matrix filling because it consumes more than 99% of
the execution time and it is the object to be parallelized.
5 Techniques for Accelerating SWat on the GPU
In this section, we describe a series of techniques to
accelerate the access of global memory and decrease the
synchronization time, and hence, communication between
multiprocessors.
5.1 Accelerating Global Memory Access
There are three techniques proposed for the effective
global memory access, which are: 1) matrix re-alignment
to optimize the sequence of memory accesses; 2) coalesced
memory access to amortize the overhead of loading from
and storing to memory; 3) tiling in order to sufficiently
increase computational granularity so as to amortize the
overhead of global memory access.
5.1.1 Matrix Re-Alignment
To leverage the SIMD processing of the GPU, we store
the matrix in memory in the diagonal-major data format
instead of the row-major data format, as shown in
Figure 3. The matrix in Figure 3(a) is stored as the layout
in Figure 3(b), where each anti-diagonal are arranged
in sequence. As a result, threads in a block can access
memory in adjacent locations, and data can be transferred
in blocks with larger size [9]. An implementation using
this technique is referred to as a simple implementation.
With the diagonal-major data format, we oﬄoad the
data-intensive, matrix-filling part onto the GPU device. If
the CPU implicit synchronization is used, the dependence
between consecutive anti-diagonals of the matrix forces
a synchronization operation, and hence a new kernel
invocation, after the computation of every anti-diagonal.
We distribute the computation of the elements in every
anti-diagonal uniformly across all the threads in the kernel
in order to completely utilize all the SMs on the chip
and support the realistic problem size. To simplify our
implementation, every kernel contains a one-dimensional
Figure 3. Matrix Representation in Memory.
grid of thread blocks, where each block further contains a
single dimension of threads, as shown in Figure 4.
Figure 4. Mapping of Threads to Matrix Elements
→ Variation in the Computational Load Imposed
on Successive Kernels.
5.1.2 Coalesced Memory Access
In addition to the changing the matrix layout in Fig-
ure 3(b) to improve the effectiveness of global memory
access, we transform non-coalesced data accesses into
coalesced ones. Running the CUDA profiler [8] shows
that 63% of memory accesses are non-coalesced if only the
matrix re-alignment technique (i.e., simple implementa-
tion) is used. In this subsection, we propose a method to
improve percentage of coalesced memory accesses and refer
to this implementation as our coalesced implementation.
Since each anti-diagonal is computed by a new kernel
and each thread accesses an integer (32-bit word), if the
starting addresses of every anti-diagonal is aligned to 64-
byte boundaries, then all the writes are coalesced, as shown
in Figure 5. The skewed dependence between the elements
of neighboring anti-diagonals restricts the degree of coa-
lescing for the read operations from global memory. Also,
we select the block and grid dimensions of the kernel, such
that all the thread blocks have coalesced memory store.
5.1.3 Tiling
In the previous subsections, we proposed techniques for
efficient global memory access. Here we apply the tiled-
wavefront design pattern, which was used to efficiently
map SWat to the IBM Cell/BE [1], to the GPU architec-
Figure 5. Incorporating Coalesced Data Represen-
tation of Successive Anti-Diagonals in Memory.
ture. This approach amortizes the cost of kernel launches
by grouping the matrix elements into computationally
independent tiles. In addition, shared memory is used to
reduce the amount of data needed to load from and store
to the global memory.
Our tile-scheduling scheme assigns a thread block to
compute a tile, and a grid of blocks (kernel) is mapped to
process a single tile-diagonal, thus decreasing the number
of barriers needed to fill the alignment matrix. CPU
implicit synchronization via new kernel launches or our
proposed GPU synchronization (see Section 5.2) serve as
barriers to the threads from the previous tile-diagonal.
Consecutive tile-diagonals are computed one after another
from the northwest corner to the southeast corner of the
matrix in the design pattern of a tiled wavefront, as shown
in Figure 6.
Figure 6. Tiled Wavefront.
The elements within a tile are computed by a thread
block by following the simple wavefront pattern, starting
from the northwest element of the tile. The threads
within each block are synchronized after computing
every anti-diagonal by explicitly calling CUDA’s local
synchronization function syncthreads().
Each thread block computes its allocated tiles within
shared memory. The processed tile is then transferred
back to the designated location in global memory. This
memory transfer will be coalesced because we handcraft
the allocation of each tile to follow the rules for coalesced
memory accesses.
5.2 GPU Synchronization
Efficient global memory accesses can accelerate the
computation, but it cannot accelerate the synchronization
on the GPU. While the tiled-wavefront technique reduces
the number of kernel launches, it explicitly and implicitly
serializes the computation both within and across tiles,
respectively. One solution to this problem is to introduce
an efficient synchronization mechanism across thread
blocks. In this way, we avoid both launching the kernel
multiple times and serializing the tiled wavefront.
1 //the mutex variable
2 __device__ volatile int g_mutex;
3
4 //GPU simple synchronization function
5 __device__ void __GPU_sync(int goalVal)
6 {
7 // thread ID in a block
8 int tid = threadIdx.x * blockDim.y
9 + threadIdx.y;
10
11 // only thread 0 is used for
12 // synchronization
13 if (tid == 0) {
14 atomicAdd ((int *)& g_mutex , 1);
15
16 //only when all blocks add 1 to
17 //g_mutex , will it be equal to
18 // goalVal
19 while(g_mutex != goalVal) {
20 //Do nothing
21 }
22 }
23 __syncthreads ();
24 }
Figure 7. Code Snapshot of GPU Synchroniza-
tion
However, NVIDIA discourages inter-block communi-
cation via global memory because its outcome can be
undefined. Furthermore, a classic deadlock scenario can
occur if multiple blocks are mapped to the same SM, and
the active block waits on a message from the block that
is yet to be scheduled by the CUDA thread scheduler [9].
CUDA threads do not yield the execution, i.e., they
run to completion once spawned by the CUDA thread
scheduler, and therefore, the deadlocks cannot be resolved
in the same way as in traditional processor environments.
This problem can be solved if and only if there exists
a one-to-one mapping between the SM and the thread
block. In other words, if a GPU has ‘X’ SMs, then at most
‘X’ thread blocks can be spawned in the kernel to avoid
deadlock. In this paper, we use the GTX 280, which has
30 SMs, and spawn kernels with at most 30 thread blocks
each. Note that this idea also works on other generations
of the NVIDIA GPU with any number of SMs. To further
ensure that no two blocks can be scheduled to be executed
on the same SM, we can spawn maximum permissible
threads per block or pre-allocate all of the available shared
memory to each block.
We implement our GPU synchronization by first iden-
tifying a global memory variable (g mutex) as a shared
mutex, initialized to 0. At the synchronization step, each
block chooses one representative thread to increment
g mutex by using the atomic function atomicAdd.2 The
synchronization point is considered to be reached when the
value of g mutex equals the goalVal, which is the number
of blocks in the kernel. Figure 7 shows the pseudo-code
for the GPU synchronization function GPU sync().
6 Performance Analysis
This section presents a performance evaluation of various
SWat implementations. From Section 5, by combining the
three memory-access techniques — matrix re-alignment,
coalesced memory access, and tiled wavefront—and the
two synchronization methods — CPU (implicit) syn-
chronization and GPU synchronization, we present six
different implementations on the GPU. In addition, we
compare our new GPU implementation (CUDA-SWat) to
our previous Cell/BE implementation (Cell-SWat) [1].3
We use CUDA version 2.0 as our programming interface
to the GTX 280 GPU card, which has 1-GB global mem-
ory and 30 SMs, each running at 1.2GHz. Our Cell/BE is
a PlayStation 3 (PS3) game console, which is powered by
the Cell/BE processor. The PS3 is housed as part of a 24-
node PS3 cluster at Virginia Tech. The Linux kernel on the
PS3 runs on top of a proprietary hypervisor that disallows
the use of one of the SPE cores while another SPE core
is hardware-disabled. Thus, we can effectively use only 6
SPE cores for computational purposes. The Cell processor
on the PS3 executes at 3.2 GHz and has a total main mem-
ory of 256MB. For the results presented here, we choose
sequence pairs of size 3K and report their alignment re-
sults. While the GPU can handle sequences of 8K in size,
the Cell cannot since anything larger would not fit into
the 256-MB main memory of the PS3. So, although we
can align sequences as long as 8461 characters in length on
the NVIDIA GTX 280, we do not report these even better
results in this paper in order to ensure a fair comparison
between the Cell/BE and the GTX 280 graphics card.
With the above SWat implementations and their associ-
ated computing platforms, we evaluate the performance of
our six different implementations along three dimensions:
(1) total matrix filling time, (2) profile of the time spent
computing versus synchronizing in the matrix-filling
phase on the GPU, and (3) comparison of the “compute
+ data communication” approach of the Cell/BE ver-
sus the “compute + synchronization” approach of the
GPGPU, which is necessitated due to the lack of direct
multiprocessor-to-multiprocessor communication.
A final note: As traditionally done, we focus on par-
allelizing the matrix-filling phase. All the results are the
average of three runs.
6.1 Matrix-Filling Time
Figure 8 presents the total matrix-filling time of the
six GPU implementations. Figure 8(a) shows the matrix-
filling time of the simple implementation with both the
2Atomic operations are available on the NVIDIA graphics cards
with compute capability 1.2 and up.
3Given the memory constraints of the GPU, and particularly the
Cell, we set the tile size to 32 × 32, and the number of threads per
block to 128.
CPU and the GPU synchronization. We set the block
number from 8 to 30 to show the matrix-filling time
change against the block number. Figure 8(b) shows the
results of the coalesced implementation, and Figure 8(c)
is for the tiled wavefront.
Figure 8 illustrates that for the same number of blocks,
the coalesced implementation (with either CPU or GPU
synchronization) is by far the fastest while the tiled
wavefront is the slowest. Why? In the coalesced imple-
mentation, fewer data transactions are needed to fill the
same matrix, thus reducing the aggregate data-access time
to memory and speeding up program execution. Using the
CUDA profiler, we indirectly verify this by showing the
difference in the number of data transactions between the
simple and coalesced implementation, as shown in Figure 9.
(a) Simple implementation;
(b) Coalesced implementation;
(c) Tiled wavefront.
Figure 8. Matrix-Filling Time versus Number of
Blocks in the Kernel
Though the tiled wavefront reduces the synchronization
time by reducing the number of kernel launches, each
multiprocessor must now execute a larger computational
granularity. More importantly, the occupancy of the mul-
tiprocessor in the tiled-wavefront implementation is only
Figure 9. Number of Data Transactions to Global
Memory.
0.125, much lower than that of the simple implementation
at 1.000, as shown in Table 1. This means that 0.875 of
the GPU “goes to waste” when using the tiled-wavefront
approach. Thus, while the technique of tiling significantly
improves the performance of SWat on the Cell/BE, it has
the opposite effect on the GTX 280 GPU.
In addition, as mentioned in Section 5.2, the tiled wave-
front serializes the matrix filling explicitly and implicitly.
When using CPU synchronization, the best SWat perfor-
mance is achieved with 30 blocks per kernel launch. With
GPU synchronization, however, only 22 blocks are needed
in the simple implementation and 18 blocks in the coalesced
implementation to achieve the best performance. The lat-
ter can be attributed to the tradeoff between the increase
in synchronization time and the decrease in the computa-
tion time when more blocks are in the kernel. With more
blocks, additional resources can be used for the computa-
tion, which can accelerate the computation; however, at
the same time, more time is needed for GPU synchroniza-
tion because of the “atomic add” in the synchronization
function, which can only execute sequentially even in differ-
ent blocks. With 30 blocks, in the simple and coalesced im-
plementation, the time increase for GPU synchronization is
more than the computation time decrease, when compared
to that of 22 and 18 blocks in the kernel, respectively.
For tiled wavefront, the synchronization time is very
small compared to the computation time. The decrease
in computation time is much larger than the increase in
GPU synchronization time when more blocks are used.
As a result, the best performance for tiled wavefront is
achieved with 30 blocks in the kernel for both CPU and
the GPU synchronization.
Finally, in comparing the performance of only the
synchronization within SWat, GPU synchronization
always outperforms CPU synchronization. With the
CPU synchronization, the best synchronization times are
183.019 ms, 129.504 ms, and 372.728 ms for the simple im-
plementation, coalesced implementation, and tiled wave-
front, respectively, while the corresponding GPU synchro-
nization times are 175.989 ms, 124.144 ms, and 372.176 ms,
respectively. This indicates that performance can be im-
proved with a better synchronization method on the GPU.
6.2 GPU Implementation: Performance Profile
The kernel execution time of dynamic programming
(DP) algorithms such as SWat contains both computation
and non-computation time, i.e., the synchronization time
Table 1. Simple Implementation versus Tiled Wavefront via the CUDA Profiler.
occu- coalesced coalesced divergent instruc- warp
#calls pancy load store branch branch tions serialize
Simple imlementation 10359 1.0 5541640 5950074 737542 118791 3588471 76389
Tiled wavefront 335 0.125 5109783 7510320 3608729 346777 15249137 261469
Figure 10. Kernel Execution Time on the GPU.
on CUDA. Thus, we partition the kernel execution time
into the computation time and the synchronization time.
Here we assume, much as we did for global memory
access, that the computation time is the same no matter
which synchronization method is used. We obtain this
value by running the SWat implementations with GPU
synchronization but with the synchronization function
GPU sync() removed. The synchronization time is then
the difference between the total kernel execution time
(with either the CPU or the GPU synchronization) and
the computation time.
Figure 10 shows the percentage of the computation time
and the synchronization time corresponding to the best
performance of the six GPU implementations. Firstly,
in Figure 10, we observe that the percentage of time
to synchronize in the simple implementation is smaller
than that in the coalesced implementation for both CPU
and GPU synchronization, respectively. The reason is
the computation time of the coalesced implementation
is smaller, which makes the synchronization time occupy
a larger percentage of the total matrix filling time.
Secondly, percentage of the synchronization time for the
GPU synchronization is always less than that of the CPU
synchronization if the memory access technique is the
same, which indicates that the GPU synchronization needs
less time than the CPU synchronization. Thirdly, though
GPU synchronization can reduce the synchronization
time, the percentage of time spent synchronizing is still
21.96% and 30.74% for the simple and coalesced imple-
mentations, respectively. This means the synchronization
consumes a large part of the total kernel execution time.
6.3 GPU vs. Cell/BE: Performance Profile
On the Cell/BE, data communication between SPEs is
supported via direct software communication primitives
over the EIB. In contrast, the GPU has no such direct
communication between its “SPEs” known as multipro-
cessors. As a consequence, we used the high-overhead
CPU (implicit) barrier synchronization to accomplish
this communication on the GPU as well as constructed
our own software-based GPU barrier synchronization by
which multiprocessors could communicate.
In this section, we focus on the non-computation time
on the GPU and the Cell/BE. Figure 11 shows the compo-
sition of the matrix-filling time of all the implementations
on the GPU and the Cell/BE for their best performance.
From Figure 11, though the tiled wavefront on the Cell/BE
can achieve a good performance [1], the matrix-filling time
(239.13 ms) is larger than that of the simple implementa-
tion (183.02 ms and 175.99 ms for CPU and GPU synchro-
nization, respectively) and the coalesced implementation
(129.504 ms and 124.14 ms for CPU and GPU synchro-
nization, respectively), but it is less than that of the tiled
wavefront on the GPU (372.728 ms and 372.176 ms).
This indicates that the computational horsepower on the
Cell/BE (PS3) is not as fast as that on the GPU.
However, if we look at the percentage of the data com-
munication time on the Cell/BE, it is less than all the GPU
implementations. Though the GPU tiled wavefront can
reduce the synchronization time to 2.25 ms and 1.70 ms
for CPU and the GPU synchronization, respectively, and
close to that on the Cell/BE at 0.69 ms, it is at the cost
of significantly more computation time and, as a result,
the total kernel execution time for the GPU of 370 ms
is larger than that on the Cell/BE 238.44 ms. In other
words, data communication on the Cell/BE is much more
efficient than that on the GPU. The reason is obvious, i.e.,
the Cell/BE supports direct data communication among
the PPE and the SPEs; but on the GPU, no explicit data
communication methods are provided.
Figure 11. Performance Profile of GPU vs.
Cell/BE SWat.
7 Conclusions
In this paper, we choose one of the dynamic program-
ming applications—Smith-Waterman—as an example
to accelerate it on the GPU. In contrast to the pre-
vious coarse-grained parallelization, we implement the
fine-grained parallelization of the SWat. To improve its
performance, techniques to decrease both the computation
time and the data communication time are proposed. For
the former, it mainly focuses on the efficient global mem-
ory access; and for the latter, a new GPU synchronization
method is proposed, which can synchronize the execution
across different blocks without the host involved. These
techniques are proposed based on the SWat, but they can
apply to any dynamic programming-based applications.
To evaluate the performance, we compare the matrix
filling time of each GPU implementation. From the ex-
periment results, compared to the simple implementation,
the coalesced implementation can speedup the execution;
but the tiled wavefront can not. Also, the proposed GPU
synchronization can achieve a better performance than
the CPU synchronization, though the performance is
achieved by using only part of the SMs on the GPU. From
the results, synchronization time percentage can decrease
from 32.91% to 21.96% for the simple implementation and
from 45.67% to 30.74% for the coalesced implementation
with the GPU synchronization. In addition, we compare
the computation time and the data communication time
between the Cell/BE and the GPU. From our results,
matrix filling time on Cell/BE is larger than those of the
simple and coalesced implementations on the GPU, but
its data communication is much less. Though we have a
tiled wavefront on the GPU that can decrease the syn-
chronization time to a few milli-seconds, it is at the cost
of increasing the computation time to much more than
that on the Cell/BE. As a result, for non-data-parallel
applications such as the dynamic programming, efficient
data communication mechanism is important for them to
be accelerated on multi-core and many-core architectures.
As future work, we would like to use some cached
memory, e.g., texture memory and constant memory, to
decrease the computation time even more for the SWat.
Also, from our experiment results, one problem in the
GPU synchronization is the atomic operation—the more
blocks are in the kernel, the more time is needed for the
synchronization. In future, we will try to decrease the
overhead caused by the atomic operations in some degree
or even totally removing them in the GPU synchronization
function. Finally, we would like to extend the proposed
GPU synchronization method to other applications.
Acknowledgments
We would like to thank Heshan Lin, Jeremy Archuleta,
Tom Scogland, and Song Huang for their technical support
and feedback on the manuscript.
This work was supported in part by an IBM Faculty
Award and a NVIDIA Professor Partnership Award.
References
[1] A. M. Aji, W. Feng, F. Blagojevic, and D. S. Nikolopou-
los. Cell-SWat: Modeling and Scheduling Wavefront
Computations on the Cell Broadband Engine. In Proc.
of the ACM International Conference on Computing
Frontiers, May 2008.
[2] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer,
and K. Skadron. A Performance Study of General-
Purpose Applications on Graphics Processors Using
CUDA. Journal of Parallel and Distributed Computing,
68(10):1370–1380, 2008.
[3] T. Chen, R. Raghavan, J. Dale, and E. Iwata. Cell Broad-
band Engine Architecture and its First Implementation.
IBM developerWorks, Nov 2005.
[4] W. Liu, B. Schmidt, G. Voss, A. Schroder, and W. Muller-
Wittig. Bio-Sequence Database Scanning on a GPU.
Proc. of the 20th International Parallel and Distributed
Processing Symposium, April 2006.
[5] Y. Liu, W. Huang, J. Johnson, and S. Vaidya. GPU
Accelerated Smith-Waterman. In Proc. of the 2006
International Conference on Computational Science,
Lectures Notes in Computer Science Vol. 3994, pages
188–195, June 2006.
[6] S. A. Manavski and G. Valle. CUDA Compatible GPU
Cards as Efficient Hardware Accelerators for Smith-
Waterman Sequence Alignment. BMC Bioinformatics,
2008.
[7] Y. Munekawa, F. Ino, and K. Hagihara. Design and
Implementation of the Smith-Waterman Algorithm on
the CUDA-Compatible GPU. 8th IEEE International
Conference on BioInformatics and BioEngineering, pages
1–6, Oct. 2008.
[8] NVIDIA. CUDA Profiler, 2008. http://developer.
download.nvidia.com/compute/cuda/2.0-Beta2/
docs/CudaVisualProfiler_linux_release_notes_1.0_
13June08.txt.
[9] NVIDIA. NVIDIA CUDA Programming Guide, 2008.
http://developer.download.nvidia.com/compute/
cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.
pdf.
[10] L. Nyland, M. Harris, and J. Prins. Fast N-Body
Simulation with CUDA. GPU Gems, 3:677–695, 2007.
[11] C. I. Rodrigues, D. J. Hardy, J. E. Stone, K. Schulten, and
W. W. Hwu. GPU Acceleration of Cutoff Pair Potentials
for Molecular Modeling Applications. In Proceedings of
the Conference on Computing Frontiers, pages 273–282.
ACM New York, NY, USA, 2008.
[12] M. Schatz, C. Trapnell, A. Delcher, and A. Varshney.
High-Throughput Sequence Alignment Using Graphics
Processing Units. BMC Bioinformatics, 8(1):474, 2007.
[13] T. Smith and M. Waterman. Identification of Common
Molecular Subsequences. Journal of Molecular Biology,
147:195–197, 1981.
[14] J. E. Stone, J. C. Phillips, P. L. Freddolino, D. J. Hardy,
L. G. Trabuco, and K. Schulten. Accelerating Molecular
Modeling Applications with Graphics Processors. Journal
of Computational Chemistry, 28:2618–2640, 2007.
[15] G. M. Striemer and A. Akoglu. Sequence Alignment with
GPU: Performance and Design Challenges. In IPDPS,
May 2009.
[16] V. Volkov and J. Demmel. LU, QR and Cholesky
Factorizations Using Vector Capabilities of GPUs. Tech-
nical Report, UCB/EECS-2008-49, EECS Department,
University of California, Berkeley, CA, May 2008.
