CUDAMPF++: A Proactive Resource Exhaustion Scheme for Accelerating
  Homologous Sequence Search on CUDA-enabled GPU by Jiang, Hanyu et al.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1
CUDAMPF++: A Proactive Resource Exhaustion
Scheme for Accelerating Homologous Sequence
Search on CUDA-enabled GPU
Hanyu Jiang, Student Member, IEEE, Narayan Ganesan, Senior Member, IEEE,
and Yu-Dong Yao, Fellow, IEEE
Abstract—Genomic sequence alignment is an important research topic in bioinformatics and continues to attract significant efforts. As
genomic data grow exponentially, however, most of alignment methods face challenges due to their huge computational costs.
HMMER, a suite of bioinformatics tools, is widely used for the analysis of homologous protein and nucleotide sequences with high
sensitivity, based on profile hidden Markov models (HMMs). Its latest version, HMMER3, introdues a heuristic pipeline to accelerate the
alignment process, which is carried out on central processing units (CPUs) with the support of streaming SIMD extensions (SSE)
instructions. Few acceleration results have since been reported based on HMMER3. In this paper, we propose a five-tiered parallel
framework, CUDAMPF++, to accelerate the most computationally intensive stages of HMMER3’s pipeline, multiple/single segment
Viterbi (MSV/SSV), on a single graphics processing unit (GPU). As an architecture-aware design, the proposed framework aims to fully
utilize hardware resources via exploiting finer-grained parallelism (multi-sequence alignment) compared with its predecessor
(CUDAMPF). In addition, we propose a novel method that proactively sacrifices L1 Cache Hit Ratio (CHR) to get improved
performance and scalability in return. A comprehensive evaluation shows that the proposed framework outperfroms all existig work and
exhibits good consistency in performance regardless of the variation of query models or protein sequence datasets. For MSV (SSV)
kernels, the peak performance of the CUDAMPF++ is 283.9 (471.7) GCUPS on a single K40 GPU, and impressive speedups ranging
from 1.x (1.7x) to 168.3x (160.7x) are achieved over the CPU-based implementation (16 cores, 32 threads).
Index Terms—GPU, CUDA, SIMD, L1 cache, hidden Markov model, HMMER, MSV, SSV, Viterbi algorithm.
F
1 INTRODUCTION
T YPICAL algorithms and applications in bioinformatics,computational biology and system biology share a com-
mon trait that they are computationally challenging and
demand more computing power due to the rapid growth
of genomic data and the need for high fidelity simulations.
As one of the most important branches, the genomic se-
quence analysis with various alignment methods scales the
abstraction level from atoms to RNA/DNA molecules and
even whole genomes, which aims to interpret the similarity
and detect homologous domains amongst sequences [1].
For example, the protein motif detection is key to identify
conserved protein domains within a known family of pro-
teins. This paper addresses HMMER [2], [3], a widely used
toolset designed for the analysis of homologous protein and
nucleotide sequences with high sensitivity, which is carried
out on central processing units (CPUs) originally.
HMMER is built on the basis of probabilistic inference
methods with profile hidden Markov models (HMMs) [3].
Particularly, the profile HMM used in HMMER is Plan-
7 architecture that consists of five main states (Match(M),
Insert(I), Delete(D), Begin(B) and End(E)) as well as five
special states (N, C, J, S and T). The M, I and D states which
are in the same position form a node, and the number of
• H. Jiang, N. Ganesan, Y. Yao are with the Department of Electrical and
Computer Engineering, Stevens Institute of Technology, Hoboken, NJ
07030.
E-mail: {hjiang5, nganesan, Yu-Dong.Yao}@stevens.edu.
Manuscript received xxxx xx, xxxx; revised xxxx xx, xxxx.
nodes included in a profile HMM indicates its length. The
digital number “7” in Plan-7 refers to the total of seven
transitions per node, which exist in the architecture and each
has a transition probability. In addtion, some states also have
emission probabilities. This architecute is a little bit different
from the original one proposed by Krogh et al. [4] which
contains extra I-D and D-I transitions.
The profile HMMs employ position-specific Insert or
Delete probabilities rather than gap penalties, which enables
HMMER to outperform BLAST [5] on senstivity [3]. How-
ever, the previous version of HMMER, HMMER2, suffers
the computational expense and gains less utilization than
BLAST. Due to well-designed heuristics, BLAST is in the
order of 100x to 1000x faster than HMMER2 [3]. There-
fore, numerous acceleration efforts have been made for
HMMER2, such as [6], [7], [8], [9], [10]. Most of them em-
ploy application accelerators and co-processors, like field-
programmable gate array (FPGA), graphics processing unit
(GPU) and other parallel infrastructures, which provide
good performence improvement. To popularize HMMER
for standard commodity processors, Eddy et al. propose new
versions of HMMER, HMMER3 (v3.0) and its subsequent
version (v3.1), which achieve the comparable performance
as BLAST [3]. As the main contribution, HMMER3 imple-
ments a heuristic pipeline in hmmsearch which aligns a query
model with the whole sequence dataset to find out signifi-
cantly similar sequence matches. The heuristic acceleration
pipeline is highly optimized on CPU-based systems with the
support of streaming SIMD extensions (SSE) instructions,
ar
X
iv
:1
70
7.
09
68
3v
1 
 [c
s.C
E]
  3
0 J
ul 
20
17
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2
and hence only few acceleration attempts, including [11],
[12], [13], [14], [15], [16] and [17], report further speedups.
Our previous work [18], CUDAMPF, proposes a multi-
tiered parallel framework to accelerate HMMER3 pipeline
on a single GPU, which was shown to exceed the cur-
rent state-of-the-art. However, the performance evaluation
shows that the thoughput of the computational kernel de-
pends on the model length, especially for small models,
which implies underutilization of the GPU. This inspires
us to exploit finer-grained parallelism, compared with the
framework presented in [18]. In this paper, we describe
another tier of parallelization that aims to fully take ad-
vantage of the hardware resources provided by single GPU.
A novel optimization strategy that proactively utilizes on-
chip cache system is proposed to further boosts kernel
throughput and improves scalability of the framework. A
comprehensive evaluation indicates that our method ex-
hibits good consistency of performance regardless of query
models and protein sequence datasets. The generalization
of the proposed framework as well as performance-oriented
suggestions are also discussed.
The rest of the paper is organized as follows. Section 2
presents background of HMMER3 pipeline, GPU architec-
ture and highlighted CUDA features, followed by a review
of CUDAMPF implementation. In Section 3, the in-depth
description of our proposed framework is presented. Then,
we give comprehensive evaluations and analyses in Section
4. Related works and discussions are presented in Section 5
and 6, respectively. Finally, we present the conclusion of this
paper.
2 BACKGROUND
In this section, we go through the new heuristic pipeline
of HMMER3 and highlights its computationally intensive
stages. An overview of the GPU architecture and the CUDA
programming model is also presented. For better under-
standing of subsequent ideas, we briefly review our pre-
vious work, CUDAMPF, at end of this section.
2.1 Heuristic Pipeline in HMMER3
The main contribution that accelerates the HMMER3 is a
new algorithm, multiple segment Viterbi (MSV) [3], which
is derived from the standard Viterbi algoritm. The MSV
model is a kind of ungapped local alignment model with
multiple hits, as shown in Fig. 1, and it is achieved by
pruning Delete and Insert states as well as their transitions
in the original profile HMMs. The M -M transitions are also
treated as constants of 1. In addition to the MSV algorithm,
another simpler algorithm, single segment Viterbi (SSV), is
also introduced to boost the overall performance further.
Given that the J state is the bridge between two matched
alignments, the SSV model assumes that there is rarely a
matched alignment with a score that is higher than the cost
of going through the J state, and hence it speculatively
removes the J state to gain a significant speedup [19].
However, in order to avoid false negatives, the SSV model
is followed by regular MSV processing to re-calculate sus-
pected sequences. Fig. 1 illustrates profiles of P7Viterbi,
MSV and SSV models with an example of 4 nodes. The solid
arrows indicate transitions between different types of states
whereas dashed arrows represent the self-increase of a state.
M1 M2 M3 M4B E
M1 M2 M3 M4BS E TN C
J
M1 M2 M3 M4BS E TN C
J
I1 I2 I3
D2 D3 D4
SSV: Single ungapped local alignment 
Segment Viterbi
MSV: Multiple ungapped local alignment 
Segment Viterbi
P7Viterbi: Plan-7 architecture based 
Viterbi
TCS N
Fig. 1. Profiles of P7Viterbi, MSV and SSV models.
In the pipeline, SSV and MSV models work as heuristic
filters (stages) that filter out nonhomologous sequences. All
sequences are scored during SSV and MSV stages, and
only about 2.2% of sequences are passed to the next stage,
given a threshold. The second stage consists of the P7Viterbi
model which only allows roughly 0.1% of sequences pass,
and resulting sequences are then scored with the the full
Forward algorithm [3]. These four stages mentioned above
form the main part of HMMER3’s pipeline. However, SSV
and MSV stages consumes more than 70% of the overall
execution time [17], [18], and hence they are prime targets
of optimization.
Fig. 2 illustrates the dynamic programming (DP) matrix
of the P7Viterbi stage, corresponding to Fig. 1. A lattice of
the middle region contains three scores for Match, Insert
and Delete states, respectively, whereas flanking lattices only
have one. To complete the alignment of a sequence, we
need to calculate every lattice of the DP matrix starting
from the left-top corner to the right-bottom corner (green)
in a row-by-row order, which is a computationally intensive
process. In the middle region of the DP matix, each lattice
(red) depends on four cells (blue) directly, denoted by solid
arrows, which can be formulated as:
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3
E J CBN 1 Mj
0
i
L
Query model (states)
Se
qu
en
ce
 (
re
si
d
u
es
)
Fig. 2. Dynamic programming matrix of the P7Viterbi stage.
VM [i, j] = M + max

VM [i− 1, j − 1] + T (Mj−1,Mj),
VI [i− 1, j − 1] + T (Ij−1,Mj),
VD[i− 1, j − 1] + T (Dj−1,Mj),
B[i− 1] + T (B,Mj)
VI [i, j] = I + max
{
VM [i− 1, j] + T (Mj , Ij),
VI [i− 1, j] + T (Ij , Ij)
VD[i, j] = max
{
VM [i, j − 1] + T (Mj−1, Dj),
VD[i, j − 1] + T (Dj−1, Dj)
(1)
where  denotes emission scores. V and T represent scores
of M/I/D states and transitions, respectively. As for MSV
and SSV stages, the mathematical formula can be simpli-
fied via removing VI and VD , which results in moderate
dependencies and fewer amount of computation than the
P7Viterbi stage. However, in order to exceed the perfor-
mance of the highly optimized CPU-based implementation
with the SIMD vector parallelization, it is imperative to go
beyond general methods and exploit more parallelism on
other multi/many-core processor architectures.
2.2 GPU Architecture and CUDA Programming Model
As parallel computing engines, CUDA-enabled GPUs are
built around a scalable array of multi-threaded streaming
multiprocessors (SMs) for large-scale data and task paral-
lelism, which are capable of executing thousands of threads
in the single-instruction multiple-thread (SIMT) pattern [20].
Each generation of GPU introduces more hardware re-
sources and new features, which aims to deal with the ever-
increasing demand for computing power in both industry
and academia. In this paper, we implement our design on
Tesla K40 GPU of Kepler GK110 architecture which equips
with 15 powerful streaming multiprocessors, also known
as SMXs. Each SMX consists of 192 single-precision CUDA
cores, 64 double-precision units, 32 special function units
and load/store units [21]. The architecture offers another
48KB on-chip read-only texture cache with an independent
datapath from the existing L1 and shared memory datapath,
and the maximum amount of available registers for each
thread is increased to 255 instead of prior 63 per thread.
Moreover, a set of Shuffle instructions that enables a warp of
threads to share data without going through shared memory
are also introduced in Kepler architecture. This new feature
is heavily used in our proposed framework.
The CUDA programming model is designed for NVIDIA
GPUs, and it provides users with a development environ-
ment to easily leverage horsepower of GPUs. In CUDA, a
kernel is usually defined as a function that is executed by all
CUDA threads concurrently. Both grid and block are vitural
units that form a thread hierarchy with some restrictions. Al-
though CUDA allows users to launch thousands of threads,
only a warp of threads (32 threads, currently) guarantee that
they advance exectuions in lockstep, which is scheduled
by a warp scheduler. Hence, the full efficiency is achieved
only if all threads within a warp have the same execution
path. Traditionally, in order to make sure that threads keep
the same pace, a barrier synchronization has to be called
explicitly, which imposes additional overhead.
2.3 CUDAMPF
In [18], we proposed a four-tiered parallel framework, CUD-
AMPF, implemented on single GPU to accelerate SSV, MSV
and P7Viterbi stages of hmmsearch pipeline. The framework
describes a hierarchical method that parallelizes algorithms
and distributes the computational workload considering
available hardware resources. CUDAMPF is completely
warp-based that regards each resident warp as a compute
unit to handle the exclusive workload, and hence the explict
thread-synchronization is eliminated. Instead, the built-in
warp-synchronism is fully utilized. A warp of threads make
the alignment of one protein sequence one time and then
pick up next scheduled sequence. Given that 8-bit or 16-
bit values are sufficient to the precision of algorithms,
we couple SIMT execution mechanism with SIMD video
instructions to achieve 64 and 128-fold parallelism within
each warp. In addition, the runtime compilation (NVRTC),
first appeared in CUDA v7.0, was also incorporated into the
framework, which enabled swichable kernels and innermost
loop unrolling to boost the performance further. CUDAMPF
yields upto 440, 277 and 14.3 GCUPS (giga cells updates per
second) with strong scalability for SSV, MSV and P7Viterbi
kernels, respectively, and it has been proved to exceed all
existing work.
3 PROPOSED FRAMEWORK: CUDAMPF++
This section presents detailed implementations of the pro-
posed framework, CUDAMPF++, that is designed to gain
more parallelism based on CUDAMPF. We first introduce
a new tier of parallelism followed by a data reformatting
scheme for protein sequence data, and then in-depth expla-
nations of kernel design are presented. Finally, we discuss
the optimizations of the proposed framework.
3.1 Five-tiered Parallelism
In CUDAMPF, the four-tiered parallel framework is pro-
posed to implement MSV, SSV and P7Viterbi kernels. Al-
though the performance improvement is observed on all
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
accelerated kernels, the speedup on P7Viterbi kernel is very
limited whereas MSV/SSV kernel yields significant im-
provement. Given the profiling information [18], we are able
to gain additional insights into the behaviors: (a) L1 Cache
Hit Ratio (CHR) of the P7Viterbi kernel degrades rapidly
as model size increases, and (b) its register usage always
exceed the maximum pre-allocation for each thread, which
indicates the exhaustion of on-chip memory resources and
serious register spill. As for MSV/SSV kernels, however, the
on-chip memory resources are sufficient. A large amount of
low-latency registers, especially when aligning with small
models, are underutilized. This can also be proved by per-
formance curves in [18] in which only upward slopes are
observed without any flat or downward trends as model
size increases from 100 to 2405. The underutilization leaves
an opportunity to exploit further parallelism that can fully
take advantage of hardware resources on GPUs.
In addition to original four-tiered structure, another
tier of parallelism, inserted between 3rd and 4th tiers, is
proposed to enable each warp handle multiple alignments
with different protein sequences in parallel while the de-
sign of the CUDAMPF only allow single-sequence aligment
per warp. This scheme aims to exhaust on-chip memory
resources, regardless of the model length, to maximize the
throughput of MSV/SSV kernels. Fig. 3 illustrates the five-
tiered parallel framework.
The first tier is based on multiple SMXs that possess
plenty of computing and memory resources individually.
Given the inexistence of data and execution dependency
between each protein sequence, it is straightforward to
partition whole sequence database into several chunks and
distribute them to SMXs, as the most basic data paral-
lelism. Tier 2 describes the parallelism between multiple
warps that reside on each SMX. Since our implementation
still applies the warp-synchronous execution that all warps
are assigned to process different sequences without inter-
warp interactions, explicit synchronizations are eliminated
completely. Unlike the CUDAMPF in which warps move
to their next scheduled task once the sequence at hand is
done, the current design allocates a sequence data block
to each warp in advance. A data block may contain thou-
sands of sequences, less or more, depending on the total
size of protein sequence dataset and available warps. For
multi-sequence alignment, all sequences within each data
block need to be reformatted as the striped layout, which
enables coalesced access to the global memory. Details of
data reformatting will be discussed in Sec. 3.2. The number
of resident warps per SMX is still fixed to 32 due to the
complexity of MSV and SSV algorithms, which ensures that
every thread obtains enough registers to handle complex
execution dependencies and avoids excessive register spill.
Besides, each SMX contains 4 warp schedulers, each with
dual instruction dispath units [21], and hence it is able
to dispatch upto 8 independent instructions each cycle.
Those hardware resources have the critical influence on the
parallelism of tier 2 and the performance of warp-based
CUDA kernels.
Tier 3 is built on the basis of warps. A warp of threads
update different model states simultaneously and iterate
over remaining model states in batches. The number of itera-
tions depends on the query model size. Once the alignment
SMX 
Block
Shared 
Memory
L2 Cache
Global Memory
Warp
Block 
Warp
Warp
Warp
sequence data blocks
Warp Scheduler
2 Instruction 
Dispatch Units
A
iterate through 
all states of the 
query model
R
striped layout of query model states
32 threads in one warp
Seq.
A
R
C
C
striped 
model 
states
H
W
LGroup 1
Seq. 2
32 threads in one warp
alignment
H
W
LGroup 2
Seq. 1
H
W
LGroup S
Seq. S
Thread 1 Thread 2Thread 1 Thread 2
32-bit register
8 bits 
MSV / SSV (4-lane per thread) 
32-bit register
16 bits 
P7Viterbi (2-lane per thread) 
Tier 1
Tier 2
Tier 3
Tier 4
Tier 5
L1 
Cache
Read-
only 
Cache
SMX 
Block
Shared 
Memory
L1 
Cache
Read-
only 
Cache
SMX 
Block
Shared 
Memory
L1 
Cache
Read-
only 
Cache
A
V
L
S
D
V
D
R
E
Warp Scheduler
2 Instruction 
Dispatch Units
Warp Scheduler
2 Instruction 
Dispatch Units
Fig. 3. Five-tiered parallel framework based on the NVIDIA Kepler archi-
tecture.
of an amino-acid residue is done, such as the ’A’ and its
alignment scores (marked as a row of blue lattices) shown
in Fig. 3 (Tier 3), the warp moves to next residue ’R’ and
start over the alignment until the end of current sequence.
On the basis of tier 3, tier 4 illustrates the multi-sequence
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
alignment that a warp of threads are evenly partitioned into
several groups, each has L threads, to make alignments
with S different sequences in parallel. For example, the
first residues of S sequences, like ’A’ of Seq. 1, ’V’ of Seq.
2 and ’V’ of Seq. S, are extracted together for the first-
round alignment. Each group of threads update W scores
per iteration, and H iterations are required to finish one
alignment. The model states are formatted as a rectangle
with W×H lattices. Considering two models, a large model
Ml and a small model Ms where the size of Ml is S times
larger than the size of Ms, we are able to get Ws = 1SWl
given Hs = Hl, which provides S-lane parallelism and
roughly keeps register utilization of Ms as same as Ml. The
tier 5 remains unchanged as the fine-grained data paral-
lelism: every thread can operate on four 8-bit values and
two 16-bit values simutaneously, using single SIMD video
instruction [22], for MSV/SSV and P7Viterbi algorithms,
respectively. With the support of tier 5, the parallelism of
tier 4 is further extended because each thread takes charge
of four different sequences at most. The value of S, as the
number of sequences processed in parallel by a warp, is
defined as below:
{S | S = 2i, i ∈ Z ∩ [1, log2
sˆwˆr
wˆv
]}, (2)
where sˆ is the warp size, wˆr and wˆv represent the width of
registers and participant values, respectively. With Eq. 2, the
rest of values can be also formulated as:
W =
sˆwˆr
wˆvS
,
L = d sˆ
S
e
H = max{2, d mˆ
W
e},
(3)
where mˆ represents the size of query model. W , L and H
can be regarded as functions of S.
3.2 Warp-based Sequence Data Blocks
Due to the introduction of the multi-sequence alignment,
loading sequence data in the sequential layout is highly
inefficient. Originally, in [18], residues of each sequence
are stored in contiguous memory space, and warps always
read 128 residues of one sequence by one coalesced global
memory transaction. As for current design, however, the
sequential data layout may lead to 128 transactions per
memory request while extracting residues from 128 dif-
ferent sequences. Hence, a striped data layout is the most
straightforward solution. Given that warps have exclusive
tasks, we propose a data reformatting method that arranges
sequences in a striped layout to (a) achieve fully coalesced
memory access and (b) balance workload amongst warps.
The proposed method partitions large sequence dataset into
blocks based on the number of resident warps.
As shown in Fig. 4, all protein sequences are divided
into N blocks, each consists of Mi × 128 residues, where
N is the total number of resident warps, and Mi represents
the height of each block with i = [1, 2, 3...N ]. The number
128, as the width of each block, is pre-fixed for two reasons:
(a) MSV/SSV kernels only need participant values with the
width of wˆv = 8 bits, and hence a warp can handle up
P
L
A
A
M
A
P
R
M
M
R
P
V
T
@
G
S
T
C
C
M
D
D
S
C
M
@
H
G
G
F
D
L
E
E
P
M
@
V
N
R
H
@
Q
Q
A
K
F
A
T
P
E
M
E
E
@
N
V
G
G
L
Y
Q
S
@
Q
G
P
#
#
@
A
#
#
#
#
#
#
#
@
Q
R
A
S
M
Q
@
D
D
A
S
D
K
T
S
L
#
#
#
@
V
H
V
L
M
V
@
P
E
W
K
K
G
T
@
Y
L
K
M
V
M
S
I
L
P
R
@
H
D
D
R
A
#
#
#
@
#
#
@
T
0 1 2 3 4 5 126 128127
Mi=1
rows
128 columns (128 Bytes / row)
Warp 1 Warp 2 Warp 3 Warp N
128 cols
128 cols
128 cols
Addr.
(Bytes)
Fig. 4. Warp-based sequence data blocks.
to S = 128 sequences simultaneously. (b) 128 residues,
each occupies 1 byte, achieves aligned memory address
for coalescing access. Marked as different colors, residues
consist of three types: regular (green), padding (blue) and
ending (red). In each block, sequences are concatenated and
filled up 128 columns. A ending residue ’@’ is inserted at the
end of each sequence, which is used to trigger an intra-wrap
branch to calculate and record the score of current sequence
in the kernel. The value of Mi is equal to the length of
longest column within each block, such as second column of
the data block for warp 1. As for the rest of columns whose
length is less than Mi, padding residue ’#’s are attached to
make them all aligned. Residues that belong to the same
warp are stored together in row-major order.
Algorithm 1 shows the pseduo-code of reformatting
protein sequence data. N is determined by the number
of SMXs and the number of resident warps per SMX,
denoted by Nsmx and Nwarp, respectively (line 1). Given
S, N ∗ S containers Cx are created to hold and shape
sequences as we need. One container corresponds to one
column as shown in Fig. 4. We load first N ∗ S seqeunces
as the basis (line 4), and the for loop (line 7 to 19) iterates
over the rest to distribute them into Cx. To avoid serious
imbalance, the maxLen (line 6) is employed to monitor the
longest container as the upper limit. Every new sequence
searches for a suitable position based on the accumulated
length of each container and maxLen (line 13). We force
the sequence to be attached to the last container (line 12)
if no position is found. maxLen always be checked and
updated by checkMax() function once a new sequence is
attached successfully (line 17). Cx are evenly divdied into
N blocks when all sequences are attached, and the length
of longest container within each block Mi is recorded by
getMax(S) function (line 20). The padding process, as the
last step (line 21), shapes warp-based sequence data blocks
Bi into rectangles. Implementation of this method and the
evaluation of workload balancing are presented in Sec. 4.1.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6
Algorithm 1 Protein sequence data reformatting
Input: all sequences sorted by length in descending order
Seq, and some parameters, such as S.
Output: warp-based sequence data blocks Bi.
1: N ← Nsmx ∗Nwarp
2: lines← N ∗ S
3: create lines containers Cx where x ∈ Z ∩ [0, lines).
4: Cx ← Seq[x] . load first lines sequences
5: ptr ← lines . start from last container
6: maxLen← max{Cx.len()}
7: for all Seq[y] where y ≥ lines do
8: repeat
9: ptr ← ptr − 1
10: if ptr < 0 then . position is not found
11: ptr ← lines . start over
12: C(ptr−1).attach(Seq[y])
13: else if Seq[y].len() +Cptr.len() ≤ maxLen then
14: Cptr.attach(Seq[y])
15: end if
16: until Seq[y] is attached.
17: maxLen.checkMax()
18: ptr ← lines if ptr < 0. . refresh pointer normally
19: end for . all sequences are attached
20: Mi ← Cx.divide(N).getMax(S) where i = [1, ..., N ]
21: Bi ← Cx.padding(Mi)
22: return Bi
3.3 Kernel Design
Since the number of sequences that are processed in parallel
is within the range of {2, 4, 8, 16, 32, 64, 128}, the proposed
algorithms are designed to cover all these cases. Therefore,
14 different types of kernels are generated, named as S-lane
MSV/SSV kernels, and their implementations slightly vary
with value S.
Algorithm 2 outlines the S-lane MSV kernels that is
more complex than the implementation of single-sequence
alignment in [18]. Some features are inherited, like (a) using
local memory to hold intermediate values as well as register
spill (line 1), (b) loading scores through read-only cache
instead of shared memory to avoid weak scalability with
low occupancy (line 16) and (c) fully unrolling the innermost
loop for maximizing registers usage to reside high frequency
values on the on-chip memory (line 14). In order to assign
different threads of a warp to work on different sequences
without mutual interference, we label group ID gid and the
offset in group oig on each thread (line 3 and 4). Threads that
work on the same sequence are grouped with a unique gid,
and they are assigned to different in-group tasks based on
the oig. Inter-thread collaborations are only allowed within
each thread group.
The outer loop (line 5) iterates over columns of the warp-
based sequence data block B while the middle loop (line 11)
takes charge of each row. The cycle times of outer loop is
directly affected by the query model size: the larger model
results in the more cycles. This is because on-chip memory
resources are limited when making alignment with large
models, and it further leads to the kernel selection with
small S. For example, a model with length of 45 can be
handled by 128-lane kernels whereas a model of 1000-length
Algorithm 2 MSV kernels with multi-sequence alignment
Input: emission score E, sequence data block B, height of
data block M , sequence length Len, offset of sequence
Oseq , offset of sequence length Olen and other parame-
ters, such as L, W , H , S and dbias, etc.
Output: P-values of all sequences Pi.
1: local memory Γ[H]
2: wid← blockIdx.y ∗ blockDim.y + threadIdx.y
3: gid← bthreadIdx.y/Lc . group id
4: oig ← threadIdx.x % L . offset in group
5: for k ← 0 toW − 1 do
6: R←M [wid]
7: count← 0
8: mask, scE ← 0x00000000
9: Iseq ← Oseq[wid] + gid ∗ L+ bk/4c
10: scB ← initialize it based on Len, Olen[wid] and k.
11: while count < R do
12: r ← extract res(B, Iseq, k, S)
13: γ ← inter or intra reorder(L, S, v, oig)
14: #pragma unroll H
15: for h← 0 to H − 1 do
16: σ ← load emission score(E,S, L, oig, h, r)
17: v ← vmaxu4(γ, scB)
18: v ← vaddus4(v, dbias)
19: v ← vsubus4(v, σ)
20: scE ← vmaxu4(scE , v)
21: γ ← Γ[h] & mask . load old scores
22: Γ[h]← v . store new scores
23: end for
24: scE ← max reduction(L, S, scE)
25: scJ , scB ← update special states, given scE .
26: mask ← 0xffffffff
27: if r contains ending residue @ then . branch
28: Pi ← calculate P-value and record it.
29: mask ← set affected bits to 0x00.
30: scJ , scE , scB ← update or reset special states.
31: end if
32: count, Iseq ← step up.
33: end while
34: end for
may only select 4-lane kernels. Given that, in one warp,
k is used to index S columns of residues simultaneously
during each iteration, and the Iseq always points to residues
that are being extracted from global memory. The details
of residue extraction are shown in Algorithm 3. For S-lane
kernels with S ≤ 32, only one 8-bit value (one residue) per
thread group is extracted and be ready to make alignment
though a warp always have the fully coalesced access to 128
residues assembled in 128-byte memory space. Instead, 64-
lane kernels extract two 8-bit values, and 128-lane kernels
are able to handle all of them. These residues are then
used in the function load emission score (line 16) to load
corresponding emission scores of “Match” states (line 3, 5
and 11 in Algorithm 4). The total number of amino acids is
extended to 32, and the extra states are filled with invalid
scores, which aims to cover the newly introduced residues
(ending and padding). 64 and 128-lane kernels are treated in a
specical way as shown in Algorithm 4 (line 3-9 and line 12-
15) due to the demand of score assembly. In this case, each
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7
thread assembles two or four scores of different residues
into a 32-bit register to be ready for subsequent SIMD in-
structions. All emission scores are loaded through read-only
cache to keep shared/L1 cache path from overuse, and the
score sharing is done via inter-thread shuffle instructions.
Algorithm 3 Extract residues from data block - extract res
Input: B, Iseq , k and S.
Output: a 32-bit value r that contains one, two or four
residues, given S.
1: if S ∈ {2, 4, 8, 16, 32} then
2: r ← (B[Iseq] >> 8 ∗ (k % 4)) & 0x000000ff
3: else if S = 64 then
4: r ← (B[Iseq] >> 16 ∗ k) & 0x0000ffff
5: else if S = 128 then
6: r ← B[Iseq]
7: end if
Algorithm 4 Get “Match” scores - load emission score
Input: E, S, L, oig, h and r.
Output: a 32-bit value σ that contains four emission scores,
each is 8-bit, in striped layout.
1: Na ← 32 . amino acids
2: if S ∈ {2, 4, 8, 16, 32} then
3: σ ← E[h ∗Na ∗ L+ r ∗ L+ oig] . ldg
4: else if S = 64 then
5: sc← E[h ∗Na + threadIdx.x] & 0x0000ffff
6: res← r & 0x000000ff
7: σ ← σ ‖ ( shfl(sc, res)) . assembly
8: res← (r >> 8) & 0x000000ff
9: σ ← σ ‖ ( shfl(sc, res) << 16) . assembly
10: else if S = 128 then
11: sc← E[h ∗Na + threadIdx.x] & 0x000000ff
12: for bits ∈ {0, 8, 16, 24} do
13: res← (r >> bits) & 0x000000ff
14: σ ← σ ‖ ( shfl(sc, res) << bits) . assembly
15: end for
16: end if
Algorithm 5 and 6 detail two crucial steps of MSV/SSV
kernels, inter or intra reorder and max reduction, (line
13 and 24 in Algorithm 2) via the PTX assembly to expose
internal mechanisms of massive bitwise operaions for the
multi-sequence alignment. They aim to reorder 8-bit values
and get the maximum value amongst each thread group
in parallel, and meanwhile, noninterference between thread
groups is guaranteed. Our design still avoids to use shared
memory since available L1 cache is the key factor on per-
formance when unrolling the innermost loop. Therefore, all
intermediate or temporary values are held by private mem-
ory space of each thread, such as registers or local memory.
The shuffle instruction, shfl, is employed again to achieve
the inter-thread communication but the difference is that a
mask is specified to split a warp into sub-segments (line
8 in Algorithm 5). Each sub-segment represents a thread
group. In Algorithm 6, two reduction phases are required
for S-lane kernels with S ≤ 16. Line 5 to 12 presents inter-
thread reductions by using shfl and vmax to get top four
values of 8-bit within each thread group. The following
lines are intra-thread reductions which only happen inside
each thread and eventually works out the maximum value.
As an example, Fig. 5 illustrates the reordering and max-
reduction for 16-lane kernels. W = 8 is the width of striped
model states, and it can also be regarded as the scope of
each thread group. For S = 16, a warp is partitioned
into 16 thread groups, and each handles eight values of
8-bit. Yellow lattices in Fig. 5(a) are values that need to
be exchanged betweem two threads. Arrows indicate the
reordering direction. In Fig. 5(b), assuming digital numbers
(0 to 127) labeled inside lattices represent the values held
by threads, the yellow lattices always track the maximum
value of each thread group. Three pairs of shuffle and SIMD
instructions are used to calculate the maximum value and
broadcast it to all members of the thread group.
Algorithm 5 Reorder 8-bit value inter-thread or intra-thread
- inter or intra reorder
Input: L, S, v and oig.
Output: a 32-bit value γ that contain four 8-bit values.
1: if S ∈ {2, 4, 8, 16} then . inter-thread
2: lane← (oig + L− 1) % L . source lane
3: asm{shr.u32 x, v, 24}
4: asm{mov.b32 mask, 0x1f}
5: asm{sub.b32 mask, mask, L(S)-1}
6: asm{shl.b32 mask, mask, 16}
7: asm{or.b32 mask, mask, 0x1f}
8: asm{shfl.idx.b32 x, x, lane, mask}
9: asm{shl.b32 y, v, 8}
10: γ ← x | y
11: else if S = 32 then . intra-thread
12: asm{shr.u32 x, v, 24}
13: asm{shl.b32 y, v, 8}
14: γ ← x | y
15: else if S = 64 then . intra-thread
16: asm{shr.u32 x, v, 8}
17: asm{and.b32 x, x, 0x00ff00ff}
18: asm{shl.b32 y, v, 8}
19: asm{and.b32 y, y, 0xff00ff00}
20: γ ← x | y
21: else if S = 128 then
22: γ ← 0x00000000 or 0x80808080 . not reorder
23: end if
The potential divergent execution only happens in the
branch for recording P-value of ended sequences, as shown
in Algorithm 2, line 27-31. Threads which find the existence
of ending residues keep active and move into the branch
whereas others are inactive during this period. A mask is
introduced to mark the position of ended sequences within
32-bit memory space and set those affected bits to 0. This
is particularly helpful to 64 and 128-lane kernels because it
only cleans up corresponding lanes for new sequence while
keeping data of other lanes unchanged. Moreover, bitwise
operations with mask minimize the number of instructions
needed inside the innermost loop, which is also beneficial
to the overall performance. As for the SSV kernel shown in
Algorithm 7, it shares the same framework with the MSV
kernel but has less computational workload. Besides, one
more mask value is added to reset affected bits since the
-inf of SSV kernel is 0x80 rather than 0x00. mask1 cleans
up outdated scores as the first step followed by a bitwise
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8
Algorithm 6 Get and broadcast maximum value through
reduction operations - max reduction
Input: L, S and scE .
Output: a 32-bit value scE that contains four 8-bit or two
16-bit values.
1: if S = 128 then
2: do nothing but Return scE .
3: else
4: x← scE , y ← 0, z ← 0
5: if S ∈ {2, 4, 8, 16} then . inter-thread reduction
6: i← log2 L− 1
7: for lm← 20 to 2i do
8: asm{shfl.bfly.b32 y, x, lm, 0x1f}
9: asm{and.b32 m, m, 0x00000000}
10: asm{vmax4.u32.u32.u32 x, x, y, m}
11: end for
12: end if
13: asm{shr.u32 y, x, 8}
14: asm{and.b32 y, y, 0x00ff00ff}
15: asm{shl.b32 z, x, 8}
16: asm{and.b32 z, z, 0xff00ff00}
17: asm{or.b32 y, y, z}
18: asm{and.b32 m, m, 0x00000000}
19: asm{vmax4.u32.u32.u32 x, x, y, m}
20: Return scE ← x if S = 64.
21: asm{shr.u32 y, x, 16}
22: asm{shl.b32 z, x, 16}
23: asm{or.b32 y, y, z}
24: asm{and.b32 m, m, 0x00000000}
25: asm{vmax4.u32.u32.u32 x, x, y, m}
26: Return scE ← x if S ∈ {1, 2, 4, 8, 16, 32}.
27: end if
32-bit reg.
Thread #31
W = 8
6 57 4 2 13 014 1315 12 10 911 8126 125127 124 122 121123 120
2 13 0 6 57 410 911 8 14 1315 12122 121123 120 126 125127 124
6 57 4 6 57 414 1315 12 14 1315 12126 125127 124 126 125127 124
7 46 5 7 46 515 1214 13 15 1214 13127 124126 125 127 124126 125
7 57 5 7 57 515 1315 13 15 1315 13127 125127 125 127 125127 125
5 75 7 5 75 713 1513 15 13 1513 15125 127125 127 125 127125 127
7 77 7 7 77 715 1515 15 15 1515 15127 127127 127 127 127127 127
Reorder
8-bit
value
(a) Reordering for MSV/SSV kernels with S = 16
32-bit reg.
Thread #30
32-bit reg.
Thread #3
32-bit reg.
Thread #2
32-bit reg.
Thread #1
32-bit reg.
Thread #0
W = 8 W = 8
32-bit reg.
Thread #31
32-bit reg.
Thread #30
32-bit reg.
Thread #3
32-bit reg.
Thread #2
32-bit reg.
Thread #1
32-bit reg.
Thread #0
Inter-thread
shuffle
Intra-thread
shuffle
Intra-thread
shuffle
vmax4
vmax4
vmax4
(b) Max-reduction for MSV/SSV kernels with S = 16
Fig. 5. An example of Reordering and Max-reduction for kernels that
handle 16 sequences simutaneously.
disjunction with mask2 to reset local memory Γ (line 18).
3.4 Kernel Optimization
The kernel performance of CUDAMPF shows that H = 19
is able to cover the longest query model whose length is
Algorithm 7 SSV kernel with multi-sequence alignment
Input: emission score E, sequence data block B, height of
data block M , sequence length Len, offset of sequence
Oseq , offset of sequence length Olen and other parame-
ters, such as L, W , H , S and dbias, etc.
Output: P-values of all sequences Pi.
1: local memory Γ[H]
2: wid← blockIdx.y ∗ blockDim.y + threadIdx.y
3: gid← bthreadIdx.y/Lc . group id
4: oig ← threadIdx.x % L . offset in group
5: for k ← 0 toW − 1 do
6: T ←M [wid]
7: count← 0
8: mask1, mask2, scE ← 0x80808080
9: Iseq ← Oseq[wid] + gid ∗ L+ bk/4c
10: while count < T do
11: r ← extract res(B, Iseq, k, S)
12: γ ← inter or intra reorder(L, S, v, oig)
13: #pragma unroll H
14: for h← 0 to H − 1 do
15: σ ← load emission score(E,S, L, oig, h, r)
16: v ← vsubus4(v, σ)
17: scE ← vmaxu4(scE , v)
18: γ ← Γ[h] & mask1 ‖ mask2 . load old scores
19: Γ[h]← v . store new scores
20: end for
21: scE ← max reduction(L, S, scE)
22: mask1← 0xffffffff, mask2← 0x00000000
23: if r contains ending residue @ then . branch
24: Pi ← calculate P-value and record it.
25: mask1← set affected bits to 0x00.
26: mask2← set affected bits to 0x80.
27: scE ← update or reset special states.
28: end if
29: count, Iseq ← step up.
30: end while
31: end for
2405, for MSV and SSV algorithms [18]. In current design,
however, 2-lane kernels can only handle models with the
length of 1216 at most, given the same H . In addition, we
recall that L1 Cache-Hit-Ratio (CHR) is employed as a met-
ric to evaluate register spill in CUDAMPF, and MSV/SSV
kernels with maximum 64 registers per thread have no spill
to local memory. This enables us to push upH to hold larger
models for the multi-sequence alignment. Two optimization
schemes are proposed to improve overall performance and
address the concern about scalability.
3.4.1 CHR Sacrificed Kernel
It is well-known that L1 cache shares the same block of
on-chip memory with shared memory physically, and it is
used to cache accesses to local memory as well as register
spill. The L1 cache has low latency on data retrieval and
storage with a cache hit, which can be further utilized to
increase the throughput of kernels based on the proposed
framework. We treat L1 cache as secondary registers, and the
usage is measured by CHR for local loads and stores. By
increasing up H , more model states can reside in registers
and cached local memory. The moderate loss of performance
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9
due to uncached register spills are acceptable, which is
attributed to highly optimized task and data parallelism in
the current framework. However, it is impossible to increase
H unboundedly due to the limited capacity of L1 cache.
Overly large H leads to the severe register spill that causes
low CHR and stalls warps significantly. Hence, there always
is a trade-off between CHR and H , and the goal is to
find a reasonable point where kernel performance (GCUPS)
begins fall off. The decline in performance indicates that
the latency, caused by excessive communications between
on and off-chip memory, starts to overwhelm the benefits
of parallelism. The corresponding H at the turning point is
considered to be the maximum one, denoted by Hmax.
TABLE 1
Benchmarks of the maximum H via innermost loop unrolling
H
reg.
per thread
stack
frame
spill
stores
spill
loads GCUPS
L1 CHR
(%)
MSV kernels
20 63 8 0 0 258.7 99.97
25 63 8 0 0 261.5 99.97
30 64 8 0 0 272.1 99.97
35 64 40 40 44 279.3 75.65
40 64 48 52 56 283.9 75.41
45 64 48 56 52 280.6 75.49
50 64 96 152 96 263.9 25.61
55 64 128 208 140 240.8 14.95
SSV kernels
20 62 8 0 0 269.5 99.96
25 62 8 0 0 314.0 99.96
30 62 8 0 0 329.0 99.96
35 61 8 0 0 347.9 99.96
40 64 16 4 4 361.5 99.94
45 64 48 44 40 375.9 80.44
50 64 56 64 44 375.9 60.30
55 64 80 116 72 340.9 17.81
*** Data collections on 32-lane kernels compiled with nvcc 8.0. Use env nr [23]
as the sequence dataset.
TABLE 1 lists a benchmark result that shows the rela-
tionship between H , kernel performance and CHR. Starting
from H = 20 with a step of 5 , intuitively, CHR is being
consumed after on-chip registers are exhausted, and the
kernel performance increases first and falls back eventually
as expected. We choose 45 and 50 as the Hmax for MSV
and SSV kernels, respectively. Larger H results in rapid
degradation of both performance and L1 CHR. Besides,
the difference of Hmax indicates that MSV kernels have
more instructions than SSV kernels within the innermost
loop, and hence more registers or local memory are used
while unrolling the loop. The Hmax is therefore algorithm-
dependent. Given Eq. (2), (3) and Hmax, we formulate the
selection of S as below:
argmax
S
f = {S | f = WSHmax,∀mˆ ≤W2Hmax : f ≥ mˆ},
(4)
where f is the function of S, and W2Hmax indicates the
maximum length of query models that 2-lane kernels can
handle. The CUDAMPF implementation will be used in-
stead if any larger model is applicable. Eq. (4) describes a
rule of kernel selection that always prefer to use kernels
with more lanes if they are able to cover the model length.
Once the kernel type (S-lane) is determined, the H of every
query model located in the coverage area can be obtained
via:
argmin
H
Φ = {H | ∀H ∈ [d mˆ
WS
e, Hmax] ∩ Z : Φ = WSH,
Φ ≥ mˆ}, (5)
where Φ represents the function of H . Eq. (5) minimizes H
to fit query models perfectly, which thereby avoid redun-
dant computation and memory allocation.
In summary, this optimization scheme aims to fully
leverage the speedy on-chip memory, including L1 cache
via sacrificing CHR proactively, to further boost kernel
throughput, and in the meanwhile, it extends coverage of
the proposed framework to larger query models.
3.4.2 Performance-oriented Kernel
Although the proposed framework achieves a significant
improvement in performance, it certainly introduces over-
head due to the implementation of multi-sequence align-
ment, compared with CUDAMPF. This downside becomes
more apparent as the model length increases (S decreases).
Therefore, it is expected that CUDAMPF with single-
sequence alignment may exceed CUDAMPF++ for large
enough query models. In order to pursue optimal per-
formance consistently, we also merge CUDAMPF imple-
mention into the proposed framework as a special case
with S = 1. The maximum model length mˆmax on which
CUDAMPF++ still outperforms is defined as the threshold
of kernel switch.
Similar to mˆmax, another threshold mˆmin can also be
employed to optimize 128 and 64-lane kernels for small
models. We recall that 128 and 64-lane kernels need extra
operations to load emission scores in Algorithm 4. Thus,
they have more overhead than other kernels within the
innermost loop, which may counteract their advantages on
the number of parallel lanes. We extend the coverage of 32-
lane kernels to handle small models owned by 128 and 64-
lane kernels previously, and the evaluation is presented in
Sec. 4.3.
4 EXPERIMENTAL RESULTS
In this section, we present several performance evalua-
tions on the proposed CUDAMPF++, such as workload
balancing, kernel throughput and scalability. The compar-
ison targets consist of CUDAMPF, CPU-based hmmsearch
of latest HMMER v3.1b2 and other related acceleration
attempts. Both CUDAMPF++ and CUDAMPF are evaluated
on a NVIDIA Tesla K40 GPU and compiled with CUDA
v8.0 compiler. Tesla K40 is built with the Kepler GK110
architecture that contains 15 SMXs (2880 CUDA cores) and
12 GB off-chip memory [24]. One of NVIDIA profiling
tools, nvprof [25], is also used to track metrics like L1/tex
CHR, register usage and spill. For hmmsearch, two types
of CPUs are employed to collect performance results: Intel
Xeon E5620 (4 physical cores with maximum 8 threads) and
dual Intel Xeon E5-2650 (16 physical cores and 32 threads
in total). All programs are executed in the 64-bit Linux
operating system.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10
Unlike CUDAMPF implementation, the NVRTC is dep-
recated in current design due to its unstability in compiling
kernels with high usage of on-chip registers. Even with
latest compiler nvcc v8.0, runtime compilation with the
NVRTC library still generates unexpected binary files or
report the error of resources exhaustion, especially when
unrolling large loops and register spill happens. Thus, we
choose the just-in-time (JIT) compilation instead. All kernels
are pre-compiled in offline mode and stored as .ptx files,
each with an unique kernel name inside. Given different
query models, the corresponding .ptx file is loaded and
further compiled to binary code at runtime. The load time
and overhead of compilation are negligible.
Two protein sequence databsets [23] are chosen for ex-
periments: (a) env nr (1.9 GB) and (b) est human (5.6 GB). As
for query models, we still use Pfam 27.0 [26] that contains 34
thousand HMMs with different sizes ranging from 7 to 2405.
The overall performance is measured in kernel throughput
(GCUPS) which is directly calculated by the total number
of residues contained in each database, model length and
kernel execution time.
4.1 Evaluation of Workload Balancing
To avoid time overhead of data reformatting introduced in
Sec. 3.2, we incorporate Redis [27], a high performance in-
memory database, into the proposed framework. Redis is
written in ANSI C and able to work with CUDA seamlessly.
It currently works as an auxiliary component to hold warp-
based sequence data blocks and query models in memory,
which offers blazing fast speed for data retrieval. Given
the single K40 GPU, each protein sequence dataset is parti-
tioned into 61,440 blocks which are then ingested into Redis
database separately as key-value pairs. The quantity of data
blocks resided in Redis database should be integral multiple
of the number of available warps.
Table 2 summarizes the evaluation result of workload
balancing for both protein sequence datasets. The “avg.”
and “sd.” represent average value and standard deviation
across all blocks, respectively. We recall that M is the height
of data block which serves as the metrics of computational
workload for each warp, and the number of ending residues
is another impact factor of performance because the ending
residue may lead to thread idling. It is clear to see that
both sd. M and sd. ending residues are trivial, and the last
two columns show that average multiprocessor efficiency
approaches 100%, which are strong evidences of balanced
workload over all warps on GPU. Besides, the Padding-to-
Real Ratio (PRR) that compares the level of invalid compu-
tation to the level of desired computation is investigated to
assess the negative effect of padding residues, and it is also
proved to be negligible.
4.2 Performance Evaluation and Analysis
In order to demonstrate the outstanding performance of
proposed method and its correlation with the utilization
of memory resources, we make an in-depth comparison
between CUDAMPF++ and CUDAMPF via profiling both
MSV and SSV kernels, reported in Table 3 and 4, respec-
tively. A total of 27 query models are selected to investigate
the impact of H on the performance of S-lane kernels. Each
kernel type is evaluated with two models that correspond
to H = Hmax and H = dHmax−12 + 1e, except for the 2-lane
kernel since the 2405 is the largest model length in [26] with
corresponding H = 38.
For the MSV kernels, the maximum speedup listed on
Table 3 is 17.0x when mˆ = 23, and the trend of speedup
is descending as model length increases. This is because
memory resources, like on-chip registers, L1 cache and
even local memory, are significantly underutilized in CUD-
AMPF when making alignment with small models whereas
CUDAMPF++ always intends to fully take advantage of
them. Given 64 as the maximum number of register per
thread, only about half the amount of registers are occupied
in CUDAMPF till mˆ = 735, and other resources are not
utilized at all. In contrast, the CUDAMPF++ not only keeps
high usage of registers but also utilizes L1 cache and local
memory to hold more data, which results in a near constant
performance regardless of the model length. The texture
CHR is dominated by model length since we only use
texture cache for loading emission scores. Larger model
leads to lower texture CHR. Comparing the performance
of S-lane kernels in CUDAMPF++, the cases of H = 45
outperform the cases of H = 23 though more local memory
are allocated with register spill. One exception is the 128-
lane kernel due to its higher complexity of innermost loop,
which can be optimized via using the 32-lane kernel instead.
As shown in Table 4, SSV kernels have similar evaluation
results with MSV kernels but higher throughput. Starting
from mˆ = 832, nevertheless, CUDAMPF outperforms and
eventually yields upto 468.9 GCUPS which is 1.5x faster
than CUDAMPF++. The case that peak performance of two
frameworks are not comparable is due to the overhead of
extra instructions introduced for the multi-sequence align-
ment in CUDAMPF++. The kernel profiling indicates that
both MSV and SSV kernels are bounded by computation
and memory bandwidth (texture). However, unlike MSV
kernels, SSV kernels have fewer operations within the in-
nermost loop, which makes them more “sensitive”. In other
words, newly added operations (i.e., bitwise operations
for mask) within the innermost loop, compared with CU-
DAMPF, have more negative effect on SSV kernels than
MSV kernels. Therefore, an upto 50% performance gap is
observed only in SSV kernels.
4.3 Scalability Evaluation
In order to demonstrate the scalability of the proposed
framework, a total of 57 query models with different sizes
ranging from 10 to 2450 are investigated. The interval
of model length is fixed to 50. Fig. 6 and 7 show the
performance comparison between CUDAMPF++ and CUD-
AMPF for MSV and SSV kernels, respectively. The coverage
area of model length for each kernel type is highlighted.
Right subfigure depicts the performance of 128 and 64-lane
kernels while others are shown in the left one. Overall,
CUDAMPF++ achieves near constant performance and sig-
nificantly outperform CUDAMPF with small models. The
S-lane MSV (SSV) kernel yields the maximum speedup of
30x (23x) with respect to CUDAMPF. It is worth mentioning
that, in CUDAMPF++, SSV kernels have larger fluctuation
margin of performance than MSV kernels. This is caused
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11
TABLE 2
Evaluation of workload balancing for the warp-based sequence data blocks
DB
name
DB
size (GB)
total
seq.
total
residues avg. M sd. M
avg. ending
residues
sd. ending
residues avg. PRR
SMX eff.
MSV (%)*
SMX eff.
SSV (%)*
env nr 1.9 6,549,721 1,290,247,663 21,109 85 13,645 71 1.14E-4 97.13 96.7
est human 5.6 8,704,954 4,449,477,543 72,563 174 18,135 60 3.22E-5 97.97 97.76
* Data collection with 32-lane kernels.
TABLE 3
Performance comparison of MSV kernel between the proposed CUDAMPF++ and CUDAMPF
CUDAMPF++ vs. CUDAMPF
S-lane
kernels
model
length mˆ acc. ID H
reg.
per thread
stack
frame
spill
stores
spill
loads
L1 CHR
(%)
Tex. CHR
(%) GCUPS speedup
128 23 PF13823.1 23 / 2 63 / 29 24 / 0 0 / 0 0 / 0 99.94 / unused 100 / 100 168.6 / 9.9 17.0x
128 45 PF05931.6 45 / 2 64 / 29 48 / 0 48 / 0 24 / 0 51.93 / unused 100 / 100 144.4 / 19.4 7.5x
64 46 PF09501.5 23 / 2 64 / 29 16 / 0 0 / 0 0 / 0 99.96 / unused 100 / 100 225.7 / 19.8 11.4x
64 90 PF05777.7 45 / 2 64 / 29 48 / 0 64 / 0 32 / 0 50.88 / unused 100 / 100 231.3 / 38.7 6.0x
32 92 PF00207.17 23 / 2 64 / 29 8 / 0 0 / 0 0 / 0 99.97 / unused 100 / 100 274.1 / 39.6 7.0x
32 180 PF02737.13 45 / 2 64 / 29 48 / 0 56 / 0 52 / 0 75.49 / unused 100 / 100 278.8 / 77.4 3.6x
16 184 PF00596.16 23 / 2 63 / 29 8 / 0 0 / 0 0 / 0 99.99 / unused 100 / 100 266.7 / 78.9 3.4x
16 360 PF01117.15 45 / 3 64 / 30 56 / 0 60 / 0 60 / 0 80.11 / unused 100 / 100 277.0 / 130.9 2.1x
8 368 PF05208.8 23 / 3 63 / 30 8 / 0 0 / 0 0 / 0 99.99 / unused 100 / 100 266.7 / 133.6 2.0x
8 720 PB000053 45 / 6 64 / 34 56 / 0 60 / 0 60 / 0 80.10 / unused 78.48 / 92.37 271.6 / 183.8 1.5x
4 735 PF03971.9 23 / 6 63 / 34 8 / 0 0 / 0 0 / 0 100 / unused 75.3 / 92.37 262.4 / 187.4 1.4x
4 1439 PF12252.3 45 / 12 64 / 51 56 / 0 60 / 0 60 / 0 80.08 / unused 67.09 / 72.66 271.5 / 231.6 1.2x
2 1471 PB006678 23 / 12 63 / 51 8 / 0 0 / 0 0 / 0 100 / unused 63.38 / 72.6 259.7 / 237.3 1.1x
2 2405 PB003055 38 / 19 64 / 64 40 / 0 32 / 0 44 / 0 100 / unused 60.15 / 64.06 264.0 / 271.2 -1.0x
*** The env nr [23] is used in data collection.
TABLE 4
Performance comparison of SSV kernel between the proposed CUDAMPF++ and CUDAMPF
CUDAMPF++ vs. CUDAMPF
S-lane
kernels
model
length acc. ID H
reg.
per thread
stack
frame
spill
stores
spill
loads
L1 CHR
(%)
Tex. CHR
(%) GCUPS speedup
128 26 PF02822.9 26 / 2 62 / 33 24 / 0 0 / 0 0 / 0 99.93 / unused 100 / 100 178.4 / 15.9 11.2x
128 50 PF03869.9 50 / 2 64 / 33 56 / 0 60 / 0 32 / 0 54.29 / unused 100 / 100 144.6 / 30.7 4.7x
64 52 PF02770.14 26 / 2 63 / 33 16 / 0 0 / 0 0 / 0 99.96 / unused 100 / 100 276.1 / 31.8 8.7x
64 100 PB000229 50 / 2 64 / 33 72 / 0 100 / 0 52 / 0 30.65 / unused 100 / 100 239.8 / 61.4 3.9x
32 104 PF14807.1 26 / 2 61 / 33 8 / 0 0 / 0 0 / 0 99.96 / unused 100 / 100 307.8 / 63.6 4.8x
32 200 PF13087.1 50 / 2 64 / 33 56 / 0 64 / 0 44 / 0 60.3 / unused 100 / 100 365.0 / 122.4 3.0x
16 208 PF15420.1 26 / 2 62 / 33 8 / 0 0 / 0 0 / 0 99.98 / unused 100 / 100 305.0 / 127.2 2.4x
16 400 PF13372.1 50 / 4 64 / 37 56 / 0 72 / 0 52 / 0 58.57 / unused 95.66 / 100 347.8 / 199.3 1.7x
8 416 PF06808.7 26 / 4 62 / 37 8 / 0 0 / 0 0 / 0 99.99 / unused 98.92 / 100 302.6 / 209.1 1.4x
8 800 PF02460.13 50 / 7 64 / 42 56 / 0 72 / 0 52 / 0 58.53 / unused 75.79 / 87.24 344.8 / 306.9 1.1x
4 832 PB001474 26 / 7 62 / 42 8 / 0 0 / 0 0 / 0 99.99 / unused 76.90 / 87.24 302.1 / 319.1 -1.1x
4 1600 PB000744 50 / 13 64 / 56 56 / 0 72 / 0 52 / 0 58.49 / unused 65.68 / 70.80 349.3 / 403.5 -1.2x
2 1630 PB000663 26 / 13 62 / 56 8 / 0 0 / 0 0 / 0 100 / unused 66.57 / 70.81 293.1 / 411.1 -1.4x
2 2405 PB003055 38 / 19 64 / 63 16 / 0 0 / 0 0 / 0 100 / unused 59.69 / 64.05 318.8 / 468.9 -1.5x
*** The env nr [23] is used in data collection.
by the overhead of using read-only cache (texture cache) to
load emission scores. Although both MSV and SSV kernels
have only one ldg() function inside innermost loop, the
texture cache read in SSV kernels, as one of kernel limits,
has a higher proportion of negative effect on performance
than that in MSV kernels, which results in such obvious
fluctuation. A simple evidence is that the performance curve
will be smooth and regular if replacing texture cache reads
with a constant value.
Besides, 32-lane kernels are also tested to compare with
128 and 64-lane kernels. By decreasing H , the 32-lane kernel
is able to cover smaller query models, and it outperforms
128 and 64-lane kernels until the model length is slightly
larger than 20. We simply set mˆmin = 20 for both MSV and
SSV kernels in terms of evaluation results. As for mˆmax , the
model lengths of 2450 and 1000 are selected for MSV and
SSV kernels, respectively.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12
0 500 1000 1500 2000 2500
Model Length
50
100
150
200
250
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
CUDAMPF++
CUDAMPF
32 16 8 4 2
Coverage Area of S-lane Kernels
10 20 30 40 50 60 70 80 90
Model Length
0
50
100
150
200
250
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
32-lane kernel
CUDAMPF++
CUDAMPF
128 64
Coverage Area of S-lane Kernels
Fig. 6. Performance comparison between CUDAMPF++ and CUDAMPF for the MSV kernel.
0 500 1000 1500 2000 2500
Model Length
100
150
200
250
300
350
400
450
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
CUDAMPF++
CUDAMPF
32 16 8 4 2
Coverage Area of S-lane Kernels
20 40 60 80 100
Model Length
0
50
100
150
200
250
300
350
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
32-lane kernel
CUDAMPF++
CUDAMPF
128 64
Coverage Area of S-lane Kernels
Fig. 7. Performance comparison between CUDAMPF++ and CUDAMPF for the SSV kernel.
4.4 Performance Comparison: CUDAMPF++ vs. Others
A comprehensive performance comparison is also made
between optimized CUDAMPF++ and other implementa-
tions. Fig 8 and 9 present results of comparison between
CUDAMPF++ and CPU-based MSV/SSV stages with two
datasets. The CUDAMPF++ achieves upto 282.6 (283.9) and
465.7 (471.7) GCUPS for MSV and SSV kernels, respectively,
given the env nr (est human) dataset. Compared with the
best performance achieved by dual Xeon E5-2650 CPUs,
a maximum speedup of 168.3x (160.7x) and a minimum
speedup of 1.8x (1.7x) are observed for the MSV (SSV)
kernel of CUDAMPF++.
In the original HMMER3 paper [3], Eddy reports 12
GCUPS for MSV stage, achieved by a single CPU core.
Several acceleration efforts exist and report higher perfor-
mance: (a) an FPGA-based implementation [11] yields upto
81 GCUPS for MSV stage; (b) Lin [13] inherits and modifies
a GPU-based implementation of HMMER2 [6] to accelerate
MSV stage of HMMER3, which achieves upto 32.8 GCUPS
on a Quadro K4000 GPU; (c) [16] claims the first acceleration
work on SSV stage of latest HMMER v3.1b2 and reports
the maximum performance of 372.1 GCUPS on a GTX570
GPU. To sum up, as shown in Fig. 8 and 9, the proposed
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 13
0 500 1000 1500 2000 2500
Model Length
0
50
100
150
200
250
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
CUDAMPF++,env,K40
CUDAMPF++,human,K40
HMMER3,env,E5620(8 threads)
HMMER3,human,E5620(8 threads)
HMMER3,env,E5-2650(32 threads)
HMMER3,human,E5-2650(32 threads)
10 20 30 40 50 60 70 80 90
Model Length
0
50
100
150
200
250
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
Fig. 8. Performance comparison between CUDAMPF++ and HMMER3’s CPU-based implementation for the MSV kernel (stage).
500 1000 1500 2000 2500
Model Length
0
100
200
300
400
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
CUDAMPF++,env,K40
CUDAMPF++,human,K40
HMMER3,env,E5620(8 threads)
HMMER3,human,E5620(8 threads)
HMMER3,env,E5-2650(32 threads)
HMMER3,human,E5-2650(32 threads)
20 40 60 80 100
Model Length
0
50
100
150
200
250
300
350
400
Gi
ga
 C
el
ls 
Up
da
te
 P
er
 S
ec
on
d 
(G
CU
PS
)
Fig. 9. Performance comparison between CUDAMPF++ and HMMER3’s CPU-based implementation for the SSV kernel (stage).
framework, CUDAMPF++, exceeds all existing work and
exhibits strong consistency in performance regardless of
either the model length or the amount of protein sequences.
5 RELATED WORK
As one of the most popular tool for the analysis of ho-
mologous protein and nucleotide sequences, HMMER at-
tracts many acceleration attempts. The previous version,
HMMER2, is based on Viterbi algorithm that has proved to
be the computational bottleneck. The initial effort of GPU-
based implementation for HMMER2 is ClawHMMER [28]
which introduces a streaming version of Viterbi algorithm
for GPUs. They also demonstrate the implementation run-
ning on a 16-node GPU cluster, each equipped with a
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 14
Radeon 9800 Pro GPU. Another early GPU-based imple-
mentation is proposed by Walters et al. [6] who properly
fit the Viterbi algorithm into the CUDA-enabled GPU with
several optimizations, like memory coalescing, proper ker-
nel occupancy and shared/constant memory usage, which
outperforms the ClawHMMER substantially. Yao et al. [29]
present a CPU-GPU cooperative pattern to accelerate HM-
MER2. Ganesan et al. [7] re-design the aligment process of
a single sequence across multiple threads to partially break
the sequential dependency in computation of Viterbi scores.
This helps building a hybrid task and data-level parallelism
that eliminates the overhead due to unbalanced sequence
lengths.
However, with the heuristic pipeline, HMMER3 achieves
about 100x to 1000x speedups over its predecessor [3],
which hence renders any acceleration effort of HMMER2
obsolete. There are only few existing work that aim to accel-
erate SSV, MSV and P7Viterbi stages of hmmsearch pipeline
in HMMER3. Abbas et al. [11] re-writes mathematical formu-
las of MSV and Viterbi algorithms to expose reduction and
prefix scan computation patterns which are fitted into the
FPGA architecture. In [12], a speculative method is proposed
to reduce the number of global memory access on the GPU,
which aims to accelerate the MSV stage. Lin et al. [13],
[14] also focus on MSV stage but incorporate SIMD video
instructions provied by the CUDA-enabled GPU into their
method. Like the strategy of [6], they assign each thread to
handle a whole sequence. A CPU-based implementation of
P7Viterb stage is done by Ferreira et al. [15] who propose a
cache-oblivious parallel SIMD Viterbi algorithm that offsets
cache miss penalties of original HMMER3 work. Neto et
al. [16] accelerate the SSV stage via a set of optimizations
on the GPU, such as model tiling, outer loop unrolling,
coalesced and vectorized memory access.
6 DISCUSSION
While we have shown that the proposed framework with
hierarchical parallelism achieves impressive performance
based on Kepler architecture, we believe that more ad-
vanced GPU architectures, like Maxwell, Pascal and Volta,
could also benefit from it because of its hardware-based
design. It is easy to port the framework to run on advanced
GPUs and gain better performance given more available
hardware resources, such as on-chip registers, cache ca-
pacity, memory bandwidth and SMs. Also, the framework
naturally has linear scalability when distributing protein
sequences to multiple GPUs. To handle large models that
exceed the carrying capability of single GPU, however, one
potential solution is the model partitioning that distributes
different segments of model to different GPUs while intro-
ducing inter-device communication (i.e., max-reduction, re-
ordering). The multi-GPU implementation of the proposed
framework is being investigated.
As for the general applicability, not only is the frame-
work suitable for accelerating analogous algorithms of ge-
nomic sequence analysis, other domain-specified applica-
tions with some features may also benefit from it. The high-
light features, for example, may include data irregularity,
large-scaled working set and relatively complex logic with
execution dependency. In the contrary, for some agent-based
problems that usually investigate the behavior of millions
of individuals, such as molecular dynamics or simulation
of spatio-temporal dynamics, our framework may not be
the preferred choice. Actually, the key performance factor is
the innermost loop, corresponding to 3rd, 4th and 5th tiers
of the proposed framework, in which we should only put
necessary operations. In general, assuming that kernels are
bound by the innermost loop, there are several suggestions
related to minimizing the cost inside the innermost loop:
(a) try to hold repeatedly used values in registers to avoid
high-frequency communications between on and off-chip
memory; (b) pre-load data needed by the innermost loop
in outer loops; (c) use either L1 or texture cache to reduce
the overhead of load/store operations; (d) try to use high-
throughput arithmetic instructions. (e) use shuffle instruc-
tions rather than shared memory, if applicable.
Ultimately, this work sheds light a strategy to amplify
the horsepower of individual GPU in an architecture-aware
way while other acceleration efforts usually aim to exploit
performance scaling with muliple GPUs.
7 CONCLUSION
In this paper, we propose a five-tiered parallel framework,
CUDAMPF++, to accelerate computationally intensive tasks
of the homologous protein sequence search with profile
HMMs. This framework is based on CUDA-enabled GPUs,
and it aims to fully utilize hardware resources of the GPU
via exploiting finer-grained parallelism (multi-sequence
alignment) compared with its predecessor. In addition, we
introduce a novel idea that improves the performance and
scalability of the proposed framework by sacrificing L1
CHR proactively. As shown by experimental results, the
optimized framework outperforms all existing work, and
it exhibits good consistency in performance regardless of
the variation of query models or protein sequence datasets.
For MSV (SSV) kernels, the peak performance of the CU-
DAMPF++ is 283.9 (471.7) GCUPS on single K40 GPU,
and impressive speedups ranging from 1.8x (1.7x) to 168.3x
(160.7x) are achieved over the CPU-based implementation
(16 cores, 32 threads). Moreover, further generalization of
the proposed framework is also discussed.
ACKNOWLEDGMENTS
The authors would like to thank the NVIDIA-Professor
partnership for generous donations in carrying out this
research.
REFERENCES
[1] N. Marco S., C. Paolo, T. Andrea, and B. Daniela, “Graphics
processing units in bioinformatics, computational biology and
systems biology,” Briefings in Bioinformatics, p. 116, 2016.
[2] S. Eddy, “Profile hidden markov models,” Bioinformatics, vol. 14,
pp. 755–763, 1998.
[3] ——, “Accelerated profile HMM searches,” PLoS Comput Biol,
vol. 7, no. 10, 2011, doi:10.1371/journal.pcbi.1002195.
[4] K. Anders, B. Michael, M. I. Saira, S. Kiminen, and H. David,
“Hidden Markov models in computational biology: Applications
to protein modeling,” Journal of Molecular Biology, vol. 235, pp.
1501–1531, 1994.
[5] S. Altschul, W. Gish, W. Miller, E. Myers, and D. Lipman, “Basic
local alignment search tool,” Journal of Molecular Biology, vol. 215,
no. 3, pp. 403–410, 1990.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 15
[6] J. P. Walters, V. Balu, S. Kompalli, and V. Chaudhary, “Evaluating
the use of gpus in liver image segmentation and hmmer database
searches,” in International Symposium on Parallel & Distributed Pro-
cessing (IPDPS); Rome. IEEE, 2009, pp. 1–12.
[7] N. Ganesan, R. D. Chamberlain, J. Buhler, and M. Taufer, “Accel-
erating HMMER on GPUs by implementing hybrid data and task
parallelism,” in Proceedings of the First ACM Int. Conf. on Bioinfor-
matics and Computational Biology (ACM-BCB); Buffalo. ACM, 2010,
pp. 418–421.
[8] R. Maddimsetty, J. Buhler, R. Chamberlain, M. Franklin, and
B. Harris, “Accelerator design for protein sequence HMM search,”
in Proc. 20th ACM International Conference on Supercomputing, 2006.
[9] T. Oliver, L. Y. Yeow, and B. Schmidt, “Integrating FPGA accelera-
tion into HMMer,” Parallel Computing, vol. 34, no. 11, pp. 681–691,
2008.
[10] T. Takagi and T. Maruyama, “Accelerating HMMER search using
FPGA,” in International Conference on Field Programmable Logic and
Applications (FPL); Prague. IEEE, 2009, pp. 332–337.
[11] N. Abbas, S. Derrien, S. Rajopadye, and P. Quinton, “Accelerating
HMMER on FPGA using Parallel Prefixes and Reductions,” in
International Conference on Field-Programmable Technology (FPT) : 28-
10 Dec. 2010; Beijing. IEEE., 2010, pp. 37–44.
[12] X. Li, W. Han, G. Liu, H. An, M. Xu, W. Zhou, and Q. Li, “A
speculative HMMER search implementation on GPU,” in 26th
IPDPS Workshop and PhD Forum; Shanghai. IEEE, 2012, pp. 73–
74.
[13] L. Cheng and G. Butler, “Implementing and Accelerating HM-
MER3 Protein Sequence Search on CUDA-Enabled GPU,” Ph.D.
dissertation, Concordia University, The Department of Computer
Science and Software Engineering, 2014.
[14] ——, “Accelerating search of protein sequence databases using
CUDA-enabled GPU,” in 20th International Conference on Database
Systems for Advanced Applications (DASFAA) : April 20-23. 2015;
Hanoi. IEEE., 2015, pp. 279–298.
[15] M. Ferreira, N. Roma, and L. Russo, “Cache-Oblivious parallel
SIMD Viterbi decoding for sequence search in HMMER,” BMC
Bioinformatics, vol. 15, no. 165, 2014.
[16] A. C. de Arajo Neto and N. Moreano, “Acceleration of Single-
and Multiple-Segment Viterbi Algorithms for Biological Sequence-
Profile Comparison on GPU,” in 21st International Conference
on Parallel and Distributed Processing Techniques and Applications
(PDPTA) : July 27-30. 2015; Las Vegas. WORLDCOMP, 2015, pp.
65–71.
[17] H. Jiang and G. Narayan, “Fine-Grained Acceleration of HMMER
3.0 via Architecture-aware Optimization on Massively Parallel
Processors,” in 14th IEEE International Workshop on High Perfor-
mance Computational Biology (HiCOMB) in IPDPSW : May 25-29.
2015; Hyderabad. IEEE., 2015.
[18] H. Jiang and N. Ganesan, “CUDAMPF: a multi-tiered parallel
framework for accelerating protein sequence search in HMMER
on CUDA-enabled GPU,” BMC Bioinformatics, vol. 17, no. 1, p.
106, 2016.
[19] K. Bjarne. (2015, Mar.) The ssv filter implementation.
[20] NVIDIA, “CUDA C programming guide,” June 2017.
[Online]. Available: http://docs.nvidia.com/pdf/CUDA C
Programming Guide.pdf
[21] ——, “NVIDIAs next generation CUDA com-
pute architecture: Kepler GK110/210,” 2014,
nVIDIA Corporation Whitepaper. [Online]. Avail-
able: http://international.download.nvidia.com/pdf/kepler/
NVIDIA-Kepler-GK110-GK210-Architecture-Whitepaper.pdf
[22] ——, “Parallel thread execution ISA,” June 2017. [Online].
Available: http://docs.nvidia.com/pdf/ptx isa 5.0.pdf
[23] Fasta sequence databases. sep. 2014. [Online]. Available:
ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/
[24] NVIDIA, “NVIDIA Tesla GPU accelerators,” 2013. [On-
line]. Available: http://www.nvidia.com/content/tesla/pdf/
NVIDIA-Tesla-Kepler-Family-Datasheet.pdf
[25] ——, “Profiler user’s guide,” June 2017. [Online]. Available:
http://docs.nvidia.com/pdf/CUDA Profiler Users Guide.pdf
[26] Pfam: Protein family database. may 2013. [Online]. Available:
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam27.0/
[27] S. Karl and N. Perry, “The little redis book,” 2012. [Online].
Available: http://openmymind.net/redis.pdf
[28] D. Horn, M. Houston, and P. Hanrahan, “ClawHMMER: A
streaming HMMer-search implementation,” in Proceedings of the
ACM/IEEE Supercomputing Conference. IEEE, 2005.
[29] Y. Ping, A. Hong, X. Mu, L. Gu, L. Xiaoqiang, W. Yaobin, and
H. Wenting, “CuHMMer: A load-balanced CPU-GPU cooperative
bioinformatics application,” in International Conference on High
Performance Computing and Simulation (HPCS) : June. 2010; Caen,
France. IEEE., 2010, pp. 24–30.
PLACE
PHOTO
HERE
Hanyu Jiang Biography text here.
PLACE
PHOTO
HERE
Narayan Ganesan Biography text here.
PLACE
PHOTO
HERE
Yu-Dong Yao Biography text here.
