Hardware acceleration of the pair HMM algorithm for DNA variant calling by Huang, Sitao
c© 2017 Sitao Huang
HARDWARE ACCELERATION OF THE PAIR HMM ALGORITHM
FOR DNA VARIANT CALLING
BY
SITAO HUANG
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2017
Urbana, Illinois
Adviser:
Professor Deming Chen
Professor Wen-Mei W. Hwu
ABSTRACT
With the advent of several accurate and sophisticated statistical algorithms
and pipelines for DNA sequence analysis, it is becoming increasingly possi-
ble to translate raw sequencing data into biologically meaningful informa-
tion for further clinical analysis and processing. However, given the large
volume of the data involved, even modestly complex algorithms would re-
quire a prohibitively long time to complete. Hence it is urgent to explore
non-conventional implementation platforms to accelerate genomics research.
In this thesis, we present a Field-Programmable Gate Array (FPGA) ac-
celerated implementation of the Pair Hidden Markov Model (Pair HMM)
forward algorithm, the performance bottleneck in the HaplotypeCaller, a
critical function in the popular Genome Analysis Toolkit (GATK) variant
calling tool. We introduce the PE ring structure which, thanks to the fine-
grained parallelism allowed by the FPGA, can be built into various configu-
rations striking a trade-off between Instruction-Level Parallelism (ILP) and
data parallelism. We investigate the resource utilization and performance of
different configurations. Our solution can achieve a speed-up of up to 487×
compared to the C++ baseline implementation on CPU and 1.56× compared
to the previous best hardware implementation.
ii
To my parents, for their love and support.
iii
ACKNOWLEDGMENTS
Foremost, I would like to express my sincere gratitude to my advisors Pro-
fessor Deming Chen and Professor Wen-Mei W. Hwu. for their continuous
support of my study and research, for their great patience, guidance, trust
and encouragement. Their immense knowledge, brilliant intuition, and en-
thusiasm make them the role model of great researchers. I could not have
imagined having better advisors for my study and research.
I am deeply indebted to my labmates and colleagues, Gowthami Jayashri
Manikandan, Anand Ramachandran, and Dr. Kyle Rupnow, for their great
help and stimulating and exiting discussions in my thesis research. I could
never have completed this thesis without their generous help. I would like to
thank all my other labmates in the ES-CAD research group and the IMPACT
research group. I learned a lot and had a great time working with them.
My special thanks go to Jiaqi Mu, for all her love and support.
Lastly, I would like to thank my mother Yali Kang and my father Jianyuan
Huang, for allowing me to realize my own potential, for their patience, en-
couragement, and support throughout my life.
iv
TABLE OF CONTENTS
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . vi
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . 1
CHAPTER 2 RELATED WORK . . . . . . . . . . . . . . . . . . . . 6
CHAPTER 3 FORWARD ALGORITHM . . . . . . . . . . . . . . . . 8
CHAPTER 4 DESIGN AND IMPLEMENTATION . . . . . . . . . . 11
4.1 PE Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 PE Ring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.3 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
CHAPTER 5 EXPERIMENTAL RESULTS . . . . . . . . . . . . . . 23
5.1 Test Data and Target FPGAs . . . . . . . . . . . . . . . . . . 23
5.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
CHAPTER 6 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . 27
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
LIST OF ABBREVIATIONS
ALM Adaptive Logic Modules
ASIC Application-Specific Integrated Circuit
CPU Central Processing Unit
DNA Deoxyribonucleic Acid
DP Dynamic Programming
DSP Digital Signal Processor
FPGA Field-Programmable Gate Array
GATK Genome Analysis Toolkit
GPU Graphics Processing Unit
HMM Hidden Markov Model
ILP Instruction-Level Parallelism
IP Intellectual Property
NGS Next Generation Sequencing
PE Processing Element
RTL Register-Transfer Level
SDK Software Development Kit
vi
CHAPTER 1
INTRODUCTION
Bioinformatics is a fast-growing field, with increasing demand for high com-
putational capabilities for several applications. Next Generation Sequencing
(NGS) technologies and the increasing availability of genome data through
public databases have enabled the development of comprehensive pipelines
to sequence and process complex genomes. Translation and interpretation of
the raw sequencing data to biologically meaningful information for further
clinical analysis like disease prediction, drug performance evaluation etc.,
are of utmost importance now. Several algorithms have been developed to
address problems like DNA sequence alignment [1, 2, 3, 4], error correction
[5, 6, 7] and variant calling [8].
Many bioinformatics workflows suffer from very high computational times
resulting from the explosion in data available from low-cost, high-throughput
sequencing. In response to this, several bioinformatics algorithms have been
implemented on alternative computing platforms like FPGAs and GPUs to
reduce their execution times. Such recent developments point to workflows
being executed on heterogeneous computing platforms where instructions
for the critical and computation intensive parts of an algorithm will be off-
loaded for execution on an FPGA or a GPU to achieve significant speedups
compared to a CPU-only implementation. Acceleration of complex bioinfor-
matics algorithms has become an widely explored area [9, 10, 11, 12].
Our work deals with a specific bioinformatic analysis called variant call-
ing. Variant calling identifies differences between a given subject’s DNA
and a standard reference DNA. The input data to the analysis are the stan-
dard reference DNA and the sequencing data of the individual in the form of
alignment files. Alignment files are a specific type of representation of the se-
This thesis is based on my previous publication, “Hardware Acceleration of the Pair-
HMM Algorithm for DNA Variant Calling”, in Proceedings of the 25th ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays, pp. 275-284, 2017.
1
 Figure 1.1: The major steps in the HaplotypeCaller [13].
quencing data showing the sequenced reads and the most likely areas they are
sequenced from with respect to the reference sequence. These associations
are determined by minimizing the edit-distance between a read sequence and
regions in the reference sequence [1, 2]. Edit-distance minimizations may be
considered valid because the error rates involved in the sequencing technolo-
gies of interest are small and localized, and we are primarily interested in
small variations from the reference.
GATK’s HaplotypeCaller [13, 14] is one of the most popular variant calling
tools available today. The tool does the analysis in two steps. In the first
step, it identifies locations in the genome where the chances are high that a
variation is present based on simple computations. These locations, called
active sites, are processed further to confirm the initial assessment.
To determine if an active site in the sample is a variant or not, the tool
assembles what it thinks are the probable haplotypes in the region surround-
ing the active site. The probable haplotypes in a location are the tool’s
initial guess regarding the different sequences of DNA present in that partic-
ular location. More than one sequence may be found at a given location in
the genome because, for instance, the human genomic material is made up of
pairs of chromosomes, and two chromosomes in a pair can be different locally,
while being very similar globally. Once the probable haplotypes are con-
structed, the tool assesses which of the candidate haplotypes are most likely
present in the individual’s genome by calculating the likelihood of alignment
2
of each haplotype to the read sequences in that region in the input alignment
file. If the established haplotypes are different from the reference sequence,
the tool determines that there is a variation at that location. The general
flow of the HaplotypeCaller is summarized in Figure 1.1.
GATK’s HaplotypeCaller assumes that haplotypes and read sequences fol-
low a pair hidden Markov model (Pair HMM). Pair HMM [15] is a popular
statistical model to study pairwise alignment probabilities of two sequences.
We can infer several aspects of the alignment using various inference algo-
rithms of the Pair HMM model, such as optimal sequence alignment (Viterbi
algorithm) and the overall alignment probability (forward algorithm). The
forward algorithm of the Pair HMM gives the statistical measure of the sim-
ilarity between two sequences. It computes the overall alignment probability
by summing the likelihood of all possible alignments between the two se-
quences. The forward algorithm for the Pair HMM model are used in several
applications like gene prediction, functional similarity analysis between two
protein sequences and variant calling [16, 17, 18]. Specifically, it is used by
GATK’s HaplotypeCaller to measure similarity between reads and probable
haplotypes.
Table 1.1 shows the runtime results of each stage of the HaplotypeCaller
for Chromosome 20 from sample NA12878 (Whole Genome Sequence data)
as published by Mauricio et al. of Broad Institute [19]. This shows that the
major bottleneck is the forward algorithm computations of the Pair HMM.
In the Pair HMM stage, every candidate haplotype is compared to each
input read to compute the likelihood score using the forward algorithm. The
number of computations involved is of the order N ×M ×R×H, where N
is the number of input reads, M is the number of candidate haplotypes, R
is the length of the input reads and H refers to the length of the candidate
haplotype.
In this thesis, the objective is to achieve a speedup of the Pair HMM’s
forward algorithm on an FPGA-based computing platform to minimize the
bottleneck of computing flows similar to GATK’s HaplotypeCaller that utilize
such a Pair HMM model to compute the overall alignment probability.
Pair HMM is a type of more complicated dynamic programming algorithm,
compared to many other programming algorithm. The propagation opera-
tor in Pair HMM is complicated combination of floating-point additions and
multiplications, rather than simple min or max operators as in many other
3
Table 1.1: Profiling results for HaplotypeCaller run on Chromosome 20 of
NA12878 Whole Genome Sequence (WGS) sample [19].
Stage Time Runtime
Assembly 2,598s 13%
Pair HMM (forward algorithm) 14,225s 70%
Traversal + Genotyping 3,379s 17%
dynamic programming algorithms. Further, there are three matrices that
each has data dependencies on the other two. Besides, Pair HMM requires
all the values to be at least single-precision floating-point numbers. Normal-
izing to fix point domain will lead to either overflow or underflow problem.
Performing floating-point operations efficiently is usually a hard problem for
FPGAs.
Traditionally, in FPGA-accelerated solutions to the Pair HMM, systolic ar-
rays have been commonly used. However, the systolic array structure lacks
the flexibility of handling variable input lengths and hence lacks design scal-
ability and configurability. In this work, we propose a Processing Element
(PE) ring structure to compute the forward algorithm. To the best of our
knowledge, this is the first PE ring structure based implementation of the
Pair HMM forward algorithm. The PE ring structure can be configured
for inputs of varied lengths without any changes to the hardware. In the
following chapters, we will demonstrate how our PE ring structure exhibits
significant advantage in terms of performance and flexibility.
The contributions in this thesis may be summarized as follows:
• We propose a ring-based hardware implementation of the Pair HMM’s
forward algorithm, which can support flexible lengths for input read
sequences. Our implementation can achieve significant speedups of up
to 487× compared to the C++ baseline implementation on a CPU, and
1.56× compared to the published best hardware implementation.
• Several optimization techniques for improving the PE ring’s perfor-
mance are proposed and discussed; the PE ring structure and the opti-
mization may be readily extended to accelerate other applications that
are based on similar dynamic programming algorithms.
• Based on both the above implementation and optimization techniques,
we present experimental results to illustrate the trade-offs and other
4
design considerations in using the PE ring structure to accelerate dy-
namic programming algorithms.
• Our work provides an example of how to effectively accelerate compli-
cated floating-point dynamic programming calculation with FPGA.
The rest of the thesis is organized as follows: Chapter 2 provides an
overview of the prior implementations of bioinformatics algorithms that are
similar to the Pair HMM; Chapter 3 introduces the fundamentals of the
forward algorithm of the Pair HMM; Chapter 4 presents the details of our
ring-based implementation of the forward algorithm; Chapter 5 illustrates
the experimental results and presents a comparison of the performance of
the ring-based implementation to other implementations of the Pair HMM
algorithm; and Chapter 6 concludes the thesis.
5
CHAPTER 2
RELATED WORK
There are many illustrative examples of the speed optimizations offered by
FPGA accelerators for bioinformatics algorithms. FPGA acceleration of Er-
ror Correction in NGS reads has been achieved in [12]. The work gets a
significant 35× speedup compared to the software implementation of its base
error correction algorithm presented in [5].
Dynamic Programming algorithms have been explored and implemented
on varies types of hardware platforms, such as GPU, FPGA and ASICs. For
the FPGA platform, most of the work uses systolic arrays structures, e.g.
the work in [20]. An FPGA accelerated version of the Smith-Waterman al-
gorithm that identifies the best local alignment between two DNA sequences
is presented by Isaac TS Li et al. [11]. The work achieved a 160× accelera-
tion compared to the baseline software version. In [21], ring-based structure
is proposed to accelerate the dynamic time warping algorithm, which is a
type of dynamic programming algorithm whose propagation operator is min
operator. In that problem, the data could be processed in the fixed point
number domain. Ring-based structure is also adopted to accelerate the dy-
namic programming algorithm in this work. However, Pair HMM is an even
more complicated dynamic programming algorithm. Pair HMM’s propaga-
tion operator is a complicated combination of floating-point additions and
multiplications, and there are three matrices involved at the same time. This
is the reason why Pair HMM is hard to accelerate using FPGA. There are
some other studies in literature that accelerates dynamic programming algo-
rithms with novel circuit design. In [22], the accumulated score/penalty in
the dynamic programming problem is represented by the latency of a path in
a combinational circuit, which corresponding to the dynamic programming
search path. This novel design achieve significant speedup compared to the
computation based methods. However, this latency based design could only
be applied to the those dynamic programming problems whose propagation
6
operator is a max or min operator.
An FPGA implementation of the Pair HMM stage of HaplotypeCaller on
Convey computers is reported in [19]. It achieves 13× speedup compared
to the Java implementation of the Pair HMM algorithm on a CPU. The
Convey machine contains four high-end FPGAs. In our work, we target a
single FPGA chip. However, we leverage most of the optimization techniques
feasible through the ring-based RTL-design modifications and achieve higher
performance and flexibility of the Pair HMM’s forward algorithm. We achieve
a faster and efficient implementation of the Pair HMM’s forward algorithm
that can be used in similar flows that utilize it for other applications.
A recent work from Altera [23] accelerates the Pair HMM algorithm with
Altera OpenCL SDK and FPGAs. Their work achieves significant speedup
compared to software. The hardware structure in this work is the systolic
array. Processing elements are placed in a grid structure. Grid structures
are very common in FPGA acceleration designs. However, using the grid
structure to process dynamic programming matrices introduces additional
overhead of having to store intermediate results back to memory in every
step of computation. Our PE ring implementation eliminates this overhead.
In the PE ring, the output of one PE is delivered to the neighbor PE and
consumed immediately. The whole PE ring produces at most one intermedi-
ate result that needs to be stored every cycle. Besides, the PE ring is very
amenable to trade-offs through restructuring which can reduce the number
of idle PEs when boundary conditions (starting or ending of processing a
single sequence pair) happen. This can be thought of as a trade-off between
Instruction Level Parallelism (ILP) and data-parallelism. In the case of using
a single PE ring, the execution goes over one set of dynamic-programming
matrices with identical dimensions, while when using multiple smaller rings,
we concentrate in parallel on many, possibly differently sized, sets of dynamic
programming matrices which may not be amenable to simultaneous process-
ing by a single PE ring. We will discuss more about such techniques later in
Section 4.3.
7
CHAPTER 3
FORWARD ALGORITHM
When used as a generative model, the Pair HMM may be thought of as
emitting a pair of aligned sequences X and Y. To model an alignment based
on edit-distance, as opposed to Hamming distance, the model has match,
insert and delete states. The Pair HMM allows us to draw inferences about
the alignment quality of a pair of sequences under the assumption that the
sequence pair was emitted by itself (Figure 3.1). In the HaplotypeCaller, the
overall alignment quality between a candidate haplotype and an input read
is computed using the Pair HMM.
 
 
 
 
 
 
 
Match 
M 
Delete 
D 
Insert 
I 
iia
ima
mma
mia
mda
dma
dda
Figure 3.1: Insert, delete, and match states of Pair HMM.
Table 3.1: An example of hidden-state sequence generation using Pair
HMM [16].
Sequence X: A G G T A -
Sequence Y: - - G T A A
Hidden state Sequence: I I M M M D
Figure 3.1 shows the state transitions in a Pair HMM model. The hid-
den states are represented as M, D and I. When in state M, the Pair HMM
emits symbols from both sequences, implying that the symbols may align to
8
each other. When in state I, it emits one symbol from sequence X, and a
blank symbol “-” means no symbol from sequence Y, indicating an insertion
in sequence X. Similarly, D state represents a deletion in sequence X (or an
insertion in sequence Y). The edge-weights between the states represent tran-
sition probabilities. Table 3.1 illustrates how a particular alignment between
two sequences may be represented using a Pair HMM state sequence. The
probability of the particular alignment is the product of the corresponding
state transition probabilities. To find the overall alignment probability, we
need to find the sum of the probabilities of all such alignments between the
two sequences. However, if we follow a brute-force approach to evaluate each
possible sequence alignment, it can be computationally expensive as there can
be a large number of possible alignments between two sequences. Instead,
the forward algorithm is used to efficiently compute the overall alignment
probability.
The forward algorithm is essentially a dynamic programming approach,
which uses three matrices: fM , f I and fD. The i and j correspond to the
position indices in the sequences X and Y. The fk(i, j) is the forward vari-
able that represents the combined probability of all alignments up to posi-
tions (i, j) that end in state k. The forward algorithm may be summarized as:
Initialization: 
fM(0, 0) = 1
fX(0, 0) = fD(0, 0) = 0
fM(i, 0) = f I(i, 0) = 0
fM(0, j) = fD(0, j) = 0
(3.1)
Recursion:
fM(i, j) = Prior · (ammfM(i− 1, j − 1)
+ aimf
I(i− 1, j − 1) + admfD(i− 1, j − 1))
f I(i, j) = amif
M(i− 1, j) + aiif I(i− 1, j)
fD(i, j) = amdf
M(i, j − 1) + addfD(i, j − 1)
(3.2)
Termination:
Result = fM(Nh, Nr) + f
I(Nh, Nr) + f
D(Nh, Nr) (3.3)
where Nh and Nr are the lengths of the haplotype (sequence X) and input
9
read (sequence Y) respectively.
Many quantities in the forward algorithm recursion need elaboration. These
quantities are evaluated from additional information available in the input
dataset pertaining to the quality of a read sequence and various character-
istics related to its alignment at each position in the sequence. Specifically,
four different quality score values are used in GATK’s formulation of the
Pair HMM, upon which we base our implementation. Three of them assign
penalties to gaps in the alignment, namely the insertion gap open penalty (or
base insertion quality Qi), the deletion gap open penalty (or base deletion
quality Qd) and the gap continuation penalty (Qg). In addition, there is also
data that indicates the level of confidence in the correctness of each symbol
in each read sequence, which we represent as Qbase. In the forward algorithm
recursion presented here, aij represents a transition probability from state i
to state j. For example, amm represents a transition from the match state to
the match state. The Prior value represents the probability of emitting an
aligned pair of symbols. The emission and transition probabilities are com-
puted in the Pair HMM implementation for the HaplotypeCaller as follows
[24]:
Prior =
{
1−Qbase; if the bases match
Qbase; if the bases do not match
(3.4)
amm = 1− (Qi + Qd) − match continuation
aim = 1−Qg − insertion to match
adm = 1−Qg − deletion to match
ami = Qi − match to insertion
aii = Qg − insertion continuation
amd = Qd − match to deletion
add = Qg − deletion continuation
(3.5)
This set of prior probabilities and transition probabilities need to be com-
puted for each input read, as they differ for each position in the read.
10
CHAPTER 4
DESIGN AND IMPLEMENTATION
The flow for Pair HMM computations in GATK’s HaplotypeCaller is as fol-
lows: The Pair HMM input datasets are parsed to get the read and haplotype
bases, read base qualities and other gap penalty scores. The prior and tran-
sition probability matrices for each read-haplotype pair are computed using
these quality scores. The input matrices are fed to the ring-based forward
algorithm computation block on the FPGA to get the overall alignment like-
lihood. To parse and pre-process the read-sequences and the haplotypes,
and obtain the various quality scores in the right form, we use the reference
software implementation of the Pair HMM stage of the HaplotypeCaller [19].
The original order of computations of the forward algorithm’s Dynamic Pro-
gramming (DP) matrix is row-wise (or column-wise). These computations
cannot be parallelized as they are due to the data dependencies between the
successive computations according to Equation 3.2. Modifying the array ac-
cess patterns will help us leverage the parallel nature of the computations and
design a Processing Element (PE) ring structure to maximize performance.
Figure 4.1 shows the data dependencies within the forward algorithm matrix
and the computing order in our implementation. PEs are placed along the
diagonal and all diagonal values are computed in parallel and stored in in-
ternal registers. These register values are used to calculate the values of the
next diagonal of the matrix. The forward algorithm involves computations
for three such DP matrices fM , fD and f I .
The forward algorithm essentially requires a series of arithmetic opera-
tions on probability values. Software implementations can comfortably work
with floating-point numbers to represent data and execute computations.
For FPGA-based acceleration, it is common practice to normalize all the
floating-point numbers to be within a range of fixed point numbers, and use
the FPGA to process those fixed point numbers. However, based on our ex-
periments, the values that turn up in the forward algorithm’s computations
11
           
           
           
           
 
Haplotype 
 T       C         G       A        G      G       T        C        T        A       C 
In
p
u
t R
ead
 B
ases 
T
      G
        A
       C
 
Untouched elements Elements being computed Completed elements 
… 
… 
Independent elements Dependency example 
Figure 4.1: Diagonal matrix access pattern: computations along the
diagonal can be parallelized as there are no dependencies among them.
cover a large range of values, tens of orders of magnitude larger than what
fixed point numbers can represent. Working with the fixed point domain will
lead to overflow or underflow issues for this algorithm, or a very large loss of
precision. Thus, our design needs to use floating-point numbers and compu-
tations to obtain the requisite accuracy and correctness of computation.
Generally, the hardware design for a dynamic programming algorithm con-
sists of a series of PEs, similar to what was described in Figure 4.1. In our
design, a fixed number of processing elements are arranged in a ring-based
structure. Ring-based organization has been proposed and used to accelerate
other dynamic programming algorithms [21]. However, given that the for-
ward algorithm requires two different types of critical operations (both addi-
tion and multiplication) and given the fact that we need to use floating-point
operations for accurate execution of the algorithm, the forward algorithm
poses a new set of challenges to tackle. According to Equation 3.2, there are
seven multiply operations and four add operations for Pair HMM calculation
in each PE. The critical path consists of two adders and two multipliers.
All the operations are implemented using Altera’s floating-point IPs, and
are fully pipelined. However this adds many cycles of latency; floating-point
multiplication requires five cycles while floating-point addition requires seven
cycles (at an operating frequency of 200 MHz). Thus, the overall latency of
the critical path is 24 cycles (on Stratix V). Without careful optimizations,
the initiation interval of each PE will be too large to provide useful acceler-
ation. We will discuss the techniques we use to improve the performance of
these arithmetic operations in Section 4.3.
12
4.1 PE Array
Figure 4.2 shows a simplified organization of the PE array, the core compo-
nent of the PE ring structure. The figure also illustrates how various PEs in
the design process the DP matrix elements. Our design consists of n basic
PEs, where n is the optimal number of read bases to be processed. As shown
in Figure 4.2, during any step of processing, the processing elements (PEs)
are “placed” (processing) along a diagonal of the matrix. One may notice
that there are no dependencies along such a diagonal (Figure 4.1), and all
the PEs can compute at the same time, given the results of computation of
the previous diagonal. After each computation, the ring moves along the
horizontal direction of the matrix to place itself on the next diagonal par-
allel to the current one. Though there are no dependencies among the PEs
within a diagonal, there are dependencies among the neighboring PEs across
consecutive diagonals. There are several data buses between the neighboring
PEs to share intermediate results to satisfy these dependencies.
         
         
         
         
         
         
 
Haplotype Bases 
R
ead
 B
ases 
 
Untouched elements Elements being computed Completed elements 
PE0 
PE1 
PE2 
… 
PEn-2 
PEn-1 
Figure 4.2: PE array processing the corresponding matrix elements.
Let v(PEi) be the temporary calculation result produced by the i-th PE in
the current clock cycle, v′(PEi) be the result produced by the i-th PE at the
last clock cycle, and v′′(PEi) be the result produced by the i-th PE at the
clock cycle before the last clock cycle. Then, based on the data dependency
13
 v 
v' 
 
v'' 
…… 
PE0 PE1 PE2 PEn-1 
v v v 
v' 
  
v' 
  
v' 
  
v'' v'' v'' 
Figure 4.3: The internal shift registers and their data dependencies among
PEs.
relationships among the matrix elements of the forward algorithm, we get
the following formula.
v(PEi) = f(v
′(PEi), v′(PEi−1), v′′(PEi−1)) (4.1)
where f is a function that summarizes the forward algorithm recursion (see
Equation 3.2). In our design, v(PEi), v
′(PEi), and v′′(PEi) are stored in
three sets of registers. The value of v′(PEi) and v′′(PEi) can be stored for
computations through shift registers. Figure 4.3 shows the internal registers
in the PEs and the data dependencies among those PEs.
Moreover, Equation 4.1 is a general equation that can be used to describe
many dependency patterns. Thus, a similar hardware structure can be used
to accelerate other dynamic programming algorithms with dependencies that
look like Equation 4.1.
4.2 PE Ring
The PE array structure can be extended to a PE ring-based organization
by connecting the first PE and the last PE with an internal buffer. The
ring-based structure provides extra flexibility and scalability for the design.
Figure 4.4 demonstrates how we use the PE ring structure to calculate the
values of the forward algorithm matrix.
With the PE ring structure, the first n rows of the forward-algorithm
14
matrix are calculated first, where n is the number of instantiated PEs (dif-
ferent from the input read length and the haplotype length). After the first
PE reaches the last element in the first row, it can continue to process the
(n + 1)th row in the matrix. In the PE ring structure, the first PE, PE0,
and the last PE, PEn−1, are connected to an internal data buffer. This in-
ternal buffer is used to store the temporary results produced by the last PE,
PEn−1, so that the first PE can use this data when it processes the new row
in the matrix. This way, the design can handle computations for different
sizes of the dynamic programming matrix irrespective of the number of PEs
instantiated in the design.
In our PE ring implementation, only one of the PEs, the PE that computes
the first element of the first row of the forward algorithm matrix, (labeled the
“first PE”) takes in the haplotype bases and quality scores, and the inputs to
all the other PEs come from their neighboring PEs. This reduces the amount
of data transfer among a large number of PEs, as well as the data storage.
PE ring has several advantages over the other organizations, e.g. the sys-
tolic array and the PE array. First, with a ring-based structure, we are able
to process matrices with more rows than the number of instantiated PEs.
This cannot be done with a PE array. The second advantage of the PE ring
is that there are at most two intermediate results produced per PE which
are consumed within two cycles by itself or by neighboring PEs (see Equa-
tion 4.1). This reduces considerable data transfer and data storage overhead
in some other organizations, e.g. systolic arrays. In the case that an m× n
two-dimensional systolic array is used to compute the dynamic programming
algorithm, after each step of calculation, n or m of intermediate data items
are produced and need to be stored in the local on-chip buffer or external
off-chip memory until the entire matrix computation is complete.
4.3 Optimizations
We adopt several techniques to further improve the hardware described in
Section 4.1 and Section 4.2.
15
          
          
          
          
          
          
          
          
 
Haplotype Bases 
R
ead
 B
ases 
Internal Buffer 
(Connects with PE1) 
(Connects with PE2) 
PE2 
… 
PEn-2 
PEn-1 
PE0 
PE1 
Untouched elements Elements being computed Completed elements 
Figure 4.4: PE ring processing the corresponding matrix elements.
4.3.1 Shorten critical paths in arithmetic operations
Figure 4.5 shows the arithmetic operations inside each PE. The naive imple-
mentation of Figure 4.5 has 24 cycles of latency (based on Altera’s Stratix V
floating-point IPs), which is a big drawback from the point of view of overall
latency. If we analyze the critical path in Figure 4.5 carefully, we will find
that the operands of the first adder and first two multipliers in the critical
path (f Ii−1,j−1, f
D
i−1,j−1, and f
M
i−1,j−1) have been ready two rounds earlier than
the current round. This means that we would be able to start the computa-
tion of the critical path earlier and redistribute the operations among PEs.
Based on this idea, we move the first adder and the first two multipliers to
the previous PE, and start computation as soon as the operands are ready.
After adopting this scheme, the arithmetic operations within each PE are
shown in Figure 4.6. Intermediate computation results tai,j and t
b
i,j are sent
to the next PE in the PE ring. Note that, compared to Figure 4.5, the
operands’ indices of first adder and first two multipliers (fM , f I , and fD)
have changed, because they are used to compute ta and tb in advance.
With this optimization, the critical path of arithmetic operations is short-
ened to have only one adder and one multiplier with 12 cycles of latency (on
16
+× 
+
× 
× 
I
jif 1,1 
D
jif 1,1 
dma mma
M
jif 1,1 
prior
M
jif ,
× 
+ 
I
jif ,
× 
M
jif ,1
I
jif ,1mia iia
× 
+ 
D
jif ,
× 
M
jif 1, 
D
jif 1, mda dda
 
Figure 4.5: Arithmetic operations inside PEs.
+
× 
+
× × 
I
jif 1, 
D
jif 1, 
dma mma
M
jif 1,  prior
M
jif ,
× 
+ 
I
jif ,
× 
M
jif ,1
I
jif ,1mia iia
× 
+ 
D
jif ,
× 
M
jif 1, 
D
jif 1, mda dda
a
jit ,
b
jit ,
a
jit ,1
b
jit ,1
 
Figure 4.6: Optimized arithmetic operations inside PEs.
Stratix V), while the overall number of adders and multipliers remains the
same.
4.3.2 Pipelining and resource sharing
Note that all the adders and multipliers in the PEs are fully pipelined and the
critical path has the latency of 12 cycles (on Stratix V). During this 12-cycle
period, we could initiate the computation of other matrices corresponding
to another read-haplotype pair. Since the computations of DP matrices are
independent from each other, the PEs could compute 12 matrices at the same
time.
Figure 4.7 shows an example of computing multiple matrices at the same
time with full pipelining. In the different pipeline stages of the arithmetic
operators, operands from different matrices are used to do the computation.
17
Matrix10[i][j].add.cycle3 Matrix11[i][j].add.cycle3
Matrix9[i][j].add.cycle4
Matrix8[i][j].add.cycle5
Matrix10[i][j].add.cycle4
Matrix9[i][j].add.cycle5
Matrix5[i][j].multiply.cycle1
Matrix4[i][j].multiply.cycle2
Matrix3[i][j].multiply.cycle3
Matrix7[i][j].add.cycle6
Matrix6[i][j].add.cycle7
Matrix6[i][j].multiply.cycle1
Matrix5[i][j].multiply.cycle2
Matrix4[i][j].multiply.cycle3
Matrix8[i][j].add.cycle6
Matrix7[i][j].add.cycle7+
× 
prior
M
jif ,
a
jit ,1− b jit ,1−
Add and multiply with 
12 cycle latency
Clock cycle k Clock cycle k+1
Matrix2[i][j].multiply.cycle4
Matrix1[i][j].multiply.cycle5
Matrix3[i][j].multiply.cycle4
Matrix2[i][j].multiply.cycle5
Matrix12[i][j].add.cycle1
Matrix11[i][j].add.cycle2
Matrix13[i][j].add.cycle1
Matrix12[i][j].add.cycle2
 
Figure 4.7: Example of pipelining inside ith PE.
4.3.3 Tuning PE ring size and number of PE rings
Another property that can be used to tune the design is the size of the PE
rings, i.e., the number of PEs in a single PE ring. If the number of PEs in
the PE ring is small, we can place a larger number of PE rings inside the
FPGA chip.
Having smaller, but many PE rings could have some potential benefits.
First, with multiple PE rings, multiple matrices can be processed at the
same time. It is to be noted that this parallelism is different from that
described in Section 4.3.2. The parallelism utilized with multiple PE rings is
more coarse-grained. Second, a smaller PE ring will potentially have fewer
idle PEs during the first and the last few cycles of processing a matrix.
Note that there are idle PEs for a small number of cycles at the beginning of
matrix computations or at the last few rounds (when the number of remaining
rows is smaller than the number of PEs in the ring). This also means that
there are no idle PEs when processing the middle rows of the DP matrix
since the PEs that finish the computation of a row will move to uncomputed
rows below.
Figure 4.8 illustrates the idle PEs when the PE ring is processing the DP
18
Working PEs Idle PEs
(a)
(b) Internal Buffer
Internal Buffer
Figure 4.8: Idle PEs at the beginning of processing and depth of internal
buffer when using (a) long PE ring; (b) shorter PE ring.
matrix during the first few cycles. The parallel diagonal lines represent the
positions of PE ring at consecutive cycles. The PE ring starts from the top-
right corner, and moves to the left. Comparing (a) and (b) in Figure 4.8,
we can see that if the PE ring is shorter, there will be fewer idle PEs during
processing. Considering the fact that while using shorter PE rings, we can
place more PE rings on the FPGA; we can also compare the idle #PE ×
#cycle product for the whole FPGA design. Figure 4.9 shows one intuitive
way to do the comparison. In the figure, the area of triangles represents the
idle #PE ×#cycle product. Assuming that the maximum total number of
PEs that can be placed into an FPGA chip is a constant, i.e. if #PE×#ring
is a constant, then we can see that the idle #PE×#cycle product of a single
longer PE ring is a single big triangle, while the sum of idle #PE ×#cycle
product of the shorter PE rings are several smaller triangles. Comparing the
areas of two set of triangles, we could see that the configuration of shorter
PE rings has fewer idle PEs in general.
We could also count the number of idle PEs directly. Let M be the total
number of PEs that can be placed in the FPGA, i.e. the length of a PE ring
when we try to deploy a single large PE ring on the FPGA. Let M = kN ,
19
where N is the length of smaller PE rings, and there are k PE rings. Then,
for the single PE ring case, the number of idle PEs are
M−1∑
i=1
i =
1
2
M(M − 1) (4.2)
For the smaller PE ring case, the number of idle PEs are
k
N−1∑
i=1
i =
k
2
N(N − 1) = 1
2
M(N − 1) (4.3)
Thus a design with multiple smaller PE rings only has N−1
M−1 of the idle PEs
of the design with a single long PE ring.
(a) (b)
PE0
PE1
PE2
Figure 4.9: Comparing #idle PEs × #idle cycles when using (a) one single
longer PE ring; (b) multiple shorter PE rings.
Then, consider idle PEs while processing the last few rows of the DP
matrix. Figure 4.10 illustrates this case. This case of idle PEs happens when
the number of matrix’s rows could not be divided exactly by the number of
PEs. This is common because the number of matrix’s rows, which is the read
sequence’s length could be an arbitrary positive integer. If we assume the
distribution of read sequence’s length is uniform, then any number of idle
PEs could happen with the same probability, giving a mean value of half of
the PE ring length. This means the number of idle PEs is proportional to the
length of PE rings. If the total number of PEs is a constant, then the total
numbers of idle PEs for different configurations are similar. Based on this
analysis, using shorter PE rings would not reduce the number of idle PEs
20
during the computation of the final rows of the DP matrix. Figure 4.10(b)
shows only the processing steps of one of the n short PE rings.
Working PEs Idle PEs
(a)
(b)
 n 
 
Figure 4.10: Idle PEs at the last stage of processing when using (a) one
single long PE ring; (b) n short PE rings (showing only one of n PE rings
in the figure).
Even though using shorter PE rings could reduce the number of idle PEs
at the beginning of processing, using shorter PE rings could also have some
disadvantages. While using a shorter PE ring, the internal buffer needs to
be larger. This is because there will be more intermediate results produced
before the next iteration on columns starts. That will significantly increases
memory block utilization. This point is also illustrated in Figure 4.8. Besides,
too many small PE rings will lead to potential memory port contention,
because each PE ring will need to fetch data in every clock cycle.
21
4.3.4 Floating-point operator implementation
All the floating-point addition and multiplication operators are built with
Altera’s floating-point IPs. Those arithmetic IPs could be instantiated and
mapped to either DSP blocks or logic elements in FPGA. Both mappings give
operators with the exact same functionality. However, the amount of avail-
able logic resource and DSP resource could be different on a target FPGA,
thus the ratio of IPs mapped to logic elements and that mapped to the
DSPs should be carefully tuned to achieve maximum resource utilization in
the FPGA chip. In our implementation, we calculate the maximum number
of operators that can be implemented using logic elements and with DSPs.
Based on this calculation, we figure out the best ratio of the number of
IP instances mapped to logic and that mapped to DSP. In our final design
for Stratix V FPGA, we utilize 83% of the logic elements and 75% of DSP
blocks. Generally, when considering the mapping of IPs, given a fixed target
frequency, the latency in terms of cycles could be less important than re-
source utilization. This is because when all the operators are fully pipelined,
we may be able to almost completely hide the effects of latency by feeding in
new data to occupy the different pipeline stages of the operators, as shown
in Figure 4.7.
This type of tuning of the IP implementation becomes even more important
if there are multiple mapping choices, multiple latency options and multiple
frequency choices. In our case, there are not many options, thus manual
tuning works well enough. If the number of options grows, this could become
an interesting design space exploration problem.
22
CHAPTER 5
EXPERIMENTAL RESULTS
5.1 Test Data and Target FPGAs
The test data comes from a Whole Genome Sequence (WGS) dataset avail-
able at [24] that represents Pair HMM inputs generated by HaplotypeCaller
from GATK version 2.7. The benchmark consists of individual datasets each
having different haplotype sizes and input read lengths. Each dataset con-
tains test cases consisting of read and haplotype pairs of lengths varying from
10 to 302 bases.
To fully analyze the performance and resource utilization of our imple-
mentation, we synthesize our design targeting two FPGAs, Altera’s Stratix
V FPGA (5SGXEA7N2F45C2), which is the FPGA on Terasic’s DE5-Net
experiment board, and the new Arria 10 FPGA (10AX115H1F34E1SG). Al-
tera’s Stratix V is built with 28 nm process technology, and it is one of
Altera’s high-end FPGAs. Altera’s new Arria 10 FPGA uses 20 nm process
technology, and it has more logic and DSP resources. Besides, the Arria
10 FPGA has hard floating-point elements, which makes floating-point com-
putations more efficient. We collect the performance data of our design by
running simulations of our design.
5.2 Performance
5.2.1 Compared with other implementations
We compare our performance data with that of other representative im-
plementations [19], including CPU, GPU, multi-cores, and FPGAs. The
comparison shows that our implementation is faster than other reported im-
23
plementations.
Table 5.1: Performance comparison across various implementations.
Platform Runtime(ms) Speedup
Java on CPU 10800 1×
C++ Baseline 1267 9×
Intel Xeon AVX Single Core 138 78×
NVidia K40 GPU 70 154×
Intel Xeon 24 Cores 15 720×
Altera OpenCL (Stratix V) 8.3 1301×
PE Ring (Stratix V) 5.3 2038×
Altera OpenCL (Arria 10) 2.8 3857×
PE Ring (Arria 10) 2.6 4154×
Table 5.1 compares the performance of our implementation versus a few
others. The performance data for CPU and GPU platforms are reported in
[19]. The dataset used is the “10s” dataset available along with the original
Java implementation in GATK. We present our results with this dataset
because it allows us to make comparisons to other implementations. We also
run the experiments using larger datasets, and significant speedup is also
achieved for larger datasets. The execution time on other datasets is listed
in Figure 5.1.
Our performance data in the table is based on 8 PEs/ring × 8 rings con-
figuration on the Stratix V FPGA, and 8 PEs/ring × 16 rings configuration
on the Arria 10 FPGA. Our performance numbers are based on the over-
all FPGA frequency of 200 MHz. Our FPGA synthesis targets have the
same number of logic elements and DSP blocks as that of Altera’s OpenCL
implementation [23].
Prior to our work, the state-of-the-art implementation has been Altera’s
OpenCL implementation of the Pair HMM on FPGA. These performance re-
sults come from Altera’s whitepaper [23]. As shown in the table, on Stratix
V, our PE ring design could achieve 1.56× further speedup compared with
Altera’s implementation. On Arria 10, our design is 7.7% faster than Altera’s
implementation. The speedup from our Arria 10 implementation is smaller
because our implementation contains 128 PEs while Altera’s implementation
has 208 PEs. In comparison to other platforms, our proposed PE ring de-
sign for Arria 10 could achieve 5.77× speedup over Intel Xeon 24 core AVX
implementation, 26.92× speedup over K40 GPU implementation, and more
24
than 4000× speedup over the original Java implementation.
5.2.2 Impact of PE ring size
Based on our synthesis results, Altera Stratix V FPGA can accommodate
a single PE ring consisting of 64 PEs or multiple rings of shorter lengths.
For example, eight PE rings each consisting of eight PEs can be put into the
Stratix V FPGA. We synthesize designs with various lengths and numbers
of PE rings, and run simulations to get performance data.
0.06ms
5.26ms 35.79ms
0.07ms
5.75ms 39.16ms
0.14ms
6.93ms 47.51ms
0.28ms 9.99ms 69.42ms
0
0.2
0.4
0.6
0.8
1
1.2
tiny 10s 1m
N
o
rm
al
iz
ed
 E
xe
cu
ti
o
n
 T
im
e
8x8
16x4
32x2
64x1
Matrix Size:           10~41      10~30210~263
#Pair-HMM Runs:   332      293073550
Dataset: 
Figure 5.1: Normalized execution time on three datasets when using
different sizes of PE rings (Stratix V). The “m× n” stands for the
configuration of m PEs/ring × n rings.
Figure 5.1 shows the normalized execution time on three datasets when
the PE ring sizes are varied keeping the total number of PEs constant. The
figure legend “m× n” stands for the configuration of m PEs/ring × n rings.
As discussed in Section 4.3.3, using multiple smaller PE rings could reduce
the total number of idle PEs during the processing. Figure 5.1 supports this
observation.
Note that we do not further reduce the size of the PE rings below eight
PEs/ring. There are two reasons. First, too many small PE rings will lead to
potential memory port contention, because each PE ring needs to read input
25
data every clock cycle. Second, multiple smaller PE rings require more and
larger internal buffers, which significantly increases memory block utilization.
5.2.3 Implementation on Stratix V and Arria 10
To explore performance trends with more PEs and hard floating-point IP
blocks, we synthesize our design targeting Altera’s Arria 10 FPGA as well.
Arria 10 FPGA has hard floating-point elements that shorten the latency
of floating-point operations. On Arria 10, the latencies of single-precision
floating-point addition and multiplication are both only three cycles when
using hard floating-point elements. Besides, the number of logic resources
available on the Arria 10 FPGA is more than what is available on Stratix V.
The Arria 10 synthesis target has 427,200 Adaptive Logic Modules (ALM),
while the Stratix V target has 234,720 ALMs. Also, Arria 10 has 1,518
DSP blocks, while Stratix V has 256 DSP blocks. Arria 10 FPGA is able
to accommodate 128 PEs, and the total latency of arithmetic operations in
each PE is only six cycles, while the Stratix V FPGA can accommodate 64
PEs, and the total latency of arithmetic operations in each PE is 12 cycles.
Table 5.2: Synthesis results for both target FPGAs.
FPGA #PEs Fmax Logic DSP
Stratix V 64 200.16 MHz 83% 75%
Arria 10 128 230.73 MHz 4% 93%
Table 5.2 shows the maximum number of PEs, maximum frequency, and re-
source utilization for the two implementations. We observed that our design
on Arria 10 is able to achieve a higher frequency. However, the implementa-
tion on Arria 10 utilizes almost all the DSP resources while utilizing a very
small percentage of logic elements. This is because the synthesis tool (Altera
Quartus Prime) maps all the floating-point IPs for Arria 10 to DSP blocks.
If those floating-point operators could be mapped to the logic elements, the
Arria 10 FPGA will be able to fit in much more PEs, and thus gain further
speedup. In the Stratix V case, the resource utilization is more balanced, as
a fraction of the operations are also mapped to the logic elements.
26
CHAPTER 6
CONCLUSION
Pair HMM forward algorithm computation is a major bottleneck in several
DNA sequence analysis flows. Essentially, the Pair HMM’s forward algo-
rithm is a complicated floating-point dynamic programming algorithm with
a high computational complexity. The forward algorithm involves the com-
putation of three matrices while respecting data dependencies among the
matrix elements, and it involves a series of floating-point add and multiply
operations. In this thesis, we proposed an efficient and flexible ring-based
hardware implementation of the Pair HMM forward algorithm, as well as
several optimization techniques to further boost the performance of the PE
ring structure. Our ring-based design achieved a significant speedup of up
to 487× compared to the C++ baseline implementation on CPU, and up
to 1.56× further speedup compared to the published best hardware imple-
mentation. In our design, the ring structure exhibited its unique advantages
of flexibility allowing trade-offs between coarse and fine-grained parallelism,
and reduced data transfers between the hardware kernel and memory compo-
nents. We also analyzed at depth, the details of how dynamic programming
calculations implemented on hardware could benefit from varying the PE ring
size. The proposed design could be configured as multiple shorter PE rings,
which has fewer idle PEs during computation. This configuration could be
adjusted accordingly based on the resources available on the specific FPGA.
27
REFERENCES
[1] T. F. Smith and M. S. Waterman, “Identification of common molecular
subsequences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195–
197, 1981.
[2] S. B. Needleman and C. D. Wunsch, “A general method applicable to
the search for similarities in the amino acid sequence of two proteins,”
Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970.
[3] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman,
“Basic local alignment search tool,” Journal of Molecular Biology, vol.
215, no. 3, pp. 403–410, 1990.
[4] H. Li, J. Ruan, and R. Durbin, “Mapping short DNA sequencing reads
and calling variants using mapping quality scores,” Genome Research,
vol. 18, no. 11, pp. 1851–1858, 2008.
[5] Y. Heo, X.-L. Wu, D. Chen, J. Ma, and W.-M. Hwu, “BLESS: Bloom
filter-based error correction solution for high-throughput sequencing
reads,” Bioinformatics, vol. 30, no. 10, p. 1354, 2014.
[6] W.-C. Kao, A. H. Chan, and Y. S. Song, “Echo: A reference-free short-
read error correction algorithm,” Genome Research, vol. 21, no. 7, pp.
1181–1192, 2011.
[7] L. Ilie, F. Fazayeli, and S. Ilie, “Hitec: Accurate error correction in
high-throughput sequencing data,” Bioinformatics, vol. 27, no. 3, pp.
295–302, 2011.
[8] Q. Wang, P. Jia, F. Li, H. Chen, H. Ji, D. Hucks, K. B. Dahlman,
W. Pao, and Z. Zhao, “Detecting somatic point mutations in cancer
genome sequencing data: A comparison of mutation callers,” Genome
Medicine, vol. 5, no. 10, p. 1, 2013.
[9] M. C. Schatz, C. Trapnell, A. L. Delcher, and A. Varshney, “High-
throughput sequence alignment using graphics processing units,” BMC
Bioinformatics, vol. 8, no. 1, p. 474, 2007.
28
[10] C.-M. Liu, T. Wong, E. Wu, R. Luo, S.-M. Yiu, Y. Li, B. Wang, C. Yu,
X. Chu, K. Zhao et al., “SOAP3: Ultra-fast GPU-based parallel align-
ment tool for short reads,” Bioinformatics, vol. 28, no. 6, pp. 878–879,
2012.
[11] I. T. Li, W. Shum, and K. Truong, “160-fold acceleration of the Smith-
Waterman algorithm using a field programmable gate array (FPGA),”
BMC Bioinformatics, vol. 8, no. 1, p. 1, 2007.
[12] A. Ramachandran, Y. Heo, W.-m. Hwu, J. Ma, and D. Chen, “FPGA
accelerated DNA error correction,” in Proceedings of the 2015 Design,
Automation & Test in Europe Conference & Exhibition. EDA Consor-
tium, 2015, pp. 1371–1376.
[13] Broad Institute, “Haplotypecaller overview,” https://www.broad-
institute.org/gatk/guide/article?id=4148.
[14] A. McKenna, M. Hanna, E. Banks, A. Sivachenko, K. Cibulskis,
A. Kernytsky, K. Garimella, D. Altshuler, S. Gabriel, M. Daly et al.,
“The genome analysis toolkit: A MapReduce framework for analyz-
ing next-generation DNA sequencing data,” Genome Research, vol. 20,
no. 9, pp. 1297–1303, 2010.
[15] R. Durbin, S. R. Eddy, A. Krogh, and G. Mitchison, Biological Se-
quence Analysis: Probabilistic Models of Proteins and Nucleic Acids.
Cambridge University Press, 1998.
[16] B.-J. Yoon, “Hidden Markov models and their applications in biological
sequence analysis,” Current Genomics, vol. 10, no. 6, pp. 402–415, 2009.
[17] C. B. Do, M. S. Mahabhashyam, M. Brudno, and S. Batzoglou, “Prob-
cons: Probabilistic consistency-based multiple sequence alignment,”
Genome Research, vol. 15, no. 2, pp. 330–340, 2005.
[18] W. H. Majoros, M. Pertea, and S. L. Salzberg, “Efficient implementa-
tion of a generalized pair hidden Markov model for comparative gene
finding,” Bioinformatics, vol. 21, no. 9, pp. 1782–1788, 2005.
[19] Broad Institute, “Accelerating variant calling,” 2013.
[20] S. O. Settle, “High-performance dynamic programming on FPGAs with
OpenCL,” in Proceedings IEEE High Performance Extreme Computing
Conference, 2013, pp. 1–6.
[21] Z. Wang, S. Huang, L. Wang, H. Li, Y. Wang, and H. Yang, “Accel-
erating subsequence similarity search based on dynamic time warping
distance with FPGA,” in Proceedings of the ACM/SIGDA International
Symposium on Field Programmable Gate Arrays. ACM, 2013, pp. 53–
62.
29
[22] A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardware
acceleration for dynamic programming algorithms,” in Proceeding of
the 41st Annual International Symposium on Computer Architecuture.
IEEE, 2014, pp. 517–528.
[23] Altera, “Accelerating genomics research with OpenCL and FPGAs,”
2016.
[24] “Pair HMM test data,” https://github.com/MauricioCarneiro/PairHMM/
tree/master/test data.
30
