ASAP: Accelerated Short-Read Alignment on Programmable Hardware by Banerjee, Subho S. et al.
1ASAP: Accelerated Short-Read Alignment on
Programmable Hardware
Subho S. Banerjee, Mohamed El-Hadedy, Jong Bin Lim,
Zbigniew T. Kalbarczyk, Deming Chen, Steven S. Lumetta, and Ravishankar K. Iyer
Abstract—The proliferation of high-throughput sequencing machines ensures rapid generation of up to billions of short nucleotide fragments in a short
period of time. This massive amount of sequence data can quickly overwhelm today’s storage and compute infrastructure. This paper explores the use of
hardware acceleration to significantly improve the runtime of short-read alignment, a crucial step in preprocessing sequenced genomes. We focus on the
Levenshtein distance (edit-distance) computation kernel and propose the ASAP accelerator, which utilizes the intrinsic delay of circuits for edit-distance
computation elements as a proxy for computation. Our design is implemented on an Xilinx Virtex 7 FPGA in an IBM POWER8 system that uses the CAPI
interface for cache coherence across the CPU and FPGA. Our design is 200× faster than the equivalent C implementation of the kernel running on the host
processor and 2× faster for an end-to-end alignment tool for 120–150 base-pair short-read sequences. Further the design represents a 3760× improvement
over the CPU in performance/Watt terms.
Index Terms—Bioinformatics, Genomics, Levenshtein Distance, Application-Specific Processor, Hardware Accelerator.
F
1 INTRODUCTION
THE advent of high-throughput next-generation sequenc-ing technology (NGS) has created a deluge of genomic
data for computational analysis [1]. Efficiently processing
this data requires the development of a new generation of
high-performance computing systems that can efficiently
handle such data. This new generation of application-specific
and accelerator-rich computing systems are expected to
gain performance, power, and energy improvements over
traditional systems [2].
A crucial step in a significant number of NGS data
analytics applications (e.g., variant discovery, genome-wide
association studies, and phylogeny creation) is the mapping
of short fragments of sequenced genetic material (called
reads) to their most likely points of origin in the genome,
popularly called the short-read alignment problem. This pa-
per presents the design and implementation of ASAP, an
accelerator for computing Levenshtein distance [3], [4] (LD;
used interchangeably with edit-distance) in the context of
the short-read alignment problem. LD is a measure of the
similarity between strings, which is computed by counting
the number of single-character edits required to change
one string into the other. LD computation is a prominent
underlying mathematical kernel that is common to a large
number of short-read alignment algorithms and tools (e.g.,
BLAST [5], Bowtie [6], [7], BWA [8], and SNAP [9]), and is
responsible for 50% – 70% of their runtime [10].
ASAP represents a novel approach to accelerate the LD
computation, in that it uses algorithmic approximations, and
maps these approximations into hardware to significantly
improve overall performance (∼ 200× compared to the CPU
baseline). The core algorithm in ASAP leverages two key
observations about the computation and datasets involved
in the short-read alignment problem:
• S. S. Banerjee, M. el-Hadedy, J. B. Lim, Z. T. Kalbarczyk, D. Chen, S. Lumetta and R.
K. Iyer are with the Coordinated Science Laboratory, and the Departments of Computer
Science and Electrical and Computer Engineering all at the University of Illinois at
Urbana-Champaign, Urbana, IL, 61801.
1) Although all the tools mentioned above calculate the exact
value of LD between pairs of nucleotide strings, they use
them only to build a total ordering (i.e., an ordered list)
of the most likely points of origin in the genome. The
best alignment is the pair of strings corresponding to the
minimum LD in the ordered list. Hence, it is sufficient to
only calculate the total ordering (in this instance, returning
the pair that corresponds to the minimum LD), and not
essential to compute the exact value of the LD. This
distinction enables approximation in the computation
of LD to gain performance, while preserving the overall
accuracy of the alignment algorithm (which comes from
the total ordering).
2) Modern sequencing platforms (like the Illumina HiSeq
2500) represent a very low sequencing error regime
(≤ 1%) [11], [12], and modern alignment tools (mentioned
above) have accurate candidate region-matching algo-
rithms (described in Section 2). Hence, LD computations
process significantly more “matches” than “mismatches,”
in the majority of sequencing experiments.1 The ASAP ar-
chitecture uses this heuristic to accelerate LD computation
(described in Sections 3.1 and 3.2).
To take advantage of these observations, ASAP augments
RaceLogic [13]2 using application heuristics, as well as
hardware architectural optimizations to realize the design on
FPGAs. In particular, this paper proposes (a) a mechanism
to encode LD computation parameters (e.g., gap-penalties;
described further in Section 2) into the ASAP architecture,
making it possible to map the time taken to process a “match”
exactly as a circuit delay. This mapping gives us the ability to
tune the performance of ASAP to match data characteristics;
and (b) the use of “zero delay” circuit elements to explore
large portions of the search space (LDs of substrings of
1. This is a facet of the accurate sequencing process and the thoroughly validated
reference genome for human subjects. This observation will also apply to most
model organisms whose genome has been extensively studied.
2. RaceLogic uses propagation delay of circuit elements to perform computa-
tions.
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
ar
X
iv
:1
80
3.
02
65
7v
2 
 [c
s.D
C]
  2
3 M
ay
 20
18
2the strings being compared) in parallel within one clock
cycle, and to ignore parts of the search space that do not
contribute to an answer, thereby saving energy. Overall,
ASAP can compute alignments quickly (∼ 200× faster than
the CPU baseline and ∼ 50× faster than an equivalent
RaceLogic design), and with the same accuracy as traditional
software- or hardware-based alignment tools. We leverage
reconfigurable FPGA devices to prototype ASAP, thereby
allowing us to reconfigure the accelerator based on user
decisions on input parameters (described in Section 2), as
well as to adapt the accelerator to input NGS datasets of
varying read lengths.
Contributions. To summarize, the primary contributions
of this paper are as follows:
1) Presents a measurement-driven study that demonstrates
that computation of LD represents a significant portion of
the runtime of several short-read alignment programs.
2) Builds on top of the delay-based computation paradigm
presented in [13] to encode gap-penalties as “zero delay”
circuit elements. This allows us to calculate approximate
the LD between strings by using combinational circuit
elements. We prove the correctness of this encoding and
demonstrate that the result of the approximation can be
used as a proxy for computing LD in short-read aligners.
That is, a tool using the approximation and the accelerator
produces alignments identical to those of tools based on
traditional methods (e.g., BWA-MEM [8]).
3) Presents an FPGA-based implementation of the accel-
erated LD computation in the ASAP accelerator that
leverages the coherent accelerator-processor interface
(CAPI) [14], [15] for communication between the host
and accelerator.
4) Demonstrates that ASAP on an FPGA is able to accelerate
the runtime of the LD computation by 200× compared
to a CPU-based execution, and by ∼ 5× over the best
xFPGA result, while consuming less energy.
5) Demonstrates that integration of the ASAP accelerator
into a short-read alignment framework like SNAP can
accelerate it by nearly 2× (which is close to the Amdahl’s
law limit for the accelerator).
Other Applications. Our approach can be adapted to
a variety of other problems in which a total ordering of
LDs is computed. For example, in signal processing, where
similarity between signals is computed [3]; in text retrieval,
where misspelled words have to be accounted for in a
dictionary [16]; and in computer-security where virus- and
intrusion-detection requires comparison of signatures [17].
Organization. The remainder of this paper is organized as
follows. Section 2 describes the recursion-based formulation
of the LD computation and its use in popular short-read
alignment tools. Section 3 briefly describes 1) a mathematical
formalism for encoding computation in circuit delays; 2) uses
that formalism to define the approximation algorithm at the
core of ASAP and prove its correctness; and 3) presents the
hardware architecture of ASAP leverages this approximation
algorithm. Section 4 presents the evaluation of the accelerator.
Section 5 compares the ASAP approach to other CPU-, GPU-,
and FPGA-based approaches for computing LD, and, finally,
we conclude in Section 6.
2 LEVENSHTEIN DISTANCE COMPUTATION AND
SHORT-READ ALIGNMENT
Traditional methods for aligning reads to a reference
genome find the position (locus) of a single read in the
reference by minimizing the maximum edit distance between
the short read being aligned (called the query, and denoted
by Q) and the reference genome sequence. The Smith-
Waterman algorithm (SW) [18] and Needleman-Wunsch
algorithm (NW) [19] utilize a dynamic programming-based
algorithm to calculate the alignment score (Levenshtein
distance) between the read and a particular section R of
the reference genome, accounting for base pair substitutions,
insertions, and deletions. Both of these algorithms work by
constructing a matrix S (which is used interchangeably with
lattice) of size lQ × lR, where lQ and lR are the lengths of
the two strings, between which the edit distance must be
calculated. Consider a matrix S in which the (i, j)th entry,
S(i, j), is the minimum edit distance between the sub strings
Q[1 : j] and R[1 : i]. S(i, j) is recursively defined as
S(i, j) = min

S(i− 1, j) + ∆(−, Rj),
S(i, j − 1) + ∆(Qi,−),
S(i− 1, j − 1) + ∆(Qi, Rj)
 (1)
where ∆ corresponds to input parameters called gap penalties.
These ∆-parameters assign scores for insertion, deletion,
match,3 or mismatch between the sequences such that a more
desirable outcome has a smaller score associated with it. The
parameters ∆(Qi, Rj), ∆(−, Rj) and ∆(Qi,−) correspond
to the match/mismatch, deletion, and insertion penalties
respectively. These parameters are chosen to optimize the
accuracy of alignments based on prior information about the
sequences being compared (e.g., evolutionary information
about mutations in a population [20], [21], [22]). This paper
describes the use of constant gap penalties (i.e., a fixed score
is assigned to every gap between nucleotides). That is,
∆(Qi, Rj) = ∆(Match) if Qi = Rj
∆(Qi, Rj) = ∆(Mismatch) if Qi 6= Rj
∆(−, Rj) = ∆(Delete)
∆(Qi,−) = ∆(Insert)
 ∀Ri, Qj .
(2)
Such gap penalties are are commonly used in DNA alignment
(e.g., in NCBI-BLASTN, or WU-BLASTN [20]).
The NW algorithm computes a global alignment in which
the entirety of the query is matched to the reference, as shown
in Fig. 1. It does so by computing the value of S(m,n). The
SW algorithm computes a local alignment and matches the
largest (substring) of the query to the reference, and, hence,
needs to calculate the minimum value in the row S(m,−).
For example, when the strings AGCACACA and ACACAACT
are compared with constant penalties ∆(Match) = 0,
∆(Mismatch) = 2, and ∆(Insert) = ∆(Delete) = 1, we
get the matrix described in Fig. 1. The optimal alignment is
then calculated from this matrix by finding the minimum
weighted path (in S) from (m,n) to (0, 0) in the NW
algorithm and (m,N ) to (0, 0) in the SW algorithm. N
corresponds to the largest substring of the reference to which
the query string maps with the lowest LD.
Although these methods are guaranteed to produce the
optimal alignment, they are prohibitively expensive for
3. Gap penalties traditionally do not have match scores. We group them together
for simplicity in our notation.
3Table 1
Mathematical formulation of different aligners to fit them into the structure of Algorithm 1.
Function BWA-MEM [8] SNAP [9]
Build_Index Burrows-Wheeler transform [23] of prefix trie Ukkonen’s algorithm [24]
Candidate_Locations Prefix trie traversal Hash table lookup
Edit_Distance Smith-Waterman algorithm [18] Landau-Vishkin algorithm [25]
Find_Config Smith-Waterman algorithm [18] Landau-Vishkin algorithm [25]
A C A C A A C T
0 1 2 3 4 5 6 7 8
A 1 0 1 2 3 4 5 6 7
G 2 1 2 3 4 5 6 7 8
C 3 2 1 2 3 4 5 6 7
A 4 3 2 1 2 3 4 5 6
C 5 4 3 2 1 2 3 4 5
A 6 5 4 3 2 1 2 3 4
C 7 6 5 4 3 2 3 2 3
A 8 7 6 5 4 3 2 3 4
A-CACA-ACT
| |||| |
AGCACACA
A-CACAACT
| |||| |*
AGCACA-CA
Output for the 
Needleman-Wunsch
Algorithm
Output for the
Smith-Waterman
Algorithm
Figure 1. The matrix S for the strings AGCACACA and ACACAACT, assuming
∆(Match) = 0, ∆(Mismatch) = 2, and ∆(Insert) = ∆(Delete) = 1. The
colored paths from S(8, 8) and S(8, 6) to S(0, 0) show the optimal alignments
produced by the NW and SW algorithms, respectively.
whole-genome alignments because of O(lQ × lR) space
and time complexity. Therefore, a large number of align-
ment tools are designed to heuristically reduce the search
space required to find the optimal match of a query in
the reference. An extensive amount of research, e.g., [5],
[6], [7], [8], [9], has been conducted, focusing on indexing
strategies for the reference genome to rapidly reduce the
number of candidate locations that have to be searched.
Most of these tools use some variant of a backwards search
algorithm utilizing an FM-index [26] or a hash-table-like data
structure. As a result of this reduction in the search space,
linear-time heuristic algorithms like the Landau-Vishkin
algorithm (LV) [25] (in addition to traditional algorithms
like SW and NW) can be applied to the sequence alignment
problem in SNAP [9], to compute edit distance accurately
up to a particular number of mismatches (assuming that
correct alignments have lower numbers of mismatches).
Algorithm 1 describes the skeleton of these heuristic accel-
erated algorithms for single-ended read alignment [27]. The
definitions of the Build_Index, Candidate_Locations,
Edit_Distance, and Find_Config functions define dif-
ferent variants of these algorithms. For example, Table 1
defines the BWA-MEM and SNAP alignment tools by substi-
tuting these placeholder functions with specific algorithms.
We performed a profiling study of the SNAP aligner on
an in-sillico (from an Illumina HiSeq 2500) whole human
genome [28] with 50× coverage (i.e., each nucleotide of the
reference is backed by an average of 50 reads that align to
that base) on the Blue Waters [29] supercomputer. We chose
the SNAP aligner in particular because it is significantly
faster than other alignment tools like BWA and Bowtie.
Also, as the LV algorithm used in SNAP has a linear time
complexity, its comparison to ASAP as the CPU baseline is
much more challenging. Table 2 describes the distribution
of runtime across for the SNAP aligner for corresponding
Algorithm 1: Algorithmic skeleton for single-ended
short-read-alignment algorithms.
Data: NGS Read Dataset, Reference Genome
Result: Aligned positions and mapping of reads in
Reference Genome
1 ngsdata← Set of reads;
2 reference← String(s) corresponding to a reference;
3 index← Build_Index(reference);
4 alignment← ∅;
5 for read ∈ ngsdata do
6 locs← Candidate_Locations(read, index);
7 opt← argminloc∈locs(Edit_Distance(read, loc));
8 config ← Find_Config(read, opt);
9 alignment← alignment ∪ config;
10 end
11 return alignment;
steps of Algorithm 1.4 These measurements, along with static
analysis of Algorithm 1, show the following:
1) The LD computation corresponds to nearly 60% of the
running time of the SNAP aligner.
2) The LD computation is one of the most frequently called
algorithmic kernels in the alignment process (on average
called 54.1 times per read).
3) The LD kernel is used to build a total ordering of all
candidate locations for a read in the reference; refer to
Line 7 of Algorithm 1.
4) The backtrack-based alignment [18], [19] is computed only
for the best-matched location in the reference.
5) The remaining portion of SNAP’s runtime (after the LD
computation) is spent in either memory or IO bound com-
putation (e.g., hash table look-ups and reading/writing
files). This part is unsuitable for acceleration on PCIe-
based devices because of the time-cost associated with
performing data transfer over the bus.
3 DESIGN OF THE ASAP ACCELERATOR
This section describes the approximation algorithm that
drives the design of ASAP, provides a proof for its correct-
ness, and describes its implementation in programmable
hardware. Section 3.1 briefly summarizes the RaceLogic
paper [13], describing an formalizing the encoding the
computation of LD scores into circuit propagation delay.
4. Note that some steps of the SNAP aligner implementation includes a variety
of other miscellaneous tasks, e.g., memory allocation, IO. These are collectively
described in the “Misc” category. Also note, the SNAP aligner is optimized
to perform asynchronous pre-fetch based disk IO. Hence wait time for IO is
minimized.
Table 2
Distribution of runtime across the steps of Algorithm 1 for the SNAP tool aligning
an in-sillico human genome with 50× coverage.
Lines in Algorithm 1 % of runtime # of calls
Line 5 6.79 1.5× 1010
Line 6 18.59 6× 1010
Line 7 59.22 8.3× 1011
Line 8 9.25 1× 1010
Misc 6.15 –
4Propagation
Delay D1
Propagation
Delay D2
Time
Input Output
Input
Output
Net delay = D1 + D2
Input2
OutputInput1
Time
Input1
Output
Input2
Addition of propagation delays as a
proxy for addition 
OR-gate as a proxy for choosing 
signal with minimum delay
Output picks the signal which arrives first 
Figure 2. Computing with propagation delays: Delay-based proxy for the addition
operator is a series connection, and the proxy for the min operator is the OR gate.
Section 3.2 describes the approximation at the heart of
ASAP: using the ability to directly tune the performance
of the algorithm to input-data characteristics (i.e., using
circuit propagation delays encode both the algorithm and its
computation time), we show a method to chose appropriate
propagation delays to compute approximate answers for
LD while maintaining their total ordering (i.e., satisfy the
application invariant for correctness). Finally, Sections 3.3
and 3.4 describes the ASAP FPGA implementation.
3.1 Encoding LD Computation in Circuit Propagation
Delays
The core idea is to map addition and minimization, the
two mathematical operators necessary for the recursive
computation defined in (1), to particular topologies of circuit
elements. Fig. 2 illustrates the mapping explained here:
1) If circuit elements are combined in series, the net propaga-
tion delay of a signal is the sum of the propagation delays
for all of the individual elements. This construction is a
proxy for addition.
2) If two circuit elements are connected to an OR gate, the
signal that emerges out of the OR gate corresponds to
the signal that arrived first at the gate. This construction
is a proxy for the minimization operator (in particular,
the rising edge of the OR gate’s output computes a
minimization in time).
For example, Fig. 3 demonstrates the computation of
“min(X + 2, X + 3)” using the aforementioned delay based
computing. In the example, X corresponds to an arbitrary
input signal that is represented in the delay encoding, the
2- and 3-length shift register serving as the delay element
implementing the · + 2 and · + 3 operator respectively, the
OR gate serves as the minimization operator and the counter
serving as the decoder.
We formalize this delay based computation succinctly in
the following lemma.
Encoding Computation Decoding
Values 2, 3 ↦ 2, 3 Clock 
Width Shift Register
X ↦ binary signal
Counter
Input X
(encoded 
form)
2 represented as SR
en
dis
Output
min(X+2, X+3)
(decoded form)
Decoding is performed by 
measuring intervals of time 
3 represented as SR
min 
operator
Figure 3. Example of the encoding, computation, and decoding phase for computing
“min(X + 2, X + 3)” using the circuit-delay proposed in RaceLogic [13]. Note that
we present this example using shift-registers for delay elements as opposed to
comparators proposed in [13].
Delay Element (            )D(i, j)
D
I
D
D
D
M
T C
G
G
A
C
A G G A A C
Input Signal
Counter
Counter
Output of NW
Output of SWen
en
dis
dis
Figure 4. High-level design of the ASAP accelerator to compute the minimum edit
distance between two strings. The accelerator lattice is of size lQ × lR, where lQ
and lR are the sizes of the query and reference, respectively.
Lemma 1. Propagation-delay-based computation can occur
on a tropical semiring structure T over {0} ∪ Z+ (i.e., time
measured in clock ticks) that defines a binary addition
operation, a minimization operator (using an OR gate), and
a maximization operator (using an AND gate).
The delay-based proxies for the addition and minimization
operators can be used by replacing the LD values S(i, j) in (1)
with the equivalent propagation delays. The resulting circuit
represents the application of the addition and minimization
operators in the computation of S(i, j). Fig. 4 shows the
structure of the circuit that produces this computation. It is
composed of a lattice of lQ × lR delay elements (DEs). The
connections in the lattice build on the recursive definition
of S: each DE D(i, j)’s inputs are connected to the outputs
of the preceding elements D(i − 1, j − 1), D(i − 1, j), and
D(i, j − 1), and its outputs are connected to the input of
D(i+ 1, j + 1), D(i+ 1, j), and D(i, j + 1). At a high level,
each DE is composed of three delay blocks: 1) DM (delay due
to match or mismatch at (i, j)), 2) DI (delay due to insertion
at (i, j)), and 3) DD (delay due to deletion at (i, j)). This
design is specialized for FPGAs in Section 3.3)
The computation can be started by injecting a high signal
(logic value 1) at the inputs of index D(0, 0) in the array. The
time-encoded value of the LD is then found by measuring
the propagation delay of the signal exiting the array of
delay elements. Note that the delay-based computation can
be applied to all variants (SW, NW, and LV) of the LD
computation as follows.
1) The delay-based version of the SW variant can be com-
puted by measuring the delay between the introduction
of the input signal in the lattice, and its emergence at any
of the delay elements on the last row, i.e., (lR,−)th DE.
Fig. 4 illustrates this configuration.
2) The delay-based version of the NW variant can be com-
puted by measuring the delay between the introduction
of the input signal in the lattice, and its emergence at the
(lR, lQ)
th DE. This configuration is also shown in Fig. 4.
3) The delay-based version of the LV variant can be com-
puted by assigning the maximum permissible LD as the
result of the computation. This represents the “timeout”
with which the signal wavefront will emerge from the DE
lattice. If the timeout is triggered, the maximum value
of LD, as set by the user, is used as the result of the
computation. One delay element and one AND gate (not
shown in the Fig. 4) suffice to implement the timeout.
5A C A C A A C T
A
G
C
A
C
A
C
A
0
0
A C A C A A C T
A
G
C
A
C
A
C
A
0 1
1 0 1
1
1
1
1
1
A C A C A A C T
A
G
C
A
C
A
C
A
0 1 2
1 0 1 2
2 1 2
2 1 2
2 1 2
2 1 2
2 1 2
2 2
2
A C A C A A C T
A
G
C
A
C
A
C
A
0 1 2 3
1 0 1 2 3
2 1 2 3
3 2 1 2 3
3 2 1 2 3
3 2 1 2 3
3 2 1 2 3
3 2 3 2 3
3 2 3
A C A C A A C T
A
G
C
A
C
A
C
A
0 1 2 3 4
1 0 1 2 3 4
2 1 2 3 4
3 2 1 2 3 4
4 3 2 1 2 3 4
4 3 2 1 2 3 4
4 3 2 1 2 3 4
4 3 2 3 2 3
4 3 2 3 4
T=0 T=1 T=2 T=3 T=4
Edit-distance 
from SW at T=2
Edit-distance 
from NW at T=4
Unexplored search-
space
Figure 5. An example of the ASAP accelerator processing the same inputs used in Fig. 1. The signal wavefront is shown progressing through the ASAP lattice until the
outputs of the SW and NW algorithms are produced in 2 and 4 clock cycles, respectively. The values in the matrix represent the clock cycles in which the corresponding
DEs were enabled.
3.2 Approximating LD Computations in ASAP
A key aspect of the aforementioned method is the mapping
of gap-penalty parameters (∆-parameters) to their corre-
sponding circuit delays. The ASAP accelerator uses this
mapping both to encode the approximation (mentioned in
Section 1), and to reduce the time taken to do the “match”-
based computation. Both actions are formally stated below.
Definition 1. A Delay Encoding Function E : R → T is a
mapping between the set of real numbers and its propagation-
delay-based representation. E is constrained to obey the
Cauchy functional equation (E(x+ y) = E(x) + E(y)).
More general delay encoding functions can be considered,
for example in analog circuits, where circuit elements do
not exhibit linear behavior for all inputs. We constrain
ourselves to those that satisfy the Cauchy functional equation
(CFE) because of simplicity in proving of correctness of the
transformation. Although the domain of E can be the set of
real numbers R, the ASAP implementation presented in this
paper uses integer or rational gap penalties which can be
easily mapped to integer delay values (which can further be
represented as a multiples of the clock width).
Definition 2. A δ-parameter is the time-encoded representa-
tion of a user-inputted ∆-parameter. That is
δ(Insert) = E(∆(Insert))
δ(Delete) = E(∆(Delete))
δ(Match) = E(∆(Match))
δ(Mismatch) = E(∆(Mismatch)) (3)
These parameters are used to define the delays in the
DM , DI , and DD blocks. Note that we have assumed that
∆(Match) = 0, and thus δ(Match) = E(0) is also 0 based
on Definition 1.
Based on definitions 1 and 2, we now show that any
encoding of δ-parameters based on E produces the same
ordering of LDs as the original algorithm.
Lemma 2. When a query string Q and a reference string R
are compared under the traditional (see (1)) and delay-based
algorithm for computing LD at loci l1, . . . , ln of the reference,
to produce LDs e1, . . . , en and propagation delays d1, . . . , dn,
respectively, then di = E(ei), and consequently
ei ≤ ej ⇐⇒ E(ei) ≤ E(ej) ⇐⇒ di ≤ dj ∀i, j.
Lemma 2 is sufficient to show that using the ASAP
accelerator to compute LD in the context of Algorithm 1
(in line 7; i.e., using an “arg min” operator over the results of
multiple executions of the ASAP accelerator) produces the
same result as the traditional algorithm (without requiring
the computation of the inverse for E). A key observation in
the formalism of E is that the choice of the numerical values
of δ can be tuned to directly change the performance of the
accelerator, as they corresponds to circuit propagation delays.
That is, the parameters and inputs to the accelerator jointly
define the net propagation delay of the circuit. Below we
demonstrate one such transformation, which forms the core
of the approximation used in ASAP.
Lemma 3. When a query string Q and a reference string R
are compared at loci l1, . . . , ln of the reference, they produce
LDs e1, . . . , en for gap penalties ∆, and LDs e′1, . . . , e
′
n for
gap penalties ∆ + k, for some number k. The e′i obey the
relationship: e′i = ei + nik, for some ni ∈ Z such that (ni ≥
0) ∧ (ei ≤ ej ⇐⇒ ni ≤ nj), and consequently
ei ≤ ej ⇐⇒ e′i ≤ e′j ∀i, j.
Our algorithm for the approximation at the core of ASAP
uses Lemmas 2 and 3 to select values of the delay-encoded
parameters that correspond to minimizing the time taken
to process a dataset. For example, to optimize performance
for our observed case of most nucleotides corresponding to
“matches,” we modify the gap-penalties to set the match
penalty (i.e., δ(Match)) to 0 cycles5. This transformation
uses a two-step process to convert (encode) user-inputted
∆-parameters into δ-parameters:
1) ∆ 7→ ∆ + k, choosing k so that ∆(Match) = 0 after the
transformation;
2) ∆ + k 7→ E(∆ + k), with E(x) = mx to produce the
required delay value.6
As a result, the parameters in the LD algorithm are tweaked
to better suit the delay-based computation hardware. The
answer (i.e., the exact values of LD) produced by this
approximate version of the algorithm is not identical to
that produced by the original algorithm. However, based
on the aforementioned lemmas, we can see that the total
ordering created by the approximated LDs is identical to that
of the original algorithm. Furthermore, assuming that most
nucleotide comparisons are matches (which is true for the
indexed reference-based techniques described in Section 2),
5. True “0 cycle” propagation delay is not possible because of finite combina-
tional and wire delays in the circuit. Here we imply that the computation is done
in combinational logic, whose propagation delay is much much lower than the
clock width of the circuit (i.e., 0 time). This is explained further in Section 3.3.
6. The choice of k and m has to ensure that none of the encoded gap penalties
are negative. As the encoded values represent circuit propagation delays, negative
numbers are meaningless.
6Input 
from
(i-1, j-1)
FF FF FF FF
MUX
Input 
from
(i, j-1)
FF FF FF FF
MUX
Input 
from
(i-1, j)
FF FF FF FF
MUX
Insertion
Penalty
Deletion
Penalty
Mismatch
Penalty
Match
Penalty Read[i] == Ref[j]
Output 
from (i,j)
DM
DD DI
Clock Gating Delay 
Element
Delay 
Element
Input from (i-1, j-1)
Input from (i, j-1)
Input from (i-1, j)
clk
Figure 6. Design of a single delay element D in ASAP. The DE is composed of three separate delay units corresponding to DM , DI , and DD in Fig. 4.
this encoding ensures that (almost) zero time is taken to
explore large portions of the search space that correspond
to matches. We explore the relation of this optimization to
timing closure on the FPGA design in Section 3.3. In other re-
sequencing experiments, where “matches” do not represent
the common computation, a user can set δ(k) = 0 for
k ∈ {Insert, Delete, Mismatch}. Note that in our formulation
of the problem (as described in Section 2), ∆(Match) is
required to be the minimum positive value amongst all the
∆-parameters.
Consider the example of computing the LD between the
strings AGCACACA and ACAACAACT, presented in Section 2.
Based on our encoding mechanism (k = 0,m = 1), we com-
pute the δ-parameters of the ASAP accelerator as δ(Match) =
0, δ(Mismatch) = 2, and δ(Insert) = δ(Delete) = 1. Fig. 5
illustrates the propagation of the signal wavefront through
the ASAP accelerator for that example. The accelerator
produces an output for the SW notion of LD (local alignment)
in two clock cycles and the NW notion of LD (global
alignment) in four clock cycles. The figure shows the portion
of the array explored and the value of the propagation delay
at each element D(i, j) of the lattice. Note that some portions
of the array are not explored at all (e.g., for SW and NW, only
25 and 53 DEs out of a total of 81 are triggered, respectively).
This design thus provides a large savings in both time (using
“zero delay” circuit components for the most commonly used
computation) and power (clock-gating unused DEs with
their input signals ensures minimal power usage) compared
to traditional methods.
To summarize, using the encoding of δ-parameters de-
scribed in this section, the ASAP accelerator has two clear
advantages over traditional techniques:
1) Faster Processing: One can explore large portions of the
search space in a small amount of time by setting delay
parameters appropriately.
2) Energy Savings: DEs in the ASAP lattice are used only
when their output can contribute to the answer; other-
wise, they are switched off to save energy. This can be
accomplished by clock-gating the DEs with their input
signal.
3.3 ASAP: The FPGA Implementation
3.3.1 Why FPGA?
The techniques discussed so far in the paper represent an
approximation technique and architecture, one which can
be implemented ASICs, FPGAs, or any other platform. The
original RaceLogic design was demonstrated in simulation
as an ASIC [13]. However, some key characteristics of the
short-read alignment problem and the ASAP architecture
make ASAP particularly suitable for FPGAs, as they offer
programmability and reconfiguration. The ASAP accelerator
is runtime-programmable only for changing the values of
gap penalties. The input data size, which defines the size of
the accelerator lattice, is fixed at compile time. To allow users
to sweep experiment such “meta-parameters” (i.e., input
data size, gap-penalty bit-width, and input encoding), ASAP
is designed to be re-synthesized and re-programmed on
an FPGA. Potentially, the use of partial reconfiguration can
allow users to change these parameters on the fly. We leave
this possibility for future work. We discuss the advantages of
the ASAP design compared to the commonly used systolic
array based design (e.g., [30], [31], [32], [33], [34]) in Section 5.
3.3.2 Design of a Delay Element
The overall architecture of the ASAP accelerator is shown
in Fig. 4. Fig. 6 shows the design of a single DE. A DE
utilizes sequential logic in the form of a shift-register to add
a user-specified amount of delay. Each DE has 1) three input
signals (representing input wavefront) that connect it to its
preceding DEs in the grid, 2) two input signals representing
the nucleotides being compared by the element, and 3) three
input signals representing the δ-parameters. Each DE has one
output signal representing the propagated wavefront after
the delay has been added. The match, mismatch, insertion,
and deletion delay penalties are defined in terms of multiples
of the clock period. When the input signal wavefront first
reaches an element, it is propagated through a shift register
to create delay. Based on the gap penalty specified for
match/mismatch, insertion and deletion, the DE propagates
the input signals to the output. The output of each flip-
flop in the shift register is muxed to allow for the selection
of the bit corresponding to the gap-penalty of the block
(illustrated in Fig. 6). The ASAP array allows the user to
program (i.e., dynamically set at runtime) the values of the
select lines of these MUXs. This provides the ASAP array with
a degree of programmability, allowing it to be reused across
computations that merely require re-parameterization of the
gap-penalties. Changes in input-sizes, or the dynamic range
of the gap penalties (i.e., number of bits required to represent
the gap-penalties) requires a re-synthesis and reconfiguration
of the accelerator on the FPGA.
As described in the motivating example for the ASAP
accelerator given in Fig. 5, the power of the ASAP accelerator
is that it can explore a large portion of the search space
of possible mappings between the query string and the
reference within a clock cycle by setting δ(Match) = 0. This
improvement in computational speed can be coupled with
a decrease in energy consumed by the accelerator by clock-
7Delay Element
Synchronous FF 
to buffer output
ASAP Tile
Figure 7. The architecture of the ASAP accelerator in terms of tiles whose output is
buffered by clock synchronous flip-flops (FFs).
gating the DE (illustrated in Fig. 6) with the input signal.
The approach mentioned above has problems with long
chains of combinational logic and may lead to timing
violations on large lattices of DEs. To get around this problem,
larger lattices of delay elements are composed by using the
smaller tiles of ASAP accelerators (for which the timing
violations do not occur) and by adding a sets of clock-
triggered flip-flops between the tiles to break the chains
of combinational logic (see Fig. 7). Further, the diagonal tile
crossing (i.e., the flip-flops at the lower right corner of the tile)
corresponds to a 2 cycle delay (i.e., two flip-flops in serial).
Although the additions of the tile flip-flops changes the
results of ASAP from what was described in the last section,
the overall total-ordering is preserved, as this constitutes
a constant addition of delay to all outputs of the ASAP
accelerator. Each tile is synthesized, optimized, and placed-
and-routed separately by defining separate design partitions.
This approach prevents the compiler from performing op-
timizations across partition boundaries [35]. This approach
also ensures that unintended wiring delays do not creep into
the netlist of the ASAP accelerator.
The counter that decodes the delayed signal output from
the ASAP lattice (shown in Fig. 4) is designed based on a
computation of the number of clock cycles for the signal
wavefront to emerge from the lattice. The bit-width of this
counter, No, is calculated from the sizes of the input strings
and the user-input gap-penalty parameters, and is given by
No =
⌈
log2 min
{
δI lQ + δDlR,
δM lQ + δD(lR − lQ)
}⌉
.
This expression is an upper bound (albeit a loose one) on the
maximum delay caused by a DE.
3.3.3 Scalability Issues in the ASAP Accelerator
There are challenges involved in scaling the ASAP accel-
erator to large input sizes and large gap penalties. Those
challenges can be addressed as follows:
1) Large Input Sizes. The size of the reference and read strings
being compared in the ASAP accelerator plays a role in
the size of the lattice defined by the ASAP accelerator.
The size of the accelerator grows as O(lQ × s) with the
input size7. The tile size parameter defines a tunable knob
to control the critical combinational path in the circuit.
It can be used to trade off performance against meeting
timing closure as the size of the accelerator grows to a
significant portion of the resources available on the FPGA.
7. This corresponds to quadratic growth in size of the ASAP lattice (i.e., O(n2))
when lQ = s = n.
Eliminated Blocks
Max  tolerable 
edit distance  
(application 
specific)
Al
l M
atc
h
Ma
x 
mi
sm
atc
h 
(in
se
rt)
Ma
x 
mi
sm
atc
h 
(d
ele
te)
Figure 8. Elimination of unused tiles from the ASAP lattice in the case of LV variant
of the LD algorithm
Section 4 demonstrates our scaling experiments with the
accelerator.
2) Large Gap Penalties. A large dynamic range of the gap-
penalty values negatively affects the ASAP accelerator, as
it increases the size of the shift-registers and multiplexers
in the DE (see Fig. 6). We work around this problem by
using BRAM-based shift registers, which can be ∼ 103
bits long (without intermediate routing). In general, we
do not expect large gap penalties to be a problem for
genomic sequences (as opposed to protein sequences), for
which the dynamic range in gap-penalties is low.
3) Potentially Unused Tiles. Fig. 5 shows that a large part of
the ASAP array is not involved in computation when
the input strings have low LD (which is indeed the case
in the short read alignment problem). There are several
ways to tackle the problem of unused tiles across the
three variants of the LD computation (i.e., SW, NW, and
LD). As mentioned earlier, in the case of SW or NW,
clock-gating individual delay elements ensures minimal
power consumption. Further, in the LV case, as a the
worst case LD is specified, we can use this information
at compile (in this case synthesis) time to eliminate part
of the ASAP lattice that will not contribute to an answer.
Fig. 8 illustrates such an elimination on an 18× 18 lattice
with a maximum of 6 insertions or deletions permitted,
resulting in a 56%(= 20/36× 100) reduction in area.
3.3.4 Issues with Timing Closure
Computing with propagation delays is disadvantaged by
the fact that thermal dissipation and temperature variations
at different parts of the FPGA chip to change the physical
time associated with unit delay. However, the ASAP acceler-
ator is resilient to these thermal changes up to the maximum
operating temperature of the FPGA (i.e., timing violations do
not occur). Further, only delays that are multiples of the clock
period can affect the computed LD. The tile length serves
as a tunable knob between runtime performance and worst
case negative slack for the circuit. This slack is enforced by
the compiler (e.g., Xilinx Vivado, Altera Quartus) as only
values of tile length for which timing closure can be met can
be used in the FPGA. Furthermore, the counters in Fig. 4
that measure edit distance are synchronously triggered by
the clock, thereby ensuring that all delay-based LDs are
computed as multiples of the clock cycle.
8C
oherence Bus
CAPP PCIe
Core
AFU
PSL
IBM Supplied 
POWER
Service Layer 
(PSL)
Internal Input Cache
(32 KB)
Control Unit
Internal Output Cache
(128 B)
Accelerator Function Unit (AFU)
en
done
Inputs
OutputBuffer
Interface
Buffer
Interface
MMIO, Command, 
Control, Response 
Interface
IBM POWER8
FPGA
ASAP Lattices
Crossbar
Figure 9. The design of the interface between the host Power8 processor and the FPGA running the ASAP accelerator using the CAPI interface. The diagram assumes an
ASAP accelerator that computes on input strings that are 64 nucleotides long and encoded as 2 bits per nucleotide.
3.3.5 Encoding Input Sequences
The implementation of the ASAP accelerator assumed use
for genomic data, implying that the entire alphabet can be
represented in two bits (i.e., A, C, G and T). The bases N, -, R,
Y, K, M, S, and W (which represent an unknown or ambiguous
nucleotide) are removed from the alphabet. Our design could
potentially be extended to larger alphabets, e.g., for protein
sequence alignment.
3.4 Host-to-Accelerator Communication via CAPI
Communication between the host and accelerator is imple-
mented using the CAPI interface [14], [15] provided on an
IBM Power8 CPU. The CAPI interface gives an accelerator (a
PCIe-attached FPGA) coherent access to the virtual address
space of a process running on the host CPU, with all address
translations from virtual to physical memory done in the
CPU. Fig. 9 shows the interface and mechanism by which
the host CPU communicates with the ASAP accelerator.
The Power8 is a superscalar symmetric multiprocessor, that
has 12 cores per chip, with up to 8 hardware threads per
core. All cores have access to shared memory through a
PowerBus (shared memory bus). The Coherent Attached
Processor Proxy (CAPP) enables the interface (CAPI) by
maintaining a directory of cache lines held by the processor
and providing coherency by snooping the PowerBus on
behalf of the accelerator (or any other PCIe device). The
PCIe host bridge provides connectivity between the CAPP
and the Power Service Layer (PSL) on the FPGA over the
PCIe bus. The PSL on the accelerator acts as a proxy for
the CAPI protocol on the FPGA, communicating between
the CAPP and the Accelerator Functional Unit (AFU). The
AFU contains the custom acceleration logic and reads/writes
coherent data across the PCIe. The PSL unit runs at the same
speed as the PCIe bus (250 MHz). It contains a memory
management unit (MMU) to handle address translation on
the accelerator side on its copy of the processor’s cache
directory.
The AFU interacts with the PSL to provide word-level
read and write commands. If these requests are made to
cache lines (which are 1024 bits long) in a shared or exclusive
state on the device, they are served locally. Otherwise the
PSL interacts with the CAPP over the PCIe bus to attempt
virtual to physical address translation, loading of the cache
line from main memory (if it is already not present in the
processor’s cache), moving (or copying of) the cache line to
the PSL, and changing the coherence of the cache line in the
processor’s directory [14], [36]. We use the AFU in dedicated
mode, meaning only one MMU context is supported by the
accelerator. That is, only one user-space process can use the
accelerator at one time.
Fig. 9 shows the configuration of the interface to the
PSL for an ASAP accelerator that computes on two 64-bp
strings, with each nucleotide encoded by two bits. Hence
the accelerator takes 256-bit inputs (64 bp × 2 bits/bp × 2)
and produces a propagation delay measurement encoded in
32 bits (to keep with the signed integer implementation in
short-read aligner), which is the number of clock cycles for
the signal to emerge from the ASAP accelerator (depending
on whether the SW or NW algorithm is used). There is
an internal 32 kB cache, which has a 1024-bit input port
connected to the PSL, and a 1024-bit output port that is
connected to the input of the ASAP accelerator. This cache is
configured in a modified FIFO configuration; each entry in
the FIFO contains multiple input cases (in this case, four). A
4× 1 MUX controlled by the AFU control unit is responsible
for producing 256 bits at a time from the 1024-bit input. The
AFU packs the 32 bit outputs from the ASAP array into
1024 bit cache-lines before writing them back to the address
space of the host over DMA. The AFU uses the work element
descriptor (WED; [14]) to communicate the pointer to the
input and output, as well as the progress of the accelerator.
4 EVALUATION AND DISCUSSION
The ASAP accelerator is implemented in Chisel [37] and
can potentially be compiled across FPGAs and CAD tools
provided by Xilinx and Altera. The host-accelerator interface
(which utilizes IBM CAPI) is implemented in VHDL and is
specific to an IBM Power8 S824L system with an Alpha-Data
ADM-PCIE-7V3 board (that uses a Xilinx Virtex 7 XC7VX690T
FPGA) clocked at 250 MHz. All measurements (baseline CPU
as well as FPGA-based) were done on this machine. Fig. 10
illustrates the layout of four ASAP lattices and the CAPI
based interface on the Virtex 7 FPGA mentioned above.
All inputs for the experiments presented in this section
are derived from the human reference genome hg38 by
simulating [28] 100 million reads of appropriate length. The
read simulation introduced random mutations and simulated
sequencing-error models from an Illumina HiSeq 2500 with
9a 0.1% sequencing error rate. We verified the correctness of
our implementation through comparison with 1) answers
generated from the software tools (i.e., in this case SNAP [9]);
2) the ground truth values generated by the simulator.
The remainder of this section is organized as follows.
In Section 4.1, we discuss the resource consumption in
ASAP with varying input sizes. Then, in Section 4.2, we
discuss a micro-benchmark performance comparison of
a single ASAP lattice (configured in SW mode) with a
software-based SW implementation. Section 4.3 discusses the
performance implications of the CAPI interface. Finally in
Sections 4.4 and 4.5, we discuss the power and performance
characteristics (respectively) of an end-to-end application of
ASAP (configured in LV mode) in the SNAP aligner.
4.1 Area-based Scaling
The resource utilization of the ASAP accelerator scales
quadratically with the lengths of the sequences being com-
pared. For example, Fig. 11 shows the number of flip-
flops used by the ASAP accelerator with increasing string
length, based on a 16×16 square tile size8. In comparison,
an FPGA-based systolic array implementation of the LD
computation [30] (described in Section 5) scales linearly
(i.e., 2N + 1, where N is the length of the strings being
compared). It is apparent that for larger sequences, ASAP
quickly exhausts the amount of available storage (flip-flops),
even on the largest FPGAs available today. However, ASAP is
able to compute LD for shorter-read sequences (e.g., the 100-
150 bp sequences that are typically obtained from an Illumina
HiSeq 2500) which are popularly being used in resequencing
experiments. In addition, we leave approximately 20% of
the area of the FPGA free, to allow the design compiler to
place-and-route the circuit without timing violations due to
wiring delays.9 As a result, we are able to fit a maximum
128 bp read accelerator on our FPGA. Fitting larger blocks
leads to timing violations because of delays introduced by
the on-chip interconnect. Given the industry trend towards
FPGAs with larger programmable area, in the future it should
8. This example does not include flip-flops required for the CAPI interface.
9. There is no simple analytical method to derive the optimal tile size, sequence
size and free area on the FPGA, as the synthesis tools are a black box.
CAPIASAP Core 0
ASAP Core 1ASAP Core 2ASAP Core 3 Crossbar
Figure 10. Layout of the accelerator on the Xilinx Virtex 7 XC7VX690T FPGA. The
design implemented above has 4 instances of the ASAP accelerator and the IBM
CAPI interface for host-accelerator communications.
100
101
102
103
104
105
106
107
 0  100  200  300  400  500  600
Ar
ea
 U
til
ize
d 
(F
F)
String Length
ASAP (Simple)
ASAP (Optimized)
Systolic
Figure 11. Scaling of FPGA resource utilization (accelerator size) with increase in
input string size.
Table 3
Comparison of performance of end-to-end run-time for LD computation on CPU
and ASAP (50th percentile). Rows marked with “*” are simulated results for the
FPGA.
Read Size CPU Baseline ASAP Speedup
64 1890 µs 10.3 µs 183×
128 2083 µs 10.7 µs 194×
192* 3326 µs 16.4 µs 203×
256* 3906 µs 17.2 µs 219×
320* 4484 µs 18.9 µs 237×
be possible to extend ASAP to read sequences that are
potentially thousands of nucleotides long. The results in
the following sections show measurements from a Power8
system of up to 128 bp inputs, and simulated results (using
Xilinx Vivado’s RTL Simulator) for larger input sizes.
Currently, the ASAP accelerator can be used to compute
LD for larger strings by adding a special control algorithm in
software to compute LD between sub-strings of the original
queries, and combine them to compute the result. The
algorithm works by measuring (and storing) the time at
which the signal wavefront leaves the extremal DEs of the
ASAP lattice, and reintroducing this signal wavefront in the
same lattice after updating the nucleotides to be another
disjoint substring of the queries. We leave the hardware
implementation of this approach for future work.
4.2 Performance of the Accelerator
The ASAP accelerator (configured in SW mode) is ap-
proximately 200× faster than the baseline C implementation
of the SW algorithm for computing LD that is optimized
to use single instruction multiple data (SIMD; e.g., Intel
AVX instructions) and simultaneous multi-threading (SMT;
e.g., Intel Hyperthreads) based multi-threading [38]. The
baseline implementation exploits inter-task parallelism (i.e.,
data parallelism) by processing multiple reads across threads.
Table 3 describes the comparison of the performance of a
single lattice ASAP accelerator. Having multiple cores on
the CPU or multiple ASAP lattices on the FPGA does not
change this comparison, as each core/lattice is expected
to be computing a separate unrelated instance of the LD
computation. The performance of ASAP depends not only on
the size of the inputs, but on the inputs themselves (i.e., more
mismatched inputs mean a higher computation time). Hence
we present all ASAP measurements as the median across
all the randomly generated reads. We observe that a single
ASAP lattice shows ∼ 200× speedup relative to a single
10
CPU core (containing 8 SMT threads and SIMD units), with
potential improvements in performance with growing input
size (see Table 3). Overall, a Power8 CPU chip contains six
such cores, whereas our implementation of ASAP can scale
to four lattices (see Fig. 10). Hence a chip-to-chip comparison
yields a 133× improvement in performance.
Fig. 12(a) illustrates the latency of the accelerator (without
the overhead of communication between the host and device)
in computing LD (in the SW sense) for a single read-
reference pair. In contrast to traditional systolic-array-based
accelerators, ASAP needs to update only the cells (DEs) that
can contribute to the LD computation (i.e., corresponding to
the colored cells in Fig. 5). Hence, throughput of the ASAP
accelerator can be computed in two ways: we can compute
it either by considering the total number of cells in the LD
lattice, or by considering only the cells updated by ASAP. The
first method which we refer to as effective-GCUP/s is directly
comparable to traditional techniques as they too consider
updating all elements in the LD lattice. In terms of the first
method, ASAP achieves an average of 609.6 GCUP/s (109
cell updates per second) for 128-bp reads; the second method,
it achieves an average of 204.8 GCUP/s. This implies that
in the median case, ASAP is approximately 5× better than
an equivalent systolic-array-based FPGA implementations
(e.g., 122 GCUP/s were physically achieved on an FPGA
in [39]10). Fig. 12(b) shows the effect of changing tile-length
on the latency of the accelerator. It is evident that there are
diminishing returns for increasing the tile length, with almost
no improvement beyond tile size 16.
Another point to note about Fig. 12 is that ASAP represents
a method to trade-off worst-case performance and average-
case performance. The approximations that we present may
be slower than the baseline performance for the worst-
case (i.e., when read mismatches reference completely).
However, we see that for representative data sets, the median
performance as well as the 75th percentile performance are
significantly better than the baseline. For the short read
alignment problem, we observe that matches occur more
frequently than insertions, deletions or mismatches. The
ASAP accelerator can also be applied to other cases where
insertions or deletions are more frequent by dealing with
those cases in combinational logic.
4.3 Performance of the CAPI Interface
The ASAP accelerator benefits from the use of the CAPI
interface, because CAPI 1) significantly simplifies, and
2) significantly streamlines the process of initializing and
communicating with the accelerator. We benefit from using
a unified virtual memory space across the PCIe bus with
hardware-supported address translation, compared with the
traditional model, which requires significant hand-holding
by an OS. For example, a typical device driver would exe-
cute approximately 20k instructions, PCIe bounce-buffering,
and page-pinning to perform communication between host
and accelerator. We performed measurements on the CAPI
interface using a loopback accelerator [36] (i.e., an accel-
erator reads a cache-line and writes it back to a different
location). We observed that (see Table 4 and Fig. 13(a)) the
10. The comparison to [39] is made based on numbers presented in their paper,
and has not been re-implemented by us. Note that the comparison is fair as the
FPGAs in question are from the same architecture series as well as running on
similar clock frequencies.
100
101
102
103
104
101 102 103
La
te
nc
y 
(C
lo
ck
-C
yc
le
s)
String Length
25th-75th percentile
Systolic
ASAP (Simulated)
ASAP (Hardware)
(a) Input string length (Tile length = 16).
 0
 50
 100
 150
 200
 250
 300
 350
 400
 0  5  10  15  20  25  30  35
La
te
nc
y 
(C
lo
ck
-C
yc
le
s)
Tile Size
25th-75th percentile
Best (Simulated)
Worst (Simulated)
Median (Simulated)
Median (Hardware)
(b) Tile length (Input length = 128).
Figure 12. Latency of the accelerator as a function of the input string length. The
shaded area in both the graphs show 25th and 75th percentile measurement from
simulation.
Table 4
Basic CAPI-based memory access performance for an AFU running on the
Alpha-Data Board. Latency measurements includes round-trip latency to shared
memory as seen from the accelerator.
Interface Payload (B) Type Measurement
PCIe 128 Mean read/write latency 0.87 µs
CAPI 128 Mean read/write latency 126 ns
CAPI 128 Mean read/write bandwidth 3.88 GB/s
CAPI interface can perform random reads and writes with
1) sub-µs latency, and 2) 4 GB/s bandwidth which are both
close to the measured native PCIe latency/bandwidth for the
FPGA board used in the evaluation. The one disadvantage
that we observe with the CAPI interface is that it allows
an AFU to use only 50% of the available peak-theoretical
PCIe bandwidth. Our measurements of PCIe goodput (i.e.,
bandwidth for user data to and from the accelerator) are
similar to those from CAPI (see Fig. 13(a)). 11 Bandwidth is
currently not a limitation for the accelerator. Fig. 13(b) shows
the fraction of the runtime of the accelerator spent in stall
over the execution of a large number of reads. However,
moving to a larger FPGA that supports larger ASAP lattices
or multiple smaller ASAP lattices (executing in parallel), or
clocking the ASAP accelerator higher than 250 MHz will
require larger bandwidth for the host-accelerator interface.
11. We speculate that this limitation occurs because of non-optimal interactions
between the OS-modules (e.g., CAPI cache misses trigger TLB (ERAT in IBM
parlance) or page misses) and the PCIe-endpoint ASIC (e.g., dealing with out-of-
order packet delivery) on the FPGA board. We leave the optimization of such
direct memory access (DMA) issues to future work.
11
 0
 0.5
 1
 1.5
 2
 2.5
 3
 3.5
 4
 4.5
103 104 105 106 107 108Th
ro
ug
hp
ut
 (G
B/
s)
Payload (bits)
PCIe DMA
CAPI
(a) Mean observed bandwidth.
 0
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
 0  5  10  15  20  25  30  35  40  45
CD
F
Fraction of cycles stalled (× 10-3)
(b) Fraction of cycles stalled due to unavailability of data.
Figure 13. Mean host-accelerator bandwidth over the CAPI interface and its effect
on the performance of the ASAP accelerator.
4.4 FPGA Resource Utilization
Previously, Section 4.1 described the scaling behavior of
the ASAP lattice with the input size; this section describes the
overall on-chip resource utilization to implement the CAPI
interface and multiple ASAP lattices on the FPGA. Fig. 14
illustrates this utilization with the increasing number of
lattices for two implementation styles for the ASAP delay ele-
ment. First, the comparator based design that was presented
in the original RaceLogic paper [13] (referred to as CMP in the
figure), and second, the shift-register based design (presented
in Section 3) that has been optimized for FPGAs (referred to
as SR in the figure). Fig. 14(a) demonstrates the significant
reduction (nearly 15%) in number of logic elements (i.e., slice
resources) required to implement SR compared to CMP. This
further translates to a ∼ 1.9× reduction in power consumed
by the SR design (shown in Fig. 14(b)). The proposed design
is nearly 18.8× more power efficient than the IBM Power8
CPU (∼ 10.1 W compared to 190 W). This implies an overall
3, 760× (= 200 × 18.8; based on Section 4.2) improvement
over the CPU in performance/Watt terms.
Note that the power consumption for the chip is calculated
from the synthesis tool (i.e., Xilinx Vivado) and represents
worst-case power consumed by the accelerator. However, the
real power consumption is input-dependent and lower than
that mentioned above, as clock-gating on off-diagonal delay
elements will be enabled differently based on inputs (recall
Fig. 5). We computed this difference in power consumption
using the S824L’s on-board power meters on the Flexible
Service Processor (FSP).12 The FSP measurements report
power consumption of the entire computer system averaged
over 30 s intervals. To calculate the power consumption
of the ASAP accelerator, we measured the difference in
power consumed by the system when executing the 4-lattice
instance of the ASAP accelerator shown in Fig. 10, and
when idling. We observed an average difference (i.e., the
ASAP accelerator’s average power consumption) over 100
12. The FSP is an auxiliary processor on the S824L that is an always-on
management processor enabling out-of-band management of the server.
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
1 2 4 1 2 4
CMP SR
Ut
iliz
at
io
n 
(%
)
Number of PEs
Slice LUT FF
(a) Scaling of FPGA resource utilization (accelerator size) with
increase in number of ASAP lattices.
 0
 2
 4
 6
 8
 10
 12
 14
 16
 18
 20
1 2 4 1 2 4
CMP SR
Po
we
r (
W
)
Number of PEs
Static
Dynamic
GTH
(b) Power dissipation from the ASAP accelerator with increase in
number of ASAP lattices per chip.
Figure 14. Comparison of on-chip resource utilized by the CMP and SR implementa-
tions of the ASAP design. Each ASAP lattice is 128× 128.
executions (of the entire benchmark dataset) of 6.9 W with a
standard deviation of 2.8 W. These measurements support
our claim that the actual power consumption of ASAP is
lower than that reported by the synthesis tool.
4.5 Integration into the SNAP Aligner
We now compare the ASAP accelerator (configured in
LV mode) when used in an end-to-end aligner (that uses
Algorithm 1), SNAP [9]13. We ensure that the maximum
permissible LD in both the ASAP and SNAP implementations
of the LV algorithm are identical across all individual
alignments. The baseline SNAP aligner exploits parallelism in
the alignment problem by dividing the work of aligning a set
of reads among all of the 192 threads available on the system.
Since our current implementation of ASAP allows for only
one calling context on the host-side, we use ASAP in SNAP
by maintaining a pool of memory shared among all threads
to communicate with the accelerator. The procedure for each
thread communicating with the accelerator is as follows:
1) picks a read from the set it was assigned; 2) queries
the reference index for candidate locations for the read;
3) contends for a lock, then writes nucleotides for the read
and the candidate locations into shared memory; 4) at this
point, the accelerator reads from the shared memory and
writes out the results to another shared segment of memory;
and 5) polls for results from the accelerator using a test and
test-and-set based locking protocol [40], then consumes the
13. We used version 1.0 of the SNAP tool.
12
 0.1
 0.2
 0.3
 0.4
 0.5
 0.6
 0.7
 0.8
 0.9
 1
10-7 10-6 10-5 10-4 10-3 10-2
Cu
m
ul
at
iv
e 
Pr
ob
ab
ilit
y
Time(s)/Read
LV (in SNAP)
ASAP
Figure 15. Comparing performance of SNAP using the default LV implementation
and ASAP using the SW algorithm.
output. This algorithm is another example scenario where
CAPI is very beneficial, and we can make use of cache
coherence between the CPUs and FPGA to easily implement
mutual exclusion.
Fig. 15 shows the time taken per read by the baseline and
the ASAP accelerator for all LD computations. We see that
there is a large spread for total time spent in computing LD
because some reads map to more regions of the reference
than others. This variation is an artifact of both the nature
of the human genome and the read simulator’s practice
of picking reads at random from the genome. The ASAP
accelerator was used in SW mode in these experiments (a
quadratic complexity algorithm), but is compared to a CPU
implementation of the LV algorithm (a linear complexity
algorithm). This comparison represents the most challenging
acceleration use-case for ASAP. We demonstrate how to
configure ASAP to a LV accelerator in Section 3.2. Overall,
we see that the aligner is accelerated by 2× (i.e., 1.85 hr/0.92 hr).
This is close to the Amdahl’s law limit of the SNAP algorithm
based on our measurements presented in Table 2.
5 RELATED WORK
The sequence alignment problem has been addressed by
an extensive body of work that looks at algorithms and
their high-performant implementations on CPUs and on
accelerators like GPUs and FPGAs. In this section we restrict
ourselves to the comparison of the ASAP accelerator to other
implementations of the LD computations. Refer to Section 2
for a discussion of algorithms.
On CPUs and GPUs. The LD computation and sequence-
alignment problem has been studied on SIMD and MIMD
processors that exploit parallelism in the problem at two
levels. Inter-task parallelism [41] (using multiple cores
to independently compute alignments of different short
reads), and intra-task parallelism [42], [43] (using SIMD
instructions and efficient use of the memory hierarchy to
effectively compute (1)). Most of the popular SW or NW
implementations exploit the use of both of these techniques.
These techniques have also been applied to GPUs [44], [45],
[46]. One such example is NVIDIA’s NVBIO [47] library
and the accompanying set of tools nvBWT, nvFM-server.
These look at accelerating the construction and look-up
of data structures that index the reference genome. The
major disadvantages of this approach is the large power
consumption of these processors, and their restrictive lock-
step parallelism based programming models.
On FPGAs and ASICs. Custom hardware acceleration
of the problem on FPGAs and ASICs has also been widely
studied. Most of the popular hardware architectures are
based on systolic arrays [30], [31], [32], [33], [34]. These
architectures like the SIMD and MIMD approaches, are
limited by the amount of parallelism they can exploit. It
has been shown in [48], that exploiting deeper pipelines with
much larger inter-task parallelism can potentially enable
more efficient use of FPGAs. We may be able to use this
optimization to further increase the throughput of the
accelerator, particularly on larger FPGAs that can sustain
larger off-chip bandwidth. Kaplan et al. [49] present an
ASIC design for a Processing-in-Memory accelerator for the
SW algorithm that leverages resistive content-addressable
memory to compute matches/mismatches of nucleotides.
ASAP represents a significant improvement over [49] in
throughput/Watt terms, i.e., ASAP achieves 61 GCUP/s/W
(= 609.6/10.1) compared to their 53 GCUP/s/W. Turakhia et
al. [50] present an accelerator to perform long-read assembly,
one step of which includes a SW-based alignment (through a
seed-and-extend approach). Aligning long reads (i.e. ≥ 1000
base-pair reads) poses a significantly different algorithmic
challenge than aligning short-reads (i.e., 100 − 250 base-
pair reads) as the sequencing chemistry that produces the
long reads are inherently error prone. As a result, [50]’s SW
implementation is not directly comparable to ASAP. Alser
et al. [51] present an FPGA based accelerator to efficiently
filter candidate locations to calculate LD. This accelerator
is targeted at Line 6 of Algorithm 1, as opposed to ASAP
which targets Line 7, hence the accelerator can be used in
addition to ASAP to accelerate the end-to-end alignment
process. More recent work [52] has also shown the benefit of
distributing the compute intensive LD computation across
multiple accelerators (including CPUs, GPUs, FPGAs, Xeon
Phis). We observe that ASAP significantly outperforms such
multi-accelerator systems both in terms of performance
and performance per-Watt. The Host + 2× FPGA design
presented in [52] only achieves a 441.6 GCUP/s performance
at 1.51 GCUP/s/W. In comparison ASAP achieves 609.6
effective GCUP/s at 61 GCUP/s/W on a single FPGA.14
Other work, e.g., [53], [54], [55], has demonstrated the use
of systolic-array-based designs to accelerate computations
on Pair-HMM models, where gap-penalties are replaced by
probability distributions. That may be a future direction for
the extension of the ASAP design.
ASAP’s design philosophy is most closely related to
Madhavan et. al.’s RaceLogic [13] ASIC design, which also
encodes LD computations as circuit delay. However, ASAP
builds on this basic model to further optimize the design by
using 1) approximation algorithms for the LD computation
which maintains the total ordering of LDs, and 2) accelerating
the most common computation (in this case the processing
of “matches”) in combinational circuitry thereby spending
minimal runtime in its computation. This is demonstrated
by the fact that ASAP is ∼ 50× faster than a RaceLogic
implementation. Further, the nature of the alignment problem
and the rapidly evolving sequencing technology (i.e., read
14. The comparison is made across an equivalent generation of Altera and Xilinx
FPGAs, using effective-GCUP/s (described in Section 4.2).
13
lengths), implies that fixed function ASICs are not favorable
because of the large monetary investment required and the
inability of the accelerator to adapt to new input sizes. ASAP
circumvents these problems by using reconfigurable FPGAs.
Of course, an ASIC will almost always outperform an FPGA
in energy efficiency because of its customized layout. Hence
going forward, a design with a fixed function (i.e., ASIC-
based) IO interface (i.e., CAPI) with a configurable substrate
for ASAP accelerators might present an ideal trade-off.
Comparison to Systolic Arrays. Relative to the related
work described above, ASAP has some decided advantages:
1) The systolic array based approaches require each element
of the array to compute on as many bits as the maximum
LD computed. Our approach requires only as many bits
per delay element as the maximum delay between inputs
at that point in the lattice.
2) The earlier accelerators have to explore the entirety of
the lattice before computing the LD. We show that the
ASAP accelerator explores only the portions of the lattice
that is reachable before the final result is produced. This
represents a significant savings in run time and energy
expended for computation.
3) The ASAP accelerator can explore multiple elements in the
lattice in under one clock cycle by setting δ(Match) = 0.
Systolic array based architectures cannot perform this
optimization, as this creates large combinational chains
which make timing closure difficult to obtain.
On Neuromorphic Computers. Neuromorphic computing
is modeled on biological neurons that communicate and
compute using information encoded as voltage pulses, or
spikes. Such spiking based models convey and process
information via precise spike timing relationships mea-
sured across multiple communication paths. Though the
computational model is similar in principle to the delay
based computation outlined in this paper, realization of
this technology, and its applications in domains other than
pattern recognition is still an open research question [56],
[57]. This field of research represents an avenue to extend
ASAP using analog circuit components.
6 CONCLUSION AND FUTURE WORK
This paper proposed ASAP, an accelerator for rapid
computation of Levenshtein distance, in the context of the
short-read alignment problem. ASAP builds upon the idea
that the LD between strings can be approximated for the
short-read alignment problem by encoding gap penalties
in propagation delays of circuit elements. We show that by
effectively setting these delays, it is possible to accelerate
performance significantly, and at the same time ensure that
the accuracy of alignment is maintained. Accelerators like
ASAP that synergize well with technologies like the CAPI
interface point to a new generation of HPC machines that
will embrace heterogeneity and allow for efficient handling
of high throughput genomic data.
The ASAP accelerator, and the approach (based on heuris-
tic approximations) presented in this paper, can also be
adapted to a variety of other problems in which a total
ordering of LDs is computed. For example, in signal process-
ing, where different instances of a signal have to be aligned
to compute similarity [3]; in text retrieval, where misspelled
words have to be accounted for in a dictionary [16]; and in
virus- and intrusion-detection, where signatures have to be
aligned to a baseline [17].
Future Work. Our future work will primarily look to
extend ASAP to handle more complex gap-penalty models.
This paper describes the use of constant gap penalties (i.e., a
fixed score is assigned to every gap), which are commonly
used in DNA alignment (e.g., in NCBI-BLASTN, or WU-
BLASTN [20]). We can extend ASAP to handle linear, affine,
and convex gap penalties by letting each DE track the
propagation of the signal wavefront in the portion of the
lattice before it. Further, ASAP can be extended for use in
the alignment of proteins by using substitution matrices, like
BLOSUM [5], which assign unique scores to each pair of
residues.
ACKNOWLEDGMENTS
This research was supported by several grants: in part
by the National Science Foundation under Grant No. CNS
13-37732; in part by the Blue Waters sustained-petascale com-
puting project supported by the National Science Foundation
(awards OCI-0725070 and ACI-1238993) and the state of
Illinois; and in part by IBM Faculty Awards. We thank Jenny
Applequist and Kathleen Atchley for their help in preparing
the manuscript.
REFERENCES
[1] Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron,
R. Iyer, M. C. Schatz, S. Sinha, and G. E. Robinson, “Big data: Astronomical
or genomical?” PLOS Biology, vol. 13, no. 7, p. e1002195, jul 2015.
[2] Y. S. Shao and D. Brooks, “Research infrastructures for hardware accelerators,”
Synthesis Lectures on Computer Architecture, vol. 10, no. 4, pp. 1–99, 2015.
[3] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions,
and reversals,” Tech. Rep. 8, 1966.
[4] G. Navarro, “A guided tour to approximate string matching,” ACM Comput.
Surv., vol. 33, no. 1, pp. 31–88, Mar. 2001.
[5] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic
local alignment search tool,” Journal of Molecular Biology, vol. 215, no. 3, pp.
403–410, oct 1990.
[6] B. Langmead, C. Trapnell, M. Pop, and S. L. Salzberg, “Ultrafast and memory-
efficient alignment of short DNA sequences to the human genome,” Genome
Biol, vol. 10, no. 3, p. R25, 2009.
[7] B. Langmead and S. Salzberg, “Fast gapped-read alignment with bowtie 2,”
Nature Methods, vol. 9, pp. 357–359, March 2012.
[8] H. Li and R. Durbin, “Fast and accurate long-read alignment with burrows-
wheeler transform,” Bioinformatics, vol. 26, no. 5, pp. 589–595, jan 2010.
[9] M. Zaharia, W. J. Bolosky, K. Curtis, A. Fox, D. A. Patterson, S. Shenker,
I. Stoica, R. M. Karp, and T. Sittler, “Faster and more accurate sequence
alignment with SNAP,” CoRR, vol. abs/1111.5572, 2011.
[10] S. S. Banerjee, A. P. Athreya, L. S. Mainzer, C. V. Jongeneel, W.-M. Hwu, Z. T.
Kalbarczyk, and R. K. Iyer, “Efficient and scalable workflows for genomic
analyses,” in Proceedings of the ACM International Workshop on Data-Intensive
Distributed Computing, ser. DIDC ’16. New York, NY, USA: ACM, 2016, pp.
27–36.
[11] T. C. GLENN, “Field guide to next-generation DNA sequencers,” Molecular
Ecology Resources, vol. 11, no. 5, pp. 759–769, may 2011.
[12] M. G. Ross, C. Russ, M. Costello, A. Hollinger, N. J. Lennon, R. Hegarty,
C. Nusbaum, and D. B. Jaffe, “Characterizing and measuring bias in sequence
data,” Genome Biology, vol. 14, no. 5, p. R51, 2013.
[13] A. Madhavan, T. Sherwood, and D. Strukov, “Race logic: A hardware
acceleration for dynamic programming algorithms,” SIGARCH Comput. Archit.
News, vol. 42, no. 3, pp. 517–528, Jun. 2014.
[14] I. C. Systems and T. Group. (2015) Coherent accelerator processor interface:
User’s manual. [Online]. Available: http://www.nallatech.com/wp-content/
uploads/IBM CAPI Users Guide 1-2.pdf
[15] J. Stuecheli, B. Blaner, C. R. Johns, and M. S. Siegel, “Capi: A coherent
accelerator processor interface,” IBM Journal of Research and Development,
vol. 59, no. 1, pp. 7:1–7:7, Jan 2015.
[16] R. A. Baeza-Yates and B. Ribeiro-Neto, Modern Information Retrieval. Boston,
MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1999.
[17] S. Kumar and E. H. Spafford, “A pattern matching model for misuse intrusion
detection,” in In Proceedings of the 17th National Computer Security Conference,
1994, pp. 11–21.
[18] T. Smith and M. Waterman, “Identification of common molecular subse-
quences,” Journal of Molecular Biology, vol. 147, no. 1, pp. 195 – 197, 1981.
[19] S. B. Needleman and C. D. Wunsch, “A general method applicable to the
search for similarities in the amino acid sequence of two proteins,” Journal of
Molecular Biology, vol. 48, no. 3, pp. 443–453, mar 1970.
14
[20] W.-K. Sung, Algorithms in Bioinformatics: A Practical Introduction (Chapman &
Hall/CRC Mathematical and Computational Biology). Chapman and Hall/CRC,
2009.
[21] S. Henikoff and J. G. Henikoff, “Amino acid substitution matrices from
protein blocks.” Proceedings of the National Academy of Sciences, vol. 89, no. 22,
pp. 10 915–10 919, nov 1992.
[22] C. Wang, R.-X. Yan, X.-F. Wang, J.-N. Si, and Z. Zhang, “Comparison of
linear gap penalties and profile-based variable gap penalties in profile–profile
alignments,” Computational Biology and Chemistry, vol. 35, no. 5, pp. 308–318,
oct 2011.
[23] M. Burrows and D. J. Wheeler, “A block-sorting lossless data compression
algorithm,” Tech. Rep., 1994.
[24] E. Ukkonen, “Algorithms for approximate string matching,” Information and
control, vol. 64, no. 1, pp. 100–118, 1985.
[25] G. M. Landau and U. Vishkin, “Efficient string matching with k mismatches,”
Theoretical Computer Science, vol. 43, pp. 239–249, 1986.
[26] P. Ferragina and G. Manzini, “Opportunistic data structures with applica-
tions,” in Proceedings of the 41st Annual Symposium on Foundations of Computer
Science, ser. FOCS ’00. Washington, DC, USA: IEEE Computer Society, 2000,
pp. 390–.
[27] Illumina. (2010) Pair-end sequencing. [Online]. Avail-
able: http://www.illumina.com/technology/next-generation-sequencing/
paired-end-sequencing assay.html
[28] Z. D. Stephens, M. E. Hudson, L. S. Mainzer, M. Taschuk, M. R. Weber, and
R. K. Iyer, “Simulating next-generation sequencing datasets from empirical
mutation and sequencing models,” PLOS ONE, vol. 11, no. 11, p. e0167047,
nov 2016.
[29] N. C. for Supercomputing Applications (NCSA). (2012) Blue waters
supercomputer. [Online]. Available: https://bluewaters.ncsa.illinois.edu/
[30] R. J. Lipton and D. Lopresti, “A systolic array for rapid string comparison,”
in Proceedings of the Chapel Hill Conference on VLSI, 1985, pp. 363–376.
[31] D. T. Hoang and D. P. Lopresti, “FPGA implementation of systolic sequence
alignment,” in Lecture Notes in Computer Science. Springer Science + Business
Media, 1993, pp. 183–191.
[32] S. A. Guccione and K. Eric, Field-Programmable Logic and Applications:
Reconfigurable Computing Is Going Mainstream: 12th International Conference, FPL
2002 Montpellier, France, September 2–4, 2002 Proceedings. Berlin, Heidelberg:
Springer Berlin Heidelberg, 2002, ch. Gene Matching Using JBits, pp. 1168–
1171.
[33] P. Zhang, G. Tan, and G. R. Gao, “Implementation of the smith-waterman
algorithm on a reconfigurable supercomputing platform,” in Proceedings of
the 1st International Workshop on High-performance Reconfigurable Computing
Technology and Applications: Held in Conjunction with SC07, ser. HPRCTA ’07.
New York, NY, USA: ACM, 2007, pp. 39–48.
[34] N. Ahmed, V. M. Sima, E. Houtgast, K. Bertels, and Z. Al-Ars, “Heterogeneous
hardware/software acceleration of the bwa-mem dna alignment algorithm,”
in 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD),
Nov 2015, pp. 240–246.
[35] Xilinx. (2010) Hierarchical design methodology guide. [Online].
Available: https://www.xilinx.com/support/documentation/sw manuals/
xilinx12 2/Hierarchical Design Methodology Guide.pdf
[36] M. J. Jaspers, “Acceleration of read alignment with coherent attached
FPGA coprocessors,” Master’s thesis, Delft University of Technology, The
Netherlands, 2015.
[37] J. Bachrach, H. Vo, B. Richards, Y. Lee, A. Waterman, R. Avizˇienis,
J. Wawrzynek, and K. Asanovic´, “Chisel: Constructing hardware in a scala
embedded language,” in DAC Design Automation Conference 2012, June 2012,
pp. 1212–1221.
[38] J. Daily, “Parasail: SIMD c library for global, semi-global, and local pairwise
sequence alignments,” BMC Bioinformatics, vol. 17, no. 1, feb 2016.
[39] A. Sirasao, E. Delaye, R. Sunkavalli, and S. Neuendorffer, “Fpga
based opencl acceleration of genome sequencing software,” in Poster
presented at Supercomputing 2015, Austin, TX, Nov 2015. [Online].
Available: http://sc15.supercomputing.org/sites/all/themes/SC15images/
tech poster/tech poster pages/post269.html
[40] G. Andrews, Foundations of Multithreaded, Parallel, and Distributed Programming.
Addison-Wesley, 2000.
[41] E. Georganas, A. Buluc¸, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick,
“meraligner: A fully parallel sequence aligner,” in Parallel and Distributed
Processing Symposium (IPDPS), 2015 IEEE International, May 2015, pp. 561–570.
[42] M. Farrar, “Striped smith-waterman speeds database searches six times over
other SIMD implementations,” Bioinformatics, vol. 23, no. 2, pp. 156–161, nov
2006.
[43] R. Hughey, “Parallel hardware for sequence comparison and alignment,”
Comput. Appl. Biosci., vol. 12, no. 6, pp. 473–479, Dec 1996.
[44] Y. Liu, W. Huang, J. Johnson, and S. Vaidya, Computational Science – ICCS 2006:
6th International Conference, Reading, UK, May 28-31, 2006, Proceedings, Part IV.
Berlin, Heidelberg: Springer Berlin Heidelberg, 2006, ch. GPU Accelerated
Smith-Waterman, pp. 188–195.
[45] Y. Liu, B. Schmidt, and D. L. Maskell, “CUSHAW: a CUDA compatible short
read aligner to large genomes based on the burrows-wheeler transform,”
Bioinformatics, vol. 28, no. 14, pp. 1830–1837, may 2012.
[46] K. Zhao and X. Chu, “G-BLASTN: accelerating nucleotide alignment by
graphics processors,” Bioinformatics, vol. 30, no. 10, pp. 1384–1391, jan 2014.
[47] N. Corporation. (2015) Nvbio. [Online]. Available: https://developer.nvidia.
com/nvbio
[48] Y. T. Chen, J. Cong, J. Lei, and P. Wei, “A novel high-throughput acceleration
engine for read alignment,” in Field-Programmable Custom Computing Machines
(FCCM), 2015 IEEE 23rd Annual International Symposium on, May 2015, pp.
199–202.
[49] R. Kaplan, L. Yavits, R. Ginosar, and U. Weiser, “A resistive cam processing-
in-storage architecture for dna sequence alignment,” IEEE Micro, vol. 37,
no. 4, pp. 20–28, 2017.
[50] Y. Turakhia, G. Bejerano, and W. J. Dally, “Darwin: A genomics co-processor
provides up to 15,000x acceleration on long read assembly,” in Proceedings of
the Twenty-Third International Conference on Architectural Support for Program-
ming Languages and Operating Systems, ser. ASPLOS ’18, 2018, pp. 199–213.
[51] M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, “GateKeeper:
a new hardware architecture for accelerating pre-alignment in DNA short
read mapping,” Bioinformatics, vol. 33, no. 21, pp. 3355–3363, may 2017.
[52] E. Rucci, C. Garcia, G. Botella, A. E. D. Giusti, M. Naiouf, and M. Prieto-
Matias, “Oswald: Opencl smith–waterman on altera’s fpga for large protein
databases,” The International Journal of High Performance Computing Applications,
vol. 32, no. 3, pp. 337–350, 2018.
[53] J. Peltenburg, Johan, S. Ren, and Z. Al-Ars, “Maximizing systolic array
efficiency to accelerate the PairHMM Forward Algorithm,” 2016 IEEE Int.
Conf. Bioinformatics and Biomedicine (BIBM), vol. 00, pp. 758–762, 2016.
[54] S. S. Banerjee, M. el Hadedy, C. Y. Tan, Z. T. Kalbarczyk, S. Lumetta, and R. K.
Iyer, “On accelerating pair-hmm computations in programmable hardware,”
in 2017 27th International Conference on Field Programmable Logic and Applications
(FPL), Sept 2017, pp. 1–8.
[55] S. Huang, G. J. Manikandan, A. Ramachandran, K. Rupnow, W. mei W. Hwu,
and D. Chen, “Hardware Acceleration of the Pair-HMM Algorithm for DNA
Variant Calling,” in Proc. 2017 ACM/SIGDA Int. Symp. on Field-Programmable
Gate Arrays, ser. FPGA ’17. New York, NY, USA: ACM, 2017, pp. 275–284.
[56] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada,
F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura, B. Brezzo, I. Vo,
S. K. Esser, R. Appuswamy, B. Taba, A. Amir, M. D. Flickner, W. P. Risk,
R. Manohar, and D. S. Modha, “A million spiking-neuron integrated circuit
with a scalable communication network and interface,” Science, vol. 345, no.
6197, pp. 668–673, aug 2014.
[57] B. V. Benjamin, P. Gao, E. McQuinn, S. Choudhary, A. R. Chandrasekaran,
J.-M. Bussat, R. Alvarez-Icaza, J. V. Arthur, P. A. Merolla, and K. Boahen,
“Neurogrid: A mixed-analog-digital multichip system for large-scale neural
simulations,” Proceedings of the IEEE, vol. 102, no. 5, pp. 699–716, may 2014.
Subho S. Banerjee is a PhD candidate in Computer Science at the University of
Illinois at Urbana-Champaign. His research focuses on the design and implemen-
tation of workload optimized computing systems (using hardware accelerator and
parallel runtime environments) for data analytics workloads. He holds a B.Tech.
degree in Computer Science and Engineering from LNMIIT, India.
Mohamed el-Hadedy a is Research Scientist at the University of Illinois at Urbana-
Champaign. He earned his B.Sc and M.Sc degrees from the Mansoura University,
Egypt in 2002 and 2006 respectively, and his PhD degree in Electrical and Computer
Engineering from the Telematics Department at the Norwegian University of
Science and Technology, Trondheim, Norway in 2012. His main research interests
include FPGA-based accelerator design for Cryptography, Signal/Image Processing,
Robotics, and Genomics.
Jong Bin Lim received his B.S. degree in Electrical Engineering from University
of Illinois at the Urbana-Champaign in 2014. He is currently working toward his
Ph.D degree in the department of Electrical and Computer Engineering at the
University of Illinois at Urbana-Champaign. His current research interests include
optimal System-On-Chip and accelerator design by using high-level synthesis, and
hardware-software co-design.
Zbigniew T. Kalbarczyk is a Research Professor at the Electrical and Computer
Engineering and the Coordinated Science Laboratory of the University of Illinois at
Urbana-Champaign. Dr. Kalbarczyks research interests are in the area of design
and validation of reliable and secure computing systems.
Deming Chen received the B.S. degree in computer science from the University of
Pittsburgh, PA, USA, in 1995, and the M.S. and Ph.D. degrees in computer science
from the University of California at Los Angeles, in 2001 and 2005, respectively. He
is a Professor with the ECE Department, University of Illinois at Urbana–Champaign,
where he is the Donald Biggar Willett Faculty Scholar. His current research interests
include system-level and high-level synthesis, nano-systems design and nano-
centric CAD techniques, GPU and reconfigurable computing, hardware security,
and computational genomics.
Steven S. Lumetta received the A.B. degree in physics and the M.S. and Ph.D.
degrees in computer science from the University of California, Berkeley, in 1991,
1994, and 1998, respectively. He is an Associate Professor of Electrical and
Computer Engineering and a Research Associate Professor with the Coordinated
Science Laboratory, University of Illinois at Urbana-Champaign. His research
interests are in optical networking, high-performance networking and computing,
hierarchical systems, and parallel run-time software.
Ravishankar K. Iyer is the George and Ann Fisher Distinguished Professor of
Engineering at the University of Illinois at Urbana-Champaign. He holds appoint-
ments in the Department of Electrical and Computer Engineering, the Coordinated
Science Laboratory (CSL), and the Department of Computer Science, serves as
Chief Scientist of the Information Trust Institute, and is affiliate faculty of the National
Center for Supercomputing Applications (NCSA).
