Reconfigurable acceleration of genetic sequence alignment: A survey of two decades of efforts by Ng, HC et al.
Reconfigurable Acceleration of Genetic Sequence
Alignment: A Survey of Two Decades of Efforts
Ho-Cheung Ng, Shuanglong Liu, Wayne Luk
Department of Computing, Imperial College London, UK
{h.ng16, s.liu13, w.luk}@imperial.ac.uk
Abstract—Genetic sequence alignment has always been a
computational challenge in bioinformatics. Depending on the
problem size, software-based aligners can take multiple CPU-
days to process the sequence data, creating a bottleneck point
in bioinformatic analysis flow. Reconfigurable accelerator can
achieve high performance for such computation by providing
massive parallelism, but at the expense of programming flexibility
and thus has not been commensurately used by practitioners.
Therefore, this paper aims to provide a thorough survey of the
proposed accelerators by giving a qualitative categorization based
on their algorithms and speedup. A comprehensive comparison
between work is also presented so as to guide selection for
biologist, and to provide insight on future research direction for
FPGA scientists.
I. INTRODUCTION
Genetic sequence alignment is an important and fundamen-
tal aspect in modern molecular biology. However, the expo-
nential growth of bio-sequence databases [1] and significant
improvement of next-generation sequencing (NGS) machine
have posed a computational challenge for general purpose
processor, especially when performance of NGS machines has
been developing at a rate faster than Moore’s law [2].
In the literature, FPGA technology has shown to be a
promising candidate for accelerating genetic sequence align-
ment. Because of its highly-parallel bit-oriented architecture,
FPGAs have been leveraged to accelerate various alignment
algorithms since 1992 [3]. Therefore, this survey covers the
traditional sequence analysis such as pairwise sequence align-
ment and genomic database search on FPGAs. In addition, the
recently evolved NGS technology and its applications are also
discussed to demonstrate the benefits of FPGAs in accelerating
genetic sequence alignment.
The rest of this paper is organized as follows: Section II
discusses the commonly used techniques and algorithms in
pattern matching. Previous work on reconfigurable acceler-
ation of genetic sequence alignment is then elaborated in
Section III. Concluding remarks are drawn in Section IV.
II. BACKGROUND AND ALGORITHMS
Over the last two decades, FPGA researchers have applied
different techniques to accelerate genetic sequence alignment.
For example, Fernandez et al. implement a direct comparison
design where bases from a streaming reference sequence and
a stationary short read are compared [4]. Other algorithms
such as Aho-Corasick algorithm [5] or hash table [6] are also
adopted in FPGA aligners.
In this section, the most commonly used algorithms for
genetic sequence alignment: Smith-Waterman [7] and FM-
Index [8] are described to provide a background for various
accelerating approaches. Other algorithms such as seed-and-
extension strategy will also be briefly mentioned in Section III
from the applications perspective.
A. Smith-Waterman Algorithm
The Smith-Waterman is a dynamic programming (DP)
technique based on the Needleman-Wunsch algorithm [9]. A
scoring matrix V is used to reveal the optimal local alignment
between two sequences S, T where |S| = n and |T | = m.
Each entry in Matrix V is calculated recursively according to
equations (1) and (2).
V (i, j) = max

0
V (i− 1, j − 1) + σ(S[i], T [j]) Match/Mismatch
V (i− 1, j) + σ(S[i], ) Deletion
V (i, j − 1) + σ( , T [j]) Insertion
(1)
for 1 ≤ i ≤ m, 1 ≤ j ≤ n
Base case
{
V (i, 0) = 0 0 ≤ i ≤ n
V (0, j) = 0 0 ≤ j ≤ m (2)
The function σ(x, y) determines the relative weighting of
matches, mismatches, deletions and insertions between charac-
ters x and y. The weighting, on the other hand, can be adjusted
according to different alignment requirements. For example,
the insertion and deletion penalties can be set to higher value
than the substitution penalty if the presence of redundant
characters is less acceptable than character difference.
TABLE I: Example of calculating the scoring matrix for the
sequences S = AT and T = CTCATGC.
- C T C A T G C
- 0 0 0 0 0 0 0 0
A 0 0 0 0 2 1 0 0
T 0 0 2 1 1 4 3 2
Match: σ(x, x) = +2, Mismatch: σ(x, y) = −1
Deletion: σ(x, ) = −1, Insertion: σ( , x) = −1
Table I illustrates the calculation for the scoring matrix
for the alignment of sequence S = AT to a reference T =
CTCATGC. The optimal alignment can be obtained by com-
pleting the matrix V and the highest score indicates if a
pattern can be mapped to another sequence within the allowed
diversity. In Table I the highest score obtained is 4 which
indicates that pattern S can be exactly mapped to sequence T .
By backtracing from the highest score to the entry in which the
score becomes zero, the optimal alignment can be constructed
as a string representation.
B. FM-Index
FM-index is a data structure that combines the properties of
suffix array with the Burrows-Wheeler transform (BWT) [10].
Such data structure provides an efficient mechanism to locate
all the occurrences of a pattern P in a long reference sequence
R. As a result, BWT and FM-index have been broadly
employed in many of the software for short reads alignment
such as Bowtie [11], SOAP2 [12] and BWA [13].
To compute the BWT of a reference sequence R, i.e.
BWT (R), R is first terminated with an unique character: ‘$’,
which is lexicographically the smallest value. Then, all the ro-
tations of the text are generated and are sorted correspondingly.
The suffix array can be obtained by considering the characters
before ‘$’ in each entry of the rotation list. BWT (R) can also
be formed by extracting and concatenating the last characters
of all the entries on the sorted list.
Table IIa shows an example of deriving the BWT of the
sequence R = ACACGT. The strings preceding the ‘$’ sign
in the sorted rotations forms the suffix array (SA), which
indicates the position of each possible suffix in the original
string.
TABLE II: (a) Example of deriving the suffix array and BWT
of reference sequence R. (b) i(x) and c(n, x) functions for
the sequence R.
(a)
R = ACACGT
Index SA Sorted Rotations
0 6 $ACACGT
1 0 ACACGT$
2 2 ACGT$AC
3 1 CACGT$A
4 3 CGT$ACA
5 4 GT$ACAC
6 5 T$ACACG
BWT (R) = T$CAACG
(b)
c(n, x)
Index i A C G T
0 0 0 0 0
1 0 0 0 1
2 0 0 0 1
3 0 1 0 1
4 1 1 0 1
5 2 1 0 1
6 2 2 0 1
7 2 2 1 1
i(x) {1, 3, 5, 6}
After generating the suffix array, the BWT (R) is sorted to
form the i and c functions. For each element x of the alphabet
of R, i(x) is defined as the index of its first occurrence in
sorted-BWT (R), while for each index n in BWT (R) and for
each character x in the alphabet, c(n, x) stores the number of
occurrences of x in BWT (R) in the range [0, n−1]. Table IIb
illustrates the i(x) and c(n, x) functions for the sequence R.
Essentially, the FM-index is a pattern searching technique
that operates on the i(x) and c(n, x) functions recursively.
Two specific pointers: top and bottom are defined to perform
the search. top refers to an index of the suffix array element
where a specific pattern is first located, and bottom is the
location where the pattern can be last found. If bottom points
to an index that is less than or equal to the index pointed by
the top, the pattern does not occur on the text.
To locate a specific pattern P with the FM-index, a character
is processed at a time, beginning with the last character of the
pattern. The top and bottom are first initialized with the first
and last indices of the c(n, x) function respectively. Then both
pointers are updated according to the following equations:
topnew = c(topcurrent, x) + i(x)
bottomnew = c(bottomcurrent, x) + i(x)
Notice that the time of locating the pattern in the reference
sequence is linear in the length P instead of the length of R.
III. FPGA ACCELERATION OF GENOMIC ALIGNMENT
Depending on applications, different alignment operations
can be applied to perform different genomic analysis. In this
section, existing work on FPGA-accelerated genomic aligner is
studied from the application perspective. Additionally, the in-
terplay between the hardware characteristics of FPGA and the
algorithmic techniques are briefly mentioned and described.
A. Pairwise Sequence Alignment
A fundamental problem in the field of computational bi-
ology is the comparison and alignment of two sequences of
DNA strands. Depending on the applications, the alignment
results can provide useful biological or medical information
such as evolutionary development of a species, or identification
of causal cancer genes and genetic diseases [14].
As mentioned, the Smith-Waterman is the most commonly
used algorithm to perform genetic sequence alignment, partic-
ularly pairwise sequence alignment (PSA). However, because
of the enormous size of DNA sequences, purely software
based implementations of the algorithm suffer from prolonged
execution latency. To accelerate PSA, FPGA devices have been
extensively used to reduce the time complexity from O(mn)
in software to O(m+n) in parallel processing hardware.
Simple Aligner — In [15] and [16], the authors present one of
the first FPGA accelerators for the Smith-Waterman algorithm.
Reconfigurable systolic array is adapted to provide a large
amount of parallelism. Also, runtime reconfiguration is used
to write one string directly into the FPGA’s bitstream.
Yu et al. [17], on the other hand, later propose an im-
proved, reconfiguration-free systolic array architecture where
the accelerator can be deployed on cross-vendor FPGAs.
Experiments show that the proposed solution can achieve 814
entry/cell updates per second (GCUPs) when implemented on
Virtex-E XCV1000E-6 FPGA.
Affine Gap Cost Model — Very often, alignment of two
sequences favours gap extension rather than insertion/deletion.
Therefore, instead of giving a fixed negative score to every
gap, biologists usually apply affine gap penalty when comput-
ing the scoring matrix.
In [18] and [19], the authors propose the first FPGA-based
accelerator that supports affine gap penalty. Systolic matching
cells are implemented to support different cost functions
and alignment algorithms such as the Needleman-Wunsch or
Smith-Waterman. Compared to software implementation on
Xeon 3GHz processor, a speedup of 370× can be achieved
when implemented on Virtex-II Pro XC2VP70 FPGA.
Similarly, Jiang et al. [20] implement a reconfigurable
accelerator that can adopt affine gap penalty. In this design, a
modified equation is proposed to improve mapping efficiency
of a processing element (PE). A special floor plan is applied
to fine-grain parallel PE array to cut down their routing delay.
With these two techniques, the proposed implementation on
Stratix EP1S30 can improve the performance by 345× versus
a similar software on Xeon 2.8GHz processor.
Basically, most of the research efforts such as [21]–[27]
utilize systolic array or fine-grain PE architecture to accelerate
PSA with affine gap penalty. Experiments show that, when
compared to state-of-the-art software implementations, the
reconfigurable accelerators can achieve a speedup from around
40× to 246×.
Accelerator with Traceback — To further improve the accel-
erator performance, some FPGA designs realise the traceback
procedure instead of relying on the host CPU to perform
backtracing. For example, Benkrid et al. [28] implement a
Smith-Waterman accelerator on Virtex-II XC2VP100 where a
pipeline of PEs can be used to calculate the scoring matrix and
traceback. An improved accelerator is later proposed in [29]
in which a space-efficient algorithm is used to overcome the
memory size and bandwidth limitations. Compared to software
on Core2 Duo 2.4GHz, a performance gain over 300× can
be obtained with 256 PEs on Virtex-4 FX100 FPGA.
Moreover, a few researchers accelerate variants of the
Smith-Waterman such as DIALIGN [30] to accomplish better
alignment sensitivity. In particular, Boukerche et al. [31]
propose a reconfigurable accelerator for DIALIGN by imple-
menting wavefront array processors on Stratix-II EP2S180.
The traceback procedure can also be executed on FPGA to
retrieve the alignment and the overall speedup is around 141×
compared to a similar software implementation.
Hardware Abstraction in RC-PSA — Some of the efforts
are devoted to improving the portability and usability of the
accelerated system. In [32], the authors design a systolic
architecture that can be applied to solve general DP-based
alignment problem. Others such as [19], [28], [33] provide
generic, parameterizable FPGA cores for PSA which are
portable across various FPGA platform. Finally, Liu et al. [34]
introduce the concept of “RC-PSA in the cloud” where a
web server is used to serve alignment requests. All these
implementations, compared to state-of-the-art CPU designs,
can deliver a speedup of more than 62×.
B. Database Search
Computational search through large databases of DNA is
another important tool to uncover homologous sequence in
modern molecular biology. Database sequences that exhibit
high similarity with the query are hypothesized to derive from
the ancestral sequence and often display the same biological
function.
Heuristic algorithm such as BLAST [35] is extensively
used to perform database search among biologists. Basically,
BLAST algorithm works in three consecutive stages: (1) Word
Matching, (2) Ungapped Extension, (3) Gapped Extension.
However, as the size of the most commonly used database
such as NCBI databank [1] grows at the same pace as Moore’s
law, running BLAST on a general purpose processor has been
the bottleneck in homology analysis.
In this sub-section, previous work on reconfigurable ac-
celeration of BLAST is included and a summary is shown
in Table III. Although some variations of BLAST such as
BLASTp or BLASTx do not target at alignment of genetic
nucleotides, they are also included in the discussion because
of their similarities in heuristic and methodology.
Basic Accelerators — The TUC BLAST is one of the earliest
efforts in accelerating BLAST on reconfigurable devices.
In [36], [37], Sotiriades et al. develop the first version of TUC
BLAST in which the entire BLAST algorithm is mapped onto
Virtex-4 4VFX140FF1517-11 FPGA. The architecture can
support small queries of up to 1,000/ 5,000 letters regardless
of the database size. Hash table is used to build hit finders
and extension is done with basic comparators. Experiments
indicate that the proposed accelerator can achieve a speedup
of 215× versus BLASTn on Xeon 2GHz.
Hybrid Systems — The TUC BLAST is then revised and
incorporated with the PowerPC processor onboard to perform
extension [38]. Implemented on Virtex-II PRO V2P30, the
modified accelerator achieves 32× speedup compared to ex-
ecution on Pentium-4 3.0GHz. Moreover, the same authors
explore the design space on ASIC to reduce technology related
limitations of FPGA in [39].
Xia et al. [40]–[42] also design an hybrid accelerator where
the first two stages of BLAST are accelerated with Stratix-
II EP2S130C5 FPGA and the final stage is executed on
commodity CPU. To decrease the memory requirement on-
chip and support longer query, systolic array of 3072 PEs
are used to perform multi-seeds detection and multi-channel
hardware modules are implemented to complete ungapped
extension. The experimental results show that the accelerator
can deliver 48× speedup versus Pentium-4 2.6GHz CPU.
Chen et al. [43] also present an FPGA-based reconfigurable
architecture to accelerate the word-matching stage of BLAST
while maintaining the computations of other stages on CPU.
This design consists of three sub-stages, a parallel Bloom filter,
an off-chip hash table, and a match redundancy eliminator.
The performance of this architecture, when implemented on
Virtex-5 LX330, demonstrates 10× speedup against Core2
Duo 3.2GHz (1-thread) in Word Matching*.
Mercury BLAST — The Mercury system, on the other hand,
is reconfigurable logic, associated with the disk controller, to
provide computation in close proximity to the data flowing
off the disk drive [55]. Such platform is frequently employed
TABLE III: Summary of the previous work on reconfigurable acceleration of database search.
Category Paper Algorithm & Method Co-Processor Device Max. QueryLength Tested
Database
Length Speedup
Basic
Accelerator
[44] Smith-Waterman Pentium-III 1GHz Virtex-E XCV2000E 2,048 64M bases 330×
[36] [37] All Stages - Virtex-4 4VFX140 5,000 44M bases 215×
Hybrid
Systems
[38]
Stage 1, 2
PowerPC 405 300MHz Virtex-II PRO V2P30 200k - 32×
[42] Pentium-4 2.60GHz Stratix-II EP2S130C5 3,000 101M bases 48×
[45] Core i7 2.80GHz Virtex-5 ML509 3,000 123M bases 46×
[43] Stage 1 DRC coprocessor Virtex-5 LX330 1M 1.4G bases 10×*
Mercury
BLAST
[46] Stage 1 Pentium-D 3GHz
Virtex-II 6000
1M 1.16G bases 7×
[47] Stage 1, 2 Opteron 2GHz 64,000 1.5G bases 11×
[48] Stage 1, 2 (partially) - 25,000 1.5G bases 50×
Rslt Compatible [49] Function Blast_Nt_Scan PowerPC 405 300MHz Virtex-4 ML410 975M bases 400MB 3×
Database
Pre-filter
[50] Pre-filtering - Threshold - Virtex-5 XC5VLX330T - - 5×
[51] Pre-filtering - TreeBLAST - Stratix-III EP3SL340 1,000 7.17G bases 12×
Hardware
Abstraction
in RC-BLAST
[52] Two-Hit Method - Virtex-4 XC4VLX60 - - 52×
[53] Mitrion Virtual Processor Itanium Virtex-4 LX200 100M - 20×
[54] Smith-Waterman Xeon E5620 Stratix-IV E530 1.6M 975M bases 59×
to accelerate BLAST because of its high-throughput data
channels.
Krishnamurthy et al. [46] develop the first Mercury BLAST
implementation on Virtex-II 6000 where the Word Matching
stage is accelerated with Bloom filters and a hash table. Since
the first stage is found to be the bottleneck in BLAST exe-
cution, this accelerator can demonstrate 7× speedup against
Pentium-4 2.8GHz.
Mercury BLAST is later improved by Buhler et al. [47]
where Word Matching and Ungapped Extension are both
accelerated on FPGA. The hardware-accelerated ungapped
extension employs a similar heuristic as BLAST in order to
achieve a speedup of 11× while retaining 98.5-99% of all
alignments found by NCBI BLASTN.
Finally, Lancaster et al. [48] further enhance the design by
implementing a pre-filter on FPGA for the third stage and
at the same time offloading the computation of ungapped
extension on CPU. By highly paralleling and pipelining the
hardware modules, the accelerator accepts query of 25k bases
and achieves 50× improvement while maintaining equivalent
sensitivity of the BLAST software.
Moreover, BLASTp is also accelerated using the Mercury
framework. Word Matching is accelerated in [56] and Gapped
Extension is accelerated with Smith-Waterman in [57]. Finally,
[58] presents a full acceleration of BLASTp where all the
previous efforts are combined to deliver a full implementation.
Single-Pass BLAST — Since BLAST involves multiple passes
during database queries, some researchers introduce a new
algorithm that operates in a single-pass at streaming rate
to improve performance. In particular, Herbordt et al. [59],
[60] propose the use of a DP approach on FPGA to emulate
the seeding and extension phases of BLAST. This algorithm,
named TreeBLAST, can improve the performance of the
database search by 400× on Virtex-4 LX160 FPGA compared
to multiple-pass NCBI BLASTp on Xeon 2.8GHz.
Results Compatible Accelerator — Although the mentioned
implementations demonstrate significant speedups compared
to software, the search outcomes are not always in complete
agreement with the NCBI results. Since typical biologists
would have no idea whether the differences are statistically
significant, some FPGA researchers argue that the hardware
accelerated design should be NCBI BLAST compatible.
Datta et al. [49] propose a memory efficient FPGA design
that implements Blast_Nt_Scan function of BLAST. The
primary function of the scan function is to stream the subject
data sequence and locate hits. Without compromising fidelity,
the proposed implementation on Virtex-4 ML410 can improve
performance by a factor of 3 (compared to Pentium-4 3.2GHz)
while in complete agreement with the standard NCBI BLAST.
Database Pre-filtering — In addition to accelerate different
phases of BLAST using FPGA, another useful approach to
improve the overall performance is to profile the code and
reduce the database size.
Afratis et al. [50] propose the first pre-filtering approach
to BLAST by finding and reporting matches in the areas of
high similarity between database and query. It is found that
pre-filtering offers at least a factor of 5 and up to 3 orders of
magnitude reduction in the database space.
Park et al. [51], [61] also apply pre-filtering with the
TreeBLAST algorithm so as to quickly reduce the size of
the database to a small fraction. The sensitivity of the pre-
filtering approach is tuned to exceed that of the NCBI BLAST
implementation to ensure identical results. Experimental re-
sults show that, compared with NCBI BLASTn, the speedup
is greater than 12× when pre-filtering and accelerator in [59]
are used in execution.
Hardware Abstraction in RC-BLAST — In spite of the
promising results described above, the FPGA-based solutions
should also be portable and straightforward in order to promote
the use of reconfigurable accelerator among biologists.
In [62], Muriki et al. present the first portable, cost-effective,
open source solution of RC-BLAST to guarantee usability.
Kasap et al. [52] also present a portable FPGA accelerator
for BLAST by capturing the design with an FPGA-platform-
independent language Handel-C. The architecture of the ac-
celerator can also be parametrized in terms of the sequence
lengths, match scores, gap penalties, and cut-off and threshold
values. It is reported that the hardware implementation is 52×
faster than equivalent software implementations on Centrino
Duo 2.2GHz.
Moreover, Abelsson et al. [53] propose the use of Mitrion
Virtual Processor to accelerate BLAST. Since Mitrion enables
software developers to target FPGA-based computers without
needing any of the hardware design skills, users can continue
using the familiar BLAST interface, while at the same time
getting searches completed 10× to 20× faster.
Finally, Lam et al. [54] introduce an FPGA-accelerated
BLAST in the cloud framework. Smith-Waterman is accel-
erated on 64 Stratix-IV E530 on multiple PS4 compute nodes
of Novo-G to provide database search. A robust software
interface is also provided to seamlessly integrate the FPGA
design into existing processing pipelines of NCBI BLAST.
C. Multiple Sequence Alignment
Multiple Sequence Alignment (MSA) is an extension to
PSA and is generally used to construct family representations
of sequences or to reveal evolutionary histories of species.
However, it is a NP-Hard problem and therefore the optimal
solution can only be obtained with a i dimensional DP-table
where i is number of sequences [63].
Heuristic algorithm, such as ClustalW [64], has been widely
used among biologists because of its efficiency. Basically,
ClustalW uses a progressive algorithm which consists of three
major steps: (1) PSA between all sequences to generate dis-
tance matrix, (2) Guide tree generation based on the distance
matrix, (3) Successively building MSA by performing PSA
based on the branching order of the guide tree. However,
ClustalW faces the same problem as BLAST does due to the
rapid growth of the sequence database, and aligning a few
hundred sequences could require several hours on computers.
Research has been done to overcome this problem by ac-
celerating MSA with reconfigurable devices. In [65] and [66],
the authors present an accelerated ClustalW by offloading
the computation of the first stage onto FPGA. As more
than 90% of the runtime is spent in the first stage, [65]
provides a speedup of 50× on Virtex-II XC2V6000 compared
to Pentium-4 3GHz for the first stage, and [66] achieves 10×
performance improvement when Stratix PEIS30 is compared
to Xeon 2.8GHz.
Finally, the third stage of ClustalW is accelerated in [67].
Compared to Core2 2.4GHz, an overall speedup of 150× can
be achieved by reducing subgroups of aligned sequences into
discrete profiles before PSA is performed on Virtex-4 FX100.
D. Mapping
Mapping, or resequencing refers to the alignment of a
generated sequence to a reference genome where the complete
sequence of the concerning species, such as human, is already
known. Such application is essentially used to determine the
genomic variations of a sample in relation to the reference
so as to explore and understand genetic diseases and recent
cancer genomes.
Mapping is one of the dominant applications of next-
generation sequencing where millions of DNA fragments,
called short reads, with 75 to 200 b.p. in length, are generated
by NGS machine and mapped to the reference genome. Soft-
ware such as Bowtie, BWA, SOAP2 and BWA-MEM [68] are
widely used among biologists as de facto sequence alignment
program of choice. Yet, since the sequencing machine is
improving at a rate faster than the transistor counts according
to Moore’s law, mapping of generated sequence such as
the complete human genome is taking order of day’s worth
of computing time [63]. Therefore, FPGA technology has
been extensively used by researchers to speedup the mapping
process. A summary of the previous work on reconfigurable
acceleration of short-read alignment is displayed in Table IV.
Basic Mappers — Fernandez et al. [4] implement the first
hardware short-read mapper in 2010 where the design is based
on a naive solution. The reconfigurable implementation on
Virtex-5 LX330 delivers a speedup of 1.6× to 4× when com-
pared to the fastest software tool RAMP [82] and ELAND [83]
on Xeon Harpertown 2.5GHz (1-thread). However, the per-
formance of this design decreases with the increase of reads
length, therefore a followed-up work [84] is proposed in which
the authors develop the first implementation of FM-index on
FPGA. As the FM-index does not need to perform all character
matching compared to the naive solution, this approach, when
implemented on Virtex-6 LX760, outperforms the previous
work by around 2× and more importantly, provides a 133×
speedup compared to Bowtie on Xeon 2.5GHz (1-thread).
Approximate String Matching — Since [84] is only limited
to exact string matching, the authors extend their work as a
multi-threaded FPGA design called FHAST which supports up
to 2 mismatches [70] . In this implementation of FM-index,
each read represents a thread in the search and maximally
512 concurrent threads can be executed on a single Virtex-5
XC5VLX330 FPGA of Convey HC-1. Experimental results
show that FHAST achieves a speedup of up to 70× over
Bowtie running on Xeon L540B and E5520 (16-thread),
and a second version that runs on Convey Computers HC-
2ex provides a higher sensitivity for higher number of mis-
matches [85]. Using four Virtex-6 LX760 FPGAs, FHAST
version-II can provide a speedup up to 12× compared to
Bowtie on two Xeon E5-2634 (8-thread).
Besides FM-index, other researchers propose different
FPGA-solution for approximate string matching. In [69], Ol-
son et al. propose an accelerator that is based on indexing
of reference with Smith-Waterman alignment performed on
FPGA. The authors optimize the size of the candidate align-
ment location (CAL) lookup table and partition the design into
eight Pico M-503 boards each with one XC6VLX240T FPGA.
This 8-FPGA system can achieve 31× speedup versus Bowtie
running on two Xeon E-5520 (8-thread).
Chen et al. [6], [86] also implement an accelerated short-
read aligner based on seed-and-extension strategy [6]. The
basic idea of such strategy rests on the heuristic that only a
limited amount of errors (substitution, insertion and deletion)
TABLE IV: Summary of the previous work on reconfigurable acceleration of short-read alignment.
Category Paper Algorithm & Method Platform Device Speedup Mbp/s kbp/J
Basic Mapper [4] Brute-Force - Virtex-5 LX330 4× 0.006 -
Approximate
String Matching
[6] Hash + Needleman-Wunsch - Virtex-5 LX330 5.2× 3.6 -
[69] BFAST SW Pico Computing M-503 Virtex-6 LX240T×8 32× 112 225
[70] FM-index Convey HC-1 Virtex-5 LX330×4 70× 59.2 -
Accurate Mapper [71] Compact Linear Systolic - Virtex-6 LX550T 1× 1 -
Runtime
Reconfigurable
Mappers
[72] FM-index + Smith-Waterman Maxeler MAX3 Virtex-6 SX475T 293× 516 -
[73] FM-index Maxeler MAX3 Virtex-6 SX475T 18.1× 97.5 -
[74] FM-index Maxeler MPC-X1000 Stratix-V×8 14.9× 66.4 -
Hybrid
Systems
[75] PerM Algorithm Blade Server (AMD 2.8GHz CPU×2) Virtex-5 LX330 42.9× 11 -
[76] BWT Pico Computing M-505 FPGAs×12 48× - -
Long Read Mapper [77] Smith-Waterman - Virtex-5 LX110T 3.3× - -
BWA-MEM
Accelerator
[78] [79] Smith-Waterman Alpha Data ADM-PCIE-7V3 Virtex-7 VX690T-2 2× 4.1 34.1
[80] Smith-Waterman - Virtex-7 VC707 26.4× - -
[81] BWT Inter-Altera HARP Stratix-V 26%† - -
exists for a significant alignment and therefore long exact
match regions would exist. Thus aligning the exact matches,
i.e. seed first, and then extending to both directions of the
sequence for approximate string matching can reduce the
search space enormously.
In these implementations, Chen et al. use a hash table as
the seed engine and apply Needleman-Wunsch algorithm as
the extension. Using a Virtex-5 LX330 device, the hardware
aligner achieves a speedup between 2.5× to 5.9× compared
to BWA at a higher sensitivity.
Highly Accurate Mappers — On the other hand, Knodel et
al. [87] design a short-read mapper on FPGA that allows a
freely adjustable character mismatch threshold. This mapper
is based on a brute-force approach that relies on massive
amount of shift registers (Block RAM) and comparators to
perform matching, and it guarantees a 100% mapping rate
within the mismatch threshold. Compared to Bowtie on Core2
Duo 2.66GHz (2-thread), the hardware mapper can run 2×
faster and can align 20% more genome when implemented on
Virtex-6 XC6VLX240T FPGA.
The authors continue their work and design another short-
read mapper based on linear systolic computation scheme
to achieve better performance [71]. Implemented on Virtex-6
XC6VLX550T FPGA, the hardware mapper reports 2× more
locations than Bowtie while maintaining the execution latency
competitive to software executed on i7-2600K (4-thread). This
solution is also ported onto Virtex-7 XC7VX485T and is
realized as an open-source package called PoC-Align [88].
Runtime Reconfigurable Mappers — Some researchers man-
age to take advantage of the reconfigurable property of FPGA
device to further improve the performance of hardware short-
read aligner. In [89] and [72], Arram et al. introduce a
hardware design that incorporates specialized matchers for
exact and approximate sequence alignment, while at the same
time runtime reconfiguration is used to fully populate the
FPGA with each type of matchers. Such decoupling enables
the flexibility of optimizing each matcher according to the
intended workload, hence resulting in higher parallelism and
performance. With this scheme, results reported on Virtex-6
SX475T of Maxeler MAX3 are 293× faster than BWA, and
496× faster than Bowtie on Xeon X5650 (20-thread).
Using the same approach, the authors further extend their
work and design specialized filters that can align short reads to
a reference genome with a different edit distance [73]. These
filters are arranged in a pipeline according to an increasing
edit distance, in which reads unable to be mapped by a
given filter are forwarded to the next filter in the pipeline
for further processing. Specifically, each time the FPGA is
fully populated with each filter in the pipeline in turn with
runtime reconfiguration. With specialised filters based on a
novel bidirectional backtracking version of the FM-index, it is
found that the alignment time on Maxeler MAX3 can be up
to 18.1× faster than BWA running on two X5650 (12-thread).
Hybrid Systems — Hybrid aligner refers to the concept
of hardware-software co-design for accelerating short-read
alignment. In [76], Draghicescu et al. design BWT aligners
on twelve Virtex-5 505 FPGA under the Pico Computing’s
framework. The accelerator ties into existing BWA software
and allows the CPU to perform tasks it is optimized for,
such as file handling and memory management. The proposed
system can achieve 48× speedup compared to software version
of BWA running on 16-core.
Tang et al. [75] also develop a hybrid accelerator where
a host program running on PC is dedicated to controlling
loading/storing reads/references data to/from the hardware.
The hardware mapper is based on PerM [90], a software
with periodic spaced seeds to significantly improve mapping
efficiency for large reference genomes. Meshes of processing
elements are implemented on Virtex-5 LX330 to take the
advantage of the spatial parallelism on FPGA. Experiments
show that such accelerator can deliver 22.2× to 42.9× speedup
versus PerM on six-core Xeon processor (Westmere) CPU.
Other mentioned efforts, such as [86] and [88], are also
tightly-coupled with software environment and presented as
hybrid system to accelerate short-read alignment.
Acceleration of BWA-MEM — In addition to the above imple-
mentations, some research efforts are devoted to accelerating
certain alignment software. In particular, BWA-MEM has been
widely studied and accelerated by FPGA researchers because
of the accuracy and improved efficiency of the software [68].
Basically, the BWA-MEM algorithm consists of three main
procedure which are executed in succession for each read
in the input: (a) SMEM (i.e. seeds) Generation, (b) Seed
Extension, and (c) Output Generation.
In [80], Chen et al. propose an acceleration engine for BWA-
MEM by offloading the seed extension, which is the com-
putation bottleneck, onto Virtex VC707 FPGA. The authors
develop an efficient Smith-Waterman implementation that sup-
ports massive task-level parallelism, sharply varied input sizes,
and software-pruning strategies. Compared to BWA-MEM
software on a 6-core CPU with 24 treads, the proposed design
can demonstrate 26.4× improvement in execution latency.
The authors continue their work by offloading SMEM
generation onto the FPGA in the latest Intel-Altera HARP
system [81]. With a 16-PE accelerator engine, the seeds gen-
eration is accelerated by 4×, and the overall SMEM seeding
stage by 26% when compared with 16-thread CPU execution†.
Houtgast et al. [78], [79] implement a hardware aligner
based on BWA-MEM as well and the design is composed of a
systolic array architecture to accelerate seed extension kernel
with Smith-Waterman. By offloading the computational bottle-
neck onto Virtex-7 XC7VX690T-2 FPGA, the entire system
can deliver a total acceleration of about 45%. This work is
later extended by Ahmed et al. [91] where a hardware suffix
array is used to partially accelerate SMEM generation, which
enables a total application acceleration of 2.6× compared to
the original software version.
IV. CONCLUSION
This paper reviews recent work on reconfigurable accelera-
tion of genetic sequence alignment by characterising them into
four main categories. Within these high-level categories, we
elaborate and compare each work based on their features and
the corresponding performance. We show that FPGA-based
solution is a promising candidate for the discussed topic, and
we believe future research should push forward with design
portability and usability of accelerators such as the concept
of RC-accelerated aligner in the cloud. As such, we hope
this survey can provide guidance on the accelerator choice
for genetic sequence alignment, and hence promote the use of
FPGA among the life sciences community.
ACKNOWLEDGMENT
The support of the Lee Family Scholarship, the EU Horizon
2020 Research and Innovation Programme under grant agreement
number 671653 and the UK EPSRC (EP/L00058X/1, EP/L016796/1,
EP/N031768/1 and EP/P010040/1) is gratefully acknowledged.
REFERENCES
[1] GenBank and WGS Statistics. [Online]. Available:
https://www.ncbi.nlm.nih.gov/genbank/statistics/
[2] J. Arram, T. Kaplan, W. Luk, and P. Jiang, “Leveraging FPGAs for Accelerating
Short Read Alignment,” IEEE/ACM Transactions on Computational Biology and
Bioinformatics, vol. 14, no. 3, pp. 668–677, 2017.
[3] D. T. Hoang, “Searching Genetic Databases on Splash 2,” in Proceedings IEEE
Workshop on FPGAs for Custom Computing Machines, 1993, pp. 185–191.
[4] E. Fernandez et al., “Exploration of Short Reads Genome Mapping in Hardware,”
in 2010 International Conference on Field Programmable Logic and Applications,
2010, pp. 360–363.
[5] Y. S. Dandass et al., “Accelerating String Set Matching in FPGA Hardware for
Bioinformatics Research,” BMC Bioinformatics, vol. 9, no. 1, p. 197, 2008.
[6] Y. Chen et al., “An FPGA Aligner for Short Read Mapping,” in 22nd International
Conference on Field Programmable Logic and Applications, 2012, pp. 511–514.
[7] T. Smith and M. Waterman, “Identification of Common Molecular Subsequences,”
Journal of Molecular Biology, vol. 147, no. 1, pp. 195–197, 1981.
[8] P. Ferragina and G. Manzini, “An Experimental Study of an Opportunistic Index,” in
Proceedings of the Twelfth Annual ACM-SIAM Symposium on Discrete Algorithms,
ser. SODA ’01, 2001, pp. 269–278.
[9] S. B. Needleman and C. D. Wunsch, “A general Method Applicable to the Search
for Similarities in the Amino Acid Sequence of Two Proteins,” Journal of Molecular
Biology, vol. 48, no. 3, pp. 443–453, 1970.
[10] M. Burrows and D. Wheeler, “A Block-sorting Lossless Data Compression Algo-
rithm,” Digital Equipment Corporation, Tech. Rep., 1994.
[11] B. Langmead et al., “Ultrafast and memory-efficient alignment of short DNA
sequences to the human genome,” Genome Biology, vol. 10, no. 3, pp. R25+,
2009.
[12] R. Li et al., “SOAP2: an improved ultrafast tool for short read alignment,”
Bioinformatics, vol. 25, no. 15, pp. 1966–1967, 2009.
[13] H. Li and R. Durbin, “Fast and accurate short read alignment with Bur-
rows–Wheeler transform,” Bioinformatics, vol. 25, no. 14, p. 1754–1760, 2009.
[14] R. Durbin et al., Biological Sequence Analysis Probabilistic Models of Proteins
and Nucleic Acids. Cambridge University Press, 1998.
[15] S. A. Guccione and E. Keller, “Gene Matching Using JBits,” in Field-
Programmable Logic and Applications: Reconfigurable Computing Is Going
Mainstream: 12th International Conference, FPL 2002 Montpellier, France, 2002
Proceedings, 2002, pp. 1168–1171.
[16] K. Puttegowda et al., “A Run-Time Reconfigurable System for Gene-Sequence
Searching,” in 16th International Conference on VLSI Design, 2003. Proceedings.,
2003, pp. 561–566.
[17] C. W. Yu et al., “A Smith-Waterman Systolic Cell,” in Field Programmable Logic
and Application: 13th International Conference, FPL 2003, Lisbon, Portugal, 2003
Proceedings, 2003, pp. 375–384.
[18] T. V. Court and M. C. Herbordt, “Families of FPGA-based Algorithms for Ap-
proximate String Matching,” in Proceedings. 15th IEEE International Conference
on Application-Specific Systems, Architectures and Processors, 2004., 2004, pp.
354–364.
[19] ——, “Families of FPGA-Based Accelerators for Approximate String Matching,”
Microprocessors and Microsystems, vol. 31, no. 2, pp. 135–145, 2007.
[20] X. Jiang et al., “A Reconfigurable Accelerator for Smith-Waterman Algorithm,”
IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 54, no. 12, pp.
1077–1081, 2007.
[21] T. Oliver et al., “Hyper Customized Processors for Bio-sequence Database Scanning
on FPGAs,” in Proceedings of the 2005 ACM/SIGDA 13th International Symposium
on Field-programmable Gate Arrays, ser. FPGA ’05, 2005, pp. 229–237.
[22] M. Gok and C. Yilmaz, “Efficient Cell Designs for Systolic Smith-Waterman
Implementations,” in 2006 International Conference on Field Programmable Logic
and Applications, 2006, pp. 1–4.
[23] P. Faes et al., “Scalable Hardware Accelerator for Comparing DNA and Protein
Sequences,” in Proceedings of the 1st International Conference on Scalable
Information Systems, ser. InfoScale ’06, 2006.
[24] A. Boukerche et al., “Reconfigurable Architecture for Biological Sequence Com-
parison in Reduced Memory Space,” in 2007 IEEE International Parallel and
Distributed Processing Symposium, 2007, pp. 1–8.
[25] I. T. Li et al., “160-fold acceleration of the Smith-Waterman algorithm using a field
programmable gate array (FPGA),” BMC Bioinformatics, vol. 8, no. 185, 2007.
[26] P. Zhang et al., “Implementation of the Smith-Waterman Algorithm on a Reconfig-
urable Supercomputing Platform,” in Proceedings of the 1st International Workshop
on High-performance Reconfigurable Computing Technology and Applications:
Held in Conjunction with SC07, ser. HPRCTA ’07, 2007, pp. 39–48.
[27] K. Benkrid et al., “High Performance Biological Pairwise Sequence Alignment:
FPGA versus GPU versus Cell BE versus GPP,” International Journal of Recon-
figurable Computing, vol. 2012, no. 752910, 2012.
[28] ——, “A Highly Parameterized and Efficient FPGA-Based Skeleton for Pairwise
Biological Sequence Alignment,” IEEE Transactions on Very Large Scale Integra-
tion (VLSI) Systems, vol. 17, no. 4, pp. 561–570, 2009.
[29] S. Lloyd and Q. O. Snell, “Hardware Accelerated Sequence Alignment with
Traceback,” International Journal of Reconfigurable Computing, vol. 2009, no.
762362, 2009.
[30] B. Morgenstern et al., “DIALIGN: Finding Local Similarities by Multiple Sequence
Alignment,” Bioinformatics, vol. 14, no. 3, pp. 290–294, 1998.
[31] A. Boukerche et al., “A Hardware Accelerator for the Fast Retrieval of DIALIGN
Biological Sequence Alignments in Linear Space,” IEEE Transactions on Comput-
ers, vol. 59, no. 6, pp. 808–821, 2010.
[32] R. P. Jacobi et al., “Reconfigurable Systems for Sequence Alignment and for
General Dynamic Programming,” Genetics and molecular research : GMR, vol. 4,
no. 3, pp. 543–552, 2005.
[33] O. Cret¸ et al., “A Hardware Algorithm for The Exact Subsequence Matching
Problem in DNA Strings,” Romanian Journal of Information Scient and Technology,
vol. 12, no. 1, pp. 51–67, 2009.
[34] Y. Liu et al., “An FPGA-Based Web Server for High Performance Biological
Sequence Alignment,” in 2009 NASA/ESA Conference on Adaptive Hardware and
Systems, 2009, pp. 361–368.
[35] S. F. Altschul et al., “Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs,” Nucleic Acids Research, vol. 25, no. 17, pp. 3389–3402,
1997.
[36] E. Sotiriades et al., “FPGA based Architecture for DNA Sequence Comparison
and Database Search,” in Proceedings 20th IEEE International Parallel Distributed
Processing Symposium, 2006, pp. 8 pp.–.
[37] ——, “Some Initial Results on Hardware BLAST Acceleration with a Reconfig-
urable Architecture,” in Proceedings 20th IEEE International Parallel Distributed
Processing Symposium, 2006, pp. 8 pp.–.
[38] E. Sotiriades and A. Dollas, “A General Reconfigurable Architecture for the
BLAST Algorithm,” The Journal of VLSI Signal Processing Systems for Signal,
Image, and Video Technology, vol. 48, no. 3, pp. 189–208, 2007.
[39] ——, “Design Space Exploration for the BLAST Algorithm Implementation,”
in 15th Annual IEEE Symposium on Field-Programmable Custom Computing
Machines (FCCM 2007), 2007, pp. 323–326.
[40] F. Xia et al., “Hardware BLAST Algorithms with Multi-seeds Detection and
Parallel Extension,” in Reconfigurable Computing: Architectures, Tools and Appli-
cations: 4th International Workshop, ARC 2008, London, UK. Proceedings, 2008.
[41] ——, “FPGA-Based Accelerators for BLAST Families with Multi-Seeds Detection
and Parallel Extension,” in 2008 2nd International Conference on Bioinformatics
and Biomedical Engineering, 2008, pp. 58–62.
[42] ——, “Families of FPGA-Based Accelerators for BLAST Algorithm with Multi-
seeds Detection and Parallel Extension,” in Bioinformatics Research and De-
velopment: Second International Conference, BIRD 2008 Vienna, Austria, 2008
Proceedings, 2008, pp. 43–57.
[43] Y. Chen et al., “Reconfigurable Accelerator for the Word-Matching Stage of
BLASTN,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
vol. 21, no. 4, pp. 659–669, 2013.
[44] Y. Yamaguchi et al., “High Speed Homology Search Using Run-Time Reconfigura-
tion,” in Field-Programmable Logic and Applications: Reconfigurable Computing Is
Going Mainstream: 12th International Conference, FPL 2002 Montpellier, France,
2002 Proceedings, 2002, pp. 281–291.
[45] X. Guo et al., “A Systolic Array-Based FPGA Parallel Architecture for the BLAST
Algorithm,” ISRN Bioinformatics, vol. 2012, pp. 1–11, 2012.
[46] P. Krishnamurthy et al., “Biosequence Similarity Search on the Mercury System,”
Journal of VLSI Signal Processing System, vol. 49, no. 1, p. 101–121, 2007.
[47] J. D. Buhler et al., “Mercury BLASTN : Faster DNA Sequence Comparison Using
a Streaming Hardware Architecture,” in Proceedings - 3rd Annual Reconfigurable
Systems Summer Institute, 2007.
[48] J. Lancaster et al., “Acceleration of Ungapped Extension in Mercury BLAST,”
Microprocessors and Microsystems, vol. 33, no. 4, p. 281–289, 2009.
[49] S. Datta et al., “RC-BLASTn: Implementation and Evaluation of the BLASTn
Scan Function,” in 2009 17th IEEE Symposium on Field Programmable Custom
Computing Machines, 2009, pp. 88–95.
[50] P. Afratis et al., “A Rate-based Prefiltering Approach to BLAST Acceleration,”
in 2008 International Conference on Field Programmable Logic and Applications,
2008, pp. 631–634.
[51] J. H. Park et al., “CAAD BLASTn: Accelerated NCBI BLASTn with FPGA
prefiltering,” in Proceedings of 2010 IEEE International Symposium on Circuits
and Systems, 2010, pp. 3797–3800.
[52] S. Kasap et al., “High Performance FPGA-based Core for BLAST Sequence
Alignment with the Two-Hit Method,” in 2008 8th IEEE International Conference
on BioInformatics and BioEngineering, 2008, pp. 1–7.
[53] H. Abelsson et al., “Accelerating NCBI BLAST FPGA Supercomputing Coming
of Age,” in Proceedings of the CUG Conference, Seattle, Wash, USA, 2007, pp. 5+.
[54] B. C. Lam et al., “BSW: FPGA-accelerated BLAST-Wrapped Smith-Waterman
aligner,” in 2013 International Conference on Reconfigurable Computing and
FPGAs (ReConFig), 2013, pp. 1–7.
[55] R. D. Chamberlain et al., “The Mercury System: Exploiting Truly Fast Hardware
for Data Search,” in Proceedings of the International Workshop on Storage Network
Architecture and Parallel I/Os, ser. SNAPI ’03, 2003, pp. 65–72.
[56] A. Jacob et al., “FPGA-accelerated seed generation in Mercury BLASTP,” in 15th
Annual IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM 2007), 2007, pp. 95–106.
[57] B. Harris et al., “A Banded Smith-Waterman FPGA Accelerator for Mercury
BLASTP,” in 2007 International Conference on Field Programmable Logic and
Applications, 2007, pp. 765–769.
[58] A. Jacob et al., “Mercury BLASTP: Accelerating Protein Sequence Alignment,”
ACM Trans on Reconfigurable Technology and Systems, vol. 1, no. 2, p. 9, 2008.
[59] M. C. Herbordt et al., “Single Pass, BLAST-Like, Approximate String Matching on
FPGAs,” in 2006 14th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines, 2006, pp. 217–226.
[60] ——, “Single Pass Streaming BLAST on FPGAs,” Parallel Computing, vol. 33,
no. 10-11, pp. 741–756, 2007.
[61] J. H. Park et al., “CAAD BLASTP: NCBI BLASTP Accelerated with FPGA-Based
Accelerated Pre-Filtering,” in 2009 17th IEEE Symposium on Field Programmable
Custom Computing Machines, 2009, pp. 81–87.
[62] K. Muriki et al., “RC-BLAST: towards a portable, cost-effective open source
hardware implementation,” in 19th IEEE International Parallel and Distributed
Processing Symposium, 2005, pp. 8 pp.–.
[63] S. Aluru and N. Jammula, “A Review of Hardware Acceleration for Computational
Genomics,” IEEE Design Test, vol. 31, no. 1, pp. 19–30, 2014.
[64] J. D. Thompson et al., “CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap
penalties and weight matrix choice,” Nucleic Acids Research, vol. 22, no. 22, p.
4673–4680, 1994.
[65] T. Oliver et al., “Multiple Sequence Alignment on an FPGA,” in 11th International
Conference on Parallel and Distributed Systems, vol. 2, 2005, pp. 326–330.
[66] X. Lin et al., “To Accelerate Multiple Sequence Alignment using FPGAs,” in
Eighth International Conference on High-Performance Computing in Asia-Pacific
Region (HPCASIA’05), 2005, pp. 5 pp.–180.
[67] S. Lloyd and Q. O. Snell, “Accelerated large-scale multiple sequence alignment,”
BMC Bioinformatics, vol. 12, no. 1, p. 466, 2011.
[68] H. Li, “Aligning sequence reads, clone sequences and assembly contigs with BWA-
MEM,” arXiv preprint arXiv:1303.3997v2, 2013.
[69] C. B. Olson et al., “Hardware Acceleration of Short Read Mapping,” in 2012
IEEE 20th International Symposium on Field-Programmable Custom Computing
Machines, 2012, pp. 161–168.
[70] E. Fernandez et al., “Multithreaded FPGA acceleration of DNA Sequence Map-
ping,” in 2012 IEEE Conference on High Performance Extreme Computing, 2012,
pp. 1–6.
[71] T. B. Preußer et al., “Short-Read Mapping by a Systolic Custom FPGA Computa-
tion,” in 2012 IEEE 20th International Symposium on Field-Programmable Custom
Computing Machines, 2012, pp. 169–176.
[72] J. Arram et al., “Reconfigurable Acceleration of Short Read Mapping,” in 2013
IEEE 21st Annual International Symposium on Field-Programmable Custom Com-
puting Machines, 2013, pp. 210–217.
[73] ——, “Reconfigurable filtered acceleration of short read alignment,” in 2013
International Conference on Field-Programmable Technology, 2013, pp. 438–441.
[74] ——, “Ramethy: Reconfigurable Acceleration of Bisulfite Sequence Alignment,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, ser. FPGA ’15, 2015, pp. 250–259.
[75] W. Tang et al., “Accelerating Millions of Short Reads Mapping on a Heterogeneous
Architecture with FPGA Accelerator,” in 2012 IEEE 20th International Symposium
on Field-Programmable Custom Computing Machines, 2012, pp. 184–187.
[76] P. Draghicescu et al., Inexact Search Acceleration on FPGAs Using the Burrows-
Wheeler Transform. Pico Computing, 2012.
[77] P. Chen et al., “Accelerating the Next Generation Long Read Mapping with the
FPGA-Based System,” IEEE/ACM Transactions on Computational Biology and
Bioinformatics, vol. 11, no. 5, pp. 840–852, 2014.
[78] E. J. Houtgast et al., “An FPGA-based Systolic Array to Accelerate the BWA-MEM
Genomic Mapping Algorithm,” in 2015 International Conference on Embedded
Computer Systems: Architectures, Modeling, and Simulation, 2015, pp. 221–227.
[79] ——, “Power-Efficiency Analysis of Accelerated BWA-MEM Implementations
on Heterogeneous Computing Platforms,” in 2016 International Conference on
ReConFigurable Computing and FPGAs (ReConFig), 2016, pp. 1–8.
[80] Y. T. Chen et al., “A Novel High-Throughput Acceleration Engine for Read Align-
ment,” in 2015 IEEE 23rd Annual International Symposium on Field-Programmable
Custom Computing Machines, 2015, pp. 199–202.
[81] M. C. F. Chang et al., “The SMEM Seeding Acceleration for DNA Sequence Align-
ment,” in 2016 IEEE 24th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), 2016, pp. 32–39.
[82] A. D. Smith et al., “Using quality scores and longer reads improves accuracy of
Solexa read mapping,” BMC Bioinformatics, vol. 9, no. 1, p. 128, 2008.
[83] H. Li et al., “Mapping short DNA sequencing reads and calling variants using
mapping quality scores,” Genome Research, vol. 18, no. 11, p. 1851–1858, 2008.
[84] E. Fernandez et al., “String Matching in Hardware Using the FM-Index,” in
Proceedings of the 2011 IEEE 19th Annual International Symposium on Field-
Programmable Custom Computing Machines, ser. FCCM ’11, 2011, pp. 218–225.
[85] ——, “FHAST: FPGA-Based Acceleration of Bowtie in Hardware,” IEEE/ACM
Transactions on Computational Biology and Bioinformatics, vol. 12, no. 5, pp.
973–981, 2015.
[86] Y. Chen et al., “A hybrid short read mapping accelerator,” BMC Bioinformatics,
vol. 14, no. 1, p. 67, 2013.
[87] O. Knodel et al., “Next-generation massively parallel short-read mapping on
FPGAs,” in ASAP 2011 - 22nd IEEE International Conference on Application-
specific Systems, Architectures and Processors, 2011, pp. 195–201.
[88] T. B. Preußer et al., “PoC-align: An open-source alignment accelerator using
FPGAs,” in 2014 International Conference on ReConFigurable Computing and
FPGAs (ReConFig14), 2014, pp. 1–6.
[89] J. Arram et al., “Hardware Acceleration of Genetic Sequence Alignment,” in
Reconfigurable Computing: Architectures, Tools and Applications: 9th International
Symposium, ARC 2013, Los Angeles, CA, USA. Proceedings, 2013, pp. 13–24.
[90] Y. Chen et al., “PerM: efficient mapping of short sequencing reads with periodic
full sensitive spaced seeds,” Bioinformatics, vol. 25, no. 19, p. 2514–2521, 2009.
[91] N. Ahmed et al., “Heterogeneous Hardware/Software Acceleration of the BWA-
MEM DNA Alignment Algorithm,” in Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design, ser. ICCAD ’15, 2015, pp. 240–246.
