Abstract-Genetic sequence alignment has always been a computational challenge in bioinformatics. Depending on the problem size, software-based aligners can take multiple CPUdays to process the sequence data, creating a bottleneck point in bioinformatic analysis flow. Reconfigurable accelerator can achieve high performance for such computation by providing massive parallelism, but at the expense of programming flexibility and thus has not been commensurately used by practitioners. Therefore, this paper aims to provide a thorough survey of the proposed accelerators by giving a qualitative categorization based on their algorithms and speedup. A comprehensive comparison between work is also presented so as to guide selection for biologist, and to provide insight on future research direction for FPGA scientists.
I. INTRODUCTION
Genetic sequence alignment is an important and fundamental aspect in modern molecular biology. However, the exponential growth of bio-sequence databases [1] and significant improvement of next-generation sequencing (NGS) machine have posed a computational challenge for general purpose processor, especially when performance of NGS machines has been developing at a rate faster than Moore's law [2] .
In the literature, FPGA technology has shown to be a promising candidate for accelerating genetic sequence alignment. Because of its highly-parallel bit-oriented architecture, FPGAs have been leveraged to accelerate various alignment algorithms since 1992 [3] . Therefore, this survey covers the traditional sequence analysis such as pairwise sequence alignment and genomic database search on FPGAs. In addition, the recently evolved NGS technology and its applications are also discussed to demonstrate the benefits of FPGAs in accelerating genetic sequence alignment.
The rest of this paper is organized as follows: Section II discusses the commonly used techniques and algorithms in pattern matching. Previous work on reconfigurable acceleration of genetic sequence alignment is then elaborated in Section III. Concluding remarks are drawn in Section IV.
II. BACKGROUND AND ALGORITHMS
Over the last two decades, FPGA researchers have applied different techniques to accelerate genetic sequence alignment. For example, Fernandez et al. implement a direct comparison design where bases from a streaming reference sequence and a stationary short read are compared [4] . Other algorithms such as Aho-Corasick algorithm [5] or hash table [6] are also adopted in FPGA aligners.
In this section, the most commonly used algorithms for genetic sequence alignment: Smith-Waterman [7] and FMIndex [8] are described to provide a background for various accelerating approaches. Other algorithms such as seed-andextension strategy will also be briefly mentioned in Section III from the applications perspective.
A. Smith-Waterman Algorithm
The Smith-Waterman is a dynamic programming (DP) technique based on the Needleman-Wunsch algorithm [9] . A scoring matrix V is used to reveal the optimal local alignment between two sequences S, T where |S| = n and |T | = m. Each entry in Matrix V is calculated recursively according to equations (1) and (2) .
The function σ(x, y) determines the relative weighting of matches, mismatches, deletions and insertions between characters x and y. The weighting, on the other hand, can be adjusted according to different alignment requirements. For example, the insertion and deletion penalties can be set to higher value than the substitution penalty if the presence of redundant characters is less acceptable than character difference. TABLE I: Example of calculating the scoring matrix for the sequences S = AT and T = CTCATGC.
- C  T  C  A  T  G  C  -0  0  0  0  0  0  0  0  A  0  0  0  0  2  1  0  0  T  0  0  2  1  1  4  3  2 Match: σ(x, x) = +2, Mismatch: σ(x, y) = −1 Deletion: σ(x, ) = −1, Insertion: σ( , x) = −1 Table I illustrates the calculation for the scoring matrix for the alignment of sequence S = AT to a reference T = CTCATGC. The optimal alignment can be obtained by completing the matrix V and the highest score indicates if a pattern can be mapped to another sequence within the allowed diversity. In Table I the highest score obtained is 4 which indicates that pattern S can be exactly mapped to sequence T . By backtracing from the highest score to the entry in which the score becomes zero, the optimal alignment can be constructed as a string representation.
B. FM-Index
FM-index is a data structure that combines the properties of suffix array with the Burrows-Wheeler transform (BWT) [10] . Such data structure provides an efficient mechanism to locate all the occurrences of a pattern P in a long reference sequence R. As a result, BWT and FM-index have been broadly employed in many of the software for short reads alignment such as Bowtie [11] , SOAP2 [12] and BWA [13] .
To compute the BWT of a reference sequence R, i.e. BW T (R), R is first terminated with an unique character: '$', which is lexicographically the smallest value. Then, all the rotations of the text are generated and are sorted correspondingly. The suffix array can be obtained by considering the characters before '$' in each entry of the rotation list. BW T (R) can also be formed by extracting and concatenating the last characters of all the entries on the sorted list. Table IIa shows an example of deriving the BWT of the sequence R = ACACGT. The strings preceding the '$' sign in the sorted rotations forms the suffix array (SA), which indicates the position of each possible suffix in the original string.
TABLE II: (a) Example of deriving the suffix array and BWT of reference sequence R. (b) i(x) and c(n, x) functions for the sequence R. C  G  T  0  0  0  0  0  1  0  0  0  1  2  0  0  0  1  3  0  1  0  1  4  1  1  0  1  5  2  1  0  1  6  2  2  0  1  7  2  2  1  1 i(x) {1, 3, 5, 6}
After generating the suffix array, the BW T (R) is sorted to form the i and c functions. For each element x of the alphabet of R, i(x) is defined as the index of its first occurrence in sorted-BW T (R), while for each index n in BW T (R) and for each character x in the alphabet, c(n, x) stores the number of occurrences of x in BW T (R) in the range [0, n−1]. Table IIb illustrates the i(x) and c(n, x) functions for the sequence R.
Essentially, the FM-index is a pattern searching technique that operates on the i(x) and c(n, x) functions recursively. Two specific pointers: top and bottom are defined to perform the search. top refers to an index of the suffix array element where a specific pattern is first located, and bottom is the location where the pattern can be last found. If bottom points to an index that is less than or equal to the index pointed by the top, the pattern does not occur on the text.
To locate a specific pattern P with the FM-index, a character is processed at a time, beginning with the last character of the pattern. The top and bottom are first initialized with the first and last indices of the c(n, x) function respectively. Then both pointers are updated according to the following equations:
Notice that the time of locating the pattern in the reference sequence is linear in the length P instead of the length of R.
III. FPGA ACCELERATION OF GENOMIC ALIGNMENT
Depending on applications, different alignment operations can be applied to perform different genomic analysis. In this section, existing work on FPGA-accelerated genomic aligner is studied from the application perspective. Additionally, the interplay between the hardware characteristics of FPGA and the algorithmic techniques are briefly mentioned and described.
A. Pairwise Sequence Alignment
A fundamental problem in the field of computational biology is the comparison and alignment of two sequences of DNA strands. Depending on the applications, the alignment results can provide useful biological or medical information such as evolutionary development of a species, or identification of causal cancer genes and genetic diseases [14] .
As mentioned, the Smith-Waterman is the most commonly used algorithm to perform genetic sequence alignment, particularly pairwise sequence alignment (PSA). However, because of the enormous size of DNA sequences, purely software based implementations of the algorithm suffer from prolonged execution latency. To accelerate PSA, FPGA devices have been extensively used to reduce the time complexity from O(mn) in software to O(m+n) in parallel processing hardware.
Simple Aligner -In [15] and [16] , the authors present one of the first FPGA accelerators for the Smith-Waterman algorithm. Reconfigurable systolic array is adapted to provide a large amount of parallelism. Also, runtime reconfiguration is used to write one string directly into the FPGA's bitstream.
Yu et al. [17] , on the other hand, later propose an improved, reconfiguration-free systolic array architecture where the accelerator can be deployed on cross-vendor FPGAs. Experiments show that the proposed solution can achieve 814 entry/cell updates per second (GCUPs) when implemented on Virtex-E XCV1000E-6 FPGA.
Affine Gap Cost Model -Very often, alignment of two sequences favours gap extension rather than insertion/deletion. Therefore, instead of giving a fixed negative score to every gap, biologists usually apply affine gap penalty when computing the scoring matrix.
In [18] and [19] , the authors propose the first FPGA-based accelerator that supports affine gap penalty. Systolic matching cells are implemented to support different cost functions and alignment algorithms such as the Needleman-Wunsch or Smith-Waterman. Compared to software implementation on Xeon 3 GHz processor, a speedup of 370× can be achieved when implemented on Virtex-II Pro XC2VP70 FPGA.
Similarly, Jiang et al. [20] implement a reconfigurable accelerator that can adopt affine gap penalty. In this design, a modified equation is proposed to improve mapping efficiency of a processing element (PE). A special floor plan is applied to fine-grain parallel PE array to cut down their routing delay. With these two techniques, the proposed implementation on Stratix EP1S30 can improve the performance by 345× versus a similar software on Xeon 2.8 GHz processor.
Basically, most of the research efforts such as [21] - [27] utilize systolic array or fine-grain PE architecture to accelerate PSA with affine gap penalty. Experiments show that, when compared to state-of-the-art software implementations, the reconfigurable accelerators can achieve a speedup from around 40× to 246×.
Accelerator with Traceback -To further improve the accelerator performance, some FPGA designs realise the traceback procedure instead of relying on the host CPU to perform backtracing. For example, Benkrid et al. [28] implement a Smith-Waterman accelerator on Virtex-II XC2VP100 where a pipeline of PEs can be used to calculate the scoring matrix and traceback. An improved accelerator is later proposed in [29] in which a space-efficient algorithm is used to overcome the memory size and bandwidth limitations. Compared to software on Core2 Duo 2.4 GHz, a performance gain over 300× can be obtained with 256 PEs on Virtex-4 FX100 FPGA.
Moreover, a few researchers accelerate variants of the Smith-Waterman such as DIALIGN [30] to accomplish better alignment sensitivity. In particular, Boukerche et al. [31] propose a reconfigurable accelerator for DIALIGN by implementing wavefront array processors on Stratix-II EP2S180. The traceback procedure can also be executed on FPGA to retrieve the alignment and the overall speedup is around 141× compared to a similar software implementation.
Hardware Abstraction in RC-PSA -Some of the efforts are devoted to improving the portability and usability of the accelerated system. In [32] , the authors design a systolic architecture that can be applied to solve general DP-based alignment problem. Others such as [19] , [28] , [33] provide generic, parameterizable FPGA cores for PSA which are portable across various FPGA platform. Finally, Liu et al. [34] introduce the concept of "RC-PSA in the cloud" where a web server is used to serve alignment requests. All these implementations, compared to state-of-the-art CPU designs, can deliver a speedup of more than 62×.
B. Database Search
Computational search through large databases of DNA is another important tool to uncover homologous sequence in modern molecular biology. Database sequences that exhibit high similarity with the query are hypothesized to derive from the ancestral sequence and often display the same biological function.
Heuristic algorithm such as BLAST [35] is extensively used to perform database search among biologists. Basically, BLAST algorithm works in three consecutive stages: (1) Word Matching, (2) Ungapped Extension, (3) Gapped Extension. However, as the size of the most commonly used database such as NCBI databank [1] grows at the same pace as Moore's law, running BLAST on a general purpose processor has been the bottleneck in homology analysis.
In this sub-section, previous work on reconfigurable acceleration of BLAST is included and a summary is shown in Table III . Although some variations of BLAST such as BLASTp or BLASTx do not target at alignment of genetic nucleotides, they are also included in the discussion because of their similarities in heuristic and methodology.
Basic Accelerators -The TUC BLAST is one of the earliest efforts in accelerating BLAST on reconfigurable devices. In [36] , [37] , Sotiriades et al. develop the first version of TUC BLAST in which the entire BLAST algorithm is mapped onto Virtex-4 4VFX140FF1517-11 FPGA. The architecture can support small queries of up to 1,000/ 5,000 letters regardless of the database size. Hash table is used to build hit finders and extension is done with basic comparators. Experiments indicate that the proposed accelerator can achieve a speedup of 215× versus BLASTn on Xeon 2 GHz.
Hybrid Systems -The TUC BLAST is then revised and incorporated with the PowerPC processor onboard to perform extension [38] . Implemented on Virtex-II PRO V2P30, the modified accelerator achieves 32× speedup compared to execution on Pentium-4 3.0 GHz. Moreover, the same authors explore the design space on ASIC to reduce technology related limitations of FPGA in [39] .
Xia et al. [40] - [42] also design an hybrid accelerator where the first two stages of BLAST are accelerated with Stratix-II EP2S130C5 FPGA and the final stage is executed on commodity CPU. To decrease the memory requirement onchip and support longer query, systolic array of 3072 PEs are used to perform multi-seeds detection and multi-channel hardware modules are implemented to complete ungapped extension. The experimental results show that the accelerator can deliver 48× speedup versus Pentium-4 2.6 GHz CPU.
Chen et al. [43] also present an FPGA-based reconfigurable architecture to accelerate the word-matching stage of BLAST while maintaining the computations of other stages on CPU. This design consists of three sub-stages, a parallel Bloom filter, an off-chip hash table, and a match redundancy eliminator. The performance of this architecture, when implemented on Virtex-5 LX330, demonstrates 10× speedup against Core2 Duo 3.2 GHz (1-thread) in Word Matching*.
Mercury BLAST -The Mercury system, on the other hand, is reconfigurable logic, associated with the disk controller, to provide computation in close proximity to the data flowing off the disk drive [55] . Such platform is frequently employed Mercury BLAST is later improved by Buhler et al. [47] where Word Matching and Ungapped Extension are both accelerated on FPGA. The hardware-accelerated ungapped extension employs a similar heuristic as BLAST in order to achieve a speedup of 11× while retaining 98.5-99% of all alignments found by NCBI BLASTN.
Finally, Lancaster et al. [48] further enhance the design by implementing a pre-filter on FPGA for the third stage and at the same time offloading the computation of ungapped extension on CPU. By highly paralleling and pipelining the hardware modules, the accelerator accepts query of 25k bases and achieves 50× improvement while maintaining equivalent sensitivity of the BLAST software.
Moreover, BLASTp is also accelerated using the Mercury framework. Word Matching is accelerated in [56] and Gapped Extension is accelerated with Smith-Waterman in [57] . Finally, [58] presents a full acceleration of BLASTp where all the previous efforts are combined to deliver a full implementation.
Single-Pass BLAST -Since BLAST involves multiple passes during database queries, some researchers introduce a new algorithm that operates in a single-pass at streaming rate to improve performance. In particular, Herbordt et al. [59] , [60] propose the use of a DP approach on FPGA to emulate the seeding and extension phases of BLAST. This algorithm, named TreeBLAST, can improve the performance of the database search by 400× on Virtex-4 LX160 FPGA compared to multiple-pass NCBI BLASTp on Xeon 2.8 GHz.
Results Compatible Accelerator -Although the mentioned implementations demonstrate significant speedups compared to software, the search outcomes are not always in complete agreement with the NCBI results. Since typical biologists would have no idea whether the differences are statistically significant, some FPGA researchers argue that the hardware accelerated design should be NCBI BLAST compatible.
Datta et al. [49] propose a memory efficient FPGA design that implements Blast_Nt_Scan function of BLAST. The primary function of the scan function is to stream the subject data sequence and locate hits. Without compromising fidelity, the proposed implementation on Virtex-4 ML410 can improve performance by a factor of 3 (compared to Pentium-4 3.2 GHz) while in complete agreement with the standard NCBI BLAST.
Database Pre-filtering -In addition to accelerate different phases of BLAST using FPGA, another useful approach to improve the overall performance is to profile the code and reduce the database size.
Afratis et al. [50] propose the first pre-filtering approach to BLAST by finding and reporting matches in the areas of high similarity between database and query. It is found that pre-filtering offers at least a factor of 5 and up to 3 orders of magnitude reduction in the database space.
Park et al. [51] , [61] also apply pre-filtering with the TreeBLAST algorithm so as to quickly reduce the size of the database to a small fraction. The sensitivity of the prefiltering approach is tuned to exceed that of the NCBI BLAST implementation to ensure identical results. Experimental results show that, compared with NCBI BLASTn, the speedup is greater than 12× when pre-filtering and accelerator in [59] are used in execution.
Hardware Abstraction in RC-BLAST -In spite of the promising results described above, the FPGA-based solutions should also be portable and straightforward in order to promote the use of reconfigurable accelerator among biologists.
In [62] , Muriki et al. present the first portable, cost-effective, open source solution of RC-BLAST to guarantee usability. Kasap et al. [52] also present a portable FPGA accelerator for BLAST by capturing the design with an FPGA-platformindependent language Handel-C. The architecture of the accelerator can also be parametrized in terms of the sequence lengths, match scores, gap penalties, and cut-off and threshold values. It is reported that the hardware implementation is 52× faster than equivalent software implementations on Centrino Duo 2.2 GHz.
Moreover, Abelsson et al. [53] propose the use of Mitrion Virtual Processor to accelerate BLAST. Since Mitrion enables software developers to target FPGA-based computers without needing any of the hardware design skills, users can continue using the familiar BLAST interface, while at the same time getting searches completed 10× to 20× faster.
Finally, Lam et al. [54] introduce an FPGA-accelerated BLAST in the cloud framework. Smith-Waterman is accelerated on 64 Stratix-IV E530 on multiple PS4 compute nodes of Novo-G to provide database search. A robust software interface is also provided to seamlessly integrate the FPGA design into existing processing pipelines of NCBI BLAST.
C. Multiple Sequence Alignment
Multiple Sequence Alignment (MSA) is an extension to PSA and is generally used to construct family representations of sequences or to reveal evolutionary histories of species. However, it is a NP-Hard problem and therefore the optimal solution can only be obtained with a i dimensional DP-table where i is number of sequences [63] .
Heuristic algorithm, such as ClustalW [64] , has been widely used among biologists because of its efficiency. Basically, ClustalW uses a progressive algorithm which consists of three major steps: (1) PSA between all sequences to generate distance matrix, (2) Guide tree generation based on the distance matrix, (3) Successively building MSA by performing PSA based on the branching order of the guide tree. However, ClustalW faces the same problem as BLAST does due to the rapid growth of the sequence database, and aligning a few hundred sequences could require several hours on computers.
Research has been done to overcome this problem by accelerating MSA with reconfigurable devices. In [65] and [66] , the authors present an accelerated ClustalW by offloading the computation of the first stage onto FPGA. As more than 90% of the runtime is spent in the first stage, [65] provides a speedup of 50× on Virtex-II XC2V6000 compared to Pentium-4 3 GHz for the first stage, and [66] achieves 10× performance improvement when Stratix PEIS30 is compared to Xeon 2.8 GHz.
Finally, the third stage of ClustalW is accelerated in [67] . Compared to Core2 2.4 GHz, an overall speedup of 150× can be achieved by reducing subgroups of aligned sequences into discrete profiles before PSA is performed on Virtex-4 FX100.
D. Mapping
Mapping, or resequencing refers to the alignment of a generated sequence to a reference genome where the complete sequence of the concerning species, such as human, is already known. Such application is essentially used to determine the genomic variations of a sample in relation to the reference so as to explore and understand genetic diseases and recent cancer genomes.
Mapping is one of the dominant applications of nextgeneration sequencing where millions of DNA fragments, called short reads, with 75 to 200 b.p. in length, are generated by NGS machine and mapped to the reference genome. Software such as Bowtie, BWA, SOAP2 and BWA-MEM [68] are widely used among biologists as de facto sequence alignment program of choice. Yet, since the sequencing machine is improving at a rate faster than the transistor counts according to Moore's law, mapping of generated sequence such as the complete human genome is taking order of day's worth of computing time [63] . Therefore, FPGA technology has been extensively used by researchers to speedup the mapping process. A summary of the previous work on reconfigurable acceleration of short-read alignment is displayed in Table IV .
Basic Mappers -Fernandez et al. [4] implement the first hardware short-read mapper in 2010 where the design is based on a naive solution. The reconfigurable implementation on Virtex-5 LX330 delivers a speedup of 1.6× to 4× when compared to the fastest software tool RAMP [82] and ELAND [83] on Xeon Harpertown 2.5 GHz (1-thread) . However, the performance of this design decreases with the increase of reads length, therefore a followed-up work [84] is proposed in which the authors develop the first implementation of FM-index on FPGA. As the FM-index does not need to perform all character matching compared to the naive solution, this approach, when implemented on Virtex-6 LX760, outperforms the previous work by around 2× and more importantly, provides a 133× speedup compared to Bowtie on Xeon 2.5 GHz (1-thread).
Approximate String Matching -Since [84] is only limited to exact string matching, the authors extend their work as a multi-threaded FPGA design called FHAST which supports up to 2 mismatches [70] . In this implementation of FM-index, each read represents a thread in the search and maximally 512 concurrent threads can be executed on a single Virtex-5 XC5VLX330 FPGA of Convey HC-1. Experimental results show that FHAST achieves a speedup of up to 70× over Bowtie running on Xeon L540B and E5520 (16-thread), and a second version that runs on Convey Computers HC2ex provides a higher sensitivity for higher number of mismatches [85] . Using four Virtex-6 LX760 FPGAs, FHAST version-II can provide a speedup up to 12× compared to Bowtie on two Xeon E5-2634 (8-thread).
Besides FM-index, other researchers propose different FPGA-solution for approximate string matching. In [69] , Olson et al. propose an accelerator that is based on indexing of reference with Smith-Waterman alignment performed on FPGA. The authors optimize the size of the candidate alignment location (CAL) lookup table and partition the design into eight Pico M-503 boards each with one XC6VLX240T FPGA. This 8-FPGA system can achieve 31× speedup versus Bowtie running on two Xeon E-5520 (8-thread).
Chen et al. [6] , [86] also implement an accelerated shortread aligner based on seed-and-extension strategy [6] . The basic idea of such strategy rests on the heuristic that only a limited amount of errors (substitution, insertion and deletion) Highly Accurate Mappers -On the other hand, Knodel et al. [87] design a short-read mapper on FPGA that allows a freely adjustable character mismatch threshold. This mapper is based on a brute-force approach that relies on massive amount of shift registers (Block RAM) and comparators to perform matching, and it guarantees a 100% mapping rate within the mismatch threshold. Compared to Bowtie on Core2 Duo 2.66 GHz (2-thread), the hardware mapper can run 2× faster and can align 20% more genome when implemented on Virtex-6 XC6VLX240T FPGA.
The authors continue their work and design another shortread mapper based on linear systolic computation scheme to achieve better performance [71] . Implemented on Virtex-6 XC6VLX550T FPGA, the hardware mapper reports 2× more locations than Bowtie while maintaining the execution latency competitive to software executed on i7-2600K (4-thread). This solution is also ported onto Virtex-7 XC7VX485T and is realized as an open-source package called PoC-Align [88] .
Runtime Reconfigurable Mappers -Some researchers manage to take advantage of the reconfigurable property of FPGA device to further improve the performance of hardware shortread aligner. In [89] and [72] , Arram et al. introduce a hardware design that incorporates specialized matchers for exact and approximate sequence alignment, while at the same time runtime reconfiguration is used to fully populate the FPGA with each type of matchers. Such decoupling enables the flexibility of optimizing each matcher according to the intended workload, hence resulting in higher parallelism and performance. With this scheme, results reported on Virtex-6 SX475T of Maxeler MAX3 are 293× faster than BWA, and 496× faster than Bowtie on Xeon X5650 (20-thread) .
Using the same approach, the authors further extend their work and design specialized filters that can align short reads to a reference genome with a different edit distance [73] . These filters are arranged in a pipeline according to an increasing edit distance, in which reads unable to be mapped by a given filter are forwarded to the next filter in the pipeline for further processing. Specifically, each time the FPGA is fully populated with each filter in the pipeline in turn with runtime reconfiguration. With specialised filters based on a novel bidirectional backtracking version of the FM-index, it is found that the alignment time on Maxeler MAX3 can be up to 18.1× faster than BWA running on two X5650 (12-thread).
Hybrid Systems -Hybrid aligner refers to the concept of hardware-software co-design for accelerating short-read alignment. In [76] , Draghicescu et al. design BWT aligners on twelve Virtex-5 505 FPGA under the Pico Computing's framework. The accelerator ties into existing BWA software and allows the CPU to perform tasks it is optimized for, such as file handling and memory management. The proposed system can achieve 48× speedup compared to software version of BWA running on 16-core.
Tang et al. [75] also develop a hybrid accelerator where a host program running on PC is dedicated to controlling loading/storing reads/references data to/from the hardware. The hardware mapper is based on PerM [90] , a software with periodic spaced seeds to significantly improve mapping efficiency for large reference genomes. Meshes of processing elements are implemented on Virtex-5 LX330 to take the advantage of the spatial parallelism on FPGA. Experiments show that such accelerator can deliver 22.2× to 42.9× speedup versus PerM on six-core Xeon processor (Westmere) CPU.
Other mentioned efforts, such as [86] and [88] , are also tightly-coupled with software environment and presented as hybrid system to accelerate short-read alignment.
Acceleration of BWA-MEM -In addition to the above implementations, some research efforts are devoted to accelerating certain alignment software. In particular, BWA-MEM has been widely studied and accelerated by FPGA researchers because of the accuracy and improved efficiency of the software [68] .
Basically, the BWA-MEM algorithm consists of three main procedure which are executed in succession for each read in the input: (a) SMEM (i.e. seeds) Generation, (b) Seed Extension, and (c) Output Generation.
In [80] , Chen et al. propose an acceleration engine for BWA-MEM by offloading the seed extension, which is the computation bottleneck, onto Virtex VC707 FPGA. The authors develop an efficient Smith-Waterman implementation that supports massive task-level parallelism, sharply varied input sizes, and software-pruning strategies. Compared to BWA-MEM software on a 6-core CPU with 24 treads, the proposed design can demonstrate 26.4× improvement in execution latency.
The authors continue their work by offloading SMEM generation onto the FPGA in the latest Intel-Altera HARP system [81] . With a 16-PE accelerator engine, the seeds generation is accelerated by 4×, and the overall SMEM seeding stage by 26% when compared with 16-thread CPU execution † . Houtgast et al. [78] , [79] implement a hardware aligner based on BWA-MEM as well and the design is composed of a systolic array architecture to accelerate seed extension kernel with Smith-Waterman. By offloading the computational bottleneck onto Virtex-7 XC7VX690T-2 FPGA, the entire system can deliver a total acceleration of about 45%. This work is later extended by Ahmed et al. [91] where a hardware suffix array is used to partially accelerate SMEM generation, which enables a total application acceleration of 2.6× compared to the original software version.
IV. CONCLUSION
This paper reviews recent work on reconfigurable acceleration of genetic sequence alignment by characterising them into four main categories. Within these high-level categories, we elaborate and compare each work based on their features and the corresponding performance. We show that FPGA-based solution is a promising candidate for the discussed topic, and we believe future research should push forward with design portability and usability of accelerators such as the concept of RC-accelerated aligner in the cloud. As such, we hope this survey can provide guidance on the accelerator choice for genetic sequence alignment, and hence promote the use of FPGA among the life sciences community.
