Abstract-Genome analysis helps to reveal the genomic variants that cause diseases and evolution of species. Unfortunately, high throughput DNA sequencing (HTS) technologies generate excessive number of small DNA segments -called short reads-that incur significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome through a computationally expensive process. Due to sequencing errors and genomic variants, the similarity measurement between a read and "candidate" locations in the reference genome, called alignment, is formulated as an approximate string-matching problem, which is solved using quadratic-time dynamic programming algorithms. In practice, the majority of candidate locations do not align with a given read due to high dissimilarity. The verification process of such candidate locations occupies most of a modern read mapper's execution time.
INTRODUCTION
High throughput sequencing (HTS) technologies are capable of generating a tremendous amount of sequencing data. For example, the Illumina HiSeq4000 platform can generate more than 1.5 trillion basepairs (bp) in less than four days. This flood of sequenced data continues to overwhelm the processing capacity of existing algorithms and hardware [1] . The success of the medical and genetic applications of HTS technologies relies on the existence of sufficient computational resources, which can quickly analyze the overwhelming amounts of data that the sequencers generate.
An HTS machine produces short reads (typically 75-150 bp) sampled randomly from DNA. In the presence of a reference genome, the short reads are first mapped to the long reference sequence. During this process, which is called read mapping, each short read is mapped onto one or more possible locations in the reference genome based on the similarity between the short read and the reference sequence segment at that location. Optimal alignment between the read and the reference segment could be calculated using the Smith-Waterman local alignment algorithm [2] . However, this approach is infeasible as it requires O(mn) running time, where m is the read length (100-150 bp for Illumina) and n is the reference length (~3.2 billion bp for human), for each read in the data set (hundreds of millions to billions). Therefore, read mapping algorithms apply heuristics to first find candidate map locations (seed locations) of subsequences of the reads using hash tables [3] [4] [5] [6] [7] or BWT-FM indices [8] [9] [10] [11] , and then align the read in full only to those seed locations. Although the strategies for finding seed locations vary among different read mapping algorithms, seed location identification is typically followed by a verification step, which compares the read to the reference segment at the seed location to see if the read maps to that location in the genome. The verification step typically occupies over 90% of the mapper's execution time [3] and involves quadratic-time algorithms such as Levenshtein's edit distance [12] , Smith-Waterman [2] and Needleman-Wunsch [13] . Edit distance is defined as the minimum number of edits (i.e. insertions, deletions, or substitutions) needed to make the read exactly match the reference segment [12] . If the edit-distance score is greater than a user-defined error threshold (i.e., usually less than 5% of the read length [14] [15] [16] ), then the mapping is considered to be invalid (i.e., read does not match the segment at seed location) and thus is rejected.
Definition 1. Given a set of short reads R, a candidate read r of length m, where r ∈ R and |r|=m, a reference sequence f of length n, and an edit-distance threshold E, the alignment problem asks for all short reads in a set R that exhibit a common subsequence that appears in f, whose length is m-E ≤ l ≤ m+E characters.
Recent work found that an overwhelming majority (>98%) of the seed locations exhibit more errors than the threshold [3, 16] . These particular seed locations impose a large computational burden as they waste 90% of the mapper's execution time in verifying these incorrect mappings [3] . To tackle these challenges and bridge the widening gap between the execution time of the mappers and the huge amount of sequencing data, most existing works fall into two approaches: (1) design hardware accelerators to accelerate alignment [17] [18] [19] [20] [21] [22] , (2) build alignment filters (i.e., filters that aim to minimize the number of locations on which alignment is performed), such as SeqAn [23] , FastHASH [3] , and Shifted Hamming Distance (SHD) [16] . Such filters calculate a best guess estimate for the alignment score between a read and a seed location on the reference. If the lower bound exceeds the error threshold, indicating that the read and the segment at the seed location do not align, the seed location is eliminated such that no alignment is performed. We discuss both approaches in Section II. This paper aims to combine these two promising approaches in a novel way towards building the fastest read mapper. To this end, we introduce a new FPGA-based fast alignment filtering technique (called GateKeeper) that acts as a pre-alignment step in the read mapping. To our knowledge, this is the first work that provides a new prealignment algorithm and architecture for reconfigurable hardware platforms. A fast filter designed on a specialized hardware platform can drastically expedite alignment verification by reducing the number of locations that must be verified via dynamic programming. This is desirable in order to avoid performing expensive computations unnecessarily, which eventually improves the overall running time. Our filtering technique improves and accelerates the SHD filtering approach [16] using new mechanisms and FPGAs. We build upon the SHD filtering algorithm as it is fastest and the most accurate state-of-theart filter [16] . Our new filtering algorithm has two properties that make it suitable for an FPGA-based implementation: 1) it is highly parallel, 2) it heavily relies on bitwise operations such as shift, XOR, and AND. Due to the parallel-friendly and bitwise-processing friendly architecture of the FPGAs, our design achieves more than an order of magnitude speedup compared to the best prior approach, as our comprehensive evaluation shows (Section IV). Our architecture discards the incorrect mappings from the candidate mapping pool in a streaming fashion -data is processed as it is transferred from the host system. Filtering the mappings in a streaming fashion gives the ability to integrate our filter with any mapper that performs alignment, such as Bowtie2 [11] and BWA-MEM [24] .
Contributions. We make the following contributions:
 We introduce the first FPGA-friendly alignment filtering architecture, called GateKeeper, for reducing the need for alignment verification in DNA read mapping. We show that developing a hardware-based alignment filtering algorithm and architecture is both very feasible and effective.
 We provide the design and implementation of a complete FPGA system designed specifically to act as a pre-alignment step. A key result is that our design on a single FPGA chip provides a 3.7 to 17.3 fold acceleration over the current state-of-the-art filter, SHD [16] . It can also demonstrate significant speedups (more than two orders of magnitude) over other [8] , BWT-SW [9] , Bowtie [10] and Bowtie2 [11] ) is efficient at finding the best mappings of a read (i.e., the mappings with the fewest errors), and hence we refer to them as best-mappers. Mappers in this category use aggressive algorithms to optimize the candidate location pools to find closest matches, and therefore may not find many potentially-correct mappings [25] . Their performance degrades as either the sequencing error rate increases or the genetic differences between the subject and the reference genome are more likely to occur [8] . This is due to the nature of BWT-FM as it entails a global alignment (i.e., alignment from the first base to the last one) with respect to the sequenced reads.
The second approach, seed-and-extend mappers (also referred to as hash-based mappers), such as FastHASH [3] , mrFAST/mrsFAST [4, 5] , SHRiMP2 [6] , and BFAST [7] , build very comprehensive but overly large candidate location pool and rely on filters and local alignment techniques to remove incorrect mappings from consideration in the verification step. Mappers in this category are able to find all correct mappings of a read, but waste computational resources for identifying and rejecting incorrect mappings. As a result, they are slower than BWT-FM-based mappers. A hybrid method that incorporates the advantages of each approach can be also utilized for long read alignment (i.e. up to few million bases), such as BWA-MEM [24] . Fig. 1 illustrates the existing read mappers implemented in various platforms. A majority of read mappers are based on machines equipped with general-purpose central processing units (CPUs). While the HTS platforms generate half a trillion bp per day, the state-of-the-art CPU-based read mappers can align only a few billion of them against the human genome [10, 24] . As long as the gap between the CPU computing speed and the very large amount of sequenced data widens, CPU-based mappers become less favorable due to their limitations in accessing data [17] [18] [19] [20] [21] [22] . To tackle this challenge, many attempts were made to accelerate the operations of read mapping. Most existing works can be divided into two main approaches: (1) designing hardware accelerators, (2) developing efficient alignment filters.
Hardware accelerators for read mapping are becoming increasingly popular as viable solutions for expediting the operations of existing mappers using various new processing platforms, such as GPUs [18, 19, 26] and FPGAs [17, 19, 21, 22, [26] [27] [28] [29] [30] . FPGA acceleration platforms seem to yield the highest performance gain [21, 22, 27, 31, 32] , especially for applications with unpredictable and highly irregular memory access patterns such as BWT-based search, which poses difficult challenges for the efficient implementation in CPUs or GPUs [27] . FPGA-based read mappers often demonstrate one to two orders of magnitude speedups against their GPU-based counterparts [20, 28, 29] .
Past works used hardware platforms to only accelerate the dynamic programming algorithms (e.g., SmithWaterman algorithm), as these algorithms contributed significantly to the overall running time of read mappers [26, 33] . Benkrid et al. [26] compared the Smith-Waterman method implemented on the FPGA, GPU, Cell BE, and CPU platforms. The FPGA implementation outperforms all other accelerated implementations, particularly in terms of execution time. FPGAs will likely continue to be the best choice as they enable performing large numbers of computations in a parallel fashion. Comprehensive surveys on hardware acceleration for computational genomics appeared in [1, 14, 31] . Note that there is no work on the hardware acceleration of alignment filtering techniques, which we discuss next. The second approach to accelerate read mapping is to incorporate a filtering technique within the read mapper. As we discussed above, this filter is responsible for quickly excluding incorrect mappings in an early stage (i.e., as a prealignment step) to reduce the number of locations that must be verified via dynamic programming. Existing filtering techniques include: (1) SeqAn [23] , an implementation of Gene Myers's bit-vector algorithm [34] , and its SIMDaccelerated implementation in [16] , (2) Hamming distance calculation [35] , (3) locality based filtering mechanisms, FastHASH [3] , and (4) Shifted Hamming Distance [16] . The first three works are either slow or inaccurate in filtering. SeqAn requires quadratic-time in the length of the read. Hamming distance and FastHASH have high false positive rates (i.e. the number of incorrect mappings that pass the filter) for edit-distance thresholds higher than 3 errors. Next, we discuss the fourth technique in detail.
Shifted Hamming Distance (SHD) filter. In recent work, Xin et al. [16] present a fast and comprehensive filter that can quickly identify most incorrect mappings while preserving all the correct ones. The heart of the algorithm is a new approach called Shifted Hamming Distance (SHD), which is based on the pigeonhole principle. SHD observes that if there are no more than e mismatches between the read and the reference, then the read and the reference share at least a single identical section (i.e., error-free) among e+1 non-overlapping sections, where 0 ≤ e. The more mismatches involved between a read and the reference, the less contiguous stretches of exact matches they share. This is due to the fact that the e mismatches would result in dividing the read into e+1 identical sections in accordance with their correspondences in the reference.
Lemma 1. If two reads differ by e errors, then by the pigeonhole principle they share at least a single identical section (free of errors) among e+1 non-overlapping sections.
SHD relies on identifying these e+1 identical sections as a proxy for the edit distance calculations. If there are no more than E errors between the read and the reference (where E is the user-defined error threshold), then each nonerroneous segment in the read can be matched to its corresponding region in the reference within E shifts from its position to the right or left direction. The shifting process is inevitable in order to skip the erroneous bases (especially in case of insertions and deletions).
Definition 2. Let r and f be query and reference sequences, respectively, each of a length m, and let E be an edit-distance threshold. The SHD paradigm states that all matched characters of the two reads can be aligned in at most E shifts.
SHD is implemented using Streaming SIMD Extensions (SSE) [16] . Experimentally, SHD is able to filter out 86 billion potential mappings within 40 minutes with a maximum edit distance of 1, which is 3x faster than SeqAn, which is also implemented on the same platform and tested under same conditions. On average, SHD requires the same execution time as FastHASH. However, SHD produces far fewer (4X fewer on average) false positives compared to FastHASH. This makes the SHD the best-performing alignment filter to our knowledge.
Our goal in this paper is to design a new filtering algorithm (building upon SHD) and a new hardware architecture that accelerates it by taking advantage of the computational capabilities of FPGAs. To our knowledge, this is the first work that takes advantage of novel hardware architectures to accelerate alignment filtering techniques.
III. ACCELERATOR ARCHITECTURE
We design a novel hardware architecture to accelerate an improved alignment filtering algorithm using FPGAs.
Overview of Our Accelerator Architecture. In view of the discussion in Section II, we introduce the first FPGAbased alignment filter. The use of FPGAs can yield significant performance improvements, especially for massively parallel algorithms. An FPGA chip can be programmed to include a very large number of execution units that are custom-tailored to the problem at hand. However, to cope with the drastic increase in the amount of sequenced data, we propose two approaches as follows:
 First, we build a specialized FPGA-friendly hardware architecture for a new filtering algorithm. We design our filtering algorithm to examine the alignment between a read and a reference segment in a fast and efficient way (in terms of the required resources, e.g. hardware logic blocks, power consumption, clocking).
 Second, we take advantage of the fact that alignment filtering of one read is inherently independent of filtering another read. We therefore can examine all reads in a parallel fashion. Therefore, instead of handling the reads one by one in a sequential manner, as the CPU-based filter (e.g., SHD) does, we can process a number of reads at the same time by integrating as many filtering processing cores as possible in the FPGA chip. Each processing core is a complete alignment filter and can handle a single read at a time. We use the term "processing core" in this paper to refer to the entire operation of the filtering process involved in GateKeeper. Processing cores are part of our architecture and unrelated to the term "CPU cores". Fig. 2 shows the overall architecture of our FPGA-based accelerator, GateKeeper, which consists of an FPGA engine as an essential component and a CPU. The latter is responsible for acquiring and encoding the short reads and transferring the data to-and from the FPGA. The FPGA engine is equipped with PCIe transceivers, Read Controller, Mapping Controller, and group of processing cores that are responsible for examining the read alignment. The workflow of the accelerator starts with reading a repository of short reads and seed locations. All reads are then converted into their binary representation that can be understood by the FPGA engine. Encoding the reads is a preprocessing step and accomplished through a Read Encoder at the host before transmitting the reads to the FPGA chip. Next, the encoded reads are transmitted and processed in a streaming fashion through the fastest communication medium available on the FPGA board (i.e. PCIe). We use RIFFA 2.2 [36] to perform the host-FPGA communication. The output results are transferred back to the CPU side in the same order as the input stream in a streaming fashion and then saved in the repository. We designed our system to perform alignment filtering in a streaming fashion: the accelerator receives a continual stream of short reads, examines each alignment in parallel with others and returns the decision (i.e., whether the alignment is accepted or rejected) instantaneously upon processing.
Read Controller. The Read Controller on the FPGA side is responsible for two main tasks. First, it permanently assigns the first data chunk as a reference sequence for all processing cores. Second, it manages the subsequent data chunks and distributes them to the processing cores. The first processing core receives the first read sequence and the second core receives the second sequence and so on, up to the last core. It iterates the data chunk management task until no more reads are left in the repository.
Mapping Controller. Following similar principles as the Read Controller, the Mapping Controller gathers the output results of the processing cores. Both the Read and Mapping Controllers preserve the original order of reads as in the repository (i.e., at the host). This is critical to ensure that each read will receive its own alignment filtering result. The results are transmitted back to the CPU side in a streaming fashion and then saved in the repository.
Original SHD Algorithm. We briefly describe the original SHD algorithm [16] , which our new algorithm is inspired by. As we discuss in Section II (Lemma 1 and Definition 2), SHD relies on two key observations. Based on these observations, the SHD algorithm has two main steps: (1) Shifted Hamming Mask-Set (SHM), (2) Speculative Removal of Short-Matches (SRS). SHM separately identifies all bp matches by calculating a set of Hamming masks. Each Hamming mask is a bit-vector of '0's and '1's representing the comparison of the read and the reference, where a '0' represents a bp match and a '1' represents a bp mismatch. Each mask is generated after incrementally shifting the candidate read against the reference and performing pairwise comparison (i.e. pairwise XOR operation). Based on Definition 2, we need to perform E incremental shifts in the left direction for any read that has E deletions and E incremental shifts in the opposite direction for insertion cases. E is the Edit-distance threshold. An additional Hamming mask is required for detecting the error-free reads or error-free segments that are located before the first mismatch. The last mask is produced by performing a bitwise XOR between the original read (i.e. with no shift) and the reference. Thus, in general, we should perform 2E shifts for any read regardless the source of the mismatch (i.e. insertions, deletions, or substitutions). The last step in SHM is to merge all the 2E+1 Hamming masks using 2E bitwise AND operations. This step tells us where the relevant matching and mismatching regions reside with the existence of errors in the read compared to the reference. Identical regions are then identified in each shifted Hamming mask as streak of continuous '0's. In SHM, a '0' at any position in the 2E+1 Hamming masks leads to a '0' in the resulting final bit-vector at the same position. These '0's are critical, as some Hamming masks show a mismatch at that position, but having a zero in the other masks will dominate and show a match, which tends to confuse the filter and eventually causes some incorrect mappings to pass. To overcome this issue, SHD applies a refinement step, called SRS, on all Hamming masks. SRS removes short steaks of '0's in individual masks as they have a high probability of being generated from random noise in the DNA sequence. Short streaks (less than three digits) do not represent identical sections, which are not part of the correct alignment (the alignment produced by the local alignment computation) of the mapping.
Read Encoder

PCIe
RIFFA RX Engine
The crucial observation is that SHD examines each mapping, throughout the filtering process, by performing expensive computations unnecessarily. SHD uses the same amount of computations regardless the type of mismatches, hence SHD requires a constant execution time. Next, we show how GateKeeper exploits some particular cases (where mismatches detection is computationally inexpensive compared to applying a general approach) to improve the overall running time and reduce the number of operations.
GateKeeper Processing Core. Our primary purpose is to enhance the SHD alignment filter such that we can greatly accelerate pre-alignment by taking advantage of the capabilities and parallelism of FPGAs. To achieve our goal, we design an algorithm inspired by SHD to reduce both the utilized resources and the execution time. These optimizations enable us to integrate more processing cores within the FPGA chip and hence examine more alignments.
We present three new methods that are applied in each GateKeeper processing core to improve execution time. Our first method introduces a new algorithmic method for handling alignment in a very fast way compared to the original SHD for: (1) error-free alignment, (2) handling one or more base-substitutions. Our second method addresses the problem of resource overheads introduced especially by the Read Encoder. Our third method replaces the amending process of the SRS step (discussed above) with a new, very efficient hardware design.
1) Error-Free and Low-Error Alignment Detection.
The SHM method identifies whether the alignment locations of a read are valid, by shifting individual bases. However, depending on the type of mismatches in each mapping, the shifting process is not always needed. Fig. 3 illustrates the effect of occurrence of mismatches on the alignment process. Each insertion and deletion can shift multiple trailing bases and create multiple mismatches in the Hamming masks. On the other hand, in case of one or more base-substitutions and error-free alignment, the matching and mismatching regions can be accurately determined using the Hamming distance. It can be calculated with pairwise comparison (bitwise XOR) between the bases of the read and the reference segment. As the substitutions have no effect on the subsequent bases, the number of mismatches is equivalent to the number of '1's in the resulting Hamming mask. Hence, the first key improvement of our new FPGAfriendly filtering algorithm (over SHM) is to check whether the read matches the reference segment exactly or with an "acceptable" number of differences (i.e. equal to or below the threshold), before generating the 2E Hamming masks. This step can be completed with only a few additional computations as the Hamming mask of the Hamming distance computation is already produced as part of the SHM approach. We need to count only the occurrences of '1' in the Hamming mask and examine whether their total number is equal or less than the user-defined error threshold then the mapping is considered to be valid and the read passes the filter. Similarly, if the total number of mismatches is sufficiently large (i.e. greater than the lower bound of errors) then we cannot be certain whether this is because of the content of the read, or there exist insertions and/or deletions and hence we then need to generate the 2E Hamming masks and follow the original SHM approach. The pseudocode of our new FPGA-friendly filtering algorithm is shown in Algorithm 1. 
2) Handling Resource Overheads.
Encoding the reads into a binary representation introduces overhead to accommodate the encoded read and to apply certain operations on it. For instance, encoding a read sequence of length m results in a 2m-bit word, which requires 2m bitwise XOR operations instead of m operations (Fig. 4) . To reduce the complexity of the subsequent operations on the Hamming masks, we propose a new solution: Comparing a pair of DNA nucleotides is similar to compare their binary representation. Hence, each two bits of the binary mask are correlated and represent one of two meanings; either match or mismatch. Once the Hamming masks are generated, we no longer need the two bits to represent each DNA nucleotide. We propose to further encode each two bits of the Hamming mask into a single bit of '0' or '1' using OR operations in a parallel fashion. The bit of value zero represents matching region and a bit of value one means mismatching between two bases. This makes the length of each Hamming mask equivalent to the length of the original sequence without affecting the meaning of each bit of the mask. The modified Hamming masks are then merged together in 2E bitwise AND operations. 
3) New Hardware-Based Amending Process.
Using the SRS step of the SHD algorithm, all streaks of '0's in the Hamming masks that are shorter than three digits and surrounded by '1's are replaced by '1's. As a result, bit streams such as 101, 1001 are replaced with 111 and 1111, respectively. The SRS process in the original SHD work [16] is accomplished using a 4-bit packed shuffle (SIMD parallel table-lookup instruction), shift, and OR operations. The number of computations needed is four packed shuffle, 4m bitwise OR operations, and four shift operations for each Hamming mask, that is (8+4m)(2E+1) operations. We find that this is very inefficient.
To reduce the number of operations, we propose using dedicated hardware components in FPGA slices. More precisely, rather than shifting the read and then performing packed shuffle to replace patterns of 101 or 1001 to 111 or 1111 respectively, we perform only packed shuffle for each bit independently and concurrently for all bits of each Hamming mask. As illustrated in Fig. 5 , the proposed architecture of the amend operations contains one 5-input look-up table (LUT) dedicated for each output bit, except the first and last output bits. The first look-up table copies the bit value of the first input regardless its value, even if it is zero it will not be amended as it is not contributing to the 101 or 1001 pattern. Likewise for the last look-up table. Thus, the total number of look-up tables needed is equal to the length of the short read in bases minus 2 for the first and last bit. In each look-up table, we consider a single bit of the Hamming mask and two of its right neighboring bits and two of its left neighboring bits. 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 0 Hamming mask after overwrite:
. . . . . . . . . . If the input corresponds to the output has a bit value of one, then the output copies the value of that input bit (as we only amend zeros). Otherwise, using the previous two bits and the following two bits in respect to the input bit, we can replace any zero of the "101" or "1001" patterns independently from other output bits (details are given in Algorithm 2). All bits of the amended masks are generated in the same time, as the propagation delay through an FPGA look-up table is independent of the implemented function [37] . Thus we can process all masks in a parallel fashion without affecting the correctness of the filtering decision. Using this dedicated architecture, we are able to get rid of the four shifting operations and achieve the whole operation concurrently for all bits of any Hamming mask. Thus, the required number of operations is only (2E+1) instead of (8+4m)(2E+1) for a total of (2E+1) Hamming masks. This saves a considerable amount of the filtering time. Finally, we count the number of ones (i.e., errors) in the final bitvector mask and if it respects the error threshold, the filter will accept the mapping. 
5-input LUT
Summary of Benefits over SHD. Each GateKeeper processing core performs the all operations defined in the GateKeeper algorithm. Table 1 summarizes the relative benefits gained by each of the aforementioned optimization methods, where E is the user-defined error threshold and m is the read length. When a read matches the reference exactly, or with few substitutions GateKeeper requires only 2m bitwise XOR operations, providing substantial speedup compared to the original SHD, which performs a much greater number of operations. However, this is not the only benefit we gain from our first proposed method (i.e., ErrorFree and Low-Error Alignment Detection). As this method provides an accurate examination for alignments with only substitutions (i.e., no deletions or insertions), we can directly skip calculating their optimal alignment using the computationally expensive alignment algorithms.
For more general cases such as deletions and insertions, GateKeeper still requires far fewer operations (4X fewer, as shown in Table 1 ) than the original SHD filter, due to the optimization methods outlined above. Our improvements over SHD help drastically reduce the execution time of the filtering process. The rejected alignments by our GateKeeper filter will not be further examined by the dynamic algorithms. Thus, GateKeeper leads to the acceleration of the entire read mapping process, as our evaluation quantitatively shows (Section IV). 
IV. EVALUATION
We use the Xilinx VC709 board [38] , which features the Virtex-7 XC7VX690T-2FFG1761C FPGA [39] , and a 3.6GHz Intel i7-3820 CPU with 8GB RAM as the host to implement and evaluate GateKeeper. We build the FPGA design with Vivado 2014.3 in Verilog. We configure RIFFA 2.2 as Gen3 4-lane PCIe. The operating frequency of the accelerator is 250MHz.
A. Theoretical Speedup.
We first examine the maximum speedup theoretically possible with our architecture, assuming the only constraint in the system is the FPGA logic. To this end, we calculate the number of mappings our accelerator board can potentially examine in parallel using as many GateKeeper processing cores as possible. The entire process of examining a mapping takes a single cycle to be completed on a single GateKeeper processing core. Table 2 shows the resource utilization of a single processing core for a read length of 100 bp, with different error thresholds. Based on the resource report, we estimate that we can fit up to 300 GateKeeper processing cores on the VC709 FPGA, such that all processing cores together can process up to 300 alignments of 100 bp reads in parallel. This results in a 140x to 300x speedup over the original SHD filter design [16] (depending on the error threshold used). The bottleneck in this idealized system is transferring a total of 60,000 bits in a single clock cycle into the FPGA, which is not practical for any of the existing PCIe drivers that supply data to the FPGA. Using an offline approach, such as transferring all the reads to the internal memory of the FPGA board before processing them, would allow us to achieve an even greater speedup (as more data will be available to be processed and hence more processing cores can be integrated). However, this strategy would not improve overall performance due to the memory initialization overhead. We conclude that the theoretical speedup provided by GateKeeper is extremely large, but practical speedup, which we will examine next, is limited by the data transfer rate into the accelerator. 
B. Experimental Speedup.
Throughput and Resource Analysis. Our system operates in two synchronous clock domains. The main system clock runs at 250MHz, and the GateKeeper processing cores at 50MHz. This setup allows us to integrate up to five GateKeeper processing cores. This is because the number of cores in our design is limited by the data available at each clock cycle, as the processing is accomplished in a streaming approach. Five cores are sufficient to saturate the memory bandwidth to supply data into the cores, and thus increasing the number of cores to more than five does not improve performance. Table 3 lists the resource utilization of the entire design including the PCIe communication logic. We find that as error threshold increases, more resources are occupied. This is expected since the number of operations of GateKeeper is proportional to the error threshold, E, as shown in Table 1 . Our design can execute five alignment filtering operations concurrently on a single FPGA chip. We observed a throughput of nearly 3.3GB/s, which corresponds to ~13.3 billion bases per second, nearly reaching the maximum throughput provided by the RIFFA2.2 communication channel that feeds data into the FPGA. Speedup vs. SHD. We now evaluate the execution time of our GateKeeper against SHD. We use a popular seedand-extend all-mapper, mrFAST [4] , to retrieve all potential mappings (read-reference pairs) from two sets (accession numbers ERR240726 and ERR240727) from the 1000 Genome Project Phase I [40] . Each data set contains about 4 million reads of length 100 bp. Table 4 compares the original SHD and GateKeeper in terms of run time. The GateKeeper run time includes the host-FPGA communication time in both directions. Our accelerator architecture provides 17.3x speedup over the original SHD when we align reads of length 100 bp from ERR240727, with a maximum edit distance of 5. GateKeeper can process reads of any length, but it transmits the reads in "packages" of 128 bits per clock cycle if the read is longer than 64 bp. This constraint associated with the RIFFA data transfer protocol into the FPGA incurs a bottleneck in data transfer time, which is proportional to the read length. As discussed above, the acceleration potential is largely dependent on the read length. As Table 4 shows, our system can achieve as high as 27.03x speedup for shorter reads (64 bp). We conclude that GateKeeper greatly improves pre-alignment performance, by more than an order of magnitude, over the best previous pre-alignment mechanism, SHD. Sensitivity to Error Threshold. Fig. 6(a) shows the number of potential mappings that are processed by both GateKeeper and SHD within 40 minutes, with different error thresholds (i.e., E , varied between 1 to 5 errors) across multiple read sets. As error threshold increases, GateKeeper's speedup also increases. Under different error thresholds, GateKeeper shows a constant execution time at the expense of additional FPGA LUTs used, as shown in Fig. 6(b) . This is because our architecture offers the ability to perform all computations applied on Hamming masks and the amending process in a parallel fashion (as we explained when we described our three new methods in the GateKeeper core). We conclude that our new accelerator architecture is very effective in handling more errors in reads, much more than the best previous pre-alignment mechanism, SHD. Fraction of Alignments Filtered out by GateKeeper. In Fig. 7 , we show the importance of integrating our GateKeeper filter with a sophisticated local alignment algorithm (i.e., Smith-Waterman [2] ) by reporting two observations: (1) GateKeeper rejects a significant number of incorrect mappings (e.g., up to 80% of the mappings for E=1) that will not be examined by local alignment algorithms. (2) GateKeeper accurately calculates the optimal alignment for a considerable number of correct mappings (e.g., up to 50% of the mappings for E=5) using our first method, Error-Free and Low-Error Detection. These correct mappings can be considered to continue with the rest of the read mapping process without being examined by any sophisticated alignment algorithm. We conclude that GateKeeper enables read mappers to save up to 97% of their alignment verification workload. Comparison to Other Read Mapping Accelerators. We also provide a comparison of GateKeeper with other existing algorithms that aim to accelerate read mapping using various architectures. Table 5 summaries the results. We report the run time of different tools using 100 bp reads with at most 2 mismatches (unless otherwise mentioned). To provide a fair comparison for the FPGA-based architectures, we report the run time of using only a single FPGA chip. In all these studies we surveyed, FPGAs outperform all other accelerator platforms in terms of run time. GateKeeper also achieves 62x speedup over SeqAn filter.
We observe that our accelerator architecture can achieve more than two orders of magnitude speedup against the fastest acceleration of a hash-based mapper [28] (133,264,884/316,455 = 421x speedup) and an order of magnitude speedup against the fastest acceleration of a BWT-FM implementation [20] . However, we note that GateKeeper is a pre-alignment filter, while BFAST and BWT-FM accelerators are full mappers. Thus, they are not directly comparable. However, since the GateKeeper cores have only a small footprint on the FPGA, we can combine our architecture with any of the FPGA-based accelerations of BWT-FM or hash-based mapping techniques on a single FPGA chip. With such a combination, the end result would be an efficient and fast multi-layer mapping system: alignments that pass GateKeeper, our pre-alignment filter, can be further verified using a local alignment algorithm within the same chip. We leave this combination for future work, but conclude that the results we present in Table 5 show significant promise for placing both our pre-alignment accelerator and a separate alignment accelerator on the same FPGA chip. 
