Motivation: High throughput DNA sequencing (HTS) technologies generate an excessive number of small DNA segments -called short reads-that cause significant computational burden. To analyze the entire genome, each of the billions of short reads must be mapped to a reference genome based on the similarity between a read and "candidate" locations in that reference genome. The similarity measurement, called alignment, formulated as an approximate string matching problem, is the computational bottleneck because: (1) it is implemented using quadratic-time dynamic programming algorithms, and (2) the majority of candidate locations in the reference genome do not align with a given read due to high dissimilarity. Calculating the alignment of such incorrect candidate locations consumes an overwhelming majority of a modern read mapper's execution time. Therefore, it is crucial to develop a fast and effective filter that can detect incorrect candidate locations and eliminate them before using computationally costly alignment operations. Results: We propose GateKeeper, a new hardware accelerator that functions as a pre-alignment step that quickly filters out most incorrect candidate locations. GateKeeper is the first design to accelerate pre-alignment using Field-Programmable Gate Arrays (FPGAs), which can perform pre-alignment much faster than software. GateKeeper can be integrated with any mapper that performs sequence alignment for verification. When implemented on a single FPGA chip, GateKeeper maintains high accuracy (on average >96%) while providing up to 105-fold and 215-fold speedup over the state-of-the-art software pre-alignment techniques, Adjacency Filter and Shifted Hamming Distance (SHD), respectively.
Introduction
High throughput sequencing (HTS) technologies are capable of generating a tremendous amount of sequencing data. For example, the Illumina HiSeq4000 platform can generate more than 1.5 trillion base pairs (bp) in less than four days. This flood of sequenced data continues to overwhelm the processing capacity of existing algorithms and hardware . The success of the medical and genetic applications of HTS technologies relies on the existence of sufficient computational resources, which can quickly analyze the overwhelming amounts of data that the sequencers generate. An HTS instrument produces short reads (typically 75-150 bp) sampled randomly from DNA. In the presence of a reference genome, the short reads are first mapped to the long reference sequence. During this process, called read mapping, each short read is mapped onto one or more possible locations in the reference genome based on the similarity between the short read and the reference sequence segment at that location. Optimal alignment of the read and the reference segment could be calculated using the Smith-Waterman local alignment algorithm (Smith and Waterman 1981) . However, this approach is infeasible as it requires O(mn) running time, where m is the read length (100-150 bp for Illumina) and n is the reference length (~3.2 billion bp for human genome), for each read in the data set (hundreds of millions to billions). Therefore, read mapping algorithms apply heuristics to first find candidate map locations (seed locations) of subsequences of the reads using hash tables or BWT-FM indices , and then align the read in full only to those seed locations. Although the strategies for finding seed locations vary among different read mapping algorithms, seed location identification is typically followed by a verification step, which compares the read to the reference segment at the seed location to check if the read aligns to that location in the genome with fewer differences than a threshold. The verification step is the dominant part of the whole execution time in current mappers (over 90% of the running time) . It calculates edit distance using quadratic-time algorithms such as Levenshtein's edit distance (Levenshtein 1966) , Smith-Waterman (Smith and Waterman 1981) and NeedlemanWunsch (Needleman and Wunsch 1970) . Edit distance is defined as the minimum number of edits (i.e. insertions, deletions, or substitutions) needed to make the read exactly match the reference segment (Levenshtein 1966) . If the edit distance score is greater than a user-defined edit distance threshold (usually less than 5% of the read length ), then the mapping is considered to be invalid (i.e., read does not match the segment at seed location) and thus is rejected.
Definition 1. Given a candidate read r, a reference segment f, and an edit distance threshold E, the pairwise alignment problem is to identify a set of matches of r in f, where the read aligns with edit distance ≤ E.
Recent work found that an overwhelming majority (>98%) of the seed locations exhibit more edits than the threshold . These particular seed locations impose a large computational burden as they waste 90% of the mapper's execution time in verifying these incorrect mappings . To tackle these challenges and bridge the widening gap between the execution time of the mappers and the increasing amount of sequencing data, most existing works fall into two approaches: (1) Design hardware accelerators to accelerate the verification step . (2) Build software-based alignment filters before the verification step Rasmussen, Stoye et al. 2006; . Such filters aim to minimize the number of candidate locations on which alignment is performed. They calculate a best guess estimate for the alignment score between a read and a seed location on the reference. If the lower bound exceeds a certain number of edits, indicating that the read and the segment at the seed location do not align, the seed location is eliminated such that no alignment is performed. Unfortunately, existing filtering techniques are either slow, such as Shifted Hamming distance (SHD) , or inaccurate in filtering, such as the Adjacency Filter ) (implemented as part of FastHASH ) and mrsFAST-Ultra ). While mrsFASTUltra is able to detect only substitutions, FastHASH is unable to tolerate substitutions efficiently. We provide full descriptions of the key principles underlying each strategy in Supplementary Materials.
Our goal, in this work, is to minimize the mapper time spent on accurate filtering. To this end, we introduce a new FPGA-based fast alignment filtering technique (called GateKeeper) that acts as a pre-alignment step in read mapping. To our knowledge, this is the first work that provides a new pre-alignment algorithm and architecture for reconfigurable hardware platforms. A fast filter designed on a specialized hardware platform can drastically expedite alignment by reducing the number of locations that must be verified via dynamic programming. This eliminates many unnecessary expensive computations, thereby greatly improving overall run time.
Our filtering technique improves and accelerates the state-of-the-art SHD filtering algorithm using new mechanisms and FPGAs. We build upon the SHD algorithm as it is the fastest and most accurate filter . Our new filtering algorithm has two properties that make it suitable for an FPGA-based implementation: (1) it is highly parallel, (2) it heavily relies on bitwise operations such as shift, XOR, and AND. Due to the highly-parallel and bitwise-processing-friendly architecture of modern FPGAs, our design achieves more than two orders of magnitude speedup compared to the best prior software-based filtering approaches (SHD and Adjacency Filter), as our comprehensive evaluation shows (Section 3). Our architecture discards the incorrect mappings from the candidate mapping pool in a streaming fashion -data is processed as it is transferred from the host system. Filtering the mappings in a streaming fashion gives the ability to integrate our filter with any mapper that performs alignment, such as Bowtie2 and BWA-MEM .
Contributions. We make the following contributions:
 We introduce the first hardware acceleration system for alignment filtering, called GateKeeper, which greatly reduces the need for alignment verification in DNA read mapping. To this end, we develop both a hardware-acceleration-friendly filtering algorithm and a highly-parallel hardware accelerator design. We show that developing a hardware-based alignment filtering algorithm and architecture together is both feasible and effective by building our accelerator on a modern FPGA system.  We comprehensively evaluate GateKeeper and compare it to two state-of-the-art software-based alignment filtering algorithms. A key result is that our design on a single FPGA chip provides up to 105-fold and 215-fold speedup over the state-of-the-art filters, Adjacency Filter and SHD , respectively. Experimental results on both simulated and real data sets demonstrate that GateKeeper has a low false positive rate (the rate of incorrect mappings that are accepted by the filter) of 4% on average.  We provide the design and implementation of a complete FPGA system and release its source code. To our knowledge, GateKeeper is the first open-source, freely-available FPGA based alignment filter for genome analysis.
Gatekeeper Architecture

Overview of Our Accelerator Architecture.
Based on the discussion provided in the Supplementary Materials, Section 1.2, we introduce the first specialized FPGA-friendly hardware architecture for a new filtering algorithm. Our current filter implementation relies on several layers of optimization to create a robust and efficient filtering approach. At both the design and implementation stages, we satisfy several requirements: (1) Ensuring a lossless filtering algorithm by preserving all correct mappings. (2) Supporting both Hamming distance and edit distance. The Hamming distance is a special case of the editdistance. It is defined as the minimum number of substitutions required to change the read into the reference segment. The Hamming distance is computed in linear time. (3) Examining the alignment between a read and a reference segment in a fast and efficient way (in terms of execution time and required resources).
Parallelization.
GateKeeper is designed to utilize the large amounts of parallelism offered by FPGA architectures (Herbordt, VanCourt et al. 2007; Trimberger 2015) . The use of FPGAs can yield significant performance improvements, especially for massively parallel algorithms. FPGAs are the most commonly used form of reconfigurable hardware engines today, and their computational capabilities are greatly increasing every generation due to increased number of transistors on the FPGA chip. An FPGA chip can be programmed (i.e., configured) to include a very large number of hardware execution units that are custom-tailored to the problem at hand. We take advantage of the fact that alignment filtering of one read is inherently independent of filtering another read. We therefore can examine many reads in a parallel fashion. In particular, instead of handling each read in a sequential manner, as CPU-based filters (e.g., SHD) do, we can process a large number of reads at the same time by integrating as many hardware filtering processing cores as possible (constrained by chip area) in the FPGA chip. Each processing core is a complete alignment filter and can handle a single read at a time. We use the term "processing core" in this paper to refer to the entire operation of the filtering process involved in GateKeeper. Processing cores are part of our architecture and unrelated to the term "CPU cores" or "threads".
GateKeeper Processing Core.
Our primary purpose is to enhance the state-of-the-art SHD alignment filter such that we can greatly accelerate pre-alignment by taking advantage of the capabilities and parallelism of FPGAs. To achieve our goal, we design an algorithm inspired by SHD to reduce both the utilized resources and the execution time. These optimizations enable us to integrate more processing cores within the FPGA chip and hence examine many alignments at the same time. We present three new methods that we use in each GateKeeper processing core to improve execution time. Our first method introduces a new algorithmic method for performing alignment very rapidly compared to the original SHD. This method provides: (1) fast detection for exact matching alignment and (2) handling of one or more base-substitutions. Our second method supports calculating the edit distance with a new, very efficient hardware design. Our third method addresses the problem of hardware resource overheads introduced due to the use of FPGA as an acceleration platform. All features are implemented within the filtering processing core hardware and thus are performed highly efficiently.
Fast Approximate String Matching.
We first discuss how to examine the alignment of reads against the reference sequence with a given Hamming distance threshold, and later extend our solution to support edit distance. Our first method aims to quickly detect the obviously correct alignments that contain no edits or only few substitutions (i.e., less than the user-defined threshold). Our observation is that by detecting the correct alignments, we can skip both the unnecessary filtering steps and the subsequent computationally expensive alignment algorithms (e.g. edit distance algorithms). A read is mappable if the Hamming distance between the read and its seed location does not exceed the given Hamming distance threshold. Hence, the first step is to identify all bp matches by calculating what we call a Hamming mask. The Hamming mask is a bit-vector of '0's and '1's representing the comparison of the read and the reference, where a '0' represents a bp match and a '1' represents a bp mismatch. We need to count only occurrences of '1' in the Hamming mask and examine whether their total number is equal to or less than the user-defined Hamming distance threshold. If so, the mapping is considered to be valid and the read passes the filter. Similarly, if the total number of '1' is greater than the Hamming distance threshold then we cannot be certain whether this is because of the high number of substitutions, or there exist insertions and/or deletions; hence, we need to follow the reset of our algorithm. Our filter can detect not only substitutions but also insertions and deletions in an efficient way, as we discuss next.
Insertion and Deletion (Indel) Detection.
Our indel detection algorithm is inspired by the original SHD algorithm presented in . If the substitution detection rejects an alignment, then GateKeeper checks if an insertion or deletion may cause the violation (i.e. high number of edits). Fig. 1 illustrates the effect of occurrence of edits on the alignment process. If there are one or more base-substitutions or the alignment is exact matching, the matching and mismatching regions can be accurately determined using Hamming distance. As the substitutions have no effect on the alignment of subsequent bases, the number of edits is equivalent to the number of '1's in the resulting Hamming mask. On the other hand, each insertion and deletion can shift multiple trailing bases and create multiple edits in the Hamming mask. Thus, pairwise comparison (bitwise XOR) between the bases of the read and the reference segment is not sufficient. Our indel detection method identifies whether the alignment locations of a read are valid, by shifting individual bases. We need to perform E incremental shifts to the right direction to detect any read that has E deletions, where E is the edit distance threshold. The right shift process guarantees to cancel the effect of deletion. Similarly, we need to perform E incremental shifts to the left direction to detect any read that has E insertions. As we do not have prior knowledge about whether there exist insertions or deletions or both, we need to test for every possible case in our algorithm. Thus, GateKeeper generates 2E Hamming masks regardless the source of the edit. Each mask is generated after incrementally shifting the candidate read against the reference and performing pairwise comparison (i.e., pairwise XOR operation).
Deletion:
Exact Matching: A segment of consecutive matches in one-step right shifted mask indicates that there is a single deletion that occurred in the read sequence. Since deletions and insertions affect only the trailing bases, we need to have an additional Hamming mask that is generated with no shifts. This mask helps detect the matches that are located before the first indel. However, this mask is already generated as part of the first method of the algorithm (i.e., Fast Approximate String Matching). The last step is to merge all the 2E+1 Hamming masks using a bitwise AND operation. This step tells us where the relevant matching and mismatching regions reside in the presence of edits in the read compared to the reference segment. Identical regions are then identified in each shifted Hamming mask as streak of continuous '0's. As we use a bitwise AND operation, a zero at any position in the 2E+1 Hamming masks leads to a '0' in the resulting final bit-vector at the same position. Hence, even if some Hamming masks show a mismatch at that position, a zero in some other masks leads to a match ('0') at the same position. This tends to underestimate the actual number of edits and eventually causes some incorrect mappings to pass. To fix this issue, we build a new hardware-based amending process. The amending process is first presented in the original SHD filter ) that actually amends (or flips) short streaks of '0's (single or double zeros) in each mask into '1's such that they do not mask out '1's in other Hamming masks. Short streaks of '0's do not represent identical sections and thus they are useless. As a result, bit streams such as 101, 1001 are replaced with 111 and 1111, respectively. In SHD, amending process is accomplished using a 4-bit packed shuffle (SIMD parallel table-lookup instruction), shift, and OR operations. The number of computations needed is 4 packed shuffle, 4m bitwise OR, and three shift operations for each Hamming mask, which is (7+4m)(2E+1) operations, where m is the read length. We find that this is very inefficient for FPGA implementation. To reduce the number of operations, we propose using dedicated hardware components in FPGA slices. More precisely, rather than shifting the read and then performing packed shuffle to replace patterns of 101 or 1001 to 111 or 1111 respectively, we perform only packed shuffle independently and concurrently for each bit of each Hamming mask. As illustrated in Fig. 2 , the proposed architecture for amendment operations contains one 5-input look-up table (LUT) dedicated for each output bit, except the first and last output bits. We provide full details of our amending architecture in the Supplementary Materials (Section 1.3). Using this dedicated architecture, we are able to get rid of the four shifting operations and perform the amending process concurrently for all bits of any Hamming mask. Thus, the required number of operations is only (2E+1) instead of (7+4m)(2E+1) for a total of (2E+1) Hamming masks. This saves a considerable amount of the filtering time, reducing it by 407x for a read that is 100bp long. 
5-input LUT
Minimizing Hardware Resource Overheads.
The short reads are composed of a string of nucleotides from the DNA alphabet ∑= {A, C, G, T}. Since the reads are processed in an FPGA platform, the symbols have to be encoded to a unique binary representation. We need 2 bits (log2|∑| bits) to encode each symbol. Hence encoding a read sequence of length m results in a 2m-bit word. Encoding the reads into a binary representation introduces overhead to accommodate not only the encoded reads but also the Hamming masks as their lengths also become double (i.e., 2m). The issue introduced by encoding the read can be even worse when we apply certain operations on these Hamming masks. For example, the number of LUTs required for performing the amending process on the Hamming masks will be doubled, mainly due to encoding the read. To reduce the complexity of the subsequent operations on the Hamming masks and save about half of the required number of FPGA resource, we propose a new solution. We observe that comparing a pair of DNA nucleotides is similar to comparing their binary representation. Hence, each two bits of the binary mask are correlated and represent one of two meanings; either match or mismatch. Once the Hamming masks are generated, we no longer need the two bits to represent each DNA nucleotide. As explained in Fig. 3 , we propose to further encode each two bits of the Hamming mask into a single bit of '0' or '1' using OR operations in a parallel fashion. The bit value '0' represents a matching region and the bit of value '1' means mismatch between two bases. This makes the length of each Hamming mask equivalent to the length of the original read, without affecting the meaning of each bit of the mask. The modified Hamming masks are then merged together in 2E bitwise AND operations. Finally, we count the number of ones (i.e., edits) in the final bit-vector mask; if the count is less than the edit distance threshold, the filter accepts the mapping. 
TCCAT TCCAG
Verification.
GateKeeper is a standalone filter and can be easily integrated with any existing reference-based mapper. GateKeeper does not replace the local/global alignment algorithms (e.g., Smith-Waterman (Smith and Waterman 1981) and Needleman-Wunsch (Needleman and Wunsch 1970) ). GateKeeper should be followed by an alignment verification step, which precisely verifies the alignment that passes our filter. Verification step is accurate and shows zero false positive rate. It also allows specifying a cost to each edit (i.e., a scoring system). However, such integration is mapper-specific and will be explored in our future research. In this work, we mainly focus on and deeply evaluate the benefits and downsides of our filtering algorithm and architecture independently of any mapper it can be combined with.
Novelty.
GateKeeper is the only read mapping filter that takes advantage of the parallelism offered by FPGA architectures in order to expedite the alignment filtering process. GateKeeper supports both Hamming distance and edit distance in a fast and efficient way. Each GateKeeper processing core performs all operations defined in the GateKeeper algorithm (Supplementary Materials, Section 1.3, Algorithm 1). Table 1 summarizes the relative benefits gained by each of the aforementioned optimization methods over the best previous filter, SHD (E is the user-defined edit distance threshold and m is the read length). When a read matches the reference exactly, or with few substitutions, GateKeeper requires only 2m bitwise XOR operations, providing substantial speedup compared to SHD, which performs a much greater number of operations. However, this is not the only benefit we gain from our first proposed method (i.e., Fast Approximate String Matching). As this method provides an accurate examination for alignments with only substitutions (i.e., no deletions or insertions), we can directly skip calculating their optimal alignment using the computationally expensive alignment algorithms (i.e. verification step). For more general cases such as deletions and insertions, GateKeeper still requires far fewer operations (4x fewer, as shown in Table 1 ) than the original SHD filter, due to the optimization methods outlined above. Our improvements over SHD help drastically reduce the execution time of the filtering process. The rejected alignments by our GateKeeper filter are not further examined by a verification step. Thus, GateKeeper leads to the acceleration of the entire read mapping process, as our evaluation quantitatively shows (Section 3). 
Evaluation
To implement and evaluate GateKeeper, we use a Xilinx VC709 board , which features a Virtex-7 XC7VX690T-2FFG1761C FPGA (Xilinx 2015) , and a 3.6 GHz Intel i7-3820 CPU with 8 GB RAM as the host. We build the FPGA design with Vivado 2014.4 in Verilog. We use RIFFA 2.2 (Jacobsen, Richmond et al. 2015) to perform the host-FPGA PCIe communication. We configure RIFFA 2.2 as Gen3 4-lane PCIe. The operating frequency of the accelerator is 250MHz.
Theoretical Speedup.
We first examine the maximum speedup theoretically possible with our architecture, assuming the only constraint in the system is the FPGA logic.
To this end, we calculate the number of mappings that our accelerator board can potentially examine in parallel using as many GateKeeper processing cores as possible. The entire process of examining a mapping takes a single cycle to be completed on a single GateKeeper processing core. Table 2 shows the resource utilization of a single processing core for a read length of 100 bp, with different edit distance thresholds. Based on the resource report, we estimate that we can fit up to 250 GateKeeper processing cores, when E=2, on the VC709 FPGA, such that all processing cores together can process up to 250 alignments of 100 bp reads in parallel. This results in a 140x to 250x speedup over the original SHD filter design , for E=5 and E=2, respectively. The bottleneck in this idealized system is transferring a total of 50,000 (250 alignment x 100 bp x 2 bits for encoding) bits in a single clock cycle into the FPGA, which is not practical for any of the existing PCIe drivers that supply data to the FPGA. Using an offline approach, such as transferring all the reads to the internal memory of the FPGA board before processing them, would allow us to achieve an even greater speedup (as more data will be available to be processed and hence more processing cores can be integrated). However, this strategy likely will not improve overall performance due to the memory initialization overhead. We conclude that the theoretical speedup provided by GateKeeper is extremely large, but practical speedup, which we will examine next, is limited by the data transfer rate into the accelerator. 
Experimental Speedup.
Throughput and Resource Analysis. Our system operates in two synchronous clock domains. The main system clock runs at 250MHz, and the GateKeeper processing cores at 50MHz. This setup allows us to integrate up to five GateKeeper processing cores. This is because the number of cores in our design is limited by the data available at each clock cycle, as the processing is accomplished in a streaming manner. Five cores are sufficient to saturate the memory bandwidth to supply the short reads into the cores, and thus increasing the number of cores to more than five does not improve performance. We observe a data throughput of nearly 3.3 GB/s, which corresponds to ~13.3 billion bases per second, nearly reaching the maximum throughput provided by the RIFFA2.2 communication channel that feeds data into the FPGA (Jacobsen, Richmond et al. 2015) .
To further improve the throughput of GateKeeper, we configure GateKeeper to align each of the five transferred reads against multiple reference segments concurrently. We find that we can align each read against 16 different reference segments without violating the timing constraints (e.g., maximum operating frequency). This configuration substantially increases the number of alignments executed concurrently on a single FPGA chip from 5 alignments to 80 alignments. Table 3 lists the resource utilization of the entire design including the PCIe communication logic. We find that as edit distance threshold increases, more resources are occupied. This is expected since the number of operations of GateKeeper is proportional to the edit distance threshold, E, as shown in Table 1 . Speedup vs. Existing Filters. We now evaluate the execution time of GateKeeper compared to the best existing filters. We use mrFAST ) mapper to retrieve all potential mappings (read-reference pairs) from a real data set (ERR240727) from the 1000 Genomes Project Phase I (Consortium 2012) . The data set contains about 4 million reads, each of length 100 bp. Fig. 4 shows the number of potential mappings that are processed by GateKeeper, SHD, and Adjacency Filter within 40 minutes. Under different edit distance thresholds (E varied between 1 to 5 edits), GateKeeper provides consistently good performance: it verifies about 4 trillion mappings per 40 mins using a single FPGA. This is because our architecture offers the ability to perform all computations in a parallel fashion (as we explained when we described our three new methods in the GateKeeper core). However, this feature comes with the expense of additional FPGA LUTs used. The GateKeeper running time includes the host-FPGA communication time in both directions. On average, GateKeeper is 130x faster than SHD and 90x faster than Adjacency Filter. As edit distance threshold increases, GateKeeper's speedup over SHD and Adjacency Filter also increases (e.g., up to 105X and 215X faster than Adjacency Filter and SHD, respectively, when E=5). Note that Adjacency Filter becomes faster than SHD as E increases, but at the expense of accuracy, as we will show soon. We conclude that GateKeeper greatly improves the performance of alignment filtering by more than two orders of magnitude. Our new accelerator architecture is very fast in handling more edits in reads, much more so than the best previous prealignment mechanisms.
Filtering Accuracy. An ideal filter should be both fast and accurate. We evaluate the accuracy of GateKeeper by computing its false positive and false negative rates. We also compare the accuracy of our filter with SHD and Adjacency Filter using both simulated and real data. We simulated reads from the human genome using the mason simulator (http://packages.seqan.de/mason/). The configuration and parameters used in our experiment are provided in Supplementary Materials (Section 1.4). We generate five sets, each of which contains 400,000 Illumina-like reads. Each set has an equal number of reads of length 64, 100, 150, and 300 bp. While two sets have a low number of different types of edits, the other three sets have a high number of substitutions, insertions, and deletions. The purpose of simulating the low-edits reads is that we want most of the reads to have edits less than the allowed threshold. This enables us to quantify the false negatives (i.e., correct mappings that are rejected by the filter) of the three filters with different read length. On the other hand, we use the editrich reads to evaluate the robustness of the three filters to incorrect mappings. We use the Needleman-Wunsch algorithm to benchmark the three filters as this algorithm has both zero false positive and zero false negative rates.
We make four main observations. (1) Using the low-edits reads, we observe that the three filters never filter out correct mappings; hence, they provide a lossless filtering mechanism with a false negative rate of zero. (2) Fig. 5(a) shows the average false positive rate of the three filters using the three simulated edit-rich sets. We observe that both GateKeeper and SHD have the same false positive rates. (3) GateKeeper produces far fewer (on average, 0.25x) false positives than Adjacency Filter. The tolerance of Adjacency Filter to substitutions and indels diminishes when E becomes larger than 3%. (4) Adjacency Filter is more robust handling indels than handling substitutions. This is expected as the presence of one or more substitutions in any seed is counted by the Adjacency Filter as a single mismatch. The detailed results for each of the three edit-rich sets are provided in the Supplementary Materials (Section 1.4). We also consider a more realistic scenario in which reads can have a combination of substitutions and indels. We use the first 30 million 100 bp reads of the data set ERR240727_1 mapped to the human genome to evaluate the false positive rate of the three filters, as shown in Fig. 5(b) . Based on these results, we make three observations. (1) We find that GateKeeper is very effective and superior to Adjacency Filter at both substitutions and indels detection. (2) On average, GateKeeper produces a false positive rate of 4%. (3) GateKeeper rejects a significant fraction of incorrect mappings (e.g., 85% to 99% of the mappings, depending on the edit distance threshold used) and thus avoids expensive verification computations by alignment algorithms. We conclude that GateKeeper's accuracy is as good as the best previous filter, SHD, and much better than the Adjacency Filter yet GateKeeper is much faster than both SHD and Adjacency Filter (as we showed earlier in this section). Hence, GateKeeper is an extremely fast and accurate. 
Future Work
GateKeeper shows that there is a great benefit in designing an alignment filtering accelerator to handle the flood of sequenced data. Since a single core GateKeeper has only a small footprint on the FPGA, we can combine our architecture with any of the FPGA-based accelerators for BWT-FM or hash-based mapping techniques on a single FPGA chip. With such a combination, the end result would be an efficient and fast multi-layer mapping system: alignments that pass GateKeeper can be further verified using a dynamic programing based alignment algorithm within the same chip. We leave this combination for future work. Due to its streaming nature, GateKeeper can be extended to perform real-time read mapping (Lindner, Strauch et al. 2016 ) and can be integrated into sequencing machines, such that we have a single machine that can perform both sequencing and mapping. This approach has two benefits. First, it can hide the complexity and details of the underlying hardware from users who are not necessarily aware about FPGAs (e.g., biologists and mathematicians). Second, it allows a significant reduction in total genome analysis time by starting read mapping while still sequencing (Lindner, Strauch et al. 2016 ).
Our next efforts will also focus on investigating the sources of the false positives and explore the possibility of eliminating them to achieve a dynamic-programming-free alignment approach.
Summary
In this paper, we propose the first hardware accelerator architecture for prealignment in genome read mapping. , Bowtie ) and Bowtie2 ) is efficient at finding the best mappings of a read (i.e., the mappings with the fewest number of edits), and hence we refer to them as best-mappers. Mappers in this category use aggressive algorithms to optimize the candidate location pools to find closest matches, and therefore may not find many potentially-correct mappings (Firtina and Alkan 2016) . Their performance degrades as either the sequencing error rate increases or the genetic differences between the subject and the reference genome are more likely to occur ). This is due to the nature of BWT-FM as it entails a global alignment (i.e., alignment from the first base to the last one) with respect to the sequenced reads. The second approach, seed-and-extend mappers (also referred to as hash-based mappers), such as FastHASH , mrFAST/mrsFAST , SHRiMP2 , and BFAST ), build very comprehensive but overly large candidate location pool and rely on filters and local alignment techniques to remove incorrect mappings from consideration in the verification step. Mappers in this category are able to find all correct mappings of a read, but waste computational resources for identifying and rejecting incorrect mappings. As a result, they are slower than BWT-FMbased mappers. A hybrid method that incorporates the advantages of each approach can be also utilized for long read alignment (i.e. up to few million bases), such as BWA-MEM (Li 2013).
Accelerating Read Mappers
A majority of read mappers are based on machines equipped with general-purpose central processing units (CPUs). While the HTS platforms generate half a trillion bp per day, the state-of-the-art CPUbased read mappers can align only a few billion of them against the human genome . As long as the gap between the CPU computing speed and the very large amount of sequenced data widens, CPU-based mappers become less favorable due to their limitations in accessing data . To tackle this challenge, many attempts were made to accelerate the operations of read mapping. Most existing works can be divided into two main approaches: (1) designing hardware accelerators, (2) developing efficient alignment filters.
Using Hardware Accelerators
Hardware accelerators for read mapping are becoming increasingly popular as viable solutions for expediting the operations of existing mappers using various new processing platforms, such as GPUs (Benkrid, Akoglu et al. 2012; and FPGAs (Benkrid, Akoglu et al. 2012; Chen, Schmidt et al. 2013; Sogabe and Maruyama 2013; Sogabe and Maruyama 2014; Waidyasooriya and Hariyama 2015; . Fig. 6 illustrates the existing read mappers implemented in various platforms. FPGA acceleration platforms seem to yield the highest performance gain Waidyasooriya and Hariyama 2015) , especially for applications with unpredictable and highly irregular memory access patterns such as BWT-based search, which poses difficult challenges for the efficient implementation in CPUs or GPUs (Waidyasooriya and Hariyama 2015) . FPGA-based read mappers often demonstrate one to two orders of magnitude speedups against their GPU-based counterparts Sogabe and Maruyama 2013; Sogabe and Maruyama 2014) . Past works used hardware platforms to only accelerate the dynamic programming algorithms (e.g., Smith-Waterman algorithm), as these algorithms contributed significantly to the overall running time of read mappers (Szalkowski, Ledergerber et al. 2008; Benkrid, Akoglu et al. 2012) . Recently, researchers start paying more attention to integrate the FPGA accelerated Smith-Waterman algorithm into big-data computing framework -such as Apache Spark-for accelerating the BWA-MEM . By this integration in (Chen, Cong et al. 2016) , they achieve 2.6x speedup over a cloud-based implementation (Chen, Cong et al. 2015) (without FPGA acceleration) of the same mapper. Benkrid et al. (Benkrid, Akoglu et al. 2012 ) compared the Smith-Waterman method implemented on the FPGA, GPU, Cell BE, and CPU platforms. The FPGA implementation outperforms all other accelerated implementations, particularly in terms of execution time. FPGAs will likely continue to be the best choice as they enable performing large numbers of computations in a parallel fashion. Comprehensive surveys on hardware acceleration for computational genomics appeared in . Note that there is no work on the hardware acceleration of alignment filtering techniques, which we discuss next.
Years
Using Alignment Filtering Techniques.
The second approach to accelerate read mapping is to incorporate a filtering technique within the read mapper and before the verification step. This filter is responsible for quickly excluding incorrect mappings in an early stage (i.e., as a pre-alignment step) to reduce the number of locations that must be verified via dynamic programming. Existing filtering techniques include:
Hamming Distance. Hamming distance ) between a pair of sequences is defined as the number of positions at which the corresponding symbols are different. As such, the Hamming distance measures the pairwise differences between sequences of equal length. Hence, it can only find substitutions. Calculating Hamming distance is relatively easy and supported by various hardware platforms. Bitwise operations (mainly XORs) between an appropriate binary representation of read and genomic region can be employed to obtain a bit vector that indicates edits. The Hamming distance can be calculated simply through a linear scan counting the number of pairwise differences. Key drawback of this filtering approach is that Hamming distance cannot correctly find insertions and deletions (indels). Allowing indels is important because both sequencing errors and genetic variations can result in the deletion/insertion of bases and the chance of this happening increases as reads become longer. However, finding indels is computationally more challenging. mrsFAST ) and mrsFASTUltra are examples of read mapper that uses Hamming distance as a filtering strategy.
Seed Filtering. Filtering the incorrect mapping using seeds is the basic principle of nearly all seed-and-extend mappers (such as: GEM (MarcoSola, Sammeth et al. 2012) , RaserS ), RazerS 3 , Hobbes , FastHASH , BitMapper ). Instead of considering one long string, seed filters examine all its possible substrings of length q (which are called seeds). The seed filtering is based on the observation that if two sequences are potentially similar, then they share a certain number of seeds. The seed is sometimes called q-gram or k-mer. Seeds are used as indices into the reference genome to reduce the search space and speed up the mapping process. The performance and accuracy of seed-and-extend mappers depend on how the seeds are selected in the first stage. Mappers should select a large number of nonoverlapping seeds while keeping each seed as infrequent as possible (Kiełbasa, Wan et al. 2011; Xin, Nahar et al. 2015) . There is also a significant advantage to selecting seeds with unequal lengths, as possible seeds of equal lengths can have drastically different levels of frequencies. Finding the optimal set of seeds from read sequences is challenging and complex step; primarily because the associated search space is large and it grows exponentially as the number of seeds increases. There are other variants of seed filtering based on the pigeonhole principle , non-overlapping seeds , gapped seeds (Egidi and Manzini 2013) , variable-length seeds (Xin, Nahar et al. 2015) , or random permutation of subsequences (Lederman 2013) . We select Adjacency Filtering (AF) from FastHASH ) as a representative example.
Shifted Hamming Distance (SHD). Another recent filtering technique is called Shifted Hamming Distance (SHD) . It is based on the pigeonhole principle. That is, if E items are put into E+1 boxes, then one or more boxes would be empty. It can be applied in context of sequence alignment as follows: if two reads differ by E edits, then they should share at least a single identical section (free of edits) among E+1 non-overlapping sections, where E is the threshold of edit distance. This is due to the fact that the E mismatches would result in dividing the read into E+1 identical sections in accordance with their correspondences in the reference. The more edits involved between a read and the reference, the less contiguous stretches of exact matches they share. SHD relies on identifying these E+1 identical sections as a proxy for the edit distance calculations. If there are no more than E edits between the read and the reference, then each non-erroneous segment in the read can be matched to its corresponding region in the reference within E shifts from its position to the right or left direction. The shifting process is inevitable in order to skip the erroneous bases (especially in case of insertions and deletions). The crucial observation is that SHD examines each mapping, throughout the filtering process, by performing expensive computations unnecessarily. SHD uses the same amount of computations regardless the type of edit, hence SHD requires a constant execution time. In particular, substitutions can be directly measured by the Hamming distance and does not require the shifting process. Thus, SHD is not suitable for applications that aim at finding the optimal alignment of reads against a reference where indels are not allowed, for example identifying Single Nucleotide Polymorphisms (SNPs) that are associated with diseases ); aiming at discovering and developing an efficient drugs.
Conclusions. While Hamming distance simply counts the number of substitutions implied by mismatched symbols of two strings at the same position, edit distance additionally accounts for inserted or deleted symbols in one string with respect to the other. Calculating Hamming distance is relatively easy and has been well solved. On the other hand, calculating edit distance efficiently is still the focus of ongoing research. FastHASH has high false positive rates for edit distance thresholds higher than 3 edits . On average as shown experimentally in , SHD requires the same execution time as the Adjacency Filter of FastHASH (seed filtering approach). However, SHD produces far fewer (4X fewer) false positives compared to the Adjacency Filter.
Comparison between Filtering Techniques and
Hardware Accelerators. We provide a comparison of existing filters with other existing hardware accelerated read mappers using various platforms. Table 4 summaries the results. We report the running time of different tools using 100 bp reads with at most 2 edits (unless otherwise mentioned). To provide a fair comparison for FPGA-based architectures, we report the running time of using only a single FPGA chip. In all these studies we surveyed, FPGAs outperform all other accelerator platforms in terms of running time. The best-performing alignment filter (i.e. SHD filter) is only 3x faster than the fastest FPGA-based mapper (i.e. the work presented in ). Though they are not directly comparable, but an ideal filter should have both high accuracy and low running time to compensate the computation overhead introduced by its filtering technique. Our goal in this paper is to design a new fast filtering algorithm (building upon SHD) and a new hardware architecture that accelerates it by taking advantage of the computational capabilities of FPGAs. To our knowledge, this is the first work that takes advantage of novel hardware architectures to accelerate alignment filtering techniques.
GateKeeper FPGA Implementation
In this section, we discuss the algorithmic and implementation details of GateKeeper. Fig. 7 shows the overall architecture of our FPGA-based accelerator, GateKeeper, which consists of an FPGA engine as an essential component and a CPU. The latter is responsible for acquiring and encoding the short reads and transferring the data to-and from the FPGA. The FPGA engine is equipped with PCIe transceivers, Read Controller, Mapping Controller, and group of processing cores that are responsible for examining the read alignment. The workflow of the accelerator starts with reading a repository of short reads and seed locations. All reads are then converted into their binary representation that can be understood by the FPGA engine. Encoding the reads is a preprocessing step and accomplished through a Read Encoder at the host before transmitting the reads to the FPGA chip. Next, the encoded reads are transmitted and processed in a streaming fashion through the fastest communication medium available on the FPGA board (i.e. PCIe). GateKeeper can process reads of any length, but RIFFA transmits the reads into the FPGA in "packages" of 128 bits per clock cycle. The output results are transferred back to the CPU side in the same order as the input stream in a streaming fashion and then saved in the repository. We designed our system to perform alignment filtering in a streaming fashion: the accelerator receives a continual stream of short reads, examines each alignment in parallel with others and returns the decision (i.e., whether the alignment is accepted or rejected) instantaneously upon processing. The pseudocode of our new FPGA-friendly filtering algorithm is shown in Algorithm 1.
Overview
Read Controller.
The Read Controller on the FPGA side is responsible for two main tasks. First, it permanently assigns the first data chunk as a reference sequence for all processing cores. Second, it manages the subsequent data chunks and distributes them to the processing cores. The first processing core receives the first read sequence and the second core receives the second sequence and so on, up to the last core. It iterates the data chunk management task until no more reads are left in the repository.
Mapping Controller.
Following similar principles as the Read Controller, the Mapping Controller gathers the output results of the processing cores. Both the Read and Mapping Controllers preserve the original order of reads as in the repository (i.e., at the host). This is critical to ensure that each read will receive its own alignment filtering result. The results are transmitted back to the CPU side in a streaming fashion and then saved in the repository. case (FinalMask[i, i+1, i+2, i+3 
New Hardware-Based Amending Process for supporting
Insertions and Deletions Detection. In order to replace all patterns of 101 or 1001 to 111 or 1111 respectively, we use a single 5-input look-up table for each bit of the Hamming mask. The first look-up table copies the bit value of the first input regardless its value, even if it is zero it will not be amended as it is not contributing to the 101 or 1001 pattern. Likewise for the last look-up table. Thus, the total number of look-up tables needed is equal to the length of the short read in bases minus 2 for the first and last bit. In each look-up table, we consider a single bit of the Hamming mask and two of its right neighboring bits and two of its left neighboring bits. If the input corresponds to the output has a bit value of one, then the output copies the value of that input bit (as we only amend zeros). Otherwise, using the previous two bits and the following two bits in respect to the input bit, we can replace any zero of the "101" or "1001" patterns independently from other output bits (details are given in Algorithm 2). All bits of the amended masks are generated in the same time, as the propagation delay through an FPGA look-up table is independent of the implemented function (Xilinx November 17, 2014 ). Thus we can process all masks in a parallel fashion without affecting the correctness of the filtering decision.
mrFAST and mason Configurations
In our experiments, we generate the real read set and five simulated sets of short and long illumine-like reads using the command lines presented in Table 5 . The first and third sets have a number of base-substitutions that are drawn from a Gaussian distribution with mean of 3% (of read length) and 16%, respectively. The second, fourth, and fifth sets have a number of indels that are drawn from a Gaussian distribution with mean of 3% , 16%, and 16%, respectively. Fig. 8 presents the false positive rates of GateKeeper, SHD, and Adjacency Filter, using simulated reads from three different data sets, namely, Set3, Set4, and Set5. illumina -N 100000 -i -o Set1_64.fasta -n 64 -f -snN -nN -pmm 0.03 -pi 0 -pd 0 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set1_100.fasta -n 100 -f -snN -nN -pmm 0.03 -pi 0 -pd 0 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set1_150.fasta -n 150 -f -snN -nN -pmm 0.03 -pi 0 -pd 0 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set1_300.fasta -n 300 -f -snN -nN -pmm 0.03 -pi 0 -pd 0 human_g1k_v37 .fasta Set 2: (400,000 low-indel reads) mason illumina -N 100000 -i -o Set2_64.fasta -n 64 -f -snN -nN -pmm 0.01 -pi 0.01 -pd 0.01 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set2_100.fasta -n 100 -f -snN -nN -pmm 0.01 -pi 0.01 -pd 0.01 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set2_150.fasta -n 150 -f -snN -nN -pmm 0.01 -pi 0.01 -pd 0.01 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set2_300.fasta -n 300 -f -snN -nN -pmm 0.01 -pi 0.01 -pd 0.01 human_g1k_v37.fasta Set 3: (400,000 substitution-rich reads) mason illumina -N 100000 -i -o Set3_64.fasta -n 64 -f -snN -nN -pmm 0.16 -pi 0 -pd 0 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set3_100.fasta -n 100 -f -snN -nN -pmm 0.16 -pi 0 -pd 0 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set3_150.fasta -n 150 -f -snN -nN -pmm 0.16 -pi 0 -pd 0 human_g1k_v37.fasta mason illumina -N 100000 -i -o Set3_300.fasta -n 300 -f -snN -nN -pmm 0.16 -pi 0 -pd 
