The ability to generate massive amounts of sequencing data continues to overwhelm the processing capacity of existing algorithms and compute infrastructures. Calculating the similarities between a pair of genomic sequences is one of the most fundamental computational steps in genomic analysis. This step -called sequence alignment-is formulated as an approximate string matching (ASM) problem, which is typically solved using computationally expensive dynamic programming algorithms. In this work, we introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for the computationally costly sequence alignment step. The key idea of SneakySnake is to provide fast and highly accurate filtering by reducing the ASM problem to the single net routing (SNR) problem in VLSI chip layout. In the SNR problem, we are interested in only finding the path that connects two terminals with the least routing cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly and optimally solves the SNR problem and uses the found optimal path to decide whether performing sequence alignment is necessary. We also build two new hardware accelerator designs, Snake-on-Chip and Snake-on-GPU, that adopts modern FPGA (field-programmable gate array) and GPU (graphics processing unit) architectures, respectively, to further boost the performance of our algorithm.
Introduction
The ability to quickly analyze the overwhelming amounts of sequencing data enables several scientific advancements in personalized medicine and understanding genomic contributions to health across population (Hindorff et al., 2018) . One of the most fundamental computational steps in most genomic analyses is sequence alignment. This step is formulated as an approximate string matching problem (Navarro, 2001) and it calculates three key information: (1) edit distance between two given sequences, (2) type of each edit (i.e., insertion, deletion, or substitution), and (3) location of each edit in one of the two given sequences. Edit distance is defined as the minimum number of edits needed to convert one sequence into the other (Levenshtein, 1966) . These edits result from both sequencing errors (Fox et al., 2014) and genetic variations (McKernan et al., 2009) . Edits can have different weights, based on a user-defined scoring function, to allow favoring one edit type over another . Sequence alignment involves a backtracking step, which calculates an ordered list of characters representing the location and type of each possible edit operation required to change one of the two given sequences into the other. As any two sequences can have several different arrangements of the edit operations, we need to examine all possible prefixes of the two input sequences and keep track of the pairs of prefixes that provide a minimum edit distance. Therefore, sequence alignment approaches are typically implemented as dynamic programming algorithms to avoid re-examining the same prefixes many times (Eddy, 2004) . Dynamic programming based sequence alignment algorithms, such as Levenshtein distance (Levenshtein, 1966) , Smith-Waterman (Smith and Waterman, 1981) , and Needleman-Wunsch (Needleman and Wunsch, 1970) , are computationally expensive as they have quadratic time and space complexity (i.e., O(m 2 ) for a sequence length of m). Many attempts were made to boost the performance of existing sequence aligners. Recent works tend to follow one of two key directions: (1) Accelerating the dynamic programming algorithms using hardware accelerators and (2) Developing pre-alignment filtering heuristics that reduce the need for the dynamic programming algorithms, given an edit distance threshold. Hardware accelerators include building aligners that use 1) multi-core and SIMD (single instruction multiple data) capable central processing units (CPUs) , such as Edlib (Šošić anď Sikić, 2017) and Parasail , 2) graphics processing units (GPUs), such as GSWABE and CUDASW++ 3.0 , 3) field-programmable gate arrays (FPGAs), such as FPGASW , or 4) processing-in-memory architectures that enable performing computations inside the memory chip and alleviate the need for transferring the data to the CPU cores, such as RADAR . However, many of these efforts either simplify the scoring function, or only take into account accelerating the computation of the dynamic programming matrix without performing the backtracking step as in Nishimura et al., 2017; Chen et al., 2014) . Different and more sophisticated scoring functions are typically needed to better quantify the similarity between two sequences . The backtracking step involves unpredictable and irregular memory access patterns, which poses a difficult challenge for efficient hardware implementation.
Pre-alignment filtering heuristics aim to quickly eliminate some of the dissimilar sequences before using the computationally-expensive optimal alignment algorithms. Existing pre-alignment filtering techniques are either: 1) slow and they suffer from a limited sequence length, such as SHD , or 2) inaccurate after some edit distance threshold, such as GateKeeper and MAGNET . Shouji is currently the best-performing FPGA pre-alignment filter in terms of both accuracy and execution time.
We provide full descriptions of these two key directions of accelerating sequence alignment in Supplementary Material, Section 5.
Our goal in this work is to significantly reduce the time spent on calculating the sequence alignment of short sequences using very fast and highly accurate pre-alignment filtering. To this end, we introduce three new, fast, and very accurate pre-alignment filters, called SneakySnake, Snake-on-Chip, and Snake-on-GPU for different computing platforms. The key idea of SneakySnake is to provide highly-accurate pre-alignment filtering algorithm that remarkably accelerates the computation of sequence alignment by reducing the ASM problem to the single net routing (SNR) problem (Lee et al., 1976) . The SNR problem is to find the shortest routing path that interconnects two terminals on the boundaries of VLSI chip layout and passes through the minimum number of obstacles. Solving the SNR problem is faster than solving the ASM problem, as calculating the routing path after facing an obstacle is independent of the calculated path before this obstacle. This obviates the need for using computationally costly dynamic programming algorithms to keep track of the subpath that provides optimal solution. The key idea of Snake-on-Chip and Snake-on-GPU is to judiciously leverage the parallelism-friendly architecture of modern FPGAs and GPUs, respectively, to greatly speed up the SneakySnake algorithm. The contributions of this paper are as follows:
• We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter, which 1) reduces the ASM problem to the SNR problem, 2) efficiently and optimally solves the SNR problem, 3) uses the SNR solution to quickly identify dissimilar sequences, and 4) avoids performing computationally costly sequence alignment for such dissimilar sequences.
• We demonstrate that the SneakySnake algorithm is 1) correct and optimal in solving the SNR problem and 2) it runs in linear time with respect to the sequence length and the edit distance threshold.
• We demonstrate that the SneakySnake algorithm significantly improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to Shouji , GateKeeper and SHD .
• We demonstrate that SneakySnake accelerates the state-of-the-art CPU-based sequence aligners, Edlib (Šošić andŠikić, 2017) and Parasail , by 1.23×-37.6× and 1.23×-43.9×, respectively.
• We introduce a hardware accelerator design that tailors modern FPGA architectures to boost the performance of the SneakySnake algorithm. We call this design Snake-on-Chip.
• We introduce the first pre-alignment filter that leverages GPU architectures to boost the performance of the SneakySnake algorithm while providing the flexibility for users to change input parameters, such as edit distance threshold. We call this design Snake-on-GPU.
• We demonstrate that integrating Snake-on-Chip and Snake-on-GPU with four state-of-the-art sequence aligners, designed for different computing platforms, reduces their execution time by 1.6×-536× and 1.6×-689×, depending on the edit distance threshold used. Figure 1 : Chip layout with processing elements and two layers of metal routing tracks. In this example, the chip layout has 7 horizontal routing tracks (HRTs) located on the first layer and another 12 vertical routing tracks (VRTs) located on the second layer. We show only a single VRT out of the 12 VRTs for simplicity of illustration. The optimal signal net that is calculated using the SneakySnake algorithm (the solution to the single net routing problem) is highlighted in red using three escape segments. The first escape segment is connected to the second escape segment using an VRT through vias. The second escape segment is connected to the third escape segment without passing through a VRT as both are located on the same HRT. The optimal signal net passes through three processing elements and hence the signal net has a total delay of 3 × t obstacle .
If two genomic sequences differ by more than the edit distance threshold, then the two sequences are identified as dissimilar sequences and hence identifying the location and the type of each edit is not needed. The edit distance estimated by the SneakySnake algorithm should always be less than or equal to the actual edit distance value so that the SneakySnake algorithm keeps all similar sequences and ensures reliable filtering. To quickly estimate the edit distance between two sequences, we reduce the ASM problem to the SNR problem (Lee et al., 1976) . That is, instead of calculating the sequence alignment, the SneakySnake algorithm finds the routing path that interconnects two terminals and passes through the minimum number of obstacles on VLSI chip. The number of obstacles faced throughout the routing path represents a lower bound on the edit distance between two sequences and hence this number of obstacles can be used for the filtering decision of the SneakySnake algorithm. Solving the SNR problem is easier and faster than solving the ASM problem (as we show in Section 2.3). Next We explain the SNR problem.
Single Net Routing (SNR) Problem
The SNR problem (Lee et al., 1976) in VLSI chip layout refers to the problem of optimally interconnecting two terminals on a special grid graph while respecting constraints. We present an example of a VLSI chip layout in Fig. 1 . The goal is to find the optimal path -called signal net-that connects the source and the destination terminals through the chip layout. We describe the special grid graph of the SNR problem and define such optimal signal net as follows:
• The chip layout has two layers of evenly spaced metal routing tracks. While the first layer allows traversing the chip horizontally through dedicated horizontal routing tracks (HRTs), the second layer allows traversing the chip vertically using dedicated vertical routing tracks (VRTs).
• The horizontal and vertical routing tracks induce a two dimensional uniform grid over the chip layout. Each HRT can be obstructed by some obstacles (e.g., processing elements in the chip). For simplicity, we assume that VRTs can not be obstructed by obstacles. These obstacles allow the signal to pass horizontally through HRTs, but they induce a signal delay on the passed signal. Each obstacle induces a fixed propagation delay, t obstacle , on the victim signal that passes through the obstacle in the corresponding HRT.
• A signal net often uses a sequence of alternating horizontal and vertical segments that are parts of the routing tracks. Adjacent horizontal and vertical segments in the signal net are connected by an inter-layer via. We call a signal net optimal if it is both the shortest and the fastest routing path (i.e., passes through the minimum number of obstacles).
• Alternating between horizontal and vertical segments is restricted by passing a single obstacle. Thus, segment alternating strictly delays the signal by t obstacle time.
• The terminals can be any of the I/O pads that are located on the right-hand and left-hand boundaries of the chip layout. The source terminal always lies on the opposite side of the destination terminal.
The general goal of this SNR problem is to find an optimal signal net in the grid graph of the chip layout. For the simplicity of developing a solution, we call a horizontal segment that ends with at most an obstacle an escape segment. The escape segment can also be a single obstacle only. Also for simplicity, we call the right-hand side of an escape segment as a checkpoint, and the chip layout area as routing region. We relax this definition later in Section 2.5 to allow partitioning the single net routing problem into smaller subproblems that can be solved independently and in parallel. Next, we present how we can reduce the ASM problem to the SNR problem.
Reducing the Approximate String Matching (ASM) Problem to the Single Net Routing (SNR) Problem
Our goal is to calculate the sequence alignment accurately and quickly. There are two challenges against improving today's sequence alignment algorithms that we need to tackle. (1) How to build a data structure that is simpler than the dynamic programming table (i.e., data-dependency free) and yet it accurately represents the similarities and the differences between two sequences. (2) How to compute the optimal number of differences between the two sequences using this new data structure. To this end, we reduce the problem of finding the similarities and differences between two genomic sequences to that of finding the optimal signal net in a VLSI chip layout. Reducing the ASM problem to the SNR problem requires two key steps: (1) replacing the dynamic programming table used by the sequence alignment algorithm to a special grid graph called chip maze and (2) finding the number of differences between two genomic sequences in the chip maze by solving the SNR problem. We replace the (m + 1) × (m + 1) dynamic programming table with our chip maze, Z, as we show in Fig. 2 , where m is the sequence length (for simplicity, we assume that we have a pair of equal-length sequences but we relax this assumption towards the end of this section). The chip maze is a (2E + 1) × m grid graph, where E is the edit distance threshold, (2E + 1) is the number of HRTs, and m is the number of VRTs. The chip maze is an abstract layout for the VLSI chip layout, as we show in Fig. 2(d) for the same chip layout of Fig. 1 . Each entry of the chip maze represents the pairwise comparison result of a character of one sequence with another character of the other sequence. A pairwise match is represented by a white shaded entry in the chip maze and similarly a black shaded entry represents a pairwise mismatch.
Building the chip maze requires three main steps, as we illustrate in Fig. 2. (1) We start by building an m × m binary matrix, B, which visualizes the pairwise matches and mismatches between two sequences given an edit distance threshold of E characters. (2) We transform the m × m binary matrix into a more compact (2E + 1) × m binary matrix in order to simplify the computation of the optimal number of edits between two sequences. (3) We convert the (2E + 1) × m binary matrix into a chip maze, a grid of white and black shaded entries, which replaces the dynamic programming table.
Next, we show how to build the chip maze using these three steps in detail. Given two genomic sequences, a reference sequence R[1 . . . m] and a query sequence Q[1 . . . m], and an edit distance threshold E, we first build the m × m binary matrix that represents the comparison result of the i th character of Q with the j th character of R, where i and j satisfy 1 ≤ i ≤ m and i − E ≤ j ≤ i + E. We calculate the entry B[i, j] of the binary matrix as follows:
The entry B[i, j] is set to zero if the i th character of the query sequence matches the j th character of the reference sequence. Otherwise, it is set to one. We present in Fig. 2(b) an example of the m × m matrix for two sequences, where a query sequence Q differs from a reference sequence R by three edits. In order to simplify the computation over the m × m binary matrix, B, we transform it into a more compact binary matrix, where each computed diagonal vector of the m × m binary matrix is transformed into a row in the compact binary matrix. As the diagonal vectors of the binary matrix, B, have different lengths, we transform conservatively the diagonal vectors into rows, preserving the correct order of the comparison result between the corresponding characters of the two sequences. We dedicate each row of the compact matrix to a diagonal vector of the m × m binary matrix, based on their order from the top right corner to the bottom left corner of the B matrix, such that the E + 1 th row of the compact matrix represents the main diagonal of the B matrix and the r th row above or below the E + 1 th row represents the r th diagonal above or below the main diagonal, respectively, where 1 ≤ r ≤ E. We store each entry of a diagonal vector into its corresponding entry of the assigned row of the compact matrix, where the source and destination entries have the same column index value. For example, we store the entry B[1, 4] of the B matrix at the 4 th entry, from left-hand side, of the first row of the compact matrix. We fill the remaining empty entries of each row by ones to indicate that there is no match between the corresponding characters, as we show in Fig. 2(c) . The last step is to change the representation of a pairwise match (i.e., zero) and mismatch (i.e., one) to a white shaded entry and black shaded entry, respectively. This results in generating the (2E + 1) × m grid graph that is the chip maze, as we show in Fig. 2(d) .
The way we build our chip maze ensures achieving two key properties. The chip maze is (1) a datadependency free data structure and (2) a comprehensive representation of all pairwise matches and mismatches between two sequences with the existence of at most E edits. Achieving these two key properties addresses the first challenge against improving today's sequence alignment algorithms (as we discuss earlier in this subsection). Figure 2 : Steps of replacing the dynamic programming table in (a) with the chip maze, Z, in (d), for a reference sequence R = 'GGTGCAGAGCTC', a query sequence Q = 'GGTGAGAGTTGT', a sequence length (m) of 12, and an edit distance threshold (E) of 3. This includes three main steps to build the chip maze: (b) building an m × m binary matrix that represents the pairwise matches and mismatches between the i th character of Q and the j th character of R, where 1 ≤ i ≤ m and i − E ≤ j ≤ i + E, (c) transforming conservatively each diagonal vector of the m × m binary matrix into a row in the (2E + 1) × m binary matrix, (d) changing the representation of each entry of value one in the (2E + 1) × m binary matrix into a black shaded entry, otherwise we represent the entry of value zero as a white shaded entry.
The chip maze is a data-dependency free data structure as computing each of its entries is independent of every other and thus the entire grid graph can be computed all at once in a parallel fashion. Hence, our chip maze is well suited for both sequential and highly-parallel computing platforms (Seshadri et al., 2017) .
The chip maze is a comprehensive representation of all pairwise matches and mismatches between two sequences because each of its columns stores the result of comparing the j th character of the reference sequence R with each of its corresponding 2E + 1 characters of the query sequence, Q. These 2E + 1 characters of the query sequence, Q, are as follows: the j th character of the query sequence, Q, the E right-hand neighboring characters of the j th character, and the E left-hand neighboring characters of the j th character. This is essential to maintain an accurate detection of deleted and inserted characters in one or both given sequences. Each insertion and deletion can shift multiple trailing characters (e.g., deleting the character 'N' from 'GENOME' shifts the last three characters to the left direction, making it 'GEOME'). Hence, we need to compare a character of the reference sequence R with the neighboring characters of its corresponding character of the query sequence, Q, to cancel the effect of deletion/insertion and correctly detect the common subsequences between two sequences.
After replacing the dynamic programming table with a more efficient data representation for solving the SNR problem, the challenge becomes calculating the minimum number of edits between two sequences using the chip maze. As the white shaded entry of the chip maze represents a pairwise match, a sequence of horizontally consecutive white shaded entries forms a common subsequence between two sequences. A set of one or more non-overlapping sequences of horizontally consecutive white shaded entries forms a sequence of pairwise matches. If there is more than one sequence in this set, then there should be a black shaded entry that is located at the end of each sequence of consecutive white shaded entries (except probably the last sequence). Hence, similar to the SNR problem, we call each sequence of consecutive white shaded entries including at most a single black shaded entry that is located right after the sequence as an escape segment. These escape segments should also be non-overlapping as each entry indicates that the corresponding characters of both sequences are either similar or dissimilar. The more the total number of these white shaded entries in a set the less the total number of black shaded entries.
We observe that the backtracking step (of a global alignment algorithm) finds the optimal sequence of edit operations between two sequences by examining the value of the entries from the bottom-right entry of the dynamic programming table to the top-left entry (as the red arrows show in Fig. 2(a) ). Similar to a global alignment algorithm, the first escape segment in the optimal solution set should start from any entry of the first column of the chip maze. Similarly, the last escape segment in the set should end at any entry of the last column of the chip maze. Now, the problem becomes finding an optimal set (i.e., signal net) of non-overlapping escape segments. As we discuss in Section 2.2, a set of escape segments is optimal if there is no other set solves the SNR problem and has both less number of escape segments and less number of black shaded entries (i.e., obstacles). Once we find such an optimal set of escape segments, we can compute the minimum number of edits between two sequences as the total number of black shaded entries along the computed optimal set. Different from existing sequence alignment algorithms, we do not consider the vertical distance (i.e., the number of rows) between two escape segments in the calculation of the minimum number of edits. This tends to underestimate the actual number of edits between two sequences. Slightly underestimating the number of edits while achieving fast computation is acceptable as long as we do not overestimate the number of edits. We can justify this observation by using our fast computation method as a pre-alignment filtering step to decide whether sequence alignment computation is needed. If two sequences have more edits than the edit distance threshold, E, then we do not need computationally expensive algorithms to conclude that the two sequences have unacceptable number of edits. But if the number of edits is less than or equal the edit distance threshold, then our filtering step should be followed by accurate sequence alignment algorithms. This way ensures achieving two key properties: (1) allowing sequence alignment to be calculated only for similar (or nearly similar) sequences and (2) accelerating the sequence alignment algorithms without changing (or replacing) their algorithmic method and hence preserving all the capabilities of the sequence alignment algorithms.
Sequence alignment can be performed as a global alignment, where two sequences of the same length are aligned end-to-end, or a local alignment, where subsequences of the two given sequences are aligned. It can also be performed as a semi-global alignment (called glocal), where the entirety of one sequence is aligned towards one of the ends of the other sequence. To ensure a correct reduction of the ASM problem, we need to apply some changes to the way we count the black shaded entries along the optimal solution set. This means that if an optimal alignment algorithm performs a local alignment, then we should not consider the leading and trailing black shaded entries in the total count of edits between two given sequences. Similarly for a semi-global alignment, we should not consider the leading or the trailing black shaded entries. For the rest of the paper, we consider only the global alignment as the general case since it is more challenging and includes more computations (as it examines the similarity end-to-end).
Considering the chip maze as a chip layout where the rows represent the HRTs and the columns represent the VRTs, we observe that we can reduce the ASM problem to the SNR problem. The goal now is to find an optimal signal net set in the chip maze that has both the minimum length and the minimum number of obstacles. Next, we present an efficient algorithm that solves this SNR problem.
Solving the Single Net Routing Problem
The primary purpose of the SneakySnake algorithm is to solve the SNR problem by providing an optimal signal net. Solving the SNR problem requires achieving two key objectives: 1) achieving the lowest possible latency by finding the minimum number of escape segments that are sufficient to link the source terminal to the destination terminal and 2) achieving the shortest length of the signal net by considering each escape segment just once and in monotonically increasing order of their start index (or end index). The first objective is based on a key observation that a signal net with fewer escape segments has always fewer obstacles, as each escape segment has at most a single obstacle (based on our definition in Section 2.2). This key observation leads to a signal net that has the least possible total propagation delay. The second objective restricts the SneakySnake algorithm from ever searching backward for the longest escape segment. This leads to a signal net that has non-overlapping escape segments.
To achieve these two key objectives, the SneakySnake algorithm applies five simple and effective steps. (1) It first constructs the chip maze as we explain in the two previous sections. (2) At each new checkpoint, the SneakySnake algorithm always selects the longest escape segment that allows the signal to travel as far forward as possible until it reaches an obstacle. For each row of the chip maze, it computes the length of the first horizontal segment of consecutive white shaded entries that starts from a checkpoint and ends at an obstacle or at the row end. The SneakySnake algorithm compares the length of all 2E + 1 computed horizontal segments, selects the longest one, and considers it along with its first following obstacle as an escape segment. If the SneakySnake algorithm is unable to find a horizontal segment (i.e., all rows starts after a checkpoint with an obstacle), it considers one of the obstacles as the longest escape segment. (3) It creates a new checkpoint after the longest escape segment. (4) It repeats the first three steps until either the signal net reaches a destination terminal, or the total propagation delay exceeds the allowed propagation delay threshold (i.e., E × t obstacle ). (5) If SneakySnake finds the optimal net using the previous steps, then sequence alignment (e.g., exact number of edits, type of each edit, and location of each edit) between two sequences is calculated using user's favourite sequence alignment algorithm. Otherwise, the SneakySnake algorithm terminates without performing computationally expensive sequence alignment. We provide the SneakySnake algorithm along with analysis of its computational complexity (asymptotic run time and space complexity) in Supplementary Materials, Section 7. By achieving these two key objectives, the SneakySnake algorithm is both correct and optimal. The SneakySnake algorithm is correct as it always provides a signal net (if it exists) that interconnects the source terminal and the destination terminal. In other words, it does not lead to routing failure as signal will eventually reach its destination.
Theorem 1. The SneakySnake algorithm guarantees to find a signal net that interconnects the source terminal and the destination terminal when one exists. We provide the correctness proof for Theorem 1 in Supplementary Materials, Section 6.1. The SneakySnake algorithm is also optimal as it guarantees to find an optimal signal net that links the source terminal to destination terminal when one exists. Such an optimal signal net always ensures that the signal arrives the destination terminal with the least possible total propagation delay.
Theorem 2. When a signal net exists between the source terminal and the destination terminal, using the SneakySnake algorithm, a signal from the source terminal reaches the destination terminal with the minimum possible latency. We provide the optimality proof for Theorem 2 in Supplementary Materials, Section 6.2. Next, we explain in detail the SneakySnake algorithm.
Next, we discuss an efficient implementation of the SneakySnake algorithm. Instead of building the chip maze explicitly, we use an implicit representation of the chip layout. That is, we compute each row of the chip maze on-the-fly. The computation of the chip maze consists of two steps: (1) gradually shifting the query sequence, Q, and (2) performing pairwise comparison of the shifted version of Q with the reference sequence, R. The SneakySnake algorithm shifts the query sequence by r steps to construct the r th row that is located above or below the E + 1 th row (called main diagonal in Fig. 2(d) ) of the chip maze, where 1 ≤ r ≤ E. The shift direction is performed in the right-hand direction if the row is located above the E + 1 th row of the chip maze. Otherwise, the shift direction is performed in the left-hand direction. At each checkpoint, the SneakySnake algorithm starts computing on-the-fly one entry after another for each row until it faces an obstacle (the character of Q at the current index mismatches its corresponding character of R) or it reaches the end of the row. Thus, the entries that are actually calculated for each row of the chip maze are the entries that are located only between each checkpoint and the first obstacle, in each row, following this checkpoint. This significantly reduces the number of computations needed for the SneakySnake algorithm as we discuss in detail in Supplementary Materials, Section 7.
Snake-on-Chip Hardware Architecture
We introduce an FPGA-friendly architecture for the SneakySnake algorithm, called Snake-on-Chip. The main idea behind the hardware architecture of Snake-on-Chip is to divide the SNR problem into smaller non-overlapping subproblems. Each subproblem has a width of t VRTs and a height of 2E + 1 HRTs, where 1 < t ≤ m. We then solve each subproblem independently from the other subproblems. This approach results in three key benefits. (1) Downsizing the search space into a reasonably small grid graph with a known dimension at the design time limits the number of all possible solutions for that subproblem. This reduces the size of the look-up tables (LUTs) required to build the architecture and simplifies the overall design. (2) Dividing the SNR problem into subproblems helps to maintain a modular and scalable architecture that can be implemented for any sequence length and edit distance threshold. (3) All the smaller subproblems can be solved independently and rapidly with a high parallelism. This reduces the execution time of the overall algorithm as the SneakySnake algorithm does not need to evaluate the entire chip maze.
However, these three key benefits come at the cost of accuracy degradation. As we demonstrate in Theorem 2, the SneakySnake algorithm guarantees to find an optimal solution to the SNR problem. However, the solution for each subproblem is not necessarily part of the optimal solution for the main problem (with the original size of (2E + 1) × m). This is because the source and destination terminals of these subproblems are not necessarily the same. The source and destination terminals should be located at any of the 2E + 1 entries of the first and the last VRTs, respectively, of each subproblem, but the SneakySnake algorithm determines the exact location of the source and destination terminals for each subproblem based on its individual optimal solution. This causes to underestimate the total number of obstacles found along each signal net of each SNR subproblem. This is still acceptable as long as it solves the SNR problem quickly and without overestimating the number of obstacles. We provide the details of our hardware architecture of Snake-on-Chip in Supplementary Materials, Section 8.
Snake-on-GPU Parallel Implementation
We now introduce our GPU implementation of the SneakySnake algorithm, called Snake-on-GPU. The main idea of Snake-on-GPU is to exploit the large number (typically few thousands) of GPU threads provided by modern GPUs to solve a large number of SNR problems rapidly and concurrently. In Snake-on-Chip, we explicitly divide the SNR problem into smaller non-overlapping subproblems and then solve all subproblems concurrently and independently using our specialized hardware. In Snake-on-GPU, we follow a different approach than that of Snake-on-Chip by keeping the same size of the original SNR problem and solving a massive number of these SNR problems at the same time. Snake-on-GPU uses one single GPU thread to solve one SNR problem (i.e., comparing one query sequence to one reference sequence at a time). This granularity of computation fits well the amount of resources (e.g., registers) that are available to each GPU thread and avoids the need for synchronizing several threads working on the same SNR problem. GPUs offer more flexibility to the users to change the values of some input parameters of Snake-on-Chip without the need to build a new design as in FPGAs. Given the large size of the sequence pair dataset that the GPU threads need to access, we carefully design Snake-on-GPU to efficiently 1) copy the input dataset of query and reference sequences into the GPU global memory, which is the off-chip DRAM memory of GPUs (NVIDIA, 2019a) and it typically fits a few GB of data and 2) allow each thread to store its own query and reference sequences using the on-chip register file to avoid unnecessary accesses to the off-chip global memory. Each thread solves the complete SNR problem for a single query sequence and a single reference sequence. We provide the details of our hardware architecture of Snake-on-Chip in Supplementary Materials, Section 9.
Results
We now evaluate 1) the filtering accuracy, 2) the filtering time, and 3) the benefits of combining our three new pre-alignment filters with state-of-the-art aligners. For each experiment, we compare the performance of SneakySnake, Snake-on-Chip, and Snake-on-GPU to the existing state-of-the-art pre-alignment filters, Shouji , MAGNET , GateKeeper , and SHD . We run all experiments using a 3.3 GHz Intel E3-1225 CPU with 32 GB RAM. We use a Xilinx Virtex 7 VC709 board (Xilinx, 2013) to implement Snake-on-Chip and other existing accelerator architectures (for Shouji, MAGNET, and GateKeeper). We build the FPGA design using Vivado 2015.4 in synthesizable Verilog. We use a NVIDIA GeForce RTX 2080Ti card (NVIDIA, 2019b) with a global memory of 11 GB DDR6 to implement Snake-on-GPU. Both Snake-on-Chip and Snake-on-GPU are independent of the specific FPGA and GPU platforms as they do not rely on any vendor-specific computing elements (e.g. intellectual property cores).
Dataset Description
Our experimental evaluation uses 4 different real datasets. Each dataset contains 30 million real sequence pairs (text and query pairs). We obtain two different read sets ERR240727 1 and SRR826471 1) of the whole human genome that include two different read lengths (100 bp and 250 bp). We download these two read sets from EMBL-ENA (www.ebi.ac.uk/ena). We map each read set to the human reference genome (GRCh37) using mrFAST mapper. We obtain the human reference genome from the 1000 Genomes Project (1000 Genomes Project Consortium and others, 2012), ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/. For each read set, we use two different maximum numbers of allowed edits (2 and 40 for m =100 bp and 8 and 100 for m = 250 bp) using the e parameter of mrFAST to generate four real datasets in total. Each dataset contains the sequence pairs that are generated by mrFAST mapper before the read alignment step of mrFAST, such that we allow each dataset to contain both similar (i.e., having edits fewer than or equal to the edit distance threshold) and dissimilar (i.e., having more edits than the edit distance threshold) sequences over a wide range of edit distance thresholds. We provide the details of these four datasets in the Supplementary Materials, Section 10.1. For the reader's convenience, we refer to these datasets as Set 1, Set 2, Set 3, and Set 4.
Filtering Accuracy
We evaluate the accuracy of pre-alignment filter by computing its rate of falsely-accepted and falsely-rejected sequences before performing sequence alignment. The false accept rate is the ratio of the number of dissimilar sequences that are falsely accepted by the filter and the number of dissimilar sequences that are rejected by the sequence alignment algorithm. The false reject rate is the ratio of the number of similar sequences that are rejected by the filter and the number of similar sequences that are accepted by the sequence alignment algorithm. A reliable pre-alignment filter should always ensure both a 0% false reject rate and minimum false accept rate to maintain the correctness of overall pipeline and maximize the number of dissimilar sequences that are eliminated.
We first assess the false accept rate of SneakySnake, Shouji , MAGNET , GateKeeper , and SHD across different four real datasets and edit distance thresholds of 0% − 10% of the sequence length. In Fig. 3 , we provide the false accept rate of each of the five filters (we provide the exact values in Supplementary Materials, Section 10.2).
We use Edlib (Šošić andŠikić, 2017) to identify the ground truth truly-accepted sequences for each edit distance threshold. Based on Fig. 3 , we make four key observations. (1) We observe that all the five prealignment filters are less accurate in examining Set 1 and Set 3 than the other datasets, Set 2 and Set 4. (2) GateKeeper and SHD become ineffective for edit distance thresholds of greater than 8% and 3% for sequence lengths of 100 and 250 characters, respectively, as they accept all the input sequence pairs. This causes them to examine each sequence pair unnecessarily twice (i.e. once by GateKeeper or SHD and once by the sequence alignment algorithm). (3) SneakySnake provides the lowest false accept rate compared to all the four state-of-the-art pre-alignment filters. SneakySnake provides up to 31412×, 20603×, and 64.1× less number of falsely-accepted sequences compared to GateKeeper/SHD (using Set 4, E= 10%), Shouji (using Set 4, E= 10%), and MAGNET (using Set 1, E= 1%), respectively. (4) MAGNET provides the second lowest false accept rate. It provides up to 25552× and 16760× less number of False Accept Rate Figure 3 : The false accept rate of SneakySnake, Shouji, MAGNET, SHD, and GateKeeper across 4 real datasets. We use a wide range of edit distance thresholds (0% − 10% of the sequence length) for sequence lengths of 100 and 250 bp.
falsely-accepted sequences compared to GateKeeper/SHD (using Set 4, E= 10%) and Shouji (using Set 4, E= 10%), respectively.
Second, we assess the false reject rates of the five pre-alignment filters in Supplementary Materials, Section 10.2. We demonstrate that SneakySnake, Shouji, SHD, and GateKeeper all have a 0% false reject rate. We also observe that MAGNET provides a very low false reject rate of less than 0.00045% for E > 3% of the sequence length using Set 1 and Set 3.
Hence, we conclude that SneakySnake improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters. We also conclude that SneakySnake is the most effective pre-alignment filter, with a very low false accept rate and a 0% false reject rate across a wide range of both edit distance thresholds and sequence lengths.
Effects of SneakySnake on Sequence Alignment
We analyze the benefits of integrating CPU-based pre-alignment filters, SneakySnake and SHD with the state-of-the-art CPU-based sequence aligners, Edlib (Šošić andŠikić, 2017) and Parasail . We evaluate all tools using a single CPU core and single thread environment. Fig. 4 presents the normalized end-to-end execution time of SneakySnake and SHD each combined with Edlib and Parasail, using our four real datasets over edit distance thresholds of 0% − 10% of the sequence length. We make four key observations. (1) SneakySnake is up to 42.96× (using Set 3, E= 0%) and 39.43× (using Set 4, E= 5%) faster than Edlib and Parasail, respectively, in examining the sequence pairs. (2) The addition of SneakySnake as a pre-alignment filtering step reduces significantly the execution time of Edlib (Šošić anď Sikić, 2017) and Parasail by up to 37.6× (using Set 4, E= 0%) and 43.9× (using Set 4, E =2%), respectively. (3) The addition of SHD as a pre-alignment step reduces the execution time of Edlib and Parasail for some of the edit distance thresholds by up to 17.2× (using Set 2, E = 0%) and 34.86× (using Set 4, E= 3%), respectively.
However, for most of the edit distance thresholds, we observe that Edlib and Parasail are faster alone than with SHD combined as a pre-alignment filtering step. This is expected as SHD becomes ineffective in filtering for E > 8% and E > 3% for m= 100 bp and m= 250 bp, respectively, (as we show earlier in Section 3.2). (4) SneakySnake provides up to 8.92× (using Set 4, E= 4%) and 40× (using Set 4, E= 5%) more speedup to the end-to-end execution time of Edlib and Parasail compared to SHD. This is expected as SHD produces a high false accept rate (as we show earlier in Section 3.2).
We conclude that SneakySnake is the best-performing CPU-based pre-alignment filter in terms of both speed and accuracy. Integrating SneakySnake with sequence alignment algorithms is always beneficial and reduces the end-to-end execution time by up to an order of magnitude without the need for hardware accelerators. We also conclude that SneakySnake's performance also scales very well over a wide range of both edit distance thresholds and sequence lengths. Figure 4 : Normalized end-to-end execution time of SneakySnake and SHD combined with Edlib (upper plot) and Parasail (lower plot). We use four datasets over a wide range of edit distance thresholds (E= 0%-10% of the sequence length) for sequence lengths (m) of 100 bp and 250 bp. We present two speedup rates for each edit distance threshold of 0%, 5%, and 10% of the sequence length. The upper speedup rate represents the end-to-end speedup that is gained from combining the pre-alignment step with the alignment step. It is calculated as A/(B + C), where A is the execution time of the sequence aligner before adding SneakySnake, B is the execution time of SneakySnake, and C is the execution time of the sequence aligner after adding SneakySnake. The lower speedup rate is calculated as A/B.
Effects of Snake-on-Chip and Snake-on-GPU on Sequence Alignment
We analyze the benefits of integrating Snake-on-Chip and Snake-on-GPU with the state-of-the-art sequence aligners, designed for different computing platforms in Fig. 5 . We design the hardware architecture of Snake-on-Chip for a sub-maze's width of 8 VRTs (t= 8) and 3 replications (y= 3) per each sub-maze. We select this design choice as it allows for low FPGA resource utilization while maintaining low false accept rate. We analyze the effect of choosing different y and t values on the false accept rate of Snake-on-Chip in Supplementary Materials, Section 10.5. In this analysis, we compare the effect of combining Snakeon-Chip and Snake-on-GPU with existing sequence aligner with that of two state-of-the-art FPGA-based pre-alignment filters, Shouji and GateKeeper . We also select four state-of-the-art sequence aligners that are implemented for CPU (Edlib (Šošić andŠikić, 2017) and Parasail ), GPU (GSWABE ), and FPGA (FPGASW ). We use Set 1 and Set 2 in this analysis. GSWABE and FPGASW are not open-source and not available to us. Therefore, we scale the reported number of computed entries of the dynamic programming matrix in a second (i.e. GCUPS) as follows: 60000000/(GCUPS/100 2 ).
Based on Fig. 5 , we make three key observations. (1) The execution time of Edlib and Parasail reduces by up to 321× (using Set 2 and E = 5%) and 536× (using Set 2 and E = 5%), respectively, after the addition of Snake-on-Chip as a pre-alignment filtering step and by up to 368.3× (using Set 2 and E = 5%) and 689× (using Set 2 and E = 5%), respectively, after the addition of Snake-on-GPU as a pre-alignment filtering step. That is 40× to 51× more speedup compared to that provided by adding SneakySnake as a pre-alignment filter, using Set 2 and E = 5%. It is also up to 2× more speedup compared to that provided by adding Shouji and GateKeeper as a pre-alignment filter, using Set 1 and E=5% for Snake-on-Chip and using Set 2 and E=5% fot Snake-on-GPU. (2) FPGAs and GPUs based sequence aligners follow a similar trend to that we observe in the CPU implementations. However, the speedup ratios are reduced compared to that observed in the CPU based aligners. This is due to the low execution time of these hardware accelerated aligners. Snake-on-GPU provides up to 27.7× (using Set 2 and E = 5%) and 5.1× (using Set 2 and E = 5%) reduction in the end-to-end execution time of GSWABE and FPGASW, respectively. This is up to 1.3 more speedup compared to that provided by Snake-on-Chip, using Set 2. That is also up to 1.7× more speedup compared to that provided by adding Shouji and GateKeeper as a pre-alignment filter.
We conclude that both Snake-on-Chip and Snake-on-GPU provide the highest speedup ratio (up to two orders of magnitude) compared to the state-of-the-art CPU, FPGA, and GPU based sequence aligners over edit distance thresholds of 0%-5% of the sequence length. Figure 5 : Normalized end-to-end execution time of a pre-alignment filter (Snake-on-Chip, Snakeon-GPU, Shouji, and GateKeeper) combined with a sequence aligner (Edlib, Parasail, GSWABE, and FPGASW). We use two datasets, Set 1 (upper plot) and Set 2 (lower plot), over a wide range of edit distance thresholds (0%-5% of the sequence length, 100 bp). We present two speedup rates for edit distance thresholds of 0% and 5%. The upper speedup rate is the speedup gained from integrating Snake-on-GPU with the corresponding sequence aligner. The lower speedup rate represents the speedup gained from integrating Snake-on-Chip with the corresponding sequence aligner.
Discussion and Future Work
In this work, we introduce the single net routing problem and we show how to convert an approximate string matching problem into an instance of the single net routing problem. Subsequently, we propose a new algorithm that solves the single net routing problem and acts as a new pre-alignment filtering algorithm, which we call it SneakySnake. We demonstrate that the concept of pre-alignment filtering provides substantial benefits to the existing and future sequence alignment algorithms. Many of the existing acceleration efforts either simplify the scoring function, or only take into account accelerating the computation of the dynamic programming matrix without supporting the backtracking step. SneakySnake offers the ability to make the best use of existing aligners without sacrificing any of their capabilities (e.g., configurable scoring and backtracking), as it does not modify or replace the alignment step. Our algorithm does not exploit any SIMD-enabled CPU instructions or vendor-specific processor. This makes it attractive and cost-effective given a limited resources environment. SneakySnake improves the accuracy of pre-alignment filtering by up to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper, and SHD. The addition of SneakySnake as a pre-alignment filtering step reduces significantly the execution time of state-of-the-art CPU-based sequence aligners by up to an order of magnitude. We also explore the use of hardware/software co-design and hardware accelerations to further accelerate our SneakySnake algorithm. We introduce Snake-on-Chip and Snake-on-GPU, efficient and scalable FPGA and GPU based pre-laignment filters, respectively. Snake-on-Chip and Snake-on-GPU achieve up to two orders of magnitude speedup to the state-of-the-art sequence aligners. One direction to further improve the performance of Snake-on-Chip and Snake-on-GPU is to discover the possibility of performing the SneakySnake calculations where the huge amount of genomic data resides. Conventional computing requires the movement of genomic sequence pairs from the memory to the CPU processing cores (or to the FPGA chip), using slow and energy-hungry buses, such that cores can apply sequence alignment algorithm on the sequence pairs. Performing SneakySnake inside modern memory devices can alleviate this high communication cost by enabling simple arithmetic/logic operations very close to where the data resides, with high bandwidth and low latency . However, this requires re-designing the hardware architecture of Snake-on-Chip to leverage the supported operations in such modern memory devices .
A second potential target of our research is to explore the possibility of accelerating sequence alignment algorithms for longer sequences (few tens of thousands of characters) using our pre-alignment filters. Longer sequences pose two challenges. First, we need to transfer more data to the FPGA chip to be able process a single pair of sequences which is mainly limited by the data transfer rate of the communication link (i.e. PCIe) . Second, typical edit distance threshold used for sequence alignment is 5% of the sequence length. For considerably long sequences, edit distance threshold is around few hundreds of characters (Senol Cali et al., 2018; Firtina et al., 2019; Alser, 2019) . A large edit distance threshold leads to calculating a large number of horizontal routing tracks (i.e., 2E+1 tracks, where E is the edit distance thresholds). This makes random zeros (matches resulted from comparing each character of a given sequence to the corresponding neighboring characters of the other given sequence) to occur more frequently in the horizontal routing tracks as we show in . This would negatively affect the performance and accuracy of SneakySnake. We will investigate this effect and explore new pre-alignment filtering approaches for the sequencing data produced by third-generation sequence machines.
A third research direction is to enable the use of cloud computing for performing pre-alignment filtering at scale. Cloud computing offers access to a large number of advanced FPGA chips that can be used concurrently via a simple user-friendly interface. However, such a scenario requires the development of privacy-preserving pre-alignment filters due to privacy and legal concerns (Alser et al., 2015) . Our next efforts will focus on exploring privacy-preserving cloud-baased pre-alignment filtering. 
Supplementary Materials

Related Works
Recent works tend to follow one of two key directions to boost the performance of sequence alignment implementations: (1) Accelerating the dynamic programming algorithms using hardware accelerators and (2) Developing filtering heuristics that reduce the need for the dynamic programming algorithms, given an edit distance threshold.
However, many of these efforts either simplify the scoring function, or only take into account accelerating the computation of the dynamic programming matrix without providing the optimal alignment as in Liu et al., 2013; Nishimura et al., 2017) . Different and more sophisticated scoring functions are typically needed to better quantify the similarity between two sequences (Henikoff and Henikoff, 1992; Wang et al., 2011) . The backtracking step required for the optimal alignment computation involves unpredictable and irregular memory access patterns, which poses a difficult challenge for efficient hardware implementation.
Pre-alignment filtering heuristics aim to quickly eliminate some of the dissimilar sequences before using the computationally-expensive optimal alignment algorithms. There are a few existing filtering techniques such as the Adjacency Filter (Xin et al., 2013) , which is implemented for standard CPUs as part of FastHASH (Xin et al., 2013) . SHD is a SIMD-friendly bit-vector filter that provides higher filtering accuracy compared to the Adjacency Filter. To our knowledge, SHD is currently the best CPU-based pre-alignment filter, but it suffers from limited sequence length (up to only 128 characters) due to the SIMD register size used for its implementation. GRIM-Filter exploits the high memory bandwidth and the logic layer of 3D-stacked memory to perform highly-parallel filtering in the DRAM chip itself. GateKeeper is the first pre-alignment filter designed to utilize the large amounts of parallelism offered by FPGA architectures. GateKeeper ) provides a high filtering speed but suffers from relatively high number of falsely-accepted sequence pairs. MAGNET reduces significantly the number of falsely-accepted sequence pairs of GateKeeper but provides a very low number of falsely-rejected sequence pairs. Recently, Shouji introduces, to our knowledge, the most accurate and the fastest pre-alignment filter using new algorithm and new FPGA architecture.
In this work, we introduce the first GPU-based pre-alignment filter, called Snake-on-GPU. We also provide the most accurate and the fastest (as our experimental evaluation demonstrates) CPU and FPGA based prealignment filters, SneakySnake and Snake-on-Chip, respectively.
Proofs of the Correctness and Optimality of the SneakySnake Algorithm
As the propagation delay of a signal net is mainly affected by the number of obstacles that are considered in the horizontal escape segments of the selected path, for simplicity, we do not consider the vertical segments in our proof.
Correctness proof
PROOF. We prove Theorem 1 by contradiction. Let A = {s1, s2, …, sn} be the signal net that connects the source terminal to the destination terminal using n escape segments that are part of the horizontal routing tracks within a routing region. The escape segments are sorted by their start position (i.e., s1 starts before s2 and ends at s2). Assume that SneakySnake algorithm is not able to find this signal net A that reaches the destination terminal. This means that SneakySnake algorithm finds an escape segment, sk, but it fails to find the next escape segment, sk+1. Since there is a signal net that connects s1 to sn, there exists an escape segment that starts before sk+1 and ends at sk+1. This escape segment is not reachable from sk (as we assume that SneakySnake algorithm terminates the solution after finding sk), so it should be reachable from another escape segment, st, where t < k. This indicates that sk+1 is not reachable from sk and sk is not reachable from st. This contradicts the assumption that sk+1 is reachable and it is part of the solution. Thus, our assumption that SneakySnake algorithm is not able to find a signal net is wrong. ◼
Optimality proof
PROOF. We prove Theorem 2 by induction. Suppose you have a set of n candidate horizontal segments {1, 2, …, n} that are part of the horizontal routing tracks within a routing region. Each horizontal segment has a desired pair of start and end positions (s(i), f(i)). SneakySnake algorithm determines a signal net with the minimum total propagation delay by repeatedly selecting from the available horizontal segments the one that starts at the current location and has the farthest end location, and removing all overlapping horizontal segments from the set. Let A = {x1, x2, …, xk} be the solution (set of escape segments) to S provided by SneakySnake algorithm. The escape segments are sorted by their start position (i.e., x1 starts before x2 and ends at x2). Let B = {y1, y2, …, ym} be the optimal solution for the same problem. Let k = |A| and m = |B| denote the number of escape segments in A and B, respectively. The proof is by induction on the number of escape segments. We will compare A and B by their segments' end positions. We will show that for all r ≤ k, f(xr) ≥ f(yr). As the base case, we take k = m = 1. Since SneakySnake and optimal algorithm select the longest escape segment that start at the beginning of a horizontal routing track, it certainly must be the case that f(x1) ≥ f(y1). For r > 1, assume the statement is true for r − 1 and we will prove it for r. The induction hypothesis states that f(xr-1) ≥ f(yr-1), and so any horizontal segment that is not overlapping with the first r − 1 escape segments in the optimal solution are certainly not overlapping with the first r − 1 escape segments of SneakySnake algorithm. Therefore, we can add yr to SneakySnake solution, and since SneakySnake algorithm always considers the longest escape segments, it must be the case that f(xr) ≥ f(yr). So we have that for all r ≤ k, f(xr) ≥ f(yr). In particular, f(xk) ≥ f(yk). If A is not optimal, then it must be the case that m < k, and so there is an escape segment xm+1 in A that is not in B. This escape segment must start after A's m th escape segment ends, and hence after f(ym). But then this segment is not overlapping with all the escape segments in B, and so it should be part of the solution in B. This contradicts the assumption that m<k, and thus A has as many elements as B. So SneakySnake algorithm always produces an optimal solution. ◼
Run Time and Space Complexity Analysis of the SneakySnake Algorithm
We now analyze the asymptotic run time and space complexity of the SneakySnake algorithm. We provide the pseudocode of SneakySnake in Algorithm 1. The SneakySnake algorithm builds the chip maze on-thefly by constructing partially each horizontal routing track starting from each new checkpoint until it reaches an obstacle in each horizontal routing track. The SneakySnake algorithm does not necessarily construct the entire chip maze. At each new checkpoint, the SneakySnake algorithm examines if the signal net does not reach the destination terminal nor exceed the allowed propagation delay before it iterates (as we explain in Algorithm 1, line 4). It then uses the function UpperHRT() (Algorithm 2) to construct the first escape segment, after the current checkpoint, of each of the upper HRTs (as we explain in Algorithm 1, line 6). After constructing the escape segments, it computes their length and returns the length of the longest escape segment. Note that during the first iteration of the SneakySnake algorithm, the function UpperHRT() returns a value of 1, which is the length of a single obstacle. This is because all upper HRTs start with an obstacle. The SneakySnake algorithm performs the same steps as in the function UpperHRT() for the main HRT (Algorithm 1, line 7) and the lower HRTs (Algorithm 1, line 12) , by calling the two functions: MainHRT() (Algorithm 3) and LowerHRT() (Algorithm 4). Finally, we update the position of the checkpoint and the current propagation delay of the found signal net through Algorithm 1, line 15-18. Once the signal net exceeds the allowed propagation delay, the SneakySnake algorithm terminates (as we show in Algorithm 1, line 4 and line [19] [20] . Otherwise, the SneakySnake algorithm allows computationally expensive edit distance or pairwise alignment algorithms to compute their output based on the user-defined parameters (as we show in Algorithm 1, line 21-23).
On the one hand, the lower-bound on the time complexity of the SneakySnake algorithm is O(m), which is achieved when the SneakySnake algorithm reaches the destination terminal of the maze without facing any obstacle along the signal net. For example, when a pattern sequence matches exactly a text sequence, the SneakySnake algorithm traverses only through the E+1 th HRT (i.e., main HRT) and then allows the edit distance or alignment algorithm to perform its computation.
On the other hand, the upper-bound on the run time complexity of the SneakySnake algorithm is a result of constructing the entire chip maze. As we have 2E+1 horizontal routing tracks, each of which is m characters long, the upper-bound run time complexity is O((2E+1)m). However, it is unrealistic to construct the entire chip maze, as in this case, all the horizontal routing tracks should be identical in terms of the number and the location of all obstacles. Consider a pair of random pattern and text sequences, where each character is generated completely randomly (having 1/4 probability of being either A, C, G, or T). The probability that a character of the pattern sequence does not match any neighboring character of the text sequence during constructing any of the 2E+1 horizontal routing tracks is (3/4) 2E+1 , which decreases exponentially as E increases. Therefore, this upper-bound on the run time complexity is loose. We illustrate in Fig. 6 a typical scenario where the SneakySnake algorithm constructs only small fragments of the horizontal routing tracks. Fig. 6 : An example of the chip maze after applying the SneakySnake algorithm for a reference sequence R = "GGTGCAGAGCTC", a query sequence Q = "GGTGAGAGTTGT", a sequence length m=12, and an edit distance threshold E=3. The SneakySnake algorithm constructs only the needed portion of each of the horizontal routing track. The white area is the fragments of each horizontal routing track that is not constructed. 
Snake-on-Chip Hardware Architecture
Next, we present the details of our hardware architecture of Snake-on-Chip in four key steps. (1) Snake-on-Chip constructs the entire chip maze of each subproblem. Each chip maze has 2E+1 bit-vectors (rows) and each bit-vector is t bits long. This is now different from the CPU implementation of the SneakySnake algorithm, as the number of entries computed in each row is no longer limited to the entries that are located only between a checkpoint and the first following obstacle. This is due to the fundamental difference between CPU core (sequential execution) and FPGA chip (parallel processing). We want to concurrently compute all bits of all bit-vectors beforehand so that we can exploit massive bitwise parallelism provided by FPGA and perform computations on all bit-vectors in a parallel fashion.
(2) It computes the length of the first horizontal segment of consecutive zeros for each bit-vector (i.e., each HRT) using a leading-zero counter (LZC). Snake-on-Chip uses the LZC design proposed in (Dimitrakopoulos et al., 2008) as it requires low number of both logic gates and logic levels. It counts the number of consecutive zeros that appear in a t-bit input vector before the first more significant bit that is equal to one.
(3) Snake-on-Chip finds the bit-vector (i.e., HRT) that has the largest number of leading zeros. Snake-on-Chip implements a hierarchical comparator structure with ⌈ & (2 + 1)⌉ levels. Each comparator compares the output of two LZCs and finds the largest value. That is, we need 2E+2 comparators, each of which is a (⌊ & ⌋ + 1)-bit comparator, for comparing the leading zero counts of 2E+1 t-bit LZCs and finding the largest leading zero count. Consider that we choose t, E, and m to be 8 columns, 5 edits (i.e., 11 rows), and 100 characters, respectively. This results in partitioning the chip maze of size 11 × 100 into 13 subproblems, each of size 11 × 8. We need 11 LZCs and 12 comparators. We arrange the 12 LZC comparators into 4 levels: the first level that is closer to the LZC has 6 LZC comparators, the second level has 3 LZC comparators, the third level has 2 LZC comparators, and the last level has a single LZC comparators. This hierarchical comparator structure compares the 11 escape segments of a subproblem and produces the length of the longest escape segment. We provide the overall architecture of the 4-level LZC comparator tree including the 11 LZC block diagrams in Fig. 7. (4) After computing the length of the longest segment (the largest leading-zero count), Snake-on-Chip creates a new checkpoint in order to iterate over the HRTs once again to find the next optimal escape segment. Snake-on-Chip achieves this by shifting the bits of each row (i.e., HRT) to the right-hand direction (assuming the least significant bit starts from the right-hand side). The shift amount is equal to x bits, where x is the length of the previously-found longest escape segment of the consecutive zeros. To skip the obstacle that exists at the end of the longest escape segment, Snake-on-Chip shifts the bits of each row by an additional single step to the right-hand direction. This guarantees to exclude the previouslyfound longest escape segment along with a single obstacle from the new search round.
(5) As now, Snake-on-Chip preprocesses all the 2E+1 rows and make them ready for the next search round. Snake-on-Chip repeats the previous four steps in order to find the next optimal escape segment starting from the least significant bit all the way to the most significant bit. Repeating the previous steps for each iteration requires building a new replication for the architecture design of all the four previous steps. The number of replications needed depends on the desired accuracy of the SneakySnake algorithm. If our target is to find an optimal signal net that has at most a single obstacle within each subproblem, then we need to build two replications for the three previous steps (steps 2, 3, and 4). For example, let A be "00010000", where t = 8. The first replication computes the value of x as four zeros and updates the bits of A to "11111000". The second replication computes the value of x as three zeros and updates the bits of A to "11111111".
(6) The last step is to calculate the total number of obstacles faced along the entire optimal signal net in each subproblem. For each subproblem, Snake-on-Chip calculates the total number of obstacles as follows:
( , − ∑ 9 : 9;< )
( 2) where y is the total number of replications included in the architecture of Snake-on-Chip and xk is the length of the longest segment of consecutive zeros found by the replication of index $k$. Hence, the total number of obstacles for the original problem of size $(2E+1) \times m$ is simply the summation of the total number of obstacles faced along the optimal signal net of all subproblems. 
Snake-on-GPU Parallel Implementation
Snake-on-GPU makes three key assumptions that help with providing an efficient GPU implementation.
(1) The entire input dataset of query and reference sequences fits into the GPU global memory, which is the off-chip DRAM memory of GPUs (NVIDIA, 2019) and it typically fits a few GB of data.
(2) We copy the entire input dataset from the CPU main memory to the GPU global memory before the GPU kernel execution starts. This enables massively-parallel computation by making large number of input sequences available in the GPU global memory.
(3) We copy back the pre-alignment filtering results from the GPU global memory to the CPU main memory only after the GPU kernel completes the computation. If the size of the input dataset exceeds the size of the GPU global memory, we divide the dataset into independent smaller datasets, each of which can fit the capacity of the GPU global memory. This approach also helps to overlap the computation performed on one small dataset with the transfer of another small dataset between the CPU memory and GPU memory (Gómez-Luna et al., 2012) .
Given the large size of the input dataset that the GPU threads need to access from the GPU global memory, we carefully design Snake-on-GPU to efficiently use the on-chip register file to store the query and the reference sequences and avoid unnecessary accesses to the off-chip global memory. The workflow of Snake-on-GPU includes two key steps, as we provide in Fig. 8. 1 ) Each thread copies a single reference sequence and another single query sequence from global memory to the on-chip registers. Assuming the maximum length of a query (or reference) sequence is m (i.e., the maximum number of VRTs), we need 2m bits to encode each character of the query (or reference) sequence to a unique binary representation.
Since the size of a register is 4 bytes (32 bits), each thread needs = ? &@ A& B registers to store an entire query/reference sequence. For example, for a maximum length of m = 128, R = 8. This way, 16 registers are enough to store both query and reference sequence. This number is much lower than the maximum of 256 registers that each thread can use in current NVIDIA GPUs. Thus, the resources of a GPU core are not exhausted and more threads can run concurrently. 2) Each thread solves the complete SNR problem for a single query sequence and a single reference sequence. Each GPU thread applies the same computation of the SneakySnake algorithm to solve the SNR problem. 
Fig. 8: Workflow of Snake-on-GPU. It includes two key steps: (1) each GPU thread loads a single reference sequence and another query sequence into registers, (2) the thread solves a single SNR
Supplementary Evaluation
Dataset Description
Our experimental evaluation uses 4 different real datasets. We summarize the details of these four datasets in Table 1 . We provide the source used to obtain the read sets, the read length in each read set, and the configuration used for the e parameter of mrFAST ) for our real 4 datasets. We use Edlib (Šošić and Šikić, 2017) to assess the number of similar (i.e., having edits fewer than or equal to the edit distance threshold) and dissimilar (i.e., having more edits than the edit distance threshold) pairs for each of the 4 datasets across different user-defined edit distance thresholds. We provide these details for Set_1, Set_2, Set_3, and Set_4 in Table 2 . 
Evaluating the Number of Falsely-Accepted Sequence Pairs and Falsely-Rejected Sequence Pairs
We evaluate the number of falsely-accepted pairs and falsely-rejected pairs for SneakySnake, Shouji , MAGNET , SHD , and GateKeeper . We list the number of falsely-accepted and falsely-rejected sequences in Table 3 and Table 4 , respectively, across a wide range of edit distance thresholds of E= 0% to E= 10% of the sequence length. Table 4 : Number of falsely-rejected sequences of SneakySnake, Shouji, MAGNET, SHD, and GateKeeper across 4 real datasets. We use a wide range of edit distance thresholds (0%-10% of the sequence length) for sequence lengths of 100 and 250.
Evaluating the Filtering Speed of SneakySnake
We analyze the execution time of SneakySnake compared to the best performing existing CPU-based prealignment filter, SHD and two state-of-the-art CPU implementations of sequence alignment algorithms, Edlib (Šošić and Šikić, 2017) and Parasail . We evaluate them using a single CPU core and single thread environment on the same machine. SHD supports a sequence length of up to only 128 characters (due to the SIMD register size). To ensure as fair a comparison as possible, we allow SHD to divide the long sequences into batches of 128 characters, examine each batch individually, and then sum up the results. We configure Edlib to work as a banded global alignment tool by choosing the following parameters: (1) Edlib's k parameter (maximum number of diagonals computed for the dynamic programming table) to be equal to E, (2) EDLIB_MODE_NW, and (3) EDLIB_TASK_PATH. This enables Edlib to act as a sequence aligner where it provides the alignment path (backtracking), alignment score, and location of each edit. Note that the execution time of Edlib provided in Table 5 does not include the execution time required to generate CIGAR string, which is performed using another separate function, called edlibAlignmentToCigar(). We also configure Parasail to work as a banded global alignment tool by choosing the parameter nw_banded. Similar to Edlib, the execution time of Parasail provided in Table 5 Truly does not include the time spent in generating the CIGAR string, which is performed using another separate function, called parasail_result_get_cigar(). 
Effects of Pre-Alignment Filtering on Sequence Alignment
In Table 6 , we analyze the benefits of integrating our proposed pre-alignment filter, SneakySnake, and best performing CPU-based pre-alignment filter, SHD with state-of-the-art CPU implementations of sequence alignment algorithms, Edlib (Šošić and Šikić, 2017) and Parasail . Table 6 : End-to-end execution time (in seconds) of SneakySnake and SHD combined with Edlib and Parasail. We use four datasets over a wide range of edit distance thresholds (0%-10% of the sequence length) for sequence lengths of 100 and 250 characters. The green scale represents the end-to-end execution time of pipelines that include Edlib and the blue scale represents the end-toend execution time of pipelines that include Parasail.
Evaluating the Accuracy of Snake-on-Chip
We examine the feasibility of reducing the search space of the SneakySnake algorithm without causing falsely-rejected sequences. As we discuss in the main manuscript, Section 2.5, we column-wise partition the chip maze of the SneakySnake algorithm into adjacent non-overlapping smaller chip mazes, each of which has a size of 2E+1 by t, where t is the number of columns in each small chip maze. We provide the effects of this partitioning on the number of falsely-accepted sequences of our SneakySnake algorithm in Table 7 . We also provide the effects of this partitioning on the end-to-end execution time of SneakySnake combined with Edlib and Parasail in Table 8 . Next, we assess the effect of the number of iterations within a routing region (we call it as a replication in our Snake-on-Chip design, and hence we refer to it as a replication in this section for consistency) on the filtering accuracy of the SneakySnake algorithm. The number of replications affects the number of obstacles that can be detected within the routing region, as each replication is a single iteration of the SneakySnake algorithm that starts from a checkpoint and ends by finding a single longest escape segment. We first partition the chip maze into smaller chip mazes, each of which is a routing region of size 2E+1 by t. We vary the width of the routing region (t) to be one of the three values, 8, 16, and 32 columns. In this evaluation, we select the width value to be a power of 2 for the sake of simplicity of evaluating the hardware design, Snake-on-Chip. Note that the value of t has no limitation as long as it is larger than 1. We vary the number of replications from 3 up to 32 (with an increment of 3). We evaluate this effect on the number of falsely-accepted sequences in Table 9 . We make two observations. (1) We observe that increasing the number of the replications in the design improves the filtering accuracy of the SneakySnake algorithm. This observation is in accord with our expectation as each replication detects at most a single edit within each (2) Increasing the number of replications beyond half of the width value of the routing region (e.g., y>4 for t=8) has only slight or no effect. number of replications (y) , the number of obstacles that can be avoided, within a search window (t) of the SneakySnake algorithm on its filtering accuracy. The green scale and the red scale represent the low and high numbers of falsely-accepted sequences, respectively.
Next, we evaluate the effect of increasing the number of replications (that causes a decrease in the number of falsely-accepted sequences) on the overall execution time of the SneakySnake algorithm combined with each of Edlib (Šošić and Šikić, 2017) and Parasail . We evaluate this effect in Table 10 for Edlib and in Table 11 for Parasail, using the same values of t and y that we use in Table  9 . We make two key observations. (1) The addition of the SneakySnake algorithm, with a controlled number of replications, as a pre-alignment filtering step reduces significantly the execution time of Edlib (Šošić and Šikić, 2017) by up to 23.8x. It also reduces the end-to-end execution time of Parasail by up to 40.7x. (2) Changing the number of replications (y) for the same edit distance threshold provides slight or no effect on the overall execution time of the SneakySnake algorithm combined with Edlib and Parasail. 
Evaluating Resource Analysis and Execution time of Snake-on-Chip
We now examine the FPGA resource utilization for the hardware implementation of GateKeeper, Shouji, MAGNET, and Snake-on-Chip pre-alignment filters. We build the FPGA implementation of Snake-on-Chip using a sub-matrix's width of 8 columns (t=8) and we include 3 replications in the design. We evaluate our four pre-alignment filters using a single FPGA chip. We use 60 million sequence pairs, each of which is 100 bp long, from Set_1 and Set_2. We provide several hardware designs for two commonly used edit distance thresholds, 2 bp and 5 bp, for a sequence length of 100 bp. The VC709 FPGA chip contains 433,200 slice LUTs (look-up tables) and 866,400 slice registers (flip-flops). Table 12 lists the FPGA resource utilization for a single filtering unit. We make five main observations. (1) The design for a single MAGNET filtering unit requires about 10.5% and 37.8% of the available LUTs for edit distance thresholds of 2 bp and 5 bp, respectively. Hence, MAGNET can process 8 and 2 sequence pairs concurrently for edit distance thresholds of 2 bp and 5 bp, respectively, without violating the timing constraints of our hardware accelerator. (2) (4) Snake-on-Chip requires 15.4×-26.6× less LUTs compared to MAGNET. While Snake-on-Chip requires a slightly less LUTs compared to Shouji, it requires about 2× more LUTs compared to GateKeeper. Snake-on-Chip can also examine up to 16 sequence pairs concurrently. (5) We observe that the hardware implementations of Shouji, MAGNET, and Snake-on-Chip require pipelining the design (i.e., shortening the critical path delay of each processing core by dividing it into stages or smaller tasks) to enable meeting the timing constraints and achieve more parallelism. We also analyze the execution time of our hardware pre-alignment filters, GateKeeper, MAGNET, Shouji, and Snake-on-Chip. For a single filtering unit, each of the four pre-alignment filters takes about 0.7233 seconds to complete examining Set_1 and Set_2, regardless the edit distance threshold used (we test it for E = 0% to 5% of the sequence length). This is due to the fact that these hardware architectures utilize a 250 MHz clock signal that synchronizes the entire computation. That is, increasing the edit distance threshold directly increases the number of HRTs for each SNR subproblem but not necessarily increases the execution time as FPGA provides large number of LUTs that operate in parallel. This is only limited by the available FPGA resource and the operating frequency. This is clear from the FPGA resource usage that is correlated with the filtering accuracy and the edit distance threshold. For example, the least accurate filter, GateKeeper, occupies the least FPGA resource that can be integrated into the FPGA.
We conclude that Snake-on-Chip requires reasonably small number of LUTs, which allows for integrating large number of filtering units that can examine large number of sequence pairs in parallel.
Evaluating Accuracy and Execution time of Snake-on-GPU
We now examine 1) the execution time of Snake-on-Chip and 2) the number of sequence pairs that are accepted/rejected using Set_1 and Set_2 datasets. We use cudaEventElapsedTime() function to measure the total execution time as we provide in Table 13 . 
