SneakySnake: A Fast and Accurate Universal Genome Pre-Alignment Filter
  for CPUs, GPUs, and FPGAs by Alser, Mohammed et al.
SneakySnake: A Fast and Accurate Universal Genome
Pre-Alignment Filter for CPUs, GPUs, and FPGAs
Mohammed Alser 1,3, Taha Shahroodi 1, Juan Go´mez-Luna 1, Can Alkan 3, and
Onur Mutlu 1,2,3
1Department of Computer Science, ETH Zurich, Zurich 8006, Switzerland
2Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh 15213, PA, USA
3Department of Computer Engineering, Bilkent University, Ankara 06800, Turkey
Abstract
The ability to generate massive amounts of sequencing data continues to overwhelm the processing capacity
of existing algorithms and compute infrastructures. Calculating the similarities between a pair of genomic
sequences is one of the most fundamental computational steps in genomic analysis. This step –called sequence
alignment– is formulated as an approximate string matching (ASM) problem, which is typically solved using
computationally expensive dynamic programming algorithms. In this work, we introduce SneakySnake, a
highly parallel and highly accurate pre-alignment filter that remarkably reduces the need for the computa-
tionally costly sequence alignment step. The key idea of SneakySnake is to provide fast and highly accurate
filtering by reducing the ASM problem to the single net routing (SNR) problem in VLSI chip layout. In the
SNR problem, we are interested in only finding the path that connects two terminals with the least routing
cost on a special grid layout that contains obstacles. The SneakySnake algorithm quickly and optimally
solves the SNR problem and uses the found optimal path to decide whether performing sequence alignment
is necessary. We also build two new hardware accelerator designs, Snake-on-Chip and Snake-on-GPU, that
adopts modern FPGA (field-programmable gate array) and GPU (graphics processing unit) architectures,
respectively, to further boost the performance of our algorithm.
SneakySnake significantly improves the accuracy of pre-alignment filtering by up to four orders of mag-
nitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper, and SHD. SneakySnake
accelerates the state-of-the-art CPU-based sequence aligners, Edlib and Parasail, by up to 37.6× and 43.9×,
respectively, without requiring hardware acceleration. The addition of Snake-on-Chip and Snake-on-GPU
as a pre-alignment filter reduces the execution time of four state-of-the-art sequence aligners, designed for
different computing platforms, by up to 689× (101× on average).
To our knowledge, SneakySnake is the fastest and most accurate pre-alignment filtering mechanism
that greatly enables the speeding up of genome sequence alignment while preserving its accuracy. It is
the only pre-alignment filtering mechanism that is universal, as it works on all modern high-performance
computing architectures, i.e., CPUs, GPUs, and FPGAs, by having software as well as software/hardware
co-designed versions. Unlike most existing works that aim to accelerate sequence alignment, SneakySnake
does not sacrifice any of the aligner capabilities (i.e., scoring and backtracking), as it does not modify
or replace the aligner. The three versions of SneakySnake are open source and freely available online at
https://github.com/CMU-SAFARI/SneakySnake/.
1 Introduction
The ability to quickly analyze the overwhelming amounts of sequencing data enables several scientific ad-
vancements in personalized medicine and understanding genomic contributions to health across population
(Hindorff et al., 2018). One of the most fundamental computational steps in most genomic analyses is
sequence alignment. This step is formulated as an approximate string matching problem (Navarro, 2001)
and it calculates three key information: (1) edit distance between two given sequences, (2) type of each edit
(i.e., insertion, deletion, or substitution), and (3) location of each edit in one of the two given sequences.
Edit distance is defined as the minimum number of edits needed to convert one sequence into the other
(Levenshtein, 1966). These edits result from both sequencing errors (Fox et al., 2014) and genetic variations
(McKernan et al., 2009). Edits can have different weights, based on a user-defined scoring function, to
allow favoring one edit type over another (Wang et al., 2011). Sequence alignment involves a backtrack-
ing step, which calculates an ordered list of characters representing the location and type of each possible
edit operation required to change one of the two given sequences into the other. As any two sequences
can have several different arrangements of the edit operations, we need to examine all possible prefixes
of the two input sequences and keep track of the pairs of prefixes that provide a minimum edit distance.
Therefore, sequence alignment approaches are typically implemented as dynamic programming algorithms
to avoid re-examining the same prefixes many times (Eddy, 2004). Dynamic programming based sequence
alignment algorithms, such as Levenshtein distance (Levenshtein, 1966), Smith-Waterman (Smith and Wa-
terman, 1981), and Needleman-Wunsch (Needleman and Wunsch, 1970), are computationally expensive as
they have quadratic time and space complexity (i.e., O(m2) for a sequence length of m). Many attempts
were made to boost the performance of existing sequence aligners. Recent works tend to follow one of two
key directions: (1) Accelerating the dynamic programming algorithms using hardware accelerators and (2)
Developing pre-alignment filtering heuristics that reduce the need for the dynamic programming algorithms,
1
ar
X
iv
:1
91
0.
09
02
0v
1 
 [q
-b
io.
GN
]  
20
 O
ct 
20
19
given an edit distance threshold. Hardware accelerators include building aligners that use 1) multi-core and
SIMD (single instruction multiple data) capable central processing units (CPUs), such as Edlib (Sˇosˇic´ and
Sˇikic´, 2017) and Parasail (Daily, 2016), 2) graphics processing units (GPUs), such as GSWABE (Liu and
Schmidt, 2015) and CUDASW++ 3.0 (Liu et al., 2013), 3) field-programmable gate arrays (FPGAs), such
as FPGASW (Fei et al., 2018), or 4) processing-in-memory architectures that enable performing computa-
tions inside the memory chip and alleviate the need for transferring the data to the CPU cores, such as
RADAR (Huangfu et al., 2018). However, many of these efforts either simplify the scoring function, or only
take into account accelerating the computation of the dynamic programming matrix without performing
the backtracking step as in (Liu et al., 2013; Nishimura et al., 2017; Chen et al., 2014). Different and more
sophisticated scoring functions are typically needed to better quantify the similarity between two sequences
(Wang et al., 2011). The backtracking step involves unpredictable and irregular memory access patterns,
which poses a difficult challenge for efficient hardware implementation.
Pre-alignment filtering heuristics aim to quickly eliminate some of the dissimilar sequences before using
the computationally-expensive optimal alignment algorithms. Existing pre-alignment filtering techniques
are either: 1) slow and they suffer from a limited sequence length, such as SHD (Xin et al., 2015), or 2)
inaccurate after some edit distance threshold, such as GateKeeper (Alser et al., 2017a) and MAGNET (Alser
et al., 2017b). Shouji (Alser et al., 2019) is currently the best-performing FPGA pre-alignment filter in terms
of both accuracy and execution time.
We provide full descriptions of these two key directions of accelerating sequence alignment in Supple-
mentary Material, Section 5.
Our goal in this work is to significantly reduce the time spent on calculating the sequence alignment of
short sequences using very fast and highly accurate pre-alignment filtering. To this end, we introduce three
new, fast, and very accurate pre-alignment filters, called SneakySnake, Snake-on-Chip, and Snake-on-GPU
for different computing platforms. The key idea of SneakySnake is to provide highly-accurate pre-alignment
filtering algorithm that remarkably accelerates the computation of sequence alignment by reducing the ASM
problem to the single net routing (SNR) problem (Lee et al., 1976). The SNR problem is to find the shortest
routing path that interconnects two terminals on the boundaries of VLSI chip layout and passes through
the minimum number of obstacles. Solving the SNR problem is faster than solving the ASM problem,
as calculating the routing path after facing an obstacle is independent of the calculated path before this
obstacle. This obviates the need for using computationally costly dynamic programming algorithms to keep
track of the subpath that provides optimal solution. The key idea of Snake-on-Chip and Snake-on-GPU
is to judiciously leverage the parallelism-friendly architecture of modern FPGAs and GPUs, respectively, to
greatly speed up the SneakySnake algorithm. The contributions of this paper are as follows:
• We introduce SneakySnake, a highly parallel and highly accurate pre-alignment filter, which 1) reduces
the ASM problem to the SNR problem, 2) efficiently and optimally solves the SNR problem, 3) uses
the SNR solution to quickly identify dissimilar sequences, and 4) avoids performing computationally
costly sequence alignment for such dissimilar sequences.
• We demonstrate that the SneakySnake algorithm is 1) correct and optimal in solving the SNR problem
and 2) it runs in linear time with respect to the sequence length and the edit distance threshold.
• We demonstrate that the SneakySnake algorithm significantly improves the accuracy of pre-alignment
filtering by up to four orders of magnitude compared to Shouji (Alser et al., 2019), GateKeeper (Alser
et al., 2017a) and SHD (Xin et al., 2015).
• We demonstrate that SneakySnake accelerates the state-of-the-art CPU-based sequence aligners, Edlib
(Sˇosˇic´ and Sˇikic´, 2017) and Parasail (Daily, 2016), by 1.23×-37.6× and 1.23×-43.9×, respectively.
• We introduce a hardware accelerator design that tailors modern FPGA architectures to boost the
performance of the SneakySnake algorithm. We call this design Snake-on-Chip.
• We introduce the first pre-alignment filter that leverages GPU architectures to boost the performance
of the SneakySnake algorithm while providing the flexibility for users to change input parameters, such
as edit distance threshold. We call this design Snake-on-GPU.
• We demonstrate that integrating Snake-on-Chip and Snake-on-GPU with four state-of-the-art sequence
aligners, designed for different computing platforms, reduces their execution time by 1.6×-536× and
1.6×-689×, depending on the edit distance threshold used.
2 Methods
2.1 Overview
The primary purpose of SneakySnake is to accelerate sequence alignment calculation by providing fast and
accurate pre-alignment filtering. The key idea behind the SneakySnake algorithm is to quickly examine each
sequence pair before applying sequence alignment algorithms and decide whether computationally expensive
sequence alignment step is needed between two genomic sequences. This filtering decision of the SneakyS-
nake algorithm is made based on accurately estimating the number of edits between two given sequences.
2
I/O pad
processing 
element
(obstacle)
connection 
point (via)
vertical 
routing 
track (VRT)
escape 
segment
horizontal 
routing 
track (HRT)
signal net
(the solution)
escape 
segment
Figure 1: Chip layout with processing elements and two layers of metal routing tracks. In this
example, the chip layout has 7 horizontal routing tracks (HRTs) located on the first layer and
another 12 vertical routing tracks (VRTs) located on the second layer. We show only a single VRT
out of the 12 VRTs for simplicity of illustration. The optimal signal net that is calculated using
the SneakySnake algorithm (the solution to the single net routing problem) is highlighted in red
using three escape segments. The first escape segment is connected to the second escape segment
using an VRT through vias. The second escape segment is connected to the third escape segment
without passing through a VRT as both are located on the same HRT. The optimal signal net
passes through three processing elements and hence the signal net has a total delay of 3× tobstacle.
If two genomic sequences differ by more than the edit distance threshold, then the two sequences are
identified as dissimilar sequences and hence identifying the location and the type of each edit is not needed.
The edit distance estimated by the SneakySnake algorithm should always be less than or equal to the ac-
tual edit distance value so that the SneakySnake algorithm keeps all similar sequences and ensures reliable
filtering. To quickly estimate the edit distance between two sequences, we reduce the ASM problem to the
SNR problem (Lee et al., 1976). That is, instead of calculating the sequence alignment, the SneakySnake
algorithm finds the routing path that interconnects two terminals and passes through the minimum number
of obstacles on VLSI chip. The number of obstacles faced throughout the routing path represents a lower
bound on the edit distance between two sequences and hence this number of obstacles can be used for the
filtering decision of the SneakySnake algorithm. Solving the SNR problem is easier and faster than solving
the ASM problem (as we show in Section 2.3). Next We explain the SNR problem.
2.2 Single Net Routing (SNR) Problem
The SNR problem (Lee et al., 1976) in VLSI chip layout refers to the problem of optimally interconnecting
two terminals on a special grid graph while respecting constraints. We present an example of a VLSI chip
layout in Fig. 1. The goal is to find the optimal path –called signal net– that connects the source and the
destination terminals through the chip layout. We describe the special grid graph of the SNR problem and
define such optimal signal net as follows:
• The chip layout has two layers of evenly spaced metal routing tracks. While the first layer allows
traversing the chip horizontally through dedicated horizontal routing tracks (HRTs), the second layer
allows traversing the chip vertically using dedicated vertical routing tracks (VRTs).
• The horizontal and vertical routing tracks induce a two dimensional uniform grid over the chip layout.
Each HRT can be obstructed by some obstacles (e.g., processing elements in the chip). For simplicity,
we assume that VRTs can not be obstructed by obstacles. These obstacles allow the signal to pass
horizontally through HRTs, but they induce a signal delay on the passed signal. Each obstacle in-
duces a fixed propagation delay, tobstacle, on the victim signal that passes through the obstacle in the
corresponding HRT.
• A signal net often uses a sequence of alternating horizontal and vertical segments that are parts of
the routing tracks. Adjacent horizontal and vertical segments in the signal net are connected by an
inter-layer via. We call a signal net optimal if it is both the shortest and the fastest routing path (i.e.,
passes through the minimum number of obstacles).
• Alternating between horizontal and vertical segments is restricted by passing a single obstacle. Thus,
segment alternating strictly delays the signal by tobstacle time.
• The terminals can be any of the I/O pads that are located on the right-hand and left-hand boundaries
of the chip layout. The source terminal always lies on the opposite side of the destination terminal.
The general goal of this SNR problem is to find an optimal signal net in the grid graph of the chip
layout. For the simplicity of developing a solution, we call a horizontal segment that ends with at most an
3
obstacle an escape segment. The escape segment can also be a single obstacle only. Also for simplicity, we
call the right-hand side of an escape segment as a checkpoint, and the chip layout area as routing region.
We relax this definition later in Section 2.5 to allow partitioning the single net routing problem into smaller
subproblems that can be solved independently and in parallel. Next, we present how we can reduce the
ASM problem to the SNR problem.
2.3 Reducing the Approximate String Matching (ASM) Problem to the
Single Net Routing (SNR) Problem
Our goal is to calculate the sequence alignment accurately and quickly. There are two challenges against
improving today’s sequence alignment algorithms that we need to tackle. (1) How to build a data structure
that is simpler than the dynamic programming table (i.e., data-dependency free) and yet it accurately
represents the similarities and the differences between two sequences. (2) How to compute the optimal
number of differences between the two sequences using this new data structure. To this end, we reduce
the problem of finding the similarities and differences between two genomic sequences to that of finding the
optimal signal net in a VLSI chip layout. Reducing the ASM problem to the SNR problem requires two key
steps: (1) replacing the dynamic programming table used by the sequence alignment algorithm to a special
grid graph called chip maze and (2) finding the number of differences between two genomic sequences in the
chip maze by solving the SNR problem. We replace the (m+ 1)× (m+ 1) dynamic programming table with
our chip maze, Z, as we show in Fig. 2, where m is the sequence length (for simplicity, we assume that we
have a pair of equal-length sequences but we relax this assumption towards the end of this section). The
chip maze is a (2E + 1)×m grid graph, where E is the edit distance threshold, (2E + 1) is the number of
HRTs, and m is the number of VRTs. The chip maze is an abstract layout for the VLSI chip layout, as we
show in Fig. 2(d) for the same chip layout of Fig. 1. Each entry of the chip maze represents the pairwise
comparison result of a character of one sequence with another character of the other sequence. A pairwise
match is represented by a white shaded entry in the chip maze and similarly a black shaded entry represents
a pairwise mismatch.
Building the chip maze requires three main steps, as we illustrate in Fig. 2. (1) We start by building
an m×m binary matrix, B, which visualizes the pairwise matches and mismatches between two sequences
given an edit distance threshold of E characters. (2) We transform the m ×m binary matrix into a more
compact (2E + 1) ×m binary matrix in order to simplify the computation of the optimal number of edits
between two sequences. (3) We convert the (2E + 1) ×m binary matrix into a chip maze, a grid of white
and black shaded entries, which replaces the dynamic programming table.
Next, we show how to build the chip maze using these three steps in detail. Given two genomic sequences,
a reference sequence R[1 . . .m] and a query sequence Q[1 . . .m], and an edit distance threshold E, we first
build the m×m binary matrix that represents the comparison result of the ith character of Q with the jth
character of R, where i and j satisfy 1 ≤ i ≤ m and i−E ≤ j ≤ i+E. We calculate the entry B[i, j] of the
binary matrix as follows:
B[i, j] =
{
0, if Q[i] = R[j]
1, if Q[i] 6= R[j] (1)
The entry B[i, j] is set to zero if the ith character of the query sequence matches the jth character of the
reference sequence. Otherwise, it is set to one. We present in Fig. 2(b) an example of the m ×m matrix
for two sequences, where a query sequence Q differs from a reference sequence R by three edits. In order
to simplify the computation over the m×m binary matrix, B, we transform it into a more compact binary
matrix, where each computed diagonal vector of the m×m binary matrix is transformed into a row in the
compact binary matrix. As the diagonal vectors of the binary matrix, B, have different lengths, we transform
conservatively the diagonal vectors into rows, preserving the correct order of the comparison result between
the corresponding characters of the two sequences. We dedicate each row of the compact matrix to a diagonal
vector of the m×m binary matrix, based on their order from the top right corner to the bottom left corner
of the B matrix, such that the E + 1th row of the compact matrix represents the main diagonal of the B
matrix and the rth row above or below the E+ 1th row represents the rth diagonal above or below the main
diagonal, respectively, where 1 ≤ r ≤ E. We store each entry of a diagonal vector into its corresponding
entry of the assigned row of the compact matrix, where the source and destination entries have the same
column index value. For example, we store the entry B[1, 4] of the B matrix at the 4th entry, from left-hand
side, of the first row of the compact matrix. We fill the remaining empty entries of each row by ones to
indicate that there is no match between the corresponding characters, as we show in Fig. 2(c). The last step
is to change the representation of a pairwise match (i.e., zero) and mismatch (i.e., one) to a white shaded
entry and black shaded entry, respectively. This results in generating the (2E + 1) ×m grid graph that is
the chip maze, as we show in Fig. 2(d).
The way we build our chip maze ensures achieving two key properties. The chip maze is (1) a data-
dependency free data structure and (2) a comprehensive representation of all pairwise matches and mis-
matches between two sequences with the existence of at most E edits. Achieving these two key properties
addresses the first challenge against improving today’s sequence alignment algorithms (as we discuss earlier
in this subsection).
4
j     1      2      3     4      5      6     7      8      9     10     11     12   13
G     G     T     G     C     A     G     A     G     C      T      C
G
G
T
G
A
G
A
G
T
T
G
T
i
1
2
3
4
5
6
7
8
9
10
11
12
13
 0      1      2     3 
 1      0      1     2      3 
 2      1      0     1      2      3  
 3      2      1     0      1      2     3 
         3      2     1      0      1     2      3 
                 3     2      1      1     1      2      3
                        3      2      2     2      1      2      3
                                3      3     2      2      1      2      3
                                        4     3      2      2      1      2     3
                                               4      3      3      2      2      2      3
                                                       4      4      3      3      2      3
                                                               5      4      4      3      3
                                                                       5      5      4      4
j     1      2      3     4      5      6     7      8      9     10     11     12
G     G     T     G     C     A     G     A     G     C      T      C
G
G
T
G
A
G
A
G
T
T
G
T
i
1
2
3
4
5
6
7
8
9
10
11
12
 0      0      1     0 
 0      0      1     0      1 
 1      1      0     1      1      1  
 0      0      1     0      1      1     0 
         1      1     1      1      0     1      0 
                 1     0      1      1     0      1      0
                        1      1      0     1      0      1      1
                                1      1     0      1      0      1      1
                                        1     1      1      1      1      0      1
                                               1      1      1      1      0      1
                                                       1      0      1      1      1
                                                               1      1      0      1
3rd Upper Diagonal
2nd Upper Diagonal
1st Upper Diagonal
Main Diagonal
1st Lower Diagonal
2nd Lower Diagonal
3rd Lower Diagonal
1      1     1      0      1     1      0      0      0      1      1      1         
1      1     1      0      1     1      1      1      1      1      0      1         
1      0     1      1      1     0      0      0      0      1      0      1         
0      0     0      0      1     1      1      1      1      1      1      1         
0      1     1      1      1     0      0      1      1      1      0      1
1      0     1      0      1     1      1      1      0      1      1      1         
0      1     1      1      1     1      1      1      1      1      1      1         
column       1       2      3       4      5      6       7       8       9     10      11     12 column       1       2      3       4      5      6       7       8       9     10      11     12
3rd Upper Diagonal
2nd Upper Diagonal
1st Upper Diagonal
Main Diagonal
1st Lower Diagonal
2nd Lower Diagonal
3rd Lower Diagonal
                (a)                                                                                         (b)
                (c)                                                                                         (d)
Figure 2: Steps of replacing the dynamic programming table in (a) with the chip maze, Z, in (d),
for a reference sequence R = ‘GGTGCAGAGCTC’, a query sequence Q = ‘GGTGAGAGTTGT’,
a sequence length (m) of 12, and an edit distance threshold (E) of 3. This includes three main steps
to build the chip maze: (b) building an m×m binary matrix that represents the pairwise matches
and mismatches between the ith character of Q and the jth character of R, where 1 ≤ i ≤ m and
i−E ≤ j ≤ i+E, (c) transforming conservatively each diagonal vector of the m×m binary matrix
into a row in the (2E + 1) × m binary matrix, (d) changing the representation of each entry of
value one in the (2E + 1)×m binary matrix into a black shaded entry, otherwise we represent the
entry of value zero as a white shaded entry.
The chip maze is a data-dependency free data structure as computing each of its entries is
independent of every other and thus the entire grid graph can be computed all at once in a parallel fashion.
Hence, our chip maze is well suited for both sequential and highly-parallel computing platforms (Seshadri
et al., 2017).
The chip maze is a comprehensive representation of all pairwise matches and mismatches
between two sequences because each of its columns stores the result of comparing the jth character of
the reference sequence R with each of its corresponding 2E + 1 characters of the query sequence, Q. These
2E + 1 characters of the query sequence, Q, are as follows: the jth character of the query sequence, Q, the
E right-hand neighboring characters of the jth character, and the E left-hand neighboring characters of the
jth character. This is essential to maintain an accurate detection of deleted and inserted characters in one
or both given sequences. Each insertion and deletion can shift multiple trailing characters (e.g., deleting the
character ‘N’ from ‘GENOME’ shifts the last three characters to the left direction, making it ‘GEOME’).
Hence, we need to compare a character of the reference sequence R with the neighboring characters of its
corresponding character of the query sequence, Q, to cancel the effect of deletion/insertion and correctly
detect the common subsequences between two sequences.
After replacing the dynamic programming table with a more efficient data representation for solving
the SNR problem, the challenge becomes calculating the minimum number of edits between two sequences
using the chip maze. As the white shaded entry of the chip maze represents a pairwise match, a sequence
of horizontally consecutive white shaded entries forms a common subsequence between two sequences. A
set of one or more non-overlapping sequences of horizontally consecutive white shaded entries forms a
sequence of pairwise matches. If there is more than one sequence in this set, then there should be a
black shaded entry that is located at the end of each sequence of consecutive white shaded entries (except
probably the last sequence). Hence, similar to the SNR problem, we call each sequence of consecutive white
shaded entries including at most a single black shaded entry that is located right after the sequence as an
escape segment. These escape segments should also be non-overlapping as each entry indicates that the
corresponding characters of both sequences are either similar or dissimilar. The more the total number of
these white shaded entries in a set the less the total number of black shaded entries.
We observe that the backtracking step (of a global alignment algorithm) finds the optimal sequence of
edit operations between two sequences by examining the value of the entries from the bottom-right entry of
the dynamic programming table to the top-left entry (as the red arrows show in Fig. 2(a)). Similar to a
global alignment algorithm, the first escape segment in the optimal solution set should start from any entry
of the first column of the chip maze. Similarly, the last escape segment in the set should end at any entry
5
of the last column of the chip maze. Now, the problem becomes finding an optimal set (i.e., signal net) of
non-overlapping escape segments. As we discuss in Section 2.2, a set of escape segments is optimal if there
is no other set solves the SNR problem and has both less number of escape segments and less number of
black shaded entries (i.e., obstacles). Once we find such an optimal set of escape segments, we can compute
the minimum number of edits between two sequences as the total number of black shaded entries along the
computed optimal set.
Different from existing sequence alignment algorithms, we do not consider the vertical distance (i.e.,
the number of rows) between two escape segments in the calculation of the minimum number of edits.
This tends to underestimate the actual number of edits between two sequences. Slightly underestimating
the number of edits while achieving fast computation is acceptable as long as we do not overestimate the
number of edits. We can justify this observation by using our fast computation method as a pre-alignment
filtering step to decide whether sequence alignment computation is needed. If two sequences have more edits
than the edit distance threshold, E, then we do not need computationally expensive algorithms to conclude
that the two sequences have unacceptable number of edits. But if the number of edits is less than or
equal the edit distance threshold, then our filtering step should be followed by accurate sequence alignment
algorithms. This way ensures achieving two key properties: (1) allowing sequence alignment to be calculated
only for similar (or nearly similar) sequences and (2) accelerating the sequence alignment algorithms without
changing (or replacing) their algorithmic method and hence preserving all the capabilities of the sequence
alignment algorithms.
Sequence alignment can be performed as a global alignment, where two sequences of the same length are
aligned end-to-end, or a local alignment, where subsequences of the two given sequences are aligned. It can
also be performed as a semi-global alignment (called glocal), where the entirety of one sequence is aligned
towards one of the ends of the other sequence. To ensure a correct reduction of the ASM problem, we need
to apply some changes to the way we count the black shaded entries along the optimal solution set. This
means that if an optimal alignment algorithm performs a local alignment, then we should not consider the
leading and trailing black shaded entries in the total count of edits between two given sequences. Similarly
for a semi-global alignment, we should not consider the leading or the trailing black shaded entries. For the
rest of the paper, we consider only the global alignment as the general case since it is more challenging and
includes more computations (as it examines the similarity end-to-end).
Considering the chip maze as a chip layout where the rows represent the HRTs and the columns represent
the VRTs, we observe that we can reduce the ASM problem to the SNR problem. The goal now is to find
an optimal signal net set in the chip maze that has both the minimum length and the minimum number of
obstacles. Next, we present an efficient algorithm that solves this SNR problem.
2.4 Solving the Single Net Routing Problem
The primary purpose of the SneakySnake algorithm is to solve the SNR problem by providing an optimal
signal net. Solving the SNR problem requires achieving two key objectives: 1) achieving the lowest possible
latency by finding the minimum number of escape segments that are sufficient to link the source terminal
to the destination terminal and 2) achieving the shortest length of the signal net by considering each escape
segment just once and in monotonically increasing order of their start index (or end index). The first
objective is based on a key observation that a signal net with fewer escape segments has always fewer
obstacles, as each escape segment has at most a single obstacle (based on our definition in Section 2.2).
This key observation leads to a signal net that has the least possible total propagation delay. The second
objective restricts the SneakySnake algorithm from ever searching backward for the longest escape segment.
This leads to a signal net that has non-overlapping escape segments.
To achieve these two key objectives, the SneakySnake algorithm applies five simple and effective steps.
(1) It first constructs the chip maze as we explain in the two previous sections. (2) At each new checkpoint,
the SneakySnake algorithm always selects the longest escape segment that allows the signal to travel as far
forward as possible until it reaches an obstacle. For each row of the chip maze, it computes the length of
the first horizontal segment of consecutive white shaded entries that starts from a checkpoint and ends at
an obstacle or at the row end. The SneakySnake algorithm compares the length of all 2E + 1 computed
horizontal segments, selects the longest one, and considers it along with its first following obstacle as an
escape segment. If the SneakySnake algorithm is unable to find a horizontal segment (i.e., all rows starts
after a checkpoint with an obstacle), it considers one of the obstacles as the longest escape segment. (3) It
creates a new checkpoint after the longest escape segment. (4) It repeats the first three steps until either
the signal net reaches a destination terminal, or the total propagation delay exceeds the allowed propagation
delay threshold (i.e., E × tobstacle). (5) If SneakySnake finds the optimal net using the previous steps, then
sequence alignment (e.g., exact number of edits, type of each edit, and location of each edit) between two
sequences is calculated using user’s favourite sequence alignment algorithm. Otherwise, the SneakySnake
algorithm terminates without performing computationally expensive sequence alignment. We provide the
SneakySnake algorithm along with analysis of its computational complexity (asymptotic run time and space
complexity) in Supplementary Materials, Section 7. By achieving these two key objectives, the SneakySnake
algorithm is both correct and optimal. The SneakySnake algorithm is correct as it always provides a signal
net (if it exists) that interconnects the source terminal and the destination terminal. In other words, it does
not lead to routing failure as signal will eventually reach its destination.
6
Theorem 1. The SneakySnake algorithm guarantees to find a signal net that interconnects the source
terminal and the destination terminal when one exists.
We provide the correctness proof for Theorem 1 in Supplementary Materials, Section 6.1. The SneakySnake
algorithm is also optimal as it guarantees to find an optimal signal net that links the source terminal to
destination terminal when one exists. Such an optimal signal net always ensures that the signal arrives the
destination terminal with the least possible total propagation delay.
Theorem 2. When a signal net exists between the source terminal and the destination terminal, using
the SneakySnake algorithm, a signal from the source terminal reaches the destination terminal with the min-
imum possible latency.
We provide the optimality proof for Theorem 2 in Supplementary Materials, Section 6.2. Next, we explain
in detail the SneakySnake algorithm.
Next, we discuss an efficient implementation of the SneakySnake algorithm. Instead of building the chip
maze explicitly, we use an implicit representation of the chip layout. That is, we compute each row of the
chip maze on-the-fly. The computation of the chip maze consists of two steps: (1) gradually shifting the
query sequence, Q, and (2) performing pairwise comparison of the shifted version of Q with the reference
sequence, R. The SneakySnake algorithm shifts the query sequence by r steps to construct the rth row
that is located above or below the E + 1th row (called main diagonal in Fig. 2(d)) of the chip maze, where
1 ≤ r ≤ E. The shift direction is performed in the right-hand direction if the row is located above the
E+ 1th row of the chip maze. Otherwise, the shift direction is performed in the left-hand direction. At each
checkpoint, the SneakySnake algorithm starts computing on-the-fly one entry after another for each row until
it faces an obstacle (the character of Q at the current index mismatches its corresponding character of R) or
it reaches the end of the row. Thus, the entries that are actually calculated for each row of the chip maze are
the entries that are located only between each checkpoint and the first obstacle, in each row, following this
checkpoint. This significantly reduces the number of computations needed for the SneakySnake algorithm
as we discuss in detail in Supplementary Materials, Section 7.
2.5 Snake-on-Chip Hardware Architecture
We introduce an FPGA-friendly architecture for the SneakySnake algorithm, called Snake-on-Chip. The
main idea behind the hardware architecture of Snake-on-Chip is to divide the SNR problem into smaller
non-overlapping subproblems. Each subproblem has a width of t VRTs and a height of 2E+ 1 HRTs, where
1 < t ≤ m. We then solve each subproblem independently from the other subproblems. This approach
results in three key benefits. (1) Downsizing the search space into a reasonably small grid graph with
a known dimension at the design time limits the number of all possible solutions for that subproblem.
This reduces the size of the look-up tables (LUTs) required to build the architecture and simplifies the
overall design. (2) Dividing the SNR problem into subproblems helps to maintain a modular and scalable
architecture that can be implemented for any sequence length and edit distance threshold. (3) All the smaller
subproblems can be solved independently and rapidly with a high parallelism. This reduces the execution
time of the overall algorithm as the SneakySnake algorithm does not need to evaluate the entire chip maze.
However, these three key benefits come at the cost of accuracy degradation. As we demonstrate in
Theorem 2, the SneakySnake algorithm guarantees to find an optimal solution to the SNR problem. However,
the solution for each subproblem is not necessarily part of the optimal solution for the main problem (with
the original size of (2E+1)×m). This is because the source and destination terminals of these subproblems
are not necessarily the same. The source and destination terminals should be located at any of the 2E + 1
entries of the first and the last VRTs, respectively, of each subproblem, but the SneakySnake algorithm
determines the exact location of the source and destination terminals for each subproblem based on its
individual optimal solution. This causes to underestimate the total number of obstacles found along each
signal net of each SNR subproblem. This is still acceptable as long as it solves the SNR problem quickly
and without overestimating the number of obstacles. We provide the details of our hardware architecture of
Snake-on-Chip in Supplementary Materials, Section 8.
2.6 Snake-on-GPU Parallel Implementation
We now introduce our GPU implementation of the SneakySnake algorithm, called Snake-on-GPU. The main
idea of Snake-on-GPU is to exploit the large number (typically few thousands) of GPU threads provided by
modern GPUs to solve a large number of SNR problems rapidly and concurrently. In Snake-on-Chip, we
explicitly divide the SNR problem into smaller non-overlapping subproblems and then solve all subproblems
concurrently and independently using our specialized hardware. In Snake-on-GPU, we follow a different
approach than that of Snake-on-Chip by keeping the same size of the original SNR problem and solving
a massive number of these SNR problems at the same time. Snake-on-GPU uses one single GPU thread
to solve one SNR problem (i.e., comparing one query sequence to one reference sequence at a time). This
granularity of computation fits well the amount of resources (e.g., registers) that are available to each GPU
thread and avoids the need for synchronizing several threads working on the same SNR problem. GPUs
offer more flexibility to the users to change the values of some input parameters of Snake-on-Chip without
the need to build a new design as in FPGAs. Given the large size of the sequence pair dataset that the
7
GPU threads need to access, we carefully design Snake-on-GPU to efficiently 1) copy the input dataset
of query and reference sequences into the GPU global memory, which is the off-chip DRAM memory of
GPUs (NVIDIA, 2019a) and it typically fits a few GB of data and 2) allow each thread to store its own
query and reference sequences using the on-chip register file to avoid unnecessary accesses to the off-chip
global memory. Each thread solves the complete SNR problem for a single query sequence and a single
reference sequence. We provide the details of our hardware architecture of Snake-on-Chip in Supplementary
Materials, Section 9.
3 Results
We now evaluate 1) the filtering accuracy, 2) the filtering time, and 3) the benefits of combining our three
new pre-alignment filters with state-of-the-art aligners. For each experiment, we compare the performance
of SneakySnake, Snake-on-Chip, and Snake-on-GPU to the existing state-of-the-art pre-alignment filters,
Shouji (Alser et al., 2019), MAGNET (Alser et al., 2017b), GateKeeper (Alser et al., 2017a), and SHD
(Xin et al., 2015). We run all experiments using a 3.3 GHz Intel E3-1225 CPU with 32 GB RAM. We use
a Xilinx Virtex 7 VC709 board (Xilinx, 2013) to implement Snake-on-Chip and other existing accelerator
architectures (for Shouji, MAGNET, and GateKeeper). We build the FPGA design using Vivado 2015.4 in
synthesizable Verilog. We use a NVIDIA GeForce RTX 2080Ti card (NVIDIA, 2019b) with a global memory
of 11 GB DDR6 to implement Snake-on-GPU. Both Snake-on-Chip and Snake-on-GPU are independent of
the specific FPGA and GPU platforms as they do not rely on any vendor-specific computing elements (e.g.
intellectual property cores).
3.1 Dataset Description
Our experimental evaluation uses 4 different real datasets. Each dataset contains 30 million real se-
quence pairs (text and query pairs). We obtain two different read sets ERR240727 1 and SRR826471 1)
of the whole human genome that include two different read lengths (100 bp and 250 bp). We down-
load these two read sets from EMBL-ENA (www.ebi.ac.uk/ena). We map each read set to the human
reference genome (GRCh37) using mrFAST (Alkan et al., 2009) mapper. We obtain the human ref-
erence genome from the 1000 Genomes Project (1000 Genomes Project Consortium and others, 2012),
ftp://ftp.ncbi.nlm.nih.gov/1000genomes/ftp/technical/reference/. For each read set, we use two different
maximum numbers of allowed edits (2 and 40 for m =100 bp and 8 and 100 for m = 250 bp) using the
e parameter of mrFAST to generate four real datasets in total. Each dataset contains the sequence pairs
that are generated by mrFAST mapper before the read alignment step of mrFAST, such that we allow each
dataset to contain both similar (i.e., having edits fewer than or equal to the edit distance threshold) and
dissimilar (i.e., having more edits than the edit distance threshold) sequences over a wide range of edit
distance thresholds. We provide the details of these four datasets in the Supplementary Materials, Section
10.1. For the reader’s convenience, we refer to these datasets as Set 1, Set 2, Set 3, and Set 4.
3.2 Filtering Accuracy
We evaluate the accuracy of pre-alignment filter by computing its rate of falsely-accepted and falsely-rejected
sequences before performing sequence alignment. The false accept rate is the ratio of the number of dissimilar
sequences that are falsely accepted by the filter and the number of dissimilar sequences that are rejected by
the sequence alignment algorithm. The false reject rate is the ratio of the number of similar sequences that
are rejected by the filter and the number of similar sequences that are accepted by the sequence alignment
algorithm. A reliable pre-alignment filter should always ensure both a 0% false reject rate and minimum false
accept rate to maintain the correctness of overall pipeline and maximize the number of dissimilar sequences
that are eliminated.
We first assess the false accept rate of SneakySnake, Shouji (Alser et al., 2019), MAGNET (Alser et al.,
2017b), GateKeeper (Alser et al., 2017a), and SHD (Xin et al., 2015) across different four real datasets and
edit distance thresholds of 0%− 10% of the sequence length. In Fig. 3, we provide the false accept rate of
each of the five filters (we provide the exact values in Supplementary Materials, Section 10.2).
We use Edlib (Sˇosˇic´ and Sˇikic´, 2017) to identify the ground truth truly-accepted sequences for each edit
distance threshold. Based on Fig. 3, we make four key observations. (1) We observe that all the five pre-
alignment filters are less accurate in examining Set 1 and Set 3 than the other datasets, Set 2 and Set 4. (2)
GateKeeper (Alser et al., 2017a) and SHD (Xin et al., 2015) become ineffective for edit distance thresholds
of greater than 8% and 3% for sequence lengths of 100 and 250 characters, respectively, as they accept all
the input sequence pairs. This causes them to examine each sequence pair unnecessarily twice (i.e. once by
GateKeeper or SHD and once by the sequence alignment algorithm). (3) SneakySnake provides the lowest
false accept rate compared to all the four state-of-the-art pre-alignment filters. SneakySnake provides up to
31412×, 20603×, and 64.1× less number of falsely-accepted sequences compared to GateKeeper/SHD (using
Set 4, E= 10%), Shouji (using Set 4, E= 10%), and MAGNET (using Set 1, E= 1%), respectively. (4)
MAGNET provides the second lowest false accept rate. It provides up to 25552× and 16760× less number of
8
100%
80%
60%
40%
20%
0%
100%
58%
15%
0.1%
0%
100%
80%
60%
40%
20%
0%
100%
58%
15%
0.01%
0%
SHD          GateKeeper       Shouji         MAGNET    SneakySnake
0%      1%     2%      3%      4%     5%      6%      7%     8%      9%      10%E=
Set_1
read length = 100bp
Set_2
read length = 100bp
Set_3
read length = 250bp
Set_4
read length = 250bp
Fa
ls
e 
A
cc
ep
t 
R
at
e
Figure 3: The false accept rate of SneakySnake, Shouji, MAGNET, SHD, and GateKeeper across
4 real datasets. We use a wide range of edit distance thresholds (0%−10% of the sequence length)
for sequence lengths of 100 and 250 bp.
falsely-accepted sequences compared to GateKeeper/SHD (using Set 4, E= 10%) and Shouji (using Set 4,
E= 10%), respectively.
Second, we assess the false reject rates of the five pre-alignment filters in Supplementary Materials, Sec-
tion 10.2. We demonstrate that SneakySnake, Shouji, SHD, and GateKeeper all have a 0% false reject rate.
We also observe that MAGNET provides a very low false reject rate of less than 0.00045% for E > 3% of
the sequence length using Set 1 and Set 3.
Hence, we conclude that SneakySnake improves the accuracy of pre-alignment filtering by up to four
orders of magnitude compared to the state-of-the-art pre-alignment filters. We also conclude that SneakyS-
nake is the most effective pre-alignment filter, with a very low false accept rate and a 0% false reject rate
across a wide range of both edit distance thresholds and sequence lengths.
3.3 Effects of SneakySnake on Sequence Alignment
We analyze the benefits of integrating CPU-based pre-alignment filters, SneakySnake and SHD (Xin et al.,
2015) with the state-of-the-art CPU-based sequence aligners, Edlib (Sˇosˇic´ and Sˇikic´, 2017) and Parasail
(Daily, 2016). We evaluate all tools using a single CPU core and single thread environment. Fig. 4 presents
the normalized end-to-end execution time of SneakySnake and SHD each combined with Edlib and Parasail,
using our four real datasets over edit distance thresholds of 0% − 10% of the sequence length. We make
four key observations. (1) SneakySnake is up to 42.96× (using Set 3, E= 0%) and 39.43× (using Set 4,
E= 5%) faster than Edlib and Parasail, respectively, in examining the sequence pairs. (2) The addition of
SneakySnake as a pre-alignment filtering step reduces significantly the execution time of Edlib (Sˇosˇic´ and
Sˇikic´, 2017) and Parasail (Daily, 2016) by up to 37.6× (using Set 4, E= 0%) and 43.9× (using Set 4, E
=2%), respectively. (3) The addition of SHD as a pre-alignment step reduces the execution time of Edlib
and Parasail for some of the edit distance thresholds by up to 17.2× (using Set 2, E = 0%) and 34.86×
(using Set 4, E= 3%), respectively.
However, for most of the edit distance thresholds, we observe that Edlib and Parasail are faster alone
than with SHD combined as a pre-alignment filtering step. This is expected as SHD becomes ineffective
in filtering for E > 8% and E > 3% for m= 100 bp and m= 250 bp, respectively, (as we show earlier in
Section 3.2). (4) SneakySnake provides up to 8.92× (using Set 4, E= 4%) and 40× (using Set 4, E= 5%)
more speedup to the end-to-end execution time of Edlib and Parasail compared to SHD. This is expected
as SHD produces a high false accept rate (as we show earlier in Section 3.2).
We conclude that SneakySnake is the best-performing CPU-based pre-alignment filter in terms of both
speed and accuracy. Integrating SneakySnake with sequence alignment algorithms is always beneficial and
reduces the end-to-end execution time by up to an order of magnitude without the need for hardware
accelerators. We also conclude that SneakySnake’s performance also scales very well over a wide range of
both edit distance thresholds and sequence lengths.
9
 -
 0.2
 0.4
 0.6
 0.8
 1.0
 1.2
0% 1% 2% 3% 4% 5%
Set_1 (m=100 bp)
6% 7% 8% 9% 10% 0% 1% 2% 3% 4% 5%
Set_2 (m=100 bp)
6% 7% 8% 9% 10% 0% 1% 2% 3% 4% 5%
Set_3 (m=250 bp)
6% 7% 8% 9% 10% 0% 1% 2% 3% 4% 5%
Set_4 (m=250 bp)
6% 7% 8% 9% 10%
N
o
rm
al
iz
ed
 R
u
n
ti
m
e
SneakySnake Edlib SHD Edlib
 -
 0.2
 0.4
 0.6
 0.8
 1.0
 1.2
0% 1% 2% 3% 4% 5%
Set_1 (m=100 bp)
6% 7% 8% 9% 10% 0% 1% 2% 3% 4% 5%
Set_2 (m=100 bp)
6% 7% 8% 9% 10% 0% 1% 2% 3% 4% 5%
Set_3 (m=250 bp)
6% 7% 8% 9% 10% 0% 1% 2% 3% 4% 5%
Set_4 (m=250 bp)
6% 7% 8% 9% 10%
N
o
rm
al
iz
ed
 R
u
n
ti
m
e
SneakySnake Parasail SHD Parasail
3.9x
17.26x
E=
E=
3.7x
1.23x
12.7x
16.72x
10.94x
1.23x
12.35x
4.54x
12.35x
4.54x
42.96x
4.99x
21.34x
1.76x
37.68x
1.96x
37.67x
1.96x
4.56x
13.24x
4.56x
13.18x
20.45x
20.05x
13.8x
2.4x
16x
18.07x
16x
18.05x
18.37x
2.45x
15.3x
2.39x
13.39x
13.39x
32.61x
5.78x
39.43x
39.4x
6.29x
6.29x11x
4.29x
8.01x
8.01x
Figure 4: Normalized end-to-end execution time of SneakySnake and SHD combined with Edlib
(upper plot) and Parasail (lower plot). We use four datasets over a wide range of edit distance
thresholds (E= 0%-10% of the sequence length) for sequence lengths (m) of 100 bp and 250 bp.
We present two speedup rates for each edit distance threshold of 0%, 5%, and 10% of the sequence
length. The upper speedup rate represents the end-to-end speedup that is gained from combining
the pre-alignment step with the alignment step. It is calculated as A/(B + C), where A is the
execution time of the sequence aligner before adding SneakySnake, B is the execution time of
SneakySnake, and C is the execution time of the sequence aligner after adding SneakySnake. The
lower speedup rate is calculated as A/B.
3.4 Effects of Snake-on-Chip and Snake-on-GPU on Sequence
Alignment
We analyze the benefits of integrating Snake-on-Chip and Snake-on-GPU with the state-of-the-art sequence
aligners, designed for different computing platforms in Fig. 5. We design the hardware architecture of
Snake-on-Chip for a sub-maze’s width of 8 VRTs (t= 8) and 3 replications (y= 3) per each sub-maze. We
select this design choice as it allows for low FPGA resource utilization while maintaining low false accept
rate. We analyze the effect of choosing different y and t values on the false accept rate of Snake-on-Chip
in Supplementary Materials, Section 10.5. In this analysis, we compare the effect of combining Snake-
on-Chip and Snake-on-GPU with existing sequence aligner with that of two state-of-the-art FPGA-based
pre-alignment filters, Shouji (Alser et al., 2019) and GateKeeper (Alser et al., 2017a). We also select four
state-of-the-art sequence aligners that are implemented for CPU (Edlib (Sˇosˇic´ and Sˇikic´, 2017) and Parasail
(Daily, 2016)), GPU (GSWABE (Liu and Schmidt, 2015)), and FPGA (FPGASW (Fei et al., 2018)). We
use Set 1 and Set 2 in this analysis. GSWABE and FPGASW are not open-source and not available to
us. Therefore, we scale the reported number of computed entries of the dynamic programming matrix in a
second (i.e. GCUPS) as follows: 60000000/(GCUPS/1002).
Based on Fig. 5, we make three key observations. (1) The execution time of Edlib and Parasail reduces
by up to 321× (using Set 2 and E = 5%) and 536× (using Set 2 and E = 5%), respectively, after the
addition of Snake-on-Chip as a pre-alignment filtering step and by up to 368.3× (using Set 2 and E = 5%)
and 689× (using Set 2 and E = 5%), respectively, after the addition of Snake-on-GPU as a pre-alignment
filtering step. That is 40× to 51× more speedup compared to that provided by adding SneakySnake as a
pre-alignment filter, using Set 2 and E = 5%. It is also up to 2× more speedup compared to that provided
by adding Shouji and GateKeeper as a pre-alignment filter, using Set 1 and E=5% for Snake-on-Chip and
using Set 2 and E=5% fot Snake-on-GPU. (2) FPGAs and GPUs based sequence aligners follow a similar
trend to that we observe in the CPU implementations. However, the speedup ratios are reduced compared to
that observed in the CPU based aligners. This is due to the low execution time of these hardware accelerated
aligners. Snake-on-GPU provides up to 27.7× (using Set 2 and E = 5%) and 5.1× (using Set 2 and E =
5%) reduction in the end-to-end execution time of GSWABE and FPGASW, respectively. This is up to
1.3 more speedup compared to that provided by Snake-on-Chip, using Set 2. That is also up to 1.7× more
speedup compared to that provided by adding Shouji and GateKeeper as a pre-alignment filter.
We conclude that both Snake-on-Chip and Snake-on-GPU provide the highest speedup ratio (up to two
orders of magnitude) compared to the state-of-the-art CPU, FPGA, and GPU based sequence aligners over
edit distance thresholds of 0%-5% of the sequence length.
10
0.0
0.3
0.5
0.8
1.0
0% 3% 5% 0% 3% 5% 0% 3% 5% 0% 3% 5%
Edlib Parasail GSWABE FPGASW
w/ Snake-on-Chip w/ Snake-on-GPU w/ Shouji w/ GateKeeper
0.02
0.12
0.22
0.32
0% 3% 5% 0% 3% 5% 0% 3% 5% 0% 3% 5%
Edlib Parasail GSWABE FPGASW
w/ Snake-on-Chip w/ Snake-on-GPU w/ Shouji w/ GateKeeper
0.00
0.01
0.02
0% 3% 5% 0% 3% 5% 0% 3% 5% 0% 3% 5%
Edlib Parasail GSWABE FPGASW
w/ Snake-on-Chip w/ Snake-on-GPU w/ Shouji w/ GateKeeper
41.1x
62.8x
2.3x
2.6x
33.3x
43.2x
2.7x
2.6x
17x
16.8x
2.1x
2.3x
4.5x
3.7x
1.6x
1.6x
368.3x
295.6x
413x
321x
136x
109x 689x
536x
26.8x
21.5x
27.7x
21.4x
(a)
(b)
N
o
rm
a
liz
e
d
 R
u
n
ti
m
e
N
o
rm
a
liz
e
d
 R
u
n
ti
m
e
4.9x
3.93x
5.1x
3.93x
Figure 5: Normalized end-to-end execution time of a pre-alignment filter (Snake-on-Chip, Snake-
on-GPU, Shouji, and GateKeeper) combined with a sequence aligner (Edlib, Parasail, GSWABE,
and FPGASW). We use two datasets, Set 1 (upper plot) and Set 2 (lower plot), over a wide range
of edit distance thresholds (0%-5% of the sequence length, 100 bp). We present two speedup
rates for edit distance thresholds of 0% and 5%. The upper speedup rate is the speedup gained
from integrating Snake-on-GPU with the corresponding sequence aligner. The lower speedup rate
represents the speedup gained from integrating Snake-on-Chip with the corresponding sequence
aligner.
4 Discussion and Future Work
In this work, we introduce the single net routing problem and we show how to convert an approximate
string matching problem into an instance of the single net routing problem. Subsequently, we propose a new
algorithm that solves the single net routing problem and acts as a new pre-alignment filtering algorithm,
which we call it SneakySnake. We demonstrate that the concept of pre-alignment filtering provides substan-
tial benefits to the existing and future sequence alignment algorithms. Many of the existing acceleration
efforts either simplify the scoring function, or only take into account accelerating the computation of the
dynamic programming matrix without supporting the backtracking step. SneakySnake offers the ability to
make the best use of existing aligners without sacrificing any of their capabilities (e.g., configurable scoring
and backtracking), as it does not modify or replace the alignment step. Our algorithm does not exploit any
SIMD-enabled CPU instructions or vendor-specific processor. This makes it attractive and cost-effective
given a limited resources environment. SneakySnake improves the accuracy of pre-alignment filtering by up
to four orders of magnitude compared to the state-of-the-art pre-alignment filters, Shouji, GateKeeper, and
SHD. The addition of SneakySnake as a pre-alignment filtering step reduces significantly the execution time
of state-of-the-art CPU-based sequence aligners by up to an order of magnitude. We also explore the use of
hardware/software co-design and hardware accelerations to further accelerate our SneakySnake algorithm.
We introduce Snake-on-Chip and Snake-on-GPU, efficient and scalable FPGA and GPU based pre-laignment
filters, respectively. Snake-on-Chip and Snake-on-GPU achieve up to two orders of magnitude speedup to
the state-of-the-art sequence aligners.
One direction to further improve the performance of Snake-on-Chip and Snake-on-GPU is to discover
the possibility of performing the SneakySnake calculations where the huge amount of genomic data resides.
Conventional computing requires the movement of genomic sequence pairs from the memory to the CPU
processing cores (or to the FPGA chip), using slow and energy-hungry buses, such that cores can apply
sequence alignment algorithm on the sequence pairs. Performing SneakySnake inside modern memory devices
can alleviate this high communication cost by enabling simple arithmetic/logic operations very close to where
the data resides, with high bandwidth and low latency (Kim et al., 2018). However, this requires re-designing
the hardware architecture of Snake-on-Chip to leverage the supported operations in such modern memory
devices (Kim et al., 2017).
A second potential target of our research is to explore the possibility of accelerating sequence alignment
algorithms for longer sequences (few tens of thousands of characters) using our pre-alignment filters. Longer
sequences pose two challenges. First, we need to transfer more data to the FPGA chip to be able process
a single pair of sequences which is mainly limited by the data transfer rate of the communication link (i.e.
PCIe) (Alser et al., 2018). Second, typical edit distance threshold used for sequence alignment is 5% of
the sequence length. For considerably long sequences, edit distance threshold is around few hundreds of
11
characters (Senol Cali et al., 2018; Firtina et al., 2019; Alser, 2019). A large edit distance threshold leads
to calculating a large number of horizontal routing tracks (i.e., 2E+1 tracks, where E is the edit distance
thresholds). This makes random zeros (matches resulted from comparing each character of a given sequence
to the corresponding neighboring characters of the other given sequence) to occur more frequently in the
horizontal routing tracks as we show in (Alser et al., 2017b). This would negatively affect the performance and
accuracy of SneakySnake. We will investigate this effect and explore new pre-alignment filtering approaches
for the sequencing data produced by third-generation sequence machines.
A third research direction is to enable the use of cloud computing for performing pre-alignment filtering
at scale. Cloud computing offers access to a large number of advanced FPGA chips that can be used
concurrently via a simple user-friendly interface. However, such a scenario requires the development of
privacy-preserving pre-alignment filters due to privacy and legal concerns (Alser et al., 2015). Our next
efforts will focus on exploring privacy-preserving cloud-baased pre-alignment filtering.
Funding
This work is supported by gifts from Intel [to O.M.]; VMware [to O.M.]; and an EMBO Installation Grant
[IG-2521 to C.A.].
References
1000 Genomes Project Consortium and others. An integrated map of genetic variation from 1,092 human
genomes. Nature, 491(7422):56, 2012.
C. Alkan, J. M. Kidd, T. Marques-Bonet, G. Aksay, F. Antonacci, F. Hormozdiari, J. O. Kitzman, C. Baker,
M. Malig, O. Mutlu, et al. Personalized copy number and segmental duplication maps using next-
generation sequencing. Nature genetics, 41(10):1061, 2009.
M. Alser, N. Almadhoun, A. Nouri, C. Alkan, and E. Ayday. Can you Really Anonymize the Donors of
Genomic Data in Today’s Digital World? In Data Privacy Management, and Security Assurance, pages
237–244. Springer, 2015.
M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan. GateKeeper: a new hardware architecture
for accelerating pre-alignment in DNA short read mapping. Bioinformatics, 33(21):3355–3363, 2017a.
M. Alser, O. Mutlu, and C. Alkan. MAGNET: Understanding and improving the accuracy of genome
pre-alignment filtering. Transactions on Internet Research, 13(2):33–42, 2017b.
M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan. SLIDER: Fast and Efficient Computation of
Banded Sequence Alignment. arXiv preprint arXiv:1809.07858, 2018.
M. Alser, H. Hassan, A. Kumar, O. Mutlu, and C. Alkan. Shouji: A Fast and Efficient Pre-Alignment Filter
for Sequence Alignment. Bioinformatics, 2019.
M. H. Alser. Accelerating the Understanding of Life’s Code Through Better Algorithms and Hardware
Design. arXiv preprint arXiv:1910.03936, 2019.
P. Chen, C. Wang, X. Li, and X. Zhou. Accelerating the next generation long read mapping with the
FPGA-based system. IEEE/ACM transactions on computational biology and bioinformatics, 11(5):840–
852, 2014.
J. Daily. Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments. BMC
bioinformatics, 17(1):81, 2016.
S. R. Eddy. What is dynamic programming? Nature biotechnology, 22(7):909, 2004.
X. Fei, Z. Dan, L. Lina, M. Xin, and Z. Chunlei. FPGASW: Accelerating Large-Scale Smith–Waterman
Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array. Interdisciplinary
Sciences: Computational Life Sciences, 10(1):176–188, 2018.
C. Firtina, J. S. Kim, M. Alser, D. S. Cali, A. E. Cicek, C. Alkan, and O. Mutlu. Apollo: A
Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm. arXiv
preprint arXiv:1902.04341, 2019.
E. J. Fox, K. S. Reid-Bayliss, M. J. Emond, and L. A. Loeb. Accuracy of next generation sequencing
platforms. Next generation, sequencing & applications, 1, 2014.
L. A. Hindorff, V. L. Bonham, L. C. Brody, M. E. Ginoza, C. M. Hutter, T. A. Manolio, and E. D. Green.
Prioritizing diversity in human genomics research. Nature Reviews Genetics, 19(3):175, 2018.
12
W. Huangfu, S. Li, X. Hu, and Y. Xie. RADAR: a 3D-reRAM based DNA alignment accelerator architecture.
In 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pages 1–6. IEEE, 2018.
J. S. Kim, D. Senol, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, and O. Mutlu.
GRIM-Filter: fast seed filtering in read mapping using emerging memory technologies. arXiv preprint
arXiv:1708.04329, 2017.
J. S. Kim, D. S. Cali, H. Xin, D. Lee, S. Ghose, M. Alser, H. Hassan, O. Ergin, C. Alkan, and O. Mutlu.
GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies.
BMC Genomics, 19(2):89, 2018.
J. Lee, N. Bose, and F. Hwang. Use of Steiner’s problem in suboptimal routing in rectilinear metric. IEEE
Transactions on Circuits and Systems, 23(7):470–476, 1976.
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics-
Doklady, volume 10, pages 707–710, 1966.
Y. Liu and B. Schmidt. GSWABE: faster GPU-accelerated sequence alignment with optimal alignment
retrieval for short DNA sequences. Concurrency and Computation: Practice and Experience, 27(4):958–
972, 2015.
Y. Liu, A. Wirawan, and B. Schmidt. CUDASW++ 3.0: accelerating Smith-Waterman protein database
search by coupling CPU and GPU SIMD instructions. BMC bioinformatics, 14(1):117, 2013.
K. J. McKernan, H. E. Peckham, G. L. Costa, S. F. McLaughlin, Y. Fu, E. F. Tsung, C. R. Clouser,
C. Duncan, J. K. Ichikawa, C. C. Lee, et al. Sequence and structural variation in a human genome
uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome research,
19(9):1527–1541, 2009.
G. Navarro. A guided tour to approximate string matching. ACM computing surveys (CSUR), 33(1):31–88,
2001.
S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino
acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970.
T. Nishimura, J. L. Bordim, Y. Ito, and K. Nakano. Accelerating the Smith-Waterman Algorithm Using Bit-
wise Parallel Bulk Computation Technique on GPU. In 2017 IEEE International Parallel and Distributed
Processing Symposium Workshops (IPDPSW), pages 932–941. IEEE, 2017.
NVIDIA. CUDA C Programming Guide. 2019a. URL \https://docs.nvidia.com/cuda/
cuda-c-programming-guide/index.html.
NVIDIA. NVIDIA GeForce RTX 2080 Ti User Guide. 2019b.
D. Senol Cali, J. S. Kim, S. Ghose, C. Alkan, and O. Mutlu. Nanopore sequencing technology and tools for
genome assembly: computational analysis of the current state, bottlenecks and future directions. Briefings
in bioinformatics, 20(4):1542–1559, 2018.
V. Seshadri, D. Lee, T. Mullins, H. Hassan, A. Boroumand, J. Kim, M. A. Kozuch, O. Mutlu, P. B. Gibbons,
and T. C. Mowry. Ambit: In-memory accelerator for bulk bitwise operations using commodity DRAM
technology. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture,
pages 273–287. ACM, 2017.
T. Smith and M. Waterman. Identification of common molecular subsequences. Journal of molecular biology,
147:195–197, 1981.
M. Sˇosˇic´ and M. Sˇikic´. Edlib: a C/C++ library for fast, exact sequence alignment using edit distance.
Bioinformatics, 33(9):1394–1395, 2017.
C. Wang, R.-X. Yan, X.-F. Wang, J.-N. Si, and Z. Zhang. Comparison of linear gap penalties and profile-
based variable gap penalties in profile–profile alignments. Computational biology and chemistry, 35(5):
308–318, 2011.
Xilinx. Virtex-7 XT VC709 Connectivity Kit. 2013.
H. Xin, J. Greth, J. Emmons, G. Pekhimenko, C. Kingsford, C. Alkan, and O. Mutlu. Shifted Hamming
distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping.
Bioinformatics, 31(10):1553–1560, 2015.
13
 1 
Supplementary Materials 
5. Related Works 
Recent works tend to follow one of two key directions to boost the performance of sequence alignment 
implementations: (1) Accelerating the dynamic programming algorithms using hardware accelerators and 
(2) Developing filtering heuristics that reduce the need for the dynamic programming algorithms, given an 
edit distance threshold. 
 
Hardware accelerators are becoming increasingly popular for speeding up the computationally-
expensive sequence alignment algorithms (Al Kawam et al., 2017; Aluru and Jammula, 2014; Ng et 
al., 2017; Sandes et al., 2016). Hardware accelerators include four main directions. 1) Using multi-core 
and SIMD (single instruction multiple data) capable central processing units (CPUs), such as Edlib (Šošić 
and Šikić, 2017) and Parasail (Daily, 2016). 2) Using graphics processing units (GPUs), such as GSWABE 
(Liu and Schmidt, 2015) and CUDASW++ 3.0 (Liu et al., 2013). 3) Using field-programmable gate arrays 
(FPGAs), such as FPGASW (Fei et al., 2018). 4) Using processing-in-memory architectures, such as 
RADAR (Huangfu et al., 2018). The classical dynamic programming algorithms are typically accelerated 
by computing only the necessary regions (i.e., diagonal vectors) of the dynamic programming matrix rather 
than the entire matrix, as proposed in Ukkonen’s banded algorithm (Ukkonen, 1985). The number of the 
diagonal bands required for computing the dynamic programming matrix is 2E+1, where E is a user-defined 
edit distance threshold. The banded algorithm is still beneficial even with its recent sequential 
implementations as in Edlib (Šošić and Šikić, 2017). Edlib algorithm is implemented in C for standard 
CPUs and it calculates the banded Levenshtein distance. Parasail (Daily, 2016) exploits both Ukkonen’s 
banded algorithm and SIMD-capable CPUs to compute a banded alignment for a sequence pair with a user-
defined scoring function. SIMD instructions offer significant parallelism to the matrix computation by 
executing the same vector operation on multiple operands at once. Multi-core architecture of CPUs and 
GPUs provides the ability to compute alignments of many sequence pairs independently and concurrently 
(Georganas et al., 2015; Liu and Schmidt, 2015). GSWABE (Liu and Schmidt, 2015) exploits GPUs (Tesla 
K40) for highly-parallel computation of global alignment with a user-defined scoring function. 
CUDASW++ 3.0 (Liu et al., 2013) exploits the SIMD capability of both CPUs and GPUs (GTX690) to 
accelerate the computation of the Smith-Waterman algorithm with a user-defined scoring function. 
CUDASW++ 3.0 provides only the optimal score, not the optimal alignment (i.e., no backtracking step). 
Other designs, for instance FPGASW (Fei et al., 2018), exploit the very large number of hardware execution 
units in FPGAs (Xilinx VC707) to form a linear systolic array (Kung, 1982). Each execution unit in the 
systolic array is responsible for computing the value of a single entry of the dynamic programming matrix. 
The systolic array computes a single vector of the matrix at a time. The data dependencies between the 
entries restrict the systolic array to computing the vectors sequentially (e.g., top-to-bottom, left-to-right, or 
in an anti-diagonal manner). FPGA accelerators seem to yield the highest performance gain compared to 
the other hardware accelerators (Banerjee et al., 2018; Chen et al., 2016; Fei et al., 2018; Waidyasooriya 
and Hariyama, 2015). Recently, a few processing-in-memory architectures are proposed to exploit the 
ability of performing computations inside the memory chip, such as RADAR (Huangfu et al., 2018). The 
main benefits of such architectures is the high energy efficiency as they alleviate the need for transferring 
the data back and forth from the main memory to the CPU cores, for processing, through slow and energy 
hungry buses (Mutlu et al., 2019). 
 2 
However, many of these efforts either simplify the scoring function, or only take into account accelerating 
the computation of the dynamic programming matrix without providing the optimal alignment as in (Chen 
et al., 2014; Liu et al., 2013; Nishimura et al., 2017). Different and more sophisticated scoring functions 
are typically needed to better quantify the similarity between two sequences (Henikoff and Henikoff, 1992; 
Wang et al., 2011). The backtracking step required for the optimal alignment computation involves 
unpredictable and irregular memory access patterns, which poses a difficult challenge for efficient hardware 
implementation. 
 
Pre-alignment filtering heuristics aim to quickly eliminate some of the dissimilar sequences before 
using the computationally-expensive optimal alignment algorithms. There are a few existing filtering 
techniques such as the Adjacency Filter (Xin et al., 2013), which is implemented for standard CPUs as part 
of FastHASH (Xin et al., 2013). SHD (Xin et al., 2015) is a SIMD-friendly bit-vector filter that provides 
higher filtering accuracy compared to the Adjacency Filter. To our knowledge, SHD is currently the best 
CPU-based pre-alignment filter, but it suffers from limited sequence length (up to only 128 characters) due 
to the SIMD register size used for its implementation. GRIM-Filter (Kim et al., 2018) exploits the high 
memory bandwidth and the logic layer of 3D-stacked memory to perform highly-parallel filtering in the 
DRAM chip itself. GateKeeper (Alser et al., 2017a) is the first pre-alignment filter designed to utilize the 
large amounts of parallelism offered by FPGA architectures. GateKeeper (Alser et al., 2017a) provides a 
high filtering speed but suffers from relatively high number of falsely-accepted sequence pairs. MAGNET 
(Alser et al., 2017b) reduces significantly the number of falsely-accepted sequence pairs of GateKeeper but 
provides a very low number of falsely-rejected sequence pairs. Recently, Shouji (Alser et al., 2019) 
introduces, to our knowledge, the most accurate and the fastest pre-alignment filter using new algorithm 
and new FPGA architecture. 
 
In this work, we introduce the first GPU-based pre-alignment filter, called Snake-on-GPU. We also provide 
the most accurate and the fastest (as our experimental evaluation demonstrates) CPU and FPGA based pre-
alignment filters, SneakySnake and Snake-on-Chip, respectively. 
6. Proofs of the Correctness and Optimality of the SneakySnake Algorithm 
As the propagation delay of a signal net is mainly affected by the number of obstacles that are considered 
in the horizontal escape segments of the selected path, for simplicity, we do not consider the vertical 
segments in our proof.  
6.1. Correctness proof 
PROOF. We prove Theorem 1 by contradiction. Let A = {s1, s2, …, sn} be the signal net that connects the 
source terminal to the destination terminal using n escape segments that are part of the horizontal routing 
tracks within a routing region. The escape segments are sorted by their start position (i.e., s1 starts before s2 
and ends at s2). Assume that SneakySnake algorithm is not able to find this signal net A that reaches the 
destination terminal. This means that SneakySnake algorithm finds an escape segment, sk, but it fails to find 
the next escape segment, sk+1. Since there is a signal net that connects s1 to sn, there exists an escape segment 
that starts before sk+1 and ends at sk+1. This escape segment is not reachable from sk (as we assume that 
SneakySnake algorithm terminates the solution after finding sk), so it should be reachable from another 
 3 
escape segment, st, where t < k. This indicates that sk+1 is not reachable from sk and sk is not reachable from 
st. This contradicts the assumption that sk+1 is reachable and it is part of the solution. Thus, our assumption 
that SneakySnake algorithm is not able to find a signal net is wrong. ◼ 
6.2. Optimality proof 
PROOF. We prove Theorem 2 by induction. Suppose you have a set of n candidate horizontal segments {1, 
2, …, n} that are part of the horizontal routing tracks within a routing region. Each horizontal segment has 
a desired pair of start and end positions (s(i), f(i)). SneakySnake algorithm determines a signal net with the 
minimum total propagation delay by repeatedly selecting from the available horizontal segments the one 
that starts at the current location and has the farthest end location, and removing all overlapping horizontal 
segments from the set. 
Let A = {x1, x2, …, xk} be the solution (set of escape segments) to S provided by SneakySnake algorithm. 
The escape segments are sorted by their start position (i.e., x1 starts before x2 and ends at x2). Let B = {y1, 
y2, …, ym} be the optimal solution for the same problem. Let k = |A| and m = |B| denote the number of escape 
segments in A and B, respectively. 
The proof is by induction on the number of escape segments. We will compare A and B by their segments’ 
end positions. We will show that for all r ≤ k, f(xr) ≥ f(yr). 
As the base case, we take k = m = 1. Since SneakySnake and optimal algorithm select the longest escape 
segment that start at the beginning of a horizontal routing track, it certainly must be the case that f(x1) ≥ 
f(y1). 
For r > 1, assume the statement is true for r − 1 and we will prove it for r. The induction hypothesis states 
that f(xr-1) ≥ f(yr-1), and so any horizontal segment that is not overlapping with the first r − 1 escape segments 
in the optimal solution are certainly not overlapping with the first r − 1 escape segments of SneakySnake 
algorithm. Therefore, we can add yr to SneakySnake solution, and since SneakySnake algorithm always 
considers the longest escape segments, it must be the case that f(xr) ≥ f(yr). So we have that for all r ≤ k, 
f(xr) ≥ f(yr). In particular, f(xk) ≥ f(yk).  
If A is not optimal, then it must be the case that m < k, and so there is an escape segment xm+1 in A that is 
not in B. This escape segment must start after A’s mth escape segment ends, and hence after f(ym). But then 
this segment is not overlapping with all the escape segments in B, and so it should be part of the solution 
in B. This contradicts the assumption that m<k, and thus A has as many elements as B. So SneakySnake 
algorithm always produces an optimal solution. ◼	
7. Run Time and Space Complexity Analysis of the SneakySnake Algorithm 
We now analyze the asymptotic run time and space complexity of the SneakySnake algorithm. We provide 
the pseudocode of SneakySnake in Algorithm 1. The SneakySnake algorithm builds the chip maze on-the-
fly by constructing partially each horizontal routing track starting from each new checkpoint until it reaches 
an obstacle in each horizontal routing track. The SneakySnake algorithm does not necessarily construct the 
entire chip maze. At each new checkpoint, the SneakySnake algorithm examines if the signal net does not 
reach the destination terminal nor exceed the allowed propagation delay before it iterates (as we explain in 
Algorithm 1, line 4). It then uses the function UpperHRT() (Algorithm 2) to construct the first escape 
segment, after the current checkpoint, of each of the upper HRTs (as we explain in Algorithm 1, line 6). 
After constructing the escape segments, it computes their length and returns the length of the longest escape 
 4 
segment. Note that during the first iteration of the SneakySnake algorithm, the function UpperHRT() returns 
a value of 1, which is the length of a single obstacle. This is because all upper HRTs start with an obstacle. 
The SneakySnake algorithm performs the same steps as in the function UpperHRT() for the main HRT 
(Algorithm 1, line 7) and the lower HRTs (Algorithm 1, line 12), by calling the two functions: MainHRT() 
(Algorithm 3) and LowerHRT() (Algorithm 4). Finally, we update the position of the checkpoint and the 
current propagation delay of the found signal net through Algorithm 1, line 15-18. Once the signal net 
exceeds the allowed propagation delay, the SneakySnake algorithm terminates (as we show in Algorithm 
1, line 4 and line 19-20). Otherwise, the SneakySnake algorithm allows computationally expensive edit 
distance or pairwise alignment algorithms to compute their output based on the user-defined parameters (as 
we show in Algorithm 1, line 21-23). 
 
On the one hand, the lower-bound on the time complexity of the SneakySnake algorithm is O(m), which is 
achieved when the SneakySnake algorithm reaches the destination terminal of the maze without facing any 
obstacle along the signal net. For example, when a pattern sequence matches exactly a text sequence, the 
SneakySnake algorithm traverses only through the E+1th HRT (i.e., main HRT) and then allows the edit 
distance or alignment algorithm to perform its computation.  
 
On the other hand, the upper-bound on the run time complexity of the SneakySnake algorithm is a result of 
constructing the entire chip maze. As we have 2E+1 horizontal routing tracks, each of which is m characters 
long, the upper-bound run time complexity is O((2E+1)m). However, it is unrealistic to construct the entire 
chip maze, as in this case, all the horizontal routing tracks should be identical in terms of the number and 
the location of all obstacles. Consider a pair of random pattern and text sequences, where each character is 
generated completely randomly (having 1/4 probability of being either A, C, G, or T). The probability that 
a character of the pattern sequence does not match any neighboring character of the text sequence during 
constructing any of the 2E+1 horizontal routing tracks is (3/4)2E+1⁠, which decreases exponentially as E 
increases. Therefore, this upper-bound on the run time complexity is loose. We illustrate in Fig. 6 a typical 
scenario where the SneakySnake algorithm constructs only small fragments of the horizontal routing tracks.  
 
 
Fig. 6: An example of the chip maze after applying the SneakySnake algorithm for a reference 
sequence R = “GGTGCAGAGCTC”, a query sequence Q = “GGTGAGAGTTGT”, a sequence 
length m=12, and an edit distance threshold E=3. The SneakySnake algorithm constructs only the 
needed portion of each of the horizontal routing track. The white area is the fragments of each 
horizontal routing track that is not constructed. 
 
column       1       2      3       4      5      6       7       8       9     10      11     12
3rd Upper HRT
2nd Upper HRT
1st Upper HRT
Main HRT
1st Lower HRT
2nd Lower HRT
3rd Lower HRT
checkpoint 1                     checkpoint 2                        checkpoint 3
 5 
 
Algorithm 1: SneakySnake 
Input: pattern (P), text (T), and edit distance threshold (E) 
Output: -1 for dissimilar sequences / EditDistance() or Alignment() 
Functions: UpperHRT(), MainHRT(), LowerHRT() construct the first escape 
segment of each of the E upper, main, and E lower horizontal routing track, 
respectively, and returns the length of the longest escape segment  
Pseudocode: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
15: 
16: 
17: 
18: 
19: 
20: 
21: 
22: 
23: 
checkpoint = 0 
PropagationDelay = 0 
m = length (P) 
while checkpoint < m and PropagationDelay <= E do 
      count = 0 
      longest_es = UpperHRT(P[checkpoint:m], T[checkpoint:m], E) 
      count = MainHRT(P[checkpoint:m], T[checkpoint:m]) 
      if count > longest_es then 
            longest_es = count 
      if longest_es == m then 
            return = EditDistance() or Alignment() 
      count = LowerHRT(P[checkpoint:m], T[checkpoint:m], E) 
      if count > longest_es then 
            longest_es = count 
      checkpoint = checkpoint + longest_es 
      if checkpoint < m then 
            PropagationDelay++ 
            checkpoint++ 
if PropagationDelay > E then 
      return -1 
else  
      //depends on user’s requirement  
      return EditDistance() or Alignment()  
 
Algorithm 2: UpperHRT 
Input: pattern (P[checkpoint:m]), text (T[checkpoint:m]), and edit distance 
threshold (E) 
Output: length of the longest escape segment of the upper horizontal routing 
tracks 
Pseudocode: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
longest_es = 0 
for r = E to 1 do 
      count = 0 
      for n = 0 to length(P)-1 do 
              if n < r then 
                        goto EXIT  
              else if P[n-r] != T[n] then 
                        goto EXIT 
              else if P[n-r] == T[n] then 
                        count++ 
EXIT: 
      if count > longest_es 
              longest_es = count       
return longest_es 
 
Algorithm 3: MainHRT 
Input: pattern (P[checkpoint:m]) and text (T[checkpoint:m]) 
Output: length of the longest escape segment of the main horizontal routing 
track 
Pseudocode: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
count = 0 
for n = 0 to length(P)-1 do 
      if P[n] != T[n] then 
            return count 
      else if P[n] == T[n] then 
            count = count + 1  
return count 
 6 
 
 
Algorithm 4: LowerHRT 
Input: pattern (P[checkpoint:m]), text (T[checkpoint:m]), and edit distance 
threshold (E) 
Output: length of the longest escape segment of the lower horizontal routing 
tracks 
Pseudocode: 
1: 
2: 
3: 
4: 
5: 
6: 
7: 
8: 
9: 
10: 
11: 
12: 
13: 
14: 
longest_es = 0 
for r = 1 to E do 
      count = 0 
      for n = 0 to length(P)-1 do 
              if n > m-r-1 then 
                        goto EXIT  
              else if P[n+r] != T[n] then 
                        goto EXIT 
              else if P[n+r] == T[n] then 
                        count++ 
EXIT: 
      if count > longest_es 
              longest_es = count       
return longest_es 
 
 
 
8. Snake-on-Chip Hardware Architecture 
Next, we present the details of our hardware architecture of Snake-on-Chip in four key steps. (1) Snake-on-
Chip constructs the entire chip maze of each subproblem. Each chip maze has 2E+1 bit-vectors (rows) and 
each bit-vector is t bits long. This is now different from the CPU implementation of the SneakySnake 
algorithm, as the number of entries computed in each row is no longer limited to the entries that are located 
only between a checkpoint and the first following obstacle. This is due to the fundamental difference 
between CPU core (sequential execution) and FPGA chip (parallel processing). We want to concurrently 
compute all bits of all bit-vectors beforehand so that we can exploit massive bitwise parallelism provided 
by FPGA and perform computations on all bit-vectors in a parallel fashion.  
 
(2) It computes the length of the first horizontal segment of consecutive zeros for each bit-vector (i.e., each 
HRT) using a leading-zero counter (LZC). Snake-on-Chip uses the LZC design proposed in 
(Dimitrakopoulos et al., 2008) as it requires low number of both logic gates and logic levels. It counts the 
number of consecutive zeros that appear in a t-bit input vector before the first more significant bit that is 
equal to one.  
 
(3) Snake-on-Chip finds the bit-vector (i.e., HRT) that has the largest number of leading zeros. Snake-on-
Chip implements a hierarchical comparator structure with ⌈𝑙𝑜𝑔&(2𝐸 + 1)⌉ levels. Each comparator 
compares the output of two LZCs and finds the largest value. That is, we need 2E+2 comparators, each of 
which is a (⌊𝑙𝑜𝑔&𝑡⌋ + 1)-bit comparator, for comparing the leading zero counts of 2E+1 t-bit LZCs and 
finding the largest leading zero count. Consider that we choose t, E, and m to be 8 columns, 5 edits (i.e., 11 
rows), and 100 characters, respectively. This results in partitioning the chip maze of size 11 × 100 into 13 
subproblems, each of size 11 × 8. We need 11 LZCs and 12 comparators. We arrange the 12 LZC 
comparators into 4 levels: the first level that is closer to the LZC has 6 LZC comparators, the second level 
 7 
has 3 LZC comparators, the third level has 2 LZC comparators, and the last level has a single LZC 
comparators. This hierarchical comparator structure compares the 11 escape segments of a subproblem and 
produces the length of the longest escape segment. We provide the overall architecture of the 4-level LZC 
comparator tree including the 11 LZC block diagrams in Fig. 7. 
 
(4) After computing the length of the longest segment (the largest leading-zero count), Snake-on-Chip 
creates a new checkpoint in order to iterate over the HRTs once again to find the next optimal escape 
segment. Snake-on-Chip achieves this by shifting the bits of each row (i.e., HRT) to the right-hand direction 
(assuming the least significant bit starts from the right-hand side). The shift amount is equal to x bits, where 
x is the length of the previously-found longest escape segment of the consecutive zeros. 
To skip the obstacle that exists at the end of the longest escape segment, Snake-on-Chip shifts the bits of 
each row by an additional single step to the right-hand direction. This guarantees to exclude the previously-
found longest escape segment along with a single obstacle from the new search round. 
 
(5) As now, Snake-on-Chip preprocesses all the 2E+1 rows and make them ready for the next search round. 
Snake-on-Chip repeats the previous four steps in order to find the next optimal escape segment starting 
from the least significant bit all the way to the most significant bit. Repeating the previous steps for each 
iteration requires building a new replication for the architecture design of all the four previous steps. The 
number of replications needed depends on the desired accuracy of the SneakySnake algorithm. If our target 
is to find an optimal signal net that has at most a single obstacle within each subproblem, then we need to 
build two replications for the three previous steps (steps 2, 3, and 4). For example, let A be “00010000”, 
where t = 8. The first replication computes the value of x as four zeros and updates the bits of A to 
“11111000”. The second replication computes the value of x as three zeros and updates the bits of A to 
“11111111”.  
 
(6) The last step is to calculate the total number of obstacles faced along the entire optimal signal net in 
each subproblem. For each subproblem, Snake-on-Chip calculates the total number of obstacles as follows: 
 𝑚𝑖𝑛	(𝑦, 𝑡 − ∑ 𝑥9:9;< )                                                             (2) 
where y is the total number of replications included in the architecture of Snake-on-Chip and xk is the length 
of the longest segment of consecutive zeros found by the replication of index $k$. Hence, the total number 
of obstacles for the original problem of size $(2E+1) \times m$ is simply the summation of the total number 
of obstacles faced along the optimal signal net of all subproblems. 
 
 
 
 
 8 
 
 
Fig. 7: Block diagram of the 11 LZCs and the hierarchical LZC comparator tree for computing 
the largest number of leading zeros in 11 rows. 
 
9. Snake-on-GPU Parallel Implementation 
Snake-on-GPU makes three key assumptions that help with providing an efficient GPU implementation. 
(1) The entire input dataset of query and reference sequences fits into the GPU global memory, which is 
the off-chip DRAM memory of GPUs (NVIDIA, 2019) and it typically fits a few GB of data.  (2) We copy 
the entire input dataset from the CPU main memory to the GPU global memory before the GPU kernel 
execution starts. This enables massively-parallel computation by making large number of input sequences 
available in the GPU global memory.  (3) We copy back the pre-alignment filtering results from the GPU 
global memory to the CPU main memory only after the GPU kernel completes the computation. If the size 
of the input dataset exceeds the size of the GPU global memory, we divide the dataset into independent 
LZC 1 
LZC 2 
LZC 3 
LZC 4 
LZC 5 
LZC 6 
LZC 7 
LZC 8 
LZC 9 
LZC 10 
LZC 11 
LZC 
Comp .  
1 
LZC 
Comp .  
2 
LZC 
Comp .  
3 
LZC 
Comp .  
4 
LZC 
Comp .  
5 
LZC 
Comp .  
6 
LZC 
Comp .  
7 
LZC 
Comp .  
8 
LZC 
Comp .  
9 
LZC 
Comp .  
10 
LZC 
Comp .  
12 
LZC 
Comp .  
11 
C 0 
C 1 
C 2 
Valid 
 9 
smaller datasets, each of which can fit the capacity of the GPU global memory. This approach also helps to 
overlap the computation performed on one small dataset with the transfer of another small dataset between 
the CPU memory and GPU memory (Gómez-Luna et al., 2012). 
Given the large size of the input dataset that the GPU threads need to access from the GPU global memory, 
we carefully design Snake-on-GPU to efficiently use the on-chip register file to store the query and the 
reference sequences and avoid unnecessary accesses to the off-chip global memory. The workflow of 
Snake-on-GPU includes two key steps, as we provide in Fig. 8. 1) Each thread copies a single reference 
sequence and another single query sequence from global memory to the on-chip registers. Assuming the 
maximum length of a query (or reference) sequence is m (i.e., the maximum number of VRTs), we need 
2m bits to encode each character of the query (or reference) sequence to a unique binary representation. 
Since the size of a register is 4 bytes (32 bits), each thread needs 𝑅 = ?&@A& B registers to store an entire 
query/reference sequence. For example, for a maximum length of m = 128, R = 8. This way, 16 registers 
are enough to store both query and reference sequence. This number is much lower than the maximum of 
256 registers that each thread can use in current NVIDIA GPUs. Thus, the resources of a GPU core are not 
exhausted and more threads can run concurrently. 2) Each thread solves the complete SNR problem for a 
single query sequence and a single reference sequence. Each GPU thread applies the same computation of 
the SneakySnake algorithm to solve the SNR problem. 
 
 
 
Fig. 8: Workflow of Snake-on-GPU. It includes two key steps: (1) each GPU thread loads a single 
reference sequence and another query sequence into registers, (2) the thread solves a single SNR 
problem for the two sequences. 
 
 
 query 1
build chip 
maze
thread T
solve the 
SNR 
problem
...
...
ref 1                ref 2              ref 3              ref 4
    query 1         query 2          query 3         query 4
global memory
...
thread 1
thread 2
thread T
registers
 query 1
 query 2
 query T
references         
queries        
 query T
thread 2
 ref 1
build chip 
maze
 query 2
thread 1
build chip 
maze
solve the 
SNR 
problem
...
concurrent thread execution
...
thread 1
thread 2
thread T
registers
ref 1
ref 2
ref T
 query 1
ref 1
 10 
10. Supplementary Evaluation 
10.1. Dataset Description 
Our experimental evaluation uses 4 different real datasets. We summarize the details of these four datasets 
in Table 1. We provide the source used to obtain the read sets, the read length in each read set, and the 
configuration used for the e parameter of mrFAST (Alkan et al., 2009) for our real 4 datasets. We use Edlib 
(Šošić and Šikić, 2017) to assess the number of similar (i.e., having edits fewer than or equal to the edit 
distance threshold) and dissimilar (i.e., having more edits than the edit distance threshold) pairs for each of 
the 4 datasets across different user-defined edit distance thresholds. We provide these details for Set_1, 
Set_2, Set_3, and Set_4 in Table 2. 
Table 1: Benchmark illumina-like datasets (read-reference pairs). We map each read set to the 
human reference genome in order to generate four datasets using different mappers’ edit distance 
thresholds (using the e parameter). 
Accession no. ERR240727_1 SRR826471_1 
Source https://www.ebi.ac.uk/ena/data/view/ERR240727   https://www.ebi.ac.uk/ena/data/view/SRR826471  
Sequence Length 100 250 
HTS Illumina HiSeq 2000 Illumina HiSeq 2000 
Dataset Set_1 Set_2 Set_3 Set_4 
mrFAST e 2 40 8 100 
Amount of Edits  Low-edit High-edit Low-edit High-edit 
 
Table 2: Details of evaluating the number of similar and dissimilar sequences in each of our four 
datasets using Edlib over a wide range of edit distance thresholds of E= 0% up to E= 10% of the 
sequence length. Each dataset contains 30 million sequence pairs. 
 set_1 set_2  Set_3 Set_4 
E Similar Dissimilar Similar Dissimilar E Similar Dissimilar Similar Dissimilar 
0 381,901 29,618,099 11 29,999,989 0 707,517 29,292,483 49 29,999,951 
1 1,345,842 28,654,158 18 29,999,982 2 1,462,242 28,537,758 163 29,999,837 
2 3,266,455 26,733,545 24 29,999,976 5 1,973,835 28,026,165 301 29,999,699 
3 5,595,596 24,404,404 27 29,999,973 7 2,361,418 27,638,582 375 29,999,625 
4 7,825,272 22,174,728 29 29,999,971 10 3,183,271 26,816,729 472 29,999,528 
5 9,821,308 20,178,692 34 29,999,966 12 3,862,776 26,137,224 520 29,999,480 
6 11,650,490 18,349,510 83 29,999,917 15 4,915,346 25,084,654 575 29,999,425 
7 13,407,801 16,592,199 177 29,999,823 17 5,550,869 24,449,131 623 29,999,377 
8 15,152,501 14,847,499 333 29,999,667 20 6,404,832 23,595,168 718 29,999,282 
9 16,894,680 13,105,320 711 29,999,289 22 6,959,616 23,040,384 842 29,999,158 
10 18,610,897 11,389,103 1,627 29,998,373 25 7,857,750 22,142,250 1,133 29,998,867 
 
 11 
10.2. Evaluating the Number of Falsely-Accepted Sequence Pairs and Falsely-
Rejected Sequence Pairs 
We evaluate the number of falsely-accepted pairs and falsely-rejected pairs for SneakySnake, Shouji (Alser 
et al., 2019), MAGNET (Alser et al., 2017b), SHD (Xin et al., 2015), and GateKeeper (Alser et al., 2017a). 
We list the number of falsely-accepted and falsely-rejected sequences in Table 3 and Table 4, respectively, 
across a wide range of edit distance thresholds of E= 0% to E= 10% of the sequence length. 
Table 3: Number of falsely-accepted sequences of SneakySnake, Shouji, MAGNET, SHD, and 
GateKeeper across 4 real datasets. We use a wide range of edit distance thresholds (0%-10% of the 
sequence length) for sequence lengths of 100 and 250. The red scale represents the numbers of 
falsely-accepted sequences. 
 
 
 
Truly-Accepted
E Edlib SHD GateKeeper Shouji MAGNET SneakySnake
0 381,901                          10                       0 0 963,941         0
1 1,345,842                       783,185            783,185                333,320           800,099         12,473                       
2 3,266,455                       2,704,128        2,704,128             1,283,004       1,876,518      77,165                       
3 5,595,596                       5,237,529        5,237,529             2,674,876       2,428,301      234,003                     
4 7,825,272                       8,231,507        8,231,507             4,399,886       2,662,902      484,179                     
5 9,821,308                       11,195,124      11,195,124          6,452,280       2,916,838      795,582                     
6 11,650,490                    13,781,651      13,781,651          9,373,309       3,406,303      1,240,276                 
7 13,407,801                    14,283,519      14,283,519          11,113,616     4,026,433      1,815,478                 
8 15,152,501                    13,814,295      13,814,295          11,990,529     4,745,672      2,567,290                 
9 16,894,680                    13,105,305      13,105,305          11,693,396     5,319,627      3,331,944                 
10 18,610,897                    11,389,103      11,389,103          10,664,722     5,673,172      4,020,164                 
0 11                                     0 0 0 7                      0
1 18                                     14                       14                           2                        5                      0
2 24                                     155                    155                         15                      2                      0
3 27                                     1,196                 1,196                     216                   4                      1                                  
4 29                                     7,436                 7,436                     1,986                13                    3                                  
5 34                                     32,792              32,792                   10,551             82                    13                                
6 83                                     155,134            155,134                57,258             298                  69                                
7 177                                   417,444            417,444                214,005           1,030              289                             
8 333                                   1,031,480        1,031,480             675,029           3,129              1,081                          
9 711                                   29,997,022      29,997,022          1,742,476       8,234              3,563                          
10 1,627                               29,998,373      29,998,373          3,902,535       19,013            9,698                          
0 707,517                          0 0 0 479,104         0
2 1,462,242                       238,368            238,368                174,366           143,066         12,319                       
5 1,973,835                       1,546,126        1,546,126             1,071,218       226,864         38,814                       
7 2,361,418                       3,933,916        3,933,916             2,775,419       347,819         79,246                       
10 3,183,271                       26,816,729      26,816,729          6,669,084       624,927         235,689                     
12 3,862,776                       26,137,224      26,137,224          11,147,373     825,468         407,799                     
15 4,915,346                       25,084,654      25,084,654          18,406,823     1,066,633      705,904                     
17 5,550,869                       24,449,131      24,449,131          20,971,826     1,235,999      914,730                     
20 6,404,832                       23,595,168      23,595,168          22,223,170     1,695,351      1,364,891                 
22 6,959,616                       23,040,384      23,040,384          22,271,215     2,241,984      1,879,428                 
25 7,857,750                       22,142,250      22,142,250          21,849,454     3,514,515      3,134,474                 
0 49                                     0 0 0 53                    0
2 163                                   71                       71                           55                      44                    2                                  
5 301                                   249                    249                         161                   49                    6                                  
7 375                                   698                    698                         212                   48                    6                                  
10 472                                   29,999,528      29,999,528          5,627                42                    14                                
12 520                                   29,999,480      29,999,480          64,225             45                    22                                
15 575                                   29,999,425      29,999,425          775,314           82                    47                                
17 623                                   29,999,377      29,999,377          2,052,498       175                  106                             
20 718                                   29,999,282      29,999,282          5,679,869       417                  326                             
22 842                                   29,999,158      29,999,158          10,277,297     593                  495                             
25 1,133                               29,998,867      29,998,867          19,676,652     1,174              955                             
Dataset
Falsely-Accepted
Set_4
Set_1
Set_2
Set_3
 12 
Table 4: Number of falsely-rejected sequences of SneakySnake, Shouji, MAGNET, SHD, and 
GateKeeper across 4 real datasets. We use a wide range of edit distance thresholds (0%-10% of the 
sequence length) for sequence lengths of 100 and 250. 
 
10.3. Evaluating the Filtering Speed of SneakySnake 
We analyze the execution time of SneakySnake compared to the best performing existing CPU-based pre-
alignment filter, SHD (Xin et al., 2015) and two state-of-the-art CPU implementations of sequence 
alignment algorithms, Edlib (Šošić and Šikić, 2017) and Parasail (Daily, 2016). We evaluate them using a 
single CPU core and single thread environment on the same machine. SHD supports a sequence length of 
up to only 128 characters (due to the SIMD register size). To ensure as fair a comparison as possible, we 
allow SHD to divide the long sequences into batches of 128 characters, examine each batch individually, 
and then sum up the results. We configure Edlib to work as a banded global alignment tool by choosing the 
following parameters: (1) Edlib’s k parameter (maximum number of diagonals computed for the dynamic 
programming table) to be equal to E, (2) EDLIB_MODE_NW, and (3) EDLIB_TASK_PATH. This enables 
Edlib to act as a sequence aligner where it provides the alignment path (backtracking), alignment score, and 
location of each edit. Note that the execution time of Edlib provided in Table 5 does not include the 
execution time required to generate CIGAR string, which is performed using another separate function, 
called edlibAlignmentToCigar(). We also configure Parasail to work as a banded global alignment tool by 
choosing the parameter nw_banded. Similar to Edlib, the execution time of Parasail provided in Table 5 
Truly-Rejected
E Edlib SHD GateKeeper Shouji MAGNET SneakySnake
0 29,618,099                    0 0 0 0 0
1 28,654,158                    0 0 0 0 0
2 26,733,545                    0 0 0 0 0
3 24,404,404                    0 0 0 0 0
4 22,174,728                    0 0 0 1 0
5 20,178,692                    0 0 0 0 0
6 18,349,510                    0 0 0 4 0
7 16,592,199                    0 0 0 19 0
8 14,847,499                    0 0 0 27 0
9 13,105,320                    0 0 0 41 0
10 11,389,103                    0 0 0 31 0
0 29,999,989                    0 0 0 0 0
1 29,999,982                    0 0 0 0 0
2 29,999,976                    0 0 0 0 0
3 29,999,973                    0 0 0 0 0
4 29,999,971                    0 0 0 0 0
5 29,999,966                    0 0 0 0 0
6 29,999,917                    0 0 0 0 0
7 29,999,823                    0 0 0 0 0
8 29,999,667                    0 0 0 0 0
9 29,999,289                    0 0 0 0 0
10 29,998,373                    0 0 0 0 0
0 29,292,483                    0 0 0 0 0
2 28,537,758                    0 0 0 0 0
5 28,026,165                    0 0 0 0 0
7 27,638,582                    0 0 0 1 0
10 26,816,729                    0 0 0 1 0
12 26,137,224                    0 0 0 9 0
15 25,084,654                    0 0 0 14 0
17 24,449,131                    0 0 0 23 0
20 23,595,168                    0 0 0 35 0
22 23,040,384                    0 0 0 42 0
25 22,142,250                    0 0 0 54 0
0 29,999,951                    0 0 0 0 0
2 29,999,837                    0 0 0 0 0
5 29,999,699                    0 0 0 0 0
7 29,999,625                    0 0 0 0 0
10 29,999,528                    0 0 0 0 0
12 29,999,480                    0 0 0 0 0
15 29,999,425                    0 0 0 0 0
17 29,999,377                    0 0 0 0 0
20 29,999,282                    0 0 0 0 0
22 29,999,158                    0 0 0 0 0
25 29,998,867                    0 0 0 0 0
Set_2
Set_3
Set_4
Dataset
Falsely-Accepted
Set_1
 13 
does not include the time spent in generating the CIGAR string, which is performed using another separate 
function, called parasail_result_get_cigar(). 
Table 5: Execution time (in seconds) of SneakySnake and SHD compared to that of Edlib and 
Parasail across 4 real datasets. We use a wide range of edit distance thresholds (0%-10% of the 
sequence length) for sequence lengths of 100 bp and 250 bp. The green scale represents the 
execution time. 
 
 
 
 
 
 
DataSet E Parasail Edlib SHD SneakySnake
0 69.00 225.10 12.17                         17.71                         
1 161.72 270.74 11.16                         18.27                         
2 222.93 325.10 11.93                         19.58                         
3 289.13 384.26 12.70                         22.13                         
4 352.91 440.17 13.52                         23.95                         
5 407.96 489.80 14.34                         26.67                         
6 487.16 535.24 15.14                         31.31                         
7 546.82 577.64 15.84                         34.13                         
8 618.78 619.33 16.45                         37.71                         
9 661.80 659.80 17.28                         38.95                         
10 720.45 697.85 18.22                         41.75                         
0 78.54 212.78 12.36                         17.23                         
1 139.68 220.87 18.67                         18.04                         
2 197.57 224.24 20.22                         18.39                         
3 261.37 226.25 20.99                         20.83                         
4 330.11 228.77 22.08                         26.42                         
5 386.53 231.12 24.00                         28.87                         
6 448.75 233.50 24.90                         31.79                         
7 511.77 235.96 26.20                         37.16                         
8 574.12 238.21 27.13                         38.95                         
9 636.17 239.37 28.38                         45.89                         
10 706.12 242.33 29.16                         53.32                         
0 197.15 414.10 29.08                         9.64                           
2 541.63 494.41 36.52                         12.18                         
5 1005.10 544.12 34.89                         21.28                         
7 1347.73 580.12 36.41                         30.44                         
10 1804.81 648.19 39.48                         48.48                         
12 2071.57 698.63 40.38                         63.52                         
15 2536.76 776.64 43.62                         89.26                         
17 2838.03 823.72 45.17                         108.31                      
20 3299.15 893.10 48.14                         140.74                      
22 3598.94 936.31 49.86                         163.26                      
25 4035.22 1004.11 56.93                         201.29                      
0 153.13 360.55 28.68                         9.57                           
2 473.30 377.83 32.09                         12.14                         
5 954.77 386.27 36.12                         21.74                         
7 1272.73 391.28 36.47                         30.39                         
10 1992.27 396.95 39.47                         48.91                         
12 2521.69 402.55 41.08                         63.96                         
15 3094.02 408.62 44.97                         90.31                         
17 2840.88 413.41 45.51                         113.49                      
20 3265.87 416.67 53.68                         149.54                      
22 3550.71 423.11 50.73                         177.08                      
25 3998.50 434.02 70.46                         221.23                      
Set_4
Set_1
Set_2
Set_3
 14 
10.4. Effects of Pre-Alignment Filtering on Sequence Alignment  
In Table 6, we analyze the benefits of integrating our proposed pre-alignment filter, SneakySnake, and best 
performing CPU-based pre-alignment filter, SHD (Xin et al., 2015)  with state-of-the-art CPU 
implementations of sequence alignment algorithms, Edlib (Šošić and Šikić, 2017) and Parasail (Daily, 
2016). 
 
 
Table 6: End-to-end execution time (in seconds) of SneakySnake and SHD combined with Edlib 
and Parasail. We use four datasets over a wide range of edit distance thresholds (0%-10% of the 
sequence length) for sequence lengths of 100 and 250 characters. The green scale represents the 
end-to-end execution time of pipelines that include Edlib and the blue scale represents the end-to-
end execution time of pipelines that include Parasail. 
 
 
10.5. Evaluating the Accuracy of Snake-on-Chip 
We examine the feasibility of reducing the search space of the SneakySnake algorithm without causing 
falsely-rejected sequences. As we discuss in the main manuscript, Section 2.5, we column-wise partition 
the chip maze of the SneakySnake algorithm into adjacent non-overlapping smaller chip mazes, each of 
which has a size of 2E+1 by t, where t is the number of columns in each small chip maze. We provide the 
effects of this partitioning on the number of falsely-accepted sequences of our SneakySnake algorithm in 
Table 7. We also provide the effects of this partitioning on the end-to-end execution time of SneakySnake 
combined with Edlib and Parasail in Table 8. 
E w/ Edlib w/ Parasail w/ Edlib w/ Parasail w/ Edlib w/ Parasail w/ Edlib w/ Parasail
0 20.58 18.59 13.76 12.65 225.10 69.00 17.23 17.23 12.36 12.36 212.78 78.54
1 30.53 25.59 30.37 22.64 270.74 161.72 18.04 18.04 18.67 18.67 220.87 139.68
2 55.81 44.43 76.59 56.27 325.10 222.93 18.39 18.39 20.22 20.22 224.24 197.57
3 96.80 78.31 151.36 117.03 384.26 289.13 20.83 20.83 21.00 21.00 226.25 261.37
4 145.87 121.70 248.91 202.24 440.17 352.91 26.42 26.42 22.13 22.16 228.77 330.11
5 200.01 171.04 357.20 299.91 489.80 407.96 28.87 28.87 24.26 24.43 231.12 386.53
6 261.30 240.64 468.67 427.94 535.24 487.16 31.79 31.79 26.10 27.21 233.50 448.75
7 327.25 311.61 548.89 520.45 577.64 546.82 37.16 37.17 29.47 33.29 235.96 511.77
8 403.53 403.20 614.40 613.87 619.33 618.78 38.96 38.98 35.30 46.82 238.21 574.12
9 483.80 485.15 677.08 679.08 659.80 661.80 45.92 45.98 267.73 664.51 239.37 636.17
10 568.19 585.23 716.08 738.67 697.85 720.45 53.41 53.59 271.50 735.28 242.33 706.12
E w/ Edlib w/ Parasail w/ Edlib w/ Parasail w/ Edlib w/ Parasail w/ Edlib w/ Parasail
0 19.41 14.29 38.85 33.73 414.10 197.15 9.57 9.57 28.68 28.68 360.55 153.13
2 36.48 38.80 64.54 67.22 494.41 541.63 12.14 12.14 32.10 32.10 377.83 473.30
5 57.78 88.71 98.73 152.82 544.12 1005.10 21.74 21.75 36.13 36.14 386.27 954.77
7 77.64 140.09 158.15 319.23 580.12 1347.73 30.39 30.41 36.48 36.51 391.28 1272.73
10 122.35 254.17 687.67 1844.29 648.19 1804.81 48.92 48.94 436.42 2031.75 396.95 1992.27
12 162.97 358.41 739.01 2111.95 698.63 2071.57 63.97 64.01 443.62 2562.77 402.55 2521.69
15 234.78 564.58 820.26 2580.38 776.64 2536.76 90.32 90.37 453.59 3138.99 408.62 3094.02
17 285.84 719.96 868.89 2883.19 823.72 2838.03 113.50 113.56 458.92 2886.39 413.41 2840.88
20 372.04 995.19 941.24 3347.30 893.10 3299.15 149.55 149.65 470.35 3319.55 416.67 3265.87
22 439.13 1223.63 986.17 3648.81 936.31 3598.94 177.10 177.24 473.84 3601.43 423.11 3550.71
25 569.20 1679.82 1061.04 4092.15 1004.11 4035.22 221.26 221.51 504.48 4068.96 434.02 3998.50
Set_2Set_1
SHD SHD
SHDSHDSneakySnake SneakySnake Edlib ParasailEdlib Parasail
SneakySnake SneakySnake ParasailEdlibEdlib Parasail
Set_4Set_3
 15 
 
 
 Table 7: Effects of column-wise partitioning the search space of the SneakySnake algorithm on the 
number of falsely-accepted sequences. Besides the default size (equals the read length) of the 
SneakySnake's chip maze, we choose partition sizes (t) of 5, 10, 25, and 50 columns. The green scale 
represents the numbers of falsely-accepted sequences. 
 
 
 
 
 
E t=5 t=10 t=25 t=50 t=100 Edlib Parasail
0 0 0 0 0 0 381,901         381,901         
1 76,020           37,353           19,769           14,832           12,473           1,345,842     1,345,842     
2 380,613         195,879         111,870         88,475           77,165           3,266,455     3,266,455     
3 944,492         517,385         318,868         260,617         234,003         5,595,596     5,595,596     
4 1,700,916     966,068         632,872         530,638         484,179         7,825,272     7,825,272     
5 2,597,654     1,507,277     1,016,349     864,585         795,582         9,821,308     9,821,308     
6 3,913,638     2,296,303     1,564,784     1,343,365     1,240,276     11,650,490   11,650,490   
7 5,284,834     3,221,727     2,248,326     1,954,763     1,815,478     13,407,801   13,407,801   
8 6,764,416     4,344,591     3,120,867     2,748,009     2,567,290     15,152,501   15,152,501   
9 7,704,679     5,306,849     3,958,802     3,535,364     3,331,944     16,894,680   16,894,680   
10 7,993,033     5,962,287     4,656,672     4,227,978     4,020,164     18,610,897   18,610,897   
0 0 0 0 0 0 11                    11                    
1 2 2 0 0 0 18                    18                    
2 1 0 0 0 0 24                    24                    
3 11                    1                      1                      1                      1                      27                    27                    
4 91                    11                    3                      3                      3                      29                    29                    
5 721                 61                    16                    13                    13                    34                    34                    
6 3,479              341                 86                    75                    69                    83                    83                    
7 14,161           1,504              413                 312                 289                 177                 177                 
8 39,386           5,540              1,643              1,182              1,081              333                 333                 
9 88,157           15,390           5,269              3,833              3,563              711                 711                 
10 170,450         37,709           13,820           10,204           9,698              1,627              1,627              
E t=5 t=10 t=25 t=50 t=250 Edlib Parasail
0 0 0 0 0 0 707,517         707,517         
2 39,738           23,472           16,411           13,872           12,319           1,462,242     1,462,242     
5 130,941         71,364           49,959           42,900           38,814           1,973,835     1,973,835     
7 318,051         160,786         106,600         89,780           79,246           2,361,418     2,361,418     
10 926,659         474,698         316,435         268,644         235,689         3,183,271     3,183,271     
12 1,429,713     783,967         534,873         460,416         407,799         3,862,776     3,862,776     
15 2,178,385     1,255,483     887,762         780,642         705,904         4,915,346     4,915,346     
17 2,983,917     1,627,962     1,138,255     1,006,711     914,730         5,550,869     5,550,869     
20 5,279,238     2,644,478     1,738,675     1,516,330     1,364,891     6,404,832     6,404,832     
22 7,644,306     3,799,309     2,437,154     2,103,779     1,879,428     6,959,616     6,959,616     
25 12,076,413   6,481,318     4,111,781     3,526,761     3,134,474     7,857,750     7,857,750     
0 0 0 0 0 0 49                    49                    
2 14 7 2 2 2 163                 163                 
5 26 20 7 6 6 301                 301                 
7 42 20 10 6 6 375                 375                 
10 58 39 23 19 14 472                 472                 
12 80 41 26 22 22 520                 520                 
15 329 134 73 52 47 575                 575                 
17 565 265 159 129 106 623                 623                 
20 2099 593 429 355 326 718                 718                 
22 7758 1027 641 539 495 842                 842                 
25 65739 4292 1576 1161 955 1,133              1,133              
Set_4
Falsely-accepted sequences Truly-accepted
Set_1
Set_2
Set_3
 16 
Table 8: Effects of column-wise partitioning the search space of the SneakySnake algorithm on the 
end-to-end execution time (in seconds) of Edlib and Parasail. Besides the default size (equals the 
read length) of the SneakySnake's chip maze, we choose partition sizes (t) of 5, 10, 25, and 50 
columns. The red scale represents the end-to-end execution time. 
 
 
 Next, we assess the effect of the number of iterations within a routing region (we call it as a replication in 
our Snake-on-Chip design, and hence we refer to it as a replication in this section for consistency) on the 
filtering accuracy of the SneakySnake algorithm. The number of replications affects the number of obstacles 
that can be detected within the routing region, as each replication is a single iteration of the SneakySnake 
algorithm that starts from a checkpoint and ends by finding a single longest escape segment. We first 
partition the chip maze into smaller chip mazes, each of which is a routing region of size 2E+1 by t. We 
vary the width of the routing region (t) to be one of the three values, 8, 16, and 32 columns. In this 
evaluation, we select the width value to be a power of 2 for the sake of simplicity of evaluating the hardware 
design, Snake-on-Chip. Note that the value of t has no limitation as long as it is larger than 1. We vary the 
number of replications from 3 up to 32 (with an increment of 3). We evaluate this effect on the number of 
falsely-accepted sequences in Table 9. We make two observations. (1) We observe that increasing the 
number of the replications in the design improves the filtering accuracy of the SneakySnake algorithm. This 
observation is in accord with our expectation as each replication detects at most a single edit within each 
E t=5 t=10 t=25 t=50 t=100 t=5 t=10 t=25 t=50 t=100
0 21.14 20.17 20.56 20.62 20.58 225.10 19.15 18.18 18.57 18.63 18.59 69.00
1 32.03 31.25 31.23 30.75 30.53 270.74 26.86 26.23 26.27 25.80 25.59 161.72
2 60.63 58.70 56.64 56.29 55.81 325.10 48.21 46.91 45.13 44.86 44.43 222.93
3 107.12 101.81 97.29 96.76 96.80 384.26 86.38 82.42 78.53 78.19 78.31 289.13
4 166.66 156.48 148.47 146.63 145.87 440.17 138.95 130.91 123.87 122.33 121.70 352.91
5 232.60 215.65 204.75 202.34 200.01 489.80 198.72 184.74 175.19 173.19 171.04 407.96
6 312.89 282.28 265.20 262.30 261.30 535.24 287.95 259.93 244.02 241.47 240.64 487.16
7 398.06 357.30 334.42 328.26 327.25 577.64 378.86 340.21 318.34 312.48 311.61 546.82
8 493.42 444.77 414.24 407.08 403.53 619.33 493.02 444.41 413.91 406.75 403.20 618.78
9 585.39 532.25 498.75 487.73 483.80 659.80 587.03 533.73 500.14 489.09 485.15 661.80
10 665.44 620.74 585.64 571.98 568.19 697.85 685.48 639.24 603.16 589.18 585.23 720.45
0 17.97 19.65 17.48 17.26 17.23 212.78 17.97 19.65 17.48 17.26 17.23 78.54
1 18.07 20.53 18.37 17.49 18.04 220.87 18.07 20.53 18.37 17.49 18.04 139.68
2 20.09 20.87 20.04 19.28 18.39 224.24 20.09 20.87 20.04 19.28 18.39 197.57
3 22.74 24.36 21.59 21.10 20.83 226.25 22.74 24.36 21.59 21.10 20.83 261.37
4 27.18 27.64 24.00 23.22 26.42 228.77 27.18 27.64 24.00 23.22 26.42 330.11
5 33.71 31.48 27.45 26.40 28.87 231.12 33.71 31.48 27.45 26.40 28.87 386.53
6 40.26 39.46 30.31 31.13 31.79 233.50 40.28 39.47 30.31 31.13 31.79 448.75
7 47.06 42.65 35.31 34.58 37.16 235.96 47.19 42.67 35.32 34.59 37.17 511.77
8 57.62 48.43 41.67 40.61 38.96 238.21 58.06 48.49 41.69 40.63 38.98 574.12
9 65.96 54.72 50.33 45.06 45.92 239.37 67.13 54.93 50.41 45.12 45.98 636.17
10 76.22 63.23 59.23 55.06 53.41 242.33 78.88 63.84 59.47 55.24 53.59 706.12
E t=5 t=10 t=25 t=50 t=250 Edlib t=5 t=10 t=25 t=50 t=250 Parasail
0 20.55 19.71 19.48 20.02 19.41 414.10 15.43 14.59 14.36 14.90 14.29 197.15
2 39.74 38.60 37.90 37.74 36.48 494.41 42.11 40.93 40.23 40.06 38.80 541.63
5 68.30 64.49 60.60 59.93 57.78 544.12 100.64 95.92 91.69 90.92 88.71 1005.10
7 99.10 88.97 83.85 81.98 77.64 580.12 167.66 153.51 147.00 144.70 140.09 1347.73
10 168.70 148.81 135.86 129.92 122.35 648.19 327.15 289.84 270.78 263.01 254.17 1804.81
12 230.12 200.85 181.65 171.67 162.97 698.63 472.33 413.51 382.91 369.52 358.41 2071.57
15 336.63 292.98 263.82 247.70 234.78 776.64 752.83 655.03 604.29 581.88 564.58 2536.76
17 420.14 360.48 324.10 302.64 285.84 823.72 993.20 842.49 773.23 742.94 719.96 2838.03
20 585.73 480.17 420.02 394.81 372.04 893.10 1522.82 1205.94 1073.15 1030.10 995.19 3299.15
22 728.80 580.52 502.16 468.06 439.13 936.31 2024.97 1535.42 1336.17 1272.48 1223.63 3598.94
25 983.85 778.52 654.78 608.19 569.20 1004.11 2997.93 2227.30 1864.14 1758.45 1679.82 4035.22
0 10.11 9.35 9.91 9.36 9.57 360.55 10.11 9.35 9.91 9.36 9.57 153.13
2 12.75 12.27 12.09 12.11 12.14 377.83 12.75 12.27 12.09 12.11 12.14 473.30
5 25.65 22.67 23.30 21.91 21.74 386.27 25.66 22.68 23.31 21.92 21.75 954.77
7 39.26 34.84 32.71 30.82 30.39 391.28 39.27 34.85 32.72 30.84 30.41 1272.73
10 68.67 57.64 52.82 51.40 48.92 396.95 68.70 57.66 52.84 51.42 48.94 1992.27
12 93.77 77.51 69.56 68.50 63.97 402.55 93.81 77.55 69.60 68.54 64.01 2521.69
15 141.68 113.95 100.82 96.72 90.32 408.62 141.76 114.01 100.88 96.77 90.37 3094.02
17 180.02 142.00 125.56 117.34 113.50 413.41 180.11 142.07 125.62 117.40 113.56 2840.88
20 243.66 190.33 169.16 157.48 149.55 416.67 243.93 190.45 169.26 157.59 149.65 3265.87
22 293.32 228.85 199.85 188.68 177.10 423.11 294.22 229.04 200.01 188.82 177.24 3550.71
25 380.81 295.01 250.05 233.50 221.26 434.02 388.75 295.65 250.37 233.78 221.51 3998.50
Set_1
Set_2
Set_3
Set_4
SneakySnake w/ ParasailEdlib ParasailSneakySnake w/ Edlib
 17 
routing region. (2) Increasing the number of replications beyond half of the width value of the routing 
region (e.g., y>4 for t=8) has only slight or no effect. 
 
Table 9: Effects of the number of replications (y), the number of obstacles that can be avoided, 
within a search window (t) of the SneakySnake algorithm on its filtering accuracy. The green scale 
and the red scale represent the low and high numbers of falsely-accepted sequences, respectively. 
 
Next, we evaluate the effect of increasing the number of replications (that causes a decrease in the 
number of falsely-accepted sequences) on the overall execution time of the SneakySnake algorithm 
combined with each of Edlib (Šošić and Šikić, 2017) and Parasail (Daily, 2016). We evaluate this effect in 
Table 10 for Edlib and in Table 11 for Parasail, using the same values of t and y that we use in Table 
9.  
We make two key observations. (1) The addition of the SneakySnake algorithm, with a controlled 
number of replications, as a pre-alignment filtering step reduces significantly the execution time of Edlib 
(Šošić and Šikić, 2017) by up to 23.8x. It also reduces the end-to-end execution time of Parasail by up to 
40.7x. (2) Changing the number of replications (y) for the same edit distance threshold provides slight 
or no effect on the overall execution time of the SneakySnake algorithm combined with Edlib and 
Parasail.  
 
Table 10: Effects of the number of replications (y), the number of obstacles that can be avoided, 
within a search window (t) of the SneakySnake algorithm on the end-to-end execution of Edlib 
y=3 y=6 y=9 y=3 y=6 y=9 y=12 y=3 y=6 y=9 y=12 y=15 y=18 y=21 y=32
E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11                      11                     
1 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 18                      18                     
2 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 24                      24                     
3 2                     2                     2                     1                     1                     1                     1                     1                     1                     1                     1                     1                     1                     1                     1                     27                      27                     
4 11                   11                   11                   3                     3                     3                     3                     6                     3                     3                     3                     3                     3                     3                     3                     29                      29                     
5 92                   92                   92                   23                   22                   22                   22                   36                   14                   14                   14                   14                   14                   14                   14                   34                      34                     
6 397                 397                 397                 112                 108                 108                 108                 268                 78                   78                   78                   78                   78                   78                   78                   83                      83                     
7 1,628             1,625             1,625             477                 467                 467                 467                 1,390             357                 355                 355                 355                 355                 355                 355                 177                    177                   
8 5,745             5,738             5,738             1,837             1,782             1,781             1,781             30,738           1,401             1,387             1,387             1,387             1,387             1,387             1,387             333                    333                   
9 16,398           16,394           16,393           6,321             6,079             6,079             6,079             322,772        4,784             4,725             4,725             4,725             4,725             4,725             4,725             711                    711                   
10 39,070           39,048           39,047           17,529           16,476           16,475           16,475           1,678,167     13,085           12,815           12,815           12,815           12,815           12,815           12,815           1,627                1,627               
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 381,901           381,901           
1 42,398           42,398           42,398           26,397           26,397           26,397           26,397           17,697           17,697           17,697           17,697           17,697           17,697           17,697           17,697           1,345,842        1,345,842       
2 225,156        225,156        225,156        141,805        141,805        141,805        141,805        102,400        102,400        102,400        102,400        102,400        102,400        102,400        102,400        3,266,455        3,266,455       
3 591,426        591,426        591,426        387,017        387,017        387,017        387,017        292,175        292,175        292,175        292,175        292,175        292,175        292,175        292,175        5,595,596        5,595,596       
4 1,101,267     1,099,250     1,099,250     759,985        740,230        740,230        740,230        757,770        578,887        578,887        578,887        578,887        578,887        578,887        578,887        7,825,272        7,825,272       
5 1,721,547     1,714,525     1,714,525     1,247,295     1,175,893     1,175,893     1,175,893     1,522,555     931,646        931,646        931,646        931,646        931,646        931,646        931,646        9,821,308        9,821,308       
6 2,616,243     2,601,129     2,601,129     1,953,902     1,800,058     1,800,058     1,800,058     2,700,966     1,438,590     1,438,590     1,438,590     1,438,590     1,438,590     1,438,590     1,438,590     11,650,490     11,650,490     
7 3,648,450     3,624,632     3,624,601     2,831,240     2,566,985     2,565,576     2,565,576     4,242,055     2,124,632     2,080,481     2,080,481     2,080,481     2,080,481     2,080,481     2,080,481     13,407,801     13,407,801     
8 4,870,677     4,834,918     4,834,673     3,933,604     3,529,317     3,524,886     3,524,886     6,391,161     3,041,705     2,907,134     2,907,134     2,907,134     2,907,134     2,907,134     2,907,134     15,152,501     15,152,501     
9 5,868,610     5,821,249     5,820,698     4,971,500     4,422,972     4,414,613     4,414,613     8,227,253     3,975,020     3,719,631     3,719,631     3,719,631     3,719,631     3,719,631     3,719,631     16,894,680     16,894,680     
10 6,480,934     6,424,523     6,423,557     5,770,611     5,121,908     5,108,084     5,107,996     9,106,931     4,801,844     4,426,115     4,415,178     4,415,178     4,415,178     4,415,178     4,415,178     18,610,897     18,610,897     
E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 707517 707517
2 25842 25842 25842 18856 18856 18856 18856 14970 14970 14970 14970 14970 14970 14970 14970 1462242 1462242
5 80904 80524 80524 70150 55875 55875 55875 126851 45323 45323 45323 45323 45323 45323 45323 1973835 1973835
7 185802 185413 185410 143974 124436 122491 122491 231575 115835 96169 96169 96169 96169 96169 96169 2361418 2361418
10 556026 555588 555582 404540 367202 364563 364492 634081 339499 295454 287497 287497 287497 287497 287497 3183271 3183271
12 910977 910479 910474 677047 616018 612552 612475 1108978 577417 506285 490642 490642 490642 490642 490642 3862776 3862776
15 1441092 1440062 1440039 1132791 1013413 1006383 1006265 2046458 1022793 865029 832175 826468 826468 826468 826468 4915346 4915346
17 1889539 1887498 1887457 1483773 1301600 1290591 1290424 3067152 1373136 1131158 1073177 1063348 1062418 1062418 1062418 5550869 5550869
20 3148641 3143777 3143707 2395557 2023731 2005314 2004972 6398123 2251883 1746041 1633929 1608986 1607401 1607355 1607355 6404832 6404832
22 4559381 4551852 4551764 3455042 2862576 2832754 2832182 10418923 3303898 2460112 2280141 2240523 2237430 2237359 2237351 6959616 6959616
25 7763580 7752037 7751822 5922490 4864824 4808221 4806701 17705160 5815944 4216921 3839936 3763350 3757568 3757372 3757350 7857750 7857750
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 49                      49                     
2 6                     6                     6                     3                     3                     3                     3                     2                     2                     2                     2                     2                     2                     2                     2                     163                    163                   
5 17                   17                   17                   15                   11                   11                   11                   41                   7                     7                     7                     7                     7                     7                     7                     301                    301                   
7 22                   22                   22                   19                   12                   12                   12                   62                   18                   9                     9                     9                     9                     9                     9                     375                    375                   
10 38                   38                   38                   32                   23                   23                   23                   96                   35                   20                   17                   17                   17                   17                   17                   472                    472                   
12 52                   52                   52                   62                   34                   32                   32                   172                 47                   32                   25                   25                   25                   25                   25                   520                    520                   
15 185                 185                 185                 206                 111                 93                   93                   509                 144                 67                   63                   61                   61                   61                   61                   575                    575                   
17 359                 358                 358                 364                 218                 180                 180                 1,022             305                 157                 140                 139                 139                 139                 139                 623                    623                   
20 789                 789                 789                 722                 555                 487                 487                 6,008             698                 453                 395                 391                 391                 391                 391                 718                    718                   
22 1,595             1,590             1,590             1,052             803                 736                 731                 36,499           1,037             672                 575                 562                 562                 562                 562                 842                    842                   
25 8,935             8,928             8,927             3,308             2,127             2,088             2,079             1,079,392     3,877             1,405             1,272             1,248             1,245             1,245             1,245             1,133                1,133               
Set_3
Set_4
t=8 t=16 t=32
Falsely-accepted sequences Truly-accepted
Set_1
Set_2
Falsely-accepted sequences Truly-accepted
Edlib Parasail
 18 
combined with the SneakySnake algorithm. The green scale and the red scale represent the low and 
high execution time, respectively. 
 
 
 
 
 
 
 
 
 
 
y=3 y=6 y=9 y=3 y=6 y=9 y=12 y=3 y=6 y=9 y=12 y=15 y=18 y=21 y=32
0 13.59 12.72 12.32 13.28 12.84 13.05 12.59 12.80 12.35 12.63 12.51 12.58 12.78 12.36 13.23 225.10
1 24.90 24.25 24.46 24.36 24.37 23.37 23.40 23.74 23.46 23.84 23.47 23.06 23.28 23.72 23.22 270.74
2 53.47 52.50 52.58 50.76 50.95 50.73 50.89 50.09 49.78 50.10 50.45 49.75 49.71 50.28 49.82 325.10
3 97.94 97.57 97.12 93.56 93.60 93.51 93.81 91.55 90.98 90.53 91.30 91.41 91.64 90.96 91.38 384.26
4 153.02 152.56 152.37 146.51 145.81 145.65 145.86 144.88 141.85 141.90 141.52 142.16 142.26 142.15 142.24 440.17
5 214.98 214.05 214.37 205.14 203.71 203.74 203.37 207.95 197.51 197.74 197.57 197.98 196.88 197.44 197.64 489.80
6 286.20 284.79 285.52 272.03 267.93 268.29 267.74 282.32 259.37 259.19 259.29 259.37 259.06 259.08 259.30 535.24
7 365.18 363.64 363.45 345.85 340.56 340.00 339.18 368.69 328.76 327.24 328.36 327.71 327.80 327.97 328.10 577.64
8 453.76 453.07 453.38 431.09 422.83 422.35 422.62 476.93 409.39 406.86 407.54 406.67 406.34 406.17 407.22 619.33
9 545.33 544.33 545.29 522.54 510.35 510.58 510.89 587.83 496.44 491.48 490.17 491.78 491.07 491.33 491.71 659.80
10 633.14 631.50 631.33 612.82 597.34 597.11 597.18 683.34 585.64 576.77 578.12 577.18 576.90 577.13 578.42 697.85
0 18.69 16.18 16.38 16.69 16.39 15.44 24.20 17.02 16.08 22.33 17.12 17.22 17.64 17.54 16.94 212.784
1 19.20 17.21 16.72 17.21 16.77 16.38 24.45 19.02 16.61 22.18 18.10 17.96 18.53 18.60 18.52 220.8721
2 21.44 18.18 20.79 18.03 17.82 18.10 28.12 19.84 18.41 23.59 19.73 19.84 18.96 19.72 19.72 224.2372
3 22.92 20.70 24.06 20.27 20.52 20.31 31.46 21.56 20.00 26.25 21.56 22.06 20.94 21.60 22.29 226.2495
4 26.63 24.60 29.03 23.96 23.84 23.74 36.08 24.95 22.80 31.06 24.54 24.88 24.24 24.72 25.41 228.7664
5 30.66 28.28 29.60 28.55 27.22 29.18 41.85 27.99 26.44 37.40 29.09 29.01 28.90 28.65 28.95 231.1207
6 36.63 32.95 33.03 32.26 31.59 35.25 46.94 32.40 31.00 39.75 33.81 33.32 33.79 33.55 33.23 233.5021
7 42.24 37.45 38.29 37.24 36.26 40.64 54.41 38.09 34.67 45.38 38.30 38.11 38.07 38.19 38.35 235.955
8 48.63 43.33 43.77 42.48 41.11 55.01 63.65 39.82 39.27 56.45 44.17 43.30 43.42 43.00 43.24 238.2126
9 57.27 50.58 52.56 48.25 47.73 69.37 59.22 48.12 45.73 75.57 49.72 50.34 50.14 50.33 50.22 239.3653
10 66.61 57.64 59.63 55.27 53.73 65.39 59.61 63.97 53.80 79.31 56.43 56.31 56.02 56.80 56.07 242.3329
E
0 18.99 20.60 20.16 19.55 20.00 19.70 21.72 19.79 19.82 20.76 20.41 22.20 21.86 19.33 19.05 414.10
2 37.81 39.24 39.32 38.08 37.77 38.66 39.04 37.43 37.79 38.35 38.27 40.36 39.88 36.90 36.81 494.41
5 62.22 64.95 64.74 62.71 61.84 61.88 62.16 60.50 59.90 60.21 60.79 62.73 60.56 58.45 57.99 544.12
7 86.13 90.02 89.71 85.99 83.55 84.58 85.91 81.79 81.25 81.30 85.15 86.40 86.52 78.76 77.78 580.12
10 142.38 146.83 148.04 136.67 135.79 135.50 137.87 132.68 129.30 126.67 137.48 136.54 136.91 124.33 123.41 648.19
12 192.16 199.14 199.81 183.05 182.06 180.72 180.77 181.85 173.01 169.92 182.03 181.36 181.01 165.42 164.17 698.63
15 278.81 288.05 289.08 265.50 262.49 262.34 262.53 270.38 249.47 245.96 260.26 251.33 245.91 237.35 236.16 776.64
17 344.68 357.16 356.66 323.11 321.55 320.81 321.96 341.55 307.30 294.24 317.82 317.95 291.54 288.43 288.65 823.72
20 471.99 481.89 480.95 432.98 424.45 421.60 428.15 521.74 409.13 382.27 415.00 414.42 377.64 377.16 377.63 893.10
22 589.37 587.53 586.54 513.95 507.66 503.44 519.98 701.40 493.20 456.22 491.27 491.11 447.85 447.56 447.79 936.31
25 798.97 796.34 794.75 689.28 670.98 667.49 684.50 1038.97 668.20 607.08 640.47 637.73 585.16 585.53 583.79 1004.11
0 18.40 15.14 17.29 17.72 21.98 17.88 18.61 16.12 17.31 17.14 17.54 15.63 15.81 20.30 17.55 360.5509
2 20.36 18.04 20.10 18.59 25.70 20.22 20.08 17.89 20.14 18.64 19.89 17.88 17.97 24.59 19.56 377.8348
5 30.51 27.97 30.63 26.72 39.65 28.62 28.80 26.01 27.54 27.23 27.41 25.91 26.00 38.90 28.56 386.2661
7 41.08 38.49 41.40 35.37 54.01 39.10 39.33 34.18 36.42 34.17 36.12 33.71 34.61 55.11 36.79 391.2792
10 63.78 58.21 63.04 51.36 83.29 57.68 58.48 49.36 52.50 51.05 50.81 49.56 50.52 74.38 55.10 396.9489
12 84.15 77.06 82.28 64.46 95.23 74.06 73.99 61.88 66.32 62.85 63.17 62.97 63.39 103.19 69.48 402.547
15 114.74 121.05 116.78 88.57 102.04 102.64 104.20 83.17 90.64 94.79 88.24 87.20 88.91 130.48 95.15 408.621
17 134.67 173.88 143.63 106.84 124.66 118.33 119.48 101.71 112.69 115.23 105.19 106.10 117.66 152.10 115.65 413.4121
20 175.59 219.13 188.12 140.55 163.72 148.67 153.98 135.46 145.44 147.17 137.57 137.35 167.21 173.56 151.08 416.6745
22 209.16 255.33 217.33 165.58 193.13 181.58 180.57 183.68 170.96 167.67 162.41 164.36 205.72 176.08 176.29 423.1108
25 278.20 370.41 276.04 205.99 241.12 228.37 234.25 208.07 210.72 199.94 200.97 224.94 255.94 220.08 219.02 434.0238
Set_4
Execution time
Set_1
Set_2
Set_3
E
t=8 t=16 t=32
SneakySnake w/ Edlib
Speedup gain
Edlib
 19 
Table 11: Effects of the number of replications (y), the number of obstacles that can be avoided, 
within a search window (t) of the SneakySnake algorithm on the end-to-end execution of Parasail 
combined with the SneakySnake algorithm. The green scale and the red scale represent the low and 
high execution time, respectively.  
 
 
10.6. Evaluating Resource Analysis and Execution time of Snake-on-Chip 
We now examine the FPGA resource utilization for the hardware implementation of GateKeeper, Shouji, 
MAGNET, and Snake-on-Chip pre-alignment filters. We build the FPGA implementation of Snake-on-
Chip using a sub-matrix’s width of 8 columns (t=8) and we include 3 replications in the design. We evaluate 
our four pre-alignment filters using a single FPGA chip. We use 60 million sequence pairs, each of which 
is 100 bp long, from Set_1 and Set_2. We provide several hardware designs for two commonly used edit 
distance thresholds, 2 bp and 5 bp, for a sequence length of 100 bp. The VC709 FPGA chip contains 433,200 
slice LUTs (look-up tables) and 866,400 slice registers (flip-flops). Table 12 lists the FPGA resource 
utilization for a single filtering unit. We make five main observations. (1) The design for a single MAGNET 
filtering unit requires about 10.5% and 37.8% of the available LUTs for edit distance thresholds of 2 bp 
and 5 bp, respectively. Hence, MAGNET can process 8 and 2 sequence pairs concurrently for edit distance 
thresholds of 2 bp and 5 bp, respectively, without violating the timing constraints of our hardware 
accelerator. (2) The design for a single Shouji filtering unit requires about 15×-21.9× less LUTs compared 
y=3 y=6 y=9 y=3 y=6 y=9 y=12 y=3 y=6 y=9 y=12 y=15 y=18 y=21 y=32
0 11.60 10.73 10.33 11.29 10.85 11.06 10.60 10.81 10.36 10.64 10.52 10.59 10.79 10.37 11.24 69.00
1 19.85 19.20 19.41 19.38 19.39 18.39 18.42 18.78 18.50 18.88 18.51 18.10 18.32 18.76 18.26 161.72
2 41.58 40.61 40.69 39.16 39.35 39.13 39.29 38.61 38.30 38.62 38.97 38.27 38.23 38.80 38.34 222.93
3 78.32 77.95 77.50 74.59 74.63 74.54 74.84 72.87 72.30 71.85 72.62 72.73 72.96 72.28 72.70 289.13
4 127.06 126.60 126.41 121.53 120.89 120.73 120.94 119.92 117.40 117.45 117.07 117.71 117.81 117.70 117.79 352.91
5 183.49 182.58 182.90 174.95 173.71 173.74 173.37 177.00 168.18 168.41 168.24 168.65 167.55 168.11 168.31 407.96
6 263.33 261.95 262.68 250.23 246.37 246.73 246.18 259.32 238.39 238.21 238.31 238.39 238.08 238.10 238.32 487.16
7 347.66 346.15 345.96 329.17 324.15 323.59 322.77 350.56 312.81 311.33 312.45 311.80 311.89 312.06 312.19 546.82
8 453.39 452.70 453.02 430.74 422.48 422.00 422.27 476.53 409.06 406.53 407.21 406.34 406.01 405.84 406.89 618.78
9 546.85 545.84 546.80 524.00 511.77 512.00 512.31 589.50 497.83 492.85 491.54 493.15 492.44 492.70 493.08 661.80
10 652.04 650.35 650.18 631.18 615.21 614.97 615.05 704.21 603.28 594.12 595.46 594.52 594.24 594.47 595.76 720.45
0 18.69 16.18 16.38 16.69 16.39 15.44 24.20 17.02 16.08 22.33 17.12 17.22 17.64 17.54 16.94 78.5394
1 19.20 17.21 16.72 17.21 16.77 16.38 24.45 19.02 16.61 22.18 18.10 17.96 18.53 18.60 18.52 139.684
2 21.44 18.18 20.79 18.03 17.82 18.10 28.12 19.84 18.41 23.59 19.73 19.84 18.96 19.72 19.72 197.5745
3 22.92 20.70 24.06 20.27 20.52 20.31 31.46 21.56 20.00 26.25 21.56 22.06 20.94 21.60 22.29 261.3701
4 26.63 24.60 29.03 23.96 23.84 23.74 36.08 24.95 22.80 31.06 24.54 24.88 24.24 24.72 25.41 330.1078
5 30.66 28.28 29.60 28.55 27.22 29.18 41.85 27.99 26.44 37.40 29.09 29.01 28.90 28.65 28.95 386.5291
6 36.64 32.96 33.04 32.26 31.59 35.25 46.94 32.41 31.00 39.75 33.81 33.32 33.79 33.55 33.23 448.7464
7 42.26 37.47 38.31 37.24 36.26 40.64 54.41 38.11 34.68 45.39 38.31 38.12 38.08 38.20 38.36 511.774
8 48.70 43.40 43.84 42.50 41.13 55.03 63.67 40.16 39.29 56.47 44.19 43.32 43.44 43.02 43.26 574.1217
9 57.49 50.80 52.78 48.34 47.82 69.46 59.31 52.40 45.81 75.65 49.80 50.42 50.22 50.41 50.30 636.1744
10 67.24 58.27 60.26 55.57 54.01 65.67 59.89 89.94 54.03 79.53 56.65 56.53 56.24 57.02 56.29 706.1189
E
0 13.87 15.48 15.04 14.43 14.88 14.58 16.60 14.67 14.70 15.64 15.29 17.08 16.74 14.21 13.93 197.15
2 40.16 41.59 41.67 40.41 40.10 40.99 41.37 39.76 40.12 40.68 40.60 42.69 42.21 39.23 39.14 541.63
5 93.79 96.52 96.31 94.12 93.03 93.07 93.35 92.78 90.93 91.24 91.82 93.76 91.59 89.48 89.02 1005.10
7 151.30 155.19 154.87 150.09 147.16 148.14 149.47 148.14 144.64 144.19 148.04 149.29 149.41 141.65 140.67 1347.73
10 286.55 290.98 292.19 274.99 272.68 272.28 274.65 279.85 265.12 260.79 271.29 270.35 270.72 258.14 257.22 1804.81
12 410.63 417.58 418.25 390.81 387.03 385.53 385.58 409.38 376.22 369.86 381.26 380.59 380.24 364.65 363.40 2071.57
15 651.74 660.92 661.95 620.35 610.34 609.77 609.95 678.83 597.86 585.10 597.47 588.21 582.79 574.23 573.04 2536.76
17 844.26 856.60 856.09 795.44 781.65 780.17 781.31 920.19 772.20 742.90 762.58 762.05 735.58 732.47 732.69 2838.03
20 1238.19 1247.71 1246.76 1138.78 1100.43 1096.11 1102.63 1548.56 1103.41 1035.99 1059.73 1057.14 1020.24 1019.75 1020.22 3299.15
22 1611.73 1609.22 1608.23 1438.30 1379.43 1372.56 1389.05 2243.82 1404.13 1292.27 1311.35 1307.66 1264.13 1263.83 1264.06 3598.94
25 2377.30 2373.50 2371.89 2081.59 1956.43 1947.22 1964.08 3621.77 2049.75 1827.07 1822.36 1811.89 1758.74 1759.09 1757.35 4035.22
0 18.40 15.14 17.29 17.72 21.98 17.88 18.61 16.12 17.31 17.14 17.54 15.63 15.81 20.30 17.55 153.1312
2 20.36 18.04 20.10 18.59 25.70 20.22 20.08 17.89 20.14 18.64 19.89 17.88 17.97 24.59 19.56 473.2962
5 30.52 27.98 30.64 26.73 39.66 28.63 28.81 26.02 27.55 27.24 27.42 25.92 26.01 38.91 28.57 954.771
7 41.09 38.50 41.41 35.38 54.02 39.11 39.34 34.19 36.43 34.18 36.13 33.72 34.62 55.12 36.80 1272.7334
10 63.80 58.23 63.06 51.38 83.31 57.70 58.50 49.39 52.52 51.07 50.83 49.58 50.54 74.40 55.12 1992.2721
12 84.19 77.10 82.32 64.50 95.27 74.10 74.03 61.93 66.36 62.89 63.21 63.01 63.43 103.23 69.52 2521.6944
15 114.81 121.12 116.85 88.64 102.10 102.70 104.26 83.27 90.70 94.85 88.30 87.26 88.97 130.54 95.21 3094.0229
17 134.75 173.96 143.71 106.92 124.73 118.40 119.55 101.85 112.77 115.29 105.25 106.16 117.72 152.16 115.71 2840.8838
20 175.73 219.27 188.26 140.69 163.84 148.78 154.09 136.10 145.57 147.28 137.67 137.45 167.31 173.66 151.18 3265.8746
22 209.42 255.59 217.59 165.77 193.30 181.75 180.74 187.57 171.15 167.83 162.56 164.51 205.87 176.23 176.44 3550.7063
25 279.39 371.60 277.23 206.52 241.50 228.75 234.63 336.46 211.32 200.24 201.26 225.23 256.23 220.37 219.31 3998.5033
Set_1
Set_2
Set_3
Set_4
SneakySnake w/ Parasail
Parasail
t=16 t=32t=8
Execution time
E Execution time
 20 
to MAGNET. This enables Shouji to achieve more parallelism over MAGNET design as it can have 16 
filtering units within the same FPGA chip. (3) GateKeeper requires about 26.9×-53× and 1.7×-2.4× less 
LUTs compared to MAGNET and Shouji, respectively. GateKeeper can also examine up to 16 sequence 
pairs at the same time. (4) Snake-on-Chip requires 15.4×-26.6× less LUTs compared to MAGNET. While 
Snake-on-Chip requires a slightly less LUTs compared to Shouji, it requires about 2× more LUTs compared 
to GateKeeper. Snake-on-Chip can also examine up to 16 sequence pairs concurrently. (5) We observe that 
the hardware implementations of Shouji, MAGNET, and Snake-on-Chip require pipelining the design (i.e., 
shortening the critical path delay of each processing core by dividing it into stages or smaller tasks) to 
enable meeting the timing constraints and achieve more parallelism. 
 
Table 12: FPGA resource usage for a single filtering unit of GateKeeper, Shouji, MAGNET, and 
Snake-on-Chip for a sequence length of 100 and under different edit distance thresholds (E). 
 
  E (bp) Slice LUT Slice Register No. of Filtering Units 
GateKeeper 2 0.39% 0.01% 16 5 0.71% 0.01% 16 
Shouji 2 0.69% 0.08% 16 5 1.72% 0.16% 16 
MAGNET 2 10.50% 0.80% 8 5 37.80% 2.30% 2 
Snake-on-Chip 2 0.68% 0.16% 16 5 1.42% 0.34% 16 
 
We also analyze the execution time of our hardware pre-alignment filters, GateKeeper, MAGNET, Shouji, 
and Snake-on-Chip. For a single filtering unit, each of the four pre-alignment filters takes about 0.7233 
seconds to complete examining Set_1 and Set_2, regardless the edit distance threshold used (we test it for 
E = 0% to 5% of the sequence length). This is due to the fact that these hardware architectures utilize a 250 
MHz clock signal that synchronizes the entire computation. That is, increasing the edit distance threshold 
directly increases the number of HRTs for each SNR subproblem but not necessarily increases the execution 
time as FPGA provides large number of LUTs that operate in parallel. This is only limited by the available 
FPGA resource and the operating frequency. 
This is clear from the FPGA resource usage that is correlated with the filtering accuracy and the edit distance 
threshold. For example, the least accurate filter, GateKeeper, occupies the least FPGA resource that can be 
integrated into the FPGA.  
 
We conclude that Snake-on-Chip requires reasonably small number of LUTs, which allows for integrating 
large number of filtering units that can examine large number of sequence pairs in parallel. 
 
10.1. Evaluating Accuracy and Execution time of Snake-on-GPU 
We now examine 1) the execution time of Snake-on-Chip and 2) the number of sequence pairs that are 
accepted/rejected using Set_1 and Set_2 datasets. We use cudaEventElapsedTime() function to measure 
the total execution time as we provide in Table 13. 
 21 
Table 13: The execution time (in seconds) of Snake-on-GPU, using NVIDIA GeForce RTX 2080Ti 
card, under different edit distance thresholds. We use Set_1 and Set_2 with a read length of 100. 
 
 
11. References: 
Al Kawam, A., Khatri, S. and Datta, A. (2017) A Survey of Software and Hardware Approaches to Performing 
Read Alignment in Next Generation Sequencing, IEEE/ACM Transactions on Computational Biology and 
Bioinformatics (TCBB), 14, 1202-1213. 
Alkan, C., Kidd, J. M., Marques-Bonet, T., Aksay, G., Antonacci, F., Hormozdiari, F., Kitzman, J. O., Baker, 
C., Malig, M. and Mutlu, O. (2009) Personalized copy number and segmental duplication maps using 
next-generation sequencing, Nature genetics, 41, 1061-1067. 
Alser, M., Hassan, H., Kumar, A., Mutlu, O. and Alkan, C. (2019) Shouji: a fast and efficient pre-alignment 
filter for sequence alignment, Bioinformatics. 
Alser, M., Hassan, H., Xin, H., Ergin, O., Mutlu, O. and Alkan, C. (2017a) GateKeeper: a new hardware 
architecture for accelerating pre-alignment in DNA short read mapping, Bioinformatics, 33, 3355-3363. 
Alser, M., Mutlu, O. and Alkan, C. (2017b) MAGNET: understanding and improving the accuracy of genome 
pre-alignment filtering, Transactions on Internet Research, 13, 33-42. 
Aluru, S. and Jammula, N. (2014) A review of hardware acceleration for computational genomics, Design 
& Test, IEEE, 31, 19-30. 
Banerjee, S. S., El-Hadedy, M., Lim, J. B., Kalbarczyk, Z. T., Chen, D., Lumetta, S. and Iyer, R. K. (2018) ASAP: 
Accelerated Short-Read Alignment on Programmable Hardware, arXiv preprint arXiv:1803.02657. 
Chen, P., Wang, C., Li, X. and Zhou, X. (2014) Accelerating the next generation long read mapping with the 
FPGA-based system, IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 11, 
840-852. 
E Computation (sec) Data Transfer (sec) end-to-end (sec) Accepted Rejected
0 0.0903                   0.4818                    0.5722                    653,408       29,346,106  
1 0.1004                   0.4529                    0.5534                    2,065,683    27,932,871  
2 0.1050                   0.4530                    0.5581                    4,665,768    25,331,194  
3 0.1097                   0.4558                    0.5655                    7,601,344    22,393,785  
4 0.1173                   0.4519                    0.5692                    10,460,264  19,533,122  
5 0.1251                   0.4529                    0.5781                    13,202,659  16,789,361  
6 0.1320                   0.4597                    0.5918                    16,029,917  13,960,784  
7 0.1579                   0.6049                    0.7628                    18,836,982  11,152,303  
8 0.1560                   0.5354                    0.6914                    21,604,033  8,383,825    
9 0.1681                   0.4727                    0.6408                    24,019,045  5,967,465    
10 0.1815                   0.4636                    0.6451 25,994,473  3,990,988    
E Computation (sec) Data Transfer (sec) end-to-end (sec) Accepted Rejected
0 0.0877 0.4900 0.5777 11                29,999,989  
1 0.1002 0.4533 0.5535 22                29,999,978  
2 0.1017 0.4518 0.5534 29                29,999,971  
3 0.1024 0.4483 0.5507 34                29,999,966  
4 0.1047 0.4494 0.5540 61                29,999,939  
5 0.1080 0.4492 0.5572 292              29,999,708  
6 0.1078 0.4548 0.5626 1,287           29,998,713  
7 0.1324 0.6449 0.7773 4,233           29,995,767  
8 0.1233 0.5221 0.6453 12,039         29,987,961  
9 0.1302 0.4522 0.5824 30,176         29,969,824  
10 0.1393 0.4537 0.5931 68,791         29,931,209  
Set_1
Set_2
 22 
Chen, Y.-T., Cong, J., Fang, Z., Lei, J. and Wei, P. (2016) When spark meets FPGAs: a case study for next-
generation DNA sequencing acceleration. Field-Programmable Custom Computing Machines (FCCM), 
2016 IEEE 24th Annual International Symposium on. IEEE, pp. 29-29. 
Daily, J. (2016) Parasail: SIMD C library for global, semi-global, and local pairwise sequence alignments, 
BMC bioinformatics, 17, 81. 
Dimitrakopoulos, G., Galanopoulos, K., Mavrokefalidis, C. and Nikolos, D. (2008) Low-power leading-zero 
counting and anticipation logic for high-speed floating point units, IEEE transactions on very large scale 
integration (VLSI) systems, 16, 837-850. 
Fei, X., Dan, Z., Lina, L., Xin, M. and Chunlei, Z. (2018) FPGASW: Accelerating Large-Scale Smith–Waterman 
Sequence Alignment Application with Backtracking on FPGA Linear Systolic Array, Interdisciplinary 
Sciences: Computational Life Sciences, 10, 176-188. 
Georganas, E., Buluç, A., Chapman, J., Oliker, L., Rokhsar, D. and Yelick, K. (2015) meraligner: A fully parallel 
sequence aligner. Parallel and Distributed Processing Symposium (IPDPS), 2015 IEEE International. 
IEEE, pp. 561-570. 
Gómez-Luna, J., GonzáLez-Linares, J. M., Benavides, J. I. and Guil, N. (2012) Performance models for 
asynchronous data transfers on consumer graphics processing units, Journal of Parallel and Distributed 
Computing, 72, 1117-1126. 
Henikoff, S. and Henikoff, J. G. (1992) Amino acid substitution matrices from protein blocks, Proceedings 
of the National Academy of Sciences, 89, 10915-10919. 
Huangfu, W., Li, S., Hu, X. and Xie, Y. (2018) RADAR: a 3D-reRAM based DNA alignment accelerator 
architecture. 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC). IEEE, pp. 1-6. 
Kim, J. S., Cali, D. S., Xin, H., Lee, D., Ghose, S., Alser, M., Hassan, H., Ergin, O., Alkan, C. and Mutlu, O. 
(2018) GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory 
technologies, BMC genomics, 19, 89. 
Kung, H.-T. (1982) Why systolic architectures?, IEEE computer, 15, 37-46. 
Liu, Y. and Schmidt, B. (2015) GSWABE: faster GPU-accelerated sequence alignment with optimal 
alignment retrieval for short DNA sequences, Concurrency and Computation: Practice and Experience, 
27, 958-972. 
Liu, Y., Wirawan, A. and Schmidt, B. (2013) CUDASW++ 3.0: accelerating Smith-Waterman protein 
database search by coupling CPU and GPU SIMD instructions, BMC bioinformatics, 14, 117. 
Mutlu, O., Ghose, S., Gómez-Luna, J. and Ausavarungnirun, R. (2019) Processing data where it makes 
sense: Enabling in-memory computation, Microprocessors and Microsystems, 67, 28-41. 
Ng, H.-C., Liu, S. and Luk, W. (2017) Reconfigurable acceleration of genetic sequence alignment: A survey 
of two decades of efforts. Field Programmable Logic and Applications (FPL), 2017 27th International 
Conference on. IEEE, pp. 1-8. 
Nishimura, T., Bordim, J. L., Ito, Y. and Nakano, K. (2017) Accelerating the Smith-Waterman Algorithm 
Using Bitwise Parallel Bulk Computation Technique on GPU. Parallel and Distributed Processing 
Symposium Workshops (IPDPSW), 2017 IEEE International. IEEE, pp. 932-941. 
NVIDIA (2019) CUDA C Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-
guide/index.html. 
Sandes, E. F. D. O., Boukerche, A. and Melo, A. C. M. A. D. (2016) Parallel optimal pairwise biological 
sequence comparison: Algorithms, platforms, and classification, ACM Computing Surveys (CSUR), 48, 
63. 
Šošić, M. and Šikić, M. (2017) Edlib: a C/C++ library for fast, exact sequence alignment using edit distance, 
Bioinformatics, 33, 1394-1395. 
Ukkonen, E. (1985) Algorithms for approximate string matching, Information and control, 64, 100-118. 
Waidyasooriya, H. and Hariyama, M. (2015) Hardware-Acceleration of Short-read Alignment Based on the 
Burrows-Wheeler Transform, Parallel and Distributed Systems, IEEE Transactions on, PP, 1-1. 
 23 
Wang, C., Yan, R.-X., Wang, X.-F., Si, J.-N. and Zhang, Z. (2011) Comparison of linear gap penalties and 
profile-based variable gap penalties in profile–profile alignments, Computational biology and 
chemistry, 35, 308-318. 
Xin, H., Greth, J., Emmons, J., Pekhimenko, G., Kingsford, C., Alkan, C. and Mutlu, O. (2015) Shifted 
Hamming Distance: A Fast and Accurate SIMD-Friendly Filter to Accelerate Alignment Verification in 
Read Mapping, Bioinformatics, 31, 1553-1560. 
Xin, H., Lee, D., Hormozdiari, F., Yedkar, S., Mutlu, O. and Alkan, C. (2013) Accelerating read mapping with 
FastHASH, BMC genomics, 14, S13. 
 
 
